Learning Hadoop 2

Design and implement data processing, lifecycle management and analytic workflows
with the cutting edge toolbox of Hadoop 2.

Book cover

What you will learn from this book

  • Write distributed applications using the MapReduce framework
  • Go beyond MapReduce and process data in real time with Samza and iteratively with Spark
  • Become familiar with data mining approaches that work with very large datasets
  • Prototype applications on a VM and deploy them to a local cluster or to a Cloud infrastructure (Amazon Web Services)
  • Batch and real time data analysis using SQL like tools
  • Build data processing flows using Apache Pig and see how it allows easy incorporation of custom functionality
  • Define and orchestrate complex workflows and pipelines with Apache Oozie
  • Manage data lifecycle and changes over time


What it is about

Learning Hadoop 2 introduces the world of building data processing applications on the wide variety of tools supported by the platform. Starting from the core components of the framework – HDFS and YARN – this book will guide you in building analytics and data processing applications using a variety of approaches.

Who it is for

This book is aimed at system and application developers interested in learning to solve practical problems using the Hadoop framework and related components. Prerequisites are familiarity with Unix / Linux command line interface and experience with the Java programming language. Familiarity with Hadoop 1 is a plus.

How it is structured

Each chapter illustrates a key component of Hadoop 2 with a hands on approach complete with use cases and best practices. Each topic is illustrated in the context of data analysis or a processing application built around a dataset generated from Twitter’s message stream.

Garry Turkington

CTO, Improve Digital

Garry's current focus is on the infrastructural and architectural challenges around processing large data volumes. Before joining Improve Digital he worked at Amazon and in the public sector with a specialization in distributed computing.

Gabriele Modena

Data Scientist, Improve Digital

Gabriele enjoys using statistical and computational methods to look for patterns in large amounts of data. Prior to his current job in ad tech he held a number of positions in Academia and Industry where he did research in machine learning and artificial intelligence.

Back to top

© 2014 Gabriele Modena, Garry Turkington·