Learning Hadoop 2

What you will learn from this book

Write distributed applications using the MapReduce framework

Go beyond MapReduce and process data in real time with Samza and iteratively with Spark

Become familiar with data mining approaches that work with very large datasets

Prototype applications on a VM and deploy them to a local cluster or to a Cloud infrastructure (Amazon Web Services)

Batch and real time data analysis using SQL like tools

Build data processing flows using Apache Pig and see how it allows easy incorporation of custom functionality

Define and orchestrate complex workflows and pipelines with Apache Oozie

Manage data lifecycle and changes over time

What it is about

Learning Hadoop 2 introduces the world of building data processing applications on the wide variety of tools supported by the platform. Starting from the core components of the framework – HDFS and YARN – this book will guide you in building analytics and data processing applications using a variety of approaches.

Who it is for

This book is aimed at system and application developers interested in learning to solve practical problems using the Hadoop framework and related components. Prerequisites are familiarity with Unix / Linux command line interface and experience with the Java programming language. Familiarity with Hadoop 1 is a plus.

How it is structured

Each chapter illustrates a key component of Hadoop 2 with a hands on approach complete with use cases and best practices. Each topic is illustrated in the context of data analysis or a processing application built around a dataset generated from Twitter’s message stream.

Gabriele Modena

Data Scientist, Improve Digital

Gabriele enjoys using statistical and computational methods to look for patterns in large amounts of data. Prior to his current job in ad tech he held a number of positions in Academia and Industry where he did research in machine learning and artificial intelligence.