Overview of Big Data with Hadoop
Apache Hadoop is the open source data management software and is a very hot topic across the tech industry that helps organizations analyze massive volumes of structured and unstructured data. Employed by such big named websites such as Facebook, Yahoo and eBay etc Hadoop can be presented on cloud environment like Windows HDinsight Service , where we need to pay for the computing resource we use
Class format is 50% Lecture/50% Lab. Ten hands-on exercises make up the lab portion of the class that include setting up Hadoop in pseudo distributed mode, managing files in HDFS, writing map reduce programs in Java, Hadoop monitoring, sqoop, hive and pig.
Duration
5 Days
Prerequisite for Big Data with Hadoop
Prior knowledge of Core Java and SQL will be helpful but is not mandatory.
Course Outline for Big Data with Hadoop
Day 01
Introduction to BigData
- Which data is called as BigData
- What are business use cases for BigData
- BigData requirement for traditional Data warehousing and BI space
- BigData solutions
Introduction to Hadoop
- The amount of data processing in today’s life
- What Hadoop is why it is important
- Hadoop comparison with traditional systems
- Hadoop history
- Hadoop main components and architecture
Hadoop Distributed File System (HDFS)
- HDFS overview and design
- HDFS architecture
- HDFS file storage
- Component failures and recoveries
- Block placement
- Balancing the Hadoop cluster
Working with HDFS
- Ways of accessing data in HDFS
- Common HDFS operations and commands
- Different HDFS commands
- Internals of a file read in HDFS
- Data copying with ‘distcp’
Map-Reduce Abstraction
- What MapReduce is and why it is popular
- The Big Picture of the MapReduce
- MapReduce process and terminology
- MapReduce components failures and recoveries
- Working with MapReduce
- Lab: Working with MapReduce
Day 02
Programming MapReduce Jobs
- Java MapReduce implementation
- Map() and Reduce() methods
- Java MapReduce calling code
- Lab: Programming Word Count
MapReduce Features
- Joining Data Sets in MapReduce Jobs
- How to write a Map-Side Join
- How to write a Reduce-Side Join
- MapReduce Counters
- Built-in and user-defined counters
- Retrieving MapReduce counters
- Lab: Map-Side Join
Troubleshooting MapReduce Jobs
- How to Find and Review Logs for Yarn MapReduce Jobs
- Understanding log messages
- Viewing and Filtering MapReduce Activities
Day 03
Hive – This class will help you in understanding Hive concepts, Loading and Querying Data in Hive and Hive UDF.
Topics – Hive Background, Hive Use Case, About Hive, Hive Vs Pig, Hive Architecture and Components, Metastore in Hive, Limitations of Hive, Comparison with Traditional Database
Hive Data Types and Data Models, Partitions and Buckets, Hive Tables(Managed Tables and External Tables), Importing Data, Querying Data, Managing Outputs, Hive Script, Hive UDF, Hive Demo on Healthcare Data set.
Hands On:
- Understanding the map reduce flow in the Hive-SQL
- Creating Static partition table
- Creating Dynamic partition table
- Loading a unstructured text file into table using Regex serde
- Loading a JSON file into table using Json serde
- Creating transaction table
- Creating view and indexes
- Creating ORC, Parquet tables and using compression techniques
- Creating Sequence file table
Writing Java code for UDF
Writing JAVA code to connect with Hive and perform CRUD Operations using JDBC
- Using Sqoop to capture RDBMS data into HDFS
- Using Sqoop to capture RDBMS data into Hive
- Using Sqoop to capture RDBMS into Hbase
- Using sqoop exporting data into RDBMS from HDFS
Scala
Duration: 4 Hours
Basics:
- Hello World
- Primitive Types
- Type inference
- Vars vs Vals
- Lazy Vals
- Methods
- Pass By Name
- No parens/Brackets
- Default Arguments
- Named Arguments
Classes:
- Introduction
- Inheritance
- Main/Additional Constructors
- Private Constructors
- Uniform Access
- Case Classes
- Objects
- Traits
Day 04
Scala Continue..
Collections:
- Lists
- Collection Manipulation
- Simple Methods
- Methods With Functions
- Use Cases With Common Methods
- Tuples
Types:
- Type parameterization
- Covariance
- Contravariance
- Type Upper Bounds
- ‘Nothing’ Type
Options:
- Option Implementation
- Like Lists
- Practice Application
Anonymous Classes:
- Introduction
- Structural Typing
- Anonymous Classes With Structural Typing
Special Methods:
- Apply
- Update
Closure and functions
Currying:
- Introduction
- Applications
Implicits:
- Implicit Values/Parameters
- Implicit Conversions
- With Anonymous Classes
- Implicit Classes
For Loops:
- Introduction
- Coding Style
- With Options
- And flatMap
- Guards
- Definitions
Var Args:
- Introduction
- Ascribing the _* type
Partial Functions:
- Introduction
- Match
- Match Values/Constants
- Match Types
- Extractors
- If Conditions
- Or
Working with XML & JSON
Performance tuning guidelines
Packing and deployment
Introduction of Spark
- Evolution of distributed systems
- Why we need new generation of distributed system?
- Limitation with Map Reduce in Hadoop,
- Understanding need of Batch Vs. Real Time Analytics
- Batch Analytics – Hadoop Ecosystem Overview, Real Time Analytics Options
- Introduction to stream and in memory analysis
- What is Spark?
- A Brief History: Spark
Using Scala for creating Spark Application
- Invoking Spark Shell
- Creating the SparkContext
- Loading a File in Shell
- Performing Some Basic Operations on Files in Spark Shell
- Building a Spark Project with sbt
- Running Spark Project with sbt, Caching Overview
- Distributed Persistence
- Spark Streaming Overview
- Example: Streaming Word Count
- Testing Tips in Scala
- Performance Tuning Tips in Spark
- Shared Variables: Broadcast Variables
- Shared Variables: Accumulators
Hands On:
- Installing Spark
- Installing SBT and maven for building the project
- Writing code for Converting HDFS data into RDD
- Writing code for Performing different transformation and action
- Understanding tasks, stages related to spark job
- Writing code for using different storage levels and Caching
- Creating broadcast and accumulators and using them
Running SQL queries using Spark SQL
- Starting Point: SQLContext
- Creating DataFrames
- DataFrame Operations
- Running SQL Queries Programmatically
- Interoperating with RDDs
- Inferring the Schema Using Reflection
- PInferring the Schema Using Reflection
- Data Sources
- Generic Load/Save Functions
- Save Modes
- Saving to Persistent Tables
- Parquet Files
- Loading Data Programmatically
- Partition Discovery
- Schema Merging
- JSON Datasets
- Hive Tables
- JDBC To Other Databases
- Hbase Integration
- Read Solr results as a data Frame
- Troubleshooting
- Performance Tuning
- Caching Data In Memory
- Compatibility with Apache Hive
- Unsupported Hive Functionality
Hands On:
- Writing code for Creating SparkContext , HiveContext and HbaseContext objects
- Writing code for Running Hive queries using Spark-SQL
- Writing code Loading , transforming text file data and converting that into Dataframe
- Writing code Reading and storing JSON files as Dataframes inside the spark code
- Writing code for Reading and storing PERQUET files as Dataframes
- Reading and Writing data into RDBMS (MySQL for example) using Spark-SQL
- Caching the dataframes
- Java code for Reading Solr results as a DataFrame
Day 05
Spark Streaming
- Micro batch
- Discretized Streams (DStreams)
- Input DStreams and Receivers
- Dstream to RDD
- Basic Sources
- Advanced Sources
- Transformations on DStreams
- Output Operations on DStreams
- Design Patterns for using foreachRDD
- DataFrame and SQL Operations
- Checkpointing
- Socket stream
- File Stream
- Stateful operations
- How stateful operations work?
- Window Operations
- Join Operations
Tuning Spark
- Data Serialization
- Memory Tuning
- Determining Memory Consumption
- Tuning Data Structures
- Serialized RDD Storage
- Garbage Collection Tuning
- Other Considerations
- Level of Parallelism
- Memory Usage of Reduce Tasks
- Broadcasting Large Variables
- Data Locality
- Summary
Spark ML Programming
- Data types
- Classification and regression
- Collaborative filtering
- Alternating least squares (ALS)
Hands On:
- Writing code for Processing Flume data using Spark streaming
- Writing code for Processing network data using Spark streaming
- Writing code for Processing Kafka data using Spark streaming
- Writing code and performing SVMs, logistic regression, linear regression
Data Loading : Here we will learn different data loading options available in Hadoop and will look into details about Flume and Sqoop to demonstrate how to bring various kind of files such as Web server logs , stream data, RDBMS, twetter ‘s tweet into HDFS.
Flume and Sqoop
Learning Objectives – In this clss, you will understand working of multiple Hadoop ecosystem components together in a Hadoop implementation to solve Big Data problems. We will discuss multiple data sets and specifications of the project.
Kafka
- Introduction
- Basic Kafka Concepts
- Kafka vs Other Messaging Systems
- Intra-Cluster Replication
- An Inside Look at Kafka’s Components
- Cluster Administration
- Using Kafka Connect to Move Data
Hands On:
- Using flume to capture and transport network data
- Using flume to capture and transport web server log data
- Using flume to capture and transport Twitter data
- Creating topic and configuring replication factor and no of partition for the same
- Loading data into Kafka topic
- Reading data from Kafka topic