Big Data with Hadoop Training

Apache Hadoop is the open source data management software and is a very hot topic across the tech industry that helps organizations analyze massive volumes of structured and unstructured data. Employed…

Created by

Stalwart Learning

Date & Time

Price

Duration

5 Days

Location

https://stalwartlearning.com

ENQUIRE NOW

Course Description

Overview of Big Data with Hadoop

Apache Hadoop is the open source data management software and is a very hot topic across the tech industry that helps organizations analyze massive volumes of structured and unstructured data. Employed by such big named websites such as Facebook, Yahoo and eBay etc Hadoop can be presented on cloud environment like Windows HDinsight Service , where we need to pay for the computing resource we use

Class format is 50% Lecture/50% Lab. Ten hands-on exercises make up the lab portion of the class that include setting up Hadoop in pseudo distributed mode, managing files in HDFS, writing map reduce programs in Java, Hadoop monitoring, sqoop, hive and pig.

Duration

5 Days

Prerequisite for Big Data with Hadoop

Prior knowledge of Core Java and SQL will be helpful but is not mandatory.

Course Outline for Big Data with Hadoop

Day 01

Introduction to BigData

Which data is called as BigData
What are business use cases for BigData
BigData requirement for traditional Data warehousing and BI space
BigData solutions

Introduction to Hadoop

The amount of data processing in today’s life
What Hadoop is why it is important
Hadoop comparison with traditional systems
Hadoop history
Hadoop main components and architecture

Hadoop Distributed File System (HDFS)

HDFS overview and design
HDFS architecture
HDFS file storage
Component failures and recoveries
Block placement
Balancing the Hadoop cluster

Working with HDFS

Ways of accessing data in HDFS
Common HDFS operations and commands
Different HDFS commands
Internals of a file read in HDFS
Data copying with ‘distcp’

Map-Reduce Abstraction

What MapReduce is and why it is popular
The Big Picture of the MapReduce
MapReduce process and terminology
MapReduce components failures and recoveries
Working with MapReduce
Lab: Working with MapReduce

Day 02

Programming MapReduce Jobs

Java MapReduce implementation
Map() and Reduce() methods
Java MapReduce calling code
Lab: Programming Word Count

MapReduce Features

Joining Data Sets in MapReduce Jobs
How to write a Map-Side Join
How to write a Reduce-Side Join
MapReduce Counters
Built-in and user-defined counters
Retrieving MapReduce counters
Lab: Map-Side Join

Troubleshooting MapReduce Jobs

How to Find and Review Logs for Yarn MapReduce Jobs
Understanding log messages
Viewing and Filtering MapReduce Activities

Day 03

Hive – This class will help you in understanding Hive concepts, Loading and Querying Data in Hive and Hive UDF.

Topics – Hive Background, Hive Use Case, About Hive, Hive Vs Pig, Hive Architecture and Components, Metastore in Hive, Limitations of Hive, Comparison with Traditional Database

Hive Data Types and Data Models, Partitions and Buckets, Hive Tables(Managed Tables and External Tables), Importing Data, Querying Data, Managing Outputs, Hive Script, Hive UDF, Hive Demo on Healthcare Data set.

Hands On:

Understanding the map reduce flow in the Hive-SQL
Creating Static partition table
Creating Dynamic partition table
Loading a unstructured text file into table using Regex serde
Loading a JSON file into table using Json serde
Creating transaction table
Creating view and indexes
Creating ORC, Parquet tables and using compression techniques
Creating Sequence file table

Writing Java code for UDF

Writing JAVA code to connect with Hive and perform CRUD Operations using JDBC

Using Sqoop to capture RDBMS data into HDFS
Using Sqoop to capture RDBMS data into Hive
Using Sqoop to capture RDBMS into Hbase
Using sqoop exporting data into RDBMS from HDFS

Scala

Duration: 4 Hours

Basics:

Hello World
Primitive Types
Type inference
Vars vs Vals
Lazy Vals
Methods
Pass By Name
No parens/Brackets
Default Arguments
Named Arguments

Classes:

Introduction
Inheritance
Main/Additional Constructors
Private Constructors
Uniform Access
Case Classes
Objects
Traits

Day 04

Scala Continue..

Collections:

Lists
Collection Manipulation
Simple Methods
Methods With Functions
Use Cases With Common Methods
Tuples

Types:

Type parameterization
Covariance
Contravariance
Type Upper Bounds
‘Nothing’ Type

Options:

Option Implementation
Like Lists
Practice Application

Anonymous Classes:

Introduction
Structural Typing
Anonymous Classes With Structural Typing

Special Methods:

Apply
Update

Closure and functions

Currying:

Introduction
Applications

Implicits:

Implicit Values/Parameters
Implicit Conversions
With Anonymous Classes
Implicit Classes

For Loops:

Introduction
Coding Style
With Options
And flatMap
Guards
Definitions

Var Args:

Introduction
Ascribing the _* type

Partial Functions:

Introduction
Match
Match Values/Constants
Match Types
Extractors
If Conditions
Or

Working with XML & JSON

Performance tuning guidelines

Packing and deployment

Introduction of Spark

Evolution of distributed systems
Why we need new generation of distributed system?
Limitation with Map Reduce in Hadoop,
Understanding need of Batch Vs. Real Time Analytics
Batch Analytics – Hadoop Ecosystem Overview, Real Time Analytics Options
Introduction to stream and in memory analysis
What is Spark?
A Brief History: Spark

Using Scala for creating Spark Application

Invoking Spark Shell
Creating the SparkContext
Loading a File in Shell
Performing Some Basic Operations on Files in Spark Shell
Building a Spark Project with sbt
Running Spark Project with sbt, Caching Overview
Distributed Persistence
Spark Streaming Overview
Example: Streaming Word Count
Testing Tips in Scala
Performance Tuning Tips in Spark
Shared Variables: Broadcast Variables
Shared Variables: Accumulators

Hands On:

Installing Spark
Installing SBT and maven for building the project
Writing code for Converting HDFS data into RDD
Writing code for Performing different transformation and action
Understanding tasks, stages related to spark job
Writing code for using different storage levels and Caching
Creating broadcast and accumulators and using them

Running SQL queries using Spark SQL

Starting Point: SQLContext
Creating DataFrames
DataFrame Operations
Running SQL Queries Programmatically
Interoperating with RDDs
Inferring the Schema Using Reflection
PInferring the Schema Using Reflection
Data Sources
Generic Load/Save Functions
Save Modes
Saving to Persistent Tables
Parquet Files
Loading Data Programmatically
Partition Discovery
Schema Merging
JSON Datasets
Hive Tables
JDBC To Other Databases
Hbase Integration
Read Solr results as a data Frame
Troubleshooting
Performance Tuning
Caching Data In Memory
Compatibility with Apache Hive
Unsupported Hive Functionality

Hands On:

Writing code for Creating SparkContext , HiveContext and HbaseContext objects
Writing code for Running Hive queries using Spark-SQL
Writing code Loading , transforming text file data and converting that into Dataframe
Writing code Reading and storing JSON files as Dataframes inside the spark code
Writing code for Reading and storing PERQUET files as Dataframes
Reading and Writing data into RDBMS (MySQL for example) using Spark-SQL
Caching the dataframes
Java code for Reading Solr results as a DataFrame

Day 05

Spark Streaming

Micro batch
Discretized Streams (DStreams)
Input DStreams and Receivers
Dstream to RDD
Basic Sources
Advanced Sources
Transformations on DStreams
Output Operations on DStreams
Design Patterns for using foreachRDD
DataFrame and SQL Operations
Checkpointing
Socket stream
File Stream
Stateful operations
How stateful operations work?
Window Operations
Join Operations

Tuning Spark

Data Serialization
Memory Tuning
Determining Memory Consumption
Tuning Data Structures
Serialized RDD Storage
Garbage Collection Tuning
Other Considerations
Level of Parallelism
Memory Usage of Reduce Tasks
Broadcasting Large Variables
Data Locality
Summary

Spark ML Programming

Data types
Classification and regression
Collaborative filtering
Alternating least squares (ALS)

Hands On:

Writing code for Processing Flume data using Spark streaming
Writing code for Processing network data using Spark streaming
Writing code for Processing Kafka data using Spark streaming
Writing code and performing SVMs, logistic regression, linear regression

Data Loading : Here we will learn different data loading options available in Hadoop and will look into details about Flume and Sqoop to demonstrate how to bring various kind of files such as Web server logs , stream data, RDBMS, twetter ‘s tweet into HDFS.

Flume and Sqoop

Learning Objectives – In this clss, you will understand working of multiple Hadoop ecosystem components together in a Hadoop implementation to solve Big Data problems. We will discuss multiple data sets and specifications of the project.

Kafka

Introduction
Basic Kafka Concepts
Kafka vs Other Messaging Systems
Intra-Cluster Replication
An Inside Look at Kafka’s Components
Cluster Administration
Using Kafka Connect to Move Data

Hands On:

Using flume to capture and transport network data
Using flume to capture and transport web server log data
Using flume to capture and transport Twitter data
Creating topic and configuring replication factor and no of partition for the same
Loading data into Kafka topic
Reading data from Kafka topic

+(91) 9731 203 391

reachus@stalwartlearning.com