Hadoop, Spark and Scala Training

A framework which allows distributed processing of large data sets across a cluster of computers using simple programming models is called Hadoop. A general purpose and fast cluster computing or…

Created by

Stalwart Learning

Date & Time

Price

Duration

3 Days

Location

https://stalwartlearning.com

ENQUIRE NOW


Course Description

Overview of Hadoop, Spark and Scala

A framework which allows distributed processing of large data sets across a cluster of computers using simple programming models is called Hadoop.

A general purpose and fast cluster computing or framework is called Spark whereas Scala is a programming language in which Spark is written

  • During this training, participants would
    • Learn about Hadoop Traditional Models
    • Understand HDFS Architecture
    • Understand MapReduce
    • Learn about Impala and Hive
    • Understand RDD lineage
    • Understand PIG

Duration

3 Days

Prerequisite for Hadoop, Spark and Scala

  • Familiarity with Java
  • Intermediate level of exposure in Data Analytics

Course Outline for Hadoop, Spark and Scala

Lesson 1
  • Traditional-models
  • Problems: Traditional Large-Scale-Systems
  • Understanding of Hadoop
  • Hadoop-Eco-System
Lesson 2
  • Distributed Processing: On a Cluster
  • Storage: HDFS-Architecture
  • Storage: Using-HDFS
  • Resource-Management: YARN
  • Resource-Management: YARN-Architecture
  • Resource-Management: Using YARN
Lesson 3
  • Map-reduce
  • Characteristics of Map-reduce
  • Advanced map-reduce

Sqoop overview, basic import & exports inSqoop, improving Sqoop’s performance, limitations of Sqoop and Sqoop2

Lesson 4
  • Introducing Impala/Hive
  • Importance of Impala/Hive
  • Difference: Impala/Hive
  • How: Impala/Hive
  • Hive & Traditional-Database
Lesson 5
  • Understanding Meta-store
  • Creating: Databases & Tables in Hive & Impala
  • Loading Data into Tables of Hive & Impala
  • Understanding HCatalog
  • Impala: cluster
Lesson 6
  • Various File Format
  • Hadoop Tool Support: File Formats
  • Understanding Avro Schemas
  • Understanding Avro with Hive/Sqoop
  • Evolution: Avro Schema
Lesson 7
  • Overview: DataFile Partitioning
  • Partitioning: Impala/Hive
  • Using Partition
  • Bucketing: Hive
  • Advance concepts: Hive
Lesson 8
  • Overview of Sqoop
  • Basic: Imports & Exports
  • Performance improving Sqoop
  • Limitations: Sqoop
  • Understanding Sqoop 2
  • Understanding Apache Flume
  • Basic: Flume Architecture
  • Understanding Flume-Sources
  • Understanding Flume-Sinks
  • Understanding Flume-Channels
  • Configuration ofFlume
  • Understanding HBase
  • Architecture HBase
  • Data storage: HBase
  • Comparing HBase & RDBMS
  • Using HBase
Lesson 9
  • Understanding Pig
  • Components: Pig
  • Comparing Pig & SQL
  • Using Pig
Lesson 10
  • Understanding Apache Spark
  • What is Spark Shell
  • Understanding RDDs (Resilient Distributed Datasets)
  • Functional Programming: Sparks
Lesson 11
  • Exploring RDD
  • OtherPair: RDD Operations

Key-Value: Pair RDD

Lesson 12
  • Comparing Spark Applications/Spark Shell
  • Build Spark Context
  • Creating: Spark-Application (Scala and Java)
  • Spark onYARN: Client-Mode
  • Spark on YARN: Cluster-Mode
  • Dynamic-Resource-Allocation
  • Configuration: Spark-Properties
Lesson 13
  • Spark: Cluster
  • Understanding RDD-Partitions
  • Partitioning: File-based RDDs
  • HDFS & Data-Locality
  • Parallel Operations: Partitions
  • Understanding Stages & Tasks
  • Controlling: Levels Parallelism
Lesson 14
  • Understanding RDD Lineage
  • Overview of Caching
  • Distributed-Persistence
  • Storage Levels of RDD Persistence
  • Correct RDD Persistence Storage Level
  • RDD: Fault tolerance
Lesson 15
  • Used Cases: Spark
  • Iterative Algorithms: Spark
  • Understanding Machine Learning
  • Graph Processing & Analysis
  • Example k-means
Lesson 16
  • Context: Spark SQL & the SQL
  • Creation of Data-Frames
  • Transforming/Querying of Data-Frames
  • Impala Vs Spark SQL

ENQUIRE NOW