Overview of Hadoop, Spark and Scala
A framework which allows distributed processing of large data sets across a cluster of computers using simple programming models is called Hadoop.
A general purpose and fast cluster computing or framework is called Spark whereas Scala is a programming language in which Spark is written
- During this training, participants would
- Learn about Hadoop Traditional Models
- Understand HDFS Architecture
- Understand MapReduce
- Learn about Impala and Hive
- Understand RDD lineage
- Understand PIG
Duration
3 Days
Prerequisite for Hadoop, Spark and Scala
- Familiarity with Java
- Intermediate level of exposure in Data Analytics
Course Outline for Hadoop, Spark and Scala
Lesson 1
- Traditional-models
- Problems: Traditional Large-Scale-Systems
- Understanding of Hadoop
- Hadoop-Eco-System
Lesson 2
- Distributed Processing: On a Cluster
- Storage: HDFS-Architecture
- Storage: Using-HDFS
- Resource-Management: YARN
- Resource-Management: YARN-Architecture
- Resource-Management: Using YARN
Lesson 3
- Map-reduce
- Characteristics of Map-reduce
- Advanced map-reduce
Sqoop overview, basic import & exports inSqoop, improving Sqoop’s performance, limitations of Sqoop and Sqoop2
Lesson 4
- Introducing Impala/Hive
- Importance of Impala/Hive
- Difference: Impala/Hive
- How: Impala/Hive
- Hive & Traditional-Database
Lesson 5
- Understanding Meta-store
- Creating: Databases & Tables in Hive & Impala
- Loading Data into Tables of Hive & Impala
- Understanding HCatalog
- Impala: cluster
Lesson 6
- Various File Format
- Hadoop Tool Support: File Formats
- Understanding Avro Schemas
- Understanding Avro with Hive/Sqoop
- Evolution: Avro Schema
Lesson 7
- Overview: DataFile Partitioning
- Partitioning: Impala/Hive
- Using Partition
- Bucketing: Hive
- Advance concepts: Hive
Lesson 8
- Overview of Sqoop
- Basic: Imports & Exports
- Performance improving Sqoop
- Limitations: Sqoop
- Understanding Sqoop 2
- Understanding Apache Flume
- Basic: Flume Architecture
- Understanding Flume-Sources
- Understanding Flume-Sinks
- Understanding Flume-Channels
- Configuration ofFlume
- Understanding HBase
- Architecture HBase
- Data storage: HBase
- Comparing HBase & RDBMS
- Using HBase
Lesson 9
- Understanding Pig
- Components: Pig
- Comparing Pig & SQL
- Using Pig
Lesson 10
- Understanding Apache Spark
- What is Spark Shell
- Understanding RDDs (Resilient Distributed Datasets)
- Functional Programming: Sparks
Lesson 11
- Exploring RDD
- OtherPair: RDD Operations
Key-Value: Pair RDD
Lesson 12
- Comparing Spark Applications/Spark Shell
- Build Spark Context
- Creating: Spark-Application (Scala and Java)
- Spark onYARN: Client-Mode
- Spark on YARN: Cluster-Mode
- Dynamic-Resource-Allocation
- Configuration: Spark-Properties
Lesson 13
- Spark: Cluster
- Understanding RDD-Partitions
- Partitioning: File-based RDDs
- HDFS & Data-Locality
- Parallel Operations: Partitions
- Understanding Stages & Tasks
- Controlling: Levels Parallelism
Lesson 14
- Understanding RDD Lineage
- Overview of Caching
- Distributed-Persistence
- Storage Levels of RDD Persistence
- Correct RDD Persistence Storage Level
- RDD: Fault tolerance
Lesson 15
- Used Cases: Spark
- Iterative Algorithms: Spark
- Understanding Machine Learning
- Graph Processing & Analysis
- Example k-means
Lesson 16
- Context: Spark SQL & the SQL
- Creation of Data-Frames
- Transforming/Querying of Data-Frames
- Impala Vs Spark SQL