a

Hadoop, Spark and Scala

A framework which allows distributed processing of large data sets across a cluster of computers using simple programming models is called Hadoop.

A general purpose and fast cluster computing or framework is called Spark whereas Scala is a programming language in which Spark is written

  • Please find below the objective of the session:
    • Learn about Hadoop Traditional Models
    • Understand HDFS Architecture
    • Understand MapReduce
    • Learn about Impala and Hive
    • Understand RDD lineage
    • Understand PIG

3 Days

  • familiarity with Java
  • Intermediate level of exposure in Data Analytics
  • Traditional-models
  • Problems: Traditional Large-Scale-Systems
  • Understanding of Hadoop
  • Hadoop-Eco-System
  • Distributed Processing: On a Cluster
  • Storage: HDFS-Architecture
  • Storage: Using-HDFS
  • Resource-Management: YARN
  • Resource-Management: YARN-Architecture
  • Resource-Management: Using YARN
  • Map-reduce
  • Characteristics of Map-reduce
  • Advanced map-reduce

Sqoop overview, basic import & exports inSqoop, improving Sqoop’s performance, limitations of Sqoop and Sqoop2

  • Introducing Impala/Hive
  • Importance of Impala/Hive
  • Difference: Impala/Hive
  • How: Impala/Hive
  • Hive & Traditional-Database
  • Understanding Meta-store
  • Creating: Databases & Tables in Hive & Impala
  • Loading Data into Tables of Hive & Impala
  • Understanding HCatalog
  • Impala: cluster
  • Various File Format
  • Hadoop Tool Support: File Formats
  • Understanding Avro Schemas
  • Understanding Avro with Hive/Sqoop
  • Evolution: Avro Schema
  • Overview: DataFile Partitioning
  • Partitioning: Impala/Hive
  • Using Partition
  • Bucketing: Hive
  • Advance concepts: Hive
  • Overview of Sqoop
  • Basic: Imports & Exports
  • Performance improving Sqoop
  • Limitations: Sqoop
  • Understanding Sqoop 2
  • Understanding Apache Flume
  • Basic: Flume Architecture
  • Understanding Flume-Sources
  • Understanding Flume-Sinks
  • Understanding Flume-Channels
  • Configuration ofFlume
  • Understanding HBase
  • Architecture HBase
  • Data storage: HBase
  • Comparing HBase & RDBMS
  • Using HBase
  • Understanding Pig
  • Components: Pig
  • Comparing Pig & SQL
  • Using Pig
  • Understanding Apache Spark
  • What is Spark Shell
  • Understanding RDDs (Resilient Distributed Datasets)
  • Functional Programming: Sparks
  • Exploring RDD
  • OtherPair: RDD Operations

Key-Value: Pair RDD

  • Comparing Spark Applications/Spark Shell
  • Build Spark Context
  • Creating: Spark-Application (Scala and Java)
  • Spark onYARN: Client-Mode
  • Spark on YARN: Cluster-Mode
  • Dynamic-Resource-Allocation
  • Configuration: Spark-Properties
  • Spark: Cluster
  • Understanding RDD-Partitions
  • Partitioning: File-based RDDs
  • HDFS & Data-Locality
  • Parallel Operations: Partitions
  • Understanding Stages & Tasks
  • Controlling: Levels Parallelism
  • Understanding RDD Lineage
  • Overview of Caching
  • Distributed-Persistence
  • Storage Levels of RDD Persistence
  • Correct RDD Persistence Storage Level
  • RDD: Fault tolerance
  • Used Cases: Spark
  • Iterative Algorithms: Spark
  • Understanding Machine Learning
  • Graph Processing & Analysis
  • Example k-means
  • Context: Spark SQL & the SQL
  • Creation of Data-Frames
  • Transforming/Querying of Data-Frames
  • Impala Vs Spark SQL