Overview of Spark
The Spark training provides students with a solid technical introduction to the Spark architecture and how Spark works. Participants learn the basic building blocks of Spark, including RDDs and the distributed compute engine, as well as higher-level concepts that provide a simpler and more capable interface, including Spark SQL and DataFrames.
This course covers more advanced skills such as the use of Spark Streaming to process streaming data, and provides an overview of Spark Graph Processing – GraphX and GraphFrames and Spark Machine Learning- SparkML Pipelines. Lastly, the participants explore possible performance issues, troubleshooting, cluster deployment techniques, and strategies for optimization
All students will:
- Understand the need for Spark in data processing and Understand the Spark architecture as to how it distributes computations to cluster nodes
- Be familiar with basic installation, setup, layout of Spark
- Use the Spark for interactive and ad-hoc operations
- Use Dataset, DataFrame, Spark SQL to efficiently process structured data
- Understand basics of RDDs (Resilient Distributed Datasets), data partitioning, pipelining, and computations
- Understand Spark’s data caching and its usage
- Understand performance implications and optimizations when using Spark
- Participants will be familiar with Spark Graph Processing and SparkML machine learning
Duration
2 Days
Prerequisite for Spark
- Fundamental knowledge of any programming language and Basic understanding of any database, SQL, and query language for databases
- Participants/Attendees must have working knowledge of Linux- or Unix-based systems however this is not mandatory.
Course Outline for Spark
Introduction to Spark
- Overview, Motivations, Spark Systems
- Spark Ecosystem
- Spark vs. Hadoop
- Typical Spark Deployment and Usage Environments
RDDs and Spark Architecture
- RDD Concepts, Partitions, Lifecycle, Lazy Evaluation
- Working with RDDs: Creating and Transforming (map, filter, etc.)
- Caching – Concepts, Storage Type, Guidelines
DataSets/DataFrames and Spark SQL
- Introduction and Usage
- Creating and Using a DataSet
- Working with JSON
- Using the DataSet DSL
- Using SQL with Spark
- Data Formats
- Optimizations: Catalyst and Tungsten
- DataSets vs. DataFrames vs. RDDs
Creating Spark Applications
- Overview, Basic Driver Code, SparkConf
- Creating and Using a SparkContext/SparkSession
- Building and Running Applications
- Application Lifecycle
- Cluster Managers
- Logging and Debugging
Spark Streaming
- Overview and Streaming Basics
- Structured Streaming
- DStreams (Discretized Steams),
- Architecture, Stateless, Stateful, and Windowed Transformations
- Spark Streaming API
- Programming and Transformations
Performance Characteristics and Tuning
- The Spark UI
- Narrow vs. Wide Dependencies
- Minimizing Data Processing and Shuffling
- Caching – Concepts, Storage Type, Guidelines
- Using Caching
- Using Broadcast Variables and Accumulators
Spark GraphX Overview
- Introduction
- Constructing Simple Graphs
- GraphX API
- Shortest Path Example
MLLib Overview
- Introduction
- Feature Vectors
- Clustering / Grouping, K-Means
- Recommendations
- Classifications