Spark Training

The Spark training provides students with a solid technical introduction to the Spark architecture and how Spark works. Participants learn the basic building blocks of Spark, including RDDs and the distributed compute engine, as well as higher-level concepts that provide a simpler and more capable interface, including Spark SQL and DataFrames.

This course covers more advanced skills such as the use of Spark Streaming to process streaming data, and provides an overview of Spark Graph Processing – GraphX and GraphFrames and Spark Machine Learning- SparkML Pipelines. Lastly, the participants explore possible performance issues, troubleshooting, cluster deployment techniques, and strategies for optimization

All students will:

  • Understand the need for Spark in data processing and Understand the Spark architecture as to how it distributes computations to cluster nodes
  • Be familiar with basic installation, setup, layout of Spark
  • Use the Spark for interactive and ad-hoc operations
  • Use Dataset, DataFrame, Spark SQL to efficiently process structured data
  • Understand basics of RDDs (Resilient Distributed Datasets), data partitioning, pipelining, and computations
  • Understand Spark’s data caching and its usage
  • Understand performance implications and optimizations when using Spark
  • Participants will be familiar with Spark Graph Processing and SparkML machine learning

2 Days

  • Fundamental knowledge of any programming language and Basic understanding of any database, SQL, and query language for databases
  • Participants/Attendees must have working knowledge of Linux- or Unix-based systems however this is not mandatory.
  • Overview, Motivations, Spark Systems
  • Spark Ecosystem
  • Spark vs. Hadoop
  • Typical Spark Deployment and Usage Environments
  • RDD Concepts, Partitions, Lifecycle, Lazy Evaluation
  • Working with RDDs: Creating and Transforming (map, filter, etc.)
  • Caching – Concepts, Storage Type, Guidelines
  • Introduction and Usage
  • Creating and Using a DataSet
  • Working with JSON
  • Using the DataSet DSL
  • Using SQL with Spark
  • Data Formats
  • Optimizations: Catalyst and Tungsten
  • DataSets vs. DataFrames vs. RDDs
  • Overview, Basic Driver Code, SparkConf
  • Creating and Using a SparkContext/SparkSession
  • Building and Running Applications
  • Application Lifecycle
  • Cluster Managers
  • Logging and Debugging
  • Overview and Streaming Basics
  • Structured Streaming
  • DStreams (Discretized Steams),
  • Architecture, Stateless, Stateful, and Windowed Transformations
  • Spark Streaming API
  • Programming and Transformations
  • The Spark UI
  • Narrow vs. Wide Dependencies
  • Minimizing Data Processing and Shuffling
  • Caching – Concepts, Storage Type, Guidelines
  • Using Caching
  • Using Broadcast Variables and Accumulators
  • Introduction
  • Constructing Simple Graphs
  • GraphX API
  • Shortest Path Example
  • Introduction
  • Feature Vectors
  • Clustering / Grouping, K-Means
  • Recommendations
  • Classifications