Spark Training

Created by

Stalwart Learning

Date & Time

Price

Duration

2 Days

Location

https://stalwartlearning.com

ENQUIRE NOW

Course Description

Overview of Spark

The Spark training provides students with a solid technical introduction to the Spark architecture and how Spark works. Participants learn the basic building blocks of Spark, including RDDs and the distributed compute engine, as well as higher-level concepts that provide a simpler and more capable interface, including Spark SQL and DataFrames.

This course covers more advanced skills such as the use of Spark Streaming to process streaming data, and provides an overview of Spark Graph Processing – GraphX and GraphFrames and Spark Machine Learning- SparkML Pipelines. Lastly, the participants explore possible performance issues, troubleshooting, cluster deployment techniques, and strategies for optimization

All students will:

Understand the need for Spark in data processing and Understand the Spark architecture as to how it distributes computations to cluster nodes
Be familiar with basic installation, setup, layout of Spark
Use the Spark for interactive and ad-hoc operations
Use Dataset, DataFrame, Spark SQL to efficiently process structured data
Understand basics of RDDs (Resilient Distributed Datasets), data partitioning, pipelining, and computations
Understand Spark’s data caching and its usage
Understand performance implications and optimizations when using Spark
Participants will be familiar with Spark Graph Processing and SparkML machine learning

Duration

2 Days

Prerequisite for Spark

Fundamental knowledge of any programming language and Basic understanding of any database, SQL, and query language for databases
Participants/Attendees must have working knowledge of Linux- or Unix-based systems however this is not mandatory.

Course Outline for Spark

Introduction to Spark

Overview, Motivations, Spark Systems
Spark Ecosystem
Spark vs. Hadoop
Typical Spark Deployment and Usage Environments

RDDs and Spark Architecture

RDD Concepts, Partitions, Lifecycle, Lazy Evaluation
Working with RDDs: Creating and Transforming (map, filter, etc.)
Caching – Concepts, Storage Type, Guidelines

DataSets/DataFrames and Spark SQL

Introduction and Usage
Creating and Using a DataSet
Working with JSON
Using the DataSet DSL
Using SQL with Spark
Data Formats
Optimizations: Catalyst and Tungsten
DataSets vs. DataFrames vs. RDDs

Creating Spark Applications

Overview, Basic Driver Code, SparkConf
Creating and Using a SparkContext/SparkSession
Building and Running Applications
Application Lifecycle
Cluster Managers
Logging and Debugging

Spark Streaming

Overview and Streaming Basics
Structured Streaming
DStreams (Discretized Steams),
Architecture, Stateless, Stateful, and Windowed Transformations
Spark Streaming API
Programming and Transformations

Performance Characteristics and Tuning

The Spark UI
Narrow vs. Wide Dependencies
Minimizing Data Processing and Shuffling
Caching – Concepts, Storage Type, Guidelines
Using Caching
Using Broadcast Variables and Accumulators

Spark GraphX Overview

Introduction
Constructing Simple Graphs
GraphX API
Shortest Path Example

MLLib Overview

Introduction
Feature Vectors
Clustering / Grouping, K-Means
Recommendations
Classifications

+(91) 9731 203 391

[email protected]

Spark Training

Created by

Category

Date & Time

Price

Duration

Location

ENQUIRE NOW

Course Description

Overview of Spark

Duration

Prerequisite for Spark

Course Outline for Spark

Introduction to Spark

RDDs and Spark Architecture

DataSets/DataFrames and Spark SQL

Creating Spark Applications

Spark Streaming

Performance Characteristics and Tuning

Spark GraphX Overview

MLLib Overview

ENQUIRE NOW

(+91) 9731 203 391

[email protected]

Mon - Sat 8.00 AM - 8.00 PM

Bengaluru, Pune, Mumbai, Hyderabad, Chennai, Thiruvananthapuram, Kochi, Delhi NCR