Pentaho Data Integration Training

Pentaho Data Integration (PDI) being part of Pentaho Open Source BI Suite, includes software of all sort to support business decision making. Getting Started with PDI, Getting Started with Transformations…

Created by

Stalwart Learning

Date & Time

Price

Duration

5 Days

Location

https://stalwartlearning.com

ENQUIRE NOW

Course Description

Overview of Pentaho Data Integration

Pentaho Data Integration (PDI) being part of Pentaho Open Source BI Suite, includes software of all sort to support business decision making.

Please find below the objective of this session:

Getting Started with PDI
Getting Started with Transformations
Manipulating Real World Data
Filtering Searching and Performing other useful operation with Data.
Transforming the Row set
Performing Advanced Operations with Databases
Creating Basic Task Flows (Jobs)
Real-Time Data Integration
Performance Tuning
Parallelization, Clustering, and Partitioning

Duration

5 Days

Prerequisite for Pentaho Data Integration

Knowledge of ETL concepts

Course Outline for Pentaho Data Integration

Getting Started with Pentaho Data Integration

Pentaho Data Integration and Pentaho BI Suite
Using PDI in real-world scenarios
Loading data warehouses or data marts
Integrating data
Data Cleansing
Migrating information
Exporting data
Integrating PDI along with other Pentaho tools

Demo – Installing PDI

Demo – starting and customizing Spoon

Setting preferences in the Options window
Storing transformations and jobs in a repository

Demo – creating a hello world transformation Directing Kettle engine with transformations

Exploring the Spoon interface
Designing a transformation
Running and previewing the transformation

Demo – installing MySQL on Windows

Getting Started with Transformations

Designing and previewing transformations
Demo – creating a simple transformation and getting familiar with the design process understanding the Kettle rowsetSeeing the results in the Execution-Results-pane
Demo – generating a range of dates and inspecting the data as it is being created adding/modifying fields using various PDI steps
Demo – avoiding errors while converting the estimated time from string to integer error handling functionality
Demo – configuring the error handling to see the description of the errorsPersonalizing the error handling

Manipulating Real-world Data

Reading data from filesDemo – reading results of football matches from files
- Reading several files at once
Demo – reading all your files at a time using a single text file input step
- regular expressions
- Troubleshooting reading files
- Sending data to files
Demo – sending the results of matches to a plain file
- Getting system information
Demo – reading and writing matches files with flexibility
- Running transformations from a terminal window
Demo – running the matches transformation from a terminal windowDemo – getting data from an XML file with information about countries
- PDI transformation files
- Getting data from XML files
- Kettle variables
Filtering/Searching/Performing related Useful Operations with DataSorting dataDemo – sorting information about matches with the Sort rows step
- Calculations on groups of rows
Demo – calculating football match statistics by grouping data Group by Step
- Numeric fields
- Filtering
Demo – counting frequent words by filteringDemo – refining the counting task by filtering even more
- Filtering rows using the Filter rows step
- Looking up data
Demo – finding out which language people speak
- The Stream lookup step
- Data cleaning
Demo – fixing words before counting them
- Cleansing data with PDI
- Controlling the Flow of Data
- Splitting streams
Demo – browsing new features of PDI by copying a dataset
- Copying rows
- Distributing rows
Demo – assigning tasks by distributing
- Splitting the stream based on conditions
Demo – assigning tasks by filtering priorities with the Filter rows step
- PDI steps – splitting the stream-based on conditions
Demo – assigning tasks by filtering priorities with the Switch/Case step
- Merging streams
Demo – gathering progress and merging it all together
- PDI options for merging streams
Demo – giving priority to Bouchard by using the Append Stream
- Treating invalid data by splitting and merging streams
Demo – treating errors in the estimated time to avoid discarding rows
- Treating rows with invalid data
- Transforming Your Data by Coding
- Doing simple tasks with the JavaScript
Demo – counting frequent words by coding in JavaScript
- Using the JavaScript language in PDI
- Using transformation predefined constants
- Testing the script using the Test script button
- Reading and parsing unstructured files with JavaScript
Demo – changing a list of house descriptions with JavaScript
- Looping over the dataset rows
- Doing simple tasks with the Java Class
Demo – counting frequent words by coding in Java
- Using the Java language in PDI
- Sending rows to the next step
- Data types equivalence
- Transforming the dataset with Java
Demo – splitting the field to rows using Java
- Avoiding coding by using purpose-built steps

Transforming the Row Set

Converting rows to columns
Demo – enhancing the file by converting rows to columnsConverting row data to column data by using the Row DemoralizerAggregating data with a Row DemoralizerDemo – aggregating football matches data with the Row DemoralizerUsing Row Demoralizer for aggregating data normalizing
- data
Demo – enhancing the matches file by normalizing the dataset
- Modifying the dataset with a Row NormalizerSummarizing: PDI steps which operate on sets of rowsGenerate a customized time-dimension dataset using Kettle-variables
Demo – creating the time dimension datasetGetting variables
Demo – parameterizing the start and end date of the time dimension dataset

Working with Databases

Introducing the Steel Wheels sample database connecting to the Steel Wheels database
Demo – creating a connection to the Steel Wheels database
- Connecting with Relational Database Management SystemsQuerying a database
Demo – getting data about shipped orders
- Generating data from the database from the Table input step generating a new dataset using the SELECT statement Making flexible queries using parameters
Demo – getting orders on a range of dates using parameters
- Adding parameters to your queries making flexible queries by using Kettle variables
Demo – getting orders in a range of dates by using Kettle variables
- Using Kettle variables in your queriesSending data to a database
Demo – loading a table with a list of manufacturers
- Inserting or updating data by using other PDI
Demo – inserting new products or updating existing ones
Demo – testing the update of existing products
- Eliminating data from a database
Demo – deleting data about discontinued items

Performing Advanced Operations with Databases

Demo – populating the Jigsaw database
- Exploring the Jigsaw database modelDoing simple lookups
Demo – using a Database lookup step to create a list of products to buyPerforming complex lookups
Demo – using a Database join step to create a list of suggested products to buy
- Joining data to the stream-data by using a Database-joinIntroducing dimensional modelingLoading dimensions with data
Demo – loading a region dimension with a Combination lookup/update step
- Describing data with dimensionsLoading Type I SCD with a Combination lookup/updateStoring history of changes
Demo – keeping a history of changes in products by using the Dimension lookup/update
- Keeping an entire history of data with a Type II SCD
- Loading Type II SCDs with the Dimension lookup/update step

Creating Basic Task Flows (Jobs)

Introducing PDI jobs

Demo – creating a folder with a Kettle jobExecuting processes with PDI jobsUsing Spoon to design and run jobs

Demo – creating a simple job and getting familiar with the design processChanging the flow of execution on the basis of conditionsLook at the results – Execution results windowRunning transformations from jobs

Demo – generating a range of dates and inspecting how things are running

Using the Transformation job entry

Receiving arguments and parameters in a job

Demo – generating a hello world file by using arguments and parameters

Using named parameters in jobs

Running jobs from a terminal window

Demo – executing the hello world job from a terminal window

Demo – calling the hello world transformation with fixed arguments and parameters

Deciding: use of a command-line argument or a named parameter

Creating Advanced Transformations and Jobs

Re-using part of your transformations

Demo– calculating statistics with the use of a sub-transformation

Creating a job as a process flow

Demo – generating top average scores by copying and getting rows

Use the copy/get rows mechanism to transfer data between transformations

Demo – generating custom files by executing a transformation for every input row

Executing for each row

Enhancing your processes with the use of variables

Demo – generating custom messages by setting a variable with the name of the examination file

Case Study : Developing and Implementing a Sample Datamart

Exploring the sales data mart

Deciding the level of granularity

Loading the dimensions

Demo – loading the dimensions for the sales datamart

Extending the sales datamart model

Loading a fact table with aggregated data

Demo – loading the sales fact table by looking up dimensions

Get the data from the source – SQL queries

Translating the business keys into surrogate keys

Obtaining the surrogate key for Type I SCD

Obtaining the surrogate key for Type II SCD

Obtaining the surrogate key for the Junk dimension

Obtaining the surrogate key for the Time dimension

Getting facts and dimensions together

Demo – loading the fact table using a range of dates obtained from the command line

Demo – loading the SALES star schema model

Automating the administrative tasks

Demo – automating the loading of the sales data mart

Real-Time Data Integration

Introduction to Real-Time ETL
Real-Time Challenges
Requirements
Transformation Streaming
A Practical Example of Transformation Streaming
Debugging
Third-Party Software and Real-Time Integration
Java Message Service
Creating a JMS Connection and Session

Performance Tuning

Transformation Performance: Finding the Weakest Link
Finding Bottlenecks by Simplifying
Finding Bottlenecks by Measuring
copying Rows of Data
Improving Transformation Performance
Using Lazy Conversion for Reading Text Files
Single-File Parallel Reading, Multi-File Parallel Reading
Configuring the NIO Block Size
Changing Disks and Reading Text Files
Improving Performance in Writing Text Files
Using Lazy Conversion for Writing Text Files
Parallel Files Writing
Improving Database Performance
Avoiding Dynamic SQL, Handling Roundtrips
Handling Relational Databases
Sorting Data, Sorting on the Database, Sorting in Parallel
Reducing CPU Usage
Optimizing the Use of JavaScript
Launching Multiple Copies of a Step
Selecting and Removing Values
Managing Thread Priorities
Adding Static Data to Rows of Data
Limiting the Number of Step Copies
Avoiding Excessive Logging
Improving Job Performance
Loops in Jobs
Database Connection Pools

Parallelization, Clustering, and Partitioning

Multi-Threading
Row Distribution, Row Merging, and Row Redistribution
Data Pipelining
Consequences of Multi-Threading
Database Connections
Order of Execution, Parallel Execution in a Job
Using Carte as a Slave Server
The Configuration File
Defining Slave Servers
Remote Execution, Monitoring Slave Servers
Carte Security, Services
Clustering Transformations
Defining a Cluster Schema, Designing Clustered Transformations
Execution and Monitoring
Metadata Transformations
Rules
Data Pipelining, Partitioning
Defining a Partitioning Schema
Objectives of Partitioning, Implementing Partitioning
Internal Variables
Database Partitions
Partitioning in a Clustered Transformation

Bonus Topics

Working with Repositories
Pan/Kitchen – Launching Transformations and Jobs from the Command-Line

+(91) 9731 203 391

[email protected]