Pentaho Data Integration Training

Pentaho Data Integration (PDI) being part of Pentaho Open Source BI Suite, includes software of all sort to support business decision making. Getting Started with PDI, Getting Started with Transformations…

Created by

Stalwart Learning

Date & Time

Price

Duration

5 Days

Location

https://stalwartlearning.com

ENQUIRE NOW


Course Description

Overview of Pentaho Data Integration

Pentaho Data Integration (PDI) being part of Pentaho Open Source BI Suite, includes software of all sort to support business decision making.

Please find below the objective of this session:

  • Getting Started with PDI
  • Getting Started with Transformations
  • Manipulating Real World Data
  • Filtering Searching and Performing other useful operation with Data.
  • Transforming the Row set
  • Performing Advanced Operations with Databases
  • Creating Basic Task Flows (Jobs)
  • Real-Time Data Integration
  • Performance Tuning
  • Parallelization, Clustering, and Partitioning

Duration

5 Days

Prerequisite for Pentaho Data Integration

Knowledge of ETL concepts

Course Outline for Pentaho Data Integration

Getting Started with Pentaho Data Integration
  • Pentaho Data Integration and Pentaho BI Suite  
  • Using PDI in real-world scenarios
  • Loading data warehouses or data marts
  • Integrating data
  • Data Cleansing
  • Migrating information
  • Exporting data
  • Integrating PDI along with other Pentaho tools

Demo – Installing PDI

Demo – starting and customizing Spoon

  • Setting preferences in the Options window
  • Storing transformations and jobs in a repository  

Demo  – creating a hello world transformation  Directing Kettle engine with transformations

  • Exploring the Spoon interface
  • Designing a transformation
  • Running and previewing the transformation

Demo – installing MySQL on Windows

Getting Started with Transformations
  • Designing and previewing transformations
  • Demo – creating a simple transformation and getting familiar  with the design process understanding the Kettle rowsetSeeing the results in the Execution-Results-pane  
  • Demo – generating a range of dates and inspecting the data as  it is being created adding/modifying fields using various PDI steps
  • Demo – avoiding errors while converting the estimated time  from string to integer error handling functionality
  • Demo – configuring the error handling to see the description  of the errorsPersonalizing the error handling
Manipulating Real-world Data
  • Reading data from filesDemo  – reading results of football matches from files
    • Reading several files at once
    Demo – reading all your files at a time using a single text file input step  
    • regular expressions
    • Troubleshooting reading files
    • Sending data to files
    Demo  – sending the results of matches to a plain file
    • Getting system information
    Demo – reading and writing matches files with flexibility 
    • Running transformations from a terminal window
    Demo  – running the matches transformation from a terminal windowDemo – getting data from an XML file with information about countries
    • PDI transformation files
    • Getting data from XML files
    • Kettle variables
    Filtering/Searching/Performing related Useful Operations  with DataSorting dataDemo – sorting information about matches with the Sort rows step
    • Calculations on groups of rows
    Demo – calculating football match statistics by grouping data Group by Step
    • Numeric fields
    • Filtering
    Demo – counting frequent words by filteringDemo – refining the counting task by filtering even more
    • Filtering rows using the Filter rows step
    • Looking up data
    Demo – finding out which language people speak
    • The Stream lookup step
    • Data cleaning
    Demo – fixing words before counting them
    • Cleansing data with PDI
    • Controlling the Flow of Data
    • Splitting streams
    Demo – browsing new features of PDI by copying a dataset
    • Copying rows
    • Distributing rows
    Demo – assigning tasks by distributing
    • Splitting the stream based on conditions
    Demo – assigning tasks by filtering priorities with the Filter rows step
    • PDI steps – splitting the stream-based on conditions
    Demo – assigning tasks by filtering priorities with the Switch/Case step
    • Merging streams
    Demo – gathering progress and merging it all together
    • PDI options for merging streams
    Demo – giving priority to Bouchard by using the Append Stream
    • Treating invalid data by splitting and merging streams
    Demo – treating errors in the estimated time to avoid discarding rows
    • Treating rows with invalid data
    • Transforming Your Data by Coding
    • Doing simple tasks with the JavaScript
    Demo – counting frequent words by coding in JavaScript
    • Using the JavaScript language in PDI
    • Using transformation predefined constants
    • Testing the script using the Test script button
    • Reading and parsing unstructured files with JavaScript
    Demo – changing a list of house descriptions with JavaScript
    • Looping over the dataset rows
    • Doing simple tasks with the Java Class
    Demo – counting frequent words by coding in Java
    • Using the Java language in PDI
    • Sending rows to the next step
    • Data types equivalence
    • Transforming the dataset with Java
    Demo – splitting the field to rows using Java
    • Avoiding coding by using purpose-built steps
Transforming the Row Set
  • Converting rows to columns
  • Demo – enhancing the file by converting rows to columnsConverting row data to column data by using the Row DemoralizerAggregating data with a Row DemoralizerDemo – aggregating football matches data with the Row  DemoralizerUsing Row Demoralizer for aggregating data normalizing
    • data
  • Demo – enhancing the matches file by normalizing the dataset
    • Modifying the dataset with a Row NormalizerSummarizing: PDI steps which operate on sets of rowsGenerate a customized time-dimension dataset using Kettle-variables
  • Demo – creating the time dimension datasetGetting variables
  • Demo – parameterizing the start and end date of the time  dimension dataset
Working with Databases
  • Introducing the Steel Wheels sample database connecting to the Steel Wheels database
  • Demo – creating a connection to the Steel Wheels database
    • Connecting with Relational Database Management SystemsQuerying a database
  • Demo – getting data about shipped orders
    • Generating data from the database from the Table input step generating a new dataset using the SELECT statement  Making flexible queries using parameters
  • Demo – getting orders on a range of dates using parameters
    • Adding parameters to your queries making flexible queries by using Kettle variables
  • Demo – getting orders in a range of dates by using Kettle variables
    • Using Kettle variables in your queriesSending data to a database
  • Demo – loading a table with a list of manufacturers
    • Inserting or updating data by using other PDI
  • Demo – inserting new products or updating existing ones
  • Demo – testing the update of existing products
    • Eliminating data from a database
  • Demo – deleting data about discontinued items
Performing Advanced Operations with Databases
  • Demo – populating the Jigsaw database
    • Exploring the Jigsaw database modelDoing simple lookups
  • Demo – using a Database lookup step to create a list of products to buyPerforming complex lookups
  • Demo – using a Database join step to create a list of suggested  products to buy
    • Joining data to the stream-data by using a Database-joinIntroducing dimensional modelingLoading dimensions with data
  • Demo – loading a region dimension with a Combination  lookup/update step
    • Describing data with dimensionsLoading Type I SCD with a Combination lookup/updateStoring history of changes
  • Demo – keeping a history of changes in products by using the  Dimension lookup/update
    • Keeping an entire history of data with a Type II SCD
    • Loading Type II SCDs with the Dimension lookup/update step
Creating Basic Task Flows (Jobs)

Introducing PDI jobs

Demo – creating a folder with a Kettle jobExecuting processes with PDI jobsUsing Spoon to design and run jobs

Demo – creating a simple job and getting familiar with the design  processChanging the flow of execution on the basis of conditionsLook at the results – Execution results windowRunning transformations from jobs

Demo – generating a range of dates and inspecting how things  are running

Using the Transformation job entry

Receiving arguments and parameters in a job

Demo – generating a hello world file by using arguments and  parameters

Using named parameters in jobs

Running jobs from a terminal window

Demo – executing the hello world job from a terminal window

Demo – calling the hello world transformation with fixed arguments  and parameters

Deciding: use of a command-line argument or a named parameter

Creating Advanced Transformations and Jobs

Re-using part of your transformations

Demo– calculating statistics with the use of a sub-transformation

Creating a job as a process flow

Demo – generating top average scores by copying and getting rows

Use the copy/get rows mechanism to transfer data between transformations  

Demo – generating custom files by executing a transformation for  every input row

Executing for each row

Enhancing your processes with the use of variables

Demo – generating custom messages by setting a variable with the  name of the examination file

Case Study : Developing and Implementing a Sample Datamart

Exploring the sales data mart

Deciding the level of granularity

Loading the dimensions

Demo – loading the dimensions for the sales datamart

Extending the sales datamart model

Loading a fact table with aggregated data

Demo – loading the sales fact table by looking up dimensions

Get the data from the source – SQL queries

Translating the business keys into surrogate keys

Obtaining the surrogate key for Type I SCD

Obtaining the surrogate key for Type II SCD

Obtaining the surrogate key for the Junk dimension

Obtaining the surrogate key for the Time dimension

Getting facts and dimensions together

Demo – loading the fact table using a range of dates obtained  from the command line

Demo – loading the SALES star schema model

Automating the administrative tasks

Demo – automating the loading of the sales data mart

Real-Time Data Integration
  • Introduction to Real-Time ETL
  • Real-Time Challenges
  • Requirements
  • Transformation Streaming
  • A Practical Example of Transformation Streaming
  • Debugging
  • Third-Party Software and Real-Time Integration
  • Java Message Service
  • Creating a JMS Connection and Session
Performance Tuning
  • Transformation Performance: Finding the Weakest Link
  • Finding Bottlenecks by Simplifying
  • Finding Bottlenecks by Measuring
  • copying Rows of Data
  • Improving Transformation Performance
  • Using Lazy Conversion for Reading Text Files
  • Single-File Parallel Reading, Multi-File Parallel Reading
  • Configuring the NIO Block Size
  • Changing Disks and Reading Text Files
  • Improving Performance in Writing Text Files
  • Using Lazy Conversion for Writing Text Files
  • Parallel Files Writing
  • Improving Database Performance
  • Avoiding Dynamic SQL, Handling Roundtrips
  • Handling Relational Databases
  • Sorting Data, Sorting on the Database, Sorting in Parallel
  • Reducing CPU Usage
  • Optimizing the Use of JavaScript
  • Launching Multiple Copies of a Step
  • Selecting and Removing Values
  • Managing Thread Priorities
  • Adding Static Data to Rows of Data
  • Limiting the Number of Step Copies
  • Avoiding Excessive Logging
  • Improving Job Performance
  • Loops in Jobs
  • Database Connection Pools
Parallelization, Clustering, and Partitioning
  • Multi-Threading
  • Row Distribution, Row Merging, and Row Redistribution
  • Data Pipelining
  • Consequences of Multi-Threading
  • Database Connections
  • Order of Execution, Parallel Execution in a Job
  • Using Carte as a Slave Server
  • The Configuration File
  • Defining Slave Servers
  • Remote Execution, Monitoring Slave Servers
  • Carte Security, Services
  • Clustering Transformations
  • Defining a Cluster Schema, Designing Clustered Transformations
  • Execution and Monitoring
  • Metadata Transformations
  • Rules
  • Data Pipelining, Partitioning
  • Defining a Partitioning Schema
  • Objectives of Partitioning, Implementing Partitioning
  • Internal Variables
  • Database Partitions
  • Partitioning in a Clustered Transformation
Bonus Topics
  • Working with Repositories
  • Pan/Kitchen – Launching Transformations and Jobs from the Command-Line

ENQUIRE NOW