Overview of Pentaho Data Integration
Pentaho Data Integration (PDI) being part of Pentaho Open Source BI Suite, includes software of all sort to support business decision making.
Please find below the objective of this session:
- Getting Started with PDI
- Getting Started with Transformations
- Manipulating Real World Data
- Filtering Searching and Performing other useful operation with Data.
- Transforming the Row set
- Performing Advanced Operations with Databases
- Creating Basic Task Flows (Jobs)
- Real-Time Data Integration
- Performance Tuning
- Parallelization, Clustering, and Partitioning
Duration
5 Days
Prerequisite for Pentaho Data Integration
Knowledge of ETL concepts
Course Outline for Pentaho Data Integration
Getting Started with Pentaho Data Integration
- Pentaho Data Integration and Pentaho BI Suite
- Using PDI in real-world scenarios
- Loading data warehouses or data marts
- Integrating data
- Data Cleansing
- Migrating information
- Exporting data
- Integrating PDI along with other Pentaho tools
Demo – Installing PDI
Demo – starting and customizing Spoon
- Setting preferences in the Options window
- Storing transformations and jobs in a repository
Demo – creating a hello world transformation Directing Kettle engine with transformations
- Exploring the Spoon interface
- Designing a transformation
- Running and previewing the transformation
Demo – installing MySQL on Windows
Getting Started with Transformations
- Designing and previewing transformations
- Demo – creating a simple transformation and getting familiar with the design process understanding the Kettle rowsetSeeing the results in the Execution-Results-pane
- Demo – generating a range of dates and inspecting the data as it is being created adding/modifying fields using various PDI steps
- Demo – avoiding errors while converting the estimated time from string to integer error handling functionality
- Demo – configuring the error handling to see the description of the errorsPersonalizing the error handling
Manipulating Real-world Data
- Reading data from filesDemo – reading results of football matches from files
- Reading several files at once
- regular expressions
- Troubleshooting reading files
- Sending data to files
- Getting system information
- Running transformations from a terminal window
- PDI transformation files
- Getting data from XML files
- Kettle variables
- Calculations on groups of rows
- Numeric fields
- Filtering
- Filtering rows using the Filter rows step
- Looking up data
- The Stream lookup step
- Data cleaning
- Cleansing data with PDI
- Controlling the Flow of Data
- Splitting streams
- Copying rows
- Distributing rows
- Splitting the stream based on conditions
- PDI steps – splitting the stream-based on conditions
- Merging streams
- PDI options for merging streams
- Treating invalid data by splitting and merging streams
- Treating rows with invalid data
- Transforming Your Data by Coding
- Doing simple tasks with the JavaScript
- Using the JavaScript language in PDI
- Using transformation predefined constants
- Testing the script using the Test script button
- Reading and parsing unstructured files with JavaScript
- Looping over the dataset rows
- Doing simple tasks with the Java Class
- Using the Java language in PDI
- Sending rows to the next step
- Data types equivalence
- Transforming the dataset with Java
- Avoiding coding by using purpose-built steps
Transforming the Row Set
- Converting rows to columns
- Demo – enhancing the file by converting rows to columnsConverting row data to column data by using the Row DemoralizerAggregating data with a Row DemoralizerDemo – aggregating football matches data with the Row DemoralizerUsing Row Demoralizer for aggregating data normalizing
- data
- Demo – enhancing the matches file by normalizing the dataset
- Modifying the dataset with a Row NormalizerSummarizing: PDI steps which operate on sets of rowsGenerate a customized time-dimension dataset using Kettle-variables
- Demo – creating the time dimension datasetGetting variables
- Demo – parameterizing the start and end date of the time dimension dataset
Working with Databases
- Introducing the Steel Wheels sample database connecting to the Steel Wheels database
- Demo – creating a connection to the Steel Wheels database
- Connecting with Relational Database Management SystemsQuerying a database
- Demo – getting data about shipped orders
- Generating data from the database from the Table input step generating a new dataset using the SELECT statement Making flexible queries using parameters
- Demo – getting orders on a range of dates using parameters
- Adding parameters to your queries making flexible queries by using Kettle variables
- Demo – getting orders in a range of dates by using Kettle variables
- Using Kettle variables in your queriesSending data to a database
- Demo – loading a table with a list of manufacturers
- Inserting or updating data by using other PDI
- Demo – inserting new products or updating existing ones
- Demo – testing the update of existing products
- Eliminating data from a database
- Demo – deleting data about discontinued items
Performing Advanced Operations with Databases
- Demo – populating the Jigsaw database
- Exploring the Jigsaw database modelDoing simple lookups
- Demo – using a Database lookup step to create a list of products to buyPerforming complex lookups
- Demo – using a Database join step to create a list of suggested products to buy
- Joining data to the stream-data by using a Database-joinIntroducing dimensional modelingLoading dimensions with data
- Demo – loading a region dimension with a Combination lookup/update step
- Describing data with dimensionsLoading Type I SCD with a Combination lookup/updateStoring history of changes
- Demo – keeping a history of changes in products by using the Dimension lookup/update
- Keeping an entire history of data with a Type II SCD
- Loading Type II SCDs with the Dimension lookup/update step
Creating Basic Task Flows (Jobs)
Introducing PDI jobs
Demo – creating a folder with a Kettle jobExecuting processes with PDI jobsUsing Spoon to design and run jobs
Demo – creating a simple job and getting familiar with the design processChanging the flow of execution on the basis of conditionsLook at the results – Execution results windowRunning transformations from jobs
Demo – generating a range of dates and inspecting how things are running
Using the Transformation job entry
Receiving arguments and parameters in a job
Demo – generating a hello world file by using arguments and parameters
Using named parameters in jobs
Running jobs from a terminal window
Demo – executing the hello world job from a terminal window
Demo – calling the hello world transformation with fixed arguments and parameters
Deciding: use of a command-line argument or a named parameter
Creating Advanced Transformations and Jobs
Re-using part of your transformations
Demo– calculating statistics with the use of a sub-transformation
Creating a job as a process flow
Demo – generating top average scores by copying and getting rows
Use the copy/get rows mechanism to transfer data between transformations
Demo – generating custom files by executing a transformation for every input row
Executing for each row
Enhancing your processes with the use of variables
Demo – generating custom messages by setting a variable with the name of the examination file
Case Study : Developing and Implementing a Sample Datamart
Exploring the sales data mart
Deciding the level of granularity
Loading the dimensions
Demo – loading the dimensions for the sales datamart
Extending the sales datamart model
Loading a fact table with aggregated data
Demo – loading the sales fact table by looking up dimensions
Get the data from the source – SQL queries
Translating the business keys into surrogate keys
Obtaining the surrogate key for Type I SCD
Obtaining the surrogate key for Type II SCD
Obtaining the surrogate key for the Junk dimension
Obtaining the surrogate key for the Time dimension
Getting facts and dimensions together
Demo – loading the fact table using a range of dates obtained from the command line
Demo – loading the SALES star schema model
Automating the administrative tasks
Demo – automating the loading of the sales data mart
Real-Time Data Integration
- Introduction to Real-Time ETL
- Real-Time Challenges
- Requirements
- Transformation Streaming
- A Practical Example of Transformation Streaming
- Debugging
- Third-Party Software and Real-Time Integration
- Java Message Service
- Creating a JMS Connection and Session
Performance Tuning
- Transformation Performance: Finding the Weakest Link
- Finding Bottlenecks by Simplifying
- Finding Bottlenecks by Measuring
- copying Rows of Data
- Improving Transformation Performance
- Using Lazy Conversion for Reading Text Files
- Single-File Parallel Reading, Multi-File Parallel Reading
- Configuring the NIO Block Size
- Changing Disks and Reading Text Files
- Improving Performance in Writing Text Files
- Using Lazy Conversion for Writing Text Files
- Parallel Files Writing
- Improving Database Performance
- Avoiding Dynamic SQL, Handling Roundtrips
- Handling Relational Databases
- Sorting Data, Sorting on the Database, Sorting in Parallel
- Reducing CPU Usage
- Optimizing the Use of JavaScript
- Launching Multiple Copies of a Step
- Selecting and Removing Values
- Managing Thread Priorities
- Adding Static Data to Rows of Data
- Limiting the Number of Step Copies
- Avoiding Excessive Logging
- Improving Job Performance
- Loops in Jobs
- Database Connection Pools
Parallelization, Clustering, and Partitioning
- Multi-Threading
- Row Distribution, Row Merging, and Row Redistribution
- Data Pipelining
- Consequences of Multi-Threading
- Database Connections
- Order of Execution, Parallel Execution in a Job
- Using Carte as a Slave Server
- The Configuration File
- Defining Slave Servers
- Remote Execution, Monitoring Slave Servers
- Carte Security, Services
- Clustering Transformations
- Defining a Cluster Schema, Designing Clustered Transformations
- Execution and Monitoring
- Metadata Transformations
- Rules
- Data Pipelining, Partitioning
- Defining a Partitioning Schema
- Objectives of Partitioning, Implementing Partitioning
- Internal Variables
- Database Partitions
- Partitioning in a Clustered Transformation
Bonus Topics
- Working with Repositories
- Pan/Kitchen – Launching Transformations and Jobs from the Command-Line