
Pentaho Data Integration (PDI) being part of Pentaho Open Source BI Suite, includes software of all sort to support business decision making.
Please find below the objective of this session:
-
- Getting Started with PDI
- Getting Started with Transformations
- Manipulating Real World Data
- Filtering Searching and Performing other useful operation with Data.
- Transforming the Row set
- Performing Advanced Operations with Databases
- Creating Basic Task Flows (Jobs)
- Real-Time Data Integration
- Performance Tuning
- Parallelization, Clustering, and Partitioning
5 Days
Knowledge of ETL concepts
-
- Pentaho Data Integration and Pentaho BI Suite
- Using PDI in real-world scenarios
- Loading data warehouses or data marts
- Integrating data
- Data Cleansing
- Migrating information
- Exporting data
- Integrating PDI along with other Pentaho tools
Demo – Installing PDI
Demo – starting and customizing Spoon
- Setting preferences in the Options window
- Storing transformations and jobs in a repository
Demo – creating a hello world transformation Directing Kettle engine with transformations
- Exploring the Spoon interface
- Designing a transformation
- Running and previewing the transformation
Demo – installing MySQL on Windows
- Designing and previewing transformationsDemo – creating a simple transformation and getting familiar with the design processUnderstanding the Kettle rowsetSeeing the results in the Execution-Results-pane Demo – generating a range of dates and inspecting the data as it is being createdAdding/modifying fields using various PDI stepsDemo – avoiding errors while converting the estimated time from string to integerThe error handling functionalityDemo – configuring the error handling to see the description of the errors
Personalizing the error handling
- Reading data from filesDemo – reading results of football matches from files
- Reading several files at once
Demo – reading all your files at a time using a single text file input step
- regular expressions
- Troubleshooting reading files
- Sending data to files
Demo – sending the results of matches to a plain file
- Getting system information
Demo – reading and writing matches files with flexibility
- Running transformations from a terminal window
Demo – running the matches transformation from a terminal window
Demo – getting data from an XML file with information about countries
- PDI transformation files
- Getting data from XML files
- Kettle variables
Filtering/Searching/Performing related Useful Operations with Data
Sorting data
Demo – sorting information about matches with the Sort rows step
- Calculations on groups of rows
Demo – calculating football match statistics by grouping data Group by Step
- Numeric fields
- Filtering
Demo – counting frequent words by filtering
Demo – refining the counting task by filtering even more
- Filtering rows using the Filter rows step
- Looking up data
Demo – finding out which language people speak
- The Stream lookup step
- Data cleaning
Demo – fixing words before counting them
- Cleansing data with PDI
- Controlling the Flow of Data
- Splitting streams
Demo – browsing new features of PDI by copying a dataset
- Copying rows
- Distributing rows
Demo – assigning tasks by distributing
- Splitting the stream based on conditions
Demo – assigning tasks by filtering priorities with the Filter rows step
- PDI steps – splitting the stream-based on conditions
Demo – assigning tasks by filtering priorities with the Switch/Case step
- Merging streams
Demo – gathering progress and merging it all together
- PDI options for merging streams
Demo – giving priority to Bouchard by using the Append Stream
- Treating invalid data by splitting and merging streams
Demo – treating errors in the estimated time to avoid discarding rows
- Treating rows with invalid data
- Transforming Your Data by Coding
- Doing simple tasks with the JavaScript
Demo – counting frequent words by coding in JavaScript
- Using the JavaScript language in PDI
- Using transformation predefined constants
- Testing the script using the Test script button
- Reading and parsing unstructured files with JavaScript
Demo – changing a list of house descriptions with JavaScript
- Looping over the dataset rows
- Doing simple tasks with the Java Class
Demo – counting frequent words by coding in Java
- Using the Java language in PDI
- Sending rows to the next step
- Data types equivalence
- Transforming the dataset with Java
Demo – splitting the field to rows using Java
- Avoiding coding by using purpose-built steps
- Converting rows to columnsDemo – enhancing the file by converting rows to columns
- Converting row data to column data by using the Row Demoralizer
- Aggregating data with a Row Demoralizer
Demo – aggregating football matches data with the Row Demoralizer
- Using Row Demoralizer for aggregating data
- Normalizing data
Demo – enhancing the matches file by normalizing the dataset
- Modifying the dataset with a Row Normalizer
- Summarizing: PDI steps which operate on sets of rows
- Generate a customized time-dimension dataset using Kettle-variables
Demo – creating the time dimension dataset
Getting variables
Demo – parameterizing the start and end date of the time dimension dataset
- Introducing the Steel Wheels sample databaseConnecting to the Steel Wheels databaseDemo – creating a connection to the Steel Wheels database
- Connecting with Relational Database Management Systems
- Querying a database
Demo – getting data about shipped orders
- Generating data from the database from Table input step
- Generating a new dataset using the SELECT statement
- Making flexible queries using parameters
Demo – getting orders in a range of dates using parameters
- Adding parameters to your queries
- Making flexible queries by using Kettle variables
Demo – getting orders in a range of dates by using Kettle variables
- Using Kettle variables in your queries
- Sending data to a database
Demo – loading a table with a list of manufacturers
- Inserting or updating data by using other PDI
Demo – inserting new products or updating existing ones
Demo – testing the update of existing products
- Eliminating data from a database
Demo – deleting data about discontinued items
- Demo – populating the Jigsaw database
- Exploring the Jigsaw database model
- Doing simple lookups
Demo – using a Database lookup step to create a list of products to buy
Performing complex lookups
Demo – using a Database join step to create a list of suggested products to buy
- Joining data to the stream-data by using a Database-join
- Introducing dimensional modeling
- Loading dimensions with data
Demo – loading a region dimension with a Combination lookup/update step
- Describing data with dimensions
- Loading Type I SCD with a Combination lookup/update
- Storing history of changes
Demo – keeping a history of changes in products by using the Dimension lookup/update
- Keeping an entire history of data with a Type II SCD
- Loading Type II SCDs with the Dimension lookup/update step
- Introducing PDI jobsDemo – creating a folder with a Kettle jobExecuting processes with PDI jobsUsing Spoon to design and run jobsDemo – creating a simple job and getting familiar with the design processChanging the flow of execution on the basis of conditionsLook at the results – Execution results windowRunning transformations from jobsDemo – generating a range of dates and inspecting how things are running
Using the Transformation job entry
Receiving arguments and parameters in a job
Demo – generating a hello world file by using arguments and parameters
Using named parameters in jobs
Running jobs from a terminal window
Demo – executing the hello world job from a terminal window
Demo – calling the hello world transformation with fixed arguments and parameters
Deciding: use of a command-line argument or a named parameter
Creating Advanced Transformations and Jobs
Re-using part of your transformations
Demo– calculating statistics with the use of a sub-transformation
Creating a job as a process flow
Demo – generating top average scores by copying and getting rows
Use the copy/get rows mechanism to transfer data between transformations
Demo – generating custom files by executing a transformation for every input row
Executing for each row
Enhancing your processes with the use of variables
Demo – generating custom messages by setting a variable with the name of the examination file
Case Study : Developing and Implementing a Sample Datamart
Exploring the sales data mart
Deciding the level of granularity
Loading the dimensions
Demo – loading the dimensions for the sales datamart
Extending the sales datamart model
Loading a fact table with aggregated data
Demo – loading the sales fact table by looking up dimensions
Get the data from the source – SQL queries
Translating the business keys into surrogate keys
Obtaining the surrogate key for Type I SCD
Obtaining the surrogate key for Type II SCD
Obtaining the surrogate key for the Junk dimension
Obtaining the surrogate key for the Time dimension
Getting facts and dimensions together
Demo – loading the fact table using a range of dates obtained from the command line
Demo – loading the SALES star schema model
Automating the administrative tasks
Demo – automating the loading of the sales data mart
- Introduction to Real-Time ETL
- Real-Time Challenges
- Requirements
- Transformation Streaming
- A Practical Example of Transformation Streaming
- Debugging
- Third-Party Software and Real-Time Integration
- Java Message Service
- Creating a JMS Connection and Session
- Transformation Performance: Finding the Weakest Link
- Finding Bottlenecks by Simplifying
- Finding Bottlenecks by Measuring
- copying Rows of Data
- Improving Transformation Performance
- Using Lazy Conversion for Reading Text Files
- Single-File Parallel Reading, Multi-File Parallel Reading
- Configuring the NIO Block Size
- Changing Disks and Reading Text Files
- Improving Performance in Writing Text Files
- Using Lazy Conversion for Writing Text Files
- Parallel Files Writing
- Improving Database Performance
- Avoiding Dynamic SQL, Handling Roundtrips
- Handling Relational Databases
- Sorting Data, Sorting on the Database, Sorting in Parallel
- Reducing CPU Usage
- Optimizing the Use of JavaScript
- Launching Multiple Copies of a Step
- Selecting and Removing Values
- Managing Thread Priorities
- Adding Static Data to Rows of Data
- Limiting the Number of Step Copies
- Avoiding Excessive Logging
- Improving Job Performance
- Loops in Jobs
- Database Connection Pools
- Multi-Threading
- Row Distribution, Row Merging, and Row Redistribution
- Data Pipelining
- Consequences of Multi-Threading
- Database Connections
- Order of Execution, Parallel Execution in a Job
- Using Carte as a Slave Server
- The Configuration File
- Defining Slave Servers
- Remote Execution, Monitoring Slave Servers
- Carte Security, Services
- Clustering Transformations
- Defining a Cluster Schema, Designing Clustered Transformations
- Execution and Monitoring
- Metadata Transformations
- Rules
- Data Pipelining, Partitioning
- Defining a Partitioning Schema
- Objectives of Partitioning, Implementing Partitioning
- Internal Variables
- Database Partitions
- Partitioning in a Clustered Transformation
- Working with Repositories
- Pan/Kitchen – Launching Transformations and Jobs from the Command-Line