a

Pentaho Data Integration

Pentaho Data Integration (PDI) being part of Pentaho Open Source BI Suite, includes software of all sort to support business decision making.

  • Please find below the objective of this session:
    • Getting Started with PDI
    • Getting Started with Transformations
    • Manipulating Real World Data
    • Filtering Searching and Performing other useful operation with Data.
    • Transforming the Row set
    • Performing Advanced Operations with Databases
    • Creating Basic Task Flows (Jobs)
    • Real-Time Data Integration
    • Performance Tuning
    • Parallelization, Clustering, and Partitioning

5 Days

Knowledge of ETL concepts

    • Pentaho Data Integration and Pentaho BI Suite  
    • Using PDI in real-world scenarios
    • Loading data warehouses or data marts
    • Integrating data
    • Data Cleansing
    • Migrating information
    • Exporting data
    • Integrating PDI along with other Pentaho tools

    Demo – Installing PDI

    Demo – starting and customizing Spoon

    • Setting preferences in the Options window
    • Storing transformations and jobs in a repository  

    Demo  – creating a hello world transformation  Directing Kettle engine with transformations

    • Exploring the Spoon interface
    • Designing a transformation
    • Running and previewing the transformation

    Demo – installing MySQL on Windows

  • Designing and previewing transformationsDemo – creating a simple transformation and getting familiar  with the design process

    Understanding the Kettle rowset

    Seeing the results in the Execution-Results-pane  

    Demo – generating a range of dates and inspecting the data as  it is being created

    Adding/modifying fields using various PDI steps

    Demo – avoiding errors while converting the estimated time  from string to integer

    The error handling functionality

    Demo – configuring the error handling to see the description  of the errors

    Personalizing the error handling

  • Reading data from filesDemo  – reading results of football matches from files
    • Reading several files at once

    Demo – reading all your files at a time using a single text file input step  

    • regular expressions
    • Troubleshooting reading files
    • Sending data to files

    Demo  – sending the results of matches to a plain file

    • Getting system information

    Demo – reading and writing matches files with flexibility

     

    • Running transformations from a terminal window

    Demo  – running the matches transformation from a terminal window

    Demo – getting data from an XML file with information about countries

    • PDI transformation files
    • Getting data from XML files
    • Kettle variables

    Filtering/Searching/Performing related Useful Operations  with Data

    Sorting data

    Demo – sorting information about matches with the Sort rows step

    • Calculations on groups of rows

    Demo – calculating football match statistics by grouping data Group by Step

    • Numeric fields
    • Filtering

    Demo – counting frequent words by filtering

    Demo – refining the counting task by filtering even more

    • Filtering rows using the Filter rows step
    • Looking up data

    Demo – finding out which language people speak

    • The Stream lookup step
    • Data cleaning

    Demo – fixing words before counting them

    • Cleansing data with PDI
    • Controlling the Flow of Data
    • Splitting streams

    Demo – browsing new features of PDI by copying a dataset

    • Copying rows
    • Distributing rows

    Demo – assigning tasks by distributing

    • Splitting the stream based on conditions

    Demo – assigning tasks by filtering priorities with the Filter rows step

    • PDI steps – splitting the stream-based on conditions

    Demo – assigning tasks by filtering priorities with the Switch/Case step

    • Merging streams

    Demo – gathering progress and merging it all together

    • PDI options for merging streams

    Demo – giving priority to Bouchard by using the Append Stream

    • Treating invalid data by splitting and merging streams

    Demo – treating errors in the estimated time to avoid discarding rows

    • Treating rows with invalid data
    • Transforming Your Data by Coding
    • Doing simple tasks with the JavaScript

    Demo – counting frequent words by coding in JavaScript

    • Using the JavaScript language in PDI
    • Using transformation predefined constants
    • Testing the script using the Test script button
    • Reading and parsing unstructured files with JavaScript

    Demo – changing a list of house descriptions with JavaScript

    • Looping over the dataset rows
    • Doing simple tasks with the Java Class

    Demo – counting frequent words by coding in Java

    • Using the Java language in PDI
    • Sending rows to the next step
    • Data types equivalence
    • Transforming the dataset with Java

    Demo – splitting the field to rows using Java

    • Avoiding coding by using purpose-built steps
  • Converting rows to columnsDemo – enhancing the file by converting rows to columns
    • Converting row data to column data by using the Row Demoralizer
    • Aggregating data with a Row Demoralizer

    Demo – aggregating football matches data with the Row  Demoralizer

    • Using Row Demoralizer for aggregating data
    • Normalizing data

    Demo – enhancing the matches file by normalizing the dataset

    • Modifying the dataset with a Row Normalizer
    • Summarizing: PDI steps which operate on sets of rows
    • Generate a customized time-dimension dataset using Kettle-variables

    Demo – creating the time dimension dataset

    Getting variables

    Demo – parameterizing the start and end date of the time  dimension dataset

  • Introducing the Steel Wheels sample databaseConnecting to the Steel Wheels database

    Demo – creating a connection to the Steel Wheels database

    • Connecting with Relational Database Management Systems
    • Querying a database

    Demo – getting data about shipped orders

    • Generating data from the database from Table input step
    • Generating a new dataset using the SELECT statement  
    • Making flexible queries using parameters

    Demo – getting orders in a range of dates using parameters

    • Adding parameters to your queries
    • Making flexible queries by using Kettle variables

    Demo – getting orders in a range of dates by using Kettle variables

    • Using Kettle variables in your queries
    • Sending data to a database

    Demo – loading a table with a list of manufacturers

    • Inserting or updating data by using other PDI

    Demo – inserting new products or updating existing ones

    Demo – testing the update of existing products

    • Eliminating data from a database

    Demo – deleting data about discontinued items

  • Demo – populating the Jigsaw database
    • Exploring the Jigsaw database model
    • Doing simple lookups

    Demo – using a Database lookup step to create a list of products to buy

    Performing complex lookups

    Demo – using a Database join step to create a list of suggested  products to buy

    • Joining data to the stream-data by using a Database-join
    • Introducing dimensional modeling
    • Loading dimensions with data

    Demo – loading a region dimension with a Combination  lookup/update step

    • Describing data with dimensions
    • Loading Type I SCD with a Combination lookup/update
    • Storing history of changes

    Demo – keeping a history of changes in products by using the  Dimension lookup/update

    • Keeping an entire history of data with a Type II SCD
    • Loading Type II SCDs with the Dimension lookup/update step
  • Introducing PDI jobsDemo – creating a folder with a Kettle job

    Executing processes with PDI jobs

    Using Spoon to design and run jobs

    Demo – creating a simple job and getting familiar with the design  process

    Changing the flow of execution on the basis of conditions

    Look at the results – Execution results window

    Running transformations from jobs

    Demo – generating a range of dates and inspecting how things  are running

    Using the Transformation job entry

    Receiving arguments and parameters in a job

    Demo – generating a hello world file by using arguments and  parameters

    Using named parameters in jobs

    Running jobs from a terminal window

    Demo – executing the hello world job from a terminal window

    Demo – calling the hello world transformation with fixed arguments  and parameters

    Deciding: use of a command-line argument or a named parameter

    Creating Advanced Transformations and Jobs

    Re-using part of your transformations

    Demo– calculating statistics with the use of a sub-transformation

    Creating a job as a process flow

    Demo – generating top average scores by copying and getting rows

    Use the copy/get rows mechanism to transfer data between transformations  

    Demo – generating custom files by executing a transformation for  every input row

    Executing for each row

    Enhancing your processes with the use of variables

    Demo – generating custom messages by setting a variable with the  name of the examination file

    Case Study : Developing and Implementing a Sample Datamart

    Exploring the sales data mart

    Deciding the level of granularity

    Loading the dimensions

    Demo – loading the dimensions for the sales datamart

    Extending the sales datamart model

    Loading a fact table with aggregated data

    Demo – loading the sales fact table by looking up dimensions

    Get the data from the source – SQL queries

    Translating the business keys into surrogate keys

    Obtaining the surrogate key for Type I SCD

    Obtaining the surrogate key for Type II SCD

    Obtaining the surrogate key for the Junk dimension

    Obtaining the surrogate key for the Time dimension

    Getting facts and dimensions together

    Demo – loading the fact table using a range of dates obtained  from the command line

    Demo – loading the SALES star schema model

    Automating the administrative tasks

    Demo – automating the loading of the sales data mart

  • Introduction to Real-Time ETL
  • Real-Time Challenges
  • Requirements
  • Transformation Streaming
  • A Practical Example of Transformation Streaming
  • Debugging
  • Third-Party Software and Real-Time Integration
  • Java Message Service
  • Creating a JMS Connection and Session
  • Transformation Performance: Finding the Weakest Link
  • Finding Bottlenecks by Simplifying
  • Finding Bottlenecks by Measuring
  • copying Rows of Data
  • Improving Transformation Performance
  • Using Lazy Conversion for Reading Text Files
  • Single-File Parallel Reading, Multi-File Parallel Reading
  • Configuring the NIO Block Size
  • Changing Disks and Reading Text Files
  • Improving Performance in Writing Text Files
  • Using Lazy Conversion for Writing Text Files
  • Parallel Files Writing
  • Improving Database Performance
  • Avoiding Dynamic SQL, Handling Roundtrips
  • Handling Relational Databases
  • Sorting Data, Sorting on the Database, Sorting in Parallel
  • Reducing CPU Usage
  • Optimizing the Use of JavaScript
  • Launching Multiple Copies of a Step
  • Selecting and Removing Values
  • Managing Thread Priorities
  • Adding Static Data to Rows of Data
  • Limiting the Number of Step Copies
  • Avoiding Excessive Logging
  • Improving Job Performance
  • Loops in Jobs
  • Database Connection Pools
  • Multi-Threading
  • Row Distribution, Row Merging, and Row Redistribution
  • Data Pipelining
  • Consequences of Multi-Threading
  • Database Connections
  • Order of Execution, Parallel Execution in a Job
  • Using Carte as a Slave Server
  • The Configuration File
  • Defining Slave Servers
  • Remote Execution, Monitoring Slave Servers
  • Carte Security, Services
  • Clustering Transformations
  • Defining a Cluster Schema, Designing Clustered Transformations
  • Execution and Monitoring
  • Metadata Transformations
  • Rules
  • Data Pipelining, Partitioning
  • Defining a Partitioning Schema
  • Objectives of Partitioning, Implementing Partitioning
  • Internal Variables
  • Database Partitions
  • Partitioning in a Clustered Transformation
  • Working with Repositories
  • Pan/Kitchen – Launching Transformations and Jobs from the Command-Line