Wednesday, 4 January 2017

Learning Pentaho

It’s easy to start and learn Pentaho on its own. In-build sample code and easy to connect various database with drag and drop option base on Java. Adding to beauty it has been provide with Kittel and Pan Scripts, which make it unique in nature. My first project is just for presentation point of view. With time I explore some of best feature inside Pentaho.

Projecting to requirement# setting up ETL (Extract, Transform, Load) to extract data from source system to target database, is going to be first phase in every metadata or migration projects. Sometime business need complete data or incremental data. Easy part is move complete set of data in one go, but what about incremental data logic!

In short, I am presenting incremental data extraction concept on tables in this document.

Higher level# we build one audit table which hold our job and time to start for next go. Start time can be your business start date or sysdate. Next is building ETL in two phases, first phase which pulls the data from source system and dump (Insert or Update) in target database for backup or analysis or conversion requirement. Second phase which control or watch all the dependent process get completed and finally update your log table.

What I try to do? In one transformation I try to move the data for number of tables and maintain their joint log buy running them against one Job Code.

Detail level# to understand this concept I added screen shots taken at various level.

Step1. To create log table in source schema database

Figure 1. Log table with two column (Job and Start Date)

Step2. To create ETL in Pentaho, developer has to build Table and index to hold data in target system. (Target system can be Oracle or MySQl or any other system to meet business requirement). In one project there is possibility of having any number of tables with different structure from source objects, just like view in Oracle Database. It means Pentaho gives you capability to select some column from table A and some column from table B and move it to single table of target database.

Figure 2.Shows list of source database objects and target database object (Naming standard “_IN” source object and “_IU” target database object)

It has workflow like graphical representation. We have applied some set of rules while fetching data from source system. Most standard system maintains value in WHO columns (WHO stand for “log of who has updated or created record in past”). Its fetch records base on creation or last update greater than the log table set value for respective job.

Figure 3. Definition of source records extraction for Po_Lines_All table.

To populate data in target object we just have to select object and columns that mapped to source table. Additional one can set “Table Field” which has to compare target column “Steam Filed” to maintain data integrity in case of records are getting update.

Figure 4.To show sample target object definition.

Till this point it was normal process to build ETL, so in next phase we set additional three nods.

Nodes one “LOG_IN” will hold Job name and system date till the process reach to last node “LOG_IU”. On reach it system will update log table defined in LOG_IU. Update value signify till what date you have data imported to target system. In next run system will check log table and start processing job for update or insert by looking at WHO columns.

Figure 5.Monitor logic flow

** On LOG_IN node **

Block function has capability to hold any job. i.e. For set duration or making it dependent on other job. Using this fetcher I was able to update one record for all the data migrating object in log table. This way we have used single JOB CODE for number of migrated object and also we are able to bundle them as a part of one transformation. To complete our process we still have to review the definition of Block node and LOG_IU node. Below screen shot is for Block node definition. This shows lists of steps need to complete to move flow last node.

Figure 6.To show steps name. On Completing this steps workflow will move to LOG_IU.

Figure 7.LOG_IU.

Once all the data get migrate system will update log table start data column.

Figure 8.To Show various jobs in running status and Block node is checking competition status

Above figure will even give you brief idea about loading speed and records that are process so far with their status. As ETL analyst one call think of calling the transformation by using Kettel and move it next level.

Some more screen shot for your knowledge purpose to Database connections and In/Out tables.

Arpit