Learning Pentaho
It’s
easy to start and learn Pentaho on its own. In-build sample code and easy to
connect various database with drag and drop option base on Java. Adding to
beauty it has been provide with Kittel and Pan Scripts, which make it unique in
nature. My first project is just for presentation point of view. With time I
explore some of best feature inside Pentaho.
Projecting to requirement# setting up ETL (Extract,
Transform, Load) to extract data from source system to target database, is
going to be first phase in every metadata or migration projects. Sometime
business need complete data or incremental data. Easy part is move complete set
of data in one go, but what about incremental data logic!
In
short, I am presenting incremental data extraction concept on tables in this
document.
Higher level# we build one audit table which hold our job and
time to start for next go. Start time can be your business start date or
sysdate. Next is building ETL in two phases, first phase which pulls the data
from source system and dump (Insert or Update) in target database for backup or
analysis or conversion requirement. Second phase which control or watch all the
dependent process get completed and finally update your log table.
What
I try to do? In one transformation I try to move the data for number of tables
and maintain their joint log buy running them against one Job Code.
Detail
level# to understand this concept I added screen shots taken at various
level.
Step1. To create log table in source schema database
Figure 1. Log table with two
column (Job and Start Date)
Step2.
To create ETL in Pentaho, developer has to build Table and index to hold data in
target system. (Target system can be Oracle or MySQl or any other system to
meet business requirement). In one project there is possibility of having any
number of tables with different structure from source objects, just like view
in Oracle Database. It means Pentaho gives you capability to select some column
from table A and some column from table B and move it to single table of target
database.
Figure 2.Shows list of source database
objects and target database object (Naming standard “_IN” source object and
“_IU” target database object)
It
has workflow like graphical representation. We have applied some set of rules
while fetching data from source system. Most standard system maintains value in
WHO columns (WHO stand for “log of who has updated or created record in past”).
Its fetch records base on creation or last update greater than the log table
set value for respective job.
Figure
3. Definition of source records extraction for Po_Lines_All table.
To
populate data in target object we just have to select object and columns that
mapped to source table. Additional one can set “Table Field” which has to
compare target column “Steam Filed” to maintain data integrity in case of
records are getting update.
Figure 4.To show
sample target object definition.
Till
this point it was normal process to build ETL, so in next phase we set additional
three nods.
Nodes
one “LOG_IN” will hold Job name and system date till the process reach to last
node “LOG_IU”. On reach it system will update log table defined in LOG_IU.
Update value signify till what date you have data imported to target system. In
next run system will check log table and start processing job for update or
insert by looking at WHO columns.
Figure 5.Monitor logic
flow
** On LOG_IN node **
Block
function has capability to hold any job. i.e. For set duration or making it
dependent on other job. Using this fetcher I was able to update one record for
all the data migrating object in log table. This way we have used single JOB
CODE for number of migrated object and also we are able to bundle them as a
part of one transformation. To complete our process we still have to review the
definition of
Block node and LOG_IU node. Below screen shot is for
Block node definition. This shows lists of steps need to complete to move flow
last node.
Figure 6.To show steps name. On Completing
this steps workflow will move to LOG_IU.
Figure 7.LOG_IU.
Once
all the data get migrate system will update log table start data column.
Figure
8.To Show various jobs in running status and Block node is checking competition
status
Above
figure will even give you brief idea about loading speed and records that are
process so far with their status. As ETL analyst one call think of calling the
transformation by using Kettel and move it next level.
Some
more screen shot for your knowledge purpose to Database connections and In/Out
tables.
By
Arpit
No comments:
Post a Comment