Monday 6 May 2013

How to Read multiple Files in Pentaho PDI (ETL)

Reading several files at the same time

Sometimes you have several files to read, all with the same structure, but different data.
 In this recipe, you will see how to read those files in a single step. The example uses a list of files containing names of museums in Italy.

Getting ready You must have a group of text files in a directory, all with the same format. In this recipe, the names of these files start with museums_italy_ for example, museums_italy_1, museums_italy_2, museums_italy_roma, museums_italy_genova, and so on.

Each file has a list of names of museums, one museum on each line.

How to do it...
--------------------------------------

Carry out the following steps:
1. Create a new transformation.
2. Drop a Text file input step onto the work area.
3. Under the File or directory tab, type the directory where the files are.
4. In the Regular Expression textbox, type: museums_italy_.*\.txt
5. Then click on the Add button. The grid will be populated, as shown in the following
screenshot:
${Internal.Transformation.Filename.Directory} is a
variable that will be replaced at run-time with the full path of the current
transformation. Note that the variable will be undefined until you save the
transformation. Therefore it's necessary that you save before running a
preview of the step.
You don't have to type the complete name of the ${Internal.
Transformation.Filename.Directory} variable. It can be selected
from a list automatically created when pressing Ctrl+Space.


6. Under the Fields tab, add one row: type museum for the Name column and String under the Type column.
7. Save the transformation in the same place, the museum directory is located.
Previewing the step, you will obtain a dataset with the content of all files with names of museums.
---------------------
How it works...
--------------
With Kettle, it is possible to read more than one file at a time using a single Text File Input step.
In order to get the content of several files, you can add names to the grid row by row. If the names of files share the path and some part of their names, you can also specify the names of the files by using regular expressions, as shown in the recipe. If you enter a regular
expression, Kettle will take all the files whose names match it. In the recipe, the files that

matched museums_italy_.*\.txt were considered as input files.
museums_italy_.*\.txt means "all the files starting with museum_italy_ and having
txt extension". You can test if the regular expression is correct by clicking on the Show
filename(s)... button. That will show you a list of all files that matches the expression.
If you fill the grid with the names of several files (with or without using regular expressions),
Kettle will create a dataset with the content of all of those files one after the other.
To learn more about regular expressions, you can visit the following URLs:

f http://www.regular-expressions.info/quickstart.html
f http://java.sun.com/docs/books/tutorial/essential/regex/
There's more...

In the recipe, you read several files. It might happen that you have to read just one file, but you don't know the exact name of the file.

One example of that is a file whose name is a fixed text followed by the current year and month as in samplefile_201012.txt.
The recipe is useful in cases like that as well. In this example, if you don't know the name of the file, you will still be able to read it by typing the following regular expression: samplefile_20[0-9][0-9]
(0[1-9]|1[0-2])\.txt.

No comments:

Post a Comment