Extract-Transform-Load (ETL), is a process that is used to take information from one or more sources, normalize it in some way to some convenient schema, and then insert it into some other repository.
A common use is for data warehousing, where regular updates from one or more systems are merged and refined so that analysis can be done using more specialized tools. Typically the same process is run over and over, as new data appears in the source application(s). Other areas include:
In order to demonstrate this, we'll use a use-case that is similar in concept. In our ETL development example, the human resources department uses a recruiting tool written using a dBase- or FoxPro-like data structure. But the employee data is stored in Microsoft SQL Server.
For reference, here is the
new_empl.dbf
and using an
XSLT transform that reads a table and displays it as HTML,
we see this:
| new_empl.dbf | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| NAME | STREET | CITY | STATE | ZIP | DEAR WHO | TEL HOME | BIRTH DATE | HIRE DATE | INSIDE |
| Guiles, Makenzie | 145 Meadowview Road | South Hadley | MA | 01075 | Macy | (413)555-6225 | 19770201 | 20060703 | yes |
| Forbes, Andrew | 12 Yarmouth Drive | Holyoke | MA | 01040 | Andy | (413)555-8863 | 19600330 | 20060710 | yes |
| Barra, Alexander | 4327 Spring Drive | Sandwich | MA | 02537 | Al | (508)555-8974 | 19861221 | 20060717 | no |
| Mellon, Philip | 123 Pendexter Street | Boston | MA | 01321 | Phil | (781)555-4289 | 19700625 | 20060724 | no |
| Clark, Pamela | 44 Mayberry Circle | Bedford | MA | 01730 | Pam | (860)555-6447 | 19500502 | 20060731 | yes |
Our steps will be then, to Extract, Transform, and Load.
Using the
XML Converters,
extracting the data into something usable is very simple.
Doing just File|Open, selecting the new_empl.dbf file,
and choosing the dBase III converter will transform the data into XML.
The URL that is used shows that anywhere XML is used, we can use
also use any data source that we have an converter for. Our URL in
this case is:
adapter:dBase_III?file:///c:/etl/new_empl.dbf
Adding the adapter: prefix enables the converter engine.
The dBase_III portion choose the converter; there are many converters available including some
very sophisticated ones such as
EDI which automatically handles
EDIFACT,
X12,
HL7,
IATA,
EANCOM and other EDI dialects.
The file:///c:/etl/new_empl.dbf is just the location of the file we want to use as input.
The result of this extraction will be an XML file similar to this:
<?xml version="1.0" encoding="UTF-8"?>
<table date="20060731" rows="5">
<row row="1">
<NAME>Guiles, Makenzie</NAME>
<STREET>145 Meadowview Road</STREET>
<CITY>South Hadley</CITY>
<STATE>MA</STATE>
<ZIP>01075</ZIP>
<DEAR_WHO>Macy</DEAR_WHO>
<TEL_HOME>(413)555-6225</TEL_HOME>
<BIRTH_DATE>19770201</BIRTH_DATE>
<HIRE_DATE>20060703</HIRE_DATE>
<INSIDE>yes</INSIDE>
</row>
...
</table>
While we are at it, let's find out what our target schema is. To do this, we use the DB to XML Data Source module. We're using the standard demonstration database that comes with SQL Server, called Northwind.
To build our map, we'll use SQL/XML, which gives us any arbitrary subset of the available data. Just the defaults are fine for this demonstration. The steps are:
localhostNow, you should see the schema of your database. Open the Northwind database and drag the Employees table onto the canvas, and choose "SELECT (SQL/XML)..." from the menu. Since we're also going to use this later to insert data into the database, switch to the UPDATE tab, and drag Employees again, but this time choose "INSERT...".
Now, save this as etl-target.rdbxml.
(For more details on SQL/XML, see the SQL/XML Tutorial.
We're going to use a series of XSLT transforms to modify this. (We could do it in one large XSLT file, we're breaking it into steps for demonstration purposes. In a production ETL operation, likely each step would be more complicated, and/or would use different technologies or methods.)
etl-code-1.xsl]etl-code-2.xsl]etl-code-3.xsl]etl-code-4.xsl]For the last step above, we can use the XSLT mapper to generate the code for us.
We use the output from the third step on the left, and the
etl-target.rdbxml
we saved earlier on the right, and draw lines
connecting the corresponding fields.
At this point, loading is a just matter of writing the output of the last XSLT
transform step into the etl-target.rdbxml
map we built earlier. But rather than run all of those steps one-by-one, we should
use the XML Pipeline to automate the task.
etl-target.rdbxmlAnd now we can execute our XML pipeline in the simulator it by doing XMLPipeline|Execute.
Creating the core of a service is as simple as pressing the
button. Answer a couple of questions,
and code is created. Following the instructions in the generated code will let you run it. Since
<%= ConfigurationManager.AppSettings["SS"] %> contains an integrated Java IDE,
you can run it right from there. Here's what a portion of the code might look like for the standard Java deployer:
public boolean init() throws Exception {
// Create all the objects required for the pipeline
convertToXML = new ConvertToXML("Convert to XML", getEnvironment());
convertToXML.setAdapter("adapter:dBase_III");
fixDates = new SaxonXSLTOperation("Fix Dates", ContentType.UNKNOWN, getEnvironment());
nameSplit = new SaxonXSLTOperation("Name Split", ContentType.UNKNOWN, getEnvironment());
inside_OutsideSales = new SaxonXSLTOperation("Inside/Outside Sales", ContentType.UNKNOWN, getEnvironment());
mapSchema = new SaxonXSLTOperation("Map Schema", ContentType.XML, getEnvironment());
reportOnOutput = new SaxonXSLTOperation("Report on output", ContentType.XML, getEnvironment());
edge_2 = new EdgeImpl("edge_2", getEnvironment(), DataType.NODE, DataType.DOCUMENT);
edge_3 = new EdgeImpl("edge_3", getEnvironment(), DataType.ANY, DataType.DOCUMENT);
edge_4 = new EdgeImpl("edge_4", getEnvironment(), DataType.ANY, DataType.DOCUMENT);
edge_5 = new EdgeImpl("edge_5", getEnvironment(), DataType.ANY, DataType.DOCUMENT);
edge_6 = new EdgeImpl("edge_6", getEnvironment(), DataType.NODE, DataType.DOCUMENT);
// Connect all the operation objects with edge objects.
convertToXML.setInputUrl("new_empl.dbf", DataType.NONE);
convertToXML.addOutputEdge(edge_2);
fixDates.setScriptUrl("etl-step-1.xsl");
fixDates.addInputEdge(edge_2);
fixDates.addOutputEdge(edge_3);
nameSplit.setScriptUrl("etl-step-2.xsl");
nameSplit.addInputEdge(edge_3);
nameSplit.addOutputEdge(edge_4);
inside_OutsideSales.setScriptUrl("etl-step-3.xsl");
inside_OutsideSales.addInputEdge(edge_4);
inside_OutsideSales.addOutputEdge(edge_5);
mapSchema.setScriptUrl("etl-step-4.xsl");
mapSchema.addInputEdge(edge_5);
mapSchema.addOutputEdge(edge_6);
mapSchema.setOutputUrl("db:///c:/etl/etl-target.rdbxml", DataType.NODE);
reportOnOutput.setScriptUrl("generic_table_dumper.xsl");
reportOnOutput.addInputEdge(edge_6);
reportOnOutput.setOutputUrl("etl-log.html", DataType.NODE);
return true;
}
But we might want to see what's happening, so after that, we just have to drag
and drop over our pal generic_table_dumper.xsl
from earlier, and draw a line from the output of our last step to the start of this.
We'll also set the output of this to be etl-log.html.
Now when we run it, we'll get the input data refined, written to the database, and have as output a log of the results that should look something like this when you open the .html file:
| new_empl.dbf | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Emplo yeeID | Last Name | First Name | Title | TitleOf Courtesy | BirthDate | HireDate | Address | City | Reg ion | Postal Code | Coun try | HomePhone | Exten sion | Notes | Repo rtsTo | Photo Path |
| 10 | Guiles | Makenzie | Macy | 1977-02-01 00:00:00.0 | 2006-07-03 00:00:00.0 | 145 Meadowview Road | South Hadley | MA | 01075 | USA | (413)555-6225 | 8 | ||||
| 11 | Forbes | Andrew | Andy | 1960-03-30 00:00:00.0 | 2006-07-10 00:00:00.0 | 12 Yarmouth Drive | Holyoke | MA | 01040 | USA | (413)555-8863 | 8 | ||||
| 12 | Barra | Alexander | Al | 1986-12-21 00:00:00.0 | 2006-07-17 00:00:00.0 | 4327 Spring Drive | Sandwich | MA | 02537 | USA | (508)555-8974 | 5 | ||||
| 13 | Mellon | Philip | Phil | 1970-06-25 00:00:00.0 | 2006-07-24 00:00:00.0 | 123 Pendexter Street | Boston | MA | 01321 | USA | (781)555-4289 | 5 | ||||
| 14 | Clark | Pamela | Pam | 1950-05-02 00:00:00.0 | 2006-07-31 00:00:00.0 | 44 Mayberry Circle | Bedford | MA | 01730 | USA | (860)555-6447 | 8 | ||||
(Titles split to make the display fit better)
At this point, what we've done is build a service that can handle extracting information from one source, transforming it in a series of operations, and loading it into another source. We've then gone on and built a logging system, and even packaged up the whole set of steps as a single pipeline and generated the code to support it.
This is only the beginning. Other steps could involve conditional processing, validation of intermediate and/or final results, error handling, and conversion to other file formats. Entire pipelines can be included as modules inside of other pipelines, and parameters can be passed in to XSLT and XQuery.
See the other demonstrations for more examples on the power of XML Pipelines.
Simplify ETL design with with <%= ConfigurationManager.AppSettings["SS"] %>'s award-winning ETL programming tools — Download a free trial of our today!
Simplify your next legacy data integration project with the help of the EDIFACT Zone, the world's largest, free online reference covering all EDIFACT and UN/CEFACT versions.
Deploy your ETL applications with <%=ConfigurationManager.AppSettings["SSADVX"]%> featuring powerful new Data Conversion API's