The Misunderstanding of ETL - Extract Transform Load


When it comes to migrating data, many of the clients who need some form of data migration or data integration throw out the term ETL: Extract, Transform and Load.

When I worked for a leading Software as a Service (SaaS) company, performing enterprise data migrations, our management team had a very difficult time understanding why most of the enterprise data migrations were difficult and complex. Even the sales folks would inquire - "why can't we just click a button and have all of the customer's source data automatically migrate over into our system?", they would ask.

Well to understand the reason why, let's look at the leading charts and diagrams for what exactly ETL is. Searching google images for ETL diagram yields the following as the first diagrams:

ETL Diagram 1

ETL Diagram #1

The above diagram from google images doesn't really express much, other than there are some source systems, and the large blue arrow is pointing at a target of a Data Warehouse. Other than the Extract Transform and Load process not being described very well by this diagram, notice that there is no mention of Analysis anywhere within in the diagram.

ETL Diagram 2

ETL Diagram #2

Let's look at the next diagram - #2. This one is a bit better, as it does describe the flow for ETL. It shows that to follow the ETL process, we want to extract some important data into a staging area, then we want to transform and load that important data into some data warehouse.

ETL Diagram 3

ETL Diagram #3

Finally, let's check out the third diagram. This one is similar to the second diagram, and provides some examples of the source data systems: SQL Server, a 2nd instance of SQL Server, Oracle and perhaps an XML source. It shows the same process of extracting data into staging, then transforming it and loading it into a warehouse. It then adds another step of loading the data into a Mart, although in my 10 years of experience, we have never heard of loading data into a 'Mart'.

None of the above three diagrams mention any analysis step. And, over the years that I have worked for one of the industry's leading Interactive Parts Catalog SaaS companies for large equipment manufacturers, the process was just about the same: customer sends some data, a technical consultant extracts the data, transforms it according to some business rules, and then loads it into the SaaS system via a publishing engine.

So what's the problem?

Well, if the task at hand was to simply migrate some small dataset, where the data is consistent and the transformations are simple, then there is no problem. But this is hardly ever the case.

In reality, enterprise migrations look a lot more like the 2nd and 3rd diagrams, where there are multiple source data systems, where the data must be ingested and aggregated (extracted), and then transformed and loaded into a target system. However, it is hardly ever the case that the source data systems are consistent, and that simple transformations will result in an accurate load into the target system. Because the data is hardly ever consistent between, and even within the source systems, at Data Migration Services we do not adhere to this flawed ETL process. Instead, we follow an A-ETL process, where a thorough analysis is performed prior to and before even considering execution of the ETL steps.

A-ETL versus ETL-A

We propose that any enterprise data migration must first begin with an analysis, and hence the term A-ETL. This is in stark contract to the standard ETL process that begins with (E)xtracting the data, and usually ends with a mixed mess in the target system. What happens in many of these cases is that when the users of the target system begin testing the data, they find inaccurate results, missing data, additional data that should have been omitted, etc. And then when these issues are reported to the technical team that worked on the ETL, only then does the team perform analysis based on the problems. As such, their analysis becomes one of the last steps, and hence the term ETL-A, where analysis is the last step. But, this hardly works either.

As an anecdote, I was working on an enterprise data migration where approximately 10 different legacy parts catalog systems were being used throughout the world for a large Original Equipment Manufacturer. My team and I were able to thoroughly analyze the data, locate the outliers and potential problems before ever beginning the ETL process (ergo, we followed our A-ETL process: Analysis First, then Extract, Transform and Load). We found out that similarly, this same OEM was also upgrading their PLM system to Team Center, and Team Center needed to hold the same type of data. So two projects were going on at the same time, with similar data and end goals, and at first I was surprised to hear that the TeamCenter technical consultants, a team of 5, were working on this for well over a year. It was only later on that I realized the industry is plagued by this ETL process, where the analysis generally comes much later on once there are problems in the target system, that I was no longer surprised by how long the Team Center data migration was taking, as compared to ours.

Conclusion

Although analysis is completely omitted from the standard ETL work-flow, we generally find that some kind of analysis is eventually performed, but at the end of the project when the data has already been through the load phase and end users are testing (a type of ETL-A process). Instead, at iDataMigration, we always perform a thorough analysis up front, and as such we follow what we are calling the new and improved A-ETL process. This A-ETL process can be depicted accurately in the following diagram:

New & Improved ETL Diagram

2018 - New and Improved A ETL Diagram
www.iDataMigration.com

Free File Analysis Tool

Do you have a dataset that needs to be analyzed before beginning the ETL process? If the data sources are already databases, you should be able to query the data and run a number of custom reports against it (meaning you may not need a tool). However, if the data exists in a nested folder structure and contains various file formats, we have a tool you can use to analyze and profile that data. Simply zip the root folder (it can have nested folders and nested files, the tool will recurse for the analysis) and upload it into our Free Analyze Files Tool, located at the bottom of the page. The top of that page discusses the tool and has a guide on how we built it.

Have a project or need some help?

Help us help you - contact us and let us know what your goal or objective is, and we'll see if it's something we can help you with.