We'll describe the modern data engineering stack, and how Fivetran pulls in data from different sources, loads them into an analytics database like snowflake or redshift, and how to then clean and transform data with a tool like DBT.
Now, the main task of a data engineer is to collect data from various sources that exist in a company and move them into an analytics database. The next step is to ensure the data is accurate by validating and cleaning the data. And the final step is to organize and transform the data so it is easier for employees to perform analysis. This is the work involved in building a data pipeline. Doing this is still harder than it still sounds, but many steps have been made a lot easier over the last few years.
In this post we’ll discuss the data engineering tools that make this task a bit easier. Then in a later post, we’ll talk about the data engineering stack involved when additional data engineering work is required, and data engineers need to move beyond these tools.
Probably the main step that has become easier for data engineers is simply moving the data into a data warehouse. This is referred to as extracting and loading the data. This data can come from a company’s database as well as from third party tools like Hubspot for sales data, or Google Analytics to track visitors to a website.
What’s nice is that extract and load tools like Fivetran, Stitch, and Meltano can pull from either our internal database, or from third party tools like Hubspot or Google Analytics directly into our analytics database. Fivetran is currently the easiest to use, but also more expensive.
To get these services to work, all we need to do is find our account API keys for each service, as well as our analytics database, and enter them into Fivetran.
A tool like Fivetran will know how to retrieve data from each of the APIs, and will create related tables in our analytics database. For example, these are all of the tables Fivetran will create in our analytics database for Hubspot.
Yes, this is a lot, but all of these tables will be tucked underneath a Fivetran Hubspot schema, and we can then pull the data we need into a different schema for further analysis. And if we want to choose fewer tables to load in, Fivetran allows us to do so.
So all the work of connecting to the related API, creating all of the related tables, and then loading that data into tables is performed for us by an extract and load tool, and all we needed to do was plug in our credentials into a tool like Fivetran. For example, here are the credentials we would need to enter to pull data from a database like AWS RDS – and Fivetran recreate the tables in an analytics database.
Transforming the Data
Now that we have loaded the data into an analytics database, there is still more work to do. For example, take another look at our diagram showing the analytics pipeline.
Focusing on the section to the left, one thing to note is that an EL tool like Fivetran pulls in a lot of data. And if we consider this will include sales and marketing data from Salesforce or Hubspot, purchasing data from Stripe, or product usage data from database, you may be able to think of some of the tasks still left for us.
First, we’ll want to clean up some of the data from each of these sources – selecting just the columns we want, and maybe changing some of the column names so that the naming is consistent across our sources. We can select the original source data, make these light transformations, and populate new tables called our staging tables. Second, once our data is cleaned, we’ll want to merge the records across these sources – for example, we want to identify that the firstname.lastname@example.org that we see from our hubspot data is the same Sam Arnold that we see paying for our product in our stripe data, as opposed to treating this same person as separate users. This merging of our data is called integration. And finally, we’ll then need to connect these different tables – say our users table with our products table – to see which users purchased which product. We perform this step in the marts stage.
The marts stage are the tables that different teams will query. But the prior work of cleaning, merging, and then data modeling has already been performed by data engineers.
Now to perform all of the work above – the transformation step – DBT (Data Build Tool) is really the only game in town. It allows us to update each of these tables, and understands that before updating any mart tables, it first needs to look at changes made in the staging tables, and integration tables that they depend on. In other words, DBT keeps track of all of the dependencies in our data modeling. DBT calls this the data lineage, and it even generates a nice diagram illustrating this.
So here we can see that some of the data from the `hubspot_contacts` table is derived from the `stg_hubspot_companies` table. So if the data in `stg_hubspot_companies` is changed, then this may impact the data in `stg_hubspot_contacts`. DBT has other features like automatically performing these transformations to our data on a regular basis, as we get new data, and allowing us to combine Python (or more specifically, Jinja) with our SQL queries.
After data engineers have done the work to transform the data and organize the data into data marts, they may have to help construct some data dashboards using a tool like Looker or Mode. And at that point, the data pipeline they built is complete.
This article covered the components of a modern data engineering stack. As we can see, this involves collecting data from various sources, loading that data into an analytics warehouse and then transforming that data using SQL and DBT to serve to internal stakeholders. It should be mentioned that all of the tools we highlighted were tools that can be used when not much customization is required. We’ll describe when and how data engineers need to build more custom built pipelines in a later post.