How to bring external data into the data warehouse#

While we generally take the stance that any data that can come through Ed-Fi should come through Ed-Fi, we recognize that there are many kinds of data that either don't fit into Ed-Fi at all, or need to get into the warehouse quickly and can't wait for a full integration.

This article will describe the general framework EDU uses to think about this type of work. Following a consistent approach for such things will increase the organization and legibility of your warehouse as it grows.

General approach#

Following the ELT pattern allows for a pretty consistent set of steps to get data into the warehouse:

Create a table shell in the raw database where that data will land
Write an Airflow DAG to extract data from a source system and load it to a raw table
Write staging dbt models to clean up and organize the raw data
Write dimensional dbt models to prepare the data for analytic use
Write any downstream metrics or reports on top of the dimensional models

Extract-Load#

The Raw Database#

Before we can load data to the warehouse, we need somewhere to put it. Generally we'd create a new schema in the raw database for any new source system we're adding to the system, with one table shell for each file, API endpoint, or database table we're bringing in.

See more on how we structure our raw tables here.

Extract-Load Airflow DAGs#

Airflow DAGs are a deep topic which we won't attempt to fully cover here (there are lots of free tutorials online), but all Extract-Load DAGs look functionally identical:

Creating new DAGs

Any valid Python file that contains an Airflow DAG and is placed in the airflow/dags folder in the project repository will create a new DAG (once the code is deployed to a server).

Connect to a source system, pull data out
Put the data in the data lake (e.g. S3)
Copy the data from the lake into a raw table in the database

Essentially DAGs just define a sequence of tasks (usually python functions) that perform this work in sequence, and the schedule on which they should run. We have some example dags that show what this looks like in common situations. DAGs consist of Python code which go in the airflow/dags folder of your project repository.

Key Questions Before writing a DAG#

Before you begin writing a DAG, you need to answer a couple questions about the data you want to load.

The first question to answer is: "How do we get access to this data in an automated way?". This could be a database connection, a REST API, web scraping, an SFTP, etc. In some cases there may be no automated way to get to the data at all (e.g. it gets emailed to you once a quarter). In these cases you may just need to set up a folder for it somewhere that can be accessed by Airflow (e.g. an S3 folder or an SFTP), and have a process by which new files get uploaded there manually.

The second question to answer is: how is this data updated? There are a few basic update strategies:

Drop and replace
1. This is the simplest: if you only expect one file per year (but it might update a few times), you can create a process that first deletes any data from the year you're about to load, then loads the new file. This ensures that the database will only contain the valid records from the most recent file, while leaving old copies of that data in the data lake in case you want to reference them. You could also 'soft delete' outdated records by setting an is_deleted flag to true before loading the new data.
Delta Processing
1. This is the best approach: in this case you only load records that have been changed since your last pull (this is how the Ed-Fi DAG works). Beware though: some systems make it relatively easy to pull newly inserted or updated records but do not track deleted records. If you only pull new/updated rows, you will leave behind records in the warehouse that have since been deleted from the source system. If the system you're pulling from exposes full Change Data Capture (CDC) tables/streams or itself implements soft-deletes, you can correctly do Delta Processing.
Append Only
1. In this approach you just append new records/files to the raw table. This can work in limited cases where you only need to know what the data looked like as of a particular date, or you have other strategies downstream to handle duplicates and deletes.
Merge
1. Another possible strategy is to merge incoming data to the existing raw table. While there are some benefits to this strategy (in that you can handle updates/inserts/deletes directly), we generally avoid it. With the other strategies, if an error in duplicate handling is found, you can update the logic and rerun dbt, because you still have all the raw data to work with. With a merge strategy, the 'raw' table itself becomes a source of truth about how data has changed over time, and it is more challenging to replay history when you need to update the rules or fix an error.

Using Snapshots

If your source system does not allow for delta processing but you need to track how data changes over time, you can combine either the Drop And Replace or Append Only strategies with dbt Snapshots to track how data changes over time.

Data Transformation: DBT#

Once you have data loaded in raw, you can begin modeling it in dbt. New dbt models will go in dbt/models, and organized within folders by their stage of data processing and/or the data domain. They can be tested directly from your laptop before ultimately being merged to production. Make sure you pay attention to schema configuration so your model is built in the right section of the warehouse.

See our guide on how we organize dbt for more information on how we sequence and name our transformation pipelines.