Cooking up some SQL: Photo by Gaelle Marcel on Unsplash

SQL CookBook for dbt: source testing

An oft-overlooked and fundamentally important part of using dbt

Hugo Lu
Data Engineer Things
5 min readAug 31, 2023

--

About me

I’m Hugo Lu — I started my career working in M&A in London before moving to JUUL and falling into data engineering. After a brief stint back in finance, I headed up the Data function at London-based Fintech Codat. I’m now CEO at Orchestra, which is a data release pipeline tool that helps Data Teams release data into production reliably and efficiently.

SQL Cookbook dbt Index

  1. Source testing (this one)

Introduction

Testing sources is probably one of the most helpful things you can do as an analytics engineer managing a data release pipeline.

Sources are the building blocks of everything you do. If the raw tables getting created in your warehouse or data lake contain data with issues, they will necessarily propagate into your models if you don’t catch them early. This leads to headaches, time spent fixing bugs instead of delivering value, and wasted compute resources from running queries over and over again.

In this part of my SQL CookBook for Analytics Engineers, I’ll dive into some common errors and how you can avoid these using source-testing in dbt.

Common Errors or Data Quality Issues

Source testing is designed to do one thing, and that’s catch data quality issues or “errors” before they get propagated into production — production refers to the environment in which end users (typically business stakeholders) access data.

There are, as it turns out, a list of errors. They’re not multi-faceted like in software, but in data, can actually be named for the most part:

  1. Recency; data may be missing the most recent date or partition
  2. Rows included; the number of rows ingested may be higher or lower than expected for a given date or partition
  3. Completeness; data may be missing for a given date or partition
  4. Value averages or totals; the value averages or totals for a given column may be too high or too low
  5. Primary key; there could be a duplicate in the primary key, which indicates either that a process is failing or the primary key is incorrect
  6. Nulls; values may be null where they should not be null
  7. Relationships; IDs may exist in table A that should exist in table B but don’t

People often say dbt doesn’t make it easy for analytics engineers to write tests on their sources. This drives me insane. If you read the docs and try to write tests for 1 hour, you will be able to write tests. There are even packages like dbt expectations where the tests already exist. I even wrote an extremely complicated python test so you don’t need to use an observability tool.

Anyway, let’s dive in:

How to write tests the easy way

packages:
- package: calogica/dbt_expectations
version: [">=0.9.0", "<0.10.0"]
# <see https://github.com/calogica/dbt-expectations/releases/latest> for the latest version tag

Run dbt deps

Then in your .yml file with your sources, you’ll be able to do things like:

  - name: snowflake
database: SNOWFLAKE_SAMPLE_DATA
schema: tpcds_sf100tcl
tables:
- name: customer
description: "Snowflake sample customer dataset"
columns:
- name: C_CUSTOMER_ID
description: "The primary key for this table"
tests:
- unique
- not_null
- dbt_expectations.expect_row_values_to_have_recent_data:
datepart: day
interval: 1
row_condition: 'id is not null' #optional

There are lots of tests you can use for the errors listed in 1–7. Recency is included in the example above. I like using dbt expectations for:

  • Calculating value expectations e.g. max . These work for rows updated too (as you’re basically doing a count by date / partition)
  • expect_column_to_exist
  • Recency

Note we can actually do quite a lot with dbt out the box tests too:

Writing custom tests

Something we’ve not found in the above is a test for completeness.

To do this, we’ll need to write a custom test. Head on over to your tests folder in dbt. Make a new folder called “generic”. Add a .sql file called completeness_hourly.sql and in it, put:


{% test completeness_hourly(model, column_name, num_days) %}
WITH staging AS (

SELECT
DATE_TRUNC(HOUR,{{column_name}} ) as hour_,
sum(1) rows_ # not strictly necessary but can be helpful for testing rolling counts of rows per hour
FROM {{model}}
where hour_ >= dateadd(day, -{{num_days}}, current_date())
group by 1

)

SELECT

hour_,
LAG(hour_) OVER (ORDER BY hour_ desc) as prev_hour,
DATEDIFF(HOUR, hour_, prev_hour) hourly_diff

FROM staging
-- Corrected with thanks to Emanuel Oliveira and Cai Parry-Jones
WHERE hourly_diff != 1
{% endtest %}

This tests takes data from X days ago and groups it by hour. It then calculates the difference in hours between each hour and the previous hour present in the table. If there is expected to be data in the table for every hour, the test will throw an error if there are rows missing for any hour. This is a basic example of a completeness test.

How to test sources

The easiest way to test your sources is by running dbt build. The dbt build command automatically tests sources before doing anything else, which is extremely powerful.

If, for whatever reason, you don’t want to do this, then the way you test sources is by running one of the three below:

dbt test --select source:source_name.source

dbt test --select source:source_name.*

dbt test --select source:tags.tag_name

The first command tests a single source. Recall all sources need to belong to a source group or “source_name”. If you need to test all sources, just run source_name.*. As is most common, you’ll want to run a subset of sources based on tags like a schedule e.g. hourly. For this, you can run tests on only sources with a specific tag.

However really, there should be no use-case where you need to just run a dbt test. You should always be running dbt build — select tag:hourly in the context of testing sources.

Conclusion

A lot of bad data points we see in models come from sources. These lead to questions like:

  • Where is my data?
  • Why is there no customer name?
  • What’s the reason for this massive spike / drop ?

Contrary to what many people in the data discourse say, these can actually be prevented by having some basic testing on source tables and a robust data release pipeline using a tool like Orchestra. Sure, data contracts can help with solving the root cause (why the data producers send you stuff which is bad in the first place!) but testing sources gives you 90% of what you need to prevent bad data reaching end users. A valuable asset for any analytics engineer.

What’s so powerful about this is that as long as tests are set up correctly on sources, upstream tests offer little marginal value in respect of the errors you catch in the tests.

Really, the testing of upstream tables will be focussed on catching errors introduced to the data through changes to upstream models, which are necessarily beyond the scope of source testing anyway. Unless you’re worried an analyst has put a weird filter in an upstream model to filter out recent data, then there is no reason to add a recency test onto upstream models. This is not true of primary key tests.

So yeah — test your sources. I bet these are what fail most.

--

--

Hugo Lu - I write about how to be good at Data engineering and do the coolest data stuff. I am the CEO @ Orchestra, a data release pipeline management platform