Building a modern data stack in a box using DuckDB, dbt, Meltano and Streamlit

Published in

Data Engineer Things

5 min readNov 13, 2023

Picture depicts a container which is placed on top of a laptop showing a database, which is meant to show what a modern data stack should look like — Modern data stack hosted on a laptop

TLDR: I’ve integrated DuckDB, dbt, Meltano, and Streamlit to create an all-in-one Modern Data Stack which can be used on your laptop, simplifying the process of building data pipelines.
See the final dashboard here: Dashboard link

A few months ago, I came across this article by Jacob Matson in which he talks about creating a Modern Data Stack(MDS) in a box with DuckDB. I had previous experience building a data pipeline with DuckDB. The idea of blending various innovative tools for an end-to-end proof of concept on your laptop caught my attention. I also came across this video by mehdio, where he showed how useful DuckDB and dbt can be for building and testing local data pipelines and running analysis using your local compute only. I wanted to replicate the project and build out a web application for the same, so I made a small project using the same WHO Air Quality data with DuckDB, dbt, Meltano and Streamlit.

Here’s what each tool does

DuckDB — Open source data warehouse, which can be run locally and used to explore data through SQL or Python
dbt — Open source tool used for data transformations using SQL
Meltano — Open-source data integration platform, used to consume data from various sources
Streamlit — Open source library used to create web apps using python.

Steps to run the project locally

Clone the project from Github and cd into the cloned directory
Install all the required dependencies which are necessary to run the project: pip install -r requirements.txt
This step will install meltano extractors, loaders & utilities, which are defined in meltano.yml file: make build
Created the pipelines: make pipelineExplanation:
- This step will take the file from the data folder, and will generate a parquet file from it.
- After that the required dbt dependencies will be installed.
- Finally, dbt-duckdb build command will be run, which will generate all the tables and views from the models folder.
Will launch the streamlit app on port 8501: make streamlit-visuals

Putting it all together:

pip install -r requirements.txt

make build
make pipeline
make streamlit-visuals

Understanding the `meltano.yml` file

The first part is the tap (extractor) which is ‘tap-spreadsheets-anywhere’. This tap allows us to get data file stored in the data folder of our directory.

plugins:
  extractors:
    - name: tap-spreadsheets-anywhere
      variant: ets
      pip_url: git+https://github.com/ets/tap-spreadsheets-anywhere.git
      config:
        tables:
          - path: 'file://./data'
            name: air_quality_2023
            pattern: who_ambient_air_quality_2023.csv
            start_date: '2001-01-01T00:00:00Z'
            prefer_schema_as_string: true
            key_properties: ['who_region', 'iso3']   
            format: csv

The next part is the target loader (target-parquet), this can take data from a meltano tap and loads parquet files into the destination folder. In the later stage those parquet files will be loaded into DuckDB, as configured in dbt sources file.

loaders:
    - name: target-parquet
      variant: estrategiahq
      pip_url: git+https://github.com/estrategiahq/target-parquet.git
      config:
        destination_path: $MELTANO_PROJECT_ROOT/data/data_catalog/air_quality
        compression_method: snappy
        logging_level: info
        disable_collection: true

The final piece of this meltano config file, is the utilities section, where the dbt-duckdb configuration is defined. This will help us execute dbt commands against our data, and the results will be loaded into DuckDB.

utilities:
    - name: dbt-duckdb
      variant: jwills
      pip_url: dbt-core~=1.5.2 dbt-duckdb~=1.5.2 git+https://github.com/meltano/dbt-ext.git@main duckdb==0.8.1
      executable: dbt_invoker

Understanding the dbt part

The transform directory contains all the dbt project code, where the SQL models are written, which are later transformed into tables and views and loaded into DuckDB.

The dbt-duckdb adapter requires a profiles.yml file, which has the below simple configuration. As it supports multiple threads, it makes parallel execution even faster.

config:
  send_anonymous_usage_stats: False
  use_colors: True
meltano:
  target: "{{ env_var('MELTANO_ENVIRONMENT', 'dev') }}"
  outputs:
    dev:
      type: duckdb
      threads: 4
      external_root: "{{ env_var('MELTANO_PROJECT_ROOT') }}/data"
      path: "{{ env_var('DBT_DUCKDB_PATH') }}"

Below is the final SQL code, where all the important analytical information like avg_pm10_concentration , avg_pm25_concentration , avg_no2_concentration and the particle category is defined.

with aggregated_data as (
  select *
  from {{ ref('air_quality_aggregate') }}
)

select
  country_name,
  city,
  year,
  avg_pm10_concentration,
  avg_pm25_concentration,
  avg_no2_concentration,
  avg_pm10_tempcov,
  avg_pm25_tempcov,
  avg_no2_tempcov,
  population,
  latitude,
  longitude,
  case
    when avg_pm10_concentration > 50 then 'High'
    when avg_pm10_concentration > 20 then 'Moderate'
    else 'Low'
  end as pm10_category,
  case
    when avg_pm25_concentration > 25 then 'High'
    when avg_pm25_concentration > 10 then 'Moderate'
    else 'Low'
  end as pm25_category,
  case
    when avg_no2_concentration > 40 then 'High'
    when avg_no2_concentration > 20 then 'Moderate'
    else 'Low'
  end as no2_category
from aggregated_data

Dashboard

The app allows users to select a country and city from the sidebar, and it displays visualizations and a data table based on the selected location. The code below imports streamlit and duckdb dependencies and establishes a connection to a DuckDB database file named ‘mdsbox.db’ in read-only mode.

import streamlit as st
import duckdb


aqi_duck = duckdb.connect('mdsbox.db', read_only=True)

A SQL query is constructed to fetch data from the ‘air_quality_final’ table based on the selected country and city. The result is stored in a DataFrame called ‘df’.

query = f"""
SELECT * FROM air_quality_final 
WHERE country_name = '{selected_country}' 
AND city = '{selected_city}'
ORDER BY year
"""
df = aqi_duck.execute(query).fetchdf()

Visualizations are displayed using Streamlit’s st.line_chart method for three different air quality metrics.

st.line_chart(data=df, x='year', y='avg_pm10_concentration')
st.line_chart(data=df, x='year', y='avg_pm25_concentration')
st.line_chart(data=df, x='year', y='avg_no2_concentration')

A table is displayed using Streamlit’s st.dataframe method, showing specific columns from the DataFrame 'df'.

st.dataframe(df[['year', 'country_name', 'city', 'pm10_category', 'pm25_category', 'no2_category']], 
             column_config={
                 "year": "Year",
                 "country_name": "Country",
                 "city": "City",
                 "pm10_category": "PM10 Type",
                 "pm25_category": "PM25 Type",
                 "no2_category": "NO2 Type",
             },
             hide_index=True, 
             use_container_width=True)

Conclusion

Embarking on the journey to build a compact yet robust Modern Data Stack (MDS) in a box proved to be an enjoyable and enlightening experience. Several key highlights underscored the merits of this approach:

The shared philosophy among tools like dbt, DuckDB, and Meltano revolves around empowering the local user and facilitating easy interaction through Python. This alignment not only simplifies the user experience but also underscores the strength of a customizable and user-centric Modern Data Stack.
The entire development process was done seamlessly on my laptop, offering a sense of complete control, without needing a cloud data warehouse, levraging my local compute for experimenting quickly before going to production.

References

Dataset — WHO Air Quality Dataset May 2023
Youtube video explaining usage of DuckDB and dbt by mehdio
Github — NBA monte carlo simulation project