Building a modern data stack in a box using DuckDB, dbt, Meltano and Streamlit

Rahul Soni
Data Engineer Things
5 min readNov 13, 2023

--

Picture depicts a container which is placed on top of a laptop showing a database, which is meant to show what a modern data stack should look like
Modern data stack hosted on a laptop

TLDR: I’ve integrated DuckDB, dbt, Meltano, and Streamlit to create an all-in-one Modern Data Stack which can be used on your laptop, simplifying the process of building data pipelines.
See the final dashboard here: Dashboard link

A few months ago, I came across this article by Jacob Matson in which he talks about creating a Modern Data Stack(MDS) in a box with DuckDB. I had previous experience building a data pipeline with DuckDB. The idea of blending various innovative tools for an end-to-end proof of concept on your laptop caught my attention. I also came across this video by mehdio, where he showed how useful DuckDB and dbt can be for building and testing local data pipelines and running analysis using your local compute only. I wanted to replicate the project and build out a web application for the same, so I made a small project using the same WHO Air Quality data with DuckDB, dbt, Meltano and Streamlit.

Here’s what each tool does

  1. DuckDB — Open source data warehouse, which can be run locally and used to explore data through SQL or Python
  2. dbt — Open source tool used for data transformations using SQL
  3. Meltano — Open-source data integration platform, used to consume data from various sources
  4. Streamlit — Open source library used to create web apps using python.

Steps to run the project locally

  1. Clone the project from Github and cd into the cloned directory
  2. Install all the required dependencies which are necessary to run the project: pip install -r requirements.txt
  3. This step will install meltano extractors, loaders & utilities, which are defined in meltano.yml file: make build
  4. Created the pipelines: make pipeline
    Explanation:
    -
    This step will take the file from the data folder, and will generate a parquet file from it.
    - After that the required dbt dependencies will be installed.
    - Finally, dbt-duckdb build command will be run, which will generate all the tables and views from the models folder.
  5. Will launch the streamlit app on port 8501: make streamlit-visuals

Putting it all together:

pip install -r requirements.txt

make build
make pipeline
make streamlit-visuals

Understanding the meltano.yml file

The first part is the tap (extractor) which is ‘tap-spreadsheets-anywhere’. This tap allows us to get data file stored in the data folder of our directory.

plugins:
extractors:
- name: tap-spreadsheets-anywhere
variant: ets
pip_url: git+https://github.com/ets/tap-spreadsheets-anywhere.git
config:
tables:
- path: 'file://./data'
name: air_quality_2023
pattern: who_ambient_air_quality_2023.csv
start_date: '2001-01-01T00:00:00Z'
prefer_schema_as_string: true
key_properties: ['who_region', 'iso3']
format: csv

The next part is the target loader (target-parquet), this can take data from a meltano tap and loads parquet files into the destination folder. In the later stage those parquet files will be loaded into DuckDB, as configured in dbt sources file.

loaders:
- name: target-parquet
variant: estrategiahq
pip_url: git+https://github.com/estrategiahq/target-parquet.git
config:
destination_path: $MELTANO_PROJECT_ROOT/data/data_catalog/air_quality
compression_method: snappy
logging_level: info
disable_collection: true

The final piece of this meltano config file, is the utilities section, where the dbt-duckdb configuration is defined. This will help us execute dbt commands against our data, and the results will be loaded into DuckDB.

utilities:
- name: dbt-duckdb
variant: jwills
pip_url: dbt-core~=1.5.2 dbt-duckdb~=1.5.2 git+https://github.com/meltano/dbt-ext.git@main duckdb==0.8.1
executable: dbt_invoker

Understanding the dbt part

The transform directory contains all the dbt project code, where the SQL models are written, which are later transformed into tables and views and loaded into DuckDB.

The dbt-duckdb adapter requires a profiles.yml file, which has the below simple configuration. As it supports multiple threads, it makes parallel execution even faster.

config:
send_anonymous_usage_stats: False
use_colors: True
meltano:
target: "{{ env_var('MELTANO_ENVIRONMENT', 'dev') }}"
outputs:
dev:
type: duckdb
threads: 4
external_root: "{{ env_var('MELTANO_PROJECT_ROOT') }}/data"
path: "{{ env_var('DBT_DUCKDB_PATH') }}"

Below is the final SQL code, where all the important analytical information like avg_pm10_concentration , avg_pm25_concentration , avg_no2_concentration and the particle category is defined.

with aggregated_data as (
select *
from {{ ref('air_quality_aggregate') }}
)

select
country_name,
city,
year,
avg_pm10_concentration,
avg_pm25_concentration,
avg_no2_concentration,
avg_pm10_tempcov,
avg_pm25_tempcov,
avg_no2_tempcov,
population,
latitude,
longitude,
case
when avg_pm10_concentration > 50 then 'High'
when avg_pm10_concentration > 20 then 'Moderate'
else 'Low'
end as pm10_category,
case
when avg_pm25_concentration > 25 then 'High'
when avg_pm25_concentration > 10 then 'Moderate'
else 'Low'
end as pm25_category,
case
when avg_no2_concentration > 40 then 'High'
when avg_no2_concentration > 20 then 'Moderate'
else 'Low'
end as no2_category
from aggregated_data

Dashboard

The app allows users to select a country and city from the sidebar, and it displays visualizations and a data table based on the selected location. The code below imports streamlit and duckdb dependencies and establishes a connection to a DuckDB database file named ‘mdsbox.db’ in read-only mode.

import streamlit as st
import duckdb


aqi_duck = duckdb.connect('mdsbox.db', read_only=True)

A SQL query is constructed to fetch data from the ‘air_quality_final’ table based on the selected country and city. The result is stored in a DataFrame called ‘df’.

query = f"""
SELECT * FROM air_quality_final
WHERE country_name = '{selected_country}'
AND city = '{selected_city}'
ORDER BY year
"""
df = aqi_duck.execute(query).fetchdf()

Visualizations are displayed using Streamlit’s st.line_chart method for three different air quality metrics.

st.line_chart(data=df, x='year', y='avg_pm10_concentration')
st.line_chart(data=df, x='year', y='avg_pm25_concentration')
st.line_chart(data=df, x='year', y='avg_no2_concentration')

A table is displayed using Streamlit’s st.dataframe method, showing specific columns from the DataFrame 'df'.

st.dataframe(df[['year', 'country_name', 'city', 'pm10_category', 'pm25_category', 'no2_category']], 
column_config={
"year": "Year",
"country_name": "Country",
"city": "City",
"pm10_category": "PM10 Type",
"pm25_category": "PM25 Type",
"no2_category": "NO2 Type",
},
hide_index=True,
use_container_width=True)

Conclusion

Embarking on the journey to build a compact yet robust Modern Data Stack (MDS) in a box proved to be an enjoyable and enlightening experience. Several key highlights underscored the merits of this approach:

  1. The shared philosophy among tools like dbt, DuckDB, and Meltano revolves around empowering the local user and facilitating easy interaction through Python. This alignment not only simplifies the user experience but also underscores the strength of a customizable and user-centric Modern Data Stack.
  2. The entire development process was done seamlessly on my laptop, offering a sense of complete control, without needing a cloud data warehouse, levraging my local compute for experimenting quickly before going to production.

References

  1. Dataset — WHO Air Quality Dataset May 2023
  2. Youtube video explaining usage of DuckDB and dbt by mehdio
  3. Github — NBA monte carlo simulation project

--

--