Building a modern data stack in a box using DuckDB, dbt, Meltano and Streamlit
TLDR: I’ve integrated DuckDB, dbt, Meltano, and Streamlit to create an all-in-one Modern Data Stack which can be used on your laptop, simplifying the process of building data pipelines.
See the final dashboard here: Dashboard link
A few months ago, I came across this article by Jacob Matson in which he talks about creating a Modern Data Stack(MDS) in a box with DuckDB. I had previous experience building a data pipeline with DuckDB. The idea of blending various innovative tools for an end-to-end proof of concept on your laptop caught my attention. I also came across this video by mehdio, where he showed how useful DuckDB and dbt can be for building and testing local data pipelines and running analysis using your local compute only. I wanted to replicate the project and build out a web application for the same, so I made a small project using the same WHO Air Quality data with DuckDB, dbt, Meltano and Streamlit.
Here’s what each tool does
- DuckDB — Open source data warehouse, which can be run locally and used to explore data through SQL or Python
- dbt — Open source tool used for data transformations using SQL
- Meltano — Open-source data integration platform, used to consume data from various sources
- Streamlit — Open source library used to create web apps using python.
Steps to run the project locally
- Clone the project from Github and
cd
into the cloned directory - Install all the required dependencies which are necessary to run the project:
pip install -r requirements.txt
- This step will install meltano extractors, loaders & utilities, which are defined in
meltano.yml
file:make build
- Created the pipelines:
make pipeline
Explanation:
- This step will take the file from thedata
folder, and will generate a parquet file from it.
- After that the requireddbt
dependencies will be installed.
- Finally,dbt-duckdb build
command will be run, which will generate all the tables and views from the models folder. - Will launch the streamlit app on port 8501:
make streamlit-visuals
Putting it all together:
pip install -r requirements.txt
make build
make pipeline
make streamlit-visuals
Understanding the meltano.yml
file
The first part is the tap (extractor) which is ‘tap-spreadsheets-anywhere’. This tap allows us to get data file stored in the data folder of our directory.
plugins:
extractors:
- name: tap-spreadsheets-anywhere
variant: ets
pip_url: git+https://github.com/ets/tap-spreadsheets-anywhere.git
config:
tables:
- path: 'file://./data'
name: air_quality_2023
pattern: who_ambient_air_quality_2023.csv
start_date: '2001-01-01T00:00:00Z'
prefer_schema_as_string: true
key_properties: ['who_region', 'iso3']
format: csv
The next part is the target loader (target-parquet), this can take data from a meltano tap and loads parquet files into the destination folder. In the later stage those parquet files will be loaded into DuckDB, as configured in dbt sources file.
loaders:
- name: target-parquet
variant: estrategiahq
pip_url: git+https://github.com/estrategiahq/target-parquet.git
config:
destination_path: $MELTANO_PROJECT_ROOT/data/data_catalog/air_quality
compression_method: snappy
logging_level: info
disable_collection: true
The final piece of this meltano config file, is the utilities section, where the dbt-duckdb
configuration is defined. This will help us execute dbt commands against our data, and the results will be loaded into DuckDB.
utilities:
- name: dbt-duckdb
variant: jwills
pip_url: dbt-core~=1.5.2 dbt-duckdb~=1.5.2 git+https://github.com/meltano/dbt-ext.git@main duckdb==0.8.1
executable: dbt_invoker
Understanding the dbt part
The transform
directory contains all the dbt project code, where the SQL models are written, which are later transformed into tables and views and loaded into DuckDB.
The dbt-duckdb
adapter requires a profiles.yml
file, which has the below simple configuration. As it supports multiple threads, it makes parallel execution even faster.
config:
send_anonymous_usage_stats: False
use_colors: True
meltano:
target: "{{ env_var('MELTANO_ENVIRONMENT', 'dev') }}"
outputs:
dev:
type: duckdb
threads: 4
external_root: "{{ env_var('MELTANO_PROJECT_ROOT') }}/data"
path: "{{ env_var('DBT_DUCKDB_PATH') }}"
Below is the final SQL code, where all the important analytical information like avg_pm10_concentration
, avg_pm25_concentration
, avg_no2_concentration
and the particle category is defined.
with aggregated_data as (
select *
from {{ ref('air_quality_aggregate') }}
)
select
country_name,
city,
year,
avg_pm10_concentration,
avg_pm25_concentration,
avg_no2_concentration,
avg_pm10_tempcov,
avg_pm25_tempcov,
avg_no2_tempcov,
population,
latitude,
longitude,
case
when avg_pm10_concentration > 50 then 'High'
when avg_pm10_concentration > 20 then 'Moderate'
else 'Low'
end as pm10_category,
case
when avg_pm25_concentration > 25 then 'High'
when avg_pm25_concentration > 10 then 'Moderate'
else 'Low'
end as pm25_category,
case
when avg_no2_concentration > 40 then 'High'
when avg_no2_concentration > 20 then 'Moderate'
else 'Low'
end as no2_category
from aggregated_data
Dashboard
The app allows users to select a country and city from the sidebar, and it displays visualizations and a data table based on the selected location. The code below imports streamlit and duckdb dependencies and establishes a connection to a DuckDB database file named ‘mdsbox.db’ in read-only mode.
import streamlit as st
import duckdb
aqi_duck = duckdb.connect('mdsbox.db', read_only=True)
A SQL query is constructed to fetch data from the ‘air_quality_final’ table based on the selected country and city. The result is stored in a DataFrame called ‘df’.
query = f"""
SELECT * FROM air_quality_final
WHERE country_name = '{selected_country}'
AND city = '{selected_city}'
ORDER BY year
"""
df = aqi_duck.execute(query).fetchdf()
Visualizations are displayed using Streamlit’s st.line_chart
method for three different air quality metrics.
st.line_chart(data=df, x='year', y='avg_pm10_concentration')
st.line_chart(data=df, x='year', y='avg_pm25_concentration')
st.line_chart(data=df, x='year', y='avg_no2_concentration')
A table is displayed using Streamlit’s st.dataframe
method, showing specific columns from the DataFrame 'df'.
st.dataframe(df[['year', 'country_name', 'city', 'pm10_category', 'pm25_category', 'no2_category']],
column_config={
"year": "Year",
"country_name": "Country",
"city": "City",
"pm10_category": "PM10 Type",
"pm25_category": "PM25 Type",
"no2_category": "NO2 Type",
},
hide_index=True,
use_container_width=True)
Conclusion
Embarking on the journey to build a compact yet robust Modern Data Stack (MDS) in a box proved to be an enjoyable and enlightening experience. Several key highlights underscored the merits of this approach:
- The shared philosophy among tools like dbt, DuckDB, and Meltano revolves around empowering the local user and facilitating easy interaction through Python. This alignment not only simplifies the user experience but also underscores the strength of a customizable and user-centric Modern Data Stack.
- The entire development process was done seamlessly on my laptop, offering a sense of complete control, without needing a cloud data warehouse, levraging my local compute for experimenting quickly before going to production.
References
- Dataset — WHO Air Quality Dataset May 2023
- Youtube video explaining usage of DuckDB and dbt by mehdio
- Github — NBA monte carlo simulation project