Homepage
Open in app
Sign in
Get started
Data Engineer Things
Insights and ideas on data and engineering.
ETL
Data Architecture
Optimization
Interview Guide
Career Growth
AI in Data Engineering
About
Contribute
Follow
Following
Create your own Gemini AI-chatbot with a twist using Python, Jinja2 and NiceGUI
Create your own Gemini AI-chatbot with a twist using Python, Jinja2 and NiceGUI
Discover the basics of using Gemini with Python via VertexAI, creating a Web UI with NiceGUI and using Jinja2 to construct modular prompts
Volker Janz
Apr 25
Trending Now
Pydantic for Experts: Reusing & Importing Validators
Pydantic for Experts: Reusing & Importing Validators
Advanced techniques for reusing and importing validation across python models.
Yaakov Bressler
Apr 21
How to think about Internal Data Products as a Data Engineer
How to think about Internal Data Products as a Data Engineer
Data Products are all the rage, but why?
Hugo Lu
Apr 17
I completed a Senior Data Engineer Code Challenge for fun, and this is how it went. PART II
I completed a Senior Data Engineer Code Challenge for fun, and this is how it went. PART II
Question: Using MySQL’s public employee sample database, create a DAG to move data from the employee’s table to BigQuery.
Jennifer Ebe
Apr 27
I spent another 8 hours understanding the design of Amazon Redshift. Here’s what I found.
I spent another 8 hours understanding the design of Amazon Redshift. Here’s what I found.
All insights from Redshift academic paper: Amazon Redshift re-invented in 2022
Vu Trinh
Mar 16
Why did Databricks build the Photon engine?
Why did Databricks build the Photon engine?
The Lakehouse, its motivation, and the difference between Photon and the existing engine.
Vu Trinh
Apr 6
Deleting Duplicate Data With No Primary Key — A Data Engineer's Worst Nightmare?
Deleting Duplicate Data With No Primary Key — A Data Engineer's Worst Nightmare?
Here’s how you can delete duplicate data with no primary key
Saikat Dutta
Mar 12
Latest stories
Understanding Snowflake Table Locks
Understanding Snowflake Table Locks
A hands-on look at table locks.
Jonathan Duran
May 16
Automate Dbt Date Logic with Python — Part 2
Automate Dbt Date Logic with Python — Part 2
Simplifying Our Models and Tests From Part 1 Using Meta Config
Leo Godin
May 14
The Inheritance Schema Design Pattern for MongoDB Data Modelling
The Inheritance Schema Design Pattern for MongoDB Data Modelling
In the world of NoSQL databases, particularly MongoDB, designing an efficient data model is crucial for optimal application performance…
Karen Zhang
May 12
How I build an ETL pipeline with AWS Glue, Lambda, and Terraform
How I build an ETL pipeline with AWS Glue, Lambda, and Terraform
A Step-by-Step Guide
Lorena Gongang
May 12
Enhance your data quality tests with the dataform-assertions package
Enhance your data quality tests with the dataform-assertions package
dbt is no longer the only choice for testing data pipelines
Fumiaki Kobayashi
May 12
My Data Pipeline Orchestrators Journey
My Data Pipeline Orchestrators Journey
Originally Posted at: www.junaideffendi.com
Junaid Effendi
May 5
I spent 5 hours understanding more about the Delta Lake table format
I spent 5 hours understanding more about the Delta Lake table format
All insights from the paper: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores
Vu Trinh
May 4
What is something we have but don’t own and is never working when you need it.
What is something we have but don’t own and is never working when you need it.
Testing is difficult but pains could be eased with unified tooling. Here we explore the pros and cons of testing with new tools to help us
Peter Flook
May 2
Installing (and Switching between) Different Versions of Python
Installing (and Switching between) Different Versions of Python
How to install and switch between different python versions.
Yaakov Bressler
May 1
How We Integrate 1000++ Hive Tables into Data Warehouse Without ETL Seamlessly
How We Integrate 1000++ Hive Tables into Data Warehouse Without ETL Seamlessly
Migrating our data warehouse to Greenplum enables us to access data from Hive in real-time, eliminate storage issue, and much more!
Bernard Adhitya
Apr 26
Why Stream Processing is a Terrible Market (Yet We Are Still Investing in It)
Why Stream Processing is a Terrible Market (Yet We Are Still Investing in It)
TL;DR: It’s a challenging market, yet it holds promising prospects.
Yingjun Wu
Apr 25
Speeding Up Power BI: The Case for Surrogate Keys in Dimensional Modeling
Speeding Up Power BI: The Case for Surrogate Keys in Dimensional Modeling
Exploring the transition from composite to surrogate keys for enhanced performance and maintainability in data warehousing.
Ivanna Ditlevsen Jurkiv
Apr 25
Do We Need the Lakehouse Architecture?
Do We Need the Lakehouse Architecture?
When data lakes and data warehouses are not enough.
Vu Trinh
Apr 20
AWS Glue: essential tips for enhancing ETL development and operations
AWS Glue: essential tips for enhancing ETL development and operations
Explore 12 essential tips for Data Engineers and ETL Developers using AWS Glue
George Matheou
Apr 17
Best Practices for Writing Maintainable and Testable Spark Code in Scala
Best Practices for Writing Maintainable and Testable Spark Code in Scala
Enhancing Scalability and Reliability Through Structured Spark Development Practices
Thomas Cardenas
Apr 17
A Closer Look Into Databricks’s Photon Engine
A Closer Look Into Databricks’s Photon Engine
Part 2 of Databricks’s Photon paper note: Vectorization
Vu Trinh
Apr 13
Memory Management in Apache Spark
Memory Management in Apache Spark
Apache Spark’s performance advantage over MapReduce is greatest in the use-cases involving repeated computations. Much of this performance…
Solon Das
Apr 11
Simple Real-time Sentiment Analysis with Apache Spark and Kafka
Simple Real-time Sentiment Analysis with Apache Spark and Kafka
It’s been a long time since my last article. Just an update, at the beginning of this year, finally I’ve officially become an Analytics…
Caesario Kisty
Apr 9
What do you do if you encounter technical interview questions that stump you?
What do you do if you encounter technical interview questions that stump you?
We’ve all been there — you’re in the middle of a technical interview, and you get hit with a question that leaves you stumped. It’s a…
Ayush Thakur
Apr 9
Quickstart on Open Data Lakehouse with StarRocks + S3 + Delta Lake
Quickstart on Open Data Lakehouse with StarRocks + S3 + Delta Lake
Delta Lake, the open-source storage layer on top of data lakes like S3, brings reliability and structure to your data. But querying that…
Albert Wong
Apr 3
Cracking the Code: Tied Scores, a Window Functions Perspective
Cracking the Code: Tied Scores, a Window Functions Perspective
In modern businesses, performance dashboards have become essential tools for recognizing and celebrating top performers. Serving as a…
Nicholas Piesco
Apr 3
Deploying a Spark-based application as a Windows application
Deploying a Spark-based application as a Windows application
*Note: Not a medium member? Use this link to view the full article. **Note: If you want to generate or validate data, take a look at Data…
Peter Flook
Apr 3
What is real time analytics, and who are the top players in this OLAP space?
What is real time analytics, and who are the top players in this OLAP space?
Imagine a world where your data insights are as fresh as this morning’s coffee. That’s the magic of real-time analytics, where you analyze…
Albert Wong
Apr 1
Expanding dbt metadata features
Expanding dbt metadata features
Leveraging custom properties for enriched documentation and SQL adaptability
Arthur Chaves
Apr 1
Easter egg hunt with BigQuery and User-Defined Functions (UDFs)
Easter egg hunt with BigQuery and User-Defined Functions (UDFs)
Discover the basics of extensibility with BigQuery User-Defined Functions (UDFs) with a little Easter egg hunt game
Volker Janz
Apr 1
About Data Engineer Things
Latest Stories
Archive
About Medium
Terms
Privacy
Teams