Sign in Get started

Data Engineer Things

Insights and ideas on data and engineering.

Create your own Gemini AI-chatbot with a twist using Python, Jinja2 and NiceGUI

Create your own Gemini AI-chatbot with a twist using Python, Jinja2 and NiceGUI

Discover the basics of using Gemini with Python via VertexAI, creating a Web UI with NiceGUI and using Jinja2 to construct modular prompts

Apr 25

Trending Now

Pydantic for Experts: Reusing & Importing Validators

Pydantic for Experts: Reusing & Importing Validators

Advanced techniques for reusing and importing validation across python models.

Yaakov Bressler

Apr 21

How to think about Internal Data Products as a Data Engineer

How to think about Internal Data Products as a Data Engineer

Data Products are all the rage, but why?

Apr 17

I completed a Senior Data Engineer Code Challenge for fun, and this is how it went. PART II

I completed a Senior Data Engineer Code Challenge for fun, and this is how it went. PART II

Question: Using MySQL’s public employee sample database, create a DAG to move data from the employee’s table to BigQuery.

Apr 27

I spent another 8 hours understanding the design of Amazon Redshift. Here’s what I found.

I spent another 8 hours understanding the design of Amazon Redshift. Here’s what I found.

All insights from Redshift academic paper: Amazon Redshift re-invented in 2022

Mar 16

Why did Databricks build the Photon engine?

Why did Databricks build the Photon engine?

The Lakehouse, its motivation, and the difference between Photon and the existing engine.

Apr 6

Deleting Duplicate Data With No Primary Key — A Data Engineer's Worst Nightmare?

Deleting Duplicate Data With No Primary Key — A Data Engineer's Worst Nightmare?

Here’s how you can delete duplicate data with no primary key

Mar 12

Latest stories

Understanding Snowflake Table Locks

Understanding Snowflake Table Locks

A hands-on look at table locks.

May 16

Automate Dbt Date Logic with Python — Part 2

Automate Dbt Date Logic with Python — Part 2

Simplifying Our Models and Tests From Part 1 Using Meta Config

May 14

The Inheritance Schema Design Pattern for MongoDB Data Modelling

The Inheritance Schema Design Pattern for MongoDB Data Modelling

In the world of NoSQL databases, particularly MongoDB, designing an efficient data model is crucial for optimal application performance…

May 12

How I build an ETL pipeline with AWS Glue, Lambda, and Terraform

How I build an ETL pipeline with AWS Glue, Lambda, and Terraform

A Step-by-Step Guide

May 12

Enhance your data quality tests with the dataform-assertions package

Enhance your data quality tests with the dataform-assertions package

dbt is no longer the only choice for testing data pipelines

Fumiaki Kobayashi

May 12

My Data Pipeline Orchestrators Journey

My Data Pipeline Orchestrators Journey

Originally Posted at: www.junaideffendi.com

May 5

I spent 5 hours understanding more about the Delta Lake table format

I spent 5 hours understanding more about the Delta Lake table format

All insights from the paper: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores

May 4

What is something we have but don’t own and is never working when you need it.

What is something we have but don’t own and is never working when you need it.

Testing is difficult but pains could be eased with unified tooling. Here we explore the pros and cons of testing with new tools to help us

May 2

Installing (and Switching between) Different Versions of Python

Installing (and Switching between) Different Versions of Python

How to install and switch between different python versions.

Yaakov Bressler

May 1

How We Integrate 1000++ Hive Tables into Data Warehouse Without ETL Seamlessly

How We Integrate 1000++ Hive Tables into Data Warehouse Without ETL Seamlessly

Migrating our data warehouse to Greenplum enables us to access data from Hive in real-time, eliminate storage issue, and much more!

Bernard Adhitya

Apr 26

Why Stream Processing is a Terrible Market (Yet We Are Still Investing in It)

Why Stream Processing is a Terrible Market (Yet We Are Still Investing in It)

TL;DR: It’s a challenging market, yet it holds promising prospects.

Apr 25

Speeding Up Power BI: The Case for Surrogate Keys in Dimensional Modeling

Speeding Up Power BI: The Case for Surrogate Keys in Dimensional Modeling

Exploring the transition from composite to surrogate keys for enhanced performance and maintainability in data warehousing.

Ivanna Ditlevsen Jurkiv

Apr 25

Do We Need the Lakehouse Architecture?

Do We Need the Lakehouse Architecture?

When data lakes and data warehouses are not enough.

Apr 20

AWS Glue: essential tips for enhancing ETL development and operations

AWS Glue: essential tips for enhancing ETL development and operations

Explore 12 essential tips for Data Engineers and ETL Developers using AWS Glue

Apr 17

Best Practices for Writing Maintainable and Testable Spark Code in Scala

Best Practices for Writing Maintainable and Testable Spark Code in Scala

Enhancing Scalability and Reliability Through Structured Spark Development Practices

Thomas Cardenas

Apr 17

A Closer Look Into Databricks’s Photon Engine

A Closer Look Into Databricks’s Photon Engine

Part 2 of Databricks’s Photon paper note: Vectorization

Apr 13

Memory Management in Apache Spark

Memory Management in Apache Spark

Apache Spark’s performance advantage over MapReduce is greatest in the use-cases involving repeated computations. Much of this performance…

Apr 11

Simple Real-time Sentiment Analysis with Apache Spark and Kafka

Simple Real-time Sentiment Analysis with Apache Spark and Kafka

It’s been a long time since my last article. Just an update, at the beginning of this year, finally I’ve officially become an Analytics…

Apr 9

What do you do if you encounter technical interview questions that stump you?

What do you do if you encounter technical interview questions that stump you?

We’ve all been there — you’re in the middle of a technical interview, and you get hit with a question that leaves you stumped. It’s a…

Apr 9

Quickstart on Open Data Lakehouse with StarRocks + S3 + Delta Lake

Quickstart on Open Data Lakehouse with StarRocks + S3 + Delta Lake

Delta Lake, the open-source storage layer on top of data lakes like S3, brings reliability and structure to your data. But querying that…

Apr 3

Cracking the Code: Tied Scores, a Window Functions Perspective

Cracking the Code: Tied Scores, a Window Functions Perspective

In modern businesses, performance dashboards have become essential tools for recognizing and celebrating top performers. Serving as a…

Nicholas Piesco

Apr 3

Deploying a Spark-based application as a Windows application

Deploying a Spark-based application as a Windows application

*Note: Not a medium member? Use this link to view the full article. **Note: If you want to generate or validate data, take a look at Data…

Apr 3

What is real time analytics, and who are the top players in this OLAP space?

What is real time analytics, and who are the top players in this OLAP space?

Imagine a world where your data insights are as fresh as this morning’s coffee. That’s the magic of real-time analytics, where you analyze…

Apr 1

Expanding dbt metadata features

Expanding dbt metadata features

Leveraging custom properties for enriched documentation and SQL adaptability

Apr 1

Easter egg hunt with BigQuery and User-Defined Functions (UDFs)

Easter egg hunt with BigQuery and User-Defined Functions (UDFs)

Discover the basics of extensibility with BigQuery User-Defined Functions (UDFs) with a little Easter egg hunt game

Apr 1

About Data Engineer ThingsLatest StoriesArchiveAbout MediumTermsPrivacyTeams