I spent 3 hours figuring out how BigQuery inserts, deletes and updates data internally. Here’s what I found.

“Just open the file and modify the data, … No?”

Vu Trinh
Data Engineer Things

--

This was originally published at https://vutr.substack.com

Image created by the author

Intro

In BigQuery, you can do these things with SQL:

  • Load data using the INSERT statement.
  • Update data using the MERGE statement in BigQuery.
  • Delete data using the DELETE statement in BigQuery.

Besides SQL, you can also use other ways to interact with the data in BigQuery. All the approaches are straightforward to the user. But have you ever wondered how BigQuery executes these operations under the hood?

If yes, then this article is for you.

This article is my note on the section Architecture of BigQuery — Storage from the book Google BigQuery: The Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale — 2019.

Background info before we get started

When you load data into BigQuery, it will be stored on Colossus — Google’s scalable storage system; after data is written to the file, it can never be modified again — another way to say that the files are immutable.

This makes it easy to parallelize the processing because we just need to send a copy of the data files to workers without worrying about the data being modified somewhere. (If the file can be modified, guarantee the data consistency between… 100 workers is a nightmare !!)

The immutability also makes some optimization more “convenient“; for example, an OLAP storage database usually maintains metadata like the min-max of the data chunks, which will be used for filtering unnecessary data files. If the system allows modified data after initial writing, it will need to re-calculate the min-max whenever the modification happens; this will slow down the whole operation of the OLAP system.

Moreover, the immutability leads to the fact that every table’s modification will result in whole new files, making features like Data Snapshot, Time Travel, and Data Cloning easier to implement. Other cloud data warehouses like Snowflake, Databricks, and Redshift also operate based on the immutability of the data storage.

Now, after getting the background information, let’s move on to understand how inserts, deletes, and updates are implemented in BigQuery.

The storage set

BigQuery doesn’t treat each file as an atomic unit of data; instead, it has an abstraction called a storage set.

A storage set is created in response to a load job, streaming job, or Data Manipulation Language (DML) query.

A transaction will modify the table’s data by creating a new set of files (due to immutability); these files will belong to a storage set. This means the storage sets are also immutable.

Image created by the author

Storage sets enable modification to BigQuery to guarantee the ACID constraint.

This means they are Atomic (all or nothing), Consistent (after they commit, they are available everywhere), Isolation (transactions can be executed independently), and Durable (the transaction won’t be lost after committing).

Storage set has a life cycle, which first is the PENDING state, progresses to the COMMITTED state after finishing the file writing process, and finally moves to GARBAGE whenever the storage set is no longer needed; mark the storage set ready for the garbage collector.

If a storage set has data being written to it, this storage set’s data won’t ever be visible to the users. Only when it reaches the COMMITTED state will the data be available.

Image created by the author

Storage sets also have size information, which is how a dry run can determine how much data would be scanned without running a query.

The dry run shows how much data for my random query.

INSERT, UPDATE, and DELETE

Whole new files.

INSERT

When an INSERT operation is executed, the data will be written in a new set of files in the storage; these files will belong to a new storage set, which will be added to the metadata (this metadata has some information like the committed timestamp of the storage set, which files belong to this storage set, which storage sets belong to a table,…)

Image created by the author

DELETE

Removing rows from the table (from the data files physically) is more complicated than the INSERT operation. Because files are immutable, the system can not open the file and discard the desired rows from it. Let’s go through the DELETE operation with an illustration below:

Image created by the author

Assume you want to delete the record where id = ABC and this record exists in file Z; file Z belongs to the storage set 1 along with files X and Y.

  1. To execute the delete operation, the system will create a new file Z2 with the same data as file Z except for the record with id = ABC. The file Z2 will belong to the new storage set 2.
  2. Storage set 2 must also point to files X and Y, except for file Z.
  3. Finally, storage set 1 will marked GARBAGE .

UPDATE

UPDATE is implemented as the combination of INSERT and DELETE operations. Instead of updating the record directly into a file, the system will create a new file with the latest version of that record and delete the old file.

Storage optimization

The compact process

The storage fragmentation can happen when you write or update data over time.

Suppose you write 200 kb of data every two minutes into BigQuery. Each 200 kb will get a storage set and its own file. (I can’t find the maximum file size in BigQuery storage). After a month, you’ll have 4 TB of data, which is not much when putting it into a cloud data warehouse like BigQuery. However, this will result in a lot of files and storage sets, which will undoubtedly harm the query performance because BigQuery needs to spend time operating on many files and its associated metadata.

To overcome this challenge, the storage optimizer will re-arrange data files into a more optimal form for data reading. This process is automatically run behind the scenes. The optimizer will periodically re-write the file. The data can be written into many files based on the user’s request; then, it will try to compact the data into larger files.

Let’s check the following illustration for a better understanding.

Image created by the author

Partitioning

Partitioning is BigQuery’s optimized technique that divides a large table into smaller parts called “partitions“. Based on the query’s filter, the system only needs to read the required files and skip all the irrelevant files.

For example, if you need data from 2024-01-01 to 2024-01-15 and your data is partitioned on the date column, only those files belonging to partitions 2024-01-01 to 2024-01-15 need to be brought up.

Internally, a partition is nothing like a table. Data from a partition will be stored separately from other partitions. This allows features like data expiration, data insertion, and data deletion to be executed effectively on partition granularity (because it’s just like a table).

At the time of writing this, BigQuery limits the number of partitions per table to 4000. Over-partition (which leads to the table being physically divided into multiple partitions) will affect the performance. The more partitions you have, the more metadata it produces, which causes the query optimizer to struggle when reading many metadata files.

BigQuery uses storage sets that have associated partition IDs to represent partitions in the metadata. This way, BigQuery can apply the filter at the metadata layer without opening the physical data. Let’s check another illustration here:

Image created by the author

Clustering

Besides partitioning, BigQuery also exposes another optimized technique that allows users to control how data will be stored internally.

It is clustering. Clustering is a feature that stores the data that will be semi-sorted based on a key from single or multiple columns (max is four based on the documentation). Data files will get non-overlapping ranges of the key space. This allows for efficient lookups and range scans because the query engine only needs to open files with the key.

Image created by the author

Re-clustering

Because the data need to be maintained in some order (cluster), when new data arrives, this possibly causes data to be distributed into overlapping key ranges and affects the clustering characteristic of the data.

To solve this problem, BigQuery lets users write data as they need; the data will be written to new files as usual, and the data will be re-clustered periodically in the background.

To decide when to re-clustering, BigQuery maintains a ratio called clustering ratio, which indicates the fraction of the completely clustered data; if the ratio drops too low, it will rewrite the data in a sorted format. This will be done on the new storage set (storage set is the atomic unit of data, remember?). BigQuery automatically handles the re-clustering; users don’t need to worry about this.

One more illustration to clear the idea:

Image created by the author

Time travel

BigQuery supports time travel for configurable intervals (a minimum of two days to a maximum of seven days). This feature allows you to revisit the past state of the table at any point within that time window.

Image created by the author

This can be useful if you accidentally delete data and want to restore it. It is also helpful in debugging scenarios when you want to check the table’s data before some transformation has already been applied. To enable time travel, BigQuery keeps track of the timestamp at which storage set transitions happen.

Outro

From the article, I’ve just given you guys a glimpse into how data operations are handled physically inside BigQuery.

The key here is the immutability of the data files, which is entirely different from traditional OLTP databases, in which files can be modified after being written.

Thank you for reading this far. See you in my future blog.

My newsletter is a place where I note down things I learn from people smarter than me in the form of a weekly blog-style email.

So, if you want to learn and grow with me, subscribe here: https://vutr.substack.com.

--

--

🚀 My newsletter vutr.substack.com 🚀 Subscribe for weekly writing, mainly about OLAP databases and other data engineering topics.