Member-only story
I spent 5 hours learning how Google manages terabytes of metadata for BigQuery.
How Google manages metadata at a large scale.

To sustain my work, I’ve enabled the Medium paywall. If you’re already a Medium member, I deeply appreciate your support! But if you prefer to read for FREE, my newsletter is open to you: vutr.substack.com. Either way, you’re helping me continue writing!
Intro
The importance of metadata — data about data — should not be underestimated. It is vital in optimizing storage, query performance, and governance in data warehouse and lakehouse systems.
- How do we know which files belong to a table? We use metadata.
- How does the query engine know which files it can skip? The query engine uses metadata.
- How do you enforce ACID over many files in the object storage? They use metadata.
Managing metadata for small datasets is usually straightforward. However, when dealing with massive tables, the situation changes significantly. While it’s true that metadata is not typically as large as the dataset itself, this assumption holds mainly when the system records only high-level (coarse-grained) metadata. However, the more detailed the metadata, the more valuable it becomes for optimizing the query engine and storage management.
Fine-grained metadata tracks information at a much more granular level — such as metadata for each data block or each column within those blocks. This can quickly scale to millions of metadata objects, matching the scale of the underlying data. The challenge lies in efficiently managing this metadata without letting it become a bottleneck.
Google’s BigQuery, a cloud-based data warehouse, tackles these challenges head-on. Rather than treating metadata as a secondary concern, BigQuery employs innovative techniques to manage metadata at scale, treating it with the same priority as the data. This allows the system to efficiently store and query billions of metadata objects with high performance.