DATA ARCHITECTURE

Why Do You Need a Data Architect?

And how one can help you to make key decisions.

Gaurav Thalpati

Published in

Data Engineer Things

9 min readFeb 13, 2023

Have you ever faced any of these questions?

Data Architect? Do you really need one for this project?
Why can’t one of the senior Data Engineers take this up as an additional activity?
Can we only have one engineer to architect, data model, administer, and manage our platform?
Can’t the Tech PM perform this role; they have several years of data experience.

Whenever a Program Manager running a “Data and Analytics” program puts up a case to onboard/hire a “Data Architect,” these are some questions that the Program Sponsors/Leaders would generally ask.

While this may not be the case for larger enterprises or strategic programs, onboarding a dedicated Data Architect for SMBs(Small and Medium Businesses) implementing data platforms requires reasonable convincing skills.

This blog post highlights key design decisions you must make while implementing a data platform. These decisions are not easy to make and need expertise across cloud and data technologies, a good understanding of warehousing concepts, and practical experience working in several data programs across industries.

Key Design Decisions for Implementing a Data Platform.

While most Data Engineers are good at Python, SQL, and Spark programming, they may lack the required expertise and experience of working across various data projects and cloud technologies to provide relevant inputs to make these decisions for designing the data platforms.

That’s where a Data Architect can help.

A Data Architect’s primary responsibility is to design an architecture for implementing a data platform that can support various data use cases, including batch, streaming, ETL, BI, and ML. Designing such systems needs experience, expertise, and the ability to communicate and collaborate with different internal departments.

A Data Architect can help you to get started and guide you as you progress with your data journey.

The following section provides information about key design decisions you must make to implement a data platform. Data Architects can create design documents and help you make these decisions that can have long-term impacts for all internal and external platform users.

Key Design Decisions — Table of Contents

∘ Designing Ingestion Framework
∘ Designing Control and Audit Tables
∘ Data Lake or Data Warehouse or a Lakehouse?
∘ File and Open Table Formats
∘ Data Catalog
∘ ETL or ELT?
∘ Data Validations and Transformations
∘ Data Protection
∘ Workflow Orchestration
∘ Access Control and Sensitive Data Handling

Designing Ingestion Framework

Creating a data ingestion framework is the starting point for most data programs. It can become extremely complex if you try to implement a generic framework that can ingest data from various sources(CSV, RDBMS, NoSQL, Kafka).

A Data Architect can help you decide how to design generic ingestion frameworks that can be used across source systems and are also simple enough to be quickly adopted by all data engineers.

Key Decisions/Considerations

Should you use a No Code/Low Code/GUI-based tool like AWS Glue/Azure Data Factory or build a custom framework based on Managed Spark like Amazon EMR, Databricks, or Azure Synapse Analytics
You also can use products like Fivetran or Airbyte, which do not need any management or admin efforts and provide practical “out of the box” features like continuous replication, incremental extraction, and target schema creation.
Every cloud provider has several tools that can perform ingestion. E.g., In AWS, you can ingest data from the RDBMS source system using AWS DMS, AWS Glue, Amazon EMR, and Amazon Athena (federated queries). Which one is the most suitable for your use case?

Designing Control and Audit Tables

You will need to design control and audit tables. These tables are required to get operation dashboards, like the number of files pending, completed, or aborted. They can also help to understand the records loaded on target tables from each batch. Designing control tables becomes more complex for a system that requires streaming or micro-batch processing.

Key Decisions/Considerations

How to design the control tables, their schemas, and attributes
What is the best approach to identify which records in the target table were inserted or updated by which batch?
How to implement an operational database that can store execution details across Cloud, SaaS products, and on-prem systems to give you a single pane of glass to view all these. E.g., Can you store the batch execution details for SAP (on-prem) extract, AWS EMR processing, Snowflake (target)

Data Lake or Data Warehouse or a Lakehouse?

You need to decide approach for storing data that can be used for analytics and ML use cases. How do you know which one suits you the best? A Data Architect can guide you to decide this by asking the right questions and going through key design considerations.

Key Decisions/Considerations

Deciding the best architecture. Should you implement Data Lake + Data Warehouse to support all your workloads, or should you implement a Lakehouse that can help keep data in one place?
How many buckets(AWS) or storage accounts (Azure) should be created? What should be the life cycle policies for these?
Decide on tools for implementing a lakehouse. What is the right combination for you?

AWS Stack

Amazon S3 + Apache Hudi + Amazon Athena OR
Amazon S3 + Apache Iceberg + Amazon Athena OR
Amazon S3 + Apache Iceberg + EMR

Azure Stack

ADLS + Databricks OR
ADLS + Delta Lake + Synapse

Others

S3/ADLS + Iceberg + Snowflake
Dremio, Starburst, anything else?

File and Open Table Formats

Most of the modern data platforms are now adopting the lakehouse architecture. It provides the “best of both worlds” from lakes and warehouses. The technology that powers lakehouse is the open table formats/storage frameworks that give it ACID capabilities. A Data Architect can help you decide which one you should choose.

Key Decisions/Considerations

Apache Hudi vs. Apace Iceberg vs. Delta Lake — Which one to choose?
You will get comparison documents and blogs giving benchmarking numbers, but how to decide whether these are relevant for your use case?
Do these work well with your other tech stack? E.g., Would Iceberg integrate with ADLS (storage) and Synapse Serverless SQL (compute)?

Data Catalog

Metadata Management or Data Cataloging is another vital activity that needs careful consideration. You will have to decide whether you should use a cloud-native tech stack or an external product to implement it.

Key Decisions/Considerations

AWS + Databricks — Should you use AWS Glue Catalog or Databricks Unity Catalog
Azure Synapse — Should you keep the metadata on the lake database or SQL database or use Azure Purview
Should you consider SaaS products like Atlan, Alation, and others?

ETL or ELT?

Where do you want to move your processing — before loading it to the lake or after loading the data in the warehouse? This is another question that I’ve often come across, and finalizing the approach depends on multiple factors like tech stack, architecture, and available skills.

Key Decisions/Considerations

ETL

AWS — Should you perform ETL in AWS Glue, Amazon EMR, or AWS Lambda?
Azure — ADF or Synapse Spark Pools or HDInsight?

ELT

Snowflake with dbt (data build tool)?
Snowflake with Snowpark or PL/SQL

Data Validations and Transformations

The traditional approach of validating, rejecting, and cleaning data before it gets loaded in the warehouse might not be followed in today’s world, where the focus is on ingesting data and making it available as soon as possible. However, you still need to check and improve your data quality to make it suitable for accurate insights generation and ML predictions. Take help from a Data Architect to decide the validations and best approaches to perform transformations before making data available for your data consumers.

Key Decisions/Considerations

How to decide the “must have” data validations?
Are any “out-of-box” features available within tools for implementing Data Quality checks? E.g., Databricks DLT or AWS “deequ”. Should you use these features or do custom coding for validations?
Should you consider using open-source tools like great expectations or soda or other python libraries
Do you want to consider Data Observability platforms to provide you with run-time health status and reduce “data downtime”?

Data Protection

Data should be secured at all times; whether it is “data at rest” or “data in transit,” it should be protected. You will have to decide on the encryption Keys (Cloud Managed or Customer Managed), the number of keys to be created, who should have access to them, and the level of access.

Key Decisions/Considerations

Data at Rest — Should you use Cloud managed Keys or Customer Managed Keys? How many keys should be used?
Data in Transit — AWS PrivateLink (for SaaS Products), AWS Direct Connect (For On-Prem)
Encryption or Masking or Tokenization — Which one should you use?

Workflow Orchestration

Native Cloud tech stack vs. external third-party product is a debate that will always come up for implementing various platform features. Orchestration is no exception. A Data Architect can help you decide whether to use AWS Managed Apache Airflow or Airflow on EC2 or Astronomer. The data architect cannot make these decisions independently, but he can help you list critical considerations that can help in making these decisions.

Key Decisions/Considerations

Should you use cloud-native tools like AWS Step Functions or ADF
Should you consider Amazon MWAA (Amazon Managed Worlkfow for Apache Airflow) or Databricks Worklfows
You can get a lot more flexibility with Airflow on EC2. Is the additional admin efforts worth considering the flexibility that you would get?
Other SaaS products — Astronomer, dagster?

Access Control and Sensitive Data Handling

Before you provide access to your data consumers, you will have to implement an access control strategy based on the roles of every consumer. These consumers can be analysts running queries, BI dashboards, or ML models. You will have to implement access controls at a granular level (table, column, row), not just at the file level. You will also need to implement data abstraction for sensitive data to ensure that it is not exposed to non-authorized users.

Key Decisions/Considerations

AWS — AWS Lake Formation can help you to hide columns. How can you implement hashing/tokenization if you need to use sensitive columns as search filters or join conditions?
Azure — Dynamic Masking can be used once data is in the cloud, but is there a way to abstract PII data before it gets loaded to the cloud?
Others — Snowflake/Databricks Dynamic masking, or do you want to use SaaS products like Immuta, or Raito for access control

This is a never-ending list.

By no means is this a comprehensive list. I’ve mentioned just a few of the everyday decisions/challenges. Many of these depend on your cloud platform, org-approved tool stack, and available expertise within data teams.

As technology advances and new architectural patterns evolve, new challenges will come. Challenges faced by the organizations implementing Data Mesh might be similar but specific to their domains.

Please use this list only as a reference and prepare to face new, complex, more specific challenges with every new use case or functional requirement.

Just onboard a “Data Architect” who can guide you to make the right choices!

Disclaimers

Most examples here are based on AWS or Azure, as I’ve experience working on these two platforms. GCP is not mentioned due to my lack of experience with the Google Cloud Platform.
I’ve not used some of these tools/technologies/products but have some level of understanding of their features and how to leverage them.
Data Architects can help you to make most of these decisions; however, they cannot do this independently. These need to be taken collectively with inputs from all relevant stakeholders.
Finding a Data Architect with expertise across all these technologies is difficult. Go with the best candidate that you have for this role!
Senior Data Engineers can be mentored to play this role, provided they have the required expertise. But they will need more time to do justice to both parts simultaneously. Every data program need Data Engineer, Data Architect, and Project Manager roles.
I’ve excluded data modeling challenges, such as selecting the correct modeling technique, the right set of tables and attributes, etc. You need a strong domain expert and experienced data modeler to help make these decisions.

If you faced any of these challenges or anything else that is not on the list, please comment or highlight them in this article so that we all would get to know the most common challenges faced by different teams.

For further questions, please DM me on Linkedin or Twitter, and I’ll be happy to share my views!

You can also subscribe to my substack newsletter & youtube channel to learn about various data concepts, architectures, and tools. Links below.

Data & Cloud

Data & Cloud in simple words - For beginners as well as experts! Click to read Data & Cloud, by GT, a Substack…

gauravthalpati.substack.com

DATA ARCHITECTURE

Why Do You Need a Data Architect?

And how one can help you to make key decisions.

Key Design Decisions — Table of Contents

Designing Ingestion Framework

Designing Control and Audit Tables

Data Lake or Data Warehouse or a Lakehouse?

File and Open Table Formats

Data Catalog

ETL or ELT?

Data Validations and Transformations

Data Protection

Workflow Orchestration

Access Control and Sensitive Data Handling

Just onboard a “Data Architect” who can guide you to make the right choices!

Data & Cloud

Data & Cloud in simple words - For beginners as well as experts! Click to read Data & Cloud, by GT, a Substack…

Gaurav Thalpati

Welcome to the lecture series on cloud data architectures! Through this channel, I'll be sharing my experiences…

Written by Gaurav Thalpati