WWat is a data lakehouse? How is ELT different from ETL? And what the heck is a semantic layer?
Below is a short attempt to summarize my learnings about the purpose of the individual components of the data engineering stack, how they relate to each other, and the relevant market offerings for each.
The above is an approximation of the modern data flow. It comprises of:
There are many variations on the model above. Some variations leverage the older ETL pattern, which transforms data before it reaches the warehouse, making the OLAP schema the only one present in the warehouse.
Another variation is deploying advanced BI tools, like Preset or Power BI, that can handle both internally-facing and externally-facing needs.
One of the critical decisions during the design of a data pipeline is designating where data is transformed from its OLTP-optimized form to its OLAP shape.
Data transformation is necessary because the nature of queries differs between OLTP (transactional model) and OLAP (analytical model). OLTP focuses on operating with full individual rows through CRUD operations. Conversely, OLAP queries are typically column-oriented. In other words, OLAP queries want to know the aggregate values of a small group of columns but for many rows simultaneously. Keep this in mind when we discuss columnar data warehouses. 💡
There are several points during the lifecycle of the data when we can perform the transition from row-oriented OLTP to column-oriented OLAP.
What is the difference between transforming data in your data warehouse and building a pre-aggregation in the semantic layer? One consideration is that all analytical queries should be pointed at the OLAP models, meaning that the OLAP model needs to be complete. However, making the analytical models top fine-grained can be harmful, as it reduces data exploration possibilities for downstream tools. Once again, the name of the game is balance and understanding the underlying business data and requirements.
For this reason, pre-aggregations exist. They don't take wait querying freedom, instead introducing a side mechanism that speeds up frequent queries. In other words, pre-aggregations make frequent queries fast and keep the rare ones possible. A bit more information can be found here.
Admittedly, this area sounds theoretically logical, but in practical terms, it still feels murky to me.
Looking at the schema above, one question is why move data from the OLTP database to the OLAP database if the initial format is the same. It can feel like doing so only distributes the ETL duties into more components, making things more complex without apparent advantages.
First, it's important to understand what is a "columnar" data warehouse.
As the name suggests, a columnar data warehouse turns the tables loaded into it by ninety degrees. Rows become columns and vice versa. In fact, AWS Redshift is essentially a Postgres database that stores individual rows as columns, with a few bits of magic sprinkled on top.
Why does this help? Analytical queries usually query a large portion of table rows but ask only about a couple of columns.
SELECT column1, column2, AVG(column3) AS average
FROM your_table
WHERE condition
GROUP BY column1, column2
HAVING aggregate_condition
ORDER BY average DESC;
A simple analytical query
With the columnar design, all values for the same column are stored on the "line" in the database, which means that these values are concentrated on only a few pages on disk and not spread out like in a row-oriented design.
As a result of this design, queries that do a SUM
or AVG
over a column become very fast, but queries that do SELECT * from a couple of rows
become slower.
The second characteristic that led to the change from ETL to ELT is the decoupling of storage and computing in cloud warehouses, coupled with cheap storage.
These two developments create a constellation where storing the same data several times over in different shapes is affordable. Unlike in the past, the benefits of having all data available in a form close to the raw one, like in a data warehouse, outweigh the storage costs.
In addition, decoupled computing means that transformations consume compute from a different pool than resource queries. Consequently, heavy computations to transform data no longer slow down clients trying to query the warehouse.
The semantic layer is the data warehouse equivalent of object-relational mapping (ORM) for the transactional database.
It aims to provide several capabilities:
The semantic layer is the newest member of the data stack. As such, it is still relatively immature. An adjacent concept to the semantic layer is Headless BI.
The world of data engineering is vast and evolving fast. Having a cohesive model of how individual components in the stack interact is useful, as it helps with chosing the right tool for the right job. This is true, especially as individual parts of the toolchain are highly specialized, and the language used to describe individual tools can be vague, especially since the advent of AI, when all data-related tools became "AI enablers".
Hopefully, the notes above help you with navigating the data landscape a bit!