This article is guest authored by Ryan Blue, one of Iceberg’s creators.
The modern data stack has proven to be incredibly valuable. For example, Fivetran makes it easier than ever to load all of your data sources into your data warehouse, immediately ready for consumption or transformation. The transition from ETL to ELT led by Fivetran was one of several trends that converged to make the modern data stack possible, but the initial spark was the arrival of cloud data warehouses: Snowflake and BigQuery.
The future of the modern data stack is tightly coupled to how those warehouses evolve — and that’s where Iceberg comes in. Last month, Snowflake announced upcoming support for native Iceberg tables. A little earlier this year, Google announced BigLake, a project that will bring together BigQuery and open standards, like Iceberg and Parquet. Those are radical changes! In this post, I’ll take a deeper look at what Iceberg is and why it is being built into the foundation of the modern data stack.
What is Iceberg?
Apache Iceberg is a modern table format for analytic tables, created by my team at Netflix and later adopted at companies like Apple, LinkedIn, Stripe, Airbnb, Pinterest, and Expedia. At Netflix, we used Iceberg to transform our data lake into a cloud-native data warehouse, by building the guarantees of SQL into data lake tables.
If you’re looking at Iceberg from a data lake background, its features are impressive: queries can time travel, transactions are safe so queries never lie, partitioning (data layout) is automatic and can be updated, schema evolution is reliable — no more zombie data! — and a lot more.
On the other hand, if you’re coming from a data warehouse or modern data stack background, that probably sounds horrible! Data lakes can’t rename columns? Queries might just give wrong answers? Partitioning is a manual process that people mess up all the time? Sigh. Yes.
The flaws can be managed and data lakes have benefits that make them attractive. The main benefit is flexibility: There is a wealth of projects and processing frameworks. For example, Spark and Trino come from the data lake world, and are commonly used in the same architecture because they’re good at different things. Trino prioritizes speed and is great for ad hoc SQL queries across multiple sources. Spark prioritized reliability and can mix Python, Scala and SQL together to handle any workload. And there are more examples, like Flink for streaming or Python DataFrames that run on GPUs.
Iceberg’s appeal is that you aren’t forced to choose between data lakes and data warehouses. It brings together the best capabilities of both: you can be productive and rely on the guarantees of a data warehouse using any processing tool or query engine, like a data lake.
Why add Iceberg to data warehouses?
It’s easy to see how data lakes are improved by Iceberg’s warehouse features — nobody likes unreliable queries — but warehouses already have mature tables with standard SQL behavior that deliver great performance. So what’s the compelling reason to build Iceberg into data warehouses?
I think the answer is the incredible demand for different ways to work with data. People want the right tool for the job and to use what they’re comfortable and productive with. They expect to be able to load data into hundreds of Python containers in parallel to test model parameters, or to use a streaming framework to easily sessionize events, or to query the same tables from a BI tool as they are consumed by Spark. These tasks are not easy to do in a data warehouse.
Warehouses are built to accept SQL queries and quickly produce manageable-sized results. They do this really, really well. But a crucial assumption is baked into this model: that the query layer handles all access to data. This is how access controls are enforced, how queries are isolated from concurrent changes, and it enables features like result set caching and background data maintenance.
Having a single way to access data is a useful assumption when building a data warehouse, but it is also an Achilles heel. With Iceberg, warehouses no longer need a runtime query layer to deliver safe and correct queries.
To understand the trade-off, think of a data warehouse like a combined bakery/sandwich shop, where the sandwich shop (query layer) is the only way to get bread from the bakery (storage layer). Customers wait a little while for someone to take their order, make it, and check them out. This works great when a steady stream of people want lunch. The sandwich shop employees make sandwiches and ensure that each customer is checked out.
But the sandwich line breaks down quickly in other situations. What if a person in line orders 50 sandwiches? And what if a deli opens next door and wants bread from the bakery? It would make no sense to send deli employees to stand in the sandwich line to buy loaves of bread as needed; the bakery needs a different way to handle catering and wholesale.
A query layer limits flexibility in similar ways:
- If you spun up 1,000 Python jobs at once to read the same warehouse table, chances of success are low. This is the “thundering herd” problem. You can fix it by scaling up query capacity (more employees working the counter), but making it work is not simple or cheap.
- Stream processing is at odds with a query-centric view of data, so it must depend on custom integration and protocols, or messy polling and deduplication.
- Stacking full query engines on one another duplicates work and compounds their flaws. If you used Trino to load data from a Spark query, it would be slower and no more reliable! This is like sending a deli employee to buy bread in the sandwich line while customers are waiting.
By building support for Iceberg, data warehouses can skip the query layer and share data directly.
Iceberg was built on the assumption that there is no single query layer. Instead, many different processes all use the same underlying data and coordinate through the table format along with a very lightweight catalog. Iceberg enables direct data access needed by all of these use cases and, uniquely, does it without compromising the SQL behavior of data warehouses.
How does the modern data stack change?
What does it mean to build a warehouse that is independent of the query layer, and how does that change the modern data stack?
To start with the most obvious, the modern data stack will get a lot more options. An independent warehouse lets you use the processing pattern or query layer that is best for your task and for your team. If you want to stream data through Flink, train models in Spark and run BI on top of Snowflake, you can! All the projects from the data lake space can now operate reliably on the same warehouse and can be brought into the modern data stack, without maintaining pipelines to copy data in or out for them.
This means you can use your tool of choice – Pandas, Trino, Snowflake, Spark, and others – with data cleansed, normalized, and delivered to your warehouse by Fivetran. Integrating all of your data sources and getting insights from them is easier, and faster, than ever before.
Next, I think the role that data warehouses play today will evolve into two separate functions: a query layer and a storage layer. Returning to the bakery/sandwich shop analogy, say the bakery now supplies a deli and a bánh mì place. Now it needs to be good at making rye and baguettes! Similarly, there’s a lot more for the storage layer to do, not least of which are unsolved challenges like consistent security controls or data maintenance and automation. (Quick disclosure: An independent storage and automation platform is what we’re building at my company, Tabular.)
On top of needing new capabilities, does it still make sense for our hypothetical bakery to be coupled to its sandwich shop? Probably not: the owner could choose to sell only stale left-overs to sandwich shop competitors to drive more sweet PB&J business. I think there are conflicts of interest that will naturally lead to separating not only query and storage responsibilities, but query and storage businesses.
This highlights another critical reason for Iceberg adoption in the modern data stack: it is an open standard governed by the ASF. Iceberg is the result of collaboration from a broad and vibrant community of companies like Netflix and Apple, and this community is trusted by vendors like Google, AWS, Snowflake, Fivetran, Tabular, Starburst, Dremio, Cloudera and more. That trust is incredibly important because it means the companies that make up the modern data stack can confidently invest time and R&D into Iceberg, as well as support for it in their products.
Register for the webinar: “Modernizing your data lake with Fivetran, Amazon S3 and Apache Iceberg” on August 31 at 9am PST to hear more from Ryan Blue, co-creator of Apache Iceberg and CEO of Tabular.