Fairy tales often emphasize the importance of moderation, compromise and combining the best characteristics of things. Goldilocks needed a bowl of porridge that was not scalding or frigid but just right.
The scalding bowl is a data lake in which data swims without a schema. You can replicate any data, structured or unstructured, into it and impose order on what you’ve stored only when you need to analyze it.
Just as you can blow on a spoonful of hot porridge to cool it down, you can organize and analyze portions of data from a data lake as needed.
The frigid bowl is the data warehouse, a scalable repository in which data is defined by schemas, which makes it suitable for analyzing organized data. The downside is its inability to absorb media files and unstructured data. It’s thick, cool and unable to properly dissolve many ingredients.
The third palatable bowl is the data lakehouse. It’s at the right temperature to be neither too ordered nor too unstructured for timely analytics.
In this article, we’ll outline everything you need to know about data lakehouses and why they’re a great choice for companies today.
Table of Contents
- What is a data lakehouse?
- Data lakehouse architecture
- Data lakehouses vs. data lakes vs. data warehouses
- The critical advantages of a data lakehouse
- Are data lakehouses the future?
What is a data lakehouse?
A data lakehouse is a modern data solution that combines the best features of a data warehouse and a data lake. It combines the flexibility of a data lake (which allows you to store unstructured data) and the management methods of a data warehouse.
By merging two existing solutions, a data lakehouse enables easy data movement between a data lake’s low-cost and unstructured storage and a more rigid data warehouse.
As a result, it’s easier to implement schema and governance using a data warehouse’s tools.
Extended flexibility and easier governance speed up many data processing steps, including collection from multiple sources, cleaning, validation and transformation.
A data lakehouse also removes the critical limitations of both the data lake and the data warehouse. For example, data lakes require additional tools and techniques to support SQL queries. This slows down business intelligence and reporting.
But, since lakehouses also employ the mechanisms of a data warehouse, they can handle SQL queries more efficiently.
Lakehouses are thus an efficient way to support advanced analytics, artificial intelligence, machine learning and reporting — all of which can improve business intelligence and analysis.
Data lakehouse architecture
There are five key layers that make up a data lakehouse architecture.
Let’s take a closer look.
1. Ingestion
In the first layer, data from multiple sources is collected and delivered to the storage layer. This data can be pulled from either internal or external sources, such as:
- NoSQL databases
- Relational database management systems (RDBMSs)
- Software-as-a-Service (SaaS) applications
- Customer relationship management (CRM) applications
- Social media
- Websites
Organizations can use tools like Amazon Data Migration Service (Amazon DMS) to import data from NoSQL databases and RDBMSs. Many other tools help with data streaming and other functions.
2. Storage
Data lakehouses use open-source file formats to store unstructured, structured and semi-structured data. They are designed to store all data types as objects in object stores like Amazon S3.
Lakehouses keep schemas of structured and semi-structured data sets in the metadata layer, making it easier to apply them while reading.
3. Metadata
Metadata is data that offers information about other pieces of data. A data lakehouse’s metadata layer is the key advantage it has over storage options like the lake and warehouse.
The metadata layer houses a catalog that provides metadata for every object in the lake storage. It also enables users to implement features such as ACID transactions, indexing, caching and data versioning.
The metadata layer also allows users to implement data warehouse schema architectures, like snowflake or star schemas and improve schema management. Auditing and data governance can be done directly on the data lake, which enhances data integrity across the entire data pipeline.
4. APIs
Lakehouses use APIs to speed up task processing and provide advanced analytics. APIs in the metadata relay which data items are needed from specific applications, making it easier to retrieve them.
End users and developers can capitalize on APIs, which allows them to use an array of languages and libraries.
5. Consumption
This architecture hosts client apps in the consumption layer, which means it can access all the metadata and data stored in a lake.
Any user within a business can manipulate a lakehouse for various analytics tasks, including data visualization, SQL queries, machine learning and business intelligence.
Data warehouses vs. data lakes vs. data lakehouses
To fully understand the benefits of data lakehouses, let’s take a look at all three prominent architectures and their key features.
Here’s a closer look at all three.
Data warehouse
A data warehouse is a central storage system where data is organized into tables and columns. The data is taken from different sources and stored in schema-defined tables.
One of the key advantages of a data warehouse is the use of relational database schemas to define structured data, which makes for fast analytics and SQL compatibility.
A warehouse stores data in two ways: fast storage (like SSDs) or cheap object stores (like Amazon S3). Typically, data is loaded in batches based on a predetermined schedule — either daily or every few hours.
Modern warehouses have been adapted to current business needs and can support real-time loading, making data instantly accessible to businesses.
Data lake
A data lake is a central repository that lets you store structured and unstructured data at a large scale. You can store images, videos, free-form text and other media, along with neatly organized table schemas.
Data lakes were created to be flexible and to support analytics and data science that relied on unstructured data. This is why they beat out data warehouses, which support only structured data.
Data lakehouse
A data lakehouse combines both approaches, which means it can instantly load data (structured or unstructured) and make it immediately available to analysts, data scientists and more.
A data lakehouse has six key features:
- ACID transactions support: Data lakehouses enable ACID transactions, another data warehouse feature, to ensure consistency as multiple parties concurrently read and write data.
- BI support: BI and analytics professionals both have access to the same data repository. They get access to updated data that goes through cleaning and integration. So, BI is enhanced.
- Open storage formats: Data lakehouses use open storage formats. You can use various tools with them; you’re not locked into one vendor’s monolithic data analytics architecture.
- Schema and governance capabilities: They support schema-on-read, in which the software accessing the data determines its structure on the fly. A data lakehouse supports schemas for structured data and implements schema enforcement to ensure that the data uploaded to a table matches the schema.
- Support for diverse data types and workloads: Data lakehouses can hold both structured and unstructured data, so you can use them to store, transform and analyze things like images, video, audio and text, as well as semi-structured data like JSON files.
- Decoupled storage and compute: Storage resources are decoupled from compute resources, so you can scale either one separately to meet the needs of your workloads, whether they be for machine learning, business intelligence and analytics, or data science.
The critical advantages of a data lakehouse
Data lakehouses have six significant advantages over warehouses and lakes.
These advantages are:
- Scalability: Data lakehouses run on cloud platforms, which means they have high scalability.
- Improved data management: Lakehouses store diverse data and support multiple use cases, including advanced reporting, machine learning and analytics.
- Enhanced data reliability: In a data lakehouse, the data goes through fewer transfers to get to the destination. This reduces the chances of errors and lower quality.
- Lower costs: A data lakehouse offers a way for organizations to avoid maintaining a separate data warehouse (for a single source of truth) and data lake (for cost-effective storage of historical data and media files). Data lakehouses also support huge volumes of storage more cost-effectively than data warehouses.
- Actionable data: A lakehouse’s flexible architecture allows for better data organization, which prevents data stagnation and ensures that analysts get the most recent, accurate data.
- Eliminate redundancies: Since data is unified in a lakehouse, excessive copies of that data are deleted. This reduces storage requirements and is also helpful for companies that use several warehouses or lakes.
Are data lakehouses the future?
Databricks coined the term “lakehouse” in 2020 for its Delta Lake software. Delta Lake is an open-source project aimed at bringing reliability to data lakes.
While we can’t recommend that businesses immediately migrate to data lakehouses, specific scenarios are ideal for the system. These include if an organization wants to increase the capabilities of an existing data lake and eliminate redundancies caused by multiple systems.
Fivetran supports multiple storage systems in a data pipeline, be it a data lakehouse or any other popular data warehouse or data lake. Wherever you pipe your data, we’ve got you covered. Sign up today for a free trial.