What is a Data Lakehouse

  • May 19, 2021
ocean waves of colored pixels

For years data warehousing has been the go-to single source of truth for enterprises. From analytics to business intelligence, it has reigned supreme. More recently, data lakes hit the scene as an alternative that took advantage of the cloud’s less expensive data storage options. Yet, both data warehouses and data lakes have drawbacks that the more recently introduced data lakehouse architecture looks to address.  

What is a data lakehouse?

First, to understand data lakehouse benefits, it’s important to understand what it is and how it works. A data lakehouse offers a metadata layer that sits over the top of a data lake architecture. It combines the best of data warehousing and data lakes by taking advantage of the cheaper cloud-based storage that data lakes use, and the data management and structure used by data warehouses. In doing so, the lakehouse allows organizations to quickly query massive amounts of structured and unstructured data that can sit in cloud stores like Azure Data Lake Store (ADLS). 

Notably, data lakehouses separate storage and compute and allow you to apply different query engines to the data. In this way, you can use the best engine – whether it is Athena/Presto, RedShift Spectrum, Spark or something else – for the amount and type of data that you want to query. 

Data warehouse vs data lake vs data lakehouse

In comparing the strengths and weaknesses of the three different architecture types, each has its own merits. Yet, it’s clear that the data lakehouse’s unique ability to marry the strengths of data warehousing and data lakes is the reason behind its rise in popularity. 

   Strengths Weaknesses
Data Warehouse  - Fixed, structured data eases governance.
- Viewed as a single source of truth.
- Established ecosystem for extract, transform, load (ETL) tools and processes.
 - The storage and human resources required to manage a data warehouse can be expensive.
- As the amount of data increases, performance decreases.
- Requires structured data that is formatted for the data warehouse. 
- Challenging to build a streaming interface, requiring more parts and steps. 
Data Lake  - Takes advantage of the least expensive storage types, allowing you to inexpensively store TBs of data. 
- Supports both structured and unstructured data, giving you the flexibility to store everything from JSON to images and blobs.
- Ecosystem of tools that can read a wide variety of data quickly and inexpensively.
- Easily scales.

 - Less structure means that data lakes are more difficult to govern and manage. As a result, they are not seen as a single source of truth.
- Data lakes often require more sophisticated developers, which has resulted in a skills gap and fewer resources available for enterprises interested in pursuing a data lake.

Data Lakehouse

 - Adds a metadata layer to provide structure to the data lake for manageability and governance.
- Makes the data transactional for increased user confidence.
- Low-cost interface allows you to take advantage of cheaper cloud storage options.
- Connects with popular reporting tools.
- Schema enforcement (and schema evolution).

 - Spark Engine and resources are required.

Data lakehouse tools and features

Enterprises interested in pursuing a data lakehouse strategy have a number of tools available to them: Databricks Delta, Apache Iceberg and Apache Hudi are a few examples. In addition, the major cloud providers have built functionality that approaches a lakehouse architecture. For example, Microsoft’s Azure Synapse Analytics allows you to use Azure Data Lake Storage with tools like Spark. (For more info, see our blog on Azure Synapse Analytics.) 

Key features of a Delta storage layer – specifically, the Delta technology from Delta.io – include: 

  • Transaction layer - Supports ACID (atomic, consistent, isolated, durable) transactions through tools like Spark. This provides a cheaper, faster method for updates, deletes, and schema enforcement. 
  • Versioning – Automatically maintains previous versions of data records, providing you with greater data protection via a stored history. Optimizes how the data is stored to create an index with order to make queries faster across prior versions.  This also permits ‘Time Travel’, which allows you to simply query these versions.
  • Transaction log - Tracks all the data transactions in the lakehouse. With this functionality, now you can update, insert, and merge incoming data to your existing data with a single block of code similar to traditional SQL.  The transaction log also permits ‘Time Travel’ allowing a user to look back at the data from a specific point-in-time.  Time travel is great for querying lineage and changes over time.  
  • Connectivity via Spark (and all of Spark’s interfaces) – Provides at scale access to the data in Delta giving users the ability to query the data lakehouse. Data analysts can manage the lakehouse’s applied structure, specifying the tables, columns, etc. that they want to expose to end users. At the speed of Spark, you can now go into the query interface and feed other tools like Tableau or Power BI with specific data.  
  • Multi-cloud capability – Creates one great advantage: the ability to implement across the three major cloud providers (AWS, Microsoft, Google).  The lakehouse can be unified across all three with a product like DataBricks (leading the way in the lakehouse implementation); however, it is entirely possible take advantage of Delta and lakehouse methodology without DataBricks.

Is a data lakehouse better?

Data lakehouses address several challenges people have with data warehouses and data lakes. Yet, it ultimately depends on your use case if a data lakehouse is best for you. If you are interested in optimizing an existing data lake for ROI and TCO, a lakehouse could be a good fit. Similarly, if scalability and reliability top your list of concerns, a data lakehouse architecture may be ideal.

As data lakehouse adoption is not widespread yet, you may be hesitant to adopt this methodology.  Having strong resources to guide you through the challenges will be a must as this framework gains a wider market share.

It’s not my intention to say that the lakehouse will replace warehouses (or even lakes for that matter).  It IS my intention to say that current technology is allowing us to re-think what we have been doing for the last 20 years.  Many data professionals are still trying to make older technologies and methods work with constantly growing amounts of data; there is no doubt many will succeed, but until then the lakehouse definitely appears to be the next step for data.

You may find that data lakehouses aren’t the best fit for your needs – not yet, anyway. But if they are, we suggest that you have the right team in place to support a lakehouse as it can require significant data-centric development resources.   

Need help determining if a data lakehouse architecture is a fit for your needs? Reach out to our team today.

 

 

Subscribe to our blog

ribbon-logo-dark
Anthony Taylor

Anthony Taylor is a Senior Systems Integration Advisor specializing in data engineering for analytics with cloud technologies , the Community of Practice Lead for Azure Data Products and the Practice Lead for Databricks. He designs and builds large scale serverless data platforms in Azure and AWS, leading and mentoring a team of Cloud Data Professionals across multiple industries. Taylor is a an AWS Certified Solution Architect and Data Analytics and Visualization instructor at the University of California, Irvine.

Related Blog Posts