Categories: Tech Threads

The Lakehouse: An uplift of Data Warehouse Architecture

In short, the initial architecture of the data warehouse was designed to provide analytical insights by collecting data from various heterogeneous data sources into the centralized repository and acted as a fulcrum for decision support and business intelligence (BI). But it has continued with numerous challenges like more time consumption on data model designing because of only support schema-on-write, the inability of storing unstructured data, tight integration of computing and storage into an on-premises appliance, etc.

This article has intended to highlight how the architectural pattern is getting enhanced to transform the traditional data warehouse by rolling over the second-generation platform Data lake and eventually into Lakehouse. Although the present data warehouse supports three-tier architecture with online analytical processing (OLAP) server as the middle tire but still a consolidated platform for machine learning and data science with Metadata, Caching, and Indexing Layer that is not yet available as a separate tire.

 


Architecture of Traditional Data Warehouse Platform (Click image to enlarge)
The organizations, as well as enterprises, were looking for some alternative/ advanced data warehousing system to resolve the complexities, problems, etc associated with traditional data warehouse systems as mentioned in the beginning due to rapidly growing unstructured datasets like audio, videos, etc. After surfacing the Apache Hadoop eco-system in the post year 2006, the major bottleneck for extraction and transformation of raw data into structured format (row and column) prior to loading into traditional data warehousing system has been resolved by leveraging HDFS (Hadoop Distributed File System). HDFS handles large data sets running on commodity hardware and accommodates applications that have data sets typically gigabytes to terabytes in size. Besides, can scale horizontally by adding new nodes on the cluster to accommodate a massive volume of data irrespective of any data format on demand. Can read here how Apache Hadoop can be installed and configured on a multi-node cluster.

Another major relief achieved with the Hadoop eco-system (Apache Hive) is that it supports schema-on-read. Due to the strict schema-on-write principle with traditional data warehouses, ETL steps were very time-consuming to adhere to designed tablespaces. In one line statement, we can define a Data lake as a repository to store huge amounts of raw data in its native formats (structured, unstructured, and semi-structured) for big data processing with subsequent analysis, predictive analytics, executing machine learning code/app to build algorithms, etc.


 


Data Lake Architecture (Click image to enlarge)
Even though Data Lake doesn’t require data transformation prior to loading as there isn’t any schema for data to fit, maintaining the data quality is a big concern. Data Lake is not fully equipped to handle the issues related to data governance and security. Machine learning (ML), as well as data science applications, needs to process a large amount of data with non-SQL code so that they can be deployed and run successfully over Data Lake. But these applications are not well served by Data Lakes due to a lack of well-optimized processing engines compared to SQL engines. However, these engines are not sufficient alone to solve all the problems with Data Lakes and replace Data warehouses. Features like ACID properties and efficient access methods like indexes are still missing in Data lakes. On top of it, these ML and data science applications encounter data management problems such as data quality, consistency, and isolation. Data lakes require additional tools and techniques to support SQL queries so that Business intelligence and reporting can be executed.

The idea of Data Lakehouse is at a beginning phase and will have processing power on top of Data Lakes such as S3, HDFS, Azure Blob, etc. The Lakehouse combines the benefit of Data Lake’s low-cost storage in an open format accessible by a variety of systems and Data warehouse’s powerful management and optimization features. The concept of Data Lakehouse has been introduced by Databricks and AWS .



The multi-layered Lakehouse architecture (Click image to enlarge)
The Lakehouse will have the capacity to boost the speed of advanced analytics workloads and  give them better data management features. As a bird eye view, the Lakehouse will be segmented into five layers namely Ingestion layer, Storage layer,  Metadata layer, API layer and eventually Consumption layer.
  • The Ingestion layer which is the first layer in the Lakehouse takes care of pulling data from a variety of sources and delivering it to the Storage layer. A Variety of components can be used in this layer to ingest data such as Apache Kafka for data streaming from IoT devices, etc, Apache Sqoop to import data from RDBMS, and many more to support batch data processing as well.
  • Data Lakehouses are best suited for cloud repository services due to the separation of computing and storage layers. The Lakehouse can be implemented on-premise by leveraging Hadoop Distributed File System (HDFS) platform. And it’s designed is supposed to allow keeping all kinds of data in low-cost object stores, e.g., AWS S3, as objects using a standard file format such as Apache Parquet.
  • Metadata layer in the Lakehouse would hold the responsibility of providing metadata (data giving information about other data pieces) for all objects in the lake storage. Besides, the opportunity to fulfill management features such as
    • ACID transactions to ensure that concurrent transaction
    • Caching to cache files from the cloud object store using faster storage devices such as SSDs and
      RAM on the processing nodes
    • Indexing for faster query-making
  • API layer in Lakehouse facilitates two types of API namely declarative DataFrame APIs and SQL API. With the help of DataFrame APIs, the data scientist can consume the data directly to execute their various applications. Some Machine learning libraries like TensorFlow and Spark MLlib can read open file formats like Parquet and query the metadata layer directly. Similarly, SQL API can be utilized to get the data for BI (combination business analytics, data mining, data visualization, etc.) and various reporting tools.
  • Finally, the consumption layer holds various tools and apps like Power BI, Tableau, and many more. Lakehouse’s consumption layer can be utilized by all users across an organization to carry out all sorts of analytics tasks including business intelligence dashboards, data visualization, SQL queries, and machine learning jobs.

The Lakehouse architecture is best suited to provide a single point of access within an organization for all sorts of data despite purpose.

Conclusion 

Data cleaning complexity, query compatibility, the monolithic architecture of Lakehouse, caching for hot data, etc. are a few limitations to be considered before completely relying upon Lakehouse architecture. Even though all the hype around Data lakehouses, it’s worth remembering that the concept is in the very nascent stage. Going forward in near future, there would be a requirement for tools that enables data discovery, data usage metrics, data governance capabilities, etc on the Lakehouse.

Hope you have enjoyed this read. Please like and share if you feel this composition is valuable.

Reference:-    http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

Written by
Gautam Goswami

Can be reached for real-time POC development and hands-on technical training at gautambangalore@gmail.com. Besides, to design, develop just as help in any Hadoop/Big Data handling related task, Apache Kafka, Streaming Data etc. Gautam is a advisor and furthermore an Educator as well. Before that, he filled in as Sr. Technical Architect in different technologies and business space across numerous nations. He is energetic about sharing information through blogs, preparing workshops on different Big Data related innovations, systems and related technologies.

 

 

 

Page: 1 2

Recent Posts

Transferring real-time data processed within Apache Flink to Kafka

Transferring real-time data processed within Apache Flink to Kafka and ultimately to Druid for analysis/decision-making.… Read More

3 weeks ago

Streaming real-time data from Kafka 3.7.0 to Flink 1.18.1 for processing

Over the past few years, Apache Kafka has emerged as the leading standard for streaming… Read More

2 months ago

Why Apache Kafka and Apache Flink work incredibly well together to boost real-time data analytics

When data is analyzed and processed in real-time, it can yield insights and actionable information… Read More

3 months ago

Integrating rate-limiting and backpressure strategies synergistically to handle and alleviate consumer lag in Apache Kafka

Apache Kafka stands as a robust distributed streaming platform. However, like any system, it is… Read More

3 months ago

Leveraging Apache Kafka for the Distribution of Large Messages (in gigabyte size range)

In today's data-driven world, the capability to transport and circulate large amounts of data, especially… Read More

5 months ago

The Zero Copy principle subtly encourages Apache Kafka to be more efficient.

The Apache Kafka, a distributed event streaming technology, can process trillions of events each day… Read More

6 months ago