In short, the initial architecture of the data warehouse was designed to provide analytical insights by collecting data from various heterogeneous data sources into the centralized repository and acted as a fulcrum for decision support and business intelligence (BI). But it has continued with numerous challenges like more time consumption on data model designing because of only support schema-on-write, the inability of storing unstructured data, tight integration of computing and storage into an on-premises appliance, etc.
This article has intended to highlight how the architectural pattern is getting enhanced to transform the traditional data warehouse by rolling over the second-generation platform Data lake and eventually into Lakehouse. Although the present data warehouse supports three-tier architecture with online analytical processing (OLAP) server as the middle tire but still a consolidated platform for machine learning and data science with Metadata, Caching, and Indexing Layer that is not yet available as a separate tire.
Another major relief achieved with the Hadoop eco-system (Apache Hive) is that it supports schema-on-read. Due to the strict schema-on-write principle with traditional data warehouses, ETL steps were very time-consuming to adhere to designed tablespaces. In one line statement, we can define a Data lake as a repository to store huge amounts of raw data in its native formats (structured, unstructured, and semi-structured) for big data processing with subsequent analysis, predictive analytics, executing machine learning code/app to build algorithms, etc.
The idea of Data Lakehouse is at a beginning phase and will have processing power on top of Data Lakes such as S3, HDFS, Azure Blob, etc. The Lakehouse combines the benefit of Data Lake’s low-cost storage in an open format accessible by a variety of systems and Data warehouse’s powerful management and optimization features. The concept of Data Lakehouse has been introduced by Databricks and AWS .
The Lakehouse architecture is best suited to provide a single point of access within an organization for all sorts of data despite purpose.
Conclusion
Data cleaning complexity, query compatibility, the monolithic architecture of Lakehouse, caching for hot data, etc. are a few limitations to be considered before completely relying upon Lakehouse architecture. Even though all the hype around Data lakehouses, it’s worth remembering that the concept is in the very nascent stage. Going forward in near future, there would be a requirement for tools that enables data discovery, data usage metrics, data governance capabilities, etc on the Lakehouse.
Hope you have enjoyed this read. Please like and share if you feel this composition is valuable.
Reference:- http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf
Written by
Gautam Goswami
Can be reached for real-time POC development and hands-on technical training at gautambangalore@gmail.com. Besides, to design, develop just as help in any Hadoop/Big Data handling related task, Apache Kafka, Streaming Data etc. Gautam is a advisor and furthermore an Educator as well. Before that, he filled in as Sr. Technical Architect in different technologies and business space across numerous nations. He is energetic about sharing information through blogs, preparing workshops on different Big Data related innovations, systems and related technologies.
Page: 1 2
Data is being generated from various sources, including electronic devices, machines, and social media, across… Read More
An Apache Kafka outage occurs when a Kafka cluster or some of its components fail,… Read More
Complex event processing (CEP) is a highly effective and optimized mechanism that combines several sources… Read More
Source:- www.PacktPub.com This book focuses on data science, a rapidly expanding field of study and… Read More
Over the past few years, Apache Kafka has emerged as the top event streaming platform… Read More
In the current fast-paced digital age, many data sources generate an unending flow of information,… Read More