Tech Threads

Basic concept on Data Lake

The info graphics representing the basic concept of Data Lake where we can use the approach ELT (Extraction, loading and then transformation) against traditional ETL (Extraction, Transformation and then loading)process. ETL process implies to traditional data warehousing system where structured data format follows (raw and column).By leveraging HDFS (Hadoop Distributed File System), we can develop data lake to store any format data in order to process and analysis. Directly data can be loaded in the Lake without transformation, later transformation can be performed on demand basis. Data Lake concept is offering a tremendous advantage and benefit.

  1.  Huge volume of data can be stored in a distributed manner.
  2.  Format of data is not a criteria in Data Lake. Any data format can be stored like structured, Semi-Structured and Unstructured.
  3.  Semi-Structured and Unstructured data can be stored in traditional data warehousing system. Pre-processing steps mandatory to convert into Structured data format before loading.  These steps are very expensive and time-consuming and chances of data loss/corrupt highly visible.
  4. Commodity hardware can be utilized to create/develop a Data Lake. Besides, it’s fault tolerant.


Written by
Gautam Goswami

Can be reached for real-time POC development and hands-on technical training at gautambangalore@gmail.com. Besides, to design, develop just as help in any Hadoop/Big Data handling related task. Gautam is a advisor and furthermore an Educator as well. Before that, he filled in as Sr. Technical Architect in different technologies and business space across numerous nations.
He is energetic about sharing information through blogs, preparing workshops on different Big Data related innovations, systems and related technologies.

Page: 1 2

Recent Posts

Transferring real-time data processed within Apache Flink to Kafka

Transferring real-time data processed within Apache Flink to Kafka and ultimately to Druid for analysis/decision-making.… Read More

2 weeks ago

Streaming real-time data from Kafka 3.7.0 to Flink 1.18.1 for processing

Over the past few years, Apache Kafka has emerged as the leading standard for streaming… Read More

2 months ago

Why Apache Kafka and Apache Flink work incredibly well together to boost real-time data analytics

When data is analyzed and processed in real-time, it can yield insights and actionable information… Read More

3 months ago

Integrating rate-limiting and backpressure strategies synergistically to handle and alleviate consumer lag in Apache Kafka

Apache Kafka stands as a robust distributed streaming platform. However, like any system, it is… Read More

3 months ago

Leveraging Apache Kafka for the Distribution of Large Messages (in gigabyte size range)

In today's data-driven world, the capability to transport and circulate large amounts of data, especially… Read More

4 months ago

The Zero Copy principle subtly encourages Apache Kafka to be more efficient.

The Apache Kafka, a distributed event streaming technology, can process trillions of events each day… Read More

5 months ago