Categories: Tech Threads

Bridging the Gap: Unlocking the Power of HDFS-Based Data Lakes with Streaming Databases

Big data technologies’ quick development has brought attention to the necessity of a smooth transition between real-time data analytics and batch processing systems. Since HDFS (Hadoop Distributed File System) based data lakes provide scalable and affordable storage for vast amounts of heterogeneous data, they have emerged as a key component of present-day data architectures. However, when interacting with dynamic, real-time data operations, HDFS’s static nature frequently poses difficulties. In order to enable real-time data input, transformation, and analysis within HDFS-based data lakes, this article examines how streaming databases can help close the gap.

Streaming databases can depend on HDFS-based data lakes to efficiently handle, process, and store large volumes of streaming data. This dependency arises because HDFS-based data lakes are designed to store and manage big data in a distributed manner while streaming databases specialize in real-time processing and querying.

Data Ingestion

By using popular distributed data streaming platforms like Apache Kafka, data can be ingested into streaming databases from real-time sources like IoT sensors, log files, social media feeds, financial transaction systems, etc. After ingestion, streaming databases can forward or periodically offload the data into an HDFS-based data lake for long-term storage and batch processing. Streaming databases like Apache Druid’s Deep storage guarantees long-term data persistence even if data is deleted from the live cluster after compaction. This ensures data longevity and offers data loss protection. Please click here to read more.

Similarly, RisingWave, an open-source streaming SQL database, allows data to be sent from RisingWave to WebHDFS directly and to HDFS indirectly using the CREATE SINK command. Using Risingwave’s WebHDFS functionality, you can browse HDFS and read, write, and remove files without depending on the Hadoop cluster. WebHDFS is a RESTful API that enables interactions with HDFS.

Real-Time Processing vs. Long-Term Storage

The streaming databases are optimized for real-time analysis, such as anomaly detection, alerting, or running continuous queries, and typically operate on in-memory datasets or temporary storage for high-speed access. But HDFS-based data lakes act as the “single source of truth” for historical and archived data and are designed for scalability where you can store raw, semi-structured, or structured data at a lower cost compared to a streaming database.

After processing and analyzing streaming data in real-time within streaming databases, the data can then be transported and stored in an HDFS-based data lake for downstream use cases such as machine learning model training or batch analytics.

Data Lifecycle Management

To differentiate between cold and hot data, streaming databases handle hot data that is accessed or processed immediately, while HDFS manages cold data that needs to be retained for purposes such as compliance, backup, or historical analysis. In terms of offloading mechanisms, streaming systems periodically push data(either raw or transformed)to HDFS, ensuring that the database remains efficient and scalable by preventing overload.

Query and Analytics Workflow

In terms of hybrid workflows, Real-time insights are derived from the streaming database. Batch processing and deep analysis are performed on historical data stored in the HDFS-based data lake using tools like Hive, Spark, or Presto. For example, as a use case, an e-commerce platform detects fraud in real time using a streaming database and stores transactional logs in HDFS. Later, they perform customer behavior analysis on the entire dataset using Spark on the HDFS data lake.

Frameworks and Tools for Facilitating Integration

Many contemporary tools facilitate the smooth integration of HDFS-based data lakes with streaming databases. Many streaming databases, such as Apache Druid and RisingWave, provide built-in connectors, as mentioned in the data ingestion section. Additionally, Kafka’s HDFS sink connector simplifies data communication between HDFS and Kafka topics. Real-time data, after being analyzed in the streaming database, can be sent to a Kafka topic and subsequently transferred from the Kafka topic to HDFS-based data lakes.

Benefits of Streaming Databases with HDFS as a Data Lake

In terms of scalability, HDFS can handle petabytes of data, accommodating the high throughput of streaming sources. Storing large volumes of data in HDFS is more cost-efficient compared to relying solely on a high-performance streaming database for all data. HDFS supports storing data in multiple formats (e.g., Parquet, ORC, Avro), catering to diverse analytics needs. From a durability and fault-tolerance perspective, HDFS ensures data replication across nodes, providing a level of durability that streaming systems might not guarantee on their own.

In summary, streaming databases rely on HDFS-based data lakes to complement their real-time capabilities with cost-effective, durable, and scalable long-term storage and analytics, enabling businesses to balance real-time and historical data processing efficiently. However, HDFS-based data lakes face challenges such as data governance, latency, cluster maintenance overhead, and performance issues.

If you find this content valuable, please share it and give it a thumbs up!

Can connect me on LinkedIn

Written by
Gautam Goswami

Page: 1 2