Categories: Tech Threads

Forging Apache Druid with Apache Kafka for real-time streaming analytics


A real-time analytics database called Apache Druid is developed for quick slice-and-dice analysis on massive data volumes. The best data for Apache Druid is event-oriented and frequently utilized as the database backend for analytical application GUIs and for highly concurrent APIs that require quick aggregations. Druid can be leveraged very effectively where real-time ingestion, fast query performance, and high uptime are crucial.
At the other end, Apache Kafka is gaining outstanding momentum as a distributed event streaming platform with excellent performance, low latency, fault tolerance, and high throughput and having capable of handling thousands of messages per second.
However, there are multiple steps to be crossed to visualize/analyze eventually for achieving the business goal/decision from real-time data streaming that is ingested to Kafka continuously from various live data sources. As a bird’s eye view, these steps can be outlined as

  • Consume data from Kafka’s topic by subscribing to it and execute various data pipelines subsequently for transformation, validation, cleansing, etc before dumping into a permanent repository/database for the query.
  • By integrating different types of Kafka connectors (like JDBC Kafka Sink Connector), data can be pulled out from Kafka’s topic and pushed to RDBMS/repository for query, visualization, etc. But there is a Schema Registry dependency while using JDBC Kafka Sink Connector.
  • Confluent REST proxy too can be leveraged for consuming the messages/data from the Kafka topic directly and persist into RDBMS via adapters.

But with Apache Druid, we can directly connect Apache Kafka so that real-time data can be ingested continuously and subsequently queried to take business decisions on the spot without interventing any third-party system or application. Another beauty of Apache Druid is that we need not configure or install any third-party UI application to view the data that landed or is published to the Kafka topic.

In this article, we are going to see the steps of how Apache Druid can be installed, configured with single-node Apache Kafka cluster, and stream/publish a few String data using Kafka’s built-in console producer and eventually visualize in Apache Druid.

Assumptions:-

  • The system has a minimum of 8 GB RAM and 250 GB SSD along with Ubuntu-22.04.2 amd64 as the operating system.
  • OpenJDK 11 is installed with JAVA_HOME environment variable configuration.
  • Python 3 or Python 2 along with Perl 5 is available on the system.
  • Single-node Apache Kafka-2.7.0 cluster has been up and running with Apache Zookeeper -3.5.6. (Please read here how to set up a multi-node Kafka cluster)

Installation -> Configuration -> Start:-

  1. The latest version of Apache Druid can be downloaded from https://www.apache.org/dyn/closer.cgi?path=/druid/26.0.0/apache-druid-26.0.0-bin.tar.gz
  2. Open a terminal and extract the downloaded tarball. Change directories to the distribution directory.
  3. Stop the Kafka broker along with Zookeeper if already running.
  4. Apache Druid is too dependent on Zookeeper so we can leverage the same Zookeeper instance that is already bundled and configured with Druid for Kafka broker to run. We can abandon the previous Zookeeper instance that was running with Kafka broker. We can switch over to this zookeeper instance again when we won’t run the Druid instance.
  5. Navigate to the zoo.cfg file available under /apache-druid-26.0.0/conf/zk and add the following
    Server.1=127.0.0.1:2888:3888

6. Open a terminal and navigate to /apache-druid-26.0.0/bin and execute ./start-druid
The following should appear on the terminal

7. Open a browser (Probably an updated version of Firefox, by default available with Ubuntu 22.04) and type the URL as http://localhost:8888. Follow page should be displayed on the browser


8. Start the Kafka broker and create a new topic as FirstTopic using the kafka-topic.sh from a terminal. (You can read here how to create a topic using the built-in script)

9. Click on “Streaming” by navigating through the “Load Data” menu on the browse. And subsequently start a “Start a new streaming spec”

 

 

10. Click on Apache Kafka

11. Update the Bootstrap servers as “localhost:9092” and Topic as “FirstTopic”. Druid would fetch the real-time data from this topic once data/message is published to Kafka’s topic “FirstTopic”. Click on the “Start of stream” radio button and eventually the “Start” button.

12. Execute the built-in console producer script available inside Kafka’s bin directory and start publishing a few messages/data into “FirstTopic”.

13. Click on “Next Parse Data” and proceed to the “ Parse Data” tab on the browser, the messages/data would be displayed immediately with a timestamp.

 

There are many other options available to set the input format, Patterns, etc. Will explain in upcoming articles related to Druid’s outstanding features.
To know more about Druid, you can visit https://druid.apache.org

In the following URL, you could watch the short video that has been captured while performing the above.

https://vimeo.com/836518053

References:- https://druid.apache.org/docs/latest/design/index.html

Hope you have enjoyed this read. Please like and share if you feel this composition is valuable. Watch this space for additional articles on handling real-time data streaming.


Written by
Gautam Goswami

Can be reached for real-time POC development and hands-on technical training at gautambangalore@gmail.com. Besides, to design, develop just as help in any Hadoop/Big Data handling related task, Apache Kafka, Streaming Data etc. Gautam is a advisor and furthermore an Educator as well. Before that, he filled in as Sr. Technical Architect in different technologies and business space across numerous nations.
He is energetic about sharing information through blogs, preparing workshops on different Big Data related innovations, systems and related technologies.

 

 

Page: 1 2

Recent Posts

Revolutionize Stream Processing with the Power of Data Fabric

A data fabric is an innovative system designed to seamlessly integrate and organize data from… Read More

3 weeks ago

Bridging the Gap: Unlocking the Power of HDFS-Based Data Lakes with Streaming Databases

Big data technologies' quick development has brought attention to the necessity of a smooth transition… Read More

4 weeks ago

Which Flow Is Best for Your Data Needs: Time Series vs. Streaming Databases

Data is being generated from various sources, including electronic devices, machines, and social media, across… Read More

1 month ago

Protecting Your Data Pipeline: Avoid Apache Kafka Outages

An Apache Kafka outage occurs when a Kafka cluster or some of its components fail,… Read More

2 months ago

The Significance of Complex Event Processing (CEP) with RisingWave for Delivering Accurate Business Decisions

Complex event processing (CEP) is a highly effective and optimized mechanism that combines several sources… Read More

5 months ago

Principle Of Data Science

Source:- www.PacktPub.com This book focuses on data science, a rapidly expanding field of study and… Read More

5 months ago