A real-time analytics database called Apache Druid is developed for quick slice-and-dice analysis on massive data volumes. The best data for Apache Druid is event-oriented and frequently utilized as the database backend for analytical application GUIs and for highly concurrent APIs that require quick aggregations. Druid can be leveraged very effectively where real-time ingestion, fast query performance, and high uptime are crucial.
At the other end, Apache Kafka is gaining outstanding momentum as a distributed event streaming platform with excellent performance, low latency, fault tolerance, and high throughput and having capable of handling thousands of messages per second.
However, there are multiple steps to be crossed to visualize/analyze eventually for achieving the business goal/decision from real-time data streaming that is ingested to Kafka continuously from various live data sources. As a bird’s eye view, these steps can be outlined as
But with Apache Druid, we can directly connect Apache Kafka so that real-time data can be ingested continuously and subsequently queried to take business decisions on the spot without interventing any third-party system or application. Another beauty of Apache Druid is that we need not configure or install any third-party UI application to view the data that landed or is published to the Kafka topic.
In this article, we are going to see the steps of how Apache Druid can be installed, configured with single-node Apache Kafka cluster, and stream/publish a few String data using Kafka’s built-in console producer and eventually visualize in Apache Druid.
Assumptions:-
Installation -> Configuration -> Start:-
6. Open a terminal and navigate to /apache-druid-26.0.0/bin and execute ./start-druid
The following should appear on the terminal
7. Open a browser (Probably an updated version of Firefox, by default available with Ubuntu 22.04) and type the URL as http://localhost:8888. Follow page should be displayed on the browser
8. Start the Kafka broker and create a new topic as FirstTopic using the kafka-topic.sh from a terminal. (You can read here how to create a topic using the built-in script)
9. Click on “Streaming” by navigating through the “Load Data” menu on the browse. And subsequently start a “Start a new streaming spec”
10. Click on Apache Kafka
11. Update the Bootstrap servers as “localhost:9092” and Topic as “FirstTopic”. Druid would fetch the real-time data from this topic once data/message is published to Kafka’s topic “FirstTopic”. Click on the “Start of stream” radio button and eventually the “Start” button.
12. Execute the built-in console producer script available inside Kafka’s bin directory and start publishing a few messages/data into “FirstTopic”.
13. Click on “Next Parse Data” and proceed to the “ Parse Data” tab on the browser, the messages/data would be displayed immediately with a timestamp.
There are many other options available to set the input format, Patterns, etc. Will explain in upcoming articles related to Druid’s outstanding features.
To know more about Druid, you can visit https://druid.apache.org
In the following URL, you could watch the short video that has been captured while performing the above.
References:- https://druid.apache.org/docs/latest/design/index.html
Hope you have enjoyed this read. Please like and share if you feel this composition is valuable. Watch this space for additional articles on handling real-time data streaming.
Can be reached for real-time POC development and hands-on technical training at gautambangalore@gmail.com. Besides, to design, develop just as help in any Hadoop/Big Data handling related task, Apache Kafka, Streaming Data etc. Gautam is a advisor and furthermore an Educator as well. Before that, he filled in as Sr. Technical Architect in different technologies and business space across numerous nations.
He is energetic about sharing information through blogs, preparing workshops on different Big Data related innovations, systems and related technologies.
Page: 1 2
Complex event processing (CEP) is a highly effective and optimized mechanism that combines several sources… Read More
Source:- www.PacktPub.com This book focuses on data science, a rapidly expanding field of study and… Read More
Over the past few years, Apache Kafka has emerged as the top event streaming platform… Read More
In the current fast-paced digital age, many data sources generate an unending flow of information,… Read More
At first, data tiering was a tactic used by storage systems to reduce data storage… Read More
With the use of telemetry, data can be remotely measured and transmitted from multiple sources… Read More