Categories: Tech Threads

Integrating Apache Kafka in KRaft Mode with RisingWave for Event Streaming Analytics

Over the past few years, Apache Kafka has emerged as the top event streaming platform for streaming data/event ingestion. However, in earlier version 3.5 of Apache Kafka, Zookeeper was the additional and mandatory component for managing and coordinating the Kafka cluster. Relying on ZooKeeper on the operational multi-node Kafka cluster introduced complexity and could be a single point of failure. Zookeeper is completely a separate system having its own configuration file syntax, management tools, and deployment patterns. In-depth skill with experience is necessary to manage and deploy two individual distributed systems and eventually up and running Kafka cluster. Having expertise in Kafka administration without Zookeeper won’t be able to help to come out from the crisis, especially in the production environment where Zookeeper runs in a completely isolated environment (Cloud).

Kafka’s reliance on ZooKeeper for metadata management was eliminated by introducing the Apache Kafka Raft (KRaft) consensus protocol. This eliminates the need for and configuration of two distinct systems—ZooKeeper and Kafka—and significantly simplifies Kafka’s architecture by transferring metadata management into Kafka itself. Apache Kafka has officially deprecated ZooKeeper in version 3.5 and the latest version of Kafka which is 3.8, improved the KRaft metadata version-related messages. There is no use unless we consume the ingested events from the Kafka topic and process them further to achieve business value.

RisingWave on the other hand makes processing streaming data easy, dependable, and efficient once event streaming flows to it from Kafka topic. Impressively RisingWave excels in delivering consistently updated materialized views, which are persistent data structures reflecting the outcomes of stream processing with incremental updates.

In this article, I am going to explain step by step how to install and configure the latest version of Apache Kafka, version 3.8 on a single-node cluster running on Ubuntu-22.04 and subsequently integrate it with RisingWave that was too installed and configured on the same node.

Assumptions:-

OpenJDK version 17.0.12 has been installed and configured including setting JAVA_HOME on ~/.bashrc file
SSH connection has been installed and configured. Later this node could be clubbed with a multi-node cluster on-prem.
PostgreSQL client version 14.12 (not the PostgreSQL server) has been installed and configured. This is mandatory to connect via psql with the RisingWave streaming database. psql is a command-line interface for interacting with PostgreSQL databases that is included in the PostgreSQL package. Since RisingWave is wire-compatible with PostgreSQL, by using psql, we will connect to RisingWave so that SQL queries can be issued and manage database objects. You can refer here to install and configure psql on Ubuntu

Installation and Configuration of Apache Kafka-3.8 with KRaft:-

The binary version of Kafka 3.8 which is the latest can be downloaded from here
Extract the tar ball and after extraction, the entire directory “kafka_2.13-3.8.0” moved to /usr/local/kafka. Make sure we should have “root” privilege
We can create a location directory as “kafka-logs” where Kafka logs will be stored under /usr/local. Make sure the created directory has read-write permissions.
As a configuration step, navigate to “kraft” directory available inside “/usr/local/kafka_2.13-3.8.0/config” and open the properties in vi editor to manipulate/update key-value pair. The following keys should have the corresponding values.
In KRaft mode each Kafka server can be configured as a controller, a broker, or both using the roles property. Since it is a single-node cluster, so I am setting broker and controller both

process.roles=broker,controller

and subsequently node.id=1, num.partitions=5 and delete.topic.enable=true.

Start and Verify the cluster:-

The unique cluster id generation and other required properties can be created by using the built-in script kafka-storage.sh available inside the bin directory

Make sure the files checkpoint and meta.properties got generated inside the created directory “ kafka-logs”. Unique cluster id available inside meta.properties file

Start the broker using the following command from the terminal.

Make sure the following should be displayed on the terminal.

Topic Creation:-

Using Apache Kafka’s built-in script kafka-topics.sh available inside the bin directory, I can create topic on the running Kafka broker using the terminal. Created one topic named “UPIStream” with a number of partitions 3. You can read here how to use Kafka’s built-in scripts.

Make RisingWave functional as a single instance on standalone mode:-

As said above, RisingWave in the standalone mode has been installed and configured on the same node where Kafka 3.8 on KRaft mode is operational. The RisingWave in standalone mode leverages the embedded SQLite database to store metadata and data in the file system. Before that, we need to install and configure the PostgreSQL client as mentioned in the assumptions.

Open a terminal and execute the following curl command

$ curl https://risingwave.com/sh | sh

We can start a RisingWave instance by running the following command on the terminal.

$./risingwave

Open a terminal to connect to RisingWave using the following command

$ psql -h 127.0.0.1 -p 4566 -d dev -U root

Connecting Kafka broker with RisingWave:-

Here I am going to connect RisingWave with the Kafka broker that I want to receive events from the created topic “UPIStream”. I need to create a source in RisingWave using the CREATE SOURCE command. When creating a source, I can choose to persist the data from the Kafka topic in RisingWave by using the CREATE TABLE command and specifying the connection settings and data format. There are more additional parameters available while connecting to Kafka broker. You can refer here (https://docs.risingwave.com/docs/current/ingest-from-kafka/) to know more.

Adding the following to simply connect the topic “UPIStream” on the psql terminal.

Continuous pushing of events from Kafka topic to RisingWave:-

Using a developed simulator in Java, I have published a stream of upi transaction events at an interval of 0.5 second in the following JSON format to the created topic “UPIStream”. Here is the one stream of event.

{“timestamp”:”2024-08-20 22:39:20.866″,”upiID”:”9902480505@pnb”,”name”:”Brahma Gupta Sr.”,”note”:” “,”amount”:”2779.00″,”currency”:”INR”,”Latitude”:”22.5348319″,”Longitude”:”15.1863628″,”deviceOS”:”iOS”,”targetApp”:”GPAY”,”merchantTransactionId”:”3619d3c01f5ad14f521b320100d46318b9″,”merchantUserId”:”11185368866533@sbi”}

Verify and analyzing events on RisingWave:-

Move to the psql terminal that is already connected with the RisingWave single instance and consuming all the published events from the Kafka topic “UPIStream” and storing on the source “UPI_Transaction_Stream”. On the other side Java simulator is running and continuously publishing individual event with different data to topic “UPIStream” at an interval of 0.5 second and subsequently each event is getting ingested to the RisingWave instance for further processing/analyzing.

After processing/modifying the events using the Materialized views, I could sink or send those events back to the different Kafka topic so that downstream applications can consumes those for further analytics. I’ll articulate this on my upcoming blog, please stay tuned J.

Since I have not done any processing, modification, or computations on the ingested events in the running RisingWave instance, I created a simple Materialized view to observe a few fields in the events to make sure integration with Apache Kafka on KRaft mode with RisingWave is working absolutely fine or not. And the answer is a big YES :).

Final Note:-

Especially for the on-premises deployment of a multi-node Kafka cluster, Apache Kafka 3.8 is an excellent release where we completely bypass the ZooKeeper dependency. Besides, it’s easy to set up a development environment for those who want to explore more about event streaming platforms like Apache Kafka. On the other hand, RisingWave functions as a streaming database that innovatively utilizes materialized views to power continuous analytics and data transformations for time-sensitive applications like alerting, monitoring, and trading. Ultimately, it’s becoming a game-changer as Apache Kafka joins forces with RisingWave to unlock business value from real-time stream processing.

Ref:-

https://docs.risingwave.com/docs/current/intro/
https://cwiki.apache.org/confluence/display/KAFKA/Release+Plan+3.8.0

I hope you enjoyed reading this. If you found this article valuable, please consider liking and sharing it.

Written by
Gautam Goswami

Can be reached for real-time POC development and hands-on technical training at gautambangalore@gmail.com. Besides, to design, develop just as help in any Hadoop/Big Data handling related task, Apache Kafka, Streaming Data etc. Gautam is a advisor and furthermore an Educator as well. Before that, he filled in as Sr. Technical Architect in different technologies and business space across numerous nations.
He is energetic about sharing information through blogs, preparing workshops on different Big Data related innovations, systems and related technologies.

Page: 1 2

Next Principle Of Data Science »

Previous « Criticality in Data Stream Processing and a Few Effective Approaches

View Comments

Hot Data: Where Real-Time Insight Begins

Hot data means the data currently being created, accessed, and queried at real-time or near… Read More

2 months ago

Tech Threads

Is TOON the Next Lightweight Hero in Event Stream Processing with Apache Kafka?

The data serialization format is a key factor when dealing with stream processing, as it… Read More

4 months ago

Tech Threads

Using Schema Registry to Manage Real-Time Data Streams in AI Pipelines

In today's AI-powered systems, real-time data is essential rather than optional. Real-time data streaming has… Read More

5 months ago

Tech Threads

AI on the Fly: Real-Time Data Streaming from Apache Kafka To Live Dashboards

In the current fast-paced digital age, many data sources generate an unending flow of information,… Read More

7 months ago

Tech Threads

Real-Time at Sea: Harnessing Data Stream Processing to Power Smarter Maritime Logistics

According to the International Chamber of Shipping, the maritime industry has increased fourfold in the… Read More

8 months ago

Tech Threads

Driving Streaming Intelligence On-Premises: Real-Time ML with Apache Kafka and Flink

Lately, companies, in their efforts to engage in real-time decision-making by exploiting big data, have… Read More

9 months ago