Categories: Tech Threads

Streaming real-time data from Kafka 3.7.0 to Flink 1.18.1 for processing

Over the past few years, Apache Kafka has emerged as the leading standard for streaming data. Fast-forward to the present day, Kafka has achieved ubiquity, being adopted by at least 80% of the Fortune 100. This widespread adoption is attributed to Kafka’s architecture, which goes far beyond basic messaging. Kafka’s architecture versatility makes it exceptionally suitable for streaming data at a vast ‘internet’ scale, ensuring fault tolerance and data consistency crucial for supporting mission-critical applications. Flink is a high-throughput, unified batch and stream processing engine, renowned for its capability to handle continuous data streams at scale. It seamlessly integrates with Kafka and offers robust support for exactly-once semantics, ensuring each event is processed precisely once, even amidst system failures. Flink emerges as a natural choice as a stream processor for Kafka. While Apache Flink enjoys significant success and popularity as a tool for real-time data processing, accessing sufficient resources and current examples for learning Flink can be challenging.

In this article, I will guide you through the step-by-step process of integrating Kafka 2.13-3.7.0 with Flink 1.18.1 to consume data from a topic and process it within Flink on the single-node cluster. Ubuntu-22.04 LTS has been used as an OS in the cluster.

Assumptions:-

The system has a minimum of 8 GB RAM and 250 GB SSD along with Ubuntu-22.04.2 amd64 as the operating system.
OpenJDK 11 is installed with JAVA_HOME environment variable configuration.
Python 3 or Python 2 along with Perl 5 is available on the system.
Single-node Apache Kafka-3.7.0 cluster has been up and running with Apache Zookeeper -3.5.6. (Please read here how to set up a Kafka cluster)

Installation and starting of Flink-1.18.1:-

The binary distribution of Flink-1.18.1 can be downloaded from https://www.apache.org/dyn/closer.lua/flink/flink-1.18.1/flink-1.18.1-bin-scala_2.12.tgz
Extract the archive flink-1.18.1-bin-scala_2.12.tgz on the terminal using $ tar -xzf flink-1.18.1-bin-scala_2.12.tgz. After successful extraction directory flink-1.18.1 will be created. Please make sure that inside it bin/, conf/, and examples/ directories are available.
Navigate to the bin directory through the terminal and execute $ ./bin/start-cluster.sh to start the single-node flink cluster.
Moreover, we can utilize Flink’s web UI to monitor the status of the cluster and running jobs by accessing the browser at port 8081.
The flink cluster can be stopped by executing $ ./bin/stop-cluster.sh

List of dependent jars:-

The following jars should be included on the classpath/build file

I’ve created a basic Java program using Eclipse IDE 23-12 to continuously consume messages within Flink from a Kafka topic. Dummy string messages are being published to the topic using Kafka’s built-in kafka-console-publisher script. Upon arrival in the Flink engine, no data transformation occurs for each message. Instead, an additional string is simply appended to each message and printed for verification, ensuring that messages are continuously streamed to Flink.

The entire execution has been screen-recorded. If interested you could watch it here.

https://vimeo.com/920423458?share=copy

I hope you enjoyed reading this. Please stay tuned for another upcoming article where I will explain how to stream messages/data from Flink to a Kafka topic

Written by
Gautam Goswami

Can be reached for real-time POC development and hands-on technical training at gautambangalore@gmail.com. Besides, to design, develop just as help in any Hadoop/Big Data handling related task, Apache Kafka, Streaming Data etc. Gautam is a advisor and furthermore an Educator as well. Before that, he filled in as Sr. Technical Architect in different technologies and business space across numerous nations.
He is energetic about sharing information through blogs, preparing workshops on different Big Data related innovations, systems and related technologies.

Page: 1 2

Next Transferring real-time data processed within Apache Flink to Kafka »

Previous « Why Apache Kafka and Apache Flink work incredibly well together to boost real-time data analytics

View Comments

Using Schema Registry to Manage Real-Time Data Streams in AI Pipelines

In today's AI-powered systems, real-time data is essential rather than optional. Real-time data streaming has… Read More

4 weeks ago

Tech Threads

AI on the Fly: Real-Time Data Streaming from Apache Kafka To Live Dashboards

In the current fast-paced digital age, many data sources generate an unending flow of information,… Read More

3 months ago

Tech Threads

Real-Time at Sea: Harnessing Data Stream Processing to Power Smarter Maritime Logistics

According to the International Chamber of Shipping, the maritime industry has increased fourfold in the… Read More

4 months ago

Tech Threads

Driving Streaming Intelligence On-Premises: Real-Time ML with Apache Kafka and Flink

Lately, companies, in their efforts to engage in real-time decision-making by exploiting big data, have… Read More

6 months ago

Tech Threads

Dark Data Demystified: The Role of Apache Iceberg

Lurking in the shadows of every organization is a silent giant—dark data. Undiscovered log files,… Read More

6 months ago

Tech Threads

The Role of Materialized Views in Modern Data Stream Processing Architectures + RisingWave

Incremental computation in data streaming means updating results as fresh data comes in, without redoing… Read More

9 months ago