Categories: Tech Threads

Is TOON the Next Lightweight Hero in Event Stream Processing with Apache Kafka?

 
The data serialization format is a key factor when dealing with stream processing, as it decides how efficiently the data is forwarded on the wire and optimized internally in order to be stored, understood, and processed by a distributed system. The data serialization format is core to stream processing, in that it directly influences the speed and reliability and scalability, and maintainability of the entire pipeline. Choosing the right one can eliminate expensive lock-ins and ensure that our streaming infrastructure remains stable as data volume and intricacy evolve. In a stream-processing platform where millions of events per second must be handled with low latency by ingestion systems such as Apache Kafka and processing engines like Flink or Spark, reducing CPU usage is important because it depends on using efficient data formats.

Exploring TOON as a data format:

Early in the new millennium, Douglas Crockford popularized JSON, which was designed with humans in mind. For APIs to consume data or return responses, it is ubiquitous, readable, and accessible. However, one drawback of JSON has been apparent in the current AI era: it is rather verbose. Additionally, real-time data streaming has started having an important impact on modern AI models for applications that need quick decisions.

TOON stands for Token-Oriented Object Notation, a lightweight, line-oriented data format. It is too human-readable (more than binary formats), like JSON, but more compact and structured than raw text. TOON is built to be very simple to parse, where each line or “entry” begins with a token header (uppercase letters or digits), then uses pipe separators (|) for fields. Keeping importance for streaming environments, it is optimized because line-oriented, and we do not need to build a full in-memory parse tree (unlike JSON), which makes it suitable for low-memory contexts, embedded systems, or logs. Here is a simple example of JSON with a shoes array that contains information about two shoes (two objects)

{

  “shoes”: [

    { “id”: 1, “name”: “Nike”, “type”: “running” },

    { “id”: 2, “name”: “Adidas”, “type”: “walking” }

  ]

}

Now, let’s convert the same data above into TOON, the it will be like below

shoes[2]{id,name,type}:

  1,Nike,running

  2,Adidas,walking

Simple, right? Because in TOON, we don’t need quotes, braces, or colons. The lines are simply the data rows and shoes[2]{id,name,type}: declares an array of two objects with the fields id, name, and type. Now we can see how TOON visibly reduced the token usage by 30-50%, depending on the data shape

Is TOON better than JSON?

As we know the real-time data streaming plays a key role for AI models as it allows them to handle and respond to data as it comes in, instead of just using old fixed datasets. To build or develop such a platform or architecture where processed streaming data eventually feeds into AI systems like TensorFlow, etc, TOON provides several key advantages over JSON. Specially for large language models (LLMs), where JSON is considered to be heavyweight for data exchange because of thousands of tokens in quotes, braces, colons, and repeated keys. Using TOON, we can reduce 30-50% fewer tokens for uniform data sets, and It has less syntactic clutter, which makes it easier for LLMs. Besides, TOON can be nested as well, similar to JSON. Similar to JSON, TOON can have a simple object, an array of values, an array of objects, and an array of objects with nested fields. In case of an array of objects with nested fields, TOON can be excellently understandable as well as much smaller than the JSON format. TOON is a token-efficient serialization format that is primarily designed for streaming, low-memory environments, and LLM contexts.

What more for Apache Kafka

Before ingesting data streams from various real-time sources via producers into the multi-node Apache Kafka cluster, we first require a TOON parser that can translate its unique structural markers into a common internal representation like JSON, as TOON is usually a hierarchically annotated, nested format. Secondly, there should be an implementation for a schema-extraction layer for TOON data format to normalize fields like rich metadata, embedded annotations, etc. To enforce consistent types before producing messages to Kafka’s topic, the former step is necessary. On top of that, we need to have data validation rules so that malformed frames or unsupported TOON constructs can be handled. Besides, if the input stream data format carries large embedded objects from the producers to Kafka’s topic, then pre-serialization compression is essential. And we should design a proper Kafka message’s key mechanism that is specific to TOON identifiers in order to preserve ordering and enable efficient deserialization for the consumers in downstream applications. The community-driven Java implementation of TOON has been released under the MIT license at GitHub and can be useful if message producers are to be developed using Java.

Takeaway

TOON is a new data serialization format that is designed to reduce or minimize the number of tokens when exchanging structured data primarily with language models. Although majorly beneficial in LLM-specific pipelines, we can use it to ingest stream data into Apache Kafka’s topic as it’s a compact and token-efficient serialization format. TOON is not Kafka-native and still relatively young compared to JSON, Avro, or Protobuf. JSON or binary formats might be better for deeply nested structures or highly heterogeneous data for incoming messages to Kafka’s topic. As TOON is not widely supported yet, we may need to write custom serializers/deserializers code while integrating with existing message producers as well as consumers for downstream applications/components in the entire stream processing platform. If we are concerned especially about efficient parsing and minimizing overhead, then TOON could be a very well-suited message payload format with Apache Kafka.

Thank you for reading! If you found this article valuable, please consider liking and sharing it.

Can connect me on LinkedIn

Written by
Gautam Goswami 

Page: 1 2

Recent Posts

Using Schema Registry to Manage Real-Time Data Streams in AI Pipelines

In today's AI-powered systems, real-time data is essential rather than optional. Real-time data streaming has… Read More

1 month ago

AI on the Fly: Real-Time Data Streaming from Apache Kafka To Live Dashboards

In the current fast-paced digital age, many data sources generate an unending flow of information,… Read More

4 months ago

Real-Time at Sea: Harnessing Data Stream Processing to Power Smarter Maritime Logistics

According to the International Chamber of Shipping, the maritime industry has increased fourfold in the… Read More

4 months ago

Driving Streaming Intelligence On-Premises: Real-Time ML with Apache Kafka and Flink

Lately, companies, in their efforts to engage in real-time decision-making by exploiting big data, have… Read More

6 months ago

Dark Data Demystified: The Role of Apache Iceberg

Lurking in the shadows of every organization is a silent giant—dark data. Undiscovered log files,… Read More

6 months ago

The Role of Materialized Views in Modern Data Stream Processing Architectures + RisingWave

Incremental computation in data streaming means updating results as fresh data comes in, without redoing… Read More

10 months ago