Categories: Tech Threads

Using Schema Registry to Manage Real-Time Data Streams in AI Pipelines

In today’s AI-powered systems, real-time data is essential rather than optional. Real-time data streaming has started playing an important impact on modern AI models for applications that need quick decisions. However, as data streams increase in complexity and speed, ensuring data consistency is a significant engineering challenge. As we know that the AI models are heavily dependent on input data that is primarily used to train them. The quality of this input data is very important and should not be corrupted or contain errors. The accuracy, reliability, and fairness of the model’s predictions can be significantly affected if the quality of the input data is compromised.

The above statement is concrete, while AI models are being developed and subsequently made ready to identify patterns, make predictions based on input data. If we integrate these developed and tested trained AI models with real-time data stream processing pipelines, the predictions can be achieved on the fly. Because the real-time data streaming plays a key role for AI models as it allows them to handle and respond to data as it comes in, instead of just using old fixed datasets. You could read here my previous article “AI on the Fly: Real-Time Data Streaming from Apache Kafka To Live Dashboards”. But the big question is how we can ensure real-time data that comes as a stream from various sources is free from errors and not at all bad data. By spotting patterns and trained data, AI systems decide. If this data has mistakes, doesn’t add up, or is messy, the model might pick up wrong patterns. This can lead to outputs that are biased, off the mark, or even risky. However, as data streams become more complex and attain greater velocity, managing data consistency and schema evolution are formidable engineering challenge. This is where Schema Registry comes in. It is the central location for all data schema definitions and so plays a vital role in maintaining integrity across real-time pipelines, compatibility among various downstream applications (separate from raw sources), and handling the scalability needs of these same pipelines.


 What Scheme Registry Do

A Schema Registry is essential for data governance and consistency in a streaming platform by ensuring that messages follow a pre-defined structure before they are published to an Apache Kafka Topic. By using it to validate every incoming message against a known schema, the registry can save us from corrupt, invalid, and incompatible data being sent to Kafka that might disrupt downstream processing, cause consumer crashes, or hurt data quality. It stores a versioned history of all the schemas of registered data streams and schema change history. Before constructing a Kafka based data pipeline, we need to register or assign schema info in the schema registry about the data available at the source point. Schema Registry is a distributed storage layer for schemas by making use of the underlying storage mechanism of Kafka. It assigns a unique ID to each registered schema. Instead of appending the whole schema info, only the schema ID needs to be set with each set of records by the producer while sending messages to Kafka’s topic.

Where to Integrate Scheme Registry

Preferably, we can use Apache Kafka as a data ingestion tool in the entire streaming processing ecosystem, and the Schema Registry should be integrated at the point where data is produced. Besides, between the downstream consumers that continuously consume the messages/data from  Kafka’s topic and subsequently push to processing engines like Flink, etc. Of course, Flink-1.18.1 can directly consume messages from Kafka’s topic without depending on additional consumers. You can read here. We need to register the schema with the Schema Registry that includes a schema identifier in the message payload prior to ingesting the stream from the sources to Kafka’s topic. By this identifier, all the downstream consumers fetch the correct schema from the Schema Registry that has already registered and subsequently deserialize and validate the incoming data. By doing this, schema consistency will be ensured, backwards/forward compatibility will be enabled, and data formats will be able to evolve safely over time, reducing the risk of runtime errors due to incompatible data.

Which Schema Registry to use

The Schema Registry is an external process that runs on a server outside of the Apache Kafka cluster, and essentially a database for the schemas. There are multiple types of schema registries available that support various data serialization formats such as Avro, JSON, or Protobuf. Some are cloud native with a license subscription, and others are open-source. 

  • Confluent Schema Registry

Confluent Schema Registry is the most popular and widely used schema registry. Developed by Confluent for use with Apache Kafka. Can be deployed to run as a standalone service or with Confluent Platform. Additionally, provide a REST API for registering and retrieving schemas. Avro, JSON Schema, and Protobuf are the supported data formats. It is available under the Confluent Community License as well as the Confluent Enterprise (subscription) license with additional advanced features.

  • Apicurio Registry

An open-source, cloud-native schema registry developed by Red Hat. Apicurio Registry has multiple storage options where we can configure it to store data in backend storage systems, depending on the use case. Storage options include Apache Kafka, PostgreSQL, or Microsoft SQL Server. It supports adding, removing, and updating the following artifact types:

  • OpenAPI
  • AsyncAPI
  • GraphQL
  • Apache Avro
  • Google Protocol Buffers
  • JSON Schema
  • Kafka Connect schema
  • WSDL
  • XML Schema (XSD)
  • AWS Glue Schema Registry

It is fully managed schema registry provided by AWS that is tightly integrated with AWS services like Kinesis, MSK (Managed Streaming for Kafka), and Lambda. It has the features of automatic schema registration and evolution. Besides, it is an integration with AWS Glue ETL jobs and streaming services. This schema registry can be used when operating entirely within AWS and looking for serverless, managed schema governance.

  • Karapace

Karapace schema registry is a free and open-source tool and licensed under Apache 2.0. Karapace is a 1-to-1 replacement for Confluent’s Schema Registry and Apache Kafka REST proxy. It supports the storing of schemas in a central repository, which clients can access to serialize and deserialize messages. The schemas also maintain their own version histories and can be checked for compatibility between their different respective versions. Karapace supports Avro, JSON Schema, and Protobuf data formats.

Takeaways

In today’s AI pipelines, which demand low-latency data ingestion and real-time inference, schema registries are a key element in ensuring that consistency and interoperability can be achieved across large numbers of distributed systems. By inverting data producers’ and consumers’ knowledge through a centralized (Schema) contract, Registries permit tight validation, backward-forward compatibility constraints, and controlled evolution. This ensures that upstream changes don’t break downstream consumers, one of the most important properties in a streaming architecture based on technologies like Apache Kafka or Flink. In addition, schema registries allow for versioning, metadata management, and hook up to tooling—increasing the observability, monitoring, and operational reliability of the data layer. Since AI systems are more and more based on streaming data to make real-time decisions, a schema registry should not only be considered best practices but rather the core design principle for building scalable, fault-tolerant, and production-grade ML infrastructure.

Thank you for reading! If you found this article valuable, please consider liking and sharing it.

Can connect me on LinkedIn

Written by
Gautam Goswami 

 

 

 

Page: 1 2

Recent Posts

AI on the Fly: Real-Time Data Streaming from Apache Kafka To Live Dashboards

In the current fast-paced digital age, many data sources generate an unending flow of information,… Read More

2 months ago

Real-Time at Sea: Harnessing Data Stream Processing to Power Smarter Maritime Logistics

According to the International Chamber of Shipping, the maritime industry has increased fourfold in the… Read More

3 months ago

Driving Streaming Intelligence On-Premises: Real-Time ML with Apache Kafka and Flink

Lately, companies, in their efforts to engage in real-time decision-making by exploiting big data, have… Read More

5 months ago

Dark Data Demystified: The Role of Apache Iceberg

Lurking in the shadows of every organization is a silent giant—dark data. Undiscovered log files,… Read More

5 months ago

The Role of Materialized Views in Modern Data Stream Processing Architectures + RisingWave

Incremental computation in data streaming means updating results as fresh data comes in, without redoing… Read More

8 months ago

Unlocking the Power of Patterns in Event Stream Processing (ESP): The Critical Role of Apache Flink’s FlinkCEP Library

We call this an event when a button is pressed, a sensor detects a temperature… Read More

9 months ago