Tech Threads

A Short Introduction to Apache Iceberg


A table can be defined as an arrangement of data in rows and columns and in a similar fashion, if you visualize from the Big Data perspective, the large number of individual files that hold the actual data can be organized in a tabular manner too.

We are already familiar with Apache Hive that works as a data warehouse system to query and analyze large datasets stored in the HDFS (Hadoop Distributed File System) or Amazon S3. Also an integral part of the big data ecosystem. Hive is a simple directory-based design where actual data files are getting stored at the folder/directory level in HDFS. If interested, you can read here how to import data into Hive tables. Hive keeps track of data at the folder level not in actual data files. Because of the directory-based model in Hive, listings are much slower, renames are not atomic, and results are eventually consistent. To work with data in a table, Hive needs to perform file list operations and this causes a performance bottleneck while executing SQL queries. Apache Iceberg is a new table format for storing large, slow-moving tabular data and can improve on the more standard table layout built into Hive, Trino, and Spark

The giant OTT platform Netflix originally developed Iceberg to decode their established issues related to managing/storing huge volumes of data in tables probably in petabyte-scales. Later in 2018, Iceberg was open-sourced as an Apache Incubator project.

The main aim to designed and developed Iceberg was basically to address the data consistency and performance issues that Hive having. Below we can see few major issues that Hive holding as said above and how resolved by Apache Iceberg.

– Schema Evolution

In nutshell, Schema evolution permits us to update the schema used to write new data while maintaining backward compatibility with the schemas of our old data. To make schema evolution support in Hive, actual data files need to be modified or rewritten. As an example, if we want to handle schema changes/evolution in Hive ORC tables like column deletions occurring at source database which is MySQL by leveraging Flume to import data, here are few major steps we need to follow like

Taking backup of old schema file -> Move the New AVSC Schema File to HDFS -> Create AVRO table with new location set and scheme location set ->Verify the data in AVRO after Schema changes in MySQL ->Take a Backup of the current ORC table and Drop the Original ORC table -> Create ORC table with a new location set -> Insert the data into the ORC table and eventually verify the ORC table after the schema changes. Then Continue the Incremental loads from the Next day onwards with the new target directory which was created after the schema changes.

But Apache Iceberg schema updates are metadata changes only and because of that, no data files need to be rewritten to perform the update.

Using unique IDs, Iceberg tracks each column in a table. While we add a new column, a new ID would assign to it to avoid any existing data usage by mistake.

Following schema evolution changes are currently supporting by Apache Iceberg

  • Add – add a new column to the table or to a nested struct
  • Drop – remove an existing column from the table or a nested struct
  • Rename – rename an existing column or field in a nested struct
  • Update – widen the type of a column, struct field, map key, map value, or list element
  • Reorder – change the order of columns or fields in a nested struct

To ensure schema evolution changes are unfettered and free of side-effects as well as without rewriting files, Apache Iceberg never read existing values from another column while adding a new column. Similarly for dropping or updating a column or field, Iceberg does not change the values in any other column.

– Partition Evolution

In Apache Hive partitioning can be done by dividing a table into related groups based on the values of a particular column like date, city, country, etc, Partitioning reduces the query response time in Apache Hive as data is stored in horizontal slices. In Hive partitioning, partitions are explicit and appear as a column and must be given partition values. Due to this approach, Hive having several issues like can’t validate partition values so fully dependent on the writer to produce the correct value, 100% dependent on the user to write queries correctly, Working queries are tightly coupled with the table’s partitioning scheme, so partitioning configuration cannot be changed without breaking queries, etc.

Apache Iceberg introduces the concept of hidden partitioning where the reading of unnecessary partitions can be avoided automatically. Data consumers that fire the queries don’t need to know how the table is partitioned and add extra filters to their queries. Iceberg partition layouts can evolve as needed. Iceberg can hide partitioning because it does not require user-maintained partition columns. Iceberg produces partition values by taking a column value and optionally transforming it.

Apache Iceberg is used in production where a single table can contain tens of petabytes of data and can read these huge tables without leveraging distributed SQL engine. It was developed for gigantic tables. By using a set of Java API that Iceberg produces, we can manage table metadata, like schema, partition spec, metadata, and data files that store table data.

Hope you have enjoyed this read. Please like and share if you feel this composition is valuable.

Reference:- https://iceberg.apache.org

Written by
Gautam Goswami

Can be reached for real-time POC development and hands-on technical training at gautambangalore@gmail.com. Besides, to design, develop just as help in any Hadoop/Big Data handling related task, Apache Kafka, Streaming Data etc. Gautam is a advisor and furthermore an Educator as well. Before that, he filled in as Sr. Technical Architect in different technologies and business space across numerous nations.
He is energetic about sharing information through blogs, preparing workshops on different Big Data related innovations, systems and related technologies.

 

Page: 1 2

Recent Posts

Transferring real-time data processed within Apache Flink to Kafka

Transferring real-time data processed within Apache Flink to Kafka and ultimately to Druid for analysis/decision-making.… Read More

3 weeks ago

Streaming real-time data from Kafka 3.7.0 to Flink 1.18.1 for processing

Over the past few years, Apache Kafka has emerged as the leading standard for streaming… Read More

2 months ago

Why Apache Kafka and Apache Flink work incredibly well together to boost real-time data analytics

When data is analyzed and processed in real-time, it can yield insights and actionable information… Read More

3 months ago

Integrating rate-limiting and backpressure strategies synergistically to handle and alleviate consumer lag in Apache Kafka

Apache Kafka stands as a robust distributed streaming platform. However, like any system, it is… Read More

3 months ago

Leveraging Apache Kafka for the Distribution of Large Messages (in gigabyte size range)

In today's data-driven world, the capability to transport and circulate large amounts of data, especially… Read More

5 months ago

The Zero Copy principle subtly encourages Apache Kafka to be more efficient.

The Apache Kafka, a distributed event streaming technology, can process trillions of events each day… Read More

6 months ago