Tech Threads

A Short Introduction to Apache Iceberg


A table can be defined as an arrangement of data in rows and columns and in a similar fashion, if you visualize from the Big Data perspective, the large number of individual files that hold the actual data can be organized in a tabular manner too.

We are already familiar with Apache Hive that works as a data warehouse system to query and analyze large datasets stored in the HDFS (Hadoop Distributed File System) or Amazon S3. Also an integral part of the big data ecosystem. Hive is a simple directory-based design where actual data files are getting stored at the folder/directory level in HDFS. If interested, you can read here how to import data into Hive tables. Hive keeps track of data at the folder level not in actual data files. Because of the directory-based model in Hive, listings are much slower, renames are not atomic, and results are eventually consistent. To work with data in a table, Hive needs to perform file list operations and this causes a performance bottleneck while executing SQL queries. Apache Iceberg is a new table format for storing large, slow-moving tabular data and can improve on the more standard table layout built into Hive, Trino, and Spark

The giant OTT platform Netflix originally developed Iceberg to decode their established issues related to managing/storing huge volumes of data in tables probably in petabyte-scales. Later in 2018, Iceberg was open-sourced as an Apache Incubator project.

The main aim to designed and developed Iceberg was basically to address the data consistency and performance issues that Hive having. Below we can see few major issues that Hive holding as said above and how resolved by Apache Iceberg.

– Schema Evolution

In nutshell, Schema evolution permits us to update the schema used to write new data while maintaining backward compatibility with the schemas of our old data. To make schema evolution support in Hive, actual data files need to be modified or rewritten. As an example, if we want to handle schema changes/evolution in Hive ORC tables like column deletions occurring at source database which is MySQL by leveraging Flume to import data, here are few major steps we need to follow like

Taking backup of old schema file -> Move the New AVSC Schema File to HDFS -> Create AVRO table with new location set and scheme location set ->Verify the data in AVRO after Schema changes in MySQL ->Take a Backup of the current ORC table and Drop the Original ORC table -> Create ORC table with a new location set -> Insert the data into the ORC table and eventually verify the ORC table after the schema changes. Then Continue the Incremental loads from the Next day onwards with the new target directory which was created after the schema changes.

But Apache Iceberg schema updates are metadata changes only and because of that, no data files need to be rewritten to perform the update.

Using unique IDs, Iceberg tracks each column in a table. While we add a new column, a new ID would assign to it to avoid any existing data usage by mistake.

Following schema evolution changes are currently supporting by Apache Iceberg

  • Add – add a new column to the table or to a nested struct
  • Drop – remove an existing column from the table or a nested struct
  • Rename – rename an existing column or field in a nested struct
  • Update – widen the type of a column, struct field, map key, map value, or list element
  • Reorder – change the order of columns or fields in a nested struct

To ensure schema evolution changes are unfettered and free of side-effects as well as without rewriting files, Apache Iceberg never read existing values from another column while adding a new column. Similarly for dropping or updating a column or field, Iceberg does not change the values in any other column.

– Partition Evolution

In Apache Hive partitioning can be done by dividing a table into related groups based on the values of a particular column like date, city, country, etc, Partitioning reduces the query response time in Apache Hive as data is stored in horizontal slices. In Hive partitioning, partitions are explicit and appear as a column and must be given partition values. Due to this approach, Hive having several issues like can’t validate partition values so fully dependent on the writer to produce the correct value, 100% dependent on the user to write queries correctly, Working queries are tightly coupled with the table’s partitioning scheme, so partitioning configuration cannot be changed without breaking queries, etc.

Apache Iceberg introduces the concept of hidden partitioning where the reading of unnecessary partitions can be avoided automatically. Data consumers that fire the queries don’t need to know how the table is partitioned and add extra filters to their queries. Iceberg partition layouts can evolve as needed. Iceberg can hide partitioning because it does not require user-maintained partition columns. Iceberg produces partition values by taking a column value and optionally transforming it.

Apache Iceberg is used in production where a single table can contain tens of petabytes of data and can read these huge tables without leveraging distributed SQL engine. It was developed for gigantic tables. By using a set of Java API that Iceberg produces, we can manage table metadata, like schema, partition spec, metadata, and data files that store table data.

Hope you have enjoyed this read. Please like and share if you feel this composition is valuable.

Reference:- https://iceberg.apache.org

Written by
Gautam Goswami

Can be reached for real-time POC development and hands-on technical training at gautambangalore@gmail.com. Besides, to design, develop just as help in any Hadoop/Big Data handling related task, Apache Kafka, Streaming Data etc. Gautam is a advisor and furthermore an Educator as well. Before that, he filled in as Sr. Technical Architect in different technologies and business space across numerous nations.
He is energetic about sharing information through blogs, preparing workshops on different Big Data related innovations, systems and related technologies.

 

Page: 1 2

Recent Posts

The Role of Materialized Views in Modern Data Stream Processing Architectures + RisingWave

Incremental computation in data streaming means updating results as fresh data comes in, without redoing… Read More

2 days ago

Unlocking the Power of Patterns in Event Stream Processing (ESP): The Critical Role of Apache Flink’s FlinkCEP Library

We call this an event when a button is pressed, a sensor detects a temperature… Read More

2 weeks ago

Real-Time Redefined: Apache Flink and Apache Paimon Influence Data Streaming’s Future

Apache Paimon is made to function well with constantly flowing data, which is typical of… Read More

1 month ago

Revolutionize Stream Processing with the Power of Data Fabric

A data fabric is an innovative system designed to seamlessly integrate and organize data from… Read More

2 months ago

Bridging the Gap: Unlocking the Power of HDFS-Based Data Lakes with Streaming Databases

Big data technologies' quick development has brought attention to the necessity of a smooth transition… Read More

2 months ago

Which Flow Is Best for Your Data Needs: Time Series vs. Streaming Databases

Data is being generated from various sources, including electronic devices, machines, and social media, across… Read More

2 months ago