Categories: Tech Threads

How checksum smartly manages data integrity in HDFS (Hadoop Distributed File System).

Checksum

Ensuring data integrity is basic necessity or back bond in big data processing environment to achieve accurate outcome. Of course, same is applicable while executing any data moving operations with traditional data storage systems (RDBMS, Document Repository etc) through various applications. Data transportation over the network, device to device transfer, ETL process, and many more. In two words, data integrity can be defined as assurance of the accuracy and consistency of data throughout the entire life cycle.
In big data processing environment, data(rest) gets persisted in a distributed manner because of huge volume. So, achieving data integrity on top of it is challenging. Hadoop Distributed File Systems (HDFS) has been efficiently built/developed to store any type of data in a distributed manner in the form of data block (breaks down huge volume of data into a set of individual blocks) with data integrity commitment. There might be multiple reasons to get corrupt data blocks in HDFS, starting from IO operation on the system disk, network failure etc.

Internally, HDFS smartly utilizes checksum for data integrity. A checksum is a small-sized datum derived from a block of digital data for the purpose of detecting errors. HDFS calculates/computes checksums for each data block and eventually store in a separate hidden file in the same HDFS namespace. HDFS uses 32-bit Cyclic Redundancy Check (CRC32) as default checksum algorithm because of 4 bytes long and less than 1% storage overhead. Similarly verifies checksums while reading data from the data nodes. If there is a discrepancy or error in the checksum value, the exception “CheckSumException” will be thrown to the client during retrieval of data for processing. 512 bytes is the default value.

Once data blocks are received by data nodes in the multi-node cluster, they compute and store checksum as well before storing the data in disk. And would be compared with stored one in the data nodes while client read the data. Besides, every individual data node maintains a persistent log of checksum verification to keep track when last verification occurred on each data block. Programmatically we can disable the checksum verification while submitting a job to the cluster. Using Apache Hadoop 3.1, comparison of checksums of a file stored in hdfs can be done with locally stored file. Please read at https://issues.apache.org/jira/browse/HDFS-13056.

Written by
Gautam Goswami

Page: 1 2

Next Apache Hadoop-3.2.0 installation on the multi-node cluster with an alternative backup recovery »

Previous « How Data Analytics is transforming Sports Performance – Captivating

AI on the Fly: Real-Time Data Streaming from Apache Kafka To Live Dashboards

In the current fast-paced digital age, many data sources generate an unending flow of information,… Read More

13 hours ago

Tech Threads

Real-Time at Sea: Harnessing Data Stream Processing to Power Smarter Maritime Logistics

According to the International Chamber of Shipping, the maritime industry has increased fourfold in the… Read More

2 weeks ago

Tech Threads

Driving Streaming Intelligence On-Premises: Real-Time ML with Apache Kafka and Flink

Lately, companies, in their efforts to engage in real-time decision-making by exploiting big data, have… Read More

2 months ago

Tech Threads

Dark Data Demystified: The Role of Apache Iceberg

Lurking in the shadows of every organization is a silent giant—dark data. Undiscovered log files,… Read More

3 months ago

Tech Threads

The Role of Materialized Views in Modern Data Stream Processing Architectures + RisingWave

Incremental computation in data streaming means updating results as fresh data comes in, without redoing… Read More

6 months ago

Tech Threads

Unlocking the Power of Patterns in Event Stream Processing (ESP): The Critical Role of Apache Flink’s FlinkCEP Library

We call this an event when a button is pressed, a sensor detects a temperature… Read More