Tech Threads

How to understand Data Pipeline easily

A data pipeline can be visualized as extraction, transformation and then loading of data into storage area referred as Database system or Data warehousing system. The data enters into one end of the multi-stage process in a particular shape / form and comes out of the other side in a different (desired) shape / form. The data pipeline has many stages that depends on the entered input data. There might be less number of stages if input data is quite purified and does not need more transformation. But if complex, for example unstructured data (blogs with images, emails etc). then number of stages will increase. These stages could be connecting to one or many sources of data or running in a single or multiple servers.


Now a days, Hadoop has become a popular adoption for all major organizations. We can leverage Hadoop cluster to build a data pipeline for the purpose of extraction, loading and then transformation. Hadoop platform provides a highly scalable and fault tolerance infrastructure which is built on cheap commodity hardware. Infact, if we need to extract a specific information from millions of tweets in tweeter streaming, we need to have data pipeline because data supply from tweeter is unstructured.

Written by
Gautam Goswami

Page: 1 2

Recent Posts

The Significance of Complex Event Processing (CEP) with RisingWave for Delivering Accurate Business Decisions

Complex event processing (CEP) is a highly effective and optimized mechanism that combines several sources… Read More

3 months ago

Principle Of Data Science

Source:- www.PacktPub.com This book focuses on data science, a rapidly expanding field of study and… Read More

3 months ago

Integrating Apache Kafka in KRaft Mode with RisingWave for Event Streaming Analytics

Over the past few years, Apache Kafka has emerged as the top event streaming platform… Read More

3 months ago

Criticality in Data Stream Processing and a Few Effective Approaches

In the current fast-paced digital age, many data sources generate an unending flow of information,… Read More

4 months ago

Partitioning Hot and Cold Data Tier in Apache Kafka Cluster for Optimal Performance

At first, data tiering was a tactic used by storage systems to reduce data storage… Read More

5 months ago

Exploring Telemetry: Apache Kafka’s Role in Telemetry Data Management with OpenTelemetry as a Fulcrum

With the use of telemetry, data can be remotely measured and transmitted from multiple sources… Read More

6 months ago