Tech Threads

How to understand Data Pipeline easily

A data pipeline can be visualized as extraction, transformation and then loading of data into storage area referred as Database system or Data warehousing system. The data enters into one end of the multi-stage process in a particular shape / form and comes out of the other side in a different (desired) shape / form. The data pipeline has many stages that depends on the entered input data. There might be less number of stages if input data is quite purified and does not need more transformation. But if complex, for example unstructured data (blogs with images, emails etc). then number of stages will increase. These stages could be connecting to one or many sources of data or running in a single or multiple servers.

Now a days, Hadoop has become a popular adoption for all major organizations. We can leverage Hadoop cluster to build a data pipeline for the purpose of extraction, loading and then transformation. Hadoop platform provides a highly scalable and fault tolerance infrastructure which is built on cheap commodity hardware. Infact, if we need to extract a specific information from millions of tweets in tweeter streaming, we need to have data pipeline because data supply from tweeter is unstructured.

Written by
Gautam Goswami

Page: 1 2

Next Establishment of Data Lake specific to multi-channel e-commerce application to understand customer's buying pattern »

Previous « The Internet Of Things (IOT)

AI on the Fly: Real-Time Data Streaming from Apache Kafka To Live Dashboards

In the current fast-paced digital age, many data sources generate an unending flow of information,… Read More

1 week ago

Tech Threads

Real-Time at Sea: Harnessing Data Stream Processing to Power Smarter Maritime Logistics

According to the International Chamber of Shipping, the maritime industry has increased fourfold in the… Read More

4 weeks ago

Tech Threads

Driving Streaming Intelligence On-Premises: Real-Time ML with Apache Kafka and Flink

Lately, companies, in their efforts to engage in real-time decision-making by exploiting big data, have… Read More

3 months ago

Tech Threads

Dark Data Demystified: The Role of Apache Iceberg

Lurking in the shadows of every organization is a silent giant—dark data. Undiscovered log files,… Read More

3 months ago

Tech Threads

The Role of Materialized Views in Modern Data Stream Processing Architectures + RisingWave

Incremental computation in data streaming means updating results as fresh data comes in, without redoing… Read More

6 months ago

Tech Threads

Unlocking the Power of Patterns in Event Stream Processing (ESP): The Critical Role of Apache Flink’s FlinkCEP Library

We call this an event when a button is pressed, a sensor detects a temperature… Read More

7 months ago