Tech Threads

Establishment of Data Lake specific to multi-channel e-commerce application to understand customer’s buying pattern

Post order fulfillment data is becoming a very important asset of e-commerce vendors to understand complete buying pattern of customers. Especially for the e-commerce vendors who sells multiple products starting from electronics to apparels. Extraction and transformation are time-consuming operations when partially structured data starts moving from the various sources and finally land into the relational data warehouse.  Data extracted from the social media are semi-structured (JSON or XML).  As an example, Facebook provides information in JSON format through Graph API and same for Twitter streaming too.  Besides, social media, we have another network like Bazaarvoice that connects brands and retailers to the authentic voices of people where they shop.   To accommodate the data extracted from the former ,  another parsing operation has to be carried out to the shape of the  data into a tabular form (row, column)before loading into a data warehouse for analysis.


Data generation and subsequently propagation from the social media is continuous  and can be referred to as Streaming data. With available traditional data storage system, we are a handicap to store that massive volume of data due to disk/storage space scarcity. Even though e-commerce sites have multiple channels like web, mobile along with teleshopping etc., it’s straightforward to pull out  the order placed data from the e-commerce database because data are already stored in the row-column format.  Analyzing customer’s order placed data is not sufficient to understand the complete buying pattern.
Now a day’s customer’s rating and review data on the product detail page have started playing as deciding factor to add into the shopping cart and eventually place the order or not. Customer/online buyer have various option to choose right product by accumulating data from social media, n numbers of blogs, email etc. Interestingly data belongs to email, blogs are unstructured. This is also an another influencing factor for the customer to choose the right product. It’s a tough challenge to convert data into a common format and  analysis on top of it. The key points or expected depth received after intricate analysis of entire captured data in the data lake helps in taking a strong business strategic decision to boost revenue growth. With available traditional RDBMS data warehousing system, the entire process is expensive and time-consuming w.r.t ETL and there will be an uncertainty whether we can achieve the goal or not.
Using Hadoop and its eco system’s component, we can establish the Data Lake to ingest and persist data in any format and subsequently process to meet our requirement.  Industries have started adopting Hadoop for its massively parallel storage and parallel processing in a distributed manner.   On top of it, Hadoop is an open source framework which can be customized as per our need. Companies like MapR, Hortonworks, Cloudera have done enough customization in specific areas of Hadoop framework and released their own version of Hadoop into the market. By effective utilization of Hadoop Distributed File System (HDFS), we can establish Data Lake irrespective of data format. In HDFS, data are stored in a distributed manner.  By installing Hadoop (HDFS) in a horizontal cluster, we can eliminate data storage scarcity which is a major concern on traditional data storage systems, Warehouse etc. In case  infrastructure for cluster setup a major concern, we can leverage cloud computing. For example Amazon web service (AWS)’s EC2 instances can be used to create multiple nodes for the cluster with configurable resources.  Now we need to configure two component namely Apache Flume and Sqoop with HDFS. Flume and Sqoop are the two component belongs to Hadoop ecosystem and are responsible for ingesting data into HDFS. Using Flume, we can ingest streaming data like server log, data from social media like Twitter etc into HDFS.  With Sqoop, we can transfer data from the RDBMS into HDFS and similarly from HDFS to RDBMS. With the above components, semi-structured data mainly from social media and tubular form data (row column) from RDBMS can be ingested to the Data Lake which is nothing but HDFS. Eventually, for unstructured data like Blog, email etc, we have to develop data pipeline through which required data can be ingested into the HDFS.  Prior to that, all the required unstructured data should be dumped into HDFS. Since HDFS is a distributed file system, there is a constraint to dump any format of data into HDFS. Map Reduce programming or Spark can be used to convert dumped unstructured data into the desired form so that it can be blended with other ingested data inside Lake for analysis and finding out the appropriate depth.

Written by
Gautam Goswami

Can be reached for real-time POC development and hands-on technical training at gautambangalore@gmail.com. Besides, to design, develop just as help in any Hadoop/Big Data handling related task. Gautam is a advisor and furthermore an Educator as well. Before that, he filled in as Sr. Technical Architect in different technologies and business space across numerous nations.
He is energetic about sharing information through blogs, preparing workshops on different Big Data related innovations, systems and related technologies.

Page: 1 2

Recent Posts

The Significance of Complex Event Processing (CEP) with RisingWave for Delivering Accurate Business Decisions

Complex event processing (CEP) is a highly effective and optimized mechanism that combines several sources… Read More

3 months ago

Principle Of Data Science

Source:- www.PacktPub.com This book focuses on data science, a rapidly expanding field of study and… Read More

3 months ago

Integrating Apache Kafka in KRaft Mode with RisingWave for Event Streaming Analytics

Over the past few years, Apache Kafka has emerged as the top event streaming platform… Read More

3 months ago

Criticality in Data Stream Processing and a Few Effective Approaches

In the current fast-paced digital age, many data sources generate an unending flow of information,… Read More

4 months ago

Partitioning Hot and Cold Data Tier in Apache Kafka Cluster for Optimal Performance

At first, data tiering was a tactic used by storage systems to reduce data storage… Read More

5 months ago

Exploring Telemetry: Apache Kafka’s Role in Telemetry Data Management with OpenTelemetry as a Fulcrum

With the use of telemetry, data can be remotely measured and transmitted from multiple sources… Read More

6 months ago