Tech Threads

Establishment of Data Lake specific to multi-channel e-commerce application to understand customer’s buying pattern

Post order fulfillment data is becoming a very important asset of e-commerce vendors to understand complete buying pattern of customers. Especially for the e-commerce vendors who sells multiple products starting from electronics to apparels. Extraction and transformation are time-consuming operations when partially structured data starts moving from the various sources and finally land into the relational data warehouse. Data extracted from the social media are semi-structured (JSON or XML). As an example, Facebook provides information in JSON format through Graph API and same for Twitter streaming too. Besides, social media, we have another network like Bazaarvoice that connects brands and retailers to the authentic voices of people where they shop. To accommodate the data extracted from the former , another parsing operation has to be carried out to the shape of the data into a tabular form (row, column)before loading into a data warehouse for analysis.

Data generation and subsequently propagation from the social media is continuous and can be referred to as Streaming data. With available traditional data storage system, we are a handicap to store that massive volume of data due to disk/storage space scarcity. Even though e-commerce sites have multiple channels like web, mobile along with teleshopping etc., it’s straightforward to pull out the order placed data from the e-commerce database because data are already stored in the row-column format. Analyzing customer’s order placed data is not sufficient to understand the complete buying pattern.
Now a day’s customer’s rating and review data on the product detail page have started playing as deciding factor to add into the shopping cart and eventually place the order or not. Customer/online buyer have various option to choose right product by accumulating data from social media, n numbers of blogs, email etc. Interestingly data belongs to email, blogs are unstructured. This is also an another influencing factor for the customer to choose the right product. It’s a tough challenge to convert data into a common format and analysis on top of it. The key points or expected depth received after intricate analysis of entire captured data in the data lake helps in taking a strong business strategic decision to boost revenue growth. With available traditional RDBMS data warehousing system, the entire process is expensive and time-consuming w.r.t ETL and there will be an uncertainty whether we can achieve the goal or not.
Using Hadoop and its eco system’s component, we can establish the Data Lake to ingest and persist data in any format and subsequently process to meet our requirement. Industries have started adopting Hadoop for its massively parallel storage and parallel processing in a distributed manner. On top of it, Hadoop is an open source framework which can be customized as per our need. Companies like MapR, Hortonworks, Cloudera have done enough customization in specific areas of Hadoop framework and released their own version of Hadoop into the market. By effective utilization of Hadoop Distributed File System (HDFS), we can establish Data Lake irrespective of data format. In HDFS, data are stored in a distributed manner. By installing Hadoop (HDFS) in a horizontal cluster, we can eliminate data storage scarcity which is a major concern on traditional data storage systems, Warehouse etc. In case infrastructure for cluster setup a major concern, we can leverage cloud computing. For example Amazon web service (AWS)’s EC2 instances can be used to create multiple nodes for the cluster with configurable resources. Now we need to configure two component namely Apache Flume and Sqoop with HDFS. Flume and Sqoop are the two component belongs to Hadoop ecosystem and are responsible for ingesting data into HDFS. Using Flume, we can ingest streaming data like server log, data from social media like Twitter etc into HDFS. With Sqoop, we can transfer data from the RDBMS into HDFS and similarly from HDFS to RDBMS. With the above components, semi-structured data mainly from social media and tubular form data (row column) from RDBMS can be ingested to the Data Lake which is nothing but HDFS. Eventually, for unstructured data like Blog, email etc, we have to develop data pipeline through which required data can be ingested into the HDFS. Prior to that, all the required unstructured data should be dumped into HDFS. Since HDFS is a distributed file system, there is a constraint to dump any format of data into HDFS. Map Reduce programming or Spark can be used to convert dumped unstructured data into the desired form so that it can be blended with other ingested data inside Lake for analysis and finding out the appropriate depth.

Written by
Gautam Goswami

Can be reached for real-time POC development and hands-on technical training at gautambangalore@gmail.com. Besides, to design, develop just as help in any Hadoop/Big Data handling related task. Gautam is a advisor and furthermore an Educator as well. Before that, he filled in as Sr. Technical Architect in different technologies and business space across numerous nations.
He is energetic about sharing information through blogs, preparing workshops on different Big Data related innovations, systems and related technologies.

Page: 1 2

Next How Social Media Making Irritating Login Process Comfortable ! »

Previous « How to understand Data Pipeline easily

AI on the Fly: Real-Time Data Streaming from Apache Kafka To Live Dashboards

In the current fast-paced digital age, many data sources generate an unending flow of information,… Read More

1 week ago

Tech Threads

Real-Time at Sea: Harnessing Data Stream Processing to Power Smarter Maritime Logistics

According to the International Chamber of Shipping, the maritime industry has increased fourfold in the… Read More

4 weeks ago

Tech Threads

Driving Streaming Intelligence On-Premises: Real-Time ML with Apache Kafka and Flink

Lately, companies, in their efforts to engage in real-time decision-making by exploiting big data, have… Read More

3 months ago

Tech Threads

Dark Data Demystified: The Role of Apache Iceberg

Lurking in the shadows of every organization is a silent giant—dark data. Undiscovered log files,… Read More

3 months ago

Tech Threads

The Role of Materialized Views in Modern Data Stream Processing Architectures + RisingWave

Incremental computation in data streaming means updating results as fresh data comes in, without redoing… Read More

6 months ago

Tech Threads

Unlocking the Power of Patterns in Event Stream Processing (ESP): The Critical Role of Apache Flink’s FlinkCEP Library

We call this an event when a button is pressed, a sensor detects a temperature… Read More

7 months ago