Tech Threads

Semi-Structured Data

Semi structured data lies between structured and unstructured data. Data that stores in traditional database system or excel sheet can be denoted as structured data and organized in COLUMNS and ROWS. Unstructured data can be considered as any data or piece of information which can’t be stored in Databases/RDBMS etc. Email, facebook comments, news paper etc are the examples of unstructured data.


Semi structured data do not follow strict data model structure and neither raw data nor typed data in a traditional database system. To represent information as semi-structured data, certain format has to be followed. We can use JSON (JavaScript Object Notation ), XML format as well as to transport over wire. Specific parser is mandatory to retrieve desire data from JSON or XML at the data consumer end.
JSON is light weight and efficient compare to XML and easily human readable but we can’t store/persist or query from traditional database system. NoSQL databases like HBase, MongoDb, Cassandra, Hadoop distributed file system (HDFS) etc can be leveraged to store, query, analyze etc . In a typical client-server web application, JSON format widely used for bi-directional data interchange.
Here is the sample unstructured data ” The two company named ABCD and EFGH are located in Bangalore and Chennai respectively. ABCD is a pharmaceutical company and have 150 employs. They are into medical drugs supplier and associated with HDFC bank for all business transaction. Company EFGH is into manufacturing of PVC pipes and have 300 employs and doing financial transaction with State Bank Of India “. Above information or data can be transformed into semi-structured data using JSON format. Also possible to persist in NoSQL Database and transmit over wire as REST service request/response.
       [
            {
                “CompanyName”: “ABCD”,
                “Description”: “pharmaceutical company”,
                “Type” : “Medical drugs supplier”,
                “EmployNo”: “150”,
                “BusineesBank”: “HDFC Bank”,
                “Location” : “Bangalore”
             },
            {
                 “CompanyName”: “EFGH”,
                 “Description”: “Manufacturing company”,
                 “Type” : “PVC Pipes”,
                 “EmployNo”: “300”,
                 “BusineesBank”: “State Bank Of India”,
                 “Location” : “Chennai”
             }
         ]
Facebook graph API provides semi-structured data in JSON format when we query from a specific node using GET method in REST service.

Written by
Gautam Goswami

Can be reached for real-time POC development and hands-on technical training at gautambangalore@gmail.com. Besides, to design, develop just as help in any Hadoop/Big Data handling related task. Gautam is a advisor and furthermore an Educator as well. Before that, he filled in as Sr. Technical Architect in different technologies and business space across numerous nations.
He is energetic about sharing information through blogs, preparing workshops on different Big Data related innovations, systems and related technologies.

Recent Posts

The Significance of Complex Event Processing (CEP) with RisingWave for Delivering Accurate Business Decisions

Complex event processing (CEP) is a highly effective and optimized mechanism that combines several sources… Read More

3 months ago

Principle Of Data Science

Source:- www.PacktPub.com This book focuses on data science, a rapidly expanding field of study and… Read More

3 months ago

Integrating Apache Kafka in KRaft Mode with RisingWave for Event Streaming Analytics

Over the past few years, Apache Kafka has emerged as the top event streaming platform… Read More

3 months ago

Criticality in Data Stream Processing and a Few Effective Approaches

In the current fast-paced digital age, many data sources generate an unending flow of information,… Read More

4 months ago

Partitioning Hot and Cold Data Tier in Apache Kafka Cluster for Optimal Performance

At first, data tiering was a tactic used by storage systems to reduce data storage… Read More

5 months ago

Exploring Telemetry: Apache Kafka’s Role in Telemetry Data Management with OpenTelemetry as a Fulcrum

With the use of telemetry, data can be remotely measured and transmitted from multiple sources… Read More

6 months ago