2 Structured, Semi-structured and Unstructured Data

2.1 What Structure Means

“Structure” in the data sense refers to whether a dataset conforms to a predefined, machine-readable schema. A schema is a specification of the fields a record must contain, the type each field takes, and the relationships between fields. The degree to which a dataset adheres to such a schema determines how easy it is to query, store efficiently, and analyse with off-the-shelf statistical tools.

A working definition

Data is called structured when every record fits a fixed set of columns with declared types, semi-structured when records carry tags or markers that identify fields but different records may carry different fields, and unstructured when the raw content has no machine-enforced schema at all.

The three categories are not watertight. A single business process often generates all three at once: an e-commerce order produces a structured row in the orders table, a semi-structured JSON payload for the shipping API, and unstructured customer-review text after delivery.

2.2 Structured Data

Structured data is the oldest and most heavily standardised form of organisational data. It lives in relational database tables, CSV files, and well-formed spreadsheets.

Identifying features

Every record (row) has the same fields (columns). Each field has a declared type (integer, decimal, date, varchar). Primary keys uniquely identify rows and foreign keys express relationships across tables. The schema is enforced at write time, so an order row without a price is rejected by the database.

Transactional systems such as SBI core banking, Flipkart’s orders database, and the GST filings portal store the overwhelming majority of their records as structured data. SQL is the universal query language.

Try here

Why SEM-ready data is almost always structured

Confirmatory factor analysis, regression, and other latent-variable methods operate on a participant-by-variable matrix. Even when the raw input is a survey form, the analysable object is a structured rectangle with one row per respondent.

2.3 Semi-structured Data

Semi-structured data carries enough markup to make field boundaries machine-readable, but the collection of fields is not fixed across records.

Identifying features

Self-describing: each field is labelled inline (as an XML tag, a JSON key, or a YAML entry). Schema is loose or enforced only at read time. Records are typically nested, so an order can contain a list of line items, each with its own attributes. Exchange formats for APIs and event streams are almost always semi-structured.

The dominant formats are JSON (web APIs, MongoDB), XML (enterprise B2B, UPI message formats used by NPCI), and YAML (configuration). A single Swiggy order moving from app to kitchen to rider is carried as JSON payloads at every step.

Try here

Flattening for analysis

Semi-structured data is rarely analysed in its raw nested form. The first preparation step is usually to flatten or pivot the nested elements into a rectangular frame, for example by unnesting the items array into one row per line item.

2.4 Unstructured Data

Unstructured data has no pre-declared field layout. The content itself carries the meaning and must be parsed or modelled before any statistical summary is possible.

Identifying features

No schema; the raw bytes are the payload. Dominant examples are free-form text (emails, customer reviews, call-centre transcripts, social-media posts), images, audio, video, PDFs, and scanned documents. Storage is usually file-based (object stores, data lakes) rather than row-based.

A single customer-support interaction at Ola produces a chat log (text), a call recording (audio), possibly a ride-map screenshot (image), and a free-text resolution note. None of these fits a table without additional processing.

Try here

The cost of unstructured data

Unstructured data is expensive to analyse. Before a correlation, mean, or regression can be computed, the raw content must be converted into numerical features (word frequencies, sentiment scores, image embeddings). That conversion stage is often longer than the modelling that follows.

2.5 The Spectrum of Structure

flowchart LR
  A[Structured<br/>RDBMS, CSV] --> B[Semi-structured<br/>JSON, XML, YAML]
  B --> C[Unstructured<br/>Text, images, audio, video]
  A -.-> A1[Fixed schema<br/>Schema-on-write]
  B -.-> B1[Flexible schema<br/>Schema-on-read]
  C -.-> C1[No schema<br/>Parse at use]
    classDef default fill:#2e4057,color:#ffffff,stroke:#ff9933,stroke-width:3px,rx:10px,ry:10px;

The three categories sit on a continuum of schema rigidity. Moving left to right, storage becomes cheaper and more flexible, but the preparation effort required before analysis grows sharply.

2.6 Storage Systems for Each Type

The choice of storage technology is dictated by the dominant structure of the data it will hold.

Structured → relational databases

MySQL, PostgreSQL, Oracle, Microsoft SQL Server, and cloud equivalents (Amazon RDS, Azure SQL). Optimised for transactional integrity, joins, and aggregation. Query language is SQL. Suited to customer masters, order books, GL ledgers.

Semi-structured → document and key-value stores

MongoDB (document), Cassandra and DynamoDB (wide-column), Redis (key-value), Elasticsearch (search). These tolerate heterogeneous records, scale horizontally, and are the common backbone of modern web and mobile apps.

Unstructured → object stores and data lakes

Amazon S3, Azure Blob Storage, Google Cloud Storage, Hadoop HDFS. These store opaque files of any size. Analysis layers (Apache Spark, Databricks, Snowflake, BigQuery) read directly from these stores and impose a schema only at query time.

2.7 The 80/20 Observation

Industry surveys repeatedly report that roughly 80 percent of enterprise data is unstructured or semi-structured, leaving only 20 percent in well-governed relational tables.

Why the split matters for analytics

Most classical analytics curricula and most R and SAS training examples use the 20 percent that is already structured. The bulk of the data sitting in the organisation (emails, documents, images, sensor streams) demands additional engineering before it can enter a model. Organisations that want to exploit this larger pool invest in text mining, computer vision, and audio analytics pipelines.

Published estimates (IDC, Gartner) vary with industry: financial services is closer to 60/40 because of heavy transactional volume, whereas healthcare and media are closer to 90 percent unstructured due to imaging, notes, and rich media.

2.8 Analytical Implications

The structure type of a dataset shapes the methods that can be applied to it.

Structure	Typical methods	Preparation burden
Structured	Descriptive statistics, regression, ANOVA, CFA, SEM, classification trees	Low: already tabular
Semi-structured	Unnesting → same methods as structured; graph analytics when relationships dominate	Medium: flatten, join, type-cast
Unstructured	Text mining, topic models, sentiment analysis, image classification, speech-to-text	High: feature extraction required before standard methods apply

A practical workflow

For a mixed-input project, the typical sequence is: extract structured features from unstructured sources, merge them with the existing structured tables on a common key, and then run the standard statistical pipeline on the combined rectangle.

2.9 Convergence: Turning Unstructured into Structured

Almost every analysis ultimately reduces to a structured matrix. Unstructured inputs are therefore converted into structured features before modelling.

Text → numbers

A corpus of customer reviews can be summarised as a document-term matrix (one row per review, one column per term, cell = term frequency or TF-IDF). From that matrix every tabular method becomes available. A sentiment model can also collapse each review into a single numeric score that becomes one new column in the customer table.

Images and audio → embeddings

A deep-learning model can compress an image or audio clip into a fixed-length numeric vector (an embedding). These vectors are then stored as additional columns and used as predictors in downstream models.

Try here

The resulting document-term matrix is ordinary structured data and can be fed into clustering, regression, or factor models without any further special handling.

2.10 Deciding How to Treat a New Dataset

flowchart TD
  A[New dataset arrives] --> B{Does every record<br/>share the same<br/>fixed columns?}
  B -- Yes --> C[Treat as structured:<br/>load into a data frame,<br/>use SQL or dplyr]
  B -- No --> D{Are fields labelled inline<br/>with tags or keys?}
  D -- Yes --> E[Treat as semi-structured:<br/>parse JSON or XML,<br/>unnest, flatten]
  D -- No --> F[Treat as unstructured:<br/>extract features first<br/>text, image or audio pipeline]
  C --> G[Standard statistical methods]
  E --> G
  F --> G
    classDef default fill:#2e4057,color:#ffffff,stroke:#ff9933,stroke-width:3px,rx:10px,ry:10px;

The decision drives the first line of the analysis script: read.csv() for structured inputs, jsonlite::fromJSON() for semi-structured, and a text-mining or image-processing stage for unstructured.

Summary

Concept	Description
Three Data Types
Structured Data	Tabular data with explicit schema, types, and relations
Semi-structured Data	Self-describing data such as JSON or XML with implicit structure
Unstructured Data	Free text, images, audio, and video that lack predefined fields
Storage Systems
Spectrum of Structure	Data sits along a continuum from rigid tables to raw blobs
Relational Storage	Tables in databases like Postgres, MySQL, and SQL Server
Document and Wide-Column	MongoDB and Cassandra for semi-structured at scale
Object Storage	S3, ADLS, and GCS for unstructured artefacts and binary data
SEM-Ready Data	SEM (structural equation modelling) requires structured tabular input