2  Structured, Semi-structured and Unstructured Data

2.1 What Structure Means

“Structure” in the data sense refers to whether a dataset conforms to a predefined, machine-readable schema. A schema is a specification of the fields a record must contain, the type each field takes, and the relationships between fields. The degree to which a dataset adheres to such a schema determines how easy it is to query, store efficiently, and analyse with off-the-shelf statistical tools.

NoteA working definition

Data is called structured when every record fits a fixed set of columns with declared types, semi-structured when records carry tags or markers that identify fields but different records may carry different fields, and unstructured when the raw content has no machine-enforced schema at all.

The three categories are not watertight. A single business process often generates all three at once: an e-commerce order produces a structured row in the orders table, a semi-structured JSON payload for the shipping API, and unstructured customer-review text after delivery.

2.2 Structured Data

Structured data is the oldest and most heavily standardised form of organisational data. It lives in relational database tables, CSV files, and well-formed spreadsheets.

NoteIdentifying features

Every record (row) has the same fields (columns). Each field has a declared type (integer, decimal, date, varchar). Primary keys uniquely identify rows and foreign keys express relationships across tables. The schema is enforced at write time, so an order row without a price is rejected by the database.

Transactional systems such as SBI core banking, Flipkart’s orders database, and the GST filings portal store the overwhelming majority of their records as structured data. SQL is the universal query language.

TipWhy SEM-ready data is almost always structured

Confirmatory factor analysis, regression, and other latent-variable methods operate on a participant-by-variable matrix. Even when the raw input is a survey form, the analysable object is a structured rectangle with one row per respondent.

2.3 Semi-structured Data

Semi-structured data carries enough markup to make field boundaries machine-readable, but the collection of fields is not fixed across records.

NoteIdentifying features

Self-describing: each field is labelled inline (as an XML tag, a JSON key, or a YAML entry). Schema is loose or enforced only at read time. Records are typically nested, so an order can contain a list of line items, each with its own attributes. Exchange formats for APIs and event streams are almost always semi-structured.

The dominant formats are JSON (web APIs, MongoDB), XML (enterprise B2B, UPI message formats used by NPCI), and YAML (configuration). A single Swiggy order moving from app to kitchen to rider is carried as JSON payloads at every step.

TipFlattening for analysis

Semi-structured data is rarely analysed in its raw nested form. The first preparation step is usually to flatten or pivot the nested elements into a rectangular frame, for example by unnesting the items array into one row per line item.

2.4 Unstructured Data

Unstructured data has no pre-declared field layout. The content itself carries the meaning and must be parsed or modelled before any statistical summary is possible.

NoteIdentifying features

No schema; the raw bytes are the payload. Dominant examples are free-form text (emails, customer reviews, call-centre transcripts, social-media posts), images, audio, video, PDFs, and scanned documents. Storage is usually file-based (object stores, data lakes) rather than row-based.

A single customer-support interaction at Ola produces a chat log (text), a call recording (audio), possibly a ride-map screenshot (image), and a free-text resolution note. None of these fits a table without additional processing.

WarningThe cost of unstructured data

Unstructured data is expensive to analyse. Before a correlation, mean, or regression can be computed, the raw content must be converted into numerical features (word frequencies, sentiment scores, image embeddings). That conversion stage is often longer than the modelling that follows.

2.5 The Spectrum of Structure

flowchart LR
  A[Structured<br/>RDBMS, CSV] --> B[Semi-structured<br/>JSON, XML, YAML]
  B --> C[Unstructured<br/>Text, images, audio, video]
  A -.-> A1[Fixed schema<br/>Schema-on-write]
  B -.-> B1[Flexible schema<br/>Schema-on-read]
  C -.-> C1[No schema<br/>Parse at use]

The three categories sit on a continuum of schema rigidity. Moving left to right, storage becomes cheaper and more flexible, but the preparation effort required before analysis grows sharply.

2.6 Storage Systems for Each Type

The choice of storage technology is dictated by the dominant structure of the data it will hold.

NoteStructured → relational databases

MySQL, PostgreSQL, Oracle, Microsoft SQL Server, and cloud equivalents (Amazon RDS, Azure SQL). Optimised for transactional integrity, joins, and aggregation. Query language is SQL. Suited to customer masters, order books, GL ledgers.

NoteSemi-structured → document and key-value stores

MongoDB (document), Cassandra and DynamoDB (wide-column), Redis (key-value), Elasticsearch (search). These tolerate heterogeneous records, scale horizontally, and are the common backbone of modern web and mobile apps.

NoteUnstructured → object stores and data lakes

Amazon S3, Azure Blob Storage, Google Cloud Storage, Hadoop HDFS. These store opaque files of any size. Analysis layers (Apache Spark, Databricks, Snowflake, BigQuery) read directly from these stores and impose a schema only at query time.

2.7 The 80/20 Observation

Industry surveys repeatedly report that roughly 80 percent of enterprise data is unstructured or semi-structured, leaving only 20 percent in well-governed relational tables.

ImportantWhy the split matters for analytics

Most classical analytics curricula and most R and SAS training examples use the 20 percent that is already structured. The bulk of the data sitting in the organisation (emails, documents, images, sensor streams) demands additional engineering before it can enter a model. Organisations that want to exploit this larger pool invest in text mining, computer vision, and audio analytics pipelines.

Published estimates (IDC, Gartner) vary with industry: financial services is closer to 60/40 because of heavy transactional volume, whereas healthcare and media are closer to 90 percent unstructured due to imaging, notes, and rich media.

2.8 Analytical Implications

The structure type of a dataset shapes the methods that can be applied to it.

Structure Typical methods Preparation burden
Structured Descriptive statistics, regression, ANOVA, CFA, SEM, classification trees Low: already tabular
Semi-structured Unnesting → same methods as structured; graph analytics when relationships dominate Medium: flatten, join, type-cast
Unstructured Text mining, topic models, sentiment analysis, image classification, speech-to-text High: feature extraction required before standard methods apply
TipA practical workflow

For a mixed-input project, the typical sequence is: extract structured features from unstructured sources, merge them with the existing structured tables on a common key, and then run the standard statistical pipeline on the combined rectangle.

2.9 Convergence: Turning Unstructured into Structured

Almost every analysis ultimately reduces to a structured matrix. Unstructured inputs are therefore converted into structured features before modelling.

NoteText → numbers

A corpus of customer reviews can be summarised as a document-term matrix (one row per review, one column per term, cell = term frequency or TF-IDF). From that matrix every tabular method becomes available. A sentiment model can also collapse each review into a single numeric score that becomes one new column in the customer table.

NoteImages and audio → embeddings

A deep-learning model can compress an image or audio clip into a fixed-length numeric vector (an embedding). These vectors are then stored as additional columns and used as predictors in downstream models.

The resulting document-term matrix is ordinary structured data and can be fed into clustering, regression, or factor models without any further special handling.

2.10 Deciding How to Treat a New Dataset

flowchart TD
  A[New dataset arrives] --> B{Does every record<br/>share the same<br/>fixed columns?}
  B -- Yes --> C[Treat as structured:<br/>load into a data frame,<br/>use SQL or dplyr]
  B -- No --> D{Are fields labelled inline<br/>with tags or keys?}
  D -- Yes --> E[Treat as semi-structured:<br/>parse JSON or XML,<br/>unnest, flatten]
  D -- No --> F[Treat as unstructured:<br/>extract features first<br/>text, image or audio pipeline]
  C --> G[Standard statistical methods]
  E --> G
  F --> G

The decision drives the first line of the analysis script: read.csv() for structured inputs, jsonlite::fromJSON() for semi-structured, and a text-mining or image-processing stage for unstructured.

2.11 Summary

Summary of data concepts introduced in this chapter
Concept Description
Structure Types
Structured Fixed schema; rectangular rows and columns
Semi-structured Self-describing with tags or keys; nested allowed
Unstructured No schema; raw text, images, audio or video
Schema Modes
Schema-on-write Structure enforced at load time by the database
Schema-on-read Structure imposed at query time (typical of data lakes)
Exchange Formats
CSV Flat rows and columns; structured exchange
JSON Hierarchical keys and values; semi-structured
XML Tag-delimited markup; semi-structured
Storage Systems
RDBMS MySQL, PostgreSQL, Oracle for structured data
Document store MongoDB, Couchbase for semi-structured data
Object store Amazon S3, Azure Blob, HDFS for any file type
Preparation Effort
Low Structured inputs enter models directly
Medium Semi-structured inputs need flattening and type casts
High Unstructured inputs need feature extraction first

Every subsequent chapter in this book works with structured data by the time modelling begins, but the source material may start in any of the three forms. Recognising the starting structure and the amount of preparation needed is the first judgement a data analyst makes on any new project.