2 Structured, Semi-structured and Unstructured Data
2.1 What Structure Means
“Structure” in the data sense refers to whether a dataset conforms to a predefined, machine-readable schema. A schema is a specification of the fields a record must contain, the type each field takes, and the relationships between fields. The degree to which a dataset adheres to such a schema determines how easy it is to query, store efficiently, and analyse with off-the-shelf statistical tools.
Data is called structured when every record fits a fixed set of columns with declared types, semi-structured when records carry tags or markers that identify fields but different records may carry different fields, and unstructured when the raw content has no machine-enforced schema at all.
The three categories are not watertight. A single business process often generates all three at once: an e-commerce order produces a structured row in the orders table, a semi-structured JSON payload for the shipping API, and unstructured customer-review text after delivery.
2.2 Structured Data
Structured data is the oldest and most heavily standardised form of organisational data. It lives in relational database tables, CSV files, and well-formed spreadsheets.
Every record (row) has the same fields (columns). Each field has a declared type (integer, decimal, date, varchar). Primary keys uniquely identify rows and foreign keys express relationships across tables. The schema is enforced at write time, so an order row without a price is rejected by the database.
Transactional systems such as SBI core banking, Flipkart’s orders database, and the GST filings portal store the overwhelming majority of their records as structured data. SQL is the universal query language.
Confirmatory factor analysis, regression, and other latent-variable methods operate on a participant-by-variable matrix. Even when the raw input is a survey form, the analysable object is a structured rectangle with one row per respondent.
2.3 Semi-structured Data
Semi-structured data carries enough markup to make field boundaries machine-readable, but the collection of fields is not fixed across records.
Self-describing: each field is labelled inline (as an XML tag, a JSON key, or a YAML entry). Schema is loose or enforced only at read time. Records are typically nested, so an order can contain a list of line items, each with its own attributes. Exchange formats for APIs and event streams are almost always semi-structured.
The dominant formats are JSON (web APIs, MongoDB), XML (enterprise B2B, UPI message formats used by NPCI), and YAML (configuration). A single Swiggy order moving from app to kitchen to rider is carried as JSON payloads at every step.
Semi-structured data is rarely analysed in its raw nested form. The first preparation step is usually to flatten or pivot the nested elements into a rectangular frame, for example by unnesting the items array into one row per line item.
2.4 Unstructured Data
Unstructured data has no pre-declared field layout. The content itself carries the meaning and must be parsed or modelled before any statistical summary is possible.
No schema; the raw bytes are the payload. Dominant examples are free-form text (emails, customer reviews, call-centre transcripts, social-media posts), images, audio, video, PDFs, and scanned documents. Storage is usually file-based (object stores, data lakes) rather than row-based.
A single customer-support interaction at Ola produces a chat log (text), a call recording (audio), possibly a ride-map screenshot (image), and a free-text resolution note. None of these fits a table without additional processing.
Unstructured data is expensive to analyse. Before a correlation, mean, or regression can be computed, the raw content must be converted into numerical features (word frequencies, sentiment scores, image embeddings). That conversion stage is often longer than the modelling that follows.
2.5 The Spectrum of Structure
flowchart LR A[Structured<br/>RDBMS, CSV] --> B[Semi-structured<br/>JSON, XML, YAML] B --> C[Unstructured<br/>Text, images, audio, video] A -.-> A1[Fixed schema<br/>Schema-on-write] B -.-> B1[Flexible schema<br/>Schema-on-read] C -.-> C1[No schema<br/>Parse at use]
The three categories sit on a continuum of schema rigidity. Moving left to right, storage becomes cheaper and more flexible, but the preparation effort required before analysis grows sharply.
2.6 Storage Systems for Each Type
The choice of storage technology is dictated by the dominant structure of the data it will hold.
MySQL, PostgreSQL, Oracle, Microsoft SQL Server, and cloud equivalents (Amazon RDS, Azure SQL). Optimised for transactional integrity, joins, and aggregation. Query language is SQL. Suited to customer masters, order books, GL ledgers.
MongoDB (document), Cassandra and DynamoDB (wide-column), Redis (key-value), Elasticsearch (search). These tolerate heterogeneous records, scale horizontally, and are the common backbone of modern web and mobile apps.
Amazon S3, Azure Blob Storage, Google Cloud Storage, Hadoop HDFS. These store opaque files of any size. Analysis layers (Apache Spark, Databricks, Snowflake, BigQuery) read directly from these stores and impose a schema only at query time.
2.7 The 80/20 Observation
Industry surveys repeatedly report that roughly 80 percent of enterprise data is unstructured or semi-structured, leaving only 20 percent in well-governed relational tables.
Most classical analytics curricula and most R and SAS training examples use the 20 percent that is already structured. The bulk of the data sitting in the organisation (emails, documents, images, sensor streams) demands additional engineering before it can enter a model. Organisations that want to exploit this larger pool invest in text mining, computer vision, and audio analytics pipelines.
Published estimates (IDC, Gartner) vary with industry: financial services is closer to 60/40 because of heavy transactional volume, whereas healthcare and media are closer to 90 percent unstructured due to imaging, notes, and rich media.
2.8 Analytical Implications
The structure type of a dataset shapes the methods that can be applied to it.
| Structure | Typical methods | Preparation burden |
|---|---|---|
| Structured | Descriptive statistics, regression, ANOVA, CFA, SEM, classification trees | Low: already tabular |
| Semi-structured | Unnesting → same methods as structured; graph analytics when relationships dominate | Medium: flatten, join, type-cast |
| Unstructured | Text mining, topic models, sentiment analysis, image classification, speech-to-text | High: feature extraction required before standard methods apply |
For a mixed-input project, the typical sequence is: extract structured features from unstructured sources, merge them with the existing structured tables on a common key, and then run the standard statistical pipeline on the combined rectangle.
2.9 Convergence: Turning Unstructured into Structured
Almost every analysis ultimately reduces to a structured matrix. Unstructured inputs are therefore converted into structured features before modelling.
A corpus of customer reviews can be summarised as a document-term matrix (one row per review, one column per term, cell = term frequency or TF-IDF). From that matrix every tabular method becomes available. A sentiment model can also collapse each review into a single numeric score that becomes one new column in the customer table.
A deep-learning model can compress an image or audio clip into a fixed-length numeric vector (an embedding). These vectors are then stored as additional columns and used as predictors in downstream models.
The resulting document-term matrix is ordinary structured data and can be fed into clustering, regression, or factor models without any further special handling.
2.10 Deciding How to Treat a New Dataset
flowchart TD
A[New dataset arrives] --> B{Does every record<br/>share the same<br/>fixed columns?}
B -- Yes --> C[Treat as structured:<br/>load into a data frame,<br/>use SQL or dplyr]
B -- No --> D{Are fields labelled inline<br/>with tags or keys?}
D -- Yes --> E[Treat as semi-structured:<br/>parse JSON or XML,<br/>unnest, flatten]
D -- No --> F[Treat as unstructured:<br/>extract features first<br/>text, image or audio pipeline]
C --> G[Standard statistical methods]
E --> G
F --> G
The decision drives the first line of the analysis script: read.csv() for structured inputs, jsonlite::fromJSON() for semi-structured, and a text-mining or image-processing stage for unstructured.
2.11 Summary
| Concept | Description |
|---|---|
| Structure Types | |
| Structured | Fixed schema; rectangular rows and columns |
| Semi-structured | Self-describing with tags or keys; nested allowed |
| Unstructured | No schema; raw text, images, audio or video |
| Schema Modes | |
| Schema-on-write | Structure enforced at load time by the database |
| Schema-on-read | Structure imposed at query time (typical of data lakes) |
| Exchange Formats | |
| CSV | Flat rows and columns; structured exchange |
| JSON | Hierarchical keys and values; semi-structured |
| XML | Tag-delimited markup; semi-structured |
| Storage Systems | |
| RDBMS | MySQL, PostgreSQL, Oracle for structured data |
| Document store | MongoDB, Couchbase for semi-structured data |
| Object store | Amazon S3, Azure Blob, HDFS for any file type |
| Preparation Effort | |
| Low | Structured inputs enter models directly |
| Medium | Semi-structured inputs need flattening and type casts |
| High | Unstructured inputs need feature extraction first |
Every subsequent chapter in this book works with structured data by the time modelling begins, but the source material may start in any of the three forms. Recognising the starting structure and the amount of preparation needed is the first judgement a data analyst makes on any new project.