1 Basic Concepts of Data

1.1 What Is Data

Data are recorded facts, observations, and measurements about entities in the world. An entity can be a customer, a transaction, a sensor reading, a tweet, a product, a delivery route, or an employee’s response to a survey item. Each measurement is a single fact about a single entity at a single moment. A dataset is an organised collection of such measurements.

A working definition

Data are symbolic representations of facts, events, or measurements, stored in a form that can be communicated, processed, and analysed. The symbols may be numbers (revenue of ₹4.2 crore), text (a product review), images (a photograph of a defect), timestamps, or categories.

Data are not the same as truth

Data are representations. A recorded temperature of 36.7 °C is a representation of a body temperature at a moment, produced by a particular thermometer under particular conditions. Errors of measurement, of recording, of transmission, and of interpretation are always possible. Analytics therefore begins with the recognition that data describe the world imperfectly.

1.2 The DIKW Hierarchy

The Data-Information-Knowledge-Wisdom hierarchy is the most widely cited model of how raw observations become useful in decision-making. Each tier adds context, pattern, or judgement to the tier below.

flowchart BT
  D[Data: raw observations] --> I[Information: data + context]
  I --> K[Knowledge: information + pattern]
  K --> W[Wisdom: knowledge + judgement]
    classDef default fill:#004466,color:#ffffff,stroke:#ffcc00,stroke-width:3px,rx:10px,ry:10px;

Each tier in concrete terms

Data: “1250, 1340, 1180, 1420, 1375.” Numbers without labels or units.

Information: “Daily active users on the Flipkart grocery app in Bengaluru on the first five days of February 2026 were 1250, 1340, 1180, 1420, 1375.” The same numbers gain a unit, a scope, and a time window.

Knowledge: “Bengaluru daily active users grew 10 percent week-on-week during the February promotional campaign, and growth was concentrated in the 25-to-34 age group.” A pattern has been extracted by aggregation and comparison.

Wisdom: “Given that the 25-to-34 segment responds strongly to weekday promotions but not to weekend promotions, the next campaign should concentrate on Tuesday and Wednesday pushes.” A decision has been formed by combining the pattern with business context.

Why the hierarchy matters for analytics

Most analytical value is created in the transitions between tiers: from data to information (cleaning, labelling, contextualising), from information to knowledge (summarising, modelling, testing), and from knowledge to wisdom (interpretation, action, governance). Almost every chapter of this book describes techniques for one of these transitions.

1.3 Quantitative and Qualitative Data

The first and most general classification of data distinguishes the numeric from the non-numeric.

Quantitative data

Numeric measurements on which arithmetic operations are meaningful. Examples: monthly revenue, customer age, number of transactions, product rating on a 1-to-5 scale treated as interval, latency in milliseconds.

Qualitative data

Non-numeric observations that describe qualities, categories, or narratives. Examples: product category, customer feedback text, colour of a delivery bike, interview transcripts, images of store shelves.

Chapter 6 refines the quantitative side into four levels of measurement (nominal, ordinal, interval, ratio), a distinction that governs which statistical operations are valid on a given variable.

Try here

Numeric does not mean quantitative

An Aadhaar number, a pincode, and a customer ID are stored as digits but they do not support arithmetic. Adding two pincodes is meaningless. Whether data are quantitative is a question of meaning, not of storage format.

1.4 Sources: Primary and Secondary Data

Data enter an analysis from one of two sources.

Primary data

Collected first-hand by the analyst or organisation for the specific question at hand. Survey responses, laboratory experiments, in-store observation, A/B-test logs collected by the product team, focus-group transcripts.

Secondary data

Collected previously by someone else, for a different purpose, and now re-used. Government releases (RBI bulletins, NSSO surveys, GSTN filings), subscription databases (Prowess, Capitaline, Euromonitor), published research, industry reports, Kaggle datasets, open government data on data.gov.in.

Dimension	Primary	Secondary
Control over definitions	High	Low
Cost of collection	High	Low (often free)
Time to obtain	Long	Short
Fit to the exact research question	High	Variable
Sample size and scope	Limited by budget	Often very large
Ownership and permission	Usually owned	Licence-dependent

A mixed-source pattern

Many real studies combine the two. An investigation of branch profitability at a bank might pair primary customer-satisfaction survey data (collected by the bank) with secondary district-level economic indicators (from RBI and the Ministry of Statistics and Programme Implementation).

1.5 Internal and External Data

A complementary source distinction, orthogonal to primary and secondary, is whether the data originate inside or outside the organisation that analyses them.

Internal data

Generated by the organisation’s own operations: transactional systems (ERP, CRM, POS), web and app logs, payroll, inventory, customer-support tickets. Typically rich, high-volume, and governed by the organisation’s data policies.

External data

Originates outside the organisation: macroeconomic indicators, competitor pricing, social-media content, weather, geospatial data, credit-bureau data, third-party consumer panels. Usually purchased, subscribed to, scraped under terms of service, or accessed via APIs.

Try here

1.6 Data as a Strategic Asset

Organisations now treat data on a par with physical plant, capital, and brand. The reasoning has three parts.

1. Data compound

Every transaction creates a new record, and the marginal cost of recording it is close to zero. Over years a large retailer accumulates a transaction history that no new entrant can replicate, because the history itself took years to generate.

2. Data become an input to models

Recommendation engines, credit-risk scorecards, fraud detectors, logistics optimisers, and pricing models are trained on historical data. The accumulated dataset is the fuel; the model is the engine.

3. Data enable new products

UPI payments, Aadhaar-based e-KYC, GST invoice matching, Ola ride-hail surge pricing, and Amazon next-day delivery all exist because the underlying data infrastructure makes them feasible. The product and the data are inseparable.

The asset view comes with obligations

If data are an asset they must be inventoried, maintained, secured, and depreciated. The DPDP Act 2023 in India and the GDPR in the European Union impose legal duties on holders of personal data: consent, purpose limitation, data minimisation, breach notification, and in some cases a right to erasure. Chapter 3 develops the organisational implications of this asset view.

1.7 Dimensions of Data Quality

Data quality is multi-dimensional; a dataset can be strong on one axis and weak on another. Seven dimensions are in widespread use.

Dimension	Question it answers
Accuracy	Do the recorded values match reality?
Completeness	Are all required values present?
Consistency	Do the same facts agree across systems?
Timeliness	Are the values current enough for the decision?
Validity	Do values conform to the allowed format or range?
Uniqueness	Is each real-world entity represented once?
Integrity	Do relationships across tables hold (foreign keys, constraints)?

Try here

Quality is measured against purpose

A dataset good enough for a marketing segmentation exercise may be unsuitable for a credit decision. “Fit for purpose” is the operational test of data quality; the seven dimensions above are the axes along which fitness is assessed.

1.8 Structure of a Dataset

Almost every analytical dataset shares a common shape: rectangular, with one row per observation and one column per variable. The Wickham (2014) “tidy data” principles formalise this shape.

Three tidy rules

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

Try here

Common violations

A single column holding both city and state (“Mumbai, Maharashtra”) violates the “each variable forms a column” rule. A wide table with one column per month (“Jan”, “Feb”, “Mar”) instead of one “month” column and one “value” column makes aggregation harder. Both problems are fixed by reshaping (Chapter 8).

1.9 The Data Lifecycle

Data move through a recognisable sequence of stages from the moment they are generated to the moment they are archived or destroyed. A coherent data strategy treats each stage explicitly.

flowchart LR
  G[Generation] --> C[Collection]
  C --> S[Storage]
  S --> P[Processing]
  P --> A[Analysis]
  A --> U[Use and Dissemination]
  U --> R[Archival or Disposal]
    classDef default fill:#004466,color:#ffffff,stroke:#ffcc00,stroke-width:3px,rx:10px,ry:10px;

The stages, in practice

Generation: transactions, sensor readings, form submissions, clicks.
Collection: ingestion into a system of record, with metadata and provenance attached.
Storage: databases, data warehouses, data lakes, object stores, with retention policies.
Processing: cleaning, transformation, integration across sources.
Analysis: descriptive, exploratory, confirmatory, predictive, and prescriptive work.
Use and Dissemination: dashboards, reports, models in production, decisions.
Archival or Disposal: cold storage, legal retention, or secure deletion when the data are no longer needed or when regulation requires erasure.

The lifecycle is regulated

Under the DPDP Act 2023, personal data in India must be collected for a specified purpose, processed lawfully, retained only as long as necessary, and erased when the purpose is fulfilled unless another legal basis applies. The seven stages above are therefore not only operational; they are the units at which governance, consent, and auditing attach.

Summary

Concept	Description
Foundations
What is Data	Working definition: recorded observations awaiting interpretation
DIKW Hierarchy	Data -> Information -> Knowledge -> Wisdom climbs from facts to judgement
Data by Type and Source
Quantitative Data	Numeric measurements such as price, time, count, or weight
Qualitative Data	Categorical or descriptive observations such as type, label, or text
Primary Data	Collected directly for the analysis at hand (surveys, sensors, logs)
Secondary Data	Reused from existing sources (open data, reports, databases)
Numeric vs Quantitative	Numeric labels such as zip codes are not necessarily quantitative
Mixed-Source Pattern	Most analyses combine primary measurements with secondary context