1  Basic Concepts of Data

1.1 What Is Data

Data are recorded facts, observations, and measurements about entities in the world. An entity can be a customer, a transaction, a sensor reading, a tweet, a product, a delivery route, or an employee’s response to a survey item. Each measurement is a single fact about a single entity at a single moment. A dataset is an organised collection of such measurements.

NoteA working definition

Data are symbolic representations of facts, events, or measurements, stored in a form that can be communicated, processed, and analysed. The symbols may be numbers (revenue of ₹4.2 crore), text (a product review), images (a photograph of a defect), timestamps, or categories.

ImportantData are not the same as truth

Data are representations. A recorded temperature of 36.7 °C is a representation of a body temperature at a moment, produced by a particular thermometer under particular conditions. Errors of measurement, of recording, of transmission, and of interpretation are always possible. Analytics therefore begins with the recognition that data describe the world imperfectly.

1.2 The DIKW Hierarchy

The Data-Information-Knowledge-Wisdom hierarchy is the most widely cited model of how raw observations become useful in decision-making. Each tier adds context, pattern, or judgement to the tier below.

flowchart BT
  D[Data: raw observations] --> I[Information: data + context]
  I --> K[Knowledge: information + pattern]
  K --> W[Wisdom: knowledge + judgement]

NoteEach tier in concrete terms

Data: “1250, 1340, 1180, 1420, 1375.” Numbers without labels or units.

Information: “Daily active users on the Flipkart grocery app in Bengaluru on the first five days of February 2026 were 1250, 1340, 1180, 1420, 1375.” The same numbers gain a unit, a scope, and a time window.

Knowledge: “Bengaluru daily active users grew 10 percent week-on-week during the February promotional campaign, and growth was concentrated in the 25-to-34 age group.” A pattern has been extracted by aggregation and comparison.

Wisdom: “Given that the 25-to-34 segment responds strongly to weekday promotions but not to weekend promotions, the next campaign should concentrate on Tuesday and Wednesday pushes.” A decision has been formed by combining the pattern with business context.

TipWhy the hierarchy matters for analytics

Most analytical value is created in the transitions between tiers: from data to information (cleaning, labelling, contextualising), from information to knowledge (summarising, modelling, testing), and from knowledge to wisdom (interpretation, action, governance). Almost every chapter of this book describes techniques for one of these transitions.

1.3 Quantitative and Qualitative Data

The first and most general classification of data distinguishes the numeric from the non-numeric.

NoteQuantitative data

Numeric measurements on which arithmetic operations are meaningful. Examples: monthly revenue, customer age, number of transactions, product rating on a 1-to-5 scale treated as interval, latency in milliseconds.

NoteQualitative data

Non-numeric observations that describe qualities, categories, or narratives. Examples: product category, customer feedback text, colour of a delivery bike, interview transcripts, images of store shelves.

Chapter 6 refines the quantitative side into four levels of measurement (nominal, ordinal, interval, ratio), a distinction that governs which statistical operations are valid on a given variable.

WarningNumeric does not mean quantitative

An Aadhaar number, a pincode, and a customer ID are stored as digits but they do not support arithmetic. Adding two pincodes is meaningless. Whether data are quantitative is a question of meaning, not of storage format.

1.4 Sources: Primary and Secondary Data

Data enter an analysis from one of two sources.

NotePrimary data

Collected first-hand by the analyst or organisation for the specific question at hand. Survey responses, laboratory experiments, in-store observation, A/B-test logs collected by the product team, focus-group transcripts.

NoteSecondary data

Collected previously by someone else, for a different purpose, and now re-used. Government releases (RBI bulletins, NSSO surveys, GSTN filings), subscription databases (Prowess, Capitaline, Euromonitor), published research, industry reports, Kaggle datasets, open government data on data.gov.in.

Dimension Primary Secondary
Control over definitions High Low
Cost of collection High Low (often free)
Time to obtain Long Short
Fit to the exact research question High Variable
Sample size and scope Limited by budget Often very large
Ownership and permission Usually owned Licence-dependent
TipA mixed-source pattern

Many real studies combine the two. An investigation of branch profitability at a bank might pair primary customer-satisfaction survey data (collected by the bank) with secondary district-level economic indicators (from RBI and the Ministry of Statistics and Programme Implementation).

1.5 Internal and External Data

A complementary source distinction, orthogonal to primary and secondary, is whether the data originate inside or outside the organisation that analyses them.

NoteInternal data

Generated by the organisation’s own operations: transactional systems (ERP, CRM, POS), web and app logs, payroll, inventory, customer-support tickets. Typically rich, high-volume, and governed by the organisation’s data policies.

NoteExternal data

Originates outside the organisation: macroeconomic indicators, competitor pricing, social-media content, weather, geospatial data, credit-bureau data, third-party consumer panels. Usually purchased, subscribed to, scraped under terms of service, or accessed via APIs.

1.6 Data as a Strategic Asset

Organisations now treat data on a par with physical plant, capital, and brand. The reasoning has three parts.

Note1. Data compound

Every transaction creates a new record, and the marginal cost of recording it is close to zero. Over years a large retailer accumulates a transaction history that no new entrant can replicate, because the history itself took years to generate.

Note2. Data become an input to models

Recommendation engines, credit-risk scorecards, fraud detectors, logistics optimisers, and pricing models are trained on historical data. The accumulated dataset is the fuel; the model is the engine.

Note3. Data enable new products

UPI payments, Aadhaar-based e-KYC, GST invoice matching, Ola ride-hail surge pricing, and Amazon next-day delivery all exist because the underlying data infrastructure makes them feasible. The product and the data are inseparable.

ImportantThe asset view comes with obligations

If data are an asset they must be inventoried, maintained, secured, and depreciated. The DPDP Act 2023 in India and the GDPR in the European Union impose legal duties on holders of personal data: consent, purpose limitation, data minimisation, breach notification, and in some cases a right to erasure. Chapter 3 develops the organisational implications of this asset view.

1.7 Dimensions of Data Quality

Data quality is multi-dimensional; a dataset can be strong on one axis and weak on another. Seven dimensions are in widespread use.

Dimension Question it answers
Accuracy Do the recorded values match reality?
Completeness Are all required values present?
Consistency Do the same facts agree across systems?
Timeliness Are the values current enough for the decision?
Validity Do values conform to the allowed format or range?
Uniqueness Is each real-world entity represented once?
Integrity Do relationships across tables hold (foreign keys, constraints)?
TipQuality is measured against purpose

A dataset good enough for a marketing segmentation exercise may be unsuitable for a credit decision. “Fit for purpose” is the operational test of data quality; the seven dimensions above are the axes along which fitness is assessed.

1.8 Structure of a Dataset

Almost every analytical dataset shares a common shape: rectangular, with one row per observation and one column per variable. The Wickham (2014) “tidy data” principles formalise this shape.

NoteThree tidy rules
  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.
WarningCommon violations

A single column holding both city and state (“Mumbai, Maharashtra”) violates the “each variable forms a column” rule. A wide table with one column per month (“Jan”, “Feb”, “Mar”) instead of one “month” column and one “value” column makes aggregation harder. Both problems are fixed by reshaping (Chapter 8).

1.9 The Data Lifecycle

Data move through a recognisable sequence of stages from the moment they are generated to the moment they are archived or destroyed. A coherent data strategy treats each stage explicitly.

flowchart LR
  G[Generation] --> C[Collection]
  C --> S[Storage]
  S --> P[Processing]
  P --> A[Analysis]
  A --> U[Use and Dissemination]
  U --> R[Archival or Disposal]

NoteThe stages, in practice
  • Generation: transactions, sensor readings, form submissions, clicks.
  • Collection: ingestion into a system of record, with metadata and provenance attached.
  • Storage: databases, data warehouses, data lakes, object stores, with retention policies.
  • Processing: cleaning, transformation, integration across sources.
  • Analysis: descriptive, exploratory, confirmatory, predictive, and prescriptive work.
  • Use and Dissemination: dashboards, reports, models in production, decisions.
  • Archival or Disposal: cold storage, legal retention, or secure deletion when the data are no longer needed or when regulation requires erasure.
ImportantThe lifecycle is regulated

Under the DPDP Act 2023, personal data in India must be collected for a specified purpose, processed lawfully, retained only as long as necessary, and erased when the purpose is fulfilled unless another legal basis applies. The seven stages above are therefore not only operational; they are the units at which governance, consent, and auditing attach.

1.10 Summary

Summary of data concepts introduced in this chapter
Concept Description
DIKW Hierarchy
Data Raw observations
Information Data placed in context
Knowledge Pattern extracted from information
Wisdom Judgement applied to knowledge
Data Types
Quantitative Numeric and arithmetic-amenable
Qualitative Categorical or narrative
Sources
Primary Collected first-hand
Secondary Re-used from another source
Location
Internal Generated inside the organisation
External Originating outside the organisation
Structure
Tidy data One row per observation, one column per variable