flowchart BT D[Data: raw observations] --> I[Information: data + context] I --> K[Knowledge: information + pattern] K --> W[Wisdom: knowledge + judgement]
1 Basic Concepts of Data
1.1 What Is Data
Data are recorded facts, observations, and measurements about entities in the world. An entity can be a customer, a transaction, a sensor reading, a tweet, a product, a delivery route, or an employee’s response to a survey item. Each measurement is a single fact about a single entity at a single moment. A dataset is an organised collection of such measurements.
Data are symbolic representations of facts, events, or measurements, stored in a form that can be communicated, processed, and analysed. The symbols may be numbers (revenue of ₹4.2 crore), text (a product review), images (a photograph of a defect), timestamps, or categories.
Data are representations. A recorded temperature of 36.7 °C is a representation of a body temperature at a moment, produced by a particular thermometer under particular conditions. Errors of measurement, of recording, of transmission, and of interpretation are always possible. Analytics therefore begins with the recognition that data describe the world imperfectly.
1.2 The DIKW Hierarchy
The Data-Information-Knowledge-Wisdom hierarchy is the most widely cited model of how raw observations become useful in decision-making. Each tier adds context, pattern, or judgement to the tier below.
Data: “1250, 1340, 1180, 1420, 1375.” Numbers without labels or units.
Information: “Daily active users on the Flipkart grocery app in Bengaluru on the first five days of February 2026 were 1250, 1340, 1180, 1420, 1375.” The same numbers gain a unit, a scope, and a time window.
Knowledge: “Bengaluru daily active users grew 10 percent week-on-week during the February promotional campaign, and growth was concentrated in the 25-to-34 age group.” A pattern has been extracted by aggregation and comparison.
Wisdom: “Given that the 25-to-34 segment responds strongly to weekday promotions but not to weekend promotions, the next campaign should concentrate on Tuesday and Wednesday pushes.” A decision has been formed by combining the pattern with business context.
Most analytical value is created in the transitions between tiers: from data to information (cleaning, labelling, contextualising), from information to knowledge (summarising, modelling, testing), and from knowledge to wisdom (interpretation, action, governance). Almost every chapter of this book describes techniques for one of these transitions.
1.3 Quantitative and Qualitative Data
The first and most general classification of data distinguishes the numeric from the non-numeric.
Numeric measurements on which arithmetic operations are meaningful. Examples: monthly revenue, customer age, number of transactions, product rating on a 1-to-5 scale treated as interval, latency in milliseconds.
Non-numeric observations that describe qualities, categories, or narratives. Examples: product category, customer feedback text, colour of a delivery bike, interview transcripts, images of store shelves.
Chapter 6 refines the quantitative side into four levels of measurement (nominal, ordinal, interval, ratio), a distinction that governs which statistical operations are valid on a given variable.
An Aadhaar number, a pincode, and a customer ID are stored as digits but they do not support arithmetic. Adding two pincodes is meaningless. Whether data are quantitative is a question of meaning, not of storage format.
1.4 Sources: Primary and Secondary Data
Data enter an analysis from one of two sources.
Collected first-hand by the analyst or organisation for the specific question at hand. Survey responses, laboratory experiments, in-store observation, A/B-test logs collected by the product team, focus-group transcripts.
Collected previously by someone else, for a different purpose, and now re-used. Government releases (RBI bulletins, NSSO surveys, GSTN filings), subscription databases (Prowess, Capitaline, Euromonitor), published research, industry reports, Kaggle datasets, open government data on data.gov.in.
| Dimension | Primary | Secondary |
|---|---|---|
| Control over definitions | High | Low |
| Cost of collection | High | Low (often free) |
| Time to obtain | Long | Short |
| Fit to the exact research question | High | Variable |
| Sample size and scope | Limited by budget | Often very large |
| Ownership and permission | Usually owned | Licence-dependent |
Many real studies combine the two. An investigation of branch profitability at a bank might pair primary customer-satisfaction survey data (collected by the bank) with secondary district-level economic indicators (from RBI and the Ministry of Statistics and Programme Implementation).
1.5 Internal and External Data
A complementary source distinction, orthogonal to primary and secondary, is whether the data originate inside or outside the organisation that analyses them.
Generated by the organisation’s own operations: transactional systems (ERP, CRM, POS), web and app logs, payroll, inventory, customer-support tickets. Typically rich, high-volume, and governed by the organisation’s data policies.
Originates outside the organisation: macroeconomic indicators, competitor pricing, social-media content, weather, geospatial data, credit-bureau data, third-party consumer panels. Usually purchased, subscribed to, scraped under terms of service, or accessed via APIs.
1.6 Data as a Strategic Asset
Organisations now treat data on a par with physical plant, capital, and brand. The reasoning has three parts.
Every transaction creates a new record, and the marginal cost of recording it is close to zero. Over years a large retailer accumulates a transaction history that no new entrant can replicate, because the history itself took years to generate.
Recommendation engines, credit-risk scorecards, fraud detectors, logistics optimisers, and pricing models are trained on historical data. The accumulated dataset is the fuel; the model is the engine.
UPI payments, Aadhaar-based e-KYC, GST invoice matching, Ola ride-hail surge pricing, and Amazon next-day delivery all exist because the underlying data infrastructure makes them feasible. The product and the data are inseparable.
If data are an asset they must be inventoried, maintained, secured, and depreciated. The DPDP Act 2023 in India and the GDPR in the European Union impose legal duties on holders of personal data: consent, purpose limitation, data minimisation, breach notification, and in some cases a right to erasure. Chapter 3 develops the organisational implications of this asset view.
1.7 Dimensions of Data Quality
Data quality is multi-dimensional; a dataset can be strong on one axis and weak on another. Seven dimensions are in widespread use.
| Dimension | Question it answers |
|---|---|
| Accuracy | Do the recorded values match reality? |
| Completeness | Are all required values present? |
| Consistency | Do the same facts agree across systems? |
| Timeliness | Are the values current enough for the decision? |
| Validity | Do values conform to the allowed format or range? |
| Uniqueness | Is each real-world entity represented once? |
| Integrity | Do relationships across tables hold (foreign keys, constraints)? |
A dataset good enough for a marketing segmentation exercise may be unsuitable for a credit decision. “Fit for purpose” is the operational test of data quality; the seven dimensions above are the axes along which fitness is assessed.
1.8 Structure of a Dataset
Almost every analytical dataset shares a common shape: rectangular, with one row per observation and one column per variable. The Wickham (2014) “tidy data” principles formalise this shape.
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
A single column holding both city and state (“Mumbai, Maharashtra”) violates the “each variable forms a column” rule. A wide table with one column per month (“Jan”, “Feb”, “Mar”) instead of one “month” column and one “value” column makes aggregation harder. Both problems are fixed by reshaping (Chapter 8).
1.9 The Data Lifecycle
Data move through a recognisable sequence of stages from the moment they are generated to the moment they are archived or destroyed. A coherent data strategy treats each stage explicitly.
flowchart LR G[Generation] --> C[Collection] C --> S[Storage] S --> P[Processing] P --> A[Analysis] A --> U[Use and Dissemination] U --> R[Archival or Disposal]
- Generation: transactions, sensor readings, form submissions, clicks.
- Collection: ingestion into a system of record, with metadata and provenance attached.
- Storage: databases, data warehouses, data lakes, object stores, with retention policies.
- Processing: cleaning, transformation, integration across sources.
- Analysis: descriptive, exploratory, confirmatory, predictive, and prescriptive work.
- Use and Dissemination: dashboards, reports, models in production, decisions.
- Archival or Disposal: cold storage, legal retention, or secure deletion when the data are no longer needed or when regulation requires erasure.
Under the DPDP Act 2023, personal data in India must be collected for a specified purpose, processed lawfully, retained only as long as necessary, and erased when the purpose is fulfilled unless another legal basis applies. The seven stages above are therefore not only operational; they are the units at which governance, consent, and auditing attach.
1.10 Summary
| Concept | Description |
|---|---|
| DIKW Hierarchy | |
| Data | Raw observations |
| Information | Data placed in context |
| Knowledge | Pattern extracted from information |
| Wisdom | Judgement applied to knowledge |
| Data Types | |
| Quantitative | Numeric and arithmetic-amenable |
| Qualitative | Categorical or narrative |
| Sources | |
| Primary | Collected first-hand |
| Secondary | Re-used from another source |
| Location | |
| Internal | Generated inside the organisation |
| External | Originating outside the organisation |
| Structure | |
| Tidy data | One row per observation, one column per variable |