3  Data in Organizations

3.1 Why Organizations Need Data Systems

A modern organization generates data at every touchpoint: a customer opens an app, a warehouse scans a carton, a plant sensor logs a temperature, an HR system records a leave request. Without deliberate systems, this output is fragmented across applications, stored in incompatible formats, and effectively invisible to decision makers. Data systems are the organizational plumbing that captures, moves, stores, and serves this output so that it can support both daily operations and longer-horizon analysis.

NoteThe two jobs data systems must do

Every organization needs data infrastructure that can simultaneously (a) keep the business running by serving live transactions, and (b) support reporting and analysis by preserving a consistent historical view. These two jobs pull in opposite directions, which is why most organizations end up with two layers of data systems rather than one.

3.2 Operational versus Analytical Data

The single most important architectural distinction inside an organization is between systems that run the business and systems that analyse it.

NoteOLTP: Online Transaction Processing

Systems optimised for short, frequent, read-write operations. A single transaction touches few rows and must complete in milliseconds. Examples: the core banking system at HDFC, the order-placement table at Flipkart, the POS database at Reliance Retail. OLTP schemas are highly normalised to avoid update anomalies.

NoteOLAP: Online Analytical Processing

Systems optimised for long, read-heavy queries that aggregate millions of rows. A single query may scan an entire year of sales. Examples: the enterprise data warehouse at Infosys, the revenue cube at Airtel, the supply-chain warehouse at Maruti Suzuki. OLAP schemas are denormalised (star, snowflake) to speed aggregation.

ImportantWhy running analytics on the OLTP system is discouraged

A long aggregation query on the transactional database competes for the same resources that serve live customer requests. Organizations therefore replicate operational data on a periodic cadence into a separate analytical store so that the two workloads never interfere.

3.3 Data Across Functional Areas

Each functional area of the organization is both a producer and a consumer of data.

Function Systems that generate data Analytical uses
Sales and Marketing CRM (Salesforce), campaign tools, web analytics Segmentation, churn prediction, campaign ROI
Finance ERP (SAP, Oracle Fusion), GL, billing Variance analysis, forecasting, audit
Operations and Supply Chain WMS, TMS, MES, IoT sensors Demand planning, OEE, route optimisation
Human Resources HRIS (Workday, SuccessFactors) Attrition modelling, workforce planning
Customer Service Call-centre, ticketing, chat (Zendesk) First-call resolution, sentiment, staffing
Risk and Compliance Core banking, trading systems, KYC Fraud detection, credit scoring, AML
TipCrossing the functional boundary

The highest-impact analyses usually require joining data that originated in different functions. Linking marketing spend to sales revenue to customer-service tickets tells a richer story than any one of those datasets alone. The ability to make those joins depends entirely on shared keys (customer ID, product code) that are managed centrally.

3.4 Warehouses, Marts, Lakes and Lakehouses

Analytical data is held in one of four architectural patterns that have emerged over the last three decades.

NoteData warehouse

A centralised, integrated, subject-oriented repository of structured data drawn from multiple source systems. Bill Inmon’s classical definition emphasises that it is “integrated, non-volatile and time-variant”. Implementations include Teradata, Oracle Exadata, Amazon Redshift, Google BigQuery, Snowflake.

NoteData mart

A subject-specific subset of the warehouse, usually serving a single department (finance mart, sales mart). Smaller, faster to query, and easier to permission than the central warehouse.

NoteData lake

A storage layer that holds raw files of any format (structured, semi-structured, unstructured) at low cost. Implementations: Amazon S3 with AWS Glue catalog, Azure Data Lake Storage, Hadoop HDFS. The lake is schema-on-read: structure is imposed by the query, not by the loader.

NoteLakehouse

A convergent pattern that places warehouse-style governance and performance directly on top of lake storage, using open table formats such as Delta Lake, Apache Iceberg, or Apache Hudi. Databricks and Snowflake both converge toward this model.

ImportantChoosing between them

Warehouses and marts are preferred for reliable, governed reporting on structured business metrics. Lakes are preferred for machine-learning projects that need access to raw logs, images, and text. Most large organizations now operate both, with a lakehouse increasingly positioned as the single surface that spans the two.

3.5 ETL and ELT Pipelines

Data does not move from source systems to the warehouse by itself. A pipeline extracts it, transforms it to match the target schema, and loads it into the analytical store.

flowchart LR
  A[Source systems<br/>ERP, CRM, logs] --> B[Extract]
  B --> C{Pattern}
  C -- ETL --> D[Transform<br/>in staging]
  D --> E[Load into<br/>warehouse]
  C -- ELT --> F[Load raw into<br/>cloud warehouse]
  F --> G[Transform<br/>with SQL/dbt]
  E --> H[Analytics layer]
  G --> H

NoteETL (Extract, Transform, Load)

The traditional pattern. Data is extracted from the source, cleaned and transformed on a dedicated server, and then loaded into the warehouse. Tools: Informatica, Talend, IBM DataStage. Favoured when compute in the warehouse is expensive.

NoteELT (Extract, Load, Transform)

The cloud-native variant. Raw data is loaded directly into the warehouse first, then transformed in place using SQL. Tools: Fivetran or Airbyte for ingestion, dbt for transformations, Snowflake or BigQuery as compute. Favoured when warehouse compute is cheap and elastic.

TipScheduling and orchestration

Production pipelines run on schedulers that manage dependencies, retries and alerts. Apache Airflow, Prefect, and Dagster are the dominant open-source orchestrators.

3.6 Master Data Management

Operational systems tend to duplicate the same entities under different identifiers. A single retail customer might appear as CUST_00421 in the POS, as email arjun@example.com in the loyalty app, and as phone number +91-98xxxxxxxx in the contact centre. Without reconciliation, every cross-system analysis is wrong.

NoteWhat MDM does

Master Data Management is the set of processes and tools that maintain a single authoritative version of core entities (customer, product, employee, supplier, location) and propagate that version to every consuming system. Implementations include Informatica MDM, Reltio, Microsoft MDS.

ImportantThe cost of poor master data

Without a reliable golden customer ID, customer-lifetime-value, churn, and cross-sell analyses become structurally unreliable. Most organizations that undertake advanced analytics find that master-data quality is the binding constraint long before any algorithmic sophistication.

3.7 Data Governance, Ownership and Stewardship

Governance is the set of policies that determine who may collect, access, change, and delete data, and how those decisions are enforced.

NoteKey governance roles

Data owner: accountable for a dataset (usually a senior business executive). Data steward: responsible for day-to-day quality and definitions. Data custodian: operates the storage and access layers (typically IT). Chief Data Officer: enterprise-wide accountability for data as an asset.

NoteAccess control

Role-based access control (RBAC) restricts who can read or modify each dataset. Sensitive columns are often masked or tokenised. Every access should be logged for audit.

TipData catalogs

A data catalog (Alation, Collibra, DataHub, Atlan) documents what datasets exist, what each column means, who owns it, and how fresh it is. A functioning catalog is the difference between data that can be reused and data that has to be rediscovered on every project.

3.8 Privacy and Regulation

Organizations do not hold data unconditionally. Regulators impose substantive constraints on how personal data may be collected, stored, and processed.

NoteDigital Personal Data Protection Act, 2023 (India)

The DPDP Act requires notice-and-consent for processing personal data of Indian residents, mandates purpose limitation and data minimisation, creates rights of correction and erasure, and prescribes obligations for Significant Data Fiduciaries. Penalties can reach INR 250 crore per breach.

NoteGDPR (European Union)

Applies extraterritorially to any organization processing personal data of EU residents. Core principles: lawfulness, fairness, transparency, purpose limitation, data minimisation, accuracy, storage limitation, integrity and confidentiality, accountability. Penalties can reach 4 percent of global annual turnover.

ImportantConsequences for analytics teams

Personally identifiable fields must be either removed or pseudonymised before they enter an analytical sandbox. Consent records must be honoured on every downstream use, and deletion requests must propagate to derived datasets. Governance is therefore a direct input to the design of every pipeline, not an afterthought.

3.9 Data Roles in Organizations

A functioning data organization distributes responsibility across specialist roles.

Role Primary responsibility
Data engineer Builds and maintains pipelines, schemas and storage
Data analyst Produces dashboards, KPIs and descriptive analysis
Business intelligence developer Designs reports in Power BI, Tableau, Looker
Data scientist Builds predictive and prescriptive models
ML engineer Productionises and monitors models
Data architect Designs the enterprise data model and integrations
Data steward Defines and monitors data quality
Chief Data Officer Sets strategy and accountability for data as an asset
TipOverlap is normal

The boundary between analyst, scientist and engineer is fluid and organization-specific. What matters is that extraction, cleaning, modelling, visualisation, and governance are all performed by someone, not that each is performed by a differently named specialist.

3.10 A Reference Data Architecture

flowchart TD
  subgraph Sources
    A1[ERP]
    A2[CRM]
    A3[Web & mobile logs]
    A4[IoT & sensors]
    A5[External feeds]
  end
  Sources --> B[Ingestion<br/>Fivetran / Kafka / Airflow]
  B --> C[(Data Lake<br/>S3 / ADLS)]
  C --> D[Warehouse /<br/>Lakehouse<br/>Snowflake / Databricks]
  D --> E1[BI Layer<br/>Power BI, Tableau]
  D --> E2[Data Science<br/>R, Python, Spark]
  D --> E3[Applications<br/>APIs, reverse ETL]
  F[Governance & Catalog<br/>Collibra / DataHub] -.-> C
  F -.-> D

The left-to-right flow is almost universal across medium and large organizations today: many sources feed a lake, the lake feeds a warehouse or lakehouse, the warehouse serves reporting, data science and operational applications, and a governance layer oversees the entire stack.

3.11 Summary

Summary of data concepts introduced in this chapter
Concept Description
Access Patterns
OLTP Transaction-focused: fast inserts, updates and short reads
OLAP Analysis-focused: long aggregations over historical data
Storage Layers
Data warehouse Integrated, non-volatile store of structured data
Data mart Subject-specific subset of the warehouse
Data lake Raw files of any format; cheap, schema-on-read
Lakehouse Warehouse governance on top of lake storage (Delta, Iceberg, Hudi)
Pipeline Patterns
ETL Transform before loading; traditional pattern
ELT Load raw first, transform in warehouse; cloud-native pattern
Governance
MDM Master Data Management: single authoritative view of core entities
Data owner Accountable business executive for a dataset
Data steward Day-to-day quality and definitions
CDO Chief Data Officer: enterprise-wide data accountability
Privacy Regulations
DPDP Act 2023 India's consent and purpose-limitation law for personal data
GDPR EU-wide privacy regulation; applies extraterritorially

An organization’s ability to run advanced analytics is a function of how cleanly each of these layers has been built and how well the governance overlay has been enforced. The methods chapters later in the book assume a clean analytical dataset; this chapter names the systems and roles that are responsible for producing it.