3 Data in Organizations
3.1 Why Organizations Need Data Systems
A modern organization generates data at every touchpoint: a customer opens an app, a warehouse scans a carton, a plant sensor logs a temperature, an HR system records a leave request. Without deliberate systems, this output is fragmented across applications, stored in incompatible formats, and effectively invisible to decision makers. Data systems are the organizational plumbing that captures, moves, stores, and serves this output so that it can support both daily operations and longer-horizon analysis.
Every organization needs data infrastructure that can simultaneously (a) keep the business running by serving live transactions, and (b) support reporting and analysis by preserving a consistent historical view. These two jobs pull in opposite directions, which is why most organizations end up with two layers of data systems rather than one.
3.2 Operational versus Analytical Data
The single most important architectural distinction inside an organization is between systems that run the business and systems that analyse it.
Systems optimised for short, frequent, read-write operations. A single transaction touches few rows and must complete in milliseconds. Examples: the core banking system at HDFC, the order-placement table at Flipkart, the POS database at Reliance Retail. OLTP schemas are highly normalised to avoid update anomalies.
Systems optimised for long, read-heavy queries that aggregate millions of rows. A single query may scan an entire year of sales. Examples: the enterprise data warehouse at Infosys, the revenue cube at Airtel, the supply-chain warehouse at Maruti Suzuki. OLAP schemas are denormalised (star, snowflake) to speed aggregation.
A long aggregation query on the transactional database competes for the same resources that serve live customer requests. Organizations therefore replicate operational data on a periodic cadence into a separate analytical store so that the two workloads never interfere.
3.3 Data Across Functional Areas
Each functional area of the organization is both a producer and a consumer of data.
| Function | Systems that generate data | Analytical uses |
|---|---|---|
| Sales and Marketing | CRM (Salesforce), campaign tools, web analytics | Segmentation, churn prediction, campaign ROI |
| Finance | ERP (SAP, Oracle Fusion), GL, billing | Variance analysis, forecasting, audit |
| Operations and Supply Chain | WMS, TMS, MES, IoT sensors | Demand planning, OEE, route optimisation |
| Human Resources | HRIS (Workday, SuccessFactors) | Attrition modelling, workforce planning |
| Customer Service | Call-centre, ticketing, chat (Zendesk) | First-call resolution, sentiment, staffing |
| Risk and Compliance | Core banking, trading systems, KYC | Fraud detection, credit scoring, AML |
The highest-impact analyses usually require joining data that originated in different functions. Linking marketing spend to sales revenue to customer-service tickets tells a richer story than any one of those datasets alone. The ability to make those joins depends entirely on shared keys (customer ID, product code) that are managed centrally.
3.4 Warehouses, Marts, Lakes and Lakehouses
Analytical data is held in one of four architectural patterns that have emerged over the last three decades.
A centralised, integrated, subject-oriented repository of structured data drawn from multiple source systems. Bill Inmon’s classical definition emphasises that it is “integrated, non-volatile and time-variant”. Implementations include Teradata, Oracle Exadata, Amazon Redshift, Google BigQuery, Snowflake.
A subject-specific subset of the warehouse, usually serving a single department (finance mart, sales mart). Smaller, faster to query, and easier to permission than the central warehouse.
A storage layer that holds raw files of any format (structured, semi-structured, unstructured) at low cost. Implementations: Amazon S3 with AWS Glue catalog, Azure Data Lake Storage, Hadoop HDFS. The lake is schema-on-read: structure is imposed by the query, not by the loader.
A convergent pattern that places warehouse-style governance and performance directly on top of lake storage, using open table formats such as Delta Lake, Apache Iceberg, or Apache Hudi. Databricks and Snowflake both converge toward this model.
Warehouses and marts are preferred for reliable, governed reporting on structured business metrics. Lakes are preferred for machine-learning projects that need access to raw logs, images, and text. Most large organizations now operate both, with a lakehouse increasingly positioned as the single surface that spans the two.
3.5 ETL and ELT Pipelines
Data does not move from source systems to the warehouse by itself. A pipeline extracts it, transforms it to match the target schema, and loads it into the analytical store.
flowchart LR
A[Source systems<br/>ERP, CRM, logs] --> B[Extract]
B --> C{Pattern}
C -- ETL --> D[Transform<br/>in staging]
D --> E[Load into<br/>warehouse]
C -- ELT --> F[Load raw into<br/>cloud warehouse]
F --> G[Transform<br/>with SQL/dbt]
E --> H[Analytics layer]
G --> H
The traditional pattern. Data is extracted from the source, cleaned and transformed on a dedicated server, and then loaded into the warehouse. Tools: Informatica, Talend, IBM DataStage. Favoured when compute in the warehouse is expensive.
The cloud-native variant. Raw data is loaded directly into the warehouse first, then transformed in place using SQL. Tools: Fivetran or Airbyte for ingestion, dbt for transformations, Snowflake or BigQuery as compute. Favoured when warehouse compute is cheap and elastic.
Production pipelines run on schedulers that manage dependencies, retries and alerts. Apache Airflow, Prefect, and Dagster are the dominant open-source orchestrators.
3.6 Master Data Management
Operational systems tend to duplicate the same entities under different identifiers. A single retail customer might appear as CUST_00421 in the POS, as email arjun@example.com in the loyalty app, and as phone number +91-98xxxxxxxx in the contact centre. Without reconciliation, every cross-system analysis is wrong.
Master Data Management is the set of processes and tools that maintain a single authoritative version of core entities (customer, product, employee, supplier, location) and propagate that version to every consuming system. Implementations include Informatica MDM, Reltio, Microsoft MDS.
Without a reliable golden customer ID, customer-lifetime-value, churn, and cross-sell analyses become structurally unreliable. Most organizations that undertake advanced analytics find that master-data quality is the binding constraint long before any algorithmic sophistication.
3.7 Data Governance, Ownership and Stewardship
Governance is the set of policies that determine who may collect, access, change, and delete data, and how those decisions are enforced.
Data owner: accountable for a dataset (usually a senior business executive). Data steward: responsible for day-to-day quality and definitions. Data custodian: operates the storage and access layers (typically IT). Chief Data Officer: enterprise-wide accountability for data as an asset.
Role-based access control (RBAC) restricts who can read or modify each dataset. Sensitive columns are often masked or tokenised. Every access should be logged for audit.
A data catalog (Alation, Collibra, DataHub, Atlan) documents what datasets exist, what each column means, who owns it, and how fresh it is. A functioning catalog is the difference between data that can be reused and data that has to be rediscovered on every project.
3.8 Privacy and Regulation
Organizations do not hold data unconditionally. Regulators impose substantive constraints on how personal data may be collected, stored, and processed.
The DPDP Act requires notice-and-consent for processing personal data of Indian residents, mandates purpose limitation and data minimisation, creates rights of correction and erasure, and prescribes obligations for Significant Data Fiduciaries. Penalties can reach INR 250 crore per breach.
Applies extraterritorially to any organization processing personal data of EU residents. Core principles: lawfulness, fairness, transparency, purpose limitation, data minimisation, accuracy, storage limitation, integrity and confidentiality, accountability. Penalties can reach 4 percent of global annual turnover.
Personally identifiable fields must be either removed or pseudonymised before they enter an analytical sandbox. Consent records must be honoured on every downstream use, and deletion requests must propagate to derived datasets. Governance is therefore a direct input to the design of every pipeline, not an afterthought.
3.9 Data Roles in Organizations
A functioning data organization distributes responsibility across specialist roles.
| Role | Primary responsibility |
|---|---|
| Data engineer | Builds and maintains pipelines, schemas and storage |
| Data analyst | Produces dashboards, KPIs and descriptive analysis |
| Business intelligence developer | Designs reports in Power BI, Tableau, Looker |
| Data scientist | Builds predictive and prescriptive models |
| ML engineer | Productionises and monitors models |
| Data architect | Designs the enterprise data model and integrations |
| Data steward | Defines and monitors data quality |
| Chief Data Officer | Sets strategy and accountability for data as an asset |
The boundary between analyst, scientist and engineer is fluid and organization-specific. What matters is that extraction, cleaning, modelling, visualisation, and governance are all performed by someone, not that each is performed by a differently named specialist.
3.10 A Reference Data Architecture
flowchart TD
subgraph Sources
A1[ERP]
A2[CRM]
A3[Web & mobile logs]
A4[IoT & sensors]
A5[External feeds]
end
Sources --> B[Ingestion<br/>Fivetran / Kafka / Airflow]
B --> C[(Data Lake<br/>S3 / ADLS)]
C --> D[Warehouse /<br/>Lakehouse<br/>Snowflake / Databricks]
D --> E1[BI Layer<br/>Power BI, Tableau]
D --> E2[Data Science<br/>R, Python, Spark]
D --> E3[Applications<br/>APIs, reverse ETL]
F[Governance & Catalog<br/>Collibra / DataHub] -.-> C
F -.-> D
The left-to-right flow is almost universal across medium and large organizations today: many sources feed a lake, the lake feeds a warehouse or lakehouse, the warehouse serves reporting, data science and operational applications, and a governance layer oversees the entire stack.
3.11 Summary
| Concept | Description |
|---|---|
| Access Patterns | |
| OLTP | Transaction-focused: fast inserts, updates and short reads |
| OLAP | Analysis-focused: long aggregations over historical data |
| Storage Layers | |
| Data warehouse | Integrated, non-volatile store of structured data |
| Data mart | Subject-specific subset of the warehouse |
| Data lake | Raw files of any format; cheap, schema-on-read |
| Lakehouse | Warehouse governance on top of lake storage (Delta, Iceberg, Hudi) |
| Pipeline Patterns | |
| ETL | Transform before loading; traditional pattern |
| ELT | Load raw first, transform in warehouse; cloud-native pattern |
| Governance | |
| MDM | Master Data Management: single authoritative view of core entities |
| Data owner | Accountable business executive for a dataset |
| Data steward | Day-to-day quality and definitions |
| CDO | Chief Data Officer: enterprise-wide data accountability |
| Privacy Regulations | |
| DPDP Act 2023 | India's consent and purpose-limitation law for personal data |
| GDPR | EU-wide privacy regulation; applies extraterritorially |
An organization’s ability to run advanced analytics is a function of how cleanly each of these layers has been built and how well the governance overlay has been enforced. The methods chapters later in the book assume a clean analytical dataset; this chapter names the systems and roles that are responsible for producing it.