3 Data in Organizations

3.1 Why Organizations Need Data Systems

A modern organization generates data at every touchpoint: a customer opens an app, a warehouse scans a carton, a plant sensor logs a temperature, an HR system records a leave request. Without deliberate systems, this output is fragmented across applications, stored in incompatible formats, and effectively invisible to decision makers. Data systems are the organizational plumbing that captures, moves, stores, and serves this output so that it can support both daily operations and longer-horizon analysis.

The two jobs data systems must do

Every organization needs data infrastructure that can simultaneously (a) keep the business running by serving live transactions, and (b) support reporting and analysis by preserving a consistent historical view. These two jobs pull in opposite directions, which is why most organizations end up with two layers of data systems rather than one.

3.2 Operational versus Analytical Data

The single most important architectural distinction inside an organization is between systems that run the business and systems that analyse it.

OLTP: Online Transaction Processing

Systems optimised for short, frequent, read-write operations. A single transaction touches few rows and must complete in milliseconds. Examples: the core banking system at HDFC, the order-placement table at Flipkart, the POS database at Reliance Retail. OLTP schemas are highly normalised to avoid update anomalies.

OLAP: Online Analytical Processing

Systems optimised for long, read-heavy queries that aggregate millions of rows. A single query may scan an entire year of sales. Examples: the enterprise data warehouse at Infosys, the revenue cube at Airtel, the supply-chain warehouse at Maruti Suzuki. OLAP schemas are denormalised (star, snowflake) to speed aggregation.

Try here

Why running analytics on the OLTP system is discouraged

A long aggregation query on the transactional database competes for the same resources that serve live customer requests. Organizations therefore replicate operational data on a periodic cadence into a separate analytical store so that the two workloads never interfere.

3.3 Data Across Functional Areas

Each functional area of the organization is both a producer and a consumer of data.

Function	Systems that generate data	Analytical uses
Sales and Marketing	CRM (Salesforce), campaign tools, web analytics	Segmentation, churn prediction, campaign ROI
Finance	ERP (SAP, Oracle Fusion), GL, billing	Variance analysis, forecasting, audit
Operations and Supply Chain	WMS, TMS, MES, IoT sensors	Demand planning, OEE, route optimisation
Human Resources	HRIS (Workday, SuccessFactors)	Attrition modelling, workforce planning
Customer Service	Call-centre, ticketing, chat (Zendesk)	First-call resolution, sentiment, staffing
Risk and Compliance	Core banking, trading systems, KYC	Fraud detection, credit scoring, AML

Crossing the functional boundary

The highest-impact analyses usually require joining data that originated in different functions. Linking marketing spend to sales revenue to customer-service tickets tells a richer story than any one of those datasets alone. The ability to make those joins depends entirely on shared keys (customer ID, product code) that are managed centrally.

3.4 Warehouses, Marts, Lakes and Lakehouses

Analytical data is held in one of four architectural patterns that have emerged over the last three decades.

Data warehouse

A centralised, integrated, subject-oriented repository of structured data drawn from multiple source systems. Bill Inmon’s classical definition emphasises that it is “integrated, non-volatile and time-variant”. Implementations include Teradata, Oracle Exadata, Amazon Redshift, Google BigQuery, Snowflake.

Data mart

A subject-specific subset of the warehouse, usually serving a single department (finance mart, sales mart). Smaller, faster to query, and easier to permission than the central warehouse.

Data lake

A storage layer that holds raw files of any format (structured, semi-structured, unstructured) at low cost. Implementations: Amazon S3 with AWS Glue catalog, Azure Data Lake Storage, Hadoop HDFS. The lake is schema-on-read: structure is imposed by the query, not by the loader.

Lakehouse

A convergent pattern that places warehouse-style governance and performance directly on top of lake storage, using open table formats such as Delta Lake, Apache Iceberg, or Apache Hudi. Databricks and Snowflake both converge toward this model.

Choosing between them

Warehouses and marts are preferred for reliable, governed reporting on structured business metrics. Lakes are preferred for machine-learning projects that need access to raw logs, images, and text. Most large organizations now operate both, with a lakehouse increasingly positioned as the single surface that spans the two.

3.5 ETL and ELT Pipelines

Data does not move from source systems to the warehouse by itself. A pipeline extracts it, transforms it to match the target schema, and loads it into the analytical store.

flowchart LR
  A[Source systems<br/>ERP, CRM, logs] --> B[Extract]
  B --> C{Pattern}
  C -- ETL --> D[Transform<br/>in staging]
  D --> E[Load into<br/>warehouse]
  C -- ELT --> F[Load raw into<br/>cloud warehouse]
  F --> G[Transform<br/>with SQL/dbt]
  E --> H[Analytics layer]
  G --> H
    classDef default fill:#2e4057,color:#ffffff,stroke:#ff9933,stroke-width:3px,rx:10px,ry:10px;

ETL (Extract, Transform, Load)

The traditional pattern. Data is extracted from the source, cleaned and transformed on a dedicated server, and then loaded into the warehouse. Tools: Informatica, Talend, IBM DataStage. Favoured when compute in the warehouse is expensive.

ELT (Extract, Load, Transform)

The cloud-native variant. Raw data is loaded directly into the warehouse first, then transformed in place using SQL. Tools: Fivetran or Airbyte for ingestion, dbt for transformations, Snowflake or BigQuery as compute. Favoured when warehouse compute is cheap and elastic.

Scheduling and orchestration

Production pipelines run on schedulers that manage dependencies, retries and alerts. Apache Airflow, Prefect, and Dagster are the dominant open-source orchestrators.

3.6 Master Data Management

Operational systems tend to duplicate the same entities under different identifiers. A single retail customer might appear as CUST_00421 in the POS, as email arjun@example.com in the loyalty app, and as phone number +91-98xxxxxxxx in the contact centre. Without reconciliation, every cross-system analysis is wrong.

What MDM does

Master Data Management is the set of processes and tools that maintain a single authoritative version of core entities (customer, product, employee, supplier, location) and propagate that version to every consuming system. Implementations include Informatica MDM, Reltio, Microsoft MDS.

Try here

The cost of poor master data

Without a reliable golden customer ID, customer-lifetime-value, churn, and cross-sell analyses become structurally unreliable. Most organizations that undertake advanced analytics find that master-data quality is the binding constraint long before any algorithmic sophistication.

3.7 Data Governance, Ownership and Stewardship

Governance is the set of policies that determine who may collect, access, change, and delete data, and how those decisions are enforced.

Key governance roles

Data owner: accountable for a dataset (usually a senior business executive). Data steward: responsible for day-to-day quality and definitions. Data custodian: operates the storage and access layers (typically IT). Chief Data Officer: enterprise-wide accountability for data as an asset.

Access control

Role-based access control (RBAC) restricts who can read or modify each dataset. Sensitive columns are often masked or tokenised. Every access should be logged for audit.

Data catalogs

A data catalog (Alation, Collibra, DataHub, Atlan) documents what datasets exist, what each column means, who owns it, and how fresh it is. A functioning catalog is the difference between data that can be reused and data that has to be rediscovered on every project.

3.8 Privacy and Regulation

Organizations do not hold data unconditionally. Regulators impose substantive constraints on how personal data may be collected, stored, and processed.

Digital Personal Data Protection Act, 2023 (India)

The DPDP Act requires notice-and-consent for processing personal data of Indian residents, mandates purpose limitation and data minimisation, creates rights of correction and erasure, and prescribes obligations for Significant Data Fiduciaries. Penalties can reach INR 250 crore per breach.

GDPR (European Union)

Applies extraterritorially to any organization processing personal data of EU residents. Core principles: lawfulness, fairness, transparency, purpose limitation, data minimisation, accuracy, storage limitation, integrity and confidentiality, accountability. Penalties can reach 4 percent of global annual turnover.

Consequences for analytics teams

Personally identifiable fields must be either removed or pseudonymised before they enter an analytical sandbox. Consent records must be honoured on every downstream use, and deletion requests must propagate to derived datasets. Governance is therefore a direct input to the design of every pipeline, not an afterthought.

3.9 Data Roles in Organizations

A functioning data organization distributes responsibility across specialist roles.

Role	Primary responsibility
Data engineer	Builds and maintains pipelines, schemas and storage
Data analyst	Produces dashboards, KPIs and descriptive analysis
Business intelligence developer	Designs reports in Power BI, Tableau, Looker
Data scientist	Builds predictive and prescriptive models
ML engineer	Productionises and monitors models
Data architect	Designs the enterprise data model and integrations
Data steward	Defines and monitors data quality
Chief Data Officer	Sets strategy and accountability for data as an asset

Overlap is normal

The boundary between analyst, scientist and engineer is fluid and organization-specific. What matters is that extraction, cleaning, modelling, visualisation, and governance are all performed by someone, not that each is performed by a differently named specialist.

3.10 A Reference Data Architecture

flowchart TD
  subgraph Sources
    A1[ERP]
    A2[CRM]
    A3[Web & mobile logs]
    A4[IoT & sensors]
    A5[External feeds]
  end
  Sources --> B[Ingestion<br/>Fivetran / Kafka / Airflow]
  B --> C[(Data Lake<br/>S3 / ADLS)]
  C --> D[Warehouse /<br/>Lakehouse<br/>Snowflake / Databricks]
  D --> E1[BI Layer<br/>Power BI, Tableau]
  D --> E2[Data Science<br/>R, Python, Spark]
  D --> E3[Applications<br/>APIs, reverse ETL]
  F[Governance & Catalog<br/>Collibra / DataHub] -.-> C
  F -.-> D
    classDef default fill:#2e4057,color:#ffffff,stroke:#ff9933,stroke-width:3px,rx:10px,ry:10px;

The left-to-right flow is almost universal across medium and large organizations today: many sources feed a lake, the lake feeds a warehouse or lakehouse, the warehouse serves reporting, data science and operational applications, and a governance layer oversees the entire stack.

Summary

Concept	Description
Operational vs Analytical
Two Jobs	Run the business (transactions) and improve the business (analytics)
OLTP	Online Transaction Processing — fast, row-level operational systems
OLAP	Online Analytical Processing — read-heavy, aggregated analysis
Functions and Storage
Functional Areas	Sales, marketing, ops, and finance each generate distinct data
Data Warehouse	Curated, integrated, subject-oriented store for BI reporting
Data Mart	A subset of a warehouse focused on a single business area
Data Lake	Schema-on-read repository for raw structured and unstructured data
Lakehouse	Open-format architecture combining lake flexibility with warehouse rigour