4  Big Data

4.1 What Makes Data “Big”

The term “big data” entered common use around the mid-2000s to describe datasets that traditional single-machine databases could no longer process within an acceptable time budget. The threshold is deliberately relative: a 100 GB file was “big” in 2005 and is routine today, while a petabyte-per-day stream remains big in 2026.

NoteAn operational definition

Data is considered big when its volume, velocity, or variety exceeds the capacity of a single conventional server to store, ingest, or process it within the time the business needs. The response is typically a distributed architecture that spreads the work across many machines.

The phrase is therefore less a statement about raw size and more a statement about the architecture required. When a business problem forces a move from one server to a cluster, the data has become big for that organization.

4.2 The Five Vs

Doug Laney (META Group, 2001) framed the original three Vs (Volume, Velocity, Variety). Two further Vs (Veracity, Value) were added by IBM and industry practitioners and are now standard.

NoteVolume

The sheer amount of data. UPI processed over 17 billion transactions in a single month in 2025 (NPCI); Aadhaar has over 1.3 billion enrolled identities. These systems generate terabytes to petabytes of data per day.

NoteVelocity

The speed at which data arrives and must be processed. A credit-card fraud system must score a transaction in under 100 milliseconds; a ride-hailing platform must match drivers and riders every few seconds. Velocity drives the need for streaming systems.

NoteVariety

The mix of formats: structured rows, semi-structured JSON, unstructured text, images, audio, video, clickstreams, sensor readings. A single customer-service interaction at a bank may produce all of these simultaneously.

NoteVeracity

The trustworthiness and accuracy of incoming data. Sensor dropouts, duplicated events, spoofed identities, and inconsistent reference data all degrade veracity. Big systems typically accept that a non-trivial fraction of inputs is noisy and must be filtered.

NoteValue

The business outcome the data is expected to deliver. Volume without value is waste. The justification for every big data investment is the marginal business decision the additional data enables: better pricing, lower fraud, tighter inventory, richer personalisation.

4.3 Scale in Units

Data volumes are described on a geometric scale; each unit is a thousand times larger than the previous one.

Unit Bytes Rough intuition
Gigabyte (GB) 10^9 A single high-resolution film
Terabyte (TB) 10^12 A small organisation’s annual transaction log
Petabyte (PB) 10^15 A large e-commerce platform’s annual clickstream
Exabyte (EB) 10^18 A national telecom’s yearly network logs
Zettabyte (ZB) 10^21 Global annual internet traffic
TipA useful rule of thumb

Workloads up to a few hundred GB still fit a modern single server with enough RAM. Workloads in the low TB range are borderline and often move to a clustered database or warehouse. Workloads in the PB range almost always require a distributed file system and a distributed compute engine.

4.4 Batch versus Streaming Processing

Once data volumes grow, the processing pattern forks into two families with different latency profiles.

NoteBatch processing

Data is accumulated over a window (an hour, a day) and processed in bulk. Latency is measured in minutes to hours. Suited to overnight reports, model retraining, billing runs. Implementations: Apache Spark batch jobs, Hive, legacy MapReduce.

NoteStream processing

Each record is processed as it arrives. Latency is measured in milliseconds to seconds. Suited to fraud detection, real-time personalisation, live dashboards, alerting. Implementations: Apache Kafka Streams, Apache Flink, Spark Structured Streaming.

ImportantThe Lambda and Kappa architectures

Many production systems blend the two. The Lambda architecture maintains parallel batch and streaming pipelines and reconciles them at the serving layer. The Kappa architecture uses a single streaming pipeline for both real-time and historical processing, replaying the event log when needed. The trend since 2020 has been towards Kappa-style designs because maintaining two pipelines for the same logic is operationally expensive.

4.5 Distributed Storage

Single machines do not hold petabytes, so storage is spread across many servers with built-in replication.

NoteHDFS (Hadoop Distributed File System)

Splits large files into blocks (typically 128 MB) and replicates each block across three nodes by default. Originally designed at Yahoo to run commodity-hardware data centres; still widely deployed on-premise.

NoteCloud object storage

Amazon S3, Azure Data Lake Storage, Google Cloud Storage. Effectively limitless, pay-per-byte, durable by design. Object stores have largely replaced HDFS for greenfield projects because they decouple storage from compute.

TipOpen table formats on top of object storage

Delta Lake, Apache Iceberg, and Apache Hudi sit on top of cloud object storage and add the table semantics (ACID transactions, schema evolution, time travel) that classical HDFS-backed Hive tables offered. They are the substrate of the modern lakehouse.

4.6 Distributed Processing: MapReduce to Spark

Processing also has to be spread across many nodes. Two generations of engines dominate.

NoteMapReduce (first generation)

Google’s 2004 paper introduced the Map/Reduce paradigm: decompose a computation into a parallel map step followed by a global reduce step. Hadoop MapReduce popularised the pattern. Writes intermediate results to disk; robust but slow.

NoteApache Spark (current generation)

An in-memory, DAG-based engine that subsumes MapReduce and adds SQL, streaming, machine learning (MLlib) and graph processing (GraphX) libraries. Spark is the de-facto standard for distributed batch and machine-learning workloads.

The base-R example above is conceptually identical to a Spark word-count: the lapply corresponds to a distributed map and the table corresponds to a distributed reduceByKey.

4.7 Streaming Platforms

Stream processing requires a durable, high-throughput pipe that many producers can write to and many consumers can read from independently.

NoteApache Kafka

A distributed, append-only log. Producers publish events to named topics; consumers subscribe and read at their own pace. Replicated for durability, partitioned for throughput. Kafka is the backbone of event-driven architectures at LinkedIn, Uber, Flipkart, and most large Indian fintechs.

NoteApache Flink and Spark Structured Streaming

Stream-processing engines that consume from Kafka (and other sources), apply transformations, windowed aggregations, and joins, and emit results downstream. Flink is preferred for sub-second event-time processing; Structured Streaming is preferred where the team is already invested in Spark.

TipExactly-once semantics

Modern streaming stacks (Kafka + Flink, Kafka + Spark Structured Streaming with checkpointing) can guarantee that each event affects downstream state exactly once, even in the presence of failures. This changed what could be trusted to a streaming pipeline rather than reconciled in batch.

4.8 NoSQL Database Families

When the access pattern or the data format does not fit a relational table, one of four NoSQL families is typically used.

Family Model Examples Good for
Key-value Simple key to value lookup Redis, DynamoDB Session stores, caches, feature stores
Document Nested JSON-like documents MongoDB, Couchbase Content management, mobile apps
Wide-column Partitioned column families Cassandra, HBase, ScyllaDB Time-series, IoT, massive writes
Graph Nodes and edges Neo4j, Amazon Neptune Fraud rings, recommendations, knowledge graphs
ImportantPolyglot persistence

Large systems rarely pick one. A single application at an e-commerce company might use Postgres for orders, MongoDB for the product catalog, Redis for sessions, Cassandra for clickstreams, and Neo4j for recommendations. Each database is chosen for its fit to one access pattern.

4.9 Big Data in Practice

A short survey of use cases that only become feasible at big-data scale.

NoteReal-time fraud detection

A card transaction is scored against a streaming feature store (recent velocity, merchant risk, device fingerprint) and either approved or challenged in under 100 milliseconds. Deployed at every major bank and card network.

NoteRecommendation and personalisation

Every view, click and purchase updates user and item embeddings that feed homepage and search ranking. Flipkart, Amazon India, Hotstar, and Spotify operate recommendation systems at this scale.

NoteTelematics and predictive maintenance

Sensor streams from vehicles, elevators, jet engines, and factory equipment are aggregated into anomaly-detection models that flag components likely to fail. GE Aviation and Siemens run established examples; Ola Electric publishes similar work for its scooter fleet.

NoteRegulatory reporting

RBI, SEBI, and GSTN require the aggregation of transactions across millions of counterparties for supervisory reports. The GSTN alone processes over a billion invoices per month, which is a big-data problem by any definition.

4.10 Limitations and Challenges

Not every problem benefits from a big-data stack.

WarningInfrastructure and skills cost

A petabyte-scale warehouse, a Spark cluster, and a Kafka deployment each require specialist engineering. Many organizations that adopted Hadoop between 2012 and 2018 later retired it because the operational cost exceeded the value it delivered.

WarningPrivacy and ethical exposure

More data means more personal data. The DPDP Act in India and the GDPR in Europe impose stiff constraints on retention, consent, and cross-border transfer. Analytical projects that ignore these constraints accumulate legal risk rapidly.

WarningSignal-to-noise ratio

Doubling data volume does not double insight. Many large datasets contain repetitive or low-signal records; disciplined sampling can produce the same analytical conclusions at a fraction of the compute cost.

Important“Small data” is still the common case

Most management research and most intra-organisational analyses operate on samples of a few hundred to a few thousand observations, for which a laptop, a CSV and R or Python are fully sufficient. Big data is an architectural response to a specific class of problems, not the default setting.

4.11 A Reference Big Data Stack

flowchart TD
  subgraph Sources
    S1[App & web events]
    S2[IoT sensors]
    S3[Transactional DBs]
    S4[External APIs]
  end
  Sources --> K[Kafka<br/>event bus]
  K --> ST[Stream processing<br/>Flink / Spark Streaming]
  K --> L[(Data Lake<br/>S3 / ADLS)]
  ST --> OLSTORE[(Low-latency store<br/>Cassandra / Redis)]
  L --> LH[Lakehouse<br/>Delta / Iceberg / Hudi]
  LH --> BI[BI & reporting]
  LH --> DS[Data science & ML<br/>Spark, Python, R]
  OLSTORE --> APP[Real-time applications]

The reference stack splits into two lanes: a low-latency lane (Kafka → stream processor → key-value store → application) that serves real-time decisions, and a batch lane (Kafka or direct ingestion → lake → lakehouse → BI and data-science layers) that serves analysis and training. Both lanes feed the same underlying event log, which is why modern designs converge on Kappa-style architectures.

4.12 Summary

Summary of data concepts introduced in this chapter
Concept Description
The 5 Vs
Volume Sheer amount of data, TB to PB or more
Velocity Speed at which data arrives and must be processed
Variety Mix of structured, semi-structured and unstructured formats
Veracity Trustworthiness and accuracy of incoming data
Value Business outcome the data is expected to deliver
Processing Modes
Batch Accumulated over a window; latency minutes to hours
Streaming Records processed as they arrive; latency milliseconds to seconds
Distributed Storage
HDFS Hadoop Distributed File System; on-premise, replicated blocks
Object store Amazon S3, Azure Blob, GCS; cloud, decoupled from compute
Processing Engines
MapReduce First-generation batch engine; disk-based, robust but slow
Spark In-memory, DAG-based engine; SQL, ML and streaming libraries
Kafka Distributed append-only event log; backbone of streaming stacks
NoSQL Families
Key-value Redis, DynamoDB; simple key lookup, caches and feature stores
Document MongoDB, Couchbase; nested JSON-like records
Wide-column Cassandra, HBase; partitioned column families, heavy writes
Graph Neo4j, Neptune; nodes and edges, fraud rings, recommendations

Big data is not a single technology but a stack of interlocking choices made in response to volume, velocity, and variety that no single server can absorb. The rest of this book mostly operates on the post-processed, rectangular output of such stacks, so the names in this chapter explain where that rectangular dataset came from and why it arrived in the shape it did.