4 Big Data
4.1 What Makes Data “Big”
The term “big data” entered common use around the mid-2000s to describe datasets that traditional single-machine databases could no longer process within an acceptable time budget. The threshold is deliberately relative: a 100 GB file was “big” in 2005 and is routine today, while a petabyte-per-day stream remains big in 2026.
Data is considered big when its volume, velocity, or variety exceeds the capacity of a single conventional server to store, ingest, or process it within the time the business needs. The response is typically a distributed architecture that spreads the work across many machines.
The phrase is therefore less a statement about raw size and more a statement about the architecture required. When a business problem forces a move from one server to a cluster, the data has become big for that organization.
4.2 The Five Vs
Doug Laney (META Group, 2001) framed the original three Vs (Volume, Velocity, Variety). Two further Vs (Veracity, Value) were added by IBM and industry practitioners and are now standard.
The sheer amount of data. UPI processed over 17 billion transactions in a single month in 2025 (NPCI); Aadhaar has over 1.3 billion enrolled identities. These systems generate terabytes to petabytes of data per day.
The speed at which data arrives and must be processed. A credit-card fraud system must score a transaction in under 100 milliseconds; a ride-hailing platform must match drivers and riders every few seconds. Velocity drives the need for streaming systems.
The mix of formats: structured rows, semi-structured JSON, unstructured text, images, audio, video, clickstreams, sensor readings. A single customer-service interaction at a bank may produce all of these simultaneously.
The trustworthiness and accuracy of incoming data. Sensor dropouts, duplicated events, spoofed identities, and inconsistent reference data all degrade veracity. Big systems typically accept that a non-trivial fraction of inputs is noisy and must be filtered.
The business outcome the data is expected to deliver. Volume without value is waste. The justification for every big data investment is the marginal business decision the additional data enables: better pricing, lower fraud, tighter inventory, richer personalisation.
4.3 Scale in Units
Data volumes are described on a geometric scale; each unit is a thousand times larger than the previous one.
| Unit | Bytes | Rough intuition |
|---|---|---|
| Gigabyte (GB) | 10^9 | A single high-resolution film |
| Terabyte (TB) | 10^12 | A small organisation’s annual transaction log |
| Petabyte (PB) | 10^15 | A large e-commerce platform’s annual clickstream |
| Exabyte (EB) | 10^18 | A national telecom’s yearly network logs |
| Zettabyte (ZB) | 10^21 | Global annual internet traffic |
Workloads up to a few hundred GB still fit a modern single server with enough RAM. Workloads in the low TB range are borderline and often move to a clustered database or warehouse. Workloads in the PB range almost always require a distributed file system and a distributed compute engine.
4.4 Batch versus Streaming Processing
Once data volumes grow, the processing pattern forks into two families with different latency profiles.
Data is accumulated over a window (an hour, a day) and processed in bulk. Latency is measured in minutes to hours. Suited to overnight reports, model retraining, billing runs. Implementations: Apache Spark batch jobs, Hive, legacy MapReduce.
Each record is processed as it arrives. Latency is measured in milliseconds to seconds. Suited to fraud detection, real-time personalisation, live dashboards, alerting. Implementations: Apache Kafka Streams, Apache Flink, Spark Structured Streaming.
Many production systems blend the two. The Lambda architecture maintains parallel batch and streaming pipelines and reconciles them at the serving layer. The Kappa architecture uses a single streaming pipeline for both real-time and historical processing, replaying the event log when needed. The trend since 2020 has been towards Kappa-style designs because maintaining two pipelines for the same logic is operationally expensive.
4.5 Distributed Storage
Single machines do not hold petabytes, so storage is spread across many servers with built-in replication.
Splits large files into blocks (typically 128 MB) and replicates each block across three nodes by default. Originally designed at Yahoo to run commodity-hardware data centres; still widely deployed on-premise.
Amazon S3, Azure Data Lake Storage, Google Cloud Storage. Effectively limitless, pay-per-byte, durable by design. Object stores have largely replaced HDFS for greenfield projects because they decouple storage from compute.
Delta Lake, Apache Iceberg, and Apache Hudi sit on top of cloud object storage and add the table semantics (ACID transactions, schema evolution, time travel) that classical HDFS-backed Hive tables offered. They are the substrate of the modern lakehouse.
4.6 Distributed Processing: MapReduce to Spark
Processing also has to be spread across many nodes. Two generations of engines dominate.
Google’s 2004 paper introduced the Map/Reduce paradigm: decompose a computation into a parallel map step followed by a global reduce step. Hadoop MapReduce popularised the pattern. Writes intermediate results to disk; robust but slow.
An in-memory, DAG-based engine that subsumes MapReduce and adds SQL, streaming, machine learning (MLlib) and graph processing (GraphX) libraries. Spark is the de-facto standard for distributed batch and machine-learning workloads.
The base-R example above is conceptually identical to a Spark word-count: the lapply corresponds to a distributed map and the table corresponds to a distributed reduceByKey.
4.7 Streaming Platforms
Stream processing requires a durable, high-throughput pipe that many producers can write to and many consumers can read from independently.
A distributed, append-only log. Producers publish events to named topics; consumers subscribe and read at their own pace. Replicated for durability, partitioned for throughput. Kafka is the backbone of event-driven architectures at LinkedIn, Uber, Flipkart, and most large Indian fintechs.
Stream-processing engines that consume from Kafka (and other sources), apply transformations, windowed aggregations, and joins, and emit results downstream. Flink is preferred for sub-second event-time processing; Structured Streaming is preferred where the team is already invested in Spark.
Modern streaming stacks (Kafka + Flink, Kafka + Spark Structured Streaming with checkpointing) can guarantee that each event affects downstream state exactly once, even in the presence of failures. This changed what could be trusted to a streaming pipeline rather than reconciled in batch.
4.8 NoSQL Database Families
When the access pattern or the data format does not fit a relational table, one of four NoSQL families is typically used.
| Family | Model | Examples | Good for |
|---|---|---|---|
| Key-value | Simple key to value lookup | Redis, DynamoDB | Session stores, caches, feature stores |
| Document | Nested JSON-like documents | MongoDB, Couchbase | Content management, mobile apps |
| Wide-column | Partitioned column families | Cassandra, HBase, ScyllaDB | Time-series, IoT, massive writes |
| Graph | Nodes and edges | Neo4j, Amazon Neptune | Fraud rings, recommendations, knowledge graphs |
Large systems rarely pick one. A single application at an e-commerce company might use Postgres for orders, MongoDB for the product catalog, Redis for sessions, Cassandra for clickstreams, and Neo4j for recommendations. Each database is chosen for its fit to one access pattern.
4.9 Big Data in Practice
A short survey of use cases that only become feasible at big-data scale.
A card transaction is scored against a streaming feature store (recent velocity, merchant risk, device fingerprint) and either approved or challenged in under 100 milliseconds. Deployed at every major bank and card network.
Every view, click and purchase updates user and item embeddings that feed homepage and search ranking. Flipkart, Amazon India, Hotstar, and Spotify operate recommendation systems at this scale.
Sensor streams from vehicles, elevators, jet engines, and factory equipment are aggregated into anomaly-detection models that flag components likely to fail. GE Aviation and Siemens run established examples; Ola Electric publishes similar work for its scooter fleet.
RBI, SEBI, and GSTN require the aggregation of transactions across millions of counterparties for supervisory reports. The GSTN alone processes over a billion invoices per month, which is a big-data problem by any definition.
4.10 Limitations and Challenges
Not every problem benefits from a big-data stack.
A petabyte-scale warehouse, a Spark cluster, and a Kafka deployment each require specialist engineering. Many organizations that adopted Hadoop between 2012 and 2018 later retired it because the operational cost exceeded the value it delivered.
More data means more personal data. The DPDP Act in India and the GDPR in Europe impose stiff constraints on retention, consent, and cross-border transfer. Analytical projects that ignore these constraints accumulate legal risk rapidly.
Doubling data volume does not double insight. Many large datasets contain repetitive or low-signal records; disciplined sampling can produce the same analytical conclusions at a fraction of the compute cost.
Most management research and most intra-organisational analyses operate on samples of a few hundred to a few thousand observations, for which a laptop, a CSV and R or Python are fully sufficient. Big data is an architectural response to a specific class of problems, not the default setting.
4.11 A Reference Big Data Stack
flowchart TD
subgraph Sources
S1[App & web events]
S2[IoT sensors]
S3[Transactional DBs]
S4[External APIs]
end
Sources --> K[Kafka<br/>event bus]
K --> ST[Stream processing<br/>Flink / Spark Streaming]
K --> L[(Data Lake<br/>S3 / ADLS)]
ST --> OLSTORE[(Low-latency store<br/>Cassandra / Redis)]
L --> LH[Lakehouse<br/>Delta / Iceberg / Hudi]
LH --> BI[BI & reporting]
LH --> DS[Data science & ML<br/>Spark, Python, R]
OLSTORE --> APP[Real-time applications]
The reference stack splits into two lanes: a low-latency lane (Kafka → stream processor → key-value store → application) that serves real-time decisions, and a batch lane (Kafka or direct ingestion → lake → lakehouse → BI and data-science layers) that serves analysis and training. Both lanes feed the same underlying event log, which is why modern designs converge on Kappa-style architectures.
4.12 Summary
| Concept | Description |
|---|---|
| The 5 Vs | |
| Volume | Sheer amount of data, TB to PB or more |
| Velocity | Speed at which data arrives and must be processed |
| Variety | Mix of structured, semi-structured and unstructured formats |
| Veracity | Trustworthiness and accuracy of incoming data |
| Value | Business outcome the data is expected to deliver |
| Processing Modes | |
| Batch | Accumulated over a window; latency minutes to hours |
| Streaming | Records processed as they arrive; latency milliseconds to seconds |
| Distributed Storage | |
| HDFS | Hadoop Distributed File System; on-premise, replicated blocks |
| Object store | Amazon S3, Azure Blob, GCS; cloud, decoupled from compute |
| Processing Engines | |
| MapReduce | First-generation batch engine; disk-based, robust but slow |
| Spark | In-memory, DAG-based engine; SQL, ML and streaming libraries |
| Kafka | Distributed append-only event log; backbone of streaming stacks |
| NoSQL Families | |
| Key-value | Redis, DynamoDB; simple key lookup, caches and feature stores |
| Document | MongoDB, Couchbase; nested JSON-like records |
| Wide-column | Cassandra, HBase; partitioned column families, heavy writes |
| Graph | Neo4j, Neptune; nodes and edges, fraud rings, recommendations |
Big data is not a single technology but a stack of interlocking choices made in response to volume, velocity, and variety that no single server can absorb. The rest of this book mostly operates on the post-processed, rectangular output of such stacks, so the names in this chapter explain where that rectangular dataset came from and why it arrived in the shape it did.