4 Big Data

4.1 What Makes Data “Big”

The term “big data” entered common use around the mid-2000s to describe datasets that traditional single-machine databases could no longer process within an acceptable time budget. The threshold is deliberately relative: a 100 GB file was “big” in 2005 and is routine today, while a petabyte-per-day stream remains big in 2026.

An operational definition

Data is considered big when its volume, velocity, or variety exceeds the capacity of a single conventional server to store, ingest, or process it within the time the business needs. The response is typically a distributed architecture that spreads the work across many machines.

The phrase is therefore less a statement about raw size and more a statement about the architecture required. When a business problem forces a move from one server to a cluster, the data has become big for that organization.

4.2 The Five Vs

Doug Laney (META Group, 2001) framed the original three Vs (Volume, Velocity, Variety). Two further Vs (Veracity, Value) were added by IBM and industry practitioners and are now standard.

Volume

The sheer amount of data. UPI processed over 17 billion transactions in a single month in 2025 (NPCI); Aadhaar has over 1.3 billion enrolled identities. These systems generate terabytes to petabytes of data per day.

Velocity

The speed at which data arrives and must be processed. A credit-card fraud system must score a transaction in under 100 milliseconds; a ride-hailing platform must match drivers and riders every few seconds. Velocity drives the need for streaming systems.

Variety

The mix of formats: structured rows, semi-structured JSON, unstructured text, images, audio, video, clickstreams, sensor readings. A single customer-service interaction at a bank may produce all of these simultaneously.

Veracity

The trustworthiness and accuracy of incoming data. Sensor dropouts, duplicated events, spoofed identities, and inconsistent reference data all degrade veracity. Big systems typically accept that a non-trivial fraction of inputs is noisy and must be filtered.

Value

The business outcome the data is expected to deliver. Volume without value is waste. The justification for every big data investment is the marginal business decision the additional data enables: better pricing, lower fraud, tighter inventory, richer personalisation.

Try here

4.3 Scale in Units

Data volumes are described on a geometric scale; each unit is a thousand times larger than the previous one.

Unit	Bytes	Rough intuition
Gigabyte (GB)	10^9	A single high-resolution film
Terabyte (TB)	10^12	A small organisation’s annual transaction log
Petabyte (PB)	10^15	A large e-commerce platform’s annual clickstream
Exabyte (EB)	10^18	A national telecom’s yearly network logs
Zettabyte (ZB)	10^21	Global annual internet traffic

A useful rule of thumb

Workloads up to a few hundred GB still fit a modern single server with enough RAM. Workloads in the low TB range are borderline and often move to a clustered database or warehouse. Workloads in the PB range almost always require a distributed file system and a distributed compute engine.

4.4 Batch versus Streaming Processing

Once data volumes grow, the processing pattern forks into two families with different latency profiles.

Batch processing

Data is accumulated over a window (an hour, a day) and processed in bulk. Latency is measured in minutes to hours. Suited to overnight reports, model retraining, billing runs. Implementations: Apache Spark batch jobs, Hive, legacy MapReduce.

Stream processing

Each record is processed as it arrives. Latency is measured in milliseconds to seconds. Suited to fraud detection, real-time personalisation, live dashboards, alerting. Implementations: Apache Kafka Streams, Apache Flink, Spark Structured Streaming.

The Lambda and Kappa architectures

Many production systems blend the two. The Lambda architecture maintains parallel batch and streaming pipelines and reconciles them at the serving layer. The Kappa architecture uses a single streaming pipeline for both real-time and historical processing, replaying the event log when needed. The trend since 2020 has been towards Kappa-style designs because maintaining two pipelines for the same logic is operationally expensive.

4.5 Distributed Storage

Single machines do not hold petabytes, so storage is spread across many servers with built-in replication.

HDFS (Hadoop Distributed File System)

Splits large files into blocks (typically 128 MB) and replicates each block across three nodes by default. Originally designed at Yahoo to run commodity-hardware data centres; still widely deployed on-premise.

Cloud object storage

Amazon S3, Azure Data Lake Storage, Google Cloud Storage. Effectively limitless, pay-per-byte, durable by design. Object stores have largely replaced HDFS for greenfield projects because they decouple storage from compute.

Open table formats on top of object storage

Delta Lake, Apache Iceberg, and Apache Hudi sit on top of cloud object storage and add the table semantics (ACID transactions, schema evolution, time travel) that classical HDFS-backed Hive tables offered. They are the substrate of the modern lakehouse.

4.6 Distributed Processing: MapReduce to Spark

Processing also has to be spread across many nodes. Two generations of engines dominate.

MapReduce (first generation)

Google’s 2004 paper introduced the Map/Reduce paradigm: decompose a computation into a parallel map step followed by a global reduce step. Hadoop MapReduce popularised the pattern. Writes intermediate results to disk; robust but slow.

Apache Spark (current generation)

An in-memory, DAG-based engine that subsumes MapReduce and adds SQL, streaming, machine learning (MLlib) and graph processing (GraphX) libraries. Spark is the de-facto standard for distributed batch and machine-learning workloads.

Try here

The base-R example above is conceptually identical to a Spark word-count: the lapply corresponds to a distributed map and the table corresponds to a distributed reduceByKey.

4.7 Streaming Platforms

Stream processing requires a durable, high-throughput pipe that many producers can write to and many consumers can read from independently.

Apache Kafka

A distributed, append-only log. Producers publish events to named topics; consumers subscribe and read at their own pace. Replicated for durability, partitioned for throughput. Kafka is the backbone of event-driven architectures at LinkedIn, Uber, Flipkart, and most large Indian fintechs.

Apache Flink and Spark Structured Streaming

Stream-processing engines that consume from Kafka (and other sources), apply transformations, windowed aggregations, and joins, and emit results downstream. Flink is preferred for sub-second event-time processing; Structured Streaming is preferred where the team is already invested in Spark.

Exactly-once semantics

Modern streaming stacks (Kafka + Flink, Kafka + Spark Structured Streaming with checkpointing) can guarantee that each event affects downstream state exactly once, even in the presence of failures. This changed what could be trusted to a streaming pipeline rather than reconciled in batch.

4.8 NoSQL Database Families

When the access pattern or the data format does not fit a relational table, one of four NoSQL families is typically used.

Family	Model	Examples	Good for
Key-value	Simple key to value lookup	Redis, DynamoDB	Session stores, caches, feature stores
Document	Nested JSON-like documents	MongoDB, Couchbase	Content management, mobile apps
Wide-column	Partitioned column families	Cassandra, HBase, ScyllaDB	Time-series, IoT, massive writes
Graph	Nodes and edges	Neo4j, Amazon Neptune	Fraud rings, recommendations, knowledge graphs

Polyglot persistence

Large systems rarely pick one. A single application at an e-commerce company might use Postgres for orders, MongoDB for the product catalog, Redis for sessions, Cassandra for clickstreams, and Neo4j for recommendations. Each database is chosen for its fit to one access pattern.

4.9 Big Data in Practice

A short survey of use cases that only become feasible at big-data scale.

Real-time fraud detection

A card transaction is scored against a streaming feature store (recent velocity, merchant risk, device fingerprint) and either approved or challenged in under 100 milliseconds. Deployed at every major bank and card network.

Recommendation and personalisation

Every view, click and purchase updates user and item embeddings that feed homepage and search ranking. Flipkart, Amazon India, Hotstar, and Spotify operate recommendation systems at this scale.

Telematics and predictive maintenance

Sensor streams from vehicles, elevators, jet engines, and factory equipment are aggregated into anomaly-detection models that flag components likely to fail. GE Aviation and Siemens run established examples; Ola Electric publishes similar work for its scooter fleet.

Regulatory reporting

RBI, SEBI, and GSTN require the aggregation of transactions across millions of counterparties for supervisory reports. The GSTN alone processes over a billion invoices per month, which is a big-data problem by any definition.

4.10 Limitations and Challenges

Not every problem benefits from a big-data stack.

Infrastructure and skills cost

A petabyte-scale warehouse, a Spark cluster, and a Kafka deployment each require specialist engineering. Many organizations that adopted Hadoop between 2012 and 2018 later retired it because the operational cost exceeded the value it delivered.

Privacy and ethical exposure

More data means more personal data. The DPDP Act in India and the GDPR in Europe impose stiff constraints on retention, consent, and cross-border transfer. Analytical projects that ignore these constraints accumulate legal risk rapidly.

Signal-to-noise ratio

Doubling data volume does not double insight. Many large datasets contain repetitive or low-signal records; disciplined sampling can produce the same analytical conclusions at a fraction of the compute cost.

“Small data” is still the common case

Most management research and most intra-organisational analyses operate on samples of a few hundred to a few thousand observations, for which a laptop, a CSV and R or Python are fully sufficient. Big data is an architectural response to a specific class of problems, not the default setting.

4.11 A Reference Big Data Stack

flowchart TD
  subgraph Sources
    S1[App & web events]
    S2[IoT sensors]
    S3[Transactional DBs]
    S4[External APIs]
  end
  Sources --> K[Kafka<br/>event bus]
  K --> ST[Stream processing<br/>Flink / Spark Streaming]
  K --> L[(Data Lake<br/>S3 / ADLS)]
  ST --> OLSTORE[(Low-latency store<br/>Cassandra / Redis)]
  L --> LH[Lakehouse<br/>Delta / Iceberg / Hudi]
  LH --> BI[BI & reporting]
  LH --> DS[Data science & ML<br/>Spark, Python, R]
  OLSTORE --> APP[Real-time applications]
    classDef default fill:#2a4d69,color:#ffffff,stroke:#ffcc00,stroke-width:3px,rx:10px,ry:10px;

The reference stack splits into two lanes: a low-latency lane (Kafka → stream processor → key-value store → application) that serves real-time decisions, and a batch lane (Kafka or direct ingestion → lake → lakehouse → BI and data-science layers) that serves analysis and training. Both lanes feed the same underlying event log, which is why modern designs converge on Kappa-style architectures.

Summary

Concept	Description
Defining Big Data
Operational Definition	Data that exceeds the capacity of single-machine processing
Volume	Scale of data measured in terabytes, petabytes, and beyond
Velocity	Speed at which data is generated, transmitted, and processed
Variety	Mix of structured, semi-structured, and unstructured formats
Veracity	Trustworthiness, accuracy, and consistency of the data
Value	Business worth derived from analysing the data
Processing Modes
Batch Processing	Periodic processing of large datasets in scheduled chunks
Stream Processing	Continuous, real-time analysis of data as it arrives