The False Urgency of Real-Time

There's a pervasive assumption in modern data engineering that real-time is always better than batch. If you can process data in milliseconds, why would you wait hours? The answer is simple: cost, complexity, and necessity. Real-time processing architectures are significantly more expensive to build, operate, and debug than batch alternatives. Before committing to real-time, you need to ask whether the use case actually demands it.

The question isn't "can we do this in real-time?" — it's "does the business value of real-time justify the engineering cost?" A fraud detection system that delays alerts by 4 hours is useless. But a marketing dashboard that refreshes hourly instead of instantly? The business impact of that delay is essentially zero. Matching the processing model to the actual latency requirement is one of the most consequential architecture decisions in data engineering.

Understanding the Processing Models

Batch processing collects data over a period (hourly, daily, weekly), then processes it all at once. This is the traditional ETL model: extract from source systems, transform in a staging area, load into the data warehouse. Tools like Apache Spark, dbt, and Airflow are optimized for batch workloads. Batch is simple, predictable, and cost-efficient — you spin up compute when the job runs and shut it down when it's done.

Stream processing handles data continuously as it arrives, record by record or in micro-batches (every few seconds). Tools like Apache Kafka, Apache Flink, Spark Structured Streaming, and Amazon Kinesis are designed for streaming. Stream processing provides low-latency results but requires always-on infrastructure, more complex state management, and sophisticated error handling.

Micro-batch is a hybrid approach that processes data in small batches (every 1-15 minutes). Spark Structured Streaming operates natively in micro-batch mode. This approach offers a pragmatic middle ground: latency measured in minutes rather than hours (good enough for many use cases) with significantly less complexity than true event-by-event streaming.

When Real-Time Is Worth the Investment

Real-time processing is justified when the value of the insight degrades rapidly with time. These use cases fall into several categories:

When Batch Is the Right Choice

Batch processing is the right choice for the majority of analytical and reporting workloads — and choosing it saves significant engineering effort and infrastructure cost:

If the decision the data supports is made daily, the data pipeline should run daily. Investing in real-time infrastructure for a daily decision is over-engineering.

Architecture Comparison

Batch architecture is straightforward: a scheduler (Airflow, cron) triggers extraction jobs at defined intervals. Data flows through a sequence of transformations (typically SQL in dbt or Spark). The output lands in a data warehouse where BI tools query it. The infrastructure is simple — compute spins up for the job and shuts down after. Costs are predictable and proportional to data volume.

Streaming architecture is fundamentally different. A message broker (Kafka, Kinesis, Pulsar) receives events continuously from producers (application servers, IoT devices, change data capture from databases). Stream processing engines (Flink, Spark Streaming, ksqlDB) consume from the broker, apply transformations, and write results to a serving layer (database, cache, search index, or another Kafka topic). The infrastructure runs 24/7, and costs scale with throughput rather than data volume.

The operational complexity of streaming is significantly higher. State management (maintaining counts, aggregations, or windows across a continuous stream), exactly-once processing guarantees (ensuring events aren't processed twice or skipped), late data handling (events that arrive out of order), and backpressure management (handling bursts that exceed processing capacity) are all engineering challenges that don't exist in batch.

The Lambda and Kappa Architectures

When organizations need both real-time and batch capabilities, two architectural patterns have emerged:

Lambda architecture runs batch and streaming pipelines in parallel. The batch layer processes all historical data and produces "correct" results with higher latency. The streaming layer processes recent data and produces "approximate" results with low latency. A serving layer merges both views. The downside: you maintain two separate codebases that implement the same business logic, creating a consistency and maintenance burden.

Kappa architecture uses a single streaming pipeline for everything. Historical reprocessing is done by replaying events from the message broker's log (Kafka retains events for configurable periods, potentially indefinitely). This eliminates the dual-codebase problem but requires your streaming infrastructure to handle both real-time processing and batch-scale reprocessing.

In practice, most organizations end up with a pragmatic hybrid: streaming for the few use cases that truly need it, batch for everything else, and a shared metadata layer (data catalog, schema registry) that ensures consistency across both. Don't feel pressured to adopt a "pure" architecture — the goal is solving business problems, not architectural elegance.

Cost Comparison

The cost difference between batch and streaming is substantial and often underestimated:

Infrastructure costs: A batch pipeline that runs for 2 hours daily on a 4-node Spark cluster costs roughly $200-400/month on AWS. An equivalent streaming pipeline running 24/7 on a Kafka cluster plus Flink workers costs $2,000-5,000/month — 5-10x more for the same data volume.

Engineering costs: Streaming pipelines require specialized skills (Kafka administration, Flink programming, distributed systems debugging) that are rarer and more expensive than batch skills (SQL, Airflow, dbt). The talent premium adds 20-40% to team costs.

Operational costs: Streaming systems require 24/7 monitoring and on-call support because they're always running. Batch failures can wait until morning; streaming failures need immediate attention because data is flowing continuously.

Our Recommendation

Start with batch. Build your data warehouse, your transformation pipeline, and your BI layer on batch processing. This gives you 80% of the value with 20% of the complexity. Then, identify the specific use cases where batch latency is genuinely insufficient — where the business can quantify the value of faster processing. Build streaming infrastructure only for those use cases, and keep everything else on batch.

For organizations that need "near real-time" (minutes, not seconds), consider micro-batch as a middle ground. Spark Structured Streaming with a 5-minute trigger interval provides minute-level freshness with batch-like simplicity. This covers many "real-time" requirements that don't actually need sub-second latency.

Need Help With This?

Neural Vector Insights helps organizations turn these concepts into production reality. Let us talk about your project.

Start a Conversation