The False Urgency of Real-Time
There's a pervasive assumption in modern data engineering that real-time is always better than batch. If you can process data in milliseconds, why would you wait hours? The answer is simple: cost, complexity, and necessity. Real-time processing architectures are significantly more expensive to build, operate, and debug than batch alternatives. Before committing to real-time, you need to ask whether the use case actually demands it.
The question isn't "can we do this in real-time?" — it's "does the business value of real-time justify the engineering cost?" A fraud detection system that delays alerts by 4 hours is useless. But a marketing dashboard that refreshes hourly instead of instantly? The business impact of that delay is essentially zero. Matching the processing model to the actual latency requirement is one of the most consequential architecture decisions in data engineering.
Understanding the Processing Models
Batch processing collects data over a period (hourly, daily, weekly), then processes it all at once. This is the traditional ETL model: extract from source systems, transform in a staging area, load into the data warehouse. Tools like Apache Spark, dbt, and Airflow are optimized for batch workloads. Batch is simple, predictable, and cost-efficient — you spin up compute when the job runs and shut it down when it's done.
Stream processing handles data continuously as it arrives, record by record or in micro-batches (every few seconds). Tools like Apache Kafka, Apache Flink, Spark Structured Streaming, and Amazon Kinesis are designed for streaming. Stream processing provides low-latency results but requires always-on infrastructure, more complex state management, and sophisticated error handling.
Micro-batch is a hybrid approach that processes data in small batches (every 1-15 minutes). Spark Structured Streaming operates natively in micro-batch mode. This approach offers a pragmatic middle ground: latency measured in minutes rather than hours (good enough for many use cases) with significantly less complexity than true event-by-event streaming.
When Real-Time Is Worth the Investment
Real-time processing is justified when the value of the insight degrades rapidly with time. These use cases fall into several categories:
- Fraud and anomaly detection: A fraudulent transaction detected 4 hours later is a loss. Detected in 100 milliseconds, it's a prevented loss. The dollar value of faster detection directly justifies the infrastructure cost.
- Operational monitoring: Server health, application performance, and IoT sensor monitoring need real-time processing because delayed alerts mean extended outages. The cost of downtime almost always exceeds the cost of real-time monitoring infrastructure.
- Personalization and recommendations: A recommendation engine that reflects what a user did 30 seconds ago is more relevant than one based on yesterday's behavior. E-commerce and media companies see measurable conversion improvements from real-time personalization.
- Financial trading: Market data processing and trading algorithms operate on millisecond timescales where latency directly equals money.
- Logistics and ride-sharing: Vehicle tracking, route optimization, and dynamic pricing require continuous processing of location data to function.
When Batch Is the Right Choice
Batch processing is the right choice for the majority of analytical and reporting workloads — and choosing it saves significant engineering effort and infrastructure cost:
- Executive dashboards and BI reports: Most business reporting is consumed daily or weekly. A dashboard that refreshes at 6 AM every morning is perfectly adequate for a 9 AM leadership meeting.
- Data warehouse loading: Transforming and loading data from operational systems into an analytical warehouse is inherently a batch operation. Even if source data arrives continuously, the transformation logic (joins across multiple sources, aggregations, slowly changing dimensions) is best expressed as batch SQL.
- ML model training: Model training processes historical data in bulk. Even if the model serves predictions in real-time, the training pipeline is almost always batch.
- Month-end and quarter-end reporting: Financial close, regulatory reporting, and compliance audits operate on fixed time periods. There's no benefit to processing this data in real-time.
- Historical analysis and data science: Exploratory analysis, cohort studies, and trend analysis are retrospective by nature. Batch processing is the natural fit.
If the decision the data supports is made daily, the data pipeline should run daily. Investing in real-time infrastructure for a daily decision is over-engineering.
Architecture Comparison
Batch architecture is straightforward: a scheduler (Airflow, cron) triggers extraction jobs at defined intervals. Data flows through a sequence of transformations (typically SQL in dbt or Spark). The output lands in a data warehouse where BI tools query it. The infrastructure is simple — compute spins up for the job and shuts down after. Costs are predictable and proportional to data volume.
Streaming architecture is fundamentally different. A message broker (Kafka, Kinesis, Pulsar) receives events continuously from producers (application servers, IoT devices, change data capture from databases). Stream processing engines (Flink, Spark Streaming, ksqlDB) consume from the broker, apply transformations, and write results to a serving layer (database, cache, search index, or another Kafka topic). The infrastructure runs 24/7, and costs scale with throughput rather than data volume.
The operational complexity of streaming is significantly higher. State management (maintaining counts, aggregations, or windows across a continuous stream), exactly-once processing guarantees (ensuring events aren't processed twice or skipped), late data handling (events that arrive out of order), and backpressure management (handling bursts that exceed processing capacity) are all engineering challenges that don't exist in batch.
The Lambda and Kappa Architectures
When organizations need both real-time and batch capabilities, two architectural patterns have emerged:
Lambda architecture runs batch and streaming pipelines in parallel. The batch layer processes all historical data and produces "correct" results with higher latency. The streaming layer processes recent data and produces "approximate" results with low latency. A serving layer merges both views. The downside: you maintain two separate codebases that implement the same business logic, creating a consistency and maintenance burden.
Kappa architecture uses a single streaming pipeline for everything. Historical reprocessing is done by replaying events from the message broker's log (Kafka retains events for configurable periods, potentially indefinitely). This eliminates the dual-codebase problem but requires your streaming infrastructure to handle both real-time processing and batch-scale reprocessing.
In practice, most organizations end up with a pragmatic hybrid: streaming for the few use cases that truly need it, batch for everything else, and a shared metadata layer (data catalog, schema registry) that ensures consistency across both. Don't feel pressured to adopt a "pure" architecture — the goal is solving business problems, not architectural elegance.
Cost Comparison
The cost difference between batch and streaming is substantial and often underestimated:
Infrastructure costs: A batch pipeline that runs for 2 hours daily on a 4-node Spark cluster costs roughly $200-400/month on AWS. An equivalent streaming pipeline running 24/7 on a Kafka cluster plus Flink workers costs $2,000-5,000/month — 5-10x more for the same data volume.
Engineering costs: Streaming pipelines require specialized skills (Kafka administration, Flink programming, distributed systems debugging) that are rarer and more expensive than batch skills (SQL, Airflow, dbt). The talent premium adds 20-40% to team costs.
Operational costs: Streaming systems require 24/7 monitoring and on-call support because they're always running. Batch failures can wait until morning; streaming failures need immediate attention because data is flowing continuously.
Our Recommendation
Start with batch. Build your data warehouse, your transformation pipeline, and your BI layer on batch processing. This gives you 80% of the value with 20% of the complexity. Then, identify the specific use cases where batch latency is genuinely insufficient — where the business can quantify the value of faster processing. Build streaming infrastructure only for those use cases, and keep everything else on batch.
For organizations that need "near real-time" (minutes, not seconds), consider micro-batch as a middle ground. Spark Structured Streaming with a 5-minute trigger interval provides minute-level freshness with batch-like simplicity. This covers many "real-time" requirements that don't actually need sub-second latency.
Need Help With This?
Neural Vector Insights helps organizations turn these concepts into production reality. Let us talk about your project.
Start a Conversation