Overview
Lambda Architecture addresses the challenge of building distributed data processing systems that provide both accurate batch computations and real-time query responses. The architecture separates data processing into distinct layers optimized for their specific characteristics: a batch layer for complete and accurate results, a speed layer for low-latency approximate results, and a serving layer that merges outputs from both.
The architecture emerged from the need to balance consistency and availability in distributed systems while maintaining query performance. Traditional batch processing systems achieve high accuracy but suffer from processing delays measured in hours or days. Pure streaming systems provide low latency but struggle with correctness guarantees and fault tolerance. Lambda Architecture reconciles these competing requirements through a dual-pipeline approach.
The three-layer structure processes all data through both batch and streaming paths. The batch layer continuously recomputes comprehensive views from the complete dataset, providing eventual consistency and error correction. The speed layer processes only recent data to fill the temporal gap between batch processing runs. The serving layer indexes batch views for fast random access while merging them with real-time updates from the speed layer.
This separation enables systems to handle diverse query patterns across different timescales. Analytics queries requiring historical accuracy access the batch layer. Monitoring dashboards needing current values query the speed layer. User-facing applications combine both layers to provide complete and current results.
[Raw Data] → Batch Layer → Batch Views ↘
Serving Layer → Query Results
[Raw Data] → Speed Layer → Real-time Views ↗
The architecture assumes immutable data, append-only storage, and the ability to recompute results from raw inputs. These assumptions enable fault tolerance through recomputation rather than complex coordination protocols. When errors occur in either layer, the batch layer eventually corrects them during its next complete processing cycle.
Key Principles
Lambda Architecture operates on several foundational principles that define its behavior and characteristics.
Immutability of data forms the core assumption. All incoming data appends to an immutable master dataset without updates or deletions. This immutability simplifies reasoning about system state and enables recomputation as the primary fault tolerance mechanism. When processing errors occur, the system reprocesses the original immutable data rather than attempting to fix corrupted state.
Separate batch and speed layers handle different temporal characteristics of data processing. The batch layer processes the entire dataset periodically, computing comprehensive views with complete accuracy. Processing runs take hours but guarantee correctness. The speed layer processes only data that arrived since the last batch processing run, computing incremental views with low latency measured in seconds. Results may be approximate but provide current information.
Recomputation rather than incremental updates handles the batch layer. Each batch processing run starts from the complete raw dataset and recomputes all views from scratch. This approach eliminates the complexity of incremental updates and the possibility of accumulating errors. While computationally expensive, modern distributed computing frameworks make full recomputation feasible for many workloads.
Human fault tolerance refers to the ability to correct logic errors in processing code. When developers discover bugs in batch processing logic, they fix the code and rerun batch processing. The batch layer recomputes all views with corrected logic, eventually replacing incorrect results. This differs from traditional database systems where logic errors permanently corrupt data.
Query merging combines results from batch and speed layers at read time. The serving layer merges batch views (complete but slightly stale) with speed layer views (current but incremental) to produce final query results. This merge operation must handle overlapping time ranges and potential inconsistencies between layers.
Eventually consistent writes characterize the system's consistency model. The speed layer may provide approximate or incomplete results for recent data. The batch layer eventually processes all data and produces correct comprehensive views. Applications tolerate temporary inconsistencies in exchange for low latency.
The batch layer implementation typically uses distributed computing frameworks designed for large-scale data processing. These frameworks distribute computation across clusters of commodity hardware, achieving horizontal scalability through data partitioning. Processing logic expresses computation as transformations over immutable datasets.
The speed layer requires different characteristics: low-latency processing of individual events or small batches, support for incremental computation, and the ability to handle out-of-order data. Stream processing frameworks address these requirements through windowing operations, state management, and event-time processing.
Design Considerations
Lambda Architecture introduces significant complexity in exchange for specific benefits. Understanding when to adopt this architecture requires evaluating trade-offs against alternative approaches.
Consistency requirements determine Lambda Architecture's suitability. Applications needing strong consistency for recent data should consider alternative architectures. Lambda Architecture provides eventual consistency with a temporal gap between speed and batch layer results. Financial systems requiring immediate consistency for all transactions typically need different architectures. Analytics systems that tolerate temporary inconsistencies for recent data while requiring accuracy for historical analysis align well with Lambda Architecture.
Query latency expectations affect architectural decisions. The speed layer adds complexity specifically to reduce query latency for recent data. Applications where users tolerate batch processing delays (daily reports, historical analytics) may not benefit from the speed layer. Real-time dashboards, monitoring systems, and user-facing applications showing current activity justify the additional complexity.
Development and operational costs increase substantially with Lambda Architecture. Teams maintain separate codebases for batch and speed layers, often implementing similar logic in different frameworks. Operations teams manage multiple distributed systems with different operational characteristics. Organizations must weigh these costs against the value of low-latency queries over current data.
Data volume and velocity impact architectural choices. Lambda Architecture handles high data volumes through horizontal scaling in the batch layer. High-velocity data streams benefit from the speed layer's incremental processing. Systems with moderate data volumes and velocities may find simpler architectures sufficient. Extremely high velocities may push the speed layer's capabilities, requiring careful resource allocation.
Reprocessing requirements favor Lambda Architecture when historical data needs reanalysis with updated logic. The immutable master dataset enables full reprocessing without data migration. Systems where business logic evolves frequently and historical data must reflect new logic benefit from recomputation capabilities. Static analytical logic with infrequent changes reduces this advantage.
Kappa Architecture represents an alternative that eliminates the batch layer, processing all data through a single streaming pipeline. This simplification reduces operational complexity and eliminates code duplication. Kappa Architecture suits systems where stream processing frameworks can handle complete historical reprocessing, the speed layer alone meets accuracy requirements, and development resources are constrained. Lambda Architecture provides better separation of concerns when batch and streaming characteristics differ significantly.
Data freshness tolerance varies by use case. Lambda Architecture's speed layer typically maintains recent data (minutes to hours) while the batch layer handles historical data. Applications requiring sub-second freshness may need pure streaming architectures. Those tolerating hourly or daily freshness might not need a speed layer at all.
# Decision matrix for Lambda Architecture adoption
class ArchitectureDecision
def recommend_lambda?(requirements)
score = 0
# Strong indicators for Lambda
score += 2 if requirements[:data_volume] == :large
score += 2 if requirements[:query_latency] == :low
score += 2 if requirements[:reprocessing_needed] == :frequent
score += 1 if requirements[:eventual_consistency] == :acceptable
# Indicators against Lambda
score -= 2 if requirements[:strong_consistency] == :required
score -= 2 if requirements[:operational_complexity] == :limited
score -= 1 if requirements[:team_size] == :small
score > 3
end
end
The architectural decision should account for team expertise. Lambda Architecture requires skills in both batch processing frameworks (MapReduce, Spark) and stream processing systems (Storm, Flink, Kafka Streams). Teams experienced in only one paradigm face a steeper learning curve.
Implementation Approaches
Lambda Architecture implementations vary based on technology choices, scale requirements, and organizational constraints. Several distinct approaches address different operational contexts.
Full-scale distributed implementation deploys separate clusters for batch and speed layers using enterprise-grade frameworks. The batch layer runs on Hadoop or Spark clusters processing terabytes or petabytes of data. The speed layer operates on Storm, Flink, or Kafka Streams clusters handling thousands of events per second. The serving layer uses distributed databases like Cassandra or HBase for indexed access to batch views, with in-memory stores like Redis for speed layer results.
This approach maximizes throughput and fault tolerance through horizontal scaling. Batch processing distributes across hundreds or thousands of nodes. Speed layer processing partitions across multiple stream processors. The serving layer replicates data across availability zones. Operations teams manage complex distributed systems with sophisticated monitoring, alerting, and deployment automation.
Organizations adopt this approach when data volumes exceed single-machine capabilities, query loads require distributed serving infrastructure, and operational expertise exists for managing multiple clusters. The implementation cost includes hardware or cloud infrastructure, specialized personnel, and ongoing operational overhead.
Hybrid cloud implementation splits components across different infrastructure providers based on cost and performance characteristics. Batch processing might run on commodity hardware or spot instances due to its fault-tolerant nature. Speed layer processing requires more reliable infrastructure with better SLAs. The serving layer might use managed database services to reduce operational burden.
# Configuration for hybrid deployment
class HybridArchitecture
def initialize
@batch_config = {
provider: 'aws_spot',
instance_type: 'compute_optimized',
scaling: 'horizontal',
cost_priority: 'minimize'
}
@speed_config = {
provider: 'aws_on_demand',
instance_type: 'memory_optimized',
scaling: 'vertical_then_horizontal',
latency_priority: 'minimize'
}
@serving_config = {
provider: 'managed_service',
database: 'dynamodb',
caching: 'elasticache',
availability: 'multi_region'
}
end
end
Simplified small-scale implementation reduces complexity for moderate data volumes. The batch layer might run on a single powerful server using in-memory processing frameworks. The speed layer could use lightweight stream processors or even cron jobs processing recent data incrementally. The serving layer might combine batch and speed results in a single database with appropriate indexing.
This approach suits organizations with limited operational resources, data volumes measured in gigabytes or low terabytes, and query loads manageable by single-server databases. Implementation complexity decreases substantially while maintaining Lambda Architecture's conceptual benefits.
Serverless implementation eliminates infrastructure management by using cloud-native services. Batch processing runs on AWS Lambda, Google Cloud Functions, or Azure Functions triggered periodically. Speed layer processing uses cloud stream processing services. The serving layer uses managed databases with auto-scaling capabilities.
# Serverless batch processor trigger
require 'aws-sdk-lambda'
class BatchTrigger
def schedule_processing(data_partition)
lambda = Aws::Lambda::Client.new
lambda.invoke({
function_name: 'batch_processor',
invocation_type: 'Event',
payload: {
partition: data_partition,
timestamp: Time.now.utc.iso8601
}.to_json
})
end
end
This approach minimizes operational overhead and allows fine-grained cost optimization. Organizations pay only for actual computation time. The implementation works well for variable workloads with unpredictable patterns. Limitations include function execution time limits, memory constraints, and cold start latencies.
Micro-batch compromise implements both layers using the same streaming framework with different window sizes. Large windows (hours) provide batch-like processing. Small windows (seconds) provide real-time processing. This reduces code duplication while maintaining separate logical layers.
Spark Streaming's micro-batch architecture naturally fits this approach. Developers implement processing logic once, then configure different batch intervals for batch and speed layers. The serving layer merges results from different window sizes. This compromise sacrifices some theoretical separation but gains practical simplicity.
Ruby Implementation
Ruby's ecosystem includes tools for implementing Lambda Architecture components, though the language sees less adoption for large-scale batch processing compared to JVM languages. Ruby's strengths in API development and data transformation make it suitable for specific Lambda Architecture roles.
Batch layer processing in Ruby typically uses smaller-scale approaches or interfaces with existing batch processing systems. The Hadoop Streaming API allows Ruby scripts to process data in Hadoop clusters.
#!/usr/bin/env ruby
# Hadoop streaming mapper
STDIN.each_line do |line|
user_id, timestamp, event_type, value = line.chomp.split(',')
# Emit key-value pairs for reducer
puts "#{user_id}\t#{event_type}:#{value}"
end
#!/usr/bin/env ruby
# Hadoop streaming reducer
require 'json'
current_key = nil
aggregates = Hash.new { |h, k| h[k] = [] }
STDIN.each_line do |line|
key, value = line.chomp.split("\t")
if key != current_key && current_key
# Output aggregated results for previous key
result = {
user_id: current_key,
events: aggregates
}
puts JSON.generate(result)
aggregates.clear
end
current_key = key
event_type, event_value = value.split(':')
aggregates[event_type] << event_value.to_f
end
# Output final key
if current_key
result = { user_id: current_key, events: aggregates }
puts JSON.generate(result)
end
For smaller batch processing, Ruby excels at orchestrating workflows and data transformations. The parallel gem enables concurrent processing on multi-core systems.
require 'parallel'
require 'json'
class BatchProcessor
def process_partition(partition_files)
Parallel.map(partition_files, in_processes: 8) do |file|
process_file(file)
end
end
private
def process_file(file)
aggregates = Hash.new(0)
File.foreach(file) do |line|
data = JSON.parse(line)
key = extract_key(data)
aggregates[key] += data['value'].to_f
end
aggregates
end
def extract_key(data)
"#{data['user_id']}:#{data['event_type']}"
end
end
Speed layer processing suits Ruby better given the language's performance characteristics. The bunny gem provides RabbitMQ integration for message stream processing.
require 'bunny'
require 'json'
require 'redis'
class SpeedLayerProcessor
def initialize
@redis = Redis.new
@connection = Bunny.new
@connection.start
@channel = @connection.create_channel
end
def process_stream(queue_name)
queue = @channel.queue(queue_name, durable: true)
queue.subscribe(block: true) do |delivery_info, properties, body|
event = JSON.parse(body)
process_event(event)
@channel.ack(delivery_info.delivery_tag)
end
end
private
def process_event(event)
key = "speed:#{event['user_id']}:#{event['event_type']}"
# Increment counter with expiration
@redis.multi do |redis|
redis.incrbyfloat(key, event['value'].to_f)
redis.expire(key, 3600) # Keep for 1 hour
end
end
end
The kafka-ruby gem integrates with Apache Kafka for higher-throughput streaming scenarios.
require 'kafka'
require 'json'
class KafkaSpeedLayer
def initialize(brokers)
@kafka = Kafka.new(brokers)
@consumer = @kafka.consumer(group_id: 'speed-layer')
end
def process_topic(topic)
@consumer.subscribe(topic)
@consumer.each_message do |message|
event = JSON.parse(message.value)
# Process with windowing
window_key = calculate_window(message.timestamp)
update_window(window_key, event)
end
end
private
def calculate_window(timestamp)
window_size = 300 # 5-minute windows
(timestamp / window_size) * window_size
end
def update_window(window_key, event)
# Update windowed aggregates
key = "window:#{window_key}:#{event['user_id']}"
# Implementation depends on storage backend
end
end
Serving layer implementation leverages Ruby web frameworks. A Sinatra or Rails application provides the query API that merges batch and speed layer results.
require 'sinatra'
require 'redis'
require 'cassandra'
class ServingLayer < Sinatra::Base
configure do
set :redis, Redis.new
set :cassandra, Cassandra.cluster.connect('analytics')
end
get '/user/:user_id/stats' do
user_id = params[:user_id]
# Query batch layer (Cassandra)
batch_results = query_batch_layer(user_id)
# Query speed layer (Redis)
speed_results = query_speed_layer(user_id)
# Merge results
merged = merge_layers(batch_results, speed_results)
json merged
end
private
def query_batch_layer(user_id)
statement = settings.cassandra.prepare(
'SELECT event_type, SUM(value) as total
FROM user_aggregates
WHERE user_id = ?
GROUP BY event_type'
)
results = settings.cassandra.execute(statement, arguments: [user_id])
results.each_with_object({}) do |row, hash|
hash[row['event_type']] = row['total']
end
end
def query_speed_layer(user_id)
keys = settings.redis.keys("speed:#{user_id}:*")
keys.each_with_object({}) do |key, hash|
event_type = key.split(':').last
value = settings.redis.get(key).to_f
hash[event_type] = value
end
end
def merge_layers(batch, speed)
merged = batch.dup
speed.each do |event_type, value|
merged[event_type] = (merged[event_type] || 0) + value
end
merged
end
end
Data ingestion pipeline in Ruby handles incoming data distribution to both batch and speed layers.
require 'kafka'
require 's3'
class DataIngestionPipeline
def initialize(kafka_brokers, s3_bucket)
@kafka = Kafka.new(kafka_brokers)
@producer = @kafka.producer
@s3 = S3::Service.new(access_key_id: ENV['AWS_ACCESS_KEY'],
secret_access_key: ENV['AWS_SECRET_KEY'])
@bucket = @s3.buckets.find(s3_bucket)
end
def ingest(event)
timestamp = Time.now.utc
# Send to speed layer (Kafka)
@producer.produce(
event.to_json,
topic: 'events',
partition_key: event['user_id']
)
# Append to batch layer (S3)
append_to_batch_storage(event, timestamp)
@producer.deliver_messages
end
private
def append_to_batch_storage(event, timestamp)
partition_path = "events/year=#{timestamp.year}/" \
"month=#{timestamp.month}/" \
"day=#{timestamp.day}/" \
"hour=#{timestamp.hour}/" \
"#{timestamp.to_i}-#{SecureRandom.uuid}.json"
object = @bucket.objects.build(partition_path)
object.content = event.to_json
object.save
end
end
Performance Considerations
Lambda Architecture's performance characteristics span multiple dimensions: batch throughput, stream latency, query response time, and resource utilization. Each layer exhibits different performance properties requiring distinct optimization approaches.
Batch layer throughput determines how quickly comprehensive views refresh. Processing time grows with dataset size but should scale sub-linearly through parallelization. A well-implemented batch layer processes terabytes of data in hours rather than days.
Data partitioning strategy significantly impacts batch performance. Partitioning by time (year, month, day) enables incremental processing where only new partitions require computation. Partitioning by key (user ID, region) distributes computation evenly across nodes. Hybrid partitioning schemes combine temporal and key-based approaches.
# Partition strategy impacts batch performance
class PartitionStrategy
def time_based_partitions(start_date, end_date)
(start_date..end_date).map do |date|
"data/year=#{date.year}/month=#{date.month}/day=#{date.day}"
end
end
def key_based_partitions(key, num_partitions)
partition_id = Digest::MD5.hexdigest(key.to_s).to_i(16) % num_partitions
"data/partition=#{partition_id}"
end
def hybrid_partitions(date, key, num_partitions)
partition_id = Digest::MD5.hexdigest(key.to_s).to_i(16) % num_partitions
"data/year=#{date.year}/month=#{date.month}/day=#{date.day}/partition=#{partition_id}"
end
end
Compression reduces storage and I/O costs while increasing CPU utilization. Modern codecs like Snappy or LZ4 balance compression ratios with CPU overhead. Column-oriented formats like Parquet achieve high compression for analytical workloads through encoding schemes optimized for repeated values.
Speed layer latency measures the delay between event arrival and query availability. Micro-batch streaming (Spark Streaming) introduces latency equal to batch intervals, typically seconds. True streaming systems (Flink, Storm) process events individually with millisecond latencies.
State management dominates speed layer performance. Systems maintaining large amounts of state (windowed aggregates, session data) require efficient state backends. In-memory state provides lowest latency but limits scale. Disk-backed state (RocksDB) enables larger state sizes with higher latency. Distributed state stores sacrifice latency for horizontal scaling.
# State management strategies for speed layer
class StateManager
def initialize(backend: :memory)
@backend = create_backend(backend)
end
def update_window(window_key, event)
current = @backend.get(window_key) || {}
updated = merge_event(current, event)
@backend.put(window_key, updated)
end
private
def create_backend(type)
case type
when :memory
InMemoryBackend.new # Fast, limited capacity
when :rocksdb
RocksDBBackend.new # Larger capacity, slower
when :redis
RedisBackend.new # Distributed, network overhead
end
end
end
Windowing granularity affects speed layer throughput. Fine-grained windows (1-second) generate more state updates. Coarse-grained windows (5-minute) reduce update frequency but increase staleness. Sliding windows require more computation than tumbling windows due to overlapping calculations.
Query performance depends on serving layer implementation. Indexed databases (Cassandra, HBase) provide sub-second lookups for point queries. Range queries perform worse, requiring multiple index lookups. Pre-aggregated views trade write-time computation for read-time performance.
The merge operation combining batch and speed results adds overhead to every query. Minimizing merge complexity improves query latency. Simple addition (sum of counters) merges efficiently. Complex operations (median, percentile) require more computation.
# Query merge optimization
class QueryMerger
# Fast merge for simple aggregates
def merge_sums(batch_sum, speed_sum)
batch_sum + speed_sum
end
# Slower merge for complex aggregates
def merge_percentiles(batch_values, speed_values)
combined = batch_values + speed_values
combined.sort[combined.length * 0.95]
end
end
Caching frequently accessed queries reduces load on both batch and speed layers. A caching layer (Redis, Memcached) in front of the serving layer handles repeated queries with microsecond latencies. Cache invalidation strategies must account for batch layer updates and speed layer freshness requirements.
Resource utilization patterns differ across layers. The batch layer exhibits periodic high utilization during processing runs and low utilization between runs. Elastic scaling matches compute resources to processing schedules. The speed layer requires sustained resource allocation to handle continuous event streams. The serving layer experiences variable load following query patterns.
Cost optimization strategies exploit these patterns. Batch processing uses spot instances or preemptible VMs for cost reduction. Speed layer processing requires reliable instances with SLA guarantees. The serving layer benefits from auto-scaling groups responding to query load variations.
Data locality optimizations reduce network transfer costs. Co-locating batch processing with storage (HDFS, S3) minimizes data movement. Processing frameworks schedule tasks on nodes containing required data partitions. The speed layer processes events near their source, reducing ingestion latency.
Real-World Applications
Lambda Architecture deployments span diverse industries and use cases, each adapting the architecture to specific requirements and constraints.
Real-time analytics platforms represent the canonical Lambda Architecture application. Companies process clickstream data from web and mobile applications, computing user engagement metrics, conversion funnels, and behavioral cohorts. The batch layer reprocesses complete history to maintain accurate lifetime value calculations and cross-session analysis. The speed layer provides current session metrics and real-time dashboard updates.
A social media platform processes billions of events daily. The batch layer runs nightly Spark jobs computing user follower counts, post engagement rates, and content recommendations from the complete dataset. The speed layer maintains current counts using Kafka Streams, updating within seconds as users interact. The serving layer merges historical trends with current activity for profile pages and analytics dashboards.
# Real-time analytics query merging
class AnalyticsPlatform
def user_metrics(user_id)
# Batch metrics: complete history, refreshed nightly
batch_metrics = cassandra.execute(
'SELECT * FROM user_metrics WHERE user_id = ?',
user_id
).first
# Speed metrics: last 24 hours, updated in real-time
speed_metrics = redis.hgetall("metrics:#{user_id}:realtime")
{
total_posts: batch_metrics['total_posts'] + speed_metrics['recent_posts'].to_i,
total_likes: batch_metrics['total_likes'] + speed_metrics['recent_likes'].to_i,
followers: batch_metrics['followers'], # Updated nightly
current_session_time: speed_metrics['session_time'].to_i
}
end
end
Fraud detection systems use Lambda Architecture to balance detection accuracy with response time. The batch layer trains machine learning models on complete transaction history, identifying patterns across months or years. The speed layer applies lightweight rules to incoming transactions, flagging suspicious activity within milliseconds. The serving layer combines model scores with real-time rule violations for comprehensive risk assessment.
Financial institutions process millions of transactions hourly. The batch layer executes complex graph analysis detecting fraud rings and multi-account schemes. The speed layer evaluates individual transactions against velocity rules and blocklists. Both layers feed a serving layer that determines whether to approve, decline, or review transactions.
IoT sensor networks generate continuous data streams requiring both real-time monitoring and historical analysis. The batch layer processes complete sensor histories to identify long-term trends, equipment degradation patterns, and seasonal effects. The speed layer monitors current readings for anomaly detection and threshold violations. The serving layer provides operational dashboards showing both real-time status and historical context.
A manufacturing facility instruments thousands of sensors across production equipment. The batch layer analyzes vibration patterns, temperature cycles, and performance metrics to predict maintenance needs. The speed layer alerts operators immediately when sensors exceed safety thresholds. Engineers access the serving layer for dashboards combining current equipment status with maintenance predictions.
Content recommendation engines balance personalization accuracy with freshness. The batch layer computes collaborative filtering models from complete user interaction histories. These models identify long-term preferences and cross-user patterns. The speed layer incorporates recent user actions—clicks, purchases, ratings—to adjust recommendations immediately. The serving layer combines stable base recommendations with real-time preference updates.
Video streaming platforms process viewing histories for millions of users. The batch layer runs matrix factorization algorithms overnight, generating base recommendations. The speed layer tracks current viewing sessions, adjusting recommendations based on what users watch now. The serving layer merges both for homepage recommendation lists that reflect both stable preferences and current interests.
Log aggregation and analysis platforms ingest application logs, metrics, and events from distributed systems. The batch layer indexes complete logs in searchable formats, enabling complex historical queries. The speed layer maintains recent logs in memory for fast access to current system state. The serving layer provides unified search across both layers, automatically routing queries based on time ranges.
Operations teams query logs to debug production issues. Recent logs (last hour) come from the speed layer with sub-second response times. Historical queries (last week) access the batch layer with seconds of latency. The serving layer presents a unified interface, automatically determining which layer to query based on requested time ranges.
# Log query routing based on time range
class LogServingLayer
SPEED_LAYER_THRESHOLD = 3600 # 1 hour
def query_logs(filters)
time_range = filters[:end_time] - filters[:start_time]
cutoff = Time.now.utc - SPEED_LAYER_THRESHOLD
if filters[:start_time] >= cutoff
# All recent data, use speed layer only
query_speed_layer(filters)
elsif filters[:end_time] < cutoff
# All historical data, use batch layer only
query_batch_layer(filters)
else
# Spans both layers, query and merge
speed_results = query_speed_layer(
filters.merge(start_time: cutoff)
)
batch_results = query_batch_layer(
filters.merge(end_time: cutoff)
)
merge_results(batch_results, speed_results)
end
end
end
Reference
Architecture Components
| Component | Responsibility | Update Frequency | Consistency |
|---|---|---|---|
| Batch Layer | Complete recomputation from raw data | Hours to days | Strongly consistent |
| Speed Layer | Incremental computation on recent data | Seconds to minutes | Eventually consistent |
| Serving Layer | Index batch views, merge with speed views | Continuous | Mixed consistency |
| Master Dataset | Immutable append-only raw data storage | Real-time appends | Strongly consistent |
Layer Characteristics
| Characteristic | Batch Layer | Speed Layer |
|---|---|---|
| Data Volume | Complete dataset | Recent data only |
| Latency | High (hours) | Low (seconds) |
| Throughput | Very high | Moderate |
| Fault Tolerance | Recomputation | State checkpointing |
| Complexity | Moderate | High |
| Resource Usage | Periodic spikes | Sustained |
Processing Patterns
| Pattern | Batch Implementation | Speed Implementation | Use Case |
|---|---|---|---|
| Aggregation | MapReduce, Spark | Windowed streams | Counters, sums |
| Joins | Hash joins, sort-merge | Stream-stream, stream-table | Data enrichment |
| Filtering | Predicate pushdown | Event filtering | Data selection |
| Grouping | Shuffle operations | Key-based partitioning | Category analysis |
| Sorting | Distributed sort | Per-window sorting | Rankings, top-N |
Technology Stack Options
| Layer | JVM Ecosystem | Alternative Options | Ruby Integration |
|---|---|---|---|
| Batch | Hadoop, Spark, Hive | Dask, Ray | Hadoop Streaming, JRuby Spark |
| Speed | Storm, Flink, Kafka Streams | Pulsar, Akka Streams | kafka-ruby, bunny |
| Serving | Cassandra, HBase, Druid | PostgreSQL, MongoDB, DynamoDB | cassandra-driver, mongo, aws-sdk |
| Message Queue | Kafka, Pulsar | RabbitMQ, Amazon Kinesis | kafka-ruby, bunny, aws-sdk-kinesis |
| Storage | HDFS, S3, Azure Blob | Google Cloud Storage, MinIO | aws-sdk-s3, azure-storage-blob |
Query Merge Strategies
| Aggregate Type | Merge Operation | Complexity | Accuracy Trade-off |
|---|---|---|---|
| Sum | Addition | O(1) | None |
| Count | Addition | O(1) | None |
| Average | Weighted average | O(1) | None |
| Min/Max | Comparison | O(1) | None |
| Median | Re-sort | O(n log n) | Approximation possible |
| Percentile | Re-compute | O(n log n) | Approximation recommended |
| Distinct Count | HyperLogLog union | O(1) | Approximate |
| Set Union | Set merge | O(n) | None |
Performance Tuning Parameters
| Parameter | Typical Range | Impact | Trade-off |
|---|---|---|---|
| Batch Interval | 1-24 hours | Freshness vs resource usage | Longer intervals reduce cost |
| Speed Window Size | 1-300 seconds | Latency vs accuracy | Smaller windows increase overhead |
| Partition Count | 10-1000 | Parallelism vs overhead | More partitions increase coordination |
| Replication Factor | 2-5 | Fault tolerance vs cost | Higher replication improves availability |
| Cache TTL | 1-300 seconds | Freshness vs load | Longer TTL reduces backend queries |
Data Partitioning Schemes
| Scheme | Batch Layer | Speed Layer | Query Pattern |
|---|---|---|---|
| Time-based | Year/month/day | Tumbling windows | Range queries |
| Hash-based | Hash(key) mod N | Partition key | Point lookups |
| Range-based | Key ranges | Key groups | Range scans |
| Composite | Time + hash | Time + partition | Mixed queries |
Failure Scenarios
| Failure Type | Batch Layer Response | Speed Layer Response | Recovery Time |
|---|---|---|---|
| Node failure | Reschedule task | Reprocess from checkpoint | Minutes |
| Network partition | Continue with available nodes | Buffer events | Seconds to minutes |
| Logic bug | Reprocess with fixed code | Deploy fix, recompute windows | Hours to days |
| Data corruption | Recompute from source | Clear state, reprocess | Hours |
| Storage failure | Replica failover | Checkpoint restoration | Minutes |
Cost Optimization Strategies
| Strategy | Batch Layer | Speed Layer | Serving Layer |
|---|---|---|---|
| Resource Type | Spot instances | On-demand instances | Reserved instances |
| Scaling | Elastic scaling | Auto-scaling | Dynamic scaling |
| Compression | High ratio codecs | Low latency codecs | Query-optimized formats |
| Data Lifecycle | Archive to cold storage | Short retention | Tiered storage |
| Reprocessing | Incremental when possible | Stateless when possible | Cache frequently accessed |