Overview
ETL (Extract-Transform-Load) and ELT (Extract-Load-Transform) represent two distinct architectural patterns for data pipeline construction. Both patterns address the same fundamental problem: moving data from source systems to analytical data stores while applying necessary transformations. The critical difference lies in where transformation logic executes within the pipeline.
ETL executes transformations on data after extraction but before loading into the target system. An intermediate processing layer accepts raw data from sources, applies business logic and data quality rules, then writes the transformed output to the destination. This pattern emerged during the era of on-premise data warehouses with limited processing capacity and expensive storage.
ELT loads raw data directly into the target system after extraction, deferring all transformation work until after loading completes. The destination system—typically a modern cloud data warehouse—handles transformation execution using its native processing capabilities. This pattern gained prominence with the rise of scalable cloud data platforms offering cheap storage and elastic compute.
The distinction affects system architecture, resource allocation, operational complexity, and cost structures. ETL centralizes transformation logic in dedicated processing infrastructure. ELT distributes transformation work to the target platform. Neither pattern inherently surpasses the other; each offers advantages for specific requirements and constraints.
ETL Pattern Flow:
Source DB → Extract → Transform (Staging) → Load → Warehouse
ELT Pattern Flow:
Source DB → Extract → Load → Warehouse → Transform (In-Place)
Key Principles
Data Movement Stages
Both patterns divide data pipeline work into three conceptual stages. Extraction pulls data from source systems—databases, APIs, files, streaming platforms. Loading writes data to a destination system—data warehouse, data lake, analytical database. Transformation converts raw data into analysis-ready formats through cleaning, enrichment, aggregation, and business logic application. The temporal ordering of these stages defines the architectural pattern.
Transformation Location
ETL performs transformation work outside the target system. Extract processes pull raw data from sources into a dedicated transformation environment—application servers, container clusters, distributed processing frameworks. After transformation completes, load processes write results to the destination. The target system receives only final, transformed data.
ELT moves raw data directly into the target system with minimal or no transformation during transit. Load processes write source data to staging tables or raw data zones within the destination platform. Transformation executes using the target system's processing engine—SQL queries, stored procedures, transformation frameworks built on the platform's compute layer.
Processing Architecture
ETL requires separate processing infrastructure between sources and destinations. This infrastructure handles transformation workloads—data validation, type conversion, joins, aggregations, deduplication. Organizations deploy custom servers, cluster managers, or distributed processing platforms specifically for ETL work.
ELT consolidates processing onto the destination platform. Modern cloud data warehouses provide columnar storage, massively parallel processing, and elastic scaling. These capabilities handle transformation workloads that previously required dedicated infrastructure. The target system becomes both storage and compute layer.
Data Persistence Strategy
ETL maintains transient data during transformation. Raw source data exists temporarily in staging areas or processing memory. Only transformed results persist in the target system. This approach minimizes destination storage requirements but complicates troubleshooting—raw data disappears after transformation.
ELT retains raw data in the destination system. Load processes write unmodified source data to persistent storage. Subsequent transformation steps read from raw tables and write to transformed tables. Both layers exist simultaneously within the target platform. This strategy increases storage consumption but provides data lineage and audit capabilities.
Schema Application Timing
ETL applies schema and structure during transformation. Extract processes handle various source formats—relational tables, JSON documents, CSV files, XML streams. Transform logic converts these formats into target schema structures before loading. The destination receives data conforming to its expected schema.
ELT defers schema application until transformation execution. Load processes write data in flexible formats—JSON columns, variant types, semi-structured storage. The target system stores data without enforcing strict schemas initially. Transform processes apply structure when converting raw to analytical tables, supporting schema evolution and exploratory analysis.
Dependency Management
ETL creates external dependencies on transformation infrastructure. Pipeline execution requires dedicated servers, orchestration platforms, and processing frameworks to remain operational. Infrastructure failures block the entire pipeline. Scaling requires provisioning additional transformation capacity.
ELT reduces external dependencies by consolidating work onto the target platform. Pipeline execution depends primarily on source system availability and destination platform health. The target system's native scaling handles increased transformation load. Infrastructure requirements simplify to orchestration and metadata management.
Implementation Approaches
Staging-Based ETL Architecture
Staging-based ETL establishes an intermediate persistence layer between sources and destinations. Extract processes write raw data to staging databases or file systems. Transform processes read from staging, apply business logic, then write to separate transformed staging areas. Load processes transfer final data from transformed staging to production destinations.
This approach isolates transformation work from source and target systems. Staging areas buffer data during processing, protecting sources from long-running queries and destinations from corrupt data. Multiple transformation stages can chain together, each reading from previous output. Failed transformations leave staging data intact for debugging and reprocessing.
The architecture requires managing staging infrastructure—databases, file systems, cleanup policies. Storage costs accumulate as staging retains data across processing cycles. Orchestration complexity increases with multiple staging layers. However, the separation provides clear data lineage and rollback points.
Stream-Based ETL Architecture
Stream-based ETL processes data continuously without persistent intermediate storage. Extract components emit data to message queues or event streams. Transform logic subscribes to streams, applies transformations in memory, then publishes results to output streams. Load components consume transformed streams and write to destinations.
This pattern handles high-velocity data sources—application logs, sensor feeds, user activity streams. Low-latency requirements make batch staging impractical. Streaming frameworks like Apache Kafka, Apache Flink, or cloud-native streaming services provide the backbone infrastructure.
Transform logic operates on bounded windows or maintains stateful processing. State management becomes critical—streaming frameworks provide mechanisms for maintaining aggregates, joins, and session data. Failure recovery requires replaying source streams or restoring checkpointed state.
Raw Data Lake ELT Architecture
Raw data lake ELT loads all source data into a centralized data lake storage layer without transformation. Extract processes write data in native formats—JSON, Parquet, Avro, CSV—to object storage organized by source and ingestion time. Load completes immediately upon successful write.
Transform processes execute as separate jobs reading from the data lake. SQL engines, distributed processing frameworks, or cloud data warehouses query data lake files directly. Transformations write results back to the data lake in curated zones—bronze (raw), silver (cleaned), gold (aggregated).
This architecture maximizes raw data retention and supports schema evolution. New analytical requirements query historical raw data without reingestion. Different teams apply competing transformation logic to the same raw inputs. Storage costs remain low using object storage pricing.
In-Database ELT Architecture
In-database ELT consolidates all pipeline work within the target database platform. Extract processes load raw data directly to staging tables using bulk import APIs. Transform logic executes as SQL queries, stored procedures, or database-native processing languages. Results populate production analytical tables within the same database.
Modern cloud data warehouses optimize this pattern through separation of storage and compute. Multiple compute clusters query the same underlying storage. Transform jobs scale independently of analytical queries. Query optimizers leverage columnar compression, partitioning, and caching.
This approach minimizes data movement and external dependencies. All processing leverages database optimizations—parallel execution, pushdown predicates, materialized views. Security and access control use database-native mechanisms. However, the pattern couples transformation logic to specific database SQL dialects and capabilities.
Hybrid Push-Pull Architecture
Hybrid architectures combine ETL and ELT patterns for different data flows. Critical, high-quality data receives ETL treatment—validated and transformed before loading. Exploratory or rapidly changing data follows ELT patterns—loaded raw for flexible transformation. The same pipeline incorporates both patterns based on data characteristics and requirements.
Decision criteria include data volume, transformation complexity, quality requirements, and latency targets. Small, well-understood datasets benefit from ETL validation. Large, evolving datasets favor ELT flexibility. Mission-critical reports demand ETL quality assurance. Ad-hoc analysis accepts ELT trade-offs.
Implementation requires sophisticated orchestration coordinating multiple patterns. Infrastructure provisions both dedicated transformation capacity and in-database processing. Clear governance defines which datasets follow which patterns. Teams understand trade-offs and select patterns deliberately rather than defaulting to single approaches.
Design Considerations
Data Volume Impact
Data volume fundamentally shapes pattern selection. ETL scales linearly with volume—doubling data volume requires doubling transformation capacity. Processing infrastructure must handle peak data loads. Storage requirements span temporary staging across transformation duration.
ELT decouples scaling from volume through destination platform capabilities. Cloud data warehouses provide elastic compute that scales independently of storage. Transformation load becomes queries against stored data rather than data movement operations. Storage scales automatically without capacity planning.
Consider ETL when transformation dramatically reduces data volume through aggregation or filtering. Loading summary data requires less destination capacity than loading raw data for in-place transformation. However, modern compression often minimizes the difference—columnar formats achieve 10x compression ratios even on raw data.
Transformation Complexity Assessment
Transformation complexity includes computational requirements, dependency chains, and logic sophistication. Simple transformations—type conversions, column renaming, null handling—execute efficiently in either pattern. Complex transformations—multi-way joins, window functions, recursive hierarchies, machine learning models—strain different resources depending on pattern.
ETL handles complex logic through specialized processing frameworks. Spark, Flink, or custom applications provide programming language flexibility and algorithmic control. Distributed processing frameworks handle complex joins and aggregations across massive datasets. Machine learning pipelines integrate naturally with Python or JVM ecosystems.
ELT relies on SQL capabilities and database-native functions. Modern analytical databases provide extensive SQL support—window functions, recursive CTEs, user-defined functions. However, complex algorithms or external library dependencies may prove awkward or impossible in SQL. Consider ETL when transformation logic exceeds SQL expressiveness.
Latency Requirements
Latency requirements measure acceptable delay between source changes and availability in analytical systems. Batch latency tolerates hours or days between updates. Near real-time requires minutes. Real-time demands seconds or milliseconds.
ETL introduces latency through batch processing cycles. Extract jobs poll sources periodically. Transform jobs process accumulated batches. Load jobs wait for transformation completion. End-to-end latency accumulates across stages. Stream-based ETL reduces latency but increases infrastructure complexity.
ELT supports incremental updates through micro-batch processing. Continuous ingestion loads small data increments to staging tables. Transform queries process recent data only, avoiding full recomputation. Destination platforms support concurrent reads and writes, enabling near real-time analytics.
Quality Assurance Priorities
Quality requirements determine validation timing and error handling strategies. High-quality data demands validation before reaching analytical systems. Rejecting invalid data prevents downstream corruption. Low-latency requirements may accept eventual consistency or quality degradation.
ETL validates data during transformation, before loading to destinations. Quality rules execute in controlled environments with full context. Invalid records divert to error queues or reject files. Only verified data reaches production systems. However, validation latency blocks the entire pipeline.
ELT loads all data first, validating during transformation queries. Invalid data persists in raw tables but filters out during analytical table population. Error handling spans multiple transformation layers. Debugging requires querying raw data to identify quality issues. This approach favors completeness over immediate quality.
Infrastructure Flexibility
Infrastructure choices constrain pattern viability. Existing investments in transformation platforms favor ETL patterns. Organizations with mature Spark clusters or custom processing frameworks leverage these capabilities. Migration costs outweigh ELT benefits when transformation infrastructure already exists.
Cloud-native organizations favor ELT patterns built on managed services. Modern data warehouses include sufficient transformation capabilities without additional infrastructure. Operational overhead decreases through consolidated platforms. Cost models shift from fixed infrastructure to consumption-based pricing.
Consider infrastructure maturity, team expertise, and operational capacity. ETL requires managing separate transformation infrastructure—servers, clusters, orchestration, monitoring. ELT centralizes operations on the destination platform but demands expertise in that platform's transformation capabilities.
Regulatory Compliance Constraints
Regulatory requirements affect data residency, retention, and access patterns. Some regulations mandate specific data handling procedures that constrain pattern selection.
ETL provides control points for implementing compliance requirements. Transformation stages enforce data masking, encryption, or deletion before data reaches destinations. Extract processes filter data based on consent or retention policies. Audit logs track data lineage across transformation stages.
ELT stores raw data in destination systems, potentially exposing sensitive information. Transformation logic must handle masking and filtering within the target platform. Row-level security and column masking features become critical. Some regulations prohibit storing certain data types even temporarily, eliminating ELT for those data flows.
Ruby Implementation
ETL Implementation with Ruby
Ruby serves as a transformation engine in custom ETL pipelines. Applications extract data through database adapters, HTTP clients, or file parsers, then apply transformation logic using Ruby's expressive syntax before loading results.
require 'pg'
require 'json'
class CustomerETL
def initialize(source_conn, dest_conn)
@source = source_conn
@dest = dest_conn
end
def extract
@source.exec(<<-SQL)
SELECT customer_id, email, created_at,
last_purchase_date, total_purchases
FROM customers
WHERE updated_at > NOW() - INTERVAL '1 day'
SQL
end
def transform(raw_data)
raw_data.map do |row|
{
id: row['customer_id'].to_i,
email_domain: row['email'].split('@').last.downcase,
signup_date: Date.parse(row['created_at']),
days_since_purchase: days_between(row['last_purchase_date'], Date.today),
purchase_category: categorize_purchases(row['total_purchases'].to_f),
risk_score: calculate_risk(row)
}
end
end
def load(transformed_data)
@dest.prepare('insert_customer', <<-SQL)
INSERT INTO customer_analytics
(id, email_domain, signup_date, days_since_purchase,
purchase_category, risk_score)
VALUES ($1, $2, $3, $4, $5, $6)
ON CONFLICT (id) DO UPDATE SET
email_domain = EXCLUDED.email_domain,
days_since_purchase = EXCLUDED.days_since_purchase,
purchase_category = EXCLUDED.purchase_category,
risk_score = EXCLUDED.risk_score
SQL
transformed_data.each do |record|
@dest.exec_prepared('insert_customer', record.values)
end
end
def run
raw_data = extract
transformed = transform(raw_data)
load(transformed)
end
private
def days_between(past_date, current_date)
return 9999 if past_date.nil?
(current_date - Date.parse(past_date)).to_i
end
def categorize_purchases(total)
case total
when 0...100 then 'low'
when 100...1000 then 'medium'
when 1000...10000 then 'high'
else 'vip'
end
end
def calculate_risk(row)
days_inactive = days_between(row['last_purchase_date'], Date.today)
total = row['total_purchases'].to_f
return 0 if total > 5000
return 10 if days_inactive < 30
return 50 if days_inactive < 90
return 80 if days_inactive < 180
100
end
end
# Execute pipeline
source = PG.connect(dbname: 'production')
dest = PG.connect(dbname: 'analytics')
pipeline = CustomerETL.new(source, dest)
pipeline.run
This ETL implementation transforms data in Ruby's application layer before loading. The transformation logic handles business rules, data quality checks, and computed fields outside the database. This pattern works for moderate data volumes where Ruby's processing capacity suffices.
Stream Processing ETL with Ruby
Ruby processes streaming data through message queue consumers, applying transformations before producing to downstream systems.
require 'kafka'
require 'json'
class EventETL
def initialize(kafka_brokers, source_topic, dest_topic)
@kafka = Kafka.new(kafka_brokers)
@consumer = @kafka.consumer(group_id: 'event-etl')
@producer = @kafka.producer
@source_topic = source_topic
@dest_topic = dest_topic
end
def process
@consumer.subscribe(@source_topic)
@consumer.each_message do |message|
raw_event = JSON.parse(message.value)
transformed = transform_event(raw_event)
if transformed[:valid]
@producer.produce(
transformed.to_json,
topic: @dest_topic,
key: transformed[:user_id].to_s
)
else
log_invalid_event(raw_event, transformed[:errors])
end
@producer.deliver_messages
end
end
private
def transform_event(event)
errors = []
# Validate required fields
errors << 'missing user_id' unless event['user_id']
errors << 'missing timestamp' unless event['timestamp']
# Enrich with computed fields
timestamp = Time.parse(event['timestamp']) rescue nil
errors << 'invalid timestamp' unless timestamp
# Classify event
event_type = classify_event(event['action'])
{
valid: errors.empty?,
errors: errors,
user_id: event['user_id'],
action: event['action'],
event_type: event_type,
timestamp: timestamp&.iso8601,
processed_at: Time.now.iso8601,
session_id: event['session_id'],
properties: sanitize_properties(event['properties'] || {})
}
end
def classify_event(action)
return 'engagement' if action.match?(/click|view|scroll/)
return 'conversion' if action.match?(/purchase|signup|subscribe/)
return 'navigation' if action.match?(/page|route/)
'other'
end
def sanitize_properties(props)
props.select { |k, v| k.match?(/^[a-z_]+$/) && !v.nil? }
end
def log_invalid_event(event, errors)
puts "Invalid event: #{errors.join(', ')} - #{event.inspect}"
end
end
# Run streaming ETL
etl = EventETL.new(
['kafka1:9092', 'kafka2:9092'],
'raw-events',
'transformed-events'
)
etl.process
Stream-based ETL transforms data in real-time as events flow through message queues. Ruby consumes raw events, applies validation and enrichment, then produces transformed events for downstream consumers.
ELT Implementation with Ruby
Ruby orchestrates ELT pipelines by coordinating extract and load operations, then triggering transformation queries in the target database.
require 'pg'
require 'aws-sdk-s3'
require 'csv'
class SalesELT
def initialize(source_db, warehouse_db, s3_bucket)
@source = source_db
@warehouse = warehouse_db
@s3 = Aws::S3::Client.new
@bucket = s3_bucket
end
def extract_to_s3
# Extract raw data from source
result = @source.exec(<<-SQL)
SELECT order_id, customer_id, product_id, quantity,
price, order_date, status
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '1 day'
SQL
# Write to S3 as CSV
csv_data = CSV.generate do |csv|
csv << result.fields
result.each { |row| csv << row.values }
end
key = "raw/orders/#{Date.today}.csv"
@s3.put_object(bucket: @bucket, key: key, body: csv_data)
key
end
def load_from_s3(s3_key)
# Load raw data into warehouse staging table
@warehouse.exec(<<-SQL)
CREATE TABLE IF NOT EXISTS staging_orders (
order_id INTEGER,
customer_id INTEGER,
product_id INTEGER,
quantity INTEGER,
price DECIMAL(10,2),
order_date DATE,
status VARCHAR(50),
loaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
SQL
# Use warehouse's native bulk load
copy_command = <<-SQL
COPY staging_orders
FROM 's3://#{@bucket}/#{s3_key}'
CREDENTIALS 'aws_access_key_id=XXX;aws_secret_access_key=YYY'
CSV
IGNOREHEADER 1
SQL
@warehouse.exec(copy_command)
end
def transform_in_warehouse
# All transformation logic executes in the warehouse
@warehouse.exec(<<-SQL)
-- Create or replace analytics table with transformations
CREATE TABLE IF NOT EXISTS order_analytics AS
SELECT
o.order_id,
o.customer_id,
o.product_id,
o.quantity,
o.price,
o.quantity * o.price AS total_amount,
o.order_date,
o.status,
CASE
WHEN o.quantity * o.price >= 1000 THEN 'high_value'
WHEN o.quantity * o.price >= 100 THEN 'medium_value'
ELSE 'low_value'
END AS value_segment,
DATE_PART('year', o.order_date) AS order_year,
DATE_PART('month', o.order_date) AS order_month,
DATE_PART('dow', o.order_date) AS order_day_of_week
FROM staging_orders o
WHERE o.status = 'completed';
-- Create aggregate summary table
CREATE TABLE IF NOT EXISTS daily_order_summary AS
SELECT
order_date,
COUNT(*) AS order_count,
SUM(quantity * price) AS revenue,
AVG(quantity * price) AS avg_order_value,
COUNT(DISTINCT customer_id) AS unique_customers
FROM staging_orders
WHERE status = 'completed'
GROUP BY order_date;
SQL
end
def run
puts "Extracting data to S3..."
s3_key = extract_to_s3
puts "Loading data to warehouse..."
load_from_s3(s3_key)
puts "Transforming in warehouse..."
transform_in_warehouse
puts "ELT pipeline complete"
end
end
# Execute ELT pipeline
source = PG.connect(dbname: 'production')
warehouse = PG.connect(
host: 'warehouse.example.com',
dbname: 'analytics'
)
pipeline = SalesELT.new(source, warehouse, 'company-data-lake')
pipeline.run
This ELT implementation loads raw data directly into the warehouse, deferring transformation until data resides in the target system. Ruby orchestrates the pipeline but delegates transformation work to SQL executing in the warehouse.
Hybrid Pattern with Ruby
Real-world Ruby applications often combine ETL and ELT approaches based on specific requirements for each data flow.
class HybridDataPipeline
def initialize(connections)
@source = connections[:source]
@staging = connections[:staging]
@warehouse = connections[:warehouse]
end
def process_customer_data
# ETL pattern: Transform customer PII before loading
raw = @source.exec('SELECT * FROM customers')
transformed = raw.map do |row|
{
customer_id: row['id'],
# Hash email for privacy
email_hash: Digest::SHA256.hexdigest(row['email']),
# Mask phone numbers
phone_masked: mask_phone(row['phone']),
signup_date: row['created_at'],
segment: calculate_segment(row)
}
end
load_to_warehouse(transformed, 'customer_dim')
end
def process_order_events
# ELT pattern: Load raw events, transform in warehouse
raw = @source.exec(<<-SQL)
SELECT event_id, user_id, event_type,
event_data, event_timestamp
FROM order_events
WHERE event_timestamp > NOW() - INTERVAL '1 hour'
SQL
# Minimal transformation: just convert to JSON
raw.each do |row|
@warehouse.exec_params(<<-SQL, [
row['event_id'],
row['user_id'],
row['event_type'],
row['event_data'].to_json,
row['event_timestamp']
])
INSERT INTO raw_order_events
(event_id, user_id, event_type, event_data, event_timestamp)
VALUES ($1, $2, $3, $4, $5)
SQL
end
# Trigger warehouse-based transformation
@warehouse.exec(<<-SQL)
INSERT INTO order_event_analytics
SELECT
event_id,
user_id,
event_type,
event_data::json->>'order_id' AS order_id,
(event_data::json->>'amount')::DECIMAL AS amount,
event_timestamp,
DATE_TRUNC('hour', event_timestamp) AS event_hour
FROM raw_order_events
WHERE event_timestamp > NOW() - INTERVAL '1 hour'
SQL
end
private
def mask_phone(phone)
return nil if phone.nil?
phone.gsub(/\d(?=\d{4})/, '*')
end
def calculate_segment(customer)
total_orders = customer['total_orders'].to_i
return 'vip' if total_orders > 50
return 'regular' if total_orders > 10
'new'
end
def load_to_warehouse(data, table)
data.each do |record|
columns = record.keys.join(', ')
placeholders = (1..record.size).map { |i| "$#{i}" }.join(', ')
@warehouse.exec_params(
"INSERT INTO #{table} (#{columns}) VALUES (#{placeholders})",
record.values
)
end
end
end
Hybrid approaches apply ETL where data sensitivity or validation requirements demand pre-load transformation, while using ELT for high-volume or rapidly changing data that benefits from warehouse processing capabilities.
Tools & Ecosystem
Traditional ETL Platforms
Traditional ETL platforms provide graphical interfaces for designing data flows, pre-built connectors for common sources, and transformation engines for processing data outside target systems.
Talend Open Studio offers drag-and-drop ETL design with code generation to Java. Pentaho Data Integration (Kettle) provides visual transformation designers outputting to XML definitions. Both support complex transformation logic, scheduling, and monitoring. These tools arose when destination systems lacked processing capacity for transformation work.
Ruby integration occurs through custom components or REST APIs. Talend generates Java code that can invoke Ruby scripts for specific transformation logic. Pentaho supports BSF (Bean Scripting Framework) for embedding Ruby transformations within larger data flows.
Modern ELT Platforms
Modern ELT platforms coordinate extract and load operations while leveraging destination database engines for transformation.
Fivetran automates data replication from sources to warehouses, handling schema detection, incremental updates, and change data capture. Stitch (Singer specification) provides open-source data ingestion with Python-based extractors. Both load raw data into warehouses like Snowflake, BigQuery, or Redshift, enabling SQL-based transformation.
dbt (data build tool) revolutionized ELT transformation by treating SQL transformations as software engineering artifacts. Projects organize transformation logic into modular SQL files with dependencies, tests, and documentation. dbt compiles these definitions into database-specific SQL and executes transformations in the target warehouse.
# Ruby script to trigger dbt transformations
require 'open3'
require 'json'
class DBTRunner
def initialize(project_dir)
@project_dir = project_dir
end
def run_models(models: nil, full_refresh: false)
cmd = ['dbt', 'run', '--project-dir', @project_dir]
cmd += ['--models', models] if models
cmd << '--full-refresh' if full_refresh
stdout, stderr, status = Open3.capture3(*cmd)
if status.success?
parse_results(stdout)
else
raise "dbt run failed: #{stderr}"
end
end
def test
stdout, stderr, status = Open3.capture3(
'dbt', 'test', '--project-dir', @project_dir
)
parse_test_results(stdout)
end
private
def parse_results(output)
output.lines
.select { |l| l.include?('OK created') }
.map { |l| l.match(/model (\S+)/)[1] }
end
def parse_test_results(output)
failures = output.lines.select { |l| l.include?('FAIL') }
{ passed: failures.empty?, failures: failures }
end
end
# Orchestrate ELT with Ruby
runner = DBTRunner.new('/data/dbt_project')
runner.run_models(models: 'staging')
runner.run_models(models: 'marts')
test_results = runner.test
puts "Tests: #{test_results[:passed] ? 'PASSED' : 'FAILED'}"
Cloud Data Warehouse Platforms
Cloud data warehouses function as both destinations and transformation engines in ELT architectures.
Snowflake provides separated storage and compute with elastic scaling. Multiple virtual warehouses query the same data concurrently. Transformation workloads execute in dedicated compute clusters without affecting analytical queries. Support for semi-structured data (JSON, Avro, Parquet) enables flexible schema evolution.
Google BigQuery offers serverless SQL execution with automatic scaling. Storage costs separate from compute costs—pay only for queries executed. BigQuery ML enables training machine learning models directly in SQL transformations. Federated queries access external data sources without loading.
Amazon Redshift combines columnar storage with PostgreSQL compatibility. Spectrum queries extend to data lake files without loading. Concurrency scaling handles query spikes. Recent updates add semi-structured data support and materialized views for incremental transformation.
Ruby applications interact through database adapters:
require 'pg'
class SnowflakeELT
def initialize
@conn = PG.connect(
host: 'account.snowflakecomputing.com',
dbname: 'analytics',
user: 'etl_user',
password: ENV['SNOWFLAKE_PASSWORD']
)
end
def incremental_transform
@conn.exec(<<-SQL)
MERGE INTO customer_summary target
USING (
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(amount) AS total_spent,
MAX(order_date) AS last_order_date
FROM raw_orders
WHERE processed_at > CURRENT_TIMESTAMP - INTERVAL '1 hour'
GROUP BY customer_id
) source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET
order_count = target.order_count + source.order_count,
total_spent = target.total_spent + source.total_spent,
last_order_date = GREATEST(target.last_order_date, source.last_order_date)
WHEN NOT MATCHED THEN INSERT
(customer_id, order_count, total_spent, last_order_date)
VALUES
(source.customer_id, source.order_count, source.total_spent,
source.last_order_date)
SQL
end
def create_materialized_view
@conn.exec(<<-SQL)
CREATE OR REPLACE MATERIALIZED VIEW daily_revenue AS
SELECT
DATE_TRUNC('day', order_date) AS date,
product_category,
SUM(amount) AS revenue,
COUNT(*) AS order_count
FROM orders
WHERE status = 'completed'
GROUP BY 1, 2
SQL
end
end
Orchestration and Workflow Tools
Orchestration platforms schedule and monitor data pipeline execution regardless of ETL or ELT pattern choice.
Apache Airflow defines workflows as Python DAGs (Directed Acyclic Graphs) with task dependencies. Operators handle common actions—database queries, file transfers, API calls. Airflow schedules execution, tracks state, and handles retries. Ruby tasks integrate through BashOperator executing Ruby scripts or PythonOperator calling Ruby via subprocess.
Prefect provides modern workflow orchestration with dynamic DAGs and Pythonic API. Luigi offers simpler dependency management for batch pipelines. Cloud-native options include AWS Step Functions, Google Cloud Composer (managed Airflow), and Azure Data Factory.
# Generate Airflow DAG from Ruby configuration
require 'json'
class AirflowDAGGenerator
def generate(config)
dag = {
dag_id: config[:name],
schedule_interval: config[:schedule],
default_args: {
owner: 'data-team',
retries: 2,
retry_delay: { __type: 'timedelta', minutes: 5 }
},
tasks: generate_tasks(config[:pipeline])
}
File.write("dags/#{config[:name]}.json", JSON.pretty_generate(dag))
end
private
def generate_tasks(pipeline)
pipeline.map.with_index do |step, idx|
{
task_id: step[:name],
operator: 'BashOperator',
bash_command: build_command(step),
dependencies: step[:depends_on] || []
}
end
end
def build_command(step)
case step[:type]
when 'extract'
"ruby extract_#{step[:source]}.rb"
when 'transform'
"dbt run --models #{step[:models]}"
when 'load'
"ruby load_#{step[:target]}.rb"
end
end
end
# Usage
config = {
name: 'daily_sales_elt',
schedule: '@daily',
pipeline: [
{ name: 'extract_orders', type: 'extract', source: 'postgres' },
{ name: 'load_staging', type: 'load', target: 'snowflake',
depends_on: ['extract_orders'] },
{ name: 'transform_sales', type: 'transform', models: 'sales.*',
depends_on: ['load_staging'] }
]
}
generator = AirflowDAGGenerator.new
generator.generate(config)
Stream Processing Frameworks
Stream processing frameworks handle real-time ETL for event-driven architectures.
Apache Kafka provides distributed event streaming with topics, partitions, and consumer groups. Kafka Connect offers pre-built connectors for common sources and sinks. Kafka Streams library processes data streams with stateful transformations.
Apache Flink executes complex event processing with exactly-once semantics. Supports tumbling, sliding, and session windows for time-based aggregations. Flink's SQL API enables declarative stream transformations.
Ruby integration typically occurs through message queue clients rather than embedding Ruby in streaming frameworks:
require 'kafka'
require 'avro'
class KafkaETL
def initialize(brokers)
@kafka = Kafka.new(brokers)
@consumer = @kafka.consumer(group_id: 'ruby-etl')
@producer = @kafka.producer
end
def stream_transform(input_topic, output_topic, schema)
@consumer.subscribe(input_topic)
@consumer.each_message(automatically_mark_as_processed: false) do |msg|
begin
# Deserialize Avro message
decoded = Avro::IO::DatumReader.new(schema).read(
Avro::IO::BinaryDecoder.new(StringIO.new(msg.value))
)
# Transform
transformed = transform_record(decoded)
# Serialize and produce
buffer = StringIO.new
writer = Avro::IO::DatumWriter.new(schema)
encoder = Avro::IO::BinaryEncoder.new(buffer)
writer.write(transformed, encoder)
@producer.produce(
buffer.string,
topic: output_topic,
key: msg.key
)
@consumer.mark_message_as_processed(msg)
rescue => e
handle_error(msg, e)
end
end
ensure
@producer.shutdown
end
private
def transform_record(record)
# Apply transformations
record['processed_at'] = Time.now.to_i
record['category'] = classify(record['event_type'])
record
end
def classify(event_type)
event_type.start_with?('user_') ? 'engagement' : 'system'
end
def handle_error(message, error)
puts "Error processing message: #{error.message}"
# Send to dead letter queue
end
end
Performance Considerations
Data Volume Scaling
ETL performance degrades linearly with data volume. Doubling source data doubles extraction time, transformation compute requirements, and load duration. Transformation clusters require capacity planning for peak loads. Horizontal scaling adds processing nodes but introduces coordination overhead.
ELT performance scales with destination platform capabilities rather than data volume. Cloud data warehouses provide elastic compute that adjusts to workload demands. Storage scales independently—adding terabytes costs storage fees without affecting query performance. Columnar compression reduces actual data footprint by 10-50x.
Consider source data growth projections. ETL requires provisioning transformation infrastructure for future peak volumes. ELT delegates scaling to managed platforms with consumption-based pricing. However, transformation query costs increase with data volume—partition pruning and incremental processing become critical optimizations.
Transformation Complexity Impact
Complex transformations strain different resources depending on pattern. Multi-way joins, window functions, and recursive operations consume CPU and memory. ETL distributes this load across dedicated transformation clusters. ELT executes transformations within database engine optimizers.
Modern query optimizers generate parallel execution plans exploiting columnar storage and partition elimination. Transformations compile to efficient operations—aggregations leverage bitmap indices, joins use hash or merge algorithms. However, poorly written SQL triggers full table scans or cartesian products degrading performance catastrophically.
ETL frameworks provide programmatic control over transformation execution. Custom code manages memory explicitly, implements specialized algorithms, or integrates external libraries. This flexibility handles transformations exceeding SQL expressiveness or database capabilities.
Benchmark transformation performance in realistic conditions:
require 'benchmark'
def benchmark_etl_transform(data, iterations: 3)
results = Benchmark.bm(20) do |x|
x.report('ruby_transform:') do
iterations.times do
transformed = data.map do |row|
{
customer_id: row[:id],
revenue_bucket: categorize_revenue(row[:revenue]),
days_active: calculate_days_active(row),
risk_score: complex_risk_calculation(row)
}
end
end
end
x.report('sql_transform:') do
iterations.times do
@db.exec(<<-SQL)
SELECT
customer_id,
CASE
WHEN revenue < 100 THEN 'low'
WHEN revenue < 1000 THEN 'medium'
ELSE 'high'
END AS revenue_bucket,
CURRENT_DATE - last_purchase_date AS days_active,
(recency_score * 0.3 + frequency_score * 0.3 +
monetary_score * 0.4) AS risk_score
FROM customers
WHERE updated_at > CURRENT_DATE - 1
SQL
end
end
end
results
end
Network Transfer Bottlenecks
ETL transfers data multiple times—extract from source, stage intermediate results, load to destination. Network bandwidth constrains throughput. Extracting terabytes from production databases impacts application performance. Loading transformed data crosses network boundaries.
ELT minimizes data movement. Extract and load transfer raw data once from source to destination. Transformation occurs in-place without additional transfers. Cloud data warehouses colocate storage and compute, eliminating transfer costs for transformation queries.
Optimize network efficiency through compression. GZIP, Snappy, or LZ4 reduce transfer volume by 5-10x. Incremental extraction pulls only changed records using timestamps or change data capture. Parallel loading leverages multiple connections for higher throughput.
require 'zlib'
require 'net/http'
require 'json'
class OptimizedExtractor
def extract_compressed(source_url, last_sync_time)
# Request compressed response
uri = URI(source_url)
request = Net::HTTP::Get.new(uri)
request['Accept-Encoding'] = 'gzip'
request['If-Modified-Since'] = last_sync_time.httpdate
response = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) do |http|
http.request(request)
end
return [] if response.code == '304' # Not modified
# Decompress response
body = if response['Content-Encoding'] == 'gzip'
Zlib::GzipReader.new(StringIO.new(response.body)).read
else
response.body
end
JSON.parse(body)
end
def parallel_load(data, connection_pool, batch_size: 1000)
batches = data.each_slice(batch_size).to_a
threads = batches.map.with_index do |batch, idx|
Thread.new do
conn = connection_pool.checkout
begin
load_batch(conn, batch)
ensure
connection_pool.checkin(conn)
end
end
end
threads.each(&:join)
end
private
def load_batch(conn, batch)
values = batch.map { |row| "(#{row.values.map { |v| conn.escape_literal(v) }.join(',')})" }
conn.exec(<<-SQL)
INSERT INTO staging_table (col1, col2, col3)
VALUES #{values.join(',')}
SQL
end
end
Incremental Processing Strategies
Full refresh processing recomputes all transformations on entire datasets. Simple to implement but wastes resources recalculating unchanged data. Acceptable for small datasets or infrequent updates.
Incremental processing updates only changed or new records. Requires tracking modification timestamps or change events. Reduces computation 10-100x for large, slowly changing datasets.
ETL implementations maintain high-water marks tracking last processed timestamps. Extract queries filter for records modified since last run. Transform processes only new batches. Load operations merge results into existing destinations.
ELT implementations leverage database capabilities for incremental updates. Merge statements handle upserts efficiently. Partition pruning skips unchanged data. Materialized views incrementally refresh based on base table changes.
class IncrementalETL
def initialize(state_store, connections)
@state = state_store
@source = connections[:source]
@dest = connections[:dest]
end
def extract_incremental(table)
last_sync = @state.get("#{table}_last_sync") || '1970-01-01'
result = @source.exec_params(<<-SQL, [last_sync])
SELECT * FROM #{table}
WHERE updated_at > $1
ORDER BY updated_at
SQL
max_timestamp = result.map { |r| r['updated_at'] }.max
{ data: result.to_a, max_timestamp: max_timestamp }
end
def load_incremental(table, data, max_timestamp)
data.each do |row|
columns = row.keys.join(', ')
values = row.values.map { |v| @dest.escape_literal(v) }.join(', ')
@dest.exec(<<-SQL)
INSERT INTO #{table} (#{columns})
VALUES (#{values})
ON CONFLICT (id) DO UPDATE SET
#{row.keys.map { |k| "#{k} = EXCLUDED.#{k}" }.join(', ')},
updated_at = EXCLUDED.updated_at
SQL
end
@state.set("#{table}_last_sync", max_timestamp)
end
def run_incremental_pipeline
['customers', 'orders', 'products'].each do |table|
result = extract_incremental(table)
next if result[:data].empty?
transformed = transform_data(result[:data])
load_incremental(table, transformed, result[:max_timestamp])
puts "Processed #{result[:data].size} records from #{table}"
end
end
private
def transform_data(data)
# Apply transformations
data.map { |row| transform_row(row) }
end
def transform_row(row)
row # Transformation logic here
end
end
Query Optimization for ELT
ELT transformation performance depends on SQL query efficiency. Poor queries scan entire tables unnecessarily. Proper optimization leverages database capabilities—partition pruning, predicate pushdown, column pruning.
Partition tables by date or category columns commonly used in filters. Queries filtering on partition keys skip irrelevant partitions. Cluster tables by frequently joined columns improving join performance. Create covering indices for lookup patterns.
Materialized views precompute expensive aggregations or joins. Incremental refresh updates views based on changed base data. Query rewriting automatically substitutes materialized views when applicable.
class ELTOptimizer
def initialize(warehouse_conn)
@db = warehouse_conn
end
def create_partitioned_table
@db.exec(<<-SQL)
CREATE TABLE IF NOT EXISTS orders_partitioned (
order_id BIGINT,
customer_id BIGINT,
order_date DATE,
amount DECIMAL(10,2),
status VARCHAR(50)
)
PARTITION BY RANGE (order_date)
SQL
# Create partitions for each month
(Date.new(2024, 1, 1)..Date.today).select { |d| d.day == 1 }.each do |month|
partition_name = "orders_#{month.strftime('%Y%m')}"
next_month = month.next_month
@db.exec(<<-SQL)
CREATE TABLE IF NOT EXISTS #{partition_name}
PARTITION OF orders_partitioned
FOR VALUES FROM ('#{month}') TO ('#{next_month}')
SQL
end
end
def create_optimized_views
# Materialized view for daily aggregates
@db.exec(<<-SQL)
CREATE MATERIALIZED VIEW IF NOT EXISTS daily_order_summary AS
SELECT
order_date,
status,
COUNT(*) AS order_count,
SUM(amount) AS total_revenue,
AVG(amount) AS avg_order_value
FROM orders_partitioned
GROUP BY order_date, status
SQL
# Clustered table for customer analytics
@db.exec(<<-SQL)
CREATE TABLE IF NOT EXISTS customer_analytics_clustered
CLUSTER BY (customer_id) AS
SELECT
customer_id,
COUNT(DISTINCT order_id) AS total_orders,
SUM(amount) AS lifetime_value,
MAX(order_date) AS last_order_date,
MIN(order_date) AS first_order_date
FROM orders_partitioned
WHERE status = 'completed'
GROUP BY customer_id
SQL
end
def analyze_query_performance(query)
explain_result = @db.exec("EXPLAIN ANALYZE #{query}")
parse_explain_output(explain_result)
end
private
def parse_explain_output(result)
output = result.map { |row| row.values.first }.join("\n")
{
uses_index: output.include?('Index Scan'),
uses_partition_pruning: output.include?('Partition'),
execution_time: output.match(/Execution Time: ([\d.]+) ms/)[1].to_f
}
end
end
Reference
Pattern Comparison Matrix
| Aspect | ETL | ELT |
|---|---|---|
| Transformation Location | External processing layer | Target database platform |
| Infrastructure | Dedicated transformation servers | Warehouse-native compute |
| Data Storage | Transient during processing | Raw data persists |
| Schema Application | During transformation | Post-load in destination |
| Scaling Model | Linear with data volume | Elastic with destination platform |
| Latency Profile | Batch processing cycles | Micro-batch or near real-time |
| Query Language | Programming languages | SQL and database functions |
| Cost Structure | Fixed infrastructure costs | Consumption-based pricing |
| Data Lineage | External tracking required | Raw data available for audit |
| Operational Complexity | Multiple systems to manage | Consolidated platform management |
Decision Framework
| Factor | Choose ETL When | Choose ELT When |
|---|---|---|
| Data Volume | Transformation reduces volume significantly | Raw and transformed data similar size |
| Transformation Complexity | Requires custom algorithms or external libraries | Expressible in SQL or database functions |
| Infrastructure | Existing transformation capacity available | Cloud warehouse with elastic compute |
| Data Sensitivity | PII must be masked before loading | Data can be loaded raw with access controls |
| Latency Requirements | Batch processing acceptable | Near real-time analytics needed |
| Quality Requirements | Validation before destination critical | Can validate during transformation queries |
| Team Expertise | Strong in application programming | Strong in SQL and database optimization |
| Cost Model | Predictable fixed costs preferred | Variable consumption-based costs acceptable |
| Regulatory Compliance | Data must be filtered pre-load | Destination provides adequate controls |
| Schema Evolution | Schema relatively stable | Frequent schema changes expected |
Common Ruby ETL Gems
| Gem | Purpose | Use Case |
|---|---|---|
| pg | PostgreSQL adapter | Source and destination connectivity |
| mysql2 | MySQL adapter | Relational database access |
| sequel | Database toolkit | Cross-database abstraction |
| httparty | HTTP client | REST API extraction |
| nokogiri | XML parser | XML source data parsing |
| csv | CSV processing | CSV file extraction and loading |
| aws-sdk-s3 | S3 integration | Cloud storage operations |
| kafka-ruby | Kafka client | Stream processing |
| resque | Background jobs | Asynchronous ETL execution |
| sidekiq | Background processing | Job queue management |
Common ELT Transformation Patterns
| Pattern | SQL Approach | When to Use |
|---|---|---|
| Type 1 Dimension | UPDATE with MERGE | Overwrite historical values |
| Type 2 Dimension | INSERT with versioning | Preserve history with timestamps |
| Incremental Aggregate | SUM with partition filtering | Add new data to existing summaries |
| Full Refresh Aggregate | DROP and CREATE | Recompute entire aggregation |
| Deduplicate | ROW_NUMBER with QUALIFY | Remove duplicate records |
| Pivot | CASE statements or PIVOT | Convert rows to columns |
| Unpivot | UNION or UNPIVOT | Convert columns to rows |
| Slowly Changing Dimension | MERGE with condition logic | Track dimension changes over time |
| Surrogate Key Generation | SEQUENCE or ROW_NUMBER | Generate internal identifiers |
| Incremental Delete | DELETE with timestamp filter | Remove stale records |
Pipeline Orchestration Patterns
| Orchestration Need | ETL Approach | ELT Approach |
|---|---|---|
| Job Scheduling | Cron or workflow engine | Warehouse scheduling features |
| Dependency Management | External DAG definition | Stored procedure dependencies |
| Error Handling | Application retry logic | Transaction rollback and retry |
| Monitoring | Application metrics collection | Database query monitoring |
| State Management | External state store | Database tables track progress |
| Parallel Execution | Multiple worker processes | Concurrent warehouse sessions |
| Incremental Processing | Application tracks timestamps | Database merge statements |
| Data Quality Checks | Pre-load validation | Post-load test queries |
Performance Optimization Checklist
| Optimization | ETL Implementation | ELT Implementation |
|---|---|---|
| Reduce Data Transfer | Compress extract payloads | Load directly to destination |
| Incremental Processing | Track high-water marks | Use merge statements with filters |
| Parallel Execution | Multiple transformation workers | Concurrent warehouse compute |
| Optimize Transformations | Vectorized operations | Partition pruning and indices |
| Minimize Memory Usage | Stream processing | Query result limits |
| Batch Operations | Bulk inserts | Multi-row statements |
| Cache Lookups | In-memory lookup tables | Materialized views |
| Schema Optimization | Pre-aggregate before load | Clustered tables and partitions |
| Resource Scaling | Horizontal node addition | Elastic warehouse resizing |
| Query Tuning | Algorithm selection | EXPLAIN plan analysis |
Ruby ETL Framework Template
| Component | Implementation Pattern |
|---|---|
| Connection Management | Pool connections with retry logic |
| Extract Interface | Adapters for each source type |
| Transform Engine | Chain of transformation functions |
| Load Interface | Bulk loading with batching |
| Error Handling | Dead letter queues for failures |
| State Management | Persist high-water marks |
| Logging | Structured JSON logging |
| Metrics | Track volume and duration |
| Testing | Unit tests for transformations |
| Configuration | Environment-based config |
ELT Warehouse Capabilities
| Capability | Snowflake | BigQuery | Redshift |
|---|---|---|---|
| Elastic Compute | Virtual warehouses | Serverless slots | Concurrency scaling |
| Storage Format | Columnar micro-partitions | Capacitor columnar | Columnar blocks |
| Semi-Structured | VARIANT type | JSON columns | SUPER type |
| Query Language | ANSI SQL plus extensions | Standard SQL | PostgreSQL-compatible |
| Transformation Tool | Tasks and streams | Scheduled queries | Stored procedures |
| Incremental Updates | MERGE statement | MERGE statement | MERGE statement |
| Materialized Views | Automatic refresh | Manual refresh | Automatic refresh |
| External Data | External tables | Federated queries | Spectrum queries |
| Change Tracking | Streams | Audit logs | System tables |
| Cost Model | Per-second compute billing | Query bytes scanned | Hourly compute billing |