CrackedRuby - Query Optimization Techniques

Overview

Query optimization represents the process of improving database query performance by reducing execution time, minimizing resource consumption, and maximizing throughput. Database queries form the primary bottleneck in most data-driven applications, where poorly optimized queries can degrade response times from milliseconds to seconds or minutes.

Database management systems execute queries through a multi-stage process. The query parser validates syntax and constructs an abstract syntax tree. The query optimizer analyzes possible execution strategies and selects an execution plan based on cost estimates. The execution engine carries out the plan, accessing data through storage structures and applying operations like filtering, joining, and aggregating. Each stage presents optimization opportunities.

Query optimization operates at multiple levels. Physical optimization involves storage structures like indexes, partitions, and data layout. Logical optimization focuses on rewriting queries to equivalent but more efficient forms. Application-level optimization addresses how applications construct and execute queries, including caching strategies and query batching.

The cost of unoptimized queries scales with data growth. A query performing a full table scan on a thousand-row table might execute in milliseconds, but the same query on a million-row table could take seconds. As data volumes increase, optimization transitions from optional to mandatory for application viability.

# Unoptimized query scanning entire table
User.where("email LIKE ?", "%@example.com%").count
# Database performs full table scan

# Optimized with index on email column
User.where("email LIKE ?", "%example.com").count
# Database uses index scan if LIKE pattern allows

Query optimization requires understanding both database internals and application access patterns. Different databases implement different optimization strategies, but fundamental principles apply across systems. The optimizer makes decisions based on statistics about data distribution, cardinality estimates, and available indexes.

Key Principles

Query execution plans define how databases execute queries. The optimizer generates multiple candidate plans and estimates their cost based on factors including disk I/O operations, CPU processing, memory usage, and network transfer. The plan with the lowest estimated cost gets selected for execution. Understanding execution plans reveals why queries perform poorly and guides optimization strategies.

Indexes accelerate data retrieval by maintaining sorted structures that allow rapid lookups. A B-tree index enables logarithmic-time searches instead of linear table scans. Indexes trade write performance and storage space for read performance. Each insert, update, or delete operation must update all relevant indexes, creating overhead. Selecting which columns to index requires analyzing query patterns and balancing read versus write workloads.

Cardinality estimates predict how many rows operations will process. The optimizer uses these estimates to choose join orders and access methods. Outdated statistics lead to poor cardinality estimates, causing the optimizer to select suboptimal plans. Databases maintain statistics about column value distributions, distinct value counts, and data correlations. Regular statistics updates ensure accurate optimizer decisions.

Join algorithms determine how databases combine tables. Nested loop joins iterate through one table for each row in another, efficient for small datasets or when one side has few rows. Hash joins build hash tables for one input and probe with the other, effective for equi-joins on large datasets. Merge joins sort both inputs and merge them, optimal when inputs arrive pre-sorted. The optimizer selects join algorithms based on data characteristics and available indexes.

Query selectivity measures the fraction of rows a predicate matches. High selectivity means few matching rows, making indexes effective. Low selectivity means many matches, potentially making full table scans faster than index lookups. The optimizer estimates selectivity from column statistics and uses these estimates to decide between table scans and index access.

Database transactions isolate concurrent operations through locking mechanisms. Long-running queries hold locks longer, increasing contention and reducing concurrency. Query optimization reduces lock duration, improving system throughput. Different isolation levels provide different trade-offs between consistency and performance.

# Understanding selectivity
# High selectivity - index useful
User.where(email: "specific@example.com").first
# Matches one row out of millions

# Low selectivity - table scan may be faster
User.where(active: true).count
# Matches 95% of rows, index overhead exceeds benefit

Cache effectiveness depends on query predictability and data access patterns. Databases cache data pages, index pages, and query results. Repeated queries benefit from cached data, avoiding disk I/O. Query optimization positions frequently accessed data for cache retention and structures queries to maximize cache hit rates.

Implementation Approaches

Index strategy optimization begins with identifying query patterns. Analyze application queries to determine which columns appear in WHERE clauses, JOIN conditions, and ORDER BY operations. Single-column indexes support equality and range queries on individual columns. Composite indexes combine multiple columns, supporting queries filtering or sorting on column combinations. Index column order matters—place columns with higher selectivity first.

Covering indexes include all columns a query references, allowing the database to satisfy the query entirely from the index without accessing the table. This eliminates table lookups, significantly reducing I/O. Create covering indexes for frequent queries where the additional storage overhead justifies the performance gain.

# Query requiring table access
User.where(status: 'active').pluck(:id, :email, :created_at)

# With covering index: CREATE INDEX idx_covering 
# ON users(status, id, email, created_at)
# Database reads only index, no table access needed

Query rewriting transforms queries into equivalent forms with better performance characteristics. Replace correlated subqueries with joins when possible, as subqueries execute once per outer row while joins process sets. Transform DISTINCT operations into GROUP BY when aggregation occurs. Rewrite NOT IN clauses as LEFT JOIN with NULL checks, as NOT IN prevents index usage in some databases.

Partition pruning limits query scans to relevant data partitions. Partition tables by commonly filtered columns like date ranges. When queries filter on partition keys, the database scans only relevant partitions, reducing data volume. Range partitioning divides data by value ranges, list partitioning by discrete values, and hash partitioning distributes data evenly.

Denormalization trades storage space and update complexity for query performance. Maintain redundant data copies to avoid expensive joins. Store aggregated values to prevent runtime calculations. Denormalization works best for read-heavy workloads where reads outnumber writes significantly. Track data consistency carefully, as updates must maintain all copies.

Materialized views precompute and store query results, updating periodically or on demand. Create materialized views for complex aggregations, expensive joins, or frequently accessed computed data. Refresh strategies balance data freshness against computation cost. Incremental refresh updates only changed data, while full refresh recomputes everything.

Query batching combines multiple operations into single database round trips. Execute multiple independent queries in one batch to reduce network overhead. Use bulk insert operations instead of individual inserts. Batch updates reduce transaction overhead and lock contention.

# Individual operations - N database round trips
users.each { |user| user.update(processed: true) }

# Batched update - single query
User.where(id: user_ids).update_all(processed: true)

Connection pooling reuses database connections across requests, eliminating connection establishment overhead. Configure pool sizes based on concurrent request volumes. Monitor pool exhaustion, which indicates insufficient capacity or connection leaks.

Ruby Implementation

ActiveRecord provides query interface abstractions over SQL. The query interface builds SQL statements through method chaining, delaying execution until result access. This lazy evaluation enables query optimization by combining operations before database interaction.

# Query construction without execution
query = User.where(active: true)
query = query.where("created_at > ?", 1.week.ago)
query = query.order(created_at: :desc)
# No database query yet

# Execution occurs on result access
users = query.limit(10).to_a
# Single optimized SQL query executes

Eager loading prevents N+1 query problems by loading associated records in minimal queries. The includes method loads associations using separate queries or LEFT OUTER JOINs. The preload method always uses separate queries, while eager_load forces JOIN loading. Choose based on whether filtering on associations occurs.

# N+1 problem - executes N+1 queries
posts = Post.all
posts.each { |post| puts post.author.name }
# SELECT * FROM posts
# SELECT * FROM users WHERE id = 1
# SELECT * FROM users WHERE id = 2
# ... (N more queries)

# Eager loading - executes 2 queries
posts = Post.includes(:author)
posts.each { |post| puts post.author.name }
# SELECT * FROM posts
# SELECT * FROM users WHERE id IN (1, 2, 3, ...)

Select specific columns with select or pluck to reduce data transfer and memory usage. Avoid retrieving columns the application doesn't use. The pluck method returns arrays of values directly without instantiating model objects, reducing memory overhead for large result sets.

# Loads entire model objects
User.where(active: true).map(&:email)

# Returns only email column values
User.where(active: true).pluck(:email)
# Reduces memory and instantiation overhead

Database-specific ActiveRecord features expose underlying database capabilities. PostgreSQL supports array columns, JSON operations, full-text search, and window functions through Arel or raw SQL. MySQL provides spatial data types and functions. Use these features when they provide significant performance benefits.

# PostgreSQL array containment query
Product.where("tags @> ARRAY[?]::varchar[]", ['sale'])

# PostgreSQL JSON query
User.where("preferences->>'theme' = ?", 'dark')

# Window function for ranking
User.select("*, ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) as rank")

The Bullet gem detects N+1 queries and unused eager loading in development. It analyzes executed queries and raises warnings when optimization opportunities exist. Configure Bullet to raise exceptions, log warnings, or display browser alerts.

# config/environments/development.rb
config.after_initialize do
  Bullet.enable = true
  Bullet.alert = true
  Bullet.bullet_logger = true
  Bullet.rails_logger = true
end

Query profiling with explain analyzes query execution plans. Call explain on ActiveRecord relations to see how the database executes the query. Examine plan output for sequential scans, join types, and cost estimates.

# Display query execution plan
puts User.where(email: 'user@example.com').explain

# Example output:
# QUERY PLAN
# Index Scan using index_users_on_email on users
# Index Cond: (email = 'user@example.com')
# Cost: 0.42..8.44 rows=1

Raw SQL access provides full control when ActiveRecord abstractions limit optimization. Use find_by_sql for complex queries or database-specific features. Sanitize inputs to prevent SQL injection.

# Complex query with raw SQL
sql = <<-SQL
  SELECT users.*, COUNT(posts.id) as post_count
  FROM users
  LEFT JOIN posts ON posts.user_id = users.id
  WHERE users.active = true
  GROUP BY users.id
  HAVING COUNT(posts.id) > ?
  ORDER BY post_count DESC
SQL

User.find_by_sql([sql, 10])

The Sequel gem offers an alternative to ActiveRecord with different optimization characteristics. Sequel provides more direct control over SQL generation and lower overhead for some operations. Consider Sequel for performance-critical applications or when ActiveRecord abstractions impede optimization.

# Sequel query construction
DB[:users]
  .where(active: true)
  .where { created_at > Date.today - 30 }
  .select(:id, :email)
  .order(:created_at)

Performance Considerations

Query profiling identifies performance bottlenecks through execution time measurement and plan analysis. Enable query logging to capture all database interactions. Analyze slow query logs to find queries exceeding time thresholds. Aggregate query metrics to identify patterns—frequent slow queries impact performance more than occasional expensive operations.

Database EXPLAIN commands reveal execution plans and cost estimates. Each database provides EXPLAIN variants with different detail levels. PostgreSQL EXPLAIN ANALYZE executes queries and reports actual timings versus estimates, exposing optimizer inaccuracies. MySQL EXPLAIN shows join types and key usage. Interpret costs relatively rather than absolutely—compare alternative query formulations.

# ActiveRecord explain with timing
User.joins(:posts)
  .where("posts.created_at > ?", 1.week.ago)
  .explain(:analyze, :verbose)

Index effectiveness depends on query predicates and data distribution. Monitor index usage statistics to identify unused indexes consuming storage and write performance. Drop unused indexes. Identify missing indexes from queries performing sequential scans on large tables. Test index additions in staging environments before production deployment.

Query execution time correlates with processed row counts. Minimize rows examined through effective filtering. Apply filters early in query execution to reduce intermediate result sizes. Choose join orders placing selective filters first.

Caching eliminates redundant query execution. Query result caching stores complete result sets keyed by query text and parameters. Application-level caching stores computed values or object graphs. Choose cache invalidation strategies balancing freshness against cache hit rates. Time-based expiration works for data tolerating staleness. Event-based invalidation maintains consistency but adds complexity.

# Rails query caching - automatic within action
def index
  # First execution queries database
  @users = User.where(active: true).to_a
  
  # Identical query within same action uses cached result
  @user_count = User.where(active: true).count
end

# Manual caching for computed values
def expensive_calculation
  Rails.cache.fetch("users/statistics", expires_in: 1.hour) do
    # Expensive query executed only on cache miss
    User.includes(:posts, :comments).map do |user|
      { id: user.id, score: calculate_user_score(user) }
    end
  end
end

Connection pool configuration impacts concurrency and resource usage. Pool size determines maximum concurrent database connections. Insufficient pool size causes requests to wait for available connections, increasing latency. Excessive pool size wastes database resources and operating system file descriptors. Monitor pool utilization and checkout wait times to right-size pools.

Database statistics maintenance ensures accurate optimizer decisions. Stale statistics cause cardinality misestimates, leading to poor join orders and access method choices. Schedule regular statistics updates, especially after significant data changes. Some databases support automatic statistics collection, but manual updates may be necessary for critical tables.

Read replicas distribute query load across multiple database instances. Route read queries to replicas and write queries to the primary. Implement replication lag monitoring to handle eventual consistency. Consider multi-region replicas for geographic distribution, reducing network latency for distant users.

Prepared statements reduce query parsing overhead for repeated execution. The database parses and plans the query once, reusing the plan for subsequent executions with different parameters. ActiveRecord uses prepared statements automatically for parameterized queries. Benefits increase with query frequency and parsing complexity.

Common Patterns

N+1 query elimination represents the most common query optimization pattern. Load associated records in batches rather than individually within loops. Use eager loading, explicit preloading, or batch loading to reduce query counts from N+1 to 2 or 3.

# N+1 anti-pattern
def show_user_posts
  @user = User.find(params[:id])
  # N queries for N posts
  @post_titles = @user.posts.map(&:title)
end

# Optimized with eager loading
def show_user_posts
  @user = User.includes(:posts).find(params[:id])
  # 2 queries total
  @post_titles = @user.posts.map(&:title)
end

Batch processing aggregates operations to reduce overhead. Group individual database operations into batches, processing multiple records per query. Use find_in_batches or in_batches for memory-efficient iteration over large datasets.

# Memory-efficient batch processing
User.where(status: 'pending').find_in_batches(batch_size: 1000) do |batch|
  batch.each do |user|
    process_user(user)
  end
end

Counter cache columns maintain aggregate counts without expensive queries. Store counts of associated records directly on parent records. Database triggers or application callbacks update counters when associated records change.

class Post < ApplicationRecord
  belongs_to :user, counter_cache: true
end

# posts_count column on users table auto-maintained
# Fast count access without query
user.posts_count
# Instead of: user.posts.count (requires COUNT query)

Selective column loading retrieves only required columns, reducing data transfer and memory usage. Specify columns explicitly with select or retrieve specific values with pluck. Exclude large columns like text blobs or binary data unless needed.

# Load only required columns
User.select(:id, :email, :name)
  .where(active: true)

# Even more efficient for simple values
User.where(active: true).pluck(:email)

Scope query construction builds reusable query fragments. Define scopes for common query patterns, improving code readability and maintainability. Chain scopes to compose complex queries from simple building blocks.

class User < ApplicationRecord
  scope :active, -> { where(active: true) }
  scope :recent, -> { where("created_at > ?", 1.month.ago) }
  scope :with_posts, -> { joins(:posts).distinct }
  
  # Compose scopes
  User.active.recent.with_posts
end

Exists queries check for record presence without loading data. Use exists? instead of count or any? when only checking existence matters. The database can stop searching after finding the first match.

# Inefficient - counts all records
if user.posts.count > 0
  # ...
end

# Optimized - stops at first match
if user.posts.exists?
  # ...
end

Index-only scans retrieve data entirely from indexes without accessing tables. Structure indexes to include all columns queries reference. The database reads index pages exclusively, reducing I/O.

# Query requiring table access
User.where(active: true).pluck(:email)

# With covering index on (active, email)
# Database satisfies query from index alone

Query result pagination limits result set sizes and enables progressive loading. Use limit and offset for simple pagination, or keyset pagination for better performance on large offsets. Keyset pagination uses WHERE conditions instead of OFFSET, maintaining constant query time regardless of page number.

# Offset-based pagination - performance degrades with offset
User.limit(20).offset(page * 20)

# Keyset pagination - consistent performance
User.where("id > ?", last_seen_id).limit(20)

Practical Examples

This example demonstrates optimizing a slow dashboard query loading user activity summaries. The initial implementation executes separate queries for each metric, creating performance problems with many users.

# Unoptimized implementation
class DashboardController < ApplicationController
  def index
    @users = User.where(active: true)
    
    @user_stats = @users.map do |user|
      {
        user: user,
        post_count: user.posts.count,
        comment_count: user.comments.count,
        last_activity: [user.posts.maximum(:created_at), 
                       user.comments.maximum(:created_at)].compact.max
      }
    end
  end
end

# Executes queries: 
# 1 for users
# N for post counts  
# N for comment counts
# N for last post times
# N for last comment times
# Total: 4N + 1 queries

The optimized version combines data loading into minimal queries using joins and aggregations.

# Optimized with aggregation query
class DashboardController < ApplicationController
  def index
    @user_stats = User.where(active: true)
      .select("users.*, 
               COUNT(DISTINCT posts.id) as post_count,
               COUNT(DISTINCT comments.id) as comment_count,
               GREATEST(MAX(posts.created_at), 
                       MAX(comments.created_at)) as last_activity")
      .left_joins(:posts, :comments)
      .group("users.id")
  end
end

# Executes 1 query total
# Performance improvement: O(N) to O(1) query count

This example optimizes a product search with multiple filters. The initial version chains multiple queries, each hitting the database.

# Unoptimized search with multiple queries
class ProductSearch
  def search(params)
    results = Product.all
    
    results = results.where(category: params[:category]) if params[:category]
    results = results.where("price <= ?", params[:max_price]) if params[:max_price]
    results = results.where("name ILIKE ?", "%#{params[:keyword]}%") if params[:keyword]
    
    # Each where executes if called individually
    results
  end
end

The optimized version builds a single query using ActiveRecord's lazy evaluation and proper indexing.

# Optimized with query composition
class ProductSearch
  def search(params)
    query = Product.all
    
    query = query.where(category: params[:category]) if params[:category]
    query = query.where("price <= ?", params[:max_price]) if params[:max_price]
    
    if params[:keyword]
      # Use indexed column or full-text search
      query = query.where("name ILIKE ?", "%#{params[:keyword]}%")
    end
    
    query.order(created_at: :desc)
  end
end

# Database indexes:
# CREATE INDEX idx_products_category ON products(category)
# CREATE INDEX idx_products_price ON products(price)  
# CREATE INDEX idx_products_name_trgm ON products USING gin(name gin_trgm_ops)
# (for PostgreSQL trigram search)

This example shows optimizing a report requiring data from multiple tables with complex calculations.

# Unoptimized report with object instantiation
class SalesReport
  def generate(date_range)
    orders = Order.where(created_at: date_range)
    
    orders.map do |order|
      {
        order_id: order.id,
        customer: order.customer.name,
        total: order.line_items.sum(&:price),
        tax: order.line_items.sum { |item| item.price * item.tax_rate }
      }
    end
  end
end

# Problems:
# - Loads full ActiveRecord objects
# - N+1 for customers
# - Calculates in Ruby instead of database

The optimized version uses SQL aggregations and selective loading.

# Optimized with SQL aggregations
class SalesReport
  def generate(date_range)
    Order.where(created_at: date_range)
      .joins(:customer, :line_items)
      .select("orders.id as order_id,
               customers.name as customer,
               SUM(line_items.price) as total,
               SUM(line_items.price * line_items.tax_rate) as tax")
      .group("orders.id, customers.name")
      .map { |row| row.attributes }
  end
end

# Single query with database aggregation
# No object instantiation overhead
# Results as plain hashes

Reference

Common Query Optimization Techniques

Technique	Description	When to Apply
Add Index	Create B-tree or other index structure	Column appears in WHERE, JOIN, or ORDER BY clauses frequently
Covering Index	Include all query columns in index	Query executes frequently and accesses few columns
Eager Loading	Load associations with includes or preload	Accessing associations in loops
Select Specific Columns	Use select or pluck for subset of columns	Not all model attributes needed
Batch Processing	Process records in groups with find_in_batches	Iterating over large result sets
Counter Cache	Store association counts on parent model	Frequently counting associations
Query Caching	Cache query results in memory	Query results stable over time period
Denormalization	Store redundant data to avoid joins	Read-heavy workload with expensive joins
Partition Tables	Split table into smaller physical segments	Very large tables with time or range-based access
Materialized Views	Precompute and store complex query results	Expensive aggregations or joins executed frequently

ActiveRecord Optimization Methods

Method	Purpose	Example Usage
includes	Eager load associations (JOIN or separate query)	User.includes(:posts)
preload	Eager load with separate queries always	User.preload(:posts)
eager_load	Eager load with LEFT OUTER JOIN always	User.eager_load(:posts)
joins	Inner join without loading associated records	User.joins(:posts).where(posts: { published: true })
select	Specify columns to retrieve	User.select(:id, :email)
pluck	Retrieve column values without models	User.pluck(:email)
find_each	Batch iteration with automatic LIMIT/OFFSET	User.find_each(batch_size: 1000) { block }
find_in_batches	Batch iteration yielding arrays	User.find_in_batches { block }
exists?	Check record existence efficiently	User.where(email: email).exists?
explain	Display query execution plan	User.where(active: true).explain

Index Types and Use Cases

Index Type	Description	Optimal Use Case
B-tree	Balanced tree structure, default type	Equality and range queries on ordered data
Hash	Hash table structure	Exact match queries only
GiST	Generalized Search Tree	Full-text search, geometric data, network addresses
GIN	Generalized Inverted Index	Array containment, full-text search, JSONB queries
BRIN	Block Range Index	Very large tables with natural ordering
Partial	Index on subset of rows	Queries filtering on specific values frequently
Expression	Index on computed values	Queries on function results or expressions

Query Performance Indicators

Indicator	Meaning	Action
Seq Scan	Full table scan	Add index or filter earlier
Index Scan	Index used for retrieval	Generally good, verify selectivity
Index Only Scan	Data retrieved from index alone	Optimal for covering indexes
Nested Loop	Join algorithm iterating rows	Acceptable for small datasets, consider hash join for large
Hash Join	Join using hash table	Good for large equi-joins
Merge Join	Join on pre-sorted inputs	Optimal when inputs sorted on join key
High Cost	Large cost estimate	Review query structure and available indexes
Rows Mismatch	Actual vs estimated rows differ significantly	Update database statistics

Database Statistics Commands

Database	Update Statistics	Analyze Query
PostgreSQL	ANALYZE table_name	EXPLAIN ANALYZE query
MySQL	ANALYZE TABLE table_name	EXPLAIN query
SQLite	ANALYZE table_name	EXPLAIN QUERY PLAN query

Connection Pool Configuration

Setting	Purpose	Typical Value
pool	Maximum connections	5-25 depending on concurrency
timeout	Wait time for connection (seconds)	5
checkout_timeout	ActiveRecord checkout timeout	5
reaping_frequency	Dead connection check interval	60

Query Optimization Techniques