CrackedRuby - Database Statistics

Overview

Database statistics are metadata that describe the characteristics of data stored in tables and indexes. Database management systems collect and maintain these statistics to make informed decisions about query execution plans. The query optimizer uses statistics to estimate the cost of different execution strategies and select the most efficient approach.

Statistics typically include row counts, distinct value counts, data distribution histograms, null value frequencies, and index selectivity measures. When a query executes, the optimizer consults these statistics to predict how many rows each operation will return, which indexes to use, and what join algorithms to apply.

The accuracy of statistics directly affects query performance. Outdated or missing statistics lead to suboptimal execution plans that may perform full table scans instead of index lookups, choose incorrect join orders, or allocate insufficient memory for operations. Database systems provide mechanisms to update statistics manually or automatically, though the timing and frequency of updates requires careful configuration.

# PostgreSQL statistics example through ActiveRecord
ActiveRecord::Base.connection.execute(<<-SQL)
  SELECT 
    schemaname,
    tablename,
    last_analyze,
    last_autoanalyze,
    n_live_tup,
    n_dead_tup
  FROM pg_stat_user_tables
  WHERE tablename = 'users';
SQL

Key Principles

Database statistics operate on three fundamental components: table statistics, column statistics, and index statistics. Table statistics track the number of rows, pages, and storage characteristics. Column statistics measure value distribution, including the number of distinct values, most common values with their frequencies, and histograms showing value ranges. Index statistics capture index depth, leaf page count, and clustering factor.

The cardinality estimate forms the core of query optimization. Cardinality refers to the number of distinct values in a column or the number of rows an operation returns. The optimizer uses cardinality estimates to calculate the cost of table scans, index lookups, and join operations. A high cardinality column (many distinct values) typically benefits from indexes, while low cardinality columns (few distinct values) may not.

Selectivity measures the fraction of rows that satisfy a predicate. The optimizer calculates selectivity by dividing the number of matching rows by the total row count. High selectivity (few matching rows) favors index usage, while low selectivity (many matching rows) may make full table scans more efficient due to reduced random I/O.

Histograms provide detailed distribution information by dividing column values into buckets. Each bucket contains a value range and the frequency of values within that range. Equi-depth histograms maintain approximately equal row counts per bucket, while equi-width histograms use fixed value ranges. Histograms help the optimizer estimate selectivity for range predicates and identify data skew.

Data skew occurs when values distribute unevenly across a column. Skewed data challenges the optimizer because average selectivity estimates fail to capture actual distribution patterns. For example, a status column might have 95% of rows with 'active' status and 5% distributed across other statuses. Without accurate statistics, the optimizer might incorrectly estimate that a query filtering for 'active' will return only a small fraction of rows.

The analyze operation scans tables to collect statistics. Full table scans examine every row, while sampling approaches read a subset of rows and extrapolate statistics. Sampling trades accuracy for speed, with larger sample sizes producing more accurate statistics at the cost of longer analysis time. Database systems balance this trade-off through configurable sample rates.

Auto-vacuum in PostgreSQL combines dead tuple removal with statistics updates. The autovacuum daemon monitors table activity and triggers analyze operations based on the number of inserted, updated, or deleted rows since the last analysis. This automation maintains reasonably current statistics without manual intervention, though high-velocity tables may require manual analyze calls or adjusted thresholds.

# Accessing column statistics in PostgreSQL
result = ActiveRecord::Base.connection.execute(<<-SQL)
  SELECT 
    attname,
    n_distinct,
    correlation,
    most_common_vals,
    most_common_freqs
  FROM pg_stats
  WHERE tablename = 'orders'
  AND attname = 'status';
SQL

stats = result.first
puts "Distinct values: #{stats['n_distinct']}"
puts "Common values: #{stats['most_common_vals']}"

Ruby Implementation

Ruby applications interact with database statistics primarily through ORM libraries and direct database adapters. ActiveRecord, the Rails ORM, provides access to statistics through raw SQL execution while Sequel offers similar capabilities with its dataset interface.

The pg gem for PostgreSQL exposes statistics through system catalog queries. The pg_stats view provides human-readable statistics, while pg_statistic contains the raw statistical data. Applications query these views to inspect current statistics and make runtime decisions based on data characteristics.

require 'pg'

class DatabaseStats
  def initialize(connection)
    @conn = connection
  end

  def table_statistics(table_name)
    query = <<-SQL
      SELECT 
        schemaname,
        tablename,
        n_live_tup AS live_rows,
        n_dead_tup AS dead_rows,
        n_mod_since_analyze AS modifications,
        last_analyze,
        last_autoanalyze
      FROM pg_stat_user_tables
      WHERE tablename = $1
    SQL
    
    @conn.exec_params(query, [table_name]).first
  end

  def column_statistics(table_name, column_name)
    query = <<-SQL
      SELECT 
        attname AS column_name,
        n_distinct,
        array_length(most_common_vals, 1) AS num_common_vals,
        most_common_freqs,
        correlation,
        null_frac
      FROM pg_stats
      WHERE tablename = $1 AND attname = $2
    SQL
    
    @conn.exec_params(query, [table_name, column_name]).first
  end

  def analyze_table(table_name)
    @conn.exec("ANALYZE #{@conn.quote_ident(table_name)}")
  end
end

conn = PG.connect(dbname: 'production')
stats = DatabaseStats.new(conn)

# Check if statistics are stale
table_stats = stats.table_statistics('users')
modifications = table_stats['modifications'].to_i
live_rows = table_stats['live_rows'].to_i

if modifications > (live_rows * 0.1)
  puts "Statistics stale, analyzing..."
  stats.analyze_table('users')
end

ActiveRecord applications trigger statistics updates through migrations or maintenance tasks. The execute method runs ANALYZE commands directly, while connection adapters provide database-specific interfaces.

class UpdateDatabaseStatistics < ActiveRecord::Migration[7.0]
  def up
    case ActiveRecord::Base.connection.adapter_name
    when 'PostgreSQL'
      execute 'ANALYZE VERBOSE users'
      execute 'ANALYZE VERBOSE orders'
    when 'MySQL'
      execute 'ANALYZE TABLE users'
      execute 'ANALYZE TABLE orders'
    end
  end

  def down
    # Statistics updates are not reversible
  end
end

Sequel provides a statistics extension for some adapters, though direct SQL execution remains the most reliable approach across database systems.

require 'sequel'

DB = Sequel.connect('postgres://localhost/production')

# Define a statistics helper
module Sequel
  class Database
    def analyze_table(table)
      run "ANALYZE #{quote_schema_table(table)}"
    end

    def table_row_estimate(table)
      self[Sequel.lit('pg_stat_user_tables')]
        .where(tablename: table.to_s)
        .get(:n_live_tup)
    end
  end
end

DB.analyze_table(:users)
estimated_rows = DB.table_row_estimate(:users)
puts "Estimated rows: #{estimated_rows}"

Applications monitoring query performance extract statistics programmatically to identify problematic queries. The pg_stat_statements extension tracks query execution statistics, providing complementary data to table statistics.

class QueryPerformanceMonitor
  def initialize(connection)
    @conn = connection
  end

  def slow_queries(min_duration_ms = 1000)
    query = <<-SQL
      SELECT 
        query,
        calls,
        mean_exec_time,
        max_exec_time,
        stddev_exec_time
      FROM pg_stat_statements
      WHERE mean_exec_time > $1
      ORDER BY mean_exec_time DESC
      LIMIT 20
    SQL
    
    @conn.exec_params(query, [min_duration_ms])
  end

  def reset_statistics
    @conn.exec('SELECT pg_stat_statements_reset()')
  end
end

Performance Considerations

Statistics accuracy determines query execution efficiency. The optimizer relies on statistics to estimate operation costs, with errors compounding through multi-step execution plans. A 2x cardinality estimation error at one step can lead to 10x or greater errors in final result size estimates, causing the optimizer to select wildly inappropriate plans.

Sampling rates control the trade-off between analysis time and accuracy. PostgreSQL's default_statistics_target parameter sets the number of most common values to track and histogram bucket count. Values range from 1 to 10,000, with 100 as the default. Increasing this parameter improves estimate accuracy for columns with many distinct values but increases analysis time and catalog storage.

# Adjusting statistics target for specific columns
ActiveRecord::Base.connection.execute(<<-SQL)
  ALTER TABLE products 
  ALTER COLUMN category_id 
  SET STATISTICS 500;
SQL

# Analyze to regenerate statistics with new target
ActiveRecord::Base.connection.execute('ANALYZE products')

Tables with high modification rates require frequent statistics updates. The optimizer makes poor decisions when statistics reflect old data distributions. A table that doubles in size without updated statistics may continue using index scans when full table scans become more efficient, or vice versa.

Analyze timing affects both accuracy and system load. Running analyze during off-peak hours minimizes impact on application performance, but delays can leave statistics stale during business hours. High-volume tables benefit from scheduled analyze operations after batch loads or at regular intervals.

Index statistics particularly affect join operations. The optimizer uses index clustering factor to estimate the cost of index scans. A clustering factor near 1.0 indicates data physically ordered by the index, making index scans efficient. High clustering factors suggest random data distribution, increasing I/O costs for index access.

# Monitor index statistics
class IndexAnalyzer
  def initialize(connection)
    @conn = connection
  end

  def index_effectiveness(table_name)
    query = <<-SQL
      SELECT 
        indexrelname AS index_name,
        idx_scan,
        idx_tup_read,
        idx_tup_fetch,
        CASE 
          WHEN idx_scan = 0 THEN 0
          ELSE idx_tup_fetch::float / idx_scan
        END AS avg_tuples_per_scan
      FROM pg_stat_user_indexes
      WHERE schemaname = 'public'
      AND relname = $1
      ORDER BY idx_scan DESC
    SQL
    
    @conn.exec_params(query, [table_name])
  end

  def unused_indexes(min_scans = 10)
    query = <<-SQL
      SELECT 
        schemaname,
        relname AS table_name,
        indexrelname AS index_name,
        idx_scan
      FROM pg_stat_user_indexes
      WHERE schemaname = 'public'
      AND idx_scan < $1
      ORDER BY pg_relation_size(indexrelid) DESC
    SQL
    
    @conn.exec_params(query, [min_scans])
  end
end

analyzer = IndexAnalyzer.new(ActiveRecord::Base.connection.raw_connection)
unused = analyzer.unused_indexes(100)
unused.each do |idx|
  puts "Rarely used: #{idx['table_name']}.#{idx['index_name']} (#{idx['idx_scan']} scans)"
end

Partitioned tables require per-partition statistics. The optimizer needs statistics for each partition to generate efficient plans that prune irrelevant partitions. Missing partition statistics lead to scanning unnecessary partitions, negating the benefits of partitioning.

Memory allocation during query execution depends on row count estimates. Operations like hash joins and sorts allocate work memory based on optimizer estimates. Underestimates cause operations to spill to disk, dramatically reducing performance. Overestimates waste memory that could serve other queries.

Practical Examples

Consider an e-commerce application with an orders table containing millions of rows. The table has columns for order_date, status, customer_id, and total_amount. Without proper statistics, queries filtering by date range perform poorly.

# Initial state: no statistics
class Order < ActiveRecord::Base
end

# Query with implicit EXPLAIN
result = Order.where(order_date: 1.year.ago..Time.current)
              .where(status: 'shipped')
              .limit(100)

# The optimizer might choose inefficient plan:
# Seq Scan on orders (cost=0.00..150000.00 rows=50000 width=100)
#   Filter: (order_date >= '2024-01-01' AND status = 'shipped')

After analyzing the table, the optimizer generates accurate estimates and selects better plans.

# Update statistics
ActiveRecord::Base.connection.execute('ANALYZE orders')

# Same query now uses index
# Index Scan using idx_orders_date on orders 
#   (cost=0.43..8.45 rows=120 width=100)
#   Index Cond: (order_date >= '2024-01-01' AND order_date <= '2025-01-01')
#   Filter: (status = 'shipped')

# Verify statistics accuracy
stats_query = <<-SQL
  SELECT 
    n_live_tup,
    n_dead_tup,
    last_analyze,
    last_autoanalyze
  FROM pg_stat_user_tables
  WHERE tablename = 'orders'
SQL

stats = ActiveRecord::Base.connection.execute(stats_query).first
puts "Live rows: #{stats['n_live_tup']}"
puts "Last analyzed: #{stats['last_analyze']}"

A reporting system generates daily aggregates from a large events table. The table grows by millions of rows per day, with bulk inserts during batch processing. Statistics become stale quickly, causing report queries to timeout.

class EventProcessor
  def initialize
    @conn = ActiveRecord::Base.connection
  end

  def process_daily_events(date)
    # Bulk insert events
    events = fetch_external_events(date)
    Event.insert_all(events)

    # Statistics now stale, update them
    @conn.execute('ANALYZE events')

    # Generate reports with accurate optimizer estimates
    generate_daily_reports(date)
  end

  def generate_daily_reports(date)
    # Complex aggregation query benefits from accurate statistics
    query = <<-SQL
      SELECT 
        event_type,
        COUNT(*) as event_count,
        AVG(duration) as avg_duration,
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration) as p95_duration
      FROM events
      WHERE event_date = $1
      GROUP BY event_type
      HAVING COUNT(*) > 100
    SQL

    @conn.exec_params(query, [date])
  end

  private

  def fetch_external_events(date)
    # Simulated external data fetch
    []
  end
end

A multi-tenant application partitions tables by tenant_id. Each partition requires separate statistics for efficient query routing and partition pruning.

class TenantDataManager
  def initialize(connection)
    @conn = connection
  end

  def create_tenant_partition(tenant_id)
    partition_name = "orders_tenant_#{tenant_id}"

    @conn.execute(<<-SQL)
      CREATE TABLE #{partition_name} 
      PARTITION OF orders 
      FOR VALUES IN (#{tenant_id})
    SQL

    # Create indexes on partition
    @conn.execute(<<-SQL)
      CREATE INDEX idx_#{partition_name}_date 
      ON #{partition_name}(order_date)
    SQL

    # Analyze new partition
    @conn.execute("ANALYZE #{partition_name}")
  end

  def partition_statistics
    query = <<-SQL
      SELECT 
        schemaname,
        tablename,
        n_live_tup,
        last_analyze
      FROM pg_stat_user_tables
      WHERE tablename LIKE 'orders_tenant_%'
      ORDER BY tablename
    SQL

    @conn.execute(query)
  end

  def analyze_stale_partitions(threshold_days = 7)
    query = <<-SQL
      SELECT tablename
      FROM pg_stat_user_tables
      WHERE tablename LIKE 'orders_tenant_%'
      AND (
        last_analyze IS NULL 
        OR last_analyze < CURRENT_DATE - INTERVAL '#{threshold_days} days'
      )
    SQL

    stale_partitions = @conn.execute(query)
    stale_partitions.each do |partition|
      puts "Analyzing stale partition: #{partition['tablename']}"
      @conn.execute("ANALYZE #{partition['tablename']}")
    end
  end
end

Tools & Ecosystem

PostgreSQL provides several built-in views and functions for statistics access. The pg_stats view presents statistics in human-readable format, while pg_statistic contains raw statistical data. The pg_stat_user_tables view tracks table-level access and modification statistics.

# PostgreSQL statistics views
module PostgresStats
  class Catalog
    def initialize(connection)
      @conn = connection
    end

    def table_io_statistics(table_name)
      query = <<-SQL
        SELECT 
          heap_blks_read,
          heap_blks_hit,
          CASE 
            WHEN (heap_blks_read + heap_blks_hit) = 0 THEN 0
            ELSE (heap_blks_hit::float / (heap_blks_read + heap_blks_hit)) * 100
          END AS cache_hit_ratio,
          idx_blks_read,
          idx_blks_hit
        FROM pg_statio_user_tables
        WHERE relname = $1
      SQL

      @conn.exec_params(query, [table_name]).first
    end
  end
end

The pg_stat_statements extension tracks SQL statement execution statistics, including execution count, total time, and resource usage. This extension requires installation and configuration in postgresql.conf.

# Enable and query pg_stat_statements
class StatementAnalyzer
  def self.enable
    ActiveRecord::Base.connection.execute(
      'CREATE EXTENSION IF NOT EXISTS pg_stat_statements'
    )
  end

  def self.top_queries_by_time(limit = 10)
    ActiveRecord::Base.connection.execute(<<-SQL)
      SELECT 
        query,
        calls,
        total_exec_time,
        mean_exec_time,
        max_exec_time
      FROM pg_stat_statements
      ORDER BY total_exec_time DESC
      LIMIT #{limit}
    SQL
  end

  def self.queries_by_io(limit = 10)
    ActiveRecord::Base.connection.execute(<<-SQL)
      SELECT 
        query,
        calls,
        shared_blks_read,
        shared_blks_hit,
        shared_blks_dirtied,
        shared_blks_written
      FROM pg_stat_statements
      ORDER BY (shared_blks_read + shared_blks_dirtied) DESC
      LIMIT #{limit}
    SQL
  end
end

MySQL uses the information_schema database for statistics access. The STATISTICS table provides index cardinality information, while TABLES contains row counts and data length.

# MySQL statistics access
class MysqlStats
  def initialize(connection)
    @conn = connection
  end

  def table_statistics(table_name)
    query = <<-SQL
      SELECT 
        TABLE_ROWS,
        DATA_LENGTH,
        INDEX_LENGTH,
        DATA_FREE,
        AUTO_INCREMENT,
        UPDATE_TIME
      FROM information_schema.TABLES
      WHERE TABLE_SCHEMA = DATABASE()
      AND TABLE_NAME = ?
    SQL

    @conn.exec_query(query, 'SQL', [[nil, table_name]]).first
  end

  def index_cardinality(table_name)
    query = <<-SQL
      SELECT 
        INDEX_NAME,
        COLUMN_NAME,
        SEQ_IN_INDEX,
        CARDINALITY,
        SUB_PART
      FROM information_schema.STATISTICS
      WHERE TABLE_SCHEMA = DATABASE()
      AND TABLE_NAME = ?
      ORDER BY INDEX_NAME, SEQ_IN_INDEX
    SQL

    @conn.exec_query(query, 'SQL', [[nil, table_name]])
  end
end

The pg_qualstats extension provides detailed predicate usage statistics, identifying which columns appear frequently in WHERE clauses. This information guides index creation decisions.

Third-party tools like pganalyze and pgBadger parse PostgreSQL logs to provide statistics visualization and query performance analysis. These tools aggregate statistics over time and identify trends that individual statistics queries might miss.

Common Pitfalls

Stale statistics represent the most frequent issue affecting query performance. Applications that perform bulk operations without subsequent analyze calls leave the optimizer with outdated information. A table that doubles in size retains statistics indicating half the actual row count, causing the optimizer to select plans appropriate for the smaller dataset.

# Incorrect: bulk insert without statistics update
def import_users(user_records)
  User.insert_all(user_records)
  # Statistics still reflect pre-import row count
end

# Correct: update statistics after bulk operations
def import_users(user_records)
  User.insert_all(user_records)
  ActiveRecord::Base.connection.execute('ANALYZE users')
end

Insufficient statistics targets cause poor estimates for high-cardinality columns. The default statistics target of 100 may inadequately represent columns with millions of distinct values. The optimizer lacks detailed distribution information and falls back to assumptions that may not match reality.

# Symptom: poor selectivity estimates for high-cardinality column
explain = ActiveRecord::Base.connection.execute(<<-SQL)
  EXPLAIN ANALYZE
  SELECT * FROM products WHERE sku = 'ABC-12345'
SQL
# Shows: rows=1000 (estimated) vs rows=1 (actual)

# Solution: increase statistics target for specific column
ActiveRecord::Base.connection.execute(<<-SQL)
  ALTER TABLE products 
  ALTER COLUMN sku 
  SET STATISTICS 1000;
  
  ANALYZE products;
SQL

Analyzing tables during peak load competes with application queries for I/O and CPU resources. Full table scans for statistics collection on large tables can saturate I/O subsystems and increase query latency. Schedule analyze operations during maintenance windows or use sampling to reduce impact.

Missing statistics on expression indexes lead to poor plan choices. The optimizer cannot estimate selectivity for functional indexes without statistics on the indexed expression. Create statistics on expressions used in indexes and frequently occurring WHERE clauses.

# Create expression index
ActiveRecord::Base.connection.execute(<<-SQL)
  CREATE INDEX idx_users_lower_email 
  ON users(LOWER(email))
SQL

# Create statistics for the expression
ActiveRecord::Base.connection.execute(<<-SQL)
  CREATE STATISTICS stat_users_lower_email 
  ON (LOWER(email)) 
  FROM users
SQL

ActiveRecord::Base.connection.execute('ANALYZE users')

Correlation statistics measure the relationship between physical row order and logical column order. Low correlation indicates random data distribution, making index scans expensive due to random I/O patterns. The optimizer may choose table scans over index scans when correlation is poor, even for selective queries.

Autovacuum configuration errors prevent automatic statistics updates. Tables with autovacuum disabled or inappropriate thresholds accumulate stale statistics. Monitor autovacuum activity and adjust parameters based on table modification patterns.

# Monitor autovacuum effectiveness
def check_autovacuum_settings
  query = <<-SQL
    SELECT 
      relname,
      n_mod_since_analyze,
      n_live_tup,
      CASE 
        WHEN n_live_tup = 0 THEN 0
        ELSE (n_mod_since_analyze::float / n_live_tup) * 100
      END AS percent_modified,
      last_autovacuum,
      last_autoanalyze
    FROM pg_stat_user_tables
    WHERE n_mod_since_analyze > 10000
    ORDER BY n_mod_since_analyze DESC
  SQL

  ActiveRecord::Base.connection.execute(query)
end

Reference

Statistics Views (PostgreSQL)

View	Purpose	Key Columns
pg_stats	Human-readable statistics	tablename, attname, n_distinct, most_common_vals, histogram_bounds
pg_statistic	Raw statistical data	starelid, staattnum, stanumbers, stavalues
pg_stat_user_tables	Table access statistics	n_live_tup, n_dead_tup, n_mod_since_analyze, last_analyze
pg_stat_user_indexes	Index usage statistics	idx_scan, idx_tup_read, idx_tup_fetch
pg_statio_user_tables	Table I/O statistics	heap_blks_read, heap_blks_hit, idx_blks_read

Statistics Columns

Column	Type	Description
n_distinct	numeric	Number of distinct values (-1 for unique, negative for fraction)
correlation	numeric	Statistical correlation between physical and logical order
most_common_vals	anyarray	Array of most frequent values
most_common_freqs	real array	Frequencies of most common values
histogram_bounds	anyarray	Histogram bucket boundaries
null_frac	real	Fraction of null values

ANALYZE Commands

Database	Command	Effect
PostgreSQL	ANALYZE	Update statistics for all tables
PostgreSQL	ANALYZE table_name	Update statistics for specific table
PostgreSQL	ANALYZE VERBOSE	Show progress during analysis
MySQL	ANALYZE TABLE table_name	Update index statistics
SQLite	ANALYZE	Update all statistics
SQLite	ANALYZE table_name	Update specific table statistics

Configuration Parameters (PostgreSQL)

Parameter	Default	Range	Purpose
default_statistics_target	100	1-10000	Number of histogram buckets and most common values
autovacuum_analyze_threshold	50	0+	Minimum modifications before autovacuum analyze
autovacuum_analyze_scale_factor	0.1	0-1	Fraction of table size to trigger analyze

Common Statistics Queries

# Check statistics freshness
def statistics_age(table_name)
  query = <<-SQL
    SELECT 
      schemaname,
      tablename,
      last_analyze,
      last_autoanalyze,
      GREATEST(last_analyze, last_autoanalyze) AS last_stats_update,
      CURRENT_TIMESTAMP - GREATEST(last_analyze, last_autoanalyze) AS stats_age
    FROM pg_stat_user_tables
    WHERE tablename = $1
  SQL
  
  ActiveRecord::Base.connection.exec_query(
    query, 'SQL', [[nil, table_name]]
  ).first
end

# Find tables needing analysis
def tables_needing_analysis(modification_threshold = 0.1)
  query = <<-SQL
    SELECT 
      schemaname,
      tablename,
      n_live_tup,
      n_dead_tup,
      n_mod_since_analyze,
      CASE 
        WHEN n_live_tup = 0 THEN 0
        ELSE n_mod_since_analyze::float / n_live_tup
      END AS modification_ratio
    FROM pg_stat_user_tables
    WHERE n_live_tup > 0
    AND (n_mod_since_analyze::float / n_live_tup) > $1
    ORDER BY n_mod_since_analyze DESC
  SQL
  
  ActiveRecord::Base.connection.exec_query(
    query, 'SQL', [[nil, modification_threshold]]
  )
end

Database Statistics