CrackedRuby - Subqueries and CTEs

Overview

Subqueries and Common Table Expressions (CTEs) are SQL constructs that enable complex data retrieval by breaking down queries into manageable, logical components. A subquery is a query nested within another SQL statement, while a CTE defines a temporary named result set that exists only during query execution.

Subqueries appear in various clauses of SQL statements including SELECT, FROM, WHERE, and HAVING. They execute before the outer query and pass their results to the containing statement. Subqueries can return single values, single rows, single columns, or complete result sets depending on their context and purpose.

CTEs, introduced in SQL:1999 standard and implemented in PostgreSQL 8.4, MySQL 8.0, and SQLite 3.8.3, use the WITH clause to define one or more named temporary result sets. Unlike subqueries that can appear inline anywhere in a query, CTEs appear at the beginning and can be referenced multiple times within the main query. CTEs support recursive definitions, making them suitable for hierarchical data traversal and iterative operations.

-- Subquery example
SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

-- CTE example
WITH department_avg AS (
  SELECT department_id, AVG(salary) as avg_salary
  FROM employees
  GROUP BY department_id
)
SELECT e.name, e.salary, da.avg_salary
FROM employees e
JOIN department_avg da ON e.department_id = da.department_id
WHERE e.salary > da.avg_salary;

Both constructs serve similar purposes but differ in reusability, readability, and execution characteristics. The choice between them depends on query complexity, database capabilities, and specific performance requirements.

Key Principles

Subqueries are self-contained SELECT statements embedded within another query. The database executes the subquery first, stores its result temporarily, then uses that result to evaluate the outer query. Subqueries can be correlated or non-correlated. Non-correlated subqueries execute once and pass their result to the outer query. Correlated subqueries reference columns from the outer query and execute once for each row processed by the outer query.

-- Non-correlated subquery
SELECT product_name
FROM products
WHERE category_id IN (SELECT id FROM categories WHERE active = true);

-- Correlated subquery
SELECT e1.name, e1.salary
FROM employees e1
WHERE e1.salary > (
  SELECT AVG(e2.salary)
  FROM employees e2
  WHERE e2.department_id = e1.department_id
);

Subqueries return different result types based on their usage context. Scalar subqueries return a single value and work with comparison operators. Row subqueries return a single row with multiple columns. Table subqueries return multiple rows and columns, typically used with IN, EXISTS, or as derived tables in FROM clauses.

CTEs define named result sets using the WITH clause followed by a query definition. The CTE name becomes available for the duration of the query execution and can be referenced multiple times. Multiple CTEs can be defined in a single WITH clause, separated by commas, with later CTEs able to reference earlier ones.

WITH active_users AS (
  SELECT id, email, created_at
  FROM users
  WHERE status = 'active'
),
recent_orders AS (
  SELECT user_id, COUNT(*) as order_count
  FROM orders
  WHERE created_at > NOW() - INTERVAL '30 days'
  GROUP BY user_id
)
SELECT au.email, COALESCE(ro.order_count, 0) as orders
FROM active_users au
LEFT JOIN recent_orders ro ON au.id = ro.user_id;

Recursive CTEs extend basic CTEs with self-referencing capability. They consist of two parts: an anchor member that defines the initial result set, and a recursive member that references the CTE itself. The database executes the anchor member first, then repeatedly executes the recursive member using the previous iteration's results until no more rows are returned.

WITH RECURSIVE category_tree AS (
  -- Anchor member: root categories
  SELECT id, name, parent_id, 1 as level
  FROM categories
  WHERE parent_id IS NULL
  
  UNION ALL
  
  -- Recursive member: child categories
  SELECT c.id, c.name, c.parent_id, ct.level + 1
  FROM categories c
  INNER JOIN category_tree ct ON c.parent_id = ct.id
)
SELECT * FROM category_tree ORDER BY level, name;

The evaluation order matters significantly. For subqueries in WHERE clauses, the database may optimize execution order based on the query plan. For CTEs, the database materializes the result set before using it in the main query, though some databases perform inline expansion as an optimization.

Scope rules determine where subqueries and CTEs can access data. Subqueries have access to columns from outer queries when correlated. CTEs defined in the same WITH clause can reference earlier CTEs but not later ones. Neither can reference each other circularly unless using recursive syntax.

Ruby Implementation

Ruby database libraries provide multiple approaches for constructing subqueries and CTEs. ActiveRecord, the default ORM in Rails applications, supports subqueries through method chaining and SQL fragments. Sequel, an alternative Ruby ORM, offers explicit support for CTEs and subqueries through its dataset API.

ActiveRecord subqueries typically use the where method with nested queries or the select method for scalar subqueries. The ORM converts Ruby query objects into SQL subqueries automatically when passed to query methods.

# Subquery in WHERE clause using ActiveRecord
avg_salary = Employee.select('AVG(salary)')
high_earners = Employee.where("salary > (#{avg_salary.to_sql})")

# Using Arel for type-safe subqueries
employees = Employee.arel_table
subquery = employees.project(employees[:salary].average)
Employee.where(employees[:salary].gt(subquery))

# Subquery in FROM clause (derived table)
dept_stats = Employee
  .select('department_id, AVG(salary) as avg_salary')
  .group(:department_id)

Employee
  .from("employees e")
  .joins("INNER JOIN (#{dept_stats.to_sql}) ds ON e.department_id = ds.department_id")
  .where('e.salary > ds.avg_salary')

ActiveRecord lacks native CTE support in older versions, requiring raw SQL or extensions. Rails 7.1 introduced with method for CTE support, while earlier versions need the with_cte gem or manual SQL construction.

# Rails 7.1+ native CTE support
active_users = User.where(status: 'active')
recent_orders = Order.where('created_at > ?', 30.days.ago)
  .group(:user_id)
  .select('user_id, COUNT(*) as order_count')

User
  .with(active_users: active_users, recent_orders: recent_orders)
  .from('active_users au')
  .joins('LEFT JOIN recent_orders ro ON au.id = ro.user_id')
  .select('au.email, COALESCE(ro.order_count, 0) as orders')

# Pre-Rails 7.1 using raw SQL
cte_sql = <<-SQL
  WITH active_users AS (
    #{User.where(status: 'active').to_sql}
  ),
  recent_orders AS (
    #{Order.where('created_at > ?', 30.days.ago).to_sql}
  )
  SELECT au.email, COALESCE(ro.order_count, 0) as orders
  FROM active_users au
  LEFT JOIN recent_orders ro ON au.id = ro.user_id
SQL

User.find_by_sql(cte_sql)

Sequel provides first-class support for CTEs through the with and with_recursive methods. Sequel's dataset API allows chaining CTE definitions and referencing them in the main query.

# Sequel CTE implementation
DB[:employees].with(:dept_avg,
  DB[:employees]
    .select(:department_id, Sequel.function(:avg, :salary).as(:avg_salary))
    .group(:department_id)
).select(Sequel[:e][:name], Sequel[:e][:salary], Sequel[:da][:avg_salary])
  .from(Sequel[:employees].as(:e))
  .join(Sequel[:dept_avg].as(:da), department_id: :department_id)
  .where(Sequel[:e][:salary] > Sequel[:da][:avg_salary])

# Recursive CTE in Sequel
DB[:categories].with_recursive(:category_tree,
  DB[:categories].where(parent_id: nil).select(:id, :name, :parent_id, Sequel.lit('1 AS level')),
  DB[:categories]
    .select(Sequel[:c][:id], Sequel[:c][:name], Sequel[:c][:parent_id], Sequel.lit('ct.level + 1'))
    .from(Sequel[:categories].as(:c))
    .join(Sequel[:category_tree].as(:ct), id: Sequel[:c][:parent_id]),
  union_all: true
).from(:category_tree).order(:level, :name)

For complex scenarios requiring dynamic query construction, both ORMs support building queries programmatically. This approach handles conditional logic and varying query structures.

# Dynamic subquery construction with ActiveRecord
def find_employees_above_threshold(department_id: nil, threshold_type: :avg)
  base_query = Employee.all
  base_query = base_query.where(department_id: department_id) if department_id

  stat_subquery = case threshold_type
  when :avg
    base_query.select('AVG(salary)')
  when :median
    base_query.select('PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary)')
  when :percentile_75
    base_query.select('PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY salary)')
  end

  Employee.where("salary > (#{stat_subquery.to_sql})")
end

# Dynamic CTE with Sequel
def build_category_report(include_inactive: false, max_depth: nil)
  base_categories = DB[:categories]
  base_categories = base_categories.where(active: true) unless include_inactive

  recursive_member = DB[:categories]
    .select(Sequel[:c][:id], Sequel[:c][:name], Sequel[:c][:parent_id], Sequel.lit('ct.level + 1'))
    .from(Sequel[:categories].as(:c))
    .join(Sequel[:category_tree].as(:ct), id: Sequel[:c][:parent_id])
  
  recursive_member = recursive_member.where(Sequel.lit('ct.level < ?', max_depth)) if max_depth

  DB[:categories].with_recursive(:category_tree,
    base_categories.select(:id, :name, :parent_id, Sequel.lit('1 AS level')),
    recursive_member,
    union_all: true
  ).from(:category_tree)
end

Raw SQL execution provides maximum control when ORM abstractions prove limiting. Ruby's database adapters execute SQL strings directly, returning result sets as arrays of hashes.

# Execute raw SQL with parameter binding
sql = <<-SQL
  WITH RECURSIVE subordinates AS (
    SELECT id, name, manager_id, 1 as level
    FROM employees
    WHERE id = ?
    
    UNION ALL
    
    SELECT e.id, e.name, e.manager_id, s.level + 1
    FROM employees e
    INNER JOIN subordinates s ON e.manager_id = s.id
    WHERE s.level < ?
  )
  SELECT * FROM subordinates ORDER BY level, name
SQL

ActiveRecord::Base.connection.exec_query(
  sql,
  'Subordinates Query',
  [[nil, employee_id], [nil, max_depth]]
)

Practical Examples

A common scenario involves finding records that exceed departmental averages. This pattern appears in reporting systems, performance dashboards, and anomaly detection.

# Find products priced above their category average
# Using subquery approach
Product.where(
  'price > (SELECT AVG(price) FROM products p2 WHERE p2.category_id = products.category_id)'
)

# Using CTE approach (Rails 7.1+)
category_averages = Product
  .select('category_id, AVG(price) as avg_price')
  .group(:category_id)

Product
  .with(category_averages: category_averages)
  .joins('JOIN category_averages ca ON products.category_id = ca.category_id')
  .where('products.price > ca.avg_price')

# Sequel CTE version
DB[:products].with(:category_averages,
  DB[:products]
    .select(:category_id, Sequel.function(:avg, :price).as(:avg_price))
    .group(:category_id)
).join(:category_averages, category_id: :category_id)
  .where(Sequel[:products][:price] > Sequel[:category_averages][:avg_price])

Hierarchical data traversal requires recursive CTEs. Organization charts, category trees, and bill-of-materials structures benefit from this approach.

# Organization chart traversal
class Employee < ApplicationRecord
  def self.find_all_subordinates(manager_id, max_levels: 10)
    sql = <<-SQL
      WITH RECURSIVE subordinates AS (
        SELECT id, name, manager_id, email, 1 as depth, 
               ARRAY[id] as path
        FROM employees
        WHERE manager_id = $1
        
        UNION ALL
        
        SELECT e.id, e.name, e.manager_id, e.email, s.depth + 1,
               s.path || e.id
        FROM employees e
        INNER JOIN subordinates s ON e.manager_id = s.id
        WHERE s.depth < $2
          AND NOT e.id = ANY(s.path)  -- Prevent cycles
      )
      SELECT * FROM subordinates
      ORDER BY depth, name
    SQL

    find_by_sql([sql, manager_id, max_levels])
  end
end

# Using Sequel for hierarchical categories
class Category < Sequel::Model
  def self.tree_from_root(root_id)
    dataset.with_recursive(:category_tree,
      dataset.where(id: root_id)
        .select(:id, :name, :parent_id, Sequel.lit('1 AS depth'),
                Sequel.lit("ARRAY[id] AS path")),
      dataset.select(
          Sequel[:c][:id], Sequel[:c][:name], Sequel[:c][:parent_id],
          Sequel.lit('ct.depth + 1'),
          Sequel.lit('ct.path || c.id')
        )
        .from(Sequel[:categories].as(:c))
        .join(Sequel[:category_tree].as(:ct), id: Sequel[:c][:parent_id])
        .where(Sequel.lit('NOT c.id = ANY(ct.path)')),
      union_all: true
    ).from(:category_tree).order(:depth, :name)
  end
end

Multi-step data aggregation combines multiple CTEs to perform complex calculations. Sales analysis, financial reporting, and inventory management often require this pattern.

# Complex sales report with multiple CTEs
class SalesReport
  def self.quarterly_analysis(year, quarter)
    quarter_start = Date.new(year, (quarter - 1) * 3 + 1, 1)
    quarter_end = quarter_start.end_of_quarter

    sql = <<-SQL
      WITH sales_data AS (
        SELECT 
          DATE_TRUNC('month', o.created_at) as month,
          o.customer_id,
          SUM(oi.quantity * oi.unit_price) as revenue,
          COUNT(DISTINCT o.id) as order_count
        FROM orders o
        JOIN order_items oi ON o.id = oi.order_id
        WHERE o.created_at BETWEEN $1 AND $2
          AND o.status = 'completed'
        GROUP BY DATE_TRUNC('month', o.created_at), o.customer_id
      ),
      monthly_totals AS (
        SELECT 
          month,
          SUM(revenue) as total_revenue,
          SUM(order_count) as total_orders,
          COUNT(DISTINCT customer_id) as unique_customers
        FROM sales_data
        GROUP BY month
      ),
      customer_segments AS (
        SELECT 
          customer_id,
          SUM(revenue) as customer_revenue,
          CASE
            WHEN SUM(revenue) > 10000 THEN 'high_value'
            WHEN SUM(revenue) > 1000 THEN 'medium_value'
            ELSE 'low_value'
          END as segment
        FROM sales_data
        GROUP BY customer_id
      )
      SELECT 
        mt.month,
        mt.total_revenue,
        mt.total_orders,
        mt.unique_customers,
        COUNT(cs.customer_id) FILTER (WHERE cs.segment = 'high_value') as high_value_customers,
        COUNT(cs.customer_id) FILTER (WHERE cs.segment = 'medium_value') as medium_value_customers,
        COUNT(cs.customer_id) FILTER (WHERE cs.segment = 'low_value') as low_value_customers
      FROM monthly_totals mt
      CROSS JOIN customer_segments cs
      GROUP BY mt.month, mt.total_revenue, mt.total_orders, mt.unique_customers
      ORDER BY mt.month
    SQL

    Order.connection.exec_query(sql, 'Quarterly Analysis', 
      [[nil, quarter_start], [nil, quarter_end]])
  end
end

Existence checks using EXISTS subqueries perform better than IN clauses for large datasets. This pattern verifies relationships without materializing complete result sets.

# Find users with recent activity using EXISTS
# ActiveRecord with Arel
users = User.arel_table
orders = Order.arel_table

subquery = orders
  .project(1)
  .where(orders[:user_id].eq(users[:id]))
  .where(orders[:created_at].gt(30.days.ago))
  .exists

User.where(subquery)

# Equivalent SQL approach
User.where(
  'EXISTS (SELECT 1 FROM orders WHERE orders.user_id = users.id AND orders.created_at > ?)',
  30.days.ago
)

# Find products never ordered
products = Product.arel_table
order_items = OrderItem.arel_table

subquery = order_items
  .project(1)
  .where(order_items[:product_id].eq(products[:id]))
  .exists

Product.where(subquery.not)

Window function analysis combined with CTEs enables complex ranking and partition operations.

# Top N per category with running totals
def top_selling_products_per_category(limit: 5)
  sql = <<-SQL
    WITH product_sales AS (
      SELECT 
        p.id,
        p.name,
        p.category_id,
        c.name as category_name,
        SUM(oi.quantity) as units_sold,
        SUM(oi.quantity * oi.unit_price) as revenue
      FROM products p
      JOIN order_items oi ON p.id = oi.product_id
      JOIN categories c ON p.category_id = c.id
      GROUP BY p.id, p.name, p.category_id, c.name
    ),
    ranked_products AS (
      SELECT 
        *,
        ROW_NUMBER() OVER (PARTITION BY category_id ORDER BY revenue DESC) as rank,
        SUM(revenue) OVER (PARTITION BY category_id ORDER BY revenue DESC) as running_total
      FROM product_sales
    )
    SELECT * FROM ranked_products
    WHERE rank <= $1
    ORDER BY category_name, rank
  SQL

  Product.connection.exec_query(sql, 'Top Products', [[nil, limit]])
end

Design Considerations

Choosing between subqueries and CTEs depends on readability, reusability, and database optimizer behavior. Subqueries work well for simple, one-time filters or calculations embedded directly in query logic. CTEs excel when the same derived data appears multiple times in a query or when query logic requires multiple logical steps.

Readability improves with CTEs for complex queries. Named result sets make intent explicit and allow top-down reading of query logic. Subqueries embedded in WHERE or JOIN clauses require mental context switching and inside-out reading.

# Subquery approach - harder to follow
Customer.where(
  "id IN (SELECT customer_id FROM orders WHERE status = 'completed' 
           AND created_at > (SELECT MAX(created_at) - INTERVAL '30 days' FROM orders))"
).where(
  "total_spent > (SELECT AVG(total_spent) FROM customers 
                  WHERE created_at < (SELECT MIN(created_at) FROM orders WHERE status = 'completed'))"
)

# CTE approach - clearer intent
sql = <<-SQL
  WITH recent_cutoff AS (
    SELECT MAX(created_at) - INTERVAL '30 days' as cutoff_date FROM orders
  ),
  recent_customers AS (
    SELECT DISTINCT customer_id 
    FROM orders, recent_cutoff
    WHERE status = 'completed' 
      AND orders.created_at > recent_cutoff.cutoff_date
  ),
  avg_spending AS (
    SELECT AVG(total_spent) as avg_spent FROM customers
  )
  SELECT c.* 
  FROM customers c
  WHERE c.id IN (SELECT customer_id FROM recent_customers)
    AND c.total_spent > (SELECT avg_spent FROM avg_spending)
SQL

Correlated subqueries introduce performance considerations. The database executes a correlated subquery once per row of the outer query, potentially causing N+1 query problems at the SQL level. Converting to joins or CTEs often improves performance.

# Correlated subquery - executes once per product
Product.where(
  'price > (SELECT AVG(price) FROM products p2 WHERE p2.category_id = products.category_id)'
)

# Join alternative - single execution
Product
  .joins(
    'JOIN (SELECT category_id, AVG(price) as avg_price FROM products GROUP BY category_id) avgs 
     ON products.category_id = avgs.category_id'
  )
  .where('products.price > avgs.avg_price')

EXISTS subqueries perform better than IN clauses when checking for row existence. EXISTS stops at the first match, while IN materializes the complete subquery result. For large result sets, this difference becomes significant.

# IN clause - materializes all order IDs
User.where(id: Order.where('created_at > ?', 30.days.ago).select(:user_id))

# EXISTS - stops at first match per user
User.where(
  'EXISTS (SELECT 1 FROM orders WHERE orders.user_id = users.id AND orders.created_at > ?)',
  30.days.ago
)

CTE materialization behavior varies across databases. PostgreSQL materializes CTEs by default in versions before 12, preventing optimizer pushdown. PostgreSQL 12+ added inline expansion optimization. MySQL 8.0+ and recent SQLite versions also optimize CTE execution. Understanding database-specific behavior guides CTE usage decisions.

Recursive CTEs require termination guarantees. Circular references or missing base cases cause infinite loops. Include cycle detection in path tracking and depth limits in recursive members.

# Safe recursive CTE with cycle detection and depth limit
sql = <<-SQL
  WITH RECURSIVE paths AS (
    SELECT id, parent_id, name, 1 as depth, ARRAY[id] as path
    FROM categories
    WHERE parent_id IS NULL
    
    UNION ALL
    
    SELECT c.id, c.parent_id, c.name, p.depth + 1, p.path || c.id
    FROM categories c
    INNER JOIN paths p ON c.parent_id = p.id
    WHERE p.depth < 10  -- Depth limit
      AND NOT c.id = ANY(p.path)  -- Cycle detection
  )
  SELECT * FROM paths
SQL

Scalar subqueries must return exactly one row and one column. Multiple rows or NULL results cause errors or unexpected behavior. Use LIMIT 1 or aggregate functions to guarantee single-value results.

# Unsafe scalar subquery - might return multiple rows
Product.select('*, (SELECT name FROM categories WHERE id = category_id) as category_name')

# Safe version with guaranteed single result
Product.select(
  '*, (SELECT name FROM categories WHERE categories.id = products.category_id LIMIT 1) as category_name'
)

Performance Considerations

Query execution plans reveal how databases handle subqueries and CTEs. Subqueries may be executed as nested loops, hash joins, or merged into the outer query depending on optimizer decisions. CTEs traditionally materialized results but modern optimizers perform inline expansion when beneficial.

Examining execution plans identifies performance bottlenecks. Both ActiveRecord and Sequel provide methods to retrieve query plans.

# Analyze query performance with EXPLAIN
query = Employee.where('salary > (SELECT AVG(salary) FROM employees)')
puts query.explain

# Sequel query plan analysis
plan = DB[:employees]
  .where(Sequel[:salary] > DB[:employees].select(Sequel.function(:avg, :salary)))
  .explain
puts plan

# PostgreSQL-specific detailed analysis
ActiveRecord::Base.connection.execute(
  "EXPLAIN (ANALYZE, BUFFERS, VERBOSE) #{query.to_sql}"
).each { |row| puts row }

Subquery execution time depends on correlation. Non-correlated subqueries execute once before the outer query. Correlated subqueries execute once per outer row, multiplying execution time by row count. For 10,000 outer rows, a correlated subquery with 10ms execution time takes 100 seconds total.

# Non-correlated - executes once
Product.where('category_id IN (SELECT id FROM categories WHERE active = true)')

# Correlated - executes 10,000 times for 10,000 products
Product.where(
  'price > (SELECT AVG(price) FROM products p WHERE p.category_id = products.category_id)'
)

# Optimized with window function - single execution
sql = <<-SQL
  SELECT p.*
  FROM (
    SELECT *,
           AVG(price) OVER (PARTITION BY category_id) as cat_avg
    FROM products
  ) p
  WHERE p.price > p.cat_avg
SQL
Product.find_by_sql(sql)

CTE materialization forces result storage before main query execution. This prevents index usage on CTE results and duplicates computation if the CTE appears in multiple JOIN branches. Modern PostgreSQL (12+) optimizes simple CTEs through inline expansion, but complex CTEs still materialize.

# Materialized CTE - forces computation before use
sql = <<-SQL
  WITH category_products AS (
    SELECT * FROM products WHERE category_id = 5
  )
  SELECT * FROM category_products WHERE price > 100
SQL

# Optimized version - allows index usage
Product.where(category_id: 5, price: 100..)

# Force inline in PostgreSQL 12+
sql = <<-SQL
  WITH category_products AS NOT MATERIALIZED (
    SELECT * FROM products WHERE category_id = 5
  )
  SELECT * FROM category_products WHERE price > 100
SQL

Index utilization affects subquery performance dramatically. Subqueries with proper indexes on join columns and WHERE predicates execute orders of magnitude faster than full table scans.

# Add indexes for subquery performance
class AddSubqueryIndexes < ActiveRecord::Migration[7.0]
  def change
    add_index :orders, [:user_id, :created_at]
    add_index :order_items, [:order_id, :product_id]
    add_index :products, [:category_id, :price]
  end
end

# Query using indexes effectively
User.where(
  'EXISTS (SELECT 1 FROM orders WHERE orders.user_id = users.id 
           AND orders.created_at > ? AND orders.status = ?)',
  30.days.ago, 'completed'
)

Recursive CTE performance depends on hierarchy depth and branching factor. Wide, shallow trees perform better than narrow, deep trees. Each recursion level processes results from the previous level, so depth directly impacts total rows processed.

# Monitor recursive CTE performance
def analyze_hierarchy_performance(root_id)
  sql = <<-SQL
    WITH RECURSIVE category_tree AS (
      SELECT id, parent_id, name, 1 as level
      FROM categories WHERE id = $1
      
      UNION ALL
      
      SELECT c.id, c.parent_id, c.name, ct.level + 1
      FROM categories c
      JOIN category_tree ct ON c.parent_id = ct.id
    )
    SELECT 
      level,
      COUNT(*) as nodes_at_level,
      SUM(COUNT(*)) OVER (ORDER BY level) as cumulative_nodes
    FROM category_tree
    GROUP BY level
    ORDER BY level
  SQL

  Category.connection.exec_query(sql, 'Hierarchy Stats', [[nil, root_id]])
end

Caching subquery results reduces repeated computation. Application-level caching stores subquery results in Redis or Memcached for reuse across requests.

# Cache expensive subquery results
class Product < ApplicationRecord
  def self.active_category_ids
    Rails.cache.fetch('active_category_ids', expires_in: 1.hour) do
      Category.where(active: true).pluck(:id)
    end
  end

  def self.in_active_categories
    where(category_id: active_category_ids)
  end
end

# Cache CTE results for complex reports
class SalesMetrics
  def self.daily_summary(date)
    cache_key = "sales_summary:#{date}"
    
    Rails.cache.fetch(cache_key, expires_in: 24.hours) do
      sql = <<-SQL
        WITH daily_orders AS (
          SELECT id, customer_id, total_amount
          FROM orders
          WHERE DATE(created_at) = $1
        ),
        customer_stats AS (
          SELECT customer_id, COUNT(*) as order_count, SUM(total_amount) as total
          FROM daily_orders
          GROUP BY customer_id
        )
        SELECT 
          COUNT(DISTINCT customer_id) as unique_customers,
          SUM(order_count) as total_orders,
          SUM(total) as revenue
        FROM customer_stats
      SQL

      Order.connection.exec_query(sql, 'Daily Summary', [[nil, date]]).first
    end
  end
end

Common Patterns

Filtering with subqueries represents the most common pattern. The WHERE clause uses subqueries to filter rows based on aggregated or derived conditions.

# Find above-average performers
Employee.where('salary > (SELECT AVG(salary) FROM employees)')

# Find products in top-selling categories
Product.where(
  'category_id IN (
    SELECT category_id FROM products 
    GROUP BY category_id 
    HAVING SUM(units_sold) > 1000
  )'
)

# Find customers with no recent orders
Customer.where(
  'NOT EXISTS (
    SELECT 1 FROM orders 
    WHERE orders.customer_id = customers.id 
    AND orders.created_at > ?
  )', 90.days.ago
)

Derived tables in FROM clauses treat subquery results as temporary tables. This pattern enables aggregation followed by filtering or joining.

# Aggregate then filter - requires derived table
sql = <<-SQL
  SELECT ds.*
  FROM (
    SELECT category_id, AVG(price) as avg_price, COUNT(*) as product_count
    FROM products
    GROUP BY category_id
  ) ds
  WHERE ds.product_count > 10 AND ds.avg_price > 50
SQL

Product.connection.exec_query(sql)

# Equivalent using CTE for readability
sql = <<-SQL
  WITH category_stats AS (
    SELECT category_id, AVG(price) as avg_price, COUNT(*) as product_count
    FROM products
    GROUP BY category_id
  )
  SELECT * FROM category_stats
  WHERE product_count > 10 AND avg_price > 50
SQL

Multiple CTE chaining builds complex logic incrementally. Each CTE performs one logical step, and later CTEs reference earlier results.

# Multi-step analysis using CTE chain
def customer_segmentation_report
  sql = <<-SQL
    WITH customer_orders AS (
      SELECT 
        customer_id,
        COUNT(*) as order_count,
        SUM(total_amount) as lifetime_value,
        MAX(created_at) as last_order_date
      FROM orders
      WHERE status = 'completed'
      GROUP BY customer_id
    ),
    customer_segments AS (
      SELECT 
        *,
        CASE
          WHEN lifetime_value > 5000 AND order_count > 10 THEN 'vip'
          WHEN lifetime_value > 1000 THEN 'regular'
          WHEN order_count > 0 THEN 'occasional'
          ELSE 'inactive'
        END as segment,
        CASE
          WHEN last_order_date > NOW() - INTERVAL '30 days' THEN 'active'
          WHEN last_order_date > NOW() - INTERVAL '90 days' THEN 'at_risk'
          ELSE 'churned'
        END as status
      FROM customer_orders
    )
    SELECT 
      segment,
      status,
      COUNT(*) as customer_count,
      AVG(lifetime_value) as avg_ltv,
      AVG(order_count) as avg_orders
    FROM customer_segments
    GROUP BY segment, status
    ORDER BY segment, status
  SQL

  Order.connection.exec_query(sql)
end

Recursive hierarchy traversal follows a standard pattern: anchor member selects root nodes, recursive member joins children to parents while incrementing depth.

# Standard recursive traversal pattern
def self.hierarchy_from_root(root_id, max_depth: nil)
  depth_condition = max_depth ? "AND level < #{max_depth}" : ""
  
  sql = <<-SQL
    WITH RECURSIVE tree AS (
      -- Anchor: root node
      SELECT id, parent_id, name, 1 as level, 
             CAST(name AS TEXT) as path
      FROM #{table_name}
      WHERE id = $1
      
      UNION ALL
      
      -- Recursive: children
      SELECT t.id, t.parent_id, t.name, tree.level + 1,
             tree.path || ' > ' || t.name
      FROM #{table_name} t
      INNER JOIN tree ON t.parent_id = tree.id
      WHERE true #{depth_condition}
    )
    SELECT * FROM tree ORDER BY path
  SQL

  find_by_sql([sql, root_id])
end

# Bill of materials explosion
def explode_bom(product_id, quantity = 1)
  sql = <<-SQL
    WITH RECURSIVE bom AS (
      SELECT 
        id, component_id, quantity_required, 1 as level,
        quantity_required * $2 as total_quantity
      FROM bill_of_materials
      WHERE product_id = $1
      
      UNION ALL
      
      SELECT 
        b.id, b.component_id, b.quantity_required, bom.level + 1,
        b.quantity_required * bom.total_quantity
      FROM bill_of_materials b
      INNER JOIN bom ON b.product_id = bom.component_id
      WHERE bom.level < 10
    )
    SELECT 
      component_id,
      SUM(total_quantity) as total_needed,
      MAX(level) as max_depth
    FROM bom
    GROUP BY component_id
  SQL

  connection.exec_query(sql, 'BOM Explosion', [[nil, product_id], [nil, quantity]])
end

Ranking and top-N queries combine window functions with CTEs or subqueries to identify top performers per group.

# Top 3 products per category
sql = <<-SQL
  WITH ranked_products AS (
    SELECT 
      p.*,
      ROW_NUMBER() OVER (PARTITION BY category_id ORDER BY units_sold DESC) as rank
    FROM products p
  )
  SELECT * FROM ranked_products
  WHERE rank <= 3
  ORDER BY category_id, rank
SQL

# Alternative using lateral join (PostgreSQL)
sql = <<-SQL
  SELECT c.name as category, p.name, p.units_sold
  FROM categories c
  CROSS JOIN LATERAL (
    SELECT name, units_sold
    FROM products
    WHERE category_id = c.id
    ORDER BY units_sold DESC
    LIMIT 3
  ) p
  ORDER BY c.name, p.units_sold DESC
SQL

Gap detection finds missing sequences using recursive CTEs or self-joins with subqueries.

# Find missing invoice numbers
sql = <<-SQL
  WITH RECURSIVE number_series AS (
    SELECT MIN(invoice_number) as num FROM invoices
    UNION ALL
    SELECT num + 1 FROM number_series
    WHERE num < (SELECT MAX(invoice_number) FROM invoices)
  )
  SELECT ns.num as missing_number
  FROM number_series ns
  LEFT JOIN invoices i ON ns.num = i.invoice_number
  WHERE i.invoice_number IS NULL
SQL

# Find date gaps in time series data
sql = <<-SQL
  WITH RECURSIVE date_series AS (
    SELECT DATE(MIN(recorded_at)) as date FROM measurements
    UNION ALL
    SELECT date + INTERVAL '1 day'
    FROM date_series
    WHERE date < (SELECT DATE(MAX(recorded_at)) FROM measurements)
  )
  SELECT ds.date as missing_date
  FROM date_series ds
  LEFT JOIN measurements m ON DATE(m.recorded_at) = ds.date
  WHERE m.id IS NULL
SQL

Reference

Subquery Types

Type	Returns	Usage Context	Example
Scalar	Single value	Comparison operators, SELECT list	SELECT name WHERE salary > (SELECT AVG(salary) FROM employees)
Row	Single row, multiple columns	Row comparisons	WHERE (name, dept) = (SELECT name, dept FROM table WHERE id = 1)
Column	Multiple rows, single column	IN, ANY, ALL operators	WHERE id IN (SELECT user_id FROM orders)
Table	Multiple rows and columns	FROM clause, EXISTS	FROM (SELECT * FROM products WHERE active) p

CTE Syntax Patterns

Pattern	PostgreSQL	MySQL 8.0+	SQLite 3.8.3+
Basic CTE	WITH cte AS (SELECT ...)	WITH cte AS (SELECT ...)	WITH cte AS (SELECT ...)
Multiple CTEs	WITH c1 AS (...), c2 AS (...)	WITH c1 AS (...), c2 AS (...)	WITH c1 AS (...), c2 AS (...)
Recursive CTE	WITH RECURSIVE cte AS (...)	WITH RECURSIVE cte AS (...)	WITH RECURSIVE cte AS (...)
Materialization hint	WITH cte AS MATERIALIZED (...)	Not supported	Not supported
No materialization	WITH cte AS NOT MATERIALIZED (...)	Not supported	Not supported

Ruby ORM Methods

Operation	ActiveRecord	Sequel	Description
Basic subquery	where(column: Model.select(:col))	where(column: Model.select(:col))	Use subquery result in WHERE
EXISTS check	where('EXISTS (...)')	where(Sequel.function(:exists, subquery))	Check row existence
Scalar subquery	select('(SELECT ...)')	select(Sequel.lit('(SELECT ...)'))	Inline scalar value
CTE (Rails 7.1+)	with(name: query)	with(:name, query)	Define CTE
Recursive CTE	N/A (use raw SQL)	with_recursive(:name, anchor, recursive)	Define recursive CTE
Raw SQL	find_by_sql(sql)	fetch(sql)	Execute custom SQL

Subquery Operators

Operator	Purpose	Returns	Example
IN	Value in set	Boolean	WHERE id IN (SELECT user_id FROM orders)
NOT IN	Value not in set	Boolean	WHERE id NOT IN (SELECT blocked_id FROM blocks)
EXISTS	Row exists	Boolean	WHERE EXISTS (SELECT 1 FROM orders WHERE user_id = users.id)
NOT EXISTS	No rows exist	Boolean	WHERE NOT EXISTS (SELECT 1 FROM orders WHERE user_id = users.id)
ANY	Compare to any value	Boolean	WHERE price > ANY (SELECT price FROM products WHERE featured)
ALL	Compare to all values	Boolean	WHERE price > ALL (SELECT price FROM products WHERE category_id = 5)
Comparison	Single value comparison	Boolean	WHERE salary > (SELECT AVG(salary) FROM employees)

Recursive CTE Components

Component	Purpose	Required	Example
Anchor member	Initial result set	Yes	SELECT id FROM categories WHERE parent_id IS NULL
Recursive member	Self-referencing query	Yes	SELECT c.id FROM categories c JOIN tree ON c.parent_id = tree.id
UNION ALL	Combine anchor and recursive	Yes	anchor UNION ALL recursive
Termination condition	Prevent infinite loops	Recommended	WHERE depth < 10 AND NOT id = ANY(path)
Path tracking	Detect cycles	Optional	ARRAY[id] for anchor, path \|\| id for recursive

Performance Optimization Checklist

Technique	Impact	When to Use
Convert correlated to join	High	Subquery executes per outer row
Use EXISTS instead of IN	Medium-High	Checking existence with large result sets
Add indexes on join columns	High	Subqueries join to outer query
Use LIMIT in scalar subqueries	Low	Subquery might return multiple rows
Materialize complex CTEs	Medium	CTE used once, complex computation
Inline simple CTEs	Medium	Simple CTE used once, PostgreSQL 12+
Cache subquery results	High	Same subquery in multiple queries
Replace recursive with iterative	Varies	Very deep hierarchies, performance critical

Common Pitfalls and Solutions

Problem	Symptom	Solution
N+1 subquery execution	Slow correlated subquery	Convert to JOIN or window function
Multiple row scalar subquery	Error or wrong results	Add LIMIT 1 or use aggregate
NULL in NOT IN	Unexpected empty results	Use NOT EXISTS or filter NULLs
Infinite recursive loop	Query timeout	Add depth limit and cycle detection
CTE materialization overhead	Slow despite indexes	Use NOT MATERIALIZED hint or subquery
Missing indexes	Full table scans	Add indexes on join and WHERE columns
Overly complex nesting	Unreadable queries	Split into CTEs or temp tables

Subqueries and CTEs