Overview
Subqueries and Common Table Expressions (CTEs) are SQL constructs that enable complex data retrieval by breaking down queries into manageable, logical components. A subquery is a query nested within another SQL statement, while a CTE defines a temporary named result set that exists only during query execution.
Subqueries appear in various clauses of SQL statements including SELECT, FROM, WHERE, and HAVING. They execute before the outer query and pass their results to the containing statement. Subqueries can return single values, single rows, single columns, or complete result sets depending on their context and purpose.
CTEs, introduced in SQL:1999 standard and implemented in PostgreSQL 8.4, MySQL 8.0, and SQLite 3.8.3, use the WITH clause to define one or more named temporary result sets. Unlike subqueries that can appear inline anywhere in a query, CTEs appear at the beginning and can be referenced multiple times within the main query. CTEs support recursive definitions, making them suitable for hierarchical data traversal and iterative operations.
-- Subquery example
SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
-- CTE example
WITH department_avg AS (
SELECT department_id, AVG(salary) as avg_salary
FROM employees
GROUP BY department_id
)
SELECT e.name, e.salary, da.avg_salary
FROM employees e
JOIN department_avg da ON e.department_id = da.department_id
WHERE e.salary > da.avg_salary;
Both constructs serve similar purposes but differ in reusability, readability, and execution characteristics. The choice between them depends on query complexity, database capabilities, and specific performance requirements.
Key Principles
Subqueries are self-contained SELECT statements embedded within another query. The database executes the subquery first, stores its result temporarily, then uses that result to evaluate the outer query. Subqueries can be correlated or non-correlated. Non-correlated subqueries execute once and pass their result to the outer query. Correlated subqueries reference columns from the outer query and execute once for each row processed by the outer query.
-- Non-correlated subquery
SELECT product_name
FROM products
WHERE category_id IN (SELECT id FROM categories WHERE active = true);
-- Correlated subquery
SELECT e1.name, e1.salary
FROM employees e1
WHERE e1.salary > (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.department_id = e1.department_id
);
Subqueries return different result types based on their usage context. Scalar subqueries return a single value and work with comparison operators. Row subqueries return a single row with multiple columns. Table subqueries return multiple rows and columns, typically used with IN, EXISTS, or as derived tables in FROM clauses.
CTEs define named result sets using the WITH clause followed by a query definition. The CTE name becomes available for the duration of the query execution and can be referenced multiple times. Multiple CTEs can be defined in a single WITH clause, separated by commas, with later CTEs able to reference earlier ones.
WITH active_users AS (
SELECT id, email, created_at
FROM users
WHERE status = 'active'
),
recent_orders AS (
SELECT user_id, COUNT(*) as order_count
FROM orders
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY user_id
)
SELECT au.email, COALESCE(ro.order_count, 0) as orders
FROM active_users au
LEFT JOIN recent_orders ro ON au.id = ro.user_id;
Recursive CTEs extend basic CTEs with self-referencing capability. They consist of two parts: an anchor member that defines the initial result set, and a recursive member that references the CTE itself. The database executes the anchor member first, then repeatedly executes the recursive member using the previous iteration's results until no more rows are returned.
WITH RECURSIVE category_tree AS (
-- Anchor member: root categories
SELECT id, name, parent_id, 1 as level
FROM categories
WHERE parent_id IS NULL
UNION ALL
-- Recursive member: child categories
SELECT c.id, c.name, c.parent_id, ct.level + 1
FROM categories c
INNER JOIN category_tree ct ON c.parent_id = ct.id
)
SELECT * FROM category_tree ORDER BY level, name;
The evaluation order matters significantly. For subqueries in WHERE clauses, the database may optimize execution order based on the query plan. For CTEs, the database materializes the result set before using it in the main query, though some databases perform inline expansion as an optimization.
Scope rules determine where subqueries and CTEs can access data. Subqueries have access to columns from outer queries when correlated. CTEs defined in the same WITH clause can reference earlier CTEs but not later ones. Neither can reference each other circularly unless using recursive syntax.
Ruby Implementation
Ruby database libraries provide multiple approaches for constructing subqueries and CTEs. ActiveRecord, the default ORM in Rails applications, supports subqueries through method chaining and SQL fragments. Sequel, an alternative Ruby ORM, offers explicit support for CTEs and subqueries through its dataset API.
ActiveRecord subqueries typically use the where method with nested queries or the select method for scalar subqueries. The ORM converts Ruby query objects into SQL subqueries automatically when passed to query methods.
# Subquery in WHERE clause using ActiveRecord
avg_salary = Employee.select('AVG(salary)')
high_earners = Employee.where("salary > (#{avg_salary.to_sql})")
# Using Arel for type-safe subqueries
employees = Employee.arel_table
subquery = employees.project(employees[:salary].average)
Employee.where(employees[:salary].gt(subquery))
# Subquery in FROM clause (derived table)
dept_stats = Employee
.select('department_id, AVG(salary) as avg_salary')
.group(:department_id)
Employee
.from("employees e")
.joins("INNER JOIN (#{dept_stats.to_sql}) ds ON e.department_id = ds.department_id")
.where('e.salary > ds.avg_salary')
ActiveRecord lacks native CTE support in older versions, requiring raw SQL or extensions. Rails 7.1 introduced with method for CTE support, while earlier versions need the with_cte gem or manual SQL construction.
# Rails 7.1+ native CTE support
active_users = User.where(status: 'active')
recent_orders = Order.where('created_at > ?', 30.days.ago)
.group(:user_id)
.select('user_id, COUNT(*) as order_count')
User
.with(active_users: active_users, recent_orders: recent_orders)
.from('active_users au')
.joins('LEFT JOIN recent_orders ro ON au.id = ro.user_id')
.select('au.email, COALESCE(ro.order_count, 0) as orders')
# Pre-Rails 7.1 using raw SQL
cte_sql = <<-SQL
WITH active_users AS (
#{User.where(status: 'active').to_sql}
),
recent_orders AS (
#{Order.where('created_at > ?', 30.days.ago).to_sql}
)
SELECT au.email, COALESCE(ro.order_count, 0) as orders
FROM active_users au
LEFT JOIN recent_orders ro ON au.id = ro.user_id
SQL
User.find_by_sql(cte_sql)
Sequel provides first-class support for CTEs through the with and with_recursive methods. Sequel's dataset API allows chaining CTE definitions and referencing them in the main query.
# Sequel CTE implementation
DB[:employees].with(:dept_avg,
DB[:employees]
.select(:department_id, Sequel.function(:avg, :salary).as(:avg_salary))
.group(:department_id)
).select(Sequel[:e][:name], Sequel[:e][:salary], Sequel[:da][:avg_salary])
.from(Sequel[:employees].as(:e))
.join(Sequel[:dept_avg].as(:da), department_id: :department_id)
.where(Sequel[:e][:salary] > Sequel[:da][:avg_salary])
# Recursive CTE in Sequel
DB[:categories].with_recursive(:category_tree,
DB[:categories].where(parent_id: nil).select(:id, :name, :parent_id, Sequel.lit('1 AS level')),
DB[:categories]
.select(Sequel[:c][:id], Sequel[:c][:name], Sequel[:c][:parent_id], Sequel.lit('ct.level + 1'))
.from(Sequel[:categories].as(:c))
.join(Sequel[:category_tree].as(:ct), id: Sequel[:c][:parent_id]),
union_all: true
).from(:category_tree).order(:level, :name)
For complex scenarios requiring dynamic query construction, both ORMs support building queries programmatically. This approach handles conditional logic and varying query structures.
# Dynamic subquery construction with ActiveRecord
def find_employees_above_threshold(department_id: nil, threshold_type: :avg)
base_query = Employee.all
base_query = base_query.where(department_id: department_id) if department_id
stat_subquery = case threshold_type
when :avg
base_query.select('AVG(salary)')
when :median
base_query.select('PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary)')
when :percentile_75
base_query.select('PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY salary)')
end
Employee.where("salary > (#{stat_subquery.to_sql})")
end
# Dynamic CTE with Sequel
def build_category_report(include_inactive: false, max_depth: nil)
base_categories = DB[:categories]
base_categories = base_categories.where(active: true) unless include_inactive
recursive_member = DB[:categories]
.select(Sequel[:c][:id], Sequel[:c][:name], Sequel[:c][:parent_id], Sequel.lit('ct.level + 1'))
.from(Sequel[:categories].as(:c))
.join(Sequel[:category_tree].as(:ct), id: Sequel[:c][:parent_id])
recursive_member = recursive_member.where(Sequel.lit('ct.level < ?', max_depth)) if max_depth
DB[:categories].with_recursive(:category_tree,
base_categories.select(:id, :name, :parent_id, Sequel.lit('1 AS level')),
recursive_member,
union_all: true
).from(:category_tree)
end
Raw SQL execution provides maximum control when ORM abstractions prove limiting. Ruby's database adapters execute SQL strings directly, returning result sets as arrays of hashes.
# Execute raw SQL with parameter binding
sql = <<-SQL
WITH RECURSIVE subordinates AS (
SELECT id, name, manager_id, 1 as level
FROM employees
WHERE id = ?
UNION ALL
SELECT e.id, e.name, e.manager_id, s.level + 1
FROM employees e
INNER JOIN subordinates s ON e.manager_id = s.id
WHERE s.level < ?
)
SELECT * FROM subordinates ORDER BY level, name
SQL
ActiveRecord::Base.connection.exec_query(
sql,
'Subordinates Query',
[[nil, employee_id], [nil, max_depth]]
)
Practical Examples
A common scenario involves finding records that exceed departmental averages. This pattern appears in reporting systems, performance dashboards, and anomaly detection.
# Find products priced above their category average
# Using subquery approach
Product.where(
'price > (SELECT AVG(price) FROM products p2 WHERE p2.category_id = products.category_id)'
)
# Using CTE approach (Rails 7.1+)
category_averages = Product
.select('category_id, AVG(price) as avg_price')
.group(:category_id)
Product
.with(category_averages: category_averages)
.joins('JOIN category_averages ca ON products.category_id = ca.category_id')
.where('products.price > ca.avg_price')
# Sequel CTE version
DB[:products].with(:category_averages,
DB[:products]
.select(:category_id, Sequel.function(:avg, :price).as(:avg_price))
.group(:category_id)
).join(:category_averages, category_id: :category_id)
.where(Sequel[:products][:price] > Sequel[:category_averages][:avg_price])
Hierarchical data traversal requires recursive CTEs. Organization charts, category trees, and bill-of-materials structures benefit from this approach.
# Organization chart traversal
class Employee < ApplicationRecord
def self.find_all_subordinates(manager_id, max_levels: 10)
sql = <<-SQL
WITH RECURSIVE subordinates AS (
SELECT id, name, manager_id, email, 1 as depth,
ARRAY[id] as path
FROM employees
WHERE manager_id = $1
UNION ALL
SELECT e.id, e.name, e.manager_id, e.email, s.depth + 1,
s.path || e.id
FROM employees e
INNER JOIN subordinates s ON e.manager_id = s.id
WHERE s.depth < $2
AND NOT e.id = ANY(s.path) -- Prevent cycles
)
SELECT * FROM subordinates
ORDER BY depth, name
SQL
find_by_sql([sql, manager_id, max_levels])
end
end
# Using Sequel for hierarchical categories
class Category < Sequel::Model
def self.tree_from_root(root_id)
dataset.with_recursive(:category_tree,
dataset.where(id: root_id)
.select(:id, :name, :parent_id, Sequel.lit('1 AS depth'),
Sequel.lit("ARRAY[id] AS path")),
dataset.select(
Sequel[:c][:id], Sequel[:c][:name], Sequel[:c][:parent_id],
Sequel.lit('ct.depth + 1'),
Sequel.lit('ct.path || c.id')
)
.from(Sequel[:categories].as(:c))
.join(Sequel[:category_tree].as(:ct), id: Sequel[:c][:parent_id])
.where(Sequel.lit('NOT c.id = ANY(ct.path)')),
union_all: true
).from(:category_tree).order(:depth, :name)
end
end
Multi-step data aggregation combines multiple CTEs to perform complex calculations. Sales analysis, financial reporting, and inventory management often require this pattern.
# Complex sales report with multiple CTEs
class SalesReport
def self.quarterly_analysis(year, quarter)
quarter_start = Date.new(year, (quarter - 1) * 3 + 1, 1)
quarter_end = quarter_start.end_of_quarter
sql = <<-SQL
WITH sales_data AS (
SELECT
DATE_TRUNC('month', o.created_at) as month,
o.customer_id,
SUM(oi.quantity * oi.unit_price) as revenue,
COUNT(DISTINCT o.id) as order_count
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
WHERE o.created_at BETWEEN $1 AND $2
AND o.status = 'completed'
GROUP BY DATE_TRUNC('month', o.created_at), o.customer_id
),
monthly_totals AS (
SELECT
month,
SUM(revenue) as total_revenue,
SUM(order_count) as total_orders,
COUNT(DISTINCT customer_id) as unique_customers
FROM sales_data
GROUP BY month
),
customer_segments AS (
SELECT
customer_id,
SUM(revenue) as customer_revenue,
CASE
WHEN SUM(revenue) > 10000 THEN 'high_value'
WHEN SUM(revenue) > 1000 THEN 'medium_value'
ELSE 'low_value'
END as segment
FROM sales_data
GROUP BY customer_id
)
SELECT
mt.month,
mt.total_revenue,
mt.total_orders,
mt.unique_customers,
COUNT(cs.customer_id) FILTER (WHERE cs.segment = 'high_value') as high_value_customers,
COUNT(cs.customer_id) FILTER (WHERE cs.segment = 'medium_value') as medium_value_customers,
COUNT(cs.customer_id) FILTER (WHERE cs.segment = 'low_value') as low_value_customers
FROM monthly_totals mt
CROSS JOIN customer_segments cs
GROUP BY mt.month, mt.total_revenue, mt.total_orders, mt.unique_customers
ORDER BY mt.month
SQL
Order.connection.exec_query(sql, 'Quarterly Analysis',
[[nil, quarter_start], [nil, quarter_end]])
end
end
Existence checks using EXISTS subqueries perform better than IN clauses for large datasets. This pattern verifies relationships without materializing complete result sets.
# Find users with recent activity using EXISTS
# ActiveRecord with Arel
users = User.arel_table
orders = Order.arel_table
subquery = orders
.project(1)
.where(orders[:user_id].eq(users[:id]))
.where(orders[:created_at].gt(30.days.ago))
.exists
User.where(subquery)
# Equivalent SQL approach
User.where(
'EXISTS (SELECT 1 FROM orders WHERE orders.user_id = users.id AND orders.created_at > ?)',
30.days.ago
)
# Find products never ordered
products = Product.arel_table
order_items = OrderItem.arel_table
subquery = order_items
.project(1)
.where(order_items[:product_id].eq(products[:id]))
.exists
Product.where(subquery.not)
Window function analysis combined with CTEs enables complex ranking and partition operations.
# Top N per category with running totals
def top_selling_products_per_category(limit: 5)
sql = <<-SQL
WITH product_sales AS (
SELECT
p.id,
p.name,
p.category_id,
c.name as category_name,
SUM(oi.quantity) as units_sold,
SUM(oi.quantity * oi.unit_price) as revenue
FROM products p
JOIN order_items oi ON p.id = oi.product_id
JOIN categories c ON p.category_id = c.id
GROUP BY p.id, p.name, p.category_id, c.name
),
ranked_products AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY category_id ORDER BY revenue DESC) as rank,
SUM(revenue) OVER (PARTITION BY category_id ORDER BY revenue DESC) as running_total
FROM product_sales
)
SELECT * FROM ranked_products
WHERE rank <= $1
ORDER BY category_name, rank
SQL
Product.connection.exec_query(sql, 'Top Products', [[nil, limit]])
end
Design Considerations
Choosing between subqueries and CTEs depends on readability, reusability, and database optimizer behavior. Subqueries work well for simple, one-time filters or calculations embedded directly in query logic. CTEs excel when the same derived data appears multiple times in a query or when query logic requires multiple logical steps.
Readability improves with CTEs for complex queries. Named result sets make intent explicit and allow top-down reading of query logic. Subqueries embedded in WHERE or JOIN clauses require mental context switching and inside-out reading.
# Subquery approach - harder to follow
Customer.where(
"id IN (SELECT customer_id FROM orders WHERE status = 'completed'
AND created_at > (SELECT MAX(created_at) - INTERVAL '30 days' FROM orders))"
).where(
"total_spent > (SELECT AVG(total_spent) FROM customers
WHERE created_at < (SELECT MIN(created_at) FROM orders WHERE status = 'completed'))"
)
# CTE approach - clearer intent
sql = <<-SQL
WITH recent_cutoff AS (
SELECT MAX(created_at) - INTERVAL '30 days' as cutoff_date FROM orders
),
recent_customers AS (
SELECT DISTINCT customer_id
FROM orders, recent_cutoff
WHERE status = 'completed'
AND orders.created_at > recent_cutoff.cutoff_date
),
avg_spending AS (
SELECT AVG(total_spent) as avg_spent FROM customers
)
SELECT c.*
FROM customers c
WHERE c.id IN (SELECT customer_id FROM recent_customers)
AND c.total_spent > (SELECT avg_spent FROM avg_spending)
SQL
Correlated subqueries introduce performance considerations. The database executes a correlated subquery once per row of the outer query, potentially causing N+1 query problems at the SQL level. Converting to joins or CTEs often improves performance.
# Correlated subquery - executes once per product
Product.where(
'price > (SELECT AVG(price) FROM products p2 WHERE p2.category_id = products.category_id)'
)
# Join alternative - single execution
Product
.joins(
'JOIN (SELECT category_id, AVG(price) as avg_price FROM products GROUP BY category_id) avgs
ON products.category_id = avgs.category_id'
)
.where('products.price > avgs.avg_price')
EXISTS subqueries perform better than IN clauses when checking for row existence. EXISTS stops at the first match, while IN materializes the complete subquery result. For large result sets, this difference becomes significant.
# IN clause - materializes all order IDs
User.where(id: Order.where('created_at > ?', 30.days.ago).select(:user_id))
# EXISTS - stops at first match per user
User.where(
'EXISTS (SELECT 1 FROM orders WHERE orders.user_id = users.id AND orders.created_at > ?)',
30.days.ago
)
CTE materialization behavior varies across databases. PostgreSQL materializes CTEs by default in versions before 12, preventing optimizer pushdown. PostgreSQL 12+ added inline expansion optimization. MySQL 8.0+ and recent SQLite versions also optimize CTE execution. Understanding database-specific behavior guides CTE usage decisions.
Recursive CTEs require termination guarantees. Circular references or missing base cases cause infinite loops. Include cycle detection in path tracking and depth limits in recursive members.
# Safe recursive CTE with cycle detection and depth limit
sql = <<-SQL
WITH RECURSIVE paths AS (
SELECT id, parent_id, name, 1 as depth, ARRAY[id] as path
FROM categories
WHERE parent_id IS NULL
UNION ALL
SELECT c.id, c.parent_id, c.name, p.depth + 1, p.path || c.id
FROM categories c
INNER JOIN paths p ON c.parent_id = p.id
WHERE p.depth < 10 -- Depth limit
AND NOT c.id = ANY(p.path) -- Cycle detection
)
SELECT * FROM paths
SQL
Scalar subqueries must return exactly one row and one column. Multiple rows or NULL results cause errors or unexpected behavior. Use LIMIT 1 or aggregate functions to guarantee single-value results.
# Unsafe scalar subquery - might return multiple rows
Product.select('*, (SELECT name FROM categories WHERE id = category_id) as category_name')
# Safe version with guaranteed single result
Product.select(
'*, (SELECT name FROM categories WHERE categories.id = products.category_id LIMIT 1) as category_name'
)
Performance Considerations
Query execution plans reveal how databases handle subqueries and CTEs. Subqueries may be executed as nested loops, hash joins, or merged into the outer query depending on optimizer decisions. CTEs traditionally materialized results but modern optimizers perform inline expansion when beneficial.
Examining execution plans identifies performance bottlenecks. Both ActiveRecord and Sequel provide methods to retrieve query plans.
# Analyze query performance with EXPLAIN
query = Employee.where('salary > (SELECT AVG(salary) FROM employees)')
puts query.explain
# Sequel query plan analysis
plan = DB[:employees]
.where(Sequel[:salary] > DB[:employees].select(Sequel.function(:avg, :salary)))
.explain
puts plan
# PostgreSQL-specific detailed analysis
ActiveRecord::Base.connection.execute(
"EXPLAIN (ANALYZE, BUFFERS, VERBOSE) #{query.to_sql}"
).each { |row| puts row }
Subquery execution time depends on correlation. Non-correlated subqueries execute once before the outer query. Correlated subqueries execute once per outer row, multiplying execution time by row count. For 10,000 outer rows, a correlated subquery with 10ms execution time takes 100 seconds total.
# Non-correlated - executes once
Product.where('category_id IN (SELECT id FROM categories WHERE active = true)')
# Correlated - executes 10,000 times for 10,000 products
Product.where(
'price > (SELECT AVG(price) FROM products p WHERE p.category_id = products.category_id)'
)
# Optimized with window function - single execution
sql = <<-SQL
SELECT p.*
FROM (
SELECT *,
AVG(price) OVER (PARTITION BY category_id) as cat_avg
FROM products
) p
WHERE p.price > p.cat_avg
SQL
Product.find_by_sql(sql)
CTE materialization forces result storage before main query execution. This prevents index usage on CTE results and duplicates computation if the CTE appears in multiple JOIN branches. Modern PostgreSQL (12+) optimizes simple CTEs through inline expansion, but complex CTEs still materialize.
# Materialized CTE - forces computation before use
sql = <<-SQL
WITH category_products AS (
SELECT * FROM products WHERE category_id = 5
)
SELECT * FROM category_products WHERE price > 100
SQL
# Optimized version - allows index usage
Product.where(category_id: 5, price: 100..)
# Force inline in PostgreSQL 12+
sql = <<-SQL
WITH category_products AS NOT MATERIALIZED (
SELECT * FROM products WHERE category_id = 5
)
SELECT * FROM category_products WHERE price > 100
SQL
Index utilization affects subquery performance dramatically. Subqueries with proper indexes on join columns and WHERE predicates execute orders of magnitude faster than full table scans.
# Add indexes for subquery performance
class AddSubqueryIndexes < ActiveRecord::Migration[7.0]
def change
add_index :orders, [:user_id, :created_at]
add_index :order_items, [:order_id, :product_id]
add_index :products, [:category_id, :price]
end
end
# Query using indexes effectively
User.where(
'EXISTS (SELECT 1 FROM orders WHERE orders.user_id = users.id
AND orders.created_at > ? AND orders.status = ?)',
30.days.ago, 'completed'
)
Recursive CTE performance depends on hierarchy depth and branching factor. Wide, shallow trees perform better than narrow, deep trees. Each recursion level processes results from the previous level, so depth directly impacts total rows processed.
# Monitor recursive CTE performance
def analyze_hierarchy_performance(root_id)
sql = <<-SQL
WITH RECURSIVE category_tree AS (
SELECT id, parent_id, name, 1 as level
FROM categories WHERE id = $1
UNION ALL
SELECT c.id, c.parent_id, c.name, ct.level + 1
FROM categories c
JOIN category_tree ct ON c.parent_id = ct.id
)
SELECT
level,
COUNT(*) as nodes_at_level,
SUM(COUNT(*)) OVER (ORDER BY level) as cumulative_nodes
FROM category_tree
GROUP BY level
ORDER BY level
SQL
Category.connection.exec_query(sql, 'Hierarchy Stats', [[nil, root_id]])
end
Caching subquery results reduces repeated computation. Application-level caching stores subquery results in Redis or Memcached for reuse across requests.
# Cache expensive subquery results
class Product < ApplicationRecord
def self.active_category_ids
Rails.cache.fetch('active_category_ids', expires_in: 1.hour) do
Category.where(active: true).pluck(:id)
end
end
def self.in_active_categories
where(category_id: active_category_ids)
end
end
# Cache CTE results for complex reports
class SalesMetrics
def self.daily_summary(date)
cache_key = "sales_summary:#{date}"
Rails.cache.fetch(cache_key, expires_in: 24.hours) do
sql = <<-SQL
WITH daily_orders AS (
SELECT id, customer_id, total_amount
FROM orders
WHERE DATE(created_at) = $1
),
customer_stats AS (
SELECT customer_id, COUNT(*) as order_count, SUM(total_amount) as total
FROM daily_orders
GROUP BY customer_id
)
SELECT
COUNT(DISTINCT customer_id) as unique_customers,
SUM(order_count) as total_orders,
SUM(total) as revenue
FROM customer_stats
SQL
Order.connection.exec_query(sql, 'Daily Summary', [[nil, date]]).first
end
end
end
Common Patterns
Filtering with subqueries represents the most common pattern. The WHERE clause uses subqueries to filter rows based on aggregated or derived conditions.
# Find above-average performers
Employee.where('salary > (SELECT AVG(salary) FROM employees)')
# Find products in top-selling categories
Product.where(
'category_id IN (
SELECT category_id FROM products
GROUP BY category_id
HAVING SUM(units_sold) > 1000
)'
)
# Find customers with no recent orders
Customer.where(
'NOT EXISTS (
SELECT 1 FROM orders
WHERE orders.customer_id = customers.id
AND orders.created_at > ?
)', 90.days.ago
)
Derived tables in FROM clauses treat subquery results as temporary tables. This pattern enables aggregation followed by filtering or joining.
# Aggregate then filter - requires derived table
sql = <<-SQL
SELECT ds.*
FROM (
SELECT category_id, AVG(price) as avg_price, COUNT(*) as product_count
FROM products
GROUP BY category_id
) ds
WHERE ds.product_count > 10 AND ds.avg_price > 50
SQL
Product.connection.exec_query(sql)
# Equivalent using CTE for readability
sql = <<-SQL
WITH category_stats AS (
SELECT category_id, AVG(price) as avg_price, COUNT(*) as product_count
FROM products
GROUP BY category_id
)
SELECT * FROM category_stats
WHERE product_count > 10 AND avg_price > 50
SQL
Multiple CTE chaining builds complex logic incrementally. Each CTE performs one logical step, and later CTEs reference earlier results.
# Multi-step analysis using CTE chain
def customer_segmentation_report
sql = <<-SQL
WITH customer_orders AS (
SELECT
customer_id,
COUNT(*) as order_count,
SUM(total_amount) as lifetime_value,
MAX(created_at) as last_order_date
FROM orders
WHERE status = 'completed'
GROUP BY customer_id
),
customer_segments AS (
SELECT
*,
CASE
WHEN lifetime_value > 5000 AND order_count > 10 THEN 'vip'
WHEN lifetime_value > 1000 THEN 'regular'
WHEN order_count > 0 THEN 'occasional'
ELSE 'inactive'
END as segment,
CASE
WHEN last_order_date > NOW() - INTERVAL '30 days' THEN 'active'
WHEN last_order_date > NOW() - INTERVAL '90 days' THEN 'at_risk'
ELSE 'churned'
END as status
FROM customer_orders
)
SELECT
segment,
status,
COUNT(*) as customer_count,
AVG(lifetime_value) as avg_ltv,
AVG(order_count) as avg_orders
FROM customer_segments
GROUP BY segment, status
ORDER BY segment, status
SQL
Order.connection.exec_query(sql)
end
Recursive hierarchy traversal follows a standard pattern: anchor member selects root nodes, recursive member joins children to parents while incrementing depth.
# Standard recursive traversal pattern
def self.hierarchy_from_root(root_id, max_depth: nil)
depth_condition = max_depth ? "AND level < #{max_depth}" : ""
sql = <<-SQL
WITH RECURSIVE tree AS (
-- Anchor: root node
SELECT id, parent_id, name, 1 as level,
CAST(name AS TEXT) as path
FROM #{table_name}
WHERE id = $1
UNION ALL
-- Recursive: children
SELECT t.id, t.parent_id, t.name, tree.level + 1,
tree.path || ' > ' || t.name
FROM #{table_name} t
INNER JOIN tree ON t.parent_id = tree.id
WHERE true #{depth_condition}
)
SELECT * FROM tree ORDER BY path
SQL
find_by_sql([sql, root_id])
end
# Bill of materials explosion
def explode_bom(product_id, quantity = 1)
sql = <<-SQL
WITH RECURSIVE bom AS (
SELECT
id, component_id, quantity_required, 1 as level,
quantity_required * $2 as total_quantity
FROM bill_of_materials
WHERE product_id = $1
UNION ALL
SELECT
b.id, b.component_id, b.quantity_required, bom.level + 1,
b.quantity_required * bom.total_quantity
FROM bill_of_materials b
INNER JOIN bom ON b.product_id = bom.component_id
WHERE bom.level < 10
)
SELECT
component_id,
SUM(total_quantity) as total_needed,
MAX(level) as max_depth
FROM bom
GROUP BY component_id
SQL
connection.exec_query(sql, 'BOM Explosion', [[nil, product_id], [nil, quantity]])
end
Ranking and top-N queries combine window functions with CTEs or subqueries to identify top performers per group.
# Top 3 products per category
sql = <<-SQL
WITH ranked_products AS (
SELECT
p.*,
ROW_NUMBER() OVER (PARTITION BY category_id ORDER BY units_sold DESC) as rank
FROM products p
)
SELECT * FROM ranked_products
WHERE rank <= 3
ORDER BY category_id, rank
SQL
# Alternative using lateral join (PostgreSQL)
sql = <<-SQL
SELECT c.name as category, p.name, p.units_sold
FROM categories c
CROSS JOIN LATERAL (
SELECT name, units_sold
FROM products
WHERE category_id = c.id
ORDER BY units_sold DESC
LIMIT 3
) p
ORDER BY c.name, p.units_sold DESC
SQL
Gap detection finds missing sequences using recursive CTEs or self-joins with subqueries.
# Find missing invoice numbers
sql = <<-SQL
WITH RECURSIVE number_series AS (
SELECT MIN(invoice_number) as num FROM invoices
UNION ALL
SELECT num + 1 FROM number_series
WHERE num < (SELECT MAX(invoice_number) FROM invoices)
)
SELECT ns.num as missing_number
FROM number_series ns
LEFT JOIN invoices i ON ns.num = i.invoice_number
WHERE i.invoice_number IS NULL
SQL
# Find date gaps in time series data
sql = <<-SQL
WITH RECURSIVE date_series AS (
SELECT DATE(MIN(recorded_at)) as date FROM measurements
UNION ALL
SELECT date + INTERVAL '1 day'
FROM date_series
WHERE date < (SELECT DATE(MAX(recorded_at)) FROM measurements)
)
SELECT ds.date as missing_date
FROM date_series ds
LEFT JOIN measurements m ON DATE(m.recorded_at) = ds.date
WHERE m.id IS NULL
SQL
Reference
Subquery Types
| Type | Returns | Usage Context | Example |
|---|---|---|---|
| Scalar | Single value | Comparison operators, SELECT list | SELECT name WHERE salary > (SELECT AVG(salary) FROM employees) |
| Row | Single row, multiple columns | Row comparisons | WHERE (name, dept) = (SELECT name, dept FROM table WHERE id = 1) |
| Column | Multiple rows, single column | IN, ANY, ALL operators | WHERE id IN (SELECT user_id FROM orders) |
| Table | Multiple rows and columns | FROM clause, EXISTS | FROM (SELECT * FROM products WHERE active) p |
CTE Syntax Patterns
| Pattern | PostgreSQL | MySQL 8.0+ | SQLite 3.8.3+ |
|---|---|---|---|
| Basic CTE | WITH cte AS (SELECT ...) | WITH cte AS (SELECT ...) | WITH cte AS (SELECT ...) |
| Multiple CTEs | WITH c1 AS (...), c2 AS (...) | WITH c1 AS (...), c2 AS (...) | WITH c1 AS (...), c2 AS (...) |
| Recursive CTE | WITH RECURSIVE cte AS (...) | WITH RECURSIVE cte AS (...) | WITH RECURSIVE cte AS (...) |
| Materialization hint | WITH cte AS MATERIALIZED (...) | Not supported | Not supported |
| No materialization | WITH cte AS NOT MATERIALIZED (...) | Not supported | Not supported |
Ruby ORM Methods
| Operation | ActiveRecord | Sequel | Description |
|---|---|---|---|
| Basic subquery | where(column: Model.select(:col)) | where(column: Model.select(:col)) | Use subquery result in WHERE |
| EXISTS check | where('EXISTS (...)') | where(Sequel.function(:exists, subquery)) | Check row existence |
| Scalar subquery | select('(SELECT ...)') | select(Sequel.lit('(SELECT ...)')) | Inline scalar value |
| CTE (Rails 7.1+) | with(name: query) | with(:name, query) | Define CTE |
| Recursive CTE | N/A (use raw SQL) | with_recursive(:name, anchor, recursive) | Define recursive CTE |
| Raw SQL | find_by_sql(sql) | fetch(sql) | Execute custom SQL |
Subquery Operators
| Operator | Purpose | Returns | Example |
|---|---|---|---|
| IN | Value in set | Boolean | WHERE id IN (SELECT user_id FROM orders) |
| NOT IN | Value not in set | Boolean | WHERE id NOT IN (SELECT blocked_id FROM blocks) |
| EXISTS | Row exists | Boolean | WHERE EXISTS (SELECT 1 FROM orders WHERE user_id = users.id) |
| NOT EXISTS | No rows exist | Boolean | WHERE NOT EXISTS (SELECT 1 FROM orders WHERE user_id = users.id) |
| ANY | Compare to any value | Boolean | WHERE price > ANY (SELECT price FROM products WHERE featured) |
| ALL | Compare to all values | Boolean | WHERE price > ALL (SELECT price FROM products WHERE category_id = 5) |
| Comparison | Single value comparison | Boolean | WHERE salary > (SELECT AVG(salary) FROM employees) |
Recursive CTE Components
| Component | Purpose | Required | Example |
|---|---|---|---|
| Anchor member | Initial result set | Yes | SELECT id FROM categories WHERE parent_id IS NULL |
| Recursive member | Self-referencing query | Yes | SELECT c.id FROM categories c JOIN tree ON c.parent_id = tree.id |
| UNION ALL | Combine anchor and recursive | Yes | anchor UNION ALL recursive |
| Termination condition | Prevent infinite loops | Recommended | WHERE depth < 10 AND NOT id = ANY(path) |
| Path tracking | Detect cycles | Optional | ARRAY[id] for anchor, path || id for recursive |
Performance Optimization Checklist
| Technique | Impact | When to Use |
|---|---|---|
| Convert correlated to join | High | Subquery executes per outer row |
| Use EXISTS instead of IN | Medium-High | Checking existence with large result sets |
| Add indexes on join columns | High | Subqueries join to outer query |
| Use LIMIT in scalar subqueries | Low | Subquery might return multiple rows |
| Materialize complex CTEs | Medium | CTE used once, complex computation |
| Inline simple CTEs | Medium | Simple CTE used once, PostgreSQL 12+ |
| Cache subquery results | High | Same subquery in multiple queries |
| Replace recursive with iterative | Varies | Very deep hierarchies, performance critical |
Common Pitfalls and Solutions
| Problem | Symptom | Solution |
|---|---|---|
| N+1 subquery execution | Slow correlated subquery | Convert to JOIN or window function |
| Multiple row scalar subquery | Error or wrong results | Add LIMIT 1 or use aggregate |
| NULL in NOT IN | Unexpected empty results | Use NOT EXISTS or filter NULLs |
| Infinite recursive loop | Query timeout | Add depth limit and cycle detection |
| CTE materialization overhead | Slow despite indexes | Use NOT MATERIALIZED hint or subquery |
| Missing indexes | Full table scans | Add indexes on join and WHERE columns |
| Overly complex nesting | Unreadable queries | Split into CTEs or temp tables |