CrackedRuby - Columnar vs Row Storage

Overview

Database storage engines organize data in two primary patterns: row-based storage and columnar storage. Row-based storage stores complete records contiguously on disk, grouping all attributes of a single entity together. Columnar storage inverts this relationship, storing all values of a single attribute contiguously across multiple records. This architectural decision affects query performance, compression ratios, storage costs, and the types of workloads the database can handle efficiently.

Traditional relational databases like PostgreSQL and MySQL use row-based storage by default. These systems optimize for transactional workloads where applications frequently read and write complete records. A typical e-commerce transaction reads a customer record, updates inventory, and writes an order—all operations that benefit from accessing complete rows quickly.

Analytical databases like Amazon Redshift, Google BigQuery, and Apache Parquet use columnar storage. These systems optimize for analytical queries that aggregate values across millions of records but only access a few columns. A typical analytics query might calculate average purchase amounts by region, reading only two columns from billions of rows.

# Row storage conceptual representation
users_table = [
  { id: 1, name: "Alice", email: "alice@example.com", age: 30 },
  { id: 2, name: "Bob", email: "bob@example.com", age: 25 },
  { id: 3, name: "Carol", email: "carol@example.com", age: 35 }
]
# Stored on disk: [1,Alice,alice@example.com,30][2,Bob,bob@example.com,25][3,Carol,carol@example.com,35]

# Columnar storage conceptual representation
users_columnar = {
  id: [1, 2, 3],
  name: ["Alice", "Bob", "Carol"],
  email: ["alice@example.com", "bob@example.com", "carol@example.com"],
  age: [30, 25, 35]
}
# Stored on disk: id:[1,2,3] name:[Alice,Bob,Carol] email:[alice@...,bob@...,carol@...] age:[30,25,35]

The storage pattern choice impacts more than query speed. Columnar formats achieve superior compression because identical data types stored contiguously compress more efficiently than heterogeneous row data. A column of integers compresses better than alternating integers, strings, and dates within rows. Columnar databases routinely achieve 10x compression ratios on analytical data, reducing storage costs substantially.

Write patterns differ dramatically between formats. Row storage writes complete records efficiently—insert a row, update a row, delete a row. Columnar storage must fragment these operations across multiple column files, making single-record writes expensive. Most columnar systems batch writes and periodically reorganize data to maintain query performance.

Key Principles

Row-based storage organizes data in a record-oriented fashion. Each row contains all columns for a single entity, stored contiguously in memory or on disk. The database engine reads complete rows sequentially, which maps naturally to how applications think about data. When an application requests user ID 42, the system seeks to one location and reads all attributes in a single operation.

Physical storage layout determines access patterns. Row stores place related data close together physically. All attributes of record N sit adjacent to each other, followed by all attributes of record N+1. This layout optimizes for scenarios where the application needs multiple columns from the same row. The disk head performs one seek operation and reads a contiguous block containing the entire record.

Columnar storage inverts this organization. All values for column A appear together, followed by all values for column B. The system stores each column in a separate structure, often in separate files or disk pages. Reading a single complete record requires accessing multiple locations—one per column—but reading a single column across all records requires accessing only one structure.

Compression characteristics differ fundamentally between formats. Row storage compresses heterogeneous data—integers next to strings next to timestamps. General-purpose compression algorithms like LZ4 or Snappy work reasonably well but cannot exploit patterns within specific data types. A compressed row block might achieve 2-3x compression.

Columnar storage compresses homogeneous data. A column of timestamps compresses with delta encoding—store the first timestamp and subsequent deltas. A column of repeated values uses run-length encoding—store value and count. A column with low cardinality uses dictionary encoding—store unique values once and references elsewhere. These type-specific encodings achieve 10-20x compression on analytical data.

# Compression advantage in columnar format
row_data = [
  [1, "pending", "2024-01-15"],
  [2, "pending", "2024-01-15"],
  [3, "pending", "2024-01-15"],
  [4, "completed", "2024-01-16"]
]
# Row compression: compress mixed data types together
# Limited patterns to exploit

columnar_data = {
  id: [1, 2, 3, 4],                                    # Sequential integers: delta encode
  status: ["pending", "pending", "pending", "completed"], # Low cardinality: dictionary encode
  date: ["2024-01-15", "2024-01-15", "2024-01-15", "2024-01-16"] # Repeated values: RLE
}
# Columnar compression exploits patterns within each type

Query execution patterns diverge based on storage format. Analytical queries typically filter on few columns and aggregate across many rows. A query computing monthly revenue by product category reads two columns—amount and category—from millions of rows. Columnar storage reads only those two column files, ignoring unused columns entirely. Row storage reads complete rows, discarding unused columns after reading them from disk.

Transactional queries exhibit opposite characteristics. OLTP workloads read and write complete records frequently. An e-commerce checkout reads customer data (name, address, payment method), updates inventory (quantity, location), and writes an order (items, totals, timestamp). Each operation touches many columns from few rows. Row storage handles these access patterns efficiently with single seeks and writes.

Write amplification affects columnar systems significantly. Inserting a single row in columnar storage requires updating multiple column files. The system must locate the appropriate position in each column structure and insert the new value, potentially triggering reorganization to maintain sort order or compression. Row stores write the complete record to one location, requiring one disk operation.

Most columnar systems batch writes to mitigate this overhead. New data accumulates in a row-oriented write buffer, then periodically flushes to columnar format. This approach trades write latency for write throughput—individual writes are slower due to buffering, but overall throughput increases by amortizing reorganization costs across many rows.

Performance Considerations

Query selectivity determines which storage format performs better. Selectivity measures the ratio of columns accessed to total columns. Low selectivity queries—accessing few columns from many rows—favor columnar storage. High selectivity queries—accessing many columns from few rows—favor row storage.

Consider a table with 100 columns and 10 million rows. An analytical query calculating average salary by department reads two columns from all rows: selectivity is 2%. Columnar storage reads 2GB of data (2 columns × 1GB average per column). Row storage reads 100GB (all columns × all rows). The columnar system processes 50x less data.

Conversely, a transactional query retrieving complete customer records for 100 specific IDs reads all columns from few rows. Row storage performs 100 seeks and reads 100 records contiguously. Columnar storage performs 100 × 100 = 10,000 seeks (100 column files × 100 records), fragmenting access across many locations. Row storage dominates for this workload.

# Performance comparison simulation
class RowStore
  def initialize(rows:, columns:)
    @data = Array.new(rows) { |i| Array.new(columns) { |j| "value_#{i}_#{j}" } }
  end

  def aggregate_column(col_index)
    @data.map { |row| row[col_index] }.sum # Reads all rows, accesses target column
  end

  def fetch_row(row_index)
    @data[row_index] # Single access returns complete row
  end
end

class ColumnStore
  def initialize(rows:, columns:)
    @data = Array.new(columns) { |j| Array.new(rows) { |i| "value_#{i}_#{j}" } }
  end

  def aggregate_column(col_index)
    @data[col_index].sum # Single column access, no wasted I/O
  end

  def fetch_row(row_index)
    @data.map { |col| col[row_index] } # Must access every column file
  end
end

# Analytical query: aggregate one column
row_store = RowStore.new(rows: 1_000_000, columns: 100)
col_store = ColumnStore.new(rows: 1_000_000, columns: 100)

# ColumnStore reads 1M values from one array
# RowStore reads 1M rows × 100 columns = 100M values, keeps 1M

Cache efficiency differs between storage patterns. Modern CPUs cache recently accessed memory in hierarchical levels (L1, L2, L3). Row storage places related data contiguously, improving spatial locality when accessing multiple columns from the same row. One cache line load brings multiple attributes into cache simultaneously.

Columnar storage optimizes for temporal locality within single columns. Reading sequential values from one column keeps the cache warm for that specific access pattern. Switching between columns requires loading different cache lines, but analytical queries typically process one column completely before moving to the next.

Compression affects both storage size and query performance. Compressed data reduces disk I/O—reading compressed blocks from disk transfers less data than uncompressed blocks. The CPU decompression cost trades off against reduced I/O time. On I/O-bound workloads, compression improves performance despite CPU overhead.

Columnar compression enables predicate pushdown optimizations. Many columnar formats store metadata about each column block—minimum value, maximum value, null count. The query executor checks metadata before decompressing blocks. A query filtering for values greater than 1000 skips blocks where maximum value is 500, avoiding decompression entirely.

# Predicate pushdown with columnar metadata
class ColumnBlock
  attr_reader :data, :min_value, :max_value

  def initialize(data)
    @data = data
    @min_value = data.min
    @max_value = data.max
  end

  def matches_predicate?(predicate)
    case predicate[:op]
    when :greater_than
      @max_value > predicate[:value] # Skip block if max <= threshold
    when :less_than
      @min_value < predicate[:value] # Skip block if min >= threshold
    when :equals
      @min_value <= predicate[:value] && @max_value >= predicate[:value]
    end
  end
end

ages = [
  ColumnBlock.new([18, 22, 25, 29]),
  ColumnBlock.new([45, 52, 58, 61]),
  ColumnBlock.new([31, 35, 38, 42])
]

# Query: WHERE age > 50
predicate = { op: :greater_than, value: 50 }
ages.select { |block| block.matches_predicate?(predicate) }
# => [ColumnBlock with ages 45-61]
# Skips first and third blocks without reading data

Update performance degrades significantly in columnar stores. Modifying a single column value requires locating and updating that value within a compressed, encoded column file. The system may need to decompress the block, update the value, recompress the block, and potentially reorganize adjacent blocks if size changes.

Row stores update complete records atomically. The database locates the row, modifies the target column value in place or writes a new version, and commits. Most row stores use versioning systems (MVCC) that append new row versions rather than updating in place, maintaining both read and write performance.

Parallel processing capabilities favor columnar formats for analytical workloads. Column-oriented data partitions naturally—different threads or processes read different column files simultaneously without coordination. A query aggregating three columns spawns three parallel readers, each processing one column file independently.

Row storage parallelizes by partitioning rows across threads. Each thread processes a subset of rows, reading all necessary columns from its partition. This approach works well but requires more coordination when threads need to aggregate results, as each thread produces partial results across all columns.

Design Considerations

Workload characteristics determine the appropriate storage format. Online Transaction Processing (OLTP) workloads perform frequent small reads and writes of complete records. Applications insert new orders, update inventory quantities, and retrieve customer profiles. These operations touch many columns from few rows, making row storage the natural choice.

Online Analytical Processing (OLAP) workloads scan large portions of tables, aggregating values across millions of rows while accessing few columns. Business intelligence queries compute sales by region, customer lifetime value, or inventory turnover. These queries read 2-5 columns from billions of rows, where columnar storage excels.

# Workload pattern analysis
class WorkloadAnalyzer
  def analyze(queries)
    queries.map do |query|
      {
        query: query[:sql],
        selectivity: query[:columns_accessed].size.to_f / query[:total_columns],
        row_count: query[:rows_scanned],
        pattern: classify_pattern(query)
      }
    end
  end

  private

  def classify_pattern(query)
    selectivity = query[:columns_accessed].size.to_f / query[:total_columns]
    row_ratio = query[:rows_scanned].to_f / query[:total_rows]

    if selectivity > 0.5 && row_ratio < 0.01
      :oltp # Many columns, few rows
    elsif selectivity < 0.2 && row_ratio > 0.1
      :olap # Few columns, many rows
    else
      :mixed
    end
  end
end

analyzer = WorkloadAnalyzer.new
queries = [
  { sql: "SELECT * FROM users WHERE id = 42",
    columns_accessed: ["id", "name", "email", "address", "phone"],
    total_columns: 5, rows_scanned: 1, total_rows: 1_000_000 },
  { sql: "SELECT AVG(price) FROM sales WHERE date >= '2024-01'",
    columns_accessed: ["price", "date"],
    total_columns: 20, rows_scanned: 500_000, total_rows: 1_000_000 }
]
results = analyzer.analyze(queries)
# => First query: :oltp pattern (100% selectivity, 0.0001% rows)
# => Second query: :olap pattern (10% selectivity, 50% rows)

Data modification frequency impacts format selection. Tables receiving frequent inserts, updates, or deletes require efficient write patterns. Row storage handles these operations naturally—write complete rows, update in place or append new versions, mark rows as deleted. Write throughput remains high even with continuous modifications.

Columnar systems struggle with frequent modifications. Each write operation updates multiple column files, and deletes create fragmentation requiring periodic compaction. Most columnar databases append changes to separate delta stores, periodically merging deltas into main columnar storage. This architecture introduces latency—recent writes remain in row-oriented delta stores until the merge completes.

Query latency requirements constrain format choice. Interactive applications requiring sub-second response times need predictable, low-latency queries. Row storage provides consistent performance for point queries—lookup by primary key returns results in milliseconds. Columnar systems introduce variability—cache state, compression format, and data organization affect query time.

Analytical workloads tolerate higher latency. Business intelligence queries computing monthly trends or customer segments complete in seconds or minutes. Users accept longer execution times in exchange for scanning larger datasets. Columnar storage trades individual query latency for overall throughput—queries run longer but process more data.

Hybrid approaches combine both formats. Some databases maintain row and columnar representations simultaneously, routing queries to the appropriate format based on access pattern. Amazon Aurora maintains a row-oriented primary store and asynchronously replicates to columnar format for analytics. Applications write to the row store and query from the columnar store, avoiding format-specific tradeoffs.

# Hybrid storage router
class HybridStore
  def initialize
    @row_store = RowStore.new(rows: 1000, columns: 50)
    @column_store = ColumnStore.new(rows: 1000, columns: 50)
  end

  def execute_query(query)
    strategy = select_strategy(query)
    case strategy
    when :row_optimized
      @row_store.execute(query)
    when :column_optimized
      @column_store.execute(query)
    when :hybrid
      combine_results(
        @row_store.execute(query[:row_part]),
        @column_store.execute(query[:column_part])
      )
    end
  end

  private

  def select_strategy(query)
    if query[:columns_accessed] > query[:total_columns] * 0.7
      :row_optimized
    elsif query[:rows_scanned] > query[:total_rows] * 0.1
      :column_optimized
    else
      :hybrid
    end
  end
end

Data freshness requirements affect storage decisions. Real-time applications need immediate visibility of writes. Row storage provides read-after-write consistency—once a transaction commits, subsequent reads see the new data. Columnar systems often introduce replication lag—writes go to row-oriented buffers, then asynchronously convert to columnar format.

Some use cases tolerate eventual consistency. Data warehouse applications accept minutes or hours of replication lag. Overnight batch processes load daily transaction data into columnar format for next-day analytics. The delay between data generation and analytical availability matches business requirements.

Storage costs versus query costs create economic tradeoffs. Columnar compression reduces storage by 10x compared to row storage, significantly lowering storage costs. However, columnar databases often require more powerful hardware for write operations and delta merging. The total cost of ownership includes storage, compute, and operational complexity.

Row storage consumes more disk space but runs efficiently on standard hardware. The operational complexity remains lower—no delta merging, simpler backup procedures, straightforward replication. Organizations with low storage costs and high operational costs may prefer row storage despite higher disk usage.

Ruby Implementation

Ruby applications interact with both row and columnar databases through standard interfaces. The database adapter layer abstracts storage format, presenting consistent APIs regardless of underlying storage. Applications query PostgreSQL (row storage) and Redshift (columnar storage) using similar ActiveRecord or Sequel syntax.

Row-based database access patterns match Ruby's object-oriented nature. ORM frameworks map database rows to Ruby objects, where each object represents one complete record with all attributes accessible as instance variables or hash keys. This mapping aligns naturally with row storage—one database row becomes one Ruby object.

require 'pg'

class UserStore
  def initialize(connection_string)
    @conn = PG.connect(connection_string)
  end

  def find_user(user_id)
    result = @conn.exec_params(
      'SELECT id, name, email, created_at FROM users WHERE id = $1',
      [user_id]
    )
    return nil if result.ntuples.zero?

    row = result[0]
    {
      id: row['id'].to_i,
      name: row['name'],
      email: row['email'],
      created_at: Time.parse(row['created_at'])
    }
  end

  def create_user(name:, email:)
    @conn.exec_params(
      'INSERT INTO users (name, email, created_at) VALUES ($1, $2, $3) RETURNING id',
      [name, email, Time.now]
    )
  end

  def update_user(user_id, attributes)
    set_clause = attributes.keys.map.with_index { |k, i| "#{k} = $#{i + 2}" }.join(', ')
    @conn.exec_params(
      "UPDATE users SET #{set_clause} WHERE id = $1",
      [user_id] + attributes.values
    )
  end
end

# Row storage optimizes these patterns
store = UserStore.new('postgresql://localhost/myapp')
user = store.find_user(42) # Single seek, read complete row
store.update_user(42, name: 'Alice Smith') # Single row update

Columnar database access requires different query patterns. Ruby applications querying analytical databases should minimize column selection and maximize row scanning. The query layer specifies only necessary columns, allowing the columnar engine to read fewer column files.

require 'pg'

class AnalyticsQuery
  def initialize(redshift_connection)
    @conn = redshift_connection
  end

  def monthly_revenue_by_category
    result = @conn.exec(<<~SQL)
      SELECT
        DATE_TRUNC('month', order_date) AS month,
        category,
        SUM(amount) AS revenue
      FROM orders
      WHERE order_date >= CURRENT_DATE - INTERVAL '12 months'
      GROUP BY 1, 2
      ORDER BY 1 DESC, 3 DESC
    SQL

    result.map do |row|
      {
        month: Date.parse(row['month']),
        category: row['category'],
        revenue: row['revenue'].to_f
      }
    end
  end

  def customer_segmentation
    # Columnar query pattern: few columns, many rows, heavy aggregation
    result = @conn.exec(<<~SQL)
      SELECT
        customer_id,
        COUNT(*) AS order_count,
        SUM(amount) AS total_spent,
        MAX(order_date) AS last_order_date
      FROM orders
      WHERE order_date >= CURRENT_DATE - INTERVAL '24 months'
      GROUP BY customer_id
      HAVING COUNT(*) >= 5
    SQL

    result.map do |row|
      {
        customer_id: row['customer_id'].to_i,
        order_count: row['order_count'].to_i,
        total_spent: row['total_spent'].to_f,
        last_order_date: Date.parse(row['last_order_date'])
      }
    end
  end
end

Ruby's CSV library demonstrates columnar data transformation. Applications can convert row-oriented CSV files into columnar structures for analysis, or export columnar query results to row-oriented formats for reporting.

require 'csv'

class ColumnarTransformer
  def self.csv_to_columnar(filename)
    columns = {}

    CSV.foreach(filename, headers: true) do |row|
      row.each do |column_name, value|
        columns[column_name] ||= []
        columns[column_name] << value
      end
    end

    columns
  end

  def self.columnar_to_csv(columns, filename)
    return if columns.empty?

    row_count = columns.values.first.size
    headers = columns.keys

    CSV.open(filename, 'w', write_headers: true, headers: headers) do |csv|
      row_count.times do |i|
        row = headers.map { |h| columns[h][i] }
        csv << row
      end
    end
  end
end

# Convert CSV to columnar for analysis
columnar_data = ColumnarTransformer.csv_to_columnar('sales.csv')
# => { "date" => [...], "product" => [...], "amount" => [...] }

# Analyze single column without loading complete rows into memory
total_revenue = columnar_data['amount'].map(&:to_f).sum
# Reads only amount column, ignores other columns

# Export columnar results to CSV
ColumnarTransformer.columnar_to_csv(columnar_data, 'output.csv')

The Parquet gem provides Ruby access to Apache Parquet files, a standard columnar format. Parquet files store metadata describing column encodings, compression types, and statistics. Ruby applications read Parquet files directly or generate them for consumption by other systems.

require 'parquet'

class ParquetHandler
  def write_columnar_data(data, filename)
    schema = infer_schema(data)

    Parquet::Writer.open(filename, schema) do |writer|
      row_count = data.values.first.size
      row_count.times do |i|
        row = data.transform_values { |values| values[i] }
        writer << row
      end
    end
  end

  def read_columnar_projection(filename, columns)
    result = Hash.new { |h, k| h[k] = [] }

    Parquet::Reader.open(filename) do |reader|
      reader.each do |row|
        columns.each do |col|
          result[col] << row[col] if row.key?(col)
        end
      end
    end

    result
  end

  private

  def infer_schema(data)
    schema = {}
    data.each do |column_name, values|
      sample = values.first
      schema[column_name] = case sample
      when Integer then :int64
      when Float then :double
      when String then :binary
      when Date then :int32 # Store as days since epoch
      end
    end
    schema
  end
end

handler = ParquetHandler.new

# Write columnar data to Parquet
sales_data = {
  'order_id' => [1, 2, 3],
  'amount' => [100.50, 250.75, 75.25],
  'date' => [Date.today, Date.today, Date.today]
}
handler.write_columnar_data(sales_data, 'sales.parquet')

# Read specific columns only (columnar projection)
revenue_data = handler.read_columnar_projection('sales.parquet', ['amount'])
# Reads only amount column from file

Ruby applications interfacing with cloud analytics platforms handle columnar data implicitly. Amazon Athena, Google BigQuery, and Snowflake present SQL interfaces while using columnar storage internally. Ruby code submits queries and receives results without managing storage format directly.

require 'aws-sdk-athena'

class AthenaAnalytics
  def initialize
    @client = Aws::Athena::Client.new(region: 'us-east-1')
    @database = 'analytics_db'
    @output_location = 's3://my-athena-results/'
  end

  def execute_columnar_query(sql)
    response = @client.start_query_execution(
      query_string: sql,
      query_execution_context: { database: @database },
      result_configuration: { output_location: @output_location }
    )

    query_execution_id = response.query_execution_id
    wait_for_completion(query_execution_id)

    fetch_results(query_execution_id)
  end

  def aggregate_metrics
    # Athena uses columnar Parquet files in S3
    # Query reads only necessary columns from millions of rows
    sql = <<~SQL
      SELECT
        date_trunc('day', timestamp) AS day,
        COUNT(*) AS event_count,
        COUNT(DISTINCT user_id) AS unique_users
      FROM events
      WHERE timestamp >= current_date - interval '30' day
      GROUP BY 1
      ORDER BY 1
    SQL

    execute_columnar_query(sql)
  end

  private

  def wait_for_completion(query_execution_id)
    loop do
      response = @client.get_query_execution(query_execution_id: query_execution_id)
      status = response.query_execution.status.state

      break if ['SUCCEEDED', 'FAILED', 'CANCELLED'].include?(status)
      sleep 1
    end
  end

  def fetch_results(query_execution_id)
    results = []
    next_token = nil

    loop do
      response = @client.get_query_results(
        query_execution_id: query_execution_id,
        next_token: next_token
      )

      response.result_set.rows.each do |row|
        results << row.data.map(&:var_char_value)
      end

      next_token = response.next_token
      break unless next_token
    end

    results
  end
end

Tools & Ecosystem

PostgreSQL dominates row-oriented database deployments. The database engine stores complete rows in heap files, using B-tree indexes for efficient lookups. PostgreSQL extensions like TimescaleDB add time-series optimizations while maintaining row storage fundamentals. Ruby applications use the pg gem for native connectivity.

require 'pg'
require 'sequel'

# Sequel provides ORM layer over PostgreSQL
DB = Sequel.connect('postgres://localhost/production')

class Order < Sequel::Model
  # Sequel assumes row storage: operations work on complete records
  plugin :timestamps
  plugin :validation_helpers

  many_to_one :customer
  one_to_many :line_items
end

# Operations match row storage strengths
order = Order[42] # Retrieves complete row efficiently
order.update(status: 'shipped', shipped_at: Time.now) # Updates single row
new_order = Order.create(customer_id: 100, total: 250.00) # Inserts complete row

Amazon Redshift implements columnar storage for analytics. Based on PostgreSQL, Redshift replaces the storage engine with a columnar format optimized for scanning large tables. The query interface remains SQL, but internal execution differs substantially. Ruby applications connect using the pg gem with Redshift-specific connection strings.

Redshift organizes data into 1MB blocks, each containing values from a single column. The system applies compression automatically—choosing run-length encoding for low-cardinality columns, delta encoding for sequential values, and dictionary encoding for repeated strings. Query performance depends on column encodings and sort keys.

require 'pg'

class RedshiftLoader
  def initialize(connection_string)
    @conn = PG.connect(connection_string)
  end

  def load_from_s3(table:, s3_path:, columns:)
    # Redshift COPY command loads data efficiently into columnar format
    copy_command = <<~SQL
      COPY #{table} (#{columns.join(', ')})
      FROM '#{s3_path}'
      IAM_ROLE 'arn:aws:iam::account:role/RedshiftRole'
      FORMAT AS PARQUET
    SQL

    @conn.exec(copy_command)
  end

  def analyze_table(table)
    # Update column statistics for query optimization
    @conn.exec("ANALYZE #{table}")
  end

  def vacuum_table(table)
    # Reclaim space and resort columnar data
    @conn.exec("VACUUM SORT ONLY #{table}")
  end
end

loader = RedshiftLoader.new('postgresql://redshift-cluster:5439/analytics')
loader.load_from_s3(
  table: 'events',
  s3_path: 's3://data-bucket/events/2024/',
  columns: ['event_id', 'user_id', 'event_type', 'timestamp']
)
loader.analyze_table('events')

Apache Parquet serves as the standard columnar file format. Parquet files contain column chunks with embedded metadata, supporting predicate pushdown and column pruning. Spark, Hive, and Presto read Parquet natively. Ruby applications use the parquet gem or shell out to command-line tools.

DuckDB provides an embedded columnar database similar to SQLite. The database runs in-process without a separate server, querying Parquet files directly. DuckDB excels at analytical queries on local data, making it suitable for data science workflows and batch processing.

require 'duckdb'

class LocalAnalytics
  def initialize
    @db = DuckDB::Database.open(':memory:')
    @conn = @db.connect
  end

  def query_parquet(filename, sql)
    # DuckDB queries Parquet files directly without loading into tables
    full_sql = sql.gsub('SOURCE', filename)
    result = @conn.query(full_sql)

    result.map do |row|
      Hash[result.columns.zip(row)]
    end
  end

  def aggregate_sales(parquet_files)
    # Query across multiple Parquet files
    file_list = parquet_files.map { |f| "'#{f}'" }.join(', ')

    sql = <<~SQL
      SELECT
        product_category,
        SUM(amount) AS total_revenue,
        COUNT(*) AS order_count
      FROM read_parquet([#{file_list}])
      WHERE order_date >= CURRENT_DATE - INTERVAL 90 DAY
      GROUP BY product_category
      ORDER BY total_revenue DESC
    SQL

    @conn.query(sql).to_a
  end
end

analytics = LocalAnalytics.new
results = analytics.aggregate_sales([
  'data/sales_2024_q1.parquet',
  'data/sales_2024_q2.parquet'
])

ClickHouse implements a columnar database optimized for real-time analytics. The system ingests data continuously, stores it in columnar format, and executes queries with sub-second latency. ClickHouse uses specialized column encodings and aggressive compression. Ruby applications interface through HTTP or native TCP protocols.

BigQuery provides managed columnar analytics at scale. Google's service stores data in a proprietary columnar format, automatically partitioning and clustering tables for optimal query performance. Ruby applications use the google-cloud-bigquery gem to submit queries and retrieve results.

require 'google/cloud/bigquery'

class BigQueryAnalytics
  def initialize(project_id:, dataset:)
    @bigquery = Google::Cloud::Bigquery.new(project_id: project_id)
    @dataset = @bigquery.dataset(dataset)
  end

  def scan_large_table(table_name, columns, filter)
    # BigQuery charges by bytes scanned: limit columns to reduce cost
    query = @dataset.query(<<~SQL)
      SELECT #{columns.join(', ')}
      FROM `#{table_name}`
      WHERE #{filter}
    SQL

    query.map { |row| row.to_h }
  end

  def create_partitioned_table(table_name)
    # Partition by date column for efficient filtering
    table = @dataset.create_table(table_name) do |schema|
      schema.timestamp 'timestamp', mode: :required
      schema.string 'user_id'
      schema.float 'amount'
    end

    table.time_partitioning_type = 'DAY'
    table.time_partitioning_field = 'timestamp'
  end
end

ActiveRecord adapters abstract storage differences. The same Ruby code queries both PostgreSQL (row storage) and Redshift (columnar storage). Performance characteristics differ, but the application interface remains consistent. Developers switch databases by changing connection configuration.

Practical Examples

E-commerce order processing demonstrates row storage optimization. The application handles customer checkouts, inventory updates, and order fulfillment—all operations touching complete records. Each transaction reads and writes many columns from few rows.

require 'sequel'

DB = Sequel.connect('postgres://localhost/ecommerce')

class OrderProcessor
  def checkout(customer_id:, items:)
    DB.transaction do
      # Read complete customer record (many columns, one row)
      customer = DB[:customers].where(id: customer_id).first
      raise "Customer not found" unless customer

      # Calculate order total
      total = items.sum { |item| item[:quantity] * item[:price] }

      # Create order record (insert complete row)
      order_id = DB[:orders].insert(
        customer_id: customer_id,
        order_date: Time.now,
        status: 'pending',
        subtotal: total,
        tax: total * 0.08,
        total: total * 1.08,
        shipping_address: customer[:default_address],
        billing_address: customer[:billing_address]
      )

      # Update inventory for each item (multiple row updates)
      items.each do |item|
        rows_updated = DB[:inventory]
          .where(product_id: item[:product_id])
          .where { quantity >= item[:quantity] }
          .update(quantity: Sequel[:quantity] - item[:quantity])

        raise "Insufficient inventory" if rows_updated.zero?

        # Insert line item (complete row)
        DB[:order_items].insert(
          order_id: order_id,
          product_id: item[:product_id],
          quantity: item[:quantity],
          unit_price: item[:price],
          total: item[:quantity] * item[:price]
        )
      end

      order_id
    end
  end

  def get_order_details(order_id)
    # Read complete order record plus related items
    order = DB[:orders].where(id: order_id).first
    items = DB[:order_items].where(order_id: order_id).all

    {
      order: order,
      items: items,
      customer: DB[:customers].where(id: order[:customer_id]).first
    }
  end
end

processor = OrderProcessor.new
order_id = processor.checkout(
  customer_id: 42,
  items: [
    { product_id: 100, quantity: 2, price: 29.99 },
    { product_id: 101, quantity: 1, price: 149.99 }
  ]
)
# Row storage handles this efficiently: complete records, atomic updates

Business intelligence reporting demonstrates columnar storage advantages. Queries aggregate millions of rows, computing metrics across specific dimensions while ignoring most columns.

require 'pg'

class SalesAnalytics
  def initialize(redshift_connection)
    @conn = redshift_connection
  end

  def monthly_sales_report(year:)
    # Scans millions of rows but reads only 4 columns
    result = @conn.exec_params(<<~SQL, [year])
      SELECT
        DATE_TRUNC('month', order_date) AS month,
        product_category,
        SUM(total) AS revenue,
        COUNT(*) AS order_count,
        COUNT(DISTINCT customer_id) AS unique_customers
      FROM orders
      WHERE EXTRACT(YEAR FROM order_date) = $1
      GROUP BY 1, 2
      ORDER BY 1, 3 DESC
    SQL

    result.map do |row|
      {
        month: Date.parse(row['month']),
        category: row['product_category'],
        revenue: row['revenue'].to_f,
        orders: row['order_count'].to_i,
        customers: row['unique_customers'].to_i
      }
    end
  end

  def customer_lifetime_value
    # Reads 3 columns from potentially billions of rows
    result = @conn.exec(<<~SQL)
      SELECT
        customer_id,
        MIN(order_date) AS first_order,
        MAX(order_date) AS last_order,
        COUNT(*) AS total_orders,
        SUM(total) AS lifetime_value
      FROM orders
      GROUP BY customer_id
      HAVING COUNT(*) >= 5
    SQL

    result.map do |row|
      {
        customer_id: row['customer_id'].to_i,
        tenure_days: (Date.parse(row['last_order']) - Date.parse(row['first_order'])).to_i,
        orders: row['total_orders'].to_i,
        ltv: row['lifetime_value'].to_f
      }
    end
  end

  def product_performance_matrix
    # Complex aggregation reading few columns from many rows
    result = @conn.exec(<<~SQL)
      WITH monthly_sales AS (
        SELECT
          product_id,
          DATE_TRUNC('month', order_date) AS month,
          SUM(quantity) AS units_sold,
          SUM(total) AS revenue
        FROM order_items
        JOIN orders USING (order_id)
        WHERE order_date >= CURRENT_DATE - INTERVAL '12 months'
        GROUP BY 1, 2
      )
      SELECT
        product_id,
        AVG(units_sold) AS avg_monthly_units,
        STDDEV(units_sold) AS units_volatility,
        SUM(revenue) AS total_revenue,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) AS median_revenue
      FROM monthly_sales
      GROUP BY product_id
      ORDER BY total_revenue DESC
      LIMIT 100
    SQL

    result.to_a
  end
end

analytics = SalesAnalytics.new(PG.connect('postgresql://redshift:5439/warehouse'))

# Columnar storage reads only necessary columns
# Row storage would read complete order records (20+ columns)
report = analytics.monthly_sales_report(year: 2024)
ltv_data = analytics.customer_lifetime_value

Log analysis benefits from columnar storage when analyzing structured logs at scale. Applications write logs with many attributes but queries typically filter and aggregate on few fields.

require 'duckdb'
require 'csv'

class LogAnalyzer
  def initialize(log_directory)
    @db = DuckDB::Database.open(':memory:')
    @conn = @db.connect
    @log_directory = log_directory
  end

  def load_logs
    # Convert CSV logs to Parquet for columnar analysis
    Dir.glob("#{@log_directory}/*.csv").each do |csv_file|
      parquet_file = csv_file.sub('.csv', '.parquet')

      @conn.execute(<<~SQL)
        COPY (SELECT * FROM read_csv_auto('#{csv_file}'))
        TO '#{parquet_file}' (FORMAT PARQUET, COMPRESSION ZSTD)
      SQL
    end
  end

  def error_rate_by_service
    # Read only timestamp, service, and status columns
    result = @conn.query(<<~SQL)
      SELECT
        service_name,
        COUNT(*) AS total_requests,
        SUM(CASE WHEN status_code >= 500 THEN 1 ELSE 0 END) AS errors,
        SUM(CASE WHEN status_code >= 500 THEN 1 ELSE 0 END)::FLOAT / COUNT(*) AS error_rate
      FROM read_parquet('#{@log_directory}/*.parquet')
      WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL 1 HOUR
      GROUP BY service_name
      ORDER BY error_rate DESC
    SQL

    result.to_a
  end

  def latency_percentiles(service:)
    result = @conn.query(<<~SQL)
      SELECT
        PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY response_time_ms) AS p50,
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time_ms) AS p95,
        PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY response_time_ms) AS p99,
        MAX(response_time_ms) AS max
      FROM read_parquet('#{@log_directory}/*.parquet')
      WHERE service_name = '#{service}'
        AND timestamp >= CURRENT_TIMESTAMP - INTERVAL 24 HOUR
    SQL

    result.first.to_h
  end
end

analyzer = LogAnalyzer.new('/var/log/application')
analyzer.load_logs
errors = analyzer.error_rate_by_service
latencies = analyzer.latency_percentiles(service: 'api-gateway')

Time-series data collection uses hybrid storage. Recent data remains in row-oriented tables for fast inserts. Older data moves to columnar format for efficient analytical queries. The application queries both stores and merges results.

require 'sequel'

class TimeSeriesStore
  def initialize
    @pg = Sequel.connect('postgres://localhost/timeseries')
    @redshift = Sequel.connect('postgresql://redshift:5439/analytics')
  end

  def record_metric(metric_name:, value:, tags:)
    # Recent data: fast writes to row storage
    @pg[:metrics].insert(
      metric_name: metric_name,
      value: value,
      tags: Sequel.pg_jsonb(tags),
      timestamp: Time.now
    )
  end

  def query_metric(metric_name:, start_time:, end_time:, aggregation: :avg)
    cutoff = Time.now - 86400 # 24 hours ago

    # Recent data from row storage
    recent_data = @pg[:metrics]
      .where(metric_name: metric_name)
      .where { timestamp >= [start_time, cutoff].max }
      .where { timestamp <= end_time }
      .select(:timestamp, :value)
      .all

    # Historical data from columnar storage
    historical_data = if start_time < cutoff
      @redshift[:metrics_history]
        .where(metric_name: metric_name)
        .where { timestamp >= start_time }
        .where { timestamp < cutoff }
        .select(:timestamp, :value)
        .all
    else
      []
    end

    # Merge and aggregate
    all_data = recent_data + historical_data
    aggregate_values(all_data, aggregation)
  end

  def archive_old_data
    cutoff = Time.now - 86400

    # Move data from row storage to columnar
    old_data = @pg[:metrics].where { timestamp < cutoff }.all

    @redshift[:metrics_history].multi_insert(old_data)
    @pg[:metrics].where { timestamp < cutoff }.delete

    old_data.count
  end

  private

  def aggregate_values(data, aggregation)
    values = data.map { |row| row[:value] }

    case aggregation
    when :avg
      values.sum.to_f / values.size
    when :sum
      values.sum
    when :max
      values.max
    when :min
      values.min
    end
  end
end

store = TimeSeriesStore.new

# Fast writes to row storage
store.record_metric(
  metric_name: 'api.response_time',
  value: 45.2,
  tags: { service: 'api-gateway', region: 'us-east-1' }
)

# Query spans both stores, columnar handles historical aggregation
avg_response = store.query_metric(
  metric_name: 'api.response_time',
  start_time: Time.now - 7 * 86400,
  end_time: Time.now,
  aggregation: :avg
)

# Periodic archival moves data to columnar format
archived_count = store.archive_old_data

Reference

Storage Format Comparison

Aspect	Row Storage	Columnar Storage
Data Organization	Complete records stored together	Column values stored together
Read Pattern	Entire row in single I/O operation	Multiple I/O operations per row
Write Pattern	Single write per row	Multiple writes across columns
Compression Ratio	2-3x typical	10-20x typical
Insert Performance	Excellent single row inserts	Poor single row inserts
Update Performance	Efficient in-place updates	Expensive cross-column updates
Scan Performance	Poor for column subsets	Excellent for column subsets
Point Query	Fast for complete records	Slow for complete records
Analytical Query	Inefficient column scanning	Optimized column scanning
OLTP Suitability	Excellent	Poor
OLAP Suitability	Poor	Excellent

Query Selectivity Guidelines

Columns Accessed	Rows Scanned	Recommended Format	Reasoning
Many (>50%)	Few (<1%)	Row	Reading complete records efficiently
Few (<20%)	Many (>10%)	Columnar	Column subset with massive scans
Many (>50%)	Many (>10%)	Row or Hybrid	Full table scan of all data
Few (<20%)	Few (<1%)	Row	Point queries benefit from row layout
Many (>50%)	Medium (1-10%)	Row	Record-oriented access pattern
Few (<20%)	Medium (1-10%)	Columnar	Analytical aggregation pattern

Compression Techniques

Technique	Best For	Compression Ratio	Columnar Advantage
Dictionary Encoding	Low cardinality columns	5-10x	Entire column shares dictionary
Run Length Encoding	Repeated values	10-100x	Long runs in sorted columns
Delta Encoding	Sequential values	5-20x	Timestamps or IDs in column
Bit Packing	Small integer ranges	2-8x	Uniform type per column
LZ4 Compression	General purpose	2-3x	Works on any format
Snappy Compression	Fast general purpose	1.5-2x	Works on any format

Database System Characteristics

System	Storage Format	Primary Use Case	Ruby Gem	Write Throughput
PostgreSQL	Row	OLTP transactions	pg	Excellent
MySQL	Row	OLTP transactions	mysql2	Excellent
Amazon Redshift	Columnar	Data warehouse analytics	pg	Moderate
Google BigQuery	Columnar	Managed analytics	google-cloud-bigquery	Good
ClickHouse	Columnar	Real-time analytics	clickhouse	Excellent
DuckDB	Columnar	Embedded analytics	duckdb	Good
Apache Parquet	Columnar file	Data interchange	parquet	Batch only
Snowflake	Columnar	Cloud data warehouse	odbc	Good

Workload Classification

Pattern	Column Access	Row Access	Updates	Queries	Best Format
OLTP	High (80%+)	Low (<1%)	Frequent	Simple	Row
OLAP	Low (<20%)	High (50%+)	Rare	Complex aggregations	Columnar
Hybrid Transaction	Medium (40-60%)	Low (<5%)	Moderate	Mixed	Row or Hybrid
Hybrid Analytics	Low-Medium (20-40%)	Medium (10-50%)	Moderate	Mixed	Hybrid or Columnar
Append-Only Logs	Few (2-5)	Very High (100%)	None	Time-based aggregations	Columnar
User Profile Lookup	High (90%+)	Very Low (<0.1%)	Moderate	Primary key lookups	Row

Ruby Adapter Methods

Operation	Row-Optimized Pattern	Columnar-Optimized Pattern
Single Record Fetch	Model.find(id)	Avoid, use batch queries
Batch Fetch	Model.where(id: ids).all	Acceptable with few columns
Aggregation	Scan entire table	SELECT few columns with GROUP BY
Insert Single	Model.create(attrs)	Batch into buffer
Insert Batch	Model.import(records)	Optimal, load in batch
Update Single	Model.update(id, attrs)	Expensive, consider append
Update Batch	Update individual rows	Append new versions
Column Subset	Wasteful, reads all columns	SELECT specific columns only

Performance Tuning Guidelines

Scenario	Row Storage Optimization	Columnar Storage Optimization
Point Queries	Add B-tree indexes	Avoid, use row storage
Range Scans	Index on filter columns	Partition by common filters
Aggregations	Pre-compute in tables	Use column statistics
Joins	Index foreign keys	Broadcast small tables
Sorting	Sort on indexed columns	Define sort keys
Filtering	Compound indexes	Predicate pushdown
Updates	MVCC for concurrency	Batch updates, vacuum
Inserts	Single row commits	Batch load thousands
Compression	Enable page compression	Choose encoding per column
Partitioning	Range partition large tables	Partition by date or key column

Columnar vs Row Storage