CrackedRuby CrackedRuby

Overview

Partitioning strategies define methods for dividing large datasets across multiple physical or logical storage units. The primary goal involves distributing data to enable parallel processing, reduce query latency, and manage datasets that exceed single-node capacity. Each partition holds a subset of the total data based on specific criteria that determine data placement.

Partitioning differs from replication. Replication creates identical copies of data across multiple nodes for availability and fault tolerance. Partitioning divides data into distinct, non-overlapping subsets where each record exists in exactly one partition. Many production systems combine both techniques to achieve scalability and high availability.

The concept originated in database systems managing datasets exceeding single-machine capacity. Modern distributed systems implement partitioning at multiple layers: application logic, database engines, message queues, and cache layers. The choice of partitioning strategy affects query patterns, data distribution, rebalancing complexity, and system maintainability.

# Conceptual representation of partitioned data
class PartitionedUserStore
  def initialize(partition_count: 4)
    @partitions = Array.new(partition_count) { [] }
    @partition_count = partition_count
  end
  
  def partition_for(user_id)
    user_id.hash % @partition_count
  end
  
  def store_user(user)
    partition_index = partition_for(user.id)
    @partitions[partition_index] << user
  end
  
  def find_user(user_id)
    partition_index = partition_for(user_id)
    @partitions[partition_index].find { |u| u.id == user_id }
  end
end

Key Principles

Partitioning strategies follow several fundamental principles that determine their effectiveness and operational characteristics.

Data Distribution: Each partitioning strategy uses a specific algorithm to map records to partitions. The mapping function determines which partition stores each record based on one or more attributes. The quality of data distribution directly impacts query performance and resource usage. Uneven distribution creates hotspots where some partitions receive disproportionate traffic.

Partition Key Selection: The partition key determines how data distributes across partitions. Selecting an appropriate key requires understanding access patterns, data cardinality, and query requirements. High-cardinality keys generally produce better distribution than low-cardinality keys. Immutable attributes make better partition keys than mutable ones because changing a partition key requires moving data between partitions.

Query Routing: Applications or middleware must route queries to the correct partition. Single-partition queries access one partition directly, offering optimal performance. Multi-partition queries access multiple partitions and require result aggregation, increasing latency and complexity. The partition key should align with common query predicates to maximize single-partition query opportunities.

Partition Independence: Each partition operates independently, enabling parallel processing and isolated failures. Operations on one partition do not lock or block operations on other partitions. This independence provides the scalability benefits of partitioning but complicates operations requiring cross-partition coordination.

Rebalancing Complexity: Adding or removing partitions requires redistributing data. The complexity of rebalancing varies by strategy. Some strategies support incremental rebalancing with minimal disruption. Others require extensive data movement. Rebalancing involves reading data from existing partitions, applying the new partition mapping, and writing data to new locations while maintaining service availability.

# Partition key selection example
class OrderPartitioner
  # Good: immutable, high cardinality
  def by_order_id(order_id)
    order_id.hash % partition_count
  end
  
  # Poor: mutable, may create hotspots
  def by_status(status)
    status.hash % partition_count
  end
  
  # Good for time-series queries
  def by_created_date(created_at)
    (created_at.to_i / 86400) % partition_count
  end
end

Partition Granularity: The number of partitions affects performance and operational complexity. Too few partitions limit scalability and create larger units of data movement during rebalancing. Too many partitions increase metadata overhead and complicate query routing. Many systems use partition counts that are powers of two to simplify hash calculations and enable efficient partition splitting.

Cross-Partition Operations: Transactions spanning multiple partitions require distributed coordination protocols, increasing latency and reducing availability. Join operations across partitions require fetching data from multiple sources and merging results. Secondary indexes may exist within each partition or as separate partitioned structures, each with different consistency and performance characteristics.

Implementation Approaches

Multiple partitioning strategies exist, each optimized for different access patterns and data characteristics.

Range Partitioning: Range partitioning assigns contiguous ranges of partition key values to each partition. Records with keys falling within a range reside in the corresponding partition. Range partitioning works well for range queries and time-series data where recent data receives more access than historical data.

Range boundaries can be defined statically at system initialization or adjusted dynamically based on data distribution. Static boundaries simplify routing logic but may create uneven distribution as data evolves. Dynamic boundaries adapt to changing data but require coordination during boundary adjustments.

class RangePartitioner
  def initialize(ranges)
    # ranges = [[0, 1000], [1001, 5000], [5001, 10000]]
    @ranges = ranges.sort_by(&:first)
  end
  
  def partition_for(key)
    @ranges.find_index { |min, max| key >= min && key <= max }
  end
  
  def supports_range_query?(min_key, max_key)
    start_partition = partition_for(min_key)
    end_partition = partition_for(max_key)
    start_partition == end_partition
  end
end

Range partitioning enables efficient range queries when the query predicate uses the partition key. A query for records with keys between 100 and 500 can be routed to the single partition containing that range. However, range partitioning risks creating hotspots when recent data receives significantly more traffic than older data.

Hash Partitioning: Hash partitioning applies a hash function to the partition key and uses the result to determine partition placement. The hash function distributes keys uniformly across the partition space, preventing hotspots caused by skewed key distribution or sequential key assignment.

Hash partitioning provides better load distribution than range partitioning for workloads with uniform random access patterns. The deterministic nature of hash functions ensures the same key always maps to the same partition, enabling consistent routing without centralized metadata.

class HashPartitioner
  def initialize(partition_count)
    @partition_count = partition_count
    @hash_function = method(:consistent_hash)
  end
  
  def partition_for(key)
    @hash_function.call(key) % @partition_count
  end
  
  private
  
  def consistent_hash(key)
    # Use a hash function with good distribution properties
    key.to_s.bytes.reduce(0) { |hash, byte| (hash * 31 + byte) & 0x7FFFFFFF }
  end
end

Hash partitioning complicates range queries. A query for keys between 100 and 500 may require accessing all partitions because the hash function distributes sequential keys across different partitions. Applications must either accept multi-partition query overhead or maintain secondary indexes for range queries.

List Partitioning: List partitioning assigns specific partition key values to each partition. Each partition contains an explicit list of values it stores. List partitioning works well when the partition key has known, discrete values with predictable access patterns, such as geographic regions or product categories.

class ListPartitioner
  def initialize(partition_assignments)
    # partition_assignments = { 0 => ['US', 'CA'], 1 => ['UK', 'DE'], 2 => ['JP', 'CN'] }
    @partition_assignments = partition_assignments
    @value_to_partition = build_reverse_mapping(partition_assignments)
  end
  
  def partition_for(key)
    @value_to_partition[key] || raise "Unknown partition key: #{key}"
  end
  
  private
  
  def build_reverse_mapping(assignments)
    assignments.each_with_object({}) do |(partition_id, values), mapping|
      values.each { |value| mapping[value] = partition_id }
    end
  end
end

List partitioning enables fine-grained control over data placement. Different regions or categories can be isolated on separate partitions based on capacity requirements or compliance regulations. Adding new values requires updating partition assignments, which may be complex in distributed systems.

Composite Partitioning: Composite partitioning combines multiple strategies in hierarchy. A common pattern applies range partitioning at the first level and hash partitioning at the second level. This approach balances the benefits of both strategies: range partitioning enables efficient range queries while hash partitioning prevents hotspots within each range partition.

class CompositePartitioner
  def initialize(range_count, hash_count)
    @range_partitioner = RangePartitioner.new(generate_ranges(range_count))
    @hash_partitioner = HashPartitioner.new(hash_count)
  end
  
  def partition_for(range_key, hash_key)
    range_partition = @range_partitioner.partition_for(range_key)
    hash_partition = @hash_partitioner.partition_for(hash_key)
    range_partition * @hash_partitioner.partition_count + hash_partition
  end
  
  def total_partitions
    @range_partitioner.range_count * @hash_partitioner.partition_count
  end
end

Design Considerations

Selecting a partitioning strategy requires evaluating access patterns, data characteristics, and operational requirements.

Access Pattern Analysis: The dominant query patterns determine the optimal partitioning strategy. Workloads with point lookups benefit from hash partitioning because it provides fast, single-partition access for individual keys. Workloads with range queries benefit from range partitioning when queries align with partition boundaries.

Mixed workloads present trade-offs. Hash partitioning optimizes point lookups at the expense of range queries. Range partitioning optimizes range queries but risks hotspots from sequential key access. Composite partitioning attempts to balance both patterns but increases complexity.

Write-heavy workloads require even load distribution to prevent partition overload. Hash partitioning generally distributes writes more uniformly than range partitioning. Read-heavy workloads may tolerate uneven distribution if caching mitigates hotspot effects.

Data Characteristics: The distribution of partition key values affects strategy selection. Uniformly distributed keys work well with hash partitioning. Skewed distributions require careful range boundary selection or list partitioning that isolates high-volume keys. Time-series data naturally fits range partitioning where time serves as the partition key.

Key cardinality impacts partition granularity. Low-cardinality keys limit the number of meaningful partitions. A dataset with only ten distinct partition key values cannot effectively use one hundred partitions. High-cardinality keys enable fine-grained partitioning but may complicate rebalancing.

Rebalancing Requirements: Different strategies vary in rebalancing complexity. Hash partitioning with fixed partition counts enables straightforward rebalancing by rehashing a subset of keys. Range partitioning supports partition splitting by dividing ranges but requires more complex coordination to maintain range boundaries.

The frequency of partition count changes influences strategy selection. Systems that frequently add capacity require strategies with efficient rebalancing. Systems with stable capacity can use simpler strategies. Consistent hashing reduces rebalancing overhead by minimizing the number of keys that move during partition changes.

Cross-Partition Query Costs: Operations requiring data from multiple partitions incur additional latency and coordination overhead. Hash partitioning creates more cross-partition queries for range predicates and joins. Range partitioning reduces cross-partition queries for range predicates but may require multi-partition access for point lookups if query predicates do not include the partition key.

Aggregation queries often require accessing all partitions. The partition strategy affects aggregation performance through factors like partition count and data locality. Fewer, larger partitions reduce coordination overhead but limit parallelism. More, smaller partitions increase parallelism but add coordination costs.

Operational Complexity: Hash partitioning offers simpler operations because partition assignment follows a deterministic function. Range partitioning requires managing partition boundaries and monitoring distribution quality. List partitioning requires maintaining value-to-partition mappings and handling new values.

Monitoring requirements differ by strategy. Hash partitioning primarily monitors overall load distribution. Range partitioning monitors both load distribution and range boundary effectiveness. List partitioning tracks value distribution and identifies values requiring repartitioning.

class PartitioningStrategySelector
  def recommend(characteristics)
    if characteristics[:query_type] == :range_heavy && 
       characteristics[:key_distribution] == :sequential
      { strategy: :range, reason: "Range queries on sequential data" }
    elsif characteristics[:query_type] == :point_lookup && 
          characteristics[:write_pattern] == :random
      { strategy: :hash, reason: "Point lookups with uniform writes" }
    elsif characteristics[:key_type] == :categorical && 
          characteristics[:distinct_values] < 100
      { strategy: :list, reason: "Low-cardinality categorical data" }
    else
      { strategy: :composite, reason: "Mixed access patterns" }
    end
  end
end

Ruby Implementation

Ruby applications interact with partitioned systems through database adapters, ORMs, and custom partitioning logic. Most partitioning occurs at the database layer, with Ruby code handling query routing and partition awareness.

ActiveRecord with Partitioned Tables: PostgreSQL supports native table partitioning. ActiveRecord can query partitioned tables transparently, but explicit partition management requires custom SQL or extensions.

class User < ApplicationRecord
  # Table partitioned by created_at in PostgreSQL
  self.table_name = 'users'
  
  # Partition-aware query methods
  def self.in_partition(start_date, end_date)
    where(created_at: start_date..end_date)
  end
  
  def self.create_partition(year, month)
    partition_name = "users_#{year}_#{month}"
    start_date = Date.new(year, month, 1)
    end_date = start_date.next_month
    
    connection.execute(<<-SQL)
      CREATE TABLE IF NOT EXISTS #{partition_name}
      PARTITION OF users
      FOR VALUES FROM ('#{start_date}') TO ('#{end_date}')
    SQL
  end
end

Application-Level Partitioning: Applications can implement partitioning logic when the database does not support native partitioning or when partitioning spans multiple database instances.

class ShardedUserRepository
  def initialize(connections)
    @connections = connections
    @partition_count = connections.size
  end
  
  def find(user_id)
    shard = shard_for(user_id)
    @connections[shard].exec_params(
      'SELECT * FROM users WHERE id = $1',
      [user_id]
    ).first
  end
  
  def create(user_attributes)
    user_id = user_attributes[:id]
    shard = shard_for(user_id)
    @connections[shard].exec_params(
      'INSERT INTO users (id, name, email) VALUES ($1, $2, $3)',
      [user_id, user_attributes[:name], user_attributes[:email]]
    )
  end
  
  def find_by_range(min_id, max_id)
    # Range query requires querying multiple shards
    shards_needed = (min_id..max_id).map { |id| shard_for(id) }.uniq
    
    results = shards_needed.flat_map do |shard|
      @connections[shard].exec_params(
        'SELECT * FROM users WHERE id BETWEEN $1 AND $2',
        [min_id, max_id]
      )
    end
    
    results.sort_by { |user| user['id'] }
  end
  
  private
  
  def shard_for(user_id)
    user_id.hash.abs % @partition_count
  end
end

Sequel with Sharding Plugin: Sequel provides sharding support through plugins that handle routing and connection management.

require 'sequel'

DB = Sequel.connect('postgres://localhost/main')
DB.extension :pg_array, :pg_json

# Configure shards
DB.extension :server_block
DB.servers = {
  shard_0: { host: 'db1.example.com', database: 'users' },
  shard_1: { host: 'db2.example.com', database: 'users' },
  shard_2: { host: 'db3.example.com', database: 'users' }
}

class User < Sequel::Model
  plugin :sharding
  
  def self.shard_for_id(user_id)
    "shard_#{user_id.hash.abs % 3}".to_sym
  end
  
  def self.[](user_id)
    server(shard_for_id(user_id)).first(id: user_id)
  end
  
  def self.create_on_shard(attributes)
    user_id = attributes[:id]
    server(shard_for_id(user_id)).insert(attributes)
  end
end

Custom Partition Router: For complex partitioning logic, implement a dedicated router class that encapsulates partition selection and query distribution.

class PartitionRouter
  def initialize(partitions)
    @partitions = partitions
    @strategy = HashPartitioner.new(partitions.size)
  end
  
  def execute_on_partition(key, &block)
    partition_index = @strategy.partition_for(key)
    partition = @partitions[partition_index]
    
    partition.with_connection do |conn|
      block.call(conn)
    end
  end
  
  def execute_on_all_partitions(&block)
    results = @partitions.map do |partition|
      partition.with_connection do |conn|
        block.call(conn)
      end
    end
    
    merge_results(results)
  end
  
  def transaction_on_partition(key, &block)
    execute_on_partition(key) do |conn|
      conn.transaction(&block)
    end
  end
  
  private
  
  def merge_results(results)
    # Merge strategy depends on query type
    results.flatten.uniq
  end
end

Performance Considerations

Partitioning strategies directly impact query latency, throughput, and resource usage.

Query Latency: Single-partition queries achieve minimal latency by accessing only the data needed. The partition strategy determines how many queries can be resolved with single-partition access. Hash partitioning optimizes point lookups, providing O(1) partition selection. Range partitioning optimizes range queries when predicates align with partition keys.

Multi-partition queries incur additional latency from parallel execution and result aggregation. The maximum query latency equals the slowest partition's response time plus aggregation overhead. Skewed data distribution creates variance in partition response times, increasing tail latency.

require 'benchmark'

class PartitionPerformanceTester
  def initialize(partitioner, data_store)
    @partitioner = partitioner
    @data_store = data_store
  end
  
  def benchmark_point_lookup(key)
    Benchmark.measure do
      partition = @partitioner.partition_for(key)
      @data_store.fetch(partition, key)
    end
  end
  
  def benchmark_range_query(min_key, max_key)
    Benchmark.measure do
      partitions = (min_key..max_key).map { |k| @partitioner.partition_for(k) }.uniq
      
      results = partitions.map do |partition|
        @data_store.fetch_range(partition, min_key, max_key)
      end
      
      results.flatten.sort
    end
  end
end

Throughput Scaling: Partitioning increases throughput by distributing load across multiple storage units. The theoretical maximum throughput equals the sum of individual partition throughput. Actual throughput depends on load distribution quality and partition independence.

Uneven load distribution limits throughput scaling. If one partition handles 50% of requests, overall throughput cannot exceed twice that partition's capacity regardless of other partition availability. Hash partitioning generally achieves better load distribution than range partitioning for write-heavy workloads.

Hotspot Prevention: Hotspots occur when a small number of partitions receive disproportionate traffic. Sequential key assignment in range-partitioned systems creates hotspots on the partition containing the most recent keys. Hash partitioning prevents sequential hotspots by distributing consecutive keys across partitions.

Application-level caching reduces hotspot impact by serving frequently accessed data from memory. However, caching effectiveness varies with key access patterns. Zipfian distributions where a few keys dominate traffic benefit more from caching than uniform distributions.

Partition Count Selection: The optimal partition count balances parallelism and overhead. More partitions enable finer-grained load distribution and better parallelism but increase metadata overhead and query routing complexity. Fewer partitions reduce overhead but limit scalability.

Many systems use partition counts between 10 and 1000. Very small partition counts (less than 10) limit parallelism. Very large partition counts (more than 1000) increase coordination overhead. The partition count should exceed the number of physical servers to enable flexible data distribution.

Join Performance: Joins across partitioned tables require either co-located data or cross-partition data fetching. Co-location partitions related tables using the same partition key, enabling single-partition joins. Cross-partition joins fetch data from multiple partitions and merge results, increasing latency and network traffic.

class PartitionedJoinOptimizer
  def initialize(left_partitioner, right_partitioner)
    @left_partitioner = left_partitioner
    @right_partitioner = right_partitioner
  end
  
  def colocated?(left_key, right_key)
    @left_partitioner.partition_for(left_key) == 
      @right_partitioner.partition_for(right_key)
  end
  
  def execute_join(left_table, right_table, join_key)
    if colocated_join_possible?(join_key)
      execute_local_joins(left_table, right_table, join_key)
    else
      execute_distributed_join(left_table, right_table, join_key)
    end
  end
  
  private
  
  def colocated_join_possible?(join_key)
    @left_partitioner.class == @right_partitioner.class &&
      @left_partitioner.partition_count == @right_partitioner.partition_count
  end
end

Real-World Applications

Production systems implement partitioning at multiple levels with different strategies for different use cases.

Time-Series Data: Logging platforms and metrics systems partition by time ranges. Each partition contains data for a specific time window, such as one day or one week. Range partitioning enables efficient queries for recent data and simplifies data retention by dropping old partitions.

class TimeSeriesPartitioner
  def initialize(partition_duration_seconds)
    @partition_duration = partition_duration_seconds
  end
  
  def partition_for(timestamp)
    timestamp.to_i / @partition_duration
  end
  
  def partition_name(timestamp)
    partition_id = partition_for(timestamp)
    start_time = Time.at(partition_id * @partition_duration)
    "metrics_#{start_time.strftime('%Y%m%d_%H%M%S')}"
  end
  
  def create_partitions_for_range(start_time, end_time)
    current = start_time
    partitions = []
    
    while current < end_time
      partitions << partition_name(current)
      current += @partition_duration
    end
    
    partitions
  end
  
  def drop_old_partitions(retention_days)
    cutoff = Time.now - (retention_days * 86400)
    partition_for(cutoff)
  end
end

User Data Sharding: Social networks and SaaS platforms partition user data by user ID. Hash partitioning distributes users evenly across shards. Each shard contains all data for a subset of users, enabling single-shard queries for user-specific operations.

User sharding complicates features requiring cross-user operations, such as social graphs or shared content. These systems often maintain separate partitioned structures for relationships and use denormalization to reduce cross-shard queries.

Geographic Distribution: Global applications partition by geographic region to reduce latency and comply with data residency regulations. List partitioning assigns regions to partitions. Users in Europe access European partitions while users in Asia access Asian partitions.

Geographic partitioning requires routing logic that determines partition selection based on user location or explicit region configuration. Applications must handle users moving between regions and features requiring cross-region data access.

class GeographicPartitioner
  REGION_ASSIGNMENTS = {
    'us-east' => 0,
    'us-west' => 1,
    'eu-west' => 2,
    'eu-central' => 3,
    'asia-pacific' => 4
  }
  
  def partition_for(region)
    REGION_ASSIGNMENTS[region] or raise "Unknown region: #{region}"
  end
  
  def nearest_region(client_ip)
    # Use GeoIP lookup or similar
    lookup_region_for_ip(client_ip)
  end
  
  def cross_region_query(regions, query)
    partitions = regions.map { |r| partition_for(r) }
    
    results = partitions.map do |partition|
      execute_query_on_partition(partition, query)
    end
    
    aggregate_cross_region_results(results)
  end
end

Tenant Isolation: Multi-tenant systems partition by tenant ID to provide isolation and predictable performance. Each tenant's data resides in dedicated partitions. This approach simplifies tenant migration and enables per-tenant backups and performance monitoring.

Small tenants may share partitions while large tenants receive dedicated partitions. The partitioning logic accounts for tenant size and growth when assigning partitions.

Common Pitfalls

Partitioning introduces complexity that creates several common failure modes and performance issues.

Partition Key Changes: Changing a record's partition key requires moving data between partitions. Most systems treat partition keys as immutable because data migration is expensive and error-prone. Applications using mutable attributes as partition keys must implement cross-partition updates or redesign their partitioning strategy.

Attempting to update a partition key without proper migration logic creates data inconsistency. The record may exist in the wrong partition, become inaccessible, or create duplicates across partitions.

class PartitionKeyMigrator
  def initialize(old_partitioner, new_partitioner, data_store)
    @old_partitioner = old_partitioner
    @new_partitioner = new_partitioner
    @data_store = data_store
  end
  
  def migrate_record(key, record)
    old_partition = @old_partitioner.partition_for(key)
    new_partition = @new_partitioner.partition_for(key)
    
    return if old_partition == new_partition
    
    # Read from old partition
    data = @data_store.fetch(old_partition, key)
    
    # Write to new partition
    @data_store.store(new_partition, key, data)
    
    # Delete from old partition
    @data_store.delete(old_partition, key)
  rescue StandardError => e
    # Handle migration failure
    log_migration_error(key, e)
    raise
  end
end

Unbalanced Partitions: Poor partition key selection creates uneven data distribution. Some partitions grow larger than others, causing storage imbalance and performance degradation. The overloaded partitions become bottlenecks limiting overall system throughput.

Celebrity users in social networks create hotspots in user-partitioned systems. A small number of users with many followers generate disproportionate traffic. These systems require special handling for high-volume users, such as separate storage or enhanced caching.

Cross-Partition Transaction Complexity: Distributed transactions across partitions require two-phase commit or similar protocols. These protocols increase latency and reduce availability. Transaction failure handling becomes more complex when some partitions commit while others abort.

Applications should minimize cross-partition transactions through careful schema design and partition key selection. Denormalizing data to avoid cross-partition queries trades storage efficiency for operational simplicity.

Partition Count Changes: Changing the number of partitions requires rebalancing data. All keys must be reevaluated against the new partition mapping. For hash-partitioned systems, this requires rehashing all keys. Systems with billions of records require careful planning to rebalance without service disruption.

Consistent hashing reduces rebalancing overhead but adds complexity to the partition selection algorithm. Virtual nodes distribute each physical partition across multiple positions in the hash space, minimizing data movement during rebalancing.

class PartitionRebalancer
  def initialize(old_partition_count, new_partition_count)
    @old_partition_count = old_partition_count
    @new_partition_count = new_partition_count
  end
  
  def keys_to_move(key)
    old_partition = hash_with_count(key, @old_partition_count)
    new_partition = hash_with_count(key, @new_partition_count)
    
    old_partition != new_partition
  end
  
  def rebalance_plan(keys)
    moves = keys.select { |key| keys_to_move(key) }
    
    {
      keys_moving: moves.size,
      keys_staying: keys.size - moves.size,
      percentage_moving: (moves.size.to_f / keys.size * 100).round(2)
    }
  end
  
  private
  
  def hash_with_count(key, partition_count)
    key.hash.abs % partition_count
  end
end

Query Pattern Mismatch: Using hash partitioning for a workload dominated by range queries creates performance problems. Every range query becomes a multi-partition query, increasing latency and resource usage. Similarly, range partitioning for uniformly random access creates unnecessary partition scan overhead.

Monitoring query patterns after deployment reveals mismatches between partitioning strategy and actual usage. Applications may need to maintain multiple indexes or change partitioning strategies based on observed access patterns.

Metadata Synchronization: Partition metadata, such as range boundaries or partition counts, must remain synchronized across all query routers. Stale metadata causes queries to access wrong partitions or fail entirely. Updates to partition metadata require careful coordination to prevent inconsistency.

Configuration management systems or distributed coordination services manage partition metadata. All query nodes must receive metadata updates before partition changes take effect.

Reference

Partitioning Strategy Comparison

Strategy Best For Query Pattern Distribution Quality Rebalancing Complexity
Range Time-series data, range queries Excellent for range predicates Prone to hotspots Moderate - split ranges
Hash Point lookups, uniform writes Poor for range queries Excellent - uniform High - rehash all keys
List Categorical data, known values Good for category queries Depends on value distribution Low - update mappings
Composite Mixed workloads Balanced Very good High - multiple levels

Partition Key Selection Criteria

Characteristic Good Partition Key Poor Partition Key
Cardinality High - millions of values Low - tens of values
Mutability Immutable Frequently updated
Distribution Uniform access pattern Highly skewed
Query Alignment Used in WHERE clauses Rarely in predicates
Size Small - integer or UUID Large - text blobs

Query Type Performance

Query Type Hash Partitioning Range Partitioning List Partitioning
Point lookup by key Single partition - fast May span partitions Single partition - fast
Range query by key All partitions - slow Single partition - fast May span partitions
Equality on non-key All partitions - slow All partitions - slow All partitions - slow
Aggregation All partitions All partitions All partitions
Join on key Colocated if same hash Colocated within range Colocated if same list

Partition Count Guidelines

System Size Recommended Partitions Rationale
Single server 4-16 Enable parallelism without overhead
Small cluster (2-10 servers) 10-100 Multiple partitions per server
Medium cluster (10-50 servers) 100-500 Balance distribution and overhead
Large cluster (50+ servers) 500-1000+ Enable fine-grained distribution

Rebalancing Strategies

Approach Data Movement Service Impact Implementation Complexity
Stop-and-copy All affected keys Full downtime Low
Incremental migration Gradual movement Minimal disruption High
Consistent hashing Minimal - only moved keys No downtime Moderate
Virtual nodes Distributed evenly No downtime High

Ruby Adapter Support

Database Native Partitioning ActiveRecord Support Sequel Support Custom Required
PostgreSQL Yes - declarative Query only Query only Management commands
MySQL Yes - native Query only Query only Management commands
MongoDB Yes - sharding Via driver Via driver Minimal
Redis No N/A N/A Full implementation
Cassandra Yes - built-in Via driver No Minimal

Performance Metrics

Metric Target Indicates Problem When
Partition size variance Within 20% of mean Greater than 50% variance
Cross-partition query ratio Less than 10% Greater than 30%
Hotspot partition load Less than 2x average Greater than 5x average
Rebalancing duration Minutes to hours Days to weeks
Query latency p99 2-3x single partition 10x+ single partition