CrackedRuby - NoSQL Data Modeling

Overview

NoSQL data modeling represents a fundamental shift from relational database design principles. Where relational databases normalize data to eliminate redundancy and maintain consistency through ACID transactions, NoSQL databases optimize for specific access patterns, horizontal scalability, and performance at the cost of strict consistency guarantees.

The data modeling approach varies significantly across NoSQL database types. Document databases like MongoDB store semi-structured JSON-like documents. Key-value stores like Redis organize data as simple key-value pairs. Column-family stores like Cassandra arrange data in wide rows with dynamic columns. Graph databases like Neo4j represent entities and relationships as nodes and edges.

NoSQL data modeling requires a query-first approach. The database schema design derives directly from application query patterns rather than from an idealized representation of domain entities. This inverts the traditional relational modeling process where application queries adapt to the normalized schema.

A document database example demonstrates this approach:

# Relational approach: separate tables
users_table: { id: 1, name: "Alice", email: "alice@example.com" }
orders_table: { id: 101, user_id: 1, total: 50.00 }
items_table: { id: 1001, order_id: 101, product: "Book", price: 50.00 }

# NoSQL document approach: embedded structure
{
  _id: 1,
  name: "Alice",
  email: "alice@example.com",
  orders: [
    {
      order_id: 101,
      total: 50.00,
      items: [
        { product: "Book", price: 50.00 }
      ]
    }
  ]
}

The embedded structure eliminates joins and retrieves all related data in a single query. This improves read performance but introduces challenges for data consistency and updates across multiple documents.

Key Principles

NoSQL data modeling operates on several core principles that differ markedly from relational design.

Denormalization accepts data duplication to optimize read performance. Rather than splitting data across multiple tables and joining them at query time, NoSQL models embed or duplicate data where the application reads it. A product catalog might store the category name directly with each product rather than maintaining a separate categories collection and joining them. This duplication trades storage space and update complexity for query performance.

Schema flexibility allows documents in the same collection to have different structures. A user collection might contain basic user documents alongside documents with additional fields for premium users. The database does not enforce a rigid schema. The application handles variations in document structure. This flexibility supports rapid iteration and heterogeneous data but requires careful validation in application code.

Query-driven design starts with the application's read and write patterns. The data model optimizes for the most frequent and performance-critical queries. An analytics application that aggregates user activity by day might store pre-calculated daily summaries rather than raw events requiring aggregation at query time. The storage structure directly supports the query pattern.

Atomic operations occur at the document level in document databases. Updates to a single document complete atomically, but updates across multiple documents lack transaction guarantees. This constraint influences whether related data embeds in a single document (atomic updates) or separates into multiple documents (eventual consistency).

Scalability considerations affect data distribution across cluster nodes. NoSQL databases shard data horizontally across machines. The shard key determines how data distributes. A poorly chosen shard key creates hotspots where certain shards receive disproportionate traffic. An e-commerce application sharding orders by customer_id distributes load evenly if customers place orders uniformly. Sharding by order_date creates hotspots during peak shopping periods when all new orders target the same shard.

Embedding versus referencing represents the central modeling decision. Embedding stores related data within a parent document:

{
  _id: "post_123",
  title: "NoSQL Modeling",
  author: { name: "Bob", email: "bob@example.com" },
  comments: [
    { text: "Great post", user: "Alice" },
    { text: "Thanks", user: "Charlie" }
  ]
}

Referencing stores related data in separate documents:

# Post document
{ _id: "post_123", title: "NoSQL Modeling", author_id: "user_456" }

# Author document
{ _id: "user_456", name: "Bob", email: "bob@example.com" }

# Comment documents
{ _id: "comment_1", post_id: "post_123", text: "Great post", user_id: "user_789" }
{ _id: "comment_2", post_id: "post_123", text: "Thanks", user_id: "user_101" }

Embedding provides faster reads and atomic updates for the entire structure. Referencing allows independent updates and prevents unbounded document growth.

Data access patterns determine the optimal structure. One-to-few relationships (a user with a few addresses) typically embed. One-to-many relationships (a product with thousands of reviews) typically reference. One-to-squillions relationships (a server with millions of log entries) always reference to prevent document size from growing without bound.

Design Considerations

The choice between embedding and referencing depends on several factors that interact in complex ways.

Read-to-write ratio influences the embedding decision. Applications that read data far more often than writing benefit from embedding because reads require fewer queries. A product catalog that displays product details with category information but rarely updates categories benefits from embedding the category name with each product. Applications with frequent writes to related data may prefer referencing to avoid updating multiple embedded copies.

Data duplication tolerance varies by data type. Relatively static data like category names tolerates duplication well. When a category name changes, updating all products that embed it requires multiple document updates but occurs infrequently. Rapidly changing data like stock prices should not duplicate because keeping all copies synchronized becomes expensive and error-prone.

Document size limits constrain embedding strategies. MongoDB limits documents to 16MB. Embedding too much data into a single document eventually hits this limit. A blog post with thousands of comments cannot embed all comments in the post document. The application must reference comments in a separate collection or implement hybrid approaches like embedding recent comments and referencing archived comments.

Query patterns directly shape the model. An application that displays user profiles with their recent activity embeds recent activity in the user document. An application that displays activity timelines across multiple users queries an activities collection. The first access pattern optimizes for user-centric queries, the second for activity-centric queries. The same data requires different models for different query patterns.

Consistency requirements determine transaction boundaries. Applications requiring strict consistency for related data must keep that data within a single document. Financial transactions that update account balances atomically must embed the balance in the account document. Applications tolerating eventual consistency can reference related data across documents and handle temporary inconsistencies.

Update isolation requirements affect whether data embeds or references. Updating embedded data requires reading and writing the entire parent document. Concurrent updates to the same document create conflicts requiring retry logic. Referencing isolates updates to independent documents, reducing contention. A social network where users update their profiles while others comment on their posts benefits from separating profile data from activity data.

Bounded versus unbounded relationships fundamentally constrain the design. A user with at most five shipping addresses can embed all addresses. A product with an unlimited number of reviews cannot embed all reviews. Embedded arrays that grow without bound eventually cause document size problems and performance degradation. The model must reference unbounded collections.

Access pattern frequency prioritizes optimization efforts. The 80/20 rule applies: optimize the model for the 20% of queries that account for 80% of load. Rare administrative queries can use less efficient access patterns. The user-facing query that loads a product detail page must be fast. The administrative query that audits all products with pricing errors can tolerate slower performance.

Common Patterns

Several established patterns address recurring NoSQL data modeling challenges.

Subset pattern embeds a subset of related data for common queries while maintaining the full dataset in a separate collection. A movie document embeds the five most recent reviews for quick display on the movie detail page. A separate reviews collection stores all reviews for the reviews page and analysis.

# Movie document with embedded recent reviews
{
  _id: "movie_123",
  title: "The Matrix",
  recent_reviews: [
    { text: "Amazing movie", rating: 5, date: "2024-01-15" },
    { text: "Classic", rating: 5, date: "2024-01-10" }
  ],
  review_count: 15234
}

# Reviews collection with all reviews
{ _id: "review_1", movie_id: "movie_123", text: "Amazing movie", rating: 5 }
{ _id: "review_2", movie_id: "movie_123", text: "Classic", rating: 5 }
# ... thousands more reviews

Computed pattern stores pre-calculated values to avoid expensive computations at query time. An analytics dashboard displaying user activity statistics embeds daily counts in the user document rather than aggregating raw events on each page load.

{
  _id: "user_456",
  name: "Alice",
  stats: {
    total_posts: 342,
    total_comments: 1284,
    monthly_activity: [
      { month: "2024-01", posts: 28, comments: 105 },
      { month: "2023-12", posts: 31, comments: 98 }
    ]
  }
}

Bucket pattern groups time-series data into documents representing time periods. Instead of storing each sensor reading in a separate document, the application stores hourly buckets containing all readings for that hour. This reduces the number of documents and improves query performance for time-range queries.

{
  _id: "sensor_123_2024_01_15_14",
  sensor_id: "sensor_123",
  date: "2024-01-15",
  hour: 14,
  readings: [
    { minute: 0, temperature: 72.3 },
    { minute: 1, temperature: 72.4 },
    { minute: 2, temperature: 72.3 }
    # ... 60 readings
  ],
  summary: {
    avg_temp: 72.5,
    min_temp: 72.1,
    max_temp: 72.9
  }
}

Extended reference pattern duplicates frequently accessed fields from referenced documents to avoid additional queries. An order document references a user but embeds the user's name and email for display purposes. The application queries the user collection only when full user details are needed.

{
  _id: "order_789",
  user_id: "user_456",
  user_name: "Alice",  # duplicated from user document
  user_email: "alice@example.com",  # duplicated from user document
  total: 99.99,
  items: [...]
}

Attribute pattern handles sparse or heterogeneous attributes by storing them as an array of key-value pairs. Product specifications vary widely by category. Electronics have screen size and battery life. Clothing has size and color. Rather than creating fields for every possible attribute, the model stores attributes as an array.

{
  _id: "product_101",
  name: "Laptop",
  category: "Electronics",
  attributes: [
    { key: "screen_size", value: "15 inches" },
    { key: "ram", value: "16GB" },
    { key: "storage", value: "512GB SSD" }
  ]
}

Polymorphic pattern stores documents with different structures in the same collection using a type field to distinguish them. An events collection stores various event types with type-specific fields.

# Page view event
{ _id: "evt_1", type: "page_view", user_id: "user_123", page: "/home" }

# Purchase event
{ _id: "evt_2", type: "purchase", user_id: "user_123", amount: 50.00, items: [...] }

# Login event
{ _id: "evt_3", type: "login", user_id: "user_123", ip_address: "192.168.1.1" }

Outlier pattern handles documents that would exceed size limits by moving overflow data to separate documents. Most blog posts have a reasonable number of comments that embed in the post document. Posts with thousands of comments store recent comments embedded and older comments in separate overflow documents.

# Normal post
{
  _id: "post_123",
  title: "Regular Post",
  comments: [
    { text: "Comment 1" },
    { text: "Comment 2" }
  ]
}

# Viral post with overflow
{
  _id: "post_999",
  title: "Viral Post",
  has_overflow: true,
  comments: [
    # most recent 100 comments
  ],
  overflow_refs: ["post_999_overflow_1", "post_999_overflow_2"]
}

Ruby Implementation

Ruby applications interact with NoSQL databases through database-specific client libraries and object mapping frameworks.

MongoDB with Mongoid provides an ActiveRecord-like interface for MongoDB:

# Define a User model with embedded addresses
class User
  include Mongoid::Document
  include Mongoid::Timestamps
  
  field :name, type: String
  field :email, type: String
  
  embeds_many :addresses
  has_many :orders
end

class Address
  include Mongoid::Document
  
  field :street, type: String
  field :city, type: String
  field :state, type: String
  field :zip, type: String
  
  embedded_in :user
end

class Order
  include Mongoid::Document
  
  field :total, type: BigDecimal
  field :status, type: String
  
  belongs_to :user
  embeds_many :items
end

class Item
  include Mongoid::Document
  
  field :product_name, type: String
  field :price, type: BigDecimal
  field :quantity, type: Integer
  
  embedded_in :order
end

Creating and querying embedded documents:

# Create user with embedded addresses
user = User.create(
  name: "Alice",
  email: "alice@example.com",
  addresses: [
    { street: "123 Main St", city: "Boston", state: "MA", zip: "02101" },
    { street: "456 Oak Ave", city: "Cambridge", state: "MA", zip: "02139" }
  ]
)

# Query returns user with all embedded addresses in one query
user = User.find_by(email: "alice@example.com")
user.addresses.each { |addr| puts addr.city }

# Create order with referenced user and embedded items
order = Order.create(
  user: user,
  total: 150.00,
  status: "pending",
  items: [
    { product_name: "Book", price: 50.00, quantity: 2 },
    { product_name: "Pen", price: 25.00, quantity: 2 }
  ]
)

Redis with Ohm models data as Redis keys with typed fields:

# Define a User model backed by Redis hashes
class User < Ohm::Model
  attribute :name
  attribute :email
  index :email
  
  set :followers, :User
  list :recent_posts, :Post
  counter :login_count
end

class Post < Ohm::Model
  attribute :title
  attribute :content
  reference :author, :User
  
  index :author
end

# Create users and relationships
alice = User.create(name: "Alice", email: "alice@example.com")
bob = User.create(name: "Bob", email: "bob@example.com")

# Add follower (stores set in Redis)
alice.followers.add(bob)

# Create post with reference
post = Post.create(
  title: "NoSQL Patterns",
  content: "Data modeling strategies...",
  author: alice
)

# Add to recent posts list
alice.recent_posts.push(post)

# Increment counter
alice.incr(:login_count)

# Query by index
User.find(email: "alice@example.com").first
Post.find(author: alice)

Implementing the subset pattern with Mongoid:

class Movie
  include Mongoid::Document
  
  field :title, type: String
  field :review_count, type: Integer, default: 0
  
  embeds_many :recent_reviews, class_name: "ReviewSummary"
  has_many :reviews
end

class ReviewSummary
  include Mongoid::Document
  
  field :text, type: String
  field :rating, type: Integer
  field :posted_at, type: DateTime
  
  embedded_in :movie
end

class Review
  include Mongoid::Document
  
  field :text, type: String
  field :rating, type: Integer
  field :posted_at, type: DateTime
  
  belongs_to :movie
end

# Add review and update subset
def add_review(movie, text, rating)
  # Create full review
  review = movie.reviews.create(
    text: text,
    rating: rating,
    posted_at: Time.now
  )
  
  # Update subset (keep 5 most recent)
  movie.recent_reviews.create(
    text: text,
    rating: rating,
    posted_at: review.posted_at
  )
  
  # Maintain subset size
  if movie.recent_reviews.count > 5
    oldest = movie.recent_reviews.asc(:posted_at).first
    oldest.destroy
  end
  
  movie.inc(review_count: 1)
end

Implementing the computed pattern with background updates:

class User
  include Mongoid::Document
  
  field :name, type: String
  embeds_one :stats
end

class UserStats
  include Mongoid::Document
  
  field :total_posts, type: Integer, default: 0
  field :total_comments, type: Integer, default: 0
  field :last_activity, type: DateTime
  
  embedded_in :user
end

class Post
  include Mongoid::Document
  
  field :content, type: String
  field :posted_at, type: DateTime
  
  belongs_to :user
  
  after_create :update_user_stats
  
  def update_user_stats
    user.stats ||= user.build_stats
    user.stats.inc(total_posts: 1)
    user.stats.set(last_activity: posted_at)
  end
end

Implementing the bucket pattern for time-series data:

class SensorReading
  include Mongoid::Document
  
  field :sensor_id, type: String
  field :date, type: Date
  field :hour, type: Integer
  field :readings, type: Array, default: []
  field :summary, type: Hash, default: {}
  
  index({ sensor_id: 1, date: 1, hour: 1 }, { unique: true })
  
  def add_reading(minute, temperature)
    readings << { minute: minute, temperature: temperature }
    update_summary
  end
  
  def update_summary
    temps = readings.map { |r| r[:temperature] }
    self.summary = {
      avg_temp: temps.sum / temps.size.to_f,
      min_temp: temps.min,
      max_temp: temps.max,
      count: temps.size
    }
    save
  end
end

# Record reading in appropriate bucket
def record_temperature(sensor_id, temperature)
  now = Time.now
  bucket = SensorReading.find_or_create_by(
    sensor_id: sensor_id,
    date: now.to_date,
    hour: now.hour
  )
  
  bucket.add_reading(now.min, temperature)
end

# Query bucket for hour range
def readings_for_period(sensor_id, start_time, end_time)
  SensorReading.where(
    sensor_id: sensor_id,
    :date.gte => start_time.to_date,
    :date.lte => end_time.to_date
  ).to_a
end

Practical Examples

E-commerce order system demonstrates embedding versus referencing decisions:

# Product catalog uses extended reference pattern
class Product
  include Mongoid::Document
  
  field :name, type: String
  field :price, type: BigDecimal
  field :category_id, type: String
  field :category_name, type: String  # duplicated for quick access
  
  belongs_to :category
end

class Category
  include Mongoid::Document
  
  field :name, type: String
  has_many :products
end

# Orders embed items but reference users
class Order
  include Mongoid::Document
  
  field :order_number, type: String
  field :total, type: BigDecimal
  field :status, type: String
  field :placed_at, type: DateTime
  
  # Reference user with duplicated display fields
  field :user_id, type: String
  field :user_name, type: String
  field :user_email, type: String
  
  # Embed order items (bounded collection)
  embeds_many :items
  
  belongs_to :user
end

class OrderItem
  include Mongoid::Document
  
  field :product_id, type: String
  field :product_name, type: String  # snapshot at order time
  field :price, type: BigDecimal     # price at order time
  field :quantity, type: Integer
  
  embedded_in :order
end

# Creating an order snapshots product data
def place_order(user, cart_items)
  order = Order.new(
    order_number: generate_order_number,
    user: user,
    user_name: user.name,
    user_email: user.email,
    status: "pending",
    placed_at: Time.now
  )
  
  cart_items.each do |cart_item|
    product = Product.find(cart_item[:product_id])
    
    order.items.build(
      product_id: product.id.to_s,
      product_name: product.name,
      price: product.price,
      quantity: cart_item[:quantity]
    )
  end
  
  order.total = order.items.sum { |item| item.price * item.quantity }
  order.save
  
  order
end

Social media feed uses denormalization for performance:

class User
  include Mongoid::Document
  
  field :username, type: String
  field :display_name, type: String
  field :avatar_url, type: String
  
  has_many :posts
  has_and_belongs_to_many :following, class_name: "User", inverse_of: :followers
  has_and_belongs_to_many :followers, class_name: "User", inverse_of: :following
end

# Post duplicates user info for feed display
class Post
  include Mongoid::Document
  
  field :content, type: String
  field :posted_at, type: DateTime
  
  # Reference to author
  field :author_id, type: String
  
  # Denormalized author data for feed queries
  field :author_username, type: String
  field :author_display_name, type: String
  field :author_avatar_url, type: String
  
  # Embedded comments (subset pattern)
  embeds_many :recent_comments
  field :total_comments, type: Integer, default: 0
  
  belongs_to :author, class_name: "User"
  
  index({ author_id: 1, posted_at: -1 })
end

class Comment
  include Mongoid::Document
  
  field :content, type: String
  field :user_id, type: String
  field :username, type: String  # denormalized
  field :posted_at, type: DateTime
  
  embedded_in :post, touch: true
end

# Feed generation queries denormalized data
def generate_feed(user, limit = 50)
  following_ids = user.following.pluck(:id)
  
  Post.in(author_id: following_ids)
      .desc(:posted_at)
      .limit(limit)
      .to_a
  # Returns posts with embedded author info - no additional queries
end

# Updating user profile propagates changes
def update_user_profile(user, new_display_name, new_avatar_url)
  user.update(
    display_name: new_display_name,
    avatar_url: new_avatar_url
  )
  
  # Update denormalized data in posts (eventual consistency)
  Post.where(author_id: user.id.to_s)
      .update_all(
        author_display_name: new_display_name,
        author_avatar_url: new_avatar_url
      )
end

Analytics system uses the bucket pattern for metrics:

class MetricBucket
  include Mongoid::Document
  
  field :metric_name, type: String
  field :date, type: Date
  field :hour, type: Integer
  field :values, type: Array, default: []
  field :aggregates, type: Hash, default: {}
  
  index({ metric_name: 1, date: 1, hour: 1 }, { unique: true })
  
  def record_value(value)
    values << { timestamp: Time.now, value: value }
    update_aggregates
    
    # Limit bucket size
    if values.size > 1000
      values.shift
    end
  end
  
  def update_aggregates
    vals = values.map { |v| v[:value] }
    self.aggregates = {
      count: vals.size,
      sum: vals.sum,
      avg: vals.sum / vals.size.to_f,
      min: vals.min,
      max: vals.max
    }
    save
  end
end

# Daily summary uses computed pattern
class DailySummary
  include Mongoid::Document
  
  field :metric_name, type: String
  field :date, type: Date
  field :hourly_averages, type: Array, default: []
  field :daily_stats, type: Hash, default: {}
  
  index({ metric_name: 1, date: 1 }, { unique: true })
end

# Background job computes daily summaries
def compute_daily_summary(metric_name, date)
  buckets = MetricBucket.where(
    metric_name: metric_name,
    date: date
  ).order_by(hour: :asc)
  
  hourly_avgs = buckets.map do |bucket|
    { hour: bucket.hour, avg: bucket.aggregates["avg"] }
  end
  
  all_values = hourly_avgs.map { |h| h[:avg] }
  
  DailySummary.find_or_create_by(
    metric_name: metric_name,
    date: date
  ).update(
    hourly_averages: hourly_avgs,
    daily_stats: {
      avg: all_values.sum / all_values.size.to_f,
      min: all_values.min,
      max: all_values.max
    }
  )
end

Performance Considerations

Document structure directly affects query performance. Embedded documents return in a single query, while referenced documents require multiple queries or joins.

Index strategy becomes more critical in NoSQL databases because query patterns are predefined. Create indexes for every field used in queries, sorts, or range scans. MongoDB uses B-tree indexes similar to relational databases. Compound indexes support queries on multiple fields.

class User
  include Mongoid::Document
  
  field :email, type: String
  field :created_at, type: DateTime
  field :last_login, type: DateTime
  
  # Single field indexes
  index({ email: 1 }, { unique: true })
  
  # Compound index for common query
  index({ created_at: 1, last_login: -1 })
end

Document size affects read and write performance. MongoDB reads entire documents from disk. Large documents slow queries even when applications access only a few fields. The 16MB document limit prevents pathological cases, but documents approaching megabytes degrade performance.

Retrieving a 1KB document from memory takes microseconds. Retrieving a 1MB document takes milliseconds. Applications that query documents frequently should keep them small. Move large binary data like images to separate storage. Use the subset pattern to embed only frequently accessed data.

Write amplification occurs when denormalized data requires updating multiple documents. Updating a user's display name requires updating the user document plus all posts, comments, and other documents that denormalize the display name. This trade-off exchanges write performance for read performance.

# Updating denormalized data
def update_username(user, new_username)
  # 1 write to user document
  user.update(username: new_username)
  
  # N writes to posts
  Post.where(author_id: user.id.to_s)
      .update_all(author_username: new_username)
  
  # M writes to comments
  Comment.where(user_id: user.id.to_s)
         .update_all(username: new_username)
end

Query selectivity measures how many documents a query examines versus returns. Highly selective queries examine few documents to return results. Poorly selective queries scan many documents. Indexes improve selectivity by allowing the database to skip non-matching documents.

A query finding users by unique email examines one document:

User.find_by(email: "alice@example.com")  # examines 1 document

A query finding users created in the last month without an index scans all documents:

User.where(:created_at.gte => 1.month.ago)  # scans all documents without index

Embedded array performance degrades as arrays grow. MongoDB stores documents contiguously on disk. Adding elements to embedded arrays may require moving the entire document to a new location with more space. Frequent additions to large embedded arrays cause performance problems.

Keep embedded arrays bounded. Use the subset pattern to limit embedded collection size. Store unbounded collections in separate documents with references.

Connection pooling reduces overhead from establishing database connections. Ruby applications create a pool of persistent connections shared across requests:

# config/mongoid.yml
production:
  clients:
    default:
      database: myapp_production
      hosts:
        - mongodb://localhost:27017
      options:
        min_pool_size: 5
        max_pool_size: 20
        wait_queue_timeout: 1

Projection retrieves only needed fields from documents, reducing network transfer and memory usage:

# Retrieve only name and email
User.only(:name, :email).where(:created_at.gte => 1.week.ago)

# Exclude large fields
Post.without(:content).desc(:posted_at).limit(50)

Explain plans reveal query performance characteristics:

query = User.where(email: "alice@example.com")
puts query.explain

# Shows:
# - Index usage
# - Documents examined
# - Execution time
# - Query stages

Analyze explain output to identify missing indexes, inefficient queries, and performance bottlenecks.

Reference

Embedding vs Referencing Decision Matrix

Factor	Embed	Reference
Relationship cardinality	One-to-few	One-to-many, One-to-squillions
Data access pattern	Read together	Read independently
Update frequency	Infrequent	Frequent
Data duplication	Acceptable	Unacceptable
Atomic update requirement	Required	Not required
Document size impact	Small increase	No impact
Query pattern	Single entity with related data	Multiple entities or complex queries

Common Data Modeling Patterns

Pattern	Use Case	Implementation	Trade-offs
Subset	Unbounded one-to-many with common access to subset	Embed subset in parent, reference full collection	Duplicates subset data, requires sync
Computed	Expensive aggregations needed frequently	Pre-calculate and store results	Requires update logic, may be stale
Bucket	High-volume time-series data	Group time periods in documents	Fixed granularity, bucketing logic
Extended Reference	Frequently accessed fields from referenced entity	Duplicate key fields in referencing document	Data duplication, sync overhead
Attribute	Sparse or heterogeneous attributes	Array of key-value pairs	Loses type safety, indexing limitations
Polymorphic	Multiple entity types in one collection	Type field distinguishes documents	Schema variations, complex queries
Outlier	Most documents normal, few exceed limits	Separate overflow documents for outliers	Complexity for outlier cases

MongoDB Document Size Guidelines

Data Type	Guideline	Reasoning
Embedded documents	Under 100 per array	Performance degrades with large arrays
Document size	Keep under 1MB	Network transfer and memory overhead
Text fields	Use GridFS for multi-megabyte content	Avoids document size issues
Binary data	Store references to external storage	Documents not optimized for binary
Unbounded arrays	Use separate collection	Prevents unlimited growth
Frequently updated arrays	Consider separate collection	Reduces write amplification

Index Types and Usage

Index Type	Syntax	Use Case	Limitations
Single field	field: 1	Queries on one field	Direction matters for sort
Compound	field1: 1, field2: -1	Queries on multiple fields	Field order matters
Multikey	array_field: 1	Queries on array elements	One array per compound index
Text	field: text	Text search	Language-specific, one per collection
Geospatial	location: 2dsphere	Location queries	GeoJSON format required
Hashed	field: hashed	Sharding distribution	No range queries
TTL	timestamp: 1 with expireAfterSeconds	Automatic document expiration	Only on date fields

Query Performance Checklist

Action	Purpose	Implementation
Create indexes for query fields	Skip document scans	Define indexes in model
Use projection	Reduce data transfer	Specify only and without
Limit result sets	Control memory usage	Add limit to queries
Use covered queries	Avoid document access	Index all queried and returned fields
Monitor slow queries	Identify problems	Enable profiling
Analyze explain plans	Verify index usage	Call explain on queries
Batch writes	Reduce round trips	Use bulk operations
Use connection pooling	Reuse connections	Configure pool size

Document Structure Anti-patterns

Anti-pattern	Problem	Solution
Unbounded arrays	Document size grows without limit	Reference separate collection
Massive duplication	High storage cost, consistency issues	Reference instead of duplicate
Deep nesting	Difficult queries, poor performance	Flatten structure or separate collections
Generic keys	Cannot index effectively	Use fixed field names
Large binary in documents	Slow queries, high memory	Use GridFS or external storage
No indexes	Full collection scans	Create appropriate indexes
Over-indexing	Slow writes, wasted storage	Index only queried fields

Sharding Considerations

Shard Key Characteristic	Impact	Example
High cardinality	Even distribution	user_id, UUID
Low cardinality	Hotspots	country, category
Monotonically increasing	Single shard writes	timestamp, auto-increment
Randomly distributed	Even writes	hashed user_id
Query pattern alignment	Targeted queries	Match common query fields
Immutable	No chunk migrations	Set at document creation

Ruby Client Performance Settings

Setting	Recommended Value	Purpose
min_pool_size	5-10	Maintain ready connections
max_pool_size	10-50	Limit concurrent connections
wait_queue_timeout	1-5 seconds	Fail fast on pool exhaustion
socket_timeout	5-30 seconds	Detect network issues
connect_timeout	5-10 seconds	Fail fast on unavailable server
max_idle_time	300 seconds	Close idle connections
server_selection_timeout	30 seconds	Replica set failover time

NoSQL Data Modeling