Overview
NoSQL data modeling represents a fundamental shift from relational database design principles. Where relational databases normalize data to eliminate redundancy and maintain consistency through ACID transactions, NoSQL databases optimize for specific access patterns, horizontal scalability, and performance at the cost of strict consistency guarantees.
The data modeling approach varies significantly across NoSQL database types. Document databases like MongoDB store semi-structured JSON-like documents. Key-value stores like Redis organize data as simple key-value pairs. Column-family stores like Cassandra arrange data in wide rows with dynamic columns. Graph databases like Neo4j represent entities and relationships as nodes and edges.
NoSQL data modeling requires a query-first approach. The database schema design derives directly from application query patterns rather than from an idealized representation of domain entities. This inverts the traditional relational modeling process where application queries adapt to the normalized schema.
A document database example demonstrates this approach:
# Relational approach: separate tables
users_table: { id: 1, name: "Alice", email: "alice@example.com" }
orders_table: { id: 101, user_id: 1, total: 50.00 }
items_table: { id: 1001, order_id: 101, product: "Book", price: 50.00 }
# NoSQL document approach: embedded structure
{
_id: 1,
name: "Alice",
email: "alice@example.com",
orders: [
{
order_id: 101,
total: 50.00,
items: [
{ product: "Book", price: 50.00 }
]
}
]
}
The embedded structure eliminates joins and retrieves all related data in a single query. This improves read performance but introduces challenges for data consistency and updates across multiple documents.
Key Principles
NoSQL data modeling operates on several core principles that differ markedly from relational design.
Denormalization accepts data duplication to optimize read performance. Rather than splitting data across multiple tables and joining them at query time, NoSQL models embed or duplicate data where the application reads it. A product catalog might store the category name directly with each product rather than maintaining a separate categories collection and joining them. This duplication trades storage space and update complexity for query performance.
Schema flexibility allows documents in the same collection to have different structures. A user collection might contain basic user documents alongside documents with additional fields for premium users. The database does not enforce a rigid schema. The application handles variations in document structure. This flexibility supports rapid iteration and heterogeneous data but requires careful validation in application code.
Query-driven design starts with the application's read and write patterns. The data model optimizes for the most frequent and performance-critical queries. An analytics application that aggregates user activity by day might store pre-calculated daily summaries rather than raw events requiring aggregation at query time. The storage structure directly supports the query pattern.
Atomic operations occur at the document level in document databases. Updates to a single document complete atomically, but updates across multiple documents lack transaction guarantees. This constraint influences whether related data embeds in a single document (atomic updates) or separates into multiple documents (eventual consistency).
Scalability considerations affect data distribution across cluster nodes. NoSQL databases shard data horizontally across machines. The shard key determines how data distributes. A poorly chosen shard key creates hotspots where certain shards receive disproportionate traffic. An e-commerce application sharding orders by customer_id distributes load evenly if customers place orders uniformly. Sharding by order_date creates hotspots during peak shopping periods when all new orders target the same shard.
Embedding versus referencing represents the central modeling decision. Embedding stores related data within a parent document:
{
_id: "post_123",
title: "NoSQL Modeling",
author: { name: "Bob", email: "bob@example.com" },
comments: [
{ text: "Great post", user: "Alice" },
{ text: "Thanks", user: "Charlie" }
]
}
Referencing stores related data in separate documents:
# Post document
{ _id: "post_123", title: "NoSQL Modeling", author_id: "user_456" }
# Author document
{ _id: "user_456", name: "Bob", email: "bob@example.com" }
# Comment documents
{ _id: "comment_1", post_id: "post_123", text: "Great post", user_id: "user_789" }
{ _id: "comment_2", post_id: "post_123", text: "Thanks", user_id: "user_101" }
Embedding provides faster reads and atomic updates for the entire structure. Referencing allows independent updates and prevents unbounded document growth.
Data access patterns determine the optimal structure. One-to-few relationships (a user with a few addresses) typically embed. One-to-many relationships (a product with thousands of reviews) typically reference. One-to-squillions relationships (a server with millions of log entries) always reference to prevent document size from growing without bound.
Design Considerations
The choice between embedding and referencing depends on several factors that interact in complex ways.
Read-to-write ratio influences the embedding decision. Applications that read data far more often than writing benefit from embedding because reads require fewer queries. A product catalog that displays product details with category information but rarely updates categories benefits from embedding the category name with each product. Applications with frequent writes to related data may prefer referencing to avoid updating multiple embedded copies.
Data duplication tolerance varies by data type. Relatively static data like category names tolerates duplication well. When a category name changes, updating all products that embed it requires multiple document updates but occurs infrequently. Rapidly changing data like stock prices should not duplicate because keeping all copies synchronized becomes expensive and error-prone.
Document size limits constrain embedding strategies. MongoDB limits documents to 16MB. Embedding too much data into a single document eventually hits this limit. A blog post with thousands of comments cannot embed all comments in the post document. The application must reference comments in a separate collection or implement hybrid approaches like embedding recent comments and referencing archived comments.
Query patterns directly shape the model. An application that displays user profiles with their recent activity embeds recent activity in the user document. An application that displays activity timelines across multiple users queries an activities collection. The first access pattern optimizes for user-centric queries, the second for activity-centric queries. The same data requires different models for different query patterns.
Consistency requirements determine transaction boundaries. Applications requiring strict consistency for related data must keep that data within a single document. Financial transactions that update account balances atomically must embed the balance in the account document. Applications tolerating eventual consistency can reference related data across documents and handle temporary inconsistencies.
Update isolation requirements affect whether data embeds or references. Updating embedded data requires reading and writing the entire parent document. Concurrent updates to the same document create conflicts requiring retry logic. Referencing isolates updates to independent documents, reducing contention. A social network where users update their profiles while others comment on their posts benefits from separating profile data from activity data.
Bounded versus unbounded relationships fundamentally constrain the design. A user with at most five shipping addresses can embed all addresses. A product with an unlimited number of reviews cannot embed all reviews. Embedded arrays that grow without bound eventually cause document size problems and performance degradation. The model must reference unbounded collections.
Access pattern frequency prioritizes optimization efforts. The 80/20 rule applies: optimize the model for the 20% of queries that account for 80% of load. Rare administrative queries can use less efficient access patterns. The user-facing query that loads a product detail page must be fast. The administrative query that audits all products with pricing errors can tolerate slower performance.
Common Patterns
Several established patterns address recurring NoSQL data modeling challenges.
Subset pattern embeds a subset of related data for common queries while maintaining the full dataset in a separate collection. A movie document embeds the five most recent reviews for quick display on the movie detail page. A separate reviews collection stores all reviews for the reviews page and analysis.
# Movie document with embedded recent reviews
{
_id: "movie_123",
title: "The Matrix",
recent_reviews: [
{ text: "Amazing movie", rating: 5, date: "2024-01-15" },
{ text: "Classic", rating: 5, date: "2024-01-10" }
],
review_count: 15234
}
# Reviews collection with all reviews
{ _id: "review_1", movie_id: "movie_123", text: "Amazing movie", rating: 5 }
{ _id: "review_2", movie_id: "movie_123", text: "Classic", rating: 5 }
# ... thousands more reviews
Computed pattern stores pre-calculated values to avoid expensive computations at query time. An analytics dashboard displaying user activity statistics embeds daily counts in the user document rather than aggregating raw events on each page load.
{
_id: "user_456",
name: "Alice",
stats: {
total_posts: 342,
total_comments: 1284,
monthly_activity: [
{ month: "2024-01", posts: 28, comments: 105 },
{ month: "2023-12", posts: 31, comments: 98 }
]
}
}
Bucket pattern groups time-series data into documents representing time periods. Instead of storing each sensor reading in a separate document, the application stores hourly buckets containing all readings for that hour. This reduces the number of documents and improves query performance for time-range queries.
{
_id: "sensor_123_2024_01_15_14",
sensor_id: "sensor_123",
date: "2024-01-15",
hour: 14,
readings: [
{ minute: 0, temperature: 72.3 },
{ minute: 1, temperature: 72.4 },
{ minute: 2, temperature: 72.3 }
# ... 60 readings
],
summary: {
avg_temp: 72.5,
min_temp: 72.1,
max_temp: 72.9
}
}
Extended reference pattern duplicates frequently accessed fields from referenced documents to avoid additional queries. An order document references a user but embeds the user's name and email for display purposes. The application queries the user collection only when full user details are needed.
{
_id: "order_789",
user_id: "user_456",
user_name: "Alice", # duplicated from user document
user_email: "alice@example.com", # duplicated from user document
total: 99.99,
items: [...]
}
Attribute pattern handles sparse or heterogeneous attributes by storing them as an array of key-value pairs. Product specifications vary widely by category. Electronics have screen size and battery life. Clothing has size and color. Rather than creating fields for every possible attribute, the model stores attributes as an array.
{
_id: "product_101",
name: "Laptop",
category: "Electronics",
attributes: [
{ key: "screen_size", value: "15 inches" },
{ key: "ram", value: "16GB" },
{ key: "storage", value: "512GB SSD" }
]
}
Polymorphic pattern stores documents with different structures in the same collection using a type field to distinguish them. An events collection stores various event types with type-specific fields.
# Page view event
{ _id: "evt_1", type: "page_view", user_id: "user_123", page: "/home" }
# Purchase event
{ _id: "evt_2", type: "purchase", user_id: "user_123", amount: 50.00, items: [...] }
# Login event
{ _id: "evt_3", type: "login", user_id: "user_123", ip_address: "192.168.1.1" }
Outlier pattern handles documents that would exceed size limits by moving overflow data to separate documents. Most blog posts have a reasonable number of comments that embed in the post document. Posts with thousands of comments store recent comments embedded and older comments in separate overflow documents.
# Normal post
{
_id: "post_123",
title: "Regular Post",
comments: [
{ text: "Comment 1" },
{ text: "Comment 2" }
]
}
# Viral post with overflow
{
_id: "post_999",
title: "Viral Post",
has_overflow: true,
comments: [
# most recent 100 comments
],
overflow_refs: ["post_999_overflow_1", "post_999_overflow_2"]
}
Ruby Implementation
Ruby applications interact with NoSQL databases through database-specific client libraries and object mapping frameworks.
MongoDB with Mongoid provides an ActiveRecord-like interface for MongoDB:
# Define a User model with embedded addresses
class User
include Mongoid::Document
include Mongoid::Timestamps
field :name, type: String
field :email, type: String
embeds_many :addresses
has_many :orders
end
class Address
include Mongoid::Document
field :street, type: String
field :city, type: String
field :state, type: String
field :zip, type: String
embedded_in :user
end
class Order
include Mongoid::Document
field :total, type: BigDecimal
field :status, type: String
belongs_to :user
embeds_many :items
end
class Item
include Mongoid::Document
field :product_name, type: String
field :price, type: BigDecimal
field :quantity, type: Integer
embedded_in :order
end
Creating and querying embedded documents:
# Create user with embedded addresses
user = User.create(
name: "Alice",
email: "alice@example.com",
addresses: [
{ street: "123 Main St", city: "Boston", state: "MA", zip: "02101" },
{ street: "456 Oak Ave", city: "Cambridge", state: "MA", zip: "02139" }
]
)
# Query returns user with all embedded addresses in one query
user = User.find_by(email: "alice@example.com")
user.addresses.each { |addr| puts addr.city }
# Create order with referenced user and embedded items
order = Order.create(
user: user,
total: 150.00,
status: "pending",
items: [
{ product_name: "Book", price: 50.00, quantity: 2 },
{ product_name: "Pen", price: 25.00, quantity: 2 }
]
)
Redis with Ohm models data as Redis keys with typed fields:
# Define a User model backed by Redis hashes
class User < Ohm::Model
attribute :name
attribute :email
index :email
set :followers, :User
list :recent_posts, :Post
counter :login_count
end
class Post < Ohm::Model
attribute :title
attribute :content
reference :author, :User
index :author
end
# Create users and relationships
alice = User.create(name: "Alice", email: "alice@example.com")
bob = User.create(name: "Bob", email: "bob@example.com")
# Add follower (stores set in Redis)
alice.followers.add(bob)
# Create post with reference
post = Post.create(
title: "NoSQL Patterns",
content: "Data modeling strategies...",
author: alice
)
# Add to recent posts list
alice.recent_posts.push(post)
# Increment counter
alice.incr(:login_count)
# Query by index
User.find(email: "alice@example.com").first
Post.find(author: alice)
Implementing the subset pattern with Mongoid:
class Movie
include Mongoid::Document
field :title, type: String
field :review_count, type: Integer, default: 0
embeds_many :recent_reviews, class_name: "ReviewSummary"
has_many :reviews
end
class ReviewSummary
include Mongoid::Document
field :text, type: String
field :rating, type: Integer
field :posted_at, type: DateTime
embedded_in :movie
end
class Review
include Mongoid::Document
field :text, type: String
field :rating, type: Integer
field :posted_at, type: DateTime
belongs_to :movie
end
# Add review and update subset
def add_review(movie, text, rating)
# Create full review
review = movie.reviews.create(
text: text,
rating: rating,
posted_at: Time.now
)
# Update subset (keep 5 most recent)
movie.recent_reviews.create(
text: text,
rating: rating,
posted_at: review.posted_at
)
# Maintain subset size
if movie.recent_reviews.count > 5
oldest = movie.recent_reviews.asc(:posted_at).first
oldest.destroy
end
movie.inc(review_count: 1)
end
Implementing the computed pattern with background updates:
class User
include Mongoid::Document
field :name, type: String
embeds_one :stats
end
class UserStats
include Mongoid::Document
field :total_posts, type: Integer, default: 0
field :total_comments, type: Integer, default: 0
field :last_activity, type: DateTime
embedded_in :user
end
class Post
include Mongoid::Document
field :content, type: String
field :posted_at, type: DateTime
belongs_to :user
after_create :update_user_stats
def update_user_stats
user.stats ||= user.build_stats
user.stats.inc(total_posts: 1)
user.stats.set(last_activity: posted_at)
end
end
Implementing the bucket pattern for time-series data:
class SensorReading
include Mongoid::Document
field :sensor_id, type: String
field :date, type: Date
field :hour, type: Integer
field :readings, type: Array, default: []
field :summary, type: Hash, default: {}
index({ sensor_id: 1, date: 1, hour: 1 }, { unique: true })
def add_reading(minute, temperature)
readings << { minute: minute, temperature: temperature }
update_summary
end
def update_summary
temps = readings.map { |r| r[:temperature] }
self.summary = {
avg_temp: temps.sum / temps.size.to_f,
min_temp: temps.min,
max_temp: temps.max,
count: temps.size
}
save
end
end
# Record reading in appropriate bucket
def record_temperature(sensor_id, temperature)
now = Time.now
bucket = SensorReading.find_or_create_by(
sensor_id: sensor_id,
date: now.to_date,
hour: now.hour
)
bucket.add_reading(now.min, temperature)
end
# Query bucket for hour range
def readings_for_period(sensor_id, start_time, end_time)
SensorReading.where(
sensor_id: sensor_id,
:date.gte => start_time.to_date,
:date.lte => end_time.to_date
).to_a
end
Practical Examples
E-commerce order system demonstrates embedding versus referencing decisions:
# Product catalog uses extended reference pattern
class Product
include Mongoid::Document
field :name, type: String
field :price, type: BigDecimal
field :category_id, type: String
field :category_name, type: String # duplicated for quick access
belongs_to :category
end
class Category
include Mongoid::Document
field :name, type: String
has_many :products
end
# Orders embed items but reference users
class Order
include Mongoid::Document
field :order_number, type: String
field :total, type: BigDecimal
field :status, type: String
field :placed_at, type: DateTime
# Reference user with duplicated display fields
field :user_id, type: String
field :user_name, type: String
field :user_email, type: String
# Embed order items (bounded collection)
embeds_many :items
belongs_to :user
end
class OrderItem
include Mongoid::Document
field :product_id, type: String
field :product_name, type: String # snapshot at order time
field :price, type: BigDecimal # price at order time
field :quantity, type: Integer
embedded_in :order
end
# Creating an order snapshots product data
def place_order(user, cart_items)
order = Order.new(
order_number: generate_order_number,
user: user,
user_name: user.name,
user_email: user.email,
status: "pending",
placed_at: Time.now
)
cart_items.each do |cart_item|
product = Product.find(cart_item[:product_id])
order.items.build(
product_id: product.id.to_s,
product_name: product.name,
price: product.price,
quantity: cart_item[:quantity]
)
end
order.total = order.items.sum { |item| item.price * item.quantity }
order.save
order
end
Social media feed uses denormalization for performance:
class User
include Mongoid::Document
field :username, type: String
field :display_name, type: String
field :avatar_url, type: String
has_many :posts
has_and_belongs_to_many :following, class_name: "User", inverse_of: :followers
has_and_belongs_to_many :followers, class_name: "User", inverse_of: :following
end
# Post duplicates user info for feed display
class Post
include Mongoid::Document
field :content, type: String
field :posted_at, type: DateTime
# Reference to author
field :author_id, type: String
# Denormalized author data for feed queries
field :author_username, type: String
field :author_display_name, type: String
field :author_avatar_url, type: String
# Embedded comments (subset pattern)
embeds_many :recent_comments
field :total_comments, type: Integer, default: 0
belongs_to :author, class_name: "User"
index({ author_id: 1, posted_at: -1 })
end
class Comment
include Mongoid::Document
field :content, type: String
field :user_id, type: String
field :username, type: String # denormalized
field :posted_at, type: DateTime
embedded_in :post, touch: true
end
# Feed generation queries denormalized data
def generate_feed(user, limit = 50)
following_ids = user.following.pluck(:id)
Post.in(author_id: following_ids)
.desc(:posted_at)
.limit(limit)
.to_a
# Returns posts with embedded author info - no additional queries
end
# Updating user profile propagates changes
def update_user_profile(user, new_display_name, new_avatar_url)
user.update(
display_name: new_display_name,
avatar_url: new_avatar_url
)
# Update denormalized data in posts (eventual consistency)
Post.where(author_id: user.id.to_s)
.update_all(
author_display_name: new_display_name,
author_avatar_url: new_avatar_url
)
end
Analytics system uses the bucket pattern for metrics:
class MetricBucket
include Mongoid::Document
field :metric_name, type: String
field :date, type: Date
field :hour, type: Integer
field :values, type: Array, default: []
field :aggregates, type: Hash, default: {}
index({ metric_name: 1, date: 1, hour: 1 }, { unique: true })
def record_value(value)
values << { timestamp: Time.now, value: value }
update_aggregates
# Limit bucket size
if values.size > 1000
values.shift
end
end
def update_aggregates
vals = values.map { |v| v[:value] }
self.aggregates = {
count: vals.size,
sum: vals.sum,
avg: vals.sum / vals.size.to_f,
min: vals.min,
max: vals.max
}
save
end
end
# Daily summary uses computed pattern
class DailySummary
include Mongoid::Document
field :metric_name, type: String
field :date, type: Date
field :hourly_averages, type: Array, default: []
field :daily_stats, type: Hash, default: {}
index({ metric_name: 1, date: 1 }, { unique: true })
end
# Background job computes daily summaries
def compute_daily_summary(metric_name, date)
buckets = MetricBucket.where(
metric_name: metric_name,
date: date
).order_by(hour: :asc)
hourly_avgs = buckets.map do |bucket|
{ hour: bucket.hour, avg: bucket.aggregates["avg"] }
end
all_values = hourly_avgs.map { |h| h[:avg] }
DailySummary.find_or_create_by(
metric_name: metric_name,
date: date
).update(
hourly_averages: hourly_avgs,
daily_stats: {
avg: all_values.sum / all_values.size.to_f,
min: all_values.min,
max: all_values.max
}
)
end
Performance Considerations
Document structure directly affects query performance. Embedded documents return in a single query, while referenced documents require multiple queries or joins.
Index strategy becomes more critical in NoSQL databases because query patterns are predefined. Create indexes for every field used in queries, sorts, or range scans. MongoDB uses B-tree indexes similar to relational databases. Compound indexes support queries on multiple fields.
class User
include Mongoid::Document
field :email, type: String
field :created_at, type: DateTime
field :last_login, type: DateTime
# Single field indexes
index({ email: 1 }, { unique: true })
# Compound index for common query
index({ created_at: 1, last_login: -1 })
end
Document size affects read and write performance. MongoDB reads entire documents from disk. Large documents slow queries even when applications access only a few fields. The 16MB document limit prevents pathological cases, but documents approaching megabytes degrade performance.
Retrieving a 1KB document from memory takes microseconds. Retrieving a 1MB document takes milliseconds. Applications that query documents frequently should keep them small. Move large binary data like images to separate storage. Use the subset pattern to embed only frequently accessed data.
Write amplification occurs when denormalized data requires updating multiple documents. Updating a user's display name requires updating the user document plus all posts, comments, and other documents that denormalize the display name. This trade-off exchanges write performance for read performance.
# Updating denormalized data
def update_username(user, new_username)
# 1 write to user document
user.update(username: new_username)
# N writes to posts
Post.where(author_id: user.id.to_s)
.update_all(author_username: new_username)
# M writes to comments
Comment.where(user_id: user.id.to_s)
.update_all(username: new_username)
end
Query selectivity measures how many documents a query examines versus returns. Highly selective queries examine few documents to return results. Poorly selective queries scan many documents. Indexes improve selectivity by allowing the database to skip non-matching documents.
A query finding users by unique email examines one document:
User.find_by(email: "alice@example.com") # examines 1 document
A query finding users created in the last month without an index scans all documents:
User.where(:created_at.gte => 1.month.ago) # scans all documents without index
Embedded array performance degrades as arrays grow. MongoDB stores documents contiguously on disk. Adding elements to embedded arrays may require moving the entire document to a new location with more space. Frequent additions to large embedded arrays cause performance problems.
Keep embedded arrays bounded. Use the subset pattern to limit embedded collection size. Store unbounded collections in separate documents with references.
Connection pooling reduces overhead from establishing database connections. Ruby applications create a pool of persistent connections shared across requests:
# config/mongoid.yml
production:
clients:
default:
database: myapp_production
hosts:
- mongodb://localhost:27017
options:
min_pool_size: 5
max_pool_size: 20
wait_queue_timeout: 1
Projection retrieves only needed fields from documents, reducing network transfer and memory usage:
# Retrieve only name and email
User.only(:name, :email).where(:created_at.gte => 1.week.ago)
# Exclude large fields
Post.without(:content).desc(:posted_at).limit(50)
Explain plans reveal query performance characteristics:
query = User.where(email: "alice@example.com")
puts query.explain
# Shows:
# - Index usage
# - Documents examined
# - Execution time
# - Query stages
Analyze explain output to identify missing indexes, inefficient queries, and performance bottlenecks.
Reference
Embedding vs Referencing Decision Matrix
| Factor | Embed | Reference |
|---|---|---|
| Relationship cardinality | One-to-few | One-to-many, One-to-squillions |
| Data access pattern | Read together | Read independently |
| Update frequency | Infrequent | Frequent |
| Data duplication | Acceptable | Unacceptable |
| Atomic update requirement | Required | Not required |
| Document size impact | Small increase | No impact |
| Query pattern | Single entity with related data | Multiple entities or complex queries |
Common Data Modeling Patterns
| Pattern | Use Case | Implementation | Trade-offs |
|---|---|---|---|
| Subset | Unbounded one-to-many with common access to subset | Embed subset in parent, reference full collection | Duplicates subset data, requires sync |
| Computed | Expensive aggregations needed frequently | Pre-calculate and store results | Requires update logic, may be stale |
| Bucket | High-volume time-series data | Group time periods in documents | Fixed granularity, bucketing logic |
| Extended Reference | Frequently accessed fields from referenced entity | Duplicate key fields in referencing document | Data duplication, sync overhead |
| Attribute | Sparse or heterogeneous attributes | Array of key-value pairs | Loses type safety, indexing limitations |
| Polymorphic | Multiple entity types in one collection | Type field distinguishes documents | Schema variations, complex queries |
| Outlier | Most documents normal, few exceed limits | Separate overflow documents for outliers | Complexity for outlier cases |
MongoDB Document Size Guidelines
| Data Type | Guideline | Reasoning |
|---|---|---|
| Embedded documents | Under 100 per array | Performance degrades with large arrays |
| Document size | Keep under 1MB | Network transfer and memory overhead |
| Text fields | Use GridFS for multi-megabyte content | Avoids document size issues |
| Binary data | Store references to external storage | Documents not optimized for binary |
| Unbounded arrays | Use separate collection | Prevents unlimited growth |
| Frequently updated arrays | Consider separate collection | Reduces write amplification |
Index Types and Usage
| Index Type | Syntax | Use Case | Limitations |
|---|---|---|---|
| Single field | field: 1 | Queries on one field | Direction matters for sort |
| Compound | field1: 1, field2: -1 | Queries on multiple fields | Field order matters |
| Multikey | array_field: 1 | Queries on array elements | One array per compound index |
| Text | field: text | Text search | Language-specific, one per collection |
| Geospatial | location: 2dsphere | Location queries | GeoJSON format required |
| Hashed | field: hashed | Sharding distribution | No range queries |
| TTL | timestamp: 1 with expireAfterSeconds | Automatic document expiration | Only on date fields |
Query Performance Checklist
| Action | Purpose | Implementation |
|---|---|---|
| Create indexes for query fields | Skip document scans | Define indexes in model |
| Use projection | Reduce data transfer | Specify only and without |
| Limit result sets | Control memory usage | Add limit to queries |
| Use covered queries | Avoid document access | Index all queried and returned fields |
| Monitor slow queries | Identify problems | Enable profiling |
| Analyze explain plans | Verify index usage | Call explain on queries |
| Batch writes | Reduce round trips | Use bulk operations |
| Use connection pooling | Reuse connections | Configure pool size |
Document Structure Anti-patterns
| Anti-pattern | Problem | Solution |
|---|---|---|
| Unbounded arrays | Document size grows without limit | Reference separate collection |
| Massive duplication | High storage cost, consistency issues | Reference instead of duplicate |
| Deep nesting | Difficult queries, poor performance | Flatten structure or separate collections |
| Generic keys | Cannot index effectively | Use fixed field names |
| Large binary in documents | Slow queries, high memory | Use GridFS or external storage |
| No indexes | Full collection scans | Create appropriate indexes |
| Over-indexing | Slow writes, wasted storage | Index only queried fields |
Sharding Considerations
| Shard Key Characteristic | Impact | Example |
|---|---|---|
| High cardinality | Even distribution | user_id, UUID |
| Low cardinality | Hotspots | country, category |
| Monotonically increasing | Single shard writes | timestamp, auto-increment |
| Randomly distributed | Even writes | hashed user_id |
| Query pattern alignment | Targeted queries | Match common query fields |
| Immutable | No chunk migrations | Set at document creation |
Ruby Client Performance Settings
| Setting | Recommended Value | Purpose |
|---|---|---|
| min_pool_size | 5-10 | Maintain ready connections |
| max_pool_size | 10-50 | Limit concurrent connections |
| wait_queue_timeout | 1-5 seconds | Fail fast on pool exhaustion |
| socket_timeout | 5-30 seconds | Detect network issues |
| connect_timeout | 5-10 seconds | Fail fast on unavailable server |
| max_idle_time | 300 seconds | Close idle connections |
| server_selection_timeout | 30 seconds | Replica set failover time |