Overview
Distributed tracing tracks individual requests through multiple services in a distributed system. Each request receives a unique identifier that propagates across service boundaries, creating a complete picture of the request's path through the architecture. This trace data includes timing information, service interactions, and contextual metadata.
The concept originated at Google with their Dapper paper in 2010, which described how to instrument large-scale distributed systems with minimal performance overhead. Modern distributed tracing systems build on these principles, standardized through specifications like OpenTelemetry and Zipkin's B3 propagation.
A trace consists of spans—individual units of work representing operations within services. Spans contain timing data, tags, and logs. Parent-child relationships between spans create a directed acyclic graph showing causality and service dependencies.
# Basic trace structure
trace = {
trace_id: "a1b2c3d4e5f6g7h8",
spans: [
{
span_id: "span-001",
parent_span_id: nil,
operation: "web.request",
start_time: 1234567890.123,
duration: 0.450,
tags: { "http.method" => "GET", "http.url" => "/api/users" }
},
{
span_id: "span-002",
parent_span_id: "span-001",
operation: "database.query",
start_time: 1234567890.200,
duration: 0.150,
tags: { "db.type" => "postgresql", "db.statement" => "SELECT * FROM users" }
}
]
}
Distributed tracing solves critical problems in microservices architectures: identifying performance bottlenecks across services, debugging failures that span multiple components, understanding service dependencies, and measuring end-to-end latency. Without tracing, correlating logs and metrics from dozens of services becomes impractical.
Key Principles
Distributed tracing operates on several foundational concepts that determine how trace data gets collected, propagated, and analyzed across service boundaries.
Context Propagation moves trace identifiers across process boundaries. Each service extracts context from incoming requests and injects context into outgoing requests. This propagation occurs through HTTP headers, message queue metadata, or RPC frameworks. The W3C Trace Context specification standardizes this propagation with traceparent and tracestate headers.
# Context propagation in HTTP
class TracingMiddleware
def call(env)
# Extract trace context from incoming request
trace_id = env['HTTP_TRACEPARENT']&.split('-')&.[](1)
parent_span_id = env['HTTP_TRACEPARENT']&.split('-')&.[](2)
# Create span for this service's work
span = create_span(trace_id: trace_id, parent_span_id: parent_span_id)
status, headers, body = @app.call(env)
# Inject context into outgoing response
headers['traceparent'] = format_traceparent(span)
[status, headers, body]
ensure
span&.finish
end
end
Sampling determines which requests to trace. Full tracing of every request creates excessive overhead and storage costs. Head-based sampling makes decisions at trace initiation—typically keeping 1-5% of requests. Tail-based sampling examines completed traces and keeps interesting ones (errors, slow requests). Probabilistic sampling selects traces randomly while deterministic sampling uses request attributes.
Span Relationships define how operations relate within a trace. Child spans represent work performed as part of a parent operation. Spans can have multiple children but only one parent, forming a tree structure. Follows-from relationships indicate asynchronous work that continues after the parent completes, such as background jobs or message processing.
Tags and Logs attach metadata to spans. Tags are key-value pairs describing the operation—HTTP status codes, database table names, user IDs. Tags enable filtering and aggregation in analysis tools. Logs represent timestamped events within a span—exceptions thrown, cache hits, retries attempted. Unlike application logs, span logs maintain association with the specific request context.
# Span with tags and logs
span = tracer.start_span('process_payment')
span.set_tag('payment.amount', 99.99)
span.set_tag('payment.currency', 'USD')
span.set_tag('user.id', user_id)
begin
result = payment_gateway.charge(card, amount)
span.log_kv(event: 'payment_successful', transaction_id: result.id)
rescue PaymentError => e
span.set_tag('error', true)
span.log_kv(event: 'payment_failed', error_message: e.message)
raise
ensure
span.finish
end
Baggage carries data through the entire trace. Unlike tags which stay local to a span, baggage propagates to all downstream services. This enables correlating requests by business context—tenant ID, feature flags, experiment variants. Baggage increases network overhead since it travels with every request, so use sparingly.
Clock Synchronization affects trace accuracy. Spans from different services may have timestamps from different clocks. Clock skew can make child spans appear to start before parents or last longer than their parents. Tracing systems handle this through clock correction algorithms or by relying on duration measurements rather than absolute timestamps.
Ruby Implementation
Ruby provides several libraries for distributed tracing, with OpenTelemetry emerging as the standard approach. The opentelemetry-sdk gem implements the OpenTelemetry specification for creating, exporting, and managing traces.
Basic Tracer Setup requires configuring a tracer provider, processors, and exporters. The provider manages tracer instances. Processors transform span data before export. Exporters send spans to backend systems.
require 'opentelemetry/sdk'
require 'opentelemetry/exporter/otlp'
OpenTelemetry::SDK.configure do |c|
c.service_name = 'user-service'
c.service_version = '1.2.0'
# Use OTLP exporter for sending to collectors
c.add_span_processor(
OpenTelemetry::SDK::Trace::Export::BatchSpanProcessor.new(
OpenTelemetry::Exporter::OTLP::Exporter.new(
endpoint: 'http://collector:4318',
compression: 'gzip'
)
)
)
# Configure sampling - keep 10% of traces
c.sampler = OpenTelemetry::SDK::Trace::Samplers::ParentBased.new(
root: OpenTelemetry::SDK::Trace::Samplers::TraceIdRatioBased.new(0.1)
)
end
Manual Instrumentation creates spans explicitly in application code. This gives precise control over what operations to trace and what attributes to capture.
tracer = OpenTelemetry.tracer_provider.tracer('user-service', '1.0')
def process_user_registration(params)
tracer.in_span('process_registration', attributes: {
'user.email' => params[:email],
'user.plan' => params[:plan]
}) do |span|
# Validate input
tracer.in_span('validate_input') do
validate_registration_params(params)
end
# Create user record
user = tracer.in_span('database.create_user', attributes: {
'db.system' => 'postgresql',
'db.operation' => 'INSERT'
}) do |db_span|
User.create!(params)
end
# Send welcome email (async operation)
tracer.in_span('queue.enqueue_email', attributes: {
'messaging.system' => 'sidekiq',
'messaging.destination' => 'emails'
}) do
WelcomeEmailJob.perform_async(user.id)
end
span.set_attribute('user.id', user.id)
user
rescue ValidationError => e
span.record_exception(e)
span.set_attribute('error', true)
raise
end
end
Automatic Instrumentation instruments common libraries and frameworks without code changes. OpenTelemetry provides instrumentation gems for Rack, Rails, Sinatra, Faraday, Net::HTTP, and database adapters.
require 'opentelemetry-instrumentation-all'
OpenTelemetry::SDK.configure do |c|
c.use_all # Enable all available instrumentations
# Or selectively enable instrumentations
c.use 'OpenTelemetry::Instrumentation::Rack'
c.use 'OpenTelemetry::Instrumentation::Rails'
c.use 'OpenTelemetry::Instrumentation::ActiveRecord'
c.use 'OpenTelemetry::Instrumentation::Faraday'
end
Context Propagation in HTTP Clients requires injecting trace context into outgoing requests. Instrumented libraries handle this automatically, but manual HTTP clients need explicit context injection.
require 'net/http'
require 'opentelemetry/instrumentation/net/http'
def fetch_user_data(user_id)
uri = URI("https://api.example.com/users/#{user_id}")
tracer.in_span('http.request', attributes: {
'http.method' => 'GET',
'http.url' => uri.to_s
}) do |span|
request = Net::HTTP::Get.new(uri)
# Inject trace context into headers
OpenTelemetry.propagation.inject(request)
response = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) do |http|
http.request(request)
end
span.set_attribute('http.status_code', response.code.to_i)
JSON.parse(response.body)
end
end
Background Job Instrumentation traces asynchronous work by propagating context through job queues. Sidekiq instrumentation automatically handles this propagation.
class ProcessOrderJob
include Sidekiq::Job
def perform(order_id)
# Trace context automatically extracted from job metadata
tracer.in_span('process_order', attributes: {
'order.id' => order_id
}) do |span|
order = Order.find(order_id)
tracer.in_span('payment.charge') do
charge_payment(order)
end
tracer.in_span('inventory.reserve') do
reserve_inventory(order)
end
tracer.in_span('shipping.create_label') do
create_shipping_label(order)
end
span.set_attribute('order.status', order.status)
end
end
end
# Enqueuing jobs preserves trace context
tracer.in_span('enqueue_order_processing') do
ProcessOrderJob.perform_async(order.id)
end
Custom Span Processors filter or transform spans before export. This enables removing sensitive data, adding environment-specific tags, or implementing custom sampling logic.
class SensitiveDataProcessor < OpenTelemetry::SDK::Trace::SpanProcessor
SENSITIVE_ATTRS = ['user.password', 'credit_card.number', 'api_key']
def on_start(span, parent_context)
# Called when span starts
end
def on_ending(span)
# Remove sensitive attributes before export
SENSITIVE_ATTRS.each do |attr|
span.attributes.delete(attr)
end
# Add environment tag to all spans
span.set_attribute('environment', ENV['RACK_ENV'])
end
def force_flush(timeout: nil)
true
end
def shutdown(timeout: nil)
true
end
end
OpenTelemetry::SDK.configure do |c|
c.add_span_processor(SensitiveDataProcessor.new)
end
Tools & Ecosystem
The distributed tracing ecosystem includes collection agents, storage backends, analysis interfaces, and instrumentation libraries. Different tools serve different architectural patterns and scaling requirements.
OpenTelemetry Collector receives, processes, and exports telemetry data. The collector decouples instrumentation from backend systems—applications send to the collector, which forwards to multiple backends. Collectors aggregate data from many services, batch exports, and handle retry logic.
# collector-config.yaml
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
grpc:
endpoint: 0.0.0.0:4317
processors:
batch:
timeout: 10s
send_batch_size: 1024
attributes:
actions:
- key: environment
value: production
action: insert
exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, attributes]
exporters: [jaeger, otlp/tempo]
Jaeger provides trace storage, search, and visualization. Jaeger stores traces in Cassandra, Elasticsearch, or memory. The UI shows trace timelines, service dependencies, and operation statistics. Jaeger supports multiple ingestion formats including Zipkin and OpenTelemetry.
Zipkin offers similar capabilities with a simpler architecture. Zipkin stores traces in MySQL, Elasticsearch, or Cassandra. The B3 propagation format originated with Zipkin and remains widely used. Zipkin requires less infrastructure than Jaeger but provides fewer advanced features.
Grafana Tempo stores traces in object storage (S3, GCS) at lower cost than Elasticsearch or Cassandra. Tempo integrates with Grafana for visualization and correlates traces with logs and metrics. Tempo trades query flexibility for storage efficiency—traces are retrieved by ID rather than arbitrary tag searches.
Datadog APM and New Relic provide commercial tracing solutions with automatic instrumentation, anomaly detection, and extensive integrations. These SaaS platforms handle all infrastructure but incur per-span costs at scale.
Ruby-Specific Gems extend tracing capabilities:
| Gem | Purpose | Integration |
|---|---|---|
| ddtrace | Datadog instrumentation | Rails, Sinatra, Grape, Sidekiq, databases |
| newrelic_rpm | New Relic instrumentation | Rails, background jobs, external services |
| skylight | Application profiling with tracing | Rails, Grape, Sinatra, background jobs |
| scout_apm | Application monitoring | Rails, Sinatra, database queries, external calls |
| elastic-apm | Elastic APM instrumentation | Rails, Sinatra, Sidekiq, HTTP clients |
Trace Context Standards ensure interoperability between tracing systems:
The W3C Trace Context specification defines traceparent and tracestate headers. The traceparent format is version-trace_id-parent_id-trace_flags. Example: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01.
# Parsing W3C traceparent header
def parse_traceparent(header)
return nil unless header
parts = header.split('-')
return nil unless parts.length == 4
{
version: parts[0],
trace_id: parts[1],
parent_id: parts[2],
trace_flags: parts[3]
}
end
# Generating traceparent header
def generate_traceparent(trace_id, span_id, sampled)
flags = sampled ? '01' : '00'
"00-#{trace_id}-#{span_id}-#{flags}"
end
The B3 propagation format uses separate headers: X-B3-TraceId, X-B3-SpanId, X-B3-ParentSpanId, and X-B3-Sampled. Single-header B3 format combines these: X-B3: trace_id-span_id-sampled-parent_span_id.
Integration & Interoperability
Distributed tracing requires integration across multiple layers: HTTP frameworks, database clients, message queues, RPC systems, and third-party services. Each integration point must propagate context and create meaningful spans.
Rails Integration instruments controllers, views, and Active Record automatically. Additional instrumentation captures middleware, caching, and mailers.
# config/application.rb
require 'opentelemetry/sdk'
require 'opentelemetry/instrumentation/all'
module MyApp
class Application < Rails::Application
config.after_initialize do
OpenTelemetry::SDK.configure do |c|
c.service_name = 'rails-app'
c.use_all
end
end
end
end
# Manual instrumentation in controller
class UsersController < ApplicationController
def create
tracer = OpenTelemetry.tracer_provider.tracer('users_controller')
tracer.in_span('validate_and_create_user') do |span|
span.set_attribute('user.role', params[:role])
@user = User.new(user_params)
if @user.save
span.set_attribute('user.id', @user.id)
render json: @user, status: :created
else
span.set_attribute('error', true)
span.add_event('validation_failed', attributes: {
'errors' => @user.errors.full_messages.join(', ')
})
render json: @user.errors, status: :unprocessable_entity
end
end
end
end
Database Integration traces SQL queries with query text, duration, and connection information. Sensitive data in queries should be sanitized before adding to spans.
# Active Record instrumentation captures queries automatically
class UserService
def find_active_users(min_orders)
tracer.in_span('query_active_users', attributes: {
'db.system' => 'postgresql',
'db.operation' => 'SELECT'
}) do |span|
users = User.joins(:orders)
.where('orders.created_at > ?', 30.days.ago)
.group('users.id')
.having('COUNT(orders.id) >= ?', min_orders)
span.set_attribute('db.row_count', users.count)
users
end
end
end
Message Queue Integration propagates context through message metadata. Producers inject context, consumers extract it and create continuation spans.
# RabbitMQ integration with Bunny
class OrderEventPublisher
def publish_order_created(order)
tracer.in_span('publish_order_created', attributes: {
'messaging.system' => 'rabbitmq',
'messaging.destination' => 'orders.created'
}) do |span|
headers = {}
OpenTelemetry.propagation.inject(headers)
channel.default_exchange.publish(
order.to_json,
routing_key: 'orders.created',
headers: headers,
persistent: true
)
span.set_attribute('order.id', order.id)
end
end
end
class OrderEventConsumer
def handle_message(delivery_info, metadata, payload)
# Extract trace context from message headers
context = OpenTelemetry.propagation.extract(metadata.headers || {})
OpenTelemetry::Context.with_current(context) do
tracer.in_span('process_order_created', attributes: {
'messaging.system' => 'rabbitmq',
'messaging.source' => delivery_info.routing_key
}) do |span|
order_data = JSON.parse(payload)
span.set_attribute('order.id', order_data['id'])
process_order(order_data)
end
end
end
end
gRPC Integration propagates context through metadata. The grpc gem supports automatic context injection and extraction.
require 'grpc'
require 'opentelemetry/instrumentation/grpc'
# Client-side propagation
class UserServiceClient
def get_user(user_id)
tracer.in_span('grpc.user_service.get_user', attributes: {
'rpc.system' => 'grpc',
'rpc.service' => 'UserService',
'rpc.method' => 'GetUser'
}) do |span|
stub = UserService::Stub.new('localhost:50051', :this_channel_is_insecure)
# Context automatically injected by instrumentation
response = stub.get_user(GetUserRequest.new(id: user_id))
span.set_attribute('rpc.grpc.status_code', 0)
response
rescue GRPC::BadStatus => e
span.set_attribute('error', true)
span.set_attribute('rpc.grpc.status_code', e.code)
raise
end
end
end
External API Integration requires propagating context to services outside your control. Many APIs ignore trace headers but recording the outbound call still provides value.
class StripeService
def charge_customer(customer_id, amount)
tracer.in_span('stripe.charge', attributes: {
'external.service' => 'stripe',
'payment.amount' => amount,
'payment.currency' => 'usd'
}) do |span|
headers = { 'Idempotency-Key' => SecureRandom.uuid }
OpenTelemetry.propagation.inject(headers)
response = Stripe::Charge.create(
{
customer: customer_id,
amount: amount,
currency: 'usd'
},
headers
)
span.set_attribute('payment.id', response.id)
span.set_attribute('payment.status', response.status)
response
end
end
end
Real-World Applications
Production distributed tracing deployments balance comprehensive coverage against performance overhead and cost. Real-world implementations require careful sampling strategies, data retention policies, and integration with incident response workflows.
High-Traffic API Platform serving 100,000 requests per second implements adaptive sampling. Head-based sampling keeps 1% of successful requests but 100% of errors and slow requests. Tail-based sampling in the collector keeps interesting traces that pass quality checks.
# Custom sampler for high-value traces
class AdaptiveSampler < OpenTelemetry::SDK::Trace::Samplers::Sampler
def should_sample?(trace_id:, parent_context:, links:, name:, kind:, attributes:, **)
# Always sample errors
return OpenTelemetry::SDK::Trace::Samplers::RECORD_AND_SAMPLE if attributes['error']
# Always sample slow operations
parent_span = OpenTelemetry::Trace.current_span(parent_context)
if parent_span && parent_span.duration > 1.0
return OpenTelemetry::SDK::Trace::Samplers::RECORD_AND_SAMPLE
end
# Sample authenticated requests at higher rate
if attributes['user.authenticated']
return probability_sample(trace_id, 0.05) # 5%
end
# Default sampling
probability_sample(trace_id, 0.01) # 1%
end
private
def probability_sample(trace_id, rate)
threshold = (rate * (2**64 - 1)).floor
trace_id_int = trace_id.unpack1('Q>')
if trace_id_int <= threshold
OpenTelemetry::SDK::Trace::Samplers::RECORD_AND_SAMPLE
else
OpenTelemetry::SDK::Trace::Samplers::DROP
end
end
end
Microservices Platform with 50 services uses distributed tracing to identify cascading failures. When service A slows down, tracing reveals which downstream services cause the latency. The platform automatically creates service dependency graphs from trace data.
# Service dependency tracking
class DependencyAnalyzer
def analyze_trace(trace)
dependencies = {}
trace.spans.each do |span|
next unless span.attributes['peer.service']
caller = span.attributes['service.name']
callee = span.attributes['peer.service']
dependencies[caller] ||= {}
dependencies[caller][callee] ||= {
call_count: 0,
total_duration: 0.0,
error_count: 0
}
dependencies[caller][callee][:call_count] += 1
dependencies[caller][callee][:total_duration] += span.duration
dependencies[caller][callee][:error_count] += 1 if span.attributes['error']
end
dependencies
end
end
E-Commerce Platform correlates traces with business metrics. Each trace includes order value, customer segment, and AB test variant. This enables analyzing performance by business dimension—high-value customer checkout latency versus low-value customer latency.
# Business context propagation
class CheckoutController < ApplicationController
def create
tracer.in_span('checkout', attributes: {
'cart.value' => current_cart.total,
'cart.item_count' => current_cart.items.count,
'customer.segment' => current_user.segment,
'experiment.checkout_flow' => current_experiment_variant
}) do |span|
order = tracer.in_span('create_order') do
Order.create_from_cart!(current_cart)
end
span.set_attribute('order.id', order.id)
span.set_attribute('order.total', order.total)
# Business event logging
span.add_event('order_created', attributes: {
'revenue': order.total,
'items': order.items.count,
'payment_method': order.payment_method
})
redirect_to order_path(order)
end
end
end
SaaS Application implements tenant-aware tracing. Each span includes tenant ID, plan tier, and feature flags. This isolates performance issues to specific tenants and identifies which features cause slowdowns.
# Tenant context middleware
class TenantTracingMiddleware
def call(env)
tenant = env['current_tenant']
tracer.in_span('request', attributes: {
'tenant.id' => tenant.id,
'tenant.plan' => tenant.plan,
'tenant.created_at' => tenant.created_at.iso8601
}) do |span|
# Add tenant features as baggage
OpenTelemetry::Baggage.set_value('tenant.feature.advanced_analytics',
tenant.feature_enabled?(:advanced_analytics))
status, headers, body = @app.call(env)
span.set_attribute('http.status_code', status)
[status, headers, body]
end
end
end
Background Processing System traces job execution across multiple workers. Parent-child relationships show which jobs spawn additional work. Trace analysis identifies jobs that trigger cascading job creation.
# Job execution tracing
class TracedJob
include Sidekiq::Job
def perform(*args)
tracer.in_span("job.#{self.class.name}", attributes: {
'job.class' => self.class.name,
'job.args' => args.inspect,
'job.id' => jid,
'job.queue' => queue_name
}) do |span|
start_time = Time.now
begin
result = execute(*args)
span.set_attribute('job.duration', Time.now - start_time)
span.set_attribute('job.status', 'success')
result
rescue => e
span.record_exception(e)
span.set_attribute('job.status', 'failed')
span.set_attribute('job.error', e.class.name)
raise
end
end
end
end
Common Pitfalls
Distributed tracing implementations face recurring problems that reduce effectiveness or create operational issues. Understanding these pitfalls prevents wasted effort and data quality problems.
Context Propagation Failures occur when trace context does not cross service boundaries. Missing instrumentation for HTTP clients or message queues breaks trace continuity. Each service receives a new trace ID instead of continuing the parent trace.
# WRONG: Context not propagated
def call_external_service(data)
# Missing context injection - creates new trace
uri = URI('https://api.example.com/process')
response = Net::HTTP.post(uri, data.to_json, 'Content-Type' => 'application/json')
JSON.parse(response.body)
end
# CORRECT: Context propagated
def call_external_service(data)
tracer.in_span('external_api_call') do |span|
uri = URI('https://api.example.com/process')
request = Net::HTTP::Post.new(uri)
request['Content-Type'] = 'application/json'
# Inject context into headers
OpenTelemetry.propagation.inject(request)
response = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) do |http|
http.request(request, data.to_json)
end
span.set_attribute('http.status_code', response.code.to_i)
JSON.parse(response.body)
end
end
Excessive Span Creation generates too many spans per trace, overwhelming storage and making traces hard to analyze. Creating spans for every method call or loop iteration produces megabyte-sized traces. Limit spans to significant operations—HTTP requests, database queries, external service calls.
# WRONG: Too many spans
def process_items(items)
tracer.in_span('process_items') do
items.each do |item|
# Creating 1000s of spans for each item
tracer.in_span("process_item_#{item.id}") do
tracer.in_span('validate') do
validate(item)
end
tracer.in_span('transform') do
transform(item)
end
tracer.in_span('save') do
save(item)
end
end
end
end
end
# CORRECT: Aggregate spans
def process_items(items)
tracer.in_span('process_items', attributes: {
'items.count' => items.count
}) do |span|
processed = 0
errors = 0
items.each do |item|
# Process without individual spans
validate(item)
transform(item)
save(item)
processed += 1
rescue => e
errors += 1
end
span.set_attribute('items.processed', processed)
span.set_attribute('items.errors', errors)
end
end
Sensitive Data Exposure adds passwords, API keys, or personal information to spans. Span data typically has wider access than application databases. Sanitize attributes before adding them to spans.
# WRONG: Sensitive data in span
def authenticate_user(email, password)
tracer.in_span('authenticate', attributes: {
'user.email' => email,
'user.password' => password # NEVER include passwords
}) do
User.authenticate(email, password)
end
end
# CORRECT: Sanitized attributes
def authenticate_user(email, password)
tracer.in_span('authenticate', attributes: {
'user.email' => email.gsub(/@.*/, '@***') # Partially redact email
}) do |span|
user = User.authenticate(email, password)
span.set_attribute('user.id', user&.id)
span.set_attribute('auth.success', !user.nil?)
user
end
end
Sampling Inconsistency creates incomplete traces when different services use different sampling decisions. Head-based sampling must propagate the sampling decision to all downstream services. The trace flags in context indicate whether the trace is sampled.
# Respecting parent sampling decision
def process_with_context(parent_context)
# Extract sampling decision from parent
parent_span = OpenTelemetry::Trace.current_span(parent_context)
sampled = parent_span.context.trace_flags.sampled?
# Use parent-based sampler to respect decision
sampler = OpenTelemetry::SDK::Trace::Samplers::ParentBased.new(
root: OpenTelemetry::SDK::Trace::Samplers::TraceIdRatioBased.new(0.1)
)
# New spans inherit parent's sampling decision
tracer.in_span('child_operation') do |span|
# This span will be sampled if parent was sampled
perform_work
end
end
Missing Error Information creates spans that finish successfully even when exceptions occur. Always record exceptions and set error tags when operations fail.
# WRONG: Exception not recorded
def fetch_user(id)
tracer.in_span('fetch_user') do
User.find(id)
end
end # Exception escapes, span finishes without error indication
# CORRECT: Exception recorded
def fetch_user(id)
tracer.in_span('fetch_user', attributes: { 'user.id' => id }) do |span|
User.find(id)
rescue ActiveRecord::RecordNotFound => e
span.record_exception(e)
span.set_attribute('error', true)
span.set_attribute('error.type', 'not_found')
raise
end
end
Clock Skew Problems distort traces when services have unsynchronized clocks. Spans appear in wrong order or have negative durations. Use NTP synchronization across all services or rely on relative timing within services.
Cardinality Explosions occur when high-cardinality values become span attributes. User IDs, request IDs, and timestamps create millions of unique attribute combinations, overwhelming trace storage indexes.
# WRONG: High cardinality attribute
tracer.in_span('query_users', attributes: {
'query.timestamp' => Time.now.to_f, # Unique every time
'query.request_id' => SecureRandom.uuid # Unique every time
}) do
# ...
end
# CORRECT: Bounded cardinality
tracer.in_span('query_users', attributes: {
'query.hour' => Time.now.hour, # Only 24 values
'query.type' => 'user_search' # Fixed set of values
}) do
# ...
end
Reference
Trace Components
| Component | Description | Cardinality |
|---|---|---|
| Trace | Complete request path across services | One per sampled request |
| Span | Single operation within a service | Multiple per trace |
| Trace ID | Unique identifier for entire trace | 128-bit random value |
| Span ID | Unique identifier for individual span | 64-bit random value |
| Parent Span ID | Links span to its parent | References another span ID |
| Attributes | Key-value metadata on spans | 0 to 128 per span typically |
| Events | Timestamped log entries within span | 0 to many per span |
| Links | Connects spans across trace boundaries | Used for batch processing |
Span Attributes
Standard semantic conventions define common attribute names:
| Attribute | Type | Example | Use Case |
|---|---|---|---|
| service.name | string | user-service | Service identification |
| http.method | string | GET | HTTP request method |
| http.url | string | /api/users/123 | Request URL |
| http.status_code | integer | 200 | Response status |
| db.system | string | postgresql | Database type |
| db.statement | string | SELECT * FROM users | SQL query |
| db.operation | string | SELECT | Database operation |
| messaging.system | string | rabbitmq | Message queue type |
| messaging.destination | string | orders.created | Queue or topic name |
| rpc.system | string | grpc | RPC framework |
| error | boolean | true | Error occurred |
| peer.service | string | payment-service | Called service name |
Context Propagation Formats
| Format | Header | Structure | Example |
|---|---|---|---|
| W3C Trace Context | traceparent | version-traceid-parentid-flags | 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 |
| W3C Trace State | tracestate | vendor specific state | congo=t61rcWkgMzE |
| B3 Multi | X-B3-TraceId, X-B3-SpanId | Separate headers | TraceId: 80f198ee56343ba864fe8b2a57d3eff7 |
| B3 Single | b3 | traceid-spanid-sampled | 80f198ee56343ba864fe8b2a57d3eff7-e457b5a2e4d86bd1-1 |
| Jaeger | uber-trace-id | traceid:spanid:parentid:flags | 5af7651916cd43dd8448eb211c80319c:b7ad6b7169203331:0:1 |
Sampling Strategies
| Strategy | Decision Point | Use Case | Trade-off |
|---|---|---|---|
| Probabilistic | Trace start | Consistent load | May miss errors |
| Rate Limiting | Trace start | Cost control | May miss bursts |
| Parent-based | Propagated | Cross-service consistency | Depends on parent decision |
| Tail-based | Trace completion | Keep interesting traces | Requires buffering all spans |
| Adaptive | Dynamic | Respond to conditions | Complex implementation |
Span Lifecycle
| Phase | Description | Methods |
|---|---|---|
| Creation | Span initialized with trace context | start_span |
| Active | Span receiving attributes and events | set_attribute, add_event |
| Completion | Span duration finalized | finish |
| Export | Span sent to backend | Automatic via processor |
Ruby Tracing Configuration
| Option | Purpose | Default |
|---|---|---|
| service_name | Identify service in traces | unknown_service |
| service_version | Track deployments | nil |
| sampler | Control trace sampling | ParentBased(AlwaysOn) |
| span_processors | Process spans before export | BatchSpanProcessor |
| propagators | Context propagation format | TraceContext |
| resource_attributes | Service metadata | hostname, process info |
Common Span Kinds
| Kind | Purpose | Direction |
|---|---|---|
| CLIENT | Outbound request initiator | Outgoing |
| SERVER | Request handler | Incoming |
| PRODUCER | Message queue publisher | Outgoing |
| CONSUMER | Message queue subscriber | Incoming |
| INTERNAL | Internal operation | Neither |
Performance Impact
| Instrumentation Type | Overhead | Sampling Recommendation |
|---|---|---|
| Automatic HTTP/Database | 1-3% latency | 5-10% sampling |
| Manual business logic | 2-5% latency | Selective instrumentation |
| Message queues | Minimal | 10-25% sampling |
| Background jobs | 1-2% latency | 25-50% sampling |