CrackedRuby CrackedRuby

Overview

Distributed tracing tracks individual requests through multiple services in a distributed system. Each request receives a unique identifier that propagates across service boundaries, creating a complete picture of the request's path through the architecture. This trace data includes timing information, service interactions, and contextual metadata.

The concept originated at Google with their Dapper paper in 2010, which described how to instrument large-scale distributed systems with minimal performance overhead. Modern distributed tracing systems build on these principles, standardized through specifications like OpenTelemetry and Zipkin's B3 propagation.

A trace consists of spans—individual units of work representing operations within services. Spans contain timing data, tags, and logs. Parent-child relationships between spans create a directed acyclic graph showing causality and service dependencies.

# Basic trace structure
trace = {
  trace_id: "a1b2c3d4e5f6g7h8",
  spans: [
    {
      span_id: "span-001",
      parent_span_id: nil,
      operation: "web.request",
      start_time: 1234567890.123,
      duration: 0.450,
      tags: { "http.method" => "GET", "http.url" => "/api/users" }
    },
    {
      span_id: "span-002",
      parent_span_id: "span-001",
      operation: "database.query",
      start_time: 1234567890.200,
      duration: 0.150,
      tags: { "db.type" => "postgresql", "db.statement" => "SELECT * FROM users" }
    }
  ]
}

Distributed tracing solves critical problems in microservices architectures: identifying performance bottlenecks across services, debugging failures that span multiple components, understanding service dependencies, and measuring end-to-end latency. Without tracing, correlating logs and metrics from dozens of services becomes impractical.

Key Principles

Distributed tracing operates on several foundational concepts that determine how trace data gets collected, propagated, and analyzed across service boundaries.

Context Propagation moves trace identifiers across process boundaries. Each service extracts context from incoming requests and injects context into outgoing requests. This propagation occurs through HTTP headers, message queue metadata, or RPC frameworks. The W3C Trace Context specification standardizes this propagation with traceparent and tracestate headers.

# Context propagation in HTTP
class TracingMiddleware
  def call(env)
    # Extract trace context from incoming request
    trace_id = env['HTTP_TRACEPARENT']&.split('-')&.[](1)
    parent_span_id = env['HTTP_TRACEPARENT']&.split('-')&.[](2)
    
    # Create span for this service's work
    span = create_span(trace_id: trace_id, parent_span_id: parent_span_id)
    
    status, headers, body = @app.call(env)
    
    # Inject context into outgoing response
    headers['traceparent'] = format_traceparent(span)
    
    [status, headers, body]
  ensure
    span&.finish
  end
end

Sampling determines which requests to trace. Full tracing of every request creates excessive overhead and storage costs. Head-based sampling makes decisions at trace initiation—typically keeping 1-5% of requests. Tail-based sampling examines completed traces and keeps interesting ones (errors, slow requests). Probabilistic sampling selects traces randomly while deterministic sampling uses request attributes.

Span Relationships define how operations relate within a trace. Child spans represent work performed as part of a parent operation. Spans can have multiple children but only one parent, forming a tree structure. Follows-from relationships indicate asynchronous work that continues after the parent completes, such as background jobs or message processing.

Tags and Logs attach metadata to spans. Tags are key-value pairs describing the operation—HTTP status codes, database table names, user IDs. Tags enable filtering and aggregation in analysis tools. Logs represent timestamped events within a span—exceptions thrown, cache hits, retries attempted. Unlike application logs, span logs maintain association with the specific request context.

# Span with tags and logs
span = tracer.start_span('process_payment')
span.set_tag('payment.amount', 99.99)
span.set_tag('payment.currency', 'USD')
span.set_tag('user.id', user_id)

begin
  result = payment_gateway.charge(card, amount)
  span.log_kv(event: 'payment_successful', transaction_id: result.id)
rescue PaymentError => e
  span.set_tag('error', true)
  span.log_kv(event: 'payment_failed', error_message: e.message)
  raise
ensure
  span.finish
end

Baggage carries data through the entire trace. Unlike tags which stay local to a span, baggage propagates to all downstream services. This enables correlating requests by business context—tenant ID, feature flags, experiment variants. Baggage increases network overhead since it travels with every request, so use sparingly.

Clock Synchronization affects trace accuracy. Spans from different services may have timestamps from different clocks. Clock skew can make child spans appear to start before parents or last longer than their parents. Tracing systems handle this through clock correction algorithms or by relying on duration measurements rather than absolute timestamps.

Ruby Implementation

Ruby provides several libraries for distributed tracing, with OpenTelemetry emerging as the standard approach. The opentelemetry-sdk gem implements the OpenTelemetry specification for creating, exporting, and managing traces.

Basic Tracer Setup requires configuring a tracer provider, processors, and exporters. The provider manages tracer instances. Processors transform span data before export. Exporters send spans to backend systems.

require 'opentelemetry/sdk'
require 'opentelemetry/exporter/otlp'

OpenTelemetry::SDK.configure do |c|
  c.service_name = 'user-service'
  c.service_version = '1.2.0'
  
  # Use OTLP exporter for sending to collectors
  c.add_span_processor(
    OpenTelemetry::SDK::Trace::Export::BatchSpanProcessor.new(
      OpenTelemetry::Exporter::OTLP::Exporter.new(
        endpoint: 'http://collector:4318',
        compression: 'gzip'
      )
    )
  )
  
  # Configure sampling - keep 10% of traces
  c.sampler = OpenTelemetry::SDK::Trace::Samplers::ParentBased.new(
    root: OpenTelemetry::SDK::Trace::Samplers::TraceIdRatioBased.new(0.1)
  )
end

Manual Instrumentation creates spans explicitly in application code. This gives precise control over what operations to trace and what attributes to capture.

tracer = OpenTelemetry.tracer_provider.tracer('user-service', '1.0')

def process_user_registration(params)
  tracer.in_span('process_registration', attributes: {
    'user.email' => params[:email],
    'user.plan' => params[:plan]
  }) do |span|
    
    # Validate input
    tracer.in_span('validate_input') do
      validate_registration_params(params)
    end
    
    # Create user record
    user = tracer.in_span('database.create_user', attributes: {
      'db.system' => 'postgresql',
      'db.operation' => 'INSERT'
    }) do |db_span|
      User.create!(params)
    end
    
    # Send welcome email (async operation)
    tracer.in_span('queue.enqueue_email', attributes: {
      'messaging.system' => 'sidekiq',
      'messaging.destination' => 'emails'
    }) do
      WelcomeEmailJob.perform_async(user.id)
    end
    
    span.set_attribute('user.id', user.id)
    user
  rescue ValidationError => e
    span.record_exception(e)
    span.set_attribute('error', true)
    raise
  end
end

Automatic Instrumentation instruments common libraries and frameworks without code changes. OpenTelemetry provides instrumentation gems for Rack, Rails, Sinatra, Faraday, Net::HTTP, and database adapters.

require 'opentelemetry-instrumentation-all'

OpenTelemetry::SDK.configure do |c|
  c.use_all # Enable all available instrumentations
  
  # Or selectively enable instrumentations
  c.use 'OpenTelemetry::Instrumentation::Rack'
  c.use 'OpenTelemetry::Instrumentation::Rails'
  c.use 'OpenTelemetry::Instrumentation::ActiveRecord'
  c.use 'OpenTelemetry::Instrumentation::Faraday'
end

Context Propagation in HTTP Clients requires injecting trace context into outgoing requests. Instrumented libraries handle this automatically, but manual HTTP clients need explicit context injection.

require 'net/http'
require 'opentelemetry/instrumentation/net/http'

def fetch_user_data(user_id)
  uri = URI("https://api.example.com/users/#{user_id}")
  
  tracer.in_span('http.request', attributes: {
    'http.method' => 'GET',
    'http.url' => uri.to_s
  }) do |span|
    request = Net::HTTP::Get.new(uri)
    
    # Inject trace context into headers
    OpenTelemetry.propagation.inject(request)
    
    response = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) do |http|
      http.request(request)
    end
    
    span.set_attribute('http.status_code', response.code.to_i)
    JSON.parse(response.body)
  end
end

Background Job Instrumentation traces asynchronous work by propagating context through job queues. Sidekiq instrumentation automatically handles this propagation.

class ProcessOrderJob
  include Sidekiq::Job
  
  def perform(order_id)
    # Trace context automatically extracted from job metadata
    tracer.in_span('process_order', attributes: {
      'order.id' => order_id
    }) do |span|
      order = Order.find(order_id)
      
      tracer.in_span('payment.charge') do
        charge_payment(order)
      end
      
      tracer.in_span('inventory.reserve') do
        reserve_inventory(order)
      end
      
      tracer.in_span('shipping.create_label') do
        create_shipping_label(order)
      end
      
      span.set_attribute('order.status', order.status)
    end
  end
end

# Enqueuing jobs preserves trace context
tracer.in_span('enqueue_order_processing') do
  ProcessOrderJob.perform_async(order.id)
end

Custom Span Processors filter or transform spans before export. This enables removing sensitive data, adding environment-specific tags, or implementing custom sampling logic.

class SensitiveDataProcessor < OpenTelemetry::SDK::Trace::SpanProcessor
  SENSITIVE_ATTRS = ['user.password', 'credit_card.number', 'api_key']
  
  def on_start(span, parent_context)
    # Called when span starts
  end
  
  def on_ending(span)
    # Remove sensitive attributes before export
    SENSITIVE_ATTRS.each do |attr|
      span.attributes.delete(attr)
    end
    
    # Add environment tag to all spans
    span.set_attribute('environment', ENV['RACK_ENV'])
  end
  
  def force_flush(timeout: nil)
    true
  end
  
  def shutdown(timeout: nil)
    true
  end
end

OpenTelemetry::SDK.configure do |c|
  c.add_span_processor(SensitiveDataProcessor.new)
end

Tools & Ecosystem

The distributed tracing ecosystem includes collection agents, storage backends, analysis interfaces, and instrumentation libraries. Different tools serve different architectural patterns and scaling requirements.

OpenTelemetry Collector receives, processes, and exports telemetry data. The collector decouples instrumentation from backend systems—applications send to the collector, which forwards to multiple backends. Collectors aggregate data from many services, batch exports, and handle retry logic.

# collector-config.yaml
receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024
  
  attributes:
    actions:
      - key: environment
        value: production
        action: insert

exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
  
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [jaeger, otlp/tempo]

Jaeger provides trace storage, search, and visualization. Jaeger stores traces in Cassandra, Elasticsearch, or memory. The UI shows trace timelines, service dependencies, and operation statistics. Jaeger supports multiple ingestion formats including Zipkin and OpenTelemetry.

Zipkin offers similar capabilities with a simpler architecture. Zipkin stores traces in MySQL, Elasticsearch, or Cassandra. The B3 propagation format originated with Zipkin and remains widely used. Zipkin requires less infrastructure than Jaeger but provides fewer advanced features.

Grafana Tempo stores traces in object storage (S3, GCS) at lower cost than Elasticsearch or Cassandra. Tempo integrates with Grafana for visualization and correlates traces with logs and metrics. Tempo trades query flexibility for storage efficiency—traces are retrieved by ID rather than arbitrary tag searches.

Datadog APM and New Relic provide commercial tracing solutions with automatic instrumentation, anomaly detection, and extensive integrations. These SaaS platforms handle all infrastructure but incur per-span costs at scale.

Ruby-Specific Gems extend tracing capabilities:

Gem Purpose Integration
ddtrace Datadog instrumentation Rails, Sinatra, Grape, Sidekiq, databases
newrelic_rpm New Relic instrumentation Rails, background jobs, external services
skylight Application profiling with tracing Rails, Grape, Sinatra, background jobs
scout_apm Application monitoring Rails, Sinatra, database queries, external calls
elastic-apm Elastic APM instrumentation Rails, Sinatra, Sidekiq, HTTP clients

Trace Context Standards ensure interoperability between tracing systems:

The W3C Trace Context specification defines traceparent and tracestate headers. The traceparent format is version-trace_id-parent_id-trace_flags. Example: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01.

# Parsing W3C traceparent header
def parse_traceparent(header)
  return nil unless header
  
  parts = header.split('-')
  return nil unless parts.length == 4
  
  {
    version: parts[0],
    trace_id: parts[1],
    parent_id: parts[2],
    trace_flags: parts[3]
  }
end

# Generating traceparent header
def generate_traceparent(trace_id, span_id, sampled)
  flags = sampled ? '01' : '00'
  "00-#{trace_id}-#{span_id}-#{flags}"
end

The B3 propagation format uses separate headers: X-B3-TraceId, X-B3-SpanId, X-B3-ParentSpanId, and X-B3-Sampled. Single-header B3 format combines these: X-B3: trace_id-span_id-sampled-parent_span_id.

Integration & Interoperability

Distributed tracing requires integration across multiple layers: HTTP frameworks, database clients, message queues, RPC systems, and third-party services. Each integration point must propagate context and create meaningful spans.

Rails Integration instruments controllers, views, and Active Record automatically. Additional instrumentation captures middleware, caching, and mailers.

# config/application.rb
require 'opentelemetry/sdk'
require 'opentelemetry/instrumentation/all'

module MyApp
  class Application < Rails::Application
    config.after_initialize do
      OpenTelemetry::SDK.configure do |c|
        c.service_name = 'rails-app'
        c.use_all
      end
    end
  end
end

# Manual instrumentation in controller
class UsersController < ApplicationController
  def create
    tracer = OpenTelemetry.tracer_provider.tracer('users_controller')
    
    tracer.in_span('validate_and_create_user') do |span|
      span.set_attribute('user.role', params[:role])
      
      @user = User.new(user_params)
      
      if @user.save
        span.set_attribute('user.id', @user.id)
        render json: @user, status: :created
      else
        span.set_attribute('error', true)
        span.add_event('validation_failed', attributes: {
          'errors' => @user.errors.full_messages.join(', ')
        })
        render json: @user.errors, status: :unprocessable_entity
      end
    end
  end
end

Database Integration traces SQL queries with query text, duration, and connection information. Sensitive data in queries should be sanitized before adding to spans.

# Active Record instrumentation captures queries automatically
class UserService
  def find_active_users(min_orders)
    tracer.in_span('query_active_users', attributes: {
      'db.system' => 'postgresql',
      'db.operation' => 'SELECT'
    }) do |span|
      users = User.joins(:orders)
        .where('orders.created_at > ?', 30.days.ago)
        .group('users.id')
        .having('COUNT(orders.id) >= ?', min_orders)
      
      span.set_attribute('db.row_count', users.count)
      users
    end
  end
end

Message Queue Integration propagates context through message metadata. Producers inject context, consumers extract it and create continuation spans.

# RabbitMQ integration with Bunny
class OrderEventPublisher
  def publish_order_created(order)
    tracer.in_span('publish_order_created', attributes: {
      'messaging.system' => 'rabbitmq',
      'messaging.destination' => 'orders.created'
    }) do |span|
      headers = {}
      OpenTelemetry.propagation.inject(headers)
      
      channel.default_exchange.publish(
        order.to_json,
        routing_key: 'orders.created',
        headers: headers,
        persistent: true
      )
      
      span.set_attribute('order.id', order.id)
    end
  end
end

class OrderEventConsumer
  def handle_message(delivery_info, metadata, payload)
    # Extract trace context from message headers
    context = OpenTelemetry.propagation.extract(metadata.headers || {})
    
    OpenTelemetry::Context.with_current(context) do
      tracer.in_span('process_order_created', attributes: {
        'messaging.system' => 'rabbitmq',
        'messaging.source' => delivery_info.routing_key
      }) do |span|
        order_data = JSON.parse(payload)
        span.set_attribute('order.id', order_data['id'])
        
        process_order(order_data)
      end
    end
  end
end

gRPC Integration propagates context through metadata. The grpc gem supports automatic context injection and extraction.

require 'grpc'
require 'opentelemetry/instrumentation/grpc'

# Client-side propagation
class UserServiceClient
  def get_user(user_id)
    tracer.in_span('grpc.user_service.get_user', attributes: {
      'rpc.system' => 'grpc',
      'rpc.service' => 'UserService',
      'rpc.method' => 'GetUser'
    }) do |span|
      stub = UserService::Stub.new('localhost:50051', :this_channel_is_insecure)
      
      # Context automatically injected by instrumentation
      response = stub.get_user(GetUserRequest.new(id: user_id))
      
      span.set_attribute('rpc.grpc.status_code', 0)
      response
    rescue GRPC::BadStatus => e
      span.set_attribute('error', true)
      span.set_attribute('rpc.grpc.status_code', e.code)
      raise
    end
  end
end

External API Integration requires propagating context to services outside your control. Many APIs ignore trace headers but recording the outbound call still provides value.

class StripeService
  def charge_customer(customer_id, amount)
    tracer.in_span('stripe.charge', attributes: {
      'external.service' => 'stripe',
      'payment.amount' => amount,
      'payment.currency' => 'usd'
    }) do |span|
      headers = { 'Idempotency-Key' => SecureRandom.uuid }
      OpenTelemetry.propagation.inject(headers)
      
      response = Stripe::Charge.create(
        {
          customer: customer_id,
          amount: amount,
          currency: 'usd'
        },
        headers
      )
      
      span.set_attribute('payment.id', response.id)
      span.set_attribute('payment.status', response.status)
      response
    end
  end
end

Real-World Applications

Production distributed tracing deployments balance comprehensive coverage against performance overhead and cost. Real-world implementations require careful sampling strategies, data retention policies, and integration with incident response workflows.

High-Traffic API Platform serving 100,000 requests per second implements adaptive sampling. Head-based sampling keeps 1% of successful requests but 100% of errors and slow requests. Tail-based sampling in the collector keeps interesting traces that pass quality checks.

# Custom sampler for high-value traces
class AdaptiveSampler < OpenTelemetry::SDK::Trace::Samplers::Sampler
  def should_sample?(trace_id:, parent_context:, links:, name:, kind:, attributes:, **)
    # Always sample errors
    return OpenTelemetry::SDK::Trace::Samplers::RECORD_AND_SAMPLE if attributes['error']
    
    # Always sample slow operations
    parent_span = OpenTelemetry::Trace.current_span(parent_context)
    if parent_span && parent_span.duration > 1.0
      return OpenTelemetry::SDK::Trace::Samplers::RECORD_AND_SAMPLE
    end
    
    # Sample authenticated requests at higher rate
    if attributes['user.authenticated']
      return probability_sample(trace_id, 0.05) # 5%
    end
    
    # Default sampling
    probability_sample(trace_id, 0.01) # 1%
  end
  
  private
  
  def probability_sample(trace_id, rate)
    threshold = (rate * (2**64 - 1)).floor
    trace_id_int = trace_id.unpack1('Q>')
    
    if trace_id_int <= threshold
      OpenTelemetry::SDK::Trace::Samplers::RECORD_AND_SAMPLE
    else
      OpenTelemetry::SDK::Trace::Samplers::DROP
    end
  end
end

Microservices Platform with 50 services uses distributed tracing to identify cascading failures. When service A slows down, tracing reveals which downstream services cause the latency. The platform automatically creates service dependency graphs from trace data.

# Service dependency tracking
class DependencyAnalyzer
  def analyze_trace(trace)
    dependencies = {}
    
    trace.spans.each do |span|
      next unless span.attributes['peer.service']
      
      caller = span.attributes['service.name']
      callee = span.attributes['peer.service']
      
      dependencies[caller] ||= {}
      dependencies[caller][callee] ||= {
        call_count: 0,
        total_duration: 0.0,
        error_count: 0
      }
      
      dependencies[caller][callee][:call_count] += 1
      dependencies[caller][callee][:total_duration] += span.duration
      dependencies[caller][callee][:error_count] += 1 if span.attributes['error']
    end
    
    dependencies
  end
end

E-Commerce Platform correlates traces with business metrics. Each trace includes order value, customer segment, and AB test variant. This enables analyzing performance by business dimension—high-value customer checkout latency versus low-value customer latency.

# Business context propagation
class CheckoutController < ApplicationController
  def create
    tracer.in_span('checkout', attributes: {
      'cart.value' => current_cart.total,
      'cart.item_count' => current_cart.items.count,
      'customer.segment' => current_user.segment,
      'experiment.checkout_flow' => current_experiment_variant
    }) do |span|
      
      order = tracer.in_span('create_order') do
        Order.create_from_cart!(current_cart)
      end
      
      span.set_attribute('order.id', order.id)
      span.set_attribute('order.total', order.total)
      
      # Business event logging
      span.add_event('order_created', attributes: {
        'revenue': order.total,
        'items': order.items.count,
        'payment_method': order.payment_method
      })
      
      redirect_to order_path(order)
    end
  end
end

SaaS Application implements tenant-aware tracing. Each span includes tenant ID, plan tier, and feature flags. This isolates performance issues to specific tenants and identifies which features cause slowdowns.

# Tenant context middleware
class TenantTracingMiddleware
  def call(env)
    tenant = env['current_tenant']
    
    tracer.in_span('request', attributes: {
      'tenant.id' => tenant.id,
      'tenant.plan' => tenant.plan,
      'tenant.created_at' => tenant.created_at.iso8601
    }) do |span|
      # Add tenant features as baggage
      OpenTelemetry::Baggage.set_value('tenant.feature.advanced_analytics', 
                                        tenant.feature_enabled?(:advanced_analytics))
      
      status, headers, body = @app.call(env)
      
      span.set_attribute('http.status_code', status)
      [status, headers, body]
    end
  end
end

Background Processing System traces job execution across multiple workers. Parent-child relationships show which jobs spawn additional work. Trace analysis identifies jobs that trigger cascading job creation.

# Job execution tracing
class TracedJob
  include Sidekiq::Job
  
  def perform(*args)
    tracer.in_span("job.#{self.class.name}", attributes: {
      'job.class' => self.class.name,
      'job.args' => args.inspect,
      'job.id' => jid,
      'job.queue' => queue_name
    }) do |span|
      start_time = Time.now
      
      begin
        result = execute(*args)
        
        span.set_attribute('job.duration', Time.now - start_time)
        span.set_attribute('job.status', 'success')
        result
      rescue => e
        span.record_exception(e)
        span.set_attribute('job.status', 'failed')
        span.set_attribute('job.error', e.class.name)
        raise
      end
    end
  end
end

Common Pitfalls

Distributed tracing implementations face recurring problems that reduce effectiveness or create operational issues. Understanding these pitfalls prevents wasted effort and data quality problems.

Context Propagation Failures occur when trace context does not cross service boundaries. Missing instrumentation for HTTP clients or message queues breaks trace continuity. Each service receives a new trace ID instead of continuing the parent trace.

# WRONG: Context not propagated
def call_external_service(data)
  # Missing context injection - creates new trace
  uri = URI('https://api.example.com/process')
  response = Net::HTTP.post(uri, data.to_json, 'Content-Type' => 'application/json')
  JSON.parse(response.body)
end

# CORRECT: Context propagated
def call_external_service(data)
  tracer.in_span('external_api_call') do |span|
    uri = URI('https://api.example.com/process')
    request = Net::HTTP::Post.new(uri)
    request['Content-Type'] = 'application/json'
    
    # Inject context into headers
    OpenTelemetry.propagation.inject(request)
    
    response = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) do |http|
      http.request(request, data.to_json)
    end
    
    span.set_attribute('http.status_code', response.code.to_i)
    JSON.parse(response.body)
  end
end

Excessive Span Creation generates too many spans per trace, overwhelming storage and making traces hard to analyze. Creating spans for every method call or loop iteration produces megabyte-sized traces. Limit spans to significant operations—HTTP requests, database queries, external service calls.

# WRONG: Too many spans
def process_items(items)
  tracer.in_span('process_items') do
    items.each do |item|
      # Creating 1000s of spans for each item
      tracer.in_span("process_item_#{item.id}") do
        tracer.in_span('validate') do
          validate(item)
        end
        tracer.in_span('transform') do
          transform(item)
        end
        tracer.in_span('save') do
          save(item)
        end
      end
    end
  end
end

# CORRECT: Aggregate spans
def process_items(items)
  tracer.in_span('process_items', attributes: {
    'items.count' => items.count
  }) do |span|
    processed = 0
    errors = 0
    
    items.each do |item|
      # Process without individual spans
      validate(item)
      transform(item)
      save(item)
      processed += 1
    rescue => e
      errors += 1
    end
    
    span.set_attribute('items.processed', processed)
    span.set_attribute('items.errors', errors)
  end
end

Sensitive Data Exposure adds passwords, API keys, or personal information to spans. Span data typically has wider access than application databases. Sanitize attributes before adding them to spans.

# WRONG: Sensitive data in span
def authenticate_user(email, password)
  tracer.in_span('authenticate', attributes: {
    'user.email' => email,
    'user.password' => password  # NEVER include passwords
  }) do
    User.authenticate(email, password)
  end
end

# CORRECT: Sanitized attributes
def authenticate_user(email, password)
  tracer.in_span('authenticate', attributes: {
    'user.email' => email.gsub(/@.*/, '@***')  # Partially redact email
  }) do |span|
    user = User.authenticate(email, password)
    span.set_attribute('user.id', user&.id)
    span.set_attribute('auth.success', !user.nil?)
    user
  end
end

Sampling Inconsistency creates incomplete traces when different services use different sampling decisions. Head-based sampling must propagate the sampling decision to all downstream services. The trace flags in context indicate whether the trace is sampled.

# Respecting parent sampling decision
def process_with_context(parent_context)
  # Extract sampling decision from parent
  parent_span = OpenTelemetry::Trace.current_span(parent_context)
  sampled = parent_span.context.trace_flags.sampled?
  
  # Use parent-based sampler to respect decision
  sampler = OpenTelemetry::SDK::Trace::Samplers::ParentBased.new(
    root: OpenTelemetry::SDK::Trace::Samplers::TraceIdRatioBased.new(0.1)
  )
  
  # New spans inherit parent's sampling decision
  tracer.in_span('child_operation') do |span|
    # This span will be sampled if parent was sampled
    perform_work
  end
end

Missing Error Information creates spans that finish successfully even when exceptions occur. Always record exceptions and set error tags when operations fail.

# WRONG: Exception not recorded
def fetch_user(id)
  tracer.in_span('fetch_user') do
    User.find(id)
  end
end  # Exception escapes, span finishes without error indication

# CORRECT: Exception recorded
def fetch_user(id)
  tracer.in_span('fetch_user', attributes: { 'user.id' => id }) do |span|
    User.find(id)
  rescue ActiveRecord::RecordNotFound => e
    span.record_exception(e)
    span.set_attribute('error', true)
    span.set_attribute('error.type', 'not_found')
    raise
  end
end

Clock Skew Problems distort traces when services have unsynchronized clocks. Spans appear in wrong order or have negative durations. Use NTP synchronization across all services or rely on relative timing within services.

Cardinality Explosions occur when high-cardinality values become span attributes. User IDs, request IDs, and timestamps create millions of unique attribute combinations, overwhelming trace storage indexes.

# WRONG: High cardinality attribute
tracer.in_span('query_users', attributes: {
  'query.timestamp' => Time.now.to_f,  # Unique every time
  'query.request_id' => SecureRandom.uuid  # Unique every time
}) do
  # ...
end

# CORRECT: Bounded cardinality
tracer.in_span('query_users', attributes: {
  'query.hour' => Time.now.hour,  # Only 24 values
  'query.type' => 'user_search'  # Fixed set of values
}) do
  # ...
end

Reference

Trace Components

Component Description Cardinality
Trace Complete request path across services One per sampled request
Span Single operation within a service Multiple per trace
Trace ID Unique identifier for entire trace 128-bit random value
Span ID Unique identifier for individual span 64-bit random value
Parent Span ID Links span to its parent References another span ID
Attributes Key-value metadata on spans 0 to 128 per span typically
Events Timestamped log entries within span 0 to many per span
Links Connects spans across trace boundaries Used for batch processing

Span Attributes

Standard semantic conventions define common attribute names:

Attribute Type Example Use Case
service.name string user-service Service identification
http.method string GET HTTP request method
http.url string /api/users/123 Request URL
http.status_code integer 200 Response status
db.system string postgresql Database type
db.statement string SELECT * FROM users SQL query
db.operation string SELECT Database operation
messaging.system string rabbitmq Message queue type
messaging.destination string orders.created Queue or topic name
rpc.system string grpc RPC framework
error boolean true Error occurred
peer.service string payment-service Called service name

Context Propagation Formats

Format Header Structure Example
W3C Trace Context traceparent version-traceid-parentid-flags 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
W3C Trace State tracestate vendor specific state congo=t61rcWkgMzE
B3 Multi X-B3-TraceId, X-B3-SpanId Separate headers TraceId: 80f198ee56343ba864fe8b2a57d3eff7
B3 Single b3 traceid-spanid-sampled 80f198ee56343ba864fe8b2a57d3eff7-e457b5a2e4d86bd1-1
Jaeger uber-trace-id traceid:spanid:parentid:flags 5af7651916cd43dd8448eb211c80319c:b7ad6b7169203331:0:1

Sampling Strategies

Strategy Decision Point Use Case Trade-off
Probabilistic Trace start Consistent load May miss errors
Rate Limiting Trace start Cost control May miss bursts
Parent-based Propagated Cross-service consistency Depends on parent decision
Tail-based Trace completion Keep interesting traces Requires buffering all spans
Adaptive Dynamic Respond to conditions Complex implementation

Span Lifecycle

Phase Description Methods
Creation Span initialized with trace context start_span
Active Span receiving attributes and events set_attribute, add_event
Completion Span duration finalized finish
Export Span sent to backend Automatic via processor

Ruby Tracing Configuration

Option Purpose Default
service_name Identify service in traces unknown_service
service_version Track deployments nil
sampler Control trace sampling ParentBased(AlwaysOn)
span_processors Process spans before export BatchSpanProcessor
propagators Context propagation format TraceContext
resource_attributes Service metadata hostname, process info

Common Span Kinds

Kind Purpose Direction
CLIENT Outbound request initiator Outgoing
SERVER Request handler Incoming
PRODUCER Message queue publisher Outgoing
CONSUMER Message queue subscriber Incoming
INTERNAL Internal operation Neither

Performance Impact

Instrumentation Type Overhead Sampling Recommendation
Automatic HTTP/Database 1-3% latency 5-10% sampling
Manual business logic 2-5% latency Selective instrumentation
Message queues Minimal 10-25% sampling
Background jobs 1-2% latency 25-50% sampling