CrackedRuby - Distributed Tracing

Overview

Distributed tracing tracks individual requests through multiple services in a distributed system. Each request receives a unique identifier that propagates across service boundaries, creating a complete picture of the request's path through the architecture. This trace data includes timing information, service interactions, and contextual metadata.

The concept originated at Google with their Dapper paper in 2010, which described how to instrument large-scale distributed systems with minimal performance overhead. Modern distributed tracing systems build on these principles, standardized through specifications like OpenTelemetry and Zipkin's B3 propagation.

A trace consists of spans—individual units of work representing operations within services. Spans contain timing data, tags, and logs. Parent-child relationships between spans create a directed acyclic graph showing causality and service dependencies.

# Basic trace structure
trace = {
  trace_id: "a1b2c3d4e5f6g7h8",
  spans: [
    {
      span_id: "span-001",
      parent_span_id: nil,
      operation: "web.request",
      start_time: 1234567890.123,
      duration: 0.450,
      tags: { "http.method" => "GET", "http.url" => "/api/users" }
    },
    {
      span_id: "span-002",
      parent_span_id: "span-001",
      operation: "database.query",
      start_time: 1234567890.200,
      duration: 0.150,
      tags: { "db.type" => "postgresql", "db.statement" => "SELECT * FROM users" }
    }
  ]
}

Distributed tracing solves critical problems in microservices architectures: identifying performance bottlenecks across services, debugging failures that span multiple components, understanding service dependencies, and measuring end-to-end latency. Without tracing, correlating logs and metrics from dozens of services becomes impractical.

Key Principles

Distributed tracing operates on several foundational concepts that determine how trace data gets collected, propagated, and analyzed across service boundaries.

Context Propagation moves trace identifiers across process boundaries. Each service extracts context from incoming requests and injects context into outgoing requests. This propagation occurs through HTTP headers, message queue metadata, or RPC frameworks. The W3C Trace Context specification standardizes this propagation with traceparent and tracestate headers.

# Context propagation in HTTP
class TracingMiddleware
  def call(env)
    # Extract trace context from incoming request
    trace_id = env['HTTP_TRACEPARENT']&.split('-')&.[](1)
    parent_span_id = env['HTTP_TRACEPARENT']&.split('-')&.[](2)
    
    # Create span for this service's work
    span = create_span(trace_id: trace_id, parent_span_id: parent_span_id)
    
    status, headers, body = @app.call(env)
    
    # Inject context into outgoing response
    headers['traceparent'] = format_traceparent(span)
    
    [status, headers, body]
  ensure
    span&.finish
  end
end

Sampling determines which requests to trace. Full tracing of every request creates excessive overhead and storage costs. Head-based sampling makes decisions at trace initiation—typically keeping 1-5% of requests. Tail-based sampling examines completed traces and keeps interesting ones (errors, slow requests). Probabilistic sampling selects traces randomly while deterministic sampling uses request attributes.

Span Relationships define how operations relate within a trace. Child spans represent work performed as part of a parent operation. Spans can have multiple children but only one parent, forming a tree structure. Follows-from relationships indicate asynchronous work that continues after the parent completes, such as background jobs or message processing.

Tags and Logs attach metadata to spans. Tags are key-value pairs describing the operation—HTTP status codes, database table names, user IDs. Tags enable filtering and aggregation in analysis tools. Logs represent timestamped events within a span—exceptions thrown, cache hits, retries attempted. Unlike application logs, span logs maintain association with the specific request context.

# Span with tags and logs
span = tracer.start_span('process_payment')
span.set_tag('payment.amount', 99.99)
span.set_tag('payment.currency', 'USD')
span.set_tag('user.id', user_id)

begin
  result = payment_gateway.charge(card, amount)
  span.log_kv(event: 'payment_successful', transaction_id: result.id)
rescue PaymentError => e
  span.set_tag('error', true)
  span.log_kv(event: 'payment_failed', error_message: e.message)
  raise
ensure
  span.finish
end

Baggage carries data through the entire trace. Unlike tags which stay local to a span, baggage propagates to all downstream services. This enables correlating requests by business context—tenant ID, feature flags, experiment variants. Baggage increases network overhead since it travels with every request, so use sparingly.

Clock Synchronization affects trace accuracy. Spans from different services may have timestamps from different clocks. Clock skew can make child spans appear to start before parents or last longer than their parents. Tracing systems handle this through clock correction algorithms or by relying on duration measurements rather than absolute timestamps.

Ruby Implementation

Ruby provides several libraries for distributed tracing, with OpenTelemetry emerging as the standard approach. The opentelemetry-sdk gem implements the OpenTelemetry specification for creating, exporting, and managing traces.

Basic Tracer Setup requires configuring a tracer provider, processors, and exporters. The provider manages tracer instances. Processors transform span data before export. Exporters send spans to backend systems.

require 'opentelemetry/sdk'
require 'opentelemetry/exporter/otlp'

OpenTelemetry::SDK.configure do |c|
  c.service_name = 'user-service'
  c.service_version = '1.2.0'
  
  # Use OTLP exporter for sending to collectors
  c.add_span_processor(
    OpenTelemetry::SDK::Trace::Export::BatchSpanProcessor.new(
      OpenTelemetry::Exporter::OTLP::Exporter.new(
        endpoint: 'http://collector:4318',
        compression: 'gzip'
      )
    )
  )
  
  # Configure sampling - keep 10% of traces
  c.sampler = OpenTelemetry::SDK::Trace::Samplers::ParentBased.new(
    root: OpenTelemetry::SDK::Trace::Samplers::TraceIdRatioBased.new(0.1)
  )
end

Manual Instrumentation creates spans explicitly in application code. This gives precise control over what operations to trace and what attributes to capture.

tracer = OpenTelemetry.tracer_provider.tracer('user-service', '1.0')

def process_user_registration(params)
  tracer.in_span('process_registration', attributes: {
    'user.email' => params[:email],
    'user.plan' => params[:plan]
  }) do |span|
    
    # Validate input
    tracer.in_span('validate_input') do
      validate_registration_params(params)
    end
    
    # Create user record
    user = tracer.in_span('database.create_user', attributes: {
      'db.system' => 'postgresql',
      'db.operation' => 'INSERT'
    }) do |db_span|
      User.create!(params)
    end
    
    # Send welcome email (async operation)
    tracer.in_span('queue.enqueue_email', attributes: {
      'messaging.system' => 'sidekiq',
      'messaging.destination' => 'emails'
    }) do
      WelcomeEmailJob.perform_async(user.id)
    end
    
    span.set_attribute('user.id', user.id)
    user
  rescue ValidationError => e
    span.record_exception(e)
    span.set_attribute('error', true)
    raise
  end
end

Automatic Instrumentation instruments common libraries and frameworks without code changes. OpenTelemetry provides instrumentation gems for Rack, Rails, Sinatra, Faraday, Net::HTTP, and database adapters.

require 'opentelemetry-instrumentation-all'

OpenTelemetry::SDK.configure do |c|
  c.use_all # Enable all available instrumentations
  
  # Or selectively enable instrumentations
  c.use 'OpenTelemetry::Instrumentation::Rack'
  c.use 'OpenTelemetry::Instrumentation::Rails'
  c.use 'OpenTelemetry::Instrumentation::ActiveRecord'
  c.use 'OpenTelemetry::Instrumentation::Faraday'
end

Context Propagation in HTTP Clients requires injecting trace context into outgoing requests. Instrumented libraries handle this automatically, but manual HTTP clients need explicit context injection.

require 'net/http'
require 'opentelemetry/instrumentation/net/http'

def fetch_user_data(user_id)
  uri = URI("https://api.example.com/users/#{user_id}")
  
  tracer.in_span('http.request', attributes: {
    'http.method' => 'GET',
    'http.url' => uri.to_s
  }) do |span|
    request = Net::HTTP::Get.new(uri)
    
    # Inject trace context into headers
    OpenTelemetry.propagation.inject(request)
    
    response = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) do |http|
      http.request(request)
    end
    
    span.set_attribute('http.status_code', response.code.to_i)
    JSON.parse(response.body)
  end
end

Background Job Instrumentation traces asynchronous work by propagating context through job queues. Sidekiq instrumentation automatically handles this propagation.

class ProcessOrderJob
  include Sidekiq::Job
  
  def perform(order_id)
    # Trace context automatically extracted from job metadata
    tracer.in_span('process_order', attributes: {
      'order.id' => order_id
    }) do |span|
      order = Order.find(order_id)
      
      tracer.in_span('payment.charge') do
        charge_payment(order)
      end
      
      tracer.in_span('inventory.reserve') do
        reserve_inventory(order)
      end
      
      tracer.in_span('shipping.create_label') do
        create_shipping_label(order)
      end
      
      span.set_attribute('order.status', order.status)
    end
  end
end

# Enqueuing jobs preserves trace context
tracer.in_span('enqueue_order_processing') do
  ProcessOrderJob.perform_async(order.id)
end

Custom Span Processors filter or transform spans before export. This enables removing sensitive data, adding environment-specific tags, or implementing custom sampling logic.

class SensitiveDataProcessor < OpenTelemetry::SDK::Trace::SpanProcessor
  SENSITIVE_ATTRS = ['user.password', 'credit_card.number', 'api_key']
  
  def on_start(span, parent_context)
    # Called when span starts
  end
  
  def on_ending(span)
    # Remove sensitive attributes before export
    SENSITIVE_ATTRS.each do |attr|
      span.attributes.delete(attr)
    end
    
    # Add environment tag to all spans
    span.set_attribute('environment', ENV['RACK_ENV'])
  end
  
  def force_flush(timeout: nil)
    true
  end
  
  def shutdown(timeout: nil)
    true
  end
end

OpenTelemetry::SDK.configure do |c|
  c.add_span_processor(SensitiveDataProcessor.new)
end

Tools & Ecosystem

The distributed tracing ecosystem includes collection agents, storage backends, analysis interfaces, and instrumentation libraries. Different tools serve different architectural patterns and scaling requirements.

OpenTelemetry Collector receives, processes, and exports telemetry data. The collector decouples instrumentation from backend systems—applications send to the collector, which forwards to multiple backends. Collectors aggregate data from many services, batch exports, and handle retry logic.

# collector-config.yaml
receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024
  
  attributes:
    actions:
      - key: environment
        value: production
        action: insert

exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
  
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [jaeger, otlp/tempo]

Jaeger provides trace storage, search, and visualization. Jaeger stores traces in Cassandra, Elasticsearch, or memory. The UI shows trace timelines, service dependencies, and operation statistics. Jaeger supports multiple ingestion formats including Zipkin and OpenTelemetry.

Zipkin offers similar capabilities with a simpler architecture. Zipkin stores traces in MySQL, Elasticsearch, or Cassandra. The B3 propagation format originated with Zipkin and remains widely used. Zipkin requires less infrastructure than Jaeger but provides fewer advanced features.

Grafana Tempo stores traces in object storage (S3, GCS) at lower cost than Elasticsearch or Cassandra. Tempo integrates with Grafana for visualization and correlates traces with logs and metrics. Tempo trades query flexibility for storage efficiency—traces are retrieved by ID rather than arbitrary tag searches.

Datadog APM and New Relic provide commercial tracing solutions with automatic instrumentation, anomaly detection, and extensive integrations. These SaaS platforms handle all infrastructure but incur per-span costs at scale.

Ruby-Specific Gems extend tracing capabilities:

Gem	Purpose	Integration
ddtrace	Datadog instrumentation	Rails, Sinatra, Grape, Sidekiq, databases
newrelic_rpm	New Relic instrumentation	Rails, background jobs, external services
skylight	Application profiling with tracing	Rails, Grape, Sinatra, background jobs
scout_apm	Application monitoring	Rails, Sinatra, database queries, external calls
elastic-apm	Elastic APM instrumentation	Rails, Sinatra, Sidekiq, HTTP clients

Trace Context Standards ensure interoperability between tracing systems:

The W3C Trace Context specification defines traceparent and tracestate headers. The traceparent format is version-trace_id-parent_id-trace_flags. Example: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01.

# Parsing W3C traceparent header
def parse_traceparent(header)
  return nil unless header
  
  parts = header.split('-')
  return nil unless parts.length == 4
  
  {
    version: parts[0],
    trace_id: parts[1],
    parent_id: parts[2],
    trace_flags: parts[3]
  }
end

# Generating traceparent header
def generate_traceparent(trace_id, span_id, sampled)
  flags = sampled ? '01' : '00'
  "00-#{trace_id}-#{span_id}-#{flags}"
end

The B3 propagation format uses separate headers: X-B3-TraceId, X-B3-SpanId, X-B3-ParentSpanId, and X-B3-Sampled. Single-header B3 format combines these: X-B3: trace_id-span_id-sampled-parent_span_id.

Integration & Interoperability

Distributed tracing requires integration across multiple layers: HTTP frameworks, database clients, message queues, RPC systems, and third-party services. Each integration point must propagate context and create meaningful spans.

Rails Integration instruments controllers, views, and Active Record automatically. Additional instrumentation captures middleware, caching, and mailers.

# config/application.rb
require 'opentelemetry/sdk'
require 'opentelemetry/instrumentation/all'

module MyApp
  class Application < Rails::Application
    config.after_initialize do
      OpenTelemetry::SDK.configure do |c|
        c.service_name = 'rails-app'
        c.use_all
      end
    end
  end
end

# Manual instrumentation in controller
class UsersController < ApplicationController
  def create
    tracer = OpenTelemetry.tracer_provider.tracer('users_controller')
    
    tracer.in_span('validate_and_create_user') do |span|
      span.set_attribute('user.role', params[:role])
      
      @user = User.new(user_params)
      
      if @user.save
        span.set_attribute('user.id', @user.id)
        render json: @user, status: :created
      else
        span.set_attribute('error', true)
        span.add_event('validation_failed', attributes: {
          'errors' => @user.errors.full_messages.join(', ')
        })
        render json: @user.errors, status: :unprocessable_entity
      end
    end
  end
end

Database Integration traces SQL queries with query text, duration, and connection information. Sensitive data in queries should be sanitized before adding to spans.

# Active Record instrumentation captures queries automatically
class UserService
  def find_active_users(min_orders)
    tracer.in_span('query_active_users', attributes: {
      'db.system' => 'postgresql',
      'db.operation' => 'SELECT'
    }) do |span|
      users = User.joins(:orders)
        .where('orders.created_at > ?', 30.days.ago)
        .group('users.id')
        .having('COUNT(orders.id) >= ?', min_orders)
      
      span.set_attribute('db.row_count', users.count)
      users
    end
  end
end

Message Queue Integration propagates context through message metadata. Producers inject context, consumers extract it and create continuation spans.

# RabbitMQ integration with Bunny
class OrderEventPublisher
  def publish_order_created(order)
    tracer.in_span('publish_order_created', attributes: {
      'messaging.system' => 'rabbitmq',
      'messaging.destination' => 'orders.created'
    }) do |span|
      headers = {}
      OpenTelemetry.propagation.inject(headers)
      
      channel.default_exchange.publish(
        order.to_json,
        routing_key: 'orders.created',
        headers: headers,
        persistent: true
      )
      
      span.set_attribute('order.id', order.id)
    end
  end
end

class OrderEventConsumer
  def handle_message(delivery_info, metadata, payload)
    # Extract trace context from message headers
    context = OpenTelemetry.propagation.extract(metadata.headers || {})
    
    OpenTelemetry::Context.with_current(context) do
      tracer.in_span('process_order_created', attributes: {
        'messaging.system' => 'rabbitmq',
        'messaging.source' => delivery_info.routing_key
      }) do |span|
        order_data = JSON.parse(payload)
        span.set_attribute('order.id', order_data['id'])
        
        process_order(order_data)
      end
    end
  end
end

gRPC Integration propagates context through metadata. The grpc gem supports automatic context injection and extraction.

require 'grpc'
require 'opentelemetry/instrumentation/grpc'

# Client-side propagation
class UserServiceClient
  def get_user(user_id)
    tracer.in_span('grpc.user_service.get_user', attributes: {
      'rpc.system' => 'grpc',
      'rpc.service' => 'UserService',
      'rpc.method' => 'GetUser'
    }) do |span|
      stub = UserService::Stub.new('localhost:50051', :this_channel_is_insecure)
      
      # Context automatically injected by instrumentation
      response = stub.get_user(GetUserRequest.new(id: user_id))
      
      span.set_attribute('rpc.grpc.status_code', 0)
      response
    rescue GRPC::BadStatus => e
      span.set_attribute('error', true)
      span.set_attribute('rpc.grpc.status_code', e.code)
      raise
    end
  end
end

External API Integration requires propagating context to services outside your control. Many APIs ignore trace headers but recording the outbound call still provides value.

class StripeService
  def charge_customer(customer_id, amount)
    tracer.in_span('stripe.charge', attributes: {
      'external.service' => 'stripe',
      'payment.amount' => amount,
      'payment.currency' => 'usd'
    }) do |span|
      headers = { 'Idempotency-Key' => SecureRandom.uuid }
      OpenTelemetry.propagation.inject(headers)
      
      response = Stripe::Charge.create(
        {
          customer: customer_id,
          amount: amount,
          currency: 'usd'
        },
        headers
      )
      
      span.set_attribute('payment.id', response.id)
      span.set_attribute('payment.status', response.status)
      response
    end
  end
end

Real-World Applications

Production distributed tracing deployments balance comprehensive coverage against performance overhead and cost. Real-world implementations require careful sampling strategies, data retention policies, and integration with incident response workflows.

High-Traffic API Platform serving 100,000 requests per second implements adaptive sampling. Head-based sampling keeps 1% of successful requests but 100% of errors and slow requests. Tail-based sampling in the collector keeps interesting traces that pass quality checks.

# Custom sampler for high-value traces
class AdaptiveSampler < OpenTelemetry::SDK::Trace::Samplers::Sampler
  def should_sample?(trace_id:, parent_context:, links:, name:, kind:, attributes:, **)
    # Always sample errors
    return OpenTelemetry::SDK::Trace::Samplers::RECORD_AND_SAMPLE if attributes['error']
    
    # Always sample slow operations
    parent_span = OpenTelemetry::Trace.current_span(parent_context)
    if parent_span && parent_span.duration > 1.0
      return OpenTelemetry::SDK::Trace::Samplers::RECORD_AND_SAMPLE
    end
    
    # Sample authenticated requests at higher rate
    if attributes['user.authenticated']
      return probability_sample(trace_id, 0.05) # 5%
    end
    
    # Default sampling
    probability_sample(trace_id, 0.01) # 1%
  end
  
  private
  
  def probability_sample(trace_id, rate)
    threshold = (rate * (2**64 - 1)).floor
    trace_id_int = trace_id.unpack1('Q>')
    
    if trace_id_int <= threshold
      OpenTelemetry::SDK::Trace::Samplers::RECORD_AND_SAMPLE
    else
      OpenTelemetry::SDK::Trace::Samplers::DROP
    end
  end
end

Microservices Platform with 50 services uses distributed tracing to identify cascading failures. When service A slows down, tracing reveals which downstream services cause the latency. The platform automatically creates service dependency graphs from trace data.

# Service dependency tracking
class DependencyAnalyzer
  def analyze_trace(trace)
    dependencies = {}
    
    trace.spans.each do |span|
      next unless span.attributes['peer.service']
      
      caller = span.attributes['service.name']
      callee = span.attributes['peer.service']
      
      dependencies[caller] ||= {}
      dependencies[caller][callee] ||= {
        call_count: 0,
        total_duration: 0.0,
        error_count: 0
      }
      
      dependencies[caller][callee][:call_count] += 1
      dependencies[caller][callee][:total_duration] += span.duration
      dependencies[caller][callee][:error_count] += 1 if span.attributes['error']
    end
    
    dependencies
  end
end

E-Commerce Platform correlates traces with business metrics. Each trace includes order value, customer segment, and AB test variant. This enables analyzing performance by business dimension—high-value customer checkout latency versus low-value customer latency.

# Business context propagation
class CheckoutController < ApplicationController
  def create
    tracer.in_span('checkout', attributes: {
      'cart.value' => current_cart.total,
      'cart.item_count' => current_cart.items.count,
      'customer.segment' => current_user.segment,
      'experiment.checkout_flow' => current_experiment_variant
    }) do |span|
      
      order = tracer.in_span('create_order') do
        Order.create_from_cart!(current_cart)
      end
      
      span.set_attribute('order.id', order.id)
      span.set_attribute('order.total', order.total)
      
      # Business event logging
      span.add_event('order_created', attributes: {
        'revenue': order.total,
        'items': order.items.count,
        'payment_method': order.payment_method
      })
      
      redirect_to order_path(order)
    end
  end
end

SaaS Application implements tenant-aware tracing. Each span includes tenant ID, plan tier, and feature flags. This isolates performance issues to specific tenants and identifies which features cause slowdowns.

# Tenant context middleware
class TenantTracingMiddleware
  def call(env)
    tenant = env['current_tenant']
    
    tracer.in_span('request', attributes: {
      'tenant.id' => tenant.id,
      'tenant.plan' => tenant.plan,
      'tenant.created_at' => tenant.created_at.iso8601
    }) do |span|
      # Add tenant features as baggage
      OpenTelemetry::Baggage.set_value('tenant.feature.advanced_analytics', 
                                        tenant.feature_enabled?(:advanced_analytics))
      
      status, headers, body = @app.call(env)
      
      span.set_attribute('http.status_code', status)
      [status, headers, body]
    end
  end
end

Background Processing System traces job execution across multiple workers. Parent-child relationships show which jobs spawn additional work. Trace analysis identifies jobs that trigger cascading job creation.

# Job execution tracing
class TracedJob
  include Sidekiq::Job
  
  def perform(*args)
    tracer.in_span("job.#{self.class.name}", attributes: {
      'job.class' => self.class.name,
      'job.args' => args.inspect,
      'job.id' => jid,
      'job.queue' => queue_name
    }) do |span|
      start_time = Time.now
      
      begin
        result = execute(*args)
        
        span.set_attribute('job.duration', Time.now - start_time)
        span.set_attribute('job.status', 'success')
        result
      rescue => e
        span.record_exception(e)
        span.set_attribute('job.status', 'failed')
        span.set_attribute('job.error', e.class.name)
        raise
      end
    end
  end
end

Common Pitfalls

Distributed tracing implementations face recurring problems that reduce effectiveness or create operational issues. Understanding these pitfalls prevents wasted effort and data quality problems.

Context Propagation Failures occur when trace context does not cross service boundaries. Missing instrumentation for HTTP clients or message queues breaks trace continuity. Each service receives a new trace ID instead of continuing the parent trace.

# WRONG: Context not propagated
def call_external_service(data)
  # Missing context injection - creates new trace
  uri = URI('https://api.example.com/process')
  response = Net::HTTP.post(uri, data.to_json, 'Content-Type' => 'application/json')
  JSON.parse(response.body)
end

# CORRECT: Context propagated
def call_external_service(data)
  tracer.in_span('external_api_call') do |span|
    uri = URI('https://api.example.com/process')
    request = Net::HTTP::Post.new(uri)
    request['Content-Type'] = 'application/json'
    
    # Inject context into headers
    OpenTelemetry.propagation.inject(request)
    
    response = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) do |http|
      http.request(request, data.to_json)
    end
    
    span.set_attribute('http.status_code', response.code.to_i)
    JSON.parse(response.body)
  end
end

Excessive Span Creation generates too many spans per trace, overwhelming storage and making traces hard to analyze. Creating spans for every method call or loop iteration produces megabyte-sized traces. Limit spans to significant operations—HTTP requests, database queries, external service calls.

# WRONG: Too many spans
def process_items(items)
  tracer.in_span('process_items') do
    items.each do |item|
      # Creating 1000s of spans for each item
      tracer.in_span("process_item_#{item.id}") do
        tracer.in_span('validate') do
          validate(item)
        end
        tracer.in_span('transform') do
          transform(item)
        end
        tracer.in_span('save') do
          save(item)
        end
      end
    end
  end
end

# CORRECT: Aggregate spans
def process_items(items)
  tracer.in_span('process_items', attributes: {
    'items.count' => items.count
  }) do |span|
    processed = 0
    errors = 0
    
    items.each do |item|
      # Process without individual spans
      validate(item)
      transform(item)
      save(item)
      processed += 1
    rescue => e
      errors += 1
    end
    
    span.set_attribute('items.processed', processed)
    span.set_attribute('items.errors', errors)
  end
end

Sensitive Data Exposure adds passwords, API keys, or personal information to spans. Span data typically has wider access than application databases. Sanitize attributes before adding them to spans.

# WRONG: Sensitive data in span
def authenticate_user(email, password)
  tracer.in_span('authenticate', attributes: {
    'user.email' => email,
    'user.password' => password  # NEVER include passwords
  }) do
    User.authenticate(email, password)
  end
end

# CORRECT: Sanitized attributes
def authenticate_user(email, password)
  tracer.in_span('authenticate', attributes: {
    'user.email' => email.gsub(/@.*/, '@***')  # Partially redact email
  }) do |span|
    user = User.authenticate(email, password)
    span.set_attribute('user.id', user&.id)
    span.set_attribute('auth.success', !user.nil?)
    user
  end
end

Sampling Inconsistency creates incomplete traces when different services use different sampling decisions. Head-based sampling must propagate the sampling decision to all downstream services. The trace flags in context indicate whether the trace is sampled.

# Respecting parent sampling decision
def process_with_context(parent_context)
  # Extract sampling decision from parent
  parent_span = OpenTelemetry::Trace.current_span(parent_context)
  sampled = parent_span.context.trace_flags.sampled?
  
  # Use parent-based sampler to respect decision
  sampler = OpenTelemetry::SDK::Trace::Samplers::ParentBased.new(
    root: OpenTelemetry::SDK::Trace::Samplers::TraceIdRatioBased.new(0.1)
  )
  
  # New spans inherit parent's sampling decision
  tracer.in_span('child_operation') do |span|
    # This span will be sampled if parent was sampled
    perform_work
  end
end

Missing Error Information creates spans that finish successfully even when exceptions occur. Always record exceptions and set error tags when operations fail.

# WRONG: Exception not recorded
def fetch_user(id)
  tracer.in_span('fetch_user') do
    User.find(id)
  end
end  # Exception escapes, span finishes without error indication

# CORRECT: Exception recorded
def fetch_user(id)
  tracer.in_span('fetch_user', attributes: { 'user.id' => id }) do |span|
    User.find(id)
  rescue ActiveRecord::RecordNotFound => e
    span.record_exception(e)
    span.set_attribute('error', true)
    span.set_attribute('error.type', 'not_found')
    raise
  end
end

Clock Skew Problems distort traces when services have unsynchronized clocks. Spans appear in wrong order or have negative durations. Use NTP synchronization across all services or rely on relative timing within services.

Cardinality Explosions occur when high-cardinality values become span attributes. User IDs, request IDs, and timestamps create millions of unique attribute combinations, overwhelming trace storage indexes.

# WRONG: High cardinality attribute
tracer.in_span('query_users', attributes: {
  'query.timestamp' => Time.now.to_f,  # Unique every time
  'query.request_id' => SecureRandom.uuid  # Unique every time
}) do
  # ...
end

# CORRECT: Bounded cardinality
tracer.in_span('query_users', attributes: {
  'query.hour' => Time.now.hour,  # Only 24 values
  'query.type' => 'user_search'  # Fixed set of values
}) do
  # ...
end

Reference

Trace Components

Component	Description	Cardinality
Trace	Complete request path across services	One per sampled request
Span	Single operation within a service	Multiple per trace
Trace ID	Unique identifier for entire trace	128-bit random value
Span ID	Unique identifier for individual span	64-bit random value
Parent Span ID	Links span to its parent	References another span ID
Attributes	Key-value metadata on spans	0 to 128 per span typically
Events	Timestamped log entries within span	0 to many per span
Links	Connects spans across trace boundaries	Used for batch processing

Span Attributes

Standard semantic conventions define common attribute names:

Attribute	Type	Example	Use Case
service.name	string	user-service	Service identification
http.method	string	GET	HTTP request method
http.url	string	/api/users/123	Request URL
http.status_code	integer	200	Response status
db.system	string	postgresql	Database type
db.statement	string	SELECT * FROM users	SQL query
db.operation	string	SELECT	Database operation
messaging.system	string	rabbitmq	Message queue type
messaging.destination	string	orders.created	Queue or topic name
rpc.system	string	grpc	RPC framework
error	boolean	true	Error occurred
peer.service	string	payment-service	Called service name

Context Propagation Formats

Format	Header	Structure	Example
W3C Trace Context	traceparent	version-traceid-parentid-flags	00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
W3C Trace State	tracestate	vendor specific state	congo=t61rcWkgMzE
B3 Multi	X-B3-TraceId, X-B3-SpanId	Separate headers	TraceId: 80f198ee56343ba864fe8b2a57d3eff7
B3 Single	b3	traceid-spanid-sampled	80f198ee56343ba864fe8b2a57d3eff7-e457b5a2e4d86bd1-1
Jaeger	uber-trace-id	traceid:spanid:parentid:flags	5af7651916cd43dd8448eb211c80319c:b7ad6b7169203331:0:1

Sampling Strategies

Strategy	Decision Point	Use Case	Trade-off
Probabilistic	Trace start	Consistent load	May miss errors
Rate Limiting	Trace start	Cost control	May miss bursts
Parent-based	Propagated	Cross-service consistency	Depends on parent decision
Tail-based	Trace completion	Keep interesting traces	Requires buffering all spans
Adaptive	Dynamic	Respond to conditions	Complex implementation

Span Lifecycle

Phase	Description	Methods
Creation	Span initialized with trace context	start_span
Active	Span receiving attributes and events	set_attribute, add_event
Completion	Span duration finalized	finish
Export	Span sent to backend	Automatic via processor

Ruby Tracing Configuration

Option	Purpose	Default
service_name	Identify service in traces	unknown_service
service_version	Track deployments	nil
sampler	Control trace sampling	ParentBased(AlwaysOn)
span_processors	Process spans before export	BatchSpanProcessor
propagators	Context propagation format	TraceContext
resource_attributes	Service metadata	hostname, process info

Common Span Kinds

Kind	Purpose	Direction
CLIENT	Outbound request initiator	Outgoing
SERVER	Request handler	Incoming
PRODUCER	Message queue publisher	Outgoing
CONSUMER	Message queue subscriber	Incoming
INTERNAL	Internal operation	Neither

Performance Impact

Instrumentation Type	Overhead	Sampling Recommendation
Automatic HTTP/Database	1-3% latency	5-10% sampling
Manual business logic	2-5% latency	Selective instrumentation
Message queues	Minimal	10-25% sampling
Background jobs	1-2% latency	25-50% sampling

Distributed Tracing