CrackedRuby - A/B Testing

Overview

A/B testing, also known as split testing, is a statistical method for comparing two or more versions of a variable to determine which performs better. The methodology originated in agricultural experiments in the 1920s and was adapted for marketing and web optimization in the 1990s. In software development, A/B testing enables data-driven decisions by exposing different user segments to different implementations and measuring the results.

The core mechanism divides traffic randomly between a control group (version A) and one or more treatment groups (versions B, C, etc.). Each group experiences only one variant, and the system tracks conversion metrics or other key performance indicators. Statistical analysis determines whether observed differences are significant or result from random variation.

A/B testing operates on the principle of controlled experimentation. By changing only one variable at a time and maintaining consistent conditions for all groups, the test isolates the impact of the specific change. This isolation eliminates confounding variables that plague observational studies.

# Basic A/B test structure
class ButtonColorTest
  VARIANTS = ['blue', 'green']
  
  def assign_variant(user_id)
    # Deterministic assignment based on user ID
    VARIANTS[user_id.hash % VARIANTS.length]
  end
  
  def track_conversion(user_id, converted)
    variant = assign_variant(user_id)
    Analytics.track(
      event: 'button_click',
      variant: variant,
      converted: converted,
      user_id: user_id
    )
  end
end

Applications span web interface optimization, feature rollouts, pricing strategies, recommendation algorithms, and email marketing campaigns. Companies use A/B testing to validate assumptions, reduce risk when launching changes, and incrementally improve conversion rates. The methodology scales from simple color changes to complex algorithmic modifications.

Key Principles

A/B testing rests on hypothesis testing from statistical inference. The null hypothesis states that no difference exists between variants. The alternative hypothesis claims a difference exists. The test collects evidence to reject or fail to reject the null hypothesis at a predetermined significance level, typically α = 0.05.

Random assignment is fundamental to validity. Users must be randomly distributed between control and treatment groups to ensure groups are statistically equivalent before treatment. Non-random assignment introduces selection bias, invalidating results. Hash-based assignment provides deterministic randomization, ensuring users consistently see the same variant across sessions.

Sample size determines statistical power—the probability of detecting a real effect when one exists. Insufficient samples increase Type II errors (false negatives), while excessive samples waste resources and prolong testing. Power analysis calculates required sample size based on expected effect size, baseline conversion rate, and desired confidence level.

# Sample size calculation for proportion test
def calculate_sample_size(baseline_rate, minimum_detectable_effect, alpha = 0.05, power = 0.8)
  require 'statsample'
  
  # Convert percentage to proportion
  p1 = baseline_rate
  p2 = baseline_rate * (1 + minimum_detectable_effect)
  
  # Pooled proportion
  p_avg = (p1 + p2) / 2.0
  
  # Z-scores for alpha and power
  z_alpha = 1.96  # Two-tailed test at α = 0.05
  z_power = 0.84  # Power = 0.80
  
  # Sample size per variant
  numerator = (z_alpha * Math.sqrt(2 * p_avg * (1 - p_avg)) + 
               z_power * Math.sqrt(p1 * (1 - p1) + p2 * (1 - p2)))**2
  denominator = (p2 - p1)**2
  
  (numerator / denominator).ceil
end

# Example: 5% baseline, want to detect 20% improvement
sample_size = calculate_sample_size(0.05, 0.20)
# => 3842 users per variant

Statistical significance measures the probability that observed differences resulted from chance. The p-value represents this probability. A p-value below the significance threshold (typically 0.05) suggests the difference is statistically significant. However, statistical significance does not guarantee practical significance—a statistically significant 0.1% improvement may not justify implementation costs.

Confidence intervals provide more information than p-values alone. A 95% confidence interval means that if the experiment were repeated 100 times, approximately 95 intervals would contain the true effect size. Wide confidence intervals indicate high uncertainty; narrow intervals suggest precise estimates.

Multiple testing corrections become necessary when running multiple tests simultaneously or comparing multiple variants. The probability of finding at least one false positive increases with the number of comparisons. The Bonferroni correction divides the significance level by the number of tests, though this conservative approach reduces statistical power.

Ruby Implementation

Ruby's A/B testing ecosystem includes several mature libraries. The Split gem provides a Redis-backed testing framework with dashboard visualization. Field Test offers database-backed experiments with ActiveRecord integration. Vanity supports both Redis and ActiveRecord backends with comprehensive reporting.

# Split gem implementation
require 'split'

Split.configure do |config|
  config.db_failover = true
  config.db_failover_on_db_error = -> { true }
  config.persistence = Split::Persistence::RedisAdapter
end

# Define experiment
Split::ExperimentCatalog.find_or_create('button_color', 'blue', 'green', 'red')

# In controller
class PurchaseController < ApplicationController
  def show
    @button_color = ab_test('button_color')
  end
  
  def complete
    ab_finished('button_color', reset: false)
    # Conversion tracked
  end
end

# View helper automatically tracks impressions
<%= link_to 'Purchase', purchase_path, class: "btn-#{@button_color}" %>

Field Test provides ActiveRecord integration with migration-based experiment definitions. This approach offers version control for experiment configuration and seamless integration with existing database infrastructure.

# Field Test configuration
# config/initializers/field_test.rb
FieldTest.config do |config|
  config.experiments = {
    search_algorithm: {
      variants: ['elastic', 'postgres_fts', 'algolia'],
      weights: [34, 33, 33]
    },
    pricing_model: {
      variants: ['monthly', 'annual', 'usage_based'],
      winner: 'annual'  # Lock winning variant
    }
  }
end

# Migration for participants table
class CreateFieldTestTables < ActiveRecord::Migration[7.0]
  def change
    create_table :field_test_memberships do |t|
      t.string :participant_type
      t.string :participant_id
      t.string :experiment
      t.string :variant
      t.datetime :created_at
    end
    
    create_table :field_test_events do |t|
      t.string :name
      t.references :membership
      t.datetime :created_at
    end
    
    add_index :field_test_memberships, [:participant_type, :participant_id, :experiment],
              name: 'index_field_test_memberships_on_participant'
  end
end

# Usage in application
class SearchController < ApplicationController
  def index
    @algorithm = field_test(:search_algorithm, participant: current_user)
    @results = case @algorithm
               when 'elastic'
                 ElasticSearchService.search(params[:q])
               when 'postgres_fts'
                 PostgresFullTextSearch.search(params[:q])
               when 'algolia'
                 AlgoliaService.search(params[:q])
               end
  end
  
  def track_click
    field_test_converted(:search_algorithm, participant: current_user)
  end
end

Custom implementations provide maximum flexibility when existing gems do not meet specific requirements. A basic implementation requires variant assignment, event tracking, and statistical analysis.

# Custom A/B testing framework
class ABTest
  attr_reader :name, :variants, :created_at
  
  def initialize(name, variants, options = {})
    @name = name
    @variants = variants
    @weights = options[:weights] || equal_weights
    @created_at = Time.current
  end
  
  def assign_variant(participant_id)
    # Hash-based consistent assignment
    hash_value = Digest::MD5.hexdigest("#{name}-#{participant_id}").to_i(16)
    cumulative_weight = 0
    
    @weights.each_with_index do |weight, index|
      cumulative_weight += weight
      return @variants[index] if (hash_value % 100) < cumulative_weight
    end
    
    @variants.last
  end
  
  def track_impression(participant_id, variant)
    Redis.current.hincrby("ab_test:#{name}:impressions", variant, 1)
    Redis.current.sadd("ab_test:#{name}:participants:#{variant}", participant_id)
  end
  
  def track_conversion(participant_id, variant)
    Redis.current.hincrby("ab_test:#{name}:conversions", variant, 1)
    Redis.current.sadd("ab_test:#{name}:converted:#{variant}", participant_id)
  end
  
  def results
    @variants.map do |variant|
      impressions = Redis.current.hget("ab_test:#{name}:impressions", variant).to_i
      conversions = Redis.current.hget("ab_test:#{name}:conversions", variant).to_i
      
      {
        variant: variant,
        impressions: impressions,
        conversions: conversions,
        conversion_rate: impressions > 0 ? conversions.to_f / impressions : 0
      }
    end
  end
  
  def statistical_significance
    control = results.first
    variants = results[1..]
    
    variants.map do |treatment|
      z_score = calculate_z_score(control, treatment)
      p_value = 2 * (1 - normal_cdf(z_score.abs))
      
      {
        variant: treatment[:variant],
        z_score: z_score,
        p_value: p_value,
        significant: p_value < 0.05
      }
    end
  end
  
  private
  
  def equal_weights
    weight = 100.0 / @variants.length
    Array.new(@variants.length, weight)
  end
  
  def calculate_z_score(control, treatment)
    p1 = control[:conversion_rate]
    p2 = treatment[:conversion_rate]
    n1 = control[:impressions]
    n2 = treatment[:impressions]
    
    p_pool = (control[:conversions] + treatment[:conversions]).to_f / (n1 + n2)
    se = Math.sqrt(p_pool * (1 - p_pool) * (1.0/n1 + 1.0/n2))
    
    (p2 - p1) / se
  end
  
  def normal_cdf(x)
    # Approximation of standard normal CDF
    0.5 * (1 + Math.erf(x / Math.sqrt(2)))
  end
end

# Usage
test = ABTest.new('checkout_flow', ['single_page', 'multi_step'])
variant = test.assign_variant(user.id)
test.track_impression(user.id, variant)

# After conversion
test.track_conversion(user.id, variant)

# Analyze results
test.results
# => [{variant: "single_page", impressions: 5000, conversions: 250, conversion_rate: 0.05},
#     {variant: "multi_step", impressions: 5100, conversions: 280, conversion_rate: 0.0549}]

test.statistical_significance
# => [{variant: "multi_step", z_score: 1.82, p_value: 0.068, significant: false}]

Multi-armed bandit algorithms provide an alternative to fixed-split testing. Rather than maintaining equal traffic allocation, bandit algorithms dynamically adjust allocation based on observed performance, sending more traffic to better-performing variants.

class EpsilonGreedyBandit
  def initialize(variants, epsilon = 0.1)
    @variants = variants
    @epsilon = epsilon
    @counts = Hash.new(0)
    @values = Hash.new(0.0)
  end
  
  def select_variant
    if rand < @epsilon
      # Exploration: random variant
      @variants.sample
    else
      # Exploitation: best variant
      @variants.max_by { |v| @values[v] }
    end
  end
  
  def update(variant, reward)
    @counts[variant] += 1
    n = @counts[variant]
    value = @values[variant]
    
    # Incremental average
    @values[variant] = ((n - 1) / n.to_f) * value + (1 / n.to_f) * reward
  end
  
  def best_variant
    @variants.max_by { |v| @values[v] }
  end
end

# Usage
bandit = EpsilonGreedyBandit.new(['variant_a', 'variant_b', 'variant_c'])

# On each request
variant = bandit.select_variant

# After observing outcome (1 for success, 0 for failure)
bandit.update(variant, converted ? 1 : 0)

Practical Examples

A common A/B testing scenario involves optimizing a call-to-action button. The hypothesis states that changing button text from "Buy Now" to "Add to Cart" increases conversion rates. The test requires variant assignment, impression tracking, conversion tracking, and statistical analysis.

# Complete button text test implementation
class ButtonTextExperiment
  include FieldTest::Helpers
  
  EXPERIMENT_NAME = 'cta_button_text'
  
  def self.setup
    FieldTest.config do |config|
      config.experiments = {
        cta_button_text: {
          variants: ['buy_now', 'add_to_cart', 'get_yours'],
          started_at: Time.current,
          metadata: {
            hypothesis: 'Softer CTA language increases conversion',
            metric: 'checkout_completion_rate'
          }
        }
      }
    end
  end
  
  def assign_and_render(user)
    variant = field_test(EXPERIMENT_NAME, participant: user)
    button_text = case variant
                  when 'buy_now' then 'Buy Now'
                  when 'add_to_cart' then 'Add to Cart'
                  when 'get_yours' then 'Get Yours Today'
                  end
    
    { variant: variant, text: button_text }
  end
  
  def track_conversion(user)
    field_test_converted(EXPERIMENT_NAME, participant: user)
  end
  
  def self.analyze_results
    results = FieldTest::Experiment.find(EXPERIMENT_NAME)
    variants = results.variants
    
    variants.map do |variant|
      participants = FieldTest::Membership
        .where(experiment: EXPERIMENT_NAME, variant: variant)
        .count
      
      conversions = FieldTest::Event
        .joins(:membership)
        .where(
          name: EXPERIMENT_NAME,
          field_test_memberships: { variant: variant }
        ).count
      
      {
        variant: variant,
        participants: participants,
        conversions: conversions,
        rate: participants > 0 ? conversions.to_f / participants : 0
      }
    end
  end
end

# Controller integration
class ProductsController < ApplicationController
  def show
    @product = Product.find(params[:id])
    @cta_button = ButtonTextExperiment.new.assign_and_render(current_user)
  end
  
  def add_to_cart
    # Add product to cart logic
    ButtonTextExperiment.new.track_conversion(current_user)
    redirect_to cart_path
  end
end

# View
<%= button_to @cta_button[:text], 
              add_to_cart_path(@product),
              class: 'btn-primary',
              data: { variant: @cta_button[:variant] } %>

Email subject line testing demonstrates A/B testing in asynchronous contexts. The test measures open rates across different subject line variations, accounting for temporal patterns and audience segmentation.

class EmailSubjectLineTest
  def initialize(campaign_id)
    @campaign_id = campaign_id
    @variants = {
      'control' => 'Weekly Newsletter - #{Date.today.strftime("%B %d")}',
      'urgency' => 'Last Chance: Exclusive Offers Inside',
      'personalized' => '#{first_name}, Your Weekly Update'
    }
  end
  
  def send_campaign(subscribers)
    # Stratified sampling by user segment
    segments = subscribers.group_by(&:segment)
    
    segments.each do |segment, users|
      users.shuffle.each_slice((users.length / 3.0).ceil).with_index do |batch, index|
        variant = @variants.keys[index % @variants.keys.length]
        send_batch(batch, variant, segment)
      end
    end
  end
  
  def send_batch(users, variant, segment)
    users.each do |user|
      subject = interpolate_subject(variant, user)
      
      EmailJob.perform_later(
        user_id: user.id,
        campaign_id: @campaign_id,
        subject: subject,
        variant: variant,
        segment: segment
      )
      
      track_sent(user.id, variant, segment)
    end
  end
  
  def track_sent(user_id, variant, segment)
    Redis.current.sadd("email_test:#{@campaign_id}:sent:#{variant}", user_id)
    Redis.current.hincrby("email_test:#{@campaign_id}:sent_by_segment:#{variant}", segment, 1)
  end
  
  def track_open(user_id, variant, segment)
    Redis.current.sadd("email_test:#{@campaign_id}:opened:#{variant}", user_id)
    Redis.current.hincrby("email_test:#{@campaign_id}:opened_by_segment:#{variant}", segment, 1)
  end
  
  def results_by_segment
    @variants.keys.flat_map do |variant|
      segments = Redis.current.hkeys("email_test:#{@campaign_id}:sent_by_segment:#{variant}")
      
      segments.map do |segment|
        sent = Redis.current.hget("email_test:#{@campaign_id}:sent_by_segment:#{variant}", segment).to_i
        opened = Redis.current.hget("email_test:#{@campaign_id}:opened_by_segment:#{variant}", segment).to_i
        
        {
          variant: variant,
          segment: segment,
          sent: sent,
          opened: opened,
          open_rate: sent > 0 ? opened.to_f / sent : 0
        }
      end
    end
  end
  
  private
  
  def interpolate_subject(variant, user)
    subject = @variants[variant]
    subject.gsub('#{first_name}', user.first_name)
           .gsub('#{Date.today.strftime("%B %d")}', Date.today.strftime("%B %d"))
  end
end

# Background job for tracking opens
class EmailOpenTrackingJob < ApplicationJob
  def perform(user_id, campaign_id, variant, segment)
    test = EmailSubjectLineTest.new(campaign_id)
    test.track_open(user_id, variant, segment)
  end
end

Search algorithm testing represents complex A/B testing involving performance metrics beyond simple conversion rates. The test compares search relevance, latency, and user satisfaction across different search implementations.

class SearchAlgorithmTest
  ALGORITHMS = {
    'postgres_fts' => PostgresFullTextSearch,
    'elasticsearch' => ElasticsearchService,
    'hybrid' => HybridSearchService
  }
  
  def initialize
    @experiment = ABTest.new('search_algorithm', ALGORITHMS.keys, weights: [20, 40, 40])
  end
  
  def execute_search(user, query)
    variant = @experiment.assign_variant(user.id)
    service = ALGORITHMS[variant]
    
    start_time = Time.current
    results = service.search(query)
    latency = Time.current - start_time
    
    track_search(user.id, variant, query, results.count, latency)
    
    { results: results, variant: variant, latency: latency }
  end
  
  def track_search(user_id, variant, query, result_count, latency)
    @experiment.track_impression(user_id, variant)
    
    # Track detailed metrics
    Redis.current.lpush("search_test:queries:#{variant}", {
      user_id: user_id,
      query: query,
      result_count: result_count,
      latency: latency,
      timestamp: Time.current.to_i
    }.to_json)
  end
  
  def track_result_click(user_id, variant, position)
    @experiment.track_conversion(user_id, variant)
    
    Redis.current.hincrby("search_test:click_positions:#{variant}", position, 1)
  end
  
  def track_search_abandon(user_id, variant)
    Redis.current.hincrby("search_test:abandons", variant, 1)
  end
  
  def analyze_metrics
    ALGORITHMS.keys.map do |variant|
      queries = Redis.current.lrange("search_test:queries:#{variant}", 0, -1)
                     .map { |q| JSON.parse(q) }
      
      click_positions = Redis.current.hgetall("search_test:click_positions:#{variant}")
      abandons = Redis.current.hget("search_test:abandons", variant).to_i
      
      {
        variant: variant,
        total_searches: queries.length,
        avg_latency: queries.sum { |q| q['latency'] } / queries.length.to_f,
        avg_results: queries.sum { |q| q['result_count'] } / queries.length.to_f,
        click_through_rate: queries.length > 0 ? 
          (queries.length - abandons).to_f / queries.length : 0,
        avg_click_position: calculate_avg_position(click_positions)
      }
    end
  end
  
  private
  
  def calculate_avg_position(positions)
    total_clicks = positions.values.map(&:to_i).sum
    return 0 if total_clicks == 0
    
    weighted_sum = positions.sum { |pos, count| pos.to_i * count.to_i }
    weighted_sum.to_f / total_clicks
  end
end

Design Considerations

Experiment design begins with defining clear hypotheses and success metrics. The hypothesis should specify the expected direction and magnitude of change. Vague hypotheses like "this will improve conversion" provide insufficient guidance. Specific hypotheses like "changing button color to green will increase checkout completion by 15%" enable proper experiment design and sample size calculation.

Success metrics must be measurable, relevant to business objectives, and sensitive to the experimental manipulation. Primary metrics drive decision-making, while secondary metrics provide context and guard against unintended consequences. A checkout flow test might use completion rate as the primary metric, with average order value and time-to-complete as secondary metrics.

Traffic allocation strategies balance statistical power against opportunity cost. Equal allocation maximizes statistical power but exposes half the traffic to potentially inferior variants. Weighted allocation reduces exposure to poor variants but requires larger samples to detect effects. Multi-armed bandit approaches dynamically adjust allocation, reducing regret while maintaining exploration.

class TrafficAllocationStrategy
  def self.equal_split(variants)
    weight = 100.0 / variants.length
    variants.map { |v| [v, weight] }.to_h
  end
  
  def self.weighted(variants, control_weight: 50)
    treatment_weight = (100 - control_weight) / (variants.length - 1.0)
    {
      variants.first => control_weight
    }.merge(
      variants[1..].map { |v| [v, treatment_weight] }.to_h
    )
  end
  
  def self.thompson_sampling(variants, alpha_beta_params)
    # Beta distribution sampling for conversion rate estimation
    samples = variants.map do |variant|
      alpha, beta = alpha_beta_params[variant]
      [variant, rand_beta(alpha, beta)]
    end
    
    # Allocate based on probability of being best
    total = samples.sum { |_, sample| sample }
    samples.map { |variant, sample| [variant, (sample / total) * 100] }.to_h
  end
  
  private
  
  def self.rand_beta(alpha, beta)
    # Beta distribution random sampling
    x = rand_gamma(alpha, 1)
    y = rand_gamma(beta, 1)
    x / (x + y)
  end
  
  def self.rand_gamma(shape, scale)
    # Gamma distribution random sampling (Marsaglia and Tsang method)
    d = shape - 1.0/3.0
    c = 1.0 / Math.sqrt(9.0 * d)
    
    loop do
      x = rand_normal(0, 1)
      v = (1.0 + c * x) ** 3
      next if v <= 0
      
      u = rand
      if u < 1.0 - 0.0331 * x**4
        return d * v * scale
      end
      
      if Math.log(u) < 0.5 * x**2 + d * (1.0 - v + Math.log(v))
        return d * v * scale
      end
    end
  end
  
  def self.rand_normal(mean, std_dev)
    # Box-Muller transform
    u1 = rand
    u2 = rand
    z = Math.sqrt(-2.0 * Math.log(u1)) * Math.cos(2.0 * Math::PI * u2)
    mean + z * std_dev
  end
end

Test duration requires balancing statistical requirements against business needs. Tests must run long enough to achieve statistical significance and capture weekly cycles, but not so long that results become stale. Minimum duration depends on traffic volume and effect size. Sites with millions of daily visitors might complete tests in days, while lower-traffic sites require weeks or months.

Temporal patterns affect test validity. Weekly cycles in user behavior necessitate running tests for complete weeks. Testing during atypical periods like holidays introduces bias. Sequential testing with early stopping rules allows monitoring results continuously, but requires adjusted significance thresholds to maintain error rates.

Segment-based analysis reveals whether effects differ across user populations. A change might benefit mobile users while harming desktop users, producing neutral aggregate results that mask important variation. Pre-defining segments prevents post-hoc fishing for significant results. Common segments include device type, geographic region, user tenure, and referral source.

Implementation Approaches

Server-side testing executes variant logic on the backend, providing full control over user experience and enabling complex experiments involving multiple systems. This approach integrates with application code and databases, supporting experiments on algorithms, data processing, and backend workflows.

# Server-side testing approach
class FeatureFlag
  def initialize(flag_name, user)
    @flag_name = flag_name
    @user = user
  end
  
  def enabled?
    return false unless experiment_active?
    
    variant = assign_variant
    track_impression(variant)
    
    variant != 'control'
  end
  
  def variant
    return 'control' unless experiment_active?
    
    variant = assign_variant
    track_impression(variant)
    variant
  end
  
  private
  
  def experiment_active?
    config = Rails.cache.fetch("experiment:#{@flag_name}", expires_in: 5.minutes) do
      ExperimentConfig.find_by(name: @flag_name)
    end
    
    config&.active?
  end
  
  def assign_variant
    hash_value = Digest::MD5.hexdigest("#{@flag_name}-#{@user.id}").to_i(16)
    
    cumulative = 0
    ExperimentConfig.find_by(name: @flag_name).variants.each do |variant|
      cumulative += variant['weight']
      return variant['name'] if (hash_value % 100) < cumulative
    end
    
    'control'
  end
  
  def track_impression(variant)
    AnalyticsJob.perform_later(
      event: 'experiment_impression',
      properties: {
        experiment: @flag_name,
        variant: variant,
        user_id: @user.id
      }
    )
  end
end

# Usage in application code
class RecommendationController < ApplicationController
  def index
    flag = FeatureFlag.new('recommendation_algorithm', current_user)
    
    @recommendations = case flag.variant
                       when 'collaborative_filtering'
                         CollaborativeFiltering.recommend(current_user)
                       when 'content_based'
                         ContentBased.recommend(current_user)
                       else
                         PopularItems.recommend
                       end
  end
end

Client-side testing executes variant selection in the browser using JavaScript, enabling rapid iteration without deploying backend changes. This approach works well for visual changes, user interface experiments, and single-page applications. However, client-side testing introduces latency and potential flash-of-content issues.

Hybrid approaches combine server-side assignment with client-side rendering. The server determines variant assignment and passes it to the client for rendering, avoiding assignment flickering while maintaining client-side flexibility.

# Hybrid approach: server assigns, client renders
class ExperimentController < ApplicationController
  def assign
    experiments = params[:experiments].map do |experiment_name|
      variant = assign_variant(experiment_name, current_user)
      track_assignment(experiment_name, variant, current_user)
      
      { experiment: experiment_name, variant: variant }
    end
    
    render json: { experiments: experiments }
  end
  
  private
  
  def assign_variant(experiment, user)
    cache_key = "experiment_assignment:#{experiment}:#{user.id}"
    
    Rails.cache.fetch(cache_key, expires_in: 30.days) do
      ExperimentService.assign(experiment, user)
    end
  end
end

# Client-side rendering (JavaScript)
# async function loadExperiments() {
#   const response = await fetch('/experiments/assign', {
#     method: 'POST',
#     headers: { 'Content-Type': 'application/json' },
#     body: JSON.stringify({ experiments: ['checkout_flow', 'pricing_display'] })
#   });
#   
#   const data = await response.json();
#   data.experiments.forEach(exp => {
#     applyVariant(exp.experiment, exp.variant);
#   });
# }

Progressive rollout combines A/B testing with gradual feature deployment. Initial rollout targets a small percentage of users, increasing exposure as confidence grows. This approach mitigates risk when launching major changes or untested features.

class ProgressiveRollout
  def initialize(feature_name)
    @feature_name = feature_name
  end
  
  def enabled_for?(user)
    rollout_percentage = get_rollout_percentage
    return false if rollout_percentage == 0
    return true if rollout_percentage == 100
    
    hash_value = Digest::MD5.hexdigest("#{@feature_name}-#{user.id}").to_i(16)
    (hash_value % 100) < rollout_percentage
  end
  
  def increase_rollout(percentage)
    current = get_rollout_percentage
    new_percentage = [current + percentage, 100].min
    
    Redis.current.set("rollout:#{@feature_name}", new_percentage)
    log_rollout_change(current, new_percentage)
    
    new_percentage
  end
  
  def rollback
    Redis.current.set("rollout:#{@feature_name}", 0)
    log_rollout_change(get_rollout_percentage, 0)
  end
  
  private
  
  def get_rollout_percentage
    Redis.current.get("rollout:#{@feature_name}").to_i
  end
  
  def log_rollout_change(from, to)
    RolloutLog.create(
      feature: @feature_name,
      from_percentage: from,
      to_percentage: to,
      changed_at: Time.current
    )
  end
end

Common Pitfalls

Peeking at results before reaching predetermined sample size inflates Type I error rates. Each intermediate analysis increases the probability of finding a false positive. Continuous monitoring requires sequential testing methods like alpha spending functions or Bayesian approaches that account for multiple looks.

Sample size errors arise from underestimating required participants. Underpowered tests fail to detect real effects, leading to false conclusions that changes have no impact. Power analysis before testing prevents this mistake, though unexpected baseline conversion rates can still cause problems.

class SampleSizeValidator
  def initialize(experiment_name)
    @experiment = Experiment.find_by(name: experiment_name)
  end
  
  def sufficient_sample?
    required = calculate_required_sample
    actual = current_sample_size
    
    actual >= required
  end
  
  def progress_report
    required = calculate_required_sample
    actual = current_sample_size
    
    {
      required_per_variant: required,
      actual_per_variant: actual,
      progress_percentage: (actual.to_f / required * 100).round(2),
      estimated_days_remaining: estimate_completion_days
    }
  end
  
  private
  
  def calculate_required_sample
    baseline = @experiment.baseline_conversion_rate
    mde = @experiment.minimum_detectable_effect
    
    calculate_sample_size(baseline, mde)
  end
  
  def current_sample_size
    Participation.where(experiment: @experiment.name)
                 .group(:variant)
                 .count
                 .values
                 .min || 0
  end
  
  def estimate_completion_days
    daily_rate = Participation.where(experiment: @experiment.name)
                             .where('created_at > ?', 7.days.ago)
                             .count / 7.0
    
    remaining = calculate_required_sample - current_sample_size
    return 0 if remaining <= 0
    
    (remaining / daily_rate).ceil
  end
end

Multiple testing problems occur when running many tests simultaneously or comparing multiple variants. The probability of finding at least one false positive increases with the number of comparisons. Without correction, 5% significance level with 20 independent tests yields 64% chance of at least one false positive.

Selection bias invalidates results when assignment is non-random. Allowing users to choose variants, targeting specific user segments differently, or assignment based on observable characteristics introduces bias. Randomization must be truly random and independent of user characteristics.

Novelty effects cause temporary changes in behavior when users encounter new experiences. Initial excitement or confusion creates artificial effects that dissipate over time. Tests must run long enough for novelty to wear off, typically requiring several weeks for interface changes.

Interaction effects between simultaneous experiments complicate interpretation. Running tests on overlapping user segments can produce misleading results when experiments interact. The search algorithm test might interact with a pricing test if search quality affects price sensitivity. Careful experiment isolation or factorial designs address this issue.

class ExperimentIsolation
  def self.check_conflicts(experiment_name, user)
    active_experiments = get_active_experiments(user)
    proposed_experiment = Experiment.find_by(name: experiment_name)
    
    conflicts = active_experiments.select do |active|
      surfaces_overlap?(active, proposed_experiment) &&
        metrics_overlap?(active, proposed_experiment)
    end
    
    conflicts.any?
  end
  
  def self.surfaces_overlap?(exp1, exp2)
    (exp1.surfaces & exp2.surfaces).any?
  end
  
  def self.metrics_overlap?(exp1, exp2)
    (exp1.metrics & exp2.metrics).any?
  end
  
  private
  
  def self.get_active_experiments(user)
    Participation.where(user: user)
                 .includes(:experiment)
                 .map(&:experiment)
                 .select(&:active?)
  end
end

Survivorship bias affects analysis when only certain users complete the experimental flow. Users who abandon during checkout never convert, but this abandonment may correlate with variant assignment. Analyzing only users who complete the full flow introduces bias. Intent-to-treat analysis includes all assigned users regardless of completion.

Statistical significance without practical significance wastes resources implementing negligible improvements. A statistically significant 0.1% conversion increase might not justify implementation and maintenance costs. Effect size and confidence intervals provide better insight than p-values alone.

Reference

Statistical Concept	Formula	Typical Value
Z-score for significance	(p2 - p1) / SE	±1.96 for α=0.05
Standard error for proportions	sqrt(p1(1-p1)/n1 + p2(1-p2)/n2)	Varies by sample
Confidence interval	estimate ± Z * SE	95% most common
Sample size per variant	((Zα + Zβ)² * 2p(1-p)) / (p2-p1)²	Depends on effect
Minimum detectable effect	Smallest meaningful difference	10-20% typical
Statistical power	1 - P(Type II error)	0.80 standard

Test Duration Guideline	Minimum Period	Rationale
Minimum test length	1-2 weeks	Capture weekly patterns
Traffic volume factor	Until sample size met	Statistical validity
Business cycle consideration	Complete cycle period	Avoid seasonal bias
Novelty effect mitigation	2-4 weeks	Allow adaptation

Assignment Method	Implementation	Trade-offs
Hash-based	MD5(user_id + experiment)	Deterministic, fast, stateless
Database lookup	Store assignments in DB	Persistent, queryable, slower
Cookie-based	Client-side storage	Works for anonymous, can be cleared
Session-based	Server session storage	Temporary, session-dependent

Metric Type	Examples	Considerations
Conversion rate	Purchases, signups, clicks	Binary outcome, standard analysis
Continuous value	Revenue, session duration	Requires different statistical tests
Count data	Page views, items added	Poisson or negative binomial models
Time to event	Days to conversion	Survival analysis methods

Common Error	Prevention	Detection
Peeking problem	Use sequential testing	Monitor analysis frequency
Insufficient sample	Calculate before starting	Track power analysis
Multiple testing	Apply corrections	Count comparisons
Selection bias	Enforce random assignment	Check group balance
Novelty effects	Run longer tests	Monitor time trends

Ruby Gem	Backend	Features	Use Case
Split	Redis	Dashboard, goals, segments	High-traffic web apps
Field Test	ActiveRecord	Rails integration, migrations	Standard Rails apps
Vanity	Redis or AR	Reporting, metrics collection	Marketing experiments
Scientist	In-memory	Refactoring experiments	Code migration testing

Experiment Configuration	Value	Purpose
Significance level (α)	0.05	Type I error threshold
Statistical power (1-β)	0.80	Type II error control
Minimum sample ratio	100:1	Conversions per variant minimum
Test isolation period	24 hours	Prevent interaction effects
Maximum concurrent tests	3-5	Limit interaction complexity

A/B Testing