CrackedRuby - Lock-Free Programming

Overview

Lock-free programming refers to concurrent programming techniques that allow multiple threads to access shared data structures without using traditional mutual exclusion locks. Instead of blocking threads when conflicts occur, lock-free algorithms use atomic operations and careful memory ordering to ensure correctness while allowing continuous progress.

The approach emerged from research into non-blocking synchronization in the 1980s and 1990s, addressing the fundamental limitations of lock-based concurrency: deadlock, priority inversion, convoying, and the inability to guarantee system-wide progress when a thread holding a lock fails or is delayed. Lock-free algorithms guarantee that at least one thread makes progress within a finite number of steps, regardless of the execution state of other threads.

Lock-free programming operates on hardware-level atomic instructions, particularly Compare-And-Swap (CAS), which atomically compares a memory location to an expected value and updates it to a new value only if it matches. Modern processors provide these primitives as single CPU instructions, making them the foundation for building higher-level concurrent data structures without locks.

# Traditional lock-based counter
class LockedCounter
  def initialize
    @count = 0
    @mutex = Mutex.new
  end
  
  def increment
    @mutex.synchronize { @count += 1 }
  end
end

# Lock-free counter using atomic operations
require 'concurrent'

class LockFreeCounter
  def initialize
    @count = Concurrent::AtomicFixnum.new(0)
  end
  
  def increment
    @count.increment
  end
end

The distinction between lock-free, wait-free, and obstruction-free algorithms defines different progress guarantees. Lock-free guarantees system-wide progress but individual threads may starve. Wait-free guarantees every thread completes in bounded steps. Obstruction-free guarantees progress only when a thread runs in isolation. Most practical implementations target lock-free properties as the optimal balance between complexity and guarantees.

Key Principles

Lock-free programming rests on atomic read-modify-write operations provided by processor architectures. The Compare-And-Swap (CAS) operation forms the primary building block, atomically reading a memory location, comparing it to an expected value, and writing a new value only if the comparison succeeds. This atomic triplet executes without interruption, providing the linearization point for lock-free algorithms.

# Conceptual CAS operation
def compare_and_swap(location, expected, new_value)
  atomically do
    current = location.read
    if current == expected
      location.write(new_value)
      return true
    else
      return false
    end
  end
end

Memory ordering defines the visibility and ordering of memory operations across threads. Sequential consistency provides the strongest guarantee where operations appear to execute in program order, but modern processors use weaker models for performance. Acquire-release semantics create synchronization points: acquire operations prevent later operations from moving before them, while release operations prevent earlier operations from moving after them. Lock-free algorithms must explicitly specify memory ordering to ensure correctness across architectures.

The ABA problem represents a fundamental challenge in lock-free programming. A thread reads value A from a location, another thread changes it to B and back to A, and the first thread's CAS succeeds despite intervening modifications. This false positive corrupts data structures that depend on the absence of modifications. Solutions include tagged pointers (combining value with version counter), hazard pointers (preventing premature reclamation), or garbage collection.

# ABA problem demonstration
class Stack
  Node = Struct.new(:value, :next)
  
  def initialize
    @head = Concurrent::AtomicReference.new(nil)
  end
  
  def push(value)
    loop do
      old_head = @head.get
      new_node = Node.new(value, old_head)
      return if @head.compare_and_set(old_head, new_node)
    end
  end
  
  def pop
    loop do
      old_head = @head.get
      return nil unless old_head
      next_node = old_head.next
      # ABA problem: old_head might have been removed, reused, and
      # replaced with different node at same memory location
      return old_head.value if @head.compare_and_set(old_head, next_node)
    end
  end
end

Retry loops characterize lock-free algorithm structure. Operations attempt their work optimistically, assuming no conflicts. When CAS fails due to concurrent modifications, the operation restarts. The loop continues until success, with each iteration representing a potential linearization point. The algorithm must ensure that repeated failures don't prevent other threads from making progress, distinguishing lock-free from merely obstruction-free approaches.

Linearization points define the atomic instant where an operation takes effect. In lock-free structures, identifying linearization points proves more complex than in locked code where the lock acquisition serves as an obvious synchronization boundary. Each successful CAS represents a potential linearization point, but failed operations may linearize at different points depending on observed state. Correct linearization point identification enables reasoning about concurrent behavior and proving correctness.

Memory reclamation presents the central challenge in lock-free memory management. Threads cannot immediately free nodes accessed by other threads without risking use-after-free errors. Garbage-collected languages like Ruby simplify this problem, as the GC handles reclamation safely. Languages requiring manual memory management employ techniques like hazard pointers (threads announce pointers they're accessing), epoch-based reclamation (threads mark entry/exit from critical sections), or reference counting schemes.

False sharing degrades performance when multiple threads modify variables on the same cache line. Processors maintain cache coherence at cache line granularity (typically 64 bytes), so concurrent modifications to nearby memory locations cause cache line ping-ponging between cores even when modifying logically independent data. Lock-free structures must pad shared variables to separate cache lines, trading memory for performance.

Ruby Implementation

Ruby provides atomic operations through the concurrent-ruby gem, which wraps platform-specific atomic primitives. The gem offers atomic references, numeric types, and boolean values as building blocks for lock-free structures. Ruby's Global Interpreter Lock (GIL) in MRI limits true parallelism, but atomic operations remain valuable for coordinating with C extensions, ensuring memory visibility across threads, and preparing code for truly parallel Ruby implementations.

require 'concurrent'

# Atomic numeric operations
atomic_num = Concurrent::AtomicFixnum.new(0)
atomic_num.increment              # => 1
atomic_num.decrement              # => 0
atomic_num.compare_and_set(0, 5)  # => true
atomic_num.update { |v| v * 2 }   # Retries until successful

# Atomic reference for object storage
atomic_ref = Concurrent::AtomicReference.new(nil)
atomic_ref.set([1, 2, 3])
atomic_ref.compare_and_set([1, 2, 3], [4, 5, 6])
old_value = atomic_ref.get_and_set([7, 8, 9])

The AtomicReference class implements compare-and-swap for object references, enabling lock-free data structures. The compare_and_set method uses object identity comparison, checking if the current reference equals the expected reference using equal? semantics. This identity-based comparison avoids the ABA problem for objects since Ruby's garbage collector ensures that reclaimed objects don't reappear at the same memory address.

class LockFreeStack
  Node = Struct.new(:value, :next)
  
  def initialize
    @head = Concurrent::AtomicReference.new(nil)
  end
  
  def push(value)
    loop do
      old_head = @head.get
      new_node = Node.new(value, old_head)
      break if @head.compare_and_set(old_head, new_node)
      # CAS failed, another thread modified head, retry
    end
  end
  
  def pop
    loop do
      old_head = @head.get
      return nil unless old_head
      next_node = old_head.next
      # Successfully update head to next node
      return old_head.value if @head.compare_and_set(old_head, next_node)
      # CAS failed, retry with current head
    end
  end
  
  def empty?
    @head.get.nil?
  end
end

The concurrent-ruby gem also provides higher-level abstractions built on atomic primitives. Concurrent::Array offers thread-safe array operations, Concurrent::Hash provides lock-free hash table access, and Concurrent::Map implements a lock-free concurrent map using atomic operations internally. These structures handle the complexity of lock-free implementation while exposing simple interfaces.

# High-level concurrent collections
concurrent_map = Concurrent::Map.new
concurrent_map.put_if_absent(:key, 'value')
concurrent_map.compute(:counter) { |old| (old || 0) + 1 }

# Atomic boolean for flags
flag = Concurrent::AtomicBoolean.new(false)
flag.make_true   # Atomically set to true
flag.make_false  # Atomically set to false

Ruby's thread-safe queue implementations in the Thread::Queue and Thread::SizedQueue classes use internal locking rather than lock-free algorithms, prioritizing simplicity and avoiding the GIL complications. For truly lock-free queues in performance-critical code, developers must implement custom structures using AtomicReference primitives or interface with native extensions that bypass the GIL.

The update method on atomic types implements the common retry pattern concisely. It accepts a block that transforms the current value, retrying the entire block if a concurrent modification occurs. This pattern handles the CAS retry loop internally, simplifying code that performs atomic updates.

# Atomic update with retry pattern
counter = Concurrent::AtomicFixnum.new(10)

# Manual CAS retry loop
loop do
  current = counter.get
  new_value = current * 2
  break if counter.compare_and_set(current, new_value)
end

# Equivalent using update
counter.update { |v| v * 2 }

# More complex transformations
stats = Concurrent::AtomicReference.new({ count: 0, sum: 0 })
stats.update do |current|
  {
    count: current[:count] + 1,
    sum: current[:sum] + value
  }
end

JRuby and TruffleRuby provide better concurrent performance characteristics than MRI by eliminating the GIL, allowing true parallel execution of Ruby code. In these implementations, lock-free structures deliver their full performance benefits, with atomic operations executing as actual lock-free processor instructions rather than being serialized by a global lock.

Practical Examples

Lock-free counters represent the simplest lock-free data structure, requiring only atomic increment and decrement operations. Multiple threads can increment a counter simultaneously without coordination beyond the atomic operation itself. The structure demonstrates the basic retry pattern where failed operations immediately retry until success.

require 'concurrent'

class MetricsCollector
  def initialize
    @request_count = Concurrent::AtomicFixnum.new(0)
    @error_count = Concurrent::AtomicFixnum.new(0)
    @total_latency = Concurrent::AtomicFixnum.new(0)
  end
  
  def record_request(latency_ms, error: false)
    @request_count.increment
    @error_count.increment if error
    @total_latency.update { |current| current + latency_ms }
  end
  
  def stats
    count = @request_count.get
    errors = @error_count.get
    latency = @total_latency.get
    
    {
      requests: count,
      errors: errors,
      error_rate: count > 0 ? errors.to_f / count : 0.0,
      avg_latency: count > 0 ? latency.to_f / count : 0.0
    }
  end
  
  def reset
    @request_count.set(0)
    @error_count.set(0)
    @total_latency.set(0)
  end
end

# Usage across multiple threads
collector = MetricsCollector.new

threads = 10.times.map do
  Thread.new do
    1000.times do
      latency = rand(10..500)
      error = rand < 0.05
      collector.record_request(latency, error: error)
    end
  end
end

threads.each(&:join)
puts collector.stats
# => {:requests=>10000, :errors=>~500, :error_rate=>~0.05, :avg_latency=>~255}

Lock-free stacks implement a LIFO data structure using a single atomic reference to the head node. Push operations create a new node pointing to the current head, then atomically swap it in. Pop operations atomically replace the head with its next pointer. Both operations retry on concurrent modifications until success.

class LockFreeStack
  Node = Struct.new(:value, :next)
  
  def initialize
    @head = Concurrent::AtomicReference.new(nil)
    @size = Concurrent::AtomicFixnum.new(0)
  end
  
  def push(value)
    new_node = Node.new(value, nil)
    loop do
      old_head = @head.get
      new_node.next = old_head
      if @head.compare_and_set(old_head, new_node)
        @size.increment
        return value
      end
      # Retry if another thread modified head
    end
  end
  
  def pop
    loop do
      old_head = @head.get
      return nil unless old_head
      
      next_node = old_head.next
      if @head.compare_and_set(old_head, next_node)
        @size.decrement
        return old_head.value
      end
      # Retry if another thread modified head
    end
  end
  
  def peek
    head = @head.get
    head ? head.value : nil
  end
  
  def size
    @size.get
  end
  
  def empty?
    @head.get.nil?
  end
end

# Demonstrate concurrent access
stack = LockFreeStack.new

# Multiple producers
producers = 5.times.map do |i|
  Thread.new do
    100.times { |j| stack.push("#{i}-#{j}") }
  end
end

# Multiple consumers
consumed = Concurrent::Array.new
consumers = 5.times.map do
  Thread.new do
    50.times do
      value = stack.pop
      consumed << value if value
    end
  end
end

producers.each(&:join)
consumers.each(&:join)

puts "Stack size: #{stack.size}"
puts "Consumed items: #{consumed.size}"

Lock-free queues require more complexity than stacks due to maintaining both head and tail pointers. The Michael-Scott queue algorithm uses atomic operations on both ends, with dummy nodes to simplify boundary conditions. Enqueue adds nodes at the tail, dequeue removes from the head, and both operations require CAS loops with careful ordering.

class LockFreeQueue
  Node = Struct.new(:value, :next) do
    def initialize(value, next_node = nil)
      super(value, Concurrent::AtomicReference.new(next_node))
    end
  end
  
  def initialize
    dummy = Node.new(nil)
    @head = Concurrent::AtomicReference.new(dummy)
    @tail = Concurrent::AtomicReference.new(dummy)
  end
  
  def enqueue(value)
    new_node = Node.new(value)
    
    loop do
      tail = @tail.get
      next_node = tail.next.get
      
      # Check if tail is still consistent
      if tail == @tail.get
        if next_node.nil?
          # Try to link new node at end of list
          if tail.next.compare_and_set(nil, new_node)
            # Enqueue done, try to swing tail to new node
            @tail.compare_and_set(tail, new_node)
            return
          end
        else
          # Tail is behind, try to advance it
          @tail.compare_and_set(tail, next_node)
        end
      end
    end
  end
  
  def dequeue
    loop do
      head = @head.get
      tail = @tail.get
      next_node = head.next.get
      
      # Check if head is still consistent
      if head == @head.get
        if head == tail
          # Queue is empty or tail is behind
          return nil if next_node.nil?
          # Tail is behind, try to advance it
          @tail.compare_and_set(tail, next_node)
        else
          # Read value before CAS
          value = next_node.value
          # Try to swing head to next node
          if @head.compare_and_set(head, next_node)
            return value
          end
        end
      end
    end
  end
  
  def empty?
    head = @head.get
    tail = @tail.get
    next_node = head.next.get
    head == tail && next_node.nil?
  end
end

# Producer-consumer pattern
queue = LockFreeQueue.new

# Producers add items
producers = 3.times.map do |producer_id|
  Thread.new do
    100.times do |i|
      queue.enqueue("P#{producer_id}-#{i}")
      sleep(0.001)
    end
  end
end

# Consumers process items
results = Concurrent::Array.new
consumers = 3.times.map do
  Thread.new do
    loop do
      item = queue.dequeue
      break unless item
      results << item
      sleep(0.001)
    end
  end
end

producers.each(&:join)
sleep(0.5)  # Let consumers drain queue
consumers.each(&:kill)

puts "Processed #{results.size} items"

Lock-free object pools manage reusable objects without locking, useful for reducing allocation overhead in performance-critical code. The pool maintains a stack of available objects, with acquire operations popping from the stack and release operations pushing back. The structure combines lock-free stack principles with object lifecycle management.

class LockFreeObjectPool
  Node = Struct.new(:object, :next)
  
  def initialize(size, &factory)
    @head = Concurrent::AtomicReference.new(nil)
    @factory = factory
    @available = Concurrent::AtomicFixnum.new(0)
    
    # Pre-populate pool
    size.times { release(factory.call) }
  end
  
  def acquire
    loop do
      old_head = @head.get
      return @factory.call unless old_head
      
      next_node = old_head.next
      if @head.compare_and_set(old_head, next_node)
        @available.decrement
        return old_head.object
      end
    end
  end
  
  def release(object)
    loop do
      old_head = @head.get
      new_node = Node.new(object, old_head)
      if @head.compare_and_set(old_head, new_node)
        @available.increment
        return
      end
    end
  end
  
  def available_count
    @available.get
  end
end

# Example: Connection pool
ConnectionStub = Struct.new(:id) do
  def reset
    # Reset connection state
  end
end

pool = LockFreeObjectPool.new(10) { |i| ConnectionStub.new(i) }

threads = 20.times.map do
  Thread.new do
    100.times do
      conn = pool.acquire
      # Use connection
      sleep(0.01)
      conn.reset
      pool.release(conn)
    end
  end
end

threads.each(&:join)
puts "Pool size: #{pool.available_count}"

Design Considerations

Lock-free programming makes sense when lock contention significantly degrades performance and the algorithm complexity remains manageable. High contention scenarios with short critical sections benefit most, as the overhead of acquiring locks dominates execution time. Low contention workloads see minimal benefit and may perform worse due to retry overhead.

Traditional locks provide simpler reasoning and implementation at the cost of potential blocking. A thread acquiring a lock guarantees exclusive access, making state changes straightforward to reason about. Lock-free algorithms require understanding memory ordering, ABA problems, and linearization points, significantly increasing implementation complexity. Choose locks for complex state transitions or when correctness matters more than maximum throughput.

The presence of garbage collection strongly influences lock-free algorithm viability. Languages with GC like Ruby automatically solve memory reclamation problems that plague manual memory management. Without GC, implementing safe lock-free structures requires sophisticated techniques like hazard pointers or epoch-based reclamation, adding substantial complexity. Manual memory management often tips the balance toward simpler lock-based approaches.

Real-time systems and latency-sensitive applications benefit from lock-free guarantees. Locks can cause unbounded delays through priority inversion or preemption of lock holders. Lock-free algorithms guarantee system-wide progress regardless of thread scheduling, eliminating worst-case blocking scenarios. However, individual operations may still experience high latency due to repeated retries under extreme contention.

Algorithm composability differs significantly between approaches. Multiple lock-based operations compose through acquiring multiple locks in consistent order, preventing deadlock through lock ordering disciplines. Lock-free operations compose poorly; combining multiple CAS operations into a larger atomic operation typically requires redesigning the entire algorithm. This limitation restricts lock-free techniques to isolated data structures rather than complex transactional scenarios.

Debugging complexity increases dramatically with lock-free code. Race conditions manifest inconsistently and only under specific timing conditions, making reproduction difficult. Traditional debugging tools designed for sequential code provide limited insight into concurrent interactions. Testing requires stress testing with multiple threads and intentional scheduling perturbations to expose bugs that might occur rarely in production.

# Lock-based approach: Simple but blocks threads
class LockedCache
  def initialize
    @cache = {}
    @mutex = Mutex.new
  end
  
  def get(key)
    @mutex.synchronize { @cache[key] }
  end
  
  def put(key, value)
    @mutex.synchronize { @cache[key] = value }
  end
end

# Lock-free approach: More complex but non-blocking
class LockFreeCache
  def initialize
    @cache = Concurrent::AtomicReference.new({})
  end
  
  def get(key)
    @cache.get[key]
  end
  
  def put(key, value)
    loop do
      old_map = @cache.get
      new_map = old_map.merge(key => value)
      break if @cache.compare_and_set(old_map, new_map)
    end
  end
end

Read-heavy workloads with infrequent writes favor lock-free structures using atomic references. Readers access the reference without synchronization, while writers create new structures and atomically swap them in. This copy-on-write pattern works well for configuration data, routing tables, or caches where reads vastly outnumber writes and stale reads remain acceptable for brief periods.

Write-heavy workloads require careful evaluation. Lock-free structures under high write contention experience many CAS failures and retries, potentially performing worse than well-designed locks. Coarse-grained locks may outperform lock-free alternatives when contention causes excessive retry overhead. Profile real workloads before committing to lock-free implementations for write-dominated scenarios.

Performance Considerations

Lock-free algorithms eliminate the overhead of lock acquisition and release, reducing latency in uncontended cases. A CAS operation executes in tens of nanoseconds, while lock operations require hundreds of nanoseconds for the synchronization overhead. This advantage disappears under contention when CAS operations repeatedly fail and retry, potentially performing worse than locks that queue waiting threads efficiently.

Cache coherence traffic dominates lock-free performance characteristics. Every CAS operation, successful or not, generates cache coherence messages to ensure atomic visibility across cores. Under high contention, multiple cores hammer the same cache line with CAS operations, causing cache line bouncing that saturates memory bandwidth. This coherence traffic often limits scalability more than the algorithm itself.

The retry penalty scales with contention levels. Each failed CAS wastes CPU cycles re-reading state and recomputing updates. Under extreme contention, threads spend more time retrying than making progress. Exponential backoff can reduce contention by having threads wait briefly after failures, trading latency for throughput, but determining optimal backoff strategies remains difficult.

# Measuring contention impact
require 'benchmark'
require 'concurrent'

def benchmark_counter(counter_class, thread_count, iterations)
  counter = counter_class.new
  
  Benchmark.measure do
    threads = thread_count.times.map do
      Thread.new do
        iterations.times { counter.increment }
      end
    end
    threads.each(&:join)
  end
end

class LockedCounter
  def initialize
    @count = 0
    @mutex = Mutex.new
  end
  
  def increment
    @mutex.synchronize { @count += 1 }
  end
end

class LockFreeCounter
  def initialize
    @count = Concurrent::AtomicFixnum.new(0)
  end
  
  def increment
    @count.increment
  end
end

# Low contention: 2 threads
puts "Low contention (2 threads):"
puts "Locked: #{benchmark_counter(LockedCounter, 2, 100_000)}"
puts "Lock-free: #{benchmark_counter(LockFreeCounter, 2, 100_000)}"

# High contention: 8 threads
puts "\nHigh contention (8 threads):"
puts "Locked: #{benchmark_counter(LockedCounter, 8, 100_000)}"
puts "Lock-free: #{benchmark_counter(LockFreeCounter, 8, 100_000)}"

Memory ordering requirements impose performance costs. Sequential consistency provides simplest semantics but requires expensive memory barriers. Acquire-release semantics reduce overhead by allowing some reordering while maintaining synchronization points. Relaxed ordering offers maximum performance but requires expert understanding to use correctly. Ruby's atomic operations typically use acquire-release semantics as a reasonable balance.

False sharing severely degrades multi-threaded performance when unrelated variables share cache lines. Two threads modifying separate variables on the same cache line invalidate each other's caches repeatedly, serializing what should be parallel operations. Lock-free structures must pad shared state to separate cache lines, wasting memory to preserve performance.

# False sharing mitigation through padding
class PaddedCounter
  CACHE_LINE_SIZE = 64
  
  # Pad to separate cache lines
  attr_accessor :count, :_pad1, :_pad2, :_pad3, :_pad4, :_pad5, :_pad6, :_pad7
  
  def initialize
    @count = Concurrent::AtomicFixnum.new(0)
    # Padding ensures count sits on its own cache line
  end
end

Read scaling differs significantly from write scaling. Lock-free structures using atomic loads scale reads to arbitrary thread counts since reads don't modify cache lines. Write operations scale poorly due to cache coherence requirements. Designs should separate read and write paths, using techniques like per-thread counters that aggregate periodically to achieve better write scalability.

Specialized hardware instructions beyond CAS can improve performance. Double-width CAS (DCAS) atomically updates two memory locations, simplifying some algorithms. Load-linked/store-conditional (LL/SC) primitives avoid ABA problems inherently. However, portability suffers when relying on platform-specific instructions, and abstractions like Ruby's concurrent-ruby hide these details, providing consistent but potentially suboptimal performance across platforms.

Measuring lock-free performance requires careful methodology. Microbenchmarks often show lock-free advantages that disappear in real applications. Contention patterns, memory allocation rates, and system load all affect relative performance. Always benchmark with realistic workloads on target hardware under expected load conditions before concluding lock-free provides benefits.

Common Pitfalls

The ABA problem corrupts data structures when values change and revert between checks. A thread reads value A, intending to CAS it to C. Another thread changes A to B then back to A. The first thread's CAS succeeds despite missing the B state, potentially violating invariants. Solutions include tagged pointers that increment version counters with each change, making A-B-A transitions distinguishable.

# ABA problem in stack
class VulnerableStack
  Node = Struct.new(:value, :next)
  
  def initialize
    @head = Concurrent::AtomicReference.new(nil)
  end
  
  def pop
    loop do
      old_head = @head.get
      return nil unless old_head
      
      next_node = old_head.next
      # Between these lines, another thread might:
      # 1. Pop this node
      # 2. Pop another node
      # 3. Push this node back
      # Now old_head points to reused node with different next
      return old_head.value if @head.compare_and_set(old_head, next_node)
    end
  end
end

# ABA-safe version with version counter
class SafeStack
  class VersionedNode
    attr_reader :value, :next, :version
    
    def initialize(value, next_node, version = 0)
      @value = value
      @next = next_node
      @version = version
    end
  end
  
  def initialize
    @head = Concurrent::AtomicReference.new(nil)
  end
  
  def pop
    loop do
      old_head = @head.get
      return nil unless old_head
      
      next_node = old_head.next
      # Version counter makes reused nodes distinguishable
      new_next = next_node ? VersionedNode.new(
        next_node.value,
        next_node.next,
        next_node.version + 1
      ) : nil
      
      return old_head.value if @head.compare_and_set(old_head, new_next)
    end
  end
end

Memory ordering violations cause subtle bugs when operations appear reordered. Compilers and processors reorder operations for optimization unless explicit ordering constraints exist. Reading a flag then accessing data it protects may see stale data if the read reorders before initialization completes. Ruby's atomic operations provide sufficient ordering for most use cases, but understanding the guarantees remains critical.

Livelock occurs when threads continuously retry operations without making progress. Under extreme contention, threads may spin indefinitely retrying CAS operations that repeatedly fail. Unlike deadlock where threads block permanently, livelocked threads consume CPU without useful work. Exponential backoff or yielding after failed attempts can mitigate livelock by reducing contention temporarily.

# Livelock prevention with backoff
class BackoffCounter
  def initialize
    @count = Concurrent::AtomicFixnum.new(0)
  end
  
  def increment
    attempts = 0
    loop do
      current = @count.get
      return if @count.compare_and_set(current, current + 1)
      
      # Exponential backoff after failures
      attempts += 1
      if attempts > 3
        sleep(2 ** (attempts - 3) * 0.001)  # 1ms, 2ms, 4ms...
      end
    end
  end
end

Reference equality vs value equality confusion causes CAS failures. Ruby's compare_and_set uses reference equality (equal?) not value equality (==). Two arrays containing identical elements fail comparison if they're different objects. This behavior prevents accidental matches but requires careful handling of value types.

atomic_ref = Concurrent::AtomicReference.new([1, 2, 3])

# This fails: different array object with same values
atomic_ref.compare_and_set([1, 2, 3], [4, 5, 6])  # => false

# This succeeds: same array object
old_array = atomic_ref.get
atomic_ref.compare_and_set(old_array, [4, 5, 6])  # => true

Neglecting memory visibility leads to infinite loops or stale reads. Without proper synchronization, one thread's writes may never become visible to other threads. While Ruby's GIL masks some visibility issues in MRI, code must remain correct for GIL-free implementations. Always use atomic operations or proper synchronization for shared state.

Incorrect retry logic corrupts state when operations have side effects. If the block passed to an atomic update has side effects beyond the returned value, retries execute those effects multiple times. Keep update blocks pure, computing new state from current state without external effects.

counter = Concurrent::AtomicFixnum.new(0)
log = []

# Wrong: side effects in update block
counter.update do |v|
  log << "Update: #{v}"  # Executes multiple times on retry
  v + 1
end

# Right: separate side effects from update
old_value = counter.get
counter.update { |v| v + 1 }
log << "Update: #{old_value}"

Unbounded retry loops waste CPU under contention. Operations that continuously retry without success consume resources without progress. Implement retry limits or backoff strategies to prevent resource exhaustion. Consider falling back to locking if retry counts exceed thresholds indicating pathological contention.

Testing challenges hide concurrency bugs. Lock-free code may work correctly under low thread counts but fail under production load. Race conditions appear rarely and non-deterministically. Stress tests with high thread counts, deliberate scheduling jitter, and memory sanitizers help expose issues, but cannot prove correctness. Formal verification or careful reasoning about linearization points provides stronger guarantees than testing alone.

Reference

Atomic Operations

Operation	Description	Use Case
get	Read current value	Snapshot shared state
set	Write new value	Initialize or reset state
compare_and_set	Atomic CAS operation	Core lock-free primitive
get_and_set	Swap with new value	Reset with old value retrieval
update	Retry block until success	Atomic transformations
increment	Atomic add 1	Counters and statistics
decrement	Atomic subtract 1	Reference counting

Progress Guarantees

Guarantee	Definition	Trade-offs
Wait-free	Every thread completes in bounded steps	Most complex, highest overhead
Lock-free	System makes progress, individual threads may starve	Balanced complexity and guarantees
Obstruction-free	Progress when thread runs alone	Simplest, weakest guarantees
Blocking	Uses locks, threads may wait indefinitely	Simplest implementation, blocking delays

Memory Ordering Models

Model	Visibility Guarantee	Performance	Complexity
Sequential Consistency	Total order across threads	Slowest	Simplest reasoning
Acquire-Release	Synchronization at marked points	Moderate	Moderate reasoning
Relaxed	Minimal ordering constraints	Fastest	Expert-level reasoning

Common Lock-Free Data Structures

Structure	Complexity	ABA Vulnerable	Best Use Case
Counter	Simple	No	Metrics and statistics
Stack	Simple	Yes	LIFO work queues
Queue	Moderate	Yes	FIFO work distribution
Hash Table	Complex	Yes	Concurrent lookups
Skip List	Complex	Yes	Ordered concurrent access

Concurrent-Ruby Atomic Types

Class	Purpose	Key Methods
AtomicFixnum	Thread-safe integer	increment, decrement, update
AtomicBoolean	Thread-safe boolean	make_true, make_false, value
AtomicReference	Thread-safe object reference	get, set, compare_and_set
AtomicMarkableReference	Reference with boolean mark	get, set, compare_and_set

ABA Problem Mitigation

Technique	Mechanism	Trade-offs
Garbage Collection	GC prevents address reuse	Language-dependent
Version Counters	Tag values with monotonic counter	Increased memory, counter overflow
Hazard Pointers	Announce accessed pointers	Manual memory management complexity
Epoch-Based Reclamation	Batch reclamation by epoch	Delayed reclamation, memory overhead

Performance Characteristics

Scenario	Lock-Based	Lock-Free	Winner
Low contention	Fast	Faster	Lock-free
High read contention	Moderate	Fast	Lock-free
High write contention	Moderate	Slow	Lock-based
Complex transactions	Fast	Very complex	Lock-based
Predictable latency	Variable	Variable with retries	Depends on contention
Simple operations	Overhead	Minimal overhead	Lock-free

Debugging Checklist

Check	Issue Detected	Resolution
Reference vs value equality	CAS failing unexpectedly	Use reference equality consistently
Side effects in update blocks	Duplicate effects on retry	Move side effects outside update
Memory visibility	Infinite loops, stale reads	Use proper atomic operations
False sharing	Poor multi-core scaling	Pad variables to separate cache lines
Livelock	High CPU, no progress	Add backoff or retry limits
ABA corruption	Rare data structure corruption	Use version counters or hazard pointers

Ruby Implementation Decision Matrix

Scenario	Recommendation	Rationale
Simple counter/flag	AtomicFixnum/AtomicBoolean	Built-in, optimal performance
Reference swapping	AtomicReference	Clean semantics for object references
Complex state	Lock-based with Mutex	Simpler reasoning, less error-prone
High read, low write	Atomic with copy-on-write	Readers never block
Queue/Stack	Concurrent collections or custom lock-free	Depends on performance requirements
MRI production code	Consider lock-based	GIL limits parallelism benefits
JRuby/TruffleRuby	Lock-free where appropriate	True parallelism enables benefits

Lock-Free Programming