CrackedRuby logo

CrackedRuby

Array Set Operations

Array Set Operations in Ruby provide methods for treating arrays as mathematical sets, including union, intersection, difference, and uniqueness operations.

Core Built-in Classes Array Class
2.4.6

Overview

Ruby arrays support mathematical set operations through built-in methods that treat array elements as members of sets. These operations include union (|), intersection (&), difference (-), and uniqueness methods (uniq, uniq!). Ruby implements set operations directly on the Array class, eliminating the need for separate data structures in many cases.

The core set operations maintain Ruby's approach of returning new arrays rather than modifying existing ones, except for bang methods. All set operations use eql? and hash methods to determine element equality, making them compatible with custom objects that override these methods.

numbers = [1, 2, 3, 4, 5]
evens = [2, 4, 6, 8]

# Union - all unique elements from both arrays
numbers | evens
# => [1, 2, 3, 4, 5, 6, 8]

# Intersection - elements present in both arrays
numbers & evens
# => [2, 4]

# Difference - elements in first array but not in second
numbers - evens
# => [1, 3, 5]

Set operations handle duplicate elements by removing them from results. The order of elements in results follows specific rules: union preserves order from the left operand followed by new elements from the right operand, while intersection preserves order from the left operand.

duplicates = [1, 2, 2, 3, 3, 4]
duplicates.uniq
# => [1, 2, 3, 4]

# Block form for custom uniqueness criteria
words = ["hello", "world", "help", "work"]
words.uniq { |word| word[0] }
# => ["hello", "world"]

Ruby's Set class from the standard library provides additional set operations and better performance for frequent set manipulations, but array set operations remain sufficient for most use cases and integrate seamlessly with existing array-based code.

Basic Usage

Array set operations work with any comparable elements, including strings, numbers, symbols, and custom objects. The union operator | combines two arrays and removes duplicates, returning a new array containing all unique elements.

fruits = ["apple", "banana", "cherry"]
vegetables = ["carrot", "banana", "spinach"]

all_produce = fruits | vegetables
# => ["apple", "banana", "cherry", "carrot", "spinach"]

# Order preservation - duplicates removed from second array
numbers_a = [1, 2, 3, 2]
numbers_b = [2, 4, 5, 1]
numbers_a | numbers_b
# => [1, 2, 3, 4, 5]

The intersection operator & returns elements that exist in both arrays, preserving order from the left operand and removing duplicates from the result.

programming_languages = ["ruby", "python", "javascript", "ruby"]
web_languages = ["javascript", "html", "css", "python"]

common_languages = programming_languages & web_languages
# => ["python", "javascript"]

# Works with any comparable objects
class Language
  attr_reader :name
  
  def initialize(name)
    @name = name
  end
  
  def eql?(other)
    other.is_a?(Language) && name == other.name
  end
  
  def hash
    name.hash
  end
end

ruby_lang = Language.new("ruby")
python_lang = Language.new("python")
js_lang = Language.new("javascript")

backend = [ruby_lang, python_lang]
frontend = [js_lang, Language.new("python")]

backend & frontend
# => [Language instance with name "python"]

The difference operator - removes elements from the left array that appear in the right array, maintaining order and removing all occurrences of matching elements.

all_tasks = ["coding", "testing", "debugging", "documenting", "coding"]
completed_tasks = ["testing", "coding"]

remaining_tasks = all_tasks - completed_tasks
# => ["debugging", "documenting"]

# Multiple differences can be chained
high_priority = ["urgent_bug", "client_demo", "security_patch"]
medium_priority = ["refactoring", "client_demo", "documentation"]
low_priority = ["cleanup", "security_patch"]

critical_only = high_priority - medium_priority - low_priority
# => ["urgent_bug"]

The uniq method removes duplicate elements from an array, preserving the first occurrence of each element and maintaining original order for unique elements.

measurements = [10.5, 12.3, 10.5, 15.7, 12.3, 18.9]
unique_measurements = measurements.uniq
# => [10.5, 12.3, 15.7, 18.9]

# Block form allows custom uniqueness criteria
employees = [
  {name: "Alice", department: "Engineering"},
  {name: "Bob", department: "Sales"},
  {name: "Carol", department: "Engineering"},
  {name: "Dave", department: "Marketing"}
]

unique_departments = employees.uniq { |emp| emp[:department] }
# => [{name: "Alice", department: "Engineering"}, 
#     {name: "Bob", department: "Sales"}, 
#     {name: "Dave", department: "Marketing"}]

The uniq! method modifies the array in place, returning the array if changes were made or nil if the array was already unique.

data = [1, 1, 2, 3, 3, 4]
result = data.uniq!
# data is now [1, 2, 3, 4]
# result is [1, 2, 3, 4]

already_unique = [1, 2, 3, 4]
result = already_unique.uniq!
# already_unique remains [1, 2, 3, 4]
# result is nil

Advanced Usage

Complex set operations often require combining multiple operations or applying custom logic through blocks. Chaining set operations creates sophisticated filtering and data manipulation patterns.

# Multi-stage filtering using chained set operations
inventory_a = ["laptop", "mouse", "keyboard", "monitor", "laptop"]
inventory_b = ["mouse", "printer", "scanner", "keyboard"]
inventory_c = ["tablet", "laptop", "headphones", "mouse"]

# Items in A but not in B or C, plus items common to B and C
complex_filter = (inventory_a - inventory_b - inventory_c) | 
                 (inventory_b & inventory_c)
# => ["monitor", "mouse"]

# Using uniq with complex block logic
sales_data = [
  {product: "Widget A", region: "North", quarter: 1, revenue: 1000},
  {product: "Widget B", region: "South", quarter: 1, revenue: 1500},
  {product: "Widget A", region: "North", quarter: 2, revenue: 1200},
  {product: "Widget C", region: "North", quarter: 1, revenue: 800}
]

# Unique products per region (keeping first occurrence)
unique_region_products = sales_data.uniq { |sale| 
  [sale[:product], sale[:region]] 
}
# => First occurrence of each product-region combination

Set operations work with nested arrays and complex data structures, requiring careful consideration of equality semantics. Custom objects used in set operations must implement proper eql? and hash methods.

class Product
  attr_reader :name, :category, :price
  
  def initialize(name, category, price)
    @name, @category, @price = name, category, price
  end
  
  # Define equality based on name and category only
  def eql?(other)
    other.is_a?(Product) && 
    name == other.name && 
    category == other.category
  end
  
  def hash
    [name, category].hash
  end
  
  def to_s
    "#{name} (#{category}): $#{price}"
  end
end

electronics = [
  Product.new("Laptop", "Computer", 999),
  Product.new("Mouse", "Accessory", 29),
  Product.new("Laptop", "Computer", 1299)  # Different price, same product
]

accessories = [
  Product.new("Mouse", "Accessory", 25),    # Different price, same product
  Product.new("Keyboard", "Accessory", 79),
  Product.new("Monitor", "Display", 399)
]

# Union treats products with same name/category as identical
all_products = electronics | accessories
# => Contains one Laptop, one Mouse, one Keyboard, one Monitor

# Custom filtering with set operations
expensive_electronics = electronics.select { |p| p.price > 500 }
affordable_accessories = accessories.select { |p| p.price < 100 }

available_options = expensive_electronics | affordable_accessories
# => Products over $500 or accessories under $100

Nested set operations require understanding how Ruby handles array elements during comparison. Arrays compare element-wise, making set operations on arrays of arrays possible but potentially expensive.

coordinate_sets = [
  [[1, 2], [3, 4], [1, 2]],
  [[3, 4], [5, 6], [7, 8]],
  [[1, 2], [9, 10]]
]

# Find coordinates that appear in multiple sets
common_coordinates = coordinate_sets.reduce(:&)
# => [[1, 2], [3, 4]] intersection of all sets

# Unique coordinates across all sets
all_coordinates = coordinate_sets.flatten(1).uniq
# => [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]

# Complex nested operations
matrix_a = [[1, 0], [0, 1], [1, 1]]
matrix_b = [[0, 1], [1, 0], [1, 1]]
matrix_c = [[1, 1], [0, 0], [1, 0]]

# Rows unique to matrix_a compared to others
unique_to_a = matrix_a - matrix_b - matrix_c
# => [[1, 0]]

Performance-critical applications can combine set operations with other array methods for efficient data processing. The |= operator provides in-place union operations for accumulating unique elements.

# Accumulating unique values across multiple sources
unique_tags = []
data_sources = [
  ["ruby", "programming", "backend"],
  ["javascript", "frontend", "programming"],
  ["python", "data", "programming"],
  ["ruby", "web", "backend"]
]

data_sources.each do |tags|
  unique_tags |= tags  # Equivalent to: unique_tags = unique_tags | tags
end
# unique_tags => ["ruby", "programming", "backend", "javascript", "frontend", "python", "data", "web"]

# Complex data transformation combining multiple operations
user_permissions = [
  {user: "alice", permissions: ["read", "write", "admin"]},
  {user: "bob", permissions: ["read", "write"]},
  {user: "carol", permissions: ["read", "admin"]}
]

all_permissions = user_permissions
  .flat_map { |up| up[:permissions] }
  .uniq
# => ["read", "write", "admin"]

admin_users = user_permissions
  .select { |up| up[:permissions].include?("admin") }
  .map { |up| up[:user] }
# => ["alice", "carol"]

Performance & Memory

Array set operations create new arrays and iterate through elements multiple times, making performance considerations important for large datasets. Understanding the algorithmic complexity helps choose appropriate operations for different scenarios.

The union operator | has O(n + m) complexity where n and m are the lengths of the input arrays. Ruby creates a hash table internally to track seen elements, making element lookup O(1) on average but requiring additional memory proportional to the number of unique elements.

# Performance comparison for different union approaches
large_array_a = (1..10_000).to_a
large_array_b = (5_000..15_000).to_a

require 'benchmark'

Benchmark.bm(20) do |x|
  x.report("Array union |") do
    1000.times { large_array_a | large_array_b }
  end
  
  x.report("Manual union") do
    1000.times do
      seen = {}
      result = []
      (large_array_a + large_array_b).each do |elem|
        unless seen[elem]
          seen[elem] = true
          result << elem
        end
      end
    end
  end
  
  x.report("Set union") do
    require 'set'
    1000.times do
      (Set.new(large_array_a) | Set.new(large_array_b)).to_a
    end
  end
end

# Typical results show Array#| performs comparably to Set operations
# for moderate-sized arrays but Set becomes more efficient for very large datasets

Intersection operations & have similar O(n + m) complexity but require building a hash from the smaller array first. The performance characteristics make intersection generally faster than union for arrays with few common elements.

# Memory-efficient intersection for large arrays
def memory_efficient_intersection(array_a, array_b)
  # Use the smaller array for the hash table
  smaller, larger = array_a.size <= array_b.size ? [array_a, array_b] : [array_b, array_a]
  
  seen = {}
  smaller.each { |elem| seen[elem] = true }
  
  larger.select { |elem| seen[elem] }
end

# Comparison with built-in intersection
huge_array = (1..100_000).to_a
small_array = (50_000..50_100).to_a

Benchmark.bm(25) do |x|
  x.report("Built-in intersection") do
    100.times { huge_array & small_array }
  end
  
  x.report("Memory-efficient") do
    100.times { memory_efficient_intersection(huge_array, small_array) }
  end
end

The uniq method performance depends heavily on the hash and eql? methods of array elements. Objects with expensive hash calculations will slow down uniqueness operations significantly.

class ExpensiveHash
  attr_reader :data
  
  def initialize(data)
    @data = data
  end
  
  # Intentionally expensive hash calculation
  def hash
    data.chars.map(&:ord).reduce(0) { |sum, ord| sum + ord ** 2 }
  end
  
  def eql?(other)
    other.is_a?(ExpensiveHash) && data == other.data
  end
end

class EfficientHash
  attr_reader :data
  
  def initialize(data)
    @data = data
    @hash = data.hash  # Cache the hash
  end
  
  def hash
    @hash
  end
  
  def eql?(other)
    other.is_a?(EfficientHash) && data == other.data
  end
end

# Performance comparison with different hash implementations
expensive_objects = Array.new(1000) { |i| ExpensiveHash.new("item_#{i % 100}") }
efficient_objects = Array.new(1000) { |i| EfficientHash.new("item_#{i % 100}") }

Benchmark.bm(20) do |x|
  x.report("Expensive hash uniq") do
    10.times { expensive_objects.uniq }
  end
  
  x.report("Efficient hash uniq") do
    10.times { efficient_objects.uniq }
  end
end

For applications processing large datasets repeatedly, consider using the Set class from Ruby's standard library, which provides better performance for frequent set operations at the cost of memory overhead and loss of order preservation.

require 'set'

# Set class advantages for repeated operations
large_data = (1..50_000).to_a.shuffle
query_data = (25_000..75_000).to_a

# One-time conversion cost, better repeated operation performance
data_set = Set.new(large_data)

Benchmark.bm(25) do |x|
  x.report("Array intersection") do
    100.times { large_data & query_data }
  end
  
  x.report("Set intersection") do
    100.times { data_set & query_data }
  end
  
  x.report("Array difference") do
    100.times { large_data - query_data }
  end
  
  x.report("Set difference") do
    100.times { data_set - query_data }
  end
end

Memory usage patterns differ significantly between operations. Union operations require memory for both input arrays plus the result, while in-place operations like uniq! modify existing arrays and may reduce memory usage.

# Memory usage demonstration
def memory_usage_kb
  GC.start
  `ps -o pid,rss -p #{Process.pid}`.split.last.to_i
end

initial_memory = memory_usage_kb

# Large array with duplicates
large_array = Array.new(100_000) { |i| i % 10_000 }
after_creation = memory_usage_kb

# Non-destructive uniq
unique_array = large_array.uniq
after_uniq = memory_usage_kb

# Destructive uniq
large_array.uniq!
after_uniq_bang = memory_usage_kb

puts "Memory usage:"
puts "Initial: #{initial_memory} KB"
puts "After creation: #{after_creation} KB (+#{after_creation - initial_memory} KB)"
puts "After uniq: #{after_uniq} KB (+#{after_uniq - after_creation} KB)"
puts "After uniq!: #{after_uniq_bang} KB (#{after_uniq_bang - after_uniq} KB change)"

Common Pitfalls

Array set operations contain several subtle behaviors that can cause unexpected results. Understanding these pitfalls prevents bugs and helps write more robust code.

Order preservation in set operations follows specific rules that may not match intuitive expectations. Union operations preserve order from the left operand, then append new elements from the right operand in their original order.

# Order preservation pitfall
first_list = ["c", "a", "b", "a"]
second_list = ["b", "d", "a", "e"]

result = first_list | second_list
# => ["c", "a", "b", "d", "e"]
# Note: "a" and "b" from second_list are not included (already present)
# "d" and "e" are appended in their original order

# This can lead to unexpected behavior when order matters
priorities = ["high", "medium", "low", "medium"]
new_priorities = ["critical", "high", "medium"]

all_priorities = priorities | new_priorities
# => ["high", "medium", "low", "critical"]
# "critical" ends up last, not first!

# Solution: control order explicitly
all_priorities = new_priorities | priorities
# => ["critical", "high", "medium", "low"]

Duplicate handling varies between operations in ways that can surprise developers. Set operations remove duplicates from results but handle duplicates in input arrays differently.

# Duplicate handling inconsistencies
array_with_dups = [1, 2, 2, 3, 3, 3]
another_array = [2, 4, 4, 5]

# Union removes duplicates completely
union_result = array_with_dups | another_array
# => [1, 2, 3, 4, 5] - all duplicates gone

# Intersection preserves left operand's structure for common elements
intersection_result = array_with_dups & another_array
# => [2] - only one 2, despite multiple 2s in left operand

# Difference removes ALL occurrences of matching elements
difference_result = array_with_dups - another_array
# => [1, 3, 3, 3] - all 2s removed, but 3s preserved with duplicates

# This can lead to data loss
important_data = [1, 1, 2, 3]  # The repeated 1 has significance
filter_out = [2]

remaining = important_data - filter_out
# => [1, 1, 3] - works as expected

# But this loses data
filter_out_with_dup = [1, 2]
remaining = important_data - filter_out_with_dup
# => [3] - both 1s removed, losing important duplicate!

Type coercion and object equality cause subtle bugs when arrays contain mixed types or custom objects. Ruby uses eql? and hash methods, not ==, for set operations.

# Type coercion pitfall
mixed_types = [1, 1.0, "1", :one]
result = mixed_types.uniq
# => [1, 1.0, "1", :one] - all different because 1.eql?(1.0) is false

# This affects numeric operations unexpectedly
integers = [1, 2, 3, 4, 5]
floats = [1.0, 3.0, 5.0]

common_nums = integers & floats
# => [] - empty! No common elements because 1 != 1.0 with eql?

# Solution: convert types consistently
common_nums = integers.map(&:to_f) & floats
# => [1.0, 3.0, 5.0]

# Custom object equality pitfall
class Person
  attr_reader :name, :age
  
  def initialize(name, age)
    @name, @age = name, age
  end
  
  # Only implemented == but not eql?
  def ==(other)
    other.is_a?(Person) && name == other.name
  end
end

people_a = [Person.new("Alice", 30), Person.new("Bob", 25)]
people_b = [Person.new("Alice", 35), Person.new("Carol", 28)]

# This doesn't work as expected because eql? is not implemented
common_people = people_a & people_b
# => [] - empty! eql? falls back to object identity

# Fix by implementing proper eql? and hash
class Person
  def eql?(other)
    other.is_a?(Person) && name == other.name
  end
  
  def hash
    name.hash
  end
end

Block-based uniq operations create subtle traps around the relationship between the block return value and element selection. The method keeps the first element that produces each unique block value.

# Block uniqueness pitfall
products = [
  {name: "Widget A", price: 100, category: "tools"},
  {name: "Widget B", price: 150, category: "tools"},
  {name: "Widget C", price: 100, category: "gadgets"},
  {name: "Widget D", price: 200, category: "tools"}
]

# Expecting unique prices, but this keeps first item for each price
unique_prices = products.uniq { |p| p[:price] }
# => [{name: "Widget A", price: 100, category: "tools"},
#     {name: "Widget B", price: 150, category: "tools"},
#     {name: "Widget D", price: 200, category: "tools"}]
# Widget C is lost! It has same price as Widget A

# This behavior is correct but often misunderstood
users = ["alice@example.com", "bob@company.com", "carol@example.com"]
unique_domains = users.uniq { |email| email.split("@").last }
# => ["alice@example.com", "bob@company.com"]
# "carol@example.com" is filtered out, not included as unique domain

# Solution: extract the values you actually want
actual_unique_prices = products.map { |p| p[:price] }.uniq
# => [100, 150, 200]

actual_unique_domains = users.map { |email| email.split("@").last }.uniq
# => ["example.com", "company.com"]

Performance pitfalls occur when chaining multiple set operations without considering the intermediate arrays created. Each operation creates a new array, leading to unnecessary memory allocation and processing.

# Performance pitfall with chained operations
large_dataset = (1..100_000).to_a
filter_a = (50_000..150_000).to_a
filter_b = (75_000..125_000).to_a
filter_c = (90_000..110_000).to_a

# Inefficient: creates intermediate arrays
result = large_dataset - filter_a - filter_b - filter_c

# More efficient: combine filters first
combined_filter = filter_a | filter_b | filter_c
result = large_dataset - combined_filter

# Or use Set for multiple operations
require 'set'
dataset_set = Set.new(large_dataset)
filter_set = Set.new(filter_a + filter_b + filter_c)
result = (dataset_set - filter_set).to_a

Mutation during iteration can cause unexpected behavior when using set operations within loops or iterators. Arrays should not be modified while set operations are being performed on them.

# Mutation pitfall
data = [1, 2, 3, 4, 5]
filter = [2, 4]

# This creates a race condition if data is modified elsewhere
result = data - filter  # Could be inconsistent if data changes

# Safer approach: work with copies for critical operations
result = data.dup - filter

# Thread safety pitfall
shared_array = [1, 2, 3, 4, 5]
threads = []

10.times do |i|
  threads << Thread.new do
    # Multiple threads modifying and reading shared_array
    shared_array.concat([i + 6])
    unique_elements = shared_array.uniq  # Could see inconsistent state
  end
end

threads.each(&:join)

Reference

Array Set Operation Methods

Method Parameters Returns Description
Array#|(other) other (Array) Array Union - returns array with elements from both arrays, duplicates removed
Array#&(other) other (Array) Array Intersection - returns array with elements common to both arrays
Array#-(other) other (Array) Array Difference - returns array with elements from first array not in second
Array#uniq None or block Array Returns new array with duplicate elements removed
Array#uniq! None or block Array or nil Removes duplicates in place, returns array or nil if no changes

Set Operation Behaviors

Operation Order Preservation Duplicate Handling Element Comparison
Union | Left operand order, then new elements from right All duplicates removed from result Uses eql? and hash
Intersection & Left operand order for common elements Duplicates removed from result Uses eql? and hash
Difference - Left operand order All matching elements removed Uses eql? and hash
Unique uniq Original array order maintained First occurrence kept Uses eql? and hash

Complexity Characteristics

Method Time Complexity Space Complexity Notes
Array#| O(n + m) O(n + m) Creates hash table for duplicate detection
Array#& O(n + m) O(min(n, m)) Hash table size based on smaller array
Array#- O(n + m) O(m) Hash table for elements to remove
Array#uniq O(n) O(n) Hash table for seen elements
Array#uniq! O(n) O(n) Hash table for seen elements, modifies in place

Element Equality Requirements

For custom objects used in set operations:

class CustomObject
  # Required: implement both methods for proper set behavior
  def eql?(other)
    # Define when two objects are considered equal
    # Must be consistent with hash method
  end
  
  def hash
    # Return integer hash code
    # Equal objects must return same hash
  end
end

Common Operation Patterns

# Multiple set operations
array1 | array2 | array3           # Multiple union
array1 & array2 & array3           # Multiple intersection  
array1 - array2 - array3           # Multiple difference

# Combining with other Array methods
array.select { condition }.uniq    # Filter then unique
array.uniq.map { transform }       # Unique then transform
(array1 | array2).sort             # Union then sort

# In-place accumulation
result = []
arrays.each { |arr| result |= arr }  # Accumulate unique elements

# Conditional uniqueness
array.uniq { |elem| condition }    # Unique based on block result

Performance Considerations

Scenario Recommendation Reason
Large arrays (>10,000 elements) Consider Set class Better performance for repeated operations
Frequent set operations Cache intermediate results Avoid recomputation
Custom objects Optimize hash method Slow hash calculations hurt performance
Memory-constrained Use uniq! when possible Modifies in place, saves memory
Order not important Set class preferred More efficient set operations

Error Conditions

Condition Behavior Example
nil operand TypeError [1, 2] | nil
Non-array operand TypeError [1, 2] & "string"
Block expects argument ArgumentError array.uniq { } with no parameter
Custom object missing hash Uses Object#hash May cause unexpected behavior

Standard Library Integration

# Set class compatibility
require 'set'
array = [1, 2, 3]
set = Set[1, 2, 4]

# Convert between Array and Set
array.to_set | set                 # Array to Set conversion
(array.to_set & set).to_a          # Set operations with Array result

# Enumerable methods with set operations  
array.uniq.each_with_index         # Chain with Enumerable methods
array.group_by(&:class).transform_values(&:uniq)  # Group and unique