CrackedRuby logo

CrackedRuby

Array Uniqueness

This guide covers array uniqueness operations in Ruby, including deduplication methods, custom comparison logic, and performance considerations for large datasets.

Core Built-in Classes Array Class
2.4.11

Overview

Ruby provides several methods for removing duplicate elements from arrays. The primary methods are Array#uniq and Array#uniq!, which remove duplicate elements based on element equality or custom block logic. Ruby determines uniqueness using the == method for element comparison, maintaining the first occurrence of each unique element while preserving original array order.

The uniq method returns a new array with duplicates removed, while uniq! modifies the original array in place. Both methods accept an optional block that defines custom uniqueness criteria.

numbers = [1, 2, 2, 3, 3, 3, 4]
numbers.uniq
# => [1, 2, 3, 4]

words = ['apple', 'APPLE', 'banana', 'BANANA']
words.uniq(&:downcase)
# => ['apple', 'banana']

Ruby also provides Array#| (union operator) which combines arrays while removing duplicates, and set operations through the Set class for more complex uniqueness requirements.

Basic Usage

The uniq method removes consecutive and non-consecutive duplicates from an array. Ruby compares elements using their == method, so objects with the same value but different object identities are considered equal.

# Basic deduplication
items = [1, 1, 2, 3, 2, 4, 3]
items.uniq
# => [1, 2, 3, 4]

# String deduplication
names = ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob']
names.uniq
# => ['Alice', 'Bob', 'Charlie']

# Mixed data types
mixed = [1, '1', 1, 2, '2', 2]
mixed.uniq
# => [1, '1', 2, '2']

The uniq! method modifies the original array and returns the array if changes were made, or nil if no duplicates were found.

numbers = [5, 5, 6, 7, 6, 8]
result = numbers.uniq!
# numbers is now [5, 6, 7, 8]
# result is [5, 6, 7, 8]

no_duplicates = [1, 2, 3, 4]
result = no_duplicates.uniq!
# no_duplicates remains [1, 2, 3, 4]  
# result is nil

Both methods accept blocks for custom uniqueness logic. The block receives each element and should return a value used for comparison.

people = [
  { name: 'Alice', age: 30 },
  { name: 'Bob', age: 25 },
  { name: 'Alice', age: 35 },
  { name: 'Charlie', age: 30 }
]

# Remove duplicates by name
unique_names = people.uniq { |person| person[:name] }
# => [{name: 'Alice', age: 30}, {name: 'Bob', age: 25}, {name: 'Charlie', age: 30}]

# Remove duplicates by age
unique_ages = people.uniq { |person| person[:age] }
# => [{name: 'Alice', age: 30}, {name: 'Bob', age: 25}, {name: 'Alice', age: 35}]

Advanced Usage

Complex uniqueness scenarios often require sophisticated block logic or preprocessing. Ruby's block-based uniqueness allows for multi-attribute comparison, normalized comparison, and conditional uniqueness rules.

For multi-attribute uniqueness, return an array of values from the block:

products = [
  { name: 'Laptop', brand: 'Dell', price: 999 },
  { name: 'Laptop', brand: 'HP', price: 899 },
  { name: 'Laptop', brand: 'Dell', price: 1099 },
  { name: 'Mouse', brand: 'Dell', price: 25 },
  { name: 'Mouse', brand: 'HP', price: 30 }
]

# Unique by name AND brand combination
unique_products = products.uniq { |p| [p[:name], p[:brand]] }
# => [
#   {name: 'Laptop', brand: 'Dell', price: 999},
#   {name: 'Laptop', brand: 'HP', price: 899}, 
#   {name: 'Mouse', brand: 'Dell', price: 25},
#   {name: 'Mouse', brand: 'HP', price: 30}
# ]

Normalized uniqueness handles case sensitivity, whitespace, and other formatting differences:

emails = [
  ' alice@example.com ',
  'ALICE@EXAMPLE.COM',
  'bob@test.org',
  '  BOB@TEST.ORG  ',
  'charlie@demo.net'
]

# Normalize emails for comparison
unique_emails = emails.uniq { |email| email.strip.downcase }
# => [' alice@example.com ', 'bob@test.org', 'charlie@demo.net']

# Complex normalization with regex
phone_numbers = [
  '(555) 123-4567',
  '555-123-4567', 
  '5551234567',
  '(555) 987-6543',
  '555.987.6543'
]

unique_phones = phone_numbers.uniq do |phone|
  phone.gsub(/\D/, '') # Remove all non-digits
end
# => ['(555) 123-4567', '(555) 987-6543']

Conditional uniqueness applies different rules based on element properties:

transactions = [
  { type: 'debit', amount: 100, account: 'checking' },
  { type: 'debit', amount: 100, account: 'savings' },
  { type: 'credit', amount: 100, account: 'checking' },
  { type: 'credit', amount: 100, account: 'checking' },
  { type: 'debit', amount: 200, account: 'checking' }
]

# For credits: unique by amount and account
# For debits: unique by amount only
unique_transactions = transactions.uniq do |txn|
  if txn[:type] == 'credit'
    [txn[:type], txn[:amount], txn[:account]]
  else
    [txn[:type], txn[:amount]]
  end
end
# => [
#   {type: 'debit', amount: 100, account: 'checking'},
#   {type: 'credit', amount: 100, account: 'checking'},
#   {type: 'debit', amount: 200, account: 'checking'}
# ]

Chaining uniqueness operations handles multiple deduplication steps:

log_entries = [
  'ERROR: Database connection failed at 10:30',
  'ERROR: database connection failed at 10:30',
  'INFO: User login successful',
  'ERROR: Database Connection Failed at 10:30', 
  'WARN: Low disk space',
  'error: database connection failed at 10:30'
]

# First normalize case, then remove duplicates
normalized_logs = log_entries
  .map(&:downcase)
  .uniq
  .sort
# => ['error: database connection failed at 10:30', 'info: user login successful', 'warn: low disk space']

Performance & Memory

Array uniqueness operations have significant performance implications for large datasets. The uniq method has O(n) time complexity but requires additional memory for the new array, while uniq! modifies in place but still requires temporary storage for duplicate tracking.

Ruby uses a hash internally to track seen elements, making lookup operations efficient. However, memory usage increases with array size and element complexity:

# Memory-efficient approach for large arrays
large_array = (1..1_000_000).to_a.concat((1..500_000).to_a)

# uniq creates new array - doubles memory usage temporarily
unique_copy = large_array.uniq # Uses ~24MB additional memory

# uniq! modifies in place - more memory efficient  
large_array.uniq! # Uses ~12MB additional memory for tracking

Custom blocks add computational overhead. Complex block logic multiplies processing time:

require 'benchmark'

data = Array.new(100_000) { |i| { id: i % 50_000, value: rand(1000) } }

Benchmark.bm(20) do |x|
  x.report('simple uniq:') do
    data.uniq { |item| item[:id] }
  end
  
  x.report('complex block:') do  
    data.uniq { |item| [item[:id], item[:value] > 500 ? 'high' : 'low'] }
  end
  
  x.report('expensive operation:') do
    data.uniq { |item| item.to_s.hash }
  end
end

#                          user     system      total        real
# simple uniq:         0.125000   0.000000   0.125000 (  0.127234)
# complex block:       0.234000   0.000000   0.234000 (  0.235678) 
# expensive operation: 0.445000   0.000000   0.445000 (  0.447890)

For large datasets with simple uniqueness requirements, consider preprocessing or alternative data structures:

# Set for O(1) insertion and lookup
require 'set'

large_numbers = Array.new(1_000_000) { rand(100_000) }

# Using Set - faster for repeated uniqueness checks
unique_set = large_numbers.to_set
# Convert back to array if needed
unique_array = unique_set.to_a

# Hash-based manual deduplication for custom logic
seen = {}
unique_items = []
complex_data.each do |item|
  key = custom_key_function(item)
  unless seen[key]
    seen[key] = true
    unique_items << item
  end
end

Memory-conscious strategies reduce peak memory usage:

# Process in chunks for very large datasets
def unique_in_chunks(array, chunk_size = 10_000)
  seen = Set.new
  unique_items = []
  
  array.each_slice(chunk_size) do |chunk|
    chunk.each do |item|
      unless seen.include?(item)
        seen.add(item)
        unique_items << item
      end
    end
    
    # Optional: periodic cleanup
    GC.start if seen.size % 50_000 == 0
  end
  
  unique_items
end

Common Pitfalls

Object identity versus equality creates the most frequent uniqueness confusion. Ruby uses == for comparison, not object identity (equal?):

# Same content, different objects - considered equal
str1 = String.new('hello')
str2 = String.new('hello')
[str1, str2].uniq
# => ['hello'] - only one element

# Object identity doesn't affect uniqueness
a = [1, 2]
b = [1, 2] 
[a, b].uniq
# => [[1, 2]] - only one element because arrays have same content

Mutable objects can cause unexpected behavior when modified after uniqueness operations:

arrays = [[1, 2], [3, 4], [1, 2]]
unique_arrays = arrays.uniq
# => [[1, 2], [3, 4]]

# Modifying original affects the unique result
arrays[0] << 3
unique_arrays
# => [[1, 2, 3], [3, 4]] - first array is now modified

Frozen objects prevent this issue:

arrays = [[1, 2].freeze, [3, 4].freeze, [1, 2].freeze]
unique_arrays = arrays.uniq
arrays[0] << 3 # Raises FrozenError

Hash and complex object uniqueness depends on proper == and hash method implementations:

class Person
  attr_reader :name, :age
  
  def initialize(name, age)
    @name, @age = name, age
  end
  
  # Without proper == method
end

people = [
  Person.new('Alice', 30),
  Person.new('Alice', 30),
  Person.new('Bob', 25)
]

people.uniq.size
# => 3 - all considered different (object identity)

class Person
  # Add proper == method
  def ==(other)
    other.is_a?(Person) && name == other.name && age == other.age
  end
  
  # Add hash method for better performance
  def hash
    [name, age].hash
  end
end

people.uniq.size 
# => 2 - Alice duplicates removed

Block return values must be comparable. Returning incomparable objects causes errors:

data = [1, 2, 3, 'a', 'b']

# This works - returns comparable values
data.uniq { |x| x.class }
# => [1, 'a']

# This fails - can't compare mixed types in some Ruby versions
begin
  data.uniq { |x| x.is_a?(Integer) ? x : x.upcase }
rescue ArgumentError => e
  puts "Error: #{e.message}"
end

Nil values in blocks require careful handling:

records = [
  { name: 'Alice', email: 'alice@example.com' },
  { name: 'Bob', email: nil },
  { name: 'Charlie', email: nil },
  { name: 'David', email: 'david@example.com' }
]

# Naive approach - nil values are equal
records.uniq { |r| r[:email] }
# => Only one record with nil email

# Better approach - handle nil explicitly
records.uniq { |r| r[:email] || "no_email_#{r[:name]}" }
# => Keeps all records with unique keys

Reference

Core Methods

Method Parameters Returns Description
Array#uniq &block (optional) Array Returns new array with duplicates removed
Array#uniq! &block (optional) Array or nil Removes duplicates in place, returns array if modified or nil
`Array# ` other_array Array

Set Operations

Method Parameters Returns Description
Set.new enumerable (optional) Set Creates set with unique elements
Set#to_a None Array Converts set to array
Set#add element Set Adds element to set
Set#include? element Boolean Checks if element exists

Block Usage Patterns

Pattern Example Use Case
Attribute-based array.uniq { |x| x.name } Remove duplicates by object attribute
Multi-attribute array.uniq { |x| [x.name, x.age] } Composite uniqueness key
Normalized array.uniq { |x| x.downcase } Case-insensitive uniqueness
Conditional array.uniq { |x| x.type == 'A' ? x.id : x.code } Different keys based on element

Performance Characteristics

Operation Time Complexity Space Complexity Notes
Array#uniq O(n) O(n) Creates new array
Array#uniq! O(n) O(n) temporary Modifies in place
Array#uniq with block O(n * block_time) O(n) Block evaluation adds overhead
Set#new O(n) O(n) Fastest for repeated uniqueness checks

Common Return Values

Scenario uniq Returns uniq! Returns
Duplicates found New array without duplicates Modified original array
No duplicates New array (copy of original) nil
Empty array Empty array nil
Single element Single-element array nil

Error Conditions

Error Type Cause Example
ArgumentError Block returns incomparable types Mixed numeric/string comparisons
NoMethodError Missing == method on custom objects Custom classes without equality
FrozenError Calling uniq! on frozen array [1,2,3].freeze.uniq!

Memory Usage Guidelines

Array Size Recommended Approach Reasoning
< 1,000 elements Array#uniq Memory overhead negligible
1,000 - 100,000 elements Array#uniq! or Set Balance performance and memory
> 100,000 elements Chunked processing or Set Avoid memory pressure
Repeated operations Set for storage O(1) uniqueness checks