CrackedRuby logo

CrackedRuby

String Comparison

Overview

String comparison in Ruby encompasses multiple methods for testing equality, ordering, case sensitivity, and pattern matching. Ruby provides both case-sensitive and case-insensitive comparison operators, along with specialized methods for locale-aware operations and regular expression matching.

The core comparison methods operate on String objects and return different types based on the operation. Equality methods return boolean values, while ordering methods return integers following the spaceship operator convention. Ruby handles string comparison at the byte level by default, with encoding considerations affecting the results.

# Basic equality comparison
"hello" == "hello"        # => true
"hello" == "Hello"        # => false

# Ordering comparison
"apple" <=> "banana"      # => -1
"zebra" <=> "apple"       # => 1

Ruby's comparison methods handle different string encodings, though mixed encoding comparisons can raise exceptions. The methods also account for string length differences and perform character-by-character comparison when strings differ in content but not length.

# Case-insensitive comparison
"Hello".casecmp("HELLO")  # => 0
"Hello".casecmp?("hello") # => true

# Pattern matching
"ruby123".match?(/\d+/)   # => true
"ruby123" =~ /\d+/        # => 4

Basic Usage

String equality comparison uses the == operator and eql? method, both performing case-sensitive, character-by-character comparison. The != operator provides inequality testing. These methods return boolean values and handle empty strings and single characters identically to multi-character strings.

# Standard equality operations
name1 = "John"
name2 = "john"
name3 = "John"

name1 == name3           # => true
name1 == name2           # => false
name1 != name2           # => true
name1.eql?(name3)        # => true

The spaceship operator <=> performs lexicographical ordering comparison, returning -1 when the left operand comes before the right, 0 for equal strings, and 1 when the left operand comes after the right. This method forms the basis for string sorting operations.

# Ordering comparison examples
words = ["zebra", "apple", "banana"]

"apple" <=> "banana"     # => -1 (apple comes before banana)
"banana" <=> "banana"    # => 0 (equal strings)
"zebra" <=> "apple"      # => 1 (zebra comes after apple)

# Sorting based on comparison
words.sort               # => ["apple", "banana", "zebra"]
words.sort.reverse       # => ["zebra", "banana", "apple"]

Case-insensitive comparison uses casecmp and casecmp? methods. The casecmp method returns the same integer values as <=> but ignores case differences, while casecmp? returns boolean values for equality testing only.

# Case-insensitive operations
text1 = "Programming"
text2 = "PROGRAMMING"
text3 = "programming"

text1.casecmp(text2)     # => 0 (equal when ignoring case)
text1.casecmp?(text3)    # => true (equal when ignoring case)
text1.casecmp("Ruby")    # => 1 (Programming > Ruby alphabetically)

# Practical case-insensitive matching
user_input = "YES"
["yes", "y", "true"].any? { |option| user_input.casecmp?(option) }  # => true

Regular expression matching provides pattern-based comparison through several methods. The match? method returns boolean results, match returns MatchData objects, and the =~ operator returns the index of the first match or nil.

# Pattern matching variations
email = "user@example.com"
phone = "555-123-4567"
code = "ABC123"

# Boolean pattern matching
email.match?(/\A[\w+\-.]+@[a-z\d\-]+\.[a-z]+\z/i)     # => true
phone.match?(/\d{3}-\d{3}-\d{4}/)                      # => true

# Match with position information
code =~ /\d+/                                          # => 3
phone =~ /\d{3}/                                       # => 0

# Full match data
result = email.match(/(\w+)@(\w+\.\w+)/)
result[1]                                              # => "user"
result[2]                                              # => "example.com"

Error Handling & Debugging

Encoding incompatibility represents the primary source of comparison errors. When strings have different encodings that cannot be compared directly, Ruby raises an Encoding::CompatibilityError. This occurs most frequently when mixing ASCII-8BIT encoded strings with UTF-8 encoded strings containing non-ASCII characters.

# Encoding compatibility errors
utf8_string = "café".encode('UTF-8')
binary_string = "\xFF\xFE".force_encoding('ASCII-8BIT')

begin
  result = utf8_string == binary_string
rescue Encoding::CompatibilityError => e
  puts "Encoding error: #{e.message}"
  # Handle by converting to compatible encoding
  safe_binary = binary_string.encode('UTF-8', invalid: :replace, undef: :replace)
  result = utf8_string == safe_binary  # => false, but no exception
end

Invalid byte sequences in string content can cause comparison failures when Ruby encounters malformed data. The comparison methods may raise encoding errors or produce unexpected results when working with corrupted string data.

# Handling invalid byte sequences
def safe_compare(str1, str2)
  return false unless str1.valid_encoding? && str2.valid_encoding?
  
  # Ensure compatible encodings
  if str1.encoding != str2.encoding
    begin
      str2 = str2.encode(str1.encoding)
    rescue Encoding::UndefinedConversionError
      return false
    end
  end
  
  str1 == str2
end

# Test with potentially problematic data
good_string = "hello"
bad_string = "\xFF\xFE"

safe_compare(good_string, bad_string)  # => false (safely handled)

Regular expression comparison errors typically stem from invalid regex patterns or catastrophic backtracking. Complex patterns can cause performance issues or infinite loops, requiring timeout mechanisms and pattern validation.

# Regex error handling and validation
def safe_regex_match(string, pattern, timeout = 1)
  return false unless string.valid_encoding?
  
  begin
    regex = Regexp.new(pattern)
    
    # Use timeout to prevent catastrophic backtracking
    Timeout.timeout(timeout) do
      string.match?(regex)
    end
    
  rescue RegexpError => e
    puts "Invalid regex pattern: #{e.message}"
    false
  rescue Timeout::Error
    puts "Regex execution timed out"
    false
  end
end

# Examples of problematic patterns
complex_string = "a" * 1000 + "!"
evil_regex = /^(a+)+$/

safe_regex_match(complex_string, evil_regex)  # => false (safely times out)
safe_regex_match("test", "[invalid")          # => false (invalid pattern)

Performance & Memory

String comparison performance varies significantly based on string length, content similarity, and comparison method used. Equality comparison short-circuits on length differences, making it highly efficient for strings of different sizes. Character-by-character comparison occurs only when lengths match, with performance degrading linearly with string length.

# Performance characteristics demonstration
require 'benchmark'

short_string = "hello"
long_string = "hello" + ("x" * 10_000)
different_length = "hi"
similar_long = "hello" + ("x" * 9_999) + "y"

Benchmark.bm(20) do |x|
  x.report("equal short:") { 100_000.times { short_string == "hello" } }
  x.report("different length:") { 100_000.times { long_string == different_length } }
  x.report("equal long:") { 10_000.times { long_string == (long_string.dup) } }
  x.report("similar long:") { 1_000.times { long_string == similar_long } }
end

# Results show:
# - Different length comparisons are fastest (immediate return)
# - Equal short strings are very fast
# - Long string comparisons scale with content similarity

Case-insensitive comparison methods (casecmp, casecmp?) perform additional processing to normalize character cases, resulting in slower execution than case-sensitive operations. The performance impact increases with string length and Unicode character complexity.

# Case-insensitive performance analysis
mixed_case = "ThIs Is A tEsT StRiNg WiTh MiXeD cAsE" * 100
lower_case = mixed_case.downcase
upper_case = mixed_case.upcase

Benchmark.bm(25) do |x|
  x.report("case sensitive equal:") do
    10_000.times { lower_case == lower_case.dup }
  end
  
  x.report("case insensitive equal:") do
    10_000.times { mixed_case.casecmp?(upper_case) }
  end
  
  x.report("manual downcase compare:") do
    10_000.times { mixed_case.downcase == upper_case.downcase }
  end
end

# casecmp? typically outperforms manual case conversion
# as it avoids creating intermediate string objects

Regular expression matching performance depends heavily on pattern complexity and string content. Simple patterns with anchored matches perform best, while complex patterns with backtracking can cause exponential time complexity. Pattern compilation overhead affects single-use patterns more than reused compiled expressions.

# Regex performance optimization strategies
email_pattern = /\A[\w+\-.]+@[a-z\d\-]+\.[a-z]+\z/i
compiled_pattern = Regexp.new('\A[\w+\-.]+@[a-z\d\-]+\.[a-z]+\z', Regexp::IGNORECASE)

test_emails = ["user@example.com", "invalid.email", "another@test.org"] * 1000

Benchmark.bm(25) do |x|
  x.report("literal regex:") do
    test_emails.each { |email| email.match?(email_pattern) }
  end
  
  x.report("compiled regex:") do
    test_emails.each { |email| email.match?(compiled_pattern) }
  end
  
  x.report("string pattern:") do
    pattern_string = '\A[\w+\-.]+@[a-z\d\-]+\.[a-z]+\z'
    test_emails.each { |email| email.match?(Regexp.new(pattern_string, Regexp::IGNORECASE)) }
  end
end

# Pre-compiled patterns show best performance for repeated use
# Literal regex patterns offer good performance with cleaner syntax

Common Pitfalls

Unicode normalization differences create subtle comparison failures when visually identical strings contain different byte sequences. Characters can be represented using composed or decomposed forms, causing equality comparisons to fail despite identical appearance.

# Unicode normalization issues
composed = "é"                    # Single character U+00E9
decomposed = "e\u0301"           # 'e' + combining acute accent

composed == decomposed           # => false (different byte sequences)
composed.bytes                   # => [195, 169]
decomposed.bytes                 # => [101, 204, 129]

# Solution: normalize before comparison
require 'unicode_normalize'

composed.unicode_normalize(:nfc) == decomposed.unicode_normalize(:nfc)      # => true
composed.unicode_normalize(:nfd) == decomposed.unicode_normalize(:nfd)      # => true

# Defensive comparison function
def unicode_safe_compare(str1, str2)
  str1.unicode_normalize(:nfc) == str2.unicode_normalize(:nfc)
end

unicode_safe_compare(composed, decomposed)  # => true

Encoding assumptions cause failures when processing text from external sources. Strings may arrive with unexpected encodings, leading to comparison errors or incorrect results. Web forms, file uploads, and database content commonly exhibit encoding inconsistencies.

# Encoding assumption problems
# Simulating data from different sources
web_input = "naïve".encode('ISO-8859-1')      # From web form
db_value = "naïve".encode('UTF-8')            # From database
file_content = "naïve".encode('Windows-1252') # From uploaded file

# Direct comparison fails
web_input == db_value                         # => false

# Defensive encoding handling
def normalize_encoding(string, target_encoding = 'UTF-8')
  return string if string.encoding.name == target_encoding
  
  string.encode(target_encoding, 
                invalid: :replace, 
                undef: :replace, 
                replace: '?')
rescue Encoding::InvalidByteSequenceError, Encoding::UndefinedConversionError
  # Fallback for completely incompatible encodings
  string.force_encoding('ASCII-8BIT').encode(target_encoding, 
                                             invalid: :replace, 
                                             undef: :replace)
end

# Safe comparison with encoding normalization
def encoding_safe_compare(str1, str2)
  normalized1 = normalize_encoding(str1)
  normalized2 = normalize_encoding(str2)
  normalized1 == normalized2
end

encoding_safe_compare(web_input, db_value)   # => true

Case-sensitive comparison failures occur when developers assume case-insensitive behavior in contexts where Ruby performs exact matching. This particularly affects user input validation, where users may enter data in unexpected cases.

# Case sensitivity pitfalls
valid_commands = ["START", "STOP", "PAUSE", "RESUME"]
user_input = "start"

# Common mistake: case-sensitive comparison
if valid_commands.include?(user_input)
  puts "Valid command"
else
  puts "Invalid command"  # This executes unexpectedly
end

# Correct approach: normalize case for comparison
normalized_commands = valid_commands.map(&:downcase)
if normalized_commands.include?(user_input.downcase)
  puts "Valid command"    # This executes as expected
end

# Alternative: case-insensitive matching
def valid_command?(input, valid_list)
  valid_list.any? { |cmd| cmd.casecmp?(input) }
end

valid_command?(user_input, valid_commands)   # => true

# Production-ready validation with error handling
class CommandValidator
  def initialize(valid_commands)
    @valid_commands = valid_commands.map(&:freeze)
  end
  
  def valid?(input)
    return false unless input.respond_to?(:casecmp?)
    return false unless input.valid_encoding?
    
    @valid_commands.any? { |cmd| cmd.casecmp?(input.strip) }
  end
end

validator = CommandValidator.new(valid_commands)
validator.valid?("  Start  ")  # => true (handles whitespace)
validator.valid?(nil)          # => false (handles nil input)

Reference

Comparison Methods

Method Parameters Returns Description
#==(other) other (Object) Boolean Case-sensitive equality comparison
#!=(other) other (Object) Boolean Case-sensitive inequality comparison
#eql?(other) other (Object) Boolean Case-sensitive equality with type checking
#<=>(other) other (String) Integer or nil Lexicographical ordering comparison (-1, 0, 1)
#casecmp(other) other (String) Integer or nil Case-insensitive ordering comparison
#casecmp?(other) other (String) Boolean or nil Case-insensitive equality comparison

Pattern Matching Methods

Method Parameters Returns Description
#match?(pattern, pos=0) pattern (Regexp/String), pos (Integer) Boolean Tests if pattern matches string
#match(pattern, pos=0) pattern (Regexp/String), pos (Integer) MatchData or nil Returns match data object
#=~(pattern) pattern (Regexp) Integer or nil Returns index of first match
#scan(pattern) pattern (Regexp/String) Array Returns all matches as array
#include?(substring) substring (String) Boolean Tests if string contains substring

Comparison Return Values

Operation Equal Strings Left < Right Left > Right Incompatible
==, !=, eql? true/false false/true false/true false/true
<=> 0 -1 1 nil
casecmp 0 -1 1 nil
casecmp? true false false nil

Encoding Compatibility

String 1 Encoding String 2 Encoding Comparison Result
UTF-8 UTF-8 Direct comparison
ASCII-8BIT ASCII-8BIT Byte-level comparison
UTF-8 (ASCII only) ASCII-8BIT Automatic promotion
UTF-8 (non-ASCII) ASCII-8BIT Encoding::CompatibilityError
ISO-8859-1 Windows-1252 Encoding::CompatibilityError

Regular Expression Flags

Flag Symbol Description Example
Case-insensitive i or Regexp::IGNORECASE Ignores case differences /pattern/i
Multiline m or Regexp::MULTILINE . matches newlines /pattern/m
Extended x or Regexp::EXTENDED Ignores whitespace and comments /pattern/x
Unicode u or Regexp::FIXEDENCODING Forces UTF-8 encoding /pattern/u

Common Error Types

Error Cause Prevention
Encoding::CompatibilityError Mixed incompatible encodings Normalize encodings before comparison
RegexpError Invalid regex pattern Validate patterns with Regexp.new
Timeout::Error Catastrophic backtracking Use timeout wrappers for complex patterns
NoMethodError Calling string methods on nil Check for nil with safe navigation

Performance Characteristics

Operation Time Complexity Notes
== (different lengths) O(1) Short-circuits on length check
== (same length) O(n) Character-by-character comparison
casecmp? O(n) Case normalization overhead
<=> O(n) Full lexicographical comparison
match? (simple pattern) O(n) Linear scan for pattern
match? (complex pattern) O(2^n) worst case Potential exponential backtracking