CrackedRuby - String Substitution (sub, gsub)

Overview

Ruby provides two primary methods for string substitution: String#sub and String#gsub. The sub method replaces the first occurrence of a pattern within a string, while gsub replaces all occurrences. Both methods accept regular expressions or string literals as search patterns and support various replacement mechanisms including literal strings, blocks, and special replacement sequences.

These methods form the core of Ruby's text processing capabilities. String#sub and String#gsub create new string objects, leaving the original unchanged, while their destructive counterparts sub! and gsub! modify the string in place. Ruby implements pattern matching using the Onigmo regular expression engine, which supports Unicode and advanced regex features.

The substitution methods integrate closely with Ruby's global variables for match data access. After calling sub or gsub, Ruby populates $~ with the MatchData object containing capture groups, and sets numbered variables like $1, $2 for individual groups.

text = "Hello world, hello universe"
result = text.gsub(/hello/i, "hi")
# => "hi world, hi universe"

# Accessing match data after substitution
"abc123def".sub(/(\d+)/, "[\1]")
# => "abc[123]def"
puts $1  # => "123"

Ruby's substitution methods handle encoding automatically, preserving the original string's encoding while processing multibyte characters correctly. The methods work with frozen strings by returning new instances, making them safe for concurrent access patterns.

Basic Usage

The fundamental syntax for String#sub and String#gsub accepts a pattern as the first argument and a replacement as the second. The pattern can be a regular expression, string literal, or any object responding to to_s. Ruby converts string patterns to literal matches, while regular expressions enable pattern-based matching.

# Basic string replacement
text = "Ruby programming language"
text.sub("Ruby", "Python")           # => "Python programming language"
text.gsub("programming", "scripting") # => "Ruby scripting language"

# Regular expression patterns
email = "user@example.com"
email.sub(/@.*/, "@newdomain.org")   # => "user@newdomain.org"

# Case-insensitive matching
text.gsub(/ruby/i, "RUBY")          # => "RUBY programming language"

The replacement argument supports several formats. String literals provide direct substitution, while special escape sequences like \1, \2 reference capture groups from the pattern match. Ruby also supports named capture group references using \k<name> syntax.

# Capture group references
phone = "123-456-7890"
phone.gsub(/(\d{3})-(\d{3})-(\d{4})/, '(\1) \2-\3')
# => "(123) 456-7890"

# Named capture groups
date = "2024-03-15"
date.sub(/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/, '\k<month>/\k<day>/\k<year>')
# => "03/15/2024"

Block-based replacements provide dynamic substitution logic. Ruby passes the matched string to the block, allowing complex transformation during replacement. The block's return value becomes the replacement text.

# Block-based replacement
text = "The price is $19.99 and $29.99"
text.gsub(/\$\d+\.\d+/) do |match|
  price = match[1..-1].to_f
  "$#{(price * 1.08).round(2)}"
end
# => "The price is $21.59 and $32.39"

# Using match data in blocks
"hello world".gsub(/(\w+)/) { |word| word.capitalize }
# => "Hello World"

The destructive variants sub! and gsub! modify the original string object. These methods return the string itself if replacements occur, or nil if no matches exist. This behavior enables conditional logic based on whether substitution happened.

text = "original text"
result = text.gsub!("original", "modified")
puts text    # => "modified text"
puts result  # => "modified text"

# Returns nil when no matches
text = "unchanged"
result = text.sub!("missing", "replacement")
puts text    # => "unchanged"
puts result  # => nil

Advanced Usage

Advanced substitution patterns leverage Ruby's full regular expression capabilities, including lookaheads, lookbehinds, and complex grouping structures. These patterns enable sophisticated text transformations that handle edge cases and maintain formatting requirements.

# Lookahead assertions for conditional replacement
html = "<div>Content</div><span>More content</span>"
# Replace opening tags only if followed by specific content
html.gsub(/<(\w+)(?=>[^<]*Content)/, '<\1 class="highlighted"')
# => "<div class=\"highlighted\">Content</div><span>More content</span>"

# Lookbehind for context-aware replacement
text = "pre-process and post-process but not reprocess"
text.gsub(/(?<!re)process/, "handle")
# => "pre-handle and post-handle but not reprocess"

Hash-based replacements provide mapping between patterns and replacements. Ruby matches each key against the string and substitutes with the corresponding value. This approach works efficiently for multiple simultaneous replacements.

# Hash replacement mapping
text = "Convert HTML entities"
entities = {
  '&' => '&amp;',
  '<' => '&lt;',
  '>' => '&gt;',
  '"' => '&quot;'
}
text.gsub(/[&<>"]/, entities)
# => "Convert HTML entities"

# Pattern-based hash keys
abbreviations = {
  /\bDr\./ => "Doctor",
  /\bProf\./ => "Professor",
  /\bMr\./ => "Mister"
}
"Dr. Smith and Prof. Jones".gsub(Regexp.union(abbreviations.keys)) do |match|
  abbreviations.find { |pattern, replacement| pattern =~ match }[1]
end
# => "Doctor Smith and Professor Jones"

Unicode handling requires careful attention to character boundaries and normalization. Ruby's substitution methods work correctly with multibyte characters, but complex Unicode scenarios may need additional processing.

# Unicode character handling
text = "café naïve résumé"
# Replace accented characters with base characters
text.gsub(/[áàâäã]/, 'a').gsub(/[éèêë]/, 'e').gsub(/[íìîï]/, 'i')
# => "cafe naive resume"

# Complex Unicode patterns
emoji_text = "Hello 👋 world 🌍 from Ruby 💎"
emoji_text.gsub(/[\u{1F000}-\u{1FFFF}]/, '[emoji]')
# => "Hello [emoji] world [emoji] from Ruby [emoji]"

Metaprogramming integration allows dynamic pattern generation and replacement logic. This approach proves valuable when building domain-specific text processors or template systems.

class TextProcessor
  def initialize(rules)
    @rules = rules
  end

  def process(text)
    @rules.each do |pattern, handler|
      if handler.is_a?(Proc)
        text = text.gsub(pattern, &handler)
      else
        text = text.gsub(pattern, handler)
      end
    end
    text
  end
end

processor = TextProcessor.new({
  /\b(\w+)@(\w+\.\w+)\b/ => proc { |match|
    "[EMAIL:#{$1}@#{$2}]"
  },
  /\b\d{3}-\d{3}-\d{4}\b/ => "[PHONE]"
})

text = "Contact john@example.com or call 555-123-4567"
processor.process(text)
# => "Contact [EMAIL:john@example.com] or call [PHONE]"

Common Pitfalls

Regular expression escaping represents the most frequent source of substitution errors. Special regex characters require escaping when intended as literals, and replacement strings need careful handling of backslashes and dollar signs to avoid unintended capture group references.

# Problematic unescaped patterns
text = "Calculate $10.50 + $5.25"
# Wrong - treats . as any character matcher
text.gsub("$10.50", "$15.75")  # Works because string literal
text.gsub(/\$10.50/, "$15.75") # Wrong - . matches any character

# Correct escaping
text.gsub(/\$10\.50/, "$15.75")
# => "Calculate $15.75 + $5.25"

# Replacement string escaping issues
code = "var x = 10;"
# Wrong - \1 treated as capture group reference
code.gsub("var", "let \1")     # => "let  x = 10;"
# Correct - escape the backslash
code.gsub("var", "let \\1")    # => "let \1 x = 10;"

Global replacement behavior in gsub can produce unexpected results when patterns overlap or when replacement text contains the original pattern. These scenarios require careful pattern design to avoid infinite loops or incorrect substitutions.

# Overlapping pattern problems
text = "aaaaa"
text.gsub(/aa/, "bb")          # => "bbbba" - not "bbbbb"

# Replacement containing search pattern
text = "cat"
text.gsub("cat", "caterpillar") # => "caterpillar" (correct)
text = "cat cat"
text.gsub("cat", "cat dog")     # => "cat dog cat dog" (correct)

# But beware of this pattern
text = "A B C"
text.gsub(" ", ", ")           # => "A, B, C" (correct)
# This would be problematic if replacement contained the pattern

Encoding mismatches between patterns and target strings cause subtle failures. Ruby handles most encoding conversions automatically, but mixed-encoding scenarios require explicit handling to avoid compatibility errors.

# Encoding compatibility issues
utf8_string = "café".encode('UTF-8')
ascii_pattern = /caf/.encode('ASCII')

# This works due to automatic conversion
utf8_string.sub(ascii_pattern, "restaurant")
# => "restaurante"

# Problems arise with incompatible encodings
binary_string = "café".encode('BINARY')
# This can raise Encoding::CompatibilityError
begin
  binary_string.sub(/café/u, "coffee")
rescue Encoding::CompatibilityError
  # Handle encoding mismatch
  binary_string.force_encoding('UTF-8').sub(/café/, "coffee")
end

Performance degradation occurs with poorly constructed regular expressions, especially those with excessive backtracking. Catastrophic backtracking can cause substitution operations to take exponential time with certain input patterns.

# Dangerous pattern with potential backtracking
email = "a" * 1000 + "@example.com"
# This pattern can cause severe performance issues
slow_pattern = /^(a+)+@/

# Better approach with possessive quantifier or atomic grouping
fast_pattern = /^a+@/

require 'benchmark'
Benchmark.bm do |x|
  x.report("slow") { email.sub(slow_pattern, "user@") }
  x.report("fast") { email.sub(fast_pattern, "user@") }
end

Performance & Memory

String substitution performance varies significantly based on pattern complexity, string length, and replacement strategy. Simple string literal patterns perform faster than regular expressions, while complex regex patterns with multiple alternations or nested groups can degrade performance substantially.

require 'benchmark'

text = "The quick brown fox jumps over the lazy dog" * 1000

Benchmark.bm(15) do |x|
  # String literal replacement (fastest)
  x.report("string literal") { text.gsub("fox", "cat") }

  # Simple regex (moderate speed)
  x.report("simple regex") { text.gsub(/fox/, "cat") }

  # Complex regex (slower)
  x.report("complex regex") { text.gsub(/f[o]x|c[a]t/, "animal") }

  # Block replacement (slowest due to proc calls)
  x.report("block replace") { text.gsub(/\w+/) { |word| word.upcase } }
end

Memory allocation patterns differ between sub, gsub, and their destructive variants. Non-destructive methods always allocate new string objects, while destructive methods may reallocate the internal buffer when the replacement text changes the string's length significantly.

require 'objspace'

original = "small text" * 1000
puts "Original size: #{ObjectSpace.memsize_of(original)} bytes"

# Non-destructive creates new string
result1 = original.gsub("small", "very large replacement")
puts "New string size: #{ObjectSpace.memsize_of(result1)} bytes"
puts "Original unchanged: #{ObjectSpace.memsize_of(original)} bytes"

# Destructive may reallocate internal buffer
text_copy = original.dup
text_copy.gsub!("small", "big")
puts "Modified in place: #{ObjectSpace.memsize_of(text_copy)} bytes"

Large-scale text processing benefits from several optimization strategies. Precompiling regular expressions reduces repeated compilation overhead, while batch processing minimizes object allocation when handling multiple substitutions.

class TextProcessor
  def initialize
    # Precompile frequently used patterns
    @email_pattern = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/
    @phone_pattern = /\b\d{3}-\d{3}-\d{4}\b/
    @url_pattern = /https?:\/\/[^\s]+/
  end

  def sanitize_bulk(texts)
    # Process multiple texts efficiently
    texts.map! do |text|
      # Chain substitutions to minimize string allocations
      text.gsub(@email_pattern, '[EMAIL]')
          .gsub(@phone_pattern, '[PHONE]')
          .gsub(@url_pattern, '[URL]')
    end
  end
end

# Benchmark bulk processing vs individual calls
processor = TextProcessor.new
texts = Array.new(1000) { "Contact john@test.com or 555-123-4567" }

Benchmark.bm do |x|
  x.report("bulk process") { processor.sanitize_bulk(texts.dup) }
  x.report("individual") {
    texts.dup.map! { |text|
      text.gsub(/email/, '[EMAIL]').gsub(/phone/, '[PHONE]')
    }
  }
end

Memory-conscious applications should prefer destructive operations when the original string is no longer needed. This approach avoids creating intermediate string objects, reducing garbage collection pressure.

def process_file_lines(filename)
  File.readlines(filename).each do |line|
    # Use destructive operations to avoid extra allocations
    line.chomp!
    line.gsub!(/\s+/, ' ')        # Normalize whitespace
    line.gsub!(/[^\w\s]/, '')     # Remove punctuation
    line.strip!

    yield line if block_given?
  end
end

# Monitor memory usage during processing
process_file_lines('large_file.txt') do |cleaned_line|
  # Process cleaned line
end

Reference

Core Substitution Methods

Method	Parameters	Returns	Description
`#sub(pattern, replacement)`	pattern (Regexp/String), replacement (String/Hash)	String	Replaces first occurrence of pattern
`#sub(pattern) { block }`	pattern (Regexp/String), block	String	Replaces first match with block result
`#sub!(pattern, replacement)`	pattern (Regexp/String), replacement (String/Hash)	String or nil	Destructively replaces first occurrence
`#sub!(pattern) { block }`	pattern (Regexp/String), block	String or nil	Destructively replaces with block result
`#gsub(pattern, replacement)`	pattern (Regexp/String), replacement (String/Hash)	String	Replaces all occurrences of pattern
`#gsub(pattern) { block }`	pattern (Regexp/String), block	String	Replaces all matches with block results
`#gsub!(pattern, replacement)`	pattern (Regexp/String), replacement (String/Hash)	String or nil	Destructively replaces all occurrences
`#gsub!(pattern) { block }`	pattern (Regexp/String), block	String or nil	Destructively replaces with block results

Replacement Sequences

Sequence	Description	Example
`\1`, `\2`, ...	Capture group references	`"abc".sub(/(.)/, '\1\1')` → `"aac"`
`\k<name>`	Named capture group reference	`"abc".sub(/(?<char>.)/, '\k<char>\k<char>')` → `"aac"`
`\&`	Entire match	`"hello".gsub(/l+/, '[\&]')` → `"he[ll]o"`
\`	String before match	"hello".sub(/l+/, '[\`]') → `"he[he]o"`
`\'`	String after match	`"hello".sub(/l+/, '[\']')` → `"he[o]o"`
`\+`	Last capture group	`"abc".sub(/(.)(.)/, '\+')` → `"bc"`
`\\`	Literal backslash	`"test".sub(/t/, '\\\\')` → `"\\est"`

Global Variables Set by Substitution

Variable	Type	Description
`$~`	MatchData or nil	Complete match information object
`$&`	String or nil	The matched string
`$1`, `$2`, ...	String or nil	Capture group contents
`$+`	String or nil	Last successful capture group
$`	String	String preceding the match
`$'`	String	String following the match

Regular Expression Options

Option	Symbol	Description	Usage
Case insensitive	i	Ignore case in matching	`/pattern/i`
Multiline	m	`.` matches newlines	`/pattern/m`
Extended	x	Ignore whitespace and comments	`/pat tern/x`
Unicode	u	Unicode character properties	`/\p{Letter}/u`
ASCII	a	ASCII-only character classes	`/\w/a`

Performance Characteristics

Operation	Time Complexity	Memory Impact	Best Use Case
String literal search	O(n)	Minimal	Fixed text replacement
Simple regex	O(n)	Low	Pattern-based matching
Complex regex	O(n²) or worse	Moderate	Advanced text processing
Block replacement	O(n × block time)	Variable	Dynamic replacements
Hash replacement	O(n × hash size)	Low	Multiple simultaneous replacements

Common Pattern Examples

# Email validation and replacement
email_pattern = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/
text.gsub(email_pattern, '[EMAIL]')

# Phone number formatting
phone_pattern = /(\d{3})(\d{3})(\d{4})/
phone.gsub(phone_pattern, '(\1) \2-\3')

# URL replacement
url_pattern = /https?:\/\/[^\s]+/
content.gsub(url_pattern, '<a href="\&">\&</a>')

# HTML tag removal
tag_pattern = /<[^>]*>/
html.gsub(tag_pattern, '')

# Whitespace normalization
whitespace_pattern = /\s+/
text.gsub(whitespace_pattern, ' ')

String Substitution (sub, gsub)