CrackedRuby logo

CrackedRuby

String Searching and Matching

Overview

Ruby provides multiple approaches for searching and matching patterns within strings. The core functionality combines string methods with regular expression support through the Regexp class. String methods like #include?, #match, #scan, and #gsub handle common searching tasks, while regular expressions enable complex pattern matching operations.

The String class includes methods that accept either string literals or regex patterns as arguments. When using string literals, Ruby performs exact character matching. When using regular expressions, Ruby applies pattern matching rules with support for anchors, quantifiers, character classes, and capture groups.

text = "Ruby programming language"

# String literal matching
text.include?("Ruby")
# => true

# Regular expression matching
text.match(/\b\w+ing\b/)
# => #<MatchData "programming">

Ruby's matching operations return different data types depending on the method used. Boolean methods like #include? return true or false. Methods like #match return MatchData objects containing match details. Extraction methods like #scan return arrays of matched strings.

The MatchData object provides access to captured groups, match positions, and the original string. This object supports indexed access for captured groups and named access when using named capture groups in regular expressions.

match = "Email: user@domain.com".match(/(\w+)@(\w+)\.(\w+)/)
match[1]  # => "user"
match[2]  # => "domain"
match[3]  # => "com"

Basic Usage

The #include? method performs substring searches within strings. This method returns a boolean indicating whether the target substring exists anywhere in the string. Case sensitivity applies by default.

message = "Welcome to Ruby programming"
message.include?("Ruby")     # => true
message.include?("Python")   # => false
message.include?("ruby")     # => false (case sensitive)

The #match method applies regular expressions to strings and returns a MatchData object for successful matches or nil for failed matches. This method stops at the first match found.

email = "Contact us at support@example.com"
match = email.match(/\w+@\w+\.\w+/)
# => #<MatchData "support@example.com">

match.begin(0)  # => 14 (starting position)
match.end(0)    # => 33 (ending position)

The #scan method finds all matches of a pattern within a string and returns them as an array. When the pattern contains capture groups, #scan returns an array of arrays containing the captured groups.

text = "Phone numbers: 555-1234 and 555-5678"
numbers = text.scan(/\d{3}-\d{4}/)
# => ["555-1234", "555-5678"]

# With capture groups
formatted = text.scan(/(\d{3})-(\d{4})/)
# => [["555", "1234"], ["555", "5678"]]

The #gsub method replaces matches with specified replacement text. The replacement can be a string literal or a block that processes each match. When using a string replacement, captured groups are accessible using backslash notation.

phone = "Call 555-1234 or 555-5678"
formatted = phone.gsub(/(\d{3})-(\d{4})/, '(\1) \2')
# => "Call (555) 1234 or (555) 5678"

# Using a block
redacted = phone.gsub(/\d{3}-\d{4}/) { |match| "XXX-XXXX" }
# => "Call XXX-XXXX or XXX-XXXX"

Position-based searching uses #index and #rindex methods to find character positions. The #index method returns the position of the first match, while #rindex returns the position of the last match.

document = "Ruby on Rails uses Ruby syntax"
first_ruby = document.index("Ruby")   # => 0
last_ruby = document.rindex("Ruby")   # => 21

Advanced Usage

Named capture groups provide semantic access to matched portions by assigning names to capturing parentheses. The resulting MatchData object supports bracket notation with symbol keys to access named groups.

log_entry = "2024-08-29 14:30:25 [ERROR] Database connection failed"
pattern = /(?<date>\d{4}-\d{2}-\d{2}) (?<time>\d{2}:\d{2}:\d{2}) \[(?<level>\w+)\] (?<message>.+)/

match = log_entry.match(pattern)
match[:date]     # => "2024-08-29"
match[:level]    # => "ERROR"
match[:message]  # => "Database connection failed"

Lookahead and lookbehind assertions match patterns based on context without including the context in the match result. Positive lookahead ((?=...)) requires the pattern to be followed by specific text. Negative lookahead ((?!...)) requires the pattern not be followed by specific text.

password = "SecurePass123!"

# Positive lookahead - find words followed by digits
password.scan(/\w+(?=\d)/)
# => ["SecurePass"]

# Negative lookbehind - find digits not preceded by letters
text = "Room 101, Suite 202, Building A5"
text.scan(/(?<![A-Za-z])\d+/)
# => ["101", "202"]

The String#match? method provides boolean testing without creating MatchData objects, improving performance when only existence checking is needed. This method returns true or false without capturing groups or position information.

def valid_email?(email)
  email.match?(/\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i)
end

valid_email?("user@example.com")  # => true
valid_email?("invalid.email")     # => false

Global variables capture match information after successful regex operations. The $~ variable contains the last MatchData object, while $1, $2, etc., contain captured groups from the last match.

"Ruby version 3.2.1".match(/(\d+)\.(\d+)\.(\d+)/)
$1  # => "3"
$2  # => "2"
$3  # => "1"
$~.captures  # => ["3", "2", "1"]

Conditional replacements in #gsub use blocks to implement complex replacement logic. The block receives each match as an argument and returns the replacement string.

text = "Process items: item1, item2, item3, item4"
result = text.gsub(/item(\d+)/) do |match|
  number = $1.to_i
  number.even? ? "[#{match}]" : match.upcase
end
# => "Process items: ITEM1, [item2], ITEM3, [item4]"

Performance & Memory

String searching performance varies significantly between different approaches. Literal string matching with #include? outperforms regular expression matching for simple substring detection. Regular expressions provide more functionality but consume additional CPU cycles for pattern compilation and matching.

require 'benchmark'

text = "Large document with repeated content" * 1000
target = "repeated"
pattern = /repeated/

Benchmark.bm(15) do |x|
  x.report("String#include?") { 10000.times { text.include?(target) } }
  x.report("Regexp#match?")   { 10000.times { text.match?(pattern) } }
  x.report("String#index")    { 10000.times { text.index(target) } }
end

#                      user     system      total        real
# String#include?  0.055000   0.000000   0.055000 (  0.054892)
# Regexp#match?    0.892000   0.000000   0.892000 (  0.891234)
# String#index     0.067000   0.000000   0.067000 (  0.066543)

Regular expression compilation occurs each time a literal regex pattern is used. Storing compiled patterns in variables or constants reduces this overhead when the same pattern is used repeatedly.

# Inefficient - compiles regex each iteration
1000.times do |i|
  "text#{i}".match(/text\d+/)
end

# Efficient - compile once, reuse
PATTERN = /text\d+/
1000.times do |i|
  "text#{i}".match(PATTERN)
end

The #scan method processes entire strings and builds result arrays, which can consume significant memory with large inputs or many matches. For processing large amounts of data, consider using #gsub with a block to process matches incrementally without storing intermediate results.

# Memory-intensive approach
large_text = File.read("large_log_file.txt")  # 100MB file
all_errors = large_text.scan(/ERROR: .+/)    # Stores all matches

# Memory-efficient approach
error_count = 0
large_text.gsub(/ERROR: .+/) do |match|
  error_count += 1
  # Process match immediately without storing
  process_error(match)
  match  # Return original to avoid replacement
end

Character classes and anchors affect regex performance. Specific character ranges like [0-9] perform better than broad classes like \w. Anchoring patterns to string boundaries with \A and \z prevents unnecessary backtracking when matches cannot exist.

# Slower - allows backtracking through entire string
/email: .+@.+\..+/

# Faster - anchored pattern with specific character classes
/\Aemail: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\z/

Common Pitfalls

Case sensitivity in string matching catches many developers unexpectedly. String methods like #include? and #index perform exact character matching, while regular expressions require explicit case-insensitive flags.

text = "Ruby Programming Language"

# Case-sensitive matching fails
text.include?("ruby")          # => false
text.match(/ruby/)             # => nil

# Solutions
text.downcase.include?("ruby") # => true
text.match(/ruby/i)            # => #<MatchData "Ruby">

Regular expression escaping becomes problematic when dynamically constructing patterns from user input. Special regex characters like ., *, +, ?, [, ], {, }, (, ), ^, $, |, \ require escaping to match literally.

user_input = "What costs $5.99?"

# Dangerous - treats . and $ as regex metacharacters
text.match(/#{user_input}/)

# Safe - escapes special characters
escaped = Regexp.escape(user_input)
text.match(/#{escaped}/)
# Matches literal "$5.99" instead of treating $ as end anchor

Global variable pollution occurs when multiple regex operations overwrite match information. The $1, $2, etc., variables update after each successful match, creating race conditions in concurrent code and unexpected behavior in complex parsing logic.

def parse_version(text)
  text.match(/v(\d+)\.(\d+)/)
  major = $1  # Potentially overwritten by nested call
  minor = $2
  [major, minor]
end

def parse_build(text)
  text.match(/build-(\d+)/)
  $1  # Overwrites global variables from parse_version
end

# Better approach - use MatchData directly
def parse_version(text)
  match = text.match(/v(\d+)\.(\d+)/)
  return nil unless match
  [match[1], match[2]]
end

Encoding mismatches cause unexpected behavior when searching strings with different character encodings. Ruby's string methods operate on byte sequences, not logical characters, leading to failed matches when encodings differ.

utf8_string = "Café".encode('UTF-8')
latin1_pattern = /Caf/.encode('ISO-8859-1')

# Encoding mismatch causes exception
utf8_string.match(latin1_pattern)
# => Encoding::CompatibilityError

# Solution - ensure compatible encodings
utf8_pattern = latin1_pattern.encode('UTF-8')
utf8_string.match(utf8_pattern)
# => #<MatchData "Caf">

Greedy quantifiers in regular expressions match more text than expected, particularly with patterns containing .+ or .*. These quantifiers consume characters until they reach the end of the string, then backtrack to find the minimal match required.

html = '<div class="main">Content here</div><div class="footer">More content</div>'

# Greedy - matches entire string between first < and last >
html.match(/<.+>/)
# => #<MatchData "<div class=\"main\">Content here</div><div class=\"footer\">More content</div>">

# Non-greedy - matches first complete tag
html.match(/<.+?>/)
# => #<MatchData "<div class=\"main\">">

Reference

Core String Methods

Method Parameters Returns Description
#include?(str) str (String) Boolean Tests substring presence
#index(pattern, offset=0) pattern (String/Regexp), offset (Integer) Integer or nil First match position
#rindex(pattern, offset=-1) pattern (String/Regexp), offset (Integer) Integer or nil Last match position
#match(pattern, pos=0) pattern (Regexp), pos (Integer) MatchData or nil Pattern matching with details
#match?(pattern, pos=0) pattern (Regexp), pos (Integer) Boolean Pattern matching test only
#scan(pattern) pattern (Regexp) Array All matches as array
#gsub(pattern, replacement) pattern (String/Regexp), replacement (String/Proc) String Replace all matches
#gsub!(pattern, replacement) pattern (String/Regexp), replacement (String/Proc) String or nil Replace all matches in-place
#sub(pattern, replacement) pattern (String/Regexp), replacement (String/Proc) String Replace first match
#sub!(pattern, replacement) pattern (String/Regexp), replacement (String/Proc) String or nil Replace first match in-place

MatchData Methods

Method Parameters Returns Description
#[] index (Integer/Symbol) String or nil Access captured group
#begin index (Integer) Integer Match start position
#end index (Integer) Integer Match end position
#captures None Array All captured groups
#named_captures None Hash Named captured groups
#names None Array Names of capture groups
#offset index (Integer) Array Start and end positions
#pre_match None String Text before match
#post_match None String Text after match
#string None String Original matched string
#values_at *indexes (Integer/Range) Array Multiple captured groups

Regular Expression Flags

Flag Symbol Description Example
Case-insensitive i Ignores character case /ruby/i
Multiline m Dot matches newlines /start.+end/m
Extended x Ignores whitespace, allows comments /\d{3} \s \d{4}/x
Unicode u Unicode character properties /\p{L}+/u

Character Classes

Pattern Description Equivalent
\d Digit character [0-9]
\D Non-digit character [^0-9]
\w Word character [A-Za-z0-9_]
\W Non-word character [^A-Za-z0-9_]
\s Whitespace character [ \t\r\n\f]
\S Non-whitespace character [^ \t\r\n\f]
. Any character except newline [^\n]

Anchors and Boundaries

Pattern Description Usage
^ Start of line /^Ruby/
$ End of line /Ruby$/
\A Start of string /\ARuby/
\z End of string /Ruby\z/
\Z End of string or before final newline /Ruby\Z/
\b Word boundary /\bRuby\b/
\B Non-word boundary /\BRuby\B/

Global Variables

Variable Description Content
$~ Last MatchData object MatchData instance
$& Entire matched string String
$` Text before match String
$' Text after match String
$1, $2, ... Captured groups String or nil
$+ Last captured group String or nil