Overview
Ruby provides multiple approaches for searching and matching patterns within strings. The core functionality combines string methods with regular expression support through the Regexp
class. String methods like #include?
, #match
, #scan
, and #gsub
handle common searching tasks, while regular expressions enable complex pattern matching operations.
The String
class includes methods that accept either string literals or regex patterns as arguments. When using string literals, Ruby performs exact character matching. When using regular expressions, Ruby applies pattern matching rules with support for anchors, quantifiers, character classes, and capture groups.
text = "Ruby programming language"
# String literal matching
text.include?("Ruby")
# => true
# Regular expression matching
text.match(/\b\w+ing\b/)
# => #<MatchData "programming">
Ruby's matching operations return different data types depending on the method used. Boolean methods like #include?
return true or false. Methods like #match
return MatchData
objects containing match details. Extraction methods like #scan
return arrays of matched strings.
The MatchData
object provides access to captured groups, match positions, and the original string. This object supports indexed access for captured groups and named access when using named capture groups in regular expressions.
match = "Email: user@domain.com".match(/(\w+)@(\w+)\.(\w+)/)
match[1] # => "user"
match[2] # => "domain"
match[3] # => "com"
Basic Usage
The #include?
method performs substring searches within strings. This method returns a boolean indicating whether the target substring exists anywhere in the string. Case sensitivity applies by default.
message = "Welcome to Ruby programming"
message.include?("Ruby") # => true
message.include?("Python") # => false
message.include?("ruby") # => false (case sensitive)
The #match
method applies regular expressions to strings and returns a MatchData
object for successful matches or nil
for failed matches. This method stops at the first match found.
email = "Contact us at support@example.com"
match = email.match(/\w+@\w+\.\w+/)
# => #<MatchData "support@example.com">
match.begin(0) # => 14 (starting position)
match.end(0) # => 33 (ending position)
The #scan
method finds all matches of a pattern within a string and returns them as an array. When the pattern contains capture groups, #scan
returns an array of arrays containing the captured groups.
text = "Phone numbers: 555-1234 and 555-5678"
numbers = text.scan(/\d{3}-\d{4}/)
# => ["555-1234", "555-5678"]
# With capture groups
formatted = text.scan(/(\d{3})-(\d{4})/)
# => [["555", "1234"], ["555", "5678"]]
The #gsub
method replaces matches with specified replacement text. The replacement can be a string literal or a block that processes each match. When using a string replacement, captured groups are accessible using backslash notation.
phone = "Call 555-1234 or 555-5678"
formatted = phone.gsub(/(\d{3})-(\d{4})/, '(\1) \2')
# => "Call (555) 1234 or (555) 5678"
# Using a block
redacted = phone.gsub(/\d{3}-\d{4}/) { |match| "XXX-XXXX" }
# => "Call XXX-XXXX or XXX-XXXX"
Position-based searching uses #index
and #rindex
methods to find character positions. The #index
method returns the position of the first match, while #rindex
returns the position of the last match.
document = "Ruby on Rails uses Ruby syntax"
first_ruby = document.index("Ruby") # => 0
last_ruby = document.rindex("Ruby") # => 21
Advanced Usage
Named capture groups provide semantic access to matched portions by assigning names to capturing parentheses. The resulting MatchData
object supports bracket notation with symbol keys to access named groups.
log_entry = "2024-08-29 14:30:25 [ERROR] Database connection failed"
pattern = /(?<date>\d{4}-\d{2}-\d{2}) (?<time>\d{2}:\d{2}:\d{2}) \[(?<level>\w+)\] (?<message>.+)/
match = log_entry.match(pattern)
match[:date] # => "2024-08-29"
match[:level] # => "ERROR"
match[:message] # => "Database connection failed"
Lookahead and lookbehind assertions match patterns based on context without including the context in the match result. Positive lookahead ((?=...)
) requires the pattern to be followed by specific text. Negative lookahead ((?!...)
) requires the pattern not be followed by specific text.
password = "SecurePass123!"
# Positive lookahead - find words followed by digits
password.scan(/\w+(?=\d)/)
# => ["SecurePass"]
# Negative lookbehind - find digits not preceded by letters
text = "Room 101, Suite 202, Building A5"
text.scan(/(?<![A-Za-z])\d+/)
# => ["101", "202"]
The String#match?
method provides boolean testing without creating MatchData
objects, improving performance when only existence checking is needed. This method returns true or false without capturing groups or position information.
def valid_email?(email)
email.match?(/\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i)
end
valid_email?("user@example.com") # => true
valid_email?("invalid.email") # => false
Global variables capture match information after successful regex operations. The $~
variable contains the last MatchData
object, while $1
, $2
, etc., contain captured groups from the last match.
"Ruby version 3.2.1".match(/(\d+)\.(\d+)\.(\d+)/)
$1 # => "3"
$2 # => "2"
$3 # => "1"
$~.captures # => ["3", "2", "1"]
Conditional replacements in #gsub
use blocks to implement complex replacement logic. The block receives each match as an argument and returns the replacement string.
text = "Process items: item1, item2, item3, item4"
result = text.gsub(/item(\d+)/) do |match|
number = $1.to_i
number.even? ? "[#{match}]" : match.upcase
end
# => "Process items: ITEM1, [item2], ITEM3, [item4]"
Performance & Memory
String searching performance varies significantly between different approaches. Literal string matching with #include?
outperforms regular expression matching for simple substring detection. Regular expressions provide more functionality but consume additional CPU cycles for pattern compilation and matching.
require 'benchmark'
text = "Large document with repeated content" * 1000
target = "repeated"
pattern = /repeated/
Benchmark.bm(15) do |x|
x.report("String#include?") { 10000.times { text.include?(target) } }
x.report("Regexp#match?") { 10000.times { text.match?(pattern) } }
x.report("String#index") { 10000.times { text.index(target) } }
end
# user system total real
# String#include? 0.055000 0.000000 0.055000 ( 0.054892)
# Regexp#match? 0.892000 0.000000 0.892000 ( 0.891234)
# String#index 0.067000 0.000000 0.067000 ( 0.066543)
Regular expression compilation occurs each time a literal regex pattern is used. Storing compiled patterns in variables or constants reduces this overhead when the same pattern is used repeatedly.
# Inefficient - compiles regex each iteration
1000.times do |i|
"text#{i}".match(/text\d+/)
end
# Efficient - compile once, reuse
PATTERN = /text\d+/
1000.times do |i|
"text#{i}".match(PATTERN)
end
The #scan
method processes entire strings and builds result arrays, which can consume significant memory with large inputs or many matches. For processing large amounts of data, consider using #gsub
with a block to process matches incrementally without storing intermediate results.
# Memory-intensive approach
large_text = File.read("large_log_file.txt") # 100MB file
all_errors = large_text.scan(/ERROR: .+/) # Stores all matches
# Memory-efficient approach
error_count = 0
large_text.gsub(/ERROR: .+/) do |match|
error_count += 1
# Process match immediately without storing
process_error(match)
match # Return original to avoid replacement
end
Character classes and anchors affect regex performance. Specific character ranges like [0-9]
perform better than broad classes like \w
. Anchoring patterns to string boundaries with \A
and \z
prevents unnecessary backtracking when matches cannot exist.
# Slower - allows backtracking through entire string
/email: .+@.+\..+/
# Faster - anchored pattern with specific character classes
/\Aemail: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\z/
Common Pitfalls
Case sensitivity in string matching catches many developers unexpectedly. String methods like #include?
and #index
perform exact character matching, while regular expressions require explicit case-insensitive flags.
text = "Ruby Programming Language"
# Case-sensitive matching fails
text.include?("ruby") # => false
text.match(/ruby/) # => nil
# Solutions
text.downcase.include?("ruby") # => true
text.match(/ruby/i) # => #<MatchData "Ruby">
Regular expression escaping becomes problematic when dynamically constructing patterns from user input. Special regex characters like .
, *
, +
, ?
, [
, ]
, {
, }
, (
, )
, ^
, $
, |
, \
require escaping to match literally.
user_input = "What costs $5.99?"
# Dangerous - treats . and $ as regex metacharacters
text.match(/#{user_input}/)
# Safe - escapes special characters
escaped = Regexp.escape(user_input)
text.match(/#{escaped}/)
# Matches literal "$5.99" instead of treating $ as end anchor
Global variable pollution occurs when multiple regex operations overwrite match information. The $1
, $2
, etc., variables update after each successful match, creating race conditions in concurrent code and unexpected behavior in complex parsing logic.
def parse_version(text)
text.match(/v(\d+)\.(\d+)/)
major = $1 # Potentially overwritten by nested call
minor = $2
[major, minor]
end
def parse_build(text)
text.match(/build-(\d+)/)
$1 # Overwrites global variables from parse_version
end
# Better approach - use MatchData directly
def parse_version(text)
match = text.match(/v(\d+)\.(\d+)/)
return nil unless match
[match[1], match[2]]
end
Encoding mismatches cause unexpected behavior when searching strings with different character encodings. Ruby's string methods operate on byte sequences, not logical characters, leading to failed matches when encodings differ.
utf8_string = "Café".encode('UTF-8')
latin1_pattern = /Caf/.encode('ISO-8859-1')
# Encoding mismatch causes exception
utf8_string.match(latin1_pattern)
# => Encoding::CompatibilityError
# Solution - ensure compatible encodings
utf8_pattern = latin1_pattern.encode('UTF-8')
utf8_string.match(utf8_pattern)
# => #<MatchData "Caf">
Greedy quantifiers in regular expressions match more text than expected, particularly with patterns containing .+
or .*
. These quantifiers consume characters until they reach the end of the string, then backtrack to find the minimal match required.
html = '<div class="main">Content here</div><div class="footer">More content</div>'
# Greedy - matches entire string between first < and last >
html.match(/<.+>/)
# => #<MatchData "<div class=\"main\">Content here</div><div class=\"footer\">More content</div>">
# Non-greedy - matches first complete tag
html.match(/<.+?>/)
# => #<MatchData "<div class=\"main\">">
Reference
Core String Methods
Method | Parameters | Returns | Description |
---|---|---|---|
#include?(str) |
str (String) |
Boolean |
Tests substring presence |
#index(pattern, offset=0) |
pattern (String/Regexp), offset (Integer) |
Integer or nil |
First match position |
#rindex(pattern, offset=-1) |
pattern (String/Regexp), offset (Integer) |
Integer or nil |
Last match position |
#match(pattern, pos=0) |
pattern (Regexp), pos (Integer) |
MatchData or nil |
Pattern matching with details |
#match?(pattern, pos=0) |
pattern (Regexp), pos (Integer) |
Boolean |
Pattern matching test only |
#scan(pattern) |
pattern (Regexp) |
Array |
All matches as array |
#gsub(pattern, replacement) |
pattern (String/Regexp), replacement (String/Proc) |
String |
Replace all matches |
#gsub!(pattern, replacement) |
pattern (String/Regexp), replacement (String/Proc) |
String or nil |
Replace all matches in-place |
#sub(pattern, replacement) |
pattern (String/Regexp), replacement (String/Proc) |
String |
Replace first match |
#sub!(pattern, replacement) |
pattern (String/Regexp), replacement (String/Proc) |
String or nil |
Replace first match in-place |
MatchData Methods
Method | Parameters | Returns | Description |
---|---|---|---|
#[] |
index (Integer/Symbol) |
String or nil |
Access captured group |
#begin |
index (Integer) |
Integer |
Match start position |
#end |
index (Integer) |
Integer |
Match end position |
#captures |
None | Array |
All captured groups |
#named_captures |
None | Hash |
Named captured groups |
#names |
None | Array |
Names of capture groups |
#offset |
index (Integer) |
Array |
Start and end positions |
#pre_match |
None | String |
Text before match |
#post_match |
None | String |
Text after match |
#string |
None | String |
Original matched string |
#values_at |
*indexes (Integer/Range) |
Array |
Multiple captured groups |
Regular Expression Flags
Flag | Symbol | Description | Example |
---|---|---|---|
Case-insensitive | i |
Ignores character case | /ruby/i |
Multiline | m |
Dot matches newlines | /start.+end/m |
Extended | x |
Ignores whitespace, allows comments | /\d{3} \s \d{4}/x |
Unicode | u |
Unicode character properties | /\p{L}+/u |
Character Classes
Pattern | Description | Equivalent |
---|---|---|
\d |
Digit character | [0-9] |
\D |
Non-digit character | [^0-9] |
\w |
Word character | [A-Za-z0-9_] |
\W |
Non-word character | [^A-Za-z0-9_] |
\s |
Whitespace character | [ \t\r\n\f] |
\S |
Non-whitespace character | [^ \t\r\n\f] |
. |
Any character except newline | [^\n] |
Anchors and Boundaries
Pattern | Description | Usage |
---|---|---|
^ |
Start of line | /^Ruby/ |
$ |
End of line | /Ruby$/ |
\A |
Start of string | /\ARuby/ |
\z |
End of string | /Ruby\z/ |
\Z |
End of string or before final newline | /Ruby\Z/ |
\b |
Word boundary | /\bRuby\b/ |
\B |
Non-word boundary | /\BRuby\B/ |
Global Variables
Variable | Description | Content |
---|---|---|
$~ |
Last MatchData object | MatchData instance |
$& |
Entire matched string | String |
$` |
Text before match | String |
$' |
Text after match | String |
$1 , $2 , ... |
Captured groups | String or nil |
$+ |
Last captured group | String or nil |