Overview
Ruby implements regular expressions through the Regexp
class, providing pattern matching capabilities for string processing and validation. Regular expressions in Ruby follow the Onigmo engine syntax, supporting Unicode and multibyte characters with extensive metacharacter support.
The core classes involved in regex operations include Regexp
for pattern definition, MatchData
for capture results, and String
methods that accept regex parameters. Ruby supports both literal regex syntax using forward slashes and constructor-based creation through Regexp.new
.
# Literal syntax
pattern = /[a-z]+/i
# Constructor syntax
pattern = Regexp.new('[a-z]+', Regexp::IGNORECASE)
# Basic matching
text = "Hello World"
result = text.match(/[A-Z][a-z]+/)
# => #<MatchData "Hello">
Regular expressions in Ruby are first-class objects that can be stored in variables, passed as arguments, and modified at runtime. The pattern matching process returns either a MatchData
object for successful matches or nil
for failures.
Ruby's regex implementation supports named captures, lookahead and lookbehind assertions, conditional expressions, and atomic grouping. The engine processes patterns left-to-right with backtracking, making certain complex patterns computationally expensive.
Basic Usage
Creating regular expressions requires understanding Ruby's literal and constructor syntax. Literal expressions use forward slashes with optional modifiers, while constructors accept string patterns and flag constants.
# Case-insensitive matching
email_pattern = /\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i
# Multiline pattern with comments
phone_pattern = /
\A # Start of string
\d{3} # Area code
[-\s]? # Optional separator
\d{3} # Exchange
[-\s]? # Optional separator
\d{4} # Number
\z # End of string
/x
# Constructor with flags
dynamic_pattern = Regexp.new("user_#{id}", Regexp::IGNORECASE | Regexp::MULTILINE)
Character classes define sets of acceptable characters using square brackets. Ruby supports predefined classes like \d
for digits, \w
for word characters, and \s
for whitespace, plus Unicode property classes.
# Custom character class
hex_color = /\A#[0-9a-fA-F]{6}\z/
# Negated character class
non_digits = /[^\d]+/
# Unicode property class
japanese_text = /\p{Hiragana}+/
# Range-based class
uppercase_letters = /[A-Z]+/
Quantifiers specify repetition patterns for characters or groups. Ruby supports greedy, lazy, and possessive quantifiers that affect backtracking behavior during pattern matching.
text = "<tag>content</tag>"
# Greedy quantifier (default)
greedy_match = text.match(/<.+>/)
# => #<MatchData "<tag>content</tag>">
# Lazy quantifier
lazy_match = text.match(/<.+?>/)
# => #<MatchData "<tag>">
# Possessive quantifier (no backtracking)
possessive_pattern = /\d++/
Anchors position matches at specific string boundaries. Ruby provides start-of-string (\A
), end-of-string (\z
), line boundaries (^
, $
), and word boundaries (\b
).
# Full string validation
valid_username = /\A[a-zA-Z0-9_]{3,20}\z/
# Line-based matching
extract_lines = /^Error:.+$/m
# Word boundary matching
find_whole_word = /\btest\b/
# Position matching example
log_entry = "2023-10-15 ERROR: Database connection failed"
timestamp = log_entry.match(/\A\d{4}-\d{2}-\d{2}/)
# => #<MatchData "2023-10-15">
Performance & Memory
Regular expression performance depends on pattern complexity, input string length, and backtracking behavior. Simple character classes and literal strings perform faster than complex alternations and nested quantifiers.
Catastrophic backtracking occurs when patterns with nested quantifiers encounter input that forces exponential evaluation paths. The regex engine explores every possible matching combination before failing, causing performance degradation.
require 'benchmark'
# Problematic pattern with nested quantifiers
catastrophic_pattern = /(a+)+b/
safe_pattern = /a+b/
test_string = "a" * 20 + "c" # No 'b' at end forces backtracking
Benchmark.bm(15) do |x|
x.report("Safe pattern:") do
10000.times { safe_pattern.match(test_string) }
end
x.report("Catastrophic:") do
10.times { catastrophic_pattern.match(test_string) } # Much fewer iterations
end
end
Atomic grouping prevents backtracking within group boundaries using (?>pattern)
syntax. This optimization technique improves performance by eliminating redundant evaluation paths.
# Without atomic grouping - allows backtracking
inefficient = /\d+\.\d+/
# With atomic grouping - prevents backtracking
efficient = /(?>\d+)\.\d+/
# Benchmark comparison with large numeric strings
large_number = "1234567890123456789.0"
Memory usage increases with pattern complexity and capture group count. Named captures create additional references, while non-capturing groups (?:pattern)
reduce memory overhead without affecting matching behavior.
# Memory-intensive pattern with many captures
memory_heavy = /(.*)(.*)(.*)(.*)(.*)(.*)(.*)(.*)/
# Optimized with non-capturing groups
memory_light = /(?:.*){8}/
# Named captures balance readability and memory
balanced = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/
String compilation optimizes frequently-used patterns by caching compiled regex objects. Ruby automatically caches literal patterns but requires manual optimization for dynamic patterns.
# Cached literal pattern (automatic)
CONSTANT_PATTERN = /[a-z]+/i
# Manual caching for dynamic patterns
class PatternCache
def initialize
@cache = {}
end
def get_pattern(base, flags = 0)
key = [base, flags]
@cache[key] ||= Regexp.new(base, flags)
end
end
cache = PatternCache.new
pattern = cache.get_pattern('user_\d+', Regexp::IGNORECASE)
Error Handling & Debugging
Regular expression compilation errors occur when patterns contain invalid syntax or unsupported constructs. Ruby raises RegexpError
exceptions during pattern creation, not during matching operations.
begin
# Invalid quantifier syntax
invalid_pattern = Regexp.new('*invalid')
rescue RegexpError => e
puts "Pattern compilation failed: #{e.message}"
# => Pattern compilation failed: target of repeat operator is not specified
end
begin
# Unmatched parenthesis
unbalanced = Regexp.new('(unclosed group')
rescue RegexpError => e
puts "Syntax error: #{e.message}"
end
Encoding mismatches between patterns and strings cause matching failures or unexpected results. Ruby requires compatible encodings for successful pattern matching operations.
# Encoding compatibility check
def safe_match(pattern, text)
if pattern.encoding != text.encoding
# Convert pattern to text encoding
compatible_pattern = Regexp.new(pattern.source, pattern.options, text.encoding)
text.match(compatible_pattern)
else
text.match(pattern)
end
rescue Encoding::CompatibilityError => e
puts "Encoding mismatch: #{e.message}"
nil
end
# Example usage
utf8_text = "café".encode('UTF-8')
ascii_pattern = /caf./.force_encoding('ASCII-8BIT')
result = safe_match(ascii_pattern, utf8_text)
Debugging complex patterns requires systematic decomposition and testing. Ruby provides the MatchData
object with detailed capture information and position data.
def debug_regex(pattern, text)
match = text.match(pattern)
return "No match found" unless match
puts "Full match: '#{match[0]}' at position #{match.begin(0)}-#{match.end(0)}"
puts "Captures:"
match.captures.each_with_index do |capture, index|
puts " Group #{index + 1}: '#{capture}'"
end
if match.names.any?
puts "Named captures:"
match.names.each do |name|
puts " #{name}: '#{match[name]}'"
end
end
match
end
# Usage example
email_pattern = /(?<user>[\w+\-.]+)@(?<domain>[a-z\d\-]+(?:\.[a-z\d\-]+)*)/i
result = debug_regex(email_pattern, "Contact: john.doe@example.com")
Timeout protection prevents runaway regex operations from blocking application threads. Ruby's Timeout
module can wrap regex operations with time limits.
require 'timeout'
def safe_regex_match(pattern, text, timeout_seconds = 5)
Timeout::timeout(timeout_seconds) do
text.match(pattern)
end
rescue Timeout::Error
puts "Regex operation timed out after #{timeout_seconds} seconds"
nil
rescue RegexpError => e
puts "Regex compilation error: #{e.message}"
nil
end
# Example with potentially slow pattern
slow_pattern = /(a+)+b/
problematic_input = "a" * 25 + "c"
result = safe_regex_match(slow_pattern, problematic_input, 2)
Common Pitfalls
Greedy quantifiers consume maximum possible characters before attempting to match subsequent pattern elements, often producing unexpected results when developers expect minimal matching behavior.
html = '<div class="main">Content</div>'
# Common mistake - greedy quantifier matches too much
wrong_pattern = /<.*>/
wrong_match = html.match(wrong_pattern)
# => #<MatchData "<div class=\"main\">Content</div>">
# Correct approach - lazy quantifier
correct_pattern = /<.*?>/
correct_match = html.match(correct_pattern)
# => #<MatchData "<div class=\"main\">">
# Alternative - negated character class (more efficient)
efficient_pattern = /<[^>]*>/
efficient_match = html.match(efficient_pattern)
# => #<MatchData "<div class=\"main\">">
Anchor confusion leads to validation bypasses when developers use line anchors (^
, $
) instead of string anchors (\A
, \z
) for input validation. Line anchors match line boundaries within multiline strings.
# Dangerous validation - uses line anchors
unsafe_email_check = /^[\w\.-]+@[\w\.-]+\.[a-z]{2,}$/i
# Multiline input bypasses validation
malicious_input = "valid@email.com\n<script>alert('xss')</script>"
unsafe_result = malicious_input.match(unsafe_email_check)
# => #<MatchData "valid@email.com"> (matches despite script tag)
# Secure validation - uses string anchors
safe_email_check = /\A[\w\.-]+@[\w\.-]+\.[a-z]{2,}\z/i
safe_result = malicious_input.match(safe_email_check)
# => nil (correctly rejects malicious input)
Escape sequence handling differs between single and double-quoted strings when constructing regex patterns. Backslashes require additional escaping in double-quoted strings.
# Single-quoted string preserves literal backslashes
single_quoted = Regexp.new('\d+\.\d+')
# Equivalent to: /\d+\.\d+/
# Double-quoted string requires escaped backslashes
double_quoted = Regexp.new("\\d+\\.\\d+")
# Equivalent to: /\d+\.\d+/
# Common mistake - insufficient escaping in double quotes
wrong_pattern = Regexp.new("\d+\.\d+") # Interprets \d as literal d
# Creates pattern: /d+.d+/ (matches 'd' instead of digits)
# Testing the difference
test_string = "123.45"
puts single_quoted.match(test_string) # => #<MatchData "123.45">
puts double_quoted.match(test_string) # => #<MatchData "123.45">
puts wrong_pattern.match(test_string) # => nil
Unicode handling requires explicit character classes or property matching. ASCII-based patterns fail with international characters unless properly configured with Unicode support.
# ASCII-only pattern fails with Unicode
ascii_name = /^[a-zA-Z\s]+$/
unicode_name = "José María"
ascii_result = unicode_name.match(ascii_name)
# => nil (fails to match accented characters)
# Unicode-aware pattern
unicode_pattern = /\A[\p{L}\s]+\z/
unicode_result = unicode_name.match(unicode_pattern)
# => #<MatchData "José María">
# Alternative with specific Unicode ranges
spanish_pattern = /\A[a-zA-ZáéíóúÁÉÍÓÚñÑ\s]+\z/
spanish_result = unicode_name.match(spanish_pattern)
# => #<MatchData "José María">
Case sensitivity mistakes occur when patterns don't specify appropriate flags for intended matching behavior. Ruby regex defaults to case-sensitive matching unless explicitly configured otherwise.
# Case-sensitive matching (default)
case_sensitive = /hello/
mixed_case_text = "Hello World"
sensitive_result = mixed_case_text.match(case_sensitive)
# => nil (doesn't match due to case difference)
# Case-insensitive matching
case_insensitive = /hello/i
insensitive_result = mixed_case_text.match(case_insensitive)
# => #<MatchData "Hello">
# Dynamic case handling
def flexible_match(pattern_string, text, ignore_case = false)
flags = ignore_case ? Regexp::IGNORECASE : 0
pattern = Regexp.new(pattern_string, flags)
text.match(pattern)
end
result = flexible_match("HELLO", "hello world", true)
# => #<MatchData "hello">
Reference
Regex Literal Syntax
Syntax | Description | Example |
---|---|---|
/pattern/ |
Basic literal regex | /hello/ |
/pattern/flags |
Literal with modifiers | /hello/i |
%r{pattern} |
Alternative delimiter | %r{https?://} |
`%r | pattern | flags` |
Constructor Methods
Method | Parameters | Returns | Description |
---|---|---|---|
Regexp.new(pattern, flags = 0) |
pattern (String), flags (Integer) | Regexp |
Creates regex from string |
Regexp.compile(pattern, flags = 0) |
pattern (String), flags (Integer) | Regexp |
Alias for new |
Regexp.escape(string) |
string (String) | String |
Escapes special characters |
Regexp.union(*patterns) |
patterns (Array) | Regexp |
Creates alternation pattern |
Modifier Flags
Flag | Constant | Description | Literal |
---|---|---|---|
i |
Regexp::IGNORECASE |
Case-insensitive matching | /pattern/i |
m |
Regexp::MULTILINE |
Multiline mode (. matches newlines) | /pattern/m |
x |
Regexp::EXTENDED |
Extended mode (ignores whitespace) | /pattern/x |
o |
N/A | Compile once (deprecated) | /pattern/o |
Character Classes
Class | Description | Equivalent |
---|---|---|
. |
Any character except newline | [^\n] |
\d |
Digit character | [0-9] |
\D |
Non-digit character | [^0-9] |
\w |
Word character | [a-zA-Z0-9_] |
\W |
Non-word character | [^a-zA-Z0-9_] |
\s |
Whitespace character | [\t\n\f\r ] |
\S |
Non-whitespace character | [^\t\n\f\r ] |
Quantifiers
Quantifier | Type | Description | Example |
---|---|---|---|
* |
Greedy | Zero or more | /a*/ |
+ |
Greedy | One or more | /a+/ |
? |
Greedy | Zero or one | /a?/ |
{n} |
Exact | Exactly n times | /a{3}/ |
{n,} |
Greedy | n or more times | /a{3,}/ |
{n,m} |
Greedy | Between n and m times | /a{3,5}/ |
*? |
Lazy | Zero or more (minimal) | /a*?/ |
+? |
Lazy | One or more (minimal) | /a+?/ |
?? |
Lazy | Zero or one (minimal) | /a??/ |
Anchors
Anchor | Description | Example |
---|---|---|
\A |
Start of string | /\A[A-Z]/ |
\z |
End of string | /\d\z/ |
\Z |
End of string (before final newline) | /\w\Z/ |
^ |
Start of line | /^Error/m |
$ |
End of line | /\d$/m |
\b |
Word boundary | /\btest\b/ |
\B |
Non-word boundary | /\Btest\B/ |
Groups and Captures
Syntax | Description | Example |
---|---|---|
(pattern) |
Capturing group | /(a+)(b+)/ |
(?:pattern) |
Non-capturing group | /(?:a+)b+/ |
(?<name>pattern) |
Named capture | /(?<year>\d{4})/ |
(?'name'pattern) |
Named capture (alternative) | /(?'year'\d{4})/ |
\1, \2, ... |
Backreference by number | /(a+)\1/ |
\k<name> |
Backreference by name | /(?<word>\w+)\k<word>/ |
Lookahead and Lookbehind
Syntax | Type | Description | Example |
---|---|---|---|
(?=pattern) |
Positive lookahead | Look ahead for pattern | /\d+(?=px)/ |
(?!pattern) |
Negative lookahead | Look ahead, not pattern | /\d+(?!px)/ |
(?<=pattern) |
Positive lookbehind | Look behind for pattern | /(?<=\$)\d+/ |
(?<!pattern) |
Negative lookbehind | Look behind, not pattern | /(?<!\$)\d+/ |
String Methods with Regex
Method | Parameters | Returns | Description |
---|---|---|---|
#match(pattern, pos = 0) |
pattern (Regexp), pos (Integer) | MatchData or nil |
Returns match data |
#match?(pattern, pos = 0) |
pattern (Regexp), pos (Integer) | Boolean |
Returns boolean result |
#scan(pattern) |
pattern (Regexp) | Array |
Returns all matches |
#gsub(pattern, replacement) |
pattern (Regexp), replacement (String) | String |
Global substitution |
#gsub!(pattern, replacement) |
pattern (Regexp), replacement (String) | String or nil |
In-place substitution |
#sub(pattern, replacement) |
pattern (Regexp), replacement (String) | String |
Single substitution |
#split(pattern, limit = 0) |
pattern (Regexp), limit (Integer) | Array |
Split by pattern |
MatchData Methods
Method | Returns | Description |
---|---|---|
#[](index) |
String or nil |
Access capture by index |
#captures |
Array |
All captured groups |
#named_captures |
Hash |
Named captures as hash |
#names |
Array |
Names of capture groups |
#begin(index) |
Integer |
Start position of capture |
#end(index) |
Integer |
End position of capture |
#offset(index) |
Array |
[begin, end] positions |
#pre_match |
String |
String before match |
#post_match |
String |
String after match |