Overview
Regular expression literals in Ruby provide a concise syntax for creating compiled regex patterns using forward slash delimiters. Ruby parses /pattern/flags
syntax at compile time, creating Regexp
objects with specified match options and behaviors.
# Basic literal syntax
email_pattern = /\w+@\w+\.\w+/
puts email_pattern.class
# => Regexp
# With case-insensitive flag
name_pattern = /john/i
puts name_pattern.match?("JOHN")
# => true
Ruby compiles regex literals during script parsing, storing pattern bytecode in the resulting AST. This compilation happens once per literal occurrence, regardless of execution frequency. The parser recognizes flag characters (i
, m
, x
, o
) immediately following the closing delimiter and applies corresponding match options.
# Multiline and extended flags
complex_pattern = /
\A # Start of string
(\d{3}) # Area code
[-.]? # Optional separator
(\d{3}) # Exchange
[-.]? # Optional separator
(\d{4}) # Number
\z # End of string
/x
phone = "555-123-4567"
match = complex_pattern.match(phone)
puts match[1], match[2], match[3]
# => 555
# => 123
# => 4567
Regular expression literals support string interpolation when patterns contain variable content. Ruby re-compiles interpolated literals on each evaluation, creating new Regexp
objects with expanded string content.
Basic Usage
Ruby regex literals use forward slash delimiters with optional trailing flags. The syntax creates Regexp
objects that inherit standard matching methods and capture group functionality.
# Simple matching operations
text = "Ruby version 3.1.0"
version_pattern = /\d+\.\d+\.\d+/
if text.match(version_pattern)
puts "Version found: #{$&}"
end
# => Version found: 3.1.0
# Named captures
log_pattern = /(?<timestamp>\d{4}-\d{2}-\d{2}) (?<level>\w+): (?<message>.*)/
log_entry = "2024-01-15 ERROR: Database connection failed"
match = log_pattern.match(log_entry)
puts match[:timestamp], match[:level], match[:message]
# => 2024-01-15
# => ERROR
# => Database connection failed
Flag characters modify regex behavior during compilation. Ruby recognizes four standard flags: case-insensitive (i
), multiline (m
), extended (x
), and once-compiled (o
).
# Case sensitivity comparison
strict_pattern = /hello/
loose_pattern = /hello/i
test_string = "Hello World"
puts strict_pattern.match?(test_string) # => false
puts loose_pattern.match?(test_string) # => true
# Multiline flag affects ^ and $ anchors
multiline_text = "First line\nSecond line\nThird line"
line_pattern = /^Second/m
puts line_pattern.match?(multiline_text)
# => true
String interpolation within regex literals enables dynamic pattern construction. Ruby evaluates interpolated expressions and reconstructs the pattern string before compilation.
# Dynamic pattern building
def create_word_boundary(word)
/\b#{Regexp.escape(word)}\b/i
end
search_term = "Ruby"
pattern = create_word_boundary(search_term)
text = "Learning Ruby programming with ruby-lang.org"
matches = text.scan(pattern)
puts matches.length
# => 2
# Variable-based patterns
prefixes = %w[Mr Mrs Dr Prof]
title_pattern = /\A(#{prefixes.join('|')})\.\s+(\w+)/
name = "Dr. Smith"
match = title_pattern.match(name)
puts match[1], match[2]
# => Dr
# => Smith
Regex literals interact with Ruby's global match variables, setting values after successful matching operations. These variables provide access to match components without explicit MatchData
object handling.
email_text = "Contact: user@domain.com for support"
email_pattern = /(\w+)@(\w+)\.(\w+)/
# Global variables populated after match
if email_text =~ email_pattern
puts "Username: #{$1}"
puts "Domain: #{$2}"
puts "TLD: #{$3}"
puts "Full match: #{$&}"
puts "Pre-match: #{$`}"
puts "Post-match: #{$'}"
end
# => Username: user
# => Domain: domain
# => TLD: com
# => Full match: user@domain.com
# => Pre-match: Contact:
# => Post-match: for support
Advanced Usage
Complex regex literals leverage character classes, quantifiers, lookarounds, and atomic groups for sophisticated pattern matching. Ruby's regex engine supports advanced features including conditional expressions and recursive patterns.
# Nested parentheses matching with recursion
balanced_parens = /
\A
(
[^()]++ # Non-parentheses characters
| # OR
\( # Opening paren
(?: # Non-capturing group
[^()]++ # More non-paren chars
| # OR
\g<1> # Recursive reference to group 1
)*
\) # Closing paren
)*
\z
/x
test_expressions = [
"(simple)",
"((nested))",
"(incomplete",
"text(with)parens",
"((deeply)(nested)(groups))"
]
test_expressions.each do |expr|
puts "#{expr}: #{balanced_parens.match?(expr)}"
end
# => (simple): true
# => ((nested)): true
# => (incomplete: false
# => text(with)parens: true
# => ((deeply)(nested)(groups)): true
Conditional expressions within regex literals enable pattern branching based on capture group presence or match success. Ruby evaluates conditions during matching and selects appropriate pattern branches.
# IP address validation with conditional logic
ip_pattern = /
\A
(
# First three octets
(?:
(?:25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d) # 0-255
\. # Dot separator
){3}
)
# Fourth octet (captured for conditional)
(25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)
# Conditional: if group 2 exists, ensure no trailing chars
(?(2)\z|$)
\z
/x
ip_addresses = [
"192.168.1.1",
"10.0.0.256", # Invalid
"172.16.0.1",
"300.1.1.1", # Invalid
"127.0.0.1"
]
ip_addresses.each do |ip|
puts "#{ip}: #{ip_pattern.match?(ip) ? 'valid' : 'invalid'}"
end
# => 192.168.1.1: valid
# => 10.0.0.256: invalid
# => 172.16.0.1: valid
# => 300.1.1.1: invalid
# => 127.0.0.1: valid
Atomic groups and possessive quantifiers prevent backtracking in complex patterns, improving performance and preventing catastrophic backtracking scenarios.
# Email validation with atomic groups
email_pattern = /
\A
# Username part with atomic grouping
(?>[\w.-]+) # Atomic group - no backtrack
@
# Domain part
(?>[a-zA-Z0-9-]+\.)+ # Atomic subdomain groups
[a-zA-Z]{2,} # TLD
\z
/x
# Performance comparison pattern (catastrophic backtracking)
bad_pattern = /^(a+)+b$/
good_pattern = /^(?>a+)+b$/
# Test string that would cause issues
test_string = "a" * 25 + "c" # No 'b' at end
require 'timeout'
# This would hang with bad_pattern, but good_pattern fails fast
begin
Timeout.timeout(1) do
puts good_pattern.match?(test_string)
end
rescue Timeout::Error
puts "Pattern timed out"
end
# => false
# Complex URL parsing with multiple atomic groups
url_pattern = /
\A
(?<scheme>https?) # Protocol
:\/\/
(?: # Optional user info
(?<user>(?>[^\/:@]+)) # Username (atomic)
(?::(?<pass>(?>[^\/@]+)))? # Password (atomic)
@
)?
(?<host>(?>[^\/:\?]+)) # Hostname (atomic)
(?::(?<port>\d+))? # Optional port
(?<path>(?>[^\?\#]*)) # Path (atomic)
(?:\?(?<query>(?>[^\#]*)))? # Query string (atomic)
(?:\#(?<fragment>.*))? # Fragment
\z
/x
complex_url = "https://user:pass@example.com:8080/path/to/resource?param=value&other=data#section"
match = url_pattern.match(complex_url)
if match
puts "Scheme: #{match[:scheme]}"
puts "User: #{match[:user]}"
puts "Host: #{match[:host]}"
puts "Port: #{match[:port]}"
puts "Path: #{match[:path]}"
puts "Query: #{match[:query]}"
puts "Fragment: #{match[:fragment]}"
end
# => Scheme: https
# => User: user
# => Host: example.com
# => Port: 8080
# => Path: /path/to/resource
# => Query: param=value&other=data
# => Fragment: section
The once-compiled flag (o
) optimizes interpolated patterns by compiling them only on first evaluation. Ruby caches the compiled pattern and reuses it for subsequent matches, even when interpolated variables change.
# Once-compiled flag demonstration
def search_logs(pattern_string, logs)
# Without 'o' flag - recompiles every call
normal_pattern = /#{pattern_string}/i
# With 'o' flag - compiles once, caches result
cached_pattern = /#{pattern_string}/io
logs.each do |log|
# Both patterns work the same, but cached_pattern
# reuses compiled bytecode after first execution
puts log if log.match?(cached_pattern)
end
end
logs = [
"2024-01-15 INFO: User login successful",
"2024-01-15 ERROR: Database timeout",
"2024-01-15 INFO: Cache cleared",
"2024-01-15 WARN: High memory usage"
]
# Pattern is compiled once and reused
search_logs("ERROR|WARN", logs)
# => 2024-01-15 ERROR: Database timeout
# => 2024-01-15 WARN: High memory usage
Common Pitfalls
Regex literal compilation timing creates unexpected behavior when patterns contain variables that change between evaluations. Ruby compiles non-interpolated literals at parse time, but interpolated literals compile during execution.
# Compilation timing gotcha
patterns = []
words = %w[cat dog bird]
# WRONG: All patterns will match 'bird' (last value)
words.each do |word|
# This creates three different literal objects
# but they all compile with the final word value
patterns << /#{word}/o # 'o' flag causes issues here
end
puts patterns.map { |p| p.source }
# => ["bird", "bird", "bird"]
# CORRECT: Compile without 'o' flag or use Regexp.new
corrected_patterns = words.map do |word|
Regexp.new(word) # Explicit compilation
end
puts corrected_patterns.map { |p| p.source }
# => ["cat", "dog", "bird"]
Global match variables persist across method boundaries and thread contexts, creating state pollution when not properly managed. Ruby sets global variables after any matching operation, potentially overwriting previous values.
# Global variable pollution
def extract_domain(email)
email =~ /(\w+)@(\w+)\.(\w+)/
$2 # Returns domain portion
end
def extract_username(email)
email =~ /(\w+)@(\w+)\.(\w+)/
$1 # Returns username portion
end
# Dangerous: order of execution matters
email = "user@example.com"
domain = extract_domain(email) # Sets $1, $2, $3
username = extract_username(email) # Overwrites $1, $2, $3
# Both variables now contain username
puts "Domain: #{domain}" # => Domain: user (WRONG!)
puts "Username: #{username}" # => Username: user
# CORRECT: Use explicit match objects
def safe_extract_parts(email)
match = /(\w+)@(\w+)\.(\w+)/.match(email)
return nil unless match
{
username: match[1],
domain: match[2],
tld: match[3]
}
end
result = safe_extract_parts("user@example.com")
puts "Domain: #{result[:domain]}" # => Domain: example
puts "Username: #{result[:username]}" # => Username: user
Character class ranges depend on encoding and locale settings, producing inconsistent results across different Ruby configurations. Range definitions like [a-z]
may include unexpected characters in certain encodings.
# Encoding-dependent character classes
text = "café naïve résumé"
# ASCII-only pattern may miss accented characters
ascii_pattern = /[a-zA-Z]+/
unicode_pattern = /\p{L}+/
puts "ASCII matches: #{text.scan(ascii_pattern)}"
# => ASCII matches: ["caf", "na", "ve", "r", "sum"]
puts "Unicode matches: #{text.scan(unicode_pattern)}"
# => Unicode matches: ["café", "naïve", "résumé"]
# Dangerous: case ranges in different locales
# This pattern behavior depends on system locale
german_text = "Größe"
locale_pattern = /[a-z]/i
puts german_text.scan(locale_pattern)
# Results vary by locale configuration
# BETTER: Use explicit Unicode properties
safe_pattern = /\p{L}/
puts german_text.scan(safe_pattern)
# => ["G", "r", "ö", "ß", "e"]
Backtracking catastrophes occur when nested quantifiers create exponential evaluation paths. Certain pattern structures cause Ruby's regex engine to explore vast numbers of backtrack possibilities.
# Catastrophic backtracking example
# DON'T RUN - this will hang your program
catastrophic_pattern = /^(a+)+b$/
test_input = "a" * 20 + "c" # No 'b' to match
# Ruby tries every possible way to split the 'a's between
# the inner (a+) and outer (+), resulting in 2^20 attempts
# SOLUTION: Use atomic groups or possessive quantifiers
safe_pattern = /^(?>a+)+b$/ # Atomic group
# OR
possessive_pattern = /^a++b$/ # Possessive quantifier
# Both patterns fail immediately when 'b' is not found
puts safe_pattern.match?(test_input) # => false (fast)
puts possessive_pattern.match?(test_input) # => false (fast)
# Real-world example: HTML tag matching
html = "<div><span>content</span></div><malformed"
# BAD: nested quantifiers can cause issues
bad_html_pattern = /(<(\w+)>.*</\2>)+/
# BETTER: atomic groups prevent backtracking
good_html_pattern = /(?>(<(\w+)>.*?</\2>))+/
require 'benchmark'
Benchmark.bm do |x|
x.report("atomic groups:") do
1000.times { good_html_pattern.match(html) }
end
end
Anchor behavior changes with multiline flag, causing patterns to match in unexpected positions. The ^
and $
anchors match line boundaries instead of string boundaries when multiline mode is enabled.
# Multiline anchor confusion
sensitive_data = "public info\nsecret: password123\nmore public"
# Without multiline flag - matches string boundaries
string_boundary_pattern = /^secret:/
puts string_boundary_pattern.match?(sensitive_data)
# => false (good - secret not at string start)
# With multiline flag - matches line boundaries
line_boundary_pattern = /^secret:/m
puts line_boundary_pattern.match?(sensitive_data)
# => true (dangerous - secret found mid-string)
# SOLUTION: Use \A and \z for string boundaries
safe_pattern = /\Asecret:/m
puts safe_pattern.match?(sensitive_data)
# => false (safe - only matches string start)
# Complete boundary reference
test_string = "first\nsecond\nthird"
patterns = {
'^start' => /^second/, # No match - string start
'^start/m' => /^second/m, # Matches - line start
'\\A start' => /\Asecond/, # No match - string start
'$end' => /second$/, # No match - string end
'$end/m' => /second$/m, # Matches - line end
'\\z end' => /second\z/ # No match - string end
}
patterns.each do |name, pattern|
puts "#{name}: #{pattern.match?(test_string)}"
end
# => ^start: false
# => ^start/m: true
# => \A start: false
# => $end: false
# => $end/m: true
# => \z end: false
Reference
Literal Syntax
Syntax | Description | Example |
---|---|---|
/pattern/ |
Basic literal | /hello/ |
/pattern/i |
Case-insensitive | /hello/i |
/pattern/m |
Multiline mode | /^start/m |
/pattern/x |
Extended mode | /a b c/x |
/pattern/o |
Compile once | /#{var}/o |
/#{expr}/ |
Interpolation | /#{word}\b/ |
Flag Characters
Flag | Name | Effect | Regexp Constant |
---|---|---|---|
i |
Case-insensitive | Ignores case in matching | Regexp::IGNORECASE |
m |
Multiline | ^ and $ match line boundaries |
Regexp::MULTILINE |
x |
Extended | Ignores whitespace and comments | Regexp::EXTENDED |
o |
Compile once | Caches interpolated patterns | Not applicable |
Anchor Behavior
Anchor | Default Mode | Multiline Mode (/m ) |
---|---|---|
^ |
String start | String or line start |
$ |
String end | String or line end |
\A |
String start | String start (unchanged) |
\z |
String end | String end (unchanged) |
\Z |
String end or before final newline | String end or before final newline |
Global Match Variables
Variable | Description | Set By |
---|---|---|
$& |
Entire match | =~ , match |
$1 , $2 , ... |
Capture groups | =~ , match |
$+ |
Last capture group | =~ , match |
$` |
Pre-match string | =~ , match |
$' |
Post-match string | =~ , match |
$~ |
MatchData object |
=~ , match |
Interpolation Compilation
Context | Compilation Timing | Regexp Object |
---|---|---|
/static/ |
Parse time | Single cached object |
/#{var}/ |
Runtime | New object each evaluation |
/#{var}/o |
First runtime | Cached after first use |
Regexp.new(str) |
Runtime | New object each call |
Performance Characteristics
Pattern Type | Backtracking Risk | Optimization |
---|---|---|
(a+)+ |
High | Use (?>a+)+ |
.* |
Medium | Use .*? when possible |
[a-zA-Z]+ |
Low | Character classes are fast |
\w+ |
Low | Built-in classes optimized |
\A...\z |
Low | Anchors prevent unnecessary work |
Character Class Equivalents
Short Form | Long Form | Unicode Property |
---|---|---|
\d |
[0-9] |
\p{Digit} |
\w |
[a-zA-Z0-9_] |
\p{Word} |
\s |
[ \t\r\n\f] |
\p{Space} |
\D |
[^0-9] |
\P{Digit} |
\W |
[^a-zA-Z0-9_] |
\P{Word} |
\S |
[^ \t\r\n\f] |
\P{Space} |
Common Quantifier Patterns
Pattern | Meaning | Backtracking |
---|---|---|
* |
Zero or more (greedy) | Yes |
*? |
Zero or more (lazy) | Yes |
*+ |
Zero or more (possessive) | No |
+ |
One or more (greedy) | Yes |
+? |
One or more (lazy) | Yes |
++ |
One or more (possessive) | No |
? |
Zero or one (greedy) | Yes |
?? |
Zero or one (lazy) | Yes |
?+ |
Zero or one (possessive) | No |
Atomic Group Syntax
Syntax | Description | Backtracking |
---|---|---|
(pattern) |
Capturing group | Yes |
(?:pattern) |
Non-capturing group | Yes |
(?>pattern) |
Atomic group | No |
(?<name>pattern) |
Named capture | Yes |
\g<name> |
Recursive reference | Context-dependent |