Overview
Ruby implements regular expression options through single-character flags that modify pattern matching behavior. These options control case sensitivity, multiline matching, extended syntax, Unicode handling, and encoding interpretation. Ruby supports both inline flag syntax within patterns and method-level option parameters.
The core options include case-insensitive matching (i
), multiline mode (m
), extended syntax (x
), and encoding-specific options (n
, e
, s
, u
). Each option changes how the regular expression engine processes patterns and target strings.
# Method-level options
/pattern/i.match("STRING") # Case-insensitive
Regexp.new("pattern", Regexp::IGNORECASE)
# Inline options
/(?i)pattern/.match("STRING") # Inline case-insensitive
/(?i:pattern)/.match("STRING") # Group-specific option
Ruby's regex engine processes options before compilation, making them part of the compiled pattern rather than runtime modifiers. This affects caching, performance, and pattern reuse across different contexts.
The options system integrates with Ruby's encoding framework, allowing patterns to specify encoding assumptions and conversion behavior. Encoding options become critical when processing text with mixed encodings or when working with binary data patterns.
Basic Usage
Ruby provides multiple syntaxes for applying regex options. The suffix notation attaches flags directly to regex literals, while the constructor method accepts option constants or integer flags.
# Suffix notation with literals
text = "Hello WORLD"
/hello/i.match(text) # Case-insensitive match
# => #<MatchData "Hello">
/hello/im.match("hello\nworld") # Case-insensitive + multiline
# => #<MatchData "hello">
# Constructor with option constants
regex = Regexp.new("hello", Regexp::IGNORECASE | Regexp::MULTILINE)
regex.match("Hello\nWorld")
# => #<MatchData "Hello">
The multiline option (m
) changes dot (.
) behavior to match newline characters, while case-insensitive (i
) performs Unicode-aware case folding by default.
# Multiline option affects dot metacharacter
text = "first line\nsecond line"
/first.*second/.match(text) # nil (dot doesn't match newline)
/first.*second/m.match(text) # Matches across lines
# => #<MatchData "first line\nsecond line">
# Case-insensitive with Unicode
/café/i.match("CAFÉ") # Unicode case folding
# => #<MatchData "CAFÉ">
Extended syntax (x
) enables whitespace and comments within patterns, improving readability for complex expressions.
# Extended syntax for readable patterns
phone_pattern = /
\(? # Optional opening parenthesis
(\d{3}) # Area code
\)? # Optional closing parenthesis
[-.\s]? # Optional separator
(\d{3}) # Exchange
[-.\s]? # Optional separator
(\d{4}) # Number
/x
phone_pattern.match("(555) 123-4567")
# => #<MatchData "(555) 123-4567" 1:"555" 2:"123" 3:"4567">
Inline options provide pattern-specific control without modifying the entire expression. Group-specific options affect only enclosed portions.
# Inline options within patterns
/(?i)hello world/.match("HELLO world") # Entire pattern case-insensitive
/hello (?i:world)/.match("hello WORLD") # Only "world" case-insensitive
/(?i)hello (?-i:world)/.match("HELLO world") # Disable case-insensitive for "world"
Advanced Usage
Complex patterns benefit from option combinations and strategic inline modifications. Multiple options combine through bitwise operations or concatenated suffix notation.
# Multiple options with different syntaxes
complex_text = <<~TEXT
Name: John Doe
Email: john@example.com
Name: Jane Smith
Email: jane@example.org
TEXT
# Combine case-insensitive, multiline, and extended
contact_pattern = /
name:\s* # Name field
(.+?) # Capture name (non-greedy)
\s+ # Whitespace
email:\s* # Email field
(\S+@\S+) # Email pattern
/imx
matches = complex_text.scan(contact_pattern)
# => [["John Doe", "john@example.com"], ["Jane Smith", "jane@example.org"]]
Encoding options control string interpretation and pattern matching behavior across different character sets. These options become essential when processing mixed-encoding content.
# Encoding-specific options
binary_data = "\x80\x81\x82hello\x83\x84"
# ASCII-8BIT pattern matching
/hello/n.match(binary_data.force_encoding("ASCII-8BIT"))
# => #<MatchData "hello">
# Unicode interpretation fails on binary data
begin
/hello/u.match(binary_data)
rescue Encoding::CompatibilityError => e
puts e.message # incompatible character encodings
end
Conditional option application allows dynamic pattern construction based on runtime conditions. This pattern proves useful for user-configurable search functionality.
class SearchPattern
def initialize(term, case_sensitive: false, whole_words: false)
options = 0
options |= Regexp::IGNORECASE unless case_sensitive
pattern = whole_words ? "\\b#{Regexp.escape(term)}\\b" : Regexp.escape(term)
@regex = Regexp.new(pattern, options)
end
def match(text)
@regex.match(text)
end
end
# Configure search behavior
search = SearchPattern.new("Test", case_sensitive: false, whole_words: true)
search.match("This is a test case") # nil (not whole word)
search.match("This is a Test case") # Match found
Nested option groups provide fine-grained control over pattern sections, allowing different matching rules within the same expression.
# Complex nested option handling
log_pattern = /
(?<timestamp>\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}) # Timestamp
\s+
\[(?<level>(?i:error|warn|info|debug))\] # Case-insensitive level
\s+
(?<message>.*) # Message content
/x
log_entry = "2024-01-15 10:30:45 [ERROR] Database connection failed"
match = log_pattern.match(log_entry)
# => #<MatchData "2024-01-15 10:30:45 [ERROR] Database connection failed"
# timestamp:"2024-01-15 10:30:45" level:"ERROR" message:"Database connection failed">
Performance & Memory
Regular expression options significantly impact matching performance and memory usage. Case-insensitive matching requires Unicode tables for proper case folding, increasing processing overhead.
require 'benchmark'
text = "The Quick Brown Fox Jumps Over The Lazy Dog" * 1000
simple_pattern = /fox/
case_insensitive = /fox/i
Benchmark.bm(20) do |x|
x.report("case-sensitive:") { 10000.times { simple_pattern.match(text) } }
x.report("case-insensitive:") { 10000.times { case_insensitive.match(text) } }
end
# Results show case-insensitive matching ~20-30% slower
# user system total real
# case-sensitive: 0.050000 0.000000 0.050000 ( 0.048234)
# case-insensitive: 0.070000 0.000000 0.070000 ( 0.066891)
Extended syntax (x
) adds parsing overhead during regex compilation but doesn't affect runtime matching performance. The trade-off benefits maintainability for complex patterns.
# Extended syntax compilation overhead
compact = /(\d{3})[-.\s]?(\d{3})[-.\s]?(\d{4})/
extended = /
(\d{3}) # Area code
[-.\s]? # Separator
(\d{3}) # Exchange
[-.\s]? # Separator
(\d{4}) # Number
/x
# Compilation time differs, runtime matching identical
Benchmark.bm(15) do |x|
x.report("compact regex:") { 10000.times { Regexp.new('(\d{3})[-.\s]?(\d{3})[-.\s]?(\d{4})') } }
x.report("extended regex:") { 10000.times { Regexp.new('(\d{3})[-.\s]?(\d{3})[-.\s]?(\d{4})', Regexp::EXTENDED) } }
end
Multiline option performance depends on string content and pattern complexity. Patterns using dot metacharacter with multiline enabled may scan entire documents rather than stopping at line boundaries.
# Multiline performance characteristics
large_text = "line1\n" * 10000 + "target" + "\nline2\n" * 10000
# Without multiline - fails fast
fast_pattern = /line1.*target/
# With multiline - scans entire content
slow_pattern = /line1.*target/m
Benchmark.bm(15) do |x|
x.report("without multiline:") { 1000.times { fast_pattern.match(large_text) } }
x.report("with multiline:") { 1000.times { slow_pattern.match(large_text) } }
end
# Multiline version significantly slower due to full document scan
Encoding options affect performance through character conversion and validation overhead. ASCII-only patterns (n
option) bypass Unicode processing for improved speed.
# Encoding option performance comparison
ascii_text = "simple ascii text for pattern matching" * 1000
unicode_pattern = /text/u
ascii_pattern = /text/n
Benchmark.bm(15) do |x|
x.report("unicode handling:") { 10000.times { unicode_pattern.match(ascii_text) } }
x.report("ascii-only:") { 10000.times { ascii_pattern.match(ascii_text) } }
end
# ASCII-only processing shows measurable improvement
Common Pitfalls
Case-insensitive matching doesn't work identically across all Unicode characters. Some characters have multiple case representations or context-dependent case folding rules.
# Unicode case folding edge cases
german_text = "Straße" # Contains ß (sharp s)
# ß has no uppercase in traditional German
/STRASSE/i.match(german_text) # nil - doesn't match
/STRAßE/i.match("STRASSE") # nil - doesn't match
# Modern Unicode folding rules
/straße/i.match("STRASSE") # Match depends on Unicode version
Multiline option confusion commonly occurs between m
(multiline) and the ^
/$
anchor behavior. The m
option affects dot matching, not anchor behavior.
text = "first line\nsecond line\nthird line"
# Common misconception: multiline affects anchors
/^second/m.match(text) # nil - ^ still matches string start
# Multiline affects dot metacharacter
/first.*third/.match(text) # nil - dot doesn't cross lines
/first.*third/m.match(text) # Match - dot crosses lines
# => #<MatchData "first line\nsecond line\nthird">
# For line-by-line anchors, scan each line
text.lines.map { |line| /second/.match(line) }.compact
# => [#<MatchData "second">]
Extended syntax whitespace handling trips up developers when patterns contain meaningful whitespace. All unescaped whitespace gets ignored in extended mode.
# Whitespace handling in extended mode
pattern_with_spaces = /hello world/x # Matches "helloworld"
pattern_with_spaces.match("hello world") # nil
# Escape whitespace or use character classes
correct_pattern = /hello\ world/x # Escaped space
alt_pattern = /hello[ ]world/x # Space in character class
both_match = /hello\s+world/x # Whitespace class
correct_pattern.match("hello world")
# => #<MatchData "hello world">
Inline option scope creates confusion when developers expect options to apply globally. Group-specific options only affect their immediate scope.
# Scope confusion with inline options
mixed_case = "Hello WORLD"
# Wrong expectation - developers expect global case-insensitive
/(?i:hello) world/.match(mixed_case) # nil - "world" is case-sensitive
# Correct approaches
/(?i:hello) (?i:world)/.match(mixed_case) # Explicit for both groups
/(?i)hello world/.match(mixed_case) # Global option
# => #<MatchData "Hello WORLD">
Encoding option interactions with string encodings cause compatibility errors when patterns and strings have incompatible encoding assumptions.
# Encoding compatibility problems
utf8_string = "café".encode("UTF-8")
ascii_pattern = /caf/n # ASCII-only pattern
# This works fine
ascii_pattern.match(utf8_string)
# => #<MatchData "caf">
# But this fails
binary_string = "\xC3\xA9".force_encoding("ASCII-8BIT") # UTF-8 bytes as binary
utf8_pattern = /é/u
begin
utf8_pattern.match(binary_string)
rescue Encoding::CompatibilityError
# incompatible character encodings: UTF-8 and ASCII-8BIT
end
Option precedence confusion arises when combining method options with inline options. Inline options override method-level options within their scope.
# Option precedence rules
base_pattern = "(?-i)HELLO (?i:world)" # Mixed case sensitivity
# Method option gets overridden by inline options
regex = Regexp.new(base_pattern, Regexp::IGNORECASE)
# "HELLO" is forced case-sensitive by (?-i)
# "world" is forced case-insensitive by (?i:)
regex.match("hello WORLD") # nil - "hello" doesn't match "HELLO"
regex.match("HELLO world") # Match
# => #<MatchData "HELLO world">
Reference
Option Flags
Flag | Constant | Description |
---|---|---|
i |
Regexp::IGNORECASE |
Case-insensitive matching with Unicode folding |
m |
Regexp::MULTILINE |
Dot matches newline characters |
x |
Regexp::EXTENDED |
Ignore whitespace and allow comments |
n |
Regexp::NOENCODING |
ASCII-only matching, no encoding conversion |
e |
Regexp::EUC |
EUC encoding assumption |
s |
Regexp::SJIS |
Shift-JIS encoding assumption |
u |
Regexp::UTF8 |
UTF-8 encoding assumption |
Constructor Methods
Method | Parameters | Returns | Description |
---|---|---|---|
Regexp.new(pattern, options) |
pattern (String), options (Integer) |
Regexp |
Create regex with option flags |
Regexp.compile(pattern, options) |
pattern (String), options (Integer) |
Regexp |
Alias for Regexp.new |
Regexp.new(pattern, options, encoding) |
pattern (String), options (Integer), encoding (String) |
Regexp |
Create with explicit encoding |
Inline Option Syntax
Syntax | Scope | Description |
---|---|---|
(?i) |
Global | Enable case-insensitive for entire pattern |
(?-i) |
Global | Disable case-insensitive for entire pattern |
(?i:...) |
Group | Enable case-insensitive for group only |
(?-i:...) |
Group | Disable case-insensitive for group only |
(?imx) |
Global | Multiple options combined |
(?imx:...) |
Group | Multiple options for group |
Option Combinations
Combination | Effect |
---|---|
/pattern/i |
Case-insensitive |
/pattern/im |
Case-insensitive + multiline |
/pattern/imx |
Case-insensitive + multiline + extended |
Regexp::IGNORECASE | Regexp::MULTILINE |
Bitwise OR for multiple flags |
Encoding Options
Option | Encoding | Use Case |
---|---|---|
n |
None | Binary data, ASCII-only patterns |
e |
EUC-JP | Japanese EUC text processing |
s |
Shift-JIS | Japanese Shift-JIS text processing |
u |
UTF-8 | Unicode text processing (default) |
Performance Characteristics
Option | Compilation | Runtime | Memory |
---|---|---|---|
None | Fastest | Baseline | Minimal |
i |
Moderate | 20-30% slower | Higher (Unicode tables) |
m |
Fast | Variable (depends on dot usage) | Minimal |
x |
Slower | Same as baseline | Minimal |
imx |
Slowest | Combined overhead | Higher |
Error Conditions
Error Type | Cause | Solution |
---|---|---|
Encoding::CompatibilityError |
Pattern/string encoding mismatch | Use compatible encodings or force encoding |
RegexpError |
Invalid inline option syntax | Check option syntax and position |
ArgumentError |
Invalid option constant | Use valid Regexp constants |