Overview
MatchData represents the result of a regular expression match operation in Ruby. When a regex successfully matches against a string, Ruby creates a MatchData object containing the matched text, captured groups, and positional information. This object serves as the primary interface for extracting data from pattern matching operations.
Ruby creates MatchData objects through several mechanisms. The String#match
method returns a MatchData object when a pattern matches, while Regexp#match
provides the same functionality from the pattern perspective. The =~
operator performs matching but returns the match position rather than the MatchData object itself. Global match operations using String#scan
and String#gsub
with blocks receive MatchData objects as arguments.
# Basic MatchData creation
text = "Phone: 555-1234"
match_data = text.match(/(\d{3})-(\d{4})/)
# => #<MatchData "555-1234" 1:"555" 2:"1234">
# Named captures create more structured data
pattern = /(?<area>\d{3})-(?<number>\d{4})/
match_data = text.match(pattern)
# => #<MatchData "555-1234" area:"555" number:"1234">
# Multiple matches through scanning
email = "Contact: john@example.com and mary@test.org"
email.scan(/@(\w+)\.(\w+)/) do |domain, tld|
# Each iteration receives captured groups as array elements
end
The MatchData object contains the complete match as the zeroth element, followed by each captured group in sequence. Named captures provide additional hash-like access using the capture name as a key. Ruby maintains both numeric and named access methods simultaneously when named captures exist.
MatchData objects integrate with Ruby's global variable system through automatic assignment. The $~
variable contains the most recent MatchData object, while numbered globals $1
, $2
, etc., contain individual capture groups. These globals update after each match operation, creating implicit state that affects subsequent operations.
The object provides extensive introspection capabilities beyond basic group access. Methods exist for retrieving match positions, calculating offsets, accessing the original string, and examining the pattern used for matching. This comprehensive interface supports complex text processing operations while maintaining Ruby's emphasis on readable code.
Basic Usage
MatchData objects primarily serve as containers for extracted information from successful pattern matches. The most common operations involve accessing the complete match and individual captured groups through numeric indexing.
The []
method provides array-like access to match components. Index 0 returns the entire matched substring, while positive integers return captured groups in order. Negative indices work backward from the last capture group, following Ruby's standard array indexing conventions.
text = "Date: 2024-03-15"
match_data = text.match(/(\d{4})-(\d{2})-(\d{2})/)
# Complete match and individual captures
match_data[0] # => "2024-03-15"
match_data[1] # => "2024"
match_data[2] # => "03"
match_data[3] # => "15"
# Negative indexing
match_data[-1] # => "15"
match_data[-2] # => "03"
# Range access
match_data[1..2] # => ["2024", "03"]
match_data[1, 2] # => ["2024", "03"]
Named captures extend the basic indexing model by allowing string or symbol-based access. When a pattern contains named groups, the resulting MatchData object supports both numeric and name-based retrieval of the same captured content.
log_line = "2024-03-15 14:30:22 ERROR Invalid user input"
pattern = /(?<date>\d{4}-\d{2}-\d{2}) (?<time>\d{2}:\d{2}:\d{2}) (?<level>\w+) (?<message>.*)/
match_data = log_line.match(pattern)
# Named access with strings and symbols
match_data['date'] # => "2024-03-15"
match_data[:time] # => "14:30:22"
match_data['level'] # => "ERROR"
match_data[:message] # => "Invalid user input"
# Numeric access still works
match_data[1] # => "2024-03-15"
match_data[2] # => "14:30:22"
The captures
method returns all captured groups as an array, excluding the complete match. This proves useful when processing multiple captures uniformly or when the number of captures varies between different pattern matches.
# Processing variable capture counts
patterns = [
/(\w+)/, # Single capture
/(\w+)\s+(\w+)/, # Two captures
/(\w+)\s+(\w+)\s+(\w+)/ # Three captures
]
text = "Ruby programming language"
patterns.each do |pattern|
match_data = text.match(pattern)
if match_data
puts "Captures: #{match_data.captures.inspect}"
end
end
# => Captures: ["Ruby"]
# => Captures: ["Ruby", "programming"]
# => Captures: ["Ruby", "programming", "language"]
Position information retrieval uses the begin
, end
, and offset
methods. These methods accept the same indices as the []
method and return character positions within the original string. The offset
method returns both start and end positions as a two-element array.
text = "Error on line 42 at position 15"
match_data = text.match(/line (\d+) at position (\d+)/)
# Position of complete match
match_data.begin(0) # => 9
match_data.end(0) # => 32
# Positions of captured groups
match_data.begin(1) # => 14 (start of "42")
match_data.end(1) # => 16 (end of "42")
# Combined position data
match_data.offset(1) # => [14, 16]
match_data.offset(2) # => [28, 30]
The names
method returns an array of all named capture group names as strings. The named_captures
method returns a hash mapping capture names to their matched values, providing a convenient way to work with structured data extraction.
url_pattern = /^(?<protocol>https?):\/\/(?<host>[^\/]+)(?<path>\/.*)?$/
url = "https://api.example.com/v1/users"
match_data = url.match(url_pattern)
# Available named captures
match_data.names # => ["protocol", "host", "path"]
# All named captures as hash
match_data.named_captures
# => {"protocol"=>"https", "host"=>"api.example.com", "path"=>"/v1/users"}
Error Handling & Debugging
MatchData operations can fail in several ways that require careful handling. The most common issue occurs when attempting to access captured groups that do not exist, either due to optional captures that failed to match or incorrect indexing assumptions.
Accessing non-existent numeric indices returns nil
rather than raising an exception. This behavior can mask logical errors when code assumes certain captures will always be present. Defensive programming requires explicit nil checks or using methods that validate capture existence.
text = "Partial match test"
# Pattern with optional second capture group
match_data = text.match(/(\w+)( \w+)?( \w+)?/)
match_data[1] # => "Partial"
match_data[2] # => " match"
match_data[3] # => nil (third group didn't match)
# Unsafe: assumes capture exists
def extract_third_word(match_data)
match_data[3].strip # NoMethodError if capture is nil
end
# Safe: validates capture existence
def extract_third_word_safe(match_data)
third = match_data[3]
third ? third.strip : nil
end
# Alternative: provide default value
def extract_word_with_default(match_data, index, default = "")
capture = match_data[index]
capture ? capture.strip : default
end
Named capture access exhibits different error behavior depending on the access method used. Hash-style access with strings or symbols returns nil
for non-existent names, while method-style access raises a NoMethodError
. This inconsistency requires careful attention to the access patterns used throughout an application.
pattern = /(?<year>\d{4})-(?<month>\d{2})/
text = "2024-03"
match_data = text.match(pattern)
# Hash-style access returns nil for missing names
match_data['day'] # => nil
match_data[:day] # => nil
# Method access would raise NoMethodError
# match_data.day # => NoMethodError: undefined method `day'
# Safe named access with validation
def safe_named_capture(match_data, name)
if match_data.names.include?(name.to_s)
match_data[name]
else
nil
end
end
Pattern compilation errors occur at match time rather than pattern creation time when using string patterns. This delayed error detection can cause runtime failures in production code that processes user input or dynamic patterns.
# Invalid patterns raise RegexpError during matching
begin
text = "test string"
invalid_pattern = "([unclosed_group" # Missing closing parenthesis
match_data = text.match(Regexp.new(invalid_pattern))
rescue RegexpError => e
puts "Pattern error: #{e.message}"
# Handle invalid pattern gracefully
end
# Validate patterns before use in critical code
def safe_match(text, pattern_string)
pattern = Regexp.new(pattern_string)
text.match(pattern)
rescue RegexpError => e
Rails.logger.error "Invalid regex pattern: #{pattern_string} - #{e.message}"
nil
end
Global variable side effects create debugging challenges in multi-threaded applications or when multiple match operations occur in sequence. The $~
, $1
, $2
, etc. variables update after each match, potentially overwriting previous results unexpectedly.
# Global variables create hidden dependencies
text1 = "First: 123"
text2 = "Second: 456"
match1 = text1.match(/(\d+)/)
puts $1 # => "123"
match2 = text2.match(/(\d+)/) # This overwrites $1
puts $1 # => "456" (not "123"!)
# Safer approach: avoid global variables
def process_matches(texts)
results = []
texts.each do |text|
match_data = text.match(/(\d+)/)
if match_data
results << match_data[1] # Use MatchData directly
end
end
results
end
Memory leaks can occur when MatchData objects retain references to large strings. The MatchData object maintains a reference to the original string through the string
method, preventing garbage collection of the source text until the MatchData object itself is released.
# Potential memory issue with large strings
def extract_small_data(large_text)
match_data = large_text.match(/important_data: (\w+)/)
if match_data
# Returning MatchData keeps entire large_text in memory
return match_data # Problematic
# Better: extract only needed data
return match_data[1] # Just the captured group
end
nil
end
# Force string detachment for critical memory usage
def extract_with_memory_management(large_text)
match_data = large_text.match(/data: (\w+)/)
if match_data
extracted_data = match_data[1].dup # Create independent copy
match_data = nil # Allow MatchData to be collected
extracted_data
end
end
Common Pitfalls
MatchData indexing follows Ruby's array conventions, but the presence of the complete match at index 0 creates confusion for developers expecting captured groups to start at index 0. This off-by-one error appears frequently in code that processes capture groups sequentially.
# Common mistake: expecting first capture at index 0
phone = "Call 555-1234 today"
match_data = phone.match(/(\d{3})-(\d{4})/)
# Wrong: treats complete match as first capture
area_code = match_data[0] # => "555-1234" (complete match, not area code)
# Correct: first capture is at index 1
area_code = match_data[1] # => "555"
number = match_data[2] # => "1234"
# Iteration pitfall: including index 0
match_data.length.times do |i|
puts "Group #{i}: #{match_data[i]}"
end
# => Group 0: 555-1234 (this is the complete match!)
# => Group 1: 555
# => Group 2: 1234
# Correct iteration over captures only
match_data.captures.each_with_index do |capture, i|
puts "Capture #{i + 1}: #{capture}"
end
Named capture access exhibits inconsistent behavior between string and symbol keys in different Ruby contexts. While MatchData objects accept both forms, other parts of the Ruby ecosystem may expect one specific type, leading to nil returns when the wrong key type is used.
pattern = /(?<protocol>https?):\/\/(?<domain>\w+)/
url = "https://example"
match_data = url.match(pattern)
# Both string and symbol access work on MatchData
match_data['protocol'] # => "https"
match_data[:protocol] # => "https"
# But named_captures always returns string keys
named_data = match_data.named_captures
# => {"protocol"=>"https", "domain"=>"example"}
# This fails silently
named_data[:protocol] # => nil (symbol key doesn't exist)
named_data['protocol'] # => "https" (string key works)
# Consistent approach: choose one key type throughout
def extract_protocol(match_data)
# Decide: always use strings or always use symbols
match_data[:protocol] || match_data['protocol'] # Defensive
end
Global variable pollution occurs automatically after every match operation, creating hidden state changes that affect unrelated code. These variables persist across method boundaries and can cause action-at-a-distance bugs that are difficult to trace.
class DataProcessor
def process_phone(text)
match = text.match(/(\d{3})-(\d{4})/)
if match
@area_code = $1 # Relies on global variable
@number = $2
end
end
def process_date(text)
# This overwrites the phone number globals!
match = text.match(/(\d{4})-(\d{2})-(\d{2})/)
if match
@year = $1
@month = $2
@day = $3
end
end
def format_phone
# @area_code might be wrong if process_date was called after process_phone
"#{@area_code}-#{@number}"
end
end
# Safer approach: avoid global variables entirely
class SafeDataProcessor
def process_phone(text)
match = text.match(/(\d{3})-(\d{4})/)
if match
@area_code = match[1] # Use MatchData directly
@number = match[2]
end
end
end
Optional capture groups create nil values that propagate through string operations, causing unexpected NoMethodError exceptions when methods are called on nil captures. This issue becomes more severe with complex patterns containing multiple optional groups.
# Pattern with multiple optional groups
log_pattern = /(\d{4})-(\d{2})-(\d{2})( (\d{2}):(\d{2}):(\d{2}))?( (\w+))?/
# Complete log entry
full_log = "2024-03-15 14:30:22 ERROR"
full_match = full_log.match(log_pattern)
# All captures present: ["2024", "03", "15", " 14:30:22", "14", "30", "22", " ERROR", "ERROR"]
# Incomplete log entry
partial_log = "2024-03-15"
partial_match = partial_log.match(log_pattern)
# Some captures are nil: ["2024", "03", "15", nil, nil, nil, nil, nil, nil]
# Unsafe processing
def format_timestamp(match_data)
date = "#{match_data[1]}-#{match_data[2]}-#{match_data[3]}"
time = "#{match_data[5]}:#{match_data[6]}:#{match_data[7]}" # May be nil!
level = match_data[9].upcase # NoMethodError if nil
"#{date} #{time} #{level}"
end
# Safe processing with nil checks
def format_timestamp_safe(match_data)
date = "#{match_data[1]}-#{match_data[2]}-#{match_data[3]}"
time = if match_data[5] && match_data[6] && match_data[7]
"#{match_data[5]}:#{match_data[6]}:#{match_data[7]}"
else
"00:00:00"
end
level = match_data[9] ? match_data[9].upcase : "UNKNOWN"
"#{date} #{time} #{level}"
end
Case sensitivity in named captures can cause silent failures when capture names don't match the expected case. Ruby treats capture names as case-sensitive strings, so minor case differences result in nil returns rather than exceptions.
# Pattern with mixed-case named captures
mixed_pattern = /(?<FirstName>\w+) (?<lastName>\w+)/
name = "John Smith"
match_data = name.match(mixed_pattern)
# Case must match exactly
match_data['FirstName'] # => "John"
match_data['firstname'] # => nil (wrong case)
match_data['lastName'] # => "Smith"
match_data['lastname'] # => nil (wrong case)
# Debugging helper for case issues
def debug_named_captures(match_data)
puts "Available names: #{match_data.names.inspect}"
match_data.names.each do |name|
puts "#{name}: #{match_data[name].inspect}"
end
end
Backreference behavior changes when captures are nested or repeated, creating confusion about which occurrence of a capture group the backreference refers to. This affects both the MatchData content and any backreferences used within the pattern itself.
# Repeated capture groups - only last occurrence is captured
repeated_text = "word1, word2, word3"
match_data = repeated_text.match(/(\w+,?\s*)+/)
# Only captures the last iteration of the group
match_data[1] # => "word3" (not "word1" or all words)
# Nested captures can be confusing
nested_pattern = /((\w+)\s+(\w+))+/
nested_text = "first second third fourth"
nested_match = nested_text.match(nested_pattern)
# Outer group captures last complete match
nested_match[1] # => "third fourth"
nested_match[2] # => "third"
nested_match[3] # => "fourth"
# To capture all occurrences, use scan instead
all_words = repeated_text.scan(/(\w+)/)
# => [["word1"], ["word2"], ["word3"]]
Reference
Core Methods
Method | Parameters | Returns | Description |
---|---|---|---|
#[](index) |
index (Integer, String, Symbol) |
String or nil |
Returns capture group by numeric index or name |
#[](*indices) |
*indices (Array) |
Array<String> |
Returns multiple captures as array |
#begin(index) |
index (Integer) |
Integer |
Returns start position of capture group |
#captures |
None | Array<String> |
Returns all capture groups excluding complete match |
#end(index) |
index (Integer) |
Integer |
Returns end position of capture group |
#length |
None | Integer |
Returns total number of captures plus complete match |
#named_captures |
None | Hash<String, String> |
Returns hash of named captures with string keys |
#names |
None | Array<String> |
Returns array of all named capture group names |
#offset(index) |
index (Integer) |
Array<Integer> |
Returns [begin, end] positions for capture |
#post_match |
None | String |
Returns portion of string after complete match |
#pre_match |
None | String |
Returns portion of string before complete match |
#regexp |
None | Regexp |
Returns the regular expression used for matching |
#size |
None | Integer |
Alias for #length |
#string |
None | String |
Returns the original string that was matched against |
#to_a |
None | Array<String> |
Returns array containing complete match and all captures |
#to_s |
None | String |
Returns the complete matched substring |
#values_at(*indices) |
*indices (Array) |
Array<String> |
Returns captures at specified indices |
Indexing Behavior
Index Type | Example | Result | Notes |
---|---|---|---|
Positive Integer | match_data[1] |
First capture group | Index 0 is complete match |
Negative Integer | match_data[-1] |
Last capture group | Standard Ruby array indexing |
String | match_data['name'] |
Named capture value | Case-sensitive string matching |
Symbol | match_data[:name] |
Named capture value | Converted to string internally |
Range | match_data[1..3] |
Array of captures | Includes both endpoints |
Start/Length | match_data[1, 2] |
Array of captures | Two captures starting at index 1 |
Position Methods
Method | Index Parameter | Returns | Description |
---|---|---|---|
#begin(0) |
Complete match | Start of entire match | Character position in original string |
#begin(n) |
Capture group n | Start of capture | Returns nil if capture doesn't exist |
#end(0) |
Complete match | End of entire match | One past last matched character |
#end(n) |
Capture group n | End of capture | Returns nil if capture doesn't exist |
#offset(n) |
Any valid index | [begin, end] |
Combined position information |
Global Variables Side Effects
Variable | Content | Updated When |
---|---|---|
$~ |
Complete MatchData object | After any successful match |
$& |
Complete matched substring | After any successful match |
$1 , $2 , etc. |
Individual capture groups | After any successful match |
$+ |
Last (highest-numbered) capture | After any successful match |
`$`` | Pre-match string | After any successful match |
$' |
Post-match string | After any successful match |
Common Return Values
Scenario | Method Call | Return Value | Type |
---|---|---|---|
Valid numeric index | match_data[1] |
Captured substring | String |
Invalid numeric index | match_data[99] |
nil |
NilClass |
Valid named capture | match_data['name'] |
Captured substring | String |
Invalid named capture | match_data['missing'] |
nil |
NilClass |
Non-capturing group | Pattern (?:word) |
Not accessible | N/A |
Optional group (unmatched) | match_data[n] |
nil |
NilClass |
Empty capture | Pattern () matches empty |
"" |
String |
Error Conditions
Condition | Method | Error Type | Prevention |
---|---|---|---|
nil MatchData | Any method call | NoMethodError |
Check match result before use |
Invalid pattern | text.match(bad_pattern) |
RegexpError |
Validate patterns before matching |
Method on nil capture | match_data[n].method |
NoMethodError |
Check capture existence first |
Non-existent named method | match_data.undefined_name |
NoMethodError |
Use hash-style access |
Pattern Examples
# Basic captures
/(\w+)/ # Single capture group
/(\\w+)\\s+(\\w+)/ # Two capture groups
/(?<name>\\w+)/ # Named capture group
# Optional captures
/(\\w+)(\\s+(\\w+))?/ # Second group optional
/(?<req>\\w+)(?<opt>\\s+\\w+)?/ # Named optional capture
# Nested captures
/((\\w+)\\s+(\\w+))/ # Nested grouping
/(\\w+(?:\\s+\\w+)*)/ # Non-capturing nested group
# Complex patterns
/^(?<protocol>https?):\\/\\/(?<host>[^\\/]+)(?<path>\\/.*)?$/ # URL parsing
/(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})/ # Date matching