Overview
Capture groups in Ruby regular expressions use parentheses to define portions of a pattern that should be extracted separately from the overall match. When a regular expression contains capture groups, Ruby stores the matched text for each group and makes it available through several mechanisms including the MatchData
object, numbered global variables, and named capture syntax.
Ruby implements capture groups through the Regexp
class and MatchData
objects. The String#match
method returns a MatchData
instance that contains the overall match at index 0 and subsequent capture groups at indices 1, 2, 3, and so on. The String#scan
method returns arrays of capture group matches when the pattern contains groups.
text = "Contact: john.doe@example.com"
pattern = /(\w+)\.(\w+)@(\w+\.\w+)/
match = text.match(pattern)
# => #<MatchData "john.doe@example.com" 1:"john" 2:"doe" 3:"example.com">
puts match[0] # "john.doe@example.com" (full match)
puts match[1] # "john" (first capture group)
puts match[2] # "doe" (second capture group)
puts match[3] # "example.com" (third capture group)
Named capture groups use the (?<name>pattern)
syntax to assign names to groups, making them accessible through both numeric indices and descriptive names:
pattern = /(?<first>\w+)\.(?<last>\w+)@(?<domain>\w+\.\w+)/
match = text.match(pattern)
puts match[:first] # "john"
puts match[:last] # "doe"
puts match[:domain] # "example.com"
Global variables $1
, $2
, $3
, etc., automatically receive the values of capture groups after a successful match operation. These variables persist until the next match operation overwrites them:
text.match(pattern)
puts $1 # "john"
puts $2 # "doe"
puts $3 # "example.com"
Basic Usage
Capture groups segment regular expression matches into discrete components for individual processing. The most common usage involves extracting structured data from formatted strings like dates, email addresses, URLs, or log entries.
Basic numbered capture groups use standard parentheses syntax. Each opening parenthesis creates a new group numbered sequentially from left to right:
log_line = "2024-03-15 14:32:45 ERROR Database connection failed"
pattern = /(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (\w+) (.+)/
match = log_line.match(pattern)
date = match[1] # "2024-03-15"
time = match[2] # "14:32:45"
level = match[3] # "ERROR"
message = match[4] # "Database connection failed"
The String#scan
method with capture groups returns an array of arrays, where each inner array contains the capture group values for one match:
text = "Prices: $12.99, $45.50, $8.75"
pattern = /\$(\d+)\.(\d{2})/
prices = text.scan(pattern)
# => [["12", "99"], ["45", "50"], ["8", "75"]]
prices.each do |dollars, cents|
total_cents = dollars.to_i * 100 + cents.to_i
puts "#{dollars}.#{cents} = #{total_cents} cents"
end
Named capture groups improve readability and maintenance for complex patterns. The (?<name>pattern)
syntax creates groups accessible by name rather than position:
url = "https://api.example.com:8080/users/123?format=json"
pattern = %r{
(?<protocol>https?)://
(?<host>[^:]+)
(:(?<port>\d+))?
(?<path>/[^?]+)?
(\?(?<query>.+))?
}x
match = url.match(pattern)
puts match[:protocol] # "https"
puts match[:host] # "api.example.com"
puts match[:port] # "8080"
puts match[:path] # "/users/123"
puts match[:query] # "format=json"
Nested capture groups create hierarchical grouping where outer groups contain inner groups. Ruby numbers groups by the order of their opening parentheses:
text = "Version: Ruby 3.2.1"
pattern = /(Ruby (\d+)\.(\d+)\.(\d+))/
match = text.match(pattern)
puts match[1] # "Ruby 3.2.1" (outer group)
puts match[2] # "3" (first inner group)
puts match[3] # "2" (second inner group)
puts match[4] # "1" (third inner group)
The gsub
method uses capture groups in replacement strings through \1
, \2
, etc., or \k<name>
for named groups:
text = "John Smith, Jane Doe, Bob Johnson"
# Swap first and last names
swapped = text.gsub(/(\w+) (\w+)/, '\2, \1')
# => "Smith, John, Doe, Jane, Johnson, Bob"
# With named groups
swapped = text.gsub(/(?<first>\w+) (?<last>\w+)/, '\k<last>, \k<first>')
# => "Smith, John, Doe, Jane, Johnson, Bob"
Performance & Memory
Capture group performance depends on pattern complexity, input size, and the number of groups defined. Each capture group requires additional memory allocation to store matched text, and complex nested groups can significantly impact matching speed.
Simple capture groups add minimal overhead to regular expression matching. The primary cost comes from creating MatchData
objects and storing captured text:
require 'benchmark'
text = "email@domain.com " * 1000
simple_pattern = /\w+@\w+\.\w+/
capture_pattern = /(\w+)@(\w+)\.(\w+)/
Benchmark.bm(15) do |bm|
bm.report("no captures:") do
10000.times { text.scan(simple_pattern) }
end
bm.report("with captures:") do
10000.times { text.scan(capture_pattern) }
end
end
# Results show capture groups add approximately 15-25% overhead
# user system total real
# no captures: 0.180000 0.000000 0.180000 ( 0.182463)
# with captures: 0.220000 0.010000 0.230000 ( 0.228951)
Nested capture groups multiply memory usage and processing time. Each level of nesting creates additional groups that must be tracked and stored:
# Memory-efficient: 2 capture groups
simple = /(\w+)@(\w+\.\w+)/
# Memory-intensive: 5 capture groups due to nesting
nested = /((\w+)@((\w+)\.(\w+)))/
text = "test@example.com"
puts simple.match(text).size # 3 elements (full match + 2 groups)
puts nested.match(text).size # 6 elements (full match + 5 groups)
Named capture groups consume additional memory for storing name-to-index mappings but provide better code maintainability. The performance difference becomes negligible for most applications:
# Named groups have minimal additional overhead
named_pattern = /(?<user>\w+)@(?<domain>\w+\.\w+)/
numbered_pattern = /(\w+)@(\w+\.\w+)/
# Both patterns perform similarly on the same input
text = "user@domain.com " * 1000
Large input processing benefits from limiting capture group usage to necessary extractions only. Unnecessary groups waste memory and processing time:
# Inefficient: captures everything even when not needed
inefficient = /(\w+)(\s+)(\w+)(\s+)(\w+)(\s+)(.+)/
# Efficient: captures only required parts
efficient = /\w+\s+\w+\s+(\w+)\s+(.+)/
log_line = "2024 03 15 14:32:45 ERROR Database connection failed"
# Both extract the same data but efficient uses less memory
Complex alternation patterns with capture groups can cause exponential backtracking. Atomic groups and possessive quantifiers help optimize such patterns:
# Potentially slow with backtracking
slow_pattern = /((a|b)*c)*/
# Optimized with atomic grouping
fast_pattern = /(?>(a|b)*c)*/
# For capture groups with alternation, be specific
specific = /(jpg|png|gif)/
general = /(\w+)/ # Less efficient if you only want image extensions
Error Handling & Debugging
Invalid regular expression patterns with malformed capture groups raise RegexpError
exceptions at compile time. Unmatched parentheses, invalid named group syntax, and illegal backreferences cause immediate failures:
begin
# Unmatched opening parenthesis
pattern = Regexp.new("(abc")
rescue RegexpError => e
puts e.message # "end pattern with unmatched parenthesis"
end
begin
# Invalid named group syntax
pattern = Regexp.new("(?<>abc)")
rescue RegexpError => e
puts e.message # "group name is empty"
end
begin
# Invalid backreference
pattern = Regexp.new("\\2(\\w+)")
rescue RegexpError => e
puts e.message # "invalid backref number/name"
end
Missing matches return nil
instead of MatchData
objects. Accessing capture groups on nil
raises NoMethodError
. Always check for successful matches before accessing groups:
text = "no email here"
pattern = /(\w+)@(\w+\.\w+)/
match = text.match(pattern)
# Unsafe: will raise NoMethodError if no match
# puts match[1]
# Safe: check for match first
if match
puts "Email user: #{match[1]}"
puts "Domain: #{match[2]}"
else
puts "No email found"
end
# Alternative: use safe navigation
email_user = text.match(pattern)&.[](1)
puts email_user || "No email found"
Named capture groups with duplicate names create ambiguous references. Ruby uses the rightmost group with a given name:
pattern = /(?<value>\d+)\.(?<value>\d+)/ # Duplicate name
text = "12.34"
match = text.match(pattern)
# Returns the second group, not the first
puts match[:value] # "34", not "12"
# Access by index to get specific groups
puts match[1] # "12"
puts match[2] # "34"
Global variables $1
, $2
, etc., persist between match operations and can contain stale data from previous matches. Always perform a fresh match operation or explicitly check match success:
# First match sets global variables
"abc123".match(/(\w+)(\d+)/)
puts $1 # "abc"
puts $2 # "123"
# Failed match leaves old values in global variables
"xyz".match(/(\w+)(\d+)/)
puts $1 # Still "abc" (stale data)
puts $2 # Still "123" (stale data)
# Clear globals explicitly or use MatchData objects
def safe_extract(text, pattern)
match = text.match(pattern)
return nil unless match
[match[1], match[2]]
end
Debugging complex capture group patterns requires systematic testing of each group individually. The Regexp#names
method lists named capture groups, and MatchData#names
shows which names matched:
pattern = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/
puts pattern.names # ["year", "month", "day"]
text = "Date: 2024-03-15"
match = text.match(pattern)
puts match.names # ["year", "month", "day"]
# Inspect all captures
match.captures.each_with_index do |capture, index|
name = match.names[index] || index + 1
puts "Group #{name}: #{capture.inspect}"
end
Common Pitfalls
Capture group numbering changes when groups are added or removed from patterns, breaking code that relies on specific indices. Named groups provide stability but numeric access can still shift:
# Original pattern with 2 groups
original = /(\w+)@(\w+\.\w+)/
text = "user@example.com"
match = text.match(original)
user = match[1] # "user"
domain = match[2] # "example.com"
# Modified pattern adds a group at the beginning
modified = /(\w+:)?(\w+)@(\w+\.\w+)/
match = text.match(modified)
# Now user is at index 2, not 1!
user = match[2] # Still works but fragile
domain = match[3] # Domain moved to index 3
# Solution: use named groups for stability
stable = /(?<protocol>\w+:)?(?<user>\w+)@(?<domain>\w+\.\w+)/
match = text.match(stable)
user = match[:user] # Always works regardless of group changes
domain = match[:domain] # Position independent
Non-capturing groups (?:pattern)
group elements without creating capture groups but can confuse developers who expect captured values:
# Mixing capturing and non-capturing groups
pattern = /(?:https?://)(\w+)\.(\w+)/
url = "https://example.com"
match = url.match(pattern)
puts match.size # 3, not 4 (protocol not captured)
puts match[0] # "https://example.com"
puts match[1] # "example" (first capture group)
puts match[2] # "com" (second capture group)
# match[3] is nil - the protocol group was non-capturing
Optional capture groups can produce nil
values that break string operations expecting actual text. Always handle the possibility of empty groups:
pattern = /(\w+)(:(\d+))?/ # Port is optional
servers = ["localhost", "example.com:8080", "test.local:"]
servers.each do |server|
match = server.match(pattern)
host = match[1]
port = match[3] # Can be nil for "localhost" or "" for "test.local:"
# Unsafe: will raise error on nil
# port_num = port.to_i
# Safe: handle nil and empty values
port_num = port&.empty? ? nil : port&.to_i
puts "Host: #{host}, Port: #{port_num || 'default'}"
end
Capture groups inside quantifiers create nested arrays with String#scan
but only return the last match with String#match
. This behavior confuses developers expecting all repetitions:
text = "rgb(255,128,64)"
pattern = /rgb\((\d+,?)+\)/
# match only captures the last repetition
match = text.match(pattern)
puts match[1] # "64," (only the last number)
# scan with groups returns individual captures
colors = text.scan(/(\d+)/)
puts colors.flatten # ["255", "128", "64"]
# Better pattern for this case
better_pattern = /rgb\((\d+),(\d+),(\d+)\)/
match = text.match(better_pattern)
puts match[1], match[2], match[3] # "255", "128", "64"
Backreferences \1
, \2
within patterns refer to already-matched group content, not the group pattern itself. Misunderstanding this leads to incorrect pattern logic:
# Correct: matches repeated words
repeated_words = /(\w+)\s+\1/
text = "the the quick brown brown fox"
matches = text.scan(repeated_words)
puts matches # [["the"], ["brown"]]
# Incorrect assumption: \1 doesn't repeat the pattern (\w+)
# It matches the exact text that was captured by group 1
wrong_assumption = "abc123 def456".match(/(\w+)\s+\1/)
puts wrong_assumption # nil - "abc123" != "def456"
Variable scope issues arise when using capture groups inside blocks or methods where global variables like $1
might be overwritten by nested operations:
def process_emails(text)
emails = []
text.scan(/(\w+)@(\w+\.\w+)/) do
user = $1
domain = $2
# Dangerous: this method call might change $1 and $2
normalized_domain = normalize_domain(domain)
# $1 and $2 might now contain different values!
emails << "#{user}@#{normalized_domain}"
end
emails
end
# Safer approach: avoid global variables
def safe_process_emails(text)
text.scan(/(\w+)@(\w+\.\w+)/).map do |user, domain|
normalized_domain = normalize_domain(domain)
"#{user}@#{normalized_domain}"
end
end
Reference
Core Classes and Methods
Class/Method | Purpose | Returns |
---|---|---|
Regexp.new(pattern) |
Create regex with capture groups | Regexp object |
String#match(pattern) |
Find first match with groups | MatchData or nil |
String#match?(pattern) |
Test for match without creating groups | true or false |
String#scan(pattern) |
Find all matches with groups | Array of captures |
String#gsub(pattern, replacement) |
Replace with group references | Modified string |
MatchData Object Methods
Method | Parameters | Returns | Description |
---|---|---|---|
#[](index) |
Integer index | String or nil |
Access group by numeric index |
#[](name) |
Symbol or String | String or nil |
Access named group |
#captures |
None | Array | All capture groups as array |
#names |
None | Array | Names of all named groups |
#named_captures |
None | Hash | Hash of name to capture value |
#size |
None | Integer | Total captures plus full match |
#values_at(*indices) |
Indices | Array | Multiple groups by index |
Capture Group Syntax
Syntax | Type | Description | Example |
---|---|---|---|
(pattern) |
Basic | Creates numbered group | /(\\w+)@(\\w+)/ |
(?<name>pattern) |
Named | Creates named group | /(?<user>\\w+)@(?<domain>\\w+)/ |
(?:pattern) |
Non-capturing | Groups without capture | /(?:https?://)(\\w+)/ |
(?=pattern) |
Lookahead | Positive lookahead | /\\w+(?=@)/ |
(?!pattern) |
Lookahead | Negative lookahead | /\\w+(?!@)/ |
(?<=pattern) |
Lookbehind | Positive lookbehind | /(?<=@)\\w+/ |
(?<!pattern) |
Lookbehind | Negative lookbehind | /(?<!@)\\w+/ |
Global Variables
Variable | Content | Scope |
---|---|---|
$& |
Full match text | Thread-local |
$1 , $2 , $3 ... |
Capture group 1, 2, 3... | Thread-local |
$+ |
Last capture group | Thread-local |
`$`` | Text before match | Thread-local |
$' |
Text after match | Thread-local |
Replacement String References
Reference | Type | Description | Example |
---|---|---|---|
\\1 , \\2 , \\3 ... |
Numbered | Reference numbered groups | "\\2, \\1" |
\\k<name> |
Named | Reference named groups | "\\k<last>, \\k<first>" |
\\& |
Full match | Entire matched text | "[\\&]" |
`\`` | Pre-match | Text before match | "\\ _\&"` |
\\' |
Post-match | Text after match | "\\&_\\'" |
Common Patterns
Pattern | Purpose | Groups | Example Match |
---|---|---|---|
/^(\\w+)@(\\w+\\.\\w+)$/ |
Email validation | user, domain | john@example.com |
/^(\\d{4})-(\\d{2})-(\\d{2})$/ |
Date parsing | year, month, day | 2024-03-15 |
/^(https?):\\/\\/(\\w+)(?::(\\d+))?/ |
URL components | protocol, host, port | https://example.com:8080 |
/^(\\+?\\d{1,3})[-.\\s]?(\\d{3})[-.\\s]?(\\d{4})$/ |
Phone numbers | country, area, number | +1-555-1234 |
/^([A-Za-z0-9+\\/]{4})*([A-Za-z0-9+\\/]{2}==\|[A-Za-z0-9+\\/]{3}=)?$/ |
Base64 validation | data, padding | SGVsbG8= |
Error Types
Error | Cause | Example |
---|---|---|
RegexpError |
Invalid pattern syntax | Regexp.new("(unclosed") |
NoMethodError |
Accessing groups on nil |
"text".match(/\\d+/)[1] |
ArgumentError |
Invalid group reference | "text".gsub(/(\\w+)/, "\\2") |
TypeError |
Wrong argument type | "text".match(123) |