CrackedRuby - Capture Groups

Overview

Capture groups in Ruby regular expressions use parentheses to define portions of a pattern that should be extracted separately from the overall match. When a regular expression contains capture groups, Ruby stores the matched text for each group and makes it available through several mechanisms including the MatchData object, numbered global variables, and named capture syntax.

Ruby implements capture groups through the Regexp class and MatchData objects. The String#match method returns a MatchData instance that contains the overall match at index 0 and subsequent capture groups at indices 1, 2, 3, and so on. The String#scan method returns arrays of capture group matches when the pattern contains groups.

text = "Contact: john.doe@example.com"
pattern = /(\w+)\.(\w+)@(\w+\.\w+)/
match = text.match(pattern)
# => #<MatchData "john.doe@example.com" 1:"john" 2:"doe" 3:"example.com">

puts match[0]  # "john.doe@example.com" (full match)
puts match[1]  # "john" (first capture group)
puts match[2]  # "doe" (second capture group)
puts match[3]  # "example.com" (third capture group)

Named capture groups use the (?<n>pattern) syntax to assign names to groups, making them accessible through both numeric indices and descriptive names:

pattern = /(?<first>\w+)\.(?<last>\w+)@(?<domain>\w+\.\w+)/
match = text.match(pattern)
puts match[:first]   # "john"
puts match[:last]    # "doe"
puts match[:domain]  # "example.com"

Global variables $1, $2, $3, etc., automatically receive the values of capture groups after a successful match operation. These variables persist until the next match operation overwrites them:

text.match(pattern)
puts $1  # "john"
puts $2  # "doe"
puts $3  # "example.com"

Basic Usage

Capture groups segment regular expression matches into discrete components for individual processing. The most common usage involves extracting structured data from formatted strings like dates, email addresses, URLs, or log entries.

Basic numbered capture groups use standard parentheses syntax. Each opening parenthesis creates a new group numbered sequentially from left to right:

log_line = "2024-03-15 14:32:45 ERROR Database connection failed"
pattern = /(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (\w+) (.+)/
match = log_line.match(pattern)

date = match[1]      # "2024-03-15"
time = match[2]      # "14:32:45"
level = match[3]     # "ERROR"
message = match[4]   # "Database connection failed"

The String#scan method with capture groups returns an array of arrays, where each inner array contains the capture group values for one match:

text = "Prices: $12.99, $45.50, $8.75"
pattern = /\$(\d+)\.(\d{2})/
prices = text.scan(pattern)
# => [["12", "99"], ["45", "50"], ["8", "75"]]

prices.each do |dollars, cents|
  total_cents = dollars.to_i * 100 + cents.to_i
  puts "#{dollars}.#{cents} = #{total_cents} cents"
end

Named capture groups improve readability and maintenance for complex patterns. The (?<n>pattern) syntax creates groups accessible by name rather than position:

url = "https://api.example.com:8080/users/123?format=json"
pattern = %r{
  (?<protocol>https?)://
  (?<host>[^:]+)
  (:(?<port>\d+))?
  (?<path>/[^?]+)?
  (\?(?<query>.+))?
}x

match = url.match(pattern)
puts match[:protocol]  # "https"
puts match[:host]      # "api.example.com"
puts match[:port]      # "8080"
puts match[:path]      # "/users/123"
puts match[:query]     # "format=json"

Nested capture groups create hierarchical grouping where outer groups contain inner groups. Ruby numbers groups by the order of their opening parentheses:

text = "Version: Ruby 3.2.1"
pattern = /(Ruby (\d+)\.(\d+)\.(\d+))/
match = text.match(pattern)

puts match[1]  # "Ruby 3.2.1" (outer group)
puts match[2]  # "3" (first inner group)
puts match[3]  # "2" (second inner group)
puts match[4]  # "1" (third inner group)

The gsub method uses capture groups in replacement strings through \1, \2, etc., or \k<n> for named groups:

text = "John Smith, Jane Doe, Bob Johnson"
# Swap first and last names
swapped = text.gsub(/(\w+) (\w+)/, '\2, \1')
# => "Smith, John, Doe, Jane, Johnson, Bob"

# With named groups
swapped = text.gsub(/(?<first>\w+) (?<last>\w+)/, '\k<last>, \k<first>')
# => "Smith, John, Doe, Jane, Johnson, Bob"

Performance & Memory

Capture group performance depends on pattern complexity, input size, and the number of groups defined. Each capture group requires additional memory allocation to store matched text, and complex nested groups can significantly impact matching speed.

Simple capture groups add minimal overhead to regular expression matching. The primary cost comes from creating MatchData objects and storing captured text:

require 'benchmark'

text = "email@domain.com " * 1000
simple_pattern = /\w+@\w+\.\w+/
capture_pattern = /(\w+)@(\w+)\.(\w+)/

Benchmark.bm(15) do |bm|
  bm.report("no captures:") do
    10000.times { text.scan(simple_pattern) }
  end

  bm.report("with captures:") do
    10000.times { text.scan(capture_pattern) }
  end
end

# Results show capture groups add approximately 15-25% overhead
#                      user     system      total        real
# no captures:     0.180000   0.000000   0.180000 (  0.182463)
# with captures:   0.220000   0.010000   0.230000 (  0.228951)

Nested capture groups multiply memory usage and processing time. Each level of nesting creates additional groups that must be tracked and stored:

# Memory-efficient: 2 capture groups
simple = /(\w+)@(\w+\.\w+)/

# Memory-intensive: 5 capture groups due to nesting
nested = /((\w+)@((\w+)\.(\w+)))/

text = "test@example.com"
puts simple.match(text).size   # 3 elements (full match + 2 groups)
puts nested.match(text).size   # 6 elements (full match + 5 groups)

Named capture groups consume additional memory for storing name-to-index mappings but provide better code maintainability. The performance difference becomes negligible for most applications:

# Named groups have minimal additional overhead
named_pattern = /(?<user>\w+)@(?<domain>\w+\.\w+)/
numbered_pattern = /(\w+)@(\w+\.\w+)/

# Both patterns perform similarly on the same input
text = "user@domain.com " * 1000

Large input processing benefits from limiting capture group usage to necessary extractions only. Unnecessary groups waste memory and processing time:

# Inefficient: captures everything even when not needed
inefficient = /(\w+)(\s+)(\w+)(\s+)(\w+)(\s+)(.+)/

# Efficient: captures only required parts
efficient = /\w+\s+\w+\s+(\w+)\s+(.+)/

log_line = "2024 03 15 14:32:45 ERROR Database connection failed"
# Both extract the same data but efficient uses less memory

Complex alternation patterns with capture groups can cause exponential backtracking. Atomic groups and possessive quantifiers help optimize such patterns:

# Potentially slow with backtracking
slow_pattern = /((a|b)*c)*/

# Optimized with atomic grouping
fast_pattern = /(?>(a|b)*c)*/

# For capture groups with alternation, be specific
specific = /(jpg|png|gif)/
general = /(\w+)/  # Less efficient if you only want image extensions

Error Handling & Debugging

Invalid regular expression patterns with malformed capture groups raise RegexpError exceptions at compile time. Unmatched parentheses, invalid named group syntax, and illegal backreferences cause immediate failures:

begin
  # Unmatched opening parenthesis
  pattern = Regexp.new("(abc")
rescue RegexpError => e
  puts e.message  # "end pattern with unmatched parenthesis"
end

begin
  # Invalid named group syntax
  pattern = Regexp.new("(?<>abc)")
rescue RegexpError => e
  puts e.message  # "group name is empty"
end

begin
  # Invalid backreference
  pattern = Regexp.new("\\2(\\w+)")
rescue RegexpError => e
  puts e.message  # "invalid backref number/name"
end

Missing matches return nil instead of MatchData objects. Accessing capture groups on nil raises NoMethodError. Always check for successful matches before accessing groups:

text = "no email here"
pattern = /(\w+)@(\w+\.\w+)/
match = text.match(pattern)

# Unsafe: will raise NoMethodError if no match
# puts match[1]

# Safe: check for match first
if match
  puts "Email user: #{match[1]}"
  puts "Domain: #{match[2]}"
else
  puts "No email found"
end

# Alternative: use safe navigation
email_user = text.match(pattern)&.[](1)
puts email_user || "No email found"

Named capture groups with duplicate names create ambiguous references. Ruby uses the rightmost group with a given name:

pattern = /(?<value>\d+)\.(?<value>\d+)/  # Duplicate name
text = "12.34"
match = text.match(pattern)

# Returns the second group, not the first
puts match[:value]  # "34", not "12"

# Access by index to get specific groups
puts match[1]  # "12"
puts match[2]  # "34"

Global variables $1, $2, etc., persist between match operations and can contain stale data from previous matches. Always perform a fresh match operation or explicitly check match success:

# First match sets global variables
"abc123".match(/(\w+)(\d+)/)
puts $1  # "abc"
puts $2  # "123"

# Failed match leaves old values in global variables
"xyz".match(/(\w+)(\d+)/)
puts $1  # Still "abc" (stale data)
puts $2  # Still "123" (stale data)

# Clear globals explicitly or use MatchData objects
def safe_extract(text, pattern)
  match = text.match(pattern)
  return nil unless match
  [match[1], match[2]]
end

Debugging complex capture group patterns requires systematic testing of each group individually. The Regexp#names method lists named capture groups, and MatchData#names shows which names matched:

pattern = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/
puts pattern.names  # ["year", "month", "day"]

text = "Date: 2024-03-15"
match = text.match(pattern)
puts match.names    # ["year", "month", "day"]

# Inspect all captures
match.captures.each_with_index do |capture, index|
  name = match.names[index] || index + 1
  puts "Group #{name}: #{capture.inspect}"
end

Common Pitfalls

Capture group numbering changes when groups are added or removed from patterns, breaking code that relies on specific indices. Named groups provide stability but numeric access can still shift:

# Original pattern with 2 groups
original = /(\w+)@(\w+\.\w+)/
text = "user@example.com"
match = text.match(original)
user = match[1]     # "user"
domain = match[2]   # "example.com"

# Modified pattern adds a group at the beginning
modified = /(\w+:)?(\w+)@(\w+\.\w+)/
match = text.match(modified)
# Now user is at index 2, not 1!
user = match[2]     # Still works but fragile
domain = match[3]   # Domain moved to index 3

# Solution: use named groups for stability
stable = /(?<protocol>\w+:)?(?<user>\w+)@(?<domain>\w+\.\w+)/
match = text.match(stable)
user = match[:user]       # Always works regardless of group changes
domain = match[:domain]   # Position independent

Non-capturing groups (?:pattern) group elements without creating capture groups but can confuse developers who expect captured values:

# Mixing capturing and non-capturing groups
pattern = /(?:https?:\/\/)(\w+)\.(\w+)/
url = "https://example.com"
match = url.match(pattern)

puts match.size    # 3, not 4 (protocol not captured)
puts match[0]      # "https://example.com"
puts match[1]      # "example" (first capture group)
puts match[2]      # "com" (second capture group)
# match[3] is nil - the protocol group was non-capturing

Optional capture groups can produce nil values that break string operations expecting actual text. Always handle the possibility of empty groups:

pattern = /(\w+)(:(\d+))?/  # Port is optional
servers = ["localhost", "example.com:8080", "test.local:"]

servers.each do |server|
  match = server.match(pattern)
  host = match[1]
  port = match[3]  # Can be nil for "localhost" or "" for "test.local:"

  # Unsafe: will raise error on nil
  # port_num = port.to_i

  # Safe: handle nil and empty values
  port_num = port&.empty? ? nil : port&.to_i
  puts "Host: #{host}, Port: #{port_num || 'default'}"
end

Capture groups inside quantifiers create nested arrays with String#scan but only return the last match with String#match. This behavior confuses developers expecting all repetitions:

text = "rgb(255,128,64)"
pattern = /rgb\((\d+,?)+\)/

# match only captures the last repetition
match = text.match(pattern)
puts match[1]  # "64," (only the last number)

# scan with groups returns individual captures
colors = text.scan(/(\d+)/)
puts colors.flatten  # ["255", "128", "64"]

# Better pattern for this case
better_pattern = /rgb\((\d+),(\d+),(\d+)\)/
match = text.match(better_pattern)
puts match[1], match[2], match[3]  # "255", "128", "64"

Backreferences \1, \2 within patterns refer to already-matched group content, not the group pattern itself. Misunderstanding this leads to incorrect pattern logic:

# Correct: matches repeated words
repeated_words = /(\w+)\s+\1/
text = "the the quick brown brown fox"
matches = text.scan(repeated_words)
puts matches  # [["the"], ["brown"]]

# Incorrect assumption: \1 doesn't repeat the pattern (\w+)
# It matches the exact text that was captured by group 1
wrong_assumption = "abc123 def456".match(/(\w+)\s+\1/)
puts wrong_assumption  # nil - "abc123" != "def456"

Variable scope issues arise when using capture groups inside blocks or methods where global variables like $1 might be overwritten by nested operations:

def process_emails(text)
  emails = []
  text.scan(/(\w+)@(\w+\.\w+)/) do
    user = $1
    domain = $2

    # Dangerous: this method call might change $1 and $2
    normalized_domain = normalize_domain(domain)

    # $1 and $2 might now contain different values!
    emails << "#{user}@#{normalized_domain}"
  end
  emails
end

# Safer approach: avoid global variables
def safe_process_emails(text)
  text.scan(/(\w+)@(\w+\.\w+)/).map do |user, domain|
    normalized_domain = normalize_domain(domain)
    "#{user}@#{normalized_domain}"
  end
end

Reference

Core Classes and Methods

Class/Method	Purpose	Returns
`Regexp.new(pattern)`	Create regex with capture groups	`Regexp` object
`String#match(pattern)`	Find first match with groups	`MatchData` or `nil`
`String#match?(pattern)`	Test for match without creating groups	`true` or `false`
`String#scan(pattern)`	Find all matches with groups	Array of captures
`String#gsub(pattern, replacement)`	Replace with group references	Modified string

MatchData Object Methods

Method	Parameters	Returns	Description
`#[](index)`	Integer index	String or `nil`	Access group by numeric index
`#[](name)`	Symbol or String	String or `nil`	Access named group
`#captures`	None	Array	All capture groups as array
`#names`	None	Array	Names of all named groups
`#named_captures`	None	Hash	Hash of name to capture value
`#size`	None	Integer	Total captures plus full match
`#values_at(*indices)`	Indices	Array	Multiple groups by index

Capture Group Syntax

Syntax	Type	Description	Example
`(pattern)`	Basic	Creates numbered group	`/(\w+)@(\w+)/`
`(?<n>pattern)`	Named	Creates named group	`/(?<user>\w+)@(?<domain>\w+)/`
`(?:pattern)`	Non-capturing	Groups without capture	`/(?:https?:\/\/)(\w+)/`
`(?=pattern)`	Lookahead	Positive lookahead	`/\w+(?=@)/`
`(?!pattern)`	Lookahead	Negative lookahead	`/\w+(?!@)/`
`(?<=pattern)`	Lookbehind	Positive lookbehind	`/(?<=@)\w+/`
`(?<!pattern)`	Lookbehind	Negative lookbehind	`/(?<!@)\w+/`

Global Variables

Variable	Content	Scope
`$&`	Full match text	Thread-local
`$1, $2, $3...`	Capture group 1, 2, 3...	Thread-local
`$+`	Last capture group	Thread-local
`$\`	Text before match	Thread-local
`$'`	Text after match	Thread-local

Replacement String References

Reference	Type	Description	Example
`\1, \2, \3...`	Numbered	Reference numbered groups	`"\2, \1"`
`\k<n>`	Named	Reference named groups	`"\k<last>, \k<first>"`
`\&`	Full match	Entire matched text	`"[\&]"`
`\backtick`	Pre-match	Text before match	`"\`_&"`
`\'`	Post-match	Text after match	`"\&_\'"`

Common Patterns

Pattern	Purpose	Groups	Example Match
`/^(\w+)@(\w+\.\w+)$/`	Email validation	user, domain	`john@example.com`
`/^(\d{4})-(\d{2})-(\d{2})$/`	Date parsing	year, month, day	`2024-03-15`
`/^(https?):\/\/(\w+)(?::(\d+))?/`	URL components	protocol, host, port	`https://example.com:8080`
`/^(\+?\d{1,3})[-.s]?(\d{3})[-.s]?(\d{4})$/`	Phone numbers	country, area, number	`+1-555-1234`
`/^([A-Za-z0-9+\/]{4})*([A-Za-z0-9+\/]{2}==\|[A-Za-z0-9+\/]{3}=)?$/`	Base64 validation	data, padding	`SGVsbG8=`

Error Types

Error	Cause	Example
`RegexpError`	Invalid pattern syntax	`Regexp.new("(unclosed")`
`NoMethodError`	Accessing groups on `nil`	`"text".match(/\d+/)[1]`
`ArgumentError`	Invalid group reference	`"text".gsub(/(\w+)/, "\2")`
`TypeError`	Wrong argument type	`"text".match(123)`

Capture Groups