CrackedRuby - Regular Expression Syntax

Overview

Ruby implements regular expressions through the Regexp class, providing pattern matching capabilities for string processing and validation. Regular expressions in Ruby follow the Onigmo engine syntax, supporting Unicode and multibyte characters with extensive metacharacter support.

The core classes involved in regex operations include Regexp for pattern definition, MatchData for capture results, and String methods that accept regex parameters. Ruby supports both literal regex syntax using forward slashes and constructor-based creation through Regexp.new.

# Literal syntax
pattern = /[a-z]+/i

# Constructor syntax
pattern = Regexp.new('[a-z]+', Regexp::IGNORECASE)

# Basic matching
text = "Hello World"
result = text.match(/[A-Z][a-z]+/)
# => #<MatchData "Hello">

Regular expressions in Ruby are first-class objects that can be stored in variables, passed as arguments, and modified at runtime. The pattern matching process returns either a MatchData object for successful matches or nil for failures.

Ruby's regex implementation supports named captures, lookahead and lookbehind assertions, conditional expressions, and atomic grouping. The engine processes patterns left-to-right with backtracking, making certain complex patterns computationally expensive.

Basic Usage

Creating regular expressions requires understanding Ruby's literal and constructor syntax. Literal expressions use forward slashes with optional modifiers, while constructors accept string patterns and flag constants.

# Case-insensitive matching
email_pattern = /\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i

# Multiline pattern with comments
phone_pattern = /
  \A            # Start of string
  \d{3}         # Area code
  [-\s]?        # Optional separator
  \d{3}         # Exchange
  [-\s]?        # Optional separator
  \d{4}         # Number
  \z            # End of string
/x

# Constructor with flags
dynamic_pattern = Regexp.new("user_#{id}", Regexp::IGNORECASE | Regexp::MULTILINE)

Character classes define sets of acceptable characters using square brackets. Ruby supports predefined classes like \d for digits, \w for word characters, and \s for whitespace, plus Unicode property classes.

# Custom character class
hex_color = /\A#[0-9a-fA-F]{6}\z/

# Negated character class
non_digits = /[^\d]+/

# Unicode property class
japanese_text = /\p{Hiragana}+/

# Range-based class
uppercase_letters = /[A-Z]+/

Quantifiers specify repetition patterns for characters or groups. Ruby supports greedy, lazy, and possessive quantifiers that affect backtracking behavior during pattern matching.

text = "<tag>content</tag>"

# Greedy quantifier (default)
greedy_match = text.match(/<.+>/)
# => #<MatchData "<tag>content</tag>">

# Lazy quantifier
lazy_match = text.match(/<.+?>/)
# => #<MatchData "<tag>">

# Possessive quantifier (no backtracking)
possessive_pattern = /\d++/

Anchors position matches at specific string boundaries. Ruby provides start-of-string (\A), end-of-string (\z), line boundaries (^, $), and word boundaries (\b).

# Full string validation
valid_username = /\A[a-zA-Z0-9_]{3,20}\z/

# Line-based matching
extract_lines = /^Error:.+$/m

# Word boundary matching
find_whole_word = /\btest\b/

# Position matching example
log_entry = "2023-10-15 ERROR: Database connection failed"
timestamp = log_entry.match(/\A\d{4}-\d{2}-\d{2}/)
# => #<MatchData "2023-10-15">

Performance & Memory

Regular expression performance depends on pattern complexity, input string length, and backtracking behavior. Simple character classes and literal strings perform faster than complex alternations and nested quantifiers.

Catastrophic backtracking occurs when patterns with nested quantifiers encounter input that forces exponential evaluation paths. The regex engine explores every possible matching combination before failing, causing performance degradation.

require 'benchmark'

# Problematic pattern with nested quantifiers
catastrophic_pattern = /(a+)+b/
safe_pattern = /a+b/

test_string = "a" * 20 + "c"  # No 'b' at end forces backtracking

Benchmark.bm(15) do |x|
  x.report("Safe pattern:") do
    10000.times { safe_pattern.match(test_string) }
  end

  x.report("Catastrophic:") do
    10.times { catastrophic_pattern.match(test_string) }  # Much fewer iterations
  end
end

Atomic grouping prevents backtracking within group boundaries using (?>pattern) syntax. This optimization technique improves performance by eliminating redundant evaluation paths.

# Without atomic grouping - allows backtracking
inefficient = /\d+\.\d+/

# With atomic grouping - prevents backtracking
efficient = /(?>\d+)\.\d+/

# Benchmark comparison with large numeric strings
large_number = "1234567890123456789.0"

Memory usage increases with pattern complexity and capture group count. Named captures create additional references, while non-capturing groups (?:pattern) reduce memory overhead without affecting matching behavior.

# Memory-intensive pattern with many captures
memory_heavy = /(.*)(.*)(.*)(.*)(.*)(.*)(.*)(.*)/

# Optimized with non-capturing groups
memory_light = /(?:.*){8}/

# Named captures balance readability and memory
balanced = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/

String compilation optimizes frequently-used patterns by caching compiled regex objects. Ruby automatically caches literal patterns but requires manual optimization for dynamic patterns.

# Cached literal pattern (automatic)
CONSTANT_PATTERN = /[a-z]+/i

# Manual caching for dynamic patterns
class PatternCache
  def initialize
    @cache = {}
  end

  def get_pattern(base, flags = 0)
    key = [base, flags]
    @cache[key] ||= Regexp.new(base, flags)
  end
end

cache = PatternCache.new
pattern = cache.get_pattern('user_\d+', Regexp::IGNORECASE)

Error Handling & Debugging

Regular expression compilation errors occur when patterns contain invalid syntax or unsupported constructs. Ruby raises RegexpError exceptions during pattern creation, not during matching operations.

begin
  # Invalid quantifier syntax
  invalid_pattern = Regexp.new('*invalid')
rescue RegexpError => e
  puts "Pattern compilation failed: #{e.message}"
  # => Pattern compilation failed: target of repeat operator is not specified
end

begin
  # Unmatched parenthesis
  unbalanced = Regexp.new('(unclosed group')
rescue RegexpError => e
  puts "Syntax error: #{e.message}"
end

Encoding mismatches between patterns and strings cause matching failures or unexpected results. Ruby requires compatible encodings for successful pattern matching operations.

# Encoding compatibility check
def safe_match(pattern, text)
  if pattern.encoding != text.encoding
    # Convert pattern to text encoding
    compatible_pattern = Regexp.new(pattern.source, pattern.options, text.encoding)
    text.match(compatible_pattern)
  else
    text.match(pattern)
  end
rescue Encoding::CompatibilityError => e
  puts "Encoding mismatch: #{e.message}"
  nil
end

# Example usage
utf8_text = "café".encode('UTF-8')
ascii_pattern = /caf./.force_encoding('ASCII-8BIT')
result = safe_match(ascii_pattern, utf8_text)

Debugging complex patterns requires systematic decomposition and testing. Ruby provides the MatchData object with detailed capture information and position data.

def debug_regex(pattern, text)
  match = text.match(pattern)
  return "No match found" unless match

  puts "Full match: '#{match[0]}' at position #{match.begin(0)}-#{match.end(0)}"
  puts "Captures:"

  match.captures.each_with_index do |capture, index|
    puts "  Group #{index + 1}: '#{capture}'"
  end

  if match.names.any?
    puts "Named captures:"
    match.names.each do |name|
      puts "  #{name}: '#{match[name]}'"
    end
  end

  match
end

# Usage example
email_pattern = /(?<user>[\w+\-.]+)@(?<domain>[a-z\d\-]+(?:\.[a-z\d\-]+)*)/i
result = debug_regex(email_pattern, "Contact: john.doe@example.com")

Timeout protection prevents runaway regex operations from blocking application threads. Ruby's Timeout module can wrap regex operations with time limits.

require 'timeout'

def safe_regex_match(pattern, text, timeout_seconds = 5)
  Timeout::timeout(timeout_seconds) do
    text.match(pattern)
  end
rescue Timeout::Error
  puts "Regex operation timed out after #{timeout_seconds} seconds"
  nil
rescue RegexpError => e
  puts "Regex compilation error: #{e.message}"
  nil
end

# Example with potentially slow pattern
slow_pattern = /(a+)+b/
problematic_input = "a" * 25 + "c"
result = safe_regex_match(slow_pattern, problematic_input, 2)

Common Pitfalls

Greedy quantifiers consume maximum possible characters before attempting to match subsequent pattern elements, often producing unexpected results when developers expect minimal matching behavior.

html = '<div class="main">Content</div>'

# Common mistake - greedy quantifier matches too much
wrong_pattern = /<.*>/
wrong_match = html.match(wrong_pattern)
# => #<MatchData "<div class=\"main\">Content</div>">

# Correct approach - lazy quantifier
correct_pattern = /<.*?>/
correct_match = html.match(correct_pattern)
# => #<MatchData "<div class=\"main\">">

# Alternative - negated character class (more efficient)
efficient_pattern = /<[^>]*>/
efficient_match = html.match(efficient_pattern)
# => #<MatchData "<div class=\"main\">">

Anchor confusion leads to validation bypasses when developers use line anchors (^, $) instead of string anchors (\A, \z) for input validation. Line anchors match line boundaries within multiline strings.

# Dangerous validation - uses line anchors
unsafe_email_check = /^[\w\.-]+@[\w\.-]+\.[a-z]{2,}$/i

# Multiline input bypasses validation
malicious_input = "valid@email.com\n<script>alert('xss')</script>"
unsafe_result = malicious_input.match(unsafe_email_check)
# => #<MatchData "valid@email.com"> (matches despite script tag)

# Secure validation - uses string anchors
safe_email_check = /\A[\w\.-]+@[\w\.-]+\.[a-z]{2,}\z/i
safe_result = malicious_input.match(safe_email_check)
# => nil (correctly rejects malicious input)

Escape sequence handling differs between single and double-quoted strings when constructing regex patterns. Backslashes require additional escaping in double-quoted strings.

# Single-quoted string preserves literal backslashes
single_quoted = Regexp.new('\d+\.\d+')
# Equivalent to: /\d+\.\d+/

# Double-quoted string requires escaped backslashes
double_quoted = Regexp.new("\\d+\\.\\d+")
# Equivalent to: /\d+\.\d+/

# Common mistake - insufficient escaping in double quotes
wrong_pattern = Regexp.new("\d+\.\d+")  # Interprets \d as literal d
# Creates pattern: /d+.d+/ (matches 'd' instead of digits)

# Testing the difference
test_string = "123.45"
puts single_quoted.match(test_string)    # => #<MatchData "123.45">
puts double_quoted.match(test_string)    # => #<MatchData "123.45">
puts wrong_pattern.match(test_string)    # => nil

Unicode handling requires explicit character classes or property matching. ASCII-based patterns fail with international characters unless properly configured with Unicode support.

# ASCII-only pattern fails with Unicode
ascii_name = /^[a-zA-Z\s]+$/
unicode_name = "José María"
ascii_result = unicode_name.match(ascii_name)
# => nil (fails to match accented characters)

# Unicode-aware pattern
unicode_pattern = /\A[\p{L}\s]+\z/
unicode_result = unicode_name.match(unicode_pattern)
# => #<MatchData "José María">

# Alternative with specific Unicode ranges
spanish_pattern = /\A[a-zA-ZáéíóúÁÉÍÓÚñÑ\s]+\z/
spanish_result = unicode_name.match(spanish_pattern)
# => #<MatchData "José María">

Case sensitivity mistakes occur when patterns don't specify appropriate flags for intended matching behavior. Ruby regex defaults to case-sensitive matching unless explicitly configured otherwise.

# Case-sensitive matching (default)
case_sensitive = /hello/
mixed_case_text = "Hello World"
sensitive_result = mixed_case_text.match(case_sensitive)
# => nil (doesn't match due to case difference)

# Case-insensitive matching
case_insensitive = /hello/i
insensitive_result = mixed_case_text.match(case_insensitive)
# => #<MatchData "Hello">

# Dynamic case handling
def flexible_match(pattern_string, text, ignore_case = false)
  flags = ignore_case ? Regexp::IGNORECASE : 0
  pattern = Regexp.new(pattern_string, flags)
  text.match(pattern)
end

result = flexible_match("HELLO", "hello world", true)
# => #<MatchData "hello">

Reference

Regex Literal Syntax

Syntax	Description	Example
`/pattern/`	Basic literal regex	`/hello/`
`/pattern/flags`	Literal with modifiers	`/hello/i`
`%r{pattern}`	Alternative delimiter	`%r{https?://}`
`%r\|pattern\|flags`	Custom delimiter with flags	`%r\|/path/\|i`

Constructor Methods

Method	Parameters	Returns	Description
`Regexp.new(pattern, flags)`	pattern (String), flags (Integer)	`Regexp`	Creates regex from string
`Regexp.compile(pattern, flags)`	pattern (String), flags (Integer)	`Regexp`	Alias for new
`Regexp.escape(string)`	string (String)	`String`	Escapes special characters
`Regexp.union(*patterns)`	patterns (Array)	`Regexp`	Creates alternation pattern

Modifier Flags

Flag	Constant	Description	Literal
`i`	`Regexp::IGNORECASE`	Case-insensitive matching	`/pattern/i`
`m`	`Regexp::MULTILINE`	Multiline mode	`/pattern/m`
`x`	`Regexp::EXTENDED`	Extended mode	`/pattern/x`
`o`	N/A	Compile once	`/pattern/o`

Character Classes

Class	Description	Equivalent
`.`	Any character except newline	`[^\n]`
`\d`	Digit character	`[0-9]`
`\D`	Non-digit character	`[^0-9]`
`\w`	Word character	`[a-zA-Z0-9_]`
`\W`	Non-word character	`[^a-zA-Z0-9_]`
`\s`	Whitespace character	`[\t\n\f\r ]`
`\S`	Non-whitespace character	`[^\t\n\f\r ]`

Quantifiers

Quantifier	Type	Description	Example
`*`	Greedy	Zero or more	`/a*/`
`+`	Greedy	One or more	`/a+/`
`?`	Greedy	Zero or one	`/a?/`
`{n}`	Exact	Exactly n times	`/a{3}/`
`{n,}`	Greedy	n or more times	`/a{3,}/`
`{n,m}`	Greedy	Between n and m times	`/a{3,5}/`
`*?`	Lazy	Zero or more minimal	`/a*?/`
`+?`	Lazy	One or more minimal	`/a+?/`
`??`	Lazy	Zero or one minimal	`/a??/`

Anchors

Anchor	Description	Example
`\A`	Start of string	`/\A[A-Z]/`
`\z`	End of string	`/\d\z/`
`\Z`	End of string before final newline	`/\w\Z/`
`^`	Start of line	`/^Error/m`
`$`	End of line	`/\d$/m`
`\b`	Word boundary	`/\btest\b/`
`\B`	Non-word boundary	`/\Btest\B/`

Groups and Captures

Syntax	Description	Example
`(pattern)`	Capturing group	`/(a+)(b+)/`
`(?:pattern)`	Non-capturing group	`/(?:a+)b+/`
`(?<name>pattern)`	Named capture	`/(?<year>\d{4})/`
`(?'name'pattern)`	Named capture alternative	`/(?'year'\d{4})/`
`\1, \2`	Backreference by number	`/(a+)\1/`
`\k<name>`	Backreference by name	`/(?<word>\w+)\k<word>/`

Lookahead and Lookbehind

Syntax	Type	Description	Example
`(?=pattern)`	Positive lookahead	Look ahead for pattern	`/\d+(?=px)/`
`(?!pattern)`	Negative lookahead	Look ahead not pattern	`/\d+(?!px)/`
`(?<=pattern)`	Positive lookbehind	Look behind for pattern	`/(?<=\$)\d+/`
`(?<!pattern)`	Negative lookbehind	Look behind not pattern	`/(?<!\$)\d+/`

String Methods with Regex

Method	Parameters	Returns	Description
`#match(pattern, pos)`	pattern (Regexp), pos (Integer)	`MatchData` or `nil`	Returns match data
`#match?(pattern, pos)`	pattern (Regexp), pos (Integer)	`Boolean`	Returns boolean result
`#scan(pattern)`	pattern (Regexp)	`Array`	Returns all matches
`#gsub(pattern, replacement)`	pattern (Regexp), replacement (String)	`String`	Global substitution
`#gsub!(pattern, replacement)`	pattern (Regexp), replacement (String)	`String` or `nil`	In-place substitution
`#sub(pattern, replacement)`	pattern (Regexp), replacement (String)	`String`	Single substitution
`#split(pattern, limit)`	pattern (Regexp), limit (Integer)	`Array`	Split by pattern

MatchData Methods

Method	Returns	Description
`#[](index)`	`String` or `nil`	Access capture by index
`#captures`	`Array`	All captured groups
`#named_captures`	`Hash`	Named captures as hash
`#names`	`Array`	Names of capture groups
`#begin(index)`	`Integer`	Start position of capture
`#end(index)`	`Integer`	End position of capture
`#offset(index)`	`Array`	Begin and end positions
`#pre_match`	`String`	String before match
`#post_match`	`String`	String after match

Regular Expression Syntax