CrackedRuby logo

CrackedRuby

Regular Expression Syntax

A comprehensive guide to Ruby's regular expression syntax, covering pattern creation, matching operations, metacharacters, and optimization techniques.

Core Built-in Classes Regexp and MatchData
2.7.1

Overview

Ruby implements regular expressions through the Regexp class, providing pattern matching capabilities for string processing and validation. Regular expressions in Ruby follow the Onigmo engine syntax, supporting Unicode and multibyte characters with extensive metacharacter support.

The core classes involved in regex operations include Regexp for pattern definition, MatchData for capture results, and String methods that accept regex parameters. Ruby supports both literal regex syntax using forward slashes and constructor-based creation through Regexp.new.

# Literal syntax
pattern = /[a-z]+/i

# Constructor syntax  
pattern = Regexp.new('[a-z]+', Regexp::IGNORECASE)

# Basic matching
text = "Hello World"
result = text.match(/[A-Z][a-z]+/)
# => #<MatchData "Hello">

Regular expressions in Ruby are first-class objects that can be stored in variables, passed as arguments, and modified at runtime. The pattern matching process returns either a MatchData object for successful matches or nil for failures.

Ruby's regex implementation supports named captures, lookahead and lookbehind assertions, conditional expressions, and atomic grouping. The engine processes patterns left-to-right with backtracking, making certain complex patterns computationally expensive.

Basic Usage

Creating regular expressions requires understanding Ruby's literal and constructor syntax. Literal expressions use forward slashes with optional modifiers, while constructors accept string patterns and flag constants.

# Case-insensitive matching
email_pattern = /\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i

# Multiline pattern with comments
phone_pattern = /
  \A            # Start of string
  \d{3}         # Area code
  [-\s]?        # Optional separator
  \d{3}         # Exchange
  [-\s]?        # Optional separator  
  \d{4}         # Number
  \z            # End of string
/x

# Constructor with flags
dynamic_pattern = Regexp.new("user_#{id}", Regexp::IGNORECASE | Regexp::MULTILINE)

Character classes define sets of acceptable characters using square brackets. Ruby supports predefined classes like \d for digits, \w for word characters, and \s for whitespace, plus Unicode property classes.

# Custom character class
hex_color = /\A#[0-9a-fA-F]{6}\z/

# Negated character class
non_digits = /[^\d]+/

# Unicode property class
japanese_text = /\p{Hiragana}+/

# Range-based class
uppercase_letters = /[A-Z]+/

Quantifiers specify repetition patterns for characters or groups. Ruby supports greedy, lazy, and possessive quantifiers that affect backtracking behavior during pattern matching.

text = "<tag>content</tag>"

# Greedy quantifier (default)
greedy_match = text.match(/<.+>/)
# => #<MatchData "<tag>content</tag>">

# Lazy quantifier
lazy_match = text.match(/<.+?>/)
# => #<MatchData "<tag>">

# Possessive quantifier (no backtracking)
possessive_pattern = /\d++/

Anchors position matches at specific string boundaries. Ruby provides start-of-string (\A), end-of-string (\z), line boundaries (^, $), and word boundaries (\b).

# Full string validation
valid_username = /\A[a-zA-Z0-9_]{3,20}\z/

# Line-based matching
extract_lines = /^Error:.+$/m

# Word boundary matching  
find_whole_word = /\btest\b/

# Position matching example
log_entry = "2023-10-15 ERROR: Database connection failed"
timestamp = log_entry.match(/\A\d{4}-\d{2}-\d{2}/)
# => #<MatchData "2023-10-15">

Performance & Memory

Regular expression performance depends on pattern complexity, input string length, and backtracking behavior. Simple character classes and literal strings perform faster than complex alternations and nested quantifiers.

Catastrophic backtracking occurs when patterns with nested quantifiers encounter input that forces exponential evaluation paths. The regex engine explores every possible matching combination before failing, causing performance degradation.

require 'benchmark'

# Problematic pattern with nested quantifiers
catastrophic_pattern = /(a+)+b/
safe_pattern = /a+b/

test_string = "a" * 20 + "c"  # No 'b' at end forces backtracking

Benchmark.bm(15) do |x|
  x.report("Safe pattern:") do
    10000.times { safe_pattern.match(test_string) }
  end
  
  x.report("Catastrophic:") do
    10.times { catastrophic_pattern.match(test_string) }  # Much fewer iterations
  end
end

Atomic grouping prevents backtracking within group boundaries using (?>pattern) syntax. This optimization technique improves performance by eliminating redundant evaluation paths.

# Without atomic grouping - allows backtracking
inefficient = /\d+\.\d+/

# With atomic grouping - prevents backtracking
efficient = /(?>\d+)\.\d+/

# Benchmark comparison with large numeric strings
large_number = "1234567890123456789.0"

Memory usage increases with pattern complexity and capture group count. Named captures create additional references, while non-capturing groups (?:pattern) reduce memory overhead without affecting matching behavior.

# Memory-intensive pattern with many captures
memory_heavy = /(.*)(.*)(.*)(.*)(.*)(.*)(.*)(.*)/

# Optimized with non-capturing groups
memory_light = /(?:.*){8}/

# Named captures balance readability and memory
balanced = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/

String compilation optimizes frequently-used patterns by caching compiled regex objects. Ruby automatically caches literal patterns but requires manual optimization for dynamic patterns.

# Cached literal pattern (automatic)
CONSTANT_PATTERN = /[a-z]+/i

# Manual caching for dynamic patterns
class PatternCache
  def initialize
    @cache = {}
  end
  
  def get_pattern(base, flags = 0)
    key = [base, flags]
    @cache[key] ||= Regexp.new(base, flags)
  end
end

cache = PatternCache.new
pattern = cache.get_pattern('user_\d+', Regexp::IGNORECASE)

Error Handling & Debugging

Regular expression compilation errors occur when patterns contain invalid syntax or unsupported constructs. Ruby raises RegexpError exceptions during pattern creation, not during matching operations.

begin
  # Invalid quantifier syntax
  invalid_pattern = Regexp.new('*invalid')
rescue RegexpError => e
  puts "Pattern compilation failed: #{e.message}"
  # => Pattern compilation failed: target of repeat operator is not specified
end

begin
  # Unmatched parenthesis
  unbalanced = Regexp.new('(unclosed group')
rescue RegexpError => e
  puts "Syntax error: #{e.message}"
end

Encoding mismatches between patterns and strings cause matching failures or unexpected results. Ruby requires compatible encodings for successful pattern matching operations.

# Encoding compatibility check
def safe_match(pattern, text)
  if pattern.encoding != text.encoding
    # Convert pattern to text encoding
    compatible_pattern = Regexp.new(pattern.source, pattern.options, text.encoding)
    text.match(compatible_pattern)
  else
    text.match(pattern)
  end
rescue Encoding::CompatibilityError => e
  puts "Encoding mismatch: #{e.message}"
  nil
end

# Example usage
utf8_text = "café".encode('UTF-8')
ascii_pattern = /caf./.force_encoding('ASCII-8BIT')
result = safe_match(ascii_pattern, utf8_text)

Debugging complex patterns requires systematic decomposition and testing. Ruby provides the MatchData object with detailed capture information and position data.

def debug_regex(pattern, text)
  match = text.match(pattern)
  return "No match found" unless match
  
  puts "Full match: '#{match[0]}' at position #{match.begin(0)}-#{match.end(0)}"
  puts "Captures:"
  
  match.captures.each_with_index do |capture, index|
    puts "  Group #{index + 1}: '#{capture}'"
  end
  
  if match.names.any?
    puts "Named captures:"
    match.names.each do |name|
      puts "  #{name}: '#{match[name]}'"
    end
  end
  
  match
end

# Usage example
email_pattern = /(?<user>[\w+\-.]+)@(?<domain>[a-z\d\-]+(?:\.[a-z\d\-]+)*)/i
result = debug_regex(email_pattern, "Contact: john.doe@example.com")

Timeout protection prevents runaway regex operations from blocking application threads. Ruby's Timeout module can wrap regex operations with time limits.

require 'timeout'

def safe_regex_match(pattern, text, timeout_seconds = 5)
  Timeout::timeout(timeout_seconds) do
    text.match(pattern)
  end
rescue Timeout::Error
  puts "Regex operation timed out after #{timeout_seconds} seconds"
  nil
rescue RegexpError => e
  puts "Regex compilation error: #{e.message}"
  nil
end

# Example with potentially slow pattern
slow_pattern = /(a+)+b/
problematic_input = "a" * 25 + "c"
result = safe_regex_match(slow_pattern, problematic_input, 2)

Common Pitfalls

Greedy quantifiers consume maximum possible characters before attempting to match subsequent pattern elements, often producing unexpected results when developers expect minimal matching behavior.

html = '<div class="main">Content</div>'

# Common mistake - greedy quantifier matches too much
wrong_pattern = /<.*>/
wrong_match = html.match(wrong_pattern)
# => #<MatchData "<div class=\"main\">Content</div>">

# Correct approach - lazy quantifier
correct_pattern = /<.*?>/  
correct_match = html.match(correct_pattern)
# => #<MatchData "<div class=\"main\">">

# Alternative - negated character class (more efficient)
efficient_pattern = /<[^>]*>/
efficient_match = html.match(efficient_pattern)
# => #<MatchData "<div class=\"main\">">

Anchor confusion leads to validation bypasses when developers use line anchors (^, $) instead of string anchors (\A, \z) for input validation. Line anchors match line boundaries within multiline strings.

# Dangerous validation - uses line anchors
unsafe_email_check = /^[\w\.-]+@[\w\.-]+\.[a-z]{2,}$/i

# Multiline input bypasses validation
malicious_input = "valid@email.com\n<script>alert('xss')</script>"
unsafe_result = malicious_input.match(unsafe_email_check)
# => #<MatchData "valid@email.com"> (matches despite script tag)

# Secure validation - uses string anchors  
safe_email_check = /\A[\w\.-]+@[\w\.-]+\.[a-z]{2,}\z/i
safe_result = malicious_input.match(safe_email_check)
# => nil (correctly rejects malicious input)

Escape sequence handling differs between single and double-quoted strings when constructing regex patterns. Backslashes require additional escaping in double-quoted strings.

# Single-quoted string preserves literal backslashes
single_quoted = Regexp.new('\d+\.\d+')
# Equivalent to: /\d+\.\d+/

# Double-quoted string requires escaped backslashes  
double_quoted = Regexp.new("\\d+\\.\\d+")
# Equivalent to: /\d+\.\d+/

# Common mistake - insufficient escaping in double quotes
wrong_pattern = Regexp.new("\d+\.\d+")  # Interprets \d as literal d
# Creates pattern: /d+.d+/ (matches 'd' instead of digits)

# Testing the difference
test_string = "123.45"
puts single_quoted.match(test_string)    # => #<MatchData "123.45">
puts double_quoted.match(test_string)    # => #<MatchData "123.45">  
puts wrong_pattern.match(test_string)    # => nil

Unicode handling requires explicit character classes or property matching. ASCII-based patterns fail with international characters unless properly configured with Unicode support.

# ASCII-only pattern fails with Unicode
ascii_name = /^[a-zA-Z\s]+$/
unicode_name = "José María"
ascii_result = unicode_name.match(ascii_name)
# => nil (fails to match accented characters)

# Unicode-aware pattern  
unicode_pattern = /\A[\p{L}\s]+\z/
unicode_result = unicode_name.match(unicode_pattern)
# => #<MatchData "José María">

# Alternative with specific Unicode ranges
spanish_pattern = /\A[a-zA-ZáéíóúÁÉÍÓÚñÑ\s]+\z/
spanish_result = unicode_name.match(spanish_pattern)
# => #<MatchData "José María">

Case sensitivity mistakes occur when patterns don't specify appropriate flags for intended matching behavior. Ruby regex defaults to case-sensitive matching unless explicitly configured otherwise.

# Case-sensitive matching (default)
case_sensitive = /hello/
mixed_case_text = "Hello World"
sensitive_result = mixed_case_text.match(case_sensitive)
# => nil (doesn't match due to case difference)

# Case-insensitive matching
case_insensitive = /hello/i
insensitive_result = mixed_case_text.match(case_insensitive)  
# => #<MatchData "Hello">

# Dynamic case handling
def flexible_match(pattern_string, text, ignore_case = false)
  flags = ignore_case ? Regexp::IGNORECASE : 0
  pattern = Regexp.new(pattern_string, flags)
  text.match(pattern)
end

result = flexible_match("HELLO", "hello world", true)
# => #<MatchData "hello">

Reference

Regex Literal Syntax

Syntax Description Example
/pattern/ Basic literal regex /hello/
/pattern/flags Literal with modifiers /hello/i
%r{pattern} Alternative delimiter %r{https?://}
`%r pattern flags`

Constructor Methods

Method Parameters Returns Description
Regexp.new(pattern, flags = 0) pattern (String), flags (Integer) Regexp Creates regex from string
Regexp.compile(pattern, flags = 0) pattern (String), flags (Integer) Regexp Alias for new
Regexp.escape(string) string (String) String Escapes special characters
Regexp.union(*patterns) patterns (Array) Regexp Creates alternation pattern

Modifier Flags

Flag Constant Description Literal
i Regexp::IGNORECASE Case-insensitive matching /pattern/i
m Regexp::MULTILINE Multiline mode (. matches newlines) /pattern/m
x Regexp::EXTENDED Extended mode (ignores whitespace) /pattern/x
o N/A Compile once (deprecated) /pattern/o

Character Classes

Class Description Equivalent
. Any character except newline [^\n]
\d Digit character [0-9]
\D Non-digit character [^0-9]
\w Word character [a-zA-Z0-9_]
\W Non-word character [^a-zA-Z0-9_]
\s Whitespace character [\t\n\f\r ]
\S Non-whitespace character [^\t\n\f\r ]

Quantifiers

Quantifier Type Description Example
* Greedy Zero or more /a*/
+ Greedy One or more /a+/
? Greedy Zero or one /a?/
{n} Exact Exactly n times /a{3}/
{n,} Greedy n or more times /a{3,}/
{n,m} Greedy Between n and m times /a{3,5}/
*? Lazy Zero or more (minimal) /a*?/
+? Lazy One or more (minimal) /a+?/
?? Lazy Zero or one (minimal) /a??/

Anchors

Anchor Description Example
\A Start of string /\A[A-Z]/
\z End of string /\d\z/
\Z End of string (before final newline) /\w\Z/
^ Start of line /^Error/m
$ End of line /\d$/m
\b Word boundary /\btest\b/
\B Non-word boundary /\Btest\B/

Groups and Captures

Syntax Description Example
(pattern) Capturing group /(a+)(b+)/
(?:pattern) Non-capturing group /(?:a+)b+/
(?<name>pattern) Named capture /(?<year>\d{4})/
(?'name'pattern) Named capture (alternative) /(?'year'\d{4})/
\1, \2, ... Backreference by number /(a+)\1/
\k<name> Backreference by name /(?<word>\w+)\k<word>/

Lookahead and Lookbehind

Syntax Type Description Example
(?=pattern) Positive lookahead Look ahead for pattern /\d+(?=px)/
(?!pattern) Negative lookahead Look ahead, not pattern /\d+(?!px)/
(?<=pattern) Positive lookbehind Look behind for pattern /(?<=\$)\d+/
(?<!pattern) Negative lookbehind Look behind, not pattern /(?<!\$)\d+/

String Methods with Regex

Method Parameters Returns Description
#match(pattern, pos = 0) pattern (Regexp), pos (Integer) MatchData or nil Returns match data
#match?(pattern, pos = 0) pattern (Regexp), pos (Integer) Boolean Returns boolean result
#scan(pattern) pattern (Regexp) Array Returns all matches
#gsub(pattern, replacement) pattern (Regexp), replacement (String) String Global substitution
#gsub!(pattern, replacement) pattern (Regexp), replacement (String) String or nil In-place substitution
#sub(pattern, replacement) pattern (Regexp), replacement (String) String Single substitution
#split(pattern, limit = 0) pattern (Regexp), limit (Integer) Array Split by pattern

MatchData Methods

Method Returns Description
#[](index) String or nil Access capture by index
#captures Array All captured groups
#named_captures Hash Named captures as hash
#names Array Names of capture groups
#begin(index) Integer Start position of capture
#end(index) Integer End position of capture
#offset(index) Array [begin, end] positions
#pre_match String String before match
#post_match String String after match