Overview
Ruby extends the String class with numerous methods that handle text processing, encoding transformations, and pattern operations. These extensions form the backbone of text manipulation in Ruby applications, providing methods for case conversion, substring extraction, pattern matching, and character encoding operations.
The String class includes methods for modifying content (#gsub
, #tr
, #squeeze
), extracting information (#scan
, #match
, #include?
), and transforming format (#upcase
, #downcase
, #capitalize
). Ruby handles string encoding through methods like #encode
, #force_encoding
, and #valid_encoding?
, supporting multiple character encodings including UTF-8, ASCII, and ISO-8859-1.
text = "Hello, World!"
text.upcase # => "HELLO, WORLD!"
text.gsub(/[aeiou]/, '*') # => "H*ll*, W*rld!"
text.include?("World") # => true
String interpolation works seamlessly with these methods:
name = "ruby developer"
puts "Welcome #{name.titleize}!" # => "Welcome Ruby Developer!"
The encoding system handles character conversion transparently:
utf8_string = "café".encode('UTF-8')
ascii_string = utf8_string.encode('ASCII', invalid: :replace)
# => "caf?"
Basic Usage
String case conversion methods transform text between different capitalization formats. Ruby provides #upcase
, #downcase
, #capitalize
, #swapcase
, and locale-aware variants.
text = "Mixed Case String"
text.upcase # => "MIXED CASE STRING"
text.downcase # => "mixed case string"
text.capitalize # => "Mixed case string"
text.swapcase # => "mIXED cASE sTRING"
The #gsub
method performs pattern-based substitutions using regular expressions or strings. The method accepts blocks for complex replacement logic.
text = "The quick brown fox jumps"
text.gsub(/\b\w{4}\b/, '[WORD]') # => "The [WORD] brown fox [WORD]"
text.gsub(/(\w+)/) { |word| word.reverse } # => "ehT kciuq nworb xof spmuj"
Character translation occurs through #tr
and #tr_s
methods. These methods map character sets to replacement characters.
"hello world".tr('l', 'x') # => "hexxo worxd"
"hello world".tr('a-z', 'A-Z') # => "HELLO WORLD"
"bookkeeper".tr_s('k', 'c') # => "booceeper"
String scanning with #scan
extracts matching patterns into arrays. The method works with regular expressions and string patterns.
text = "Phone: 555-1234, Fax: 555-5678"
text.scan(/\d{3}-\d{4}/) # => ["555-1234", "555-5678"]
text.scan(/(\w+):\s*(\S+)/) # => [["Phone", "555-1234"], ["Fax", "555-5678"]]
Advanced Usage
String extensions support complex text processing through method chaining and block-based transformations. The #gsub
method accepts advanced regular expression patterns with named captures and lookarounds.
html_text = "<p>Hello <strong>world</strong>!</p>"
clean_text = html_text
.gsub(/<[^>]+>/, '') # Remove HTML tags
.squeeze(' ') # Collapse multiple spaces
.strip # Remove leading/trailing whitespace
# => "Hello world!"
Pattern extraction becomes sophisticated with named captures and complex regular expressions:
log_entry = "2024-01-15 14:30:25 ERROR DatabaseConnection timeout after 30s"
pattern = /(?<date>\d{4}-\d{2}-\d{2})\s+(?<time>\d{2}:\d{2}:\d{2})\s+(?<level>\w+)\s+(?<message>.*)/
match = log_entry.match(pattern)
{
timestamp: "#{match[:date]} #{match[:time]}",
severity: match[:level],
details: match[:message]
}
# => {:timestamp=>"2024-01-15 14:30:25", :severity=>"ERROR", :details=>"DatabaseConnection timeout after 30s"}
The #partition
and #rpartition
methods split strings around delimiters, returning three-element arrays containing the parts before, including, and after the delimiter.
email = "user@example.com"
username, at_sign, domain = email.partition('@')
# username => "user", at_sign => "@", domain => "example.com"
filepath = "/home/user/documents/file.txt"
directory, separator, filename = filepath.rpartition('/')
# directory => "/home/user/documents", separator => "/", filename => "file.txt"
String encoding conversion handles character set transformations with error handling strategies. The #encode
method accepts replacement characters and invalid byte handling.
mixed_encoding = "Café naïve résumé".encode('UTF-8')
# Convert with replacement characters
ascii_version = mixed_encoding.encode('ASCII',
invalid: :replace,
undef: :replace,
replace: '?')
# => "Caf? na?ve r?sum?"
# Convert with XML entity encoding
xml_safe = mixed_encoding.encode('ASCII',
invalid: :replace,
undef: :replace,
replace: proc { |char| "&##{char.ord};" })
# => "Café naïve résumé"
Common Pitfalls
String mutating methods create confusion between destructive and non-destructive operations. Methods ending with exclamation marks modify the original string, while others return new strings.
original = "hello world"
result = original.upcase # original unchanged, result = "HELLO WORLD"
original.upcase! # original modified to "HELLO WORLD"
# Common mistake: expecting mutation
text = "sample text"
text.gsub(/\s+/, '_') # text still equals "sample text"
text = text.gsub(/\s+/, '_') # Correct: reassign result
Regular expression escaping causes problems when user input contains special characters. The Regexp.escape
method handles special character escaping.
user_input = "What is $5.00 + $3.50?"
# Wrong: treats $ and + as regex metacharacters
text.gsub(/#{user_input}/, 'REDACTED') # Syntax error
# Correct: escape special characters
text.gsub(/#{Regexp.escape(user_input)}/, 'REDACTED')
Encoding issues arise when mixing strings with different encodings or reading files without specifying encoding. Ruby raises Encoding::CompatibilityError
for incompatible operations.
utf8_string = "résumé".encode('UTF-8')
ascii_string = "hello".encode('ASCII')
# This raises Encoding::CompatibilityError
begin
result = utf8_string + ascii_string.force_encoding('UTF-8')
rescue Encoding::CompatibilityError
# Handle encoding mismatch
result = utf8_string + ascii_string.encode('UTF-8')
end
Case conversion with international characters requires locale-aware methods. Standard case methods may not handle accented characters correctly.
turkish_text = "İstanbul"
turkish_text.downcase # => "i̇stanbul" (incorrect for Turkish)
# Use locale-aware conversion when available
require 'unicode'
Unicode.downcase(turkish_text, :tr) # Correct Turkish lowercase
The #tr
method performs character-by-character replacement, not substring replacement. This creates unexpected results when replacing multi-character sequences.
text = "hello"
text.tr('ll', 'x') # => "hexo" (each 'l' becomes 'x')
text.gsub('ll', 'x') # => "hexo" (substring 'll' becomes 'x')
# For multi-character replacement, use gsub
"bookkeeper".tr('kk', 'c') # => "booceeper" (each 'k' becomes 'c')
"bookkeeper".gsub('kk', 'c') # => "booceeper" (substring 'kk' becomes 'c')
Reference
Case Conversion Methods
Method | Parameters | Returns | Description |
---|---|---|---|
#upcase |
None | String |
Returns uppercase copy |
#upcase! |
None | String/nil |
Modifies string to uppercase |
#downcase |
None | String |
Returns lowercase copy |
#downcase! |
None | String/nil |
Modifies string to lowercase |
#capitalize |
None | String |
Returns copy with first character uppercase |
#capitalize! |
None | String/nil |
Modifies string capitalizing first character |
#swapcase |
None | String |
Returns copy with case swapped |
#swapcase! |
None | String/nil |
Modifies string swapping case |
Pattern Matching and Substitution
Method | Parameters | Returns | Description |
---|---|---|---|
#gsub(pattern, replacement) |
pattern (Regexp/String), replacement (String/Hash) |
String |
Returns copy with pattern replaced |
#gsub!(pattern, replacement) |
pattern (Regexp/String), replacement (String/Hash) |
String/nil |
Modifies string replacing pattern |
#sub(pattern, replacement) |
pattern (Regexp/String), replacement (String/Hash) |
String |
Returns copy with first pattern replaced |
#sub!(pattern, replacement) |
pattern (Regexp/String), replacement (String/Hash) |
String/nil |
Modifies string replacing first pattern |
#scan(pattern) |
pattern (Regexp/String) |
Array |
Returns array of pattern matches |
#match(pattern, pos=0) |
pattern (Regexp), pos (Integer) |
MatchData/nil |
Returns match data or nil |
Character Translation
Method | Parameters | Returns | Description |
---|---|---|---|
#tr(from_str, to_str) |
from_str (String), to_str (String) |
String |
Returns copy with characters translated |
#tr!(from_str, to_str) |
from_str (String), to_str (String) |
String/nil |
Modifies string translating characters |
#tr_s(from_str, to_str) |
from_str (String), to_str (String) |
String |
Returns copy with characters translated and squeezed |
#tr_s!(from_str, to_str) |
from_str (String), to_str (String) |
String/nil |
Modifies string translating and squeezing |
#delete(other_str) |
other_str (String) |
String |
Returns copy with characters removed |
#delete!(other_str) |
other_str (String) |
String/nil |
Modifies string removing characters |
#squeeze(other_str=nil) |
other_str (String) |
String |
Returns copy with consecutive characters squeezed |
#squeeze!(other_str=nil) |
other_str (String) |
String/nil |
Modifies string squeezing consecutive characters |
String Splitting and Partitioning
Method | Parameters | Returns | Description |
---|---|---|---|
#split(pattern=nil, limit=0) |
pattern (Regexp/String/nil), limit (Integer) |
Array |
Splits string into array |
#partition(sep) |
sep (String/Regexp) |
Array |
Returns [before, separator, after] |
#rpartition(sep) |
sep (String/Regexp) |
Array |
Returns [before, separator, after] from right |
#lines(separator=$/) |
separator (String) |
Array |
Returns array of lines |
#chars |
None | Array |
Returns array of characters |
#bytes |
None | Array |
Returns array of byte values |
Encoding Operations
Method | Parameters | Returns | Description |
---|---|---|---|
#encode(encoding, **opts) |
encoding (String/Encoding), options (Hash) |
String |
Returns string in specified encoding |
#encode!(encoding, **opts) |
encoding (String/Encoding), options (Hash) |
String |
Modifies string encoding |
#force_encoding(encoding) |
encoding (String/Encoding) |
String |
Changes encoding without conversion |
#encoding |
None | Encoding |
Returns current encoding |
#valid_encoding? |
None | Boolean |
Checks if string has valid encoding |
#ascii_only? |
None | Boolean |
Checks if string contains only ASCII |
Encoding Options
Option | Values | Description |
---|---|---|
:invalid |
:replace , :ignore |
How to handle invalid bytes |
:undef |
:replace , :ignore |
How to handle undefined conversions |
:replace |
String |
Replacement string for invalid/undefined |
:fallback |
Hash/Proc |
Fallback for undefined characters |
:xml |
:text , :attr |
XML entity conversion mode |
:cr_newline |
Boolean |
Convert LF to CRLF |
:crlf_newline |
Boolean |
Convert CRLF to LF |
:universal_newline |
Boolean |
Convert various newlines to LF |