CrackedRuby - String Encoding

Overview

String encoding in Ruby controls how text characters map to byte sequences in memory and storage. Ruby represents strings as sequences of bytes combined with an encoding object that defines how to interpret those bytes as characters. Every string carries encoding metadata that determines character boundaries, case conversion rules, and validity constraints.

The Encoding class provides the foundation for all encoding operations. Ruby includes over 100 built-in encodings, from ASCII and UTF-8 to legacy formats like Windows-1252 and ISO-8859-1. The String#encoding method returns the encoding object, while String#encode converts between encodings.

str = "Hello, 世界"
str.encoding               # => #<Encoding:UTF-8>
str.bytesize              # => 13 (bytes in memory)
str.size                  # => 9 (logical characters)

Ruby strings exist in one of three validity states. Valid strings contain only byte sequences that form complete characters in their declared encoding. Broken strings contain invalid byte sequences but Ruby can still perform many operations on them. Incomplete strings end with a partial character sequence.

# Valid UTF-8
valid = "café".force_encoding('UTF-8')
valid.valid_encoding?     # => true

# Broken encoding - invalid UTF-8 bytes
broken = "\xFF\xFE".force_encoding('UTF-8') 
broken.valid_encoding?    # => false

# String operations still work
broken.upcase             # => "\xFF\xFE"

The default source encoding for Ruby files is UTF-8. Ruby automatically handles encoding for string literals, but external data requires explicit encoding management. File operations, network communication, and database interactions all involve encoding decisions that affect correctness and performance.

Basic Usage

The primary encoding operations involve checking current encoding, converting between encodings, and forcing encoding interpretation. The String#encoding method returns the encoding object, while String#force_encoding changes the encoding label without converting bytes.

str = "résumé"
str.encoding                    # => #<Encoding:UTF-8>
str.force_encoding('ASCII-8BIT') # Change label only
str.encoding                    # => #<Encoding:ASCII-8BIT>
str.valid_encoding?             # => false (invalid ASCII bytes)

The String#encode method converts byte sequences from one encoding to another. This differs from force_encoding because it actually transforms the underlying bytes to match the target encoding's representation.

utf8_str = "café"               # UTF-8 source
latin1_str = utf8_str.encode('ISO-8859-1')
latin1_str.encoding             # => #<Encoding:ISO-8859-1>
latin1_str.bytes                # => [99, 97, 102, 233] (different bytes)

# Converting back
utf8_again = latin1_str.encode('UTF-8')
utf8_again == utf8_str          # => true

The Encoding.default_external and Encoding.default_internal settings control automatic conversion during I/O operations. Ruby applies these defaults when reading files or network data without explicit encoding specification.

# Check current defaults
Encoding.default_external       # => #<Encoding:UTF-8>
Encoding.default_internal       # => nil

# File operations use these defaults
File.write('test.txt', 'café')
content = File.read('test.txt')
content.encoding                # => #<Encoding:UTF-8>

String concatenation between different encodings follows specific precedence rules. ASCII-compatible encodings can combine with ASCII strings, but incompatible encodings raise Encoding::CompatibilityError.

utf8_str = "café"               # UTF-8
ascii_str = "shop"              # ASCII (compatible with UTF-8)
result = utf8_str + ascii_str   # Works fine
result.encoding                 # => #<Encoding:UTF-8>

# Incompatible encodings
utf16_str = "test".encode('UTF-16')
utf8_str + utf16_str            # => Encoding::CompatibilityError

Pattern matching with regular expressions requires encoding compatibility between the pattern and target string. Ruby automatically promotes ASCII patterns to match the string's encoding when possible.

utf8_text = "The café is open"
ascii_pattern = /caf/           # ASCII pattern
utf8_text =~ ascii_pattern      # => 4 (works due to ASCII compatibility)

# Unicode character classes work with proper encoding
utf8_text.scan(/\p{L}+/)        # => ["The", "café", "is", "open"]

Error Handling & Debugging

Encoding errors fall into several categories, each requiring different handling strategies. Encoding::UndefinedConversionError occurs when the target encoding cannot represent specific characters. Encoding::InvalidByteSequenceError happens when source bytes are invalid for their declared encoding.

# UndefinedConversionError - character doesn't exist in target encoding
utf8_str = "café"
begin
  ascii_str = utf8_str.encode('ASCII')
rescue Encoding::UndefinedConversionError => e
  puts "Cannot convert: #{e.error_char}"     # => "Cannot convert: é"
  puts "Source encoding: #{e.source_encoding}" # => UTF-8
  puts "Target encoding: #{e.destination_encoding}" # => US-ASCII
end

The encode method accepts options to handle conversion errors gracefully. The :invalid and :undef options specify replacement strategies, while :replace sets the replacement string.

problematic = "café\xFF"        # Mixed valid UTF-8 and invalid byte
# Handle both undefined conversion and invalid bytes
safe_ascii = problematic.encode('ASCII', 
  :invalid => :replace,         # Replace invalid bytes
  :undef => :replace,           # Replace undefined characters  
  :replace => '?')              # Use '?' as replacement
# => "caf??"

# XML-safe replacements
xml_safe = problematic.encode('ASCII',
  :invalid => :replace,
  :undef => :replace, 
  :replace => '&#xFFFD;')       # Unicode replacement character entity

Debugging encoding issues requires examining both the byte-level representation and the logical character structure. The String#bytes method reveals the actual byte sequence, while String#codepoints shows Unicode values.

suspicious = "caf\xE9"          # Suspicious string
suspicious.encoding             # => #<Encoding:UTF-8> (default)
suspicious.valid_encoding?      # => false

# Examine the bytes
suspicious.bytes                # => [99, 97, 102, 233]
# 233 (0xE9) is invalid UTF-8 sequence start

# Try different encoding interpretation
latin1_version = suspicious.force_encoding('ISO-8859-1')
latin1_version.valid_encoding?  # => true
latin1_version                  # => "café" (correct interpretation)

# Convert to proper UTF-8
fixed = latin1_version.encode('UTF-8')
fixed.bytes                     # => [99, 97, 102, 195, 169] (valid UTF-8)

The String#scrub method removes or replaces invalid byte sequences, providing a robust way to clean untrusted input. This method works regardless of the specific encoding errors present.

# String with multiple encoding problems
broken = "good\xFF\xFEbad\x80text"
broken.valid_encoding?          # => false

# Remove invalid sequences
cleaned = broken.scrub
cleaned                         # => "goodbadtext"

# Custom replacement
cleaned_custom = broken.scrub('�') 
cleaned_custom                  # => "good�bad�text"

# Block form for complex replacement logic
smart_clean = broken.scrub do |invalid_bytes|
  invalid_bytes.unpack('H*')[0] # Show hex representation
end
smart_clean                     # => "goodfffebadjpegtext"

File encoding detection requires heuristic analysis since most file formats don't include explicit encoding metadata. The chardet gem provides sophisticated detection, but simple patterns often suffice for known data sources.

# Simple BOM detection for UTF files
def detect_encoding(file_path)
  bytes = File.read(file_path, 3, encoding: 'ASCII-8BIT')
  case bytes
  when "\xEF\xBB\xBF"           # UTF-8 BOM
    'UTF-8'
  when "\xFF\xFE"               # UTF-16LE BOM
    'UTF-16LE'  
  when "\xFE\xFF"               # UTF-16BE BOM
    'UTF-16BE'
  else
    'UTF-8'                     # Assume UTF-8 for unknown
  end
end

Performance & Memory

Encoding conversions carry significant performance costs, especially for large strings. UTF-8 to ASCII conversion requires scanning every byte, while UTF-16 to UTF-8 conversion involves mathematical transformations for each character. Avoiding unnecessary conversions improves both speed and memory usage.

# Performance comparison setup
require 'benchmark'

large_text = "café" * 100_000   # 400KB UTF-8 string
iterations = 1000

Benchmark.bm(20) do |x|
  x.report("force_encoding:") do
    iterations.times { large_text.force_encoding('ASCII-8BIT') }
  end
  
  x.report("encode ASCII:") do
    iterations.times { large_text.encode('ASCII', :invalid => :replace) }
  end
  
  x.report("encode UTF-16:") do  
    iterations.times { large_text.encode('UTF-16') }
  end
end

# Results show force_encoding is nearly free (just metadata change)
# while encode operations involve actual byte transformation

Memory allocation patterns differ significantly between encoding operations. force_encoding changes only metadata without allocating new strings, while encode always creates new string objects with potentially different byte lengths.

original = "café" * 10_000
puts "Original: #{original.bytesize} bytes"

# Force encoding - no new allocation
forced = original.force_encoding('ASCII-8BIT')  
puts "Forced: #{forced.bytesize} bytes"         # Same byte count

# Encoding conversion - new allocation required
utf16_version = original.encode('UTF-16')
puts "UTF-16: #{utf16_version.bytesize} bytes"  # Different byte count (larger)

ascii_version = original.encode('ASCII', :invalid => :replace)
puts "ASCII: #{ascii_version.bytesize} bytes"   # Different byte count (smaller)

String operations on encoded text show varying performance characteristics. Character-based operations like String#size require encoding-aware scanning, while byte-based operations like String#bytesize access cached metadata.

text = "résumé" * 50_000        # Mixed ASCII/UTF-8 content

Benchmark.bm(15) do |x|
  x.report("bytesize:") do
    100_000.times { text.bytesize }     # O(1) - metadata lookup
  end
  
  x.report("size:") do  
    100_000.times { text.size }         # O(n) - character counting
  end
  
  x.report("valid?:") do
    10_000.times { text.valid_encoding? } # O(n) - full validation scan
  end
end

Bulk processing benefits from encoding normalization strategies that minimize conversion overhead. Converting all inputs to a common encoding once, rather than converting repeatedly during processing, reduces total computational cost.

# Inefficient: convert during each operation
def process_files_inefficient(file_paths)
  results = []
  file_paths.each do |path|
    content = File.read(path, encoding: 'ASCII-8BIT')  # Read as binary
    utf8_content = content.encode('UTF-8', :invalid => :replace)
    processed = utf8_content.upcase.gsub(/\s+/, ' ')   # Multiple operations
    results << processed.encode('ASCII', :invalid => :replace)
  end
  results
end

# Efficient: batch convert once
def process_files_efficient(file_paths) 
  # Read and normalize all files first
  normalized_contents = file_paths.map do |path|
    content = File.read(path, encoding: 'ASCII-8BIT')
    content.encode('UTF-8', :invalid => :replace)
  end
  
  # Process in consistent encoding
  results = normalized_contents.map do |content|
    content.upcase.gsub(/\s+/, ' ')
  end
  
  # Convert outputs once if needed
  results.map { |r| r.encode('ASCII', :invalid => :replace) }
end

Common Pitfalls

Encoding assumption errors represent the most frequent category of string encoding problems. Developers often assume UTF-8 encoding for external data, but file systems, databases, and network protocols may use different encodings without explicit indication.

# Dangerous assumption - file might not be UTF-8
def read_user_file_wrong(path)
  content = File.read(path)       # Assumes UTF-8
  content.upcase                  # May fail with invalid UTF-8
end

# Safe approach with encoding detection/handling  
def read_user_file_safe(path)
  # Read as binary first to inspect bytes
  binary_content = File.read(path, encoding: 'ASCII-8BIT')
  
  # Try UTF-8 first
  if binary_content.force_encoding('UTF-8').valid_encoding?
    return binary_content.force_encoding('UTF-8')
  end
  
  # Fallback to Latin-1 (can represent any byte)
  binary_content.force_encoding('ISO-8859-1').encode('UTF-8')
end

The force_encoding vs encode confusion leads to subtle bugs where strings appear correct but contain invalid byte sequences. force_encoding changes interpretation without validating compatibility, while encode performs actual conversion.

# Common mistake: using force_encoding for conversion
latin1_bytes = "café".encode('ISO-8859-1')
# Wrong way - just changes label, doesn't convert bytes
wrong_utf8 = latin1_bytes.force_encoding('UTF-8')  
wrong_utf8.valid_encoding?      # => false! (invalid UTF-8 bytes)

# Correct way - actually converts byte sequences  
right_utf8 = latin1_bytes.encode('UTF-8')
right_utf8.valid_encoding?      # => true

Regular expression encoding compatibility creates unexpected match failures. Patterns compiled with one encoding may fail to match strings with different encodings, even when the content appears identical.

# Pattern compiled with ASCII encoding
ascii_pattern = /café/
ascii_pattern.encoding          # => #<Encoding:US-ASCII>

utf8_text = "I love café food"  
utf8_text.encoding              # => #<Encoding:UTF-8>

# This works due to ASCII compatibility promotion
result = utf8_text =~ ascii_pattern  # => 7

# But explicit ASCII pattern breaks with non-ASCII content
strict_ascii_pattern = /caf\xe9/.force_encoding('ASCII')
utf8_text =~ strict_ascii_pattern    # => nil (encoding incompatible)

# Solution: ensure pattern encoding matches text encoding
compatible_pattern = /café/.encode(utf8_text.encoding)
utf8_text =~ compatible_pattern      # => 7

JSON and XML parsing libraries exhibit inconsistent encoding behavior. Some libraries respect source encoding, others force UTF-8, and many provide configuration options that change default behavior between versions.

require 'json'

# JSON standard requires UTF-8, but input might be different
latin1_json_bytes = '{"name": "caf\xe9"}'.encode('ISO-8859-1')

# This might fail depending on JSON library version and configuration
begin
  # Some JSON parsers expect UTF-8 only
  parsed = JSON.parse(latin1_json_bytes)
rescue JSON::ParserError => e
  # Convert to UTF-8 first  
  utf8_json = latin1_json_bytes.encode('UTF-8')
  parsed = JSON.parse(utf8_json)
end

Database encoding mismatches cause data corruption that persists across application restarts. Character data stored with incorrect encoding interpretation becomes permanently corrupted unless detected and repaired quickly.

# Simulate database encoding mismatch
class DatabaseSimulator
  def initialize
    @storage = []
  end
  
  # Wrong: stores UTF-8 bytes as Latin-1
  def store_wrong(text)
    # Application sends UTF-8, database interprets as Latin-1
    stored_bytes = text.encode('UTF-8').force_encoding('ISO-8859-1')
    @storage << stored_bytes
    stored_bytes
  end
  
  # Retrieval compounds the problem  
  def retrieve_wrong(index)
    stored = @storage[index]
    # Database returns "Latin-1", application expects UTF-8
    stored.force_encoding('UTF-8')    # Invalid UTF-8!
  end
  
  # Correct approach - consistent encoding throughout
  def store_correct(text)
    utf8_text = text.encode('UTF-8')  # Normalize input
    @storage << utf8_text
    utf8_text
  end
  
  def retrieve_correct(index)
    @storage[index]                   # Already UTF-8
  end
end

# Demonstration of corruption
db = DatabaseSimulator.new
original = "café"

# Wrong way corrupts data
db.store_wrong(original)
retrieved_wrong = db.retrieve_wrong(0)
retrieved_wrong.valid_encoding?       # => false
retrieved_wrong == original           # => false

# Correct way preserves data
db.store_correct(original)  
retrieved_correct = db.retrieve_correct(1)
retrieved_correct.valid_encoding?     # => true
retrieved_correct == original         # => true

Reference

Core Encoding Methods

Method	Parameters	Returns	Description
`String#encoding`	none	`Encoding`	Returns the encoding object for the string
`String#force_encoding(encoding)`	`encoding` (String/Encoding)	`String`	Changes encoding label without converting bytes
`String#encode(encoding, **opts)`	`encoding` (String/Encoding), options (Hash)	`String`	Converts string to specified encoding
`String#encode!(encoding, **opts)`	`encoding` (String/Encoding), options (Hash)	`String`	In-place encoding conversion
`String#valid_encoding?`	none	`Boolean`	Tests if string contains valid byte sequences
`String#ascii_only?`	none	`Boolean`	Tests if string contains only ASCII characters
`String#scrub(replace=nil)`	`replace` (String), block optional	`String`	Removes or replaces invalid byte sequences

Encoding Information Methods

Method	Parameters	Returns	Description
`String#bytesize`	none	`Integer`	Returns byte count in string
`String#size` / `String#length`	none	`Integer`	Returns character count in string
`String#bytes`	none	`Array<Integer>`	Returns array of byte values
`String#codepoints`	none	`Array<Integer>`	Returns array of Unicode code points
`String#each_byte`	block	`Enumerator/String`	Iterates over each byte
`String#each_codepoint`	block	`Enumerator/String`	Iterates over each code point

Encoding Class Methods

Method	Parameters	Returns	Description
`Encoding.list`	none	`Array<Encoding>`	Returns all available encodings
`Encoding.find(name)`	`name` (String)	`Encoding`	Finds encoding by name or alias
`Encoding.default_external`	none	`Encoding`	Returns default external encoding
`Encoding.default_internal`	none	`Encoding/nil`	Returns default internal encoding
`Encoding.compatible?(str1, str2)`	two objects	`Encoding/nil`	Returns compatible encoding or nil

Conversion Options

Option	Values	Description
`:invalid`	`:replace`, `:ignore`	How to handle invalid byte sequences
`:undef`	`:replace`, `:ignore`	How to handle undefined character conversions
`:replace`	String	Replacement string for invalid/undefined characters
`:fallback`	Hash/Proc	Character-specific replacement mappings
`:xml`	`:text`, `:attr`	XML-safe replacement mode
`:cr_newline`	Boolean	Convert CR to LF during transcoding
`:crlf_newline`	Boolean	Convert CRLF to LF during transcoding
`:universal_newline`	Boolean	Convert all newline types to LF

Common Encoding Names

Encoding	Aliases	Byte Width	Description
`UTF-8`	`CP65001`	1-4 bytes	Unicode, web standard
`UTF-16`	`UTF-16BE/LE`	2-4 bytes	Unicode, Windows common
`ASCII`	`US-ASCII`	1 byte	7-bit ASCII characters only
`ASCII-8BIT`	`BINARY`	1 byte	Binary data, no character interpretation
`ISO-8859-1`	`Latin-1`	1 byte	Western European languages
`Windows-1252`	`CP1252`	1 byte	Windows Western European
`Shift_JIS`	`SJIS`	1-2 bytes	Japanese character encoding

Exception Hierarchy

EncodingError
├── Encoding::UndefinedConversionError
├── Encoding::InvalidByteSequenceError  
├── Encoding::ConverterNotFoundError
└── Encoding::CompatibilityError

Encoding Detection Patterns

Byte Sequence	Encoding	Location
`EF BB BF`	UTF-8	BOM at start
`FF FE`	UTF-16LE	BOM at start
`FE FF`	UTF-16BE	BOM at start
`00 xx 00 xx`	UTF-16BE	Pattern in text
`xx 00 xx 00`	UTF-16LE	Pattern in text
`< 0x80` all bytes	ASCII	Throughout
Valid UTF-8 sequences	UTF-8	Heuristic analysis

Performance Characteristics

Operation	Time Complexity	Notes
`String#bytesize`	O(1)	Cached metadata
`String#size`	O(n)	Must count characters
`String#valid_encoding?`	O(n)	Full validation scan
`String#force_encoding`	O(1)	Metadata change only
`String#encode`	O(n)	Byte transformation required
`String#scrub`	O(n)	Scans and potentially copies