CrackedRuby logo

CrackedRuby

Language Encoding and Magic Comments

A comprehensive guide to character encoding handling and magic comment syntax in Ruby for text processing and source file configuration.

Ruby Language Fundamentals Basic Syntax and Structure
1.1.6

Overview

Ruby provides comprehensive support for character encodings through its encoding system, which manages how text is represented, processed, and converted between different character sets. Magic comments allow developers to specify encoding and other directives directly in source files.

The encoding system centers around the Encoding class and string encoding methods. Every string object has an associated encoding that determines how its bytes are interpreted as characters. Ruby can convert between encodings, validate encoding correctness, and handle encoding mismatches.

Magic comments are special source code comments that begin with a coding directive, typically on the first or second line of a file. These comments inform the Ruby interpreter about the encoding of the source file and can control other language features.

# -*- coding: utf-8 -*-
# This file uses UTF-8 encoding

str = "Hello, 世界"
str.encoding  # => #<Encoding:UTF-8>
str.bytesize  # => 13 (bytes)
str.length    # => 9 (characters)

Ruby's default encoding for source files and string literals depends on the environment, but UTF-8 is standard in modern Ruby versions.

Basic Usage

String Encoding Operations

The String#encoding method returns the encoding of a string, while String#encode converts between encodings:

# Check current encoding
text = "Ruby encoding"
text.encoding.name  # => "UTF-8"

# Convert to different encoding
iso_text = text.encode("ISO-8859-1")
iso_text.encoding.name  # => "ISO-8859-1"

# Convert back to UTF-8
utf8_text = iso_text.encode("UTF-8")

The String#force_encoding method changes the encoding interpretation without converting bytes:

binary_data = "\x48\x65\x6c\x6c\x6f"
binary_data.encoding  # => #<Encoding:ASCII-8BIT>

# Reinterpret as UTF-8
text = binary_data.force_encoding("UTF-8")
text.encoding  # => #<Encoding:UTF-8>
text  # => "Hello"

Magic Comment Syntax

Magic comments specify source file encoding using various formats:

# -*- coding: utf-8 -*-
# Standard Emacs-style format

# coding: utf-8
# Simple format

# encoding: utf-8
# Alternative format

=begin
coding: utf-8
=end
# Block comment format

The magic comment must appear on the first or second line of the file. Ruby recognizes several patterns for encoding specifications.

Encoding Validation

Use String#valid_encoding? to check if a string's bytes are valid for its declared encoding:

valid_utf8 = "Hello, 世界"
valid_utf8.valid_encoding?  # => true

# Invalid UTF-8 sequence
invalid = "\xFF\xFE".force_encoding("UTF-8")
invalid.valid_encoding?  # => false

Advanced Usage

Transcoding with Options

The String#encode method accepts options for handling conversion edge cases:

source = "café naïve résumé"

# Handle invalid characters
result = source.encode("ASCII",
  invalid: :replace,
  undef: :replace,
  replace: "?")
# => "caf? na?ve r?sum?"

# XML entity replacement
xml_safe = source.encode("ASCII",
  invalid: :replace,
  undef: :replace,
  replace: "")
  .encode("ASCII", xml: :text)

# Custom replacement
custom = source.encode("ASCII",
  fallback: {"é" => "e", "ï" => "i", "é" => "e"})

Encoding Conversion Chains

Complex encoding workflows can chain multiple conversions:

class EncodingProcessor
  def self.normalize_text(input, target_encoding = "UTF-8")
    # First ensure we have a valid encoding
    working = if input.valid_encoding?
      input
    else
      input.encode("UTF-8",
        invalid: :replace,
        undef: :replace)
    end

    # Convert to target with error handling
    working.encode(target_encoding,
      invalid: :replace,
      undef: :replace,
      universal_newline: true)
  rescue Encoding::UndefinedConversionError => e
    # Fallback to ASCII with substitution
    input.encode("ASCII//TRANSLIT//IGNORE")
  end
end

messy_input = "Mixed\xFFencoding\x80text"
clean = EncodingProcessor.normalize_text(messy_input)

IO and File Encoding

File operations automatically handle encoding based on locale and magic comments:

# Specify encoding when opening files
File.open("data.txt", "r:utf-8") do |file|
  content = file.read
  content.encoding  # => #<Encoding:UTF-8>
end

# Write with specific encoding
File.open("output.txt", "w:iso-8859-1") do |file|
  file.write("Content in Latin-1")
end

# Transcode while reading
File.open("input.txt", "r:iso-8859-1:utf-8") do |file|
  utf8_content = file.read  # Auto-converted to UTF-8
end

Regular Expressions and Encoding

Regular expressions inherit encoding from their source and target strings:

pattern = /café/u  # Unicode regex
text = "I love café au lait"

# Encoding must be compatible
if pattern.encoding.name == text.encoding.name
  matches = text.scan(pattern)
end

# Fixed string with compatible encoding
fixed_pattern = pattern.source.encode(text.encoding)
compatible_regex = Regexp.new(fixed_pattern)

Error Handling & Debugging

Common Encoding Exceptions

Ruby raises specific exceptions for encoding problems:

begin
  # Incompatible encoding operation
  "hello".encode("UTF-8") + "\xFF".force_encoding("ASCII")
rescue Encoding::CompatibilityError => e
  puts "Encoding mismatch: #{e.message}"
end

begin
  # Invalid byte sequence
  "\xFF\xFE".encode("UTF-8", "UTF-16")
rescue Encoding::InvalidByteSequenceError => e
  puts "Invalid bytes at position #{e.error_bytes.inspect}"
  puts "Valid portion: #{e.readagain_bytes.inspect}"
end

begin
  # Undefined conversion
  "café".encode("ASCII")
rescue Encoding::UndefinedConversionError => e
  puts "Cannot convert: #{e.source_encoding}#{e.destination_encoding}"
  puts "Problem character: #{e.error_char.inspect}"
end

Debugging Encoding Issues

Create diagnostic tools for encoding problems:

class EncodingDebugger
  def self.analyze_string(str)
    puts "String: #{str.inspect}"
    puts "Encoding: #{str.encoding.name}"
    puts "Bytesize: #{str.bytesize}"
    puts "Length: #{str.length}"
    puts "Valid: #{str.valid_encoding?}"

    if str.bytesize < 50  # Avoid huge output
      puts "Bytes: #{str.bytes.map { |b| '0x%02X' % b }.join(' ')}"
    end

    # Check for common problematic sequences
    if str.include?("\xFF")
      puts "⚠️  Contains 0xFF bytes (often problematic)"
    end

    if str.encoding.name == "ASCII-8BIT" && str.length != str.bytesize
      puts "⚠️  Binary encoding but non-ASCII content detected"
    end
  end

  def self.safe_convert(str, to_encoding)
    str.encode(to_encoding)
  rescue Encoding::UndefinedConversionError => e
    puts "Failed conversion: #{e.error_char} not available in #{to_encoding}"
    str.encode(to_encoding, undef: :replace, replace: '?')
  rescue Encoding::InvalidByteSequenceError => e
    puts "Invalid byte sequence: #{e.error_bytes.inspect}"
    str.encode(to_encoding, invalid: :replace, replace: '?')
  end
end

# Usage
problematic = "Text with \xFF invalid bytes"
EncodingDebugger.analyze_string(problematic)

Validation Strategies

Implement comprehensive encoding validation:

module EncodingValidator
  def self.validate_text(text, expected_encoding = nil)
    return false if text.nil? || text.empty?

    # Check basic validity
    return false unless text.valid_encoding?

    # Check expected encoding if specified
    if expected_encoding
      expected = Encoding.find(expected_encoding)
      return false unless text.encoding == expected
    end

    # Check for control characters (optional)
    control_chars = text.chars.select { |c| c.ord < 32 && !"\t\n\r".include?(c) }
    return false unless control_chars.empty?

    true
  end

  def self.sanitize_input(input, target_encoding = "UTF-8")
    return "" if input.nil?

    # Convert to string if needed
    text = input.to_s

    # Force UTF-8 if encoding is unknown
    if text.encoding == Encoding::ASCII_8BIT
      text = text.force_encoding("UTF-8")
    end

    # Clean up invalid sequences
    unless text.valid_encoding?
      text = text.encode("UTF-8",
        invalid: :replace,
        undef: :replace,
        replace: "")
    end

    # Convert to target encoding
    text.encode(target_encoding,
      invalid: :replace,
      undef: :replace)
  rescue
    ""  # Return empty string on any error
  end
end

Production Patterns

Web Application Encoding

Rails and web frameworks require careful encoding management:

class TextProcessor
  # Ensure UTF-8 for web content
  def self.prepare_for_web(input)
    return "" if input.nil?

    # Normalize to UTF-8
    text = input.to_s.encode("UTF-8",
      invalid: :replace,
      undef: :replace,
      replace: "")

    # Remove or replace problematic characters
    text.gsub(/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/, "")
        .strip
  end

  # Handle file uploads with unknown encoding
  def self.detect_and_convert(file_content)
    # Try common encodings in order
    encodings = ["UTF-8", "ISO-8859-1", "Windows-1252"]

    encodings.each do |encoding|
      begin
        decoded = file_content.force_encoding(encoding)
        if decoded.valid_encoding?
          return decoded.encode("UTF-8")
        end
      rescue Encoding::UndefinedConversionError
        next
      end
    end

    # Fallback: treat as binary and replace invalid chars
    file_content.force_encoding("UTF-8")
                .encode("UTF-8", invalid: :replace, undef: :replace)
  end
end

# Middleware for encoding normalization
class EncodingMiddleware
  def initialize(app)
    @app = app
  end

  def call(env)
    # Ensure request parameters are properly encoded
    if env["REQUEST_METHOD"] == "POST"
      normalize_params(env["rack.request.form_hash"])
    end

    @app.call(env)
  end

  private

  def normalize_params(params)
    return unless params

    params.each do |key, value|
      if value.is_a?(String)
        params[key] = TextProcessor.prepare_for_web(value)
      elsif value.is_a?(Hash)
        normalize_params(value)
      end
    end
  end
end

Database Integration

Database encoding requires coordination between Ruby and storage:

class DatabaseTextHandler
  # Prepare text for database storage
  def self.prepare_for_db(text, column_encoding = "utf8mb4")
    return nil if text.nil?

    # Normalize to UTF-8 first
    normalized = text.to_s.encode("UTF-8",
      invalid: :replace,
      undef: :replace)

    case column_encoding.downcase
    when "utf8mb4"
      # Full Unicode support
      normalized
    when "utf8"
      # Basic Multilingual Plane only (no 4-byte chars)
      normalized.gsub(/[\u{10000}-\u{10FFFF}]/, "")
    when "latin1"
      # Convert to Latin-1, replacing unsupported chars
      normalized.encode("ISO-8859-1",
        invalid: :replace,
        undef: :replace,
        replace: "?")
    else
      normalized
    end
  end

  # Handle text retrieved from database
  def self.process_from_db(text, source_encoding = "UTF-8")
    return nil if text.nil?

    # Ensure proper encoding tag
    if text.encoding != Encoding.find(source_encoding)
      text = text.force_encoding(source_encoding)
    end

    # Validate and clean
    unless text.valid_encoding?
      text = text.encode("UTF-8",
        invalid: :replace,
        undef: :replace)
    end

    text
  end
end

# ActiveRecord integration example
class TextContent < ActiveRecord::Base
  before_save :normalize_encoding

  private

  def normalize_encoding
    self.content = DatabaseTextHandler.prepare_for_db(content, "utf8mb4")
  end
end

Logging and Monitoring

Track encoding issues in production environments:

class EncodingLogger
  def self.log_encoding_error(error, context = {})
    case error
    when Encoding::CompatibilityError
      Rails.logger.warn("[ENCODING] Compatibility error: #{error.message}")
      Rails.logger.warn("[ENCODING] Context: #{context}")
    when Encoding::InvalidByteSequenceError
      Rails.logger.error("[ENCODING] Invalid byte sequence")
      Rails.logger.error("[ENCODING] Error bytes: #{error.error_bytes.inspect}")
      Rails.logger.error("[ENCODING] Context: #{context}")
    when Encoding::UndefinedConversionError
      Rails.logger.warn("[ENCODING] Undefined conversion")
      Rails.logger.warn("[ENCODING] #{error.source_encoding}#{error.destination_encoding}")
      Rails.logger.warn("[ENCODING] Character: #{error.error_char.inspect}")
    end
  end

  def self.monitor_encoding_health
    # Check default encodings
    issues = []

    if Encoding.default_external != Encoding::UTF_8
      issues << "External encoding not UTF-8: #{Encoding.default_external}"
    end

    if Encoding.default_internal && Encoding.default_internal != Encoding::UTF_8
      issues << "Internal encoding not UTF-8: #{Encoding.default_internal}"
    end

    issues.each { |issue| Rails.logger.warn("[ENCODING] #{issue}") }
    issues.empty?
  end
end

Common Pitfalls

Binary Data Confusion

The most frequent encoding error involves treating binary data as text:

# WRONG: Treating binary as text
binary_file = File.read("image.jpg")  # Binary data
binary_file.encoding  # => ASCII-8BIT

# This will likely fail or corrupt data
text_operations = binary_file.upcase.gsub(/pattern/, "replacement")

# CORRECT: Explicitly handle binary data
binary_file = File.binread("image.jpg")  # Explicitly binary
binary_file.encoding  # => ASCII-8BIT

# Keep binary data separate from text operations
text_content = File.read("text.txt", encoding: "UTF-8")

String concatenation between different encodings causes compatibility errors:

# WRONG: Mixing encodings
utf8_string = "Hello"
binary_data = "\xFF\xFE".force_encoding("ASCII-8BIT")

# This raises Encoding::CompatibilityError
combined = utf8_string + binary_data

# CORRECT: Convert to compatible encoding first
safe_binary = binary_data.encode("UTF-8",
  invalid: :replace,
  undef: :replace)
combined = utf8_string + safe_binary

Magic Comment Placement

Magic comments must appear in specific locations to take effect:

# WRONG: Magic comment too late
require 'some_gem'
# -*- coding: utf-8 -*-  # Too late, ignored

class MyClass
  # This string might be interpreted incorrectly
  GREETING = "Héllo"
end

# CORRECT: Magic comment at top
# -*- coding: utf-8 -*-
require 'some_gem'

class MyClass
  GREETING = "Héllo"  # Properly interpreted as UTF-8
end

Force Encoding Misuse

force_encoding changes interpretation without validation:

# WRONG: Forcing incompatible encoding
invalid_utf8 = "\xFF\xFE\x00\x48"  # Not valid UTF-8
text = invalid_utf8.force_encoding("UTF-8")
text.valid_encoding?  # => false

# Operations on invalid encoding can fail unpredictably
text.upcase  # May raise exception

# CORRECT: Validate after forcing or use encode
safe_text = invalid_utf8.force_encoding("UTF-8")
if safe_text.valid_encoding?
  safe_text.upcase
else
  # Handle invalid encoding appropriately
  safe_text.encode("UTF-8", invalid: :replace, undef: :replace)
end

Regular Expression Encoding

Regular expressions must match string encoding:

# WRONG: Encoding mismatch
ascii_pattern = /hello/
utf8_string = "hello wörld"

# Pattern and string have compatible encodings, but...
result = utf8_string.gsub(ascii_pattern, "hi")

# WRONG: Using source with different encoding
pattern_source = "café"
ascii_string = "I love cafe".force_encoding("ASCII")

# This will raise Encoding::CompatibilityError
regex = Regexp.new(pattern_source)
ascii_string.gsub(regex, "tea")

# CORRECT: Ensure compatible encodings
pattern = /café/u  # Unicode pattern
utf8_string = "I love café"

# Both are UTF-8 compatible
result = utf8_string.gsub(pattern, "tea")

File I/O Encoding Defaults

File operations use system defaults that may not match expectations:

# WRONG: Assuming encoding
content = File.read("data.txt")  # Uses default encoding
content.encoding  # May not be what you expect

# Content might be misinterpreted
processed = content.upcase.gsub(/é/, "e")

# CORRECT: Explicitly specify encoding
content = File.read("data.txt", encoding: "UTF-8")
# or
content = File.read("data.txt", external_encoding: "ISO-8859-1",
                                internal_encoding: "UTF-8")

# Handle unknown encoding safely
content = begin
  File.read("data.txt", encoding: "UTF-8")
rescue ArgumentError
  # Fallback for invalid UTF-8
  File.read("data.txt", encoding: "ISO-8859-1")
      .encode("UTF-8", invalid: :replace)
end

Reference

Core Encoding Methods

Method Parameters Returns Description
String#encoding None Encoding Current encoding of the string
String#encode(encoding, **opts) encoding (String/Encoding), options (Hash) String Convert to specified encoding
String#encode!(encoding, **opts) encoding (String/Encoding), options (Hash) String Convert in place
String#force_encoding(encoding) encoding (String/Encoding) String Change encoding interpretation
String#valid_encoding? None Boolean Check if bytes are valid for encoding
String#ascii_only? None Boolean Check if string contains only ASCII characters

Encoding Class Methods

Method Parameters Returns Description
Encoding.list None Array<Encoding> All available encodings
Encoding.find(name) name (String) Encoding Find encoding by name
Encoding.compatible?(str1, str2) Two objects Encoding or nil Compatible encoding or nil
Encoding.default_external None Encoding Default external encoding
Encoding.default_internal None Encoding Default internal encoding

Encode Options

Option Type Description
:invalid Symbol How to handle invalid byte sequences (:replace)
:undef Symbol How to handle undefined conversions (:replace)
:replace String Replacement string for invalid/undefined
:fallback Hash/Proc Custom character mappings
:xml Symbol XML entity encoding (:text, :attr)
:cr_newline Boolean Convert LF to CR
:crlf_newline Boolean Convert LF to CRLF
:universal_newline Boolean Convert CRLF/CR to LF

Common Encodings

Encoding Description Use Case
UTF-8 Variable-length Unicode Modern applications, web
ASCII-8BIT Binary data File I/O, network protocols
ISO-8859-1 Latin-1 Legacy European text
Windows-1252 Windows Latin Windows legacy files
UTF-16LE/BE 16-bit Unicode Windows, Java strings
Shift_JIS Japanese encoding Japanese legacy systems

Magic Comment Formats

# Emacs style
# -*- coding: utf-8 -*-
# -*- coding: iso-8859-1 -*-

# Simple format
# coding: utf-8
# encoding: utf-8

# Vim style
# vim: set fileencoding=utf-8 :

# Block comment
=begin
coding: utf-8
=end

Exception Hierarchy

EncodingError
├── Encoding::CompatibilityError
├── Encoding::InvalidByteSequenceError
├── Encoding::UndefinedConversionError
└── Encoding::ConverterNotFoundError

File I/O Encoding Options

Format External Internal Description
"r:utf-8" UTF-8 None Read as UTF-8
"r:iso-8859-1:utf-8" ISO-8859-1 UTF-8 Transcode to UTF-8
"r:bom|utf-8" UTF-8 None Auto-detect BOM
"w:utf-8" UTF-8 None Write as UTF-8