CrackedRuby - Data Cleansing

Overview

Data cleansing removes errors, inconsistencies, and inaccuracies from datasets to improve data quality and reliability. The process addresses issues including duplicate records, missing values, format inconsistencies, invalid entries, and data type mismatches. Data cleansing operations occur at multiple stages: during data ingestion, before processing, after transformation, and as part of regular maintenance routines.

The quality of data directly impacts application reliability, business decisions, and system performance. Uncleaned data causes application errors, produces incorrect analytics, degrades user experience, and increases storage costs. A single malformed email address can crash an email service; duplicate customer records create billing errors; inconsistent date formats break time-series analysis.

Data cleansing differs from data validation in scope and timing. Validation checks whether data meets predefined rules and rejects invalid input. Cleansing accepts flawed data and transforms it into acceptable formats. Validation acts as a gatekeeper; cleansing acts as a repair mechanism.

# Raw data before cleansing
raw_customer = {
  email: "  JOHN@EXAMPLE.COM  ",
  phone: "(555) 123-4567",
  name: "john   doe",
  age: "25",
  signup_date: "03/15/2023"
}

# After cleansing operations
clean_customer = {
  email: "john@example.com",
  phone: "5551234567",
  name: "John Doe",
  age: 25,
  signup_date: Date.parse("2023-03-15")
}

Data cleansing operations fall into distinct categories: structural cleansing standardizes formats and types; content cleansing corrects values and removes noise; deduplication eliminates redundant records; enrichment fills missing information. Each category requires different techniques and validation strategies.

Key Principles

Data cleansing follows a deterministic process where each operation produces predictable, repeatable results. The same input data subjected to identical cleansing operations yields identical output regardless of execution time or environment. This determinism enables testing, debugging, and audit trails.

Idempotency ensures cleansing operations can execute multiple times without changing results after the first application. Running a whitespace trimming operation twice produces the same result as running it once. Idempotent operations make data pipelines resilient to retries and partial failures.

Data preservation maintains original values alongside cleansed versions when possible. Applications store raw data before transformation, enabling recovery from cleansing errors and supporting forensic analysis. Audit trails track what changed, when, and why.

class DataRecord
  attr_reader :raw_value, :cleansed_value, :transformations
  
  def initialize(value)
    @raw_value = value
    @cleansed_value = value
    @transformations = []
  end
  
  def apply_cleansing(operation, &block)
    previous = @cleansed_value
    @cleansed_value = block.call(@cleansed_value)
    @transformations << {
      operation: operation,
      before: previous,
      after: @cleansed_value,
      timestamp: Time.now
    }
  end
end

record = DataRecord.new("  HELLO  ")
record.apply_cleansing(:strip) { |v| v.strip }
record.apply_cleansing(:downcase) { |v| v.downcase }
# => Preserves raw value and transformation history

Error handling strategies determine how cleansing responds to problematic data. Strategies include rejecting invalid records, setting default values, inferring corrections, or flagging for manual review. The chosen strategy depends on data criticality and error tolerance.

Validation boundaries define acceptable ranges, formats, and values. Boundaries enforce business rules: ages between 0 and 150, email addresses matching RFC specifications, postal codes matching geographic formats. Clear boundaries enable automated cleansing decisions.

Type coercion converts data between representations while preserving semantic meaning. Converting the string "123" to integer 123 maintains the numeric concept. Type coercion requires understanding source and target formats to avoid data loss or misinterpretation.

# Type coercion with semantic preservation
def coerce_to_integer(value)
  case value
  when Integer
    value
  when Float
    value.round
  when String
    value.strip.delete(',').to_i if value.match?(/^\d+([,\d]+)?$/)
  when NilClass
    nil
  else
    raise TypeError, "Cannot coerce #{value.class} to Integer"
  end
end

coerce_to_integer("1,234")  # => 1234
coerce_to_integer(1234.7)   # => 1235
coerce_to_integer("abc")    # => nil

Normalization transforms data into consistent formats. Text normalization converts to uniform case, removes accents, standardizes whitespace. Numeric normalization scales values to common ranges. Date normalization converts to ISO 8601 format. Normalization eliminates representation differences that complicate comparisons and aggregations.

Contextual awareness applies different cleansing rules based on data origin, purpose, and domain. Email addresses for marketing require validation; internal system identifiers do not. Financial data requires precision; display text tolerates approximation. Context determines appropriate cleansing strategies.

Ruby Implementation

Ruby's String class provides extensive methods for text cleansing. The strip, lstrip, and rstrip methods remove leading and trailing whitespace. The squeeze method collapses repeated characters. The tr and delete methods remove or replace character sets.

class StringCleaner
  def self.clean_whitespace(text)
    return nil if text.nil?
    text.strip.squeeze(' ')
  end
  
  def self.remove_non_alphanumeric(text)
    return nil if text.nil?
    text.gsub(/[^a-zA-Z0-9\s]/, '')
  end
  
  def self.normalize_case(text, style: :titlecase)
    return nil if text.nil?
    
    case style
    when :titlecase
      text.split.map(&:capitalize).join(' ')
    when :sentence
      text.capitalize
    when :upper
      text.upcase
    when :lower
      text.downcase
    else
      text
    end
  end
end

StringCleaner.clean_whitespace("  hello    world  ")  # => "hello world"
StringCleaner.remove_non_alphanumeric("hello@world!")  # => "hello world"
StringCleaner.normalize_case("john doe", style: :titlecase)  # => "John Doe"

Ruby's Regexp class enables pattern-based cleansing through matching and substitution. Regular expressions identify invalid characters, extract structured data, and validate formats. Named captures organize extracted components for reconstruction.

class PhoneCleaner
  PHONE_PATTERN = /\(?(\d{3})\)?[-.\s]?(\d{3})[-.\s]?(\d{4})/
  
  def self.normalize(phone)
    return nil if phone.nil? || phone.empty?
    
    match = phone.match(PHONE_PATTERN)
    return nil unless match
    
    "#{match[1]}#{match[2]}#{match[3]}"
  end
  
  def self.format(phone, style: :plain)
    digits = normalize(phone)
    return nil unless digits
    
    case style
    when :dotted
      "#{digits[0..2]}.#{digits[3..5]}.#{digits[6..9]}"
    when :dashed
      "#{digits[0..2]}-#{digits[3..5]}-#{digits[6..9]}"
    when :parens
      "(#{digits[0..2]}) #{digits[3..5]}-#{digits[6..9]}"
    else
      digits
    end
  end
end

PhoneCleaner.normalize("(555) 123-4567")  # => "5551234567"
PhoneCleaner.format("5551234567", style: :parens)  # => "(555) 123-4567"

Ruby's Enumerable module provides methods for collection-level cleansing. The compact method removes nil values. The uniq method eliminates duplicates. The select and reject methods filter based on predicates.

class CollectionCleaner
  def self.remove_nils(collection)
    collection.compact
  end
  
  def self.remove_duplicates(collection, key: nil)
    if key
      collection.uniq { |item| item.send(key) }
    else
      collection.uniq
    end
  end
  
  def self.remove_outliers(numbers, std_devs: 3)
    return [] if numbers.empty?
    
    mean = numbers.sum.to_f / numbers.size
    variance = numbers.sum { |n| (n - mean) ** 2 } / numbers.size
    std_dev = Math.sqrt(variance)
    
    threshold = std_devs * std_dev
    numbers.select { |n| (n - mean).abs <= threshold }
  end
end

data = [1, nil, 2, nil, 3, 2]
CollectionCleaner.remove_nils(data)  # => [1, 2, 3, 2]
CollectionCleaner.remove_duplicates(data.compact)  # => [1, 2, 3]

values = [10, 12, 11, 9, 100, 13, 8]
CollectionCleaner.remove_outliers(values)  # => [10, 12, 11, 9, 13, 8]

The encode method handles character encoding issues. Data from external sources often arrives in mixed encodings. The method converts between encodings, replaces invalid bytes, and normalizes Unicode representations.

class EncodingCleaner
  def self.clean_encoding(text, target: Encoding::UTF_8)
    return nil if text.nil?
    
    text.encode(target,
      invalid: :replace,
      undef: :replace,
      replace: ''
    )
  end
  
  def self.normalize_unicode(text)
    return nil if text.nil?
    
    # Convert to composed form (NFC)
    text.unicode_normalize(:nfc)
  end
  
  def self.remove_bom(text)
    return nil if text.nil?
    
    text.sub(/\A\uFEFF/, '')
  end
end

# Handle mixed encoding
dirty_text = "Hello\xE9World".force_encoding('ASCII-8BIT')
EncodingCleaner.clean_encoding(dirty_text)  # => "HelloWorld"

# Normalize Unicode (é can be e + combining accent)
EncodingCleaner.normalize_unicode("café")  # => Composed form

Practical Examples

Email address cleansing standardizes formats for validation and deduplication. The process converts to lowercase, removes whitespace, validates structure, and extracts components for analysis.

class EmailCleaner
  EMAIL_PATTERN = /\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i
  
  def self.clean(email)
    return nil if email.nil? || email.empty?
    
    # Remove whitespace and convert to lowercase
    cleaned = email.strip.downcase
    
    # Validate basic structure
    return nil unless cleaned.match?(EMAIL_PATTERN)
    
    # Remove duplicate @ symbols (common typo)
    parts = cleaned.split('@')
    return nil if parts.length != 2
    
    local, domain = parts
    
    # Remove dots from Gmail addresses (ignored by Gmail)
    if domain == 'gmail.com'
      local = local.delete('.')
    end
    
    # Remove plus addressing
    local = local.split('+').first
    
    "#{local}@#{domain}"
  end
  
  def self.extract_domain(email)
    cleaned = clean(email)
    return nil unless cleaned
    
    cleaned.split('@').last
  end
end

EmailCleaner.clean("  John.Doe+spam@Gmail.com  ")  # => "johndoe@gmail.com"
EmailCleaner.extract_domain("user@example.com")  # => "example.com"

Address cleansing normalizes postal addresses for geocoding and matching. The process standardizes abbreviations, corrects common misspellings, validates components, and formats consistently.

class AddressCleaner
  STREET_TYPES = {
    'street' => 'St',
    'str' => 'St',
    'avenue' => 'Ave',
    'av' => 'Ave',
    'road' => 'Rd',
    'boulevard' => 'Blvd',
    'blvd' => 'Blvd',
    'drive' => 'Dr',
    'lane' => 'Ln',
    'court' => 'Ct'
  }
  
  DIRECTIONS = {
    'north' => 'N',
    'south' => 'S',
    'east' => 'E',
    'west' => 'W',
    'northeast' => 'NE',
    'northwest' => 'NW',
    'southeast' => 'SE',
    'southwest' => 'SW'
  }
  
  def self.normalize_street(street)
    return nil if street.nil? || street.empty?
    
    normalized = street.downcase.strip
    
    # Normalize street types
    STREET_TYPES.each do |variant, standard|
      normalized.gsub!(/\b#{variant}\b\.?/i, standard)
    end
    
    # Normalize directions
    DIRECTIONS.each do |full, abbrev|
      normalized.gsub!(/\b#{full}\b/i, abbrev)
    end
    
    # Capitalize each word
    normalized.split.map(&:capitalize).join(' ')
  end
  
  def self.clean_zipcode(zipcode)
    return nil if zipcode.nil?
    
    # Remove all non-digits
    digits = zipcode.to_s.gsub(/\D/, '')
    
    # Validate length (5 or 9 digits)
    return nil unless [5, 9].include?(digits.length)
    
    # Format with hyphen for ZIP+4
    digits.length == 9 ? "#{digits[0..4]}-#{digits[5..8]}" : digits
  end
end

AddressCleaner.normalize_street("123 north main street")  # => "123 N Main St"
AddressCleaner.clean_zipcode("12345-6789")  # => "12345-6789"
AddressCleaner.clean_zipcode("12345 6789")  # => "12345-6789"

Financial data cleansing handles currency values, removes formatting characters, validates precision, and converts to standard numeric types. The process must preserve decimal accuracy and handle multiple currency formats.

class CurrencyCleaner
  def self.clean(value, currency: 'USD')
    return nil if value.nil?
    
    # Convert to string for processing
    str_value = value.to_s.strip
    
    # Remove currency symbols and common separators
    cleaned = str_value.gsub(/[$,£€¥\s]/, '')
    
    # Handle parentheses for negative values (accounting format)
    if cleaned.match?(/^\(.*\)$/)
      cleaned = "-#{cleaned.tr('()', '')}"
    end
    
    # Convert to BigDecimal for precision
    begin
      BigDecimal(cleaned)
    rescue ArgumentError
      nil
    end
  end
  
  def self.format(value, currency: 'USD', precision: 2)
    return nil if value.nil?
    
    decimal_value = value.is_a?(BigDecimal) ? value : BigDecimal(value.to_s)
    
    formatted = decimal_value.round(precision).to_s('F')
    
    # Add thousands separators
    parts = formatted.split('.')
    parts[0].gsub!(/(\d)(?=(\d{3})+(?!\d))/, '\1,')
    
    # Add currency symbol
    symbol = case currency
    when 'USD' then '$'
    when 'EUR' then '€'
    when 'GBP' then '£'
    else currency
    end
    
    "#{symbol}#{parts.join('.')}"
  end
end

CurrencyCleaner.clean("$1,234.56")  # => BigDecimal("1234.56")
CurrencyCleaner.clean("($500.00)")  # => BigDecimal("-500.00")
CurrencyCleaner.format(1234.567, precision: 2)  # => "$1,234.57"

Date and time cleansing converts between formats, handles timezone differences, validates ranges, and standardizes to ISO 8601. Multiple input formats require parsing heuristics and format detection.

require 'date'

class DateCleaner
  COMMON_FORMATS = [
    '%Y-%m-%d',           # 2023-03-15
    '%m/%d/%Y',           # 03/15/2023
    '%d/%m/%Y',           # 15/03/2023
    '%Y/%m/%d',           # 2023/03/15
    '%b %d, %Y',          # Mar 15, 2023
    '%B %d, %Y',          # March 15, 2023
    '%d-%b-%Y',           # 15-Mar-2023
  ]
  
  def self.parse(date_string)
    return nil if date_string.nil? || date_string.empty?
    
    cleaned = date_string.strip
    
    # Try each format
    COMMON_FORMATS.each do |format|
      begin
        return Date.strptime(cleaned, format)
      rescue ArgumentError
        next
      end
    end
    
    # Try natural language parsing
    begin
      Date.parse(cleaned)
    rescue ArgumentError
      nil
    end
  end
  
  def self.validate_range(date, min: nil, max: nil)
    return false if date.nil?
    
    return false if min && date < min
    return false if max && date > max
    
    true
  end
  
  def self.normalize(date)
    return nil if date.nil?
    
    date.is_a?(Date) ? date.iso8601 : parse(date)&.iso8601
  end
end

DateCleaner.parse("03/15/2023")  # => Date object
DateCleaner.parse("March 15, 2023")  # => Date object
DateCleaner.normalize("03/15/2023")  # => "2023-03-15"

date = Date.parse("2023-03-15")
DateCleaner.validate_range(date, min: Date.parse("2020-01-01"))  # => true

Common Patterns

The pipeline pattern chains multiple cleansing operations in sequence. Each operation transforms data and passes results to the next stage. Pipelines enable modular, testable cleansing workflows.

class CleansingPipeline
  def initialize
    @operations = []
  end
  
  def add(name, &block)
    @operations << { name: name, operation: block }
    self
  end
  
  def execute(data)
    result = data
    @operations.each do |op|
      result = op[:operation].call(result)
      break if result.nil?
    end
    result
  end
  
  def self.for_email
    new
      .add(:trim) { |v| v&.strip }
      .add(:lowercase) { |v| v&.downcase }
      .add(:validate) { |v| v&.match?(/\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i) ? v : nil }
  end
  
  def self.for_phone
    new
      .add(:extract_digits) { |v| v&.gsub(/\D/, '') }
      .add(:validate_length) { |v| v&.length == 10 ? v : nil }
  end
end

email_pipeline = CleansingPipeline.for_email
email_pipeline.execute("  JOHN@EXAMPLE.COM  ")  # => "john@example.com"
email_pipeline.execute("invalid")  # => nil

phone_pipeline = CleansingPipeline.for_phone
phone_pipeline.execute("(555) 123-4567")  # => "5551234567"

The strategy pattern selects cleansing approaches based on data characteristics. Different strategies handle different data types, quality levels, or business requirements. Strategies encapsulate domain-specific logic.

class CleansingStrategy
  def clean(value)
    raise NotImplementedError
  end
end

class ConservativeStrategy < CleansingStrategy
  def clean(value)
    # Only remove obvious errors, preserve ambiguous data
    return nil if value.nil?
    value.strip
  end
end

class AggressiveStrategy < CleansingStrategy
  def clean(value)
    # Apply extensive transformations
    return nil if value.nil? || value.strip.empty?
    value.strip.squeeze(' ').downcase
  end
end

class DataCleaner
  def initialize(strategy)
    @strategy = strategy
  end
  
  def clean(value)
    @strategy.clean(value)
  end
end

conservative = DataCleaner.new(ConservativeStrategy.new)
conservative.clean("  Hello  World  ")  # => "Hello  World"

aggressive = DataCleaner.new(AggressiveStrategy.new)
aggressive.clean("  Hello  World  ")  # => "hello world"

The decorator pattern adds cleansing capabilities incrementally. Each decorator wraps another cleaner and adds specific functionality. Decorators compose complex cleansing from simple components.

class BaseCleaner
  def clean(value)
    value
  end
end

class TrimDecorator
  def initialize(cleaner)
    @cleaner = cleaner
  end
  
  def clean(value)
    result = @cleaner.clean(value)
    result&.strip
  end
end

class LowercaseDecorator
  def initialize(cleaner)
    @cleaner = cleaner
  end
  
  def clean(value)
    result = @cleaner.clean(value)
    result&.downcase
  end
end

class RemoveSpecialCharsDecorator
  def initialize(cleaner)
    @cleaner = cleaner
  end
  
  def clean(value)
    result = @cleaner.clean(value)
    result&.gsub(/[^a-zA-Z0-9\s]/, '')
  end
end

# Compose cleaners
cleaner = BaseCleaner.new
cleaner = TrimDecorator.new(cleaner)
cleaner = LowercaseDecorator.new(cleaner)
cleaner = RemoveSpecialCharsDecorator.new(cleaner)

cleaner.clean("  Hello, World!  ")  # => "hello world"

The batch processing pattern processes large datasets efficiently. Batch operations load data in chunks, apply cleansing transformations, and write results incrementally. This pattern manages memory usage and enables progress tracking.

class BatchCleaner
  def initialize(batch_size: 1000)
    @batch_size = batch_size
  end
  
  def process_file(input_path, output_path, &cleaner)
    File.open(output_path, 'w') do |output|
      File.foreach(input_path).each_slice(@batch_size) do |batch|
        cleaned_batch = batch.map { |line| cleaner.call(line) }.compact
        cleaned_batch.each { |line| output.puts(line) }
      end
    end
  end
  
  def process_array(data, &cleaner)
    data.each_slice(@batch_size).flat_map do |batch|
      batch.map { |item| cleaner.call(item) }.compact
    end
  end
end

cleaner = BatchCleaner.new(batch_size: 100)

# Process large array
data = (1..10000).map { |i| "  item #{i}  " }
cleaned = cleaner.process_array(data) { |item| item.strip.downcase }

Error Handling & Edge Cases

Null and empty value handling requires distinguishing between missing data, empty strings, and whitespace-only content. Each case demands different treatment based on business requirements.

class NullSafeCleanser
  def self.clean(value, allow_empty: false, allow_whitespace: false)
    # Handle nil
    return nil if value.nil?
    
    # Convert to string
    str_value = value.to_s
    
    # Handle empty after conversion
    return nil if str_value.empty? && !allow_empty
    
    # Handle whitespace-only
    stripped = str_value.strip
    return nil if stripped.empty? && !allow_whitespace && !allow_empty
    
    stripped
  end
  
  def self.with_default(value, default:, &cleaner)
    result = cleaner.call(value)
    result.nil? || result.empty? ? default : result
  end
end

NullSafeCleanser.clean(nil)  # => nil
NullSafeCleanser.clean("")  # => nil
NullSafeCleanser.clean("", allow_empty: true)  # => ""
NullSafeCleanser.clean("   ")  # => nil
NullSafeCleanser.clean("   ", allow_whitespace: true)  # => ""

NullSafeCleanser.with_default("", default: "N/A") { |v| v&.strip }  # => "N/A"

Encoding errors occur when data contains invalid byte sequences or mixed character encodings. Handling requires detection, conversion, and fallback strategies.

class EncodingErrorHandler
  def self.clean_with_fallback(text, target: Encoding::UTF_8)
    return nil if text.nil?
    
    # Attempt direct encoding
    begin
      return text.encode(target)
    rescue Encoding::UndefinedConversionError, Encoding::InvalidByteSequenceError
      # Try with replacement characters
      begin
        return text.encode(target,
          invalid: :replace,
          undef: :replace,
          replace: '?'
        )
      rescue StandardError
        # Last resort: force encoding and scrub
        return text.force_encoding(target).scrub('?')
      end
    end
  end
  
  def self.detect_and_convert(text)
    return nil if text.nil?
    
    # Try common encodings
    [Encoding::UTF_8, Encoding::ISO_8859_1, Encoding::WINDOWS_1252].each do |encoding|
      begin
        converted = text.force_encoding(encoding).encode(Encoding::UTF_8)
        return converted if converted.valid_encoding?
      rescue StandardError
        next
      end
    end
    
    # Fallback: scrub invalid sequences
    text.force_encoding(Encoding::UTF_8).scrub('?')
  end
end

dirty = "Hello\xE9".force_encoding('ASCII-8BIT')
EncodingErrorHandler.clean_with_fallback(dirty)  # => "Hello?"

Type coercion errors happen when converting between incompatible types or when source data violates target constraints. Safe coercion validates before conversion and provides meaningful error messages.

class SafeCoercer
  class CoercionError < StandardError; end
  
  def self.to_integer(value, strict: false)
    case value
    when Integer
      value
    when Float
      strict ? raise(CoercionError, "Lossy conversion") : value.to_i
    when String
      cleaned = value.strip
      return nil if cleaned.empty?
      
      # Remove common formatting
      numeric_part = cleaned.gsub(/[,\s]/, '')
      
      if numeric_part.match?(/^-?\d+$/)
        numeric_part.to_i
      elsif strict
        raise CoercionError, "Invalid integer format: #{value}"
      else
        nil
      end
    when NilClass
      nil
    else
      strict ? raise(CoercionError, "Cannot coerce #{value.class}") : nil
    end
  end
  
  def self.to_date(value, strict: false)
    return value if value.is_a?(Date)
    return nil if value.nil?
    
    begin
      Date.parse(value.to_s)
    rescue ArgumentError => e
      strict ? raise(CoercionError, "Invalid date: #{value}") : nil
    end
  end
end

SafeCoercer.to_integer("1,234")  # => 1234
SafeCoercer.to_integer("abc")  # => nil
SafeCoercer.to_integer("abc", strict: true)  # => Raises CoercionError

SafeCoercer.to_date("2023-03-15")  # => Date object
SafeCoercer.to_date("invalid")  # => nil

Validation failures require clear error reporting and recovery strategies. Validation errors should indicate which rule failed and why, enabling corrective action.

class ValidationResult
  attr_reader :valid, :errors, :value
  
  def initialize(valid:, value: nil, errors: [])
    @valid = valid
    @value = value
    @errors = errors
  end
  
  def valid?
    @valid
  end
  
  def invalid?
    !@valid
  end
end

class ValidatingCleaner
  def initialize
    @rules = []
  end
  
  def add_rule(name, &block)
    @rules << { name: name, validator: block }
    self
  end
  
  def clean(value)
    errors = []
    cleaned_value = value
    
    @rules.each do |rule|
      result = rule[:validator].call(cleaned_value)
      
      if result.is_a?(Hash)
        if result[:valid]
          cleaned_value = result[:value]
        else
          errors << { rule: rule[:name], message: result[:error] }
        end
      elsif result == false
        errors << { rule: rule[:name], message: "Validation failed" }
      else
        cleaned_value = result
      end
    end
    
    ValidationResult.new(
      valid: errors.empty?,
      value: cleaned_value,
      errors: errors
    )
  end
end

email_cleaner = ValidatingCleaner.new
email_cleaner.add_rule(:not_empty) do |v|
  v && !v.strip.empty? ? { valid: true, value: v.strip } : { valid: false, error: "Cannot be empty" }
end
email_cleaner.add_rule(:format) do |v|
  v.match?(/\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i) ? 
    { valid: true, value: v.downcase } : 
    { valid: false, error: "Invalid email format" }
end

result = email_cleaner.clean("  JOHN@EXAMPLE.COM  ")
result.valid?  # => true
result.value  # => "john@example.com"

result = email_cleaner.clean("")
result.valid?  # => false
result.errors  # => [{ rule: :not_empty, message: "Cannot be empty" }]

Tools & Ecosystem

The Chronic gem parses natural language dates. The gem interprets phrases like "next Tuesday," "three days ago," and "last month" into Date or Time objects. Chronic handles ambiguous dates by defaulting to future or past contexts.

require 'chronic'

Chronic.parse('tomorrow')
Chronic.parse('last monday')
Chronic.parse('3 months ago')
Chronic.parse('next friday at 4pm')

# Configure parsing context
Chronic.parse('may', context: :past)  # Previous May
Chronic.parse('may', context: :future)  # Upcoming May

The Faker gem generates realistic test data for cleansing validation. The gem produces names, addresses, emails, phone numbers, and domain-specific data. Generated data tests cleansing pipelines without exposing real information.

require 'faker'

# Generate test data for cleansing validation
test_emails = 10.times.map { Faker::Internet.email }
test_phones = 10.times.map { Faker::PhoneNumber.phone_number }
test_addresses = 10.times.map { Faker::Address.full_address }

# Add common data quality issues
dirty_emails = test_emails.map { |e| "  #{e.upcase}  " }
dirty_phones = test_phones.map { |p| "(#{p})" }

The Sanitize gem removes HTML tags and dangerous content from user input. The gem provides multiple configurations for different cleansing levels: restricted removes most formatting, basic allows simple formatting, relaxed permits more tags.

require 'sanitize'

html_input = "<p>Hello <script>alert('xss')</script>World</p>"

Sanitize.fragment(html_input, Sanitize::Config::RESTRICTED)
# => "Hello World"

Sanitize.fragment(html_input, Sanitize::Config::BASIC)
# => "<p>Hello World</p>"

The Daru gem provides DataFrame operations for statistical data cleansing. The gem handles missing values, outlier detection, type conversion, and aggregations across tabular data.

require 'daru'

# Create DataFrame with dirty data
df = Daru::DataFrame.new({
  name: ['John', nil, 'Jane', 'Bob'],
  age: [25, 30, nil, 40],
  salary: [50000, 60000, 55000, 70000]
})

# Handle missing values
df[:name].replace_nils('Unknown')
df[:age].replace_nils(df[:age].mean)

# Remove outliers
mean = df[:salary].mean
std = df[:salary].sd
df = df.filter(:row) { |row| (row[:salary] - mean).abs <= 2 * std }

The ActiveSupport String extensions add cleansing methods. The squish method normalizes whitespace. The truncate method shortens strings. The parameterize method creates URL-safe slugs.

require 'active_support/core_ext/string'

"  hello    world  ".squish  # => "hello world"
"long text here".truncate(10)  # => "long te..."
"Hello World!".parameterize  # => "hello-world"
"Hello World!".parameterize(separator: '_')  # => "hello_world"

Reference

Common Cleansing Operations

Operation	Purpose	Ruby Method
Remove whitespace	Trim leading/trailing spaces	strip
Normalize whitespace	Collapse multiple spaces	squeeze
Change case	Convert to lowercase	downcase
Remove characters	Delete specific characters	delete, gsub
Extract digits	Keep only numeric characters	gsub(/\D/, '')
Remove duplicates	Eliminate duplicate elements	uniq
Remove nils	Filter out nil values	compact
Validate format	Check against pattern	match?
Convert encoding	Change character encoding	encode
Parse dates	Convert string to Date	Date.parse

String Cleansing Methods

Method	Description	Example
strip	Remove leading and trailing whitespace	" hello ".strip
lstrip	Remove leading whitespace	" hello".lstrip
rstrip	Remove trailing whitespace	"hello ".rstrip
squeeze	Collapse consecutive characters	"hello world".squeeze(' ')
tr	Translate characters	"hello".tr('el', 'ip')
delete	Remove characters	"hello".delete('l')
gsub	Replace patterns	"hello".gsub(/l/, 'r')
downcase	Convert to lowercase	"HELLO".downcase
upcase	Convert to uppercase	"hello".upcase
capitalize	Capitalize first letter	"hello".capitalize
titleize	Capitalize each word	"hello world".split.map(&:capitalize)

Validation Patterns

Data Type	Pattern	Example
Email	/\A[\w+-.]+@[a-z\d-]+(.[a-z\d-]+)*.[a-z]+\z/i	user@example.com
Phone (US)	/(?(\d{3}))?[-.\s]?(\d{3})[-.\s]?(\d{4})/	(555) 123-4567
ZIP Code	/^\d{5}(-\d{4})?$/	12345-6789
URL	/\Ahttps?://[\S]+\z/	https://example.com
IPv4	/^(\d{1,3}.){3}\d{1,3}$/	192.168.1.1
Credit Card	/^\d{13,19}$/	4111111111111111
Date ISO	/^\d{4}-\d{2}-\d{2}$/	2023-03-15
SSN	/^\d{3}-\d{2}-\d{4}$/	123-45-6789

Type Coercion Methods

Source Type	Target Type	Method	Notes
String	Integer	to_i	Returns 0 for invalid strings
String	Float	to_f	Returns 0.0 for invalid strings
String	Date	Date.parse	Raises ArgumentError if invalid
Integer	String	to_s	Always succeeds
Float	Integer	to_i	Truncates decimal
Float	Integer	round	Rounds to nearest integer
Array	String	join	Concatenates elements
String	Array	split	Splits on delimiter
String	Boolean	Check against values	Custom logic required

Encoding Operations

Method	Purpose	Example
encode	Convert between encodings	str.encode('UTF-8')
force_encoding	Change encoding without conversion	str.force_encoding('UTF-8')
scrub	Replace invalid bytes	str.scrub('?')
valid_encoding?	Check encoding validity	str.valid_encoding?
unicode_normalize	Normalize Unicode form	str.unicode_normalize(:nfc)

Collection Cleansing Methods

Method	Purpose	Example
compact	Remove nil values	[1, nil, 2].compact
uniq	Remove duplicates	[1, 2, 2, 3].uniq
select	Keep matching elements	[1, 2, 3].select { even? }
reject	Remove matching elements	[1, 2, 3].reject { even? }
map	Transform each element	[1, 2, 3].map { plus one }
filter_map	Transform and filter	[1, nil, 2].filter_map { times two }

Error Handling Strategies

Strategy	When to Use	Implementation
Reject	Data must be valid	Return nil or raise error
Default	Fallback acceptable	Return default value
Flag	Review required	Mark for manual inspection
Transform	Correction possible	Apply transformation
Log	Audit needed	Record error and continue
Partial	Some data valid	Extract valid portions

Performance Considerations

Operation	Complexity	Optimization
Regex matching	O(n)	Compile regex once
String concatenation	O(n²)	Use Array and join
Collection iteration	O(n)	Use lazy enumerators
Encoding conversion	O(n)	Batch operations
Duplicate removal	O(n)	Use Set for large collections
Pattern validation	O(n)	Short-circuit on failure

Data Cleansing