Overview
Data cleansing removes errors, inconsistencies, and inaccuracies from datasets to improve data quality and reliability. The process addresses issues including duplicate records, missing values, format inconsistencies, invalid entries, and data type mismatches. Data cleansing operations occur at multiple stages: during data ingestion, before processing, after transformation, and as part of regular maintenance routines.
The quality of data directly impacts application reliability, business decisions, and system performance. Uncleaned data causes application errors, produces incorrect analytics, degrades user experience, and increases storage costs. A single malformed email address can crash an email service; duplicate customer records create billing errors; inconsistent date formats break time-series analysis.
Data cleansing differs from data validation in scope and timing. Validation checks whether data meets predefined rules and rejects invalid input. Cleansing accepts flawed data and transforms it into acceptable formats. Validation acts as a gatekeeper; cleansing acts as a repair mechanism.
# Raw data before cleansing
raw_customer = {
email: " JOHN@EXAMPLE.COM ",
phone: "(555) 123-4567",
name: "john doe",
age: "25",
signup_date: "03/15/2023"
}
# After cleansing operations
clean_customer = {
email: "john@example.com",
phone: "5551234567",
name: "John Doe",
age: 25,
signup_date: Date.parse("2023-03-15")
}
Data cleansing operations fall into distinct categories: structural cleansing standardizes formats and types; content cleansing corrects values and removes noise; deduplication eliminates redundant records; enrichment fills missing information. Each category requires different techniques and validation strategies.
Key Principles
Data cleansing follows a deterministic process where each operation produces predictable, repeatable results. The same input data subjected to identical cleansing operations yields identical output regardless of execution time or environment. This determinism enables testing, debugging, and audit trails.
Idempotency ensures cleansing operations can execute multiple times without changing results after the first application. Running a whitespace trimming operation twice produces the same result as running it once. Idempotent operations make data pipelines resilient to retries and partial failures.
Data preservation maintains original values alongside cleansed versions when possible. Applications store raw data before transformation, enabling recovery from cleansing errors and supporting forensic analysis. Audit trails track what changed, when, and why.
class DataRecord
attr_reader :raw_value, :cleansed_value, :transformations
def initialize(value)
@raw_value = value
@cleansed_value = value
@transformations = []
end
def apply_cleansing(operation, &block)
previous = @cleansed_value
@cleansed_value = block.call(@cleansed_value)
@transformations << {
operation: operation,
before: previous,
after: @cleansed_value,
timestamp: Time.now
}
end
end
record = DataRecord.new(" HELLO ")
record.apply_cleansing(:strip) { |v| v.strip }
record.apply_cleansing(:downcase) { |v| v.downcase }
# => Preserves raw value and transformation history
Error handling strategies determine how cleansing responds to problematic data. Strategies include rejecting invalid records, setting default values, inferring corrections, or flagging for manual review. The chosen strategy depends on data criticality and error tolerance.
Validation boundaries define acceptable ranges, formats, and values. Boundaries enforce business rules: ages between 0 and 150, email addresses matching RFC specifications, postal codes matching geographic formats. Clear boundaries enable automated cleansing decisions.
Type coercion converts data between representations while preserving semantic meaning. Converting the string "123" to integer 123 maintains the numeric concept. Type coercion requires understanding source and target formats to avoid data loss or misinterpretation.
# Type coercion with semantic preservation
def coerce_to_integer(value)
case value
when Integer
value
when Float
value.round
when String
value.strip.delete(',').to_i if value.match?(/^\d+([,\d]+)?$/)
when NilClass
nil
else
raise TypeError, "Cannot coerce #{value.class} to Integer"
end
end
coerce_to_integer("1,234") # => 1234
coerce_to_integer(1234.7) # => 1235
coerce_to_integer("abc") # => nil
Normalization transforms data into consistent formats. Text normalization converts to uniform case, removes accents, standardizes whitespace. Numeric normalization scales values to common ranges. Date normalization converts to ISO 8601 format. Normalization eliminates representation differences that complicate comparisons and aggregations.
Contextual awareness applies different cleansing rules based on data origin, purpose, and domain. Email addresses for marketing require validation; internal system identifiers do not. Financial data requires precision; display text tolerates approximation. Context determines appropriate cleansing strategies.
Ruby Implementation
Ruby's String class provides extensive methods for text cleansing. The strip, lstrip, and rstrip methods remove leading and trailing whitespace. The squeeze method collapses repeated characters. The tr and delete methods remove or replace character sets.
class StringCleaner
def self.clean_whitespace(text)
return nil if text.nil?
text.strip.squeeze(' ')
end
def self.remove_non_alphanumeric(text)
return nil if text.nil?
text.gsub(/[^a-zA-Z0-9\s]/, '')
end
def self.normalize_case(text, style: :titlecase)
return nil if text.nil?
case style
when :titlecase
text.split.map(&:capitalize).join(' ')
when :sentence
text.capitalize
when :upper
text.upcase
when :lower
text.downcase
else
text
end
end
end
StringCleaner.clean_whitespace(" hello world ") # => "hello world"
StringCleaner.remove_non_alphanumeric("hello@world!") # => "hello world"
StringCleaner.normalize_case("john doe", style: :titlecase) # => "John Doe"
Ruby's Regexp class enables pattern-based cleansing through matching and substitution. Regular expressions identify invalid characters, extract structured data, and validate formats. Named captures organize extracted components for reconstruction.
class PhoneCleaner
PHONE_PATTERN = /\(?(\d{3})\)?[-.\s]?(\d{3})[-.\s]?(\d{4})/
def self.normalize(phone)
return nil if phone.nil? || phone.empty?
match = phone.match(PHONE_PATTERN)
return nil unless match
"#{match[1]}#{match[2]}#{match[3]}"
end
def self.format(phone, style: :plain)
digits = normalize(phone)
return nil unless digits
case style
when :dotted
"#{digits[0..2]}.#{digits[3..5]}.#{digits[6..9]}"
when :dashed
"#{digits[0..2]}-#{digits[3..5]}-#{digits[6..9]}"
when :parens
"(#{digits[0..2]}) #{digits[3..5]}-#{digits[6..9]}"
else
digits
end
end
end
PhoneCleaner.normalize("(555) 123-4567") # => "5551234567"
PhoneCleaner.format("5551234567", style: :parens) # => "(555) 123-4567"
Ruby's Enumerable module provides methods for collection-level cleansing. The compact method removes nil values. The uniq method eliminates duplicates. The select and reject methods filter based on predicates.
class CollectionCleaner
def self.remove_nils(collection)
collection.compact
end
def self.remove_duplicates(collection, key: nil)
if key
collection.uniq { |item| item.send(key) }
else
collection.uniq
end
end
def self.remove_outliers(numbers, std_devs: 3)
return [] if numbers.empty?
mean = numbers.sum.to_f / numbers.size
variance = numbers.sum { |n| (n - mean) ** 2 } / numbers.size
std_dev = Math.sqrt(variance)
threshold = std_devs * std_dev
numbers.select { |n| (n - mean).abs <= threshold }
end
end
data = [1, nil, 2, nil, 3, 2]
CollectionCleaner.remove_nils(data) # => [1, 2, 3, 2]
CollectionCleaner.remove_duplicates(data.compact) # => [1, 2, 3]
values = [10, 12, 11, 9, 100, 13, 8]
CollectionCleaner.remove_outliers(values) # => [10, 12, 11, 9, 13, 8]
The encode method handles character encoding issues. Data from external sources often arrives in mixed encodings. The method converts between encodings, replaces invalid bytes, and normalizes Unicode representations.
class EncodingCleaner
def self.clean_encoding(text, target: Encoding::UTF_8)
return nil if text.nil?
text.encode(target,
invalid: :replace,
undef: :replace,
replace: ''
)
end
def self.normalize_unicode(text)
return nil if text.nil?
# Convert to composed form (NFC)
text.unicode_normalize(:nfc)
end
def self.remove_bom(text)
return nil if text.nil?
text.sub(/\A\uFEFF/, '')
end
end
# Handle mixed encoding
dirty_text = "Hello\xE9World".force_encoding('ASCII-8BIT')
EncodingCleaner.clean_encoding(dirty_text) # => "HelloWorld"
# Normalize Unicode (é can be e + combining accent)
EncodingCleaner.normalize_unicode("café") # => Composed form
Practical Examples
Email address cleansing standardizes formats for validation and deduplication. The process converts to lowercase, removes whitespace, validates structure, and extracts components for analysis.
class EmailCleaner
EMAIL_PATTERN = /\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i
def self.clean(email)
return nil if email.nil? || email.empty?
# Remove whitespace and convert to lowercase
cleaned = email.strip.downcase
# Validate basic structure
return nil unless cleaned.match?(EMAIL_PATTERN)
# Remove duplicate @ symbols (common typo)
parts = cleaned.split('@')
return nil if parts.length != 2
local, domain = parts
# Remove dots from Gmail addresses (ignored by Gmail)
if domain == 'gmail.com'
local = local.delete('.')
end
# Remove plus addressing
local = local.split('+').first
"#{local}@#{domain}"
end
def self.extract_domain(email)
cleaned = clean(email)
return nil unless cleaned
cleaned.split('@').last
end
end
EmailCleaner.clean(" John.Doe+spam@Gmail.com ") # => "johndoe@gmail.com"
EmailCleaner.extract_domain("user@example.com") # => "example.com"
Address cleansing normalizes postal addresses for geocoding and matching. The process standardizes abbreviations, corrects common misspellings, validates components, and formats consistently.
class AddressCleaner
STREET_TYPES = {
'street' => 'St',
'str' => 'St',
'avenue' => 'Ave',
'av' => 'Ave',
'road' => 'Rd',
'boulevard' => 'Blvd',
'blvd' => 'Blvd',
'drive' => 'Dr',
'lane' => 'Ln',
'court' => 'Ct'
}
DIRECTIONS = {
'north' => 'N',
'south' => 'S',
'east' => 'E',
'west' => 'W',
'northeast' => 'NE',
'northwest' => 'NW',
'southeast' => 'SE',
'southwest' => 'SW'
}
def self.normalize_street(street)
return nil if street.nil? || street.empty?
normalized = street.downcase.strip
# Normalize street types
STREET_TYPES.each do |variant, standard|
normalized.gsub!(/\b#{variant}\b\.?/i, standard)
end
# Normalize directions
DIRECTIONS.each do |full, abbrev|
normalized.gsub!(/\b#{full}\b/i, abbrev)
end
# Capitalize each word
normalized.split.map(&:capitalize).join(' ')
end
def self.clean_zipcode(zipcode)
return nil if zipcode.nil?
# Remove all non-digits
digits = zipcode.to_s.gsub(/\D/, '')
# Validate length (5 or 9 digits)
return nil unless [5, 9].include?(digits.length)
# Format with hyphen for ZIP+4
digits.length == 9 ? "#{digits[0..4]}-#{digits[5..8]}" : digits
end
end
AddressCleaner.normalize_street("123 north main street") # => "123 N Main St"
AddressCleaner.clean_zipcode("12345-6789") # => "12345-6789"
AddressCleaner.clean_zipcode("12345 6789") # => "12345-6789"
Financial data cleansing handles currency values, removes formatting characters, validates precision, and converts to standard numeric types. The process must preserve decimal accuracy and handle multiple currency formats.
class CurrencyCleaner
def self.clean(value, currency: 'USD')
return nil if value.nil?
# Convert to string for processing
str_value = value.to_s.strip
# Remove currency symbols and common separators
cleaned = str_value.gsub(/[$,£€¥\s]/, '')
# Handle parentheses for negative values (accounting format)
if cleaned.match?(/^\(.*\)$/)
cleaned = "-#{cleaned.tr('()', '')}"
end
# Convert to BigDecimal for precision
begin
BigDecimal(cleaned)
rescue ArgumentError
nil
end
end
def self.format(value, currency: 'USD', precision: 2)
return nil if value.nil?
decimal_value = value.is_a?(BigDecimal) ? value : BigDecimal(value.to_s)
formatted = decimal_value.round(precision).to_s('F')
# Add thousands separators
parts = formatted.split('.')
parts[0].gsub!(/(\d)(?=(\d{3})+(?!\d))/, '\1,')
# Add currency symbol
symbol = case currency
when 'USD' then '$'
when 'EUR' then '€'
when 'GBP' then '£'
else currency
end
"#{symbol}#{parts.join('.')}"
end
end
CurrencyCleaner.clean("$1,234.56") # => BigDecimal("1234.56")
CurrencyCleaner.clean("($500.00)") # => BigDecimal("-500.00")
CurrencyCleaner.format(1234.567, precision: 2) # => "$1,234.57"
Date and time cleansing converts between formats, handles timezone differences, validates ranges, and standardizes to ISO 8601. Multiple input formats require parsing heuristics and format detection.
require 'date'
class DateCleaner
COMMON_FORMATS = [
'%Y-%m-%d', # 2023-03-15
'%m/%d/%Y', # 03/15/2023
'%d/%m/%Y', # 15/03/2023
'%Y/%m/%d', # 2023/03/15
'%b %d, %Y', # Mar 15, 2023
'%B %d, %Y', # March 15, 2023
'%d-%b-%Y', # 15-Mar-2023
]
def self.parse(date_string)
return nil if date_string.nil? || date_string.empty?
cleaned = date_string.strip
# Try each format
COMMON_FORMATS.each do |format|
begin
return Date.strptime(cleaned, format)
rescue ArgumentError
next
end
end
# Try natural language parsing
begin
Date.parse(cleaned)
rescue ArgumentError
nil
end
end
def self.validate_range(date, min: nil, max: nil)
return false if date.nil?
return false if min && date < min
return false if max && date > max
true
end
def self.normalize(date)
return nil if date.nil?
date.is_a?(Date) ? date.iso8601 : parse(date)&.iso8601
end
end
DateCleaner.parse("03/15/2023") # => Date object
DateCleaner.parse("March 15, 2023") # => Date object
DateCleaner.normalize("03/15/2023") # => "2023-03-15"
date = Date.parse("2023-03-15")
DateCleaner.validate_range(date, min: Date.parse("2020-01-01")) # => true
Common Patterns
The pipeline pattern chains multiple cleansing operations in sequence. Each operation transforms data and passes results to the next stage. Pipelines enable modular, testable cleansing workflows.
class CleansingPipeline
def initialize
@operations = []
end
def add(name, &block)
@operations << { name: name, operation: block }
self
end
def execute(data)
result = data
@operations.each do |op|
result = op[:operation].call(result)
break if result.nil?
end
result
end
def self.for_email
new
.add(:trim) { |v| v&.strip }
.add(:lowercase) { |v| v&.downcase }
.add(:validate) { |v| v&.match?(/\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i) ? v : nil }
end
def self.for_phone
new
.add(:extract_digits) { |v| v&.gsub(/\D/, '') }
.add(:validate_length) { |v| v&.length == 10 ? v : nil }
end
end
email_pipeline = CleansingPipeline.for_email
email_pipeline.execute(" JOHN@EXAMPLE.COM ") # => "john@example.com"
email_pipeline.execute("invalid") # => nil
phone_pipeline = CleansingPipeline.for_phone
phone_pipeline.execute("(555) 123-4567") # => "5551234567"
The strategy pattern selects cleansing approaches based on data characteristics. Different strategies handle different data types, quality levels, or business requirements. Strategies encapsulate domain-specific logic.
class CleansingStrategy
def clean(value)
raise NotImplementedError
end
end
class ConservativeStrategy < CleansingStrategy
def clean(value)
# Only remove obvious errors, preserve ambiguous data
return nil if value.nil?
value.strip
end
end
class AggressiveStrategy < CleansingStrategy
def clean(value)
# Apply extensive transformations
return nil if value.nil? || value.strip.empty?
value.strip.squeeze(' ').downcase
end
end
class DataCleaner
def initialize(strategy)
@strategy = strategy
end
def clean(value)
@strategy.clean(value)
end
end
conservative = DataCleaner.new(ConservativeStrategy.new)
conservative.clean(" Hello World ") # => "Hello World"
aggressive = DataCleaner.new(AggressiveStrategy.new)
aggressive.clean(" Hello World ") # => "hello world"
The decorator pattern adds cleansing capabilities incrementally. Each decorator wraps another cleaner and adds specific functionality. Decorators compose complex cleansing from simple components.
class BaseCleaner
def clean(value)
value
end
end
class TrimDecorator
def initialize(cleaner)
@cleaner = cleaner
end
def clean(value)
result = @cleaner.clean(value)
result&.strip
end
end
class LowercaseDecorator
def initialize(cleaner)
@cleaner = cleaner
end
def clean(value)
result = @cleaner.clean(value)
result&.downcase
end
end
class RemoveSpecialCharsDecorator
def initialize(cleaner)
@cleaner = cleaner
end
def clean(value)
result = @cleaner.clean(value)
result&.gsub(/[^a-zA-Z0-9\s]/, '')
end
end
# Compose cleaners
cleaner = BaseCleaner.new
cleaner = TrimDecorator.new(cleaner)
cleaner = LowercaseDecorator.new(cleaner)
cleaner = RemoveSpecialCharsDecorator.new(cleaner)
cleaner.clean(" Hello, World! ") # => "hello world"
The batch processing pattern processes large datasets efficiently. Batch operations load data in chunks, apply cleansing transformations, and write results incrementally. This pattern manages memory usage and enables progress tracking.
class BatchCleaner
def initialize(batch_size: 1000)
@batch_size = batch_size
end
def process_file(input_path, output_path, &cleaner)
File.open(output_path, 'w') do |output|
File.foreach(input_path).each_slice(@batch_size) do |batch|
cleaned_batch = batch.map { |line| cleaner.call(line) }.compact
cleaned_batch.each { |line| output.puts(line) }
end
end
end
def process_array(data, &cleaner)
data.each_slice(@batch_size).flat_map do |batch|
batch.map { |item| cleaner.call(item) }.compact
end
end
end
cleaner = BatchCleaner.new(batch_size: 100)
# Process large array
data = (1..10000).map { |i| " item #{i} " }
cleaned = cleaner.process_array(data) { |item| item.strip.downcase }
Error Handling & Edge Cases
Null and empty value handling requires distinguishing between missing data, empty strings, and whitespace-only content. Each case demands different treatment based on business requirements.
class NullSafeCleanser
def self.clean(value, allow_empty: false, allow_whitespace: false)
# Handle nil
return nil if value.nil?
# Convert to string
str_value = value.to_s
# Handle empty after conversion
return nil if str_value.empty? && !allow_empty
# Handle whitespace-only
stripped = str_value.strip
return nil if stripped.empty? && !allow_whitespace && !allow_empty
stripped
end
def self.with_default(value, default:, &cleaner)
result = cleaner.call(value)
result.nil? || result.empty? ? default : result
end
end
NullSafeCleanser.clean(nil) # => nil
NullSafeCleanser.clean("") # => nil
NullSafeCleanser.clean("", allow_empty: true) # => ""
NullSafeCleanser.clean(" ") # => nil
NullSafeCleanser.clean(" ", allow_whitespace: true) # => ""
NullSafeCleanser.with_default("", default: "N/A") { |v| v&.strip } # => "N/A"
Encoding errors occur when data contains invalid byte sequences or mixed character encodings. Handling requires detection, conversion, and fallback strategies.
class EncodingErrorHandler
def self.clean_with_fallback(text, target: Encoding::UTF_8)
return nil if text.nil?
# Attempt direct encoding
begin
return text.encode(target)
rescue Encoding::UndefinedConversionError, Encoding::InvalidByteSequenceError
# Try with replacement characters
begin
return text.encode(target,
invalid: :replace,
undef: :replace,
replace: '?'
)
rescue StandardError
# Last resort: force encoding and scrub
return text.force_encoding(target).scrub('?')
end
end
end
def self.detect_and_convert(text)
return nil if text.nil?
# Try common encodings
[Encoding::UTF_8, Encoding::ISO_8859_1, Encoding::WINDOWS_1252].each do |encoding|
begin
converted = text.force_encoding(encoding).encode(Encoding::UTF_8)
return converted if converted.valid_encoding?
rescue StandardError
next
end
end
# Fallback: scrub invalid sequences
text.force_encoding(Encoding::UTF_8).scrub('?')
end
end
dirty = "Hello\xE9".force_encoding('ASCII-8BIT')
EncodingErrorHandler.clean_with_fallback(dirty) # => "Hello?"
Type coercion errors happen when converting between incompatible types or when source data violates target constraints. Safe coercion validates before conversion and provides meaningful error messages.
class SafeCoercer
class CoercionError < StandardError; end
def self.to_integer(value, strict: false)
case value
when Integer
value
when Float
strict ? raise(CoercionError, "Lossy conversion") : value.to_i
when String
cleaned = value.strip
return nil if cleaned.empty?
# Remove common formatting
numeric_part = cleaned.gsub(/[,\s]/, '')
if numeric_part.match?(/^-?\d+$/)
numeric_part.to_i
elsif strict
raise CoercionError, "Invalid integer format: #{value}"
else
nil
end
when NilClass
nil
else
strict ? raise(CoercionError, "Cannot coerce #{value.class}") : nil
end
end
def self.to_date(value, strict: false)
return value if value.is_a?(Date)
return nil if value.nil?
begin
Date.parse(value.to_s)
rescue ArgumentError => e
strict ? raise(CoercionError, "Invalid date: #{value}") : nil
end
end
end
SafeCoercer.to_integer("1,234") # => 1234
SafeCoercer.to_integer("abc") # => nil
SafeCoercer.to_integer("abc", strict: true) # => Raises CoercionError
SafeCoercer.to_date("2023-03-15") # => Date object
SafeCoercer.to_date("invalid") # => nil
Validation failures require clear error reporting and recovery strategies. Validation errors should indicate which rule failed and why, enabling corrective action.
class ValidationResult
attr_reader :valid, :errors, :value
def initialize(valid:, value: nil, errors: [])
@valid = valid
@value = value
@errors = errors
end
def valid?
@valid
end
def invalid?
!@valid
end
end
class ValidatingCleaner
def initialize
@rules = []
end
def add_rule(name, &block)
@rules << { name: name, validator: block }
self
end
def clean(value)
errors = []
cleaned_value = value
@rules.each do |rule|
result = rule[:validator].call(cleaned_value)
if result.is_a?(Hash)
if result[:valid]
cleaned_value = result[:value]
else
errors << { rule: rule[:name], message: result[:error] }
end
elsif result == false
errors << { rule: rule[:name], message: "Validation failed" }
else
cleaned_value = result
end
end
ValidationResult.new(
valid: errors.empty?,
value: cleaned_value,
errors: errors
)
end
end
email_cleaner = ValidatingCleaner.new
email_cleaner.add_rule(:not_empty) do |v|
v && !v.strip.empty? ? { valid: true, value: v.strip } : { valid: false, error: "Cannot be empty" }
end
email_cleaner.add_rule(:format) do |v|
v.match?(/\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i) ?
{ valid: true, value: v.downcase } :
{ valid: false, error: "Invalid email format" }
end
result = email_cleaner.clean(" JOHN@EXAMPLE.COM ")
result.valid? # => true
result.value # => "john@example.com"
result = email_cleaner.clean("")
result.valid? # => false
result.errors # => [{ rule: :not_empty, message: "Cannot be empty" }]
Tools & Ecosystem
The Chronic gem parses natural language dates. The gem interprets phrases like "next Tuesday," "three days ago," and "last month" into Date or Time objects. Chronic handles ambiguous dates by defaulting to future or past contexts.
require 'chronic'
Chronic.parse('tomorrow')
Chronic.parse('last monday')
Chronic.parse('3 months ago')
Chronic.parse('next friday at 4pm')
# Configure parsing context
Chronic.parse('may', context: :past) # Previous May
Chronic.parse('may', context: :future) # Upcoming May
The Faker gem generates realistic test data for cleansing validation. The gem produces names, addresses, emails, phone numbers, and domain-specific data. Generated data tests cleansing pipelines without exposing real information.
require 'faker'
# Generate test data for cleansing validation
test_emails = 10.times.map { Faker::Internet.email }
test_phones = 10.times.map { Faker::PhoneNumber.phone_number }
test_addresses = 10.times.map { Faker::Address.full_address }
# Add common data quality issues
dirty_emails = test_emails.map { |e| " #{e.upcase} " }
dirty_phones = test_phones.map { |p| "(#{p})" }
The Sanitize gem removes HTML tags and dangerous content from user input. The gem provides multiple configurations for different cleansing levels: restricted removes most formatting, basic allows simple formatting, relaxed permits more tags.
require 'sanitize'
html_input = "<p>Hello <script>alert('xss')</script>World</p>"
Sanitize.fragment(html_input, Sanitize::Config::RESTRICTED)
# => "Hello World"
Sanitize.fragment(html_input, Sanitize::Config::BASIC)
# => "<p>Hello World</p>"
The Daru gem provides DataFrame operations for statistical data cleansing. The gem handles missing values, outlier detection, type conversion, and aggregations across tabular data.
require 'daru'
# Create DataFrame with dirty data
df = Daru::DataFrame.new({
name: ['John', nil, 'Jane', 'Bob'],
age: [25, 30, nil, 40],
salary: [50000, 60000, 55000, 70000]
})
# Handle missing values
df[:name].replace_nils('Unknown')
df[:age].replace_nils(df[:age].mean)
# Remove outliers
mean = df[:salary].mean
std = df[:salary].sd
df = df.filter(:row) { |row| (row[:salary] - mean).abs <= 2 * std }
The ActiveSupport String extensions add cleansing methods. The squish method normalizes whitespace. The truncate method shortens strings. The parameterize method creates URL-safe slugs.
require 'active_support/core_ext/string'
" hello world ".squish # => "hello world"
"long text here".truncate(10) # => "long te..."
"Hello World!".parameterize # => "hello-world"
"Hello World!".parameterize(separator: '_') # => "hello_world"
Reference
Common Cleansing Operations
| Operation | Purpose | Ruby Method |
|---|---|---|
| Remove whitespace | Trim leading/trailing spaces | strip |
| Normalize whitespace | Collapse multiple spaces | squeeze |
| Change case | Convert to lowercase | downcase |
| Remove characters | Delete specific characters | delete, gsub |
| Extract digits | Keep only numeric characters | gsub(/\D/, '') |
| Remove duplicates | Eliminate duplicate elements | uniq |
| Remove nils | Filter out nil values | compact |
| Validate format | Check against pattern | match? |
| Convert encoding | Change character encoding | encode |
| Parse dates | Convert string to Date | Date.parse |
String Cleansing Methods
| Method | Description | Example |
|---|---|---|
| strip | Remove leading and trailing whitespace | " hello ".strip |
| lstrip | Remove leading whitespace | " hello".lstrip |
| rstrip | Remove trailing whitespace | "hello ".rstrip |
| squeeze | Collapse consecutive characters | "hello world".squeeze(' ') |
| tr | Translate characters | "hello".tr('el', 'ip') |
| delete | Remove characters | "hello".delete('l') |
| gsub | Replace patterns | "hello".gsub(/l/, 'r') |
| downcase | Convert to lowercase | "HELLO".downcase |
| upcase | Convert to uppercase | "hello".upcase |
| capitalize | Capitalize first letter | "hello".capitalize |
| titleize | Capitalize each word | "hello world".split.map(&:capitalize) |
Validation Patterns
| Data Type | Pattern | Example |
|---|---|---|
| /\A[\w+-.]+@[a-z\d-]+(.[a-z\d-]+)*.[a-z]+\z/i | user@example.com | |
| Phone (US) | /(?(\d{3}))?[-.\s]?(\d{3})[-.\s]?(\d{4})/ | (555) 123-4567 |
| ZIP Code | /^\d{5}(-\d{4})?$/ | 12345-6789 |
| URL | /\Ahttps?://[\S]+\z/ | https://example.com |
| IPv4 | /^(\d{1,3}.){3}\d{1,3}$/ | 192.168.1.1 |
| Credit Card | /^\d{13,19}$/ | 4111111111111111 |
| Date ISO | /^\d{4}-\d{2}-\d{2}$/ | 2023-03-15 |
| SSN | /^\d{3}-\d{2}-\d{4}$/ | 123-45-6789 |
Type Coercion Methods
| Source Type | Target Type | Method | Notes |
|---|---|---|---|
| String | Integer | to_i | Returns 0 for invalid strings |
| String | Float | to_f | Returns 0.0 for invalid strings |
| String | Date | Date.parse | Raises ArgumentError if invalid |
| Integer | String | to_s | Always succeeds |
| Float | Integer | to_i | Truncates decimal |
| Float | Integer | round | Rounds to nearest integer |
| Array | String | join | Concatenates elements |
| String | Array | split | Splits on delimiter |
| String | Boolean | Check against values | Custom logic required |
Encoding Operations
| Method | Purpose | Example |
|---|---|---|
| encode | Convert between encodings | str.encode('UTF-8') |
| force_encoding | Change encoding without conversion | str.force_encoding('UTF-8') |
| scrub | Replace invalid bytes | str.scrub('?') |
| valid_encoding? | Check encoding validity | str.valid_encoding? |
| unicode_normalize | Normalize Unicode form | str.unicode_normalize(:nfc) |
Collection Cleansing Methods
| Method | Purpose | Example |
|---|---|---|
| compact | Remove nil values | [1, nil, 2].compact |
| uniq | Remove duplicates | [1, 2, 2, 3].uniq |
| select | Keep matching elements | [1, 2, 3].select { even? } |
| reject | Remove matching elements | [1, 2, 3].reject { even? } |
| map | Transform each element | [1, 2, 3].map { plus one } |
| filter_map | Transform and filter | [1, nil, 2].filter_map { times two } |
Error Handling Strategies
| Strategy | When to Use | Implementation |
|---|---|---|
| Reject | Data must be valid | Return nil or raise error |
| Default | Fallback acceptable | Return default value |
| Flag | Review required | Mark for manual inspection |
| Transform | Correction possible | Apply transformation |
| Log | Audit needed | Record error and continue |
| Partial | Some data valid | Extract valid portions |
Performance Considerations
| Operation | Complexity | Optimization |
|---|---|---|
| Regex matching | O(n) | Compile regex once |
| String concatenation | O(n²) | Use Array and join |
| Collection iteration | O(n) | Use lazy enumerators |
| Encoding conversion | O(n) | Batch operations |
| Duplicate removal | O(n) | Use Set for large collections |
| Pattern validation | O(n) | Short-circuit on failure |