CrackedRuby CrackedRuby

Overview

Data Quality Management encompasses the processes, techniques, and tools used to maintain high-quality data throughout its lifecycle. Data quality directly affects decision-making accuracy, system reliability, and business outcomes. Poor data quality costs organizations an average of 15-25% of revenue through operational inefficiencies, incorrect decisions, and system failures.

Data quality issues manifest in multiple forms: missing values, inconsistent formats, duplicate records, constraint violations, referential integrity breaks, and stale data. Each category requires different detection and remediation strategies. A single corrupted field can cascade through dependent systems, causing failures in analytics, reporting, and automated processes.

The practice originated in database management but expanded with distributed systems, big data pipelines, and microservices architectures. Modern data quality management addresses real-time validation, cross-system consistency, and data governance compliance.

# Data quality issue example
user_records = [
  { id: 1, email: "user@example.com", age: 25 },
  { id: 2, email: "invalid-email", age: -5 },      # Invalid format and value
  { id: 3, email: "user@example.com", age: 30 },   # Duplicate email
  { id: 4, email: nil, age: 35 }                    # Missing required field
]

# Without quality checks, these records would corrupt analytics
average_age = user_records.map { |r| r[:age] }.compact.sum / user_records.size
# => Includes negative age, produces incorrect result

Data quality management operates at multiple levels: field validation, record validation, dataset validation, and cross-system validation. Field validation checks individual values against constraints. Record validation ensures internal consistency across fields. Dataset validation identifies patterns and anomalies across records. Cross-system validation maintains referential integrity between services.

Key Principles

Data quality assessment follows six primary dimensions: accuracy, completeness, consistency, timeliness, validity, and uniqueness. Each dimension measures a specific aspect of data fitness for intended use.

Accuracy measures how well data reflects reality. An address field containing "123 Main St" has high accuracy if that address exists and corresponds to the entity. Accuracy degrades through transcription errors, system bugs, and stale data. Measuring accuracy requires authoritative reference sources or manual verification.

Completeness tracks the presence of required data. A customer record missing the email field when email is mandatory fails completeness checks. Completeness operates at field level (is the value present?), record level (are all required fields present?), and dataset level (are all expected records present?).

class DataCompletenessChecker
  def initialize(required_fields)
    @required_fields = required_fields
  end
  
  def check_record(record)
    missing = @required_fields.select { |field| record[field].nil? || record[field].to_s.strip.empty? }
    {
      complete: missing.empty?,
      missing_fields: missing,
      completeness_score: 1.0 - (missing.size.to_f / @required_fields.size)
    }
  end
  
  def check_dataset(records)
    results = records.map { |r| check_record(r) }
    {
      total_records: records.size,
      complete_records: results.count { |r| r[:complete] },
      average_completeness: results.sum { |r| r[:completeness_score] } / records.size,
      field_completeness: field_level_completeness(records)
    }
  end
  
  private
  
  def field_level_completeness(records)
    @required_fields.map do |field|
      present = records.count { |r| !r[field].nil? && !r[field].to_s.strip.empty? }
      [field, present.to_f / records.size]
    end.to_h
  end
end

checker = DataCompletenessChecker.new([:email, :name, :age])
result = checker.check_dataset([
  { email: "a@example.com", name: "Alice", age: 25 },
  { email: "b@example.com", name: nil, age: 30 },
  { email: nil, name: "Charlie", age: 35 }
])
# => { total_records: 3, complete_records: 1, average_completeness: 0.666,
#      field_completeness: { email: 0.666, name: 0.666, age: 1.0 } }

Consistency ensures data conforms to defined formats, follows business rules, and maintains logical coherence. Consistency violations include format mismatches (phone numbers stored as "555-1234" and "5551234"), unit inconsistencies (mixing meters and feet), and logical contradictions (end_date before start_date). Consistency checks require rule definitions and pattern matching.

Timeliness measures whether data is current enough for its intended use. Stock prices from yesterday may be timely for trend analysis but not for trading. Timeliness requirements vary by use case. Real-time systems require second-level timeliness, while analytical reports may accept day-old data.

Validity checks conformance to defined formats, ranges, and constraints. Email addresses must match email format. Ages must be positive integers within reasonable bounds. Foreign keys must reference existing records. Validity rules encode domain knowledge and system constraints.

Uniqueness identifies duplicate records and ensures proper entity identification. Duplicates arise from multiple data entry, system integration, and merge operations. Detecting duplicates requires fuzzy matching algorithms when exact key matches fail.

Data quality rules fall into three categories: structural rules define format and type constraints, semantic rules encode business logic, and cross-field rules validate relationships between attributes. Rule precedence matters when multiple rules conflict.

Ruby Implementation

Ruby provides multiple approaches for implementing data quality checks through validation libraries, custom validators, and data processing frameworks. The standard library includes basic validation capabilities, while gems offer specialized functionality.

# Basic validation framework
class DataValidator
  attr_reader :errors
  
  def initialize
    @errors = []
    @rules = []
  end
  
  def add_rule(name, &block)
    @rules << { name: name, check: block }
    self
  end
  
  def validate(data)
    @errors = []
    @rules.each do |rule|
      begin
        result = rule[:check].call(data)
        @errors << { rule: rule[:name], message: "Validation failed" } unless result
      rescue StandardError => e
        @errors << { rule: rule[:name], message: e.message }
      end
    end
    valid?
  end
  
  def valid?
    @errors.empty?
  end
end

# Usage with multiple validation rules
validator = DataValidator.new
validator.add_rule(:email_present) { |d| !d[:email].nil? && !d[:email].empty? }
validator.add_rule(:email_format) { |d| d[:email] =~ /\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i }
validator.add_rule(:age_valid) { |d| d[:age].is_a?(Integer) && d[:age] > 0 && d[:age] < 150 }
validator.add_rule(:age_adult) { |d| d[:age] >= 18 }

data = { email: "user@example.com", age: 25 }
validator.validate(data)
# => true

ActiveModel validations provide declarative validation syntax integrated with Ruby on Rails, but work independently in plain Ruby applications:

require 'active_model'

class UserDataQuality
  include ActiveModel::Validations
  
  attr_accessor :email, :age, :name, :phone
  
  validates :email, presence: true, format: { with: URI::MailTo::EMAIL_REGEXP }
  validates :age, numericality: { only_integer: true, greater_than: 0, less_than: 150 }
  validates :name, presence: true, length: { minimum: 2, maximum: 100 }
  validates :phone, format: { with: /\A\d{3}-\d{3}-\d{4}\z/ }, allow_blank: true
  
  validate :age_matches_email_domain
  
  def initialize(attributes = {})
    attributes.each { |key, value| send("#{key}=", value) }
  end
  
  private
  
  def age_matches_email_domain
    return unless email && age
    
    # Business rule: company emails require age >= 18
    if email.end_with?('@company.com') && age < 18
      errors.add(:age, "must be at least 18 for company email")
    end
  end
end

user = UserDataQuality.new(email: "user@company.com", age: 16, name: "John", phone: "555-555-5555")
user.valid?
# => false
user.errors.full_messages
# => ["Age must be at least 18 for company email"]

Dry-validation provides a powerful schema-based validation framework with composable rules:

require 'dry-validation'

class UserContract < Dry::Validation::Contract
  params do
    required(:email).filled(:string)
    required(:age).filled(:integer)
    required(:name).filled(:string)
    optional(:address).hash do
      required(:street).filled(:string)
      required(:city).filled(:string)
      required(:zip).filled(:string)
    end
  end
  
  rule(:email) do
    unless /\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i.match?(value)
      key.failure('must be valid email format')
    end
  end
  
  rule(:age) do
    key.failure('must be positive') if value <= 0
    key.failure('must be realistic') if value > 150
  end
  
  rule(:age, :email) do
    if values[:email]&.end_with?('@company.com') && values[:age] < 18
      key(:age).failure('must be 18+ for company email')
    end
  end
end

contract = UserContract.new
result = contract.call(email: "user@company.com", age: 16, name: "John")
result.success?
# => false
result.errors.to_h
# => { age: ["must be 18+ for company email"] }

Custom validators handle complex domain-specific logic:

class DataQualityValidator
  def self.validate_referential_integrity(records, foreign_key, reference_table)
    valid_ids = reference_table.map { |r| r[:id] }.to_set
    invalid_records = records.reject { |r| valid_ids.include?(r[foreign_key]) }
    
    {
      valid: invalid_records.empty?,
      invalid_count: invalid_records.size,
      invalid_records: invalid_records.map { |r| r[:id] }
    }
  end
  
  def self.detect_duplicates(records, key_fields)
    grouped = records.group_by { |r| key_fields.map { |k| r[k] } }
    duplicates = grouped.select { |_, group| group.size > 1 }
    
    {
      has_duplicates: !duplicates.empty?,
      duplicate_groups: duplicates.size,
      duplicate_records: duplicates.values.flatten.map { |r| r[:id] }
    }
  end
  
  def self.check_data_freshness(records, timestamp_field, max_age_seconds)
    now = Time.now
    stale_records = records.select do |r|
      timestamp = r[timestamp_field]
      timestamp.nil? || (now - timestamp) > max_age_seconds
    end
    
    {
      fresh: stale_records.empty?,
      stale_count: stale_records.size,
      stale_records: stale_records.map { |r| r[:id] }
    }
  end
end

# Check referential integrity
users = [{ id: 1, order_id: 100 }, { id: 2, order_id: 999 }]
orders = [{ id: 100 }, { id: 101 }]
DataQualityValidator.validate_referential_integrity(users, :order_id, orders)
# => { valid: false, invalid_count: 1, invalid_records: [2] }

Implementation Approaches

Data quality management strategies differ based on when validation occurs, where validation logic resides, and how failures are handled. The four primary approaches are: inline validation, batch validation, streaming validation, and external validation services.

Inline validation executes quality checks during data write operations. The application validates data before persisting to storage. This approach provides immediate feedback and prevents invalid data from entering the system. Inline validation adds latency to write operations and requires validation logic in each service that writes data.

class UserService
  def create_user(data)
    validator = UserDataQuality.new(data)
    
    unless validator.valid?
      return {
        success: false,
        errors: validator.errors.full_messages
      }
    end
    
    # Proceed with user creation
    user = User.create(data)
    { success: true, user: user }
  end
end

Batch validation processes data quality checks asynchronously on datasets. A scheduled job or triggered process scans existing data, identifies issues, and generates reports. This approach handles large volumes without impacting write performance. Batch validation detects issues after data entry, allowing invalid data to temporarily exist in the system.

class DataQualityBatchJob
  def perform
    start_time = Time.now
    issues = []
    
    User.find_in_batches(batch_size: 1000) do |batch|
      batch.each do |user|
        validator = UserDataQuality.new(user.attributes)
        next if validator.valid?
        
        issues << {
          record_id: user.id,
          record_type: 'User',
          errors: validator.errors.full_messages,
          detected_at: Time.now
        }
      end
    end
    
    DataQualityReport.create(
      run_at: start_time,
      records_checked: User.count,
      issues_found: issues.size,
      issues: issues
    )
  end
end

Streaming validation applies quality checks to data flowing through processing pipelines. Events or messages undergo validation as they move between services. This approach balances real-time feedback with decoupled architecture. Streaming validation requires message queue infrastructure and handling of failed messages.

class DataQualityStream
  def initialize(input_topic, output_topic, error_topic)
    @input = input_topic
    @output = output_topic
    @error = error_topic
    @validator = UserDataQuality
  end
  
  def process_message(message)
    data = JSON.parse(message.payload)
    validator = @validator.new(data)
    
    if validator.valid?
      @output.publish(message)
    else
      error_message = {
        original_data: data,
        errors: validator.errors.full_messages,
        timestamp: Time.now.iso8601
      }
      @error.publish(error_message.to_json)
    end
  rescue StandardError => e
    @error.publish({
      original_message: message.payload,
      error: e.message,
      timestamp: Time.now.iso8601
    }.to_json)
  end
end

External validation services centralize data quality logic in dedicated services. Multiple applications send data to the validation service via API calls. This approach ensures consistent validation rules across services and separates quality concerns from business logic. External services introduce network latency and require availability management.

The selection criteria include: write latency requirements, data volume, consistency needs, and architectural constraints. High-throughput systems prefer batch or streaming validation. Systems requiring strong consistency use inline validation. Microservices architectures benefit from external validation services for cross-service consistency.

Common Patterns

Data quality management follows established patterns that address specific validation scenarios and quality dimensions. These patterns combine to form comprehensive quality frameworks.

Schema Validation Pattern enforces structural correctness through schema definitions. The schema specifies required fields, data types, format constraints, and value ranges. Applications validate data against schemas before processing.

class SchemaValidator
  def initialize(schema)
    @schema = schema
  end
  
  def validate(data)
    errors = []
    
    @schema[:required]&.each do |field|
      errors << "Missing required field: #{field}" unless data.key?(field)
    end
    
    @schema[:fields]&.each do |field, constraints|
      next unless data.key?(field)
      
      value = data[field]
      
      if constraints[:type]
        expected_type = constraints[:type]
        unless value.is_a?(expected_type)
          errors << "Field #{field} must be #{expected_type}, got #{value.class}"
        end
      end
      
      if constraints[:pattern]
        unless constraints[:pattern].match?(value.to_s)
          errors << "Field #{field} does not match required pattern"
        end
      end
      
      if constraints[:min] && value < constraints[:min]
        errors << "Field #{field} below minimum value #{constraints[:min]}"
      end
      
      if constraints[:max] && value > constraints[:max]
        errors << "Field #{field} exceeds maximum value #{constraints[:max]}"
      end
    end
    
    { valid: errors.empty?, errors: errors }
  end
end

user_schema = {
  required: [:email, :name, :age],
  fields: {
    email: { type: String, pattern: /\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i },
    name: { type: String, min: 2, max: 100 },
    age: { type: Integer, min: 0, max: 150 }
  }
}

validator = SchemaValidator.new(user_schema)
result = validator.validate({ email: "user@example.com", name: "A", age: 25 })
# => { valid: false, errors: ["Field name below minimum value 2"] }

Data Cleansing Pattern automatically corrects common data quality issues. The pattern applies transformations to standardize formats, trim whitespace, normalize cases, and fill defaults. Cleansing occurs before validation.

class DataCleanser
  def self.cleanse_user_data(data)
    cleansed = data.dup
    
    # Trim whitespace
    cleansed.transform_values! { |v| v.is_a?(String) ? v.strip : v }
    
    # Normalize email
    if cleansed[:email]
      cleansed[:email] = cleansed[:email].downcase
    end
    
    # Normalize phone format
    if cleansed[:phone]
      digits = cleansed[:phone].gsub(/\D/, '')
      cleansed[:phone] = "#{digits[0..2]}-#{digits[3..5]}-#{digits[6..9]}" if digits.length == 10
    end
    
    # Convert string numbers to integers
    if cleansed[:age].is_a?(String) && cleansed[:age] =~ /^\d+$/
      cleansed[:age] = cleansed[:age].to_i
    end
    
    # Fill defaults
    cleansed[:created_at] ||= Time.now
    cleansed[:active] = true if cleansed[:active].nil?
    
    cleansed
  end
end

raw_data = { email: "  USER@EXAMPLE.COM  ", phone: "(555) 555-5555", age: "25" }
clean_data = DataCleanser.cleanse_user_data(raw_data)
# => { email: "user@example.com", phone: "555-555-5555", age: 25, created_at: 2025-10-11..., active: true }

Quality Metrics Pattern quantifies data quality through calculated measurements. Metrics track quality trends over time and identify degradation. Common metrics include completeness ratio, validity ratio, duplicate count, and staleness percentage.

class DataQualityMetrics
  def initialize(records, schema)
    @records = records
    @schema = schema
  end
  
  def calculate_metrics
    {
      total_records: @records.size,
      completeness: calculate_completeness,
      validity: calculate_validity,
      uniqueness: calculate_uniqueness,
      timeliness: calculate_timeliness,
      overall_score: calculate_overall_score
    }
  end
  
  private
  
  def calculate_completeness
    return 1.0 if @records.empty?
    
    field_scores = @schema[:required].map do |field|
      complete = @records.count { |r| !r[field].nil? && !r[field].to_s.empty? }
      complete.to_f / @records.size
    end
    
    field_scores.sum / field_scores.size
  end
  
  def calculate_validity
    return 1.0 if @records.empty?
    
    validator = SchemaValidator.new(@schema)
    valid_count = @records.count { |r| validator.validate(r)[:valid] }
    valid_count.to_f / @records.size
  end
  
  def calculate_uniqueness
    return 1.0 if @records.empty? || !@schema[:unique_keys]
    
    unique_keys = @schema[:unique_keys]
    key_combinations = @records.map { |r| unique_keys.map { |k| r[k] } }
    unique_count = key_combinations.uniq.size
    unique_count.to_f / @records.size
  end
  
  def calculate_timeliness
    return 1.0 if @records.empty? || !@schema[:timestamp_field]
    
    max_age = @schema[:max_age_seconds] || 86400
    now = Time.now
    fresh_count = @records.count do |r|
      timestamp = r[@schema[:timestamp_field]]
      timestamp && (now - timestamp) <= max_age
    end
    fresh_count.to_f / @records.size
  end
  
  def calculate_overall_score
    scores = [
      calculate_completeness,
      calculate_validity,
      calculate_uniqueness,
      calculate_timeliness
    ]
    scores.sum / scores.size
  end
end

Quarantine Pattern isolates invalid data for investigation and correction. Records failing quality checks move to a quarantine area rather than being rejected. This pattern preserves data for analysis while preventing corruption of clean datasets.

class DataQuarantine
  def initialize
    @quarantine = []
  end
  
  def process_record(record, validator)
    result = validator.validate(record)
    
    if result[:valid]
      store_clean_record(record)
    else
      quarantine_record(record, result[:errors])
    end
  end
  
  def quarantine_record(record, errors)
    @quarantine << {
      data: record,
      errors: errors,
      quarantined_at: Time.now,
      status: 'pending_review'
    }
  end
  
  def review_quarantined_record(index, action, corrected_data = nil)
    entry = @quarantine[index]
    return unless entry
    
    case action
    when :fix
      if corrected_data
        entry[:status] = 'corrected'
        entry[:corrected_data] = corrected_data
        entry[:corrected_at] = Time.now
        store_clean_record(corrected_data)
      end
    when :reject
      entry[:status] = 'rejected'
      entry[:rejected_at] = Time.now
    when :override
      entry[:status] = 'approved_with_override'
      entry[:approved_at] = Time.now
      store_clean_record(entry[:data])
    end
  end
  
  def quarantine_report
    {
      total_quarantined: @quarantine.size,
      pending: @quarantine.count { |e| e[:status] == 'pending_review' },
      corrected: @quarantine.count { |e| e[:status] == 'corrected' },
      rejected: @quarantine.count { |e| e[:status] == 'rejected' }
    }
  end
  
  private
  
  def store_clean_record(record)
    # Store in main database or clean data store
  end
end

Tools & Ecosystem

Ruby's data quality ecosystem includes validation libraries, data processing gems, and integration tools. Selecting appropriate tools depends on validation complexity, performance requirements, and architectural constraints.

ActiveModel::Validations integrates with Rails but works in standalone Ruby applications. It provides declarative validation syntax and custom validator support. ActiveModel handles model-level validation with built-in validators for presence, format, numericality, length, and inclusion.

Dry-validation offers advanced validation with schema definitions, type coercion, and composable rules. The gem separates input validation from business logic validation. Dry-validation excels at complex nested validations and external dependencies.

JSON Schema validates JSON data against schema specifications. The json-schema gem implements JSON Schema validation for API payloads and configuration files. JSON Schema supports OpenAPI specification integration for API validation.

require 'json-schema'

schema = {
  "type" => "object",
  "required" => ["email", "age"],
  "properties" => {
    "email" => {
      "type" => "string",
      "format" => "email"
    },
    "age" => {
      "type" => "integer",
      "minimum" => 0,
      "maximum" => 150
    },
    "address" => {
      "type" => "object",
      "properties" => {
        "street" => { "type" => "string" },
        "city" => { "type" => "string" }
      }
    }
  }
}

data = { "email" => "user@example.com", "age" => 25 }
JSON::Validator.validate!(schema, data)
# => true

invalid_data = { "email" => "invalid", "age" => -5 }
JSON::Validator.validate(schema, invalid_data)
# => false

Daru provides DataFrame operations for data analysis and quality assessment. The gem supports missing value detection, duplicate identification, and statistical analysis. Daru works with CSV, Excel, and database sources.

PaperTrail tracks data changes for audit trails and quality monitoring. The gem records who changed data, when changes occurred, and what values changed. PaperTrail supports quality investigations by providing historical data context.

Scientist enables safe refactoring of data quality rules through experimentation. The gem compares new validation logic against existing logic without affecting production behavior. Scientist measures quality rule performance and accuracy.

Great Expectations (via Python integration) provides comprehensive data quality testing. While not Ruby-native, Ruby applications integrate with Great Expectations through system calls or microservice APIs. Great Expectations offers expectation-based validation, automated documentation, and quality dashboards.

Selection criteria for tools:

  • Complexity: Simple validations use ActiveModel; complex schema validation requires Dry-validation or JSON Schema
  • Performance: High-volume validation needs optimized libraries like Dry-validation
  • Integration: Rails applications prefer ActiveModel; API services use JSON Schema
  • Reporting: Quality metrics and trends require custom implementations or Great Expectations integration

Performance Considerations

Data quality checks impact system performance through validation overhead, query costs, and processing delays. Optimizing quality checks requires balancing thoroughness with speed.

Validation execution time grows with rule complexity and data volume. Simple presence checks complete in microseconds while complex pattern matching or database lookups take milliseconds. Multiplying validation time by record count determines total overhead.

require 'benchmark'

def measure_validation_performance(validator_class, record_count)
  records = record_count.times.map do |i|
    { id: i, email: "user#{i}@example.com", age: 20 + (i % 50) }
  end
  
  time = Benchmark.measure do
    records.each { |r| validator_class.new(r).valid? }
  end
  
  {
    records: record_count,
    total_time: time.real,
    per_record: time.real / record_count,
    throughput: record_count / time.real
  }
end

result = measure_validation_performance(UserDataQuality, 10000)
# => { records: 10000, total_time: 2.3, per_record: 0.00023, throughput: 4347.8 }

Caching validation results avoids redundant checks on unchanged data. Cache keys combine record identifier and content hash. This optimization works for batch validation but not real-time validation of new data.

class CachedValidator
  def initialize(validator_class)
    @validator_class = validator_class
    @cache = {}
  end
  
  def validate(record)
    cache_key = generate_cache_key(record)
    
    @cache[cache_key] ||= begin
      validator = @validator_class.new(record)
      {
        valid: validator.valid?,
        errors: validator.errors.full_messages
      }
    end
  end
  
  private
  
  def generate_cache_key(record)
    content = record.to_json
    Digest::SHA256.hexdigest(content)
  end
end

Parallel validation distributes quality checks across multiple threads or processes. Ruby's GIL limits threading benefits for CPU-bound validation, but parallel processing reduces batch validation time through process pools.

require 'parallel'

class ParallelValidator
  def self.validate_batch(records, validator_class, process_count: 4)
    results = Parallel.map(records, in_processes: process_count) do |record|
      validator = validator_class.new(record)
      {
        record_id: record[:id],
        valid: validator.valid?,
        errors: validator.errors.full_messages
      }
    end
    
    {
      total: results.size,
      valid: results.count { |r| r[:valid] },
      invalid: results.reject { |r| r[:valid] }
    }
  end
end

# Validate 100k records using 8 processes
results = ParallelValidator.validate_batch(large_dataset, UserDataQuality, process_count: 8)

Incremental validation checks only changed data rather than full datasets. Track data modifications through timestamps or version numbers, then validate modified records. This reduces batch validation time from hours to minutes.

Database-level constraints perform validation in the database rather than application layer. Foreign key constraints, unique indexes, and check constraints enforce quality rules at write time. Database constraints execute faster than application validation but offer less flexibility and worse error messages.

Sampling strategies validate subsets of large datasets for quality monitoring. Statistical sampling provides quality estimates without full dataset scans. Sample size affects accuracy; larger samples increase confidence but cost more time.

Performance optimization trade-offs include:

  • Thoroughness vs Speed: Comprehensive validation catches more issues but increases latency
  • Real-time vs Batch: Real-time validation prevents bad data entry but adds write latency
  • Caching vs Freshness: Cached results improve speed but may miss recent data changes
  • Parallel vs Sequential: Parallel processing reduces total time but increases resource usage

Reference

Data Quality Dimensions

Dimension Definition Measurement Method
Accuracy Correctness of data values Comparison to authoritative source
Completeness Presence of required data Missing field count / total fields
Consistency Conformance to format and rules Rule violation count / total records
Timeliness Currency of data Age of data vs maximum allowed age
Validity Conformance to constraints Invalid records / total records
Uniqueness Absence of duplicates Duplicate records / total records

Validation Approaches Comparison

Approach When to Use Latency Impact Data Coverage
Inline Strong consistency required High write latency 100% at write time
Batch Large volumes, flexible timing No write latency Delayed detection
Streaming Event-driven architecture Medium per-event Real-time in pipeline
External Service Microservices, shared rules Network latency Depends on integration

Ruby Validation Libraries

Library Best For Learning Curve Performance
ActiveModel::Validations Rails integration, model validation Low Medium
Dry-validation Complex schemas, type coercion Medium High
JSON Schema API payloads, OpenAPI Low Medium
Custom Validators Domain-specific rules High Variable

Common Validation Rules

Rule Type Example Implementation Complexity
Presence Email required Low
Format Email matches regex pattern Low
Range Age between 0 and 150 Low
Referential Foreign key exists in reference table Medium
Cross-field End date after start date Medium
Business Logic Discount valid for customer tier High
External API Address verified by geocoding service High

Quality Metrics Formulas

Metric Formula Interpretation
Completeness Ratio Complete fields / Required fields 1.0 = perfect, 0.0 = all missing
Validity Ratio Valid records / Total records 1.0 = all valid, 0.0 = all invalid
Duplicate Rate Duplicate records / Total records 0.0 = no duplicates, 1.0 = all duplicates
Freshness Score Fresh records / Total records 1.0 = all current, 0.0 = all stale
Overall Quality Score Average of dimension scores Composite quality indicator

Performance Optimization Techniques

Technique Use Case Complexity Speed Improvement
Caching Unchanged data validation Low 10-100x
Parallel Processing Large batch validation Medium 2-8x per core
Incremental Validation Changed data only Medium 10-1000x
Database Constraints Write-time enforcement Low 100x
Sampling Quality monitoring Medium Proportional to sample size

Validation Execution Order

Stage Purpose Example Checks
Type Validation Ensure correct data types Is age an integer
Format Validation Check patterns and structure Email matches format
Range Validation Verify values within bounds Age between 0-150
Presence Validation Required fields exist Email not nil
Business Rules Domain logic validation Customer tier allows discount
Cross-field Validation Field relationships End date after start date
Referential Integrity Foreign key validity User ID exists in users table
Uniqueness Validation No duplicates Email unique across records

Error Handling Strategies

Strategy When to Use Data Handling
Reject Critical quality rules Block invalid data
Quarantine Investigation needed Store separately for review
Warn Non-critical issues Allow with warning flag
Auto-correct Fixable issues Apply transformation rules
Manual Review Complex cases Queue for human review