Overview
Data Quality Management encompasses the processes, techniques, and tools used to maintain high-quality data throughout its lifecycle. Data quality directly affects decision-making accuracy, system reliability, and business outcomes. Poor data quality costs organizations an average of 15-25% of revenue through operational inefficiencies, incorrect decisions, and system failures.
Data quality issues manifest in multiple forms: missing values, inconsistent formats, duplicate records, constraint violations, referential integrity breaks, and stale data. Each category requires different detection and remediation strategies. A single corrupted field can cascade through dependent systems, causing failures in analytics, reporting, and automated processes.
The practice originated in database management but expanded with distributed systems, big data pipelines, and microservices architectures. Modern data quality management addresses real-time validation, cross-system consistency, and data governance compliance.
# Data quality issue example
user_records = [
{ id: 1, email: "user@example.com", age: 25 },
{ id: 2, email: "invalid-email", age: -5 }, # Invalid format and value
{ id: 3, email: "user@example.com", age: 30 }, # Duplicate email
{ id: 4, email: nil, age: 35 } # Missing required field
]
# Without quality checks, these records would corrupt analytics
average_age = user_records.map { |r| r[:age] }.compact.sum / user_records.size
# => Includes negative age, produces incorrect result
Data quality management operates at multiple levels: field validation, record validation, dataset validation, and cross-system validation. Field validation checks individual values against constraints. Record validation ensures internal consistency across fields. Dataset validation identifies patterns and anomalies across records. Cross-system validation maintains referential integrity between services.
Key Principles
Data quality assessment follows six primary dimensions: accuracy, completeness, consistency, timeliness, validity, and uniqueness. Each dimension measures a specific aspect of data fitness for intended use.
Accuracy measures how well data reflects reality. An address field containing "123 Main St" has high accuracy if that address exists and corresponds to the entity. Accuracy degrades through transcription errors, system bugs, and stale data. Measuring accuracy requires authoritative reference sources or manual verification.
Completeness tracks the presence of required data. A customer record missing the email field when email is mandatory fails completeness checks. Completeness operates at field level (is the value present?), record level (are all required fields present?), and dataset level (are all expected records present?).
class DataCompletenessChecker
def initialize(required_fields)
@required_fields = required_fields
end
def check_record(record)
missing = @required_fields.select { |field| record[field].nil? || record[field].to_s.strip.empty? }
{
complete: missing.empty?,
missing_fields: missing,
completeness_score: 1.0 - (missing.size.to_f / @required_fields.size)
}
end
def check_dataset(records)
results = records.map { |r| check_record(r) }
{
total_records: records.size,
complete_records: results.count { |r| r[:complete] },
average_completeness: results.sum { |r| r[:completeness_score] } / records.size,
field_completeness: field_level_completeness(records)
}
end
private
def field_level_completeness(records)
@required_fields.map do |field|
present = records.count { |r| !r[field].nil? && !r[field].to_s.strip.empty? }
[field, present.to_f / records.size]
end.to_h
end
end
checker = DataCompletenessChecker.new([:email, :name, :age])
result = checker.check_dataset([
{ email: "a@example.com", name: "Alice", age: 25 },
{ email: "b@example.com", name: nil, age: 30 },
{ email: nil, name: "Charlie", age: 35 }
])
# => { total_records: 3, complete_records: 1, average_completeness: 0.666,
# field_completeness: { email: 0.666, name: 0.666, age: 1.0 } }
Consistency ensures data conforms to defined formats, follows business rules, and maintains logical coherence. Consistency violations include format mismatches (phone numbers stored as "555-1234" and "5551234"), unit inconsistencies (mixing meters and feet), and logical contradictions (end_date before start_date). Consistency checks require rule definitions and pattern matching.
Timeliness measures whether data is current enough for its intended use. Stock prices from yesterday may be timely for trend analysis but not for trading. Timeliness requirements vary by use case. Real-time systems require second-level timeliness, while analytical reports may accept day-old data.
Validity checks conformance to defined formats, ranges, and constraints. Email addresses must match email format. Ages must be positive integers within reasonable bounds. Foreign keys must reference existing records. Validity rules encode domain knowledge and system constraints.
Uniqueness identifies duplicate records and ensures proper entity identification. Duplicates arise from multiple data entry, system integration, and merge operations. Detecting duplicates requires fuzzy matching algorithms when exact key matches fail.
Data quality rules fall into three categories: structural rules define format and type constraints, semantic rules encode business logic, and cross-field rules validate relationships between attributes. Rule precedence matters when multiple rules conflict.
Ruby Implementation
Ruby provides multiple approaches for implementing data quality checks through validation libraries, custom validators, and data processing frameworks. The standard library includes basic validation capabilities, while gems offer specialized functionality.
# Basic validation framework
class DataValidator
attr_reader :errors
def initialize
@errors = []
@rules = []
end
def add_rule(name, &block)
@rules << { name: name, check: block }
self
end
def validate(data)
@errors = []
@rules.each do |rule|
begin
result = rule[:check].call(data)
@errors << { rule: rule[:name], message: "Validation failed" } unless result
rescue StandardError => e
@errors << { rule: rule[:name], message: e.message }
end
end
valid?
end
def valid?
@errors.empty?
end
end
# Usage with multiple validation rules
validator = DataValidator.new
validator.add_rule(:email_present) { |d| !d[:email].nil? && !d[:email].empty? }
validator.add_rule(:email_format) { |d| d[:email] =~ /\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i }
validator.add_rule(:age_valid) { |d| d[:age].is_a?(Integer) && d[:age] > 0 && d[:age] < 150 }
validator.add_rule(:age_adult) { |d| d[:age] >= 18 }
data = { email: "user@example.com", age: 25 }
validator.validate(data)
# => true
ActiveModel validations provide declarative validation syntax integrated with Ruby on Rails, but work independently in plain Ruby applications:
require 'active_model'
class UserDataQuality
include ActiveModel::Validations
attr_accessor :email, :age, :name, :phone
validates :email, presence: true, format: { with: URI::MailTo::EMAIL_REGEXP }
validates :age, numericality: { only_integer: true, greater_than: 0, less_than: 150 }
validates :name, presence: true, length: { minimum: 2, maximum: 100 }
validates :phone, format: { with: /\A\d{3}-\d{3}-\d{4}\z/ }, allow_blank: true
validate :age_matches_email_domain
def initialize(attributes = {})
attributes.each { |key, value| send("#{key}=", value) }
end
private
def age_matches_email_domain
return unless email && age
# Business rule: company emails require age >= 18
if email.end_with?('@company.com') && age < 18
errors.add(:age, "must be at least 18 for company email")
end
end
end
user = UserDataQuality.new(email: "user@company.com", age: 16, name: "John", phone: "555-555-5555")
user.valid?
# => false
user.errors.full_messages
# => ["Age must be at least 18 for company email"]
Dry-validation provides a powerful schema-based validation framework with composable rules:
require 'dry-validation'
class UserContract < Dry::Validation::Contract
params do
required(:email).filled(:string)
required(:age).filled(:integer)
required(:name).filled(:string)
optional(:address).hash do
required(:street).filled(:string)
required(:city).filled(:string)
required(:zip).filled(:string)
end
end
rule(:email) do
unless /\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i.match?(value)
key.failure('must be valid email format')
end
end
rule(:age) do
key.failure('must be positive') if value <= 0
key.failure('must be realistic') if value > 150
end
rule(:age, :email) do
if values[:email]&.end_with?('@company.com') && values[:age] < 18
key(:age).failure('must be 18+ for company email')
end
end
end
contract = UserContract.new
result = contract.call(email: "user@company.com", age: 16, name: "John")
result.success?
# => false
result.errors.to_h
# => { age: ["must be 18+ for company email"] }
Custom validators handle complex domain-specific logic:
class DataQualityValidator
def self.validate_referential_integrity(records, foreign_key, reference_table)
valid_ids = reference_table.map { |r| r[:id] }.to_set
invalid_records = records.reject { |r| valid_ids.include?(r[foreign_key]) }
{
valid: invalid_records.empty?,
invalid_count: invalid_records.size,
invalid_records: invalid_records.map { |r| r[:id] }
}
end
def self.detect_duplicates(records, key_fields)
grouped = records.group_by { |r| key_fields.map { |k| r[k] } }
duplicates = grouped.select { |_, group| group.size > 1 }
{
has_duplicates: !duplicates.empty?,
duplicate_groups: duplicates.size,
duplicate_records: duplicates.values.flatten.map { |r| r[:id] }
}
end
def self.check_data_freshness(records, timestamp_field, max_age_seconds)
now = Time.now
stale_records = records.select do |r|
timestamp = r[timestamp_field]
timestamp.nil? || (now - timestamp) > max_age_seconds
end
{
fresh: stale_records.empty?,
stale_count: stale_records.size,
stale_records: stale_records.map { |r| r[:id] }
}
end
end
# Check referential integrity
users = [{ id: 1, order_id: 100 }, { id: 2, order_id: 999 }]
orders = [{ id: 100 }, { id: 101 }]
DataQualityValidator.validate_referential_integrity(users, :order_id, orders)
# => { valid: false, invalid_count: 1, invalid_records: [2] }
Implementation Approaches
Data quality management strategies differ based on when validation occurs, where validation logic resides, and how failures are handled. The four primary approaches are: inline validation, batch validation, streaming validation, and external validation services.
Inline validation executes quality checks during data write operations. The application validates data before persisting to storage. This approach provides immediate feedback and prevents invalid data from entering the system. Inline validation adds latency to write operations and requires validation logic in each service that writes data.
class UserService
def create_user(data)
validator = UserDataQuality.new(data)
unless validator.valid?
return {
success: false,
errors: validator.errors.full_messages
}
end
# Proceed with user creation
user = User.create(data)
{ success: true, user: user }
end
end
Batch validation processes data quality checks asynchronously on datasets. A scheduled job or triggered process scans existing data, identifies issues, and generates reports. This approach handles large volumes without impacting write performance. Batch validation detects issues after data entry, allowing invalid data to temporarily exist in the system.
class DataQualityBatchJob
def perform
start_time = Time.now
issues = []
User.find_in_batches(batch_size: 1000) do |batch|
batch.each do |user|
validator = UserDataQuality.new(user.attributes)
next if validator.valid?
issues << {
record_id: user.id,
record_type: 'User',
errors: validator.errors.full_messages,
detected_at: Time.now
}
end
end
DataQualityReport.create(
run_at: start_time,
records_checked: User.count,
issues_found: issues.size,
issues: issues
)
end
end
Streaming validation applies quality checks to data flowing through processing pipelines. Events or messages undergo validation as they move between services. This approach balances real-time feedback with decoupled architecture. Streaming validation requires message queue infrastructure and handling of failed messages.
class DataQualityStream
def initialize(input_topic, output_topic, error_topic)
@input = input_topic
@output = output_topic
@error = error_topic
@validator = UserDataQuality
end
def process_message(message)
data = JSON.parse(message.payload)
validator = @validator.new(data)
if validator.valid?
@output.publish(message)
else
error_message = {
original_data: data,
errors: validator.errors.full_messages,
timestamp: Time.now.iso8601
}
@error.publish(error_message.to_json)
end
rescue StandardError => e
@error.publish({
original_message: message.payload,
error: e.message,
timestamp: Time.now.iso8601
}.to_json)
end
end
External validation services centralize data quality logic in dedicated services. Multiple applications send data to the validation service via API calls. This approach ensures consistent validation rules across services and separates quality concerns from business logic. External services introduce network latency and require availability management.
The selection criteria include: write latency requirements, data volume, consistency needs, and architectural constraints. High-throughput systems prefer batch or streaming validation. Systems requiring strong consistency use inline validation. Microservices architectures benefit from external validation services for cross-service consistency.
Common Patterns
Data quality management follows established patterns that address specific validation scenarios and quality dimensions. These patterns combine to form comprehensive quality frameworks.
Schema Validation Pattern enforces structural correctness through schema definitions. The schema specifies required fields, data types, format constraints, and value ranges. Applications validate data against schemas before processing.
class SchemaValidator
def initialize(schema)
@schema = schema
end
def validate(data)
errors = []
@schema[:required]&.each do |field|
errors << "Missing required field: #{field}" unless data.key?(field)
end
@schema[:fields]&.each do |field, constraints|
next unless data.key?(field)
value = data[field]
if constraints[:type]
expected_type = constraints[:type]
unless value.is_a?(expected_type)
errors << "Field #{field} must be #{expected_type}, got #{value.class}"
end
end
if constraints[:pattern]
unless constraints[:pattern].match?(value.to_s)
errors << "Field #{field} does not match required pattern"
end
end
if constraints[:min] && value < constraints[:min]
errors << "Field #{field} below minimum value #{constraints[:min]}"
end
if constraints[:max] && value > constraints[:max]
errors << "Field #{field} exceeds maximum value #{constraints[:max]}"
end
end
{ valid: errors.empty?, errors: errors }
end
end
user_schema = {
required: [:email, :name, :age],
fields: {
email: { type: String, pattern: /\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i },
name: { type: String, min: 2, max: 100 },
age: { type: Integer, min: 0, max: 150 }
}
}
validator = SchemaValidator.new(user_schema)
result = validator.validate({ email: "user@example.com", name: "A", age: 25 })
# => { valid: false, errors: ["Field name below minimum value 2"] }
Data Cleansing Pattern automatically corrects common data quality issues. The pattern applies transformations to standardize formats, trim whitespace, normalize cases, and fill defaults. Cleansing occurs before validation.
class DataCleanser
def self.cleanse_user_data(data)
cleansed = data.dup
# Trim whitespace
cleansed.transform_values! { |v| v.is_a?(String) ? v.strip : v }
# Normalize email
if cleansed[:email]
cleansed[:email] = cleansed[:email].downcase
end
# Normalize phone format
if cleansed[:phone]
digits = cleansed[:phone].gsub(/\D/, '')
cleansed[:phone] = "#{digits[0..2]}-#{digits[3..5]}-#{digits[6..9]}" if digits.length == 10
end
# Convert string numbers to integers
if cleansed[:age].is_a?(String) && cleansed[:age] =~ /^\d+$/
cleansed[:age] = cleansed[:age].to_i
end
# Fill defaults
cleansed[:created_at] ||= Time.now
cleansed[:active] = true if cleansed[:active].nil?
cleansed
end
end
raw_data = { email: " USER@EXAMPLE.COM ", phone: "(555) 555-5555", age: "25" }
clean_data = DataCleanser.cleanse_user_data(raw_data)
# => { email: "user@example.com", phone: "555-555-5555", age: 25, created_at: 2025-10-11..., active: true }
Quality Metrics Pattern quantifies data quality through calculated measurements. Metrics track quality trends over time and identify degradation. Common metrics include completeness ratio, validity ratio, duplicate count, and staleness percentage.
class DataQualityMetrics
def initialize(records, schema)
@records = records
@schema = schema
end
def calculate_metrics
{
total_records: @records.size,
completeness: calculate_completeness,
validity: calculate_validity,
uniqueness: calculate_uniqueness,
timeliness: calculate_timeliness,
overall_score: calculate_overall_score
}
end
private
def calculate_completeness
return 1.0 if @records.empty?
field_scores = @schema[:required].map do |field|
complete = @records.count { |r| !r[field].nil? && !r[field].to_s.empty? }
complete.to_f / @records.size
end
field_scores.sum / field_scores.size
end
def calculate_validity
return 1.0 if @records.empty?
validator = SchemaValidator.new(@schema)
valid_count = @records.count { |r| validator.validate(r)[:valid] }
valid_count.to_f / @records.size
end
def calculate_uniqueness
return 1.0 if @records.empty? || !@schema[:unique_keys]
unique_keys = @schema[:unique_keys]
key_combinations = @records.map { |r| unique_keys.map { |k| r[k] } }
unique_count = key_combinations.uniq.size
unique_count.to_f / @records.size
end
def calculate_timeliness
return 1.0 if @records.empty? || !@schema[:timestamp_field]
max_age = @schema[:max_age_seconds] || 86400
now = Time.now
fresh_count = @records.count do |r|
timestamp = r[@schema[:timestamp_field]]
timestamp && (now - timestamp) <= max_age
end
fresh_count.to_f / @records.size
end
def calculate_overall_score
scores = [
calculate_completeness,
calculate_validity,
calculate_uniqueness,
calculate_timeliness
]
scores.sum / scores.size
end
end
Quarantine Pattern isolates invalid data for investigation and correction. Records failing quality checks move to a quarantine area rather than being rejected. This pattern preserves data for analysis while preventing corruption of clean datasets.
class DataQuarantine
def initialize
@quarantine = []
end
def process_record(record, validator)
result = validator.validate(record)
if result[:valid]
store_clean_record(record)
else
quarantine_record(record, result[:errors])
end
end
def quarantine_record(record, errors)
@quarantine << {
data: record,
errors: errors,
quarantined_at: Time.now,
status: 'pending_review'
}
end
def review_quarantined_record(index, action, corrected_data = nil)
entry = @quarantine[index]
return unless entry
case action
when :fix
if corrected_data
entry[:status] = 'corrected'
entry[:corrected_data] = corrected_data
entry[:corrected_at] = Time.now
store_clean_record(corrected_data)
end
when :reject
entry[:status] = 'rejected'
entry[:rejected_at] = Time.now
when :override
entry[:status] = 'approved_with_override'
entry[:approved_at] = Time.now
store_clean_record(entry[:data])
end
end
def quarantine_report
{
total_quarantined: @quarantine.size,
pending: @quarantine.count { |e| e[:status] == 'pending_review' },
corrected: @quarantine.count { |e| e[:status] == 'corrected' },
rejected: @quarantine.count { |e| e[:status] == 'rejected' }
}
end
private
def store_clean_record(record)
# Store in main database or clean data store
end
end
Tools & Ecosystem
Ruby's data quality ecosystem includes validation libraries, data processing gems, and integration tools. Selecting appropriate tools depends on validation complexity, performance requirements, and architectural constraints.
ActiveModel::Validations integrates with Rails but works in standalone Ruby applications. It provides declarative validation syntax and custom validator support. ActiveModel handles model-level validation with built-in validators for presence, format, numericality, length, and inclusion.
Dry-validation offers advanced validation with schema definitions, type coercion, and composable rules. The gem separates input validation from business logic validation. Dry-validation excels at complex nested validations and external dependencies.
JSON Schema validates JSON data against schema specifications. The json-schema gem implements JSON Schema validation for API payloads and configuration files. JSON Schema supports OpenAPI specification integration for API validation.
require 'json-schema'
schema = {
"type" => "object",
"required" => ["email", "age"],
"properties" => {
"email" => {
"type" => "string",
"format" => "email"
},
"age" => {
"type" => "integer",
"minimum" => 0,
"maximum" => 150
},
"address" => {
"type" => "object",
"properties" => {
"street" => { "type" => "string" },
"city" => { "type" => "string" }
}
}
}
}
data = { "email" => "user@example.com", "age" => 25 }
JSON::Validator.validate!(schema, data)
# => true
invalid_data = { "email" => "invalid", "age" => -5 }
JSON::Validator.validate(schema, invalid_data)
# => false
Daru provides DataFrame operations for data analysis and quality assessment. The gem supports missing value detection, duplicate identification, and statistical analysis. Daru works with CSV, Excel, and database sources.
PaperTrail tracks data changes for audit trails and quality monitoring. The gem records who changed data, when changes occurred, and what values changed. PaperTrail supports quality investigations by providing historical data context.
Scientist enables safe refactoring of data quality rules through experimentation. The gem compares new validation logic against existing logic without affecting production behavior. Scientist measures quality rule performance and accuracy.
Great Expectations (via Python integration) provides comprehensive data quality testing. While not Ruby-native, Ruby applications integrate with Great Expectations through system calls or microservice APIs. Great Expectations offers expectation-based validation, automated documentation, and quality dashboards.
Selection criteria for tools:
- Complexity: Simple validations use ActiveModel; complex schema validation requires Dry-validation or JSON Schema
- Performance: High-volume validation needs optimized libraries like Dry-validation
- Integration: Rails applications prefer ActiveModel; API services use JSON Schema
- Reporting: Quality metrics and trends require custom implementations or Great Expectations integration
Performance Considerations
Data quality checks impact system performance through validation overhead, query costs, and processing delays. Optimizing quality checks requires balancing thoroughness with speed.
Validation execution time grows with rule complexity and data volume. Simple presence checks complete in microseconds while complex pattern matching or database lookups take milliseconds. Multiplying validation time by record count determines total overhead.
require 'benchmark'
def measure_validation_performance(validator_class, record_count)
records = record_count.times.map do |i|
{ id: i, email: "user#{i}@example.com", age: 20 + (i % 50) }
end
time = Benchmark.measure do
records.each { |r| validator_class.new(r).valid? }
end
{
records: record_count,
total_time: time.real,
per_record: time.real / record_count,
throughput: record_count / time.real
}
end
result = measure_validation_performance(UserDataQuality, 10000)
# => { records: 10000, total_time: 2.3, per_record: 0.00023, throughput: 4347.8 }
Caching validation results avoids redundant checks on unchanged data. Cache keys combine record identifier and content hash. This optimization works for batch validation but not real-time validation of new data.
class CachedValidator
def initialize(validator_class)
@validator_class = validator_class
@cache = {}
end
def validate(record)
cache_key = generate_cache_key(record)
@cache[cache_key] ||= begin
validator = @validator_class.new(record)
{
valid: validator.valid?,
errors: validator.errors.full_messages
}
end
end
private
def generate_cache_key(record)
content = record.to_json
Digest::SHA256.hexdigest(content)
end
end
Parallel validation distributes quality checks across multiple threads or processes. Ruby's GIL limits threading benefits for CPU-bound validation, but parallel processing reduces batch validation time through process pools.
require 'parallel'
class ParallelValidator
def self.validate_batch(records, validator_class, process_count: 4)
results = Parallel.map(records, in_processes: process_count) do |record|
validator = validator_class.new(record)
{
record_id: record[:id],
valid: validator.valid?,
errors: validator.errors.full_messages
}
end
{
total: results.size,
valid: results.count { |r| r[:valid] },
invalid: results.reject { |r| r[:valid] }
}
end
end
# Validate 100k records using 8 processes
results = ParallelValidator.validate_batch(large_dataset, UserDataQuality, process_count: 8)
Incremental validation checks only changed data rather than full datasets. Track data modifications through timestamps or version numbers, then validate modified records. This reduces batch validation time from hours to minutes.
Database-level constraints perform validation in the database rather than application layer. Foreign key constraints, unique indexes, and check constraints enforce quality rules at write time. Database constraints execute faster than application validation but offer less flexibility and worse error messages.
Sampling strategies validate subsets of large datasets for quality monitoring. Statistical sampling provides quality estimates without full dataset scans. Sample size affects accuracy; larger samples increase confidence but cost more time.
Performance optimization trade-offs include:
- Thoroughness vs Speed: Comprehensive validation catches more issues but increases latency
- Real-time vs Batch: Real-time validation prevents bad data entry but adds write latency
- Caching vs Freshness: Cached results improve speed but may miss recent data changes
- Parallel vs Sequential: Parallel processing reduces total time but increases resource usage
Reference
Data Quality Dimensions
| Dimension | Definition | Measurement Method |
|---|---|---|
| Accuracy | Correctness of data values | Comparison to authoritative source |
| Completeness | Presence of required data | Missing field count / total fields |
| Consistency | Conformance to format and rules | Rule violation count / total records |
| Timeliness | Currency of data | Age of data vs maximum allowed age |
| Validity | Conformance to constraints | Invalid records / total records |
| Uniqueness | Absence of duplicates | Duplicate records / total records |
Validation Approaches Comparison
| Approach | When to Use | Latency Impact | Data Coverage |
|---|---|---|---|
| Inline | Strong consistency required | High write latency | 100% at write time |
| Batch | Large volumes, flexible timing | No write latency | Delayed detection |
| Streaming | Event-driven architecture | Medium per-event | Real-time in pipeline |
| External Service | Microservices, shared rules | Network latency | Depends on integration |
Ruby Validation Libraries
| Library | Best For | Learning Curve | Performance |
|---|---|---|---|
| ActiveModel::Validations | Rails integration, model validation | Low | Medium |
| Dry-validation | Complex schemas, type coercion | Medium | High |
| JSON Schema | API payloads, OpenAPI | Low | Medium |
| Custom Validators | Domain-specific rules | High | Variable |
Common Validation Rules
| Rule Type | Example | Implementation Complexity |
|---|---|---|
| Presence | Email required | Low |
| Format | Email matches regex pattern | Low |
| Range | Age between 0 and 150 | Low |
| Referential | Foreign key exists in reference table | Medium |
| Cross-field | End date after start date | Medium |
| Business Logic | Discount valid for customer tier | High |
| External API | Address verified by geocoding service | High |
Quality Metrics Formulas
| Metric | Formula | Interpretation |
|---|---|---|
| Completeness Ratio | Complete fields / Required fields | 1.0 = perfect, 0.0 = all missing |
| Validity Ratio | Valid records / Total records | 1.0 = all valid, 0.0 = all invalid |
| Duplicate Rate | Duplicate records / Total records | 0.0 = no duplicates, 1.0 = all duplicates |
| Freshness Score | Fresh records / Total records | 1.0 = all current, 0.0 = all stale |
| Overall Quality Score | Average of dimension scores | Composite quality indicator |
Performance Optimization Techniques
| Technique | Use Case | Complexity | Speed Improvement |
|---|---|---|---|
| Caching | Unchanged data validation | Low | 10-100x |
| Parallel Processing | Large batch validation | Medium | 2-8x per core |
| Incremental Validation | Changed data only | Medium | 10-1000x |
| Database Constraints | Write-time enforcement | Low | 100x |
| Sampling | Quality monitoring | Medium | Proportional to sample size |
Validation Execution Order
| Stage | Purpose | Example Checks |
|---|---|---|
| Type Validation | Ensure correct data types | Is age an integer |
| Format Validation | Check patterns and structure | Email matches format |
| Range Validation | Verify values within bounds | Age between 0-150 |
| Presence Validation | Required fields exist | Email not nil |
| Business Rules | Domain logic validation | Customer tier allows discount |
| Cross-field Validation | Field relationships | End date after start date |
| Referential Integrity | Foreign key validity | User ID exists in users table |
| Uniqueness Validation | No duplicates | Email unique across records |
Error Handling Strategies
| Strategy | When to Use | Data Handling |
|---|---|---|
| Reject | Critical quality rules | Block invalid data |
| Quarantine | Investigation needed | Store separately for review |
| Warn | Non-critical issues | Allow with warning flag |
| Auto-correct | Fixable issues | Apply transformation rules |
| Manual Review | Complex cases | Queue for human review |