Challenge

Problem

Web forms produce inconsistent data, and importing customer records from external systems requires normalization. This drill teaches you to clean messy CSV data by stripping whitespace, normalizing phone formats, standardizing emails, validating required fields, removing duplicates, and handling malformed rows. You'll learn data validation patterns essential for preventing SQL errors and data corruption before expensive processing begins.

Difficulty: Intermediate

Instructions

  1. Read user-submitted CSV file with headers (Name, Email, Phone, Company)
  2. Clean each field:
    • Strip leading/trailing whitespace
    • Normalize emails: lowercase and trim
    • Normalize phone: remove formatting, keep digits only, format as XXX-XXX-XXXX
    • Capitalize company names properly
  3. Validate required fields (Name and Email cannot be empty)
  4. Validate email format (contains @ and .)
  5. Remove duplicate rows (by email)
  6. Track errors and write error report
  7. Output cleaned CSV and error summary

Files

Editable
Read-only

Hints

Hint 1

String#strip removes leading/trailing whitespace

Hint 2

String#downcase converts to lowercase

Hint 3

String#gsub(/\D/, '') removes all non-digit characters

Hint 4

Use hash to track seen emails: seen_emails[email] = true

Hint 5

Validate email with email.include?('@') && email.include?('.')

Hint 6

Split company name, capitalize each word, rejoin: .split.map(&:capitalize).join(' ')

Hint 7

Track errors with array, write to file at end

Provided Files (Read-only)

1. Basic cleaning - whitespace and formatting

Input:
clean_csv('dirty_data.csv')
Expected Output:
Processed 2 rows: 2 cleaned, 0 errors

2. Validation errors - missing required fields

Input:
clean_csv('dirty_data.csv')
Expected Output:
Processed 3 rows: 1 cleaned, 2 errors
Errors found - see error_report.txt

3. Duplicate removal by email

Input:
clean_csv('dirty_data.csv')
Expected Output:
Processed 3 rows: 2 cleaned, 0 errors
1 duplicate removed
+ 2 hidden test cases