Challenge

Problem

A 1TB drive might contain 50-100GB of duplicates from repeated downloads, backup copies, or scattered project versions. This drill teaches you to identify duplicate files by content using MD5 hashing. You'll learn recursive directory traversal, efficient algorithms (grouping by size first), and hash data structures for deduplicationโ€”essential for storage management and backup verification.

Difficulty: Intermediate

Instructions

  1. Recursively scan the specified directory for all files
  2. Group files by size first (optimization - files with different sizes can't be duplicates)
  3. For files with matching sizes, calculate MD5 hash of content
  4. Group files by hash - files with same hash are duplicates
  5. Output duplicate groups with format:
    'Duplicate set 1 (2 files, 1.5 MB each):'
    ' /path/to/file1.txt'
    ' /path/to/file2.txt'
  6. Sort output by size (largest first), then by filename within groups
  7. Output summary: 'Found X duplicate sets, Y files, Z.ZZ MB wasted'

Files

Editable
Read-only

Hints

Hint 1

Find.find recursively traverses directories - use File.file? to filter only files

Hint 2

Group by size first: Hash.new { |h, k| h[k] = [] } creates auto-initializing arrays

Hint 3

Digest::MD5.hexdigest(File.read(path)) calculates the file's hash

Hint 4

Only check hashes for files with matching sizes (huge performance gain)

Hint 5

Wasted space = file_size * (number_of_duplicates - 1)

Hint 6

Sort duplicate sets by size descending to show biggest space wasters first

Provided Files (Read-only)

1. Simple duplicate set - 2 files

Input:
find_duplicates('files')
Expected Output:
Duplicate set 1 (2 files, 0.00 MB each):
  files/file1.txt
  files/file2.txt

Found 1 duplicate set, 2 total files, 0.00 MB wasted

2. Multiple duplicate sets

Input:
find_duplicates('files')
Expected Output:
Duplicate set 1 (2 files, 0.00 MB each):
  files/a1.txt
  files/a2.txt

Duplicate set 2 (2 files, 0.00 MB each):
  files/b1.txt
  files/b2.txt

Found 2 duplicate sets, 4 total files, 0.00 MB wasted

3. Three copies of same file

Input:
find_duplicates('files')
Expected Output:
Duplicate set 1 (3 files, 0.00 MB each):
  files/copy1.txt
  files/copy2.txt
  files/original.txt

Found 1 duplicate set, 3 total files, 0.00 MB wasted
+ 2 hidden test cases