Challenge

Problem

A 1TB drive might contain 50-100GB of duplicates from repeated downloads, backup copies, or scattered project versions. This drill teaches you to identify duplicate files by content using MD5 hashing. You'll learn recursive directory traversal, efficient algorithms (grouping by size first), and hash data structures for deduplicationโ€”essential for storage management and backup verification.

Difficulty: Intermediate

Instructions

  1. Recursively scan the specified directory for all files
  2. Group files by size first (optimization - files with different sizes can't be duplicates)
  3. For files with matching sizes, calculate MD5 hash of content
  4. Group files by hash - files with same hash are duplicates
  5. Output duplicate groups with format:
    'Duplicate set 1 (2 files, 1.5 MB each):'
    ' /path/to/file1.txt'
    ' /path/to/file2.txt'
  6. Sort output by size (largest first), then by filename within groups
  7. Output summary: 'Found X duplicate sets, Y files, Z.ZZ MB wasted'

Files

Editable
Read-only

Hints

Hint 1

Find.find recursively traverses directories - use File.file? to filter only files

Hint 2

Group by size first: Hash.new { |h, k| h[k] = [] } creates auto-initializing arrays

Hint 3

Digest::MD5.hexdigest(File.read(path)) calculates the file's hash

Hint 4

Only check hashes for files with matching sizes (huge performance gain)

Hint 5

Wasted space = file_size * (number_of_duplicates - 1)

Hint 6

Sort duplicate sets by size descending to show biggest space wasters first

Ruby 3.4

Provided Files (Read-only)

1. Finds duplicate files with same content

Input:
find_duplicates('files')
Expected Output:
Duplicate set 1 (3 files, 0.00 MB each):
  files/backup/document.txt
  files/copy_of_document.txt
  files/document.txt

Found 1 duplicate set, 3 total files, 0.00 MB wasted

2. Unique files are not reported

Input:
find_duplicates('files')
puts '---'
puts 'photo.jpg and unique.pdf should not appear above'
Expected Output:
Duplicate set 1 (3 files, 0.00 MB each):
  files/backup/document.txt
  files/copy_of_document.txt
  files/document.txt

Found 1 duplicate set, 3 total files, 0.00 MB wasted
---
photo.jpg and unique.pdf should not appear above
+ 2 hidden test cases