CrackedRuby - Find Duplicate Files Using MD5 Hashing

5. Find Duplicate Files Using MD5 Hashing

Previous Drill

Next Drill

Run Code

Submit Solution

Challenge

Problem

A 1TB drive might contain 50-100GB of duplicates from repeated downloads, backup copies, or scattered project versions. This drill teaches you to identify duplicate files by content using MD5 hashing. You'll learn recursive directory traversal, efficient algorithms (grouping by size first), and hash data structures for deduplication—essential for storage management and backup verification.

Difficulty: Intermediate

Instructions

Recursively scan the specified directory for all files
Group files by size first (optimization - files with different sizes can't be duplicates)
For files with matching sizes, calculate MD5 hash of content
Group files by hash - files with same hash are duplicates
Output duplicate groups with format:
'Duplicate set 1 (2 files, 1.5 MB each):'
' /path/to/file1.txt'
' /path/to/file2.txt'
Sort output by size (largest first), then by filename within groups
Output summary: 'Found X duplicate sets, Y files, Z.ZZ MB wasted'

Files

Editable

Read-only

Hints

Hint 1

Find.find recursively traverses directories - use File.file? to filter only files

Hint 2

Group by size first: Hash.new { |h, k| h[k] = [] } creates auto-initializing arrays

Hint 3

Digest::MD5.hexdigest(File.read(path)) calculates the file's hash

Hint 4

Only check hashes for files with matching sizes (huge performance gain)

Hint 5

Wasted space = file_size * (number_of_duplicates - 1)

Hint 6

Sort duplicate sets by size descending to show biggest space wasters first

References

Provided Files (Read-only)

1. Simple duplicate set - 2 files

Input:

find_duplicates('files')

Expected Output:

Duplicate set 1 (2 files, 0.00 MB each):
  files/file1.txt
  files/file2.txt

Found 1 duplicate set, 2 total files, 0.00 MB wasted

2. Multiple duplicate sets

Input:

find_duplicates('files')

Expected Output:

Duplicate set 1 (2 files, 0.00 MB each):
  files/a1.txt
  files/a2.txt

Duplicate set 2 (2 files, 0.00 MB each):
  files/b1.txt
  files/b2.txt

Found 2 duplicate sets, 4 total files, 0.00 MB wasted

3. Three copies of same file

Input:

find_duplicates('files')

Expected Output:

Duplicate set 1 (3 files, 0.00 MB each):
  files/copy1.txt
  files/copy2.txt
  files/original.txt

Found 1 duplicate set, 3 total files, 0.00 MB wasted

+ 2 hidden test cases

Challenge

Problem

Instructions

Files

Hints

References

Provided Files (Read-only)

1. Simple duplicate set - 2 files

2. Multiple duplicate sets

3. Three copies of same file

Sign in with...

Confirm Action

🎉 Great Job!