A 1TB drive might contain 50-100GB of duplicates from repeated downloads, backup copies, or scattered project versions. This drill teaches you to identify duplicate files by content using MD5 hashing. You'll learn recursive directory traversal, efficient algorithms (grouping by size first), and hash data structures for deduplicationโessential for storage management and backup verification.
Find.find recursively traverses directories - use File.file? to filter only files
Group by size first: Hash.new { |h, k| h[k] = [] } creates auto-initializing arrays
Digest::MD5.hexdigest(File.read(path)) calculates the file's hash
Only check hashes for files with matching sizes (huge performance gain)
Wasted space = file_size * (number_of_duplicates - 1)
Sort duplicate sets by size descending to show biggest space wasters first
find_duplicates('files')
Duplicate set 1 (2 files, 0.00 MB each): files/file1.txt files/file2.txt Found 1 duplicate set, 2 total files, 0.00 MB wasted
find_duplicates('files')
Duplicate set 1 (2 files, 0.00 MB each): files/a1.txt files/a2.txt Duplicate set 2 (2 files, 0.00 MB each): files/b1.txt files/b2.txt Found 2 duplicate sets, 4 total files, 0.00 MB wasted
find_duplicates('files')
Duplicate set 1 (3 files, 0.00 MB each): files/copy1.txt files/copy2.txt files/original.txt Found 1 duplicate set, 3 total files, 0.00 MB wasted
Console output will appear here...
Are you sure?
You're making great progress