dedup challenge

interesting problem…?


i have a hard drive that i know has duplicate files. ?you know, when you make a backup of photos from laptop to desktop PC, then you get a big disk and backup /both/ the laptop and desktop and you end up with two copies of the photos on the big disk.


challenge: find the duplicate files


i started by running a find where my backups are stored - here’s my recipe to get all the filenames and details into q:




$ sudo find /media/jack/ -exec ls -ld --full-time ?{} ; >list


q)flip (9#“S”;" ")0:`:list

drwxr-x---+ 7 ?root ? root ?4096 ?2013-03-29 15:29:43.876494607 +1100 /media/jack/ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
drwxr-xr-x ?33 root ? root ?4096 ?2012-11-09 21:07:37.364233083 +1100 /media/jack/831b08d9-942d-41ef-acd6-5ad675e233c0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
drwxr-xr-x ?16 root ? root ?4096 ?2011-07-03 17:17:55.750995002 +1000 /media/jack/831b08d9-942d-41ef-acd6-5ad675e233c0/var ? ? ? ? ? ? ? ? ? ? ? ? ? ?
drwxrwxrwt ?2 ?root ? root ?6 ? ? 2010-04-23 20:23:47.000000000 +1000 /media/jack/831b08d9-942d-41ef-acd6-5ad675e233c0/var/lock ? ? ? ? ? ? ? ? ? ? ??
drwxr-xr-x ?7 ?root ? root ?120 ? 2010-08-16 20:10:54.000000000 +1000 /media/jack/831b08d9-942d-41ef-acd6-5ad675e233c0/var/run ? ? ? ? ? ? ? ? ? ? ? ?
drwxr-xr-x ?2 ?root ? root ?6 ? ? 2010-08-16 20:07:14.000000000 +1000 /media/jack/831b08d9-942d-41ef-acd6-5ad675e233c0/var/run/dbus ? ? ? ? ? ? ? ? ??
drwxr-xr-x ?2 ?saned ?root ?6 ? ? 2010-08-16 20:10:54.000000000 +1000 /media/jack/831b08d9-942d-41ef-acd6-5ad675e233c0/var/run/hplip ? ? ? ? ? ? ? ? ?
drwxr-xr-x ?3 ?root ? root ?21 ? ?2010-08-16 20:10:05.000000000 +1000 /media/jack/831b08d9-942d-41ef-acd6-5ad675e233c0/var/run/samba ? ? ? ? ? ? ? ? ?
drwxr-xr-x ?2 ?root ? root ?21 ? ?2010-08-16 20:10:05.000000000 +1000 /media/jack/831b08d9-942d-41ef-acd6-5ad675e233c0/var/run/samba/upgrades ? ? ? ??
-rw-r--r-- ?1 ?root ? root ?12416 2010-08-16 20:10:05.000000000 +1000 /media/jack/831b08d9-942d-41ef-acd6-5ad675e233c0/var/run/samba/upgrades/smb.conf
drwxrwxr-x ?2 ?root ? utmp ?6 ? ? 2010-08-16 20:08:56.000000000 +1000 /media/jack/831b08d9-942d-41ef-acd6-5ad675e233c0/var/run/screen ? ? ? ? ? ? ? ??
drwxr-xr-x ?2 ?usbmux audio 6 ? ? 2010-08-16 20:09:27.000000000 +1000 /media/jack/831b08d9-942d-41ef-acd6-5ad675e233c0/var/run/speech-dispatcher ? ? ?
..

ta, jack

Check out ‘fdupes’ or the one-liner described here: http://ajayfromiiit.wordpress.com/2009/10/16/one-liner-to-find-and-remove-duplicate-files-in-linux/
find -not -empty -type f -printf “%s\n” | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

This might give you hints on the approach.

Br,
B