Boston Linux & Unix (BLU) Home | Calendar | Mail Lists | List Archives | Desktop SIG | Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings
Linux Cafe | Meeting Notes | Blog | Linux Links | Bling | About BLU

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

How to remove duplicate files [SOLVED]



How to remove duplicate files. Or: "how to clean up after your
photo-collecting self after the Holidays"

Digital photography is great -- until the day you realize that you
have over 5,428 photos and have stashed bits and pieces of your
collection here or there (with multiple redundant copies).

Bottom line: get fslint to find and eliminate duplicate files on your system.

The techniques and solutions I used are listed in my blog post.
Comments and improvements please.

http://freephile.com/cms/content/remove-duplicate-files

### Full text reproduced below ###

Remove duplicate files

Mon, 2009/01/05 - 10:04pm ? greg

Or, how to clean up after your photo-collecting self. Digital
photography is great -- until the day you realize that you have over
5,428 photos and have stashed bits and pieces of your collection here
or there (with multiple redundant copies).

Bottom line: get fslint to find and eliminate duplicate files on your system.

Like most people, I've collected a large number of photos and other
images over the years. All digital since about 2001, and stored in a
variety of applications to make displaying them etc. that much easier.

I've stored them in folders. Sometimes taking time to organize them.
Sometimes there is not time - just a pile of picture files. I've
written perl, javascript or php scripts to organize and display my
images. As I increased my skill at making my computer bend to my
wishes, I found that other people wrote better programs to organize
photos. So, I used Gallery. Finally, I came to realize that digiKam is
the coolest thing since digital photography. digiKam can export to
Gallery for sharing photos online, with the added benefit of
completely organizing them on your local machine. digiKam can export
to Flickr or Picassa if you don't want the privacy and control of your
own photo website. digiKam does so much more. digiKam is what I use to
manage all my photos now.

The problem is that I want to put all my images into digiKam. Although
digiKam has a nice import function, it only detects file name
duplicates in the album you're importing into rather than detecting
the duplicate across your entire image collection). So, as I am in the
middle of re-organizing my photo collection, I have the tedious chore
of finding duplicate images from locations where I originally stored
them compared to various other locations, and I need to know if I
might have already imported them to digiKam.

Was that Christmas photo from 2002 saved in a folder labeled for the
date the picture was taken, or for the date it was uploaded to the
system? Did you accidentally experiment with a file renaming scheme or
use the original names from the camera? Did you convert the image from
jpg to png? If you delete this photo, are you deleting important minor
modifications compared to the original like red-eye reduction? These
types of questions make it not too trivial to re-organize your photos
and ultimately make it a manually intensive effort no matter how many
excellent tools you have.

After some really good finger exercise using rsync

rsync -nviarz # dry-run verbose itemize archive recursive compress \
--ignore-existing # use this option to quickly see what files are
completely outside your target \
--progress --stats --checksum # use this option to do a thorough
analysis of what files might have changed \
--exclude .directory  # add exlcudes as needed to ignore file system
cruft, thumbnails etc. \
--compare-dest ../2002/ # use this option to help when files may be
organized into more than one target at the boundary of a year \
img/photos/2003_01_21/ img/library/albums/family/2003/

and bash

#!/bin/bash

SUSPECTS=/home/greg/img/photos/
FILTER='IMG*'

for file in `find $SUSPECTS -maxdepth 1 -type f -name $FILTER | sed
s:$SUSPECTS::`; do
  find img/library -name $file
done

#!/bin/bash

A=img/photos/2006-10/
A=img/photos/2006-08/
A=img/photos/
B=img/library/albums/family/2006/

BASE=$(pwd)

SOURCEDIR=$BASE/$A
TARGETDIR=$BASE/$B

pushd $A

for FILE in $(ls *JPG); do
  # set a couple flags we can use to determine if the file has been processed
  FOUNDA=false
  FOUNDB=false
  # echo "checking for $FILE in $TARGETDIR"

  if [ -f $TARGETDIR$FILE ]; then
    echo 'the original file exists in target'
    FOUNDA=true
  fi
  COUSIN=$(echo $FILE |sed s/JPG/png/)
  # echo "considering $COUSIN too"
  if [ -f $TARGETDIR$COUSIN ]; then
    echo 'the original has already been converted to png'
    FOUNDB=true
  fi

  if [[ ( ! $FOUNDA ) && ( ! $FOUNDB ) ]]; then
    echo "$FILE not found: please copy"
  fi
done

popd

I started looking for scripts that other people might have used to
find duplicate files, and that's when I found fslint. I should have
checked earlier.

Still, the need to do visual comparisons with tools such as GQView,
and to check/correct exif metadata in digiKam still means this is not
a task that is easily solved with the push of a button. fslint just
gives you a big time saver over writing and tweaking your own scripts.

Note: digiKam has a built-in tool to find duplicate images in it's
database, but it is resource intensive because it is content-based
(building and comparing digital fingerprints of your files). In
testing, it provides false positives. Even if it worked flawlessly, it
would really be wasted effort to import thousands (of potential
duplicates) just to find them and eliminate them.

-- 
Greg Rundlett
Web Developer - Initiative in Innovative Computing
http://iic.harvard.edu
m. 978-764-4424
o. 978-225-8302
skype/aim/irc/twitter freephile
http://profiles.aim.com/freephile







BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org