hi,
i’m working with a huge amount of files. and i need to check for duplicates. so i thought of using a txt with proper delimiters of the whole bunch of files, for another script to read and find any matches…
the ‘database’ will be a list of md5 signatures of all the files. so i want to grab a file, and compare it’s md5 signature with every signature in the ‘database’. that way, the scripts returns that the file is a duplicate.
on run
try
tell application "Finder" to set inputFile to (choose file) as alias
set signature to last word of (do shell script "md5 " & quoted form of POSIX path of inputFile)
set database to ((((path to desktop folder) as text) & "signature log.txt"))
set countDB to number of items in (read file database using delimiter "*")
set c to 0
repeat with x from 1 to countDB
set k to (item x of (read file database using delimiter "*"))
if (k = signature) then
set c to 1
end if
display dialog (k & return & signature)
end repeat
if (c = 1) then
display dialog "Duplicate found!!!"
else
display dialog "Duplicate NOT found"
end if
on error err
display dialog err
end try
end run
note: the list is a known md5 list, made with another script of mine. and the file for testing IS in that list.
the script doen’t find the duplicate… i’ve even setted all to display dialogs, and seen the two identical strings… but it still doesn’t find the duplicate.
I don’t know if you have written your database file as shown, with each item on a new line or not, but that is what messed it up for me. If I save the file like this:
set aa to "j48dn39dng7wn29dxk73jn8djdifnew9"
set a to choose file
set b to open for access a
set c to read b using delimiter "*"
close access b
c contains aa
-->true
When I wrote the file as you had listed in your post, the same code always evaluated to false, because I believe the newline characters were being evaluated as part of each string.
no, i didn’t store it like that… thats a hand-writted thing just to illustrate… because when i tried to copy and paste… the whole thing messed up with my post deleteing it.
i belive that the problem could be in two places…
the first one is that somehow, when the signature is written in the log, and then read and stored in a variable from the script, it changes somewhere… this occurrs to me because when i tried to paste a part of that so called ‘database’ and post it, it deleted the whole post from there to the end… thats odd!
the second one, is that i’m storing the ‘database’ in some kind of form that i shouldn’t, and that’s why i can’t read it properly.
anyways, i did some testing, and if i store the md5 signature of the file that i want to know if its a duplicate, and read the signature from the txt and store it in a variable; and then compare it with the signatures in the ‘database’, it works…
so, the mess is somewhere in the writting or reading of the file…
any ideas?
tell application "Finder" to set inputFile to (choose file) as alias
set signature to last word of (do shell script "md5 " & quoted form of POSIX path of inputFile)
set database to ((((path to desktop folder) as text) & "signature log.txt"))
set hash_log to paragraphs of (read file database)
if hash_log contains signature then display dialog "Duplicate Found"
And, just to join the fray - I wrote my file with this:
tell application "Finder" to set inputFolder to files of entire contents of (choose folder) as alias list
--this will store the md5 signature of every file in 'signature'
set signature to {}
repeat with inputFile in inputFolder
set signature's end to last word of (do shell script "md5 " & quoted form of POSIX path of inputFile)
end repeat
set sigs to ""
repeat with S in signature
set sigs to sigs & S & return
end repeat
set sl to open for access ((path to desktop folder as text) & "Signature.log") with write permission
set eof of sl to 0
write sigs to sl
close access sl
and checked it with this (successfully finding the duplicate):
tell application "Finder" to set inputFile to (choose file) as alias
set signature to last word of (do shell script "md5 " & quoted form of POSIX path of inputFile)
set DB to ((((path to desktop folder) as text) & "signature.log")) as alias
set tData to paragraphs of (read DB)
set countDB to count tData
set c to 0
repeat with k from 1 to countDB
if item k of tData = signature then set c to c + 1
end repeat
display dialog "There were " & c & " duplicate(s)"
well…
clearly the problem was in my ‘database’ maker script, cause i tried every script posted in here, with the proper modification to the way it writed the signatures, and nothing happend.
until mr. Adams came and i tried his two scripts.
no more to say but thanks a lot.
just curious, why didn’t this two work?
‘database’ maker ( this one has the Progress Bar by Bruce Phillips) :
on run
try
set Progress to load script alias (((path to scripts folder) as text) & "Progress.scpt")
tell application "Finder" to set inputFolder to files of entire contents of (choose folder) as alias list
tell Progress
initialize()
changeIcon to POSIX path of ("Marto:Users:Marto:Desktop:AppleScript:Process imagebank:addons:Resources:m.jpg") --icon
setTitle to "MD5 DataBase"
barberPole(false)
setMax to number of items in inputFolder
end tell
set c to 1
repeat with inputFile in inputFolder
tell Progress
setStatusTop to ("Obtaining MD5 of file: " & (name of (info for inputFile)))
setStatusBottom to ("Files left: " & (number of items in inputFolder) - c)
end tell
set signature to last word of (do shell script "md5 " & quoted form of POSIX path of inputFile)
my writeSignatures(signature)
set c to (c + 1)
tell Progress to increase by 1
end repeat
tell Progress to quit
on error err
display dialog err
end try
end run
on writeSignatures(signature)
set the sigLog to ((path to desktop folder as text) & "signature log.txt")
try
open for access file the sigLog with write permission
write (signature & return) to file the sigLog starting at eof
close access file the sigLog
on error
try
close access file the sigLog
end try
end try
end writeSignatures
and the duplicate finder:
on run
try
tell application "Finder" to set inputFile to (choose file) as alias
set signature to (do shell script "md5 -q " & quoted form of POSIX path of inputFile)
set database to (((path to desktop folder) as text) & "signature log.txt")
set hashLog to paragraphs of (read file database)
if (hashLog contains signature) then
display dialog "Duplicate Found"
else
display dialog "Duplicate NOT Found"
end if
on error err
display dialog err
end try
end run
If you’re already using a shell, then why not save yourself some work?
choose folder with prompt "Generate list of MD5 checksums for these files:"
set sourceFolder to POSIX path of result
do shell script "cd " & quoted form of sourceFolder & "; /sbin/md5 -q * > " & quoted form of POSIX path of ((path to desktop as Unicode text) & "Signature Log.txt")
choose file with prompt "Check signature log for duplicate MD5 checksum of this file:" without invisibles
try
do shell script "/sbin/md5 -q " & quoted form of POSIX path of result & " | /usr/bin/grep -f - " & quoted form of POSIX path of ((path to desktop as Unicode text) & "Signature Log.txt")
display dialog "Duplicate found"
on error
display dialog "Duplicate **not** found"
end try
I really like the -f option - I had read that before but didn’t think about it here. Very neat. And too, I could have saved myself a “last word of…” if I’d known about the -q (or reread man md5). Thanks for both Bruce.
This sticks them together:
-- get info
set ckFile to choose file with prompt "Check signature log for duplicate MD5 checksum of this file:" without invisibles
set tFiles to choose folder with prompt "Generate list of MD5 checksums for these files:" without invisibles
set whereFile to choose folder with prompt "Choose the container of the signature file:"
set nameFile to text returned of (display dialog "Please name the output file:" default answer "")
set sigFilePath to (whereFile as text) & nameFile
-- do it
getMDs(tFiles, sigFilePath)
checkDuplicates(ckFile, sigFilePath)
-- handlers
to checkDuplicates(aFile, sigFilePath)
try
do shell script "/sbin/md5 -q " & quoted form of POSIX path of aFile & " | /usr/bin/grep -f - " & quoted form of POSIX path of sigFilePath
display dialog "Duplicate found"
on error
display dialog "Duplicate **not** found"
end try
end checkDuplicates
to getMDs(aFolder, sigFilePath)
do shell script "cd " & quoted form of POSIX path of aFolder & "; /sbin/md5 -q * > " & quoted form of POSIX path of sigFilePath
end getMDs
set my_head to POSIX path of this user's neck
set sharpshooters to "AB, JN & BP"
set bullet to item posts of sharpshooters
do fire_at_will(sharpshooters, bullet, my_head)
tell "Surgeon" to reconstruct
--> ouch!
well yes… its a lot faster. but it isn’t recursive to every sub directory in the chosen folder…
if there is a way to search for every folder with name X in chosen folder, and make a repeat of the shell script, then it would be fit for my task ( because every file i have is ultimately allocated in a folder named ‘pwg_high’ ).
but… the command overwrites everything in ‘signature.txt’, so i would have to store the result in a variable, and then write it at eof… or is there a way to write it at eof directly from the shell?
I’ll get back to you in a few moments with the proper syntax for finding a folder with X in it, but in the mean time lets take care of your appending issue to the sig file… change the line to read like this.
do shell script "cd " & quoted form of POSIX path of aFolder & "; /sbin/md5 -q * >> " & quoted form of POSIX path of sigFilePath
And as promised here is the code to generate the md5s into the signature file (appending not overwriting) of all files that contain .jpg (change the variable to what works for you) in the file name. It will start at the folder chosen as before, but will dig through all folders inside that one as well.
--Variables--
set fileName to "*.jpg"
--/Variables--
set tFiles to choose folder with prompt "Generate list of MD5 checksums for these files:" without invisibles
set whereFile to choose folder with prompt "Choose the container of the signature file:"
set nameFile to text returned of (display dialog "Please name the output file:" default answer "")
set sigFilePath to (whereFile as text) & nameFile
getMDs(tFiles, sigFilePath, fileName)
to getMDs(aFolder, sigFilePath, fileName)
do shell script "find " & quoted form of POSIX path of aFolder & " -name " & quoted form of fileName & " -exec md5 -q {} >> " & quoted form of POSIX path of sigFilePath & " \\;"
end getMDs
Back in our first thread on this problem, I think I pointed out that this was a tough problem because of the number of files involved. Like most problems of this sort, a solution evolves through discussion and suggestion. Bruce’s solution is not recursive. ls -Rf is recursive and doesn’t sort the files, but doesn’t give their paths, only their names. What we need is a shell way to list the path to every file in a directory recursively, and their probably is one. We want to keep that in order (unsorted) because eventually, you’ll need the path to a duplicate so you can remove it - simply knowing that two md5 signatures match doesn’t eliminate or identify the duplicate. To do that in the shell is beyond my capabilities, but it’s clear from Bruce’s example that it’s a case where the shell is much faster than an AppleScript at doing this.
Another way to go would be to use spotlight data: mdfind -onlyin (chosen directory) for the file type because it does return a path. What is common about the images you’re looking at?
I edited my comment while you were posting, adding this:
Another way to go would be to use spotlight data: mdfind -onlyin (chosen directory) for the file type because it does return a path. What is common about the images you’re looking at?