Flagging files of identical size but different names in a folder

I have some folders with thousands of images, a majority of which have their original digicam names (“DSCxxxx.jpg”), and a sizeable minority of which have more descriptive names (“Family2010.jpg”). In some cases, the latter are simply renamed versions of the original camera names. In such cases, both files will have identical physical sizes, but different names.

I’d like a script to go through the selected folder, with its files sorted by size, flagging (by setting color of label in Finder) any whose size is identical to that of the preceding one in the sorted list. I can then look through the folder and view these flagged files to see if they are in fact identical to their immediate predecessor.

I’ve prepared a script that uses the Finder sort, and builds two lists, one of filenames and one of physical sizes. It then repeats a comparison of the size of file i with that of file i+1, setting the color of the second file’s label if the sizes are identical.

One problem is that this script is extremely slow on a folder with so many images, in every statement from making the master list, sorting the list, building lists of names and sizes from the sorted list, and comparing each sequential set of sizes. Even timeouts of 1000 seconds were insufficient for some steps on a subset of only 2500 files.

Perhaps Applescript is simply underpowered for this project, or perhaps some optimizations (e.g., a Quicksort for the sort rather than a Finder sort, a faster way of building lists than using “every file of folder theFolder”, etc. might do the trick. I suspect a shell script might be better, but am unfamiliar with doing these sorts of operations in that environment.

Any suggestions?

Model: Mac Mini
AppleScript: 2.1.2
Browser: Safari 534.52.7
Operating System: Mac OS X (10.6)

Could you post your code so we can see what kind of optimizations could be done. Maybe you don’t have to tell finder to sort the list, for example. It all depends on how your script looks like.

Browser: Safari 6533.18.5
Operating System: Mac OS X (10.6)

Here’s what I have so far. Seems to work on folders with few files.


set theFolder to choose folder

set nameList to {}
set sizeList to {}

tell application "Finder"
	with timeout of 2000 seconds		
		set theFiles to every file of folder theFolder
		set theFiles to (sort theFiles by physical size)
	end timeout
	
	with timeout of 2000 seconds
		repeat with thisFile in theFiles
			set end of nameList to name of thisFile
			set end of sizeList to size of thisFile
		end repeat
	end timeout
	
	repeat with i from 1 to ((count of nameList) - 1)
		set firstFile to (item i of nameList)
		set secondFile to (item (i + 1) of nameList)
		if item i of sizeList is equal to item (i + 1) of sizeList then
			tell application "Finder"
				set label index of file secondFile of folder theFolder to 7
			end tell
		end if
	end repeat
end tell

Ok, now I see. The first thing I would do is remove the Finder tell block and only apply it when it’s necessary, because most of the stuff you are using is standard AppleScript, no need to tell the Finder to execute them, which is probably the reason why it is so slow. For example you had a tell Finder block inside a tell Finder block. And I would create a database of the files and work with this database instead of separate ones. Something like this:

set theFolder to choose folder

set theFilesDatabase to {}
set theFoundFiles to ""
set foundFilesTotal to 0

set startTime to current date
with timeout of 2000 seconds
	--Get the files of the selected folder already sorted in one step. They will be sorted from smallest to biggest.
	tell application "Finder" to set theFilesSorted to (sort (every file of folder theFolder) by physical size)
	
	--Cycle through the files and create a database that will be used to compare sizes later.
	repeat with thisFile in theFilesSorted
		set theFilesDatabase to theFilesDatabase & {{|fileName|:name of thisFile, |fileSize|:size of thisFile, |filePath|:(thisFile as string)}} as list
	end repeat
	
	--Check if there is items in the database, do nothing if it's empty
	if (count of items in theFilesDatabase) > 0 then
		--Cycle through the database and compare the file size of the an item to the next one, if equal change label to another color.
		repeat with currentItem from 1 to ((count of items in theFilesDatabase) - 1)
			if (|fileSize| of item currentItem of theFilesDatabase) = (|fileSize| of item (currentItem + 1) of theFilesDatabase) then
				--the file sizes are identical, change the label color of the second file
				tell application "Finder" to set label index of file (|filePath| of item (currentItem + 1) of theFilesDatabase) to 7
				set theFoundFiles to theFoundFiles & ((|fileName| of item (currentItem + 1) of theFilesDatabase) & return) as text
				set foundFilesTotal to foundFilesTotal + 1
			end if
		end repeat
		if foundFilesTotal = 0 then
			display dialog "No files found." buttons {"OK"} default button 1
		else if foundFilesTotal > 0 then
			set totalTime to ((current date) - startTime) as integer
			display dialog (foundFilesTotal & " file(s) have been found." & return & return & "Took " & totalTime & " second(s) to complete.") as string buttons {"OK"} default button 1
		end if
	else
		display dialog "No files found in the chosen folder." buttons {"OK"} default button 1
	end if
end timeout

The timeout block only needs to be set once for the entire script. I also added a counter and a list of file names which you could choose to display if you add the code. And you can tell an application to do more than one things at once, like this:

tell application "Finder"
   set theFiles to every file of folder theFolder
   set theFiles to (sort theFiles by physical size)
end

Can become this:

tell application "Finder" to set theFilesSorted to (sort (every file of folder theFolder) by physical size)

Don’t know if it’s faster, but seems like it is more efficient and saves 3 lines of code. :wink:

It should be faster, but remember that AppleScript is not a fast beast, so this is supposed to take time. What I tried to do is remove bottlenecks as much as possible. Try this and let me know how much time it takes. :slight_smile:

Model: MacBookPro8,2
Browser: Safari 534.51.22
Operating System: Mac OS X (10.7)

Hi.

It’s not so much AppleScript that’s slow, but the Finder itself and the communications between it and the script. The trick is to have the script issue as few commands to the Finder as possible and to make those commands as effective as possible.

The script below issues one command to the Finder to return the names and sizes of all the files in the folder, which the Finder can do quite quickly. (A shell script is probably even quicker.) The script itself, using vanilla AppleScript, sorts the size and name lists in parallel and searches for runs of equal sizes. Only with runs of two or more is the Finder then scripted at the level of individual files to set their label indices.

-- CustomQsort by Arthur Knapp and Nigel Garvey.
on CustomQsort(theList, l, r, customiser)
	script o
		property cutoff : 10
		property comparer : me
		property slave : me
		
		property lst : theList
		
		on qsrt(l, r)
			set pivot to my lst's item ((l + r) div 2)
			
			set i to l
			set j to r
			repeat until (i > j)
				set u to my lst's item i
				repeat while (comparer's isGreater(pivot, u))
					set i to i + 1
					set u to my lst's item i
				end repeat
				
				set w to my lst's item j
				repeat while (comparer's isGreater(w, pivot))
					set j to j - 1
					set w to my lst's item j
				end repeat
				
				if (i > j) then
				else
					set my lst's item i to w
					set my lst's item j to u
					
					slave's swap(i, j)
					
					set i to i + 1
					set j to j - 1
				end if
			end repeat
			
			if (j - l < cutoff) then
			else
				qsrt(l, j)
			end if
			if (r - i < cutoff) then
			else
				qsrt(i, r)
			end if
		end qsrt
		
		on isrt(l, r)
			set u to my lst's item l
			repeat with j from (l + 1) to r
				set v to my lst's item j
				if (comparer's isGreater(u, v)) then
					set here to l
					set my lst's item j to u
					repeat with i from (j - 2) to l by -1
						tell my lst's item i
							if (comparer's isGreater(it, v)) then
								set my lst's item (i + 1) to it
							else
								set here to i + 1
								exit repeat
							end if
						end tell
					end repeat
					set my lst's item here to v
					
					slave's shift(here, j)
				else
					set u to v
				end if
			end repeat
		end isrt
		
		on isGreater(a, b)
			(a > b)
		end isGreater
		
		on swap(a, b)
		end swap
		
		on shift(a, b)
		end shift
	end script
	
	set listLen to (count theList)
	if (listLen > 1) then
		if (l < 0) then set l to listLen + l + 1
		if (r < 0) then set r to listLen + r + 1
		if (l > r) then set {l, r} to {r, l}
		
		if (customiser's class is script) then
			try
				customiser's isLess
				set o's comparer to customiser
			end try
			try
				customiser's swap
				customiser's shift
				set o's slave to customiser
			end try
		else if (customiser's class is record) then
			set {comparer:o's comparer, slave:o's slave} to customiser & {comparer:o, slave:o}
		end if
		
		if (r - l + 1 > o's cutoff) then o's qsrt(l, r)
		if (r > l) then o's isrt(l, r)
	end if
	
	return
end CustomQsort

on main()
	-- Script object with "slave" code for CustomQsort. Sort another list in parallel with the one being sorted.
	script parallel
		property lst : missing value
		
		on swap(a, b)
			tell item a of my lst
				set item a of my lst to item b of my lst
				set item b of my lst to it
			end tell
		end swap
		
		on shift(a, b)
			tell item b of my lst
				repeat with i from b - 1 to a by -1
					set item (i + 1) of my lst to item i of my lst
				end repeat
				set item a of my lst to it
			end tell
		end shift
	end script
	
	-- Another script object for speed of access to these lists' items.
	script o
		property nameList : missing value
		property sizeList : missing value
	end script
	
	tell application "Finder"
		activate
		tell front Finder window
			set current view to list view
			set sort column of its list view options to size column
			set theFolder to its target as text
		end tell
		-- Get all the names and sizes with one command.
		set {o's nameList, o's sizeList} to {name, size} of files of folder theFolder
	end tell
	
	-- Sort the size list and the name list in parallel with it.
	set parallel's lst to o's nameList
	CustomQsort(o's sizeList, 1, -1, {slave:parallel})
	
	-- Search for runs of equal values in the size list.
	set i to 1
	set currentSize to beginning of o's sizeList
	repeat with j from 2 to (count o's sizeList)
		set thisSize to item j of o's sizeList
		if (thisSize is not currentSize) then
			-- The current run has ended and a new one started. If there's more than one value in the completed run, get the equivalent names from the name list and set the label index of the files with those names.
			if (j - i > 1) then setLabels(o, i, j, theFolder)
			-- Reset the run-start markers for the new run.
			set i to j
			set currentSize to thisSize
		end if
	end repeat
	-- Deal with any run in progress at the end of the repeat.
	if (j - i > 1) then setLabels(o, i, j, theFolder)
end main

on setLabels(o, i, j, theFolder)
	repeat with k from i to j - 1
		set thisFileName to item k of o's nameList
		tell application "Finder" to set label index of file (theFolder & thisFileName) to 1
	end repeat
end setLabels

main()

More likely the Finder is your bottleneck.
There’s decently fast quicksort implementations for Applescript on this site: http://macscripter.net/viewtopic.php?id=17340

When dealing with more than a few hundred files, it’s usually worth the effort speedwise to go with a shell script call.
That way the Finder is out of the loop as much as possible, and can catch up with your actions at its own stately pace.
Consider using a checksum rather than file size. It’ll be slower, but throw fewer false positives. Something like this works for me:

set temstr to do shell script "cksum " & quoted form of POSIXsrcpath -- returns "checksum, filesize, and path" in temstr (space separated)

You’ve got a good opportunity to learn AppleScript shell scripting here. I think you’ll be well pleased if you take it.

Why using the Finder? I mean when using stat for instance you have a more accurate file size comparison. For 500+ files on my machine still takes less than 2 seconds so one thousand files should be a few seconds.

set theFiles to every paragraph of (do shell script "stat -f %Z%t%SN /Users/admin/Pictures/* | sort -n")

set previousSize to -1

repeat with thisLine in theFiles
	set AppleScript's text item delimiters to tab
	set {__size, __path} to {text item 1 of thisLine as integer, text items 2 thru -1 of thisLine as string}
	set AppleScript's text item delimiters to ""
	if __size = previousSize then tell application "Finder" to set label index of item (POSIX file __path as string) to 1
	set previousSize to __size
end repeat

Comparing DJ’s script with mine on my own machine, his obviously has less source code ” although mine would be a lot shorter than posted if the sort handler and parallel slave object were loaded as libraries instead of being included in the source code.

DJ’s script is sometimes faster than mine, sometimes slower, depending on the folder to which they’re applied.

His script always labels one fewer than the number of files of the same size ” which is almost what the OP wanted, but not as helfpful as marking all of them. It also sometimes unhelpfully labels files which are not the same size as any others in the folder. This appears to be because the shell script returns figures equivalent to the files’ EOFs ” which I take to be the length of their data forks rather than their actual sizes including any resource forks.

True, stat returns the EOF position of a file. Since the OP want to mark two of the same images (by it’s size) I thought that it would be better to ignore all meta data and other linked file information.

It’s because my script has for a few files much overhead but when it comes to large folders AppleScript gets slower and slower. Another reason is that the Finder caches a lot, which makes it also inaccurate.

That can be solved easily and I thought that that is what the OP wanted, my mistake.

Actually, I’ve just realised that my script should check for a run in progress at the end of the main repeat. It also seems to run slightly faster if I store the folder as text and use slightly different kinds of Finder reference! I’ve edited the script in my post above (#5).