Delete other files with the exact same size.

Hi all,
I could use some help: I have one folder with a bunch of files in them, I know that any file that is the same exact size needs to be deleted even if it not the same file. I just need to keep one file that is that size at any given time. I am just looking for a ruthless down and dirty script to help with this.

Any help would be great. Thanks.

I believe the following will do what you want. You absolutely have to try this on a test folder before using it elsewhere. All deleted files are moved to the trash


set sourceFolder to choose folder

set fileSizes to {}
tell application "Finder"
	set theFiles to every file in sourceFolder as alias list
	repeat with aFile in theFiles
		set the end of fileSizes to size of aFile
	end repeat
end tell

set previousFileSizes to {item 1 of fileSizes} # as suggested by Yvan

set deleteList to {}
repeat with i from 2 to (count fileSizes)
	set anItem to item i of fileSizes
	if (previousFileSizes contains anItem) then set end of deleteList to item i of theFiles
	set end of previousFileSizes to anItem
end repeat

set AppleScript's text item delimiters to ":"
repeat with anItem in deleteList
	display dialog "Delete " & quote & text item -1 of (anItem as text) & quote default button 2
end repeat
set AppleScript's text item delimiters to ""

--An alternative dialog that prompts only once.
--display dialog "Delete " & (count deleteList) & " duplicate files." buttons {"Cancel", "OK"} default button 1

tell application "Finder" to delete deleteList

The procedure I used to find duplicates is adapted from a post by Nigel at:

https://macscripter.net/viewtopic.php?id=37449

Great!, thank you. This is exactly what I needed.

If you are sure that there is no problem with a total destruction of the unwanted files, you may replace the instruction

tell application "Finder" to delete deleteList

which move the files to the trash by:

tell application "System Events"
	repeat with anItem in deleteList
		delete anItem
	end repeat
end tell

which really destroy them.

As I am really lazy, I would also replace

set previousFileSizes to {}
set beginning of previousFileSizes to item 1 of fileSizes

by

set previousFileSizes to {item 1 of fileSizes}

Yvan KOENIG running High Sierra 10.13.6 in French (VALLAURIS, France) dimanche 1 septembre 2019 20:10:56

Here, no need to coerce the file references list to aliases list. It is enough this:

set theFiles to every file in sourceFolder

No need text Item delimeters too. And default button there should always be a button with less dangerous consequences:


repeat with anItem in deleteList
	tell application "Finder" to display dialog "Delete the following file?" & ¬
		return & return &  name of anItem default button "Cancel"
end repeat

Otherwise, the script is good and simple enough to remove duplicates without searching in subfolders.


set sourceFolder to choose folder

set fileSizes to {}
tell application "Finder"
	set theFiles to every file in sourceFolder
	repeat with aFile in theFiles
		set the end of fileSizes to size of aFile
	end repeat
end tell

set previousFileSizes to {item 1 of fileSizes} # as suggested by Yvan
set deleteList to {}
repeat with i from 2 to (count fileSizes)
	set anItem to item i of fileSizes
	if (previousFileSizes contains anItem) then set end of deleteList to item i of theFiles
	set end of previousFileSizes to anItem
end repeat

repeat with anItem in deleteList
	tell application "Finder"
		try
			display dialog "Delete the following file?" & ¬
				return & return & name of anItem default button "Cancel"
			if button returned of result is "OK" then delete anItem
		end try
	end tell
end repeat

NOTE: try block is need to avoid interruption of process from “User cancelled” error.

This offers no warnings and may not be as efficient, but it’s a little simpler:

set thePath to "Macintosh HD:Users:shane:Desktop:Sizes"
tell application id "com.apple.finder" -- Finder
	set theFiles to sort files of folder thePath by size
	set theSize to size of item 1 of theFiles
	repeat with aFile in rest of theFiles
		set thisSize to size of aFile
		if thisSize = theSize then
			delete aFile
		else
			set theSize to thisSize
		end if
	end repeat
end tell

Yes, this is much simpler. And as I see, it should be much efficient too. As for warnings, it is easy to add one similar display dialog before delete aFile code line. :slight_smile:

Several knowledgeable forum members have reported that it takes less time for the Finder to create a list of files as an alias list than it does for the Finder to create a list of files using Finder’s own syntax. For example, see Marc Anthony’s post number 21 in the following thread.

https://macscripter.net/viewtopic.php?pid=196460

Nigel appears to be reporting something similar when he wrote:

“One of the things which takes so long with the Finder is that it has to put together a long list of its own specifiers. As with System Events, if you can get it to return the results in some other form instead, it’s often quicker (that is, quicker than it otherwise would be)… The as alias list is a Finder speciality which works with the preceding specifier rather than coercing a returned list after the fact.”

https://macscripter.net/viewtopic.php?pid=195806

Finally, I often find it necessary to process files outside a Finder tell statement and an alias list makes this simpler and often quicker. So, I would disagree with your statement.

Peavine.

You are right. As alias list “Finder” gets results much faster.
I tested this with 45459 files in my home folder. Without alias list time was 920 seconds, and with alias list time was 385 seconds. The recursive test script I used was this:

set logTime to {}
tell application "Finder"
	set theFolder to (path to home folder) --the last created folder on my desktop
	set startTime to my (current date)'s time --my used to escape standard addition error
	set allFiles to {}
	my getAllFiles(theFolder, allFiles)
	set logTime's end to (my ((current date)'s time)) - startTime
	delay 0.5
	set startTime to my (current date)'s time
	set allFiles to {}
	my getAllFiles2(theFolder, allFiles)
	set logTime's end to (my ((current date)'s time)) - startTime
end tell
logTime

on getAllFiles(theFolder, allFiles)
	tell application "Finder"
		set fileList to files of theFolder
		repeat with i from 1 to (count fileList)
			set end of allFiles to item i of fileList
		end repeat
		set subFolders to folders of theFolder
		repeat with subFolderRef in subFolders
			my getAllFiles(subFolderRef, allFiles)
		end repeat
	end tell
end getAllFiles

on getAllFiles2(theFolder, allFiles)
	tell application "Finder"
		set fileList to files of theFolder as alias list
		repeat with i from 1 to (count fileList)
			set end of allFiles to item i of fileList
		end repeat
		set subFolders to folders of theFolder as alias list
		repeat with subFolderRef in subFolders
			my getAllFiles2(subFolderRef, allFiles)
		end repeat
	end tell
end getAllFiles2

I created a test folder of 300 text files, all of which were the same size, and then timed the scripts in this thread. I first modified each script to exclude the dialog prompt (except Shanes which doesn’t have one). The results were:

peavine - 1 to 2 seconds

Shane - 8 to 10 seconds

KniazidisR - 12 to 14 seconds

The difference would appear to be that my script bulk deletes the duplicate files, while the other scripts don’t. Also, Shane clearly stated that his script was written for simplicity not speed, and my test is worse-case in that it deletes 299 of 300 files.

BTW, I modified Shane’s script to bulk delete the duplicate files by moving the delete command out of the repeat loop, and the timing was 2 seconds. This may be the best script–simple and fast.

I don’t know how you edited Shane’s script.

It seems that according to the original post, the simpler code would be :

set thePath to (path to desktop as text) & "Sizes"
tell application id "com.apple.finder" -- Finder
	set theFiles to sort files of folder thePath by size
	delete (rest of theFiles)
end tell

Testing that I discovered something.
When it apply upon the Desktop,the instruction [format]set theFiles to sort files of folder thePath by size[/format] return a list of aliases.
For an other folder it return a list of document files.

Yvan KOENIG running High Sierra 10.13.6 in French (VALLAURIS, France) mardi 3 septembre 2019 17:31:34

I couldn’t resist finding a role for my Custom Iterative Ternary Merge Sort. :slight_smile: This is a bit faster than Peavine’s posted script with my 6592-file test folder, but presumably the OP’s folder won’t have that many files!

use sorter : script "Custom Iterative Ternary Merge Sort" -- <https://macscripter.net/viewtopic.php?pid=194430#p194430>
use scripting additions

on main()
	script o
		property filePaths : missing value
		property fileSizes : missing value
	end script
	
	set thePath to (choose folder) as text
	-- Get corresponding lists of the folder's files' paths and sizes. System Events is faster than the Finder for this.
	tell application "System Events" to set {o's filePaths, o's fileSizes} to {path, size} of files of folder thePath
	
	-- Sort both lists on the file's paths (can be omitted), then (stably) on the sizes.
	tell sorter to sort(o's filePaths, 1, -1, {slave:{o's fileSizes}}) -- If desired.
	tell sorter to sort(o's fileSizes, 1, -1, {slave:{o's filePaths}})
	
	-- Initialise a "current size" variable to some figure below the lowest size.
	set currentSize to (beginning of o's fileSizes) - 1024
	-- Work through the sizes. At each increase, replace the corresponding path with missing value.
	repeat with i from 1 to (count o's fileSizes)
		set thisSize to item i of o's fileSizes
		if (thisSize > currentSize) then
			set item i of o's filePaths to missing value
			set currentSize to thisSize
		end if
	end repeat
	
	-- Get a list containing only the unreplaced paths and bulk-delete those files.
	-- The Finder accepts list of paths for this. System Events's dictionary says it does too, but it doesn't accept lists of anything on my machine.
	set filesToDelete to o's filePaths's text
	tell application "Finder" to delete filesToDelete
	
	return
end main

main()

Hi Yvan.

I think the OP wants to keep one instance of each size rather than just the smallest file in the folder.

Here’s an option using ASObjC. The differences are (a) it ignores packages (and invisible files), which is probably reasonable in this situation, and (b) if the size of two files match it also checks that their contents match.

use AppleScript version "2.5" -- macOS 10.11 or later
use framework "Foundation"
use scripting additions

-- constants and enums used
property NSDirectoryEnumerationSkipsHiddenFiles : a reference to 4
property NSURLFileSizeKey : a reference to current application's NSURLFileSizeKey

set thePath to "/Users/shane/Desktop/Size test" --POSIX path of (choose folder with prompt "choose the folder")
set theFolder to current application's NSURL's fileURLWithPath:thePath
set fileManager to current application's NSFileManager's |defaultManager|()
set {theFiles, theError} to fileManager's contentsOfDirectoryAtURL:theFolder includingPropertiesForKeys:{NSURLFileSizeKey} options:NSDirectoryEnumerationSkipsHiddenFiles |error|:(reference)
set sizeInfo to current application's NSMutableDictionary's dictionary() -- keys will be size, objects will be array of URLs
repeat with aFile in theFiles
	set {theResult, theSize} to (aFile's getResourceValue:(reference) forKey:NSURLFileSizeKey |error|:(missing value))
	if theSize is not missing value then -- skip packages and folders
		if (sizeInfo's allKeys()'s containsObject:theSize) as boolean then -- check if same size already found
			set matchingFiles to (sizeInfo's objectForKey:theSize) -- get files that had same size
			set matchFlag to false
			repeat with aMatch in matchingFiles -- compare contents, delete if the same
				if (fileManager's contentsEqualAtPath:(aMatch's |path|()) andPath:(aFile's |path|())) as boolean then
					(fileManager's trashItemAtURL:aFile resultingItemURL:(missing value) |error|:(missing value))
					set matchFlag to true
					exit repeat
				end if
			end repeat
			if not matchFlag then
				(matchingFiles's addObject:aFile)
			end if
		else
			(sizeInfo's setObject:(current application's NSMutableArray's arrayWithObject:aFile) forKey:theSize)
		end if
	end if
end repeat

Hi, Shane.

I don’t now why, but your pure AppleScript variant runs faster - 32 mseconds on my machine (after compiling). AppleScriptObjC variant runs 602 mseconds (after compiling).

Peavine’s script runs 17 mseconds on my machine (after compiling). I removed warning dialog from his script to test.

It’s because the first script just compares file sizes. The second one first compares sizes, and if they match it compares the entire contents of the files to see if they match exactly. Not what was asked for, but safer.

Oh, got it. Thanks for clarifying. By the way, is this a byte comparison or some other algorithm? With hash, for example.

A byte comparison, I believe. The documents say: “For files, this method checks to see if they’re the same file, then compares their size, and finally compares their contents.”

Thanks for new version. A believe, no need matchFlag:

repeat with aMatch in matchingFiles -- compare contents, delete if the same
	if not ((fileManager's contentsEqualAtPath:(aMatch's |path|()) andPath:(aFile's |path|())) as boolean) then
		(matchingFiles's addObject:aFile)
	else
		(fileManager's trashItemAtURL:aFile resultingItemURL:(missing value) |error|:(missing value))
		exit repeat
	end if
end repeat

And, I don’t know if this comparison method leaves the byte comparison at the first byte mismatch. If not, then perhaps this method has some additional options for the faster behavior.

From documentation I read: “For files, this method checks to see if they’re the same file, then compares their size, and finally compares their contents. This method does not traverse symbolic links, but compares the links themselves.”

This means that this method has its own comparison order. So, checking files for equality of size before calling this method is doing one job twice.

No, you’re modifying the number of items in matchingFiles while you loop through it. You’re also risking adding the same file multiple times, and therefore ending up with the array containing items that have been trashed. In any event, it would be the tiniest of optimizations.

Yes, but the difference is that we’re storing the size for re-use, rather than having to read it from two files for each comparison.

Look at it this way. Suppose there are 10 files, all different. My script gets their size once each, and that’s all. If we just used contentsEqualAtPath:andPath:, you’d need to call it 9 times with the first item, 8 times with the second item, and so on.