Delete Duplicate Emails in Entourage 2008

Guys,
I have many group and committee memberships and related rules in Entourage to move new emails to their respective Entourage folders. However, due to many reasons I receive multiple copies of emails (announcements, etc.) that eventually take up space on the HD and may slow down Entourage. I have written a script based on one I found somewhere on the web (that I regret not recalling the original creator to attribute them, but this has been over a few years) to archive each folder on a “recently changed folder per half hour” basis so that, in the event of a crash or corruption of the Entourage database, I have a fairly up to date back up. Then daily, when I may be away from the machine, a full archive is done. To speed up the archival process and once again save on HD space, I wanted to eliminate duplicates. That’s where this script comes in. (I will share the archival scripts once I have cleaned them up a little and commented the code more). The code for the delete duplicates script was initially based on code I picked up somewhere on the web, possibly from the Mactopia Entourage site, a couple of years back. My regrets that I cannot attribute the originator for their work here.

This script runs through approximately 22,000 emails over 80+ folders in about 25 minutes (~15 emails per second…). Not as fast as I would like, but a lot better than my first implementation that took 2 hours. The script does depend on the “intersection of” code based on the Santimage osax. I tried to get around that using a short routine with “is contained by” but the results were slower.

The premise of the script is that it starts at the top folders (INBOX being #1) and runs through the folders in order of Entourage standard folders, then user added folders in alpha order. The script skips the deleted, drafts, outbox, and junk mail folders as these are in constant flux and duplicates in them may cause a problem later, as you’ll see. The first instance of an email (based on MIME Message-ID and then at least three matches of sender, date, subject, and Entourage ID) is left alone while any other iteration encountered is deleted. The matches must be the MIME ID plus at least three of the other criteria as one of the elements may be changed for some reason. I have tested this theory over hundreds of thousands of emails so far and it seems to be bullet-proof. Feel free to show me where there is an issue and I’ll adjust. The MIME ID match is really a must as it is unlikely to be repeated, but may be, so the rest of the elements being checked are to ensure the MIME ID was not a duplicate.

Since the script starts at the upper folders that come with Entourage any copies that have an instance in one of these folders is retained while later discovered copies are discarded. So keeping your filing away of emails current helps ensure you do not have to refile an email you once filed.

I first wrote the script using database elements and database events. Was far, far too slow. I converted it to using lists and it “flew” in comparison to database records (a 5 to 8 times speed up). If a message arrives at Entourage while the program is running there may be an issue of showing a negative number for duplicate files deleted. I will fix this by setting Entourage to work offline, however this ‘bug’ does not bother me so I am going to work on it as a low priority.

There are debug and record keeping hooks in the script. Statements on progress are output to a text file (great to view with TextWrangler [free] as the real time updating of the file while the script runs is reflected in the TextWrangler view of the file. There is the option to speak progress that aggravates my wife when she hears it so I leave it off. There is also the ability to output log information for troubleshooting.

The file must be saved as a standard script, and not in Script Debugger’s Debug mode as Entourage will not run it properly and give erroneous error messages. Save this in the “Entourage Script Menu Items” folder in the “Microsoft User Data” folder, it will then appear under the Entourage scripts menu item.

That’s all, except for the usual disclaimer that use this at your own peril, I give no warranties of any kind as to wiping out emails that were considered works of art, etc., etc.

Edit 29 Sept: spelled Satimage correctly in comment, stressed that their osax is needed for this to work, and added a link to their site where the osax can be downloaded from (http://www.satimage.fr/software/en/downloads/downloads_companion_osaxen.html)


--a derivative work by John Egan based on an original script I have lost and wish to attribute
-- the attribution will be determined and added as a later revision comment
-- version 1.0.0

-- Scans all Entourage 2008 folders for duplicate messages
property pDebugModeOn : false --send messages to Event Log in Activity window
property pLogEventsOn : true -- generate a log file
property pSpeakMsg : false
property pBaseFolder : "" --the desktop, or where the log file should be
property pEntourageResultsFolder : "Entourage Maintenance Log Messages"
property pLogFolder : "" -- filled in at run time with the desktop folder path
property pLogFileName : "Duplicate Email Message Removal.log"
property pLineEndingMac : ASCII number return -- carriage return (13)
property pLineEndingUnix : ASCII character 10 -- line feed character
property pLinefeed : pLineEndingUnix -- use this line feed character in logs

--lists used in place of database, used as properties to access anywhere
global pTotalDelCount
global pMsgHeaderSubject
global pMsgHeaderSender
global pMsgHeaderDate
global pMsgHeaderEntID
global pMimeID --keep the IDs in the list for faster search
global pRemovedMessages
global pCountDone
global ptimeStart
-------------------------------------------------------
set pRemovedMessages to 0
set pCountDone to 0
set ptimeStart to current date
-------------------------------------------------------
-- set up log file on desktop
set pBaseFolder to (the path to the desktop) as text
set pLogFolder to pBaseFolder & pEntourageResultsFolder & ":"
if pLogEventsOn is true then
	my LogEntry(return & return & "Starting up...", 0)
	my LogEntry("Using Lists instead of a database, timing results", 0)
	-- create the Log folder on disk if not created before
	if not my FileExists(pLogFolder) then
		my LogEntry("Creating folder: \"" & pLogFolder & "\".", 0)
		ignoring application responses
			tell application "Finder"
				make new folder at (pBaseFolder as alias) with properties {name:pEntourageResultsFolder}
			end tell
		end ignoring
	end if
end if
---==============================================
--main entry and exit to the program
set pTotalDelCount to 0 --init to zero
set pMsgHeaderSubject to {}
set pMsgHeaderSender to {}
set pMsgHeaderDate to {}
set pMsgHeaderEntID to {}
set pMimeID to {}
set folderCount to ProcessFolder(application "Microsoft Entourage") --{msgCount, 
display dialog "Deleted " & pTotalDelCount & " messages from " & folderCount & " folders" buttons {"Ok"} giving up after 20
---==============================================
on ProcessFolder(theFolder) --recursive starting at top Entourage folders
	local theFolder, folderItem
	set folderList to {}
	set RunningTotal to 0 --  set deleted message count to 0
	set NameList to {"Deleted Items", "Drafts", "Outbox", "Junk E-mail"} --	skips useless to check folders
	tell application "Microsoft Entourage"
		set folderList to the folders of theFolder
	end tell
	set folderCount to count the items of folderList
	if folderCount ≠ 0 then
		repeat with folderItem in folderList
			set theContents to the folderItem
			set theName to name of folderItem as text
			if theName is not in NameList then
				--log the folder and its messages
				if pLogEventsOn is true then
					if the name of theFolder ≠ "Microsoft Entourage" then
						my LogEntry("*************** Processing folder " & theName & ", a sub-folder of " & the name of theFolder, 1)
					else
						my LogEntry("********** Processing folder " & theName & ", a top folder for your Email", 1)
					end if
					tell application "Microsoft Entourage"
						my LogEntry("******* Processing " & (count messages of folderItem) & " messages", 0)
					end tell
				end if
				-------------------------
				set DelCnt to my delDupes(folderItem)
				set RunningTotal to RunningTotal + DelCnt
				if pLogEventsOn is true then
					my LogEntry("******* Messages deleted from this folder: " & DelCnt, 2)
				end if
				--process if this folder has sub-folders
				set FldrCnt to my ProcessFolder(folderItem)
				set folderCount to folderCount + FldrCnt
			end if
		end repeat
	end if
	set pTotalDelCount to pTotalDelCount + RunningTotal
	return folderCount
end ProcessFolder
---==============================================
on delDupes(theFolder)
	set DelCnt to 0
	set eStr to ""
	set eNum to 0
	set rList to ""
	set badObj to ""
	set expectedType to ""
	set theMailList to {}
	set theResult to 0
	set pCountDone to pCountDone + 1
	tell application "Microsoft Entourage"
		set FolderMsgCnt to the count of messages of theFolder
		set theMailList to every message of theFolder -- speeds up processing
	end tell
	if FolderMsgCnt ≠ 0 then
		if pSpeakMsg is true then
			set SpokenMsg to (the name of theFolder & " is folder " & pCountDone) as string
			say SpokenMsg
			say "Messages to check. " & FolderMsgCnt
		end if
		script theRefScript
			property theMailListRef : theMailList
		end script
		repeat with FolderMsgPointer from FolderMsgCnt to 1 by -1 --get all of the messages in the folder
			try
				-- get message header info
				set msgContents to the item FolderMsgPointer of theRefScript's theMailListRef
				tell application "Microsoft Entourage"
					set msgSubject to (get subject of msgContents) --as text
					set msgSenderAddress to (get sender's address of msgContents) --as text
					set msgSentDate to (get time sent of msgContents) -- as text
					set msgID to (get ID of msgContents) --as text
					set theMsgContents to msgContents's (headers)
				end tell
				set uniqueID to my getmessageID(theMsgContents)
				----------------------
				--  sort list and delete any duplicates here
				if uniqueID is in pMimeID then --if uniqueID in List then there is a possible match from email to database
					set msgDeleted to my getRecMatch(uniqueID, msgID, msgSentDate, msgSubject, msgSenderAddress, FolderMsgPointer, theFolder)
				else
					set msgDeleted to false
				end if
				--add next item to the existing list if it was not deleted
				if msgDeleted is false then
					if pMimeID is {} then
						set pMimeID to {uniqueID}
						set pMsgHeaderSubject to {msgSubject}
						set pMsgHeaderSender to {msgSenderAddress}
						set pMsgHeaderDate to {msgSentDate}
						set pMsgHeaderEntID to {msgID}
					else
						set end of pMimeID to uniqueID
						set end of pMsgHeaderSubject to msgSubject
						set end of pMsgHeaderSender to msgSenderAddress
						set end of pMsgHeaderDate to msgSentDate
						set end of pMsgHeaderEntID to msgID
					end if
				end if
			on error eStr number eNum partial result rList from badObj to expectedType
				if pLogEventsOn is true then
					my LogEntry("Trouble in the delDupes handler............................................", 0)
					my LogEntry("Folder message  number: " & FolderMsgPointer, 0)
					my LogEntry("Message Title: " & msgSubject, 0)
					my LogEntry("Sender address: " & msgSenderAddress, 0)
					my LogEntry("Message Date: " & msgSentDate, 0)
					my LogEntry(" Message ID: " & uniqueID, 0)
				end if
				if pDebugModeOn is true then
					log (current date)
					log "Trouble reported, see log file. Reference folder: " & name of the theFolder & " and its message number: " & FolderMsgPointer
					log eStr & return & eNum & return & rList
				end if
			end try
		end repeat
		tell application "Microsoft Entourage"
			set removedCount to FolderMsgCnt - (count messages of theFolder)
		end tell
		set theResult to removedCount
		if pSpeakMsg is true then
			if removedCount is 0 then
				say "Finished. No duplicates detected"
			else
				say "Finished. " & removedCount & " duplicates removed"
			end if
		end if
	end if
	return theResult
end delDupes
---==============================================
on LogEntry(someText, timingAct)
	if not pLogEventsOn then return
	--if log folder does not exist then create it
	if not my FileExists(pLogFolder) then
		tell application "Finder"
			make new folder at (pBaseFolder as alias) with properties {name:pEntourageResultsFolder}
		end tell
		my LogEntry("Creating folder: \"" & pLogFolder & "\".", 0)
	end if
	--create text entry to log file, append to end of file to maintain long record
	set logFile to (pLogFolder as text) & pLogFileName
	set logRef to 0
	ignoring application responses
		try
			set logRef to open for access file logFile with write permission
		end try
		if logRef ≠ 0 then
			write FormatDateTime(current date) & ": " & someText & pLinefeed starting at eof to logRef
			if timingAct is 1 then
				set ptimeStart to current date
			end if
			if timingAct is 2 then
				write "Time to complete: " & tab & ((current date) - ptimeStart) & pLinefeed starting at eof to logRef
				write "Email entries  running total (pMimeID): " & tab & (count items in pMimeID) & pLinefeed starting at eof to logRef
				write "Email entries deleted: " & tab & pTotalDelCount & pLinefeed starting at eof to logRef
			end if
			close access logRef
		end if
	end ignoring
end LogEntry
---==============================================
on FormatDateTime(theDate)
	set theDate to theDate as date
	set dd to text -2 thru -1 of ("0" & theDate's day)
	copy theDate to tempDate
	set the month of tempDate to January
	set mm to text -2 thru -1 of ¬
		("0" & 1 + (theDate - tempDate + 1314864) div 2629728)
	set yy to text -1 thru -4 of ((year of theDate) as text)
	set hh to time string of theDate
	return (yy & "/" & mm & "/" & dd & " " & hh as text)
end FormatDateTime
---==============================================
on FileExists(someFile)
	try
		set anAlias to (someFile as alias)
		return true
	on error
		return false
	end try
end FileExists
---==============================================
--searches lists for a match, runs through the number of matches made
-- returns the list item if there is any match of uniqueID, subject, date, and any other field
--deletes the record if the number of header items matches is greater than 50% along with the Mime Message ID being the same
on getRecMatch(uniqueID, msgID, msgSentDate, msgSubject, msgSenderAddress, FolderMsgPointer, theFolder)
	set DelCnt to 0
	set eStr to ""
	set eNum to 0
	set rList to ""
	set badObj to ""
	set expectedType to ""
	set recid to {}
	set recDate to {}
	set recSubj to {}
	set recAddr to {}
	set allrecMatches to {}
	try
		set recKey to my FindAll(pMimeID, uniqueID) -- id of them all, but only use the last one, as that is what counts
		set LastRecKey to (last item of recKey) as list
		if pDebugModeOn is true then
			log "UniqueID match to existing record(s): " & recKey
			log "Will ony use: " & LastRecKey & ", as it is the one that counts."
		end if
		-----
		set recid to my FindAll(pMsgHeaderEntID, msgID) -- id of them all, REQUIRES Satimage's osax for 'intersection of' (http://www.satimage.fr/software/en/downloads/downloads_companion_osaxen.html)
		if recid is not false then
			-- this code is slower (a hundredth of a second) versus 'intersection of' on test runs, thanks to McUsr's millisec routines used to check it. I left it in case others have ideas...
			(*set recKey2recID to {}
			if (count of LastRecKey) < (count of recid) then
				repeat with theItem in LastRecKey
					if theItem is in recid then set end of recKey2recID to theItem
				end repeat
			else
				repeat with theItem in recid
					if theItem is in LastRecKey then set end of recKey2recID to theItem
				end repeat
			end if*)
			set recKey2recID to intersection of LastRecKey and recid with removing duplicates -- what items from the 2 lists match
		end if
		if pDebugModeOn is true then
			log "Entourage ID match to existing record(s): " & recid
			if recKey2recID ≠ {} then log "Match found to RecKey item identity"
		end if
		-----
		set recDate to my FindAll(pMsgHeaderDate, msgSentDate) -- id of them all
		if recDate is not false then
			set recKey2recDate to intersection of LastRecKey and recDate with removing duplicates -- what items from the 2 lists match
		end if
		if pDebugModeOn is true then
			log "Email date & time match to existing record(s): " & recDate
			if recKey2recDate ≠ {} then log "Match found to RecKey item identity"
		end if
		-----
		set recSubj to my FindAll(pMsgHeaderSubject, msgSubject) -- id of them all
		if recSubj is not false then
			set recKey2recSubj to intersection of LastRecKey and recSubj with removing duplicates -- what items from the 2 lists match
		end if
		if pDebugModeOn is true then
			log "Email subject match to existing record(s): " & recSubj
			if recKey2recSubj ≠ {} then log "Match found to RecKey item identity"
		end if
		-----
		set recAddr to my FindAll(pMsgHeaderSender, msgSenderAddress) -- id of them all
		if recAddr is not false then
			set recKey2recAddr to intersection of LastRecKey and recAddr with removing duplicates -- what items from the 2 lists match
		end if
		if pDebugModeOn is true then
			log "Email sender match to existing record(s): " & recAddr
			if recKey2recAddr ≠ {} then log "Match found to RecKey item identity"
		end if
		-----
		set allrecMatches to recKey2recID & recKey2recDate & recKey2recSubj & recKey2recAddr
		set recKeyMatchesCnt to count items of allrecMatches
		if recKeyMatchesCnt < 3 then
			set msgDeleted to false
			return msgDeleted
		end if
		if recKeyMatchesCnt > 2 then
			tell application "Microsoft Entourage"
				delete message FolderMsgPointer of theFolder
			end tell
			if pLogEventsOn is true then
				my LogEntry("Deleting a duplicate email...", 0)
				my LogEntry("This folder's message # " & FolderMsgPointer & " deleted...", 0)
			end if
			if pDebugModeOn is true then
				log "Folder message # " & FolderMsgPointer & " has been deleted"
				log "from folder: " & name of theFolder
				log (current date)
				log return
			end if
			set msgDeleted to true
		end if
	on error eStr number eNum partial result rList from badObj to expectedType
		--means there is no match and a problem was found
		if pDebugModeOn is true then
			log "Error in searching the lists. The reported recKey is: " & recKey
			set recid to {}
			log "  Trouble...   "
			log uniqueID
			log "Message is number: " & FolderMsgPointer & " of folder: " & name of theFolder
			log (current date)
		end if
		if pLogEventsOn is true then
			my LogEntry(eStr, 0) --log eStr
			my LogEntry(eNum, 0) --log eNum
			my LogEntry(rList, 0) --log rList
			my LogEntry(recKey2recID, 0)
			my LogEntry(recKey2recDate, 0)
			my LogEntry(recKey2recSubj, 0)
			my LogEntry(recKey2recAddr, 0)
		end if
	end try
	set msgDeleted to false
	return msgDeleted
end getRecMatch
---==============================================
on getmessageID(msgHeaders)
	local msgHeadersStart, msgHeadersEnd
	-- get the headers
	script theRefScript
		property msgHeadersRef : msgHeaders -- the trick to have a reference for a list, this gets fastest results
	end script
	if theRefScript's msgHeadersRef contains "Message-ID:" then
		set saveDelims to AppleScript's text item delimiters
		set AppleScript's text item delimiters to return -- split full header text into paragraphs
		set headerList to the text items of theRefScript's msgHeadersRef
		set AppleScript's text item delimiters to saveDelims
		set theRefScript's msgHeadersRef to the paragraphs of theRefScript's msgHeadersRef
		set msgHeadersStart to items 1 thru ((length of theRefScript's msgHeadersRef) div 2) of theRefScript's msgHeadersRef
		set msgHeadersEnd to items ((length of theRefScript's msgHeadersRef) div 2) thru -1 of theRefScript's msgHeadersRef
		if (msgHeadersStart contains "Message-ID:") or ((msgHeadersStart as text) contains "Message-ID: <") then
			if item -1 of msgHeadersStart does not start with "Message-ID:" then
				set theRefScript's msgHeadersRef to msgHeadersStart
			end if
		else
			if (msgHeadersEnd contains "Message-ID:") or ((msgHeadersEnd as text) contains "Message-ID: <") then
				set theRefScript's msgHeadersRef to msgHeadersEnd
			end if
		end if
		-- find the Message-ID header
		set WatchIt to false --this tracks to see if Message ID is in a line by itself and the actual ID is on the next line
		repeat with nextHeader in theRefScript's msgHeadersRef
			if WatchIt is false then
				if nextHeader starts with "Message-ID:" then
					if length of nextHeader > length of "Message ID:" then
						set nextHeader to (characters 12 thru -1 of nextHeader) as string --eliminates Message-ID:--msgIDHeader
						exit repeat
					else
						set WatchIt to true
					end if
				end if
			else
				exit repeat
			end if
		end repeat
		if pDebugModeOn is true then
			log nextHeader -- splitID
		end if
		-- remove whitespace characters from the front of the string
		set done to false
		set whiteChars to (ASCII character 10) & space & tab & return
		repeat until done
			if whiteChars contains the first character of nextHeader then
				set nextHeader to (characters 2 through -1 of nextHeader) as text
			else
				set done to true
			end if
		end repeat
		-- remove surrounding < > chars if neeeded
		if the first character of nextHeader is "<" then set nextHeader to (characters 2 through -1 of nextHeader) as text
		if the last character of nextHeader is ">" then set nextHeader to (characters 1 through -2 of nextHeader) as text
	else
		-- this message has no Message-ID: header - this is abnormal
		if pLogEventsOn is true then
			my LogEntry("Message-ID headers are missing, means header is screwed up...", 0)
			my LogEntry("This one will be stored as a bogus one", 0)
			my LogEntry(nextHeader, 0)
		end if
		set nextHeader to "Message-ID header missing!"
	end if
	return nextHeader --as text)
end getmessageID
---==============================================
-- a routine to find where, if anywhere, in a list, a value is
on FindAll(theList, theValue)
	try
		if theList does not contain theValue then return {}
		set theResult to {}
		script theRefScript
			property ListRef : theList -- the trick to have a reference for a list, this gets fastest results
		end script
		repeat with i from 1 to count of theRefScript's ListRef -- number of items in theList
			if theRefScript's ListRef's item i is theValue then set theResult's end to i
		end repeat
		return theResult
	on error eMsg number eNum
		if pDebugModeOn is true then
			log "Can't findAll: " & eMsg & " error number: " & eNum
		end if
		return {}
	end try
end FindAll


Model: MacBook (early 2009)
Browser: Safari 533.16
Operating System: Mac OS X (10.6)

Should this compile in Script Editor? Because my customer gets errors when he tries to compile.

It barfs on “with” and says expecting end of line.

			end if*)
			set recKey2recID to intersection of LastRecKey and recid with removing duplicates -- what items from the 2 lists match
		end if