Script Efficiency / Run Time...

I am writing a script to help me identify all duplicate records in my contacts list which contains 1,500+ contacts.

I have written a portion of the script [as can be seen below] but note that it takes a very long time run on my mid 2012 MBA [i.e. 2 GHz i7 with 8MB of RAM].

Would appreciate code / ideas /suggestions as to how to speed things up…the script is:


tell application "Contacts"
	
	
	-- Set variables
	set myPeople to {} -- List of all contacts details from Contacts
	set myPeopleCount to 0 -- Number / count of all contacts from Contacts
	set myPeopleList to {} -- List of all contact names and organizations from Contacts 
	set myPeopleDuplicates to {} -- List of all contact duplicate names and organizations 
	
	
	-- Set / reference every contact
	set myPeople to every person
	set myPeopleCount to count of myPeople
	
	
	-- Set / extract individual contact names
	repeat with i from 1 to myPeopleCount
		set myPeopleIndividualNameCompany to {last name of person i & " " & first name of person i as string, organization of person i}
		set end of myPeopleList to myPeopleIndividualNameCompany
	end repeat
	

	-- Sort myPeopleIndividualNameCompany to facilitate duplicate comparison
	set callSortLocation to "Macintosh HD:Users:JoelC:Documents:Apple:Scripts:Utilities:20141125_script to sort a list of items_specific_complete.scpt" as alias -- Set callSort to load a compiled sorting script
	set callSort to (load script callSortLocation)
	tell callSort -- Tell callSort to sort myPeopleList
		set myPeopleList to simple_sort(myPeopleList)
	end tell	
	
end tell

while the sorting script – courtesy of Nigel – which is called is as follows:


-- Handler to sort a list or a list of lists based on the first item in the sublist
on simple_sort(my_list)
	set the index_list to {}
	set the sorted_list to {}
	repeat (the number of items in my_list) times
		set the low_item to ""
		repeat with i from 1 to (number of items in my_list)
			if i is not in the index_list then
				set this_item to item i of my_list
				if the low_item is "" then
					set the low_item to this_item
					set the low_item_index to i
				else if item 1 of this_item comes before item 1 of the low_item then
					set the low_item to this_item
					set the low_item_index to i
				end if
			end if
		end repeat
		set the end of sorted_list to the low_item
		set the end of the index_list to the low_item_index
	end repeat
	return the sorted_list
end simple_sort

Thanks in advance for the assistance.

Joel

  1. If it’s a one-time task, who cares if it’s slow? Run it overnight, job done.

  2. The code for fetching address book info is somewhat inefficient. If it proves to be a bottleneck, all data can be retrieved in just three commands:


tell app "Contacts"
   set {lastNames, firstName, orgNames} to {last name, first name, organization} of every person
end tell

Obviously getting three separate lists which need to be rearranged is a bit less convenient than a single list, but it’s an option.

  1. There is no reason to sort the list to eliminate duplicates. Even if there was, unless you are certain the list is already mostly sorted then the last thing on earth you want to use is a bubble sort[1], which is hideously inefficient[2]. (Especially in AppleScript, which is already hideously inefficient at accessing list items[3].) Simply appending each item to a new list if it’s not already there will do:

set uniqueItems to {}
set duplicateItems to {}
repeat with i from 1 to length of lastNames
   set anEntry to {item i of lastNames & " " & item i of firstNames, item i of orgNames}
   if {anEntry} is in uniqueItems then
      set end of uniqueItems to anEntry
   else
      if {anEntry} is not in duplicateItems then set end of duplicateItems to anEntry
   end if
end repeat
return duplicateItems

This is still inefficient (due to AppleScript being crap), but it’s probably good enough for your purposes. If not, there are kludges for improving list efficiency; or you could even resort to more appropriate Cocoa-supplied data structures via AppleScriptObjC, although I doubt that’s worth the effort for such a modest amount of data.

BTW, there’s an introductory-level chapter in the dread book (link in sig.) that covers issues like performance and efficiency. It certainly won’t make you a computer scientist, but it might give you a bit better appreciation of issues involved.

[1] That particular handler isn’t even a standard bubble sort algorithm, but rather some bizarro mismash. Honestly, there should be a surgeon general’s warning attached to that code.

[2] It is an utter disgrace that 20-odd years after AppleScript’s introduction and 2 OS releases since AppleScript finally introduced a basic library loader, AppleScript still doesn’t ship with even the most rudimentary standard library. (I don’t count stuff like Standard Additions and System Events, because aside from being a joke they’re unsuitable for a lot of basic data handling/processing tasks because they don’t operate within AppleScript itself.) It’s one of the reasons doing anything other than app automation in AppleScript is so excruciating for end-users and professional programmers alike, and this ceremonial hand-to-mouth passing about of garbage like that appalling sort code does neither AppleScript nor users any favors at all.

[3] In principle, accessing any item in an AppleScript list should take a constant amount of time (1). In practice, the time taken increases proportionally to the number of items in the list (n). Thus, the time it takes to repeat over the entire list increases quadratically (nn), and since that code nests two such loops it’s going to take nnnn time to chew through your list. Computer geeks have a handy scheme for describing code’s efficiency - i.e. how well performance holds up as the amount of data increases - called “Big-O notation”. The book briefly introduces this, or just google the term and see how that goes. But an O(n^4) algorithm is going to go put performance in the sewer on even relatively modest-sized lists.

The hog is mostly the sort. If I remember correctly the one you use is mostly didatic, there should be more efficient on the same page. But yes, it’s hideously slow.

try Satimage.osax: http://satimage.fr/software/en/downloads/index.html

sortlist
sortlist (verb)[synonyms: sortarray]sort a list of numbers (or an array of real) or a list of strings or a list of dates. Missing values of the input list are always returned at the end of the resulting list. Can also sort a list of lists: lists are sorted either asynchronously or synchronously if the ‘with respect to’ parameter is specified. Sortlist is stable. (from the Array and List Utilities suite, defined in Satimage.osax)

Agreed, it is likely a once a month task but agreed nonetheless…

The bottleneck is not so much the retrieval of the data but the sort…the other point – as you note – is that it easier far easier to work with one list than three lists…

While I thought of the approach that you coded above I thought that it would be less efficient because when one gets towards the end of the list there are a lot of comparisons to perform to determine whether an item is or is not in the list.

With the coding not being overly complex I will code both approaches and settle / use the one that is faster.

While I am very interested in these other methods I need to walk before I run as I am learning on my own and still have a ways to go to get comfortable with AppelScript before adding taking on a complex programming language but, rest assured, I will get there…

I will look into it!

***

Thanks for the response, truly appreciated…

Greatly appreciate the suggestion and will look into it…

Thx,

Joel

Just to clarify that assertion: I provided the modification to the else if line which allows it to compare the first items of lists. I did not write or provide the sort itself! And in fact I advised against hard-wiring sorts this way: http://macscripter.net/viewtopic.php?pid=177146#p177146. :wink:

I agree with hhas about the code for fetching the data from Contacts. On my machine, the following gets the data, collates them, and sorts the results in a fraction of the time yours takes just to get the data into myPeopleList:

-- Custom-sort customiser for sorting a list of lists on their first items, with subsorts on their second and third items.
script sortOnItems1_2_3
	on isGreater(a, b)
		set {a1, a2, a3} to a
		set {b1, b2, b3} to b
		
		if (a1 = b1) then
			if (a2 = b2) then
				return (a3 > b3)
			else
				return (a2 > b2)
			end if
		else
			return (a1 > b1)
		end if
	end isGreater
end script

-- Get the required information from every person in Contacts.
tell application "Contacts"
	set {lastNames, firstNames, orgNames, ids} to {last name, first name, organization, id} of every person
end tell

-- Collate the data just obtained into myPeopleList, replacing any missing values with dashes.
set myPeopleList to {}
repeat with i from 1 to (count lastNames)
	set {a, b, c, d} to {item i of lastNames, item i of firstNames, item i of orgNames, item i of ids}
	if (a is missing value) then set a to "-"
	if (b is missing value) then set b to "-"
	if (c is missing value) then set c to "-"
	set end of myPeopleList to {a, b, c, d}
end repeat

-- Sort items 1 thru -1 of myPeopleList using sortOnItems1_2_3 to compare them.
customShellSort(myPeopleList, 1, -1, {comparer:sortOnItems1_2_3})

return myPeopleList

-- Customisable Shell sort. Algorithm by Donald Shell, 1959. AppleScript implementation by Nigel Garvey, 2010.
on customShellSort(theList, l, r, customiser)
	script o
		property comparer : me
		property slave : me
		property lst : theList
		
		on shsrt(l, r)
			set inc to (r - l + 1) div 2
			repeat while (inc > 0)
				slave's setInc(inc)
				repeat with j from (l + inc) to r
					set v to item j of o's lst
					repeat with i from (j - inc) to l by -inc
						tell item i of o's lst
							if (comparer's isGreater(it, v)) then
								set item (i + inc) of o's lst to it
							else
								set i to i + inc
								exit repeat
							end if
						end tell
					end repeat
					set item i of o's lst to v
					slave's shift(i, j)
				end repeat
				set inc to (inc / 2.2) as integer
			end repeat
		end shsrt
		
		-- Default comparison and slave handlers for an ordinary sort.
		on isGreater(a, b)
			(a > b)
		end isGreater
		
		on shift(a, b)
		end shift
		
		on setInc(a)
		end setInc
	end script
	
	-- Process the input parameters.
	set listLen to (count theList)
	if (listLen > 1) then
		-- Negative and/or transposed range indices.
		if (l < 0) then set l to listLen + l + 1
		if (r < 0) then set r to listLen + r + 1
		if (l > r) then set {l, r} to {r, l}
		
		-- Supplied or default customisation scripts.
		if (customiser's class is record) then set {comparer:o's comparer, slave:o's slave} to (customiser & {comparer:o, slave:o})
		
		-- Do the sort.
		o's shsrt(l, r)
	end if
	
	return -- nothing.
end customShellSort

But while we can advise you that this bit or that bit of the code is slow, we can’t advise on the best approach overall because your script’s incomplete and you haven’t said what you ultimately want it to do. List duplicates, delete them, or what? The best route depends on the destination.

1500 entries should be negligible for comparison purposes, especially for a one-off script. If performance starts to be a regular issue, you can speed up access to (pre-populated) lists by a factor of 10 or more by simply using my. Using hhas’ code as an example, a couple lines would change to:

if {anEntry} is in my uniqueItems then

if {anEntry} is not in my duplicateItems then set end of duplicateItems to anEntry

Actually, using a reference such as my listVariable instead of just the variable only speeds up explicit access to the list’s items and properties (if the list’s quite long). With commands aimed at the list itself, as in these two examples, omitting the my appears to have a slight advantage. The second example would be faster as:

if {anEntry} is not in duplicateItems then set end of my duplicateItems to anEntry

This is true in that the time it takes AS to do a ‘ is in ’ increases linearly with length in the worst case. i.e. O(n) in Big-O notation. The flipside is that an O(n) algorithm written in C is still going to be more efficient than an O(n*n) algorithm written half in C and half in AppleScript, which is what your unoptimized inner ‘repeat’ loop represents. Like I say, there’s no point worrying about what code’s efficiency might be until 1. you actually write it, run it, and measure its performance over various sample sizes, and 2. you have some basic knowledge of algorithms, understand the difference between speed and efficiency, and are familiar with both general optimization techniques and the nasty but sometimes unavoidable hacks needed to get acceptable performance out of AS. Otherwise you’re just flailing blind.

“Premature optimization is the root of all evil”, as Donald Knuth says.

Although be fair, you are right in your assessment that a list is not the ideal data structure to use for this particular job. What you really want is to use a set, not a list. Containment tests on sets are far more efficient than containment tests on lists; being roughly O(n) or, at worst, O(log n), due to the different way in which they organize and search their contents. The only real caveat is that sets really don’t like values such as lists and records whose contents may change at any time, as that will corrupt their internal organization (plus they tend not to support things like comparison operators which are needed to organize those values in the first place). But for immutable values like strings and numbers they work great.

Unfortunately, as you’ll have noticed, AppleScript is very bare-bones when it comes to built-in collections: you get a badly designed list (array) type, an inflexible record (struct) type, and that’s it. You could implement your own set structures in AppleScript if you really wanted (e.g. do a web search for a balanced B-tree algorithm and convert that to AppleScript code), but honestly I wouldn’t bother. If you want to learn about algorithms you’d be better using a less eccentric language than AS, and if you just want to get the job done (as ASers do) then you’re best using a pre-built NSSet object which Cocoa (via AppleScript-ObjC) gives you for free.

However, I wouldn’t worry about that too much in this case, since 1500 items isn’t really that much, and you’ll get the biggest boosts by eliminating AS repeat loops where possible and kludge-optimizing them with the script object property trick where not.

The nice thing about Cocoa is it includes a bunch of well designed, highly optimized data structures: NSSet, NSCountedSet, NSDictionary, etc. You don’t need to know ObjC, or even much Cocoa, to use these from AppleScript; while integration was poor in previous OS versions, as of 10.10 you can use Cocoa classes directly in any AppleScript. Learning the interface to something like NSSet, shouldn’t be any harder than learning the interface to, say, Finder or iTunes. No, it’s still not ideal compared to having such features within AppleScript itself (there’s a reason 99% of my own work is done in other languages these days), but that’s the cost of doing business with AppleScript: excellent for application automation; dire for anything else.

Anyway, if/when you do feel like getting acquainted with ASOC and Cocoa, Shane Stanley’s written a couple of books on the subject: Everyday AppleScriptObjC, which covers using basic Cocoa classes from AppleScript, and AppleScriptObjC Explored which goes the whole hog and explains how to write Cocoa-based GUI apps in AppleScript.

Oops. :rolleyes: I actually knew that, as I previously helped Regulus with that very issue; I just wasn’t paying close enough attention. Speaking of not paying attention, I just noticed that hhas’ original solution is actually backwards, as it returns the unique items.


tell application "Contacts" to set {lastNames, firstNames, orgNames} to {last name, first name, organization} of people


set uniqueItems to {}
set duplicateItems to {}

repeat with i from 1 to length of lastNames
	set anEntry to {my item i of lastNames & space & my item i of firstNames, my item i of orgNames}
	if anEntry is not in uniqueItems then
		set end of uniqueItems to anEntry
	else
		set end of duplicateItems to anEntry
	end if
end repeat

return duplicateItems

@ Nigel, appreciate your clarification / correction regarding the coding or use of sorts…I have taken “the easy way out” for now as I wanted to focus on other aspects…with respect to the objective it is simple, list all duplicate contacts…

@ Nigel, March and hhas, appreciate the comments, I will take a close look at them but it will be a day or two as I) I am slammed with work at the moment and ii) I need to resolve the issue of recently discovered phantom records [i.e. “persons” contains more contacts than are listed in my contacts directory].

Thanks again.

If all you want to do is identify duplicate entries, you can build the list a bit like Nigel said, and get the duplicates like this:

use scripting additions
use framework "Foundation"

tell application "Contacts"
	set {lastNames, firstNames, orgNames} to {last name, first name, organization} of every person
end tell

-- Collate the data just obtained into myPeopleList
set myPeopleList to {}
repeat with i from 1 to (count lastNames)
	set end of myPeopleList to {item i of lastNames, item i of firstNames, item i of orgNames}
end repeat

-- a counted set tracks how many times each item is in the set
set anNSCountedSet to current application's NSCountedSet's setWithArray:myPeopleList
-- an ordinary set just includes each item once
set anNSSet to current application's NSSet's setWithSet:anNSCountedSet
-- minusing the second set leaves only values with a count > 1
anNSCountedSet's minusSet:anNSSet
-- get the values back as a list
set dupeEntries to anNSCountedSet's allObjects() as list

A brief update on today’s event to keep this thread moving forward.

  1. As noted above, I greatly appreciate the input of hhas, Nigel and Mark and commit to taking a look at their code and coming back with my comments / findings tomorrow.

  2. In terms of today’s activities they have been somewhat limited because of work commitments but I did manage to resolve / tackle one issue that anyone working with OS X Contacts (“Contacts”) and Microsoft Outlook (“Outlook”) may be interested in.

In testing the portion of the code I had written I noticed that Outlook and Contacts had different record counts which should not be case given that i) all my contacts are stored on Exchange Server [i.e. I store no contacts anywhere else] ii) Contacts and Outlook both act as an Exchange Server client [i.e. so the record count should match].

The Contact record count breaks down as follows:

¢ Total records 1,041
¢ Less Phantom Single records ( 12) *
¢ Less Phantom Out records ( 9) **
¢ Less Phantom Multi records ( 9) ***
¢ Less Apple records u[/u] ****
¢ Total 1009

  • Records which i) contain the only an e-mail address for an Exchange Server contact and ii) do not match the Exchange server contact record because the Exchange Server contact record is populated. I HAVE NO EXPLANATION AS TO HOW OR WHY THESE RECORDS OCCUR.

** Records which contain an e-mail address for a non Exchange Server contact. I HAVE NO EXPLANATION AS TO HOW OR WHY THESE RECORDS OCCUR.

*** Records which i) contain one of multiple e-mail addresses for an Exchange Server contact and ii) do not match the Exchange server contact record because the Exchange Server contact record is populated. I HAVE NO EXPLANATION AS TO HOW OR WHY THESE RECORDS OCCUR.

**** Records added by OS X [i.e. Apple corporate and user].

The Outlook record count breaks down as follows:

¢ Total records 1,020
¢ Less Multiple Category records ( 10)
¢ Less Contact Group records u[/u]
¢ Total 1,009

With the above noted I can safely conclude that both Contact and Outlook contain the same number of contacts though I must admit that confirming / reconciling this discrepancy was neither a fast nor trivial exercise.

  1. What are your comments / thoughts about deleting Contacts’ Phantom records noting that I leaning on i) backing up the Exchange Server contacts ii) deleting the Phantom contacts and iii) testing that their deletion had no impact on the Exchange Server listing given that none of these records have matching Exchange Server records.

I would be interested in your thoughts on this.

Appreciate the above code…though I understand how most of it works I do need to take a closer…with that, I do hope to have some time tomorrow to work on the code and will - to the extent I understand all of the code – insert it into my script!

List duplicates…I have form others different approaches but should you have another that is even more elegant pleae let me know, I am here to learn!

Thanks for all your input.

Love the quote!

Noted, I will look into this over the next few weeks as I hope to have some added time once my vacation starts.

Will be ordering those books as starting points / references…thanks for pointing me in the right direction!

Shane:

Appreciate the code though I am going to have to go through a few pages of your book – see my immediately preceding post – before I understand how / why the code works noting that I am committed to understanding what / why code works because I want to learn!!

FWIW, message #12 above is an example of that very thing.

Shane:

I understand that as well as see the elegance / efficiency in the code that you wrote and would love to be at a point that I could say that I understand what you coded and why it works but I am note there YET…that said, keep the code coming as I am BOTH very curious and determined to pick this stuff up…thanks again!

Here’s an explanation. A set is like a list, except any item can only appear in it once. An NSCountedSet is a special kind of set that also keeps track of how many times an object has been added to it. and if an item has been added more than once, trying to remove it just decrements the object’s count.

So we start off making a counted set from our list, using the set method setWithArray:

set anNSCountedSet to current application's NSCountedSet's setWithArray:myPeopleList

An ordinary set doesn’t keep count – it just stores each item just once. So we make a simple set from the counted set:

set anNSSet to current application's NSSet's setWithSet:anNSCountedSet

Sets have a method for subtracting one set from another. So if we subtract the simple set from the counted set, any items with a count of 1 are removed, and any with a higher count just have their count decremented:

anNSCountedSet's minusSet:anNSSet

So now the counted set has only values that were in the list more than once, and we get that as an AppleScript list by asking for an array (allObjects()) and then coercing it to a list:

set dupeEntries to anNSCountedSet's allObjects() as list

Appreciate the explanation which is crystal clear…now I need to learn how to use the language so I can code the above on my own…there are however a few possible pitfalls, see my next post…