Search long list optimizations..

Hi everyone,

I’m working on something that requires long lists… I got the script below from this post.

That’s a great piece of code but I’m into optimization for my aims. I also have a list of about 9000 items and I’d like to check if some text items are included in the list. I don’t need to know the value of it in the list, I only need to know if it’s in it or not…

Now I have this:

repeat with i in words of theWords
	if LongList contains i then set end of theList to i
end repeat

But I’d like to use this, because it’s much faster.

on BinarySearch(theList, value)
	script o
		property l : theList
	end script
	
	set low to 1
	set high to (count theList)
	repeat until (low = high)
		set mid to (low + high) div 2
		if ({value} is in items low thru mid of o's l) then
			set high to mid
		else
			set low to mid + 1
		end if
	end repeat
	
	if (item low of o's l is value) then return low
	return false
end BinarySearch
  1. This works great for me (I know I have to adjust it like the first script in this topic but it’s just a copy paste from the original URL) but I’m wondering if there are any optimizations, I don’t need to know the value of the string in the list, does removing that will give me faster results? Are there any other things I can get rid of ?

  2. Does it matter how my list is sorted ? (alphabetically, short strings to long strings, etc.)

  3. Which place in the list is likely to be found the first? (beginning, middle or end?) Because when I know that the words I’m looking for are in the list, I’d like to move them to that place in the list so they’re found faster than the others when I use it again.

Thanks in advance

Hi,

You don’t need the value, but the algorithm does :wink:

For a alphabetically sorted list use the script in post #9, it’s still faster. The length of the strings doesn’t matter.

Using a binary search it doesn’t matter where the item is located.
http://en.wikipedia.org/wiki/Binary_search_algorithm

Thanks Stefan! I’ll check the script in post 9 out but first I’ll sort my list alphabetically before I can test it. Or am I misunderstanding you, For a alphabetically sorted list use the script in post #9 do you mean if my list happens to be alphabetically sorted I’ll be best off with #9 or do you mean that for best results I should list it alphabetically and use the script in post #9?

I already went to http://en.wikipedia.org/wiki/Binary_search_algorithm
but in http://en.wikipedia.org/wiki/Search_algorithm I read this :

I went to the articles to see if there were any faster ways of searching thru lists…

Thanks :slight_smile:

the script in post #9 works only with alphabetically sorted lists.
It’s faster, because it can check greater than or less than, which saves some cycles of code.
The unsorted binary search can check only contains or does not contain

Anayway, the binary sort algorithm is the fastest algorithm at all (without using helper tools like hash tables)

Ok, thanks alot :slight_smile:

The two scripts do different things. The first appends something to a list if it corresponds to a word of theWords (which is presumably text) that appears in LongList (which is presumably an AppleScript list). The second finds the index in a list of the first appearance of a given value. If you want to achieve the first end, the second script isn’t the tool for the job.

The repeat in the first script is very slow because the value of i isn’t a word but a reference to an ‘item of every word of theWords’. This reference has to be resolved for each check against LongList ” ie. get every word of theWords, compare item so-and-so of the result with the contents of LongList. The repeat can be speeded up enormously by resolving ‘words of theWords’ first (ie. once only) and making i simply a reference to an item in the resulting list.

repeat with i in (get words of theWords)
	if LongList contains i then set end of TheList to i
end repeat

However, in both cases, it’s a reference that’s going into TheList, not text. The end of TheList has to be set to ‘contents of i’.

Slightly faster still, using the script object approach, would be something like this:

on getCommonWords(theWords, LongList)
	script o
		property wrds : words of theWords
		property TheList : {}
	end script
	
	repeat with i from 1 to (count o's wrds)
		set thisWord to item i of o's wrds
		if thisWord is in LongList then set end of o's TheList to thisWord
	end repeat
	
	return o's TheList
end getCommonWords

set TheList to getCommonWords(theWords, LongList)

I assume you mean post #9 in the other thread to which James refers. The handler there is the same as the second one above, which works equally well with sorted lists, unsorted lists, and mixed-class lists (except that it’s not suitable here). :slight_smile:

:slight_smile:

I was rather working on something like this :

set otherList to {}
repeat with i in words of theWords
	if BinarySearch(LongList, i as text) ≠ false then set end of otherList to i as text
end repeat

on BinarySearch(TheList, value)
	script o
		property l : TheList
	end script
	
	set low to 1
	set high to (count TheList)
	repeat until (low = high)
		set mid to (low + high) div 2
		if ({value} is in items low thru mid of o's l) then
			set high to mid
		else
			set low to mid + 1
		end if
	end repeat
	
	if (item low of o's l is value) then return low
	return false
end BinarySearch

So you think that if I use the script you mention, that it has the same results in speed, alphabetically or not ? Because I can just sort my list alphabetically before using my script and then use the fastest one, not that my script always has to sort the list over and over again.

Thanks

The BinarySearch() handler you were considering was designed to find the exact position of a value in a long, random list as quickly (on average) as possible. But the use to which you want to put it is simply to find out if the value’s in the list at all. This is more quickly achieved with a single ‘contains’ or ‘is in’ command than with the several that are performed by the handler.

If your list is alphabetically sorted, a better binary handler for your purpose would be the following, which tests relative values (as Stefan was suggesting) rather than existence in a subset of the list:

-- Return whether or not (true or false) an ordered list contains a given value.
on binaryIsIn(thelist, value)
	script o
		property l : thelist
	end script
	
	set low to 1
	set high to (count thelist)
	repeat until (low = high)
		set mid to (low + high) div 2
		set v to item mid of o's l
		if (v is value) then
			return true -- Value found.
		else if (v > value) then
			set high to mid - 1
		else
			set low to mid + 1
		end if
	end repeat
	
	return (item low of o's l is value) -- Value found/not found.
end binaryIsIn

This gives better speeds than ‘contains’ for items toward the end of a very long list or which aren’t in the list at all, but is still slower than it for items toward the beginning. Speeds for individual values are still unpredictable, but it may turn out to be statistically faster for your purposes.

What I said about the reference in your repeat loop still applies.

Thanks Nigel ! :slight_smile:
My list of 10070 items (It will be around 11000 items when done) processes it in less than a second :slight_smile: I already sorted the list alphabetically but I haven’t excluded the duplicates…

Anyway, I also used

set otherList to {}
repeat with i in words of theString
	if binaryIsIn(get paragraphs of (read file (POSIX file "/path/to/text.txt")), i as text) then set end of otherList to i as text
end repeat

:slight_smile:

-- Resolve these before the repeat so that they're only executed once.
set LongList to paragraphs of (read (POSIX file "/path/to/text.txt"))
set theWords to words of theString

set otherList to {}
repeat with i in theWords
	set thisWord to i's contents
	if binaryIsIn(LongList, thisWord) then set end of otherList to thisWord
end repeat

Oops :rolleyes:
My fault :slight_smile:
Thanks ! I’ll test it out! It’s getting faster and faster every time (probably because my list has 5000 duplicate items less but anyway, every millisecond counts :D)