search a very long list for an item

Hi,

Is there a fast way to return the item number of a certain item in a very long list?
For short list it is possible to loop through the code but for a list of 1000 items this is not very fast.

many thanks,

gecko

Hi gecko,

unfortunately ApplScript can’t search a list by its index number without parsing the whole list,
a binary search algorithm are a few lines more but should be much faster.

I’m not a mathematician at all and this is only a quick amateurish attempt of such an algorithm,
but it works :slight_smile:

set value to "g"
set theList to {"a", "b", "c", "d", "e", "f", "g", "h"}
display dialog ((BinarySearch(theList, value)) as string)

on BinarySearch(theList, value)
	set {low, high} to {1, count theList}
	if {value} is not in theList then return false
	copy high to lasth
	set high to low + ((high - low) div 2)
	repeat until high < low
		if {value} is in items low thru high of theList then
			if high = low then exit repeat
			copy high to lasth
			set high to low + ((high - low) div 2)
		else
			set low to high + 1
			set high to lasth
		end if
	end repeat
	return low
end BinarySearch

Hi gecko,

For quick access to list items, store your list in a script object. e.g.


-- just getting a big list
script S
	property l : {}
	-- just getting a list of words from dictionary
	set l to paragraphs of (do shell script "look a")
end script

run S
set search_string to "anteater"
--set t1 to the ticks
set c to count l of S
set item_index to 0
repeat with i from 1 to c
	set this_item to item i of l of S
	if this_item is search_string then
		set item_index to i
		exit repeat
	end if
end repeat
--set t2 to the ticks
--display dialog (t2 - t1)
{item_index, item item_index of l of S}

With this method, search_string can be any type. You can speed it up even faster if the list contains only strings of certain length, I think.

gl,

wow! very fast!! :smiley:

thanks for the help,

gecko

Hi kel,

with Nigel Garvey’s “Lotsa” routine the binary search is approx. 8 times faster than parsing the list
(“Lotsa” requires GetMilliSec)

timeLotsa(20) -- Nigel Garvey's "Lotsa" routine for timing.

-- The Lotsa handler for comparison

on timeLotsa(lotsa)
	script S
		property l : {}
		-- just getting a list of words from dictionary
		set l to paragraphs of (do shell script "look a")
	end script
	
	run S
	set search_string to "anteater"
	
	repeat lotsa times
	end repeat
	
	-- Test 1.
	set t to GetMilliSec
	repeat lotsa times
		set {low, high} to {1, count S's l}
		copy high to lasth
		set high to low + ((high - low) div 2)
		repeat until high < low
			if {search_string} is in items low thru high of S's l then
				if high = low then exit repeat
				copy high to lasth
				set high to low + ((high - low) div 2)
			else
				set low to high + 1
				set high to lasth
			end if
		end repeat
		low
	end repeat
	set t1 to ((GetMilliSec) - t) / 1000
	
	-- Test 2.
	set t to GetMilliSec
	repeat lotsa times
		repeat with i from 1 to count S's l
			if search_string is item i of S's l then exit repeat
		end repeat
		i
	end repeat
	set t2 to ((GetMilliSec) - t) / 1000
	
	-- Timings.
	return {t1, t2, t1 / t2, t2 / t1} -- both ratios if you don't know which is faster
end timeLotsa

Hi Stefan,

Isn’t a lot of the speed because the example uses a sorted list of strings? I don’t know a lot about sorting, but think I read that insertion sort is very fast on a sorted list. Wonder if this is true. Personally, with big list I think I would keep a lookup table for getting index to list items, but big lists are dangerous to work with anyway. I try looking at the binary sort.

Thanks,

No, a binary search routine in a sorted list is still faster because you compare only with “<” and “>” operators
for example this is a common binary search routine if the list is sorted

on BinarySearch(theList, value)
	set {low, high} to {1, count theList}
	set p to low + ((high - low) div 2)
	repeat while low is less than or equal to high
		if item p of theList > value then
			set high to p - 1
		else if item p of theList < value then
			set low to p + 1
		else
			return p
		end if
		set p to low + ((high - low) div 2)
	end repeat
	return false
end BinarySearch

Hi Stefan,

Yeah I got search and sort mixed up. I remember binary search now. It’s simplicity made it easy to remember.

Thanks,

Hi, both.

The binary search is only about three times as fast on my machine, but that’s still pretty good! Thanks, Stefan. :slight_smile:

Here’s a slightly optimised version. Besides the obvious script object, the ‘high’ and ‘low’ variables are handled more economically. Also, on the theory that the item’s more likely to be in the list than not, that test’s left to the end, when there’s only one item to test:

BinarySearch(get paragraphs of (do shell script "look a"), "anteater" as Unicode text)

on BinarySearch(TheList, value)
	script o
		property l : TheList
	end script
	
	set low to 1
	set high to (count TheList)
	repeat until (low = high)
		set mid to (low + high) div 2
		if ({value} is in items low thru mid of o's l) then
			set high to mid
		else
			set low to mid + 1
		end if
	end repeat
	
	if (item low of o's l is value) then return low
	return false
end BinarySearch

Very neat, Nigel.

I think, you can often save an extra comparison loop, if you decrement the high value like incrementing the low one.
For this change the repeat condition has also to be adapted

on BinarySearch(theList, value)
	local low, mid, high
	script o
		property l : theList
	end script
	
	set low to 1
	set high to (count theList)
	repeat while (low ≤ high)
		set mid to (low + high) div 2
		if ({value} is in items low thru mid of o's l) then
			set high to mid - 1
		else
			set low to mid + 1
		end if
	end repeat
	
	if (item low of o's l is value) then return low
	return false
	
end BinarySearch

Hi, Stefan.

Just looking at this quickly, I suspect that it might not find the item if the item’s exactly at a ‘mid’ position. I have to go out now, but I hope to look at this subject again tonight. As with a linear seach, the time taken by a binary search depends on whereabouts in the list the item actually is. For instance, although the binary search is much faster at finding “anteater” in Kel’s list, the linear search seems to have a slight advantage when looking for “aardvark”.

This is obvious, as long as the position of the item to find is less than the number of loops plus the code overhead of the binary search.
By contrast the binary search is incredibly faster if the position is near to the end of the list.
On average a binary search is always faster than a linear search

Hi, Stefan.

I think I agree with that. And when the linear search is faster, it’s not by very much, so the binary search is generally a safe bet.

Contrary to my intial guess, your effort to save the extra comparison loop does find items in ‘mid’ positions. But it doesn’t, on average, in my tests, save any time. Sometimes it does fewer loops than my optimisation, sometimes more, sometimes the same number. The totals for the two methods over the first (list length - 1) items seem to be exactly the same, but finding the very last item in the list always takes one more loop with your optimisation than with mine.

Test method: log {low, mid, high} on each loop and call the handler with a logged loop that finds every item in the list. (The items must be unique, of course.) Use similar scripts for the two optimisations, vary the length of the input list, compare the event logs.

on BinarySearch(theList, value)
	script o
		property l : theList
	end script
	
	set low to 1
	set high to (count theList)
	repeat until (low = high)
		set mid to (low + high) div 2
		log {low, mid, high}
		if ({value} is in items low thru mid of o's l) then
			set high to mid
		else
			set low to mid + 1
		end if
	end repeat
	
	if (item low of o's l is value) then return low
	return false
end BinarySearch

set theList to {"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "m", "o", "p"}
repeat with i from 1 to (count theList)
	log BinarySearch(theList, item i of theList)
end repeat

You’re right, Nigel, your script is a bit faster.

Comparing directly both scripts with Lotsa (1000 loops and a list of 1000 items)
your version takes 49.066 seconds and mine 49.566 seconds

For multiple occurences in the list this modification should work:

set theList to {"a", "b", "c", "d", "e", "f", "a", "g", "h", "i", "j", "k", "a", "l", "m", "m", "o", "p"}
MultiBinarySearch(theList, "a", 0)

on MultiBinarySearch(theList, value, start)
	script theScript
		property refList : theList
	end script
	if {value} is not in theScript's refList then return false
	set indexes to {}
	set low to 1
	set high to (count theScript's refList)
	repeat until (low = high)
		set mid to (low + high) div 2
		--log {low, mid, high}
		if ({value} is in items low thru mid of theScript's refList) and ({value} is in items (mid + 1) thru high of theScript's refList) then
			set indexes to indexes & MultiBinarySearch(items (mid + 1) thru high of theScript's refList, value, (mid + 1))
			set high to mid
		else if ({value} is in items low thru mid of theScript's refList) then
			set high to mid
		else
			set low to mid + 1
		end if
	end repeat
	if (item low of theScript's refList is value) then
		set beginning of indexes to (low + start)
		return indexes
	end if
end MultiBinarySearch
--result: {1, 14, 8}

(Results are not in chronicle order, though.
Some routine like QuickSort would have to do the rest:
http://bbs.applescript.net/viewtopic.php?id=17340)

To search for multiple occurrences of one value in a list, use this one, rather:


set theValue to "a"
set theList to {"a", "b", "c", "d", "e", "f", "a", "g", "h", "i", "j", "k", "a", "l", "m", "m", "o", "p"}
Qsearch(theList, 1, count theList, theValue)
to Qsearch(array, leftEnd, rightEnd, theValue)
	script A
		property l : array
	end script
	set theIndexes to {}
	set {i, j} to {leftEnd, rightEnd}
	set v to ((leftEnd + rightEnd) div 2) -- pivot in the middle
	repeat while (j > i)
		repeat while (i is not greater than v)
			if item i of A's l = theValue then
				set theIndexes to theIndexes & {i}
				exit repeat
			else
				set i to i + 1
			end if
		end repeat
		repeat while (j is greater than v)
			if item j of A's l = theValue then
				set theIndexes to theIndexes & {j}
				exit repeat
			else
				set j to j - 1
			end if
		end repeat
		if (i is not greater than j) then
			set {i, j} to {i + 1, j - 1}
		end if
	end repeat
	if (leftEnd < j) then Qsearch(A's l, leftEnd, j, theValue)
	if (rightEnd > i) then Qsearch(A's l, i, rightEnd, theValue)
	theIndexes
end Qsearch

-- The Hoare's QuickSort algorithm will put the result in sequential order.

– Result: {1, 13, 7}

Thanks to Vincent and mghilissen for the idea! This version returns the indices in the right order:

on MultiBinarySearch(theList, value, l, r)
	script o
		property lst : theList
		property indices : {}
		
		on mbsrch(l, r)
			if (item l of my lst is value) then set end of my indices to l
			
			set l to l + 1
			set m to (l + r) div 2
			if (m ≥ l) and ({value} is in items l thru m of my lst) then mbsrch(l, m)
			
			set m to m + 1
			if (r ≥ m) and ({value} is in items m thru r of my lst) then mbsrch(m, r)
		end mbsrch
	end script
	
	o's mbsrch(l, r)
	
	return o's indices
end MultiBinarySearch

set theList to {"a", "b", "c", "d", "e", "f", "a", "g", "h", "i", "j", "k", "a", "l", "m", "m", "o", "p", "a", "e"}
MultiBinarySearch(theList, "a", 1, (count theList))
--> {1, 7, 13, 19}

[i]Edit: Simplified the code to clarify its working and to remove an “optimisation” that was actually having slightly the opposite effect!

I’ve had a chance to study Vincent’s and mghilissen’s scripts this morning. I hope they won’t mind if I comment here.

Vincent’s script returns the right number of indices, but those returned from lower recursion levels aren’t correct. The value passed to ‘start’ in the recursive calls should be ‘(mid + start)’, not ‘(mid + 1)’.

mghilissen’s script is based on the Quicksort algorithm and is actually a series of converging linear searches that find the first instance of the value, then the last, then the second, then the second-to-last, and so on. For this reason, it’s slightly slower than a single linear search straight through the list. The two recursive calls at the end contribute nothing to the final result but add a terrific amount to the running time.

Binary searches are only effective “ and are only necessary “ because AppleScript’s a “high level” language. The ‘is in’ command actually carries out a linear search itself, but at a lower coding level. This is much faster than using high-level AppleScript commands to carry out the individual details of a linear search. (In much the same way, a person will walk across the room much faster if you tell them to walk across the room than if you give them individual instructions for lifting each leg, shifting their weight forward, and falling in a controlled manner back onto the foot of the current leg. There’s probably a similar allegory to do with Government interference and the National Health Service, but I won’t go into that here. :wink: )[/i]

I am absolutely delighted to take lessons from you, Nigel. Only perfect practice makes perfect. Thanks for your insight.

Michael

This is awesome, Nigel :slight_smile:

Ditto!:D:cool: