Surprising result randomizing long lists...

Adam_Bell · December 14, 2006, 1:21am

I originally needed to randomize a list that might have contained duplicate items, and the handler called Rand(tList) near the bottom of the script below was how I did it (knowing that ‘some item’ was faster than getting random numbers and checking for duplicates, but needing a list without duplicates for testing the “somes items” for uniqueness). When I was certain that the list did not contain any duplicates, I presumed that dropping the index list from Rand(tList) would speed things up, and Shuffle(tList) was intended to be faster. As it happens, Shuffle takes more than 40 times as long to run 20 trials of randomizing a list (the same list) of 260 unique text items than Rand does (On mm, Rand completes the task in 1.6 seconds, Shuffle takes about 70.

I’m assuming, but don’t know for sure, that the phenomenally longer time for the simpler procedure is a result of comparing text as opposed to comparing integers, or that I made a mistake. The script below shows the two forms as the handlers following the “Garvey timeLotsa” comparison handler. Barring a missed glitch, I propose that [b]Rand/b is the best way of producing a randomized list of any I’ve encountered so far.


-- Make a big list -- not timed

property tList : {} -- 260 unique alphameric items
set AZ to {}
repeat with j from 97 to 122
	set AZ's end to ASCII character j
end repeat
repeat with Ch in AZ
	repeat with i from 0 to 9
		set end of tList to Ch & i
	end repeat
end repeat
tList --> {"a0", "a1", "a2", "a3", ... , "z7", "z8", "z9"}

timeLotsa(20) -- Nigel Garvey's "Lotsa" routine for timing.

-- The Lotsa handler for comparison

on timeLotsa(lotsa)
	-- Any other preliminary values here.
	
	-- Dummy loop to absorb a small observed
	-- time handicap in the first repeat.
	repeat lotsa times
	end repeat
	
	-- Test 1.
	set t to GetMilliSec
	repeat lotsa times
		-- First test code or handler call here.
		set R1 to my Rand(tList)
	end repeat
	set t1 to ((GetMilliSec) - t) / 1000
	
	-- Test 2.
	set t to GetMilliSec
	repeat lotsa times
		-- Second test code or handler call here.
		set R2 to my shuffle(tList)
	end repeat
	set t2 to ((GetMilliSec) - t) / 1000
	
	-- More test loops here if required.
	
	-- Timings.
	return {t1, t2, t1 / t2, t2 / t1} -- both ratios if you don't know which is faster
end timeLotsa

---- The tested handlers, Rand & Shuffle:

to Rand(OL)
	-- Method for randomizing a long list that may contain duplicate entries.
	script L -- force all lists into memory for easy access
		property O : OL
		property idx : {} -- list of unique integers
		property k : {} -- the built index list for comparison
		property r : {} -- the randomized original list
	end script
	set c to count L's O -- so as not to calculate it in every repeat
	-- make index list of unique integers for "contents" comparison
	repeat with j from 1 to count L's O
		set end of L's idx to j
	end repeat
	-- randomize both L's k and L's r
	repeat until (count L's k) = c
		tell some item of L's idx to if it is not in L's k then
			set end of L's k to it
			set end of L's r to (L's O)'s item it
		end if
	end repeat
	return L's r
end Rand

to shuffle(OL) -- shuffle a list of unique items without index list.
	script M -- force both lists into memory for easy access
		property Lst : OL -- the given list
		property nL : {} -- the randomized list
	end script
	set c to count M's Lst
	repeat until (count M's nL) = c
		tell (some item of M's Lst) to if it is not in M's nL then set (M's nL)'s end to it
	end repeat
	return M's nL
end shuffle

Nigel_Garvey · December 14, 2006, 9:17am

Hi, Adam.

I think you’re probably right. In Rand, you’re checking for an integer in a list of integers and in shuffle it’s strings. You can speed up shuffle considerably by putting its internal repeat in a ‘considering case’ block, but the actual time taken depends on how many random choices the handlers have to make before every item in the input list has been chosen at least once. Both handlers continue until the output list is the same length as the input list, which is presumably because you know there are no duplicates to weed out. (Rand only checks that the indices aren’t used more than once, not that there are no duplicate items in the input list.)

I can’t offer any alternatives at the moment as I have to rush out. A school teacher friend of mine rang in a panic last night, needing someone to play in a carol service tonight. There’s a rehearsal this morning…

Nigel_Garvey · December 14, 2006, 2:50pm

Hi, Adam.

Looking at this again, it occurs to me that you may not actually have wanted to weed out duplicates, but to find a randomising method that wasn’t put off by the presence of duplicates in the list. Your Rand handler’s great for that, of course, though it still suffers from the problem of having to keep randomly choosing until it’s randomly chosen every index in the index list.

A known fast method for randomising a list is to work through it from left to right, exchanging each item encountered with another item chosen at random. This way, only one random choice needs to be made with each item in the list. Combining it with your ‘some item’/random index idea gives this:

to Rand2(OL)
	-- Method for randomizing a long list that may contain duplicate entries.
	script L -- force all lists into memory for easy access
		property O : OL
		property idx : {} -- list of unique integers
		property r : O's items -- A duplicate of the original list. (No need to do a "deep" copy with 'copy'.)
	end script
	
	set c to count OL -- so as not to calculate it in every repeat
	-- make index list of unique integers.
	repeat with j from 1 to c
		set end of L's idx to j
	end repeat
	-- exchange each item in the duplicate list with another whose index is chosen at random.
	repeat with i from 1 to c
		set x to some item of L's idx
		tell item i of L's r
			set item i of L's r to item x of L's r
			set item x of L's r to it
		end tell
	end repeat
	
	return L's r
end Rand2

Nigel_Garvey · December 15, 2006, 11:04am

Another variation, nearly as fast as the “random exhange” one, is to make use of the fact that the randomly chosen indices can be used on the index list as well. Each index can be excluded from further draws as it’s used.

to Rand1point9(OL)
	-- Method for randomizing a long list that may contain duplicate entries.
	script L -- force all lists into memory for easy access
		property O : OL
		property idx : {} -- list of unique integers
		property r : {} -- The randomised list.
	end script
	
	set c to count OL -- so as not to calculate it in every repeat
	-- make index list of unique integers.
	repeat with j from 1 to c
		set end of L's idx to j
	end repeat
	-- append an item from the input list, whose index is chosen at random
	-- from the index list, to the randomised list. Then neutralise the index.
	set mv to missing value
	repeat c times
		set x to some integer of L's idx -- NB. 'some integer'.
		set end of L's r to item x of L's O
		set item x of L's idx to mv -- replace the index with 'missing value'
	end repeat
	
	return L's r
end Rand1point9

Obviously, if there are more than 536870911 items in the list, ‘some integer’ will need to be changed to ‘some number’.

Adam_Bell · December 15, 2006, 3:57pm

:lol: Rand2 is 18.5 times faster than my Rand, Nigel. Very Neat. Why would one want to use the 1.9 version? Has it advantages beyond speed?

A related problem: At post #6 et seqq. of this thread, also in Code Exchange, I proposed to Regulus that my Rand was 7.5 times faster than his adaptation of Emmanual Levy’s algorithm for randomizing a list that might contain duplicates. I get that result consistently.

When he runs the test on his machine, he gets a much smaller ratio - on his machine EL’s version runs 4 times faster than on mine. Can you think of a reason for that? (or given time and interest, will you try it yourself and propose your Rand2 there?).

Addendum: I’ve found the answer. Script Editor delivers his results for me too. On Rand2 tested in SE, I get a ratio of 16.8. Apparently, the “environment” of SD4 interferes with the test. Damn! :mad:

Nigel_Garvey · November 30, 2008, 1:26pm

I’ve just come across this thread again this morning.

It was pointed out to me earlier this year in another forum that the method I called Rand2 above, although an entirely random process, produces biased results. Allowing items to be selected possibly more than once means that, for a list containing n items, there are n ^ n different ways the process can play out, one of which will be realised. However, there are only n! possible end results. Since n ^ n can never be an exact multiple of n! (according to those who know these things), it follows that some end results have more ways leading to them than others and are therefore more likely to be produced.

Rand1point9, however, removes items from the process as they’re selected, which removes the bias and makes every possible end result equally likely.