Efficient Vanilla AppleScript string replicator

Replicating a character or a string a given number of times is a task that comes up repeatedly in scripting. The handler presented in this post is a Vanilla AppleScript solution that achieves logarithmic execution efficiency through a combination of the progressive binary expansion of the input string into a list of intermediate strings and a memoization-like selective incorporation of the saved intermediate strings into the final replicated string. It can handle large replication numbers efficiently, for example, replicating an input string 100,000 times in < 0.002 seconds on my machine.

Handler:


on replicateString(theString, nReps)
	-- Validate the input arguments
	if theString's class ≠ text then error "The first input argument is not a text string."
	tell nReps to if ({its class} is not in {integer, real}) or ((it mod 1) ≠ 0) then error "The second input argument is not an integer."
	-- Handle the special cases where the input string = "" or the number of replications is ≤ 0
	if (theString's length = 0) or (nReps ≤ 0) then return ""
	-- Repeatedly double the input string until the next doubling would exceed the target number of replications, and save the intermediate replication counts and their associated strings
	set {nRepsEntries, replicatedStringEntries} to {{1}, {theString}}
	repeat
		tell 2 * (nRepsEntries's last item)
			if it > nReps then exit repeat
			set {end of nRepsEntries, end of replicatedStringEntries} to {it, replicatedStringEntries's last item & replicatedStringEntries's last item}
		end tell
	end repeat
	-- Loop through the saved intermediate replication counts and their associated strings in reverse order beginning with the second-to-last list item, add to the last list item those intermediate values that do not cause the target number of replications to be exceeded, and quit when the target number of replications has been reached
	if nRepsEntries's last item < nReps then
		repeat with i from ((nRepsEntries's length) - 1) to 1 by -1
			tell ((nRepsEntries's last item) + (nRepsEntries's item i)) to if it ≤ nReps then set {nRepsEntries's last item, replicatedStringEntries's last item} to {it, (replicatedStringEntries's last item) & (replicatedStringEntries's item i)}
			if nRepsEntries's last item = nReps then exit repeat
		end repeat
	end if
	-- Return the replicated string, which is the last item in the list of intermediate replicated strings
	return replicatedStringEntries's last item as text
end replicateString

replicateString("x", 10) --> "xxxxxxxxxx"

replicateString("123", 5) --> "123123123123123"


Execution speed tests were performed comparing the current handler with the following ASObjC string replication command:


(((current application's NSString)'s stringWithString:"")'s stringByPaddingToLength:(nReps * (theString's length)) withString:(theString) startingAtIndex:(0)) as text

Each solution was stored as a handler in ~/Library/Script Libraries and executed by calling the handler from the test script. The mean of 1000 trials for each data point was calculated, and the median of three such trials was taken as the final data point. The following are the execution times for theString = “x” and varying nReps (# replications of “x”):

nReps = 1:
current handler = 0.000015 secs, ASObjC = 0.00018 seconds
→ current handler ~12x faster
nReps = 10:
current handler = 0.000047 secs, ASObjC = 0.00018 seconds
→ current handler ~3.8x faster
nReps = 100:
current handler = 0.000074 secs, ASObjC = 0.00017 seconds
→ current handler ~2.3x faster
nReps = 1000:
current handler = 0.00012 secs, ASObjC = 0.00019 seconds
→ current handler ~1.6x faster
nReps = 10,000:
current handler = 0.00030 secs, ASObjC = 0.00029 seconds
→ current handler ≈ ASObjC
nReps = 100,000:
current handler = 0.0019 secs, ASObjC = 0.0019 seconds
→ current handler ≈ ASObjC
nReps = 1,000,000:
current handler = 0.032 secs, ASObjC = 0.019 seconds
→ ASObjC ~1.7x faster

As is seen, the current handler outperforms the ASObjC solution in execution time for # replications in the range of 1 to 1000, which likely comprises most scripting situations where string replication is required. For the unusual situations where greater replication values are needed, the current handler matches the ASObjC solution for # replications in the range of 10,000 to 100,000 and is outperformed by ASObjC by a modest amount for # replications in the range of 1,000,000.

Some observations:

  • AS’s string length is not necessarily the same as an NSString’s length. Your code should probably be more like:
	set theString to current application's NSString's stringWithString:theString
	(theString's stringByPaddingToLength:(nReps * (theString's |length|())) withString:theString startingAtIndex:0) as text

That’s going to be a fraction slower.

  • Particularly in the AS case, the length of the string can have a significant impact on the time taken.

  • If you run the ASObjC version with repetitions of 1 and 10000, the time taken doesn’t even double. This suggests that what you’re mostly timing in this case is the overhead of passing and converting back values.

If you have AST installed:

AST copy string "hello world!" pad to length 100000 pad string "x"

Here’s a slightly more efficient implementation of bmose’s already very efficient ( :cool: ) method. On my machine, it just manages to outrun AST copy string with short strings replicated an indecent number of times, but takes about the same time when the source strings get up to about ten characters in length:

on replicateString(theString, nReps)
	-- Validate the input arguments
	if theString's class ≠ text then error "The first input argument is not a text string."
	tell nReps to if ({its class} is not in {integer, real}) or ((it mod 1) ≠ 0) then error "The second input argument is not an integer."
	-- Handle the special cases where the input string = "" or the number of replications is ≤ 0
	if ((count theString) * nReps ≤ 0) then return ""
	-- Repeatedly double the input string until the next doubling would exceed the target number of replications, and save the intermediate replication counts and their associated strings
	script o
		property nRepsEntries : {1}
		property replicatedStringEntries : {theString}
	end script
	set repsTest to 2
	repeat until (repsTest > nReps)
		set beginning of o's nRepsEntries to repsTest
		set theString to theString & theString
		set beginning of o's replicatedStringEntries to theString
		set repsTest to repsTest * 2
	end repeat
	-- Use the highest replication count obtained to begin a subtotal for the number of reps to achieve.
	set repsSubtotal to beginning of o's nRepsEntries -- or: set repsSubtotal to repsTest div 2
	if (repsSubtotal < nReps) then
		-- The highest replication count is less than the target. Loop through and sum the remaining saved intermediate replication counts. Where a subtotal would exceed the target, zap the string corresponding to the count entry, otherwise accept the subtotal.
		repeat with i from 2 to (o's nRepsEntries's length)
			set subtotalTest to repsSubtotal + (item i of o's nRepsEntries)
			if (subtotalTest > nReps) then
				set item i of o's replicatedStringEntries to missing value
			else
				set repsSubtotal to subtotalTest
			end if
		end repeat
		-- Coerce the surviving strings to a single text.
		set astid to AppleScript's text item delimiters
		set AppleScript's text item delimiters to {""}
		set replicatedString to o's replicatedStringEntries's text as text
		set AppleScript's text item delimiters to astid
	else
		-- The highest replication count is equal to the target, so the corresponding string is the required result.
		set replicatedString to beginning of o's replicatedStringEntries -- or: set replicatedString to theString
	end if
	
	return replicatedString
end replicateString

replicateString("123", 1000000) -- AST copy string "123" pad to length 3000000 pad string "123"
replicateString("1234567890", 1000000) -- AST copy string "1234567890" pad to length 10000000 pad string "1234567890"

Thank you for pointing that out. I wasn’t aware of the difference. Can you please show an example so that I can understand the difference better?

For # replications up to 1000, the ASObjC solution’s execution time doesn’t change at all. That strongly supports your point about the small overhead cost of the Cocoa-Applescript bridge.

[Important edit note: The overhead and execution times that were originally posted below were obtained from a test script in which both ASObjC and AST were tested along side the current handler in the same script. For an unknown reason, both ASObjC and AST demonstrated dramatically slow execution times under those circumstances (gotta love AppleScript!) When the comparison tests were separated out into two different scripts (current handler vs ASObjC in one script, current handler vs AST in a second script), more believable and clearly far more accurate execution times were observed, and they are posted below along with appropriate modifications to the discussion. One consistent finding was that ASObjC and AST demonstrated nearly identical overhead and execution times in all cases tested.]

That is a nice, compact solution. I did notice in timing tests that there seems to be a fixed overhead cost of about 0.00025 seconds on my machine for an AST call even for a string of length 1 replicated one time, very similar to the overhead of an ASObjC call. On the other hand, the AST command execution time does not exceed that overhead time for # replications up to 10,000 or more, suggesting that the underlying replication algorithm is very efficient.

In the following timing tests, input string length seemed to affect ASObjC (and nearly equivalently AST) more than the current Vanilla AppleScript solution for post-replication string lengths in the commonly used range (≤10,000 characters in the example below; ASObjC and AST did execute more quickly than the current handler for post-replication string lengths in the 100,000 to 10,000,000 range). The tests consisted of replicating an input string of 100 random characters the indicated number of times. AST times are not shown but were quite similar to those shown for ASObjC below:

nReps = 1, final string length = 100:
current handler = 0.000035 secs, ASObjC = 0.00020 secs
→ current handler 5.7x faster than ASObjC
nReps = 10, final string length = 1000:
current handler = 0.000078 secs, ASObjC = 0.00020 secs
→ current handler 2.6x faster than ASObjC
nReps = 100, final string length = 10,000:
current handler = 0.00017 secs, ASObjC = 0.00021 secs
→ current handler 1.2x faster than ASObjC
nReps = 1000, final string length = 100,000:
current handler = 0.00063 secs, ASObjC = 0.00036 secs
→ ASObjC 1.8x faster than current handler
nReps = 10,000, final string length = 1,000,000:
current handler = 0.017 secs, ASObjC = 0.0035 secs
→ ASObjC 4.9x faster than current handler
nReps = 100,000, final string length = 10,000,000:
current handler = 0.21 secs, ASObjC = 0.050 secs
→ ASObjC 4.2x faster than current handler
nReps = 1,000,000, final string length = 100,000,000:
as they say in New Jersey, fuggedaboutit

True, the overhead of any remote AppleEvent is quite expensive when it comes to execution time. 0.025 sec is pretty high though, even for an old machine compared to < 0.001 sec on my machine. I guess the ASObjC and Vanilla solutions has to run much faster on my machine as well. As with almost every remote AppleEvent 99,99% of the total execution time is actually “user” time and 0.01% or less is machine time.

I can explain AST’s behavior. On top of the constant overhead of an AppleEvent, the size of the parameters affects performance as well. Sending 10 million characters is a 20MB AppleEvent descriptor that has to be handled and validated (read:strings are stored as UTF-16). That’s in AE terms quite a lot of data imaging that AE was written at a time when computers didn’t had 20MB RAM.

DJ, my apologies for the previously posted execution times for AST. Something truly bizarre happened in my initial tests in which the current handler and the ASObjC and AST solutions were all tested in the same timing test script. Both ASObjC and AST were found to slow down dramatically when tested together. I have no idea why that should have happened. But when I separated out the tests into two test scripts, current handler vs ASObjC in one script, current handler vs AST in a second script, far more believable execution times were observed. AST times were found to match those of ASObjC in virtually all cases tested. I edited my previous post, replacing the bizarre numbers with the no doubt far more accurate execution times, along with comments about those execution times. In short, (1) AST was found to very nearly equal ASObjC in both overhead (~0.00025, not 0.025, seconds!) and execution times, and (2) the current handler held its own vs ASObjC and AST even with a larger input string 100 characters in length up to a final replicated string length of about 10,000. Both ASObjC and AST did outperform the current handler for final replicated strings length in the 100,000 to 10,000,000 range.

Essentially NSString stores values in 16-bit units, and length is calculated accordingly. AppleScript, on the other hand, returns counts in what we loosely call characters, but are actually grapheme clusters.

Differences occur when you have Unicode characters outside the Basic Multilingual Plane (rare), or where combining characters are used (some accented combinations, some emoji, some special language features). In these cases, what AS calls a character occupies several 16-bit units.

That’s why the NSString APIs mostly refer to strings, rather than characters.

bmose, where are you doing the timings?

FWIW, I use a rule of thumb that if anything takes less than about half a millisecond, there’s unlikely to be any benefit speed-wise from trying ASObjC. Of course when you get down to those times, anything that doesn’t get called repeatedly is unlikely to make much impact on overall speed anyway.

BTW, thanks for posting your code – it never occurred to me to use the padding method to do this. The only time I build long strings like this is when testing other code, and in this case convenience will definitely win out.

Thank you for the explanation. I’m sure there are good reasons that Cocoa deals primarily in lower level representations of “characters”. My lazy mind prefers thinking of one emoji or one accented combination as one “character” and am glad that AppleScript has abstracted the lower levels details into grapheme clusters. (I know this is almost sacrilegious to say, but I also prefer the 1-based indexing of AppleScript and Matlab, among other languages, over the 0-based indexing of 99% of the computer world. Did 0-based indexing originate because that’s how computers, not humans, count?)

Thank you for your kind words. The idea came to me after recently listening to an MIT OpenCourseWare lecture on programming efficiency in which the lecturer demonstrated how memoization and dynamic programming could be used to dramatically increase the speed of computing Fibonacci numbers. The current solution is an attempt to extend those ideas to the problem of replicating a string.

The tools I used in the past for measuring script execution time (the Extra Suites ES milliseconds osax command in the old days, more recently the LapTime start timer and stop timer osax commands; also python and perl solutions) themselves take a few milliseconds to execute. getMicroseconds, which I have not used, is reported to take about 70 milliseconds to execute. Thus, it would seem that these tools may not be ideal for measuring execution times on the order of 10^-4 or 10^-5 seconds. On the other hand, the built-in Core Foundation function CFAbsoluteTimeGetCurrent(), which I learned about from a MacScripter discussion, takes only about 2.5*10^-5 seconds to execute on my machine and thus seems to be a more suitable tool for measuring execution times in the 10^-4 or 10^-5 seconds range.

For what it’s worth, here is one of the timing scripts I used for comparing the current handler to ASObjC. I find that results are usually not affected whether the script is run directly from Script Debugger, as a script via LaunchBar, or as an applet via the Finder (I prefer the latter approach to avoid any possible confounding effects of caching or other optimizations that might happen in Script Debugger; I also execute each solution in a set command before entering the repeat loops for the same reason, rightly or wrongly, i.e., to remove any confounding caching or other optimization effects before execution times are measured):


use framework "Foundation"
use scripting additions
	
set nTrials to 1000

set theString to "x"

repeat with i from 0 to 6
	
	set nReps to 10 ^ i
	
	set x1 to my replicateStringAS(theString, nReps)
	set x2 to my replicateStringASObjC(theString, nReps)
	
	set t1 to current application's CFAbsoluteTimeGetCurrent()
	repeat nTrials times
		set x1 to my replicateStringAS(theString, nReps)
	end repeat
	set t1 to ((current application's CFAbsoluteTimeGetCurrent()) - t1) / nTrials
	
	set t2 to current application's CFAbsoluteTimeGetCurrent()
	repeat nTrials times
		set x2 to my replicateStringASObjC(theString, nReps)
	end repeat
	set t2 to ((current application's CFAbsoluteTimeGetCurrent()) - t2) / nTrials
	
	display dialog "# Replications:" & return & return & (nReps as integer) & return & return & "Avg Execution Time Per Trial (Secs):" & return & return & "t1:" & tab & tab & tab & (round (t1 * 1000000) rounding as taught in school) / 1000000 & return & "t2:" & tab & tab & tab & (round (t2 * 1000000) rounding as taught in school) / 1000000 & return & return & "Execution Time Ratios:" & return & return & "t2 / t1:" & tab & tab & (round ((t2 / t1) * 100) rounding as taught in school) / 100 & return & return & "Value Equality Tests:" & return & return & "x1 = x2:" & tab & (x1 = x2)
	
end repeat

on replicateStringAS(theString, nReps)
	if theString's class ≠ text then error "The first input argument is not a text string."
	tell nReps to if ({its class} is not in {integer, real}) or ((it mod 1) ≠ 0) then error "The second input argument is not an integer."
	if (theString's length = 0) or (nReps ≤ 0) then return ""
	set {nRepsEntries, replicatedStringEntries} to {{1}, {theString}}
	repeat
		tell 2 * (nRepsEntries's last item)
			if it > nReps then exit repeat
			set {end of nRepsEntries, end of replicatedStringEntries} to {it, replicatedStringEntries's last item & replicatedStringEntries's last item}
		end tell
	end repeat
	if nRepsEntries's last item < nReps then
		repeat with i from ((nRepsEntries's length) - 1) to 1 by -1
			tell ((nRepsEntries's last item) + (nRepsEntries's item i)) to if it ≤ nReps then set {nRepsEntries's last item, replicatedStringEntries's last item} to {it, (replicatedStringEntries's last item) & (replicatedStringEntries's item i)}
			if nRepsEntries's last item = nReps then exit repeat
		end repeat
	end if
	return replicatedStringEntries's last item as text
end replicateStringAS

on replicateStringASObjC(theString, nReps)
	set theString to current application's NSString's stringWithString:theString
	return (theString's stringByPaddingToLength:(nReps * (theString's |length|())) withString:theString startingAtIndex:0) as text
end replicateStringASObjC

Nigel, I could not have figured out a way to eke more speed out of my handler, but yours does indeed! In timing tests, I found that yours executed consistently between 1.4 and 1.5 times faster than mine in tests replicating a single character between 1 and 1,000,000 times and a 100-character string between 1 and 100,000 times. Very impressive!

P.S. Your method of negatively selecting unwanted list items (by setting them to the missing value) and generating the replicated string by coercion of the resulting list to a text string is significantly more efficient than my method of positively selecting list items and appending them to the replicated string…and quite creative!

Thank you and no harm feelings at all :). Timing remote AppleEvents is quite tricky because the total time can be influenced at so many levels. Thanks for the corrections.

The following version of the handler is a hybrid between my original handler and Nigel Garvey’s. It implements Nigel’s efficient technique of zapping unwanted strings within the list of intermediate replicated strings and coercing the surviving strings to the final replicated string. This approach avoids the less efficient technique of building the replicated string through a series of string concatenations that I originally used. It achieves about a 5% execution speed bump over Nigel’s handler simply by exiting the “zap” loop once the target number of replications has been achieved and by applying the string coercion to the resulting subset of list items rather than the entire list. I retained my original ordering of list items from small-to-large replication size, which I found to execute more quickly for some reason than when I tried the large-to-small ordering of Nigel’s handler.


on replicateString(theString, nReps)
	script o
		property nRepsEntries : {1}
		property replicatedStringEntries : {theString}
	end script
	-- Repeatedly double the input string until the next doubling would exceed the target number of replications, and save the intermediate replication counts and their associated strings
	set newNReps to 2
	repeat until (newNReps > nReps)
		set end of o's nRepsEntries to newNReps
		set theString to theString & theString
		set end of o's replicatedStringEntries to theString
		set newNReps to newNReps * 2
	end repeat
	-- Loop through the saved intermediate replication counts and their associated strings in reverse order beginning with the second-to-last list item, zap (by setting to a non-text value, in this case the missing value) those intermediate values that would cause the target number of replications to be exceeded, or add to the cumulative replication count the current item's replication count if it does not cause the target number of replications to be exceeded; when the target number of replications has been reached, set the beginning index of the list of intermediate values to be coerced to the replicated string to the current index, and exit the repeat loop
	tell o's nRepsEntries to set {currNReps, iStart} to {last item, length}
	if currNReps < nReps then
		repeat with i from (iStart - 1) to 1 by -1
			set newNReps to currNReps + (o's nRepsEntries's item i)
			if newNReps > nReps then
				set o's replicatedStringEntries's item i to missing value
			else
				set currNReps to newNReps
				if currNReps = nReps then
					set iStart to i
					exit repeat
				end if
			end if
		end repeat
	end if
	-- Get the target replicated string by applying string coercion only to those non-zapped strings that are present in the previously determined subset of intermediate replicated strings
	set tid to AppleScript's text item delimiters
	try
		set AppleScript's text item delimiters to ""
		set replicatedString to (o's replicatedStringEntries's items iStart thru -1)'s text as text
	end try
	set AppleScript's text item delimiters to tid
	-- Return the result
	return replicatedString
end replicateString

Hi bmose.

Still experimenting and reinventing. Great. :slight_smile:

But I’m not actually seeing any difference in speed between your version and mine. It no doubt depends on how far the second repeat gets through o’s nRepsEntries before obtaining the exact figure. (Mind you, even when replicating a string a million items, the list only contains twenty items, so there’s not much opportunity for any speed improvements to bite.) And there’ll be times when your version’s marginally slower than mine, because it contains an additional check for the exit point in the second repeat and it creates two additional lists instead of one before the coercion to text at the end: ie. items iStart thru -1 of o’s replicatedStringEntries and then the list of text values extracted from that. But in fact it is possible with AppleScript’s range references to extract the list of texts directly from o’s replicatedStringEntries by replacing this line:

set replicatedString to (o's replicatedStringEntries's items iStart thru -1)'s text as text

… with this unusual construct:

set replicatedString to (o's replicatedStringEntries's text from item iStart to -1) as text

… or:

set replicatedString to (o's replicatedStringEntries's text (item iStart) thru -1) as text

Both these range references mean “all text values which occur in the range of the list from item iStart to the last text.”

Thank you for the insights and for the lesson in nice ways of subselecting list items by class that I was unaware of.

Regarding execution speed, in the course of creating my new script, I noticed that I had inadvertently left out the validity and special case tests at the start of the script! When I put those tests back into the script, the results came out exactly as you say. Your script executes a bit faster than my new script, by about 2 or 3%. Your version of the script is now the official replicateString handler in my utility library! :slight_smile:

I must apologize for a second time in this thread for posting incorrect information. I hope this doesn’t become a habit! Arghhhhh…

I’ve improved the wording this morning to “all text values which occur in the range of the list from item iStart to the last text.”

Range references can be fun. They can be used to extract elements of a particular class or type from a list or text. Each of their two boundary specifiers can either be just an index — in which case it refers to the ith element of the requested type — or it can include a specifier for the same or some other type. Or it can specify the beginning or end of the list itself.

myList's text from item n to end

And the boundary specifiers don’t need to be given in the same order as the boundaries in the list:

myList's text end thru item i -- Same as myList's text (item i) thru end

It may be a little confusing that a range of ‘text’ or ‘strings’ from a text is a continous substring while a range of ‘text’ or ‘strings’ from a list is a list of items whose class is text. :slight_smile:

That all makes great sense, including your explanation about range from text → continuous substring, and range of “text” from list → list of text strings.

I’ve mentioned this before, but I find myself stumbling over the use of the keywords beginning and end to get the first and last items of a list, simply because the same keywords are used to insert items before the currently first item and after the currently last item of a list, not to replace the first and last items, respectively. Thus:


get beginning of {1, 2, 3} --> 1
get first item (or item 1) of {1, 2, 3} --> 1, same behavior as "beginning of"

get end of {1, 2, 3} --> 3
get last item (or item -1) of {1, 2, 3} --> 3, same behavior as "end of"

but

set beginning of {1, 2, 3} to 4 --> {4, 1, 2, 3}
set first item (or item 1) of {1, 2, 3} to 4 --> {4, 2, 3}, different behavior from  "beginning of"

set end of {1, 2, 3} to 4 --> {1, 2, 3, 4}
set last item (or item -1) of {1, 2, 3} to 4 --> {1, 2, 4}, different behavior from  "end of"

I understand that the different behaviors are the result of different roles for the keywords beginning and end in the two situations. But strictly as a matter of personal preference to avoid any confusion, I tend to use the terms first item and last item (or one of their cardinal or ordinal equivalents) when getting an item by index, as in the first examples above.

I can’t explain it definitively myself. ‘beginning’ and ‘end’ are specifically properties of a list, not elements contained in it. They represent the insertion points at either end and I think how they work is related to this, but it’s difficult to get a handle on it.

It’s easier to understand what happens with ‘eof’ in the File Read/Write commands. ‘eof’ represents the trailing edge a file — the insertion point after the last byte. The ‘get eof’ command returns the distance in bytes between the beginning of the first byte and the insertion point after the last. This is of course the same as the number of bytes in the file and thus the same as the 1-based index of the last byte. But it’s actually the 0-based index of the insertion point after the last byte. So while this code …

write "Hello" as «class utf8» to fileHandle starting at eof

… appends text to the end of a file, this …

set fileLen to (get eof fileHandle)
write "Hello" as «class utf8» to fileHandle starting at fileLen

… overwrites the last byte. It’s analogous to what happened in 2000 when much of the world population began its “Third Millennium” with the last year of the Second. But I digress. :wink: