string cleanup speed challenge

Hi everyone,

I made the following function to replace all non-alphanumeric characters with a space, but when I started using it on somewhat bigger text files (like 16K) it became obvious that it is very very slow really.

I just looks at each character and then adds it to a new string when it is ok, otherwise it replaces it with a space.

So, am I doing something completely wrong? It 's a straightforward operation, but processing times seem to grow exponentially longer with longer texts. 2k text => 15 seconds, 16k text => let’s just say I stopped waiting after 20 minutes (not exaggerating), go figure :frowning: When I do this in php for instance it is done in under 1 second.


-- replace non-alphanumeric characters in source_text with a space and removes duplicate spaces
-- note: returns an empty string when the entire input is reduced to 1 space
on fn_normalize(source_text)
	
	-- check input
	try
		set source_text to source_text as string
	on error
		return false
	end try
	
	if length of source_text < 1 then return ""
	
	
	-- replace all characters by a space that are not plain alphanumeric
	set allowed_characters to "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890"
	
	-- built result text with allowed characters
	set result_string to ""
	set last_insert_is_space to true
	
	-- step trhough input characters
	repeat with single_character in the characters of source_text
		
		if ((offset of single_character in allowed_characters) > 0) then
			set result_string to result_string & single_character
			set last_insert_is_space to false
		else
			
			if last_insert_is_space is false then
				set result_string to result_string & space
				set last_insert_is_space to true
			end if
			
		end if
		
	end repeat
	
	
	-- trim end of string
	-- if the string is only one space (all characters were illegal) it will be truncated to an empty string
	if (character -1 of result_string as string) is space then
		set result_string to (characters 1 thru ((length of result_string) - 1) of result_string) as string
	end if
	
	
	return result_string
	
	
end fn_normalize

I would try it like this using a script object and unicode character id’s…

--character id 48 to 57 {"0", "1", "2", "3", "4", "5", "6", "7", "8", "9"}
-- character id 65 to 90 {"A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"}
-- character id 97 to 122 {"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"}

set theText to "ha#nk"
fn_normalize(theText)

on fn_normalize(source_text)
	script o
		property chars : missing value
		property charID : missing value
	end script
	
	set o's chars to characters of source_text
	
	repeat with i from 1 to count of o's chars
		set o's charID to id of (item i of o's chars)
		if not ((o's charID > 47 and o's charID < 58) or (o's charID > 64 and o's charID < 91) or (o's charID > 96 and o's charID < 123)) then
			set item i of o's chars to space
		end if
	end repeat
	set finalText to o's chars as text
	set {o's chars, o's charID} to {missing value, missing value}
	return finalText
end fn_normalize

Hello.

I would believe that something could be done considerably faster with the tr command, nowadays it even support character classes. “man tr” in a terminal window should give you some pointers about what to experiment with.
(Via a do shell script command of course) P.s the do shell script command can’t handle more than ca. 210.000 bytes at a time. An alphanumeric takes 1 byte, I believe most of the non “alpha nums” uses two bytes.

Here’s a faster version of Hank’s handler. The main speed improvement is to use just one ‘id’ command and one ‘character id’. It also reverses the order of the test groups to reduce the maximum possible number of tests each time from 6 to 4 and tests positively followed by ‘else’ rather than using ‘not’.

on fn_normalize(source_text)
	script o
		property chrIDs : source_text's id
	end script
	
	repeat with i from 1 to (count source_text)
		set charID to item i of o's chrIDs
		if ((charID > 96 and charID < 123) or (charID > 64 and charID < 91) or (charID > 47 and charID < 58)) then
		else
			set item i of o's chrIDs to 32
		end if
	end repeat
	
	return character id (o's chrIDs)
end fn_normalize

A couple things…

  1. oldmanegan, your script doesn’t work for me. This is what I ran and it doesn’t remove the # character. All I did was take the timing code out of your script.
set theText to "ha#nk"

set theList to {}
repeat with counter1 from 1 to 2048
	if (counter1 < 48 or counter1 > 57) and (counter1 < 65 or counter1 > 90) and (counter1 < 97 or counter1 > 122) then
		set the end of theList to character id counter1
	end if
end repeat
set theListRef to a reference to theList
set text item delimiters to theListRef
set theText to text items of theText
set text item delimiters to " "
set theText to text items of theText as string
set text item delimiters to ""
return theText
  1. As a real test I found a large text file on the internet. It’s “A Christmas Carol”. It’s a file that’s 178KB in size. You can get that file here. The file you get is called “46.txt”. So I ran my code using that as the text.
set inPath to (path to desktop folder as text) & "46.txt"
set inDate to current date
fn_normalize(read file inPath)
set outDate to (current date) - inDate

First I ran it using my original script. It took 52 seconds. Then I ran it on Nigel’s improvement to my script and it took 28 seconds. Great improvement Nigel!

Finally, I ran it using the tr command that McUsr suggested. Here’s that script…

set inPath to (path to desktop folder as text) & "46.txt"

set inDate to current date
do shell script "tr -c '[:alnum:] \\n' ' ' < " & quoted form of POSIX path of inPath
set outDate to (current date) - inDate

Note the “-c” flag which means “don’t replace” the characters and you’ll see that I even added “\n” as the characters to leave in place… which I didn’t do in my and Nigel’s handlers. Guess how much time this took… 0 seconds. It was less than 1 second so “current date” was not precise enough to measure it.

So the easy winner is tr! :o

Finally, using tr you can have it write directly to an output file so you never have to bring the text into applescript. This will do that…

set inPath to (path to desktop folder as text) & "46.txt"
set outPath to (path to desktop folder as text) & "Christams Carol.txt"

do shell script "tr -c '[:alnum:] \\n' ' ' < " & quoted form of POSIX path of inPath & " > " & quoted form of POSIX path of outPath

Hello.

The timing function is intended to be such that the time it uses itself is taken into account, an annulated.
It should have done that by the function you have commented out in the timetools script.

The timing code should work as precisely as it can be measured, down at the millisecond level, because that it will all depend on what the process dispatcher, the memory management unit and such is having cued up before this function can be executed. Therefore I recommend several runs.
The first call will calculate the overhead and subtract that overhead, such that the following call takes the time for what ever happened after the time the first call to getMillisec was called. And a couple of milliseconds to or fro is totally uninteresting because of the aforementioned processes in and around the CPU. This function should be fairly accurate down to 2 hundreds of a second as it was originally.

Yup, a dumb eror on my part. I had the script working without the “reference to” and when I aded it, I checked times, not output.

Hello.

Timing results with getMilliSec Untinkered version.


property timer : (load script file "Macintosh HD:Users:McUsr:Library:Scripts:Modules:timerTools.scpt")
property parent : timer -- DONT use this if script is included as a whole.
set n to 180
tell timer to run
set t to getMillisec()

repeat n times
	delay 1 -- Code to be timed! {1.005 is median, now getting results from 0.989 to 1.008 occasionally}
end repeat
set res to (((getMillisec()) - t) / 1000) / n
” log res -- n = 1  ->  (*1.0*)
-- log res -- n = 10 -> (*1.0002*)
-- log res -- n = 60 -> (*1.000766666667*)
” log res -- n = 180 ->(*1.000266666667*)

I am the first to admit that there may be wrong things about my logic from time to time, but not with this timing logic. I challenge you to use the same untinkered code, and show me results that shows a difference of more than 20 milliseconds with the test written above. :slight_smile:

Hello. I just wanted to show what happens when the getMilliSec routine are used wrongly in a loop

They are really meant to be used for one start time, and one end time, if you need middle times, then
you should declare a second handler.

As you can see: the discrepancies will increase with the number of calls. I see no other solution than declaring a second timer in order to get correct result. The example below are totally superficial. One shold really have coded
the timing in those cases as it were done in the post above.


property timer : (load script file "Macintosh HD:Users:McUsr:Library:Scripts:Modules:timerTools.scpt")
property parent : timer -- DONT use this if script is included as a whole.
set n to 180
set v to 0
tell timer to run
set t to getMillisec()

repeat n times
	delay 1 -- Code to be timed! {1.005 is median, now getting results from 0.989 to 1.008 occasionally}
	set v to (getMillisec())
end repeat
set res to ((v - t) / 1000) / n
--  log res -- n = 10 (*1.0339*)
--  log res -- n = 60 (*1.036633333333*)
-- log res -- n = 180 	(*1.054094444444*)

Hello.

Here is the correct way to get an intermediary result: by declaring a second timer as shown below


property timer : (load script file "Macintosh HD:Users:McUsr:Library:Scripts:Modules:timerTools.scpt")
copy timer to timer2
set n to 60
tell timer to run
tell timer2 to run
set t to timer's getMillisec()

repeat n times
	delay 1 -- Code to be timed! {1.005 is median, now getting results from 0.989 to 1.008 occasionally}
end repeat

set u to timer2's getMillisec()
repeat n times
	delay 1 -- Code to be timed! {1.005 is median, now getting results from 0.989 to 1.008 occasionally}
end repeat
set mainres to (((timer's getMillisec()) - t) / 1000) / (n * 2)
set tempres to (((timer2's getMillisec()) - u) / 1000) / n

” log mainres ”   n =   10 ”> (*1.00665*)
” log tempres ”  n =   10  ”>(*1.0049*)
” log mainres ”   n =   60  ”>(*1.001708333333*)
” log tempres ”  n =   60  ”>(*1.001666666667*)
” log mainres ”   n = 180 ”>(*1.00145*)
” log tempres ”  n = 180 ”>(*1.001366666667*)

Quite every time I enter a thread dedicated to text crunching, it seems that text is restricted to the ASCII set of chars .
Do you know that there are most human beeings using non ASCII characters than beeings using this poor old set ?

Apple took care of the all of us but here, quite all of you continue to code for the typewriter era.

For you, it’s time to wake up .

For me, it’s time to sleep :wink:

Yvan KOENIG (VALLAURIS, France) samedi 31 juillet 2010 00:47:04

Thanks for all the great input!

For what it’s worth: I managed to speed up things a lot in my original function by converting the source string to a list of its characters, and then processing those with comparisons in optimal chunks of 160 characters per list… :slight_smile: Where applescript completely chokes on long unbroken string variables, the approach of feeding it little bits worked pretty fast.

Needless to say, I’ll be using the command line option from now! thanks McUsr

@ Yvan Koenig: ahhhhh… the typewriter era… I hope it never goes away :slight_smile: