offsetMulti handler to find multiple instances of substring in string

I made a handler to find multiple instances of a sub-string in a string.
The last parameter ‘intemCount’ is optional. Without it, it will return a list of all the indexes of the found string, otherwise if you pass an integer it will find the index of that instance count or return a zero if not found.

offsetMulti of "Rob" out of "My Rob is a cool Robert! His name is Robert..." given itemCount:0
-- {4, 18, 38}

to offsetMulti of findText out of textString given itemCount:ic : 0
	local indexList, tid, c, tc
	set tid to text item delimiters
	set c to length of findText
	set text item delimiters to findText
	considering case
		set textString to text items of textString
	end considering
	if (count textString) = 1 then return 0
	set tc to 1
	set indexList to {}
	repeat with i from 1 to (count textString) - 1
		set tc to tc + (length of item i of textString)
		set end of indexList to tc
		if i = ic then exit repeat
		set tc to tc + c
	end repeat
	set text item delimiters to tid
	if ic = 0 then
		return indexList
	else if ic < (count textString) then
		return item ic of indexList
	end if
	return 0
end offsetMulti
2 Likes

There is 1 more solution Recursive ‘offset’ handler (from user @ComplexPoint) :

my offsetsForSubstring:"Rob" inText:"My Rob is a cool Robert! His name is Robert..."

on offsetsForSubstring:subString inText:theText
	set rn to (reverse of characters of subString) as text
	script go
		on |λ|(temp)
			set i to offset of rn in temp
			if i = 0 then return {}
			return |λ|(text (1 + i) thru -1 of temp) & ((length of temp) - i - 1)
		end |λ|
	end script
	return go's |λ|((reverse of characters of theText) as text)
end offsetsForSubstring:inText:

NOTE: as I see, the solution from @robertfern wotks faster.

FWIW, an ASObjC suggestion.

use framework "Foundation"
use scripting additions

set theString to "My Rob is a cool robert! His name is Robert..."
set thePattern to "(?i)rob" -- remove (?i) to make case sensitive
set theOffsets to getOffsets(theString, thePattern) --> {4, 18, 38}

on getOffsets(theString, thePattern)
	set theString to current application's NSString's stringWithString:theString
	set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
	set regexResults to theRegex's matchesInString:theString options:0 range:{location:0, |length|:theString's |length|()}
	set matchOffsets to {}
	repeat with aMatch in regexResults
		set end of matchOffsets to ((aMatch's range()'s location()) + 1)
	end repeat
	return matchOffsets
end getOffsets
1 Like

Mine works way faster.

Also, not a big fan of recursive. Too much overhead of stack manipulation.
I always will set mine up to be iterative.

Here is an iterative version of the recursive one from KniazidisR
(it’s 12% faster, and won’t have a recursive stack limit)

offsetMulti of "Rob" out of "My Rob is a cool Robert! His name is Robert..." given itemCount:0
-- {4, 18, 38}

to offsetMulti of findText out of textString given itemCount:ic : 0
	local indexList, c, n, tc
	set c to (length of findText)
	set indexList to {}
	set tc to 0
	repeat (item (((ic = 0) as integer) + 1) of {ic, 500000}) times
		considering case
			set n to offset of findText in textString
		end considering
		if n = 0 then exit repeat
		set tc to tc + n
		set end of indexList to tc
		set tc to tc + c - 1
		set textString to text (n + c) thru -1 of textString
	end repeat
	if (count indexList) > 0 then
		if ic = 0 then
			return indexList
		else if ic = (count indexList) then
			return item ic of indexList
		end if
	end if
	return 0
end offsetMulti
1 Like

I ran some timing tests. The test string contained 4096 instances of the original string, and the results were:

THE SCRIPT IN - TIMING RESULTS
Post 1 - 3.967 seconds
Post 2 - returned stack overflow
Post 3 - 262 milliseconds
Post 4 - 7.350 seconds

I also tested the Post 1 and 3 scripts with a test string that contained 32 instances of the original string and the results were 2 and 6 milliseconds, respectively.

The following is the timing script with my suggestion:

use framework "Foundation"
use scripting additions

-- untimed code
set theString to "My Rob is a cool Robert! His name is Robert... "
repeat 12 times
	set theString to theString & theString
end repeat

-- start time
set startTime to current application's CACurrentMediaTime()

-- timed code
set thePattern to "Rob"
set theOffsets to getOffsets(theString, thePattern)
on getOffsets(theString, thePattern)
	set theString to current application's NSString's stringWithString:theString
	set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
	set regexResults to theRegex's matchesInString:theString options:0 range:{location:0, |length|:theString's |length|()}
	set matchOffsets to {}
	repeat with aMatch in regexResults
		set end of matchOffsets to ((aMatch's range()'s location()) + 1)
	end repeat
	return matchOffsets
end getOffsets

-- elapsed time
set elapsedTime to (current application's CACurrentMediaTime()) - startTime
set numberFormatter to current application's NSNumberFormatter's new()
if elapsedTime > 1 then
	numberFormatter's setFormat:"0.000"
	set elapsedTime to ((numberFormatter's stringFromNumber:elapsedTime) as text) & " seconds"
else
	(numberFormatter's setFormat:"0")
	set elapsedTime to ((numberFormatter's stringFromNumber:(elapsedTime * 1000)) as text) & " milliseconds"
end if

-- result
elapsedTime --> 262 milliseconds
# count theOffsets --> 12288
1 Like

That’s a very large string. I can speed mine up drasticallly by using script objects. I’ll do it when I get a minute.

Can i get a copy of the test string you used?

Here it is…

to offsetMulti of findText out of textString given itemCount:ic : 0 -- way Faster
	local tid, c, tc
	script L
		property indexList : {}
		property foundStrings : missing value
	end script
	set tid to text item delimiters
	set c to length of findText
	set text item delimiters to findText
	considering case
		set L's foundStrings to text items of textString
	end considering
	if (count L's foundStrings) = 1 then return 0
	set tc to 1
	repeat with i from 1 to (count L's foundStrings) - 1
		set tc to tc + (length of item i of L's foundStrings)
		set end of L's indexList to tc
		if i = ic then exit repeat
		set tc to tc + c
	end repeat
	set text item delimiters to tid
	if ic = 0 then
		return L's indexList
	else if ic < (count L's foundStrings) then
		return item ic of L's indexList
	end if
	return 0
end offsetMulti

EDIT - modified one line to shorten an if statement

4 Likes

Robert. The test string is created by the script under the comment “untimed code” (see my script in post 5 above). The result with your new script was 116 milliseconds.