Basic Thompson NFA Regex Algorithm in AS, (without character classes)

McUsrII · June 8, 2013, 5:35pm

I have implemented a very primitive regexp algorithm based on this blog entry byBrian W. Kernighan, the code by Robert Pike. it is basically the NFA algorithm.

The algorithm seem to be implemented properly, and can at least return a match on a line at a time!
(or whatever character chunk below some 2^14 characters you choose.)

I did it mainly to stop the whiners about not having regexp in Applescript, secondly for the hell of it!

It is kindof primitive, no posix character classes, regular character classes, nor word-boundaries, but it has ^ $ for anchors, and supports the ‘.’ and ‘*’ characters.

It was fun to implement it, I hope it can be useful for something, because, if your needs are miniscule, this should be fast enough, and it is easy to fill in the rest of the details for making o a regexp object.

You’ll see by the example, that you’ll call it with characters of your text, as this was the easiest way to overcome the lack of a pointer to char. (char *)

# but refer to the link to this post. The excempt is Jon Pugh and his smart strings library
# Or HHAS and AppleMods, any AppleScript Library delivered free of charge, as a matter of fact.
set RE to "r$*"
set thisText to " TIf you wanted to get rich, how would you do it? I think your best bet would be to start or join a startup. That's been a reliable way to get rich for hundreds of years. The word \"startup\" dates from the 1960s, but what happens in one is very similar to the venture-backed trading voyages of the Middle Ages."
script o
	property aline : {}
	property lastMatch : false
end script
set o's aline to characters of thisText
set res to match(characters of RE, thisText)
on match(regexp, theText)
	if item 1 of regexp = "^" then
		return matchhere(items 2 thru -1 of regexp, theText)
	end if
	repeat with i from 1 to (count theText)
		if matchhere(regexp, items i thru -1 of theText) then return true
	end repeat
	return false
end match


on matchhere(regexp, theText)
	if o's lastMatch and length of regexp = 1 then return true
	if length of regexp â‰¥ 3 and item 2 of regexp = "*" then
		return matchstar(item 1 of regexp, items 3 thru -1 of regexp, theText)
	end if
	if item 1 of regexp = "$" and length of regexp = 1 then
		return ((length of theText) = 1)
	end if
	if length of theText > 0 and (item 1 of regexp = "." or item 1 of regexp = "*" or item 1 of regexp = item 1 of theText) then
		if length of theText > 1 then
			set o's aline to items 2 thru -1 of theText
		else
			set o's aline to item -1 of theText
		end if
		if length of regexp > 1 then
			return (matchhere(items 2 thru -1 of regexp, o's aline))
		else
			set o's lastMatch to true
			return (matchhere(item -1 of regexp, o's aline))
		end if
	end if
	return false
end matchhere


on matchstar(c, regexp, theText)
	repeat with i from 1 to (count theText)
		set o's aline to items i thru -1 of theText
		if matchhere(regexp, o's aline) then return true
		if item i of theText â‰  c and c â‰  "." then exit repeat
	end repeat
	return false
end matchstar

McUsrII · June 11, 2013, 5:49pm

Hello.

I have removed another bug in the implementation of NFA regex above, the bug was regex’s that ends with a star.
That should now have been solved.

Right now, this is a fancy toy, but with some optimization, it should really be usable, and competetive with regards to speed,for finding pattern in line-input and such. That is ; when only finding patterns in say 80 characters of text.