Regular Expression pattern question

The following is written to return {“11”, “22”} but it only returns {“11”}. I tested this on the regex101 test site and the pattern including the positive lookbehind seems OK. Thanks.

use framework "Foundation"
use scripting additions

set theString to "a 11 b $22 c"
set theString to current application's NSString's stringWithString:theString
set theArray to (theString's componentsSeparatedByString:space)
set thePredicate to current application's NSPredicate's predicateWithFormat:"self MATCHES '[0-9]++|(?<=[$])[0-9]++'"
set theList to (theArray's filteredArrayUsingPredicate:thePredicate) as list --> {"11"}

regex101 will be applying the pattern over the entire string and returning matching substrings, namely "11" and "22".

Your script splits the string into an array. filteredArrayUsingPredicate: will then return those elements in the array that satisfy the predicate in their entirety. It’s a true or false determination, which means that either an element is included in the filtered array, or it isn’t. Thus, from the outset, we can see it will never be able to return {"11", "22"}; at best, it could return {"11", "$22"}. The reason it doesn’t return the latter is because the look-behind won’t form part of the match, so, for the element ”$22", SELF will equal ”$22" , and will not satisfy the predicate.

1 Like

CJK. Thanks for the reply and helpful explanation.

I understand what you say about the regex101 test. So, just for the record, I isolated the test strings and got the expected results.

Regular Expression: [0-9]++|(?<=[$])[0-9]++
Test String: 11
Match: 11

Regular Expression: [0-9]++|(?<=[$])[0-9]++
Test String: $22
Match: 22

It took a bit of thought but I now understand why my script did not return $22. Fortunately this is easily fixed with a few additional lines of code.

BTW, if anyone wants to look at the regex101 test site, it can be found here

In your RegEx matching pattern even though your are in the literal [$] I would escape the $ as it’s one of the RegEx main use characters.

You should do trials with NSRegularExpression’s escapedPatternForString

I’ve been using this method lately as I’ve found inconsistencies with the “literal” interpretation of characters in […]

[\Q**\E]

Anything between the \Q…\E is treated as literal. No question about of it needs escaping or not. I’ve found this more consistent. Especially when using punctuation &±/[]{}<>^(): all the stuff that are RegEx tokens.
Espically -^. Which have meaning inside the […] block

Another thing you could do is put word boundary restriction on it:

\b(\d+)\b

Replace: $1

11

22

if and only if the g flag is set (which seems to be the default). OTOH, g is not the default in the regex engines (neither in NSPredicate nor in JavaScript). And I was not able to find a “global” flag for NSRegularExpression either. Maybe it’s implied, maybe Apple simply forgot to implement it.

That’s not necessary (cf Regular Expressions :: Eloquent JavaScript and Sets and ranges [...]): […] is a character (!) class/range. So, you don’t need to escape {, }, $, ?, ., nor + in it. Even - doesn’t need to be escaped if it’s the first or last element in the character class: [-0-9] matches “all digits or a minus sign”.

As to a $ in brackets: Since the $ is an anchor (end of string/line), it does not match any character. The same holds for ^. But ^ as the first character in a bracketed character class is meant to invert the class, so if you need that literal, put it at the end of the bracket expression. ([^0-9] matches anything but digits, whereas [0-9^] matches digits and the caret).

Thanks technomorph and chrillek for the additional information. I see there’s many ways to accomplish what I want.

Since the positive lookbehind won’t work, I’ve included a new solution below. It strips out leading dollar signs before creating the array and is easily edited to remove other leading characters.

use framework "Foundation"
use scripting additions

set theString to "a 11 b $22 c"
set theString to current application's NSMutableString's stringWithString:theString
set thePattern to "\\h[$]" -- remove dollar signs
(theString's replaceOccurrencesOfString:thePattern withString:space options:1024 range:{0, theString's |length|()})
set theArray to (theString's componentsSeparatedByString:space)
set thePredicate to current application's NSPredicate's predicateWithFormat:"self MATCHES '[0-9]++'"
set filteredArray to (theArray's filteredArrayUsingPredicate:thePredicate) as list --> {"11", "22"}

If the aim is to isolate numbers in a string like in this thread, the answer can be:

use framework "Foundation"
use scripting additions

## if your system settings are: grouping = "," and decimal = "." (like in the USA)
set theString to "One 12.95€ two 9,876.54 three 1234. four $3 five 890, six -56.3 seven" & return & "eight 989775mm nine. 4.5%"

## if your system settings are: grouping = space and decimal = "," (similar to French)
-- set theString to "One 12,95€ two 9 876,54 three 1234. four $3 five 890, six -56,3 seven" & return & "eight 989775mm nine. 4,5%"

## if your system settings are: grouping = "." and decimal = "," (similar to German)
--set theString to "One 12,95€ two 9.876,54 three 1234. four $3 five 890, six -56,3 seven" & return & "eight 989775mm nine. 4,5%"

-- get the localization setting
set theLocale to current application's NSLocale's currentLocale()
set groupSep to theLocale's groupingSeparator()
set deciSep to theLocale's decimalSeparator()

-- delete thounsand separators
set theString to current application's NSString's stringWithString:theString
set theString to (theString's stringByReplacingOccurrencesOfString:("(?<=\\d)\\" & groupSep & "(?=\\d)") withString:"" options:1024 range:{0, theString's |length|()}) -- NSRegularExpressionSearch

-- the regex pattern searches for negative or positive numbers, returning a real if it contains a decimal separator, or an integer if not
set theRegEx to current application's NSRegularExpression's regularExpressionWithPattern:("-?\\d+(?:\\" & deciSep & "\\d+)?") options:0 |error|:(missing value)
set theFinds to theRegEx's matchesInString:theString options:0 range:{0, theString's |length|()}
set theRanges to (theFinds's valueForKey:"range")
set theResult to {}
repeat with aRange in theRanges
	set aFound to (theString's substringWithRange:(aRange))
	if (aFound's containsString:deciSep) then
		set aFound to aFound as text as real
	else
		set aFound to aFound as text as integer
	end if
	set end of theResult to aFound
end repeat
return theResult
1 Like

Running that output here (i.e. in a German locale, where deciSep is “,” and groupSep is “.”), I get this output
1295, 98654, 14, 3, 80, -563, 95, 45

Doesn’t look quite right to me, but maybe I’m missing something. Perhaps because stringByReplacingOccurencesOfString doesn’t work with regular expressions but with strings?

stringByReplacingOccurencesOfString with the option 1024 is a regex search (NSRegularExpressionSearch).

I forgot to escape the grouping character.
And for security, I also escaped the decimal character.

See the amended script in my previous post.
Is it working for you?

With the new version, I get this:
12.95, 9876.54, 1234, 3, 890, -56.3, 989775, 4.5

after having selected the “German” string. That is ok, except for the fact that AS apparently does not use the locale setting to convert a float to string. Well.

Apart from that, I’d suggest using stricter matching rules for the grouping, something like (?<=\d{1,3}),(?:\d{3})?(:?\.\d+)? (with a comma as group separator)
(not certain where/how you need capturing/non-capturing groups here, though).
Also, instead of -?, I’d use [+-]? in the regularExpressionWithPattern call, since sometimes a number might start with a plus sign. And the leading digit(s) before the decimal separator are, I think, optional: .12345 is often considered a valid float.

This is useless because grouping separators are deleted at first.

Not sure what you mean.

Your RE matches 12,34 in a US locale, and then deletes the grouping separator. But 12,34 is not a correctly formed number, I think – it should be 1,234 or 1234.

In a German locale, PI would be written as 3,141. The script (or rather AppleScript) apparently outputs it as “3.141” – it uses the dot as a decimal separator, though that’s not the one defined in my locale.

You mean you’ve changed the locale “manually” in System preferences?

The script results are not strings but reals or integers.

Try with the pattern “.*?\b(\d+)\b”
This will skip an non digits before and stop at word boundary.

This might work too without word boundaries

“.*?(\d+)”

You could also use the NSScanner technique here:

If you know you only have integers you can uses scanInteger: Instead.

Also if you wanna split and clean your string at once, use the inverted character set from that post and then use NSString’s
componentsSeperatedByCharactersInSet:

Then the predicate RegEx can be simpler.
Or might even not be needed.

Oh just read that the array create will have empty strings if adjacent separators occur.
But the predicate will take care of that.

Scanner if probably best bet