text manipulation

gigagigosu · November 12, 2018, 6:46pm

Hi there,

i have a string containing a misture of letter and numbers,let’s say “Q1842GHJK003ABCD”
i’d like to extract the sub-string that follows that last occurance of 3 consecutive numbers.

in this case, i need to extarct ABCD, is it possible?

t.spoon · November 12, 2018, 7:17pm


set testString to "Q1842GHJK003ABCD"
set charCount to count of characters of testString
repeat with i from 0 to (charCount - 3)
	set testBit to text (charCount - (i + 2)) through (charCount - i) of testString
	try
		set testBit to testBit as number
		return text (charCount - (i - 1)) through charCount of testString
	end try
end repeat

Nigel_Garvey · November 12, 2018, 7:37pm

Hi.

It’s fairly easy with a regex:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

set sourceString to "Q1842GHJK003ABCD"

set sourceString to current application's class "NSString"'s stringWithString:(sourceString)
-- The regex here gets the text after the last specifically three-digit group in the input.
-- The result itself can contain one-or two-digit groups if required.
set matchRange to sourceString's rangeOfString:("(?<=\\D\\d{3})\\D(?:\\D|\\d{1,2}(?!\\d))*+$") options:(current application's NSRegularExpressionSearch)
if (matchRange's |length| > 0) then
	set matchText to (sourceString's substringWithRange:(matchRange)) as text -- The matched text.
else
	set matchText to missing value -- No match to the regex.
end if

StefanK · November 12, 2018, 8:15pm

Based on Nigel’s script this is a version with real NSRegularExpression.

The benefit is the pattern can be reduced to three digits (“\d{3}”) and get the last match


use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

property |⌘| : a reference to current application

set sourceString to "Q1842GHJK003ABCD"
set pattern to "\\d{3}"

set sourceString to |⌘|'s NSString's stringWithString:sourceString
set regex to |⌘|'s NSRegularExpression's regularExpressionWithPattern:pattern options:0 |error|:(missing value)
set matches to regex's matchesInString:sourceString options:0 range:{location:0, |length|:sourceString's |length|()}
set lastMatch to matches's lastObject()
if lastMatch is not missing value then
	set matchRange to lastMatch's range()
	set upperBound to (matchRange's location) + (matchRange's |length|)
	set matchText to (sourceString's substringFromIndex:upperBound) as text -- The matched text.
else
	set matchText to missing value
end if

Nigel_Garvey · November 12, 2018, 9:28pm

Hi Stefan.

It depends what your assumptions are about the input string. If you know that the last group of three or more digits will be exactly three digits long, and if that’s the required marker, your regex is fine. If the last group of digits is, say, six digits long, “\d{3}” will match the last three of those and your script will return what comes after that group. Mine would return ‘missing value’ for a non-match. You can specify a group of exactly three digits with this: “(?<!\d)\d{3}(?!\d)”. (Three consecutive digits neither preceded by a digit nor followed by one.) Your script with this regex could be the best solution.

Marc_Anthony · November 14, 2018, 2:18am

t.spoon’s vanilla AS method is probably going to be the best solution by far for the OP—given its potential for legibility by a scripting novice—however, I personally like a lookbehind in this situation, as the code is significantly more compact.

do shell script "ruby -e 'puts " & quote & "Q1842GHJK003ABCD" & quote & ".scan /(?<=\\d{3})[A-z]+$/'"

Nigel_Garvey · November 14, 2018, 10:36am

Well it’s vanilla code, that’s true, although it too assumes that the last three-digit group in the string isn’t part of a larger group. If I thought the OP might be interested in how it worked, I’d use less convoluted code and include a few comments:


set testString to "Q1842GHJK003ABCD"
set charCount to (count testString)

-- Search backwards through the string for a three-character group that can be coerced to integer.
repeat with i from (charCount - 3) to 1 by -1
	-- Extract a three-character group from the string.
	set testBit to text i through (i + 2) of testString
	try
		-- Try coercing it to integer. The rest of this 'try' statement will be skipped if it can't be done.
		testBit as integer
		-- If the test hasn't errored, return the text which follows the group in the main string.
		return text (i + 3) through charCount of testString
	end try
end repeat

Here’s another vanilla method which (as far as I can see!) only assumes that there’ll be text of some description after the last three-digit group in the string:


set testString to "Q1842GHJK003ABCD1234"

set astid to AppleScript's text item delimiters
set AppleScript's text item delimiters to characters of "0123456789"
set textItems to testString's text items
set requiredText to missing value
repeat with i from (count textItems) - 2 to 1 by -1
	if ((item i of textItems is "") and (item (i + 1) of textItems is "")) then
		if (i is 1) then
			if (item (i + 2) of textItems is "") then set requiredText to text from text item (i + 3) to -1 of testString
		else if not ((item (i + 2) of textItems is "") or (item (i - 1) of textItems is "")) then
			set requiredText to text from text item (i + 2) to -1 of testString
			exit repeat
		end if
	end if
end repeat
set astid to AppleScript's text item delimiters
return requiredText

Yvan_Koenig · November 14, 2018, 10:40am

@Marc Anthony

I was interested by your compact proposal but I discovered a problem seemingly related to Ruby itself.

If the source string contain (after the three digits) one or several unicode characters (I tested with é, è, ù) Nigel’s proposal returned the correct value while your code fails with:
→ error “-e:1: invalid multibyte char (US-ASCII)
-e:1: invalid multibyte char (US-ASCII)
-e:1: invalid multibyte char (US-ASCII)
-e:1: invalid multibyte char (US-ASCII)” number 1

Yvan KOENIG running High Sierra 10.13.6 in French (VALLAURIS, France) mercredi 14 novembre 2018 11:39:45

Nigel_Garvey · November 14, 2018, 1:18pm

Hi Yvan.

I don’t know how to fix that in Ruby, but the equivalent with sed would be:

-- The "LANG" environmental variable can refer to any recognised locale language and forces sed to treat its input as UTF-8. However, it's not actually needed with this regex!
-- The regex finds the _last_ three-digit group in the string because the initial ".+" makes it go right to the end before backtracking to see if it can match the rest.
do shell script "echo " & quoted form of ("Q184GHJK003éèú" & character id 127930) & " | LANG='fr_FR' sed -E 's/^.+[^0-9][0-9]{3}([^0-9].+)$/\\1/'"

Yvan_Koenig · November 14, 2018, 2:02pm

Thank you Nigel.

Yvan KOENIG running High Sierra 10.13.6 in French (VALLAURIS, France) mercredi 14 novembre 2018 15:02:36

Marc_Anthony · November 14, 2018, 11:42pm

Hi, Yvan. The encoding option wasn’t enabled in Ruby because I didn’t anticipate encountering diacriticals. The regex below is adjusted for that eventuality.

do shell script "ruby -Kue 'puts " & quote & "Q184GHJK003éèú" & quote & ".match /(?<=\\d{3})[[:alpha:]]+$/'"

Using the K option with the u code (UTF-8) works in this situation, but it’s probably more consistently reliable (best practice?) to use the environment ID as Nigel has illustrated.

do shell script "LANG='fr_FR.UTF-8' ruby -e 'puts " & quote & "Q184GHJK003éèú" & quote & ".match /(?<=\\d{3})[[:alpha:]]+$/'"

–edited for typo, secondary option, clarification