Extract Integers from the Text

Assuming the source text contains only strings and positive or negative integers:
 

extractIntegersFromText("One two three -1234 four five 890 six 56 seven eight 989775nine.")

on extractIntegersFromText(theText)
	set integersList to paragraphs of ¬
		(do shell script "echo " & quoted form of theText & " | grep -o -E '[+-]?[0-9]+'")
	set ATID to AppleScript's text item delimiters
	set AppleScript's text item delimiters to ","
	set integersList to run script ("{" & integersList & "}")
	set AppleScript's text item delimiters to ATID
	return integersList
end extractIntegersFromText

 
NOTE:
The negative integers will be detected only for minus (-) sign, followed by digits (no spaces, or other non-digital characters).

Compare with the JavaScript variant:

const text = "One two three 1234 four five 890 six 56 seven eight 989775nine.";
const numbers = [...text.matchAll(/\d+/g)];
1 Like

I would just like to add that the idea of writing this plain AppleScript solution came to me after reading the following solution suggested by user @AdmiralNovia: Getting a list of numbers out of text file?

His solution is quite interesting and works, but has a few drawbacks:

  1. Cumbersome code.
  2. The text must not contain ASCII character 0.
  3. Instead of a list of integers, it returns a list of “integer” strings.
  4. The larger the source text, the slower the script will be, due to the use of the 2nd repeat loop.

How you return the result? In my testings the result of your JXA-script is “undefined”. Or, it is some Java code?..

In general, I don’t – no need to “return” a result when I’m working in JavaScript (as this is – just pure JavaScript, no JXA involved).

I guess that you mean: If I want to call this from an AppleScript script, how do I get the result into my script?

Like so:

set scriptCode to "[...'One two three 1234 four five 890 six 56 seven eight 989775nine.'.matchAll(/\\d+/g)].map(x => +x)"
set result to run script scriptCode in "JavaScript"

or so

set txt to "One two three 1234 four five 890 six 56 seven eight 989775nine.";
set scriptCode to "[... '" & txt&"'.matchAll(/\\d+/g)].map(x => +x)"
set result to run script scriptCode in "JavaScript"

Explanation:

  • matchAllwith a regular expression (\d+ here) and the global flag set returns an “iterator” of arrays: each element contains an element for each of the submatches. In this case, each inner array contains simply the matched string: [“1234”], [“890”], [“56”], [“98775”]. Kind of a list of lists in AppleScript, but not quite.
  • We want that as a real array, which is achieved by [... ]. That gives us [“1234”, “890”, “56”, “98775”] – an array of strings
  • But we want an array of integers, so map builds a new array out of the old one by converting each element to a string, which is achieved by prepending a ‘+’ (that’s lingo, one could also use a method call here)

Note that one has to use \\d+ here instead of \d+ as is customary in JavaScript. That’s because it is part of an AppleScript string, which needs an escaped backslash to preserve it. There’s no need to explicitly return anything, since the result of the JavaScript is the result of the last expression. And JavaScript arrays are automagically converted to AppleScript lists, so that the result in AS is {1234, 890, 56, 98775}.

The difference between the two scripts is only in where the text is built: The first script makes it part of the JavaScript string, the second one sets in AppleScript and then concatenates it into the JavaScript code. In this case, the single quotes must be added before and after the txt parameter.

JavaScript knows of three ways to quote strings: Single quote, double quote and back quote (for interpolated strings). That makes it fairly easy to build JS code in AS: Include all in double quotes and use single or back quotes in the JavaScript code. Also, quoted form of is not needed here.

This is similar to KniazidisR’s script at the top, but uses sed instead of grep + AS.

extractIntegersFromText("One 1.2 three 1234 four,
 five 890 six 56 seven 1.23E+4 eight 989775nine.
 Another line.")

on extractIntegersFromText(theText)
	set integersList to ¬
		(do shell script "echo " & quoted form of theText & " | sed -En '
		s/[0-9]+[.,][0-9]+([Ee][+-]?[0-9]+)?|[^0-9]+/,/g ; # Replace reals and non-digit runs with commas.
		H ; # Append this line to the hold space.
		$ { # If this is the last line:
			g ; # Retrieve the text from the hold space.
			s/\\n//g ; # Zap the linefeed(s).
			s/,,*/,/g ; # Replace any runs of commas with single instances.
			s/^,?/{/; # Lose any spare comma at the beginning and prepend {.
			s/,?$/}/p ; # Ditto at the end, append }, and print.
		} ;'")
	return (run script integersList)
end extractIntegersFromText

An ASObjC solution:

use framework "Foundation"
use scripting additions

set theString to "One two three 1234 four five 890 six 56 seven eight 989775nine."
set theString to current application's NSMutableString's stringWithString:theString
set thePattern to "\\D+" -- non-digit characters
(theString's replaceOccurrencesOfString:thePattern withString:" " options:1024 range:{0, theString's |length|()})
set thePattern to "^\\s+|\\s+$" -- white space characters
(theString's replaceOccurrencesOfString:thePattern withString:"" options:1024 range:{0, theString's |length|()})
return ((theString's componentsSeparatedByString:space)'s valueForKey:"integerValue") as list
--> {1234, 890, 56, 989775}
1 Like

Why do you run replace twice on the string instead of getting the matches only?

It’s a bit slower, although in this case the difference is less than a millisecond ( 0.4 versus 0.7 millisecond with the Foundation framework in memory):

use framework "Foundation"
use scripting additions

set theString to "One two three 1234 four five 890 six 56 seven eight 989775nine."
set theString to current application's NSString's stringWithString:theString
set thePattern to "\\d+"
set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
set regexResults to theRegex's matchesInString:theString options:0 range:{location:0, |length|:theString's |length|()}
set theRanges to (regexResults's valueForKey:"range")
set theMatches to current application's NSMutableArray's new()
repeat with aRange in theRanges
	(theMatches's addObject:(theString's substringWithRange:aRange))
end repeat
return (theMatches's valueForKey:"integerValue") as list

I like KniazidisR’s OP code refactored


	set {ptids, integersList, text item delimiters} to {text item delimiters, paragraphs of (do shell script "echo " & quoted form of theText & " | grep -o -E '[0-9]+'"), ","}
	set {integersList, text item delimiters} to {run script ("{" & integersList & "}"), ptids}

And here’s another AS way to skin this cat.


on extractIntegersFromText2(theText)
	set text item delimiters to characters of "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
	set {theText, text item delimiters} to {text items of theText, ""}
	set {theText, text item delimiters} to {theText as text, ","}
	run script ("{" & (words of (theText as text)) as text) & "}"
end extractIntegersFromText2

Can any of the methods posted to date handle negative integers?

By modifying the regular expression like so -?\d+ that should be easy: an optional minus sign followed by any number of digits. Or even [-+]?\d+ to accommodate an optional plus sign as well.

How does that work with a string like “123 446 $22”? (The code gives an error here, probably because of a non-matching parenthesis on the run script line).

Also, your first variant puts out a trailing comma.

I see (not that I’d care about the performance here).

The logic imposed by NSRegularExpression is really a bit contrived – the right thing would be to return the matched string in the matchesInString result as well. Instead of doing that, they force the programmer to extract the matches manually.

I wondered how the scripts might do with a larger string and ran some tests:

KniazidisR - 72 milliseconds
Nigel - 71 milliseconds
Peavine - 14 milliseconds

I tried to increase the size of the test string, but both KniazidisR’s and Nigel’s scripts returned a stack overflow error. I don’t know the reason for this, but it might have something to do with the test script.

The test script with Nigel’s suggestion:

use framework "Foundation"
use scripting additions

-- untimed code
set theString to "One two three 1234 four five 890 six 56 seven eight 989775nine." & linefeed
repeat 10 times
	set theString to theString & theString
end repeat

-- start time
set startTime to current application's CACurrentMediaTime()

-- timed code
set theIntegers to extractIntegersFromText(theString)
on extractIntegersFromText(theText)
	set integersList to ¬
		(do shell script "echo " & quoted form of theText & " | sed -En '
		s/[0-9]+\\.[0-9]+([Ee][+-]?[0-9]+)?|[^0-9]+/,/g ; # Replace reals and non-digit runs with commas.
		H ; # Append this line to the hold space.
		$ { # If this is the last line:
			g ; # Retrieve the text from the hold space.
			s/\\n//g ; # Zap the linefeed(s).
			s/,,*/,/g ; # Replace any runs of commas with single instances.
			s/^,?/{/; # Lose any spare comma at the beginning and prepend {.
			s/,?$/}/p ; # Ditto at the end, append }, and print.
		} ;'")
	return (run script integersList)
end extractIntegersFromText

-- elapsed time
set elapsedTime to (current application's CACurrentMediaTime()) - startTime
set numberFormatter to current application's NSNumberFormatter's new()
if elapsedTime > 1 then
	numberFormatter's setFormat:"0.000"
	set elapsedTime to ((numberFormatter's stringFromNumber:elapsedTime) as text) & " seconds"
else
	(numberFormatter's setFormat:"0")
	set elapsedTime to ((numberFormatter's stringFromNumber:(elapsedTime * 1000)) as text) & " milliseconds"
end if

-- result
elapsedTime --> 71 milliseconds
# count paragraphs of theString --> 1025
# count theIntegers --> 4096

BTW, I didn’t test the JavaScript suggestions because they won’t run in the test script. I’m sure they would be as fast or faster than the ASObjC solution.

“Words of” disregards most punctuation but not $, Add any desired non-numeric characters to the tids definition. I have a version that discovers all non-numeric input characters and adds those to the tids but I didn’t post that version as I was just showing another approach.


on extractIntegersFromText(theText)
	set theAlphabet to characters of " ABCDEFGHIJKLMNOPQRSTUVWXYZ"
	set text item delimiters to {"1", "2", "3", "4", "5", "6", "7", "8", "9", "0"}
	set {textCopy, text item delimiters} to {(text items of theText), ""}
	set {textCopy, text item delimiters} to {textCopy as text, theAlphabet}
	set {textCopy, text item delimiters} to {text items of textCopy, ""}
	set text item delimiters to theAlphabet & (textCopy as text)
	set {theText, text item delimiters} to {text items of theText, ","}
	set theText to theText as text
	run script "{" & ((words of (theText as text)) as text) & "}"
end extractIntegersFromText

I don’t see a trailing comma generated from the two-line code. What input are you using that does?

I think there’s a limit to how much text can be included in the text of a ‘do shell script’ command, but I don’t remember how much that is now. In both KniazidisR’s script and mine the text to be parsed is included in the command text, so if it’s one of your extreme length efforts, that might be the reason. :wink: If the text is saved to a file and read from there by the running shell script, it can much longer.

This is a minor variation on @peavine’s first ASObjC script above. The differences are that the first regex pattern, like my sed script, also weeds out any “reals” represented in the text, which would otherwise be identified as two separate integers, and the second pattern contains an additional search term to catch any extra internal spaces caused by the first. As in the other scripts, it’s assumed that none of the numbers contain grouping separators. :slight_smile:

use framework "Foundation"
use scripting additions

set theString to "One 1.2 three 1234 four five 890 six 56 seven eight 989775nine."
set theString to current application's NSMutableString's stringWithString:theString
set thePattern to "\\D++|\\d++[.,]\\d++(?:[Ee][+-]?\\d++)?" -- non-digit characters and reals
(theString's replaceOccurrencesOfString:thePattern withString:" " options:1024 range:{0, theString's |length|()})
set thePattern to "(?<=^| ) ++| ++$" -- spaces
(theString's replaceOccurrencesOfString:thePattern withString:"" options:1024 range:{0, theString's |length|()})
return ((theString's componentsSeparatedByString:space)'s valueForKey:"integerValue") as list
--> {1234, 890, 56, 989775}
2 Likes

I ran KniazidisR’s script with a 111 kb file (using ‘cat’ rather than ‘echo’ to ingest it) and it resulted in a stack overflow error (-2706) at this point:

run script ("{" & integersList & "}")

In the log history, the long list that is generated includes the entirety of the text’s numerals.

When I run the script with a smaller text of 83 kb, the error does not occur. FWIW, both texts are csvs from sports stats sites… a full season’s NFL schedule and a full season of hitting stats for baseball, so lots of numbers to work with. When saving the long list, the resulting text files are 77 kb (from 111) and 13 kb (from 83).

Ah! So a size limitation when run script’s passed text.

Is that related to argmax?

On my system, this is set to 262144 bytes but the output text is only 76717 bytes.

$ sysctl kern.argmax

or…

$ getconf ARG_MAX

I know argmax affects approximately how many bytes (not necessarily how many characters) can be used in the shell script text in a ‘do shell script’ command, but I’ve no idea if it applies to ‘run script’ as well. :frowning:

1 Like