Extract Integers from the Text

I ran KniazidisR’s script with a 111 kb file (using ‘cat’ rather than ‘echo’ to ingest it) and it resulted in a stack overflow error (-2706) at this point:

run script ("{" & integersList & "}")

In the log history, the long list that is generated includes the entirety of the text’s numerals.

When I run the script with a smaller text of 83 kb, the error does not occur. FWIW, both texts are csvs from sports stats sites… a full season’s NFL schedule and a full season of hitting stats for baseball, so lots of numbers to work with. When saving the long list, the resulting text files are 77 kb (from 111) and 13 kb (from 83).

Ah! So a size limitation when run script’s passed text.

Is that related to argmax?

On my system, this is set to 262144 bytes but the output text is only 76717 bytes.

$ sysctl kern.argmax

or…

$ getconf ARG_MAX

I know argmax affects approximately how many bytes (not necessarily how many characters) can be used in the shell script text in a ‘do shell script’ command, but I’ve no idea if it applies to ‘run script’ as well. :frowning:

1 Like

I doubt that. run script shouldn’t be executed by a shell process, but by some of the Apple frameworks.

Nevertheless there might be a limit as to the length of the script code that can be passed, perhaps due to stack size limits or so.

Another interesting approach! :slight_smile:

filterUsingPredicate: is an NSMutableArray method. Is it really safe to assume that the array returned by componentsSeparatedByCharactersInSet: will be mutable? :thinking:

Hi Fredrik71.

Yes. This is what I was querying. I understood how your handler works and enjoyed the outside-the-box thinking behind it. :slight_smile:

The documentation I have declares componentsSeparatedByCharactersInSet: thus:

And the Return Value is given as:

It doesn’t say NSMutableArray anywhere on that page, although clearly the array does happen to be mutable your machine and mine. I would think it safer not to assume mutability but to use the NSArray method which works with both classes in the group:

set theArray to theArray's filteredArrayUsingPredicate:thePredicate

use more defined RegEx: no whitespace cleaning, capture only the digits with opt . ,
making sure starts at word boundary.
and valueForKey:
return integers, doubles, floats

use framework "Foundation"
use scripting additions

property theNumberArray : missing value
property theIntegersList : {}
property theDoublesList : {}
property theFloatsList : {}


set theString to "One 1.2 three 1234 four five 890 six 56 seven eight 989775nine."
set theString to current application's NSMutableString's stringWithString:theString
set thePattern to ".+?(\\b[0-9][0-9,.]+)(\\s)?|.+" -- $1 is numbers opt . , $2 space following optional
(theString's replaceOccurrencesOfString:thePattern withString:"$1$2" options:1024 range:{0, theString's |length|()})
set theNumberArray to (theString's componentsSeparatedByString:space)
set theIntegersList to (theNumberArray's valueForKey:"integerValue") as list 
set theDoublesList to (theNumberArray's valueForKey:"doubleValue") as list 
set theFloatsList to (theNumberArray's valueForKey:"floatValue") as list 

--> theIntegersList {1, 1234, 890, 56, 989775}
--> theDoublesList {1.2, 1234.0, 890.0, 56.0, 9.89775E+5}
--> theFloatsList {1.200000047684, 1234.0, 890.0, 56.0, 9.89775E+5}
1 Like

That does not match numbers starting with a dot (.005) nor those in exponential notation (1.5E2, 3.7E-3).

Nigel. Thanks for the suggestion, which works great.

I edited my second ASObjC script (in post 10) to avoid reals (except those with exponential notation). I had some trouble with this, but I think negative lookahead and negative lookbehind were the answer. Number separators break the script, though.

use framework "Foundation"
use scripting additions

set theString to "1 a 2.3 a 4,5 a 89 a 789 a23 45b"
set theString to current application's NSString's stringWithString:theString
set thePattern to "(?<![.,])(\\d+)(?![.,])" -- digit not preceded or followed by a dot or comma
set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
set regexResults to theRegex's matchesInString:theString options:0 range:{location:0, |length|:theString's |length|()}
set theRanges to (regexResults's valueForKey:"range")
set theMatches to current application's NSMutableArray's new()
repeat with aRange in theRanges
	(theMatches's addObject:(theString's substringWithRange:aRange))
end repeat
return (theMatches's valueForKey:"integerValue") as list --> {1, 89, 789, 23, 45}

BTW, I ran the timing test with my second ASObjC script and the result was 155 milliseconds (as compared with 14 milliseconds with my first ASObjC script). However, with 129 lines in the string, the timing result was a usable 35 milliseconds.

I updated my original script (post #1) to extract negative integers as well.

Perhaps the simplest solution to this is to create a script that fits the standards in a particular area. For example, the standard in the United States is to use a dot for a decimal separator and a comma for digit group separators. So, a solution might be:

use framework "Foundation"
use scripting additions

set theString to "1 a 2,000 a 3.4 a,b a23 45,000b"
set theString to current application's NSMutableString's stringWithString:theString
set thePattern to "," -- a comma
(theString's replaceOccurrencesOfString:thePattern withString:"" options:1024 range:{0, theString's |length|()})
set thePattern to "(?<!\\.)(\\d+)(?!\\.)" -- digit not preceded or followed by a dot
set theRegex to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
set regexResults to theRegex's matchesInString:theString options:0 range:{location:0, |length|:theString's |length|()}
set theRanges to (regexResults's valueForKey:"range")
set theMatches to current application's NSMutableArray's new()
repeat with aRange in theRanges
	(theMatches's addObject:(theString's substringWithRange:aRange))
end repeat
return (theMatches's valueForKey:"integerValue") as list --> {1, 2000, 23, 45000}

I decided to improve further the capabilities of my original script, once again. Now, the source text can contain not only negative values, but also real ones.

The following (improved) script will handle this more versatile task:
 

extractNumbersFromText("One two three -1234 four five 890 six 56 seven eight -9.89775nine.")
return item 1 of result -- I return here only found integers

on extractNumbersFromText(theText)
	set numbersList to paragraphs of ¬
		(do shell script "echo " & quoted form of theText & " | grep -o -E ' [+-]?[0-9]+([.][0-9]+)? '")
	set ATID to AppleScript's text item delimiters
	set AppleScript's text item delimiters to ","
	set numbersList to run script ("{" & numbersList & "}")
	set AppleScript's text item delimiters to ATID
	set realsList to every real of numbersList
	set integersList to every integer of numbersList
	return {integersList, realsList}
end extractNumbersFromText

 
NOTES:

The negative integers will be detected only for minus (-) sign, followed by digits (no spaces, or other non-digital characters).

Assuming the decimal separator is a dot (.). If the source text uses comma (,) for decimal separator instead, then the dot in the Grep Regex Expression should be replaced with comma.

I don’t think any of the currently posted code extracts integers, and only integers, from text containing integer and non-integer numbers. I know not everyone is referring to extracting integers at this point and the code and concepts are all very interesting to me, but I can’t help thinking that this would be better framed as “extract uninterrupted numeric runs from text”.

If I have data like…


"3^2 $3.50 √2 3(2*2) 4.3333... 0.00000000000001 2.2E2 One 1.2 three 1234 four five -890 six 56 seven eight $989775 nine. Johny5" 

ˆ
…and I need to get the integers from the text I probably am looking for something like { 1234, -890, 56, 989775 } or maybe even just { 1234, -890, 56 } as a result. In what situation would I want the result to be {3, 2, 3, 50, 2, 3, 2, 2, 4, 3333, 0, 1, 2, 2, 2, 1, 2, 1234, -890, 56, 989775, 5}?

While I can’t think of a reason I would want the integer components of non-integer values from a string, if I did, what numbers (not strings) would I want returned for the data “0.00000000000001”? {0,0,0,0,0,0,0,0,0,0,0,0,0,0,1}? or omg {0,1}!!?! I feel like the result of extractIntegersFromText(“0.00000000000001”) should be {}

Added two required Spaces in the regex expression. Post #34 (updated)

Paul. My suggestion:

use framework "Foundation"
use scripting additions

set theString to "3^2 $3.50 √2 3(2*2) 4.3333... 0.00000000000001 2.2E2 One 1.2 three 1234 four five -890 six 56 seven eight $989775 nine. Johny5 22, 33."
set theString to current application's NSMutableString's stringWithString:theString
set thePattern to "\\h[$]" -- space followed by dollar sign
(theString's replaceOccurrencesOfString:thePattern withString:space options:1024 range:{0, theString's |length|()})
set theArray to (theString's componentsSeparatedByString:space)
set thePredicate to current application's NSPredicate's predicateWithFormat:"self MATCHES '[-]{0,1}[0-9]++[,.]{0,1}'"
set theNewArray to (theArray's filteredArrayUsingPredicate:thePredicate)
return (theNewArray's valueForKey:"integerValue") as list --> {1234, -890, 56, 989775, 22, 33}

Trailing periods and commas are ignored, and other trailing punctuation can be ignored if desired.

1 Like

Are those integers???

New pattern:

use framework "Foundation"
use scripting additions

property theNumbersArray : missing value
property theNumbersList : {}
property theIntegersList : {}
property theDoublesList : {}
property theFloatsList : {}

set theString to "One 1.2 three 1234 four five 890 six 56 seven eight 989775nine. .005  1.5E2 3.7E-3"
set theString to current application's NSMutableString's stringWithString:theString
set thePattern to ".*?([.]?[0-9][0-9,.E-]+)\\s?|.+"
(theString's replaceOccurrencesOfString:thePattern withString:"$1 " options:1024 range:{0, theString's |length|()})
set aCharSet to current application's NSCharacterSet's whitespaceCharacterSet()
set theString to (theString's stringByTrimmingCharactersInSet:aCharSet)
set theNumbersArray to (theString's componentsSeparatedByString:space)
set theNumbersList to theNumbersArray as list
set theIntegersList to (theNumbersArray's valueForKey:"integerValue") as list
set theDoublesList to (theNumbersArray's valueForKey:"doubleValue") as list
set theFloatsList to (theNumbersArray's valueForKey:"floatValue") as list

--> theNumbersList {"1.2", "1234", "890", "56", "989775", ".005", "1.5E2", "3.7E-3"}
--> theIntegersList {1, 1234, 890, 56, 989775, 0, 1, 3}
--> theDoublesList {1.2, 1234.0, 890.0, 56.0, 9.89775E+5, 0.005, 150.0, 0.0037}
--> theFloatsList {1.200000047684, 1234.0, 890.0, 56.0, 9.89775E+5, 0.004999999888, 150.0, 0.003700000001}


:wink: that matches 1E, too. The problem is a bit finicky.

1.5E2 is an integer in the sense that it has no decimals. 1.5123E2 is not. 150E-1 is, 151E-1 is not.
These cases can’t be catched by regular expressions or string parsing.

Using NSScanner:

use framework "Foundation"
use scripting additions

property theDoublesList : {}
property theIntegersList : {}

set theString to "One 1.2 three 1234 four five -890 six 56 seven eight 989775nine .005  1.5E2 3.7E-3"
set theString to current application's NSString's stringWithString:theString
set theScanner to current application's NSScanner's localizedScannerWithString:theString
set aNonDecimalSet to current application's NSCharacterSet's decimalDigitCharacterSet()'s invertedSet()
theScanner's setCharactersToBeSkipped:aNonDecimalSet
set theDoublesList to {}
set theIntegersList to {}
set scannedAll to false
repeat while (not scannedAll)
	set {matchFound, aDouble} to theScanner's scanDouble:(reference)
	if (matchFound) then
		set the end of theDoublesList to aDouble
		set the end of theIntegersList to aDouble as integer
	end if
	set scannedAll to (theScanner's atEnd) as boolean
end repeat

--> theDoublesList {1.2, 1234.0, 890.0, 56.0, 9.89775E+5, 5.0, 150.0, 0.0037}
--> theIntegersList {1, 1234, 890, 56, 989775, 5, 150, 0}
1 Like

This is good! Works even with my extremely stupid test data values!

1 Like