I happened to notice that “%” appears to be a word-break character when parsing a string. This is illustrated by:
set aString to words of "This%is%a%test" --> {"This", "is", "a", "test"}
The ASLG defines word under the heading “Elements of Text Objects” as follows:
I haven’t been able to locate any preference pane with word-break rules and wondered if anyone knew where I might find that (or what these rules are exactly).
BTW, after some testing, I’ve pretty much concluded that it’s faster and more reliable just to use text item delimiters, but perhaps each has their use.
I think what it’s trying to say is that what constitutes a word break depends on the language you choose in System Preferences. I suspect the rules used are those defined by either the Unicode specification or the ICU.
Thanks Shane. I’ll do some research on the Unicode specification and ICU, which I know nothing about. In the meantime, I’ll use text item delimiters instead of word if the string contains anything that might without my knowledge be interpreted as a word-break character.
CAUTION, text item delimiters may be inaccurate.
How many words are you assuming to get from “This%is|a%test” ?
I don’t know for the USA but here with French in use, the string contain 5 words because the pipe is treated as a word of its own.
set aString to "This%is|a%test"
set theWords to words of aString --> {"This", "is", "|", "a", "test"}
set withDelims to my decoupe(aString, {"%", "|"})
{theWords, withDelims} --> {{"This", "is", "|", "a", "test"}, {"This", "is", "a", "test"}}
#=====
on decoupe(t, d)
local oTIDs, l
set {oTIDs, AppleScript's text item delimiters} to {AppleScript's text item delimiters, d}
set l to text items of t
set AppleScript's text item delimiters to oTIDs
return l
end decoupe
#=====
How would you get the five words using delimiters?
Yvan KOENIG running High Sierra 10.13.6 in French (VALLAURIS, France) jeudi 26 décembre 2019 15:10:37
Yvan. The string as shown in my first post does not contain a pipe and is instead:
“This%is%a%test”
The actual string I was trying to parse was:
Final file size is 52836 bytes, 49.33% of original.
I wanted to get 49.33 (not 49.33%) and this actually worked using word to parse beccause % is apparently an English-language word-break character.
Anyways, using text item delimiters:
set theString to "Final file size is 52836 bytes, 49.33% of original."
set text item delimiters to {" ", "%"}
set theNumber to text item -4 of theString
set text item delimiters to {""}
theNumber --> 49.33
Or perhaps:
set theString to "Final file size is 52836 bytes, 49.33% of original."
set text item delimiters to {" "}
set theNumber to (text 1 thru -2 of (text item -3)) of theString
set text item delimiters to {""}
theNumber --> 49.33
As I’m not a sooth sayer, I was unable to guess what you really wanted to achieve.
I just wanted to point the fact that delimiters aren’t accurate to split a text in words.
set theString to "Final file size is 52836 bytes, 49.33% of original."
set theNumber to word 7 of theString --> "49.33"
would do the job.
Yvan KOENIG running High Sierra 10.13.6 in French (VALLAURIS, France) jeudi 26 décembre 2019 15:33:36
Thanks Yvan. I misunderstood the point you were making.
The script in your post 6 does work but it raises my original point. How can I use word breaks to accurately parse a string when I don’t know what the word-break characters are in my (or any other) locale. The ASLG acknowledges this issue in the definition and discussion of “word”:
Even without locale issues, parsing for words is fraught. Is out-law one word or two?
FWIW, here are a couple of ways to parse your string using an NSScanner:
use AppleScript version "2.5" -- macOS 10.11 or later
use framework "Foundation"
use scripting additions
set theNumbers to {}
set theScanner to current application's NSScanner's scannerWithString:"Final file size is 52836 bytes, 49.33% of original."
repeat
theScanner's scanUpToCharactersFromSet:(current application's NSCharacterSet's decimalDigitCharacterSet) intoString:(missing value)
set {theResult, theNum} to theScanner's scanDouble:(reference)
if not theResult then exit repeat
set end of theNumbers to theNum
end repeat
return theNumbers
Or more specifically for your case:
use AppleScript version "2.5" -- macOS 10.11 or later
use framework "Foundation"
use scripting additions
set theNumbers to {}
set theScanner to current application's NSScanner's scannerWithString:"Final file size is 52836 bytes, 49.33% of original."
theScanner's scanUpToString:"," intoString:(missing value)
theScanner's scanString:"," intoString:(missing value)
set {theResult, theNum} to theScanner's scanDouble:(reference)
return theNum
Thanks Shane. It’s interesting that NSScanner has scannerWithString and localizedScannerWithString. The idea that a command may be localized–which apparently applies when parsing with words–is not something I’ve much considered. Always something new to learn.
Lots of string handling is very locale-dependent. The localizedScannerWithString: method is just a shortcut for using scannerWithString: and then using setLocale: to set the scanner’s locale.
Another way to extract a numeric text that’s followed in the original text by a percent sign is to use a regex pattern. There shouldn’t be any localisation issues with this:
use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions
set theString to current application's NSString's stringWithString:"Final file size is 52836 bytes, 49.33% of original."
set matchRange to theString's rangeOfString:("(?<!\\S)(?:[0-9][0-9., ]*)?[0-9](?=%)") options:(current application's NSRegularExpressionSearch) range:({0, theString's |length|()})
if (matchRange's |length|() > 0) then
set theNumber to (theString's substringWithRange:(matchRange)) as text
else
set theNumber to missing value
end if
In retrospect, perhaps I made a mistake by trying to parse information from the string, which was output by a command-line utility. I say that because it’s an easy matter to get the sizes of the input and output files and to do the calculations in the script.
Sorry for the confusion. My comment was an attempt at humour, based on the different ways Shane’s script and mine identify which of the two numbers is the one required. Mine gets the first one which has a “%” character after it; Shane’s gets the one which immediately follows the first comma, disregarding the space. Either would be fine for peavine’s purposes.
The regex pattern in my script intentionally matches any sequence of digits, spaces, periods, and/or commas which both starts and ends with a digit (including the possibility of there being only one digit), is not immediately preceded by a non-space character, and is immediately followed by “%”. It would be fairly simple to modify the pattern specifically to meet Shane’s objections, but perhaps not so simple to read it!
Indeed – but my original scripts made no claim to localization. Whether localization is needed is obviously going to depend on the source of the string, and my guess was that we were dealing with the output of a command-line utility, and therefore localization was possibly going to be counter-productive.
But where localization is not an issue, regex is often going to be a better (ie, faster, simpler) method.