AppleScript Word-Break Rules

I happened to notice that “%” appears to be a word-break character when parsing a string. This is illustrated by:

set aString to words of "This%is%a%test" --> {"This", "is", "a", "test"}

The ASLG defines word under the heading “Elements of Text Objects” as follows:

I haven’t been able to locate any preference pane with word-break rules and wondered if anyone knew where I might find that (or what these rules are exactly).

BTW, after some testing, I’ve pretty much concluded that it’s faster and more reliable just to use text item delimiters, but perhaps each has their use.

I think what it’s trying to say is that what constitutes a word break depends on the language you choose in System Preferences. I suspect the rules used are those defined by either the Unicode specification or the ICU.

Thanks Shane. I’ll do some research on the Unicode specification and ICU, which I know nothing about. In the meantime, I’ll use text item delimiters instead of word if the string contains anything that might without my knowledge be interpreted as a word-break character.

CAUTION, text item delimiters may be inaccurate.
How many words are you assuming to get from “This%is|a%test” ?
I don’t know for the USA but here with French in use, the string contain 5 words because the pipe is treated as a word of its own.

set aString to "This%is|a%test"
set theWords to words of aString --> {"This", "is", "|", "a", "test"}
set withDelims to my decoupe(aString, {"%", "|"})

{theWords, withDelims} --> {{"This", "is", "|", "a", "test"}, {"This", "is", "a", "test"}}

#=====

on decoupe(t, d)
	local oTIDs, l
	set {oTIDs, AppleScript's text item delimiters} to {AppleScript's text item delimiters, d}
	set l to text items of t
	set AppleScript's text item delimiters to oTIDs
	return l
end decoupe

#=====

How would you get the five words using delimiters?

Yvan KOENIG running High Sierra 10.13.6 in French (VALLAURIS, France) jeudi 26 décembre 2019 15:10:37

Yvan. The string as shown in my first post does not contain a pipe and is instead:

“This%is%a%test”

The actual string I was trying to parse was:

Final file size is 52836 bytes, 49.33% of original.

I wanted to get 49.33 (not 49.33%) and this actually worked using word to parse beccause % is apparently an English-language word-break character.

Anyways, using text item delimiters:

set theString to "Final file size is 52836 bytes, 49.33% of original."
set text item delimiters to {" ", "%"}
set theNumber to text item -4 of theString
set text item delimiters to {""}

theNumber --> 49.33

Or perhaps:

set theString to "Final file size is 52836 bytes, 49.33% of original."
set text item delimiters to {" "}
set theNumber to (text 1 thru -2 of (text item -3)) of theString
set text item delimiters to {""}

theNumber --> 49.33

As I’m not a sooth sayer, I was unable to guess what you really wanted to achieve.
I just wanted to point the fact that delimiters aren’t accurate to split a text in words.

set theString to "Final file size is 52836 bytes, 49.33% of original."
set theNumber to word 7 of theString --> "49.33"

would do the job.

Yvan KOENIG running High Sierra 10.13.6 in French (VALLAURIS, France) jeudi 26 décembre 2019 15:33:36

Thanks Yvan. I misunderstood the point you were making.

The script in your post 6 does work but it raises my original point. How can I use word breaks to accurately parse a string when I don’t know what the word-break characters are in my (or any other) locale. The ASLG acknowledges this issue in the definition and discussion of “word”:

https://developer.apple.com/library/archive/documentation/AppleScript/Conceptual/AppleScriptLangGuide/reference/ASLR_classes.html#//apple_ref/doc/uid/TP40000983-CH1g-DontLinkElementID_618

Even without locale issues, parsing for words is fraught. Is out-law one word or two?

FWIW, here are a couple of ways to parse your string using an NSScanner:

use AppleScript version "2.5" -- macOS 10.11 or later
use framework "Foundation"
use scripting additions

set theNumbers to {}
set theScanner to current application's NSScanner's scannerWithString:"Final file size is 52836 bytes, 49.33% of original."
repeat
	theScanner's scanUpToCharactersFromSet:(current application's NSCharacterSet's decimalDigitCharacterSet) intoString:(missing value)
	set {theResult, theNum} to theScanner's scanDouble:(reference)
	if not theResult then exit repeat
	set end of theNumbers to theNum
end repeat
return theNumbers

Or more specifically for your case:

use AppleScript version "2.5" -- macOS 10.11 or later
use framework "Foundation"
use scripting additions

set theNumbers to {}
set theScanner to current application's NSScanner's scannerWithString:"Final file size is 52836 bytes, 49.33% of original."
theScanner's scanUpToString:"," intoString:(missing value)
theScanner's scanString:"," intoString:(missing value)
set {theResult, theNum} to theScanner's scanDouble:(reference)
return theNum

And in case the string is localized, you can replace scannerWithString: with localizedScannerWithString:.

Thanks Shane. It’s interesting that NSScanner has scannerWithString and localizedScannerWithString. The idea that a command may be localized–which apparently applies when parsing with words–is not something I’ve much considered. Always something new to learn. :slight_smile:

Lots of string handling is very locale-dependent. The localizedScannerWithString: method is just a shortcut for using scannerWithString: and then using setLocale: to set the scanner’s locale.

Another way to extract a numeric text that’s followed in the original text by a percent sign is to use a regex pattern. There shouldn’t be any localisation issues with this:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

set theString to current application's NSString's stringWithString:"Final file size is 52836 bytes, 49.33% of original."
set matchRange to theString's rangeOfString:("(?<!\\S)(?:[0-9][0-9., ]*)?[0-9](?=%)") options:(current application's NSRegularExpressionSearch) range:({0, theString's |length|()})
if (matchRange's |length|() > 0) then
	set theNumber to (theString's substringWithRange:(matchRange)) as text
else
	set theNumber to missing value
end if

You wish ;). Some languages include a space before the % sign, and some even put the sign first.

I expect that some sentences don’t have a comma before the particular number required either. :wink:

Hello Nigel.
I’m not sure, at least if I understand correctly what you wrote, that this comma is a problem.

I tested with “Final file size is 52836 bytes 49.33% of original.” and with “Final file size is 52836 bytes 49,33% of original.”

The comma which you wrote about was not here and the result was the correct one,
“49.33” in 1st case, “49,33” in second one. :wink:

Shane’s code doesn’t return the string but the numerical value which in both cases is spelled 49.33.

Yvan KOENIG running High Sierra 10.13.6 in French (VALLAURIS, France) dimanche 29 décembre 2019 14:50:37

In retrospect, perhaps I made a mistake by trying to parse information from the string, which was output by a command-line utility. I say that because it’s an easy matter to get the sizes of the input and output files and to do the calculations in the script.

With respect to the first 800 unicode characters, out of those, the following characters are both word characters and word-boundary characters:[format]$ + < = > ^ ` | ~ ¢ £ ¤ ¥ ¦ ¨ © ª ® ¯ ° ± ² ³ ´ µ ¸ ¹ º ¼ ½ ¾ × ÷ ˘ ˙ ˚ ˛ ˜ ˝ ¬ ˥ ˦ ˧ ˨ ˩ ˪ ˫[/format]Because they delimit word boundaries but also identify as word characters, they remain as single-character text items when present in a string that undergoes AppleScript’s word-splitting.

The following characters usually only act as word-boundary characters, and typically get obliterated during word-splitting:[format]! " # % & ’ ( ) * , - . / : ; ? @ [ \ ] _ { } ¡ § « ¶ · » ¿[/format]
Out of that list, the following characters are occasionally spared from obliteration if they occur within a run of word characters that are not also word-splitters:[format]’ . _ ·[/format]
Those that remain are word characters, any of which occur adjacent to any other in the same set of word characters are immune to AppleScript’s word-splitting. The following characters form this set of word characters:[format] 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O
P Q R S T U V W X Y Z a b c d e f g h i j k l m n
o p q r s t u v w x y z ­À Á Ã Ä Å Æ Ç È É Ê
Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â ã ä
å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ
ÿ Ā ā Ă ă Ą ą Ć ć Ĉ ĉ Ċ ċ Č č Ď ď Đ đ Ē ē Ĕ ĕ Ė ė
Ę ę Ě ě Ĝ ĝ Ğ ğ Ġ ġ Ģ ģ Ĥ ĥ Ħ ħ Ĩ ĩ Ī ī Ĭ ĭ Į į İ
ı IJ ij Ĵ ĵ Ķ ķ ĸ Ĺ ĺ Ļ ļ Ľ ľ Ŀ ŀ Ł ł Ń ń Ņ ņ Ň ň ʼn
Ŋ ŋ Ō ō Ŏ ŏ Ő ő Œ œ Ŕ ŕ Ŗ ŗ Ř ř Ś ś Ŝ ŝ Ş ş Š š Ţ
ţ Ť ť Ŧ ŧ Ũ ũ Ū ū Ŭ ŭ Ů ů Ű ű Ų ų Ŵ ŵ Ŷ ŷ Ÿ Ź ź Ż
ż Ž ž ſ ƀ Ɓ Ƃ ƃ Ƅ ƅ Ɔ Ƈ ƈ Ɖ Ɗ Ƌ ƌ ƍ Ǝ Ə Ɛ Ƒ ƒ Ɠ Ɣ
ƕ Ɩ Ɨ Ƙ ƙ ƚ ƛ Ɯ Ɲ ƞ Ɵ Ơ ơ Ƣ ƣ Ƥ ƥ Ʀ Ƨ ƨ Ʃ ƪ ƫ Ƭ ƭ
Ʈ Ư ư Ʊ Ʋ Ƴ ƴ Ƶ ƶ Ʒ Ƹ ƹ ƺ ƻ Ƽ ƽ ƾ ƿ ǀ ǁ ǂ ǃ DŽ Dž dž
LJ Lj lj NJ Nj nj Ǎ ǎ Ǐ ǐ Ǒ ǒ Ǔ ǔ Ǖ ǖ Ǘ ǘ Ǚ ǚ Ǜ ǜ ǝ Ǟ ǟ
Ǡ ǡ Ǣ ǣ Ǥ ǥ Ǧ ǧ Ǩ ǩ Ǫ ǫ Ǭ ǭ Ǯ ǯ ǰ DZ Dz dz Ǵ ǵ Ƕ Ƿ Ǹ
ǹ Ǻ ǻ Ǽ ǽ Ǿ ǿ Ȁ ȁ Ȃ ȃ Ȅ ȅ Ȇ ȇ Ȉ ȉ Ȋ ȋ Ȍ ȍ Ȏ ȏ Ȑ ȑ
Ȓ ȓ Ȕ ȕ Ȗ ȗ Ș ș Ț ț Ȝ ȝ Ȟ ȟ Ƞ ȡ Ȣ ȣ Ȥ ȥ Ȧ ȧ Ȩ ȩ Ȫ
ȫ Ȭ ȭ Ȯ ȯ Ȱ ȱ Ȳ ȳ ȴ ȵ ȶ ȷ ȸ ȹ Ⱥ Ȼ ȼ Ƚ Ⱦ ȿ ɀ Ɂ ɂ Ƀ
Ʉ Ʌ Ɇ ɇ Ɉ ɉ Ɋ ɋ Ɍ ɍ Ɏ ɏ ɐ ɑ ɒ ɓ ɔ ɕ ɖ ɗ ɘ ə ɚ ɛ ɜ
ɝ ɞ ɟ ɠ ɡ ɢ ɣ ɤ ɥ ɦ ɧ ɨ ɩ ɪ ɫ ɬ ɭ ɮ ɯ ɰ ɱ ɲ ɳ ɴ ɵ
ɶ ɷ ɸ ɹ ɺ ɻ ɼ ɽ ɾ ɿ ʀ ʁ ʂ ʃ ʄ ʅ ʆ ʇ ʈ ʉ ʊ ʋ ʌ ʍ ʎ
ʏ ʐ ʑ ʒ ʓ ʔ ʕ ʖ ʗ ʘ ʙ ʚ ʛ ʜ ʝ ʞ ʟ ʠ ʡ ʢ ʣ ʤ ʥ ʦ ʧ
ʨ ʩ ʪ ʫ ʬ ʭ ʮ ʯ ʰ ʱ ʲ ʳ ʴ ʵ ʶ ʷ ʸ ʹ ʺ ʻ ʼ ʽ ʾ ʿ ˀ
ˁ ˂ ˃ ˄ ˅ ˆ ˇ ˈ ˉ ˊ ˋ ˌ ˍ ˎ ˏ ː ˑ ˒ ˓ ˔ ˕ ˖ ˗ ˞ ˟
ˠ ˡ ˢ ˣ ˤ ˬ ˭ ˮ ˯ ˰ ˱ ˲ ˳ ˴ ˵ ˶ ˷ ˸ ˹ ˺ ˻ ˼ ˽ ˾ ˿̛̖̗̘̙̜̝̞̟̠̀́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̕̚

[/format]

NOTE: These character sets were derived within an en_GB locale environment.

Hi Yvan.

Sorry for the confusion. My comment was an attempt at humour, based on the different ways Shane’s script and mine identify which of the two numbers is the one required. Mine gets the first one which has a “%” character after it; Shane’s gets the one which immediately follows the first comma, disregarding the space. Either would be fine for peavine’s purposes.

The regex pattern in my script intentionally matches any sequence of digits, spaces, periods, and/or commas which both starts and ends with a digit (including the possibility of there being only one digit), is not immediately preceded by a non-space character, and is immediately followed by “%”. It would be fairly simple to modify the pattern specifically to meet Shane’s objections, but perhaps not so simple to read it!


Thanks CK. That’s both interesting and useful.

Initially I did not understand how a character could be both a word- and a boundary-character, but the following made things clear.


set aString to "aa$bb+cc%dd&ee"

set theWords to words of aString --> {"aa", "$", "bb", "+", "cc", "dd", "ee"}

Indeed – but my original scripts made no claim to localization. Whether localization is needed is obviously going to depend on the source of the string, and my guess was that we were dealing with the output of a command-line utility, and therefore localization was possibly going to be counter-productive.

But where localization is not an issue, regex is often going to be a better (ie, faster, simpler) method.