Determine case of a letter within a string

Scott_Champagne · March 16, 2018, 12:41pm

Good morning,

In an AppleScript, I’m trying to determine if a string contains a capital letter.

I see many references to changing an entire string from lower to upper and vice versa. However, I do not see any references on how to determine if a string contains a particular case.

I’m looking to determine if a string “sTring” contains any capitalization.

Anyone have a solution?

Yvan_Koenig · March 16, 2018, 1:30pm

You may try :

use AppleScript version "2.3.1"
use scripting additions
use framework "Foundation"

my isItAllLower:"string" --> true
my isItAllLower:"sTring" --> false

on isItAllLower:aString
	set allLower to (current application's NSString's stringWithString:aString)'s lowercaseString() as text
	considering case
		aString = allLower
	end considering
	return result
end isItAllLower:

Yvan KOENIG running High Sierra 10.13.3 in French (VALLAURIS, France) vendredi 16 mars 2018 14:29:53

Shane_Stanley · March 16, 2018, 1:37pm

Two options:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

set theString to "sTring"
set theString to current application's NSString's stringWithString:theString
return not (theString's isEqualToString:(theString's lowercaseString())) as boolean

Or:

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

set theString to "sTring"
set theString to current application's NSString's stringWithString:theString
set theRange to theString's rangeOfCharacterFromSet:(current application's NSCharacterSet's uppercaseLetterCharacterSet())
return (|length| of theRange = 1)

bmose · March 22, 2018, 7:46pm

Here is a sed shell script solution that returns a bit more case information. Not as slick or fast as ASObjC, but it works:


on caseInfo(theString)
	-- Returns 1 if lowercase only, 2 if uppercase only, 3 if both lowercase and uppercase, and 0 if neither (i.e., no alphabetic characters)
	return (do shell script "echo $(( $(sed -E 'h ; s/[^a-z]//g ; s/.+/+1/ ; x ; s/[^A-Z]//g ; s/.+/+2/ ; G ; ' <<<" & theString's quoted form & ") ))") as integer
end caseInfo

caseInfo("string") --> 1
caseInfo("STRING") --> 2
caseInfo("sTring") --> 3
caseInfo("123456") --> 0

Shane_Stanley · March 22, 2018, 11:16pm

But it fails with non-ASCII characters:

caseInfo("strÏng") --> 1

(I think Nigel had a workaround for this issue.)

bmose · March 23, 2018, 1:36am

Point well taken.

Indeed he does!

Multi-byte characters and ‘sed’

Applying Nigel’s technique of prefixing sed with LC_ALL=‘en_US’ (or LC_ALL=‘en_GB’, if you prefer), and using the POSIX [:lower:] and [:upper:] character classes to match lowercase and uppercase characters robustly, the modified sed solution now handles both ASCII and non-ASCII text:


on caseInfo(theString)
	-- Returns 1 if lowercase only, 2 if uppercase only, 3 if both lowercase and uppercase, and 0 if neither (i.e., no alphabetic characters)
	return (do shell script "echo $(( $(LC_ALL='en_US' sed -E 'h ; s/[^[:lower:]]//g ; s/.+/+1/ ; x ; s/[^[:upper:]]//g ; s/.+/+2/ ; G ; ' <<<" & theString's quoted form & ") ))") as integer
end caseInfo

caseInfo("string") --> 1
caseInfo("STRING") --> 2
caseInfo("sTring") --> 3
caseInfo("123456") --> 0

caseInfo("šţŕĭńġ") --> 1
caseInfo("ŚŢŘĨŅĜ") --> 2
caseInfo("šŢŕĭńĜ") --> 3
caseInfo("¡¢¤¥¦§©«®°±¶»¼¿") --> 0

bmose · March 23, 2018, 6:14am

Two further refinements were made to the sed command:

It now handles multiline strings properly by loading all lines before performing case testing.
It now performs less text substitution and eliminates one unnecessary hold space read and is thus a bit more efficient (although this gain in execution speed would be small in comparison with the fixed 0.02 seconds or so of overhead of executing the do shell script command).


on caseInfo(theString)
	-- Returns 1 if lowercase only, 2 if uppercase only, 3 if both lowercase and uppercase, and 0 if neither (i.e., no alphabetic characters)
	return (do shell script "echo $(( $(LC_ALL='en_US' sed -En '1h ; 1!H ; $!d ; g ; /[[:lower:]]/s/.+/+1/p ; g ; /[[:upper:]]/s/.+/+2/p' <<<" & theString's quoted form & ") ))") as integer
end caseInfo

caseInfo("string" & return & "string" & return & "string") --> 1
caseInfo("STRING" & linefeed & "STRING" & linefeed & "STRING") --> 2
caseInfo("sTring" & return & "StRING" & linefeed & "STRing") --> 3
caseInfo("123456" & linefeed & "7890" & return & "#$%&+!=") --> 0

caseInfo("šţŕĭńġ" & linefeed & "šţŕĭńġ" & linefeed & "šţŕĭńġ") --> 1
caseInfo("ŚŢŘĨŅĜ" & return & "ŚŢŘĨŅĜ" & return & "ŚŢŘĨŅĜ") --> 2
caseInfo("šŢŕĭńĜ" & return & "ŚŢŘĨŅĜ" & linefeed & "šţŕĭŅĜ") --> 3
caseInfo("¡¢¤¥¦§©«®°±¶»¼¿" & linefeed & "※‼‽⁂⁅⁆†‡" & return & "∅∬≺⊡⋈") --> 0

Edit note: A typo regarding the overhead time of executing the do shell script command was corrected.

bmose · March 24, 2018, 3:17am

Although the previously posted sed solution works, the following ASObjC solution, adapted from Shane Stanley’s handler, returns the same case information as the sed handler but 80 to 90 times faster on my machine, no doubt because of the overhead of the do shell script command. (When will I ever learn?)


use framework "Foundation"
use scripting additions

on caseInfo(theString)
	-- Returns 1 if lowercase only, 2 if uppercase only, 3 if both lowercase and uppercase, and 0 if neither (i.e., no alphabetic characters)
	tell ((||'s NSString)'s stringWithString:theString)
		set hasLowercase to not ((its isEqualToString:(its uppercaseString())) as boolean)
		set hasUppercase to not ((its isEqualToString:(its lowercaseString())) as boolean)
	end tell
	return (hasLowercase as integer) + 2 * (hasUppercase as integer)
end caseInfo

The one scenario where the sed approach might make sense is when case information is needed in the midst of a larger shell script, and one didn’t want to break the shell script up. Otherwise, the ASObjC approach presented here is the preferred of the two methods.

Shane_Stanley · March 24, 2018, 4:01am

FWIW, do shell script shouldn’t shoulder all the blame. If I run your code here it takes about 0.26 seconds. Subtract 0.02 * 8 for the do shell script overhead (and I think 0.02 might be on the high side outside an editor) and you still get 0.1. The ASObjC code takes less than 0.002, so there’s still a factor of about 50 times. (Timings done in Script Geek.app.)

bmose · March 24, 2018, 2:59pm

I used gdate, GNU’s version of bash’s date command, to measure actual sed command execution time within the do shell script command with an accuracy in the range of about a millisecond. (Accuracy beyond that is limited by the time it takes to execute the gdate command itself.) I also measured total do shell script command execution time with the LapTime osax with an accuracy in the range of about a tenth of a millisecond. In this case, do shell script contained only the sed command without the time-testing commands so that it could be compared to ASObjC equivalently. Finallly, I measured ASObjC command execution time with the LapTime osax.

Here are the accumulated times to perform 100 repetitions of the sed vs ASObjC algorithms I posted earlier:

Input string:

“string” & return & “string” & return & “string”
“STRING” & linefeed & “STRING” & linefeed & “STRING”
“sTring” & return & “StRING” & linefeed & “STRing”

Accumulated time for 100 repetitions of the sed handler containing the do shell script command and its sed command:

3.682 seconds
3.747 seconds
3.718 seconds

Accumulated time for 100 repetitions of the sed command itself:

0.371 seconds
0.410 seconds
0.411 seconds

Accumulated time for 100 repetitions of the ASObjC handler containing the ASObjC commands:

0.023 seconds
0.022 seconds
0.023 seconds

Ratio of do shell script / ASObjC:

160
170
162

Ratio of sed command alone / ASObjC:

16
19
18

I’m not sure why my do shell script command execution times are about double what they were when I measured them previously, but the results are telling nonetheless. Just as you point out, sed is much slower than ASObjC, about 16 to 19 times slower in the current tests. But even that slowness is exacerbated another 10-fold or so by the overhead of do shell script, a veritable double whammy.

Bottom line: Outside of a larger shell script where the sed solution might be convenient, ASObjC is the way to go.

P.S. What I have been calling sed is actually a combination of a sed command, bash addition, and an echo command. It’s hard to imagine a shell solution that would be dramatically more efficient. Even if such a solution were available, do shell script imposes such a time burden (in this case, about 90% of the burden) that it wouldn’t have a chance in a speed test against ASObjC.

Shane_Stanley · March 25, 2018, 12:55am

It might be slower using sed, but your approach of using regex might still be the better one. For example, this is nearly 20% faster than the previous ASObjC method:

use framework "Foundation"
use scripting additions

on caseInfo(theString)
	-- Returns 1 if lowercase only, 2 if uppercase only, 3 if both lowercase and uppercase, and 0 if neither (i.e., no alphabetic characters)
	set theString to current application's NSString's stringWithString:theString
	set hasLowercase to ((|length| of (theString's rangeOfString:"\\p{ Ll}" options:(current application's NSRegularExpressionSearch))) div 1)
	set hasUppercase to ((|length| of (theString's rangeOfString:"\\p{ Lu}" options:(current application's NSRegularExpressionSearch))) div 1)
	return hasLowercase + 2 * hasUppercase
end caseInfo

caseInfo("string" & return & "string" & return & "string") --> 1
caseInfo("STRING" & linefeed & "STRING" & linefeed & "STRING") --> 2
caseInfo("sTring" & return & "StRING" & linefeed & "STRing") --> 3
caseInfo("123456" & linefeed & "7890" & return & "#$%&+!=") --> 0
--
caseInfo("šţŕĭńġ" & linefeed & "šţŕĭńġ" & linefeed & "šţŕĭńġ") --> 1
caseInfo("ŚŢŘĨŅĜ" & return & "ŚŢŘĨŅĜ" & return & "ŚŢŘĨŅĜ") --> 2
caseInfo("šŢŕĭńĜ" & return & "ŚŢŘĨŅĜ" & linefeed & "šţŕĭŅĜ") --> 3
caseInfo("¡¢¤¥¦§©«®°±¶»¼¿" & linefeed & "※‼‽⁂⁅⁆†‡" & return & "∅∬≺⊡⋈") --> 0

And it may well get faster as the string gets longer, because it doesn’t have to process every character in most cases.

OTOH, the rangeOfCharacterFromSet: method I posted above should have the same advantage, but was slower to begin with.

As in all things AppleScript, the fastest code is… the fastest code.

bmose · March 25, 2018, 2:45am

That forced me to read about Unicode Property Names and the “\p{ Lu}” and “\p{ Ll}” search expressions. It turned into a great learning exercise! What a great idea. The search will stop as soon as it encounters the first character of the matching case. That is very efficient.

Can you please answer one silly question: Why do you place a space character before Lu and Ll inside the curly braces?

Shane_Stanley · March 25, 2018, 6:54am

It was a mistake.

bmose · March 25, 2018, 10:12am

OK, thanks.

Having grown comfortable over the years with the ability to express powerful regular expressions in tersely coded grep and sed commands, I at first balked at the verbosity and clunkiness of NSRegularExpression, NSRegularExpressionSearch, and related items. But, my goodness, what powerful animals they are! Not only do they offer Perl-like regex features such as lookahead and lookbehind searching and so much more, but they also make full use of the Unicode standard, a nice example being your use of Unicode Property Names to find case-specific information efficiently to solve the current problem. I suspect that through repetition, NSRegularExpression will become just as comfortable to use over time, and the effort will be well rewarded.

Scott_Champagne · March 28, 2018, 7:49pm

This dialog was very educational & helpful.

Thank you, everyone!