Upper & Lower Case Sed Regex Problems

Can someone explain to me why it is so hard to use AppleScript and regex to simply capitalise a letter after a period and space ?

Here is my code;
– Capitalize the first letter after a period and lowercase the rest
set selectedText to do shell script “echo " & quoted form of selectedText & " | sed -E ‘s/\b([a-z])|\.\s+(.)/\U\1\L\2/g’”

I can’t figure out why sed wants to ignore the Uppercase and Lowercase options, instead wanting to insert the literal characters instead of performing the transformation.

Hi,

I’m not particularly conversant with sed, but my understanding is that is has no capacity for lookbehinds or case sensitivity – so it’s not suited to the task.

Marc’s right about what sed lacks. You’d also need to escape the backslashes in the AppleScript text of the shell command.

sed can achieve what you want, but it’s a bit of a production number, can only be done once per line or paragraph in the input, and you have to code for every character whose case you’re likely to want to change:

set selectedText to "This is soMe text. this seNTence follows a full stop."

do shell script "echo " & quoted form of selectedText & " | sed -E '
# Where an input line contains a period, some space, and a lower-case letter…
/(.*\\.[[:space:]]+)([[:lower:]])(.*)/ {
	# Move the letter to the front and mark where it was with a linefeed.
	s//\\2\\1\\'$'\\n''\\3/
	# Store a copy of the edited line.
	h
	# Delete all but the letter from the original in the pattern space.
	s/^(.).*/\\1/
	# Replace it with an upper-case version.
	y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/
	# Swap the pattern and hold space contents.
	x
	# Lower-case the latter now in the pattern space.
	y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/
	# Append a linefeed and the upper-cased letter to the lower-cased line.
	G
	# Move the letter to the original position, losing the original version
	# and the added linefeeds, and output the line.
	s/^.(.*)\\n(.*)\\n(.)$/\\1\\3\\2/
}'"
--> "this is some text. This sentence follows a full stop."

If you search this site for capitalizedString, you should find a way to do what you want using AppleScriptObjC and NSString.

Christ that’s so complicated for such an easy concept. Thank you for this, I’ll have to take a moment this morning and try it out.

You might be better off using Perl which has these operators for regexes.

That’s the way I went on this, far less aggravation in the end.

The OP has decided on a solution and doesn’t need any more suggestions. However, I wanted to write a solution using basic AppleScript and decided to post it here FWIW. With test strings containing 33 and 1025 paragraphs, the timing results were 20 and 332 milliseconds, although the latter result could probably be reduced 90 percent or more by using script objects.

set theString to "this is a sentence. this is a sentence. this is a sentence.
this is a sentence. this is a sentence. this is a sentence.
this is a sentence. this is a sentence. this is a sentence."

set capitalizedString to getCapitalizedString(theString)

on getCapitalizedString(theString)
	set theParagraphs to paragraphs of theString
	set {TID, text item delimiters} to {text item delimiters, {". "}}
	repeat with aParagraph in theParagraphs
		set theSentences to text items of aParagraph
		repeat with aSentence in theSentences
			try
				set theOffset to offset of (character 1 of aSentence) in "abcdefghijklmnopqrstuvwxyz"
				set theCapitalizedCharacter to character theOffset of "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
				set contents of aSentence to theCapitalizedCharacter & text 2 thru -1 of aSentence
			end try
		end repeat
		set contents of aParagraph to theSentences as text
	end repeat
	set text item delimiters to linefeed
	set capitalizedString to theParagraphs as text
	set text item delimiters to TID
	return capitalizedString as text
end getCapitalizedString

Modern sed can be case-insensitive.

Gnu-Sed or sed on macOS after Catalina (I think) support the insensitive switch.

(I’m stuck on Mojave and cannot verify that sed was updated, but I’ve been so informed.)

#!/usr/bin/env bash

STR='What a wonderful world it would be...'
gsed -E 's!WONDERFUL!WONDERFUL!I' <<< "$STR"

Of course this doesn’t help with the OP’s uppercase/lowercase issues.

I would stick with Perl for this job.

FWIW, I rewrote my basic AppleScript with ASObjC. With 50 or fewer paragraphs, the timing result was under 20 milliseconds, but with with 1025 paragraphs the result was 440 milliseconds.

use framework "Foundation"
use scripting additions

set theString to "this is a sentence. this is a Sentence. this is a sentence.
this is a sentence. this is a Sentence. this is a sentence.
this is a sentence. this is a Sentence. this is a sentence."

set capitalizedString to getCapitalizedString(theString)

on getCapitalizedString(theString)
	set theString to current application's NSString's stringWithString:theString
	set theParagraphs to (theString's componentsSeparatedByString:linefeed)
	repeat with aParagraph in theParagraphs
		set theSentences to (aParagraph's componentsSeparatedByString:". ")
		repeat with aSentence in theSentences
			set theWords to (aSentence's componentsSeparatedByString:" ")'s mutableCopy()
			set firstWord to (theWords's objectAtIndex:0)'s capitalizedString()
			(theWords's replaceObjectAtIndex:0 withObject:firstWord)
			set contents of aSentence to (theWords's componentsJoinedByString:(" "))
		end repeat
		set contents of aParagraph to (theSentences's componentsJoinedByString:". ")
	end repeat
	return ((theParagraphs's componentsJoinedByString:linefeed) as text)
end getCapitalizedString

To lowercase the entire string except the first letter of the first word of each sentence, delete existing line 1 below and replace it with new line 2 below:

set theString to current application's NSString's stringWithString:theString
set theString to (current application's NSString's stringWithString:theString)'s lowercaseString()
1 Like

I am stunned and humbled by your solutions. I truly have soooo much to learn. Your examples are really helpful. Thanks

1 Like

This topic reminded me of a script I wrote some time ago which allows case-change codes to be used in replacement templates with ASObjC’s ICU regex. I’ve tidied it up and have just posted it in MacScripter’s Code Exchange forum.

1 Like

Thanks for that I’ll have to sit down tonight and take a look at it.

Thanks Jeff

Nigel’s script provides a comprehensive NSRegularExpression solution to making case changes. The following is an NSRegularExpression solution specific to problem posed in this post that uses a slightly different approach than peavine’s. matchObjs is coded as a property to speed up execution of the repeat loop.

use framework "Foundation"
use scripting additions

property matchObjs : missing value

tell (current application's NSMutableString's stringWithString:selectedText) to set {strObj, strRange} to {it, current application's NSMakeRange(0, its |length|())}
set regexObj to (current application's NSRegularExpression's regularExpressionWithPattern:"(\\.\\s+[[:lower:]])" options:0 |error|:(missing value))
set my matchObjs to (regexObj's matchesInString:strObj options:0 range:strRange) as list
repeat with currMatchObj in my matchObjs
	set currRange to currMatchObj's range()
	set currSubstringObj to (strObj's substringWithRange:currRange)
	(strObj's replaceCharactersInRange:currRange withString:(currSubstringObj's uppercaseString()))
end repeat

set selectedText to strObj as text

I’d suggest using [:lower:] instead of [a-z] because the former works also for accented characters. The latter will only match ASCII lowercase characters.

Thank you for that helpful suggestion, chrillek. I made the change.

I was made aware of a mistake in my previous NSRegularExpression solution, namely that while it capitalizes the first letter following a period and space characters, it doesn’t make the rest of the sentence lowercase, as the poster requested. The following modified NSRegularExpression solution corrects that problem.

The key to its functionality is the regular expression pattern

(?:^\s*|\.\s+)([^.])([^.]*)

As a whole, the pattern matches a single sentence, beginning with the period preceding the sentence (or the start of the input string) and extending to but not including the period at the end of the sentence. The regular expression pattern’s first component

(?:^\s*|\.\s+)

is a non-capturing group that matches either zero or more spaces at the start of the input string, or a period followed by one or more spaces. The second component

([^.])

is a capturing group that matches the first non-period character, i.e., the first character of the sentence. It corresponds to its match object’s rangeAtIndex:1. The third component

([^.]*)

is another capturing group that matches any subsequent number of non-period characters, i.e., the remaining characters of the sentence. It corresponds to its match object’s rangeAtIndex:2.

tell (current application's NSMutableString's stringWithString:selectedText) to set {strObj, strRange} to {it, current application's NSMakeRange(0, its |length|())}
set regexObj to (current application's NSRegularExpression's regularExpressionWithPattern:"(?:^\\s*|\\.\\s+)([^.])([^.]*)" options:0 |error|:(missing value))
set my matchObjs to (regexObj's matchesInString:strObj options:0 range:strRange) as list
repeat with currMatchObj in my matchObjs
	set {sentenceFirstCharRange, sentenceRemainingCharsRange} to {currMatchObj's rangeAtIndex:1, currMatchObj's rangeAtIndex:2}
	set {sentenceFirstChar, sentenceRemainingChars} to {strObj's substringWithRange:sentenceFirstCharRange, strObj's substringWithRange:sentenceRemainingCharsRange}
	(strObj's replaceCharactersInRange:sentenceFirstCharRange withString:(sentenceFirstChar's uppercaseString()))
	(strObj's replaceCharactersInRange:sentenceRemainingCharsRange withString:(sentenceRemainingChars's lowercaseString()))
end repeat
set selectedText to strObj as text

Using a modified version of peavine’s example,

" this is a sentence. this is a Sentence. this is a sentence.
this is a sentence. this is a Sentence. t.
u. this is a Sentence. this is a sentence."

becomes

" This is a sentence. This is a sentence. This is a sentence.
This is a sentence. This is a sentence. T.
U. This is a sentence. This is a sentence."

1 Like

This will inevitably lower case proper names like Thomas or Susan and acronyms like UK and CPU, not to mention its change of
AppleScript etc. It might therefore be better to limit the code to only upper case the first letter.

chrillek, I agree completely. I posted the second version only to offer a possible solution to question as it was asked.

My take on it in pure JavaScript (no JXA):

const testStr = ` this is a sentence. this is a Sentence. this is a sentence.
this is a sentence. this is a Sentence. t.
u. this is a Sentence. this is a sentence.`;
(() => {
  const RE = new RegExp("(^\\s*|[.!?]\\s+)([^.]{2,})","gms")
  const result = testStr.replaceAll(RE, upperCase);
  console.log(result);
})()

function upperCase(match, p1,p2) {
	const uc = p2.substring(0,1).toLocaleUpperCase() + p2.substring(1);
	return `${p1}${uc}`
}

Output

 This is a sentence. This is a Sentence. This is a sentence.
This is a sentence. This is a Sentence. t.
u. This is a Sentence. This is a sentence.

That’s only an example, not necessarily the best one, to show how one can use a function in a call to replace/replaceAll to perform additional stuff. Here, it is uppercasing the first letter of the 2nd capturing group.

Differing from @bmose, I have to use a capturing group for the start of the string so I can output it again. I didn’t bother to down case uppercased words in the sentence, because that would potentially lead to too many mistakes, as mentioned here: Upper & Lower Case Sed Regex Problems - #18 by chrillek

Thank you for the JavaScript demo. I don’t know JavaScript but get the overall gist of what the script is doing.

Incidentally, when modifying text via NSRegularExpression as in my two examples above, I generally process the match objects returned by NSRegularExpression’s matchesInString:options:range: in reverse order so that any interim changes made to the NSMutableString object don’t break the ranges of match objects that are yet to be processed. I didn’t do so above because I didn’t expect that problem to arise in the specific examples shown. But generally I would recommend doing so. Thus, instead of :

set matchObjs to (regexObj’s matchesInString:strObj options:0 range:strRange) as list
repeat with currMatchObj in matchObjs
– [make changes to the mutable string object (strObj) using the current match object (currMatchObj)]
end repeat

I would recommend the following:

set matchObjsReversed to ((regexObj’s matchesInString:strObj options:0 range:strRange) as list)'s reverse
repeat with currMatchObj in matchObjsReversed
– [make changes to the mutable string object (strObj) using the current match object (currMatchObj)]
end repeat