Upper & Lower Case Sed Regex Problems

chrillek · April 2, 2023, 8:10am

Out of curiosity: are these match objects references/pointers into the original string? In Perl/JavaScript (and I support other languages with RE support as well) they are simply independent copies so that you can do with them whatever you want without harming other matches (or the original string).

bmose · April 2, 2023, 9:17am

Match objects are NSTextCheckingResult objects, which contain only range information and result type (an enum whose value = 1024 = NSTextCheckingTypeRegularExpression type, signifying that the match is a regular expression match). The range information consists of locations and lengths of matching substrings within the string at which you can make changes to the string. If you make a change that alters the length of the matching substring, and the next match’s ranges are to the right of that change, then the latter ranges will no longer point to the correct locations in the string. By processing the matches in reverse order, any changes that you make to a substring will not affect subsequent matches, because they point to locations to the left of the changes made in the string.
The following example is similar to the previous ones, except that it deletes all characters after the first character of a sentence. Matches are processed in reverse order:

set selectedText to "this is A SENtence. THIS is A sentencE.
this is A SENtence. THIS is A sentencE."
tell (current application's NSMutableString's stringWithString:selectedText) to set {strObj, strRange} to {it, current application's NSMakeRange(0, its |length|())}
set regexObj to (current application's NSRegularExpression's regularExpressionWithPattern:"(?:^\\s*|\\.\\s+)([^.])([^.]*)" options:0 |error|:(missing value))
set matchObjs to ((regexObj's matchesInString:strObj options:0 range:strRange) as list)'s reverse -- REVERSE ORDER
repeat with currMatchObj in matchObjs
	set {sentenceFirstCharRange, sentenceRemainingCharsRange} to {currMatchObj's rangeAtIndex:1, currMatchObj's rangeAtIndex:2}
	set {sentenceFirstChar, sentenceRemainingChars} to {strObj's substringWithRange:sentenceFirstCharRange, strObj's substringWithRange:sentenceRemainingCharsRange}
	(strObj's replaceCharactersInRange:sentenceFirstCharRange withString:(sentenceFirstChar's uppercaseString()))
	(strObj's replaceCharactersInRange:sentenceRemainingCharsRange withString:"") -- DELETES ALL BUT THE FIRST CHARACTER IN THE CURRENT SENTENCE
end repeat
set selectedText to strObj as text
-->
"T. T.
T. T."

If an attempt is made to process matches in forward order, a range out-of-bounds error occurs:

set selectedText to "this is A SENtence. THIS is A sentencE.
this is A SENtence. THIS is A sentencE."
tell (current application's NSMutableString's stringWithString:selectedText) to set {strObj, strRange} to {it, current application's NSMakeRange(0, its |length|())}
set regexObj to (current application's NSRegularExpression's regularExpressionWithPattern:"(?:^\\s*|\\.\\s+)([^.])([^.]*)" options:0 |error|:(missing value))
set matchObjs to (regexObj's matchesInString:strObj options:0 range:strRange) as list -- FORWARD ORDER
repeat with currMatchObj in matchObjs
	set {sentenceFirstCharRange, sentenceRemainingCharsRange} to {currMatchObj's rangeAtIndex:1, currMatchObj's rangeAtIndex:2}
	set {sentenceFirstChar, sentenceRemainingChars} to {strObj's substringWithRange:sentenceFirstCharRange, strObj's substringWithRange:sentenceRemainingCharsRange}
	(strObj's replaceCharactersInRange:sentenceFirstCharRange withString:(sentenceFirstChar's uppercaseString()))
	(strObj's replaceCharactersInRange:sentenceRemainingCharsRange withString:"") -- DELETES ALL BUT THE FIRST CHARACTER IN THE CURRENT SENTENCE
end repeat
set selectedText to strObj as text
-->
-- ERROR:  -[__NSCFString substringWithRange:]: Range {41, 17} out of bounds; string length 45

chrillek · April 2, 2023, 10:19am

Thanks for the explanation. That’s indeed a very different concept from the ones I’m used to where the original string is not modified and the start/end values for the matches thus don’t change.

Hallenstal · April 4, 2023, 11:01am

Why not awk?

set teststr to "5 is a number. this is testing the code. it shoould 
capitalize both \".\" & newline and  \".\" and space. not to worry. we are testing.
a is character.  
\"a\" is lowercase:-)
. starting line with \".\""

log capitalize(teststr)

on capitalize(s as string)
	set awkprog to "'
BEGIN{
    p=1
}
{
    if(p) $0=toupper(substr($0,1,1)) substr($0,2)
    s=\"\"
    while(off=index($0, \". \")){
        s=s substr($0,1, off+1) toupper(substr($0,off+2,1))
        $0=substr($0,off+3)    
    }
    print s $0
    if(substr($NF,length($NF),1)==\".\") p=1
    else p=0
}'"
	return (do shell script "echo  " & quoted form of s & "| awk " & awkprog)
end capitalize