Regex solution to remove duplicate unsorted lines in a string

peavine · July 24, 2025, 2:20pm

Mostly for learning purposes, I want to write a regex pattern that 1) removes duplicate non-consecutive lines from a string and 2) returns the first instances of duplicate lines. If the lines happen to be in a sorted order, the following does what I want:

use framework "Foundation"
use scripting additions

set theString to "aa aa
aa aa
bb bb
bb bb
bb bb"

set theString to current application's NSString's stringWithString:theString
set thePattern to "(?m)^(.*)(?:\\n\\1)+$"
return (theString's stringByReplacingOccurrencesOfString:thePattern withString:"$1" options:1024 range:({0, theString's |length|()})) as text

The following works with an unsorted string but does not comply with requirement 2 (“bb bb” should be returned before “aa aa”):

use framework "Foundation"
use scripting additions

set theString to "bb bb
aa aa
aa aa
bb bb
bb bb"

set theString to current application's NSString's stringWithString:theString
set thePattern to "(?sm)(^[^\\n]*)\\n(?=.*^\\1$)"
return (theString's stringByReplacingOccurrencesOfString:thePattern withString:"" options:1024 range:({0, theString's |length|()})) as text

Does anyone know a regex pattern that will meet both requirements? I spent a fair amount of time with Google but couldn’t find a solution. BTW, I am aware that this can be done with other approaches (such as NSSet), but I’m only interested in a regex solution.

Thanks.

peavine · July 24, 2025, 2:50pm

It just occurred to me that reversing the string might accomplish most of what I want. I’ll have to give this some additional thought.

use framework "Foundation"
use scripting additions

set theString to "bb bb
aa aa
aa aa
bb bb
bb bb"

set theString to current application's NSString's stringWithString:theString
set theArray to (theString's componentsSeparatedByString:linefeed)'s reverseObjectEnumerator()'s allObjects()
set theString to theArray's componentsJoinedByString:linefeed
set thePattern to "(?sm)(^[^\\n]*)\\n(?=.*^\\1$)"
set theString to (theString's stringByReplacingOccurrencesOfString:thePattern withString:"" options:1024 range:({0, theString's |length|()})) --option 1024 is regex
set theArray to (theString's componentsSeparatedByString:linefeed)'s reverseObjectEnumerator()'s allObjects()
set theString to (theArray's componentsJoinedByString:linefeed) as text

Nigel_Garvey · July 26, 2025, 11:58am

Hi @peavine.

Clever! The regex replacement process deletes lines that have at least one further instance later in the text, leaving behind just the last instance of each line. Doing this while the lines are reversed preserves the first instances from the original text.

To delete lines that have instances before them would require a look-behind assertion, and I haven’t so far been able to make this work, even when avoiding using “*” and “+” repeats. My current guess is that look-behinds and back-references don’t get along.

Another version of your pattern, which I think is essentially the same, would be “(?m)(^.*$)\\n(?s)(?=.*^\\1$)”. In both versions, if my understanding of regex’s machinations is correct, the repeat in the look-ahead obliges the regex engine to work through to the very end of the text and then back-track, attempting to match the back-reference, until it gets back to the current input position. If the repeat were to be made “lazy”, I think the check for the back-reference match might happen during the forward movement, after each character matched or not in the repeat. But I’m not sure of this or whether it would necessarily be more efficient. The repeat in the capture group can be made more efficient by making the matches possessive, but it’s very unlikely anyone would notice the difference.

“(?m)(^.*+$)\\n(?s)(?=.*?^\\1$)”

peavine · July 26, 2025, 2:05pm

Nigel. Thanks for the response. It’s always helpful to have confirmation that something can’t be done exactly as I want.

I tested your regex pattern, which works great, and I’ll study it to understand its operation. I normally use the default greedy quantifiers and make them lazy when greedy doesn’t work. And, I normally don’t consider possessive quantifiers, although the ICU documentation recommends their use. I’ll spend some time with this.

Just as an aside, I tested the operation of my script with both of our patterns after expanding the string to contain 100 lines. In both cases, the timing result was less than a millisecond, which doesn’t allow one to differentiate between possible efficiencies of the patterns. An NSSet solution was faster but only by a few tenths of a millisecond, which of course is not significant.

use framework "Foundation"
use scripting additions

set theString to "bb bb
aa aa
aa aa
bb bb
bb bb"

set theString to current application's NSString's stringWithString:theString
set theArray to (theString's componentsSeparatedByString:linefeed)
set newArray to (current application's NSOrderedSet's orderedSetWithArray:theArray)'s array()
set newString to (newArray's componentsJoinedByString:linefeed) as text