Removing duplicate characters from a string

First Code Exchange post in a while. Just to share an efficient method of removing duplicate characters from a string, which is a task that arises on occasion. It leverages AppleScript’s text item delimiters property to split the string into individual characters, and ASObjC’s NSOrderedSet class to remove duplicate characters while maintaining character order. The example demonstrates that it works for UTF-8 characters outside the ASCII character set and in a case-sensitive manner:

use framework "Cocoa"
use scripting additions

set sampleString to "3Xwww3✔✓°¦¦✓wWWΦ3X"
---------------------------------------------------------------
set {tid, AppleScript's text item delimiters} to {AppleScript's text item delimiters, ""}
set {sampleStringWithDuplicatesRemoved, AppleScript's text item delimiters} to {((current application's NSOrderedSet's orderedSetWithArray:(sampleString's text items))'s array()) as list as text, tid}
---------------------------------------------------------------
sampleStringWithDuplicatesRemoved --> "3Xw✔✓°¦WΦ"

Hi bmose.

It can also be done without AppleScript’s text item delimiters:

use framework "Foundation"
use scripting additions

set sampleString to "3Xwww3✔✓°¦¦✓wWWΦ3X"
---------------------------------------------------------------
set sampleStringWithDuplicatesRemoved to ((current application's NSOrderedSet's orderedSetWithArray:(sampleString's characters))'s array()'s componentsJoinedByString:("")) as text
---------------------------------------------------------------
sampleStringWithDuplicatesRemoved --> "3Xw✔✓°¦WΦ"

Hi, Nigel. That’s even more terse. Thanks!

Just curious about a fundamental AppleScript matter. Is there any scenario where a string split into text items with an empty string delimiter will differ from the characters returned by the string’s characters property? Or are they processed under the hood in an identical or functionally identical manner?

Here is a native AppleScript version…

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions

set sampleString to "3Xwww3✔✓°¦¦✓wWWΦ3X"
---------------------------------------------------------------
set tid to text item delimiters
set text item delimiters to ""
set i to 1
considering case
	repeat while i < length of sampleString
		set c to text i of sampleString
		set text item delimiters to c
		set sampleString to (text items of sampleString)
		tell sampleString to set item 1 to item 1 & c
		set text item delimiters to ""
		set sampleString to sampleString as text
		set i to i + 1
	end repeat
end considering
set text item delimiters to tid
---------------------------------------------------------------
sampleString --> "3Xw✔✓°¦WΦ"

or a version without text item delimiters…

use AppleScript version "2.4" -- Yosemite (10.10) or later
use scripting additions

set sampleString to "3Xwww3✔✓°¦¦✓wWWΦ3X"
---------------------------------------------------------------
set sampleStringWithDuplicatesRemoved to ""
considering case
	repeat with i from 1 to length of sampleString
		set c to text i of sampleString
		if c is not in sampleStringWithDuplicatesRemoved then set sampleStringWithDuplicatesRemoved to sampleStringWithDuplicatesRemoved & c
	end repeat
end considering
---------------------------------------------------------------
sampleStringWithDuplicatesRemoved --> "3Xw✔✓°¦WΦ"

Hi bmose. I can’t think of any offhand. :thinking:

I’ve never seen a difference in the characters returned either.

This is based on Robert’s delimiter script, but keeps the unique characters aside while winding down the sample string, which may be a tad more efficient. Not that the difference will be noticeable in practice! :wink:

set sampleString to "3Xwww3✔✓°¦¦✓wWWΦ3X"
---------------------------------------------------------------
set uniques to {}
set tid to text item delimiters
considering case
	repeat while ((count sampleString) > 1)
		set text item delimiters to character 1 of sampleString
		set end of uniques to text item delimiters
		set sampleString to (text items of sampleString)
		set text item delimiters to ""
		set sampleString to sampleString as text
	end repeat
end considering
set end of uniques to sampleString
set sampleStringWithDuplicatesRemoved to uniques as text
set text item delimiters to tid
---------------------------------------------------------------
sampleStringWithDuplicatesRemoved --> "3Xw✔✓°¦WΦ"

Does anyone know of a regex pattern that will work in the following script to remove both consecutive and non-consecutive duplicate characters. As written, only consecutive duplicate characters are removed. Thanks!

use framework "Foundation"
use scripting additions

set theString to "3Xwww3✔✓°¦¦✓wWWΦ3X"
set theString to current application's NSString's stringWithString:theString
set regexPattern to "(.)\\1+"
set cleanedString to (theString's stringByReplacingOccurrencesOfString:regexPattern withString:"$1" options:1024 range:{0, theString's |length|()}) as text -->"3Xw3✔✓°¦✓wWΦ3X"

Here is a regular expression solution, condensed into a single line:

tell (current application's NSString's stringWithString:(sampleString's characters's reverse as text)) to set sampleStringWithDuplicatesRemoved to ((its stringByReplacingOccurrencesOfString:"(.)(?=.*?\\1)" withString:"" options:(current application's NSRegularExpressionSearch) range:{0, its |length|()}) as text)'s characters's reverse as text

It works by first reversing the characters in the string:

sampleString’s characters’s reverse as text

so that a look-ahead assertion, which allows unbounded string matches, rather than a look-behind assertion, which does not, can be used. The regex pattern looks through each character in the reversed string, marking it as capture group #1:

(.)

It deletes that character via:

withString:“”

if the look-ahead pattern:

(?=.*?\1)

asserts that the character is followed by zero or more characters other than itself (via lazy matching):

.*?

followed by the character itself:

\1

Finally, it reverses the characters in the resulting string:

…as text)'s characters’s reverse as text

The result is the original string with duplicate characters removed.

I don’t know if the same result can be achieved using a look-behind assertion.

2 Likes

Incidentally, I tested all five solutions – my original ASObjC solution, Nigel’s more condensed version of the ASObjC solution, robertfern’s two AppleScript solutions, and the regex solution. For the simple input string “3Xwww3✔✓°¦¦✓wWWΦ3X”, they all execute very rapidly – on the order of 0.0001 seconds on my machine – and I could not detect a difference among them. I suspect that differences might emerge with very long input strings.

1 Like

Thanks bmose. That works great.

Just to insure I understood the operation of your script, I rewrote it in a slightly different format. All seems to work correctly.

use framework "Foundation"
use scripting additions

set theString to "3Xwww3✔✓°¦¦✓wWWΦ3X"
set reversedString to (reverse of (characters of theString)) as text
set reversedString to current application's NSString's stringWithString:reversedString
set cleanedString to (reversedString's stringByReplacingOccurrencesOfString:"(.)(?=.*?\\1)" withString:"" options:1024 range:{0, reversedString's |length|()}) as text --option 1024 is regex
set cleanedString to (reverse of (characters of cleanedString)) as text -->"3Xw✔✓°¦WΦ"

:sunglasses: !!!

But don’t forget to set the TIDs for the list-to-text coercions. :wink:

As Nigel points out, the regex solution, specifically the two as text coercions which transform lists of characters into strings, will work properly only if text item delimiters are set to the empty string. This is usually the case, but better not assume that it is so:

set {tid, AppleScript's text item delimiters} to {AppleScript's text item delimiters, ""}
tell (current application's NSString's stringWithString:(sampleString's characters's reverse as text)) to set sampleStringWithDuplicatesRemoved to ((its stringByReplacingOccurrencesOfString:"(.)(?=.*?\\1)" withString:"" options:(current application's NSRegularExpressionSearch) range:{0, its |length|()}) as text)'s characters's reverse as text
set AppleScript's text item delimiters to tid

I’ve not been able to make a forward string and look-behind work here. The character-match range range can be made indefinite rather than infinite by using "{1, " & length of sampleString & “}” instead of “*” or “+”, but I can’t get a back reference to be recognised in a look-behind.

If you’d set your heart on a one-liner, there’s this:

use framework "Foundation"

set sampleString to "3Xwww3✔✓°¦¦✓wWWΦ3X"
tell ((current application's NSArray's arrayWithArray:(sampleString's characters's reverse))'s componentsJoinedByString:("")) to tell (((its stringByReplacingOccurrencesOfString:"(.)(?=.*?\\1)" withString:"" options:(current application's NSRegularExpressionSearch) range:{0, its |length|()}) as text)'s characters's reverse) to set sampleStringWithDuplicatesRemoved to ((current application's NSArray's arrayWithArray:(it))'s componentsJoinedByString:("")) as text

Or, for the ASObjC diehards out there: :wink:

use framework "Foundation"

set sampleString to "3Xwww3✔✓°¦¦✓wWWΦ3X"
tell ((current application's NSArray's arrayWithArray:(sampleString's characters))'s reverseObjectEnumerator()'s allObjects()'s componentsJoinedByString:("")) to tell (((its stringByReplacingOccurrencesOfString:"(.)(?=.*?\\1)" withString:"" options:(current application's NSRegularExpressionSearch) range:{0, its |length|()}) as text)'s characters) to set sampleStringWithDuplicatesRemoved to ((current application's NSArray's arrayWithArray:(it))'s reverseObjectEnumerator()'s allObjects()'s componentsJoinedByString:("")) as text

Wow, with Nigel’s two new one-liners, we now have 8 solutions. In speed tests, the first two NSOrderedSet solutions and the regex lookahead solutions execute most quickly, taking less than 0.001 seconds to remove duplicates from a 500-character string with many duplicates (created by replicating the sample string used in the above examples.) For terseness and execution speed, I find that my original NSOrderedSet solution and Nigel’s variant are my preferred solutions:

set {tid, AppleScript's text item delimiters} to {AppleScript's text item delimiters, ""}
set sampleStringWithDuplicatesRemoved to ((current application's NSOrderedSet's orderedSetWithArray:(sampleString's text items))'s array()) as list as text
set AppleScript's text item delimiters to tid

--OR--

set sampleStringWithDuplicatesRemoved to ((current application's NSOrderedSet's orderedSetWithArray:(sampleString's characters))'s array()'s componentsJoinedByString:("")) as text

I ran some timing tests with three paragraphs of text with the three basic approaches, and there was not a significant difference. All were commendably fast, taking 5 milliseconds or less.

I did notice one possible issue. If the input string includes line endings, and if the goal is to eliminate duplicates from the entire string, a dotall flag (i.e. ($s)) should be added to the beginning of the regex pattern. I can’t think of a situation where this might arise, but it’s worth noting FWIW.

This is the result without the dotall flag.

Nigel’s variant of my original solution is not dependent on the value of the text item delimiters, since the value converted by the final as text expression is coercing an NSString object, not an array. (I corrected the error in my post discussing preferred approaches.) Regarding peavine’s last comment, both my NSOrderedSet solution and Nigel’s variant seem to handle line ending characters properly. Therefore, I’m going to change my personal preference to Nigel’s variant. For me, it wins the terseness/execution speed prize:

set sampleStringWithDuplicatesRemoved to ((current application's NSOrderedSet's orderedSetWithArray:(sampleString's characters))'s array()'s componentsJoinedByString:("")) as text

One way to handle line ending characters is to use NSRegularExpression rather than NSString to perform the search-and-replace actions. NSRegularExpression includes an option for the dot character to match line ending characters:

set theString to "3" & linefeed & "X" & return & "www3✔✓°¦¦✓wWWΦ" & linefeed & linefeed & "3" & return & return & "X"
---------------------------------------------------------------
set reversedString to (reverse of (characters of theString)) as text
set reversedString to current application's NSString's stringWithString:reversedString
set regex to (current application's NSRegularExpression's regularExpressionWithPattern:"(.)(?=.*?\\1)" options:(current application's NSRegularExpressionDotMatchesLineSeparators) |error|:(missing value))
set cleanedString to (regex's stringByReplacingMatchesInString:reversedString options:0 range:{0, reversedString's |length|()} withTemplate:"") as text
set cleanedString to (reverse of (characters of cleanedString)) as text --> "3"& linefeed & "X"& return & "w✔✓°¦WΦ"

There is one caveat. If the original string has consecutive characters ...linefeed & return..., when the string is reversed, ...linefeed & return... becomes ...return & linefeed..., which will be interpreted as a single DOS-style return character that will be treated differently from isolated linefeed and return characters. I can think of ways around that unusual circumstance, but then we would be drifting away from the terseness that was one of the objectives of the original post. :blush:

ADDENDUM:

That having been said, here is a way around the caveat described in the last paragraph above. It does the following:
(1) Reverses the original string, and creates an NSMutableString object from the reversed string
(2) Replaces each linefeed and return character with a character obscure enough that it shouldn’t appear in the original string (Runic “N” for linefeed, Runic “R” for return, but any replacement characters will do as long as they don’t appear in the original string)
(3) Removes all duplicate characters via a regular expression search-and-replace action
(4) Restores the single remaining Runic “N” and/or Runic “R” (if present at all) with a single linefeed and/or return character
(5) Reverses the reversed, de-duplicated string, and returns the result as an AppleScript text string

set theString to "3" & linefeed & "X" & return & "www3✔✓°¦¦✓wWWΦ" & linefeed & linefeed & "3" & linefeed & return & linefeed & return & "X"
---------------------------------------------------------------
set regex to (current application's NSRegularExpression's regularExpressionWithPattern:"(.)(?=.*?\\1)" options:(current application's NSRegularExpressionDotMatchesLineSeparators) |error|:(missing value))
set reversedString to (reverse of (characters of theString)) as text
tell (current application's NSMutableString's stringWithString:reversedString)
	set strRange to {0, its |length|()}
	(its replaceOccurrencesOfString:linefeed withString:"ᚾ" options:0 range:strRange)
	(its replaceOccurrencesOfString:return withString:"ᚱ" options:0 range:strRange)
	(regex's replaceMatchesInString:it options:0 range:strRange withTemplate:"")
	set strRange to {0, its |length|()}
	(its replaceOccurrencesOfString:"ᚾ" withString:linefeed options:0 range:strRange)
	(its replaceOccurrencesOfString:"ᚱ" withString:return options:0 range:strRange)
	set cleanedString to (reverse of (characters of (it as text))) as text
end tell
---------------------------------------------------------------
cleanedString --> "3" & linefeed & "X" & return & "w✔✓°¦WΦ"

It works, and it executes quickly, only about 50% slower than the very fast NSOrderedSet solution, even for a long string of 500 characters with many linefeed, return, and other duplicated characters. As far as terseness goes, that is for the scripter to decide.

Bmose. Thanks for the comments and additional information. I enjoy and always learn a lot from discussions such as this.

I tested your new test string with: 1) your original script in post one; 2) your regex script in the post immediately above; and 3) my regex script but with the dotall flag enabled. Except for a trailing carriage return, they all return the exact same results.

Ignoring the above, I have to wonder what the use-scenario is when a string contains multiple paragraphs, and it’s difficult for me to envision that any of the above scripts returns a useful result. For example, the following is your original script but with a “2” added to the end of the test string. In this example, the “2” is added to the paragraph located six line ending’s earlier in the string.

I would be inclined to keep each paragraph of the string separate, which is what my script without the dotall flag does. Depending on the use-scenario, I would also remove blank lines and lines containing whitespace only, and I would replace carriage returns with line feeds. So, FWIW, my favorite script is:

use framework "Foundation"
use scripting additions

set theString to "3a3" & linefeed & " " & linefeed & "X" & return & "www3✔✓°¦¦✓wWWΦ" & linefeed & linefeed & "3x33" & linefeed & return & linefeed & return & "x2x"
set reversedString to (reverse of (characters of theString)) as text
set reversedString to current application's NSString's stringWithString:reversedString
set cleanedString to (reversedString's stringByReplacingOccurrencesOfString:"(.)(?=.*?\\1)" withString:"" options:1024 range:{0, reversedString's |length|()}) --option 1024 is regex
set cleanedString to (cleanedString's stringByReplacingOccurrencesOfString:"(\\R)(?:\\h*+\\R)*" withString:linefeed options:1024 range:({0, cleanedString's |length|()})) as text
set cleanedString to (reverse of (characters of cleanedString)) as text

The timing result of this script as written is less than one millisecond.