Prepare a string to use in e-mailaddress (remove invalid characters etc)?

I am generating internal e-mail addresses from text files with strings.

I would like to “clean” these strings of invalid characters, e.g., space or 8-bit characters (such as é). For example Mac Scriptér should be MacScripter or MacScriptr

It doesn’t have to be super stable or handle every weird corner case one could imagine. Any ideas on how to implement this or does it exist something I could use as is?

I don’t imagine that this would be exhaustive but it should deal with any diacriticals you may encounter. It (as far as I can tell) will replace any accented character with its unaccented counterpart.

For whitespace, you would have to do something additional but you didn’t say how these characters should be treated. You might want to look up the standard for email addresses which will provide a complete list of usable characters.
Update: I guess you did mention that you wanted to delete spaces so I added deletion of spaces and tabs.

set goodList to "abcdefghijklmnopqrstuvwxyz0123456789@._-"

set test to "maç scrïptér 2023@macscripter.com"

set AppleScript's text item delimiters to {space, tab}
set testList to text items of test
set AppleScript's text item delimiters to ""
set test to testList as text

set charList to characters of test as list
--> {"m", "a", "ç", "s", "c", "r", "ï", "p", "t", "é", "r", "2", "0", "2", "3", "@", "m", "a", "c", "s", "c", "r", "i", "p", "t", "e", "r", ".", "c", "o", "m"}

repeat with listpos from 1 to count of test
	
	considering diacriticals
		set eachChar to contents of item listpos of test
		if eachChar is not in goodList then
			ignoring diacriticals
				set gc to character (offset of eachChar in goodList) of goodList
				set item listpos of charList to gc
			end ignoring
		end if
		
	end considering
end repeat

charList as text
--> "macscripter2023@macscripter.com"
2 Likes

You could use the tr shell utility. It has many options, and the following is only a simple example. Depending on your requirements, you can either include specified characters, exclude specified characters, or both.

set theString to "Mac Scriptér 1@gmail.com"
set includeCharacters to "A-Za-z0-9@."
set cleanedText to do shell script "echo " & theString & " | tr -cd " & quoted form of includeCharacters
return cleanedText --> "MacScriptr1@gmailcom"

set theString to "Mac Scriptér 1@gmail.com"
set excludeCharacters to " é"
set cleanedText to do shell script "echo " & theString & " | tr -d " & quoted form of excludeCharacters
return cleanedText --> "MacScriptr1@gmail.com"

Thank you!

Do you think you could explain two lines that I am not sure I understand in your example:

considering diacriticals

set gc to character (offset of eachChar in goodList) of goodList

Thank you

There may be a nice, neat AppleScriptObjC function for this…

@Shane_Stanley?

But JavaScript has a function for that, and you can run it from AppleScriptObjC.

--------------------------------------------------------
# Auth: Christopher Stone <scriptmeister@thestoneforge.com>
#     : building upon work by @ComplexPoint
# dCre: 2023/03/27 20:03
# dMod: 2023/03/27 20:03 
# Appl: AppleScriptObjC & JavaScript
# Task: Normalize Diacritical Strings and Remove Spaces.
# Libs: None
# Osax: None
# URLs: https://forum.keyboardmaestro.com/t/using-javascript-for-automation-from-applescript-and-vice-versa/4054?u=ccstone
# Tags: @Applescript, @Script, @ASObjC, @Normalize, @Diacriticals, @Remove, @Spaces
--------------------------------------------------------
use AppleScript version "2.4" --» Yosemite or later
use framework "Foundation"
use framework "OSAKit"
use scripting additions
--------------------------------------------------------

set dataStr to "àáâãäå ÈÉÊË ÀÁÂÃÄÅ"

set jsCmdStr to "
   (() => {
   
      let inputStr = '" & dataStr & "';
      let outPutStr = inputStr.normalize('NFKD').replace(/[^\\w]/g, '');
      return outPutStr;
   
   })();
"
set strEncoded to evalOSA("JavaScript", jsCmdStr)

--------------------------------------------------------
--» HANDLERS
--------------------------------------------------------
# evalOSA :: ("JavaScript" | "AppleScript") -> String -> String
--------------------------------------------------------
on evalOSA(strLang, strCode)
   set ca to current application
   set oScript to ca's OSAScript's alloc's initWithSource:strCode ¬
      language:(ca's OSALanguage's languageForName:(strLang))
   set {blnCompiled, oError} to oScript's compileAndReturnError:(reference)
   if blnCompiled then
      set {oDesc, oError} to oScript's executeAndReturnError:(reference)
      if (oError is missing value) then return oDesc's stringValue as text
   end if
   return oError's NSLocalizedDescription as text
end evalOSA
--------------------------------------------------------

Alt-Title == Run JavaScript Directly from AppleScript


2 Likes

When comparing texts, you can have the script take into account (or ignore) certain types of text, including diacriticals (and also case, hyphens, numeric strings, punctuation and white space).

So, for example, the script asks if é = e and while considering diacriticals, they are not. So the if…then statement then finds the offset of the character in the clean characters while ignoring diacriticals, and uses the resulting character as the replacement. You can get more details in the Language Guide. If the first use was ignoring, then it would skip over the é because it would consider it equal to the e.

What the offset line does is find the appropriate clean character (in this case, an ‘e’) and determine its offset (5) in the string of good characters and then get the letter at that offset (e).

Here is a more focused example:

set initialStr to "béd"
set goodStr to "abcde"

considering diacriticals
	-- is é in 'abcde'
	character 2 of initialStr is in goodStr
end considering
--> false

-- if false then…
ignoring diacriticals
	set x to character 2 of initialStr
	--> offset of 'e' in 'abcde'
	character (offset of x in goodStr) in goodStr
end ignoring
--> character at offset 5 in goodStr
--> e

So, the script substitutes an unadorned ‘e’ for any accented ‘e’, including ‘éëèê’. Meanwhile, the letter ‘a’ has all of the same accents but also an ‘ã’, so ‘áäàâã’ but you don’t need to know as they should all be swapped.

So in the full script, the following test string should return this:

set test to "including -éëèê- and -áäàâã-"
--> "including-eeee-and-aaaaa-"
1 Like

There are a couple:

-- requires macOS 10.11 or later
use framework "Foundation"
use scripting additions

set dataStr to "àáâãäå ÈÉÊË ÀÁÂÃÄÅ"

set theString to current application's NSString's stringWithString:dataStr
set theString to theString's stringByApplyingTransform:(current application's NSStringTransformStripDiacritics) |reverse|:false
return theString as text

Or:

use framework "Foundation"
use scripting additions

set dataStr to "àáâãäå ÈÉÊË ÀÁÂÃÄÅ"

set theString to current application's NSString's stringWithString:dataStr
set theString to theString's stringByFoldingWithOptions:(current application's NSDiacriticInsensitiveSearch) locale:(current application's NSLocale's currentLocale())
return theString as text
3 Likes

I used your example, slightly modified, like this:

set theResponse to display dialog "Name?" default answer name with icon note buttons {"Cancel", "Continue"} default button "Continue"

(input data)

set theResponse to fixString(theResponse)

(call to the method)

on fixString(mystring)
	
	set goodList to "abcdefghijklmnopqrstuvwxyz0123456789@._-"
	
	set test to "maç scrïptér 2023@macscripter.com"
	
	set AppleScript's text item delimiters to {space, tab}
	set mystrings to text items of mystring
	set AppleScript's text item delimiters to ""
	set mystring to mystrings as text
	
	set charList to characters of mystring as list
	
	repeat with listpos from 1 to count of mystring
		
		considering diacriticals
			set eachChar to contents of item listpos of mystring
			if eachChar is not in goodList then
				ignoring diacriticals
					set gc to character (offset of eachChar in goodList) of goodList
					set item listpos of charList to gc
				end ignoring
			end if
			
		end considering
	end repeat
	
	return charList as text
	
end fixString

When I later on did this

set arrName to (text returned of theResponse)

The result was really weird. If I enter alice as input arrName is Continuealice and I get an error:

error “Can’t get text returned of "Continuealice".” number -1728 from text returned of “Continuealice”.

If I skip the call to fixString everything works as expected, that is, arrName is a string set to alice.

1 Like

Solved it myself:

set theResponse to fixString(text returned of theResponse)

Pretty obvious.

Hi

You solution has worked very fine for several months but recently I bumped into two strings that it couldn’t handle. Not sure why

Verkís Verkfræðistofa

and

™️

I don’t understand the problem with the first one, it is just quite regular characters. The second one is obviously a “special character”. Any solution works for that, either delete it or convert it to tm.

You can also use Mockman’s routine near the end of this post below…

https://www.macscripter.net/t/strip-diacritcals/17369/6

1 Like

Hmm… those don’t actually qualify as diacriticals, so yeah, they won’t be handled by my script.

ð is actually an unadorned letter (eth), albeit one of narrow usage. Apparently, it drifted out of the English language over a thousand years ago. Its upper case version is Ð. You’d have to decide what you would like to replace it with as it’s not member of what Apple considers to be the diacritical class — at least so far as I can tell.

As for the trademarque symbol, it can be replaced with any variation of the standard replace methods using text item delimiters (see below) but you should note whether it should include a space or not. Many uses of it have it flush against the preceding text — it would be odd to just append ‘tm’ to a word.

I’d probably put this inside a handler but this is the basic idea.

	set bigText to "and ™️"
	set AppleScript's text item delimiters to "™️"
	set ti to text items of bigText
	set AppleScript's text item delimiters to "tm"
	set xt to ti as text

The second string is Icelandic. Though I think it first choked on the æ which is also not a diacritic character, is it?

No, in English at least, it is a ligature, although technically, the ligature is where the letters are joined. On a mac, you can type it with option-'.

From a quick lookup on wikipedia, in Icelandic (and a few other languages), it is apparently its own letter. It seems that on an Icelandic keyboard, it has its own key… the semi-colon key on an English keyboard (key code 41).

What would you want to do with it?

Also, in addition to Icelandic, the ð is also Old English. There seems to be a fair amount of overlap between Scandinavian languages and Old English. Lots of boots walking around England back in the day.