Refining script to replace the "@" in email addresses

Dougal_Watson · January 12, 2006, 8:16pm

Hi All,

In different areas of my website I either remove email addresses completely (for privacy reasons) or cloak the “@” symbol (in an attempt to frustrate robots trying to harvest addresses for spammers). I’ve traditionally done this, server-siide, using PHP but am now shifting a lot of this functionality to my desktop Mac and AppleScript.

After some excellent assistance here (especially kai, his contented crocodile, and Nigel Garvey) I’ve sorted out some linebreak troubles during the transfer of text from Mail.app and FileMaker, and I’ve refined my script to entirely replace the email addresses. Now I’ve turned my efforts to the problem of replacing the @ in email addresses without, hopefully, messing-up other uses of @ within a body of text.

The script, so far, is included below and it does the job pretty well, but I have a few questions and would appreciate your thoughts.
1. Is there a better way to winnow-out the email addresses? I’ve simply set the script to consider any text item that contains at least one “@” and at least one “.” and is seven or more characters in length.
2. My first “set text item delimiters to …” works fine if I use “space & return” but not simply “return” (as I’d expected and started out). If I use “return” the script misses some of the addresses. I don’t understand why this is the case … but using “space & return” seems to fix it.
3. On my website I use a small graphic “@” (at.gif) to replace the text “@”. This helps maintain the ‘look’ of the address but offers a degree of spammer harvest protection. I’ve tried setting ReplacementString to “<img src="images/at.gif" width="15" align="absmiddle">” but it doesn’t work properly. If I don’t use the backslash escape characters it doesn’t work at all but if I do then the final ModifiedText also shows the backslashes. How do I need to initially set ReplacementString so the final text that replaces @ is "?
4. Are there tidier or smarter ways of doing this simple task?

Cheers
Dougal


set StartText to "
From:   noobie@paradise.net.nz 
Subject: Reply to topic: 'Changing the @ within email addresses'
Date: 13 January 2006 1:58:51 AM
To:   	some_dude@macscripter.net
Return-Path: <www@paradise.net.nz >
Delivered-To: noobie@paradise.net.nz
X-Envelope-To: noobie@paradise.net.nz
Received: (qmail 5997 invoked from network); 12 Jan 2006 12:58:55 -0000
Message-Id: 20000111125851.5909B378D9F@paradise.net.nz 
Hi Everyone,
I want to try and cloak the @ symbol within the email addresses contained within a body of text, but I don't want to mess-up adjacent punctuation or angle-brackets:
	bloggs@hotmail.com;
	noobie@paradise.net.nz;
	bloggs@hotmail.com:, bloggs@hotmail.com;, and bloggs@hotmail.com.
	<bloggs@hotmail.com>
	@, @@, @@@, and @@@@ 
	The c@ s@ on the m@ and looked @ the f@ b@.
Cheers
Dougal"

set ReplacementString to "[at]"

set ModifiedText to StartText

-- make sure the text uses ASCII_13 returns. Mail.app returns ASCII_10s.
-- Thanks to kai and his contented crocodile
set text item delimiters to space & return
tell ModifiedText's paragraphs to set ModifiedText to beginning & ({space} & rest)
set text item delimiters to space

set AddressTest to ModifiedText contains "@"

if AddressTest then
	
	set text item delimiters to space
	set TextParts to text items of ModifiedText
	set WordCount to count TextParts
	
	considering case
	-- Thanks Nigel ... I'm afraid I still don't understand why it's quicker to consider case but I'm sure i will one day

	repeat with n from 1 to WordCount
			
	-- find the words that might be addresses
	-- and test to see if they're likely to be an email address
	set PossibleAddress to item n of TextParts

	if PossibleAddress contains "@" and PossibleAddress contains "." and length of PossibleAddress > 6 then
				
	-- it looks and smells a little like an email address so get to work on it
	-- first change the @ within PossibleAddress
	set text item delimiters to "@"
	set InterimPossibleAddress to text items of PossibleAddress
	set text item delimiters to ReplacementString
	set ModifiedPossibleAddress to InterimPossibleAddress as text
	set text item delimiters to space
				
	-- then put the modified email address back into the main text
	set text item delimiters to PossibleAddress
	set InterimText to text items of ModifiedText
	set text item delimiters to ModifiedPossibleAddress
	set ModifiedText to InterimText as text
	set text item delimiters to space
				
	end if
end repeat
end considering
end if

-- have a look at the results
return ModifiedText

Adam_Bell · January 12, 2006, 9:19pm

This borrowed Regular Expression validates a proper email address:

^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$

I’ve been trying to approach this from that point of view: properly identifying the parts before and after the “@” and replacing them with part 1 {at} part 2. Not there yet - my regex experience is minimal and I’ll have to find some examples for how to use sed or awk to do this with your list.

Adam_Bell · January 13, 2006, 12:01am

To stick to the original all AppleScript version, by the way, just requires an intermediate replacement:

-- Identify and remove the ampersand "@" from email addresses from big strings likely to contain addresses

set StartText to "I want to remove email addresses such as bloggs@hotmail.com and Bill.jones@paradise.net.nz, but I don't want to remove isolated occurrences of the @ character or non-address constructs such as c@."

set ReplacementString to "[NoSpam]"
set ModifiedText to StartText
set AddressTest to ModifiedText contains "@"

if AddressTest then
	
	set OldDelim to text item delimiters
	set text item delimiters to space
	set TextParts to text items of StartText
	set WordCount to count TextParts
	
	considering case
		repeat with n from 1 to WordCount
			
			-- find the words that might be addresses
			-- and test to see if they're likely to be an email address
			set PossibleAddress to item n of TextParts
			if PossibleAddress contains "@" and PossibleAddress contains "." and not (PossibleAddress begins with "." or PossibleAddress ends with "." or length of PossibleAddress < 7) then
				
				-- it looks and smells a little like an email address so replace it
				set text item delimiters to PossibleAddress
				set InterimText to text items of ModifiedText -- split the original text into two parts at the PossibleAddress
				set text item delimiters to "@"
				set InterimAddress to text items of PossibleAddress -- head and tail of address
				set text item delimiters to ReplacementString -- to go between the parts
				set PossibleAddress to InterimAddress as string -- put new part in between
				set text item delimiters to PossibleAddress -- 
				set ModifiedText to InterimText as text -- reassemble the original string
				set text item delimiters to space
				
			end if
			
		end repeat
	end considering
	
	set text item delimiters to OldDelim
	
	-- have a look at the results
	return StartText & return & return & ModifiedText

--> I want to remove email addresses such as bloggs[NoSpam]hotmail.com and Bill.jones[NoSpam]paradise.net.nz, but I don't want to remove isolated occurrences of the @ character or non-address constructs such as c@."
	
end if

Dougal_Watson · January 13, 2006, 12:33am

Using that REGEXP how impractical would it be to call awk or sed or grep (??) to test whether PossibleAddress meets those criteria, and then returning a true/false back to the AppleScript? PossibleAddress is already filtered to contain an “@” and a “.” or two so it wouldn’t be a case of every individual text item being piped to the shell.

I’ve not checked but suspect that PossibleAddress may contain non-address characters at the start and end. If so it’d need to be tidied-up a bit before sending it to the shell.

Looks like this might be the project for my first attempt at invoking a shell process from an AppleScript … Gulp!

Cheers
Dougal

Dougal_Watson · January 13, 2006, 12:54am

Totally off-topic but isn’t the ampersand the & symbol. @ seems to have lots of names with the technical english seeming to favour “commercial at” or simply “at” (both boring). Of the various offerings I especially like:
- “ampersat” … an adaptation from “ampersand”;
- “kroellalfa”, Norwegian for “curled a”;
- “chiocciolina”, Italian for “small snail”;
- “apestaart”, Dutch for “monkey’s tail”;
- The Israeli reference to it’s looking like the cross-section of a cut strudel.

Anyway, for all you wanted to know about the etymology of @ (and more) … http://www.guardian.co.uk/notesandqueries/query/0,5753,-1773,00.html

Cheers
Dougal

Adam_Bell · January 13, 2006, 1:20am

Great link.

Dougal_Watson · January 13, 2006, 10:32am

Thanks Adam,

I’m surprised (and pleased), my first attempt at getting AppleScript to throw stuff to the shell wasn’t that tough … so far.

I added AppleScript code to tidy-up the various PossibleAddress variables by removing punctuation characters (, ; & and angle-brackets and then I’ve thrown the variable to egrep for a regexp check. My attempts to pipe PossibleAddress to the shell failed if the variable had leading or trailing punctuation or angle-brackets. I can’t (yet) get the regexp you provided to work … but I’ll use it as a template and work a new one up from scratch (possibly differences with the extended regexp that egrep uses).

I won’t show my cleanup sub-script because it’s rather embarrasing in it’s cumbersome-ness at the moment but once I’ve tidied PossibleAddress up the following works nicely (Note: This actual regexp is set only to recognise whether there’s an @ in PossibleAddress … which there always is at the moment).


set check_for_email to (do shell script "echo " & PossibleAddress & " | egrep -c [@]") as integer
    if check_for_email = 1 then
        -- do the AppleScript moves to replace the @ within PossibleAddress
    end if

Now all I need to do is write an email-address-recognising extended regexp, to replace the [@], and that little AppleScript will have a much more selective email address recognising function than currently.

Having recognised the addresses this way would there be any benefits in passing them elsewhere, such as awk, to make the @ substitution or would the AppleScript approach above be adequate for that?

The whole script, at the moment, is quite slow processing that long pseudo-email I used in my first posting … 1 - 2 seconds. I’m not sure how much is my messy tidying-up of PossibleAddress and how much is the multiple calls on the shell script … or perhaps some other interesting permutation I’ve inadvertantly added

Cheers
Dougal
Off to bed then off camping for weekend

Adam_Bell · January 13, 2006, 2:22pm

As I understand it (and I am no expert in this regard) awk functions line-by-line like grep does, and sed functions on the whole string fed to it using a regex to find and a regex to replace.

hhas · January 13, 2006, 4:19pm

FYI/PSA time…

Folk might want to note that accurately determining if an email address is syntactically correct is actually heinously complex (see RFC822; some folks just shouldn’t be allowed to design ˜standards’ imo). Unfortunately, there’s scads of well-meaning but thorougly-broken ‘solutions’ all over the web so knowing when you’ve found one that’s actually correct is not easy; though as a rough rule of thumb, the shorter the ‘solution’ then the less likely it is to be correct.

For a script that only needs to blank out anything that looks roughly like an email address for web display purposes only, it’s probably not so important if the address matching part has less than 100% accuracy. Still, you might want to keep these issues in mind, especially if you need to deal with email addresses in situations where 100% accuracy is essential.

e.g. Here’s a couple of examples for validating addresses that should give an idea of just how deep the rabbit hole really goes:

a typically mad regex-based, but comprehensive, solution from the Perl crowd:

http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html

a simpler Python-based parsing solution; not completely comprehensive, but the docs openly state it ignores defunct formats so there’s a fair chance it was written by someone who knows what they’re doing:

http://www.secureprogramming.com/?action=view&feature=recipes&recipeid=1

HTH

has

Adam_Bell · January 13, 2006, 5:16pm

That’s clearly understood, hhas. I’m not sure how serious Dougal_Watson is, but my interest is really to learn to use sed. I don’t have a real need to remove “@” from suspected email addresses except in web docs, and there I use some of the available JavaScript tools for obfuscating an email address.

Clearly understood. Thanks for the reminder that email, being an old technology, has very loosey-goosey standards that permit some pretty strange configurations to pass for a valid email address. (I can remember the hassle of sending a series of emails to the University of Tokyo from MIT in the US in 1978 or 79 - the header was longer than the message, and had to be set up by hand.)

Which looks, for all the world, like something out of Jeffrey Friedl’s “Regular Expressions”. He develops one for email that has 4,724 bytes (in Perl, of course).

And if I could “read” Python, this one would no doubt be translatable to AppleScript.

Dougal_Watson · January 15, 2006, 10:57pm

Thanks Adam, Thanks hhas,

Yes, “heinously complex” is probably an apt description. Unfortunately RFC822, the basis of that ex-parrot (is that a dead parrot?) page, is obsolete … and the situation is possibly even more complex than that (given UTF-7 and IDN*). I think the most current related RFCs are 2821 (2001), 2822 (2001), and 1642 (1994, re UTF-7) … and transposing all of their logic into a code algorithm is, as hhas suggests, a potentially heinous task.

All the same I am trying (as much for my own further education as for filtering my website material) to write algorithms to help validate email addresses. I don’t know that I’ll be able to build something that perfectly reflects all of the relevant standards but so far I have a smallish regexp working, via egrep, as well as an all-AppleScript algorithm, that have correctly identified all email addresses in the (limited) test material I’ve thrown at them.

My scripts are both slow and the all-AppleScript algorithm is quite a monster (and growing every time I look at it) but once I’ve pushed them to the limit of my capabilties (which won’t be too far), and tested them on more material, I’ll throw them up here for further suggestions.

I’m an “egg” coder, unlike you two, so I am as busy grappling with some very basic coding fundamentals as I am with the higher level concepts and difficulties you guys mention.

If I may … another very simple (I suspect) “egg” question.

How, in AppleScript, can I escape a segment of code other than using nested “if” checks? Is there a way of getting a script to ‘ignore all of the tests below’ and go to the ‘wrap-up’ section … alas I can find no “go to” command
At the moment I am using a heap of nested if/elseif/then conditionals for testing the components of my PossibleAddress strings … they work but it’s all starting to look very messy. How, upon finding something that constitutes a “fail” criterion (e.g. an ASCII control char within one of the labels from PossibleAddress), am I best able to skip all the other possible checks and just return a ‘fail’ verdict?
I’ve tried experimenting using a handler but can’t seem to (yet) get it to do what I want.

Suggestions???

Cheers
Dougal

e.g. ASCII v. Unicode - http://www.職業.jp/ & http://www.bücher.ch… although I’m not yet sure whether and / or how this protocol applies to email addresses.