I want to present text to web browsers without any actual email addresses showing. I can do this on the server-side using PHP and REGEXPs to scrub the text on its way from MySQL to the browser BUT I’d prefer to do it at my desktop prior to popping the data into MySQL.
I’ve been trying to use AppleScript to achieve this and have come up with the following (which mostly works) … but it seems to be an awfully cumbersome way to do what should be a pretty easy find-replace. Is there a simple one-line command I am missing that will do this? How else could / should I approach this task?
Cheers
Dougal
-- Identify and remove email addresses from big strings likely to contain addresses
set StartText to "I want to remove email addresses such as bloggs@hotmail.com and Bill.jones@paradise.net.nz, but I don't want to remove isolated occurrences of the @ character or non-address constructs such as c@."
set ReplacementString to "[snip]"
set ModifiedText to StartText
set AddressTest to ModifiedText contains "@"
if AddressTest then
set TextParts to text items of ModifiedText
set WordCount to count TextParts
set n to 1
repeat while n ≤ WordCount
-- find the words that might be addresses
if text item n of ModifiedText contains "@" then
set PossibleAddress to text item n of ModifiedText
-- test to see if they're likely to be an email address
if PossibleAddress contains "." then
if first character of PossibleAddress is not "." then
if last character of PossibleAddress is not "." then
if length of PossibleAddress > 6 then
-- it looks and smells a little like an email address so replace it
set OldDelim to text item delimiters
set text item delimiters to PossibleAddress
set InterimText to text items of ModifiedText
set text item delimiters to ReplacementString
set ModifiedText to InterimText as text
set text item delimiters to OldDelim
end if
end if
end if
end if
end if
set n to n + 1
end repeat
-- have a look at the results
return StartText & return & return & ModifiedText
end if
I don’t know if there’s a cleverer way to do what you want, but with regard to the small details of your script (which seems to work very well):
Your nested ‘if’ blocks could be made less “cumbersome” by linking all the test conditions into one line with ‘ands’. (For the three negative conditions, you could use one ‘not’ and a couple of ‘ors’ instead.) The logic executed is exactly the same.
As posted, your script doesn’t contain the first setting of the text item delimiters, presumably to a space.
Since the value of ‘ModifiedText’ is continually being modified, looping through it is a bad idea. It would be safer to use ‘text item n of StartText’. However, since you’ve already extracted the text-item list ‘TextParts’, 'item n of TextParts would be more efficient.
A ‘repeat with n from 1 to WordCount’ repeat would be neater for what you’re doing. (Or ‘repeat with PossibleAddress in TextParts’.)
‘begins with’ and ‘ends with’ are slightly more efficient than extracting the first and last characters of a string and testing those.
Since you’re testing exclusively for the single-case characters “@” and “.”, putting the repeat in a ‘considering case’ block would make it faster.
-- Identify and remove email addresses from big strings likely to contain addresses
set StartText to "I want to remove email addresses such as bloggs@hotmail.com and Bill.jones@paradise.net.nz, but I don't want to remove isolated occurrences of the @ character or non-address constructs such as c@."
set ReplacementString to "[snip]"
set ModifiedText to StartText
set AddressTest to ModifiedText contains "@"
if AddressTest then
set OldDelim to text item delimiters
set text item delimiters to space
set TextParts to text items of StartText
set WordCount to count TextParts
considering case
repeat with n from 1 to WordCount
-- find the words that might be addresses
-- and test to see if they're likely to be an email address
set PossibleAddress to item n of TextParts
if PossibleAddress contains "@" and PossibleAddress contains "." and not (PossibleAddress begins with "." or PossibleAddress ends with "." or length of PossibleAddress < 7) then
-- it looks and smells a little like an email address so replace it
set text item delimiters to PossibleAddress
set InterimText to text items of ModifiedText
set text item delimiters to ReplacementString
set ModifiedText to InterimText as text
set text item delimiters to space
end if
end repeat
end considering
set text item delimiters to OldDelim
-- have a look at the results
return StartText & return & return & ModifiedText
end if
Thanks for your clear explanation, it all seems to make sense. I will give the changes a try tonight.
I’d not thought of issues such as the difference between “begins with” and “first character” in this context. Probably no real difference for a tiddly little script such as this but a great difference for my learnign curve … still very very steep.
I’d already left myself a note for tonight to look-up the syntax for ands and ors within the ‘if’ test. I will add “look up considering” to that list.
It strikes me that, with only a small fiddle of either script, every email address found, e.g. whoami@thisplace.sct, could be recast in the form: whoami {noSpam} thisplace.sct to leave them visible, but “cleansed”. I haven’t seen an AppleScript to do that.
Absolutely Kai. I’d actually printed-out that section last night but not read it as I was grappling with how to make the script work … as usual leaving “read the manual” as the last option
Adam, as for cloaking the email addresses to make them less harvestable for the robots … I agree. I already do this on my website using PHP, but decided to remove the addresses altogether (using Applescript) rather than insert an “[at]” instead of the “@”. That way the archive I’m building can be made public without exposing contributor’s email addresses to any potential misuse … much as happens on bulletin boards such as this great forum.
I’ll try, tonight, to adapt this little script to a robot-frustrater and will pop it up here for future reference.
OK, I’ve done some of my homework for tonight … but I don’t understand how the considering case control helps.
If these are single-case characters wouldn’t it be quicker to ignore case? Obviously I’m missing something but I’d have thought that ignoring case would reduce the number of comparisons that need to be made in response to a statement.
Thanks Nigel for the rewrite and the explanation. Thanks Adam and kai for your thoughts. Being one-week into learning AppleScript I obviously have a long way to go … but mentorship and support such as yours is really helpful and very important. I know I’m only barely scratching the surface of this language, I know I do not have a computing background, and I know that I am asking simple (and doubtlessly sometimes stoopid) questions … but the responses I’ve received have been supportive, courteous, and very generous in time and energy. Thanks all.
On that note I’m gonna printout kai’s explanation on linefeed replacement, shut-down for the night, and go off somewhere quiet to read it.
Cheers
Dougal
p.s. Nigel, that rewrite works beautifully … I will tinker with it some more and see where it takes me.
The core AppleScript language ‘ignores’ case by default. Counterintuitively, ‘ignoring’ involves more work than ‘considering’ in string comparisons, because different characters have to be recognised as equivalent under the aspect being ignored ” or not, as the case may be. [Sorry. ;)] “A” can be equivalent to either “A” or “a”, but not to anything else. The commands don’t know in advance what characters will be in the strings they’re fed, so they have to do almost as much analysis with single-case characters as with dual-case ones. When ‘considering’ everything, though, it doesn’t matter what the characters are. The two strings are either exactly the same, or they’re not.
Other things that can be considered or ignored by AppleScript string commands are diacriticals, expansion (“æ” = “ae”), hyphens, punctuation, and white space. These are all considered by default, which means strings have to be exactly the same in these respects to be considered equal.