Extract e-mail address from text file

This would seem to be a fairly basic question, but I can’t find an answer anywhere.

I have a big mailing list with a lot of bad e-mail addresses. When e-mail notices come back from “Daemon” I want to save them all in one folder as text files and then extract the original e-mail addresses and compile them into a list which I can then use to compare to the email addresses in the (Filemaker Pro) database.

My problem is that I can’t figure out how to make an Applescript that will identify and extract (get) an email address from a text file such as TextEdit or MS Word.

Once I have solved that basic problem, I will have to figure out how to tell the script which e-mail address to extract. I can do this in Filemaker, but it seems that it would be easier to do it in Applescript with a droplet.

[Note: the way these Daemon notices are set up, the heading with "From: ", "To: ", and, "Subject: " is repeated, first with the recipient’s information and then with the information from your original email. Thus you only have to find the email address that occurrs after the second iteration of the string "To: " – So, the first iteration would be “To: myE-mail@myDomain.com” and the second would be "To: Barney.Google@lalaland.com.]

Jake:

I am pretty sure it depends on your desired workflow. For instance, if you use the returned email daemon to get the bad email address, that could be easily saved in a list for later reference. Once you had a list you wanted to use to clean out a text file, you could then search the text for each address, and then extract the paragraph(s) associated with it and re-save the new text file. The same list should also be useful for the FileMaker data file as well. It would be simpler in a text file than in MS Word via applescript.

Is that pretty close to what you are talking about? Which app are you using for email?

I dealt with this recently and used delimination and sort of if then rules to valiadate if it actually was an email.

:rolleyes: Thanks for the responses; however I think I must not have been clear about the nature of the problem. Here is what I need:

I want to extract text from between a beginning point and an ending point. In other words, all the text between the second iteration of "To: " and the first Return (¶) after the second iteration of "To: ".

Below is the actual text of a “mailer-daemon” bad email response. The middle part is prone to change depending on who sends it, but all the mailer-daemon messages seem to include the header info from the original email, so it is from there that I want to extract the bad email address.

By the way, I am using the Mail email client that come with Mac OSX.

Thanks again

Jake Sterling

COPY OF A MAILER-DAEMON RESPONSE:

From: Mail Delivery Subsystem mailer-daemon@comcast.net
Date: Tue Feb 28, 2006 3:52:59 AM US/Eastern
To: myAddress@me.com
Subject: Returned mail: delivery problems encountered

A message (from myAddress@me.com) was received at 28 Feb 2006 8:52:55 +0000.

The following addresses had delivery problems:

Edit.Piaf@ChanteParis.net
Permanent Failure: 550_5.1.1_Edit.Piaf@ChanteParis.net…_User_unknown
Delivery last attempted at Tue, 28 Feb 2006 08:52:58 -0000
Reporting-MTA: dns; comcast.net
Arrival-Date: 28 Feb 2006 8:52:55 +0000

Final-Recipient: rfc822; Edit.Piaf@ChanteParis.net
Action: failed
Status: 5.0.0 550_5.1.1_Edit.Piaf@ChanteParis.net…_User_unknown
Diagnostic-Code: smtp; Permanent Failure: Other undefined Status
Last-Attempt-Date: Tue, 28 Feb 2006 08:52:58 -0000

From: Jake Sterling myAddress@me.com
Date: Tue Feb 28, 2006 3:53:11 AM US/Eastern
To: Edit.Piaf@ChanteParis.net
Subject: EmailTest2

IF the daemon response always looks like the one you’ve used (except repairs are needed for your email address) then this works for me:

set DResp to "From: Mail Delivery Subsystem <mailer-daemon@comcast.net>
Date: Tue Feb 28, 2006  3:52:59  AM US/Eastern
To: <myAddress@me.com>
Subject: Returned mail: delivery problems encountered

A message (from <myAddress@me.com>) was received at 28 Feb 2006  8:52:55 +0000.

The following addresses had delivery problems:

<Edit.Piaf@ChanteParis.net>
    Permanent Failure: 550_5.1.1_<Edit.Piaf@ChanteParis.net>..._User_unknown
    Delivery last attempted at Tue, 28 Feb 2006 08:52:58 -0000
Reporting-MTA: dns; comcast.net
Arrival-Date: 28 Feb 2006  8:52:55 +0000

Final-Recipient: rfc822; <Edit.Piaf@ChanteParis.net>
Action: failed
Status: 5.0.0 550_5.1.1_<Edit.Piaf@ChanteParis.net>..._User_unknown
Diagnostic-Code: smtp; Permanent Failure: Other undefined Status
Last-Attempt-Date: Tue, 28 Feb 2006 08:52:58 -0000

From: Jake Sterling <myAddress@me.com>
Date: Tue Feb 28, 2006  3:53:11  AM US/Eastern
To: Edit.Piaf@ChanteParis.net
Subject: EmailTest2"

set {tid, text item delimiters} to {text item delimiters, "From: Jake Sterling <myAddress@me.com>"}
set badAddr to paragraph 3 of last text item of DResp
set text item delimiters to "To: "
set badAddr to last text item of badAddr
set text item delimiters to tid -- always set them back!!!
badAddr --> "Edit.Piaf@ChanteParis.net"

Jake:

As long as the format being returned every time is stable, this will work:

set bad_mail to "COPY OF A MAILER-DAEMON RESPONSE:

From: Mail Delivery Subsystem <mailer-daemon@comcast.net>
Date: Tue Feb 28, 2006  3:52:59  AM US/Eastern
To: <myAddress@me.com>
Subject: Returned mail: delivery problems encountered

A message (from <myAddress@me.com>) was received at 28 Feb 2006  8:52:55 +0000.

The following addresses had delivery problems:

<Edit.Piaf@ChanteParis.net>
    Permanent Failure: 550_5.1.1_<Edit.Piaf@ChanteParis.net>..._User_unknown
    Delivery last attempted at Tue, 28 Feb 2006 08:52:58 -0000
Reporting-MTA: dns; comcast.net
Arrival-Date: 28 Feb 2006  8:52:55 +0000

Final-Recipient: rfc822; <Edit.Piaf@ChanteParis.net>
Action: failed
Status: 5.0.0 550_5.1.1_<Edit.Piaf@ChanteParis.net>..._User_unknown
Diagnostic-Code: smtp; Permanent Failure: Other undefined Status
Last-Attempt-Date: Tue, 28 Feb 2006 08:52:58 -0000

From: Jake Sterling <myAddress@me.com>
Date: Tue Feb 28, 2006  3:53:11  AM US/Eastern
To: Edit.Piaf@ChanteParis.net
Subject: EmailTest2"
set b to characters 5 thru -1 of paragraph -2 of bad_mail as string

All you have to do is pull the second to last paragraph and get rid of the first 4 characters.

Craig’s is neater if the string is always found as the second last paragraph. Mine depends on there being nothing between From: and To: but the Date:, and I was concerned that there might be a cc: after To: and before Subject:.

Hi guys - fancy meeting you here. :wink:

The following minor variation focuses purely on the two occurrences of "To: " that Jake mentioned. (As an additional precaution, they could be prefixed with a return character or ASCII character 10 - whichever is used as a line-end in the source text.)

set bad_mail to "From: Mail Delivery Subsystem <mailer-daemon@comcast.net>
Date: Tue Feb 28, 2006 3:52:59 AM US/Eastern
To: <myAddress@me.com>
Subject: Returned mail: delivery problems encountered

A message (from <myAddress@me.com>) was received at 28 Feb 2006 8:52:55 +0000.

The following addresses had delivery problems:

<Edit.Piaf@ChanteParis.net>
Permanent Failure: 550_5.1.1_<Edit.Piaf@ChanteParis.net>..._User_unknown
Delivery last attempted at Tue, 28 Feb 2006 08:52:58 -0000
Reporting-MTA: dns; comcast.net
Arrival-Date: 28 Feb 2006 8:52:55 +0000

Final-Recipient: rfc822; <Edit.Piaf@ChanteParis.net>
Action: failed
Status: 5.0.0 550_5.1.1_<Edit.Piaf@ChanteParis.net>..._User_unknown
Diagnostic-Code: smtp; Permanent Failure: Other undefined Status
Last-Attempt-Date: Tue, 28 Feb 2006 08:52:58 -0000

From: Jake Sterling <myAddress@me.com>
Date: Tue Feb 28, 2006 3:53:11 AM US/Eastern
To: Edit.Piaf@ChanteParis.net
Subject: EmailTest2"

set tid to text item delimiters
set text item delimiters to "To: "
set badAddr to paragraph 1 of text item -1 of bad_mail
set text item delimiters to tid
badAddr --> "Edit.Piaf@ChanteParis.net"

Alternatively:

set bad_mail to "From: Mail Delivery Subsystem <mailer-daemon@comcast.net>
Date: Tue Feb 28, 2006 3:52:59 AM US/Eastern
To: <myAddress@me.com>
Subject: Returned mail: delivery problems encountered

A message (from <myAddress@me.com>) was received at 28 Feb 2006 8:52:55 +0000.

The following addresses had delivery problems:

<Edit.Piaf@ChanteParis.net>
Permanent Failure: 550_5.1.1_<Edit.Piaf@ChanteParis.net>..._User_unknown
Delivery last attempted at Tue, 28 Feb 2006 08:52:58 -0000
Reporting-MTA: dns; comcast.net
Arrival-Date: 28 Feb 2006 8:52:55 +0000

Final-Recipient: rfc822; <Edit.Piaf@ChanteParis.net>
Action: failed
Status: 5.0.0 550_5.1.1_<Edit.Piaf@ChanteParis.net>..._User_unknown
Diagnostic-Code: smtp; Permanent Failure: Other undefined Status
Last-Attempt-Date: Tue, 28 Feb 2006 08:52:58 -0000

From: Jake Sterling <myAddress@me.com>
Date: Tue Feb 28, 2006 3:53:11 AM US/Eastern
To: Edit.Piaf@ChanteParis.net
Subject: EmailTest2"

repeat 2 times
	text ((offset of "To: " in result) + 4) thru end of result
end repeat
paragraph 1 of result --> "Edit.Piaf@ChanteParis.net"

Thanks everyone!

I will have to go over these and see what’s what. I can already tell that the first solutions won’t work because, alas, not all the returned emails follow the same format – so you can’t just identify “paragraph 3”. That’s why I wanted to find the second iteration of "To: "

But it seems that the last two may address this problem.

–Jake

:smiley: Thanks all for your help. Here is the solution I have worked out from your suggestions. I should preface this by saying that I start by opening all the bad email “message-daemon” responses in TextEdit before running the script.

(Now all I have to do is figure out how to get Applescript to actually open the files! Well, that’s for another day.)

Cheers

Jake Sterling

P.S. I understand the importance of reseting the “text item delimiter” but what to you reset it to?


--set a variable as a counter.
set docNum to 1
set badAddrList to "Bad Addresses"
set text item delimiters to "To: "

tell application "TextEdit"
	activate
	
	--Set a variable to count the number of open TextEdit documents
	set docCount to count documents
	
	repeat docCount times
		
		--open the first bad email text file
		set badE to text of document docNum
		
		
		
		--this extracts the text of the first paragraph of the third text item, 
		--(the text item after the second iteration of "To: ")
		--which should be the bad email address
		
		set badAddr to paragraph 1 of text item 3 of badE
		set badAddrList to badAddrList & return & badAddr
		
		
		--reset docNum to the next higher number
		set docNum to docNum + 1
		
	end repeat
	
end tell

set text item delimiters to ":"
--I end up with a return delimited list of email addresses that has "Bad Addresses" as the first item
--I can import this list into FileMaker Pro as the basis for eliminating bad email addresses.
badAddrList

You should save them first; that way, you don’t have to worry about it.


set badAddrList to "Bad Addresses"
set ASTID to AppleScript's text item delimiters
set text item delimiters to {"To: "}
-- The rest of your script.
set AppleScript's text item delimiters to ASTID

But FYI, the AppleScript text item delimiters default to “”, so if you don’t set them, the text items are the characters.

set AppleScript's text item delimiters to ""
set someText to "Twas brillig and the slithy toves
did gyre and gimbel in the wabe" -- has a return in it.
set firstItems to text items of someText
set text item delimiters to space --> returns the words
set secondItems to text items of someText
set text item delimiters to "" -- back to normal.
firstItems & return & return & secondItems
(*
{"T", "w", "a", "s", " ", "b", "r", "i", "l", "l", "i", "g", " ", "a", "n", "d", " ", "t", "h", "e", " ", "s", "l", "i", "t", "h", "y", " ", "t", "o", "v", "e", "s", "
", "d", "i", "d", " ", "g", "y", "r", "e", " ", "a", "n", "d", " ", "g", "i", "m", "b", "e", "l", " ", "i", "n", " ", "t", "h", "e", " ", "w", "a", "b", "e", "
", "
", "Twas", "brillig", "and", "the", "slithy", "toves
did", "gyre", "and", "gimbel", "in", "the", "wabe"}
*)