Script to edit text files

Hello,

I’ll admit upfront I am a newbie to Applescript, so this may seem basic to some of you. My issue is this: I am in charge of an account which sends me orders for business cards on a daily basis. These orders come to me via the mail program. I have attached a sample of one of these emails. I would like a script which would strip out the extra information such as DB_First_Name:, etc, and leave me with just the basic information (name, tel., etc.) saved as a text file, which then could be flowed into a template set up in Indesign CS3. I can set it up right now to save the email to a folder on my desktop as rich text, which at that point I would like to be able to run the script to edit the text file. I had a script which did this in OS 9 using outlook express, but since upgrading to OS X and using mail I have not been able to revise the script to make it work. Any assistance would be much appreciated, let me know if there is more information necessary, thanks for reading.

Emery


FONT_SIZE: REGULAR
LOGO_SELECTED: Prov Phys Division
DB_First_Name: Kiley
DB_Middle_Name:
DB_Last_Name: Hoffman
DB_Title1: Training Coordinator
DB_Title2:
DB_Title3:
DB_PHONE1: Tel
DB_Phone1: 503.574.9777
DB_Ext1:
DB_PHONE2: Fax
DB_Phone2: 503.574.9860
DB_Ext2:
DB_PHONE3: None
DB_Phone3:
DB_Ext3:
DB_PHONE4: None
DB_Phone4:
DB_Ext4:
No_Email: Email
DB_Email: kiley.hoffman@providence.org
DB_Address1: 3601 SW Murray Blvd Ste. 45
DB_Address2:
DB_Address3:
City: Beaverton
State: OR
Zip_Code: 97005
R1: Special_Backer
D1: 23
NUMBER_OF_CARDS: 250 $23.67
PROOF_CONTACT_NAME: andrea zottola
PROOF_CONTACT_PHONE: 574.9836
PROOF_CONTACT_EMAIL: andrea.zottola@providence.org
ShiptoAddress1: 3601 sw murray blvd ste 45
ShiptoAddress2:
ShiptoAddress3:
Shipto_City: beaverton
Shipto_State: OR
Shipto_Zip_Postal_Code: 97005
Account_Unit: 61087000

Special_Instructions:

Model: 2x266 Ghz Intel
Browser: Firefox 2.0.0.3
Operating System: Mac OS X (10.4)

Given that you’ve saved as RTF rather than plain text, TextEdit is required to read the file (otherwise we could read it directly into an AppleScript).

What you need is something like this to get you started:

tell application "TextEdit"
	set F to open alias "ACB-G5_1:Users:bellac:Desktop:DBStuff.rtf"
	set P to paragraphs of document 1
end tell
considering case
	repeat with aP in P
		if aP begins with "DB_Last_Name:" then
			set N to rest of words of contents of aP
		else if aP begins with "DB_Phone1:" then
			set T to rest of words of contents of aP
		end if
	end repeat
end considering
set Out to N & return & T as Unicode text
set NF to open for access ((path to desktop as text) & "DBOut.txt") with write permission
try
	set eof of NF to 0
	write Out to NF as Unicode text
	close access NF
on error
	close access NF
end try

Thanks for the response! When I try to run this script, I get an error saying the document doesn’t exit. I have modified the script you sent to reflect the path to the folder on my desktop, but it doesn’t see the file. Any ideas?

Emery

I can save the message as plain text instead of rich text if that will simplify things.

Emery

Is it really an RTF file? Can you open it with TextEdit, for example?

And yes, if you don’t need any text formatting then it’s easier. For example:

set F to read (choose file)
set P to paragraphs of F
considering case
	repeat with aP in P
		if aP begins with "DB_Last_Name: " then
			set N to word -1 of contents of aP
		else if aP begins with "DB_Phone1: " then
			set T to word -1 of contents of aP
		else if aP begins with "Account_Unit: " then
			set Acct to word -1 of aP
		end if
	end repeat
end considering
set Out to N & return & T & return & Acct as Unicode text
set NF to open for access ((path to desktop as text) & "DBOut.txt") with write permission
try
	set eof of NF to 0
	write Out to NF as Unicode text
	close access NF
on error
	close access NF
end try

Hi.

I’ve interpreted the problem as being that you want to loose the labels and the separating white text and just keep the values “ or empty lines where there are no values.

This works for me with both plain text and RTF source files:

set sourceFile to (choose file)
set sourceFileName to name of (info for sourceFile)

if (sourceFileName ends with ".txt") then
	set theParas to paragraphs of (read sourceFile)
else if (sourceFileName ends with ".rtf") then
	tell application "System Events" to set TEWasOpen to (application process "TextEdit" exists)
	
	tell application "TextEdit"
		open sourceFile
		set theParas to paragraphs of (text of front document as string)
		if (TEWasOpen) then
			close front document
		else
			quit
		end if
	end tell
else
	error "The file must have a suitable ".txt" or ".rtf" name extension."
end if

set editedParas to {}
set whitespace to space & tab & (ASCII character 202)
repeat with thisPara in theParas
	set paraLen to thisPara's length
	if (paraLen is 0) then
		set end of editedParas to ""
	else
		set afterWhitespace to (character 1 of thisPara is in whitespace)
		repeat with i from 2 to paraLen
			if (character i of thisPara is in whitespace) then
				set afterWhitespace to true
			else if (afterWhitespace) then
				set end of editedParas to (text i thru paraLen of thisPara)
				exit repeat
			end if
		end repeat
		if ((i is paraLen) and (character i of thisPara is in whitespace)) or (not afterWhitespace) then set end of editedParas to ""
	end if
end repeat

set astid to AppleScript's text item delimiters
set AppleScript's text item delimiters to ASCII character 10
set editedText to editedParas as string
set AppleScript's text item delimiters to astid

set destinationPath to sourceFile as Unicode text
set destinationPath to text 1 thru -5 of destinationPath & " (edited).txt"

set fref to (open for access file destinationPath with write permission)
try
	set eof fref to 0
	write editedText as string to fref
end try
close access fref

Nice way to do it. If the OP then wants only some of those paragraphs, it’s easy to set up an output by selecting from the list by number. Excellent.

Really appreciate the effort here folks. The error I get now when I try to run this script is Can’t get name of “Document 1.rtf”. I believe I have the path named correctly, what am I doing wrong? Here’s the script as I have modified it for my machine:

Thanks again!

Emery

set sourceFile to "Mac HD:Desktop:Daily Prov:Document 1.rtf"
set sourceFileName to name of "Document 1.rtf"

if (sourceFileName ends with ".txt") then
	set theParas to paragraphs of (read sourceFile)
else if (sourceFileName ends with ".rtf") then
	tell application "System Events" to set TEWasOpen to (application process "TextEdit" exists)
	
	tell application "TextEdit"
		open sourceFile
		set theParas to paragraphs of (text of front document as string)
		if (TEWasOpen) then
			close front document
		else
			quit
		end if
	end tell
else
	error "The file must have a suitable ".txt" or ".rtf" name extension."
end if

set editedParas to {}
set whitespace to space & tab & (ASCII character 202)
repeat with thisPara in theParas
	set paraLen to thisPara's length
	if (paraLen is 0) then
		set end of editedParas to ""
	else
		set afterWhitespace to (character 1 of thisPara is in whitespace)
		repeat with i from 2 to paraLen
			if (character i of thisPara is in whitespace) then
				set afterWhitespace to true
			else if (afterWhitespace) then
				set end of editedParas to (text i thru paraLen of thisPara)
				exit repeat
			end if
		end repeat
		if ((i is paraLen) and (character i of thisPara is in whitespace)) or (not afterWhitespace) then set end of editedParas to ""
	end if
end repeat

set astid to AppleScript's text item delimiters
set AppleScript's text item delimiters to ASCII character 10
set editedText to editedParas as string
set AppleScript's text item delimiters to astid

set destinationPath to sourceFile as Unicode text
set destinationPath to text 1 thru -5 of destinationPath & " (edited).txt"

set fref to (open for access file destinationPath with write permission)
try
	set eof fref to 0
	write editedText as string to fref
end try
close access fref

Hi,

no, the path isn’t correct.
Either


set sourceFileName to "Document 1.rtf"
set sourceFile to "Mac HD:Users:myUser:Desktop:Daily Prov:" & sourceFileName

or


set sourceFileName to "Document 1.rtf"
set sourceFile to (path to desktop as Unicode text) & "Daily Prov:" & sourceFileName

That works! Awesome, thanks!

Now, for the next challenge: since I receive multiple orders, I can’t save them all as Document 1, as it wants to write over the previous file saved as Document 1. Can the script be set up as for the document to be a variable, such as Document X, which would process each individual file one at a time? Then, since the emails will come with a first name/last name designation, maybe save the text file as such, first name/last name?

I can’t tell you how much I appreciate the help you’ve given so far, thanks again!

Emery

Here’s the progress I’ve made. I have found a script which will export the individual emails from a folder in mail, then save them as separate emails with different names on a folder on my desktop. From here, I use the basic script for shortening file names, so the document name is now “PHSbcform1.htm.rtf”. However, since there is more than one document in this folder, the export script has added a " (1)“, " (2)”, etc., after “htm”. The original script in this string which strips the extra info out of the text file works great for the file named “PHSbcform1.htm.rtf”. What I need to now is to alter this script which will allow it to process files with a variable in the name. Is this possible? Thanks again for the help, much appreciated.

Emery

Hi Emery,

the whole procedure to extract the emails seems to be a bit complicated.
I would read the data directly from the mails, then you could even attach the script to a mail rule,
then everything works automatically.

Edit:
Here is a different approach to extract the values and strip off the whitespaces with a shell command.
Select one or more mails in Mail.app and run the script. The textfiles will created on the desktop named with sender and subject of the mail

property CR : ASCII character 13

tell application "Mail" to set sel to selection
repeat with oneMail in sel
	tell application "Mail" to tell oneMail to set {theContent, theSubject, theSender} to {paragraphs of content, subject, extract name from sender}
	set theLines to {}
	set {TID, text item delimiters} to {text item delimiters, ":"}
	repeat with i in theContent
		if i contains "Special_Instructions" then exit repeat
		try
			set str to do shell script "echo " & quoted form of text item 2 of i & " | strings"
			if str begins with space then set str to text 2 thru -1 of str
			if str contains CR then
				set offs to offset of CR in str
				set str to text 1 thru (offs - 1) of str & tab & text (offs + 1) thru -1 of str
			end if
			set end of theLines to str
		end try
	end repeat
	set text item delimiters to ASCII character 10
	set editedText to theLines as string
	set text item delimiters to TID
	set destinationPath to ((path to desktop as Unicode text) & theSender & "_" & theSubject & ".txt")
	set fref to (open for access file destinationPath with write permission)
	try
		set eof fref to 0
		write editedText as string to fref
	end try
	close access fref
end repeat

Hi, Stefan.

Besides taking nearly three times as long as the vanilla and deliberately not handling the “Special_Instructions:” line, your shell script method leaves out the “D1:” and “Shipto_State:” results when I try it. (The shell script returns “” for those lines.) :confused:

(Tested by replacing the paragraph-editing process in my script with that from yours and matching the variable names. Both versions tested on the same file, derived from Emery’s example in post #1.)

I guessed it’s quite slow but I didn’t realize that the shell command works so unreliable :confused:

They’re not unreliable, Stefan, but on my machine, for example (see sig), starting a new thread for a shell call takes nearly 50 ms. I try to avoid using them inside a loop for that reason because each cycle will bear that overhead. While not a rigorous test, this is what I used to determine that:

set ProcTime to "perl -e 'use Time::HiRes qw(time); print time'"
set rep to 100
repeat 10 times -- get the pumps primed
	do shell script ProcTime
end repeat

set proct to 0
set t1 to GetMilliSec
repeat rep times
	set strt to do shell script ProcTime
	do shell script "echo ''"
	set proct to proct + (do shell script ProcTime) - strt
end repeat
set tot to ((GetMilliSec) - t1) / 1000
set shellCost to (tot - proct) / rep

Looking at that on my Jaguar machine this evening, I see that ‘strings’ only returns strings that have four or more printable characters, unless a lower number is specified as an option:

set str to do shell script "echo " & quoted form of text item 2 of i & " | strings -2" -- or: ". strings -1"

If it’s also true in Tiger, that could be why the values “23” and “OR” were omitted when I tried your method this morning! The other “OR” could have survived because it’s bounded in my file by a couple of spaces, but I can’t check that till I get back to my other machine.

In the Jaguar implementation of ‘strings’, spaces count towards strings, but tabs don’t. The use of the command in this context relies on the white space after the labels either not containing spaces (option “-1”) or not containing consecutive spaces (option “-2”).

That is the common default behavior for strings. (FYI, Tiger does have the same behavior.)