I’m in the middle of a project that involves searching for a string in a large text file and depending on what comes next, the script manipulates the text or leaves it be. I’ve made several attempts and this seems to be the fastest I can come up with, and it’s still dog slow. Caveats: No extra scripting additions; TextEdit is the text processor; I still have Tiger
Any thoughts?
(The text file is a GEDCOM, an ASCII representation of genealogical data. The script searches for the tag "1 NAME " and determines whether there’s a middle name. If so, it deletes the middle name from the NAME line and creates a new SPFX line for the middle name.)
Questions:
Is there an alternative way to go about this?
There’s no way to access TextEdit’s find command through Applescript?
What are the memory limitations in working with big text variables and TextEdit? How would I test available memory to prevent an error? GEDCOMs can easily grow to multimegabytes.
HyperTalk used to have an “exit repeat” statement so you could gracefully exit from a repeat loop. Nothing like that exists in Applescript, correct?
set theStartTime to the current date
tell application "TextEdit"
set InputBlob to the text of document 1
end tell
set outputBlob to ""
set n to 1
repeat until n = 100 -- 100 is for testing. On my Mac, about 1 record per second
set theStart to the offset of "1 NAME " in InputBlob
if theStart = 0 then -- exit strategy
tell application "TextEdit"
make new document with properties {text:outputBlob & return}
set last paragraph of document 1 to InputBlob & return
set theEndTime to the current date
set last paragraph of document 1 to "Time: " & ((theEndTime - theStartTime) as string)
set n to 100
end tell
else
set outputBlob to outputBlob & (text 1 thru (theStart - 1)) of InputBlob
-- output gets bigger
set InputBlob to text (theStart) thru -1 of InputBlob
-- input gets smaller
set theReturn to the offset of return in InputBlob
-- break off 1 NAME line
set nameGraf to text 1 thru theReturn of InputBlob
set findSlash to (offset of "/" in nameGraf) -- last names are set off by slashes
if findSlash = 0 then -- there's no last name
set outputBlob to outputBlob & nameGraf
else -- findSlash <> 0, therefore there's a last name.
set nameString to (text 8 thru (findSlash - 1) of nameGraf)
set nameLen to length of nameString
-- now, is there a middle name?
set findSpace to (offset of " " in nameString)
-- finds first space, everything between here and end is "middle name"
if findSpace > 0 then
if findSpace < nameLen then
set lastName to (text findSlash thru -1 of nameGraf)
set middleName to (text (findSpace + 1) thru nameLen of nameString)
set firstName to (text 1 through findSpace of nameString)
set outputBlob to outputBlob & "1 NAME " & firstName & " " & lastName & "2 SPFX " & middleName & return
else
-- there's a space at the end of the string, but no middle name
set outputBlob to outputBlob & nameGraf
end if
else
-- no middle name
set outputBlob to outputBlob & nameGraf
end if
end if
set InputBlob to text (theReturn + 1) thru -1 of InputBlob
set n to n + 1
end if
end repeat
an alternative way is to use AppleScript’s text item delimiters
no, but there are more powerful free text editors like TextWrangler, which is also very well scriptable
I don’t know this, but you can also read a plain text file (without TextEdit) sequentially and write the result back to disk.
exit repeat also exists in AppleScript
Here is a different approach with TextEdit using text item delimiters. I took the example from http://en.wikipedia.org/wiki/GEDCOM.
It should be much faster
tell application "TextEdit" to set theText to the text of document 1
set {TID, text item delimiters} to {text item delimiters, "1 NAME "}
set nameText to text items of theText
repeat with oneName from 2 to count nameText
set text item delimiters to "/"
set {a1, a2} to {text item 1 of item oneName of nameText, text items 2 thru -1 of item oneName of nameText as text}
set text item delimiters to space
set N to text items of a1
if (count N) > 2 then
set {p1, p2} to {paragraph 1 of a2, paragraphs 2 thru -1 of a2}
set x to {item 1 of N, items 3 thru -1 of N as text}
set a1 to x as text
set text item delimiters to "/"
set x to {a1, p1}
set p1 to (x as text) & return & "2 SPFX " & item 2 of N
set text item delimiters to return
set x to {p1, p2}
set A to x as text
set item oneName of nameText to A
end if
end repeat
set text item delimiters to "1 NAME "
set nameText to nameText as text
set text item delimiters to TID
tell application "TextEdit"
make new document with properties {text:nameText & return}
end tell
I wrote an example to read a huge text file in small portions, process the text and write the result into an new file.
Memory limitations don’t matter at all and no text editor is required.
The script expects a plain text file (.txt) named GEDCOM.txt on the desktop,
property amount : 10000 -- bytes per loop to read
set GEDCOMfile to ((path to desktop as Unicode text) & "GEDCOM.txt")
set newGEDCOMfile to ((path to desktop as Unicode text) & "newGEDCOM.txt")
set t_in to open for access file GEDCOMfile
set t_out to open for access file newGEDCOMfile with write permission
tell (get eof t_in) to set {div_in, mod_in} to {it div amount, it mod amount}
set borrow to ""
set {TID, text item delimiters} to {text item delimiters, return}
repeat div_in times
tell (read t_in for amount) to set {inputText, borrow} to {borrow & (paragraphs 1 thru -2 as text), paragraph -1}
write processText(inputText) to t_out starting at eof
end repeat
if mod_in is not 0 then write processText(borrow & (read t_in)) to t_out starting at eof
set text item delimiters to TID
close access t_in
close access t_out
on processText(theText)
set {TID, text item delimiters} to {text item delimiters, "1 NAME "}
set nameText to text items of theText
repeat with oneName from 2 to count nameText
set text item delimiters to "/"
set {a1, a2} to {text item 1 of item oneName of nameText, text items 2 thru -1 of item oneName of nameText as text}
set text item delimiters to space
set N to text items of a1
if (count N) > 2 then
try
set {p1, p2} to {paragraph 1 of a2, paragraphs 2 thru -1 of a2}
on error
set {p1, p2} to {paragraph 1 of a2, ""}
end try
set x to {item 1 of N, items 3 thru -1 of N as text}
set a1 to x as text
set text item delimiters to "/"
set x to {a1, p1}
set p1 to (x as text) & return & "2 SPFX " & item 2 of N
set text item delimiters to return
set x to {p1, p2}
set A to x as text
set item oneName of nameText to A
end if
end repeat
set text item delimiters to "1 NAME "
set nameText to nameText as text
set text item delimiters to TID
if nameText ends with return then
return nameText
else
return nameText & return
end if
end processText
WHOA! I can’t believe how much faster this is. My original script went for 45 minutes and it was barely half done. This completed an 850K GEDCOM in 67 seconds. Quite impressive. Thank you very much.
I’m not entirely sure what some of your coding did. It burped on the one entry that didn’t have a surname. Before I realized exactly why the code burped, I had rewritten much of the middle of your example, keeping the basic idea of using delimiters. Here’s my finished product:
set theStartTime to the current date
set outputBlob to ""
tell application "TextEdit" to set AllofIt to the text of document 1
set theStart to (offset of "1 NAME " in AllofIt)
set theFirstBlob to (text 1 thru (theStart - 1) of AllofIt) & "1 NAME "
set theText to text (theStart - 1) thru -1 of AllofIt
set {TID, text item delimiters} to {text item delimiters, "1 NAME "}
set nameText to text items of theText
repeat with oneName from 2 to count nameText
set text item delimiters to "/"
if (count of text items of (item oneName of nameText)) > 1 then
set {a1, a2} to {text item 1 of item oneName of nameText, text items 2 thru -1 of item oneName of nameText as text}
set text item delimiters to space
set N to text items of a1
if ((count N) ≥ 2) and (length of item 2 of N) > 0 then
set p1 to paragraph 1 of a2
set lenp1 to length of p1
set p2 to text (lenp1 + 2) thru -1 of a2
set outputBlob to (item 1 of N) & " /" & p1 & return & "2 SPFX " & (items 2 thru -1 of N) & return & p2
else
set outputBlob to (item oneName of nameText)
end if
else
set outputBlob to (item oneName of nameText)
end if
set theFirstBlob to theFirstBlob & outputBlob & "1 NAME "
end repeat
set text item delimiters to "1 NAME "
set text item delimiters to TID
tell application "TextEdit"
make new document with properties {text:theFirstBlob & return}
set theEndTime to the current date
set the last paragraph of document 1 to "Time: " & (theEndTime - theStartTime) as string
end tell