Working with large text files

dougtallman · May 31, 2008, 4:11pm

I’m in the middle of a project that involves searching for a string in a large text file and depending on what comes next, the script manipulates the text or leaves it be. I’ve made several attempts and this seems to be the fastest I can come up with, and it’s still dog slow. Caveats: No extra scripting additions; TextEdit is the text processor; I still have Tiger

Any thoughts?

(The text file is a GEDCOM, an ASCII representation of genealogical data. The script searches for the tag "1 NAME " and determines whether there’s a middle name. If so, it deletes the middle name from the NAME line and creates a new SPFX line for the middle name.)

Questions:

Is there an alternative way to go about this?
There’s no way to access TextEdit’s find command through Applescript?
What are the memory limitations in working with big text variables and TextEdit? How would I test available memory to prevent an error? GEDCOMs can easily grow to multimegabytes.
HyperTalk used to have an “exit repeat” statement so you could gracefully exit from a repeat loop. Nothing like that exists in Applescript, correct?

set theStartTime to the current date

tell application "TextEdit"
	set InputBlob to the text of document 1
end tell

set outputBlob to ""

set n to 1

repeat until n = 100 -- 100 is for testing. On my Mac, about 1 record per second
	
	set theStart to the offset of "1 NAME " in InputBlob
	
	if theStart = 0 then -- exit strategy
		tell application "TextEdit"
			make new document with properties {text:outputBlob & return}
			set last paragraph of document 1 to InputBlob & return
			set theEndTime to the current date
			set last paragraph of document 1 to "Time: " & ((theEndTime - theStartTime) as string)
			set n to 100
		end tell
	else
		
		set outputBlob to outputBlob & (text 1 thru (theStart - 1)) of InputBlob
		-- output gets bigger
		
		set InputBlob to text (theStart) thru -1 of InputBlob
		-- input gets smaller
		
		set theReturn to the offset of return in InputBlob
		-- break off 1 NAME line
		
		set nameGraf to text 1 thru theReturn of InputBlob
		
		set findSlash to (offset of "/" in nameGraf) -- last names are set off by slashes
		
		if findSlash = 0 then -- there's no last name
			
			set outputBlob to outputBlob & nameGraf
			
		else 	-- findSlash <> 0, therefore there's a last name.
			
			set nameString to (text 8 thru (findSlash - 1) of nameGraf)
			set nameLen to length of nameString
			
			-- now, is there a middle name?
			
			set findSpace to (offset of " " in nameString)
			-- finds first space, everything between here and end is "middle name"
			
			if findSpace > 0 then
				if findSpace < nameLen then
					set lastName to (text findSlash thru -1 of nameGraf)
					set middleName to (text (findSpace + 1) thru nameLen of nameString)
					set firstName to (text 1 through findSpace of nameString)
					set outputBlob to outputBlob & "1 NAME " & firstName & " " & lastName & "2 SPFX " & middleName & return
					
				else
					-- there's a space at the end of the string, but no middle name
					set outputBlob to outputBlob & nameGraf		
				end if
				
			else
				-- no middle name
				set outputBlob to outputBlob & nameGraf
			end if
		end if
		
		set InputBlob to text (theReturn + 1) thru -1 of InputBlob
		
		set n to n + 1
	end if
	
end repeat

StefanK · May 31, 2008, 6:08pm

Hi dougtallman,

Answers:

an alternative way is to use AppleScript’s text item delimiters
no, but there are more powerful free text editors like TextWrangler, which is also very well scriptable
I don’t know this, but you can also read a plain text file (without TextEdit) sequentially and write the result back to disk.
exit repeat also exists in AppleScript

Here is a different approach with TextEdit using text item delimiters. I took the example from http://en.wikipedia.org/wiki/GEDCOM.
It should be much faster


tell application "TextEdit" to set theText to the text of document 1

set {TID, text item delimiters} to {text item delimiters, "1 NAME "}
set nameText to text items of theText

repeat with oneName from 2 to count nameText
	set text item delimiters to "/"
	set {a1, a2} to {text item 1 of item oneName of nameText, text items 2 thru -1 of item oneName of nameText as text}
	set text item delimiters to space
	set N to text items of a1
	if (count N) > 2 then
		set {p1, p2} to {paragraph 1 of a2, paragraphs 2 thru -1 of a2}
		set x to {item 1 of N, items 3 thru -1 of N as text}
		set a1 to x as text
		set text item delimiters to "/"
		set x to {a1, p1}
		set p1 to (x as text) & return & "2 SPFX " & item 2 of N
		set text item delimiters to return
		set x to {p1, p2}
		set A to x as text
		set item oneName of nameText to A
	end if
end repeat
set text item delimiters to "1 NAME "
set nameText to nameText as text
set text item delimiters to TID

tell application "TextEdit"
	make new document with properties {text:nameText & return}
end tell

StefanK · May 31, 2008, 8:17pm

PS:

I wrote an example to read a huge text file in small portions, process the text and write the result into an new file.
Memory limitations don’t matter at all and no text editor is required.
The script expects a plain text file (.txt) named GEDCOM.txt on the desktop,


property amount : 10000 -- bytes per loop to read

set GEDCOMfile to ((path to desktop as Unicode text) & "GEDCOM.txt")
set newGEDCOMfile to ((path to desktop as Unicode text) & "newGEDCOM.txt")

set t_in to open for access file GEDCOMfile
set t_out to open for access file newGEDCOMfile with write permission

tell (get eof t_in) to set {div_in, mod_in} to {it div amount, it mod amount}
set borrow to ""
set {TID, text item delimiters} to {text item delimiters, return}
repeat div_in times
	tell (read t_in for amount) to set {inputText, borrow} to {borrow & (paragraphs 1 thru -2 as text), paragraph -1}
	write processText(inputText) to t_out starting at eof
end repeat
if mod_in is not 0 then write processText(borrow & (read t_in)) to t_out starting at eof
set text item delimiters to TID
close access t_in
close access t_out

on processText(theText)
	set {TID, text item delimiters} to {text item delimiters, "1 NAME "}
	set nameText to text items of theText
	
	repeat with oneName from 2 to count nameText
		set text item delimiters to "/"
		set {a1, a2} to {text item 1 of item oneName of nameText, text items 2 thru -1 of item oneName of nameText as text}
		set text item delimiters to space
		set N to text items of a1
		if (count N) > 2 then
			try
				set {p1, p2} to {paragraph 1 of a2, paragraphs 2 thru -1 of a2}
			on error
				set {p1, p2} to {paragraph 1 of a2, ""}
			end try
			set x to {item 1 of N, items 3 thru -1 of N as text}
			set a1 to x as text
			set text item delimiters to "/"
			set x to {a1, p1}
			set p1 to (x as text) & return & "2 SPFX " & item 2 of N
			set text item delimiters to return
			set x to {p1, p2}
			set A to x as text
			set item oneName of nameText to A
		end if
	end repeat
	set text item delimiters to "1 NAME "
	set nameText to nameText as text
	set text item delimiters to TID
	if nameText ends with return then
		return nameText
	else
		return nameText & return
	end if
end processText

dougtallman · May 31, 2008, 8:44pm

WHOA! I can’t believe how much faster this is. My original script went for 45 minutes and it was barely half done. This completed an 850K GEDCOM in 67 seconds. Quite impressive. Thank you very much.

I’m not entirely sure what some of your coding did. It burped on the one entry that didn’t have a surname. Before I realized exactly why the code burped, I had rewritten much of the middle of your example, keeping the basic idea of using delimiters. Here’s my finished product:

set theStartTime to the current date

set outputBlob to ""
tell application "TextEdit" to set AllofIt to the text of document 1


set theStart to (offset of "1 NAME " in AllofIt)

set theFirstBlob to (text 1 thru (theStart - 1) of AllofIt) & "1 NAME "

set theText to text (theStart - 1) thru -1 of AllofIt

set {TID, text item delimiters} to {text item delimiters, "1 NAME "}
set nameText to text items of theText

repeat with oneName from 2 to count nameText
	set text item delimiters to "/"
	
	if (count of text items of (item oneName of nameText)) > 1 then
		
		set {a1, a2} to {text item 1 of item oneName of nameText, text items 2 thru -1 of item oneName of nameText as text}
		set text item delimiters to space
		set N to text items of a1
		if ((count N) â‰¥ 2) and (length of item 2 of N) > 0 then
			set p1 to paragraph 1 of a2
			
			set lenp1 to length of p1
			set p2 to text (lenp1 + 2) thru -1 of a2
			
			set outputBlob to (item 1 of N) & " /" & p1 & return & "2 SPFX " & (items 2 thru -1 of N) & return & p2
		else
			set outputBlob to (item oneName of nameText)
		end if
	else
		set outputBlob to (item oneName of nameText)
	end if
	set theFirstBlob to theFirstBlob & outputBlob & "1 NAME "
end repeat
set text item delimiters to "1 NAME "
set text item delimiters to TID

tell application "TextEdit"
	make new document with properties {text:theFirstBlob & return}
	set theEndTime to the current date
	set the last paragraph of document 1 to "Time: " & (theEndTime - theStartTime) as string
end tell

StefanK · May 31, 2008, 9:46pm

On a G5 Dual 2,5 GHz my second script took 33 secs for a 2 MB file