Urgent help needed in optimizing AS file/text work

wsterdan · June 11, 2006, 5:22pm

I’m posting this here, but if it would be better posted in the Applescript Studio forum, please advise.

For those joining the program in progress:

I’ve written an Applescript program that parses FrameMaker MIF files; it worked perfectly (albeit slowly) as long as the MIFs were relatively small (eg. ~100,000 lines of text).

Larger MIFs have started arriving (eg. ~ 1,000,000 lines of text, roughly 20+ MB) and because my program worked by loading the whole file into memory before parsing the strings, Applescript crashes with an out-of-memory error.

I wrote this small line-by-line copy program with an eye to rewriting and optimizing my marker-moving algorithms to work on small chunks of the MIF at a time:

-- A Quick (?) line-by-line text file copy program

-- Get and set the filenames
set CurrentDoc to (choose file)
set DestinationFolder to (choose folder with prompt "Choose destination folder for cleaned files:")
set finalFile to (DestinationFolder & "Duped_File" as string)

-- open 'em
open for access CurrentDoc
-- open for access finalFile with write permission
set the finalFile to the finalFile as text
set the open_target_file to open for access file finalFile with write permission

-- preserve TIDs
set tid to text item delimiters

-- zero the tempstring
set tempstring to ""

-- loop through line-by-line, reading once and writing twice
-- using a Mac carriage return as the delimiter
try
	repeat
		set tempstring to read CurrentDoc until return
		write tempstring to the open_target_file starting at eof
	end repeat
	close access CurrentDoc
	close access finalFile
on error
	close access CurrentDoc
	close access finalFile
end try
set text item delimiters to tid

It works, but it takes roughly 30 minutes to do the 1,000,000 line program – and this is without actually doing any of the string-parsing.

I rewrote just this simple program (hard-coding the filenames) in C:

#include <stdio.h>

int main()
{
FILE *f,*g;
char s[80];

f=fopen("infile","r");
g=fopen("outfile","w");
if (!f)
    return 1;
while (fgets(s,80,f)!=NULL)
{
	fputs(s,g);
	fputs(s,g);
	}
fclose(f);
fclose(g);
return 0;

}

and in FreePascal:

program CopyTextFile(oldfile, newfile);
var
oldfile, newfile : text;
      Procedure CopyOneLine;
             var
                    character : char;
                    tempString : String;
      begin
            readln(oldfile, tempString);
            writeln(newfile, tempString);
      end; {CopyOneLine}
begin {CopyTextFile}
Assign (oldfile, ‘sm05.mif’);
Assign (newfile, ‘sm05Dupe.mif’);
reset(oldfile);
rewrite(newfile);
while not eof(oldfile) do
begin
CopyOneLine;
end
end. {CopyTextFile}

and ran both from inside XCode.

The C and the Pascal file both took less than 10 seconds.

10 SECONDS, versus more than 30 MINUTES.

I know that I can rewrite the AS routine to read in larger chunks at a time, but I’m limited to how much I can read in at a time and still keep my parsing routines effective.

If someone could point out a way to speed it up so that it’s more in line with the C or Pascal, I’d be very appreciative.

I know I can rewrite my simple string-parsing algorithms in either C or Pascal fairly easily (he said, without ever having written a real C program, or touched Pascal in a decade-and-a-half), but I’ve never done a Cocoa interface, and I don’t know how hard it would be to duplicate Applescript’s drag-and-droplets, or the Get/Put/Choose dialogs… and while I want to learn Cocao anyway, right now’s not a good time, not when I need to get this thing finished and working efficiently ASAP.

My options seem to be staying 100% in Applescript and building a program that will take hours to run instead of minutes, dumping Applescript and learning enough enough Cocoa to use with either Objective-C or FreePascal, or writing the actual data-handling, string-parsing parts in C or Pascal and calling them from the Applescript droplet (which, after the marker-moving routines, calls TextWrangler to do a mess of search-and-replaces).

How hard would it be to convert just the simple C or Pascal programs above to a compiled routine that could be called from Applescript? Ideally, Applescript would take the dropped file and pass the input/output file name as parameters to the compiled routines.

If someone could step me through converting one of 'em (the FreePascal would be my first choice), I’m confident I could add in the parsing routines without too much problem. I’d be very grateful for the help.

Thanks,

Walt Sterdan

Nigel_Garvey · June 11, 2006, 11:19pm

Hi, Walt.

You can normally expect C to be a bit faster than AppleScript. My knowledge of C is a little rusty and my knowledge of Pascal non-existent, but as far as I can make out, the C routine reads 80 bites at a time without bothering to look at them. The AppleScript code is set up to read in bytes, examining them as they come in to check for returns. That’s a slower process. Also, if there are less than 80 characters per “line”, that would increase the number of disk read/writes required. A million disk reads followed by a million disk writes is going to take a while!

Your AppleScript ‘read’ command doesn’t make use of the access reference returned by the ‘open for access’ command. Using that would save the filing system having to check a million times if the file’s open. You can omit the ‘starting at eof’ parameter in the ‘write’ command, as writing continues from where it left off anyway, provided the file hasn’t been closed in the meantime. It’s a good idea to set the file’s eof to zero before the first write, to “empty” it.

It’s difficult to advise, not knowing what your parsing routines do or how they’re written. If possible, I’d be looking to read the approximate byte equivalent of 3900 lines at a time and doing the rest of the work in memory. That would reduce the number of disk accesses by a factor of 7800, which should make a considerable dent in a 30-minute running time. (3900 is a safe number of elements ” eg. paragraphs ” to extract from text in one go.) Your read/write code would need to be modified something along the following lines. (I’ve assumed a line length of approximately 80 characters, to match your C routine.)

-- Set the maximum number of bytes to read in at a time. (Adjust to taste.)
set blockLength to 312000 -- 3900 * 80 bytes, assuming 80-character lines

-- Get and set the filenames
set CurrentDoc to (choose file)
set DestinationFolder to (choose folder with prompt "Choose destination folder for cleaned files:")
set finalFile to (DestinationFolder & "Duped_File" as string)

-- open 'em
set open_source_file to (open for access CurrentDoc)
-- open for access finalFile with write permission
set the finalFile to the finalFile as text
set the open_target_file to (open for access file finalFile with write permission)

-- preserve TIDs
set tid to text item delimiters

-- zero the tempstring
set tempstring to ""

-- Loop through 'blockLength' bytes at a time.
try
	-- Find out how many complete "blocks" to read and how many odd bytes.
	set sourceLength to (get eof open_source_file)
	set completeBlockCount to sourceLength div blockLength
	set oddByteCount to sourceLength mod blockLength
	-- Empty the target file for good luck.
	set eof open_target_file to 0

	-- Process 'blockLength' bytes at a time.
	repeat completeBlockCount times
		set tempstring to (read open_source_file for blockLength)
		-- set tempstring to myParsingHandler(tempstring)
		write tempstring to the open_target_file
	end repeat

	-- Process any bytes left over at the end of the file.
	if (oddByteCount > 0) then
		set tempstring to (read open_source_file) -- ie. from current position to eof.
		-- set tempstring to myParsingHandler(tempstring)
		write tempstring to the open_target_file
	end if

	close access open_source_file
	close access open_target_file
on error
	close access open_source_file
	close access open_target_file
end try
set text item delimiters to tid

wsterdan · June 12, 2006, 1:00am

Nigel, you’re a lifesaver! Many thanks.

That’s close, but not quite accurate. To the best of my knoweledge, the 80 bytes allocated to the C program is simply a buffer, as the memory has to be allocated prior to using the temporary string. The read routine actually stops when it hits a “\n” character (Unix line ending), just as the Pascal readln stops when it hits a carriage return or other end-of-line character… so the C and Pascal routines actually are reading and checking character-by-character.

Bingo! I think this is where the astronomical amounts of time are wasted. I used your routine and the program ran in roughly the same amount of time as the C and Pascal programs.

Nigel Garvey:

Walt Sterdan:

I know that I can rewrite the AS routine to read in larger chunks at a time, but I’m limited to how much I can read in at a time and still keep my parsing routines effective.

It’s difficult to advise, not knowing what your parsing routines do or how they’re written. If possible, I’d be looking to read the approximate byte equivalent of 3900 lines at a time and doing the rest of the work in memory. That would reduce the number of disk accesses by a factor of 7800, which should make a considerable dent in a 30-minute running time. (3900 is a safe number of elements ” eg. paragraphs ” to extract from text in one go.) Your read/write code would need to be modified something along the following lines. (I’ve assumed a line length of approximately 80 characters, to match your C routine.)

Basically, the MIF files have markers set in them that breaks up the blocks of text (paragraphs in the FrameMaker format) which makes them easy to edit while in FrameMaker, but difficult to deal with when using language translation software.

For example, if we wanted to translate a very simple phrase, say, “The red ball” from English into French, it would translate as, roughly, “La boule rouge.” As long as the translation software can read the whole phrase in its entirety, no problem. However, if someone puts a cross-reference marker between “red” and “ball”, the translation now sees this as two phrases, “The red” and “ball” and translates them as such, giving us “La rouge” and “boule.”

Translation inaccuracy skyrockets as the phrases become lengthier and more technically-oriented (I work for a company that specializes in technical manual translation).

My program parses the MIF by moving all of the markers either to either the beginning or the end of the paragraphs they belong to, depending on what type of markers they are.

I’ll have to rewrite the routines slightly to account for the temporary buffer ending in the middle of paragraphs or paralines, but that’s trivial compared to what I would have had to both learn and implement otherwise.

Again, my thanks! I’ve been bouncing back and forth between languages all weekend, trying to find some way to tie the various pieces together and failing miserably due to lack of Cocoa experience. Your help has saved my bacon, and I’ll be sleeping much better tonight because of your help.

Again, my hearty thanks!

– Walt Sterdan

Nigel_Garvey · June 12, 2006, 2:23am

Hi, Walt.

Glad that’s been some help. Er . What I said about disk accesses being reduced by a factor of 7800 is, on reflection, rubbish. The factor’s “only” 3900. Sorry about that. :rolleyes: