Working with huge file lists (1'000+ items) and improving speed!

Hi!

I’ve a text file which I created via command-line containing a huge list of about 18’000 filepaths.

All these files want to be edited in an applescript batch.

Loading the file of 18’000 into applescript as a list of posix paths is impossible (applescript either freezes or crashes).

What would you do in this case?

A possible way is to use something like this:


set the_posix_file to POSIX path of (choose file) --the file that contains the huge filepath list!
set package_length to 100
set thelist to {}
repeat with i from 1 to ((round (((do shell script "cat " & the_posix_file & " | grep -c ^") as integer) / package_length) rounding up) * package_length) by package_length
	set current_range to {i, i + package_length - 1}
	repeat with thefile in (paragraphs of readLines_fromRange(current_range, the_posix_file))
		--do something with thefile --thefile is a posix path!!!
	end repeat
end repeat
thelist

on readLines_fromRange(therange, the_posix_file)
	-- example of therange:  {12, 34}
	set thelines to {}
	repeat with i from (first item of therange) to (last item of therange)
		set end of thelines to i
	end repeat
	thelines
	
	set oldtid to AppleScript's text item delimiters
	set AppleScript's text item delimiters to "p;"
	set thelines to thelines as text
	set AppleScript's text item delimiters to oldtid
	do shell script "sed -n '" & thelines & "p' " & the_posix_file
end readLines_fromRange

(this explicit example is not tested)

Are there any other methods you’re using and would recommend?

Operating System: Mac OS X (10.4)

I would:

set fileContents to read file "path:to:hugePathList" --> AS should be able to read (eg) a 5MB file

--> if it's typical shell's output, you can use the following:
repeat with i from 1 to 99999999
	try
		doSomethingWithThisPOSIXPath(fileContents's paragraph i)
	on error --> no more paragraphs
		exit repeat
	end try
end repeat

But when you let your applescript read the entire file you get a variable containing <=5MB memory.

If you just let “sed” read hundred lines of the huge list file you only have 100 paragraphs in memory and not 99999999 paragraphs (as in your code snippet).

Nope. I only use 99999999 to get “i” growing. If there is only 20000 paragraphs, it will stop processing after 20000 iterations (on error => exit repeat).

As you asked for speed, I think this method should be much faster (and I don’t think 5MB is overload if you process them as quick as possible, then exit)…