Error -116 when reading big file. Workaround?

scriptim · August 30, 2013, 3:11am

Hey Guys,

I’m working with some big files 500MB-many GB. These are ascii text files, and I don’t have enough memory to read the whole file into a variable so that I can parse it and put it somewhere else. I could resort to GUI scripting in text edit but I’d rather not.

Even if not applescript, any good workaround for editing a super large file in pieces maybe? I only need to make changes in the first ~100 lines and in the last ~1000 lines.

Thanks,

Tim

kel1 · August 30, 2013, 3:28am

Hi scriptim,

from the Standard Additions dictionary:

So, something like this:
set the_start to read file_reference for 100
set the_end to read file_ref from 1000

Edited: sorry, I read your post wrongly the first time. To read the last 1000 lines, you need to count lines, do some math and you’ll know where to read from.

gl,
kel

scriptim · August 30, 2013, 3:32am

I can see how the ‘for’ could be useful (It asks for a number of bytes, so I need to count the number of characters I’d like to read in?), but you misunderstood the last part.

I need to modify the last ~1000 lines of a file that could be millions of lines long. I don’t know the length of the file.

Thanks for pointing me to those commands though.

kel1 · August 30, 2013, 3:36am

Hi scriptim,

Sorry was watching football while I read and wrote this. Disregard, I was thinking lines instead of characters. Be back later if nobody else posts.

gl,
kel

Shane_Stanley · August 30, 2013, 5:47am

You could always ask, via the Finder/shell/System Events/whatever. Not that you have to:

read someFile from -100

will read the last 100 bytes of a file. So come up with a guess of roughly how many characters 1000 lines will be, and you’re set.

Nigel_Garvey · August 30, 2013, 5:55am

Hi.

¢ If you don’t have enough memory to read a file into a variable, you’re not likely to have enough memory to open it in TextEdit either. In fact it’s probably too big full stop.
¢ If it’s ASCII text, reading from the beginning or the end with the File Read/Write commands, would be relatively easy, although it would involve making a guesstimate of the number of bytes (= characters) in those lines and having a few goes by trial and error until you’d got the relevant data in memory and knew the exact number of bytes involved.
¢ Alternatively, if you were reading from the beginning of the file and knew the line-break character (assuming there was just one), you could read ‘until’ that character the required number of times. This could be quite slow though.
¢ Any changes near the beginning which resulted in a different number of bytes would require the entire file to be rewritten. Changes near the end would be easier to accomodate.
¢ It’s possible that one of the text editing languages available though ‘do shell script’ would be better for the job, since they (or sed at least) can take care of the file work themselves. But I’ve never looked into that aspect of them myself.

mhfadams · August 30, 2013, 6:53am

If you really meant you have text files over 500 MB !?! Wow.

Thats a problem. The only files I have ever seen that big are A/V files and disc images, and occasionally PDFs (up to 50 some MB). Those are all handled in streaming style, never all loaded into memory at once.

If you have control over the project design, I would suggest splitting your project into much smaller files. (OSX: bundles, bundles everywhere !).

If you don’t… here’s the way I see it.

In order to modify any text file you have read the whole file into memory, make the changes, then write it back.
Otherwise you are stuck with either replacing strings with exactly equal length strings, or playing hopscotch with remaining byes (some pretty low-level programming).

It looks like your only option is to read the source file in chunks, then put it all back together.

Maybe this will work:
(1) split your source file into chunks with (you guessed it) ‘split’ unix utility; Of course you’re going to need a lot of scratch pad space.
(2) perform your transformations (in AS if you want)
(3) knit the whole file back together with ‘cat’ utility.
In all you’ll need at least three times the disk space and a lot of patience.

( You could easily read the last 1000 (or whatever you like) with ‘tail’ but then you don’t know the location in the source file to start writing the new text at. )

That whole process is a chore – I sure hope you can first segment the project somehow.

Model: PPC G4 Dual-800Mhz
AppleScript: 1.10.7
Browser: Safari 533.19.4
Operating System: Mac OS X (10.4)

scriptim · August 30, 2013, 8:43am

Awesome suggestions.

Here’s some background to reward your help and maybe spark some more ideas:

The original file is an output file from a computational chemistry simulator. The file contains wavefunctions that can be used to calculate the *exact electron charge density at any point (in the universe) for a given system. I’m working on a project to visualize a newly discovered metric in theoretical chemistry, and eventually I’ll add the ability to simply use the wavefunctions in the program to calculate values on the fly, resulting in *perfect data. Until then, however, and for the sake of being able to use the program on experimental data, the program runs off a 3d grid of discreet values of electron charge density at its nodes. Therefore, the finer the grid, the better the results (less noise, etc…).

To get these big ascii files, I first generate a binary file with the grid information, then from that file an ascii text (human readable) file is made. I tried one today with a billion nodes (1000^3), and the binary file came out at ~16GB, the ascii text file ~25GB. Needless to say, these files are long. I don’t know how many lines.

I imagine it may be a while before I can really deal with files that big (my research group is considering a fancy new machine with four 16-core amd opteron cpus and 128GB memory, so maybe with that I could use these), but even smaller files ~1GB will open in textedit but fail when read into AS.

The layout of the file is fairly simple. The first sample below is the beginning of the file, and contains some information about the dimensions of the system that need to be parsed. Then there’s the bulk of the file (second sample); a 1-d array with values of the charge density at each point in double precision, and finally the end of the file (third sample) is simply unnecessary and needs to be truncated.

This is really interesting research, and gives me chances to learn new AS tricks. In 10 years every high school and university is going to be teaching chemistry differently because of this new discovery.

Anyways, thanks for the suggestions!

Tim

Sample 1:

Sample 2:

0.321021033643625839E-005 0.226727032754137001E-005 0.152518368021359874E-005
0.983637467426201418E-006 0.611768054654276372E-006 0.925411495681161817E-006
0.152518368021359790E-005 0.242689819522677139E-005 0.370279454256807476E-005
0.537340259958462117E-005 0.735505725445796980E-005 0.942934103053651030E-005
0.112805655484882561E-004 0.126146796028112145E-004 0.132982567887010991E-004
0.134154911371915439E-004 0.132005582787081483E-004 0.129082280333296329E-004
0.127197211648993276E-004 0.127197211648993259E-004 0.129082280333296465E-004
0.132005582787081500E-004 0.134154911371915405E-004 0.132982567887011008E-004
0.126146796028112111E-004 0.112805655484882747E-004 0.942934103053653063E-005
0.735505725445799267E-005 0.537340259958462202E-005 0.370279454256808450E-005
0.242689819522677308E-005 0.152518368021359874E-005 0.925411495681162770E-006
0.134236572567030812E-005 0.226727032754137128E-005 0.370279454256807391E-005
0.580093338797392770E-005 0.863377804384899330E-005 0.120820043136584313E-004
0.157527723661940604E-004 0.190348700782330278E-004 0.213399749885319334E-004
0.224039131891167273E-004 0.224113709233581187E-004 0.218430900482721166E-004
0.211964927300604662E-004 0.208019153952220182E-004 0.208019153952220419E-004
0.211964927300604798E-004 0.218430900482721267E-004 0.224113709233581187E-004
0.224039131891167205E-004 0.213399749885319504E-004 0.190348700782330244E-004

Sample 3:

Geometry
xyznuc
48 48 2
0.174735994142117756E+001 0.174735994142117756E+001 0.174735994142117756E+001
0.174735994142117756E+001 0.174735994142117756E+001 -0.174735994142117756E+001
-0.174735994142117756E+001 0.174735994142117756E+001 -0.174735994142117756E+001
-0.174735994142117756E+001 0.174735994142117756E+001 0.174735994142117756E+001
0.174735994142117756E+001 -0.174735994142117756E+001 0.174735994142117756E+001
0.174735994142117756E+001 -0.174735994142117756E+001 -0.174735994142117756E+001
-0.174735994142117756E+001 -0.174735994142117756E+001 -0.174735994142117756E+001
-0.174735994142117756E+001 -0.174735994142117756E+001 0.174735994142117756E+001
0.292567653901141700E+001 0.292567653901141700E+001 0.292567653901141700E+001
0.292567653901141700E+001 0.292567653901141700E+001 -0.292567653901141700E+001
-0.292567653901141700E+001 0.292567653901141700E+001 -0.292567653901141700E+001
-0.292567653901141700E+001 0.292567653901141700E+001 0.292567653901141700E+001
0.292567653901141700E+001 -0.292567653901141700E+001 0.292567653901141700E+001
0.292567653901141700E+001 -0.292567653901141700E+001 -0.292567653901141700E+001
-0.292567653901141700E+001 -0.292567653901141700E+001 -0.292567653901141700E+001
-0.292567653901141700E+001 -0.292567653901141700E+001 0.292567653901141700E+001
Geometry
qtch
16 16 2
0.600000000000000000E+001 0.600000000000000000E+001 0.600000000000000000E+001
0.600000000000000000E+001 0.600000000000000000E+001 0.600000000000000000E+001
0.600000000000000000E+001 0.600000000000000000E+001 0.100000000000000000E+001
0.100000000000000000E+001 0.100000000000000000E+001 0.100000000000000000E+001
0.100000000000000000E+001 0.100000000000000000E+001 0.100000000000000000E+001
0.100000000000000000E+001
Geometry
unit of length
1 1 2
0.100000000000000000E+001

kel1 · August 30, 2013, 9:59am

Hi scriptim,

I found something like this on the internet to get the last 1000 lines:

set f to choose file
set pp to quoted form of POSIX path of f
set cmd to quoted form of ":loop
$ q
N
1001,$ D
b loop"
do shell script "sed -e " & cmd & space & pp

I don’t know if it can handle 1000 lines, but you can try it.

Edited: you can also try using unix 'tail". Something like:

tail ‘-n 1000’ ‘/path/to/file’

You need to put that in a ‘do shell script’ command.

gl,
kel

Nigel_Garvey · August 30, 2013, 10:40am

Ah. So it’s not computer memory that’s the problem but script memory. I don’t have a big enough text file to check that out, but I may create one later on and take a look.

If you only need to parse the first 100 lines or so of the file, you can just read off enough bytes to cover comfortably what you need. The path and variable names in this example are just descriptive:

-- Opening for access isn't strictly necessary, but it ensures that you use your own 'file mark' (index into the file).
set accessRef to (open for access file "path:to:file")
try
	set enoughText to (read accessRef for enoughBytes as string)
end try
close access accessRef

-- Parse 'enoughText'

The file handling for edits to the last part of the file would go something like this:

set accessRef to (open for access file "path:to:file" with write permission)
try
	-- Note the negative index to read the last 'enoughBytes' bytes of the file)
	set enoughText to (read accessRef from -enoughBytes as string)
	
	-- Edit out the bits you want to cut from the text, but leave everything else.
	
	-- Truncate the file to the insertion point from which the read started.
	set eof accessRef to (get eof accessRef) - enoughBytes
	-- Append the edited data to the stub.
	write editedText to accessRef starting at eof
end try
close access accessRef

DJ_Bazzie_Wazzie · August 30, 2013, 11:00am

if you only need to read the last 1000 lines you need to do your reading also backwards. Here an example code that reads the last 1000 lines, if there if any, otherwise the contents of the file:

set theFile to (choose file) as string

set fileSize to (do shell script "ls -nl " & quoted form of POSIX path of theFile & " | awk '{print $5}'") as integer


set bufferSize to 24000
set theBuffer to ""

try
	set fd to open for access file theFile
	repeat until (number of paragraphs of theBuffer) > 1000
		if bufferSize â‰¥ fileSize then
			set theBuffer to read fd from 1 to fileSize
			exit repeat
		end if
		set theBuffer to read fd from (bufferSize * -1)
		set bufferSize to bufferSize + 24000
	end repeat
	close access fd
on error
	close access file theFile
end try

return theBuffer

The variable theBuffer contains the last 1000 lines and some more data, you process, modify this data and then you put the data back into your file. This means we’re going to remove the size of the buffer of the file and paste our modified data into it.

try
	set fd to open for access file theFile with write permission
	set eof of fd to fileSize - (count theBuffer)
	write newData to fd starting at eof
	close access fd
on error
	close access file theFile
end try

So a complete script to uppercase only the last 3 lines of a very large file can look like:


set theFile to (choose file) as string

set fileSize to (do shell script "ls -nl " & quoted form of POSIX path of theFile & " | awk '{print $5}'") as integer


set bufferSize to 24000
set theBuffer to ""

try
	set fd to open for access file theFile
	repeat until (number of paragraphs of theBuffer) > 1000
		if bufferSize â‰¥ fileSize then
			set theBuffer to read fd from 1 to fileSize
			exit repeat
		end if
		set theBuffer to read fd from (bufferSize * -1)
		set bufferSize to bufferSize + 24000
	end repeat
	close access fd
on error
	close access file theFile
end try

--do your modifications here; only on the last 3 lines
set AppleScript's text item delimiters to linefeed
set lastLines to text items -3 thru -1 of theBuffer as string
set AppleScript's text item delimiters to ""

set lastLines to do shell script "tr '[:lower:]' '[:upper:]' <<<" & quoted form of lastLines
set newData to (text 1 thru ((count theBuffer) - (count lastLines)) of theBuffer) & lastLines

try
	set fd to open for access file theFile with write permission
	set eof of fd to fileSize - (count theBuffer)
	write newData to fd starting at eof
	close access fd
on error
	close access file theFile
end try

I’ve tested it with an 60MB ASCII file containing 300,000 lines in it and it is still fast.

Nigel_Garvey · August 30, 2013, 11:22am

DJ Bazzie Wazzie:

if you only need to read the last 1000 lines you need to do your reading also backwards. Here an example code that reads the last 1000 lines, if there if any, otherwise the contents of the file:

set theFile to (choose file) as string

set fileSize to (do shell script "ls -nl " & quoted form of POSIX path of theFile & " | awk '{print $5}'") as integer


set bufferSize to 24000
set theBuffer to ""

try
	set fd to open for access file theFile
	repeat until (number of paragraphs of theBuffer) > 999
		if bufferSize â‰¥ fileSize then
			set theBuffer to read fd from 1 to fileSize
			exit repeat
		end if
		set theBuffer to read fd from (bufferSize * -1)
		set bufferSize to bufferSize + 24000
	end repeat
	close access fd
on error
	close access file theFile
end try

return theBuffer

Hi DJ.

I think it should repeat until the number of paragraphs is > 1000, to ensure that the extract contains all of paragraph -1000.

DJ_Bazzie_Wazzie · August 30, 2013, 11:27am

Good point Nigel! Thanks

Nigel_Garvey · August 30, 2013, 1:41pm

DJ Bazzie Wazzie:

if you only need to read the last 1000 lines you need to do your reading also backwards. Here an example code that reads the last 1000 lines, if there if any, otherwise the contents of the file:

set theFile to (choose file) as string

set fileSize to (do shell script "ls -nl " & quoted form of POSIX path of theFile & " | awk '{print $5}'") as integer


set bufferSize to 24000
set theBuffer to ""

try
	set fd to open for access file theFile
	repeat until (number of paragraphs of theBuffer) > 1000
		if bufferSize â‰¥ fileSize then
			set theBuffer to read fd from 1 to fileSize
			exit repeat
		end if
		set theBuffer to read fd from (bufferSize * -1)
		set bufferSize to bufferSize + 24000
	end repeat
	close access fd
on error
	close access file theFile
end try

return theBuffer

Hi.

You can shave some time off this (ie. make it three times as fast on my machine with my test file) by using ‘get eof’ instead of the shell script, using ‘count’ instead of ‘number of’, and only reading what’s not already been read:

set theFile to (choose file) as string

set readSize to 24000
set bufferSize to readSize
set theBuffer to ""

set fd to (open for access file theFile)
try
	set fileSize to (get eof fd)
	repeat until ((count paragraphs of theBuffer) > 1000)
		if (bufferSize â‰¥ fileSize) then
			set theBuffer to (read fd from 1 for fileSize - (bufferSize - readSize)) & theBuffer
			exit repeat
		end if
		set theBuffer to (read fd from -bufferSize for readSize) & theBuffer
		set bufferSize to bufferSize + readSize
	end repeat
end try
close access fd

return theBuffer

Edit: Corrected a couple of oversights concerning what to do when bufferSize â‰¥ fileSize!
Edit: Replaced ‘to’ with ‘for’ in the ‘read’ commands for aesthetics and reduced maths. Eliminated special-casing for when the file length’s less than the initial buffer size.

DJ_Bazzie_Wazzie · August 30, 2013, 1:58pm

Great solution . It should be faster indeed only on my machine there is no speed difference, maybe because I have full 3.2 GHz data storage.

kel1 · August 30, 2013, 6:44pm

Hi,

On the side, I was trying to do a Sed script to return the first 10 and last 10 lines in uppercase. But I’m getting the hardest time trying to combine the two commands into one Sed call in one ‘do shell script’. I guess if you don’t use it (especially with Sed) you lose it.

set f to choose file
set pp to quoted form of POSIX path of f
set cmd1 to quoted form of "1,10 y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/
10 q"
set cmd2 to quoted form of ":loop1
$ b loop2
N
11,$ D
b loop1
:loop2
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/"
set first10 to do shell script "sed -e " & cmd1 & space & pp
set last10 to do shell script "sed -e " & cmd2 & space & pp
first10 & linefeed & "<tag>" & linefeed & last10

(* Some text for file
Start
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
End
*)

Afterwards, I was thinking about using the return and write to file the number of bytes these lines take.

Thanks,
kel

kel1 · August 30, 2013, 8:02pm

Think I almost have it, but it prints every line in capitols:

set f to choose file
set pp to quoted form of POSIX path of f
set cmd1 to quoted form of "1,10 y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/
11,$ {
:loop1
$ b loop2
N
11,$ D
b loop1
}
:loop2
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/"
set first10 to do shell script "sed -e " & cmd1 & space & pp

Edited: finally got lucky I think.

set f to choose file
set pp to quoted form of POSIX path of f
set cmd1 to quoted form of "1,10 {
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/
p
}
11,$ {
:loop1
$b loop2
N
21,$ D
b loop1
:loop2
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/
p
}"
set r to do shell script "sed -n " & cmd1 & space & pp

One thing I don’t understand here, is why you don’t need the -e flag in sed?

DJ_Bazzie_Wazzie · August 31, 2013, 12:50am

I know that a lot of AppleScripters are very fond of sed but with such a large amount of data, any stream editor is a no-go. Especially when there is only a small part of the file you want to modify. My code (optimized by Nigel) is starting to read from a certain point, stream editors like sed reads from 0 and checks every line if it meets a condition. For the TS the 16 GB file means that there is probably around 15.999GB of unnecessary processing when using a stream editor, while the read command only processes somewhere around 1MB of actual data. read command is somewhere around 10,000% faster than using sed and grows in advantage when the file is growing in size.

I don’t want to discourage you from using sed at all, but sometimes you have to choose the right tool for the right job. When working with extremely large files, as in this situation, it’s better to work in chunks or pick the right data from it to save yourself from a lot of unnecessary processing.

mhfadams · August 31, 2013, 8:19pm

Thanks for the new overview of your situation - scriptim.
It’s a pretty imposing project.

I think I should mention that a lot of people in your position (scientific calculation and data processing) usually do their code in C or Python.
Python is very popular for such things because of its speed, very large code libraries already made, and ease to write.
It also has good string processing built-in.
As fond as I am of applescript, sometimes it’s worth the while to learn a new tool (I switched an app from AS to ObjC and it made a world of difference).
As can be seen from the many responses to your question, AS has serious limitations for this kind of thing, that people are trying to overcome through unix utilities via the shell.
You might benefit from going directly to the work-horses - either writing a shell script, Python, Awk, or C.
If your project is serious enough to warrant a 64-core machine, it deserves a comparable language.

DJ_Bazzie_Wazzie · September 1, 2013, 12:45pm

As you can see that Nigel and I don’t use do shell scripts, but plain AppleScript. This problem isn’t been solved by stepping towards C (I’m a C coder myself). The problem you want to avoid is reading a 16GB file into the memory of the computer while there is only a fraction of the file you want to read. We’re using the benefits from the kernel here that you’re able to read a specific part from the file. So every solution to process the whole file, no matter which programming language, is to me a no-go.

For me there is no limitation in AppleScript in this situation, I would use the same technique in C, it seems complex but keeps it your code fast and keeps memory consumption to a limit; something that should be done in every programming language.