Read contents of txt file. Set 1st line & 2nd lines as different variables

Jono · April 24, 2015, 9:46am

I have a simple script to read the contents of a txt file and set it as a variable


	tell application "Finder" to set theFile to item 1 of (get selection)
	set documentContents to (read (theFile as alias))

What I’d like to do if possible is to set the first line of the text file to one variable, then set the second line (or just rest of that text in the document) to another variable.

Is this possible? Any help would be greatly appreciated:-)

DJ_Bazzie_Wazzie · April 24, 2015, 9:57am

When you open for access a file you can read parts of a file, the open for access command will keep an pointer to read, so when invoking the read command again it will read from the last position it read from before.


set theFile to "/etc/hosts"
try
	set fd to open for access theFile
	set var1 to read fd until linefeed
	set var2 to read fd until linefeed
	close access fd
on error
	close access theFile
end try

return {var1, var2}

Jono · April 24, 2015, 10:07am

That’s great, thanks a lot!

TecNik · April 24, 2015, 11:51am

Nice one DJ !

Here was my attempt:


tell application "Finder" to set theFile to item 1 of (get selection)
set documentContents to (read (theFile as alias))

set x to count of paragraphs of documentContents

set var1 to paragraph 1 of documentContents
set var2 to paragraphs 2 thru x of documentContents
set var2 to my joinAList(var2, return)

on joinAList(theList, delim)
	set newString to ""
	set oldDelims to AppleScript's text item delimiters
	set AppleScript's text item delimiters to delim
	set newString to theList as string
	set AppleScript's text item delimiters to oldDelims
	return newString
end joinAList

DJ, is there a way of amending your version so var2 goes to the end of the doc instead of the next linefeed?

Thanks.

Nigel_Garvey · April 24, 2015, 11:58am

Hi.

Just to note:

‘until linefeed’ includes the linefeed (if there is one) in the result, so this has to be edited from the end of var1 (and possibly var2) if you don’t want it. You could use ‘before linefeed’, but then you’d either have to edit the linefeed from the beginning of var2 instead or insert a line to read the file for 1 byte before reading into var2. It may be quicker and simpler just to read in the entire text and set var 1 and var2 to ‘paragraph 1’ and ‘paragraph 2’ of it respectively.
The values of the ‘read’ command’s ‘until’, ‘before’, and ‘using delimiter’ parameters are taken as single bytes, so unless the text in the file consists entirely of single-byte characters, these parameters may not work as expected:


set testPath to (path to desktop as text) & "Test.txt"
set txt to "Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬"

set fref to (open for access file testPath with write permission)
try
	set eof fref to 0
	write txt to fref as «class utf8» -- Write two-byte characters to the file.
end try
close access fref

read file testPath -- Read them back as single bytes just to see what the single-byte characters are.
--> "âˆš°âˆš¢âˆš£âˆš§âˆš¢âˆš¶âˆšÃŸâˆš®âˆš©âˆšâ„¢âˆš´âˆš¨"

-- OK. Now try using 'until' in a couple of ways:
read file testPath as «class utf8» until "Ã£" -- With a two-byte character value.
--> "Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬" -- Doesn't work. The whole text's returned.

read file testPath as «class utf8» until "£" -- With a one-byte character value.
--> "Ã¡Ã¢Ã£" -- Stops on the second byte of the "Ã£" (made up of the single-byte values for "âˆš" and "£").

read file testPath as «class utf8» before "£" -- Stopping BEFORE the one-byte character value.
--> error "Can't make some data into the expected type." number -1700 (because not enough bytes are read to make up the last UTF-8 character.

A linefeed is a single byte in both ASCII and UTF-8 text, and is the second of two bytes in text saved as Unicode text by the ‘write’ command, so you may be able to get away with using ‘until’ here. But you should be aware of what these parameters actually do.

DJ_Bazzie_Wazzie · April 24, 2015, 12:01pm

Sure, remove the until parameter and the read command continues from the last position of the previous read till the end of the file:


set theFile to "/etc/hosts"
try
	set fd to open for access theFile
	set header to read fd until linefeed
	set content to read fd
	close access fd
on error
	close access theFile
end try

return {header, content}

DJ_Bazzie_Wazzie · April 24, 2015, 12:22pm

Simply put every ascii/utf-8 value < 128 (0x80) is safe to be used, higher values can give the wrong results.

TecNik · April 24, 2015, 12:23pm

Thanks DJ.

StefanK · April 24, 2015, 12:25pm

Hi,

alternative solution


set documentContents to (read (choose file of type "txt"))
set variable1 to paragraph 1 of documentContents
set paragraph2Offset to (length of variable1) + 1
if id of (character (paragraph2Offset) of documentContents) < 20 then -- check for CRLF
	set paragraph2Offset to paragraph2Offset + 1
end if
set variable2 to text paragraph2Offset thru -1 of documentContents

McUsrII · April 24, 2015, 1:11pm

Hello.

This is interesting (kudos to Nigel for pointing out the problems), since it can reduce the need for reading in a big file, and at the same time not miss where that paragraph is, for the case that the “paragraph” is a rather long one.

I don’t know of any way I can convert the clipboard to utf-8, since the stuff is split into characters, there is no easy treat by using iconv either, so I figured, that once I had the paragraph, I’d write it to a file, and then read it back as utf-8.

I reuse the file, and the filehandler here, as this is just an experiment.

I have “reused” Nigels code.


set testPath to (path to desktop as text) & "Test.txt"
set txt to "Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬"

set fRef to (open for access file testPath with write permission)
try
	set eof fRef to 0
	write txt to fRef as «class utf8» -- Write two-byte characters to the file.
end try
close access fRef

set badText to read file testPath until linefeed -- Read them back as single bytes just to see what the single-byte characters are.
--> "âˆš°âˆš¢âˆš£âˆš§âˆš¢âˆš¶âˆšÃŸâˆš®âˆš©âˆšâ„¢âˆš´âˆš¨"


set fRef to (open for access file testPath with write permission)
try
	set eof fRef to 0
	write badText to fRef ” Write bad singlebytes back.
end try
close access fRef
set goodText to read file testPath as «class utf8» ” Fixing the problems by reading ok twobytes back.
--> Ã¡Ã¢Ã£Ã¤Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«Ã¬

Edit
The easy way to get the correct encoding and everything, when we know that it is the linefeed, is of course something like:

set res to do shell script "head -1 ~/Desktop/Test.txt"

StefanK’s way is probably faster, as long as the file is a short one.

Nigel_Garvey · April 24, 2015, 2:02pm

Hi McUsrII.

It’s a swings-and-roundabouts situation. It may only have to read to the end of the paragraph, but each byte has to be checked as it comes off the disk to find out when the read should stop, so the read process itself is slower. From an efficiency point-of-view, it probably works best with a short paragraph in a very long file. Otherwise, as I said earlier, it’s faster and simpler to just read the whole file without the ‘until’ filter and let AppleScript sort out the paragraph(s) in memory:

tell (read (choose file)) to set {var1, var2} to {paragraph 1, text from paragraph 2 to -1}

McUsrII · April 24, 2015, 2:47pm

Hello Nigel.

I didn’t take into account the much slower way of reading the file, when testing for a value. But I saw the paragraphs version of it all.

For flexible solutions, as to which pararaphs should be read in, when the target is different from the first one, then a combination of head and tail in a do shell script might do the trick. (To preserve the encoding for those of us that uses characters outside of the ascii charset).

To get the paragraphs 7 thru 11, one might do something like

Maybe sed is better than reading in the whole file if the sentinel character is something different than linefeed, but then again, for all I know the whole file is read into a buffer by the OS behind the scenes anyway.

Jono · April 24, 2015, 3:15pm

Thanks again everyone, there’s some great solutions

DJ_Bazzie_Wazzie · April 24, 2015, 3:59pm

That would hardly be noticeable because it’s written underneath In C (read: it’s a scripting addition). When reading a file you read character for character and every character is checked for it’s value and pushed into a buffer before it continues to read the next character. Higher programming languages will hide this from the developer and will read entire lines or even entire files directly into the buffer, but underneath the protocol has to be followed just like C. Reading until a character is therefore just as long as reading to a position and makes no difference in performance whatsoever in C. It’s one of the things where C really differs from AppleScript.

McUsrII · April 24, 2015, 4:32pm

Hello DJ.

For all I know, the Standard Addtion, uses buffered I/O in regular cases, and unbuffered I/O only when it tries to read until something. Maybe you know better. By the way, the until preposition was a great spot!

DJ_Bazzie_Wazzie · April 24, 2015, 9:43pm

Thanks McUsr,

You’re spot on! The read command uses Carbon’s FSReadFork(), which will eventually use the pread() system call. Which is indeed buffered. At least that’s the results I get when running AppleScript 2.3.2 (Mavericks) against Xcode’s debugger. But it doesn’t change the fact that until doesn’t affect any disk IO performance, and because the returned AppleEvent is smaller it will be faster as well.

As always, interesting stuff

McUsrII · April 24, 2015, 9:49pm

Hello DJ-

It is layer upon layer here. I am not sure if I have dreamt it, or if I indeed read it, but I seem to remember that even if you use unbuffered IO, then the OS actually may create buffers for you, so your read operations are buffered anyway. But I don’t bet on this one, and I am deep into something at the moment, but I’ll eventually look it up, and come back to this.

DJ_Bazzie_Wazzie · April 25, 2015, 1:40am

True, when you read from disk there is no way you can physically read one byte. The smallest size that physically can be read from disk is the block size of the device. But as you mentioned there is layers on layer. The read done by the kernel asked from a process doesn’t know any of this and is clearly separated from it. For that reason we have UFS and we should consider this as the lowest and “physical” level.

McUsrII · April 25, 2015, 2:47am

Hello DJ.

True, we probably read one byte at a time towards that buffer, and may regard that reading as the lowest one. -I wasn’t actually thinking of that buffer, but you are totally right.

ccstone · April 26, 2015, 4:48am

Here’s how I’d do it on my system using the Satimage.osax.


# Requires the Satimage.osax AppleScript Extension { http://tinyurl.com/dc3soh }.

set _file to alias ((path to home folder as text) & "test_directory:test")
set AppleScript's text item delimiters to "¶¶¶¶"
set {var1, var2} to text items of (find text "(?m)\\A([^\\n\\r]+)[\\n\\r](.+)\\Z" in _file using "\\1¶¶¶¶\\2" with regexp and string result)