Unicode in Applescript

I’ve read that Unicode is partially supported, and I’m experiencing this problem quite painfully. I’ve been developing a widget that gets some Unicode data from an application and displays it. I’m using Applescript for the data acquisition. I can get the data fine, and I’ve checked that it is intact Unicode when I get it. However, I’ve had trouble sending it back to the widget. Since terminal commands are limited in size and not-unicode supported, I’ve decided to write it to a file which the widget will then read. I’ve been using jonn8’s code to write to a file:

on write_to_file(the_file, the_data, with_appending)
   set the_file to the_file as string
   try
       set f to open for access file the_file with write permission
       if with_appending = false then set eof of f to 0
       write the_data to f starting at eof as (class of the_data)
       close access f
       return true
   on error the_error
       try
           close access file the_file
       end try
       return the_error
   end try
end write_to_file

The problem is that the unicode data written gets written as ASCII jibberish. This is my testing code:

display dialog thedata
		my write_to_file(thenewfile, thedata, false)

The dialog displays thedata correctly as the unicode, but the written to file doesn’t. I’ve tried doing write_to_file(thenewfile, thedata as Unicode Text, false), but that doesn’t change anything.

Model: G4 MDD 2x1.25
AppleScript: 1.10
Browser: Safari 412
Operating System: Mac OS X (10.4)

It sounds like you’re having some confusion over text encodings (unicode data always looks like ‘ASCII gibberish’ when you read it with the wrong encoding). A ‘write txt to fileref as Unicode text’ command will produce a UTF16-encoded text file. If you want to write to file as UTF8, use ‘write txt to fileref as «class utf8»’ instead. If you want the file to indicate the type of encoding used you’ll need to include a Byte-Order Mark (BOM) yourself; the ‘write’ command won’t do it for you. And, of course, when you read it make sure you use the right encoding (if the file contains a BOM then smarter systems will determine the encoding themselves).

BTW, your post isn’t clear but it sounds like you’re executing an AppleScript script via osascript. If your script returns a Unicode text value, osascript will encode this value as UTF8 when it writes to stdout. And there’s no size limit on data passed via stdout [1]. If this is the case, you don’t need to muck around with temp files at all; just send the AppleScript’s result to stdout and be sure to handle any subsequent encoding/conversion issues correctly.

HTH

has

[1] BTW, there’s no size limit on data passed via stdin either, but both osascript and ‘do shell script’ perversely refuse to support passing data via stdin. Which is why you end up having to pass input data via a shell command (which [i]does[/] have a size limit) or a temp file.

Thanks for your reply.

I tried using the command “write thedata to thenewfile as Unicode text”, but it doesn’t seem to do anything. I’ve tried doing searches for the filename I give it, but it doesn’t seem to exist. For clarity, here is my full code:

on run input
	tell application "Finder"
		set theoldfilepath to path to me as string
		set thefilepath to text items 1 thru -25 of theoldfilepath as string
		set thefilename to input as Unicode text
		set thenewfile to (thefilepath & thefilename)
		set thedata to the clipboard
		display dialog thenewfile
		display dialog thedata
		write thedata to thenewfile as Unicode text
	end tell
end run

The whole theoldfilepath stuff is to extract the parent directory of the script (the script’s filename is 25 characters, so thats why I delete 25 characters to it), I then set my input (run via osascript, as you said) to the actual filename. To test the Unicode, I put some unicode text in the clipboard. But, with this method, it doesn’t seem like anything is happening at all, after I display the dialogs.

To address your size limit question, you are correct, normally the input/output have no size limits. The thing is, I’m using this in a widget with the widget.system command. That command can only return up to 4K of information. I am exporting to a file that is then directly read by the widget.

You need to open a file with write permission before you can write to it. See Standard Additions’ ‘open for access’ and ‘close access’ commands.

Be aware that ‘path to me’ returns an alias to the current process, not your actual script. (Unlike Unix scripts, AppleScripts have no inherent knowledge of their position in the filesystem.) ‘path to me’ effectively acts as a synonym for ‘path to current application’; it’s actually a bug, but such a commonly used one it’s virtually become a ‘feature’. It’s often used in script applets, where script and applet are the same file. In fact, it’s such a common trick that some applications (e.g. the Smile editor) deliberately catch this event and actually return a path to the script being run instead of themselves. I’m not sure that osascript does this, however.

I’d expect osascript to raise an error on the ‘write’ command for the reason mentioned earlier.

As for writing the temp file, I’m not sure why you’d want to create it relative to your AppleScript as opposed to using something like ‘/tmp’. (In fact, if your AppleScript is part of your widget bundle you definitely shouldn’t be trying to write to files within that bundle.) The easiest thing would be for your widget to pass a full POSIX path to your AppleScript, which can then coerce it to a POSIX file and pass it to ‘open for access’ when writing the tempfile.

HTH

has

Ah, got it. For future reference, people always refer to “Standard Additions”. Excuse the newbie question, but where do I look at the Standard Additions?

Back to my original problem. Since I already had the code for writing to a file, I went back to using that (the applescript write to file code in my first post). I did change the actual write command to: “write the_data to f starting at eof as Unicode text”. The returned file is still in the wrong encoding. The english parts of the unicode text appear fine, but all unicode specific characters turn into that jibberish I spoke of. I have tried writing it as UTF8 encoding also, but I still get jibberish (different jibberish, obviously, but still jibberish).

When I do path to me in the osascript, it gives me the path to the script file itself. I was using this to extract the pathname to the widget, as I’d rather any temp files be hidden from the user. Putting it at /tmp is pretty plainly obvious, even if it is deleting the file once it gets the data. Regardless, for testing purposes it is being put right now in my root level/Test/ folder, so any permissions about bundles aren’t an issue.

I considered it possibly a problem with the source of the Unicode, but I just tested it with some Unicode I got straight from TextEdit and the Special characters palette. The source is not the issue. What is odd is that the old code had it write to the file as “class of the_data”, so putting as Unicode text or as <> shouldn’t have mattered. It should have been written to with the correct encoding regardless.

See Script Editor’s ‘Open Dictionary…’ command.

It’s only gibberish if you read it with the wrong encoding. Unlike ASCII or MacRoman or Latin1, which use a single byte to define each character, unicode uses several: two bytes per character in UTF16’s case, one or more in UTF8’s depending on the character. UTF8 is designed to be backwards-compatible with ASCII, btw, so UTF8 characters 0-127 are encoded with a single byte same as ASCII (i.e. high bit = 0); all other characters are encoded using 2 or more.

If you write a string to file as UTF8, you need to read it in as UTF8 as well, unless it contains only characters 0-127, in which case you can read it as ASCII or any other ASCII-compatible encoding. If you read in UTF8 data using, say, MacRoman encoding, any characters not in the 0-127 range will instead appear as two or more consecutive MacRoman characters in the 128-255 range. Which does look like gibberish, but very characteristic gibberish that should set alarm bells ringing as it strongly suggests that a file is being read with the wrong encoding. e.g. If you read a UTF16-encoded file using MacRoman or another single-byte encoding, you’ll find each ASCII character mysteriously separated by an invisible ASCII 0 characters as you right-cursor over it on your keypad.

Sounds like osascript’s been updated in 10.4.

Keeping temp files out of sight is what /tmp’s for. Self-modifying application bundles are very bad form on OS X.

You mean you saved a TextEdit file containing non-ASCII characters as UTF8/UTF16 and your widget was able to read it correctly? If that’s the case, it sounds like the file reading function your widget’s using automatically determines encoding based on the presence of a BOM at the start of the text file; TextEdit will include a BOM automatically, Standard Additions’ write command will not.

That will always write text to file encoded as UTF16. To include the appropriate BOM, use:

set fileRef to open for access (tempFilePath as POSIX file) with write permission
write (ASCII character 254 & ASCII character 255) to fileRef -- add 0xFEFF BOM
write theText to fileRef as Unicode text
close access fileRef

HTH

has

Great! That seems to have solved it. Once I added that BOM identifier, the file was read correctly as Unicode. In actuality, I had the Unicode coming from the clipboard each time, not being read from a text file. To simplify this operation, I was leaving the widget out of the picture till I got the unicode working. Anyway, thanks so much for your help, hhas.