Byte Order Mark of UTF 16 text encoding

StefanK · February 3, 2007, 4:48pm

Hi,

I’m just testing a few things with different text encodings
in conjunction with AppleScript’s read/write file capabilities.

I found this routine by jj, which considers also the Byte Order Mark of Unicode text.


-- Convert a plain text file to a utf-16 (aka "Unicode text") file

textfile2utf16(choose file, "BE")

to textfile2utf16(theFile, BOM)
	set oldContents to (read theFile)
	set f to (open for access theFile with write permission)
	set eof of f to 0
	if BOM is not "" then write (ASCII number 254) & (ASCII number 255) to f
	write oldContents to f as Unicode text starting at eof
	close access f
end textfile2utf16

But instead of FE FF it writes this character string (hexadecimal view):

6C 69 73 74 00 00 00 02 6C 6F 6E 67 00 00 00 04
00 00 00 32 6C 6F 6E 67 00 00 00 04 00 00 00 32

which is displayed : listlong2long2

does anybody know, what this means or where this come from??

Adam_Bell · February 3, 2007, 7:17pm

I think it’s how write works. If you write something as a record, for example, there’s also a preamble indicating that the following string is a record. I think listlong2long2 refers to two 16-bit integers.

This:

set f to open for access ((path to desktop as text) & "record") with write permission
set eof of f to 0
write {name:"Adam C Bell"} as record to f
close access f

Produces this text: “reco[unprintable char]pnamTEXT[unprintable char] Adam C Bell”

StefanK · February 3, 2007, 7:34pm

Exactly !

I copied jj’s script unprejudiced, but there is the error in it

instead of (ASCII number 254) & (ASCII number 255)
you better take (ASCII character 254) & (ASCII character 255)

:D:D

Problem solved, thanks Adam

Adam_Bell · February 3, 2007, 11:48pm

In Nigel Garvey’s paper in unScripted, he showed that write has an as parameter similar to AppleScript coercion, in that it causes the data to be written to the file as some other type than whatever it was. Without it, the item to be written is written in a form that represents whatever it is already. This isn’t necessarily the same as its AppleScript format, but among the things that are, you can write plain text “as Unicode text”, but then will have to read it that way too.

As another example, he wrote that two of the thirty-two bits in an AppleScript integer are devoted to the code that identifies it as being an AppleScript integer. (That’s why AppleScript integers only have 30-bit signed values.) But an integer value written to file will be a full 32-bits wide.

With this parameter, write can mimic some of the coercions the AppleScript language can do ” and can do a few that the language can’t. For instance, not only can reals be written to file as integer, or integers as real; but either of these (or their text equivalents) can be written as double integer (eight bytes), as extended real (ten bytes), as short (two bytes), or as small real (four bytes), none of which exist in the AppleScript language itself. (as short can also be rendered as short integer or as small integer.) If a number’s written as a type that’s too small to hold it, information will be lost ” typically the high-order bytes of an integer or the precision of a real. These non-AppleScript number classes are really for specialist use.

When numbers are written to a file as string or as Unicode text, the text number produced has greater precision, and absorbs more digits before being rendered as “scientific notation”, than the result of the equivalent AppleScript coercion.
The AppleScript values true and false can’t be written to file as themselves unless they’re in a list or a record, but they can be written discretely as boolean (!), which in this case is a single-byte value of 1 or 0.

His article is here: http://macscripter.net/articles/437_0_10_0_C/

StefanK · February 4, 2007, 10:16am

Awesome article !!

During my tests I discovered, that the write command doesn’t write automatically the Byte Order Mark at the
beginning of the file, if the text class is Unicode text.

TextEdit for example doesn’t recognize a Unicode plain text file properly without the BOM information
This demonstrats it:


write_Unicode(((path to desktop) as string) & "WriteToFileBOMfalse", false)
write_Unicode(((path to desktop) as string) & "WriteToFileBOMtrue", true)

on write_Unicode(f, BOM)
	try
		set datastream to open for access file f with write permission
		if BOM then write (ASCII character 254) & (ASCII character 255) to datastream
		write "SmÃ¸rebrÃ¸d â‰¥ DÃ©ja VÃ¹ â‰ˆ Ï€" & return to datastream as Unicode text
	end try
	close access datastream
	
	tell application "TextEdit" to open f
end write_Unicode

Adam_Bell · February 4, 2007, 1:30pm

Slick, Stefan. BBEdit doesn’t either - in fact no word processor I tried does. Nice catch.

StefanK · February 4, 2007, 3:27pm

Therefore it’s reasonable to add the BOM information when writing Unicode data.
I’ll post a universal write_to_file routine in Code Exchange

Adam_Bell · February 4, 2007, 3:36pm

Excellent!, Thanks.