I’m making a program that will read a certian kind of file. These files must be read byte-by-byte, so I use a java utility to do it. The problem is the higher ASCII characters. For example, when java reads the ö character, it reads it from byte F6 (Unicode encoding?). However, when I write the file again with AS, I am forced to write as a string to avoid all those extra null characters, and it writes that character as byte 9A due to the Mac OS Roman encoding that my system uses. Is there any way to save the text in the same encoding it started with? Maybe something that can just get rid of all the null characters that writing as Unicode puts in there?
Those null characters are supposed to be there; They are part of UTF-16 (Unicode text
in AS). A likely problem is that the written output doesn’t have a BOM to indicate that it is UTF-16. You might also consider writing as UTF-8 («class utf8»
in AS).
UTF-8 wriktes ö as C3 B6, which is one too many bytes, and Unicode writes it as 00 F6, which is good except for the null at the front. I know the null characters are supposed to be there in text files, but the files that I’m writing aren’t just text; they are tree structures that have a very rigidly-defined byte layout. Adding null characters between every byte would destroy the structure of the file.
My goal is to open a file into a readable format with my program and then save it without making any changes and have it be the same, byte for byte.
What encoding is your source file?
The file that the java program reads is created by another java program using FileOutputStream.write. The java program that reads the tree format looks at each few bytes and prints the result to Standard Out using a PrintStream initialized to UTF-8. But I get the text into the AS app with
do shell script "java reader " & (quoted form of inpFile)
Once inside the application, the text is stored in a table view by using text item delimiters so it can be edited. It is then written to the output file as a string.
Yeah, this program is complicated :D.
If there’s a better way to get the text from the java program, I would be glad to hear that too. But I posted a thread about it a few days ago in the Xcode forum and nobody replied, so I assume there isn’t a way.
Problem solved. I wrote another java utility that would take the string and cast each character separately to a byte. The normal String.getBytes() method didn’t work because of my computer’s normal encoding. I suppose this works because java uses a different basic encoding.
If there was an operation like “ascii nunmber of” for unicode, this whole thing could have been avoided. sigh. . . .
Would either of these be helpful (altering them if needed)?
http://bbs.applescript.net/viewtopic.php?pid=55869#p55869
http://mjtsai.com/blog/2003/10/04/unicode_applescript_strin/