Why not always reading/writing to a file as «class utf8»

ldicroce · November 10, 2020, 8:03am

I have a text that contains this text: a “Switch” in Time

After few rounds of reading/writing to a file, the same text becomes something like:
The PDF matched the journal with the title: A ‚Äö√Ñ√∂‚àö√ë‚àö‚àÇ‚Äö√†√∂‚àö√´‚Äö√†√∂‚Äö√†√á‚Äö√Ñ√∂‚àö‚Ä†‚àö‚àÇ‚Äö√†√∂‚àö¬¥‚Äö√Ñ√∂‚àö‚Ä†‚àö‚àÇ‚Äö√Ñ√∂‚àö‚Ä†‚àö√°

And it keeps “growing” if I don’t stop it.
The problem are the double comma (“ ”), and I realised that this can be solved if I use «class utf8» for reading/writing to a file.

My question is which the advantages/disadvantages of always using «class utf8» for reading/writing to a file?
Thanks !

Nigel_Garvey · November 10, 2020, 9:27am

Hi.

For historical reasons, AppleScript’s ‘read’ and ‘write’ commands default to reading files as Mac Roman-encoded text and writing text to them in that form too. But most text files nowadays, on the Mac at least, are encoded as UTF-8 Unicode text. So it’s a good idea to use as «class utf8» routinely where the data are text. That way, you’re always covered for non-ASCII characters. Strangely, Apple has never got round to implementing a keyword to go with the «class utf8» token.

Shane_Stanley · November 10, 2020, 10:44am

And the Web, and more places all the time. But perhaps more pertinently, MacRoman is barely used anywhere at all.

Nigel_Garvey · November 10, 2020, 11:15am

There’s that too, of course.

Shane_Stanley · November 10, 2020, 11:24am

The other caveat is that, because UTF-8 uses multi-byte characters, it should only be used to directly read whole files. If you try something like read from or read to, you run the risk of splitting a character and getting an error.

There’s no such limitation in writing to a file, though.

maro · November 10, 2020, 12:29pm

Hi, I live in always-2 byte code person. UTF-8 is not so good and not so bad for me.

If you remain Ascii character world, you can’t feel the advantage of UTF-8.

But there is a lot of characters in UTF-8 world. We can use various language characters in one text.

You can quote foreigner people’s words as is like “こんにちは”, “Здравствуйте”, “שלום”, “مرحبا”,“안녕하세요”, “สวัสดี”, “γεια”, “你好”, “नमस्कार”…

You can use various emoji and special marks like “???”, “♺”, “1︎⃣”, “♬”, “⌘”, “❖”.

There is disadvantages a lot. But I can’t imagine the world without UTF-8.

ldicroce · November 10, 2020, 4:50pm

Thanks a lot to all for the clarification(s).
One more question though:
the shell command such as:

do shell script ("echo 'Luciano'   > " & thePath & "newTextFile.txt")

Is the text written in Mac Roman-encoded or UTF-8 Unicode?
And if is in Mac Roman-encoded, how do I force the UTF-8, or vice-versa?

Ciao
L.

KniazidisR · November 10, 2020, 6:48pm

To determine the encoding of file you can use try block. If the file has one encoding and you try to read it as «class UTF8», should occur the error. Catch this error to fix the problem:
You can convert file encoding using shell command iconv

Little example:


set inputString to "âŒ˜" -- or read this text from CP1252 encoded file

try
	set inputString to inputString as «class UTF8»
on error
	set theCommand to "iconv -f UTF-8 -t CP1252"
	set theResult to do shell script "echo " & quoted form of inputString & "| " & theCommand
end try

Shane_Stanley · November 10, 2020, 11:10pm

UTF-8. In your sample, they’re exactly the same, but add something like ‘ or ” in there, and you can see for yourself.

One of the things about UTF-8 is that the way it encode characters outside the ASCII range is very specific, so if you try to read a file as UTF-8 and it uses some other encoding, you will get an error, as KniazidisR says. The exception is for files that consist solely of lower characters that are common to encodings, of course, where encoding effectively becomes irrelevant.

So if you’re in doubt, the normal approach is to try UTF-8 first, and if you get an error, fall back to some other encoding.