I have a text that contains this text: a “Switch” in Time
After few rounds of reading/writing to a file, the same text becomes something like:
The PDF matched the journal with the title: A ‚Äö
And it keeps “growing” if I don’t stop it.
The problem are the double comma (“ ”), and I realised that this can be solved if I use «class utf8» for reading/writing to a file.
My question is which the advantages/disadvantages of always using «class utf8» for reading/writing to a file?
Thanks !
For historical reasons, AppleScript’s ‘read’ and ‘write’ commands default to reading files as Mac Roman-encoded text and writing text to them in that form too. But most text files nowadays, on the Mac at least, are encoded as UTF-8 Unicode text. So it’s a good idea to use as «class utf8» routinely where the data are text. That way, you’re always covered for non-ASCII characters. Strangely, Apple has never got round to implementing a keyword to go with the «class utf8» token.
The other caveat is that, because UTF-8 uses multi-byte characters, it should only be used to directly read whole files. If you try something like read from or read to, you run the risk of splitting a character and getting an error.
There’s no such limitation in writing to a file, though.
To determine the encoding of file you can use try block. If the file has one encoding and you try to read it as «class UTF8», should occur the error. Catch this error to fix the problem:
You can convert file encoding using shell command iconv
Little example:
set inputString to "⌘" -- or read this text from CP1252 encoded file
try
set inputString to inputString as «class UTF8»
on error
set theCommand to "iconv -f UTF-8 -t CP1252"
set theResult to do shell script "echo " & quoted form of inputString & "| " & theCommand
end try
UTF-8. In your sample, they’re exactly the same, but add something like ‘ or ” in there, and you can see for yourself.
One of the things about UTF-8 is that the way it encode characters outside the ASCII range is very specific, so if you try to read a file as UTF-8 and it uses some other encoding, you will get an error, as KniazidisR says. The exception is for files that consist solely of lower characters that are common to encodings, of course, where encoding effectively becomes irrelevant.
So if you’re in doubt, the normal approach is to try UTF-8 first, and if you get an error, fall back to some other encoding.