I am trying to do the following: I have a text in ISO-8859-1, and I want to convert it to unicode so it is readable in applescript. I use perl to convert encoded characters (e.g. &# 225; ) into heximal encoding. Then, I tried the following as found at http://bbs.macscripter.net/viewtopic.php?id=16638
set cherokeeNA to «data utxt13BE» as Unicode text
display dialog "CHEROKEE LETTER NA: " & cherokeeNA
This works. However, if I replace 13BE with a variable as shown below, it does not work anymore. Instead, it gives some kind of chinese character.
set nHex to "13BE" as Unicode text
set cherokeeNA to run script "«data utxt" & nHex & "»" --as Unicode text
display dialog "CHEROKEE LETTER NA: " & cherokeeNA
text encoding in AppleScript is a quite delicate thing.
It’s not clear, where your text comes from (a webpage, a text file, form the clipboard etc.),
so there is no common solution.
A good way to convert various text formats is the shell command textutil
the text comes from a website. But the funny thing is that the code I showed does not work on my machine as it returns some weird chinese characters. Do the Language and International settings in System preferences have an affect on this?
“weird chinese characters” happens, when a 8 bit coded format like MacRoman is read as 16 bit code (UTF-16).
You have to determine the text encoding of the website.
Here is an example, which reads the html code of a website and converts it from ISO 8859 to UTF 16 and from html to txt
do shell script "/usr/bin/curl http://epguides.com/ATeam/ | /usr/bin/textutil -stdin -stdout -format html -inputencoding iso-8859-1 -convert txt -encoding UTF-16"
Note: the flag -stdin works only in Leopard, in Tiger you have to write into a temporary text file
you can also consider the “iconv” command tool which is in my opinion the ideal tool for character conversion.
do shell script "/usr/bin/curl http://epguides.com/ATeam/ | /usr/bin/iconv -f iso-8859-1 -t UTF-16 >" & space & quoted form of POSIX path of (((path to desktop folder) as Unicode text) & "myPage.htm")
to get the (huge) list of all supported encodings type “iconv -l” (lower L) in a terminal.
thanks for the tips but when I try your example with iconv I see that at the bottom of the HTML page there is this line with copyright info. The (c) symbol is still not converted. Is this not exactly what the conversion should fix?
update:
If I understand correctly, all I want to do is convert from HTML to text: replacing ampersands by their original characters. What is the easiest way to do this? Also textutils? But my solution has to work both for Leopard and Tiger.
Also, the source of my text is not important here I guess? The question is: why do the two code fragments in my original post result in different outcomes? As soon as I do
set cherokeeNA to run script “«data utxt13BE»”
I get a chinese character. Apparently this is not normal and it really should return the cherokee letter (as described at http://bbs.macscripter.net/viewtopic.php?id=16638). That is why I am thinking this is influenced by the International Settings of System Preferences? But that is not set on chinese
thanks, this explains it! Now I have this code: the first lines work in Leopard, the ones in the error block in Tiger.
Will the complete block work correctly on Tiger? probably not as it will never reach the error block but continue with a wrong character?
try
set nDec to 244 -- which should be an o with a hat on top of it
set uChar to string id nDec
on error
set nHex to do shell script "perl -e 'printf(\"%04X\", " & nDec & ")'" -- convert decimal to hex
set uChar to do shell script "perl -e 'print \"\\x{" & nHex & "}\"'"
end try