converting text with encodings

hi,

I am trying to do the following: I have a text in ISO-8859-1, and I want to convert it to unicode so it is readable in applescript. I use perl to convert encoded characters (e.g. &# 225; ) into heximal encoding. Then, I tried the following as found at http://bbs.macscripter.net/viewtopic.php?id=16638


set cherokeeNA to «data utxt13BE» as Unicode text
display dialog "CHEROKEE LETTER NA: " & cherokeeNA

This works. However, if I replace 13BE with a variable as shown below, it does not work anymore. Instead, it gives some kind of chinese character.


set nHex to "13BE" as Unicode text
set cherokeeNA to run script "«data utxt" & nHex & "»" --as Unicode text
display dialog "CHEROKEE LETTER NA: " & cherokeeNA

The script at the bottom of http://bbs.macscripter.net/viewtopic.php?id=16638 does something similar and it also does not work on my machine (Leopard 10.5.2).

Any suggestions?

thanks

Model: iMac
AppleScript: 2.0
Browser: Safari 525.13
Operating System: Mac OS X (10.5)

Hi Tom,

text encoding in AppleScript is a quite delicate thing.
It’s not clear, where your text comes from (a webpage, a text file, form the clipboard etc.),
so there is no common solution.
A good way to convert various text formats is the shell command textutil

http://www.hmug.org/man/1/textutil.php

hi,

the text comes from a website. But the funny thing is that the code I showed does not work on my machine as it returns some weird chinese characters. Do the Language and International settings in System preferences have an affect on this?

thanks!

“weird chinese characters” happens, when a 8 bit coded format like MacRoman is read as 16 bit code (UTF-16).
You have to determine the text encoding of the website.
Here is an example, which reads the html code of a website and converts it from ISO 8859 to UTF 16 and from html to txt


do shell script "/usr/bin/curl http://epguides.com/ATeam/ | /usr/bin/textutil -stdin -stdout -format html -inputencoding iso-8859-1 -convert txt -encoding UTF-16"

Note: the flag -stdin works only in Leopard, in Tiger you have to write into a temporary text file

hello,

you can also consider the “iconv” command tool which is in my opinion the ideal tool for character conversion.

do shell script "/usr/bin/curl http://epguides.com/ATeam/ | /usr/bin/iconv -f iso-8859-1 -t UTF-16 >" & space & quoted form of POSIX path of (((path to desktop folder) as Unicode text) & "myPage.htm")

to get the (huge) list of all supported encodings type “iconv -l” (lower L) in a terminal.

http://www.hmug.org/man/1/iconv.php

hth

hi,

thanks for the tips but when I try your example with iconv I see that at the bottom of the HTML page there is this line with copyright info. The (c) symbol is still not converted. Is this not exactly what the conversion should fix?

update:
If I understand correctly, all I want to do is convert from HTML to text: replacing ampersands by their original characters. What is the easiest way to do this? Also textutils? But my solution has to work both for Leopard and Tiger.

Also, the source of my text is not important here I guess? The question is: why do the two code fragments in my original post result in different outcomes? As soon as I do

set cherokeeNA to run script “«data utxt13BE»”

I get a chinese character. Apparently this is not normal and it really should return the cherokee letter (as described at http://bbs.macscripter.net/viewtopic.php?id=16638). That is why I am thinking this is influenced by the International Settings of System Preferences? But that is not set on chinese :frowning:

Since Leopard the native text class of AppleScript is Unicode text,
these weird second level evaluations are history now


set nDez to "5054" -- = Hex 13BE
display dialog "CHEROKEE LETTER NA: " & string id nDez

hi,

thanks, this explains it! Now I have this code: the first lines work in Leopard, the ones in the error block in Tiger.
Will the complete block work correctly on Tiger? probably not as it will never reach the error block but continue with a wrong character?


try
	set nDec to 244 -- which should be an o with a hat on top of it
	set uChar to string id nDec
	on error
		set nHex to do shell script "perl -e 'printf(\"%04X\", " & nDec & ")'" -- convert decimal to hex
		set uChar to do shell script "perl -e 'print \"\\x{" & nHex & "}\"'"
	end try

the dec-hex conversion is not the crucial point

without second level evaluation this should work in Tiger and Leopard


set d to «data utxt00F4» as Unicode text
display dialog d

and even this script works on my Tiger machine


set vHex to "00F4"
set o to run script "«data utxt" & vHex & "»"
display dialog o