Semi-Noob's Stupid Unicode Question

Hi,

will somebody be so kind and tell me how to return the unicode value of a character - for example “00DC” for “Ãœ”?
I’m trying to do find out how to do it for more than three hours now, and the more I try the more I get confused.
I’m pretty sure I figured out how to do it before by myself, but I just can’t remember. :confused:

Thanks in advance.

Using TextCommands:

  • Unicode hex number(s):

    pack (convert from unicode unicodeText to “utf16”) using hex encoding
    → hex string (including BOM)

  • Unicode decimal number(s):

    unicode numbers unicodeText
    → list of integer

  • Unicode hex character(s):

    convert to unicode (unpack hexString using hex encoding) from “utf16”
    → Unicode text

  • Unicode decimal character(s):

    unicode characters integerList
    → unicode text

HTH

This is a quick hack:

UnicodeCharToHex("Ü" as Unicode text) --> "00DC"
to UnicodeCharToHex(u)
	try
		((text -2 thru -1 of ({{a:u}} as string)) as C string) * 5
	on error msg
		text ((offset of "cstr" in msg) + 4) thru ((offset of "00»" in msg) - 1) of msg
	end try
end UnicodeCharToHex

Note that it relies on the “as C string” coercion, which is now a deprecated data type (but it still works in Tiger, and I think it will work forever while AS itself isn’t redesigned from scratch, as we have still lots of things in the dictionary from ancient times…).
For more powerful conversions (long text), you may use something as:

UnicodeTextToHex("Üchá" as Unicode text) --> "00DC0063006800E1"

to UnicodeTextToHex(u)
	set q to (open for access ("/tmp/u.txt" as POSIX file) with write permission)
	set eof of q to 0
	write u to q
	close access q
	read ("/tmp/u.txt" as POSIX file) as «class paca»
	try
		result * 5
	on error msg
		text ((offset of "paca" in msg) + 4) thru ((offset of "»" in msg) - 1) of msg
	end try
end UnicodeTextToHex

This may work fine for, ie, 100Kb of Unicode text. If you need more power (speed), you may use specialized tools (such as TextCommands, which is not as portable as a handler, but beats this routine in speed).

And - be warned - a naughty AppleScript bug exploit too (the list of record-to-string coercion).

If you need a vanilla solution, the easiest thing (as usual) is to use a shell script. e.g.:

on unicodeToHex(txt)
	return do shell script (("python -c \"import sys; print unicode(sys.argv[1], 'utf8').encode('UTF-16BE').encode('hex')\" " as Unicode text) & quoted form of (txt as Unicode text))
end unicodeToHex

You’ll be limited in the amount of data you can convert, of course, unless you want to muck about with temp files instead of passing it on the command line. Curse Apple’s wretched ‘do shell script’ command for its continuing lack of stdin support, and go file a feature request on it.

Folks, I don’t know what to say… You’re great. Thank you so much. Looking at some of your code makes me doubt I did it before all by myself.
First I had a little trouble using your code but then realized that “Tex-Edit Plus” (wich I like scripting for it’s nice search/replace functions) is not good in handling japanese/chinese characters. I’m using BBEdit now and everything is working a-OK. Thanks a lot!

Lars