Unicode - hex/dec value

clemhoff · February 22, 2008, 6:06pm

Hello All,

Is there a way to get the Unicode hex or Unicode decimal value of a given Unicode character

ex: “â‚¬” → U+20AC or 8364.

or do you know of shell tool who’s able to convert 0xE282AC' (UTF-8 hex) into U+20AC’ (Unicode hex)

Thank you for your assitance.

Model: 1.67GHz PPC G4
AppleScript: 1.10.7
Browser: Firefox 2.0.0.12
Operating System: Mac OS X (10.4)

StefanK · February 22, 2008, 6:28pm

Hi,

in Leopard it’s very easy, because it has full Unicode support

id of "â‚¬" --> 8364

In Tiger AppleScript works internally only with MacRoman, so I guess you need a
tool like Unicode Checker (which is scriptable)

clemhoff · February 22, 2008, 7:08pm

Thank you Stefan for your fast input, but unfortunately my script has to run on both systems without any support of external applications.

What i’ve tried/found so far is:

do shell script ""echo \"â‚¬\" | hexdump 

-->  result: 0000000 e282 ac0a

but I dont know how to convert this result to 20AC or 8364

and also:


do shell script "export LC_ALL=fr_FR.UTF-8; printf \"%x\\n\" \"'â‚¬\""

--> result: ffffffffffffffe2

but i don’t understand this result.

I am almost pretty sure that tools such as Perl or Python could bring a solution., but unfortunately I don’t know them.

chrys · February 22, 2008, 10:39pm

This works on my 10.4.11 (Tiger) system:

set euro to "â‚¬"
set firstCodePoint to («data utxt0000» as Unicode text) -- U+0000; Represented in UTF-8 as 0x00. Cannot be sent via the command line because it uses C-style strings which are NULL (0x00) terminated.
set secondCodePoint to («data utxt0001» as Unicode text) -- U+0001
set lastCodePoint to («data utxtDBFFDFFF» as Unicode text) -- U+10FFFF
length of lastCodePoint --> 1 (* just shows that not all code points can be represented in 16 bits *)

set someChars to euro & secondCodePoint & euro & lastCodePoint & euro

getHexStrings(someChars) --> {"20AC", "0001", "20AC", "10FFFF", "20AC"} (* Prefix "U+" to make it look "Unicodey" *)
getDecimalStrings(someChars) --> {"8364", "1", "8364", "1114111", "8364"}
getDecimalValues(someChars) --> {8364, 1, 8364, 1114111, 8364}

to getHexStrings(t)
	dumpCodePoints(t, "%04X")
end getHexStrings

to getDecimalValues(t)
	set l to getDecimalStrings(t)
	repeat with i in l
		set contents of i to i as integer
	end repeat
	l
end getDecimalValues

to getDecimalStrings(t)
	dumpCodePoints(t, "%u")
end getDecimalStrings

to dumpCodePoints(t, f)
	if t contains (ASCII character 0) then error "Unable to process strings containing U+0000 at this time."
	if t ends with "\\c" then error "Unable to process strings ending with "\\c" at this time."
	(*
	 possible workaround: Write the data out to a temporary file and have iconv read and convert the contents of the file. Be sure to delete the file afterwards.
	 If we do not issue an error for U+0000, anything after the first U+0000 would be ignored (by the echo command).
	 If we do not issue an error for ending with "\c", the data for the trailing "\c" would be silently omitted from the output.
	*)
	set dumpformat to "\"\" /4 \"" & f & " \" \"\""
	do shell script "/bin/echo -n " & quoted form of t & " | iconv -f UTF-8 -t UTF-32BE | hexdump -e " & quoted form of dumpformat
	words of result
end dumpCodePoints

It uses /bin/echo, hexdump, and iconv. They are external to AppleScript, but I am pretty sure that they are standard on Mac OS X 10.4 system.

I assume that you wanted the actual Unicode code points, not the UTF-16 code units (I threw in U+10FFFF to demonstrate the difference). I did something similar (iconv/hexdump) in a shell script that takes UTF-8 data and generates «data utxt.» segments for inclusion in osascript-based AppleScript shell scripts.

FYI:

The zero number is the offset in the data stream of the data that follows it. E282AC is the 3-byte UTF-8 representation of U+20AC. The last 0A is the UTF-8 representation of U+000A (the newline character, aka linefeed). The newline is added by the echo shell command. Conversion from UTF-8 to code points could be done in AppleScript, but it would likely be error prone. I prefer to let iconv handle it.


do shell script "export LC_ALL=fr_FR.UTF-8; printf \"%x\\n\" \"'â‚¬\""

--> result: ffffffffffffffe2

I have never seen the single quote modifier being used here before, but it seems to function by changing the way the string argument from the command line is converted into a number for integer format specifiers. Normally, arguments corresponding to integer format specifiers are converted to numbers by parsing them as decimal representations. The single quote pseudo-modifier seems to change that to use the raw value of the first byte of the string. If it used the values of the first character of the string, it might have worked a bit better, but the days of one-character-one-byte are long past. The Euro sign takes at least two bytes (UTF-16) and sometimes three (UTF-8) or four (UTF-32). All the extra F’s probably come from 32-bit sign extension of the first byte of the UTF-8 representation of the Euro sign (E2).

Model: iBook G4 933
AppleScript: 1.10.7
Browser: Safari 3.0.4 (523.12)
Operating System: Mac OS X (10.4)

Nigel_Garvey · February 22, 2008, 10:54pm

This returns a list of decimal unicode numbers for a string of characters:

on unicodeNumbers(u)
	set fref to (open for access file ((path to temporary items as Unicode text) & "utxt scratch.txt") with write permission)
	try
		set eof fref to 0
		write u as Unicode text to fref -- Always big-endian.
		set l to (read fref as short from 1) as list
	end try
	close access fref
	
	set len to (count l)
	repeat with i from 1 to len
		set n to item i of l
		if (n is missing value) then
		else
			set n to (65536 + n) mod 65536
			if (n div 1024 is 54) and (i < len) then
				set n2 to (65536 + (item (i + 1) of l)) --mod 65536
				if (n2 div 1024 is 55) then
					set n to n mod 1024 * 1024 + 65536 + n2 mod 1024
					set item (i + 1) of l to missing value
				end if
			end if
			set item i of l to n
		end if
	end repeat
	
	return l's integers
end unicodeNumbers

set euro to "â‚¬"
unicodeNumbers(euro)
--> {8364}

clemhoff · February 27, 2008, 12:39pm

Sorry for this long delay, but I was outside city.

Thank you very much to all three for your invaluable help! You saved my week !!

Thanks again.