Using Unicode-only characters in script strings

kai · March 26, 2006, 7:39pm

For most scripting needs, the range of characters from the system’s primary text encoding (determined by the first language listed in the International preference pane) should be adequate. However, for the occasional string value, we may want to use a few characters beyond that range.

By defining a unique number for every character - whatever the platform, environment or language, Unicode offers a way around the restrictions of a limited character palette. This perhaps explains why the Unicode standard has been adopted and implemented in so many recent technologies. * The Unicode standard’s Basic Multilingual Plane (BMP) extends our character palette from 256 (in the so-called ‘extended ASCII’ range) to a potential 65,536 - generally considered enough to cover all the commonly used characters from all the languages of the world. And if the 4-digit Unicode codepoints within the BMP aren’t enough, there are also 5-digit hexadecimal codes (a total of well over a million characters)…

Great. So if we need a few extra Unicode characters in our script, we can just copy, paste and run - right?

Well… not quite. Certainly, we can copy and paste Unicode text in a number of applications - including Script Editor. But AppleScript compiles scripts using the system’s primary encoding. So while freshly pasted text will normally be displayed accurately in a Script Editor window, any Unicode-only characters will become mangled as soon as the script is compiled/run. However, once Unicode text is stored in an AppleScript variable, it can be used like any other variable - and the Unicode encoding should, for the most part, be preserved. * Even if a Unicode character is available on a system, it may not necessarily display correctly in Script Editor’s result pane. However, it should do so in a dialog or appropriately configured application.

So how do we get special characters into a variable in the first place - in a way that side-steps any transmutation at compile time? First, we need to identify the Unicode codepoints for the particular characters that we wish to include (we’ll come to that in a moment). Then a method is required to convert those codepoints into Unicode text characters…

Let’s say we want to display the Cherokee letter NA (supported by the font Plantagenet Cherokee).
One way to convert its Unicode codepoint (U+13BE) might be to use a shell script - something like this:

set cherokeeNA to do shell script "perl -e 'print \"\\x{13BE}\"'"
display dialog "CHEROKEE LETTER NA: " & cherokeeNA

A faster alternative is to use AppleScript’s data class:

set cherokeeNA to run script "«data utxt13BE»"
display dialog "CHEROKEE LETTER NA: " & cherokeeNA

Or, simpler and faster still:

set cherokeeNA to «data utxt13BE» as Unicode text
display dialog "CHEROKEE LETTER NA: " & cherokeeNA

By way of demonstration, this last method was used in a recent discussion about card games to create a ‘virtual’ deck of cards. Here’s the relevant part of the code:

set d to {}
repeat with s in «data utxt2661266326622660» as Unicode text
	repeat with c in {"2", "3", "4", "5", "6", "7", "8", "9", "10", "J", "Q", "K", "A"}
		set d's end to c & s
	end repeat
end repeat
d

As shown above, data can contain several characters. The example actually consists of 4 Unicode characters, each representing a card suite: white heart (U+2661), black club (U+2663), white diamond (U+2662) and black spade (U+2660). But there’s a limit to how much data can be packed into a single data class. So, to process a relatively lengthy text, divide it into chunks (of less than 64 Unicode characters each) and then concatenate the results. Otherwise, be prepared to encounter my all-time favourite error message - which beats the hell out of that “fatal error” and “illegal operation” nonsense one sees elsewhere:

set t to "0"
repeat 8 times
	set t to t & t
end repeat
run script "«data utxt" & t & "»"

All we need now is some method of identifying the Unicode codepoints of the required characters (assuming, of course, that they can be copied from some source). One way is to open Character Palette and carry out a search:

launch application "CharPaletteServer"

Simply paste the copied character (or enter its Unicode name, if known) into Character Palette’s Spotlight field - and then hold the mouse pointer over the selected result.

However, since that could be quite a painstaking process (especially if several characters are required), the following script might help to encode text of varying lengths:

---------------
-- convertor --
---------------

on utxt_data from t given coercing:c
	set q to ""
	set l to count t
	set r to "«data utxt"
	if c then
		set q to "» as unicode text"
	else
		set q to "»"
	end if
	if l is 0 then error number -128
	if l > 63 then
		set l to 63
		if c then
			set r to "(" & r
			set q to q & ")"
		end if
		set q to q & " & " & my (utxt_data from t's text 64 thru -1 without coercing)
		set t to t's text 1 thru l
	end if
	set h to "0123456789ABCDEF"
	repeat with i in ({{text:t as Unicode text}} as text)'s text (l * -2) thru -1
		tell (ASCII number i) to set r to r & h's item (it div 16 + 1) & h's item (it mod 16 + 1)
	end repeat
	r & q
end utxt_data

-----------
-- demo --
-----------

set utxt to utxt_data from text returned of (display dialog ¬
	"Please enter or paste the text to convert:" default answer return) with coercing
tell application "Script Editor" to execute (make new document with properties {text:"display dialog " & utxt})