Current extended character set

kel1 · October 18, 2014, 7:42pm

Hi,

Does anyone know where there is the current extended character set (128-255)? I can’t find any on the internet that matches the ‘id’ with “character id” exactly. Maybe there’s one on the computer? I could write a repeat loop and get a list, but it is hard to see which characters are visible.

Thanks,
kel

StefanK · October 18, 2014, 8:07pm

Hi,

character id is Unicode based, id 128 - 255 represents the section Latin-1 Supplement.
To view the whole set open Character Viewer and select Unicode

kel1 · October 18, 2014, 8:22pm

Hi Stefan,

I see it now I think. Is it ISO-8859-1?

I don’t see unicode. Thought I did see it in Mountain Lion.

Thanks,
kel

StefanK · October 18, 2014, 8:33pm

No it’s Unicode, ISO-8859-x is limited to 256 characters, you can even use character id 8226 which is the bigger bullet sign

In Finder select Menu Edit > Special Characters
On the left sidebar select Unicode

kel1 · October 18, 2014, 9:32pm

I see now.

Thanks a lot,
kel

kel1 · October 18, 2014, 9:40pm

Hi Stephan,

Found it in Character viewer. I’d swear it was there by default in Mountain Lion.

Thanks a lot,
kel

DJ_Bazzie_Wazzie · October 19, 2014, 12:04am

More about Unicode:

Unicode is a two layer character encoding while 7-bit and 8-bit character encodings are not. Normal character encodings are direct translations from byte values into characters using a character encoding table. Unicode uses a character sets, not to store characters with a byte value but to store 32-bit integers, also called Unicode code points. The encryption/encoding between Unicode code points and bytes stored are done using a UTF-8 and UTF-16. There is also a UTF-32 encoding, but it contains the same values as the code points so it’s actually storing code points directly as data. The different encodings have different way of storing code points into bytes. UTF-8 does support the 7-bits ASCII table but is the least efficient in text processing. Because it’s backward compatibility with the 7-bits ASCII character table it’s the most used character set for storing files. When efficiency becomes more important, like file names on Mac but also on Windows, they are stored as UTF-16.

In the upper layer these code points are translated into a character. The code point is the actual number you see when you use character id in AppleScript. The most common misconception is because the first supplement table, which is actually ISO-8859-1 in Unicode, does have the same code point values in Unicode as byte values in the actual ISO-8859-1. The results of character id in AppleScript can be misleading and make you think that it is actually ISO-8859-1 when you think that it will return a byte value like ASCII number command will do from the standard scripting addition. However, once written to a file, lets say using the UTF-8 encoding, an encoding will be used on the characters from range 80 to FF in Unicode that doesn’t match the ISO-8859-1 character encoding.

So a character like “Ã©” is in ISO-8859-1 byte value 233, but it will also is code point 233 in Unicode (the value character id will return). But once stored in a certain encoding, like in a website using UTF-8 encoding, the data will be different.

kel1 · October 19, 2014, 12:58am

Hi DJ,

That’s interesting! I guess it’s complicated how the values change when going from utf16 to utf8 for the internet. There are so many encodings! Maybe that’s why many of the tables I’ve been looking at are different. There’s also the old mac extended character set. Unicode seems to be consistent on the same computer.

Thanks for the info,
kel