Relating a language, or a character set to the utf encoding blocks.

McUsr · June 12, 2010, 11:14am

Hello

I want to create routines considering case and capitalization for languages using latin character encoding.
Those routines will be excessively large and slow, if the encoding blocks
¢ Latin-1
¢ Latin Extended-A
¢ Latin Extended-B
¢ IPA Extensions
must be included as a whole to cover for a language. (But only the needed block of course.)

The character encodings are listed here

The idea is to create a library for those functions, which a scripter easily could extend to suit his own needs, preferably by language, or by suppling the hexadecimal values for the lowercase characters in a list, which should be added to the utf Basic latin character encoding.

The library should provide functionality to make up the needed tables for some standard functions to process the text, identifying case, and converting between.

What I need to know

What I really need to know is which language is covered by which encoding block. Or even better, if there exists tables, which lists a language’s accentuated characters, preferably with their hexadecimal value for their utf representation.

Best Regards

McUsr

chrys · June 12, 2010, 1:35pm

That is a list of code points, not character encodings. Encoding relates to the representation of a code point (loosely, a character) as a sequence of bytes. For example, U+00E9 is represented in the UTF-8 encoding as two bytes: C3 A9. In UTF-16BE is is also two bytes, but they are different: 00 E9. In ISO-8859-1, it is a single byte: E9. In MacRoman it is a different single byte: 8E. The point is that code points (e.g. U+00E9) are just abstract symbols (numbers in the Unicode code space). An encoding is what defines how to represent (a subset of) the code points as a byte sequence.

I am not sure how just the code point block list or a list of accented characters would really help. The problem is more general than either of those categories.

The concept of case manipulation is not as simple as you might hope. If you want to base your mapping on the Unicode standard (a reasonable idea), then the source for case mapping information is the Unicode Character Database. Section 5.4 of the UCD documentation mentions the relevant UCD files and bits of the Unicode standard itself.

Many programming languages implement more complete Unicode functionality by either including tables derived from the UCD or generating code based on the UCD.

% : Show that Perl can apply proper title casing to LATIN SMALL LIGATURE FF % printf '\357\254\200' | /usr/bin/perl -CS -nle 'print ucfirst' Ff
It really is a shame that AppleScript does provide (relatively) fast access to some of these tricky bits of Unicode functionality. Implementing this stuff in AppleScript itself is possible, but it certainly will be slower than if it was done in C or Objective C as a part of basic AppleScript (even something in StandardAdditions would be nice, though the RPC overhead would be egregious for doing tight AppleScript loops of such manipulations).

McUsr · June 12, 2010, 6:20pm

Hello.

I’m currently making a kind of catch all solution, which means that I basically pair up small and large letters into
some pretty long strings.
I note the last position of pairs with lowercase and uppercase, ternary characters, (the one’s consisting of two characters with the first char capitalized under capitalization. thereafter follows chars which is exclusive for their
case.

Those strings (or lists with character id’s) may not lead to any particular elegant solution for any language, but I figure it will work well as a catch all solution if my assumption about upper and lowercase chars comes in pairs for the different languages.

It should be possible to tell if a character is uppercase or lowercase, or none, it should also be possible to make character strings lower or upper case or capitalized when possible.

I’ll post the whole thing as soon as it is finished, which may still take some time to verify the solution, and such.
(This is really boooring to write!)

I guess this will be a better solution than no solution. And will leave it up to those wishing for a faster solution,
to dissect the characters they need in their strings or lists.

Best Regards

McUsr