1 character uses space of 2 characters

cirno · December 8, 2012, 3:22pm

Thanks.

Is there any easy way to get list of every Unicode character.

Yvan_Koenig · December 8, 2012, 3:38pm

http://www.unicode.org/charts/charindex.html

is your friend.

Puzzling behaviour :
Run this script :


set strASCII to "01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123"
set p2d to path to desktop
tell application "Finder"
	repeat with i from 4 to 5
		set strASCII to strASCII & i
		try
			set folderASCII to make new folder at p2d with properties {name:strASCII}
			name of folderASCII
			set countASCII to count result
			delete folderASCII
		on error
			set countASCII to 0
		end try
		set strHigh to "à¤”" & text 2 thru -1 of strASCII
		try
			set folderHigh to make new folder at p2d with properties {name:strHigh}
			name of folderHigh
			set countHigh to count result
			delete folderHigh
		on error
			set countHigh to 0
		end try
		display dialog ("" & countASCII & return & countHigh)
	end repeat
end tell

On first pass, two folders will be created on the desktop.
the name of the 1st one is made of 255 digits (plain ASCII)
the name of the second one is made by replacing the 1st character of the firts one by the character à¤” (DEVANAGARI LETTER AU) whose Unicode value is $0914 and UTF 8 value is $E0A494

Both names are 255 characters long.
Both folders are created and a dialog display the two name lengths .
On second pass, One character (5) is added at the end of the two names.
The Finder refuse to create both folders.

If I replace the (DEVANAGARI LETTER AU) by the character Ã¼ (LATIN SMALL LETTER U WITH DIAERESIS) whose Unicode value is $00FC and UTF-8 value is $C3 BC, only the first folder named with strASCII is created.

I really don’t understand why (DEVANAGARI LETTER AU) which requires more bytes than (LATIN SMALL LETTER U WITH DIAERESIS) is accepted.

Yvan KOENIG (VALLAURIS, France) samedi 8 dÃ©cembre 2012 16:38:44

McUsrII · December 8, 2012, 4:05pm

That seems like a better idea than:

repeat with i from 0 to 65535
	try
		set a to character id i
		log a
	end try
end repeat

StefanK · December 8, 2012, 4:11pm

Finder > Menu Edit > Special Characters. > Unicode Table

Yvan_Koenig · December 8, 2012, 4:39pm

But using this scheme, we don’t get the names of the characters

Yvan KOENIG (VALLAURIS, France) samedi 8 dÃ©cembre 2012 17:39:35

DJ_Bazzie_Wazzie · December 8, 2012, 4:48pm

I think you can look here

McUsrII · December 8, 2012, 6:04pm

True!

I actually think that the best solution is either look at your or DJ Bazzie Wazzie’s links, or StefanK’s tip.

StefanK’s solution work when you are offline.

Yvan_Koenig · December 8, 2012, 9:02pm

Beurk ! My name is not Koening, it’s Koenig

The characters palette is fine because we may use it to insert characters or to extract the character name with a simple copy/paste.

Did you looked at the script embedded in my message from 09:38:51 am ?

Yvan KOENIG (VALLAURIS, France) samedi 8 dÃ©cembre 2012 21:59:41

McUsrII · December 9, 2012, 8:23am

I corrected your name Yvan. And I am looking at your script right now, and I have no idea really as to why the first 3-byte character works as the second 2-byt character don’t. Maybe there are some packing scheme here?byte packing scheme here? It may also be that the $00 byte in “Diaresis” does('nt do) the trick?

Nigel_Garvey · December 9, 2012, 10:02am

Yvan Koenig:

Puzzling behaviour :
Run this script :
set strASCII to "01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123"
set p2d to path to desktop
tell application "Finder"
	repeat with i from 4 to 5
		set strASCII to strASCII & i
		try
			set folderASCII to make new folder at p2d with properties {name:strASCII}
			name of folderASCII
			set countASCII to count result
			delete folderASCII
		on error
			set countASCII to 0
		end try
		set strHigh to "à¤”" & text 2 thru -1 of strASCII
		try
			set folderHigh to make new folder at p2d with properties {name:strHigh}
			name of folderHigh
			set countHigh to count result
			delete folderHigh
		on error
			set countHigh to 0
		end try
		display dialog ("" & countASCII & return & countHigh)
	end repeat
end tell
On first pass, two folders will be created on the desktop.
the name of the 1st one is made of 255 digits (plain ASCII)
the name of the second one is made by replacing the 1st character of the firts one by the character à¤” (DEVANAGARI LETTER AU) whose Unicode value is $0914 and UTF 8 value is $E0A494

Both names are 255 characters long.
Both folders are created and a dialog display the two name lengths .
On second pass, One character (5) is added at the end of the two names.
The Finder refuse to create both folders.

If I replace the (DEVANAGARI LETTER AU) by the character Ã¼ (LATIN SMALL LETTER U WITH DIAERESIS) whose Unicode value is $00FC and UTF-8 value is $C3 BC, only the first folder named with strASCII is created.

There’s an error when the script tries to delete the second folder on my system. This appears to be something to do with the fact that the first folder’s already in the Trash. (Maybe there’s not enough overhead to edit the second folder’s name?) The effect is that, because the second folder’s not deleted the first time, there’s another error when trying to create it the second time. Putting an ‘empty’ line after each ‘delete’ cures this on my Snow Leopard system.

I don’t know the answer either. But then I don’t know how many bytes DEVANAGARI LETTER AU occupies in the normalized version of UTF-16 used by the HFS+ system. (See the link in my post (#7) above.)

Edit: Perhaps more relevantly, the article to which I linked says that HFS+ names can have a maximum of 255 UTF-16 code points. If DEVANAGARI LETTER AU is represented as just one code point (itself) ” and LATIN SMALL LETTER U WITH DIAERESIS is almost certainly “normalized” as two code points (the small letter u and the diaeresis) ” then that would explain the apparent anomaly. The number of bytes doesn’t come into it.

Yvan_Koenig · December 9, 2012, 11:04am

Hello Nigel.

I assume that the error issued when the Finder try to delete the folder is due to your old system.
Under Lion or Mountain Lion it behave flawlessly.
This version use System Events to delete the folders so nothing is moved to the trash.


set strASCII to "01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123"
set p2d to path to desktop

repeat with i from 4 to 5
	set strASCII to strASCII & i
	try
		tell application "Finder"
			set folderASCII to make new folder at p2d with properties {name:strASCII}
			name of folderASCII
			set countASCII to count result
		end tell
		tell application "System Events" to delete folderASCII
	on error
		set countASCII to 0
	end try
	set strHigh to "à¤”" & text 2 thru -1 of strASCII
	try
		tell application "Finder"
			set folderHigh to make new folder at p2d with properties {name:strHigh}
			name of folderHigh
			set countHigh to count result
		end tell
		tell application "System Events" to delete folderHigh
	on error
		set countHigh to 0
	end try
	display dialog ("" & countASCII & return & countHigh)
end repeat

I have some difficulties to imagine that the (DEVANAGARI LETTER AU) character is stored as a single byte like ASCII characters.

Yvan KOENIG (VALLAURIS, France) dimanche 9 dÃ©cembre 2012 12:03:54

Nigel_Garvey · December 9, 2012, 11:12am

Hi Yvan.

I was adding an extra paragraph to my post as you were posting this comment. The relevant units to consider are UTF-16 code points, not bytes.

set uUmlaut1 to (character id 252) -- One code point ("Ã¼").
set uUmlaut2 to (character id 117) & (character id 776) -- Two code points as in HFS+ ("u" & combining diaeresis).

{uUmlaut1, uUmlaut2, uUmlaut1 = uUmlaut2}

PS. Even System Events’s ‘delete’ command simply moves things to the trash on my machine.

McUsrII · December 9, 2012, 12:33pm

An enlightening day, as enlightening here, as the snow outside! No pun intended.

Yvan_Koenig · December 9, 2012, 8:26pm

Nigel Garvey:

Yvan Koenig:

I have some difficulties to imagine that the (DEVANAGARI LETTER AU) character is stored as a single byte like ASCII characters

Hi Yvan.

I was adding an extra paragraph to my post as you were posting this comment. The relevant units to consider are UTF-16 code points, not bytes.
set uUmlaut1 to (character id 252) -- One code point ("Ã¼").
set uUmlaut2 to (character id 117) & (character id 776) -- Two code points as in HFS+ ("u" & combining diaeresis).

{uUmlaut1, uUmlaut2, uUmlaut1 = uUmlaut2}

In the given example, the uUmlaut character is :
Ã¼ (LATIN SMALL LETTER U WITH DIAERESIS) whose Unicode value is $00FC and UTF-8 value is $C3 BC

When I insert it in a file/folder name, it reduce the number of characters allowed by one. So it’s clearly using two figures.

On the other side, character à¤” (DEVANAGARI LETTER AU) whose Unicode value is $0914 and UTF 8 value is $E0A494 may be inserted in a file/folder name and is counted as a single figure.
It’s why puzzle me.
Bingo, I got it. Thanks to your explanations and table :
http://developer.apple.com/legacy/mac/library/#technotes/tn/tn1150table.html

Before that, I didn’t understand that “points” are in fact two bytes.
à¤” (DEVANAGARI LETTER AU) is a 2 bytes Unicode value but it’s stored as a single point.
Ã¼ (LATIN SMALL LETTER U WITH DIAERESIS) is a one byte Unicode value but it’s stored as a two points object.
á¸” (LATIN CAPITAL LETTER E WITH MACRON AND) is Unicode $1E14 but it’s stored as a three points object.

So, we may replace an ASCII character by à¤” with no impact upon the number of characters allowed. (255)
So, if we replace an ASCII character by Ã¼ the number of characters allowed is reduced by 1. (254)
So, if we replace an ASCII character by á¸” the number of characters allowed is reduced by 2. (253)

This one is really puzzling me.
If I remember well you are running 10.5.8 and my memory said that the System Events delivered with this system deleted files as it did in 10.8 without putting them in the trash.
Now I may understand why you used rm in some scripts which you sent to me.

Nigel_Garvey · December 9, 2012, 10:37pm

AU is a single character and can only be represented by a single code point (or Unicode number), however many bytes that may be. (A multiple of 2, since it’s UTF-16.) Characters with diacritics, however, can be represented in Unicode either as characters in themselves (one code point) or as combinations of base characters and “Combining Diacritical Marks” (two code points). The HFS+ system insists on the latter, for some reason.

Interestingly, the ‘id’ function can tell how such characters are formed:

set nameIn to "Ã¼à¤”á¸”" -- (LATIN SMALL LETTER U WITH DIAERESIS) & (DEVANAGARI LETTER AU) & (LATIN CAPITAL LETTER E WITH MACRON AND GRAVE)
set lengthIn to (count nameIn)
set idIn to id of nameIn

tell application "Finder" to set nameOut to name of (make new folder at desktop with properties {name:nameIn})

set lengthOut to (count nameOut)
set idOut to id of nameOut

{{nameAsSet:nameIn, length:lengthIn, id:idIn}, return, {nameAsReturned:nameOut, length:lengthOut, id:idOut}}

You’re right! System Events does delete items on the spot. I hadn’t noticed that in your script, the ‘delete’ command is applied to a Finder reference and so it’s actually the Finder doing the deleting.

Shane_Stanley · December 9, 2012, 11:16pm

Not so much interesting as strange, your script compiles here with the E appearing to have just a macron. It behaves as it should, and if I copy and paste from ASE into TextEdit, it looks as it should, but for some reason it appears incorrectly in ASE (and Script Debugger). The same thing happens in the result pane in ASE: the character looks different in nameAsSet than in nameAsReturned. Very strange…

DJ_Bazzie_Wazzie · December 9, 2012, 11:57pm

The use of decomposed unicode characters (it’s not a UTF-16 thingy) is for fast text processing normally. In AS you have considering/ignoring diacriticals which will be faster processed in the background when the unicode string is in a decomposed form. The HFS+ is case insensitive, case insensitive string comparison is faster with decomposed than with composed unicode strings. Other actions like counting or checking if the string contains diacriticals can be performed faster as for sorting.

Nigel_Garvey · December 10, 2012, 12:15am

It’s too small to tell on my MacBook Pro. If I blow the text up several times in ASE with Command-+ keystrokes, a small grave does become visible above the right end of the macron.

Shane_Stanley · December 10, 2012, 12:22am

It looks like it’s an artefact – the top part of the character is being trimmed. If I add a return before it, it looks fine.

Nigel_Garvey · December 10, 2012, 12:27am

I was just going to add that I see what you mean about the two versions of the character looking different. When compared side-by-side, there are small differences. Maybe the degree of difference depends on the font in use.