1 character uses space of 2 characters

Hello Nigel.

I assume that the error issued when the Finder try to delete the folder is due to your old system.
Under Lion or Mountain Lion it behave flawlessly.
This version use System Events to delete the folders so nothing is moved to the trash.


set strASCII to "01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123"
set p2d to path to desktop

repeat with i from 4 to 5
	set strASCII to strASCII & i
	try
		tell application "Finder"
			set folderASCII to make new folder at p2d with properties {name:strASCII}
			name of folderASCII
			set countASCII to count result
		end tell
		tell application "System Events" to delete folderASCII
	on error
		set countASCII to 0
	end try
	set strHigh to "औ" & text 2 thru -1 of strASCII
	try
		tell application "Finder"
			set folderHigh to make new folder at p2d with properties {name:strHigh}
			name of folderHigh
			set countHigh to count result
		end tell
		tell application "System Events" to delete folderHigh
	on error
		set countHigh to 0
	end try
	display dialog ("" & countASCII & return & countHigh)
end repeat

I have some difficulties to imagine that the (DEVANAGARI LETTER AU) character is stored as a single byte like ASCII characters.

Yvan KOENIG (VALLAURIS, France) dimanche 9 décembre 2012 12:03:54

Hi Yvan.

I was adding an extra paragraph to my post as you were posting this comment. The relevant units to consider are UTF-16 code points, not bytes.

set uUmlaut1 to (character id 252) -- One code point ("ü").
set uUmlaut2 to (character id 117) & (character id 776) -- Two code points as in HFS+ ("u" & combining diaeresis).

{uUmlaut1, uUmlaut2, uUmlaut1 = uUmlaut2}

PS. Even System Events’s ‘delete’ command simply moves things to the trash on my machine.

An enlightening day, as enlightening here, as the snow outside! No pun intended. :smiley:

In the given example, the uUmlaut character is :
ü (LATIN SMALL LETTER U WITH DIAERESIS) whose Unicode value is $00FC and UTF-8 value is $C3 BC

When I insert it in a file/folder name, it reduce the number of characters allowed by one. So it’s clearly using two figures.

On the other side, character औ (DEVANAGARI LETTER AU) whose Unicode value is $0914 and UTF 8 value is $E0A494 may be inserted in a file/folder name and is counted as a single figure.
It’s why puzzle me.
Bingo, I got it. Thanks to your explanations and table :
http://developer.apple.com/legacy/mac/library/#technotes/tn/tn1150table.html

Before that, I didn’t understand that “points” are in fact two bytes.
औ (DEVANAGARI LETTER AU) is a 2 bytes Unicode value but it’s stored as a single point.
ü (LATIN SMALL LETTER U WITH DIAERESIS) is a one byte Unicode value but it’s stored as a two points object.
Ḕ (LATIN CAPITAL LETTER E WITH MACRON AND) is Unicode $1E14 but it’s stored as a three points object.

So, we may replace an ASCII character by औ with no impact upon the number of characters allowed. (255)
So, if we replace an ASCII character by ü the number of characters allowed is reduced by 1. (254)
So, if we replace an ASCII character by Ḕ the number of characters allowed is reduced by 2. (253)

This one is really puzzling me.
If I remember well you are running 10.5.8 and my memory said that the System Events delivered with this system deleted files as it did in 10.8 without putting them in the trash.
Now I may understand why you used rm in some scripts which you sent to me.

Yvan KOENIG (VALLAURIS, France) dimanche 9 décembre 2012 21:26:39

AU is a single character and can only be represented by a single code point (or Unicode number), however many bytes that may be. (A multiple of 2, since it’s UTF-16.) Characters with diacritics, however, can be represented in Unicode either as characters in themselves (one code point) or as combinations of base characters and “Combining Diacritical Marks” (two code points). The HFS+ system insists on the latter, for some reason.

Interestingly, the ‘id’ function can tell how such characters are formed:

set nameIn to "üऔḔ" -- (LATIN SMALL LETTER U WITH DIAERESIS) & (DEVANAGARI LETTER AU) & (LATIN CAPITAL LETTER E WITH MACRON AND GRAVE)
set lengthIn to (count nameIn)
set idIn to id of nameIn

tell application "Finder" to set nameOut to name of (make new folder at desktop with properties {name:nameIn})

set lengthOut to (count nameOut)
set idOut to id of nameOut

{{nameAsSet:nameIn, length:lengthIn, id:idIn}, return, {nameAsReturned:nameOut, length:lengthOut, id:idOut}}

You’re right! System Events does delete items on the spot. I hadn’t noticed that in your script, the ‘delete’ command is applied to a Finder reference and so it’s actually the Finder doing the deleting. :slight_smile:

Not so much interesting as strange, your script compiles here with the E appearing to have just a macron. It behaves as it should, and if I copy and paste from ASE into TextEdit, it looks as it should, but for some reason it appears incorrectly in ASE (and Script Debugger). The same thing happens in the result pane in ASE: the character looks different in nameAsSet than in nameAsReturned. Very strange…

The use of decomposed unicode characters (it’s not a UTF-16 thingy) is for fast text processing normally. In AS you have considering/ignoring diacriticals which will be faster processed in the background when the unicode string is in a decomposed form. The HFS+ is case insensitive, case insensitive string comparison is faster with decomposed than with composed unicode strings. Other actions like counting or checking if the string contains diacriticals can be performed faster as for sorting.

It’s too small to tell on my MacBook Pro. :slight_smile: If I blow the text up several times in ASE with Command-+ keystrokes, a small grave does become visible above the right end of the macron.

It looks like it’s an artefact – the top part of the character is being trimmed. If I add a return before it, it looks fine.

I was just going to add that I see what you mean about the two versions of the character looking different. When compared side-by-side, there are small differences. Maybe the degree of difference depends on the font in use.