set strASCII to "01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123"
set p2d to path to desktop
tell application "Finder"
repeat with i from 4 to 5
set strASCII to strASCII & i
try
set folderASCII to make new folder at p2d with properties {name:strASCII}
name of folderASCII
set countASCII to count result
delete folderASCII
on error
set countASCII to 0
end try
set strHigh to "औ" & text 2 thru -1 of strASCII
try
set folderHigh to make new folder at p2d with properties {name:strHigh}
name of folderHigh
set countHigh to count result
delete folderHigh
on error
set countHigh to 0
end try
display dialog ("" & countASCII & return & countHigh)
end repeat
end tell
On first pass, two folders will be created on the desktop.
the name of the 1st one is made of 255 digits (plain ASCII)
the name of the second one is made by replacing the 1st character of the firts one by the character औ (DEVANAGARI LETTER AU) whose Unicode value is $0914 and UTF 8 value is $E0A494
Both names are 255 characters long.
Both folders are created and a dialog display the two name lengths .
On second pass, One character (5) is added at the end of the two names.
The Finder refuse to create both folders.
If I replace the (DEVANAGARI LETTER AU) by the character ü (LATIN SMALL LETTER U WITH DIAERESIS) whose Unicode value is $00FC and UTF-8 value is $C3 BC, only the first folder named with strASCII is created.
I really don’t understand why (DEVANAGARI LETTER AU) which requires more bytes than (LATIN SMALL LETTER U WITH DIAERESIS) is accepted.
I corrected your name Yvan. And I am looking at your script right now, and I have no idea really as to why the first 3-byte character works as the second 2-byt character don’t. Maybe there are some packing scheme here?byte packing scheme here? It may also be that the $00 byte in “Diaresis” does('nt do) the trick?
There’s an error when the script tries to delete the second folder on my system. This appears to be something to do with the fact that the first folder’s already in the Trash. (Maybe there’s not enough overhead to edit the second folder’s name?) The effect is that, because the second folder’s not deleted the first time, there’s another error when trying to create it the second time. Putting an ‘empty’ line after each ‘delete’ cures this on my Snow Leopard system.
I don’t know the answer either. But then I don’t know how many bytes DEVANAGARI LETTER AU occupies in the normalized version of UTF-16 used by the HFS+ system. (See the link in my post (#7) above.)
Edit: Perhaps more relevantly, the article to which I linked says that HFS+ names can have a maximum of 255 UTF-16 code points. If DEVANAGARI LETTER AU is represented as just one code point (itself) ” and LATIN SMALL LETTER U WITH DIAERESIS is almost certainly “normalized” as two code points (the small letter u and the diaeresis) ” then that would explain the apparent anomaly. The number of bytes doesn’t come into it.
I assume that the error issued when the Finder try to delete the folder is due to your old system.
Under Lion or Mountain Lion it behave flawlessly.
This version use System Events to delete the folders so nothing is moved to the trash.
set strASCII to "01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123"
set p2d to path to desktop
repeat with i from 4 to 5
set strASCII to strASCII & i
try
tell application "Finder"
set folderASCII to make new folder at p2d with properties {name:strASCII}
name of folderASCII
set countASCII to count result
end tell
tell application "System Events" to delete folderASCII
on error
set countASCII to 0
end try
set strHigh to "औ" & text 2 thru -1 of strASCII
try
tell application "Finder"
set folderHigh to make new folder at p2d with properties {name:strHigh}
name of folderHigh
set countHigh to count result
end tell
tell application "System Events" to delete folderHigh
on error
set countHigh to 0
end try
display dialog ("" & countASCII & return & countHigh)
end repeat
I have some difficulties to imagine that the (DEVANAGARI LETTER AU) character is stored as a single byte like ASCII characters.
I was adding an extra paragraph to my post as you were posting this comment. The relevant units to consider are UTF-16 code points, not bytes.
set uUmlaut1 to (character id 252) -- One code point ("ü").
set uUmlaut2 to (character id 117) & (character id 776) -- Two code points as in HFS+ ("u" & combining diaeresis).
{uUmlaut1, uUmlaut2, uUmlaut1 = uUmlaut2}
PS. Even System Events’s ‘delete’ command simply moves things to the trash on my machine.
In the given example, the uUmlaut character is :
ü (LATIN SMALL LETTER U WITH DIAERESIS) whose Unicode value is $00FC and UTF-8 value is $C3 BC
When I insert it in a file/folder name, it reduce the number of characters allowed by one. So it’s clearly using two figures.
On the other side, character औ (DEVANAGARI LETTER AU) whose Unicode value is $0914 and UTF 8 value is $E0A494 may be inserted in a file/folder name and is counted as a single figure.
It’s why puzzle me.
Bingo, I got it. Thanks to your explanations and table : http://developer.apple.com/legacy/mac/library/#technotes/tn/tn1150table.html
Before that, I didn’t understand that “points” are in fact two bytes.
औ (DEVANAGARI LETTER AU) is a 2 bytes Unicode value but it’s stored as a single point.
ü (LATIN SMALL LETTER U WITH DIAERESIS) is a one byte Unicode value but it’s stored as a two points object.
Ḕ (LATIN CAPITAL LETTER E WITH MACRON AND) is Unicode $1E14 but it’s stored as a three points object.
So, we may replace an ASCII character by औ with no impact upon the number of characters allowed. (255)
So, if we replace an ASCII character by ü the number of characters allowed is reduced by 1. (254)
So, if we replace an ASCII character by Ḕ the number of characters allowed is reduced by 2. (253)
This one is really puzzling me.
If I remember well you are running 10.5.8 and my memory said that the System Events delivered with this system deleted files as it did in 10.8 without putting them in the trash.
Now I may understand why you used rm in some scripts which you sent to me.
AU is a single character and can only be represented by a single code point (or Unicode number), however many bytes that may be. (A multiple of 2, since it’s UTF-16.) Characters with diacritics, however, can be represented in Unicode either as characters in themselves (one code point) or as combinations of base characters and “Combining Diacritical Marks” (two code points). The HFS+ system insists on the latter, for some reason.
Interestingly, the ‘id’ function can tell how such characters are formed:
set nameIn to "üऔḔ" -- (LATIN SMALL LETTER U WITH DIAERESIS) & (DEVANAGARI LETTER AU) & (LATIN CAPITAL LETTER E WITH MACRON AND GRAVE)
set lengthIn to (count nameIn)
set idIn to id of nameIn
tell application "Finder" to set nameOut to name of (make new folder at desktop with properties {name:nameIn})
set lengthOut to (count nameOut)
set idOut to id of nameOut
{{nameAsSet:nameIn, length:lengthIn, id:idIn}, return, {nameAsReturned:nameOut, length:lengthOut, id:idOut}}
You’re right! System Events does delete items on the spot. I hadn’t noticed that in your script, the ‘delete’ command is applied to a Finder reference and so it’s actually the Finder doing the deleting.
Not so much interesting as strange, your script compiles here with the E appearing to have just a macron. It behaves as it should, and if I copy and paste from ASE into TextEdit, it looks as it should, but for some reason it appears incorrectly in ASE (and Script Debugger). The same thing happens in the result pane in ASE: the character looks different in nameAsSet than in nameAsReturned. Very strange…
The use of decomposed unicode characters (it’s not a UTF-16 thingy) is for fast text processing normally. In AS you have considering/ignoring diacriticals which will be faster processed in the background when the unicode string is in a decomposed form. The HFS+ is case insensitive, case insensitive string comparison is faster with decomposed than with composed unicode strings. Other actions like counting or checking if the string contains diacriticals can be performed faster as for sorting.
It’s too small to tell on my MacBook Pro. If I blow the text up several times in ASE with Command-+ keystrokes, a small grave does become visible above the right end of the macron.
I was just going to add that I see what you mean about the two versions of the character looking different. When compared side-by-side, there are small differences. Maybe the degree of difference depends on the font in use.