You are not logged in.
I just found out something.
If i have this 255 characters long filename:
12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678
9012345678901234567890123456789012345678901234567890123456789012345678901.txt
it's okay in Finder, but if i try to change 1 character to ü then Finder cuts 1 character from end:
12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678
901234567890123456789012345678901234567890123456789012345678901234567890ü.tx
Single ü character takes space of 2 characters.
What other characters uses 2 spaces?
I didn't find handler in this forum which changes leading dot to some other character and changes : to other character and cuts string to 255 long and works both folders and files, in files it keeps file extension and it needs to handle this 2 characters problem too.
Last edited by cirno (2012-12-07 12:50:53 pm)
Offline
As far as I know, the 255 limit doesn't apply to characters but to bytes used to describe the name.
Your original filename is made of characters described by one byte so, you may have 255 ones.
The character ü is described by two bytes and it's why one character is dropped.
If you were building a filename made of two bytes characters, you would be restricted to 125 of them plus x.txt
Yvan KOENIG (VALLAURIS, France) vendredi 7 décembre 2012 21:43:52
Offline
cirno wrote:
Single ü character takes space of 2 characters...What other characters uses 2 spaces?
The file names are UTF-8 encoded so look how many bytes are needed to present the character. Character ü is stored in the latin 1 supplement table (which is equally to the CP1252 extended characters). In UTF-8 every character outside the us-ascii table (7-bits character set) range will use 2, 3 or even 4 bytes to present itself. When instruction byte 0xC3 is found in UTF-8 the next character will be using the latin 1 supplement table and character ü is byte 0xBC in the latin 1 supplement table. Character ü will use bytes 0xC3 and 0xBC.
So in a worse case scenario, only using 4 byte characters, you're name is limited close to 62 characters. Still enough in my opinion.
The byte size limit of the file name is 256 bytes. File names are terminated with a zero byte so there are 255 bytes left for you because the string terminator is included in those 256 available bytes.
Funny that the number of bytes used for file name groesse and größe are equal. With wc you can check the size in bytes and as long if it's less than 256 you can use it.
Applescript:
do shell script "/bin/echo -n größe | wc -c"
do shell script "/bin/echo -n groesse | wc -c"
Note: if your prefer the <<< (here string redirection) or built-in echo, you should know that both ways adds a newline and your count has 1 extra character.
Last edited by DJ Bazzie Wazzie (2012-12-07 06:47:00 pm)
Offline
DJ Bazzie Wazzie wrote:
The file names are UTF-8 encoded
Are you sure about that? I thought they were UTF16. Hmmm.... A search on hfs_format.h says:
/* Unicode strings are used for HFS Plus file and folder names */
struct HFSUniStr255 {
u_int16_t length; /* number of unicode characters */
u_int16_t unicode[255]; /* unicode characters */
};
Offline
Shane Stanley wrote:
Are you sure about that? I thought they were UTF16. Hmmm.... A search on hfs_format.h says:
/* Unicode strings are used for HFS Plus file and folder names */
struct HFSUniStr255 {
u_int16_t length; /* number of unicode characters */
u_int16_t unicode[255]; /* unicode characters */
};
Correct Shane, but that's on low level. Every OS uses an virtual file system on top of it so there is one general filesystem in your OS. This is used so that all software on top doesn't need to know anything about file systems like fopen, fget, fput etc. To access files (In C) you never use directly the hfs_format.h but you'll use dirent.h. When you open dirent.h you'll notice that file names are just char types using multlbytes (UTF-8). Because (almost) no software uses the actuale file system directly I consider file names in Mac OS X UTF-8 encoded and not UTF-16.
Rectification: Shane is right, the limitations in UFS (BSD) won't apply for cocoa applications where I'm referring too in my posts above. However file names that uses more than 255 bytes in UTF-8 decomposed form can give you problems with some BSD utilities (like saving them will shorten the file name or return into an error). Many file handling utilities like mv for instance uses standard system calls which can handle 'too' long file names in BSD. ls however will not print them completely out nor dirent will give the complete name when you're coding in C.
Last edited by DJ Bazzie Wazzie (2012-12-09 06:09:07 pm)
Offline
Makes sense -- thanks.
Offline
Near the top of Wikipedia's article on HFS Plus, it says that it uses a "normalized" form of UTF-16, where "precomposed characters like å are decomposed in the HFS+ filename and therefore count as two characters."
Offline
The really funny part, is when we have gone from the physical filesystem, through the viritual filesystem, and ended up in AppleScript.
Script Debugger tells me that the name of a file is encoded as utf16. ![]()
So, when it comes to Finder and such, the names should be encoded as such, I guess the the only effect of this is that some unpresentable characters in utf8 has gotten a 3-byte encoding in Finder.
Edit
Finders Dictionary says Unicode text, which is a synonym for utf8.
Last edited by McUsrII (2012-12-08 04:48:30 am)
Offline
McUsrII wrote:
Finders Dictionary says Unicode text, which is a synonym for utf8.
1. What it says in a scripting dictionary is what passes between the application and a script.
2. AppleScript's Unicode text (and now text) is a form of UTF-16. ![]()
Offline
Hello. ![]()
Nigel Garvey wrote:
1. What it says in a scripting dictionary is what passes between the application and a script.
That is very good to know, as I have so far taken if for being the type of the actual value, not the data-type of what is passed between an application and the script.
Nigel Garvey wrote:
2. AppleScript's Unicode text (and now text) is a form of UTF-16.
Finders Scripting wrote:
Dictionary]unicode text (type)[synonyms: text, utf8A Unicode string value.
This is what is standing there: It kind of startled me, as I thought it was utf-16, that text and string was utf-16.
Offline
Thanks.
Is there any easy way to get list of every Unicode character.
Offline
cirno wrote:
Thanks.
Is there any easy way to get list of every Unicode character.
http://www.unicode.org/charts/charindex.html
is your friend.
Puzzling behaviour :
Run this script :
Applescript:
set strASCII to "01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123"
set p2d to path to desktop
tell application "Finder"
repeat with i from 4 to 5
set strASCII to strASCII & i
try
set folderASCII to make new folder at p2d with properties {name:strASCII}
name of folderASCII
set countASCII to count result
delete folderASCII
on error
set countASCII to 0
end try
set strHigh to "औ" & text 2 thru -1 of strASCII
try
set folderHigh to make new folder at p2d with properties {name:strHigh}
name of folderHigh
set countHigh to count result
delete folderHigh
on error
set countHigh to 0
end try
display dialog ("" & countASCII & return & countHigh)
end repeat
end tell
On first pass, two folders will be created on the desktop.
the name of the 1st one is made of 255 digits (plain ASCII)
the name of the second one is made by replacing the 1st character of the firts one by the character औ (DEVANAGARI LETTER AU) whose Unicode value is $0914 and UTF 8 value is $E0A494
Both names are 255 characters long.
Both folders are created and a dialog display the two name lengths .
On second pass, One character (5) is added at the end of the two names.
The Finder refuse to create both folders.
If I replace the (DEVANAGARI LETTER AU) by the character ü (LATIN SMALL LETTER U WITH DIAERESIS) whose Unicode value is $00FC and UTF-8 value is $C3 BC, only the first folder named with strASCII is created.
I really don't understand why (DEVANAGARI LETTER AU) which requires more bytes than (LATIN SMALL LETTER U WITH DIAERESIS) is accepted.
Yvan KOENIG (VALLAURIS, France) samedi 8 décembre 2012 16:38:44
Last edited by Yvan Koenig (2012-12-08 10:38:24 am)
Offline
That seems like a better idea than:
Applescript:
repeat with i from 0 to 65535
try
set a to character id i
log a
end try
end repeat
![]()
Offline
cirno wrote:
Is there any easy way to get list of every Unicode character.
Finder > Menu Edit > Special Characters… > Unicode Table
Last edited by StefanK (2012-12-08 10:12:32 am)
Online
McUsrII wrote:
That seems like a better idea than:
Applescript:
repeat with i from 0 to 65535
try
set a to character id i
log a
end try
end repeat
But using this scheme, we don't get the names of the characters ![]()
Yvan KOENIG (VALLAURIS, France) samedi 8 décembre 2012 17:39:35
Offline
cirno wrote:
Thanks.
Is there any easy way to get list of every Unicode character.
I think you can look here
Offline
Yvan Koenig wrote:
But using this scheme, we don't get the names of the characters
True! ![]()
I actually think that the best solution is either look at your or DJ Bazzie Wazzie's links, or StefanK's tip.
StefanK's solution work when you are offline.
Last edited by McUsrII (2012-12-09 02:21:35 am)
Offline
McUsrII wrote:
Yvan Koening wrote:
But using this scheme, we don't get the names of the characters
True!
I actually think that the best solution is either look at your or DJ Bazzie Wazzie's links, or StefanK's tip.
StefanK's solution work when you are offline.
Beurk ! My name is not Koening, it's Koenig
The characters palette is fine because we may use it to insert characters or to extract the character name with a simple copy/paste.
Did you looked at the script embedded in my message from 09:38:51 am ?
Yvan KOENIG (VALLAURIS, France) samedi 8 décembre 2012 21:59:41
Offline
I corrected your name Yvan.
And I am looking at your script right now, and I have no idea really as to why the first 3-byte character works as the second 2-byt character don't. Maybe there are some packing scheme here?byte packing scheme here? It may also be that the $00 byte in "Diaresis" does('nt do) the trick?
Last edited by McUsrII (2012-12-09 02:30:39 am)
Offline
Yvan Koenig wrote:
Puzzling behaviour :
Run this script :Applescript:
set strASCII to "01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123"
set p2d to path to desktop
tell application "Finder"
repeat with i from 4 to 5
set strASCII to strASCII & i
try
set folderASCII to make new folder at p2d with properties {name:strASCII}
name of folderASCII
set countASCII to count result
delete folderASCII
on error
set countASCII to 0
end try
set strHigh to "औ" & text 2 thru -1 of strASCII
try
set folderHigh to make new folder at p2d with properties {name:strHigh}
name of folderHigh
set countHigh to count result
delete folderHigh
on error
set countHigh to 0
end try
display dialog ("" & countASCII & return & countHigh)
end repeat
end tell
On first pass, two folders will be created on the desktop.
the name of the 1st one is made of 255 digits (plain ASCII)
the name of the second one is made by replacing the 1st character of the firts one by the character औ (DEVANAGARI LETTER AU) whose Unicode value is $0914 and UTF 8 value is $E0A494
Both names are 255 characters long.
Both folders are created and a dialog display the two name lengths .
On second pass, One character (5) is added at the end of the two names.
The Finder refuse to create both folders.
If I replace the (DEVANAGARI LETTER AU) by the character ü (LATIN SMALL LETTER U WITH DIAERESIS) whose Unicode value is $00FC and UTF-8 value is $C3 BC, only the first folder named with strASCII is created.
There's an error when the script tries to delete the second folder on my system. This appears to be something to do with the fact that the first folder's already in the Trash. (Maybe there's not enough overhead to edit the second folder's name?) The effect is that, because the second folder's not deleted the first time, there's another error when trying to create it the second time. Putting an 'empty' line after each 'delete' cures this on my Snow Leopard system.
I really don't understand why (DEVANAGARI LETTER AU) which requires more bytes than (LATIN SMALL LETTER U WITH DIAERESIS) is accepted.
I don't know the answer either. But then I don't know how many bytes DEVANAGARI LETTER AU occupies in the normalized version of UTF-16 used by the HFS+ system. (See the link in my post (#7) above.)
Edit: Perhaps more relevantly, the article to which I linked says that HFS+ names can have a maximum of 255 UTF-16 code points. If DEVANAGARI LETTER AU is represented as just one code point (itself) — and LATIN SMALL LETTER U WITH DIAERESIS is almost certainly "normalized" as two code points (the small letter u and the diaeresis) — then that would explain the apparent anomaly. The number of bytes doesn't come into it.
Last edited by Nigel Garvey (2012-12-09 05:08:01 am)
Offline
Hello Nigel.
I assume that the error issued when the Finder try to delete the folder is due to your old system.
Under Lion or Mountain Lion it behave flawlessly.
This version use System Events to delete the folders so nothing is moved to the trash.
Applescript:
set strASCII to "01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123"
set p2d to path to desktop
repeat with i from 4 to 5
set strASCII to strASCII & i
try
tell application "Finder"
set folderASCII to make new folder at p2d with properties {name:strASCII}
name of folderASCII
set countASCII to count result
end tell
tell application "System Events" to delete folderASCII
on error
set countASCII to 0
end try
set strHigh to "औ" & text 2 thru -1 of strASCII
try
tell application "Finder"
set folderHigh to make new folder at p2d with properties {name:strHigh}
name of folderHigh
set countHigh to count result
end tell
tell application "System Events" to delete folderHigh
on error
set countHigh to 0
end try
display dialog ("" & countASCII & return & countHigh)
end repeat
I have some difficulties to imagine that the (DEVANAGARI LETTER AU) character is stored as a single byte like ASCII characters.
Yvan KOENIG (VALLAURIS, France) dimanche 9 décembre 2012 12:03:54
Offline
Yvan Koenig wrote:
I have some difficulties to imagine that the (DEVANAGARI LETTER AU) character is stored as a single byte like ASCII characters
Hi Yvan.
I was adding an extra paragraph to my post as you were posting this comment. The relevant units to consider are UTF-16 code points, not bytes.
Applescript:
set uUmlaut1 to (character id 252) -- One code point ("ü").
set uUmlaut2 to (character id 117) & (character id 776) -- Two code points as in HFS+ ("u" & combining diaeresis).
{uUmlaut1, uUmlaut2, uUmlaut1 = uUmlaut2}
PS. Even System Events's 'delete' command simply moves things to the trash on my machine.
Last edited by Nigel Garvey (2012-12-09 06:19:05 am)
Offline
An enlightening day, as enlightening here, as the snow outside! No pun intended. ![]()
Offline
Nigel Garvey wrote:
Yvan Koenig wrote:
I have some difficulties to imagine that the (DEVANAGARI LETTER AU) character is stored as a single byte like ASCII characters
Hi Yvan.
I was adding an extra paragraph to my post as you were posting this comment. The relevant units to consider are UTF-16 code points, not bytes.Applescript:
set uUmlaut1 to (character id 252) -- One code point ("ü").
set uUmlaut2 to (character id 117) & (character id 776) -- Two code points as in HFS+ ("u" & combining diaeresis).
{uUmlaut1, uUmlaut2, uUmlaut1 = uUmlaut2}
In the given example, the uUmlaut character is :
ü (LATIN SMALL LETTER U WITH DIAERESIS) whose Unicode value is $00FC and UTF-8 value is $C3 BC
When I insert it in a file/folder name, it reduce the number of characters allowed by one. So it's clearly using two figures.
On the other side, character औ (DEVANAGARI LETTER AU) whose Unicode value is $0914 and UTF 8 value is $E0A494 may be inserted in a file/folder name and is counted as a single figure.
It's why puzzle me.
Bingo, I got it. Thanks to your explanations and table :
http://developer.apple.com/legacy/mac/l … table.html
Before that, I didn't understand that "points" are in fact two bytes.
औ (DEVANAGARI LETTER AU) is a 2 bytes Unicode value but it's stored as a single point.
ü (LATIN SMALL LETTER U WITH DIAERESIS) is a one byte Unicode value but it's stored as a two points object.
Ḕ (LATIN CAPITAL LETTER E WITH MACRON AND) is Unicode $1E14 but it's stored as a three points object.
So, we may replace an ASCII character by औ with no impact upon the number of characters allowed. (255)
So, if we replace an ASCII character by ü the number of characters allowed is reduced by 1. (254)
So, if we replace an ASCII character by Ḕ the number of characters allowed is reduced by 2. (253)
Nigel Garvey wrote:
PS. Even System Events's 'delete' command simply moves things to the trash on my machine.
This one is really puzzling me.
If I remember well you are running 10.5.8 and my memory said that the System Events delivered with this system deleted files as it did in 10.8 without putting them in the trash.
Now I may understand why you used rm in some scripts which you sent to me.
Yvan KOENIG (VALLAURIS, France) dimanche 9 décembre 2012 21:26:39
Last edited by Yvan Koenig (2012-12-09 02:52:20 pm)
Offline
Yvan Koenig wrote:
Before that, I didn't understand that "points" are in fact two bytes.
औ (DEVANAGARI LETTER AU) is a 2 bytes Unicode value but it's stored as a single point.
ü (LATIN SMALL LETTER U WITH DIAERESIS) is a one byte Unicode value but it's stored as a two points object.
Ḕ (LATIN CAPITAL LETTER E WITH MACRON AND) is Unicode $1E14 but it's stored as a three points object.
AU is a single character and can only be represented by a single code point (or Unicode number), however many bytes that may be. (A multiple of 2, since it's UTF-16.) Characters with diacritics, however, can be represented in Unicode either as characters in themselves (one code point) or as combinations of base characters and "Combining Diacritical Marks" (two code points). The HFS+ system insists on the latter, for some reason.
Interestingly, the 'id' function can tell how such characters are formed:
Applescript:
set nameIn to "üऔḔ" -- (LATIN SMALL LETTER U WITH DIAERESIS) & (DEVANAGARI LETTER AU) & (LATIN CAPITAL LETTER E WITH MACRON AND GRAVE)
set lengthIn to (count nameIn)
set idIn to id of nameIn
tell application "Finder" to set nameOut to name of (make new folder at desktop with properties {name:nameIn})
set lengthOut to (count nameOut)
set idOut to id of nameOut
{{nameAsSet:nameIn, length:lengthIn, id:idIn}, return, {nameAsReturned:nameOut, length:lengthOut, id:idOut}}
Nigel Garvey wrote:
PS. Even System Events's 'delete' command simply moves things to the trash on my machine.
This one is really puzzling me.
If I remember well you are running 10.5.8 and my memory said that the System Events delivered with this system deleted files as it did in 10.8 without putting them in the trash.
You're right! System Events does delete items on the spot. I hadn't noticed that in your script, the 'delete' command is applied to a Finder reference and so it's actually the Finder doing the deleting. ![]()
Offline