1 character uses space of 2 characters

cirno · December 7, 2012, 6:39pm

I just found out something.

If i have this 255 characters long filename:

12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678
9012345678901234567890123456789012345678901234567890123456789012345678901.txt

it’s okay in Finder, but if i try to change 1 character to Ã¼ then Finder cuts 1 character from end:

12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678
901234567890123456789012345678901234567890123456789012345678901234567890Ã¼.tx

Single Ã¼ character takes space of 2 characters.

What other characters uses 2 spaces?

I didn’t find handler in this forum which changes leading dot to some other character and changes : to other character and cuts string to 255 long and works both folders and files, in files it keeps file extension and it needs to handle this 2 characters problem too.

Yvan_Koenig · December 7, 2012, 8:44pm

As far as I know, the 255 limit doesn’t apply to characters but to bytes used to describe the name.
Your original filename is made of characters described by one byte so, you may have 255 ones.
The character Ã¼ is described by two bytes and it’s why one character is dropped.
If you were building a filename made of two bytes characters, you would be restricted to 125 of them plus x.txt

Yvan KOENIG (VALLAURIS, France) vendredi 7 dÃ©cembre 2012 21:43:52

DJ_Bazzie_Wazzie · December 8, 2012, 12:24am

The file names are UTF-8 encoded so look how many bytes are needed to present the character. Character Ã¼ is stored in the latin 1 supplement table (which is equally to the CP1252 extended characters). In UTF-8 every character outside the us-ascii table (7-bits character set) range will use 2, 3 or even 4 bytes to present itself. When instruction byte 0xC3 is found in UTF-8 the next character will be using the latin 1 supplement table and character Ã¼ is byte 0xBC in the latin 1 supplement table. Character Ã¼ will use bytes 0xC3 and 0xBC.

So in a worse case scenario, only using 4 byte characters, you’re name is limited close to 62 characters. Still enough in my opinion.

The byte size limit of the file name is 256 bytes. File names are terminated with a zero byte so there are 255 bytes left for you because the string terminator is included in those 256 available bytes.

Funny that the number of bytes used for file name groesse and grÃ¶ÃŸe are equal. With wc you can check the size in bytes and as long if it’s less than 256 you can use it.

do shell script "/bin/echo -n grÃ¶ÃŸe | wc -c"
do shell script "/bin/echo -n groesse | wc -c"

Note: if your prefer the <<< (here string redirection) or built-in echo, you should know that both ways adds a newline and your count has 1 extra character.

Shane_Stanley · December 8, 2012, 1:18am

Are you sure about that? I thought they were UTF16. Hmmm… A search on hfs_format.h says:

/* Unicode strings are used for HFS Plus file and folder names /
struct HFSUniStr255 {
u_int16_t length; / number of unicode characters /
u_int16_t unicode[255]; / unicode characters */
};

DJ_Bazzie_Wazzie · December 8, 2012, 10:20am

Correct Shane, but that’s on low level. Every OS uses an virtual file system on top of it so there is one general filesystem in your OS. This is used so that all software on top doesn’t need to know anything about file systems like fopen, fget, fput etc. To access files (In C) you never use directly the hfs_format.h but you’ll use dirent.h. When you open dirent.h you’ll notice that file names are just char types using multlbytes (UTF-8). Because (almost) no software uses the actuale file system directly I consider file names in Mac OS X UTF-8 encoded and not UTF-16.

Rectification: Shane is right, the limitations in UFS (BSD) won’t apply for cocoa applications where I’m referring too in my posts above. However file names that uses more than 255 bytes in UTF-8 decomposed form can give you problems with some BSD utilities (like saving them will shorten the file name or return into an error). Many file handling utilities like mv for instance uses standard system calls which can handle ‘too’ long file names in BSD. ls however will not print them completely out nor dirent will give the complete name when you’re coding in C.

Shane_Stanley · December 8, 2012, 10:38am

Makes sense – thanks.

Nigel_Garvey · December 8, 2012, 10:40am

Near the top of Wikipedia’s article on HFS Plus, it says that it uses a “normalized” form of UTF-16, where “precomposed characters like Ã¥ are decomposed in the HFS+ filename and therefore count as two characters.”

McUsrII · December 8, 2012, 10:42am

The really funny part, is when we have gone from the physical filesystem, through the viritual filesystem, and ended up in AppleScript.

Script Debugger tells me that the name of a file is encoded as utf16.

So, when it comes to Finder and such, the names should be encoded as such, I guess the the only effect of this is that some unpresentable characters in utf8 has gotten a 3-byte encoding in Finder.

Edit

Finders Dictionary says Unicode text, which is a synonym for utf8.

Nigel_Garvey · December 8, 2012, 11:08am

What it says in a scripting dictionary is what passes between the application and a script.
AppleScript’s Unicode text (and now text) is a form of UTF-16.

McUsrII · December 8, 2012, 11:31am

Hello.

That is very good to know, as I have so far taken if for being the type of the actual value, not the data-type of what is passed between an application and the script.

This is what is standing there: It kind of startled me, as I thought it was utf-16, that text and string was utf-16.

cirno · December 8, 2012, 3:22pm

Thanks.

Is there any easy way to get list of every Unicode character.

Yvan_Koenig · December 8, 2012, 3:38pm

http://www.unicode.org/charts/charindex.html

is your friend.

Puzzling behaviour :
Run this script :


set strASCII to "01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123"
set p2d to path to desktop
tell application "Finder"
	repeat with i from 4 to 5
		set strASCII to strASCII & i
		try
			set folderASCII to make new folder at p2d with properties {name:strASCII}
			name of folderASCII
			set countASCII to count result
			delete folderASCII
		on error
			set countASCII to 0
		end try
		set strHigh to "à¤”" & text 2 thru -1 of strASCII
		try
			set folderHigh to make new folder at p2d with properties {name:strHigh}
			name of folderHigh
			set countHigh to count result
			delete folderHigh
		on error
			set countHigh to 0
		end try
		display dialog ("" & countASCII & return & countHigh)
	end repeat
end tell

On first pass, two folders will be created on the desktop.
the name of the 1st one is made of 255 digits (plain ASCII)
the name of the second one is made by replacing the 1st character of the firts one by the character à¤” (DEVANAGARI LETTER AU) whose Unicode value is $0914 and UTF 8 value is $E0A494

Both names are 255 characters long.
Both folders are created and a dialog display the two name lengths .
On second pass, One character (5) is added at the end of the two names.
The Finder refuse to create both folders.

If I replace the (DEVANAGARI LETTER AU) by the character Ã¼ (LATIN SMALL LETTER U WITH DIAERESIS) whose Unicode value is $00FC and UTF-8 value is $C3 BC, only the first folder named with strASCII is created.

I really don’t understand why (DEVANAGARI LETTER AU) which requires more bytes than (LATIN SMALL LETTER U WITH DIAERESIS) is accepted.

Yvan KOENIG (VALLAURIS, France) samedi 8 dÃ©cembre 2012 16:38:44

McUsrII · December 8, 2012, 4:05pm

That seems like a better idea than:

repeat with i from 0 to 65535
	try
		set a to character id i
		log a
	end try
end repeat

StefanK · December 8, 2012, 4:11pm

Finder > Menu Edit > Special Characters. > Unicode Table

Yvan_Koenig · December 8, 2012, 4:39pm

But using this scheme, we don’t get the names of the characters

DJ_Bazzie_Wazzie · December 8, 2012, 4:48pm

I think you can look here

McUsrII · December 8, 2012, 6:04pm

True!

I actually think that the best solution is either look at your or DJ Bazzie Wazzie’s links, or StefanK’s tip.

StefanK’s solution work when you are offline.

Yvan_Koenig · December 8, 2012, 9:02pm

Beurk ! My name is not Koening, it’s Koenig

The characters palette is fine because we may use it to insert characters or to extract the character name with a simple copy/paste.

Did you looked at the script embedded in my message from 09:38:51 am ?

McUsrII · December 9, 2012, 8:23am

I corrected your name Yvan. And I am looking at your script right now, and I have no idea really as to why the first 3-byte character works as the second 2-byt character don’t. Maybe there are some packing scheme here?byte packing scheme here? It may also be that the $00 byte in “Diaresis” does('nt do) the trick?

Nigel_Garvey · December 9, 2012, 10:02am

Yvan Koenig:

Puzzling behaviour :
Run this script :
set strASCII to "01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123"
set p2d to path to desktop
tell application "Finder"
	repeat with i from 4 to 5
		set strASCII to strASCII & i
		try
			set folderASCII to make new folder at p2d with properties {name:strASCII}
			name of folderASCII
			set countASCII to count result
			delete folderASCII
		on error
			set countASCII to 0
		end try
		set strHigh to "à¤”" & text 2 thru -1 of strASCII
		try
			set folderHigh to make new folder at p2d with properties {name:strHigh}
			name of folderHigh
			set countHigh to count result
			delete folderHigh
		on error
			set countHigh to 0
		end try
		display dialog ("" & countASCII & return & countHigh)
	end repeat
end tell
On first pass, two folders will be created on the desktop.
the name of the 1st one is made of 255 digits (plain ASCII)
the name of the second one is made by replacing the 1st character of the firts one by the character à¤” (DEVANAGARI LETTER AU) whose Unicode value is $0914 and UTF 8 value is $E0A494

Both names are 255 characters long.
Both folders are created and a dialog display the two name lengths .
On second pass, One character (5) is added at the end of the two names.
The Finder refuse to create both folders.

If I replace the (DEVANAGARI LETTER AU) by the character Ã¼ (LATIN SMALL LETTER U WITH DIAERESIS) whose Unicode value is $00FC and UTF-8 value is $C3 BC, only the first folder named with strASCII is created.

There’s an error when the script tries to delete the second folder on my system. This appears to be something to do with the fact that the first folder’s already in the Trash. (Maybe there’s not enough overhead to edit the second folder’s name?) The effect is that, because the second folder’s not deleted the first time, there’s another error when trying to create it the second time. Putting an ‘empty’ line after each ‘delete’ cures this on my Snow Leopard system.

I don’t know the answer either. But then I don’t know how many bytes DEVANAGARI LETTER AU occupies in the normalized version of UTF-16 used by the HFS+ system. (See the link in my post (#7) above.)

Edit: Perhaps more relevantly, the article to which I linked says that HFS+ names can have a maximum of 255 UTF-16 code points. If DEVANAGARI LETTER AU is represented as just one code point (itself) ” and LATIN SMALL LETTER U WITH DIAERESIS is almost certainly “normalized” as two code points (the small letter u and the diaeresis) ” then that would explain the apparent anomaly. The number of bytes doesn’t come into it.