Sunday, September 21, 2014

#1 2012-12-07 12:39:50 pm

cirno
Member
Registered: 2005-05-30
Posts: 448

1 character uses space of 2 characters

I just found out something.

If i have this 255 characters long filename:

12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678
9012345678901234567890123456789012345678901234567890123456789012345678901.txt

it's okay in Finder, but if i try to change 1 character to ü then Finder cuts 1 character from end:

12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678
901234567890123456789012345678901234567890123456789012345678901234567890ü.tx

Single ü character takes space of 2 characters.

What other characters uses 2 spaces?

I didn't find handler in this forum which changes leading dot to some other character and changes : to other character and cuts string to 255 long and works both folders and files, in files it keeps file extension and it needs to handle this 2 characters problem too.

Last edited by cirno (2012-12-07 12:50:53 pm)

Offline

 

#2 2012-12-07 02:44:00 pm

Yvan Koenig
Member
Registered: 2006-09-14
Posts: 1477

Re: 1 character uses space of 2 characters

As far as I know, the 255 limit doesn't apply to characters but to bytes used to describe the name.
Your original filename is made of characters described by one byte so, you may have 255 ones.
The character ü is described by two bytes and it's why one character is dropped.
If you were building a filename made of two bytes characters, you would be restricted to 125 of them  plus x.txt

Yvan KOENIG (VALLAURIS, France) vendredi 7 décembre 2012 21:43:52

Online

 

#3 2012-12-07 06:24:13 pm

DJ Bazzie Wazzie
Member
From: the Netherlands
Registered: 2004-10-20
Posts: 1908

Re: 1 character uses space of 2 characters

cirno wrote:

Single ü character takes space of 2 characters...What other characters uses 2 spaces?

The file names are UTF-8 encoded so look how many bytes are needed to present the character. Character ü is stored in the latin 1 supplement table (which is equally to the CP1252 extended characters). In UTF-8 every character outside the us-ascii table (7-bits character set) range will use 2, 3 or even 4 bytes to present itself. When instruction byte 0xC3 is found in UTF-8 the next character will be using the latin 1 supplement table and character ü is byte 0xBC in the latin 1 supplement table. Character ü will use bytes 0xC3 and 0xBC.

So in a worse case scenario, only using 4 byte characters, you're name is limited close to 62 characters. Still enough in my opinion.

The byte size limit of the file name is 256 bytes. File names are terminated with a zero byte so there are 255 bytes left for you because the string terminator is included in those 256 available bytes.

Funny that the number of bytes used for file name groesse and größe are equal. With wc you can check the size in bytes and as long if it's less than 256 you can use it.

Applescript:

do shell script "/bin/echo -n größe | wc -c"
do shell script "/bin/echo -n groesse | wc -c"

Note: if your prefer the <<< (here string redirection) or built-in echo, you should know that both ways adds a newline and your count has 1 extra character.

Last edited by DJ Bazzie Wazzie (2012-12-07 06:47:00 pm)


Kind regards

Offline

 

#4 2012-12-07 07:18:11 pm

Shane Stanley
Member
From: Australia
Registered: 2002-12-07
Posts: 3628

Re: 1 character uses space of 2 characters

DJ Bazzie Wazzie wrote:

The file names are UTF-8 encoded

Are you sure about that? I thought they were UTF16. Hmmm.... A search on hfs_format.h says:

/* Unicode strings are used for HFS Plus file and folder names */
struct HFSUniStr255 {
    u_int16_t    length;        /* number of unicode characters */
    u_int16_t    unicode[255];    /* unicode characters */
};


Shane Stanley <sstanley@myriad-com.com.au>
www.macosxautomation.com/applescript/apps/

Offline

 

#5 2012-12-08 04:20:34 am

DJ Bazzie Wazzie
Member
From: the Netherlands
Registered: 2004-10-20
Posts: 1908

Re: 1 character uses space of 2 characters

Shane Stanley wrote:

Are you sure about that? I thought they were UTF16. Hmmm.... A search on hfs_format.h says:

/* Unicode strings are used for HFS Plus file and folder names */
struct HFSUniStr255 {
    u_int16_t    length;        /* number of unicode characters */
    u_int16_t    unicode[255];    /* unicode characters */
};

Correct Shane, but that's on low level. Every OS uses an virtual file system on top of it so there is one general filesystem in your OS. This is used so that all software on top doesn't need to know anything about file systems like fopen, fget, fput etc. To access files (In C) you never use directly the hfs_format.h but you'll use dirent.h. When you open dirent.h you'll notice that file names are just char types using multlbytes (UTF-8). Because (almost) no software uses the actuale file system directly I consider file names in Mac OS X UTF-8 encoded and not UTF-16.

Rectification: Shane is right, the limitations in UFS (BSD) won't apply for cocoa applications where I'm referring too in my posts above. However file names that uses more than 255 bytes in UTF-8 decomposed form can give you problems with some BSD utilities (like saving them will shorten the file name or return into an error). Many file handling utilities like mv for instance uses standard system calls which can handle 'too' long file names in BSD. ls however will not print them completely out nor dirent will give the complete name when you're coding in C.

Last edited by DJ Bazzie Wazzie (2012-12-09 06:09:07 pm)


Kind regards

Offline

 

#6 2012-12-08 04:38:32 am

Shane Stanley
Member
From: Australia
Registered: 2002-12-07
Posts: 3628

Re: 1 character uses space of 2 characters

Makes sense -- thanks.


Shane Stanley <sstanley@myriad-com.com.au>
www.macosxautomation.com/applescript/apps/

Offline

 

#7 2012-12-08 04:40:05 am

Nigel Garvey
Moderator
From: Warwickshire, England
Registered: 2002-11-19
Posts: 3514

Re: 1 character uses space of 2 characters

Near the top of Wikipedia's article on HFS Plus, it says that it uses a "normalized" form of UTF-16, where "precomposed characters like å are decomposed in the HFS+ filename and therefore count as two characters."


NG

Offline

 

#8 2012-12-08 04:42:23 am

McUsrII
Member
Registered: 2012-11-20
Posts: 2290
Website

Re: 1 character uses space of 2 characters

The really funny part, is when we have gone from the physical filesystem, through the viritual filesystem, and ended up in AppleScript.

Script Debugger tells me that the name of a file is encoded as utf16. smile

So, when it comes to Finder and such, the names should be encoded as such, I guess the the only effect of this is that some unpresentable characters in utf8 has gotten a 3-byte encoding in Finder.

Edit

Finders Dictionary says Unicode text, which is a synonym for utf8.

Last edited by McUsrII (2012-12-08 04:48:30 am)


Filed under: Hfs

Offline

 

#9 2012-12-08 05:08:41 am

Nigel Garvey
Moderator
From: Warwickshire, England
Registered: 2002-11-19
Posts: 3514

Re: 1 character uses space of 2 characters

McUsrII wrote:

Finders Dictionary says Unicode text, which is a synonym for utf8.

1. What it says in a scripting dictionary is what passes between the application and a script.
2. AppleScript's Unicode text (and now text) is a form of UTF-16.  smile


NG

Offline

 

#10 2012-12-08 05:31:13 am

McUsrII
Member
Registered: 2012-11-20
Posts: 2290
Website

Re: 1 character uses space of 2 characters

Hello. smile

Nigel Garvey wrote:

1. What it says in a scripting dictionary is what passes between the application and a script.

That is very good to know, as I have so far taken if for being the type of the actual value, not the data-type of what is passed between an application and the script.

Nigel Garvey wrote:

2. AppleScript's Unicode text (and now text) is a form of UTF-16.

Finders Scripting wrote:

Dictionary]unicode text (type)[synonyms: text, utf8A Unicode string value.

This is what is standing there: It kind of startled me, as I thought it was utf-16, that text and string was utf-16.


Filed under: dictionary, Finder, utf16, uf8

Offline

 

#11 2012-12-08 09:22:46 am

cirno
Member
Registered: 2005-05-30
Posts: 448

Re: 1 character uses space of 2 characters

Thanks.

Is there any easy way to get list of every Unicode character.

Offline

 

#12 2012-12-08 09:38:51 am

Yvan Koenig
Member
Registered: 2006-09-14
Posts: 1477

Re: 1 character uses space of 2 characters

cirno wrote:

Thanks.

Is there any easy way to get list of every Unicode character.

http://www.unicode.org/charts/charindex.html

is your friend.


Puzzling behaviour :
Run this script :

Applescript:


set strASCII to "01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123"
set p2d to path to desktop
tell application "Finder"
   repeat with i from 4 to 5
       set strASCII to strASCII & i
       try
           set folderASCII to make new folder at p2d with properties {name:strASCII}
           name of folderASCII
           set countASCII to count result
           delete folderASCII
       on error
           set countASCII to 0
       end try
       set strHigh to "औ" & text 2 thru -1 of strASCII
       try
           set folderHigh to make new folder at p2d with properties {name:strHigh}
           name of folderHigh
           set countHigh to count result
           delete folderHigh
       on error
           set countHigh to 0
       end try
       display dialog ("" & countASCII & return & countHigh)
   end repeat
end tell

On first pass, two folders will be created on the desktop.
the name of the 1st one is made of 255 digits (plain ASCII)
the name of the second one is made by replacing the 1st character of the firts one by the character औ (DEVANAGARI LETTER AU) whose Unicode value is $0914 and UTF 8 value is $E0A494

Both names are 255 characters long.
Both folders are created and a dialog display the two name lengths .
On second pass, One character (5) is added at the end of the two names.
The Finder refuse to create both folders.


If I replace the (DEVANAGARI LETTER AU) by the character ü (LATIN SMALL LETTER U WITH DIAERESIS) whose Unicode value is $00FC and UTF-8 value is $C3 BC, only the first folder named with strASCII is created.

I really don't understand why (DEVANAGARI LETTER AU) which requires more bytes than (LATIN SMALL LETTER U WITH DIAERESIS) is accepted.

Yvan KOENIG (VALLAURIS, France) samedi 8 décembre 2012 16:38:44

Last edited by Yvan Koenig (2012-12-08 10:38:24 am)

Online

 

#13 2012-12-08 10:05:47 am

McUsrII
Member
Registered: 2012-11-20
Posts: 2290
Website

Re: 1 character uses space of 2 characters

That seems like a better idea than:

Applescript:

repeat with i from 0 to 65535
   try
       set a to character id i
       log a
   end try
end repeat

smile


Filed under: unicode, utf-16

Offline

 

#14 2012-12-08 10:11:51 am

StefanK
Member
From: St. Gallen, Switzerland
Registered: 2006-10-21
Posts: 10516
Website

Re: 1 character uses space of 2 characters

cirno wrote:

Is there any easy way to get list of every Unicode character.

Finder > Menu Edit > Special Characters… > Unicode Table

Last edited by StefanK (2012-12-08 10:12:32 am)


regards

Stefan

Offline

 

#15 2012-12-08 10:39:41 am

Yvan Koenig
Member
Registered: 2006-09-14
Posts: 1477

Re: 1 character uses space of 2 characters

McUsrII wrote:

That seems like a better idea than:

Applescript:

repeat with i from 0 to 65535
   try
       set a to character id i
       log a
   end try
end repeat

smile

But using this scheme, we don't get the names of the characters wink

Yvan KOENIG (VALLAURIS, France) samedi 8 décembre 2012 17:39:35

Online

 

#16 2012-12-08 10:48:11 am

DJ Bazzie Wazzie
Member
From: the Netherlands
Registered: 2004-10-20
Posts: 1908

Re: 1 character uses space of 2 characters

cirno wrote:

Thanks.

Is there any easy way to get list of every Unicode character.

I think you can look here


Kind regards

Offline

 

#17 2012-12-08 12:04:51 pm

McUsrII
Member
Registered: 2012-11-20
Posts: 2290
Website

Re: 1 character uses space of 2 characters

Yvan Koenig wrote:

But using this scheme, we don't get the names of the characters wink

True! smile

I actually think that the best solution is either look at your or DJ Bazzie Wazzie's links, or StefanK's tip.

StefanK's solution work when you are offline.

Last edited by McUsrII (2012-12-09 02:21:35 am)


Filed under: utf-16

Offline

 

#18 2012-12-08 03:02:13 pm

Yvan Koenig
Member
Registered: 2006-09-14
Posts: 1477

Re: 1 character uses space of 2 characters

McUsrII wrote:

Yvan Koening wrote:

But using this scheme, we don't get the names of the characters wink

True! smile

I actually think that the best solution is either look at your or DJ Bazzie Wazzie's links, or StefanK's tip.

StefanK's solution work when you are offline.

Beurk ! My name is not Koening, it's Koenig

The characters palette is fine because we may use it to insert characters or to extract the character name with a simple copy/paste.

Did you looked at the script embedded in my message from  09:38:51 am ?

Yvan KOENIG (VALLAURIS, France) samedi 8 décembre 2012 21:59:41

Online

 

#19 2012-12-09 02:23:45 am

McUsrII
Member
Registered: 2012-11-20
Posts: 2290
Website

Re: 1 character uses space of 2 characters

I corrected your name Yvan. smile And  I am looking at your script right now, and I have no idea really as to why the first 3-byte character works as the second 2-byt character don't. Maybe there are some packing scheme here?byte packing scheme here? It may also be that the $00 byte in "Diaresis" does('nt do) the trick?

Last edited by McUsrII (2012-12-09 02:30:39 am)


Filed under: Unicode.

Offline

 

#20 2012-12-09 04:02:08 am

Nigel Garvey
Moderator
From: Warwickshire, England
Registered: 2002-11-19
Posts: 3514

Re: 1 character uses space of 2 characters

Yvan Koenig wrote:

Puzzling behaviour :
Run this script :

Applescript:


set strASCII to "01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123"
set p2d to path to desktop
tell application "Finder"
   repeat with i from 4 to 5
       set strASCII to strASCII & i
       try
           set folderASCII to make new folder at p2d with properties {name:strASCII}
           name of folderASCII
           set countASCII to count result
           delete folderASCII
       on error
           set countASCII to 0
       end try
       set strHigh to "औ" & text 2 thru -1 of strASCII
       try
           set folderHigh to make new folder at p2d with properties {name:strHigh}
           name of folderHigh
           set countHigh to count result
           delete folderHigh
       on error
           set countHigh to 0
       end try
       display dialog ("" & countASCII & return & countHigh)
   end repeat
end tell

On first pass, two folders will be created on the desktop.
the name of the 1st one is made of 255 digits (plain ASCII)
the name of the second one is made by replacing the 1st character of the firts one by the character औ (DEVANAGARI LETTER AU) whose Unicode value is $0914 and UTF 8 value is $E0A494

Both names are 255 characters long.
Both folders are created and a dialog display the two name lengths .
On second pass, One character (5) is added at the end of the two names.
The Finder refuse to create both folders.


If I replace the (DEVANAGARI LETTER AU) by the character ü (LATIN SMALL LETTER U WITH DIAERESIS) whose Unicode value is $00FC and UTF-8 value is $C3 BC, only the first folder named with strASCII is created.

There's an error when the script tries to delete the second folder on my system. This appears to be something to do with the fact that the first folder's already in the Trash. (Maybe there's not enough overhead to edit the second folder's name?) The effect is that, because the second folder's not deleted the first time, there's another error when trying to create it the second time. Putting an 'empty' line after each 'delete' cures this on my Snow Leopard system.

I really don't understand why (DEVANAGARI LETTER AU) which requires more bytes than (LATIN SMALL LETTER U WITH DIAERESIS) is accepted.

I don't know the answer either. But then I don't know how many bytes DEVANAGARI LETTER AU occupies in the normalized version of UTF-16 used by the HFS+ system. (See the link in my post (#7) above.)

Edit: Perhaps more relevantly, the article to which I linked says that HFS+ names can have a maximum of 255 UTF-16 code points. If DEVANAGARI LETTER AU is represented as just one code point (itself) — and LATIN SMALL LETTER U WITH DIAERESIS is almost certainly "normalized" as two code points (the small letter u and the diaeresis) — then that would explain the apparent anomaly. The number of bytes doesn't come into it.

Last edited by Nigel Garvey (2012-12-09 05:08:01 am)


NG

Offline

 

#21 2012-12-09 05:04:12 am

Yvan Koenig
Member
Registered: 2006-09-14
Posts: 1477

Re: 1 character uses space of 2 characters

Hello Nigel.

I assume that the error issued when the Finder try to delete the folder is due to your old system.
Under Lion or Mountain Lion it behave flawlessly.
This version use System Events to delete the folders so nothing is moved to the trash.

Applescript:


set strASCII to "01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123"
set p2d to path to desktop

repeat with i from 4 to 5
   set strASCII to strASCII & i
   try
       tell application "Finder"
           set folderASCII to make new folder at p2d with properties {name:strASCII}
           name of folderASCII
           set countASCII to count result
       end tell
       tell application "System Events" to delete folderASCII
   on error
       set countASCII to 0
   end try
   set strHigh to "औ" & text 2 thru -1 of strASCII
   try
       tell application "Finder"
           set folderHigh to make new folder at p2d with properties {name:strHigh}
           name of folderHigh
           set countHigh to count result
       end tell
       tell application "System Events" to delete folderHigh
   on error
       set countHigh to 0
   end try
   display dialog ("" & countASCII & return & countHigh)
end repeat

I have some difficulties to imagine that the (DEVANAGARI LETTER AU) character is stored as a single byte like ASCII characters.

Yvan KOENIG (VALLAURIS, France) dimanche 9 décembre 2012 12:03:54

Online

 

#22 2012-12-09 05:12:48 am

Nigel Garvey
Moderator
From: Warwickshire, England
Registered: 2002-11-19
Posts: 3514

Re: 1 character uses space of 2 characters

Yvan Koenig wrote:

I have some difficulties to imagine that the (DEVANAGARI LETTER AU) character is stored as a single byte like ASCII characters

Hi Yvan.

I was adding an extra paragraph to my post as you were posting this comment. The relevant units to consider are UTF-16 code points, not bytes.

Applescript:

set uUmlaut1 to (character id 252) -- One code point ("ü").
set uUmlaut2 to (character id 117) & (character id 776) -- Two code points as in HFS+ ("u" & combining diaeresis).

{uUmlaut1, uUmlaut2, uUmlaut1 = uUmlaut2}

PS. Even System Events's 'delete' command simply moves things to the trash on my machine.

Last edited by Nigel Garvey (2012-12-09 06:19:05 am)


NG

Offline

 

#23 2012-12-09 06:33:42 am

McUsrII
Member
Registered: 2012-11-20
Posts: 2290
Website

Re: 1 character uses space of 2 characters

An enlightening day, as enlightening here, as the snow outside! No pun intended. big_smile


Filed under: UTF, code points

Offline

 

#24 2012-12-09 02:26:46 pm

Yvan Koenig
Member
Registered: 2006-09-14
Posts: 1477

Re: 1 character uses space of 2 characters

Nigel Garvey wrote:

Yvan Koenig wrote:

I have some difficulties to imagine that the (DEVANAGARI LETTER AU) character is stored as a single byte like ASCII characters

Hi Yvan.

I was adding an extra paragraph to my post as you were posting this comment. The relevant units to consider are UTF-16 code points, not bytes.

Applescript:

set uUmlaut1 to (character id 252) -- One code point ("ü").
set uUmlaut2 to (character id 117) & (character id 776) -- Two code points as in HFS+ ("u" & combining diaeresis).

{uUmlaut1, uUmlaut2, uUmlaut1 = uUmlaut2}

In the given example, the uUmlaut character is :
ü (LATIN SMALL LETTER U WITH DIAERESIS) whose Unicode value is $00FC and UTF-8 value is $C3 BC

When I insert it in a file/folder name, it reduce the number of characters allowed by one. So it's clearly using two figures.

On the other side, character औ (DEVANAGARI LETTER AU) whose Unicode value is $0914 and UTF 8 value is $E0A494 may be inserted in a file/folder name and is counted as a single figure.
It's why puzzle me.
Bingo, I got it. Thanks to your explanations and table :
http://developer.apple.com/legacy/mac/l … table.html

Before that, I didn't understand that "points" are in fact two bytes.
औ (DEVANAGARI LETTER AU) is a 2 bytes Unicode value but it's stored as a single point.
ü (LATIN SMALL LETTER U WITH DIAERESIS) is a one byte Unicode value but it's stored as a two points object.
Ḕ (LATIN CAPITAL LETTER E WITH MACRON AND) is Unicode $1E14 but it's stored as a three points object.

So, we may replace an ASCII character by औ with no impact upon the number of characters allowed. (255)
So, if we replace an ASCII character by ü the number of characters allowed is reduced by 1. (254)
So, if we replace an ASCII character by Ḕ the number of characters allowed is reduced by 2. (253)

Nigel Garvey wrote:

PS. Even System Events's 'delete' command simply moves things to the trash on my machine.

This one is really puzzling me.
If I remember well you are running 10.5.8 and my memory said that the System Events delivered with this system deleted files as it did in 10.8 without putting them in the trash.
Now I may understand why you used rm in some scripts which you sent to me.

Yvan KOENIG (VALLAURIS, France) dimanche 9 décembre 2012 21:26:39

Last edited by Yvan Koenig (2012-12-09 02:52:20 pm)

Online

 

#25 2012-12-09 04:37:16 pm

Nigel Garvey
Moderator
From: Warwickshire, England
Registered: 2002-11-19
Posts: 3514

Re: 1 character uses space of 2 characters

Yvan Koenig wrote:

Before that, I didn't understand that "points" are in fact two bytes.
औ (DEVANAGARI LETTER AU) is a 2 bytes Unicode value but it's stored as a single point.
ü (LATIN SMALL LETTER U WITH DIAERESIS) is a one byte Unicode value but it's stored as a two points object.
Ḕ (LATIN CAPITAL LETTER E WITH MACRON AND) is Unicode $1E14 but it's stored as a three points object.

AU is a single character and can only be represented by a single code point (or Unicode number), however many bytes that may be. (A multiple of 2, since it's UTF-16.) Characters with diacritics, however, can be represented in Unicode either as characters in themselves (one code point) or as combinations of base characters and "Combining Diacritical Marks" (two code points). The HFS+ system insists on the latter, for some reason.

Interestingly, the 'id' function can tell how such characters are formed:

Applescript:

set nameIn to "üऔḔ" -- (LATIN SMALL LETTER U WITH DIAERESIS) & (DEVANAGARI LETTER AU) & (LATIN CAPITAL LETTER E WITH MACRON AND GRAVE)
set lengthIn to (count nameIn)
set idIn to id of nameIn

tell application "Finder" to set nameOut to name of (make new folder at desktop with properties {name:nameIn})

set lengthOut to (count nameOut)
set idOut to id of nameOut

{{nameAsSet:nameIn, length:lengthIn, id:idIn}, return, {nameAsReturned:nameOut, length:lengthOut, id:idOut}}

Nigel Garvey wrote:

PS. Even System Events's 'delete' command simply moves things to the trash on my machine.

This one is really puzzling me.
If I remember well you are running 10.5.8 and my memory said that the System Events delivered with this system deleted files as it did in 10.8 without putting them in the trash.

You're right! System Events does delete items on the spot. I hadn't noticed that in your script, the 'delete' command is applied to a Finder reference and so it's actually the Finder doing the deleting.  smile


NG

Offline

 

Board footer

Powered by FluxBB

[ Generated in 0.051 seconds, 10 queries executed ]

RSS (new topics) RSS (active topics)