Would you solve backwards encoding compatibility with Tiger by utf8?

McUsr · June 28, 2010, 10:24am

Hello, I’m trying to create a listPreserver, which will save lists from or interact with choose from list dialogs.

I want it as backward compatible as possible. And I want to be able to read and write to the lists from the terminal, if I ever should wish to do so. And I want a seamless solution, so I’ll have detection and conversion built in. So that I can use the same list from different machines with different Os versions, without any distortion of characters.

I’m not sure how to handle this, because of the strong grip of MacRoman in Tiger, I have to apply hacks to make the terminal use utf8. But I feel that utf8 is really the least common denominator.

The alternative would be to encode files according to the Operating System Version.

I guess this leaves the questions as follows.

Would you try to use a single encoding for backwards compatibility. If so, which would you choose?

Would you prefer to use an encoding scheme per platform, and convert files to correct format when in need.

If you answer the previous positively would you then use utf16 after Tiger and MacRoman before?

If you have other alternatives, which makes me accomplish this don’t hesitate. I’ll take any measure I must in order to reach the perfect solution, as long as it is possible and practical.

Best Regards

McUsr

StefanK · June 28, 2010, 10:33am

Not for backwards compatibility, but for platform compatibility I would use UTF-8 with BOM

No

McUsr · July 1, 2010, 11:01am

Hello.

This is some second thoughts on this issue. So far I have experienced the usage of utf-8 to be a bad idea.
-Because on the Leopard and later platforms AppleScript uses utf-16 (Unicode text) internally.
On Tiger and earlier it uses MacRoman.

So utf-8 is incompatible at both platforms!

On Leopard and later utf-16 is the preferred format in the terminal, or the Unix side of it. So the usage of utf-16 here would not impose any problems. And on Tiger it was MacRoman, (iso-8859-1) last time I checked.

My issue was to be able to use lists into the choose from list dialog which could be retain and edited or used
from everywhere -While being readable. That implies that every accentuated char shows up correctly and nothing shows up in Chinese or similar.

A second attempt on this could be to keep everything utf-16 on Snow Leopard, and everything Mac Roman on Tiger.

This is a shame really because utf-8 would have been the preferred format but right now, it seems to be a little bit much of work to first assure that a file is utf-8 and eventually convert it, and then convert the file once more to utf-16 or mac roman before reading the list in, and then performing the corresponding conversion back.

What I’d like to now is if there is some nifty way to convert utf-8 text directly to utf-16 and mac roman, internally in AppleScript so that I don’t have to perform that much Io, for something that should have been a little bit simpler.

Best Regards

McUsr

Shane_Stanley · July 1, 2010, 11:12am

You’re confusing the issue with UTF8 and UTF16. In terms of code points, they are identical, so the fact that AS uses UTF16 internally is irrelevant. In fact, what AS uses internally is irrelevant – it also stores MacRoman text internally as UTF16.

McUsr · July 1, 2010, 11:35am

Hello

The problem was that if I have characters like this in an utf-8 file.

BBEdit and file both tell me it is utf-8, and it is because I have made it with iconv.
I have read in the file with:


set theContents to read fp using delimiter {theLineFeed} as «class utf8»

Then the result shows up in a choose from list dialog like this:

That is or was my problem.
I have recently discovered that Satimage.osax does the proper conversion for me with the readtext command using the encoding parameter -and it bypasses the BOM too :).

Still I’d like a way to convert utf-8 to utf-16. I saw what I saw and I saw what I didn’t (to quote a line of Tolkien’s Lord of the Rings)

Edit
I have told read to both read it as utf8 and string in the line above because I used the using delimiter which makes read deliver string and that is a Mac Roman string.

It worked ok when i separated the two operations by removing the using delimiter clause.

It’s a pity, but it doesn’t make the original idea any better; -whenever I need utf-8, I’ll convert it, that is a far more practical approach.

Best Regards

McUsr

McUsr · July 1, 2010, 2:52pm

Hello

I have uploaded the working handler here.

Best Regards

McUsr

StefanK · July 1, 2010, 3:01pm

This is more reliable


set theContents to paragraphs of (read fp as «class utf8»)

it considers LF, CR and CRLF as line delimiter

McUsr · July 1, 2010, 3:08pm

Hello.

Thanks, I circumvented the problem, but your solution is excellent as always. -Didn’t think of that read returns MacRoman string when called with the using text item delimiter

Best Regards

McUsr