Unicode characters don't display correctly in table

Hey all,

I’m working on an HTML editor and have created a table that has all the HTML character entities (" " and so forth). The problem is that the table is being filled by a text file that I found that is in the format:
¢ & #162; ¢ ¢ cent sign

and during startup I load the table from the text file, reordering items to make it easier to read and dropping the duplicate glyph. So far so good, table loads just fine, no problem there.

The problem is that I can’t seem to find a character encoding for the text file that works. If I use unicode (8 or 16, doesn’t matter) then all the character glyphs get the ¬ character in front of them and some of them are the wrong character. So the line above comes out:
¢ & #162; ¬¢ ¬¢ cent sign

and (for example) the currency symbol
¤ & #164; ¤ ¤ currency sign

comes out like this:
¤ & #164; ¬¢ ¬¢ currency sign

And if I try to resave the text file in Mac OS Roman, Textwrangler and XCode inform me that the chosen encoding can’t handle all the characters.

Telling AS to use unicode text doesn’t help, neither does international text or just plain text. Someone please tell me that there is a way around this!

Edit: Oops, this is in the AppleScript Studio forum, maybe read f as «class utf8» is not available in that environment. If so, please disregard, or let me know in a PM, and I’ll delete this post. Either way, the line continuation characters seem to indicate reading UTF-8 encoded text as MacRoman. You could also check eacute against my result for further evidence one way or another.

I saved the cent, currency, and a new lowercase e with acute (é) examples in files with TextEdit using Mac Roman, UTF-8, UTF-16, and Latin-1 encodings (Mac Roman could not save the currency character, so it only has the cent and e-with-acute examples).

Then I ran this script:

on run
    set readAsEncoding to {text, «class utf8», Unicode text} -- text uses the system's default encoding, which is probably Mac Roman; Is there an explicit encoding specifier for Mac Roman or Latin-1?
    set encodingExt to {"macroman", "utf8", "utf16", "latin1"}
    set txt to "" as Unicode text
    repeat with ext in encodingExt
        set falias to alias ((path to desktop as Unicode text) & "test text." & ext & ".txt")
        repeat with enc in readAsEncoding
                read falias as enc
            on error e
                "<error: " & e & ">"
            end try
            set t to ext & " as " & enc & return & result & return & return
            log t
            set txt to txt & t
        end repeat
    end repeat
    return txt
end run

My result for “utf8 as text” (which is UTF-8 as Mac Roman on my system) is

¢ & #162; ¬¢ ¬¢ cent sign ¤ & #164; ¬§ ¬§ currency sign é & #233; √© √© lowercase e with acute
At least for the cent sign, this matches your problem description (for the currency sign we have a discrepency, you get “¬¢” and I get “¬§”).

I would guess that your source file is encoded in UTF-8. Have you tried reading it via read f as «class utf8»?

AS Studio doesn’t “take away” any of AS’s basic functionality, it just adds Cocoa objects via the ASkit dictionary, so yes, I can still try read f as «class utf8». That’s a good thought, I assumed (probably wrongly) that as unicode text would handle that. I’ll give that a try!

Thanks for the info about AppleScript Studio. I have a vague idea about what it does, but I have never had cause to need it yet, so I have not read any proper documentation about it. I had read one or two posts here that said something like “Oh, that does not work in AppleScript Studio, only in vanilla AppleScript”, so I was concerned that read from StandardAdditions might have been one of those things. My impression was that AppleScript Studio replaces some small fraction of the base “vanilla AppleScript” dictionary with slightly incompatible stuff.

It is my understanding (and the full result of the experiment in my previous post bears it out) that Unicode text means UTF-16, and there is no auto-detection for the case that the data is actually UTF-8.