Detecting Unicode Text

Depending on the application that saved the text file, files with their paragraphs return delimited or linefeed delimited can have either type ident "com.apple.traditional-mac-plain-text"or type ident “public.plain-text”, so the only way I can find to tell is to test for the character to see if it’s ASCII 10 or ASCII 13. That’s OK, but my question is: Is there a simple way to determine whether text is Unicode or plain ASCII?

Not really; only way to be completely sure is to know in advance. Given a file that may or may not be Unicode, the usual thing is to see if the file starts with a byte-order mark (BOM); if it does, it’s a fair bet (though not 100% guaranteed) that it’s Unicode. If it doesn’t, then it may use some other encoding, or it may be that the BOM was omitted. More intelligent encoding sniffers will analyse the data and provide a best guess as to what encoding it is, e.g. here’s one for Python I ran across the other day:

http://chardet.feedparser.org/

You could easily use it via ‘do shell script’ - wouldn’t be hard to knock up a Python script that reads the raw file data, sniffs its encoding and then, if sufficiently confident, converts it from that encoding to UTF8 and returns that. Or you could wrap it up in a scriptable faceless background application if you find that more convenient (I plan on rolling it into TextCommands at some point, but you could always do it yourself)

Perhaps the command-line tool “file” (/usr/bin/file) would work for you.

file works quite nicely if the object is a file (gives a ton of info), but I was wondering about text read into a variable.

I’m not competent to add a feedparser script to TextTools, I don’t think. I will look at it.

So

returns the following:

How do you now get the encoding?

(If there was a BOM you could easily identify the encoding)

file -i. For a typical BBEdit file, for example, it returns: text/plain; charset=iso-8859-1

It’s not an option, it’s part of the normal behavior. Without an other options specified, the output (for a typical ASCII text file) would be something like this:

When using this tool inside AppleScript, you might find the options ˜-b’ and ˜-i’ helpful. (Example follows.)

How are you reading this text?

Thanx! That’s great! I needed that for too long!

From the clipboard. In code exchange today, Kai posted this:

set the clipboard to (the clipboard as record)'s string

which at least assures you that whether it was Unicode or formatted text, it is turned into ASCII text by the coercions.