Detecting Unicode Text

Adam_Bell · April 3, 2006, 2:44pm

Depending on the application that saved the text file, files with their paragraphs return delimited or linefeed delimited can have either type ident "com.apple.traditional-mac-plain-text"or type ident “public.plain-text”, so the only way I can find to tell is to test for the character to see if it’s ASCII 10 or ASCII 13. That’s OK, but my question is: Is there a simple way to determine whether text is Unicode or plain ASCII?

hhas · April 3, 2006, 4:32pm

Not really; only way to be completely sure is to know in advance. Given a file that may or may not be Unicode, the usual thing is to see if the file starts with a byte-order mark (BOM); if it does, it’s a fair bet (though not 100% guaranteed) that it’s Unicode. If it doesn’t, then it may use some other encoding, or it may be that the BOM was omitted. More intelligent encoding sniffers will analyse the data and provide a best guess as to what encoding it is, e.g. here’s one for Python I ran across the other day:

http://chardet.feedparser.org/

You could easily use it via ‘do shell script’ - wouldn’t be hard to knock up a Python script that reads the raw file data, sniffs its encoding and then, if sufficiently confident, converts it from that encoding to UTF8 and returns that. Or you could wrap it up in a scriptable faceless background application if you find that more convenient (I plan on rolling it into TextCommands at some point, but you could always do it yourself)

Bruce_Phillips · April 3, 2006, 4:50pm

Perhaps the command-line tool “file” (/usr/bin/file) would work for you.

Adam_Bell · April 3, 2006, 5:52pm

file works quite nicely if the object is a file (gives a ton of info), but I was wondering about text read into a variable.

I’m not competent to add a feedparser script to TextTools, I don’t think. I will look at it.

Vincent · April 3, 2006, 6:26pm

So

returns the following:

Determine file type of FILEs.

-m, --magic-file LIST use LIST as a colon-separated list of magic
number files
-z, --uncompress try to look inside compressed files
-b, --brief do not prepend filenames to output lines
-c, --checking-printout print the parsed form of the magic file, use in
conjunction with -m to debug a new magic file
before installing it
-f, --files-from FILE read the filenames to be examined from FILE
-F, --separator string use string as separator instead of `:’
-i, --mime output mime type strings
-k, --keep-going don’t stop at the first match
-L, --dereference causes symlinks to be followed
-n, --no-buffer do not buffer output
-N, --no-pad do not pad output
-p, --preserve-date preserve access times on files
-r, --raw don’t translate unprintable chars to \ooo
-s, --special-files treat special (block/char devices) files as
ordinary ones
–help display this help and exit
–version output version information and exit

How do you now get the encoding?

(If there was a BOM you could easily identify the encoding)

Adam_Bell · April 3, 2006, 6:44pm

file -i. For a typical BBEdit file, for example, it returns: text/plain; charset=iso-8859-1

Bruce_Phillips · April 3, 2006, 6:45pm

It’s not an option, it’s part of the normal behavior. Without an other options specified, the output (for a typical ASCII text file) would be something like this:

When using this tool inside AppleScript, you might find the options ˜-b’ and ˜-i’ helpful. (Example follows.)

How are you reading this text?

Vincent · April 3, 2006, 7:20pm

Thanx! That’s great! I needed that for too long!

Adam_Bell · April 3, 2006, 11:13pm

From the clipboard. In code exchange today, Kai posted this:

set the clipboard to (the clipboard as record)'s string

which at least assures you that whether it was Unicode or formatted text, it is turned into ASCII text by the coercions.