Detect utf8 script

Hi everybody,

I’m thinking about writting a utf8 detection script. The algorithm is something like this:

check for bom
– look for sequences of bytes
get next byte
if this byte < 128 then – 0xxxxxxx
get next byte
else if this byte < 224 then – 110xxxxx
– check next byte – 10xxxxxx
if this byte is not less than 11000000 then
quit – not utf8
else – 10xxxxxx
get next byte
end if
else if this byte < 240 then – 11100000
– check next 2 bytes for 10xxxxxx
etc.

Anybody know if I’m on the right track or is there an easier way? This would take a long time in a big file. Should the script stop after finding a certain amount of correct sequences?

Thanks,

Hi kel,

if the text is a file on disk, you can detect utf8 with

read file "path:to:utf_File" as «class utf8»

if the text is utf16 or MacRoman, which contains characters > hex7F, an error occurs

Hi Stefan,

If reading the file as utf8 errors, then you know it’s not utf8. If reading the file doesn’t error, then you still don’t know what it is. It still could be mac roman or maybe even utf16. Am I right in this?

So, how would a script know the type of encoding without the human looking at it?

First check for the bom EF FF. If it’s there, then utf16. Otherwise …

Maybe here we can use that. Try to open as utf8. If that errors, then we know it’s mac roman. If it doesn’t error, then we have to check for sequences of bytes. Say one byte is not less than 128. Here we know that it will match a utf8 sequence, but it still could be mac roman. What to do?

Maybe use probability or something? If you find say 2 sequences of bytes with length greater than 1, then probability is that it is utf8. I just made that up, but maybe something like that. Am I wrong in this?

Thanks,

the logic is (we assume only the 3 kinds MacRoman, utf8 and utf16:
if an error occurs, then the text is either utf16 or MacRoman with characters > 127
if no error occurs, then the text is utf8 or contains only characters < 127

There is no difference at all between MacRoman and utf8, if the text contains only ASCII characters < 127

Hi Stefan,

I think the logic is wrong. Try this:


set t to (ASCII character 195) & (ASCII character 185)
set file_spec to ((path to desktop as string) & "pisquared.txt") as file specification
set ref_num to open for access file_spec with write permission
set eof ref_num to 0
write t to ref_num
close access ref_num
set utf8_check to read file_spec as «class utf8»
set mac_check to read file_spec
{utf8_check, mac_check}
--> {"ù", "√π"}

I tried to find something that someone might type in mac roman. Maybe there is something better, but there could be all kinds of possibilities I think.

Thanks,

Why not just use the Unix “file” command?

on getEncoding(POSIXpath)
	if POSIXpath starts with "~" then
		set POSIXpath to (((POSIX path of (path to home folder)) & (text 3 thru -1 of POSIXpath)) as Unicode text)
	end if
	set Foo to do shell script "file -ib " & (quoted form of POSIXpath)
	if Foo starts with "text/" then
		text ((offset of "=" in Foo) + 1) thru -1 of Foo
	else if Foo contains "iso" then
		return Foo
	else
		return false
	end if
end getEncoding

--Some sample results;
--"us-ascii", "utf-16", "utf-8","iso-8859-1"

I’m not sure if this is the kind of answer you werer looking for:

choose file with prompt "Check encoding of this text file:" without invisibles
do shell script "/usr/bin/file -bi " & quoted form of POSIX path of result

(* --> various results:

UTF-8, with or without BOM:
text/plain; charset=utf-8

UTF-16, with BOM:
text/plain; charset=utf-16

UTF-16, without BOM:
application/octet-stream

MacRoman (depending on contents):
text/plain; charset=iso-8859-1
text/plain; charset=us-ascii

*)

I was changing the encoding with TextWrangler. My test document (originally UTF-8) contained ellipsis characters; With these, the MacRoman file came up as iso-8859-1, but w/o it came up as us-ascii.

Why is the logic wrong? The result is correct.
You write two MacRoman Bytes and read utf8
I wrote: if no error occurs, then the text is utf8 or contains only characters < 127
and the utf8 condition is true :slight_smile:

@Bruce and Vincent,
using the UNIX file command is great, but it doesn’t work in all cases.
E.g. files created with AppleScript’s read/write file commands have sometimes no file metadata,
regardless of which text encoding is used,

Why not use an existing encoding sniifer? Way easier than creating your own from scratch. e.g. man file, man Encode::Guess

HTH

has

Character sniffing is one of the tricks that ‘file’ uses when it doesn’t find a better indicator. It’s true that it won’t always report the right encoding, but determining text encodings in the absence of an unambiguous encoding indicator is inevitably an act of educated guesswork anyway. More advanced sniffers will give you additional info such as a percentage confidence in their guess; a quick Google will no doubt turn up useful info.

If you only need to identify UTF8-encoded text files, trying to read a file as «class utf8» and trapping any errors is a reasonable alternative to sniffing them first; I don’t see it being any more accurate though.

HTH

has

Using your own write handler, I actually get the correct results; However, one result was not expected. I received the correct response for MacRoman and UTF-16 but not for UTF-8, which returned "us-ascii; However, when I checked with TextWrangler, the “UTF-8” file was actually encoded as MacRoman.

Edit: This is kind of similar to the oddity I mentioned earlier; When I include something besides basic characters, or include a BOM, the file shows as being encoded as UTF-8. I guess it goes back to what you said:

Of course, now I want to what happens for users that don’t MacRoman as their primary system encoding.

‘Text’ data is just a bunch of bytes, and these bytes don’t mean anything by themselves; to translate them into human-readable characters you need to interpret those bytes first. In the absence of an explicit encoding declaration that tells you how those bytes should be interpreted, you just have to guess at the encoding to use. The key phrase to remember is “educated guesswork”. “Educated”, because it’s based on a deep understanding of the various available encodings out there and how they get used. And “guesswork” because there’s always some degree of guessing involved, with levels of confidence ranging from “pretty strong evidence” to “absolutely no clue”.

I really recommend reading up on the subject of text encodings if you’ve not already done so. Here’s a good place to start:

http://www.joelonsoftware.com/articles/Unicode.html

I’m sure if you Google around you’ll be able to find discussions of the sorts of heuristics used by encoding sniffers in making their best guesses. And don’t forget your applications’ preferences will also influence the encodings they report; e.g. TW likely assumes ‘MacRoman’ because that’s what you set it to use as its default encoding, and it knows the given data can be safely interpreted using that encoding (though whether or not it makes any sense is your problem).

HTH

has

I have read that before, and it was a big help. (I actually wanted to read that again; Thanks for the link. :))

Minus one point for me for not thinking of that. :confused: That was what happened in the case above. :rolleyes: (I usually use a different editor, but just brought up TextWrangler because it’s easier for changing encodings and line ending.)

Hi kel,

just for fun a plain vanilla AppleScript solution for your algorithm,

set a to read file ((choose file) as Unicode text)
if (ASCII number (character 1 of a)) + (ASCII number (character 2 of a)) = 509 then
	display dialog "UTF16"
	return
end if
set {utf8_Flag, roman_Flag, x} to {true, true, 1}
set c to count characters of a
repeat until x > c or utf8_Flag is false
	set s to (ASCII number of (character x of a))
	if (s div 128) mod 2 = 1 then
		set {y, z} to {64, 0}
		repeat
			if (s div y) mod 2 = 0 then exit repeat
			set z to z + 1
			set y to y div 2
		end repeat
		repeat with i from 1 to z
			if ((ASCII number of (character (x + i) of a)) div 128) mod 2 = 0 then
				set utf8_Flag to false
				exit repeat
			end if
			set roman_Flag to false
		end repeat
		set x to x + z + 1
	else
		set x to x + 1
	end if
end repeat
if utf8_Flag then
	if roman_Flag then
		display dialog "MacRoman"
	else
		display dialog "UTF8"
	end if
else
	display dialog "neither UTF8 nor MacRoman"
end if

Hi everbody,

The file containing the two bytes 195 (0xC3) and 185 (0xB9) contains characters with ascii number > 127 and doesn’t error with:

read f as «class utf8»

So, the script thinks the file is encoded with utf8. So, it takes that text as “ù” when it’s supposed to be “√π”. The script did not do its job. However, ‘read f as «class utf8»’ does lower the limit of the script to just counting the number of utf8 character sequences, because the script already knows that all sequences are legal utf8 byte sequences.

I tried unix ‘file’ and my computer doesn’t have the options -ib.

SYNOPSIS
file [ -vczL ] [ -f namefile ] [ -m magicfiles ] file …

When I run it without the -ib options on the “√π” file, file returns:

“/Users/kel/Desktop/pisquared.txt: International language text”

which I take to mean unicode text.

Now looking for Encode::Guess. I don’t have man pages for these so searching the Internet or Terminal ‘Encode --help’.

I’ve learned a lot.

Thanks a lot,

A text file containing only two characters which either can be utf8 (ù) or MacRoman (√π) is a border case.
To take the right encoding is in the charge of the mature user :wink:

What version of Mac OS X are on, and what version of file do you have?

do shell script "/usr/bin/file --version"

Hi Stefan,

Here’s a less abstract example:


set f to choose file
set utf8_check to read f as «class utf8»
set mac_check to read f
{utf8_check, mac_check}
--> {"What is ù? How do you calculate it?", "What is √π? How do you calculate it?"}

Here there are many bytes. What I’m saying is that you might count the number of occurences. Maybe use a ratio to the number of characters with ascii numbers < 128. Maybe add the language. Maybe make a list of common pairs or triplets that people might type in mac roman. Here’s another one I thought of. Take every combination of possible high ascii characters that can be decoded into utf8, find which ones are printable. This would give a high probability. For instance, the sequence “√π” would have a high probability because it’s printable in utf8, but low probability if the language is english (i.e. “ù” is not used, not that I know what it is used in). I just thought of something, the word calculate is math like, so you could look at the context.

Anyway, I found out that Encode::Guess has something to do with perl.

Edited: running Jaguar. Do you get macroman when running file -ib on the pisquared.txt file?

gl,