Trying to convert a bunch of files to UTF-8

macos_boy · January 2, 2015, 8:58am

I am trying to convert a bunch of files that have different file encodings to UTF-8.

I have this script:


set listOfFiles to choose file with prompt "Choose files" with multiple selections allowed
set myDesktop to (path to the desktop folder) as string

repeat with theItem in listOfFiles
	
	open for access theItem
	
	set fileContents to (read theItem)
	close access theItem
	
	set fileInfo to info for theItem
	set filename to name of fileInfo
	set arquivoFinal to myDesktop & filename
	
	display dialog arquivoFinal
	
	set outputFile to open for access (arquivoFinal) with write permission
	write fileContents to outputFile as «class utf8»
	
	close access outputFile
	
end repeat

I have an error on this line

	set fileContents to (read theItem)

with the message error “End of file error.” number -39 from file

what is the problem? How do I do that?

StefanK · January 2, 2015, 9:09am

Hi,

the error message says that the file is empty.
By the way, open/close for access when reading the file is not necessary

macos_boy · January 2, 2015, 9:12am

file is not empty

StefanK · January 2, 2015, 9:19am

Nigel wrote a comprehensive article about MacScripter / The Ins & Outs of File Read/Write in AppleScript. Maybe you can find an answer there

Nigel_Garvey · January 2, 2015, 9:56am

Hi.

Another possibility, since your script doesn’t use the reference number returned by ‘open for access’ and you haven’t taken any precautions against errors occurring while the file’s open, is that a previous run of the script has read the file right through to the end and then errored before it could close it again.

If you’re running the script in Script Editor, the access(es) will belong to Script Editor itself. Try quitting Script Editor to close all its allocated file accesses and then open it again.

As Stefan says, you don’t actually need ‘open for access’ if you’re simply reading a file once. If you do use it, you should employ the access number it returns and protect the proceedings with a ‘try’ block:

set fRef to (open for access theItem)
try
	set fileContents to (read fRef)
end try
close access fRef

And of course a similar arrangement when writing to a file.

That’s probably better achieved with a shell script (I can’t think of the appropriate command at the moment). The ‘read’ command can’t identify what kind of encoding a file has.

McUsrII · January 2, 2015, 5:13pm

Hello.

The appropriate shell command for encoding files from one text encoding to another is iconv. You’ll find the complete manual if you type “man iconv” in a terminal window. Sometimes however, when a file is utf-8, but doesn’t contain any special characters, then utilities like quicklook or filemerge won’t interpret it correctly. Then you may set an attribute (xattr), that identifies the file at utf-8. You’ll find more on that here.

Nigel_Garvey · January 2, 2015, 11:34pm

Thanks, McUsrII. iconv does seem to be it. But as far as I can see, it too has to be told what the file’s current encoding is.

McUsrII · January 2, 2015, 11:57pm

Hello.

Yes, the usual way at least I do it, is to try with different encodings, until I find the one that gives the result I want, when I don’t know it up front. Hopefully files are arranged by origin, before you have to start encoding. It is a whole lot of different encodings out there, and guessing every encoding is probably a very complex task to get right.

There is a list option with iconv, (iconv -l |more) that lists all the encodings it supports, so that (general) you can find your best vantage point during the guess work.

Edit

The best place to start guessing the encoding is “reopening” a file with the encodings listed in TextWrangler and BBEdit, and only when those doesn’t work start trying with the more exotic ones. I think it is more rational to use iconv than using TextWrangler or BBEdit for changing the encodings for a bunch of files though.

Shane_Stanley · January 3, 2015, 12:09am

That’s generally the problem. Traditionally developers have had to play trial and error with several encodings, which is pretty messy in AS.

As of Yosemite, there’s a new API where you give the OS some hints about what it might be, and it does the work. The hints include the likely language, and you can also supply a list of encodings in preference order. Here’s a sample with just a language specified:

use AppleScript version "2.4" -- requires Yosemite
use scripting additions
use framework "Foundation"

set aPOSIXpath to POSIX path of (choose file)
-- read as raw data
set anNSData to current application's NSData's dataWithContentsOfFile:aPOSIXpath
-- set conversion options
set theOptions to current application's NSDictionary's dictionaryWithObjects:{false, "en"} forKeys:{(current application's NSStringEncodingDetectionAllowLossyKey), (current application's NSStringEncodingDetectionLikelyLanguageKey)}
-- convert to string; theEncoding will be a number that refers to the encoding used
set {theEncoding, theString, wasLossy} to current application's NSString's stringEncodingForData:anNSData encodingOptions:theOptions convertedString:(reference) usedLossyConversion:(reference)
if theEncoding is 0 then
	-- it could not be converted; handle error
	return
end if
-- write as UTF8
theString's writeToFile:(aPOSIXpath & "-conv") atomically:true encoding:(current application's NSUTF8StringEncoding) |error|:(missing value)

ccstone · January 3, 2015, 12:43am

It appears to me that file is a good place to start when guessing the encoding of a file.

[code]file -I ~/“Desktop/Shopping List.txt”

Results from successive runs with changed encodings:

/Users/chris/Desktop/Shopping List.txt: text/plain; charset=us-ascii
/Users/chris/Desktop/Shopping List.txt: text/plain; charset=utf-8
/Users/chris/Desktop/Shopping List.txt: text/plain; charset=utf-8 (-ccs note: with BOM - BOM not indicated)
/Users/chris/Desktop/Shopping List.txt: text/plain; charset=utf-16be
/Users/chris/Desktop/Shopping List.txt: text/plain; charset=utf-16le[/code]
(Changes to encoding made with BBEdit.)

My first test was with a plain-text file set to utf-8 in BBEdit.

I got text/plain; charset=us-ascii - apparently because there was only ascii content in the file.

Adding two bullets immediately changed it to text/plain; charset=utf-8

–

It looks like the Perl module Encode::Detect might be reasonably thorough, but I can’t install it to test at the moment.

Shane_Stanley · January 3, 2015, 1:01am

The only problem with file is that it doesn’t consider extended attributes. They’re not guaranteed to be present, but where the encoding has been added, it’s a quick and very reliable method. Perhaps xattr followed by file might be a good way to go shell-wise.

Shane_Stanley · January 3, 2015, 1:03am

I should also mention that one of the other optional hints is whether the file originated on Windows (NSStringEncodingDetectionFromWindowsKey).

ccstone · January 3, 2015, 2:31am

Hey Shane,

Fair point.

xattr -p com.apple.TextEncoding <file_path>

DJ_Bazzie_Wazzie · January 3, 2015, 1:11pm

Extended attributes can be inaccurate. When you edit or create a file using an application or command line util not supporting extended attributes and changing the encoding, the extended attribute can indicate a complete different encoding than it actually is. Also clean files (from internet, network share or attachments in mail) don’t have extended attributes. The chance that extended attributes are actually useful is very small.

File (I would use the option -k as well) is a good indication for making differences between 7-bit ASCII, 8-bit extended, UTF-8, UTF-16BE and UTF16LE. If you have save the file with the encoding UTF-8 or MacRoman for instance and file indicates it’s 7-bits ASCII encoded you just now it’s neither UTF-8 or 8-bit extended characters in use. The file doesn’t contain any encoding specific characters and it’s encoding is not the one you have saved, xattr would lie about it’s encoding again. To make distinctions between the different 8-bit extended files you can use existing or write your own encoding predictor. It’s just based on byte sequences where ISO-8859-1 will have byte sequences which MACROMAN never uses.

Another thing I personally dislike about xattr is when it is “damaged” there is no way you can open the file with the right encoding in text edit. Text edit will ignore it’s own preferences when an extended attribute for file encoding is set. You need to remote the extended attribute first before you can open a file with the preferred encoding.

McUsrII · January 3, 2015, 5:26pm

Hello.

The text encoding reckognition in cocoa, just parses the first line of text when determining what kind of encoding the file has. This means, that if there are just characters within the us-ascii set on the first line, then the file will be interpreted as iso-8859-1, since this is much faster to parse. If there are characters outside the ascii range later in the file, those characters will most probably be misinterpreted. Therefore we most often use xattr to “force” the correct interpretation of a file, so that it renders correctly in TextEdit, FileMerge and QuickLook plugins. So you’d expect to see the attribute in use on only a very few files. Editors like TextWrangler and BBEdit gives you a chance of choosing an encoding the file with, so using xattr to set the encoding is only a necessity, when dealing with native/primitive applications. And it doesn’t do anything with the encoding, it just tells how the encoding is to be interpreted.

Shane_Stanley · January 4, 2015, 12:08am

Any app that does that is buggy – they should either write extended attributes or remove them, which is what atomic saves do. And if they’re not doing atomic saves, you have other problems. You’re right about command line utils, but I think it’s easy to forget that most people never go near them.

Right. Which is why I said xattr first, then file.

I disagree.

Shane_Stanley · January 4, 2015, 12:10am

That’s certainly not true with the API I used above. Which API are you talking about?

McUsrII · January 4, 2015, 12:30am

Ok, maybe I should have written “the start of the file” instead of “the first line”, and I am not sure if that is how it is implemented, but it is how this appear to work. For surely, if the whole file was read in before the encoding of the file was determined, then the file would have rightfully been parsed as utf-8, (when there is no BOM).

I haven’t dug into which api it is that actually does this, it is happening inside NSDocument.

Edit

For the system to work as described, then you have to set “automatically” (default) as the option for encoding when opening files in TextEdit. (Those preferences of TextEdit appears to govern both quicklook and FileMerge, so I guess you may view those as preferences for the NSDocument class. )

Shane_Stanley · January 4, 2015, 1:07am

TextEdit doesn’t equal Cocoa, and it’s unrelated to NSDocument – you can download the code and see for yourself. TextEdit actually uses one of the APIs designed to read styled text (-readFromURL:options:documentAttributes:error:), and I suspect that getting text encoding right for non-styled text is a secondary consideration – it’s main function is to handle RTF, and convert things like HTML into attributed strings.

The API I referred to above is specifically designed for text files, and as I said, was only introduced in 10.10.

McUsrII · January 4, 2015, 1:33am

Hello Shane.

Btw. I guess you can just download and compile your own version of the ICU library, or iconv for that matter, if you need the encoding, without being on 10.10.