Converting text files from UTF-16 to UTF-8 encoding

Riverside · April 17, 2013, 9:04am

I have a series of .html files that have been generated and exported from Filemaker. Because, however, of the constraints of the Export field contents command, they are generated with UTF-16 encoding. I need to switch them to UTF-8 encoding.

The following has been cobbled together from various forums.

Identify folder
Change all files to .txt
Run a shell script using textutil
Change all files back to .html


set inputfolder to (choose folder)

tell application "Finder" to set name extension of (files of inputfolder) to "txt"

delay 2

tell application "Finder" to set thefiles to files of inputfolder

delay 2

repeat with ttt in thefiles
	do shell script "textutil -convert txt -encoding UTF-8 " & quoted form of POSIX path of (ttt as text)
end repeat

delay 2

tell application "Finder" to set name extension of (files of inputfolder) to "html"

delay 1

beep

It works - but only just. Without the delay steps it gets confused and fully processes some files, but only half processes others.

I can’t believe this is the most elegant or optimal way to do this. Can anyone rationalise / improve my script, or offer a better alternative?

[You’ll see from the script that I’m not that experienced in these matters]

Thank you

Browser: Firefox 20.0
Operating System: Mac OS X (10.6)

McUsrII · April 17, 2013, 9:30am

Hello and welcome to MacScripter!

This should work somewhat better.


set inputfolder to (choose folder)

tell application "Finder"
	activate
	set thefiles to every file of inputfolder whose name extension is "html"
	repeat with ttt in thefiles
		do shell script "textutil -convert txt -encoding UTF-8 " & quoted form of POSIX path of (ttt as text)
	end repeat
end tell

Riverside · April 17, 2013, 11:33am

Thanks for the welcome / script.

With your version I end up with both the original .html file and a new .txt file.

How might I take it a stage further and convert the .txt file into a .html file (thus updating / replacing the original files provided)?

Thanks

McUsrII · April 17, 2013, 12:00pm

Hello.

Try this:


set inputfolder to (choose folder)
set tids to AppleScript's text item delimiters
tell application "Finder"
	activate
	set thefiles to every file of inputfolder whose name extension is "html"
	repeat with ttt in thefiles
		do shell script "textutil -convert txt -encoding UTF-8 " & quoted form of POSIX path of (ttt as text)
		set AppleScript's text item delimiters to ".html"
		set tmpname to text items of (POSIX path of (ttt as text))
		set AppleScript's text item delimiters to ".txt"
		set txtname to tmpname as text
		do shell script "mv " & quoted form of txtname & " " & quoted form of POSIX path of (ttt as text)
	end repeat
end tell
set AppleScript's text item delimiters to tids

I am not sure if this is what you want really, as they strip out all the html formatting, maybe you should use iconv instead? Have a look at man iconv if that is the case, or iconv list from a terminal window, if you want to retain the html.

DJ_Bazzie_Wazzie · April 17, 2013, 1:26pm

Just a similar code but using the famous iconv library

do shell script "iconv -f UTF-16 -t UTF-8 /path/to/file.txt /path/to/newfile.txt"

using it atomically like McUsr did:

set theFile to choose file
do shell script "sourcefile=" & quoted form of posix path of theFile & "
filename=$(basename $sourcefile)
iconv -f UTF-16 -t UTF-8 $sourcefile > $TMPDIR$filename
mv  $TMPDIR$filename $sourcefile"

McUsrII · April 17, 2013, 1:39pm

Hello.

I wonder if this would have worked:

do shell script "pbcopy < " & " qouted form of  "/path/to/UTF16"  & " ; pbpaste > " & quoted form of "/path/to/utf8"

Leveraging upon the converting to utf8 of the do shell script command.

Edit

I don’t think it will work.

Well, an interesting tidbit about the iconv library is that is uses ICU. At least it seemed like that when I read some posts on the developer listings of it.

DJ_Bazzie_Wazzie · April 17, 2013, 1:44pm

It would have worked but pboard server is, especially on Lion, very buggy. Writing atomically, is normally to way to go when you convert files.

McUsrII · April 17, 2013, 1:50pm

Yes

I agree about atomically, and controlled.

I would still be nice to have a way to convert to utf8 faster, if not for any other reason than the mere hell of it.
I don’t think the above would have worked, as it all happens interally in the do shell script. I think that it on the other hand could work by getting the output converted be getting data out of do one shell script and back by another.

That is: first getting the value out of a do shell script with say a cat command, this should make the contents of a file into utf8, (but not totally sure). Then catting the contents of a variable back into a file with the second shell script.

iconv is a stand alone entity, I checked the header file.

DJ_Bazzie_Wazzie · April 17, 2013, 2:05pm

It feels like an lot of overhead but in fact it isn’t. If you read everything in your memory, you have more overhead in your application. Now it’s an stream and lightweight process. The file movement is just moving the file inode number from 1 directory file to another which is an minimalistic task, there is no content that has to be re-written.

What do you mean?

McUsrII · April 17, 2013, 2:25pm

Hello.

I meant that iconv doesn’t seem to rely on any other libraries, than perhaps the standard C library.

And I wasn’t out after finding any better way really, as the second way, involves two shell scripts, and is by far more costly than doing it atomically. (Truth to be told, I haven’t bothered to find a utf16 encoded file to try that on yet). But it might be of interest, in some situation, where such a scheme fits with the rest.

DJ_Bazzie_Wazzie · April 17, 2013, 2:52pm

Find the source file again and you’ll see it uses the libiconv library

My source is: opensource.apple.com->10.8.2->libiconv-34->libiconv->src->iconv.c accoring to apple the source of the iconv command line util using ‘The GNU LIBICONV Library’

McUsrII · April 17, 2013, 3:09pm

Hello.

I actually meant that libiconv was a stand alone library. * Then I realized, that if the library may have been linked with other libraries, and if the library is linked into something dynamically, then it may request other dynamic libraries to be lined. I am not going to find that out, but to me, it seems like libiconv is a stand alone library, independent of other ibraries, residing on your machine.

I try to avoid using it directly as much as I can, and use CoreFoundation stuff instead, guessing that this framework uses libiconv and ICU internally, one way or the other.

I should have had prefixed iconv with libiconv earlier, but this is a suffix you are really supposed to neglect when you are dealing with libraries, since in order to use libiconv you’d link it with -liconv using gcc from the commandline. Sorry for confusing you, I didn’t really think of the ambiguity, since the program is also named iconv.
I’ll try to avoid my internal “jargon” for the future. But there isn’t really many programs that are named the same as a library, iconv was so to speak a most unfortunate example.

kerflooey · April 17, 2013, 5:25pm

You could also use Applescript to write the the html text out of Filemaker fields directly to disk using “as «class utf8»”, instead of using FM’s Export script step. Search this site for some examples.

DJ_Bazzie_Wazzie · April 17, 2013, 8:11pm

At least it is depended on the malloc lib

I understand what you where saying, I thought you meant an standalone lib, but it’s loaded by the kernel an used by many processes as by perl en php too. When an library is required by the system I no longer consider it as an standalone library, more like an system library.

McUsrII · April 18, 2013, 12:50pm

Hello.

libmalloc is almost in the C-library, if it isn’t part of the standard, then it is pretty close, since everybody uses it, stand alone code may use libmalloc! (Then you can use leaks, and set environment variables, in order to debug memory leaks and free’d null pointer issues. It is stone-age Dtrace.)

Everybody should use libmalloc, and on OS X, I think you get it for free, that it is incorporated in the standard c library, at least it works without me having to do anything for it.

By a standalone library, I meant that it didn’t need other resources, or that you need to use other resources in order to make the code it provides work.

If the library is used by the system, then it is a system library, I agree on that.