Converting text files from UTF-16 to UTF-8 encoding

I have a series of .html files that have been generated and exported from Filemaker. Because, however, of the constraints of the Export field contents command, they are generated with UTF-16 encoding. I need to switch them to UTF-8 encoding.

The following has been cobbled together from various forums.

  • Identify folder
  • Change all files to .txt
  • Run a shell script using textutil
  • Change all files back to .html

set inputfolder to (choose folder)

tell application "Finder" to set name extension of (files of inputfolder) to "txt"

delay 2

tell application "Finder" to set thefiles to files of inputfolder

delay 2

repeat with ttt in thefiles
	do shell script "textutil -convert txt -encoding UTF-8 " & quoted form of POSIX path of (ttt as text)
end repeat

delay 2

tell application "Finder" to set name extension of (files of inputfolder) to "html"

delay 1

beep

It works - but only just. Without the delay steps it gets confused and fully processes some files, but only half processes others.

I can’t believe this is the most elegant or optimal way to do this. Can anyone rationalise / improve my script, or offer a better alternative?

[You’ll see from the script that I’m not that experienced in these matters]

Thank you

Browser: Firefox 20.0
Operating System: Mac OS X (10.6)

Hello and welcome to MacScripter! :slight_smile:

This should work somewhat better.


set inputfolder to (choose folder)

tell application "Finder"
	activate
	set thefiles to every file of inputfolder whose name extension is "html"
	repeat with ttt in thefiles
		do shell script "textutil -convert txt -encoding UTF-8 " & quoted form of POSIX path of (ttt as text)
	end repeat
end tell

Thanks for the welcome / script.

With your version I end up with both the original .html file and a new .txt file.

How might I take it a stage further and convert the .txt file into a .html file (thus updating / replacing the original files provided)?

Thanks

Hello.

Try this:


set inputfolder to (choose folder)
set tids to AppleScript's text item delimiters
tell application "Finder"
	activate
	set thefiles to every file of inputfolder whose name extension is "html"
	repeat with ttt in thefiles
		do shell script "textutil -convert txt -encoding UTF-8 " & quoted form of POSIX path of (ttt as text)
		set AppleScript's text item delimiters to ".html"
		set tmpname to text items of (POSIX path of (ttt as text))
		set AppleScript's text item delimiters to ".txt"
		set txtname to tmpname as text
		do shell script "mv " & quoted form of txtname & " " & quoted form of POSIX path of (ttt as text)
	end repeat
end tell
set AppleScript's text item delimiters to tids

I am not sure if this is what you want really, as they strip out all the html formatting, maybe you should use iconv instead? Have a look at man iconv if that is the case, or iconv list from a terminal window, if you want to retain the html.

Just a similar code but using the famous iconv library

do shell script "iconv -f UTF-16 -t UTF-8 /path/to/file.txt /path/to/newfile.txt"

using it atomically like McUsr did:

set theFile to choose file
do shell script "sourcefile=" & quoted form of posix path of theFile & "
filename=$(basename $sourcefile)
iconv -f UTF-16 -t UTF-8 $sourcefile > $TMPDIR$filename
mv  $TMPDIR$filename $sourcefile"

Hello.

I wonder if this would have worked:

do shell script "pbcopy < " & " qouted form of  "/path/to/UTF16"  & " ; pbpaste > " & quoted form of "/path/to/utf8"

Leveraging upon the converting to utf8 of the do shell script command.

Edit

I don’t think it will work.

Well, an interesting tidbit about the iconv library is that is uses ICU. :slight_smile: At least it seemed like that when I read some posts on the developer listings of it.

It would have worked but pboard server is, especially on Lion, very buggy. Writing atomically, is normally to way to go when you convert files.

Yes

I agree about atomically, and controlled.

I would still be nice to have a way to convert to utf8 faster, if not for any other reason than the mere hell of it.
I don’t think the above would have worked, as it all happens interally in the do shell script. I think that it on the other hand could work by getting the output converted be getting data out of do one shell script and back by another.

That is: first getting the value out of a do shell script with say a cat command, this should make the contents of a file into utf8, (but not totally sure). Then catting the contents of a variable back into a file with the second shell script.

iconv is a stand alone entity, I checked the header file. :slight_smile:

It feels like an lot of overhead but in fact it isn’t. If you read everything in your memory, you have more overhead in your application. Now it’s an stream and lightweight process. The file movement is just moving the file inode number from 1 directory file to another which is an minimalistic task, there is no content that has to be re-written.

What do you mean?

Hello.

I meant that iconv doesn’t seem to rely on any other libraries, than perhaps the standard C library.

And I wasn’t out after finding any better way really, as the second way, involves two shell scripts, and is by far more costly than doing it atomically. (Truth to be told, I haven’t bothered to find a utf16 encoded file to try that on yet). But it might be of interest, in some situation, where such a scheme fits with the rest.

Find the source file again and you’ll see it uses the libiconv library :smiley:

My source is: opensource.apple.com->10.8.2->libiconv-34->libiconv->src->iconv.c accoring to apple the source of the iconv command line util using ‘The GNU LIBICONV Library’

Hello.

I actually meant that libiconv was a stand alone library. * Then I realized, that if the library may have been linked with other libraries, and if the library is linked into something dynamically, then it may request other dynamic libraries to be lined. I am not going to find that out, but to me, it seems like libiconv is a stand alone library, independent of other ibraries, residing on your machine.

I try to avoid using it directly as much as I can, and use CoreFoundation stuff instead, guessing that this framework uses libiconv and ICU internally, one way or the other.

  • I should have had prefixed iconv with libiconv earlier, but this is a suffix you are really supposed to neglect when you are dealing with libraries, since in order to use libiconv you’d link it with -liconv using gcc from the commandline. Sorry for confusing you, I didn’t really think of the ambiguity, since the program is also named iconv.
    I’ll try to avoid my internal “jargon” for the future. But there isn’t really many programs that are named the same as a library, iconv was so to speak a most unfortunate example. :slight_smile:

You could also use Applescript to write the the html text out of Filemaker fields directly to disk using “as «class utf8»”, instead of using FM’s Export script step. Search this site for some examples.

At least it is depended on the malloc lib :smiley:

I understand what you where saying, I thought you meant an standalone lib, but it’s loaded by the kernel an used by many processes as by perl en php too. When an library is required by the system I no longer consider it as an standalone library, more like an system library.

Hello.

libmalloc is almost in the C-library, if it isn’t part of the standard, then it is pretty close, since everybody uses it, stand alone code may use libmalloc! :slight_smile: (Then you can use leaks, and set environment variables, in order to debug memory leaks and free’d null pointer issues. It is stone-age Dtrace.)

Everybody should use libmalloc, and on OS X, I think you get it for free, that it is incorporated in the standard c library, at least it works without me having to do anything for it.

By a standalone library, I meant that it didn’t need other resources, or that you need to use other resources in order to make the code it provides work.

If the library is used by the system, then it is a system library, I agree on that. :slight_smile: