Converting UTF-16 text files to UTF-8.

I’m having problems in understanding how to get Applescript to process files held in subfolders.

Basically I have a lot of files that have come from Filemaker Pro via the export field contents routine. Unfortunately these files are UTF-16 and I need to convert them all back to UTF-8. I’ve looked at all sorts of ways to this and the closest I’ve come is by using the following script which was posted by McUsrII at http://macscripter.net/viewtopic.php?id=40729.


set inputfolder to (choose folder)

tell application “Finder”
activate
set thefiles to every file of inputfolder whose name extension is “html”
repeat with ttt in thefiles
do shell script "textutil -convert txt -encoding UTF-8 " & quoted form of POSIX path of (ttt as text)
end repeat
end tell


This script works perfectly except it only processes the files held in the top level folder. Bascially all I need is some way to have the script work its way through the files held in the subfolders of the top level folder.

I only have a very basic understanding of Applescript so any guidance on this would be greatly appreciated.

Have you taken a look at my post in the same forum :), but in your case I would use find command so you can convert the folder and subfolders using a single command.

set theFolder to text 1 thru -2 of POSIX path of (choose folder)
do shell script "find " & quoted form of theFolder & " -iname '*.html' -exec sh -c 'filePath={};iconv -f UTF-16 -t UTF-8 $filePath > ${filePath%.*}_utf8.html' \\;"

edit: Maybe some explanation required:

find’s first argument is the location you want to look at, however find will add foreach file it finds an starting “/”, so the given location should not have a trailing slash.

The seconds argument -iname tells find to find case insensitive file all files ending with “.html”.

The third and last agrument is -exec. Which is an command, when ended with an escaped semicolon, that allows to execute a subcommand for each file that is found. However the -exec command is somewhat primitive and doesn’t allow any piping so therefore I execute a new subshell using sh.

The sh (sub)command will set filePath variable to the found file (the {} symbols will be replaced by find with the found file path).

the second command of the sh command is iconv to convert from (-f) one encoding to the other (-t).

iconv prints to the standard output so therefore I redirect the standard output to an file using the greather than symbol > followed by the new file path. ${filePath%.*} will remove the extension from the file path and we add _utf8.html to the end of the path as our new filename.

Thanks for the quick reply. Very appreciated. I did read your post in the original forum but my current knowledge prevents me from getting the most from that.

I’m not sure where the code should go that you have just supplied. I’ve tried incorporating the contents without success.

Here’s my attempt:

set theFolder to text 1 thru -2 of POSIX path of (choose folder)
set tids to AppleScript’s text item delimiters
tell application “Finder”
activate
set thefiles to every file of inputfolder whose name extension is “html”
repeat with ttt in thefiles
do shell script “find " & quoted form of theFolder & " -iname ‘.html’ -exec sh -c 'filePath={};iconv -t UTF-16 $filePath > ${filePath%.}_utf8.html’ \;”
set tmpname to text items of (POSIX path of (ttt as text))
set AppleScript’s text item delimiters to “.txt”
set txtname to tmpname as text
do shell script "mv " & quoted form of txtname & " " & quoted form of POSIX path of (ttt as text)
end repeat
end tell
set AppleScript’s text item delimiters to tids

I’m sure something is in the wrong place so any assistance would be great.

You shouldn’t insert the code anywhere, the example code is a complete working implementation and you can forget all other code. Just click the “open scriptlet in your editor” and run it :slight_smile:

Thanks for the detailed explanation of what each part of the script does. This has been extremely helpful in understanding what is going on as the script is executed.

The current script works like a charm.

The only refinement now to make it 100% would be to either have it over-write the original file as opposed to saving a modified file with a different name. I did try to change the “_utf8.html” to just “.html” but when the script is run the resultant file is empty. The only other thing I could think of would be to write the new files into a new folder although I think that would be making things too complicated as the original folder has nested folders within.

I’ll keep experimenting but any guidance would still be great.

The problem is that iconv is a stream editor. Which means it changes encoding while it reads the data and prints it out. So when it opens and empties the file (the part behind the > sign) iconv has nothing to read. iconv doesn’t use a large buffer, therefore almost zero but as a result it cannot overwrite files.

The most easy thing to do to overwrite the file is to rename the new file to the old file after iconv is done, using mv command for instance as the third command in sh. Now you overwrite the file in an atomic way which means when iconv fails for some reason you still have the original file.

If you want to write to another folder you could, with the script of course, duplicate the given folder to another location it’s duplicate.

Thanks again for the additional information.

I’m not sure I understand where the mv command should be placed. While I understand what you say about iconv not using a buffer, I would assume that once the new file has been created, there would still be 2 files so if I could rename the new file back to the old file name wouldn’t the newly created file still exist.

I’ll admit this is still over my head in complexity but I feel like some progress has been made.

What I had in mind was something like below.


set shellScript to "#!/bin/sh
# Create variable to the original file found by the find command
filePath=\"$0\"

# Create variable pointing to a temporary file
tmpFilePath=\"${filePath%.*}_temp.html\"

# Convert original file into UTF-8 using iconv and store it in the temporary file
iconv -f UTF-16 -t UTF-8 \"$filePath\" > \"$tmpFilePath\"

# Rename the new temporary file as the original file name to overwrite the old file. 
mv -f \"$tmpFilePath\" \"$filePath\""

set theFolder to text 1 thru -2 of POSIX path of (choose folder)
do shell script "find " & quoted form of theFolder & " -iname '*.html' -exec sh -c " & quoted form of shellScript & " {} \\;"

The script looks completely different maybe to you but it’s the same script with the command mv added. The difference is that I wrote it in a more verbose way with comments for better readability and stored the shell command in a separate variable so the distinction between the find command and the sh command executed by find is more clear now. The script is almost similar to the first one except is renames the file to the original file and replaces it.

The terse version would look like (which is the same code as above):

set theFolder to text 1 thru -2 of POSIX path of (choose folder)
do shell script "find " & quoted form of theFolder & " -iname '*.html' -exec sh -c 'filePath=\"$0\";tmpFilePath=\"${filePath%.*}_temp.html\";iconv -f UTF-16 -t UTF-8 \"$filePath\" > \"$tmpFilePath\";mv -f \"$tmpFilePath\" \"$filePath\"' {} \\;"

p.s. For other viewer: The reason I send the found path {} as an argument to the script rather than embedding it into the script is because the contents of the variable shellScript can be stored in an shell script file and used manually or using a single call where the first argument is the original file.

FWIW, here’s an alternative that doesn’t use streaming. It requires Yosemite or later, or being in a script library for Mavericks, and it’s probably going to be a little slower than DJ’s version.

use AppleScript version "2.3.1"
use scripting additions
use framework "Foundation"
set theFolder to POSIX path of (choose folder)
its convertFilesIn:theFolder

on convertFilesIn:posixPathOfFolder
	-- make NSURL of folder
	set anNSURL to current application's class "NSURL"'s fileURLWithPath:posixPathOfFolder
	-- get file manager
	set theNSFileManager to current application's NSFileManager's defaultManager()
	-- set options used when enumerating folder
	set theOptions to (current application's NSDirectoryEnumerationSkipsPackageDescendants) + (current application's NSDirectoryEnumerationSkipsHiddenFiles as integer)
	-- get the folder's contents
	set theURLs to (theNSFileManager's enumeratorAtURL:anNSURL includingPropertiesForKeys:{} options:theOptions errorHandler:(missing value))'s allObjects()
	-- filter out all except .html files
	set thePred to current application's NSPredicate's predicateWithFormat:"pathExtension == 'html'"
	set theURLs to theURLs's filteredArrayUsingPredicate:thePred
	-- loop through them
	repeat with i from 1 to theURLs's |count|()
		set thisURL to (theURLs's itemAtIndex:(i - 1))
		-- read as UTF16
		set theText to (current application's NSString's stringWithContentsOfURL:thisURL encoding:(current application's NSUTF16StringEncoding) |error|:(missing value))
		if theText is not missing value then
			-- write back as UTF8
			(theText's writeToURL:thisURL atomically:true encoding:(current application's NSUTF8StringEncoding) |error|:(missing value))
		end if
	end repeat
end convertFilesIn:

DJ, your code works like a charm. I couldn’t ask for more. The additional comments you’ve added during this exercise are also of great value. While I’m still learning, your notes will provide enough detail to allow some experimentation when needed. Your creation looks very complex so the explanation does make a huge difference for the novice.

I’d also like to thank Shane for his contribution. Unfortunately I’m still running an older operating system so for now I can’t put it into use.

I’m going to spend a lot more time browsing these forums as there is much to learn.

Again, the outstanding support has been very appreciated.:smiley: