Spaces Outputting as ÃŠ ?

mlcrutchfield · February 14, 2009, 12:04am

Hello All,

I’m working on a script to grab the contents of a specified message and output it in a separate file. It does grab the info and save it off, but the spaces seem to be being replaced with a ÃŠ character. Is there a way around it?

The content of the email is XML. Like this:

<?xml version="1.0" encoding="ISO-8859-1"?> Francisco Olivera .etc.

The outputted file is like this:

<?xml version="1.0" encoding="ISO-8859-1"?> ÃŠÃŠÃŠ ÃŠÃŠÃŠ ÃŠÃŠÃŠ ÃŠÃŠÃŠ ÃŠÃŠÃŠ ÃŠÃŠÃŠ ÃŠÃŠÃŠ ÃŠÃŠÃŠ ÃŠÃŠÃŠ ÃŠÃŠÃŠ ÃŠÃŠÃŠ ÃŠÃŠÃŠ ÃŠÃŠÃŠ ÃŠÃŠÃŠ ÃŠÃŠÃŠ ÃŠÃŠÃŠ ÃŠÃŠÃŠ Francisco Olivera .etc.

Here’s the script thus far:

tell application "Finder" to set ptd to "Macintosh HD:Users:melanie:Desktop:" as string
tell application "Mail"
	set theMessages to selection
	repeat with theMessage in theMessages
		set theText to content of theMessage as string
		set theFile to ptd & (theMessage's id as string)
		set theFileID to open for access file theFile with write permission
		write theText to theFileID
		close access theFileID
	end repeat
end tell

Thanks!

AppleScript: 2.0.1
Browser: Firefox 3.0.6
Operating System: Mac OS X (10.5)

chrys · February 14, 2009, 2:34am

The text encodings are being mishandled in one or more places. If you take a non-breaking space and write it in MacRoman encoding and read it back as Latin-1 (ISO-8859-1), you get that mangled character.

Whatever the encoding is in the actual email message, it looks like AppleScript is getting ahold of it with proper non-breaking spaces (proper for AppleScript’s internal encoding). But, when you use write to write it out, it is written out in MacRoman (the “primary encoding” on English systems):

When this happens, you have an XML file that is actually encoded as MacRoman, but declares that it uses ISO-8859-1 (as per your quoted examples). This is a broken file. It is only barely viable because MacRoman and ISO-8859-1 use the same encoding values for the basic characters.

It seems to me that the simplest way to prevent this problem would be to use email attachments so that AppleScript does not have to deal with characters themselves. If you can not change to attachments, you could try writing it out as a superset (e.g. UTF-16) and then using another tool to convert it back to the proper encoding.

Here is some code that demonstrates the encoding mangling problem and the write-as-UTF-16-convert-to-Latin-1 workaround (I wrote and tested this under Tiger, but I think it should behave the same under Leopard):

(* Any Unicode text in a 'do shell script' command is converted to UTF-8 by the time the shell-invoked processes see it. This includes arguments to 'echo'.
 * Likewise any text output from the shell-invoked processes are converted from UTF-8 before they get back to AppleScript.
 ** Usually this means that the AppleScript result of a 'do shell script' is 'Unicode text', which is UTF-16 internally. But it was UTF-8 when the shell-invoked processes generated it!
 ** To provide non-UTF-8 input to a process, we can use 'iconv' to convert from the UTF-8 that AppleScript provides to the shell to whatever we need.
 ** To receive non-UTF-8 output from a process, we can use 'iconv' to convert it to UTF-8.
 *)

set threeNBSP to «data utxt00A000A000A0» as Unicode text -- this is UTF-16BE, but it 'do shell script' will have converted it to UTF-8 by the time it gets to the command line programs
set macRomanAsLatin1 to do shell script "echo -n " & quoted form of threeNBSP & " | iconv -f UTF-8 -t MACROMAN | iconv -f ISO-8859-1 -t UTF-8" without altering line endings

set tempBase to POSIX path of (path to temporary items folder from user domain)
set macRomanPath to tempBase & "3nbsp.macroman.txt"
try
	set fr to open for access POSIX file macRomanPath with write permission
	write threeNBSP to fr as string
	close access fr
on error m number n
	close access POSIX file macRomanPath
	error m number n
end try
set writeThenIconv to do shell script "iconv -f ISO-8859-1 -t UTF-8 < " & quoted form of macRomanPath without altering line endings

(* This next block is the part you need:
 * Write it out as UTF-16 (by specifying 'as Unicode text' in the 'write' command.
 * Use the shell tool 'iconv' to convert it back to ISO-8859-1.
 *)
set tempUTF16Path to tempBase & "3nbsp.txt.utf16be"
set latin1Path to tempBase & "3nbsp.latin1.txt"
try
	set fr to open for access POSIX file tempUTF16Path with write permission
	write threeNBSP to fr as Unicode text
	close access fr
on error m number n
	close access POSIX file tempUTF16Path
	error m number n
end try
do shell script "iconv -f UTF-16 -t ISO-8859-1 < " & quoted form of tempUTF16Path & " > " & quoted form of latin1Path -- the input/output are from/to files, so we do not have to mention UTF-8 since we are not providing input or collecting output

set latin1ThroughUTF16 to do shell script "iconv -f ISO-8859-1 -t UTF-8 < " & quoted form of latin1Path without altering line endings -- verify that it was properly converted to Latin-1

display dialog "Here is the mangled text:" with title "MacRoman as Latin-1 Results" default answer "Original: " & threeNBSP & return & "Directly Mangled (iconv|iconv): " & macRomanAsLatin1 & return & "Indirectly Mangled (AS write|iconv): " & writeThenIconv & return & "OK (written as UTF-8, converted to Latin-1): " & latin1ThroughUTF16

If all your XML emails do not specify ISO-8859-1, you will have to parse the XML declaration for the proper final encoding (and use that in the iconv call).

Of course, if you are willing to parse the XML declaration, there is another alternative. You could rewrite the encoding declaration to be whatever encoding you are actually writing (change to “UTF-16” and write . as Unicode text; change to “UTF-8” and write . as «class utf8», etc.; maybe even “mac” and write . as string).

mlcrutchfield · February 18, 2009, 9:59pm

Hi Chrys,

Thanks so much for your very thorough reply! I have to admit that I had to read through it 20 times or so and it was still way over my head. The bits about the difference between MacRoman, UTC-16, UTC-8 and ISO-8859-1 were really helpful and gave me a bunch of information I didn’t know before. I tried doing my best to plug in the right details but what I ended up getting to work looks a good deal different than what you showed me.

Now, it works (as in I eventually get a file to look the way I want it) BUT, it saves that file off in an unexpected location. Here’s my script:

tell application "Finder" to set ptd to "Macintosh HD:Users:myUser:Desktop:" as string
tell application "Mail"
	set theMessages to selection
	repeat with theMessage in theMessages
		set theText to content of theMessage as string
		set theFile to ptd & (theMessage's id as string)
		set theFileID to open for access file theFile with write permission
		write theText to theFileID
		close access theFileID
	end repeat
end tell
set getFile to "/Users/myUser/Desktop/" & (theMessage's id as string)
set newText to open for access file getFile with write permission
set oldText to do shell script "iconv -f ISO-8859-1 -t UTF-8 < '" & getFile & "'" as string without altering line endings
set AppleScript's text item delimiters to "âˆšÃ¤"
set textAdjust to every text item in oldText
set AppleScript's text item delimiters to " "
set doneText to every text item in textAdjust as string
write doneText to newText
close access newText

Now, it seems to me that it should be rewriting the new information over the old. Instead, it’s saving a new document in the root level of my hard drive. Can you see anything that would make it do that?

If I use just the second half of the script and point it to a specific file, it overwrites the information just fine. Like so:

set getFile to "/Users/myUser/Desktop/12203r"
set newText to open for access file "Macintosh HD:Users:myUser:Desktop:12203r" with write permission
set oldText to do shell script "iconv -f ISO-8859-1 -t UTF-8 < '" & getFile & "'" as string without altering line endings
set AppleScript's text item delimiters to "âˆšÃ¤"
set textAdjust to every text item in oldText
set AppleScript's text item delimiters to " "
set doneText to every text item in textAdjust as string
write doneText to newText
close access newText

Thanks a bunch for your help. You’re clearly MUCH more experienced than I am.

chrys · February 19, 2009, 6:53am

mlcrutchfield:

set getFile to "/Users/myUser/Desktop/" & (theMessage's id as string)
set newText to open for access file getFile with write permission
Now, it seems to me that it should be rewriting the new information over the old. Instead, it’s saving a new document in the root level of my hard drive. Can you see anything that would make it do that?

If I use just the second half of the script and point it to a specific file, it overwrites the information just fine. Like so:
set newText to open for access file "Macintosh HD:Users:myUser:Desktop:12203r" with write permission

When reduced to the essentials, the difference between the two should be more apparent. The first uses a POSIX path (slash delimited) with the file object specifier. This is almost never what you really want to do. The second uses the normal pairing of HFS path (colon delimited) with file. If you want to use a POSIX path, use it with the POSIX file object specifier:

set getFile to "/Users/myUser/Desktop/" & (theMessage's id as string)
set newText to open for access POSIX file getFile with write permission

That said, here is a version of your script with inline comments explaining the things I would change:

(*
tell application "Finder" to set ptd to "Macintosh HD:Users:myUser:Desktop:" as string
 -- Finder is not required here. This is simply a string operation. You start with the string literal and coerce it to a string. Such a coercion does not change the data at all. It would be better as just
set ptd to "Macintosh HD:Users:myUser:Desktop:"
 -- But using a string literal that contains the boot volume name and the user's short name means that you will have to edit the script if either of these are ever different. Use this instead for better portability:
set ptd to path to desktop folder as Unicode text
 -- "path to" is a command from StandardAdditions and it can give you the path to many common places either as an alias or as text (an HFS/Mac path (colon-delimited))
 *)
set ptd to path to desktop folder as Unicode text
tell application "Mail"
	set theMessages to selection
	repeat with theMessage in theMessages
		(*
		set theText to content of theMessage as string
		set theFile to ptd & (theMessage's id as string)
		-- In the dictionary to my Mail.app (Tiger), both the id and content properties of a message are already strings. There is no need to coerce them into strings. If the same holds in Leopard, use this instead:
		set theText to content of theMessage
		set theFile to ptd & (theMessage's id)
		*)
		set theText to content of theMessage
		set theFile to ptd & (theMessage's id)
		set theFileID to open for access file theFile with write permission
		(*
		write theText to theFileID
		-- The key idea with "write" is to always supply an "as" parameter. This to make sure you know what text encoding is used when text data is written to disk. Without the "as" parameter Leopard will always write it out as the system's primary encoding (usually MacRoman) (on Tiger, without an "as" parameter, "Unicode text" type objects come out as UTF-16BE, and other types of text come out as the primary encoding (usually MacRoman)).
		-- Use this to make sure the file is written out as UTF-16:
		write theText to theFileID as Unicode text
		*)
		write theText to theFileID as Unicode text
		close access theFileID
	end repeat
end tell
(***
 *** All the code below should be inside the repeat loop, after the "close access".
 *** If it is left out here only the file from the last message will ever be "fixed up".
 ***)
(*
set getFile to "/Users/myUser/Desktop/" & (theMessage's id as string)
-- This suffers the same user name portability problem as the previous literal path string. A similar use of "path to" can also be used to get the POSIX path (slash delimited): 
set getFile to (POSIX path of (path to desktop folder)) & (theMessage's id)
*)
set getFile to (POSIX path of (path to desktop folder)) & (theMessage's id)
(*
set newText to open for access file getFile with write permission
set oldText to do shell script "iconv -f ISO-8859-1 -t UTF-8 < '" & getFile & "'" as string without altering line endings
-- A several things are muddled here.
-- First is the attempt to use "open for access" with a POSIX path, but using "file" instead of "POSIX file". This is why the file ended up in the root of your boot volume. It ended up with a name, in Finder, like "/Users/myUser/Desktop/<msgID>", right? The part that was supposed to be the path actually ended up in the filename.
-- Second, even if the proper file had been opened for writing, the program still needs to read from the file (which happens in the "do shell script" line that comes next). While it may work to open a file for writing, then read from it before rewriting it, it is not a very good practice. A better idea is to to read from the file first, then open it for (re)writing.
-- Third, the manual single quotes are a potential problem. What if the filename was actually supposed to have one or more single quotes in it. A better way of doing it is to use "quoted form of" which will handle such problems automatically.
-- Last, there is an extra, unnecessary "as string" coercion.
set oldText to do shell script "iconv -f ISO-8859-1 -t UTF-8 < " & quoted form of getFile without altering line endings
set newText to open for access POSIX file getFile with write permission
-- Another logic problem is that the original "write" was writing the text out as MacRoman, but the above iconv invocation means "convert ISO-8895-1 to UTF-8". So iconv is reading MacRoman data but interpreting it as ISO-8859-1. As far as the computer can tell, this will always work (they are both 8-bit encodings), but it is obviously not the right thing to do. The "to UTF-8" part is fine since it is the correct way to provide output that "do shell script" will accept and convert to AppleScript's internal format.
-- But, because I changed the above "write" call to use UTF-16, the previous invocation of iconv is no longer appropriate. Use this instead to convert the UTF-16 to ISO-8859-1 without further involving AppleScript's internal text handling:
-- This could be done on one line, but it is expanded and broken up a bit here to try to make its operation more transparent.
set tmpFile to getFile & ".tmp"
set convertCmd to "iconv -f UTF-16 -t ISO-8859-1"
set inputRedirection to " < " & quoted form of getFile
set outputRedirection to " > " & quoted form of tmpFile
set renameCmd to "mv -f " & quoted form of tmpFile & " " & quoted form of getFile
-- convert UTF-16 getFile to ISO-8859-1 tmpFile then overwrite getFile with tmpFile
do shell script convertCmd & inputRedirection & outputRedirection & " && " & renameCmd
*)
set tmpFile to getFile & ".tmp"
set convertCmd to "iconv -f UTF-16 -t ISO-8859-1"
set inputRedirection to " < " & quoted form of getFile
set outputRedirection to " > " & quoted form of tmpFile
set renameCmd to "mv -f " & quoted form of tmpFile & " " & quoted form of getFile
do shell script convertCmd & inputRedirection & outputRedirection & " && " & renameCmd
(*
set AppleScript's text item delimiters to "âˆšÃ¤"
set textAdjust to every text item in oldText
set AppleScript's text item delimiters to " "
set doneText to every text item in textAdjust as string
write doneText to newText
close access newText
-- The funky characters make me think that this is just a different attempt to correct the "weird-space" issue instead of solve the more general encoding issue. If that is all you really want to do, you can probably do that by using a similar find-and-replace on the original data before writing it out as MacRoman. Just replace " " with " " (they are different characters, the first is the non-breaking space (U+00A0; Option-Space on my system) the second is " " (U+0020; Space on my system). But beware that if any of the XML actually uses any others characters that have different encodings in MacRoman and ISO-8859-1 they will still show up as "weird". Anyway, none of this will be necessary if you use the above iconv+rename.
*)

mlcrutchfield · February 19, 2009, 5:14pm

Hi!

Lots of great notes. Thanks!

I ran your script and got an error on the last line. It says “iconv: (stdin):1:43: cannot convert” when it gets to this line:

do shell script convertCmd & inputRedirection & outputRedirection & " && " & renameCmd

It creates a file named “122035” (the message ID of the message I had selected) with " " characters where the “ÃŠ” were. It also creates a 122035.tmp file as indicated in the script.

chrys · February 19, 2009, 10:24pm

The “:1:43:” in the error message is “:::”, so it means that it found something it could not convert to ISO-8859-1 on the first line after character 43. In your example the first line is “<?xml version="1.0" encoding="ISO-8859-1"?>”, which only has 43 characters. Is there something extra at the end of the line? Something like “xxd -l 100 ~/Desktop/” in Terminal would provide a hex dump of the first 100 bytes of the file (50 characters for most UTF-16 encoded characters). This hex dump would be useful for debugging the error. Do you see anything between the 003f 003e (“?>”) and the 000a (linefeed)?

In some tests with my version of Mail (Tiger), I see fffc (U+FFFC; object replacement character) in the texts of some content of msg (usually where an image would be). You can skip things like this by adding a “-c” to the iconv command line. This causes iconv to silently ignore characters that it can not convert. But this has the potential to misrepresent your data!

If you are seeing U+FFFC characters from Mail I really would recommend changing to an attachment-based workflow so that Mail does not have to do its content of conversion. Also, getting the file directly from an attachment would prevent from having to mess with the encoding at all (as long as the sender always sends files where the encoding declaration matches the actual encoding).

Here is the replacement code to skip unconvertible characters:

set tmpFile to getFile & ".tmp"
-- "-c" below cause unconvertible characters to be skipped, THIS MIGHT MISREPRESENT DATA!
set convertCmd to "iconv -c -f UTF-16 -t ISO-8859-1"
set inputRedirection to " < " & quoted form of getFile
set outputRedirection to " > " & quoted form of tmpFile
set renameCmd to "mv -f " & quoted form of tmpFile & " " & quoted form of getFile
-- unconditionally rename, since even though "-c" above makes iconv skip unconvertible characters, it still returns a non-zero exit code; if you end up with a zero byte file it was because iconv could not even start to decode the data (the "-f" encoding was incorrect)
do shell script convertCmd & inputRedirection & outputRedirection & " ; " & renameCmd

If the .tmp file exists, then the “bare” file should still be in UTF-16 encoding. If the conversion had been successful you would have been left with only the “122035” file in the ISO-8859-1 encoding. Using the above replacement code, you should never end up with a “.tmp” file (though it may be zero length or truncated if the conversion failed to read some of the data, and it will be missing unconvertible characters).

When dealing with these encoding issues, you must be very careful to consider which encoding a viewing/editing program is using. " " instead of “ÃŠ” likely implies that the file was correctly written out as UTF-16, but it is being read as MacRoman. In TextEdit the Open. dialog has a drop-down list to select the encoding when reading the file. While “Automatic” is the default, it is not usually the best mode of operation while trying to debug encoding issues.

If you are seeing " ", then there is also probably a hidden, zero-width character before each displayed character. This is because when UTF-16 encoded text is interpreted as an 8-bit encoding there will often be extra NUL characters added to the stream (non-breaking space is 0x00 0xA0 in UTF-16; 0x00 0xA0 in MacRoman is two characters: NUL then " "; 0x00 0xA0 in ISO-8859-1 is a different pair of two characters: NUL then a non-breaking space).

You can “feel” these NUL characters in TextEdit by using the arrow keys to move through the text. When the NUL characters are present, you will find that the cursor does not move after every other consecutive right arrow key press. The cursor does move in TextEdit’s internal representation, but because the NUL character is not displayed, the displayed cursor effectively does not move. You can still delete, select, cut, copy, and paste these NUL characters, but the interaction looks odd because nothing is ever displayed for the NUL characters.