XML parsing error: not well-formed at character 26 of line 33

I’m picking up where I left off on an AppleScript project that I received great help on this forum http://bbs.applescript.net/viewtopic.php?id=19411

I have a test case where when I get to this line in my script…

set theXML to parse XML theContent with including empty elements and allowing leading whitespace

…Script Debugger stops with an AppleScript Runtime Error:

XML parsing error: not well-formed at character 26 of line 33

Line 33 is as follows…

<ADDRESS1>Espigolera nº1, Pol Ind nº1, 08960, Sant Just Desvern</ADDRESS1>

I take it that the XML parse is barfing on the extended character, if I edit the XML file and remove the fancy letter after the "n"s then my script can proceed past the XML parse.

What do I need to do to get past this problem (editing of the XML before it is processed is not possible in the real world)?

Hi,

I guess, this is a question of text encoding.
the 26th charcater is the ° sign, which is different in ISO-8859, MacRoman, UTF-8 and UTF-16

Yes, the moment after I posted my question I changed the encoding of the XML file in TextWrangler from MacRoman to UTF-8 and then Script Debugger stopped complaining.

The script line that writes the XML file to begin with is…

to writeFile(theContent)
	tell application "Finder"
		set theFile to (folderpath & "Order_[" & ORDER_NUMBER & "]_backup.xml")
		set f to (open for access theFile with write permission)
		set eof of f to 0
		try
			write theContent to f as «class utf8»
			close access f
		on error errMsg
			try
				close access f
			end try
			error errMsg
		end try
	end tell
end writeFile

How would I be sure that the XML file’s text encoding was UTF-8?

Isn’t the text encoding format included in the first line of the XML-file?
Or with this subroutine

on check_utf8(theFile) -- theFile is a path string
	try
		read file theFile as «class utf8»
		return true
	on error
		return false
	end try
end check_utf8

In this XML, no, the type is not declared in the first line of the file.

But even if I change the first line of the XML file to say…

<?xml version="1.0" encoding="UTF-8"?>

…it doesn’t matter, the file itself seems to be Mac Roman.

When I open in TextWrangler an XML file written by the snippet I quoted before, the bottom of the Window reports the text encoding as Mac Roman…
http://www.automaticduck.com/screenshots/AppleScriptMacRoman.jpg

…if I change the encoding to UTF-8 and save the file, then I can parse the XML just fine…
http://www.automaticduck.com/screenshots/AppleScriptUTF-8.jpg

I tried your code to read the file as UTF-8 with the XML file still showing as Mac Roman in TextWrangler and Script Debugger died on the read…
Can’t make result into the expected type.

It seems my problem is that when the XML file is written the file isn’t written as UTF-8.

This is odd, normally writing as «class utf8» writes real UTF-8 encoded files.
You can try to add the UTF-8 BOM at the beginning of the file

to writeFile(theContent)
	tell application "Finder"
		set theFile to (folderpath & "Order_[" & ORDER_NUMBER & "]_backup.xml")
		set f to (open for access theFile with write permission)
		set eof of f to 0
		try
			write «data rdatEFBBBF» to f
			write theContent to f as «class utf8» starting at eof
			close access f
		on error errMsg
			try
				close access f
			end try
			error errMsg
		end try
	end tell
end writeFile

Ok, I think I found my way around this.

After I read the XML file I change the variable holding the data into UTF-8…

set f to alias (xmlfolderpath & theFile)
set s to read f
set theContent to s as «class utf8»

Seems to work, any problems doing it this way?

I tried this and now the file appears in TextWrangler as UTF-8 and my script also reads it!

Hooray! Thank you!