Script to convert XML data into tab-separated text?

Hello,

One of the applications I use generates XML files that have eleven fields per record. An example of the contents of such a file is pasted below. I am in search of a script that will parse these XML files into flat text files such that the eleven fields of each record are separated by tabs on single row. (I think will copy that info into Excel).

I searched but couldn’t find any existing solutions. Can any of you point me in the right direction or help me with this? I would really appreciate it.

Thank you,
Erik

Ideal layout of a single record in the text file OUTPUT:

Package ID (tab) Status (tab) PIC (tab) FinalPostage (tab) Balance (tab) TransactionID (tab) TransactionDateTime (tab) PostmarkDate (tab) ReferenceID (tab) Description (tab) WeightOz

Layout of XML file INPUT (shown here with three records):

Success PICCODE 4.05 41.95 903 20070423155017 20070423 FI-XXXXX-MT-HE-IN-070423-TS SoundVision Eye Protection - Clear (1) to FIRST LAST 16 Success PICCODE2 4.00 47.95 905 20070424133045 20070424 FI-XXXXX-OK-OK-IN-070424-TS SoundVision Eye Protection - Amber (2) to FIRST LAST 25 Success PICCODE3 5.00 42.95 906 20070425125032 20070424 FC-XXXXX-MT-HE-SH-070422-TS SoundVision Eye Protection - Clear (12) to COMPANY 150

Example:

I need what pretty much what is described in the link below, except for Mac and with tab-separations instead of semicolons:

http://www.codeproject.com/vbscript/xml2csv.asp

You might do this with an osax or with System Events if you knew how. Here’s an example with the “xml tool.osax” scripting addition.

http://www.latenightsw.com/freeware/XMLTools2/
set f to choose file – the xml file
set xml_text to read f
parse XML xml_text

This returns the xml text into an AppleScript record which you parse with AppleScript.

gl,

This is accurate and reasonably fast, but I’m not sure if it’s kosher:

set XMLfile to (choose file) as Unicode text
tell application "System Events" to set XMLrecords to XML elements of XML element 1 of XML file XMLfile

set dbRecords to {}
set astid to AppleScript's text item delimiters
set AppleScript's text item delimiters to tab as Unicode text

repeat with thisXMLrecord in XMLrecords
	-- Read the ID, field names, and field values for this "Package" field.
	tell application "System Events" to set {packageID, {labels, vals}} to {value of XML attribute "ID", {name, value} of XML elements} of thisXMLrecord
	
	-- Collate the field names and respective values into a single list.
	set labelValuePairs to {}
	repeat with i from 1 to (count labels)
		set end of labelValuePairs to item i of labels
		set end of labelValuePairs to item i of vals
	end repeat
	-- Use nefarious means to turn this list into an AppleScript record.
	set rec to «class seld» of (record {«class usrf»:labelValuePairs} as record)
	-- Join the package ID and the ordered values from the AS record into
	-- a tab delimited line and add it to the DB record list.
	set end of dbRecords to {packageID, get rec's {|Status|, |PIC|, |FinalPostage|, |Balance|, |TransactionID|, |TransactionDateTime|, |PostmarkDate|, |ReferenceID|, |Description|, |WeightOz|}} as Unicode text
end repeat

-- Join the lines into a single text using returns.
set AppleScript's text item delimiters to return as Unicode text
set dbText to dbRecords as Unicode text
set AppleScript's text item delimiters to astid

dbText

If the elements of the XML file are guaranteed to be in the required order, there’s no need to go through AppleScript record process. The text value ‘PackageID’ and the list ‘vals’ can be joined together like this to make each DB record line:

set end of dbRecords to {packageID, vals} as Unicode text

The method for turning a list of paired values into a record “

set rec to «class seld» of (record {«class usrf»:labelValuePairs} as record)

“ works in Tiger and is an OS X variation of a method that used to work in OS 8/9:

set labelValuePairs to {"a", 7, "name", "Aardvark", "modification date", current date}
set rec to «class seld» of (record {«class usrf»:labelValuePairs} as record)
--> {a:7, |name|:"Aardvark", |modification date|:date "Thursday 26 April 2007 15:03:18"}

Notice that the labels produced are user labels, not reserved ones. It doesn’t work in Jaguar (I don’t know about Panther) but there’s an alternative that does, using File Read/Write commands:

set labelValuePairs to {"a", 7, "name", "Aardvark", "modification date", current date}
set fRef to (open for access file ((path to trash as Unicode text) & "Test.dat") with write permission)
try
	set eof fRef to 0
	write {«class usrf»:labelValuePairs} to fRef
	set rec to (read fRef from 1 as record)
end try
close access fRef

rec
--> {a:7, |name|:"Aardvark", |modification date|:date "Thursday 26 April 2007 15:03:18"}

The inverse of this, which only works with user labels, appears to be:

set rec to {a:7, |name|:"Aardvark", |modification date|:date "Thursday 26 April 2007 15:03:18"}

set fRef to (open for access file ((path to trash as Unicode text) & "Test.dat") with write permission)
try
	set eof fRef to 0
	write rec to fRef
	set labelValuePairs to (read fRef from 13 as list)
end try
close access fRef

labelValuePairs
--> {"a", 7, "name", "Aardvark", "modification date", date "Thursday 26 April 2007 15:03:18"}

Came up with this which should work… Worked fine in my testing and unless some records have less or mor fields it should stay working =)

set thefile to choose file with prompt "Please Select Your XML Input File"
set theXML to paragraphs 2 through -2 of (read thefile)

set deskPath to path to desktop as Unicode text
set newFilePath to deskPath & "Parsed Output.txt"
set fileRef to (open for access file newFilePath with write permission)

set i to 1
repeat ((count of theXML) / 12) times
	set newLine to text 14 through -3 of item i of theXML
	repeat 10 times
		set i to i + 1
		set o to offset of ">" in (item i of theXML)
		set newLine to newLine & tab & text (o + 1) through -(o + 2) of item i of theXML
	end repeat
	write newLine & return to fileRef
	set i to i + 2
end repeat

close access fileRef

Also if the desired end result is to import this file into Excel you can just populate a worksheet in the first place rather then writing out to a file.

Thank you all for your help - much appreciated! The first script worked well for me but I got this error when I ran the second: “Can’t make 5.5 into type integer.” Please don’t spend any more time on my behalf on it, though, as the first does what I need.
Thank you again,
Erik

Hey Erik

Well the problem is happening on this step with my script.

repeat ((count of theXML) / 12) times

This means that the XML your feeding the file is different (somewhere) than the sample you provided.
Either the structure for some of the packages is different (more or less items) or there is additional
header or footer information aside from and .

I know you said don’t spend anymore time, but whats a bored SysAdmin to do then? LOL If you could
would it be possible to post the entire data set you tried to run through my version? I would like to see
what was killing it =)

Hi James,
You’re right - I took a look at the data and see that there is a blank line separating each record.
A (sanitize) version of the file I tried to process is below.
Thank you again for your help!
Erik

By the way, I just made a small donation to this site because I’m so impressed with the help that members offer.
All the best,
Erik

Modified version Erik that strips out the extra line between the Packages then does its thing.

set thefile to choose file with prompt "Please Select Your XML Input File"
set theXML to (read thefile)
set findText to "

"
set theXML to paragraphs 2 through -2 of SaR(theXML, findText, return)
set deskPath to path to desktop as Unicode text
set newFilePath to deskPath & "Parsed Output.txt"
set fileRef to (open for access file newFilePath with write permission)

set i to 1
repeat ((count of theXML) / 12) times
	set newLine to text 14 through -3 of item i of theXML
	repeat 10 times
		set i to i + 1
		set o to offset of ">" in (item i of theXML)
		set newLine to newLine & tab & text (o + 1) through -(o + 2) of item i of theXML
	end repeat
	write newLine & return to fileRef
	set i to i + 2
end repeat

close access fileRef

to SaR(sourceText, findText, replaceText)
	set TID to AppleScript's text item delimiters
	set AppleScript's text item delimiters to findText
	set tempText to text items of sourceText
	set AppleScript's text item delimiters to replaceText
	set sourceText to tempText as string
	set AppleScript's text item delimiters to TID
	return sourceText
end SaR

Thank you James.

I ran the script but got this error: “File PB-MAIN-150:Users:emikysa:Desktop:Parsed Output.txt is already open.”

I had deleted the previous version of that file off of my desktop, though, and I didn’t see it open anywhere.

Perhaps ideally the script would save the new text file with the same name (and to the same folder) as the input xml file except with “parsed.txt” appended to it?

If you’d prefer to forget about this script, though, I won’t be offended!

All the best,
Erik

Give this a try Erik

set thefile to choose file with prompt "Please Select Your XML Input File"
tell application "Finder" to set {savePath, theName} to {container, name} of thefile
set theXML to (read thefile)
set findText to "

"
set theXML to paragraphs 2 through -2 of SaR(theXML, findText, return)

set newFilePath to (savePath as string) & theName & "_parsed.txt"
set fileRef to (open for access file newFilePath with write permission)

set i to 1
repeat ((count of theXML) / 12) times
	set newLine to text 14 through -3 of item i of theXML
	repeat 10 times
		set i to i + 1
		set o to offset of ">" in (item i of theXML)
		set newLine to newLine & tab & text (o + 1) through -(o + 2) of item i of theXML
	end repeat
	write newLine & return to fileRef
	set i to i + 2
end repeat

close access fileRef

to SaR(sourceText, findText, replaceText)
	set TID to AppleScript's text item delimiters
	set AppleScript's text item delimiters to findText
	set tempText to text items of sourceText
	set AppleScript's text item delimiters to replaceText
	set sourceText to tempText as string
	set AppleScript's text item delimiters to TID
	return sourceText
end SaR

Hi, Erik.

James’s script has no provision for recovering from errors while the file’s open. When you got your “Can’t make 5.5 into type integer.” error earlier on, it stopped the script, leaving the line that closed the access to the file unexecuted. Even though you then moved the file to the trash, the write-permission access to it is still registered to Script Editor, or whatever you’re using to run the script. (I don’t know the exact mechanics here. When I duplicate the situation, a replacement file can’t be created on the desktop, but yet the old file can’t be cleared from the trash either.)

The immediate cure is to quit Script Editor, which closes all its file accesses, and then relaunch it.

The cure to stop the situation occurring in the first place is to put everything that happens while the file’s open in a ‘try’ block. That way, the script will keep going long enough to close the access:

set thefile to choose file with prompt "Please Select Your XML Input File"
tell application "Finder" to set {savePath, theName} to {container, name} of thefile
set theXML to (read thefile)
set findText to "

"
set theXML to paragraphs 2 through -2 of SaR(theXML, findText, return)

set newFilePath to (savePath as string) & theName & "_parsed.txt"
set fileRef to (open for access file newFilePath with write permission)

try
	set i to 1
	repeat ((count of theXML) / 12) times
		set newLine to text 14 through -3 of item i of theXML
		repeat 10 times
			set i to i + 1
			set o to offset of ">" in (item i of theXML)
			set newLine to newLine & tab & text (o + 1) through -(o + 2) of item i of theXML
		end repeat
		write newLine & return to fileRef
		set i to i + 2
	end repeat
	close access fileRef
on error msg
	close access fileRef
	display dialog msg buttons {"OK"} default button 1
end try

to SaR(sourceText, findText, replaceText)
	set TID to AppleScript's text item delimiters
	set AppleScript's text item delimiters to findText
	set tempText to text items of sourceText
	set AppleScript's text item delimiters to replaceText
	set sourceText to tempText as string
	set AppleScript's text item delimiters to TID
	return sourceText
end SaR

Thank you James and Nigel!

It turns out that the xml files have two blank lines under the line. If I delete those out then the script runs fine, but I get the error if I don’t.

Just out of curiosity, how hard would it be to make it so that a file would be processed if it was selected in Finder when the script was run?

I really appreciate all your help!

All the best,
Erik

tell application "Finder" to set tXML to selection as alias

Then press on with your script

Well, there’s always my script, of course… :wink: It’s not quite as fast as James’s, but it doesn’t make as many assumptions about the layout of the XML. (It does however assume that all the required fields are present and that they’re contained in ID’d “Packages”, which are XML elements of the top level element.) If you want it to save a plain text file in the same manner as James’s script, you can replace the last line (‘dbText’) with:

set fileRef to (open for access file (XMLfile & "_parsed.txt") with write permission)
try
	set eof fref to 0
	write dbText as string to fileRef
end try
close access fileRef

heheI I prolly would have gone your method Nigel if I was familiar with using system events to parse XML. I didn’t originally think that there would be differences amongst the structure either as it seemed to be generated not hand created

That said change this line to the following:

set theXML to paragraphs 2 through -3 of SaR(theXML, findText, return)

You may have to go -4 rather than -3 though depnding how those line returns are.

Just to demonstrate impartiality (;)), here’s a (hopefully) more flexible version of James’s script, which I hope he won’t mind me posting. It’s very fast! :slight_smile: (But I’m taking a bit of a risk indexing by ‘word’. It works on English-language systems…)

-- set thefile to (choose file with prompt "Please Select Your XML Input File")
-- Or:
tell application "Finder" to set thefile to selection as alias

set theXML to paragraphs of (read thefile)

set newFilePath to (thefile as Unicode text) & "_parsed.txt"
set fileRef to (open for access file newFilePath with write permission)

try
	set eof fileRef to 0
	considering case but ignoring white space
		set package to false
		repeat with i from 1 to (count theXML)
			set thisLine to item i of theXML
			if (thisLine begins with "<Package ID") then
				set package to true
				set newLine to word 5 of thisLine
			else if (package) then
				if (thisLine begins with "</Package>") then
					write newLine & return to fileRef
					set package to false
				else if (thisLine begins with "<") then
					set newLine to newLine & tab & text from word 4 to word -4 of thisLine
				end if
			end if
		end repeat
	end considering
	close access fileRef
on error msg
	close access fileRef
	display dialog msg buttons {"OK"} default button 1
end try

Nice Nigel! I would have never thought of using word they way you did =)

Thanks, James. But I’m still not sure about it. The idea of ignoring white space and indexing by word was to avoid the assumption that the “<” was the first character in the line. (There could conceivably be tab or space indents, though Erik hasn’t indicated that there are.) But now there’s an assumption that ‘words’ will be interpreted the same way on any machine running the script. :confused: Maybe text items would be safer:

--set thefile to (choose file with prompt "Please Select Your XML Input File")
-- Or:
tell application "Finder" to set thefile to selection as alias

set theXML to paragraphs of (read thefile)
set astid to AppleScript's text item delimiters

set newFilePath to (thefile as Unicode text) & "_parsed.txt"
set fileRef to (open for access file newFilePath with write permission)

try
	set eof fileRef to 0
	considering case but ignoring white space
		set package to false
		repeat with i from 1 to (count theXML)
			set thisLine to item i of theXML
			if (thisLine begins with "<Package ID") then
				set package to true
				set AppleScript's text item delimiters to "\""
				set newLine to text item 2 of thisLine
				-- Or, to preserve the quotes:
				-- set newLine to "\"" & text item 2 of thisLine & "\""
			else if (package) then
				if (thisLine begins with "</Package>") then
					write newLine & return to fileRef
					set package to false
				else if (thisLine begins with "<") then
					set AppleScript's text item delimiters to ">"
					set b to (count text item 1 of thisLine) + 2
					set AppleScript's text item delimiters to "<"
					set newLine to newLine & tab & text b thru text item -2 of thisLine
				end if
			end if
		end repeat
	end considering
	close access fileRef
	set AppleScript's text item delimiters to astid
on error msg
	close access fileRef
	set AppleScript's text item delimiters to astid
	display dialog msg buttons {"OK"} default button 1
end try

It still assumes that every “package” is internally arranged in the right order, but that’s possibly OK…