Parsing text when there are repeating fields

I am parsing text from an email into a tab delineated file. Pretty easy to do.

If I had something like this:

[b]Product: Pro Import FCP
Version: 1.03
Platform: Macintosh (PPC)
Operating System: 10.4.8


[/b]

I parse this in Applescript like this…

set AppleScript's text item delimiters to {"Product: "}
set theContent to text item 2 of theContent
set AppleScript's text item delimiters to {"Version: "}
set ProductName to text item 1 of theContent -- result: Pro Import FCP
set theContent to text item 2 of theContent
set AppleScript's text item delimiters to {"Platform: "}
set ProductVersion to text item 1 of theContent -- result: 1.03
set theContent to text item 2 of theContent
set AppleScript's text item delimiters to {"Operating System: "}
set ProductPlatform to text item 1 of theContent -- result: Macintosh (PPC)
set theContent to text item 2 of theContent
set AppleScript's text item delimiters to {"***"}
set OSVersion to text item 1 of theContent -- result: 10.4.8
set theContent to text item 2 of theContent

My problem is that some times the text I get to parse may be “repeating”, for example:

[b]Product: Pro Import FCP
Version: 1.03
Platform: Macintosh (PPC)
Operating System: 10.4.8


[/b]
[b]Product: Pro Export FCP
Version: 3.02
Platform: Macintosh (PPC)
Operating System: 10.3.9


[/b]

My code will get the first section of text, but it loses the second section. I can grab the second section like this replacement for the beginning of my script:

set AppleScript's text item delimiters to {"Product: "}
set theContent2 to text item 3 of theContent -- >  this new line saves the second group
set theContent to text item 2 of theContent

But this will fail if there ISN’T a second group, and of course there could be a third or fourth group!

What would be a good solution? Thank you.

Hi wplate,

if each “record” has the same format
then you can do something like this:

set a to paragraphs of theContent
set {ProductName, ProductVersion, ProductPlatform, OSVersion} to {{}, {}, {}, {}}
set astid to AppleScript's text item delimiters
set AppleScript's text item delimiters to {": "}
repeat with i in paragraphs of theContent
	try
		set {a, b} to text items of i
		if a = "Product" then copy b to end of ProductName
		if a = "Version" then copy b to end of ProductVersion
		if a = "Platform" then copy b to end of ProductPlatform
		if a = "Operating System" then copy b to end of OSVersion
	end try
end repeat
set AppleScript's text item delimiters to astid

Thank you, that looks helpful!

Hi,

Another way might be to split up your records with the two returns between records.


set t to "Product: Pro Import FCP
Version: 1.03
Platform: Macintosh (PPC)
Operating System: 10.4.8
***

Product: Pro Export FCP
Version: 3.02
Platform: Macintosh (PPC)
Operating System: 10.3.9
***"
set user_tid to AppleScript's text item delimiters
set AppleScript's text item delimiters to {return & return}
set _temp to text items of t
set AppleScript's text item delimiters to user_tid
_temp

Now, you can deal with each record with a repeat loop.

gl,

Perhaps my original post placed too much emphasis on the example of the text I’m parsing.

Here is an example closer to what I’m actually doing…

Pro Import FCP 2.02 Macintosh (Intel) 10.4.8 Pro Import AE 3.04 Macintosh (PPC) 10.4.6

I need to get the values out of the “XML” and I’m writing them out to a tab-delineated text file.

I want to grab the values out of the tags in the first group then I’ll write then to the file then I’ll come back and read the values of the second group, write that to the file, etc.

So in the end I’ll have something like…

Pro Import FCP\t2.02\tMacintosh (Intel)\t10.4.8
Pro Import AE\t3.04\tMacintosh (PPC)\t10.4.6

I originally thought the post about going through each paragraph would help me but I when I sat down to do something with it I couldn’t apply it since my actual formatting is different than my first data example (and because I’m not as smart as the rest of you). :slight_smile:

Thank you for continued help.

Hi wplate,

Try downloading XML Tools osax.

http://www.latenightsw.com/freeware/XMLTools2/

If you need more help write back.

gl,

Hi wplate,

here is an example to parse a XML file with System Events,
the result is in the format you want.

set XMLfile to "path:to:my.xml"
tell application "System Events"
    set b to ""
    tell XML file XMLfile
        tell contents
            set a to value of XML elements of XML elements of XML element 1 of XML element 1
            set oldDelims to AppleScript's text item delimiters
            set AppleScript's text item delimiters to {tab}
            repeat with i in a
                set b to b & text items of i & return
            end repeat
            set AppleScript's text item delimiters to oldDelims
        end tell
    end tell
end tell

Problably you must adjust the following line, because the depth of nested XML elements is not clear in your example

set a to value of XML elements of XML elements of XML element 1 of XML element 1

I saw the link to this yesterday when I purchased Script Debugger (cool program!) but since my level of “programming abililty” is so low as to not really qualify as “ability”, I’m mostly a copy and paste programmer, the XML OSAX stuff just scared me.

I see how this will make all of my other code simpler, so I’m right now trying to figure out how to use this XML stuff, I’m not even down to where I have multiple sets of tags.

Can you tell me why this code results in an empty result (handlers not included in post)…?
If you look at this screenshot it seems the XML parse completed properly, but I can’t get the value of the element.

set theContent to "<?xml version=\"1.0\"?>
<OrderNoticeDS>

  <OrderInfo>
    <ORDER_NUMBER>1081</ORDER_NUMBER>
   </OrderInfo>

</OrderNoticeDS>"

	set theXML to parse XML theContent
	set ORDER_NUMBER to getElementValue(getElements(theXML, "ORDER_NUMBER"))

I sure do like Script Debugger, it is really helpful.

I stepped through each line of my example, including the handler and it is clear that I’m not getting deep enough.

With this code I can get my order number…

set theContent to "<?xml version=\"1.0\"?>
<OrderNoticeDS>

  <OrderInfo>
    <ORDER_NUMBER>1081</ORDER_NUMBER>
   </OrderInfo>

</OrderNoticeDS>"

set theXML to parse XML theContent

set ORDER_NUMBER to getElementValue(getElementFromPath(theXML, {"OrderInfo", "ORDER_NUMBER"}))

Once I get to the part of my script where I need to loop through a section of XML, how is that going to work?

I am assuming a structure like

repeat
get value for element product
get value for element version
get value for element platform
get value for element os
write variables to file
end repeat

I just tested it kinda quick and I get the value of the first occurance of the element. Help appreciated.

Hi wplate,

The parsed XML is returned as records and lists. You get fields of records by using the fields label and items of lists with ‘item’. I think System Events as in Stefan’s post uses the same method, but here’s how I did a step-by-step of your text:

set theContent to "<?xml version=\"1.0\"?>

1081
<OrderInfo>
<ORDER_NUMBER>2081</ORDER_NUMBER>

"

set theXML to parse XML theContent
set OrderNotices to XML contents of theXML
set orderNumbers to {}
repeat with thisNotice in OrderNotices
set orderInfo to XML contents of thisNotice
set numberInfo to item 1 of orderInfo
set orderNumber to XML contents of numberInfo
set end of orderNumbers to item 1 of orderNumber
end repeat
orderNumbers

It’s a bit confusing, so looking at the result of each line makes it easier.

gl,

Applying Kel’s explanation to your data:


set XMLTxt to "<?xml version=\"1.0\"?>
<OrderInfo>
	<record>
    		<product>Pro Import FCP</product>
    		<version>2.02</version>
    		<platform>Macintosh (Intel)</platform>
    		<os>10.4.8</os>
	</record>
	<record>
    		<product>Pro Import AE</product>
    		<version>3.04</version>
    		<platform>Macintosh (PPC)</platform>
    		<os>10.4.6</os>
	</record>
</OrderInfo>"

set Prods to {}
set RecData to {}
set ProdData to {}
set theRecs to XML contents of (parse XML of XMLTxt)
repeat with aRec in theRecs
	set end of RecData to XML contents of aRec
end repeat
repeat with anItem in RecData
	set end of ProdData to items 1 thru 4 of anItem
end repeat
repeat with aProd in ProdData
	repeat with k from 1 to 4
		set end of Prods to (XML contents of item k of aProd) as text
	end repeat
end repeat
Prods --> {"Pro Import FCP", "2.02", "Macintosh (Intel)", "10.4.8", "Pro Import AE", "3.04", "Macintosh (PPC)", "10.4.6"}

It’s probably longer than it has to be, but this is my first try too. You can get what you want out of the Prods list in groups of four.

Getting there, thank you! Kel, your post was very useful and I have been able to make good progress.

I’m now parsing through my repeating my line items writing out data to disk, though I’m getting wacky STXT stuff in my text file. See here: http://www.automaticduck.com/screenshots/Order.txt I’ve never seen this before and I’m searching Google for good ways to rid myself of it.

Here’s a snippet of my current setup (this is the script that generated the above text file):

set theContent to "
<?xml version=\"1.0\"?>
<OrderNoticeDS>
  <OrderInfo>
    <ORDER_NUMBER>1081</ORDER_NUMBER>
   </OrderInfo>
  <LineItem>
    <SKU_ID>FULLSuite</SKU_ID>
    <SKU_TITLE/>
    <SHORT_DESCRIPTION/>
    <QUANTITY>1</QUANTITY>
    <M_UNIT_PRICE>10</M_UNIT_PRICE>
  </LineItem>
  <LineItem>
    <SKU_ID>FCPSuite</SKU_ID>
    <SKU_TITLE/>
    <SHORT_DESCRIPTION/>
    <QUANTITY>1</QUANTITY>
    <M_UNIT_PRICE>7</M_UNIT_PRICE>
  </LineItem>
</OrderNoticeDS>"

set oldDelims to AppleScript's text item delimiters
set ptd to path to desktop as text
set headerLine to "SKU_ID	SKU_TITLE	SHORT_DESCRIPTION	QUANTITY	UNIT_PRICE"
set fileSpec to (ptd & "Order.txt")
set f to open for access file fileSpec with write permission
set eof f to 0
write (headerLine & return) to f

set AppleScript's text item delimiters to {tab}

set theXML to parse XML theContent with allowing leading whitespace and including empty elements
set OrderNotices to XML contents of theXML

repeat with thisNotice in OrderNotices
	-- check to see what XML section we're in
	set sectionXML to XML tag of thisNotice
	-- if we're in the Line Item section of the XML, cool, we don't want address stuff here.
	if sectionXML ≠ "OrderInfo" then
		set orderInfo to XML contents of thisNotice
		set numberInfo to item 1 of orderInfo
		set orderNumber to XML contents of numberInfo
		set SKU_ID to item 1 of orderNumber
		set numberInfo to item 2 of orderInfo
		set orderNumber to XML contents of numberInfo
		set SKU_TITLE to item 1 of orderNumber
		set numberInfo to item 3 of orderInfo
		set orderNumber to XML contents of numberInfo
		set SHORT_DESCRIPTION to item 1 of orderNumber
		set numberInfo to item 4 of orderInfo
		set orderNumber to XML contents of numberInfo
		set QUANTITY to item 1 of orderNumber
		set numberInfo to item 5 of orderInfo
		set orderNumber to XML contents of numberInfo
		set UNIT_PRICE to item 1 of orderNumber
		set theRecord to {SKU_ID, SKU_TITLE, SHORT_DESCRIPTION, QUANTITY, UNIT_PRICE}
		write (theRecord & return) to f
	end if
end repeat
close access f

set AppleScript's text item delimiters to oldDelims


Hi wplate,

The reason you’re getting the funny text is because you’re writing a list to the file. AppleScript’s read/write commands has its unique ability to write lists and records as well as text. You need to change your list into text delimited with tabs (or whatever) as you did in the header.

set theRecord to {SKU_ID, SKU_TITLE, SHORT_DESCRIPTION, QUANTITY, UNIT_PRICE} – change this to tab delimited text

And see Adam’s transform on how to use repeat loops. You catach on quickly.

gl,

You did a good job.


set theContent to "
<?xml version=\"1.0\"?>
<OrderNoticeDS>
  <OrderInfo>
    <ORDER_NUMBER>1081</ORDER_NUMBER>
   </OrderInfo>
  <LineItem>
    <SKU_ID>FULLSuite</SKU_ID>
    <SKU_TITLE/>
    <SHORT_DESCRIPTION/>
    <QUANTITY>1</QUANTITY>
    <M_UNIT_PRICE>10</M_UNIT_PRICE>
  </LineItem>
  <LineItem>
    <SKU_ID>FCPSuite</SKU_ID>
    <SKU_TITLE/>
    <SHORT_DESCRIPTION/>
    <QUANTITY>1</QUANTITY>
    <M_UNIT_PRICE>7</M_UNIT_PRICE>
  </LineItem>
</OrderNoticeDS>"

set ptd to path to desktop as text
set headerLine to "SKU_ID	SKU_TITLE	SHORT_DESCRIPTION	QUANTITY	UNIT_PRICE"
set fileSpec to (ptd & "Order.txt")

set f to open for access file fileSpec with write permission
set eof f to 0
write (headerLine & return) to f

set theXML to parse XML theContent with allowing leading whitespace and including empty elements
set OrderNotices to XML contents of theXML
repeat with thisNotice in OrderNotices
	-- check to see what XML section we're in
	set sectionXML to XML tag of thisNotice
	-- if we're in the Line Item section of the XML, cool, we don't want address stuff here.
	if sectionXML ≠ "OrderInfo" then
		set orderInfo to XML contents of thisNotice
		set numberInfo to item 1 of orderInfo
		set orderNumber to XML contents of numberInfo
		set SKU_ID to item 1 of orderNumber
		set numberInfo to item 2 of orderInfo
		set orderNumber to XML contents of numberInfo
		set SKU_TITLE to item 1 of orderNumber
		set numberInfo to item 3 of orderInfo
		set orderNumber to XML contents of numberInfo
		set SHORT_DESCRIPTION to item 1 of orderNumber
		set numberInfo to item 4 of orderInfo
		set orderNumber to XML contents of numberInfo
		set QUANTITY to item 1 of orderNumber
		set numberInfo to item 5 of orderInfo
		set orderNumber to XML contents of numberInfo
		set UNIT_PRICE to item 1 of orderNumber
		set theRecord to {SKU_ID, SKU_TITLE, SHORT_DESCRIPTION, QUANTITY, UNIT_PRICE}
		
		-- keep the commands together if possible
		set oldDelims to AppleScript's text item delimiters
		set AppleScript's text item delimiters to {tab}
		set theRecord to theRecord as string
		set AppleScript's text item delimiters to oldDelims
		
		write (theRecord & return) to f
	end if
end repeat

close access f

Just changed some stuff. Now, you can shorten it up a bit, instead of setting all those variables, if you want to.

gl,

At some point this afternoon I decided to see if the fake testing environment I had set up was contributing at all to my frustrations with the styled text/unicide/whatever.

So I moved my test stuff over to use Entourage as the source of the XML instead of keeping the XML stored in the test script and BOOM! Every time I parsed the XML passed from Entourage it (Entourage) would shut down. After editing my script down to its bare minimum and still having Entourage crash I decided to try another approach.

I set up a script in Entourage to write the XML to a file, then I wrote another script to read the time and do all the fancy work you nice people have been helping me with-- it seems to be working great!

Thank you all for your help, this forum and its members are a wonderful resource.

Hi wplate,

Yeah, that’s what the STXT was. It’s nice when people do a lot of testing on their own.

gl,