Please Help Me with HTML Parsing

Hi,

I’m new to Macs, Applescript and Programming and I am trying to build an App which enables me to load an HTML file and then extract information from it so I am able to recompile it into iCal Events.

This is a sample of the really poorly produced HTML file:

Jan04
Fri
Jan03
Thu
Jan02
Wed
Jan01
Tue
AS
145
384

Ideally, I would like to split the dates into strings and subsequent information into separate strings so I can call up Item 1 of List A and collate it with Item 1 of List B and add them to an event in iCal for January 1st, for example.

I have this so far but I’m not getting past an error message which states "Can’t make {“Jan01”, “Jan02”, “Jan03”…} into type string.


property serialBeginning : “<DIV STYLE="top:93px;”
property serialEnd : “”

– Read a chosen file and prepare to search for serialBeginning

set theContents to read (choose file with prompt “Please choose your latest AIMS Roster”)
set originalDelimiters to AppleScript’s text item delimiters
set AppleScript’s text item delimiters to “width:88px" Class="S5">”

– Split the file into a list of strings that start with serialBeginning
– Ignore the first item, which is just the text before the occurence

set theItems to text items 2 thru 32 of theContents

– Iterate through the items to look for serial day termination strings

set serialArrayDate to {} – this will store the data

set AppleScript’s text item delimiters to {serialEnd}

repeat with nextItem in theItems

– Add text before serialEnd to serialArray

set serialArrayDate to serialArrayDate & first text item of nextItem

end repeat

set AppleScript’s text item delimiters to originalDelimiters


I’d be very grateful if any of you are able to point me in the right direction! Maybe a bit ambitious for a newb!!!

There are probably a number of ways to do it. Here is a quick (commented) sample of one of them:

set MonthList to {{"Jan", "January"}, {"Feb", "February"}}
set myHTML to "<DIV STYLE=\"top:93px; left:145px; width:43px; height:34px\" Class=\"S7\"></DIV>
<DIV STYLE=\"top:93px; left:148px; width:88px\" Class=\"S5\">Jan04</DIV>
<DIV STYLE=\"top:107px; left:156px; width:52px\" Class=\"S5\">Fri</DIV>
<DIV STYLE=\"top:93px; left:102px; width:43px; height:34px\" Class=\"S7\"></DIV>
<DIV STYLE=\"top:93px; left:106px; width:88px\" Class=\"S5\">Jan03</DIV>
<DIV STYLE=\"top:107px; left:112px; width:52px\" Class=\"S5\">Thu</DIV>
<DIV STYLE=\"top:93px; left:59px; width:43px; height:34px\" Class=\"S7\"></DIV>
<DIV STYLE=\"top:93px; left:62px; width:88px\" Class=\"S5\">Jan02</DIV>
<DIV STYLE=\"top:107px; left:70px; width:52px\" Class=\"S5\">Wed</DIV>
<DIV STYLE=\"top:93px; left:16px; width:43px; height:34px\" Class=\"S7\"></DIV>
<DIV STYLE=\"top:93px; left:20px; width:88px\" Class=\"S5\">Jan01</DIV>
<DIV STYLE=\"top:107px; left:26px; width:52px\" Class=\"S5\">Tue</DIV>
<DIV STYLE=\"top:127px; left:15px; width:1108px; height:14px\" Class=\"S1\"></DIV>
<DIV STYLE=\"top:127px; left:16px; width:43px; height:15px\" Class=\"S6\"></DIV>
<DIV STYLE=\"top:127px; left:30px; width:35px\" Class=\"S8\">AS</DIV>
<DIV STYLE=\"top:127px; left:59px; width:43px; height:15px\" Class=\"S6\"></DIV>
<DIV STYLE=\"top:127px; left:70px; width:52px\" Class=\"S8\">145</DIV>
<DIV STYLE=\"top:127px; left:102px; width:43px; height:15px\" Class=\"S6\"></DIV>
<DIV STYLE=\"top:127px; left:112px; width:52px\" Class=\"S8\">384</DIV>
<DIV STYLE=\"top:127px; left:145px; width:43px; height:15px\" Class=\"S6\"></DIV>"
set OldDelimiter to AppleScript's text item delimiters
set AppleScript's text item delimiters to " Class=\"S5\">"
set ParsedText to every text item of myHTML --converts the string into a list of text with each item being split at class="S5">
set ParsedText to items 2 through -1 of ParsedText --the first text item willbe everythig before the class="S5"> so we remove it
set AppleScript's text item delimiters to "</DIV>"

repeat with i from 1 to count of ParsedText
	--here we eleminate everythign after the first </Div> which should leave us with a list of dates  and days {"Jan04", "Fri", "Jan03", "Thu", "Jan02", "Wed", "Jan01", "Tue"}
	set item i of ParsedText to first text item of item i of ParsedText
end repeat

set DateList to {}
repeat with i from 1 to count of ParsedText by 2 --stepping by 2 because every secod entry is the day of the week
	set AppleScript's text item delimiters to ""
	--split the date into a list of the month and day {"Jan", "01"}
	set TheDate to {characters 1 through 3 of item i of ParsedText as string, characters 4 through 5 of item i of ParsedText as string}
	set TheDay to item (i + 1) of ParsedText --if you need to do something with the day of the week use this variable
	repeat with i from 1 to count of MonthList
		set AppleScript's text item delimiters to " "
		--uses the MonthList to swap out the abreviation for the full month and removes the "0" from the date for days less than 10
		if item 1 of TheDate is item 1 of item i of MonthList then set item 1 of TheDate to item 2 of item i of MonthList
		if character 1 of item 2 of TheDate is "0" then set item 2 of TheDate to character 2 of item 2 of TheDate
		--combines the two list items and copies them to DateList, a space will seperate them due to the delimiter set at the beginning of the repeat loop.
		copy (TheDate) as string to end of DateList
	end repeat
end repeat

set AppleScript's text item delimiters to OldDelimiter --reset the delimiter, a good practice to get into.
return DateList

Hope that this helps.