parsing HTML

OK, I have been reading through the shell command posts trying to figure this out. I have some information on on a web page that I want to get some information out of and place into an excel document. I can get the source easily enough with Safari, though curl is giving me errors (possibly due to the dynamic page or my ignorance on how to use it). I think that the best way to get this information would be to use shell scripts rather than parsing it with AppleScript but don’t know how to do it and most of the examples that I have found don’t explain what is being done with the shell scripts so that I can change them. So if I have the following HTML:

What I want to do is get the “Boy (8-9) lying on grass next to lamp” into a variable so that I can put it into excel. Any ideas?

I kinda gave-up after too many varied responses and came up with the following…

This gets a page from Yahoo! Yellow Pages and parses some data out of it for making a mailing list. I have to manually flip to each page, but once it’s up on screen, this can parse it down to something useable. I took advantage of BBEdit’s GREP-based searching. Not super-elegant, but functional and I was in a hurry. Never got any better feedback either. :wink:

Problem is, it’s not just HTML I’m weeding out, but all kinda of other code elements since I knew exactly what on the page I wanted and it was repeatable. For your purposes, pay special attention to the GREP string I used in BBEdit to strip the markup: <[^<>]*>. Seemed to get rid of most/all of the XHTML, but still leaves quite a mess behind depending on the web page.

tell application "Safari"
	set pageHTML to source of document 1
end tell

tell application "BBEdit"
	activate
	make new text window with properties {contents:pageHTML}
	
	--strip HTML Markup
	replace "<[^<>]*>" using "" searching in text 1 of window 1 options {search mode:grep, starting at top:true}
	
	--fix carriage returns
	replace "
" using "
" searching in text 1 of window 1 options {search mode:grep, starting at top:true}
	
	--fix special characters
	replace " " using "" searching in text 1 of window 1 options {search mode:literal, starting at top:true}
	
	--remove non-data top and bottom stuff
	find "Miles**" searching in text 1 of window 1 options {search mode:literal, starting at top:true} with selecting match
	set selectionOffset to (characterOffset of selection) + 7
	select characters 1 thru selectionOffset of text window 1
	delete selection
	
	find "** Distances" searching in text 1 of window 1 options {search mode:literal, starting at top:true} with selecting match
	set selectionOffset to (characterOffset of selection) - 12
	select characters selectionOffset thru -1 of text window 1
	delete selection
	
	--strip superfluous space, tabs, and other artifacts and useless data to standardize data "records"
	replace "Map" using "" searching in text 1 of window 1 options {search mode:literal, starting at top:true}
	replace "See reviews on Local
" using "" searching in text 1 of window 1 options {search mode:literal, starting at top:true}
	replace "Web Site" using "" searching in text 1 of window 1 options {search mode:literal, starting at top:true}
	replace "  " using "" searching in text 1 of window 1 options {search mode:literal, starting at top:true}
	replace "	" using "" searching in text 1 of window 1 options {search mode:literal, starting at top:true}
	replace " 
" using "" searching in text 1 of window 1 options {search mode:literal, starting at top:true}
	replace "^\\(.*
" using "" searching in text 1 of window 1 options {search mode:grep, starting at top:true}
	replace "




" using "" searching in text 1 of window 1 options {search mode:literal, starting at top:true}
	replace ", CA" using "" searching in text 1 of window 1 options {search mode:literal, starting at top:true}
	replace "

[0-9]*\\.[0-9]*

" using "" searching in text 1 of window 1 options {search mode:grep, starting at top:true}
end tell

tell application "BBEdit"
	activate
	
	--get addresses from file
	repeat with addressNumber from 1 to 20
		tell text window 1
			set addressBlockLineStart to ((addressNumber - 1) * 7) - (addressNumber - 2)
			
			select insertion point before line addressBlockLineStart
			set lineBeginOffset to characterOffset of selection
			select insertion point after line addressBlockLineStart
			set lineEndOffset to (characterOffset of selection) - 1
			set companyName to (characters lineBeginOffset thru lineEndOffset) as text
			
			select insertion point before line (addressBlockLineStart + 1)
			set lineBeginOffset to characterOffset of selection
			select insertion point after line (addressBlockLineStart + 1)
			set lineEndOffset to (characterOffset of selection) - 1
			set companyAddress to (characters lineBeginOffset thru lineEndOffset) as text
			
			select insertion point before line (addressBlockLineStart + 2)
			set lineBeginOffset to characterOffset of selection
			select insertion point after line (addressBlockLineStart + 2)
			set lineEndOffset to (characterOffset of selection) - 1
			set companyCity to (characters lineBeginOffset thru lineEndOffset) as text
		end tell
		
		tell application "FileMaker Pro 8"
			activate
			open file "OSXT:Users:kquosig:Desktop:Print Shops, 070601.fp7"
			
			--check for duplicate address
			if companyAddress is not "" or "
" then
				set cell "g_Address_Search" of current record to "\"" & companyAddress & "\""
				do script "Duplicate Pre-Check"
				
				if cell "g_IsDupe" of current record is "no" then
					set newRecord to create new record
					
					set cell "Company" of newRecord to companyName
					set cell "Street Address" of newRecord to companyAddress
					set cell "City" of newRecord to companyCity
				end if
			end if
		end tell
		
	end repeat
end tell

tell application "BBEdit"
	close window 1 without saving
end tell

tell application "FileMaker Pro 8"
	close document 1
end tell

Thanks for the reply Kevin. I played around with GREP, found some javascripting that sort a worked and spent too much time looking for solutions. Finally I found that the best way to get the information out of the web sites was using AppleScript’s text item delimiters (a bit convaluted to get just what I want without the extra HTML but it works). I was actually suprised at how fast this worked. I might be able to do the same thing with GREP or other UNIX commands and it might be faster but this works and is a lot faster than cutting and pasting the text out of Safari.