Reading Data from a Web Page

I’m looking to write a utility that:

1…Reads the Window Title of the currently displayed web page
2…Reads the HTML code for a string (“analytics”)
3…Can read a W3C HTML validation results page for data (the error count total)
4…Can read a W3C CSS validation results page for data (the error count total)

Ideally I want to do this without seeing the W3C pages–i.e. a background query of some sort where AppleScript provides the URL and the validator replies back in to the script (not on-screen) and I parse the results for the info I need. But if that’s not possible, then I assume whatever technique I use for 1 and 2 can be used for 3 and 4 (parse the page for data once it’s displayed).

I’m going to guess I will be using “curl” but need a basic tutorial on it’s usage (how to query, how to parse results), as used from AppleScript. Unless there are better ways to do this. Not expecting finished code, but just some examples.

THANKS!

This is a simple example of getting web page source and extracting lines. It uses Safari to get the URL of the current page, and the name. But there are other ways to do that; this is the simplest.

tell application “Safari”
tell document 1
set loc_raw to URL
set win to name
end tell
end tell

set loc to quoted form of loc_raw
set analytics_txt to quoted form of (do shell script “curl " & loc & " | grep ‘analytics’”)

The reason I used quoted form of the result is so you could use it in further do shell script parsing. Alternatively you could keep adding pipes to the curl line. Similar reason for ‘analytics’ quoting; it is not needed in this case, but would be if there was a space in what you searched for. Same for URL. On this page, and many others, there is a question mark in the URL, which would cause an error.

Here’s one I use:

-- First, grab the html from the site chosen:
set HfxWeather to "http://www.findlocalweather.com/forecast/shearwater_ns_ca.html"
set ForCst to (do shell script "curl " & HfxWeather)

set TID to AppleScript's text item delimiters -- save previous value
-- Next, begin parsing out stuff from before and after each data point:

set text item delimiters to "width=\"55\" height=\"58\"><br><span class=\"copy\">"

-- Note that width="55" has become width =\"55\". That is required because in AppleScript double quotes have meaning of their own - they  delimit text, so we have to "escape" them -- make the AppleScript compiler ignore them as not part of the argument for setting the text item delimiters, but 'in' the argument. We do this with backslashes which tell the compiler to ignore the following character as part of the command and treat it as part of the parameters for the command (if we actually wanted a backslash in a script we'd use \\, the first telling the compiler to ignore the second and just transmit it as an argument). We've done the same for every quote in every text item delimiter except the first and last which are, of course, part of the AppleScript instruction itself.

set W1 to text item 2 of ForCst -- grab everything after the delimiter
set text item delimiters to "</span><br><span" -- set the other end of what we want.
set W2 to text item 1 of W1 -- Conditions; the first part of the last part.

-- Now Temperature:
set text item delimiters to "class=\"astro\">Temp:   "
-- only two quotes to escape
set W3 to text item 2 of ForCst -- everything after our delimiter
set text item delimiters to "</span></td>
               </tr>
               <tr>" -- the ending, including some returns
set W4 to text item 1 of W3 -- Temperature; the first of the last again. More about this later.

-- Next, Humidity
set text item delimiters to "Humidity:</span></td>
               <td align=\"right\"><span class=\"astroo1\">"
-- Again, our leading delimiter has a return in it. Leave it there. Copy it just as you find it, spaces and all.               
set W5 to text item 2 of ForCst -- get everything after it
set text item delimiters to "</span></td>" -- find the other bound
set W6 to text item 1 of W5 -- Humidity; take the first of the last

-- Then Wind Speed, which also has a return, some spaces and some escaping to do.
set text item delimiters to "Wind Speed:</span></td>
               <td align=\"right\"><span class=\"astroo1\">"
set W7 to text item 2 of ForCst
set text item delimiters to "</span></td>"
set W8 to text item 1 of W7 -- Wind Speed -- we've got them all now as W2, W4, W6, and W8.

-- when the Temperature is available, W4 looks like this: "22°C", where the "°"
-- is html speak for the "°" symbol which the web can't deal with directly, but AppleScript can.
-- It might be tempting to include that in the end delimiter for temperature, but if we do, 
-- then when the temperature is "N/A" without it, we won't have the correct delimiter. Guess
-- how I discovered that. We have to fix it when it occurs:

if W4 contains "°C" then
	set AppleScript's text item delimiters to "°C"
	set W9 to first text item of W4 -- the number
	set W4 to W9 & "°C" -- add our own degrees C
end if
set AppleScript's text item delimiters to TID -- back as they were at the beginning

-- Finally, we build our string for the dialog:
set tWeather to "Conditions at Shearwater:" & return & W2 & return & return & "Current Temperature:" & return & W4 & return & return & "Relative Humidity:" & return & W6 & return & return & "Wind Velocity:" & return & W8
-- and do it:
display dialog tWeather buttons {"OK"} default button 1
-- Done. If you run this, you'll see the conditions here where I'm living.

For me, this line hangs the script. Remove the grep and it works fine.

I use Firefox as my master QA environment because of a number of plugins I have there, so gathering the current URL is more annoying (UI scripting). I chose to use Adam’s tip to get the window title as a test.

--initialize script
set apple_TID to AppleScript's text item delimiters -- save previous value

--get URL from Firefox
tell application "Firefox"
	activate
	tell application "System Events"
		tell process "Firefox"
			click menu item 3 of menu 1 of menu bar item 3 of menu bar 1 --Open Location (get to URL bar)
			delay 0.1
			click menu item 5 of menu 1 of menu bar item 4 of menu bar 1 --copy to clipboard
			delay 0.1
		end tell
	end tell
end tell

--store URL
set location_raw to the clipboard
set location_quoted to quoted form of location_raw

--get web code
set raw_html to quoted form of (do shell script "curl " & location_quoted)

--get window title
set text item delimiters to "<TITLE>"
set check_raw to text item 2 of raw_html
set text item delimiters to "</TITLE>"
set window_title to text item 1 of check_raw

--get Google Analytics


--get HTML validation

--get CSS validation

--gracefully close script
set AppleScript's text item delimiters to apple_TID

Running into a number of problems:

–Adam’s trick only seems to work if the HTML code is rigid/repeatable and there are clear and unambiguous markers on either side of the data you want. Unfortunately, with W3C validation results (HTML and CSS) there is a great variance in the code and I have use too long a code chunk to get something unique.

–The search for “analytics” suffers the same problem, because where in the code the keyphrase resides varies.

So what I really need is some way to GREP the code, which Fenton was shooting for, or grep inside the variable storing the code,. But then another problem:

–Is there a way to GREP for a phrase, then say “skip over x characters and then grab the next x characters as the result” (where “x characters” could also be a grep expression to capture variances in the result output).

–Alternately, does anyone know if W3C validators (HTML and the CSS “Jigsaw”) have any sort of API to get results back as a list rather than displayed?

In case you’re wondering what I’m up to: I’m trying to write a script that displays a dialog with a bunch of common things I have to check on web sites (Quality Assuance) and rather than having to do it through multiple, tedious steps I want to design a script that gives a dialog with everything in one place from the push of a button (mapping the script to my X-keys keypad).

Just a bump since I didn’t see any replies.

So no ideas on technique?

I’m left with only one: dump the curl results to a text file and then using TextEdit and GREP on that, as a sort of temp file. Just seems kinda kludgy.

Thanks so much for this guys!
You helped me a lot to parse a simple number from a webpage

  • a consignment number from a shipping company.

tell application "Safari"
	activate
	set WebPage to the source of document 1
	set AppleScript's text item delimiters to "<b>Consignment Number</b></td><td bgcolor=\"#ffffff\">"
	set W1 to text item 2 of WebPage -- grab everything after the delimiter
	set AppleScript's text item delimiters to "</td></tr><tr><td nowrap valign=\"top\">" -- set the other end of what we want.
	set ConsignmentNumber to text item 1 of W1 -- select the first part of the delimited string
	display dialog ConsignmentNumber
end tell

CHeers, Nik