Parsing an HTML page

nyl · February 12, 2013, 11:01am

after I grab an html page:

set sourceText to do shell script "curl http://www.cnn.com"

I would like to get only the HTML between those HTML tags

and store it in a variable

then I would like to get each URL and TITLE in this variable and save it in a file.

example:

every time there is a link like this:

<a href=‹“http://cnn.com/test/testing/index.html”>this is a news title

I would like to create a text file like this:

this is a news title

cnn.com/test/testing/index.html

juergen · February 12, 2013, 12:21pm

One possible way (without any html or xml tools): GREP through the text.

Here a very basic example that returns a list of link contents. The GREP pattern could be better

set sourceText to do shell script "curl 'http://edition.cnn.com/'"

grep_find_all(sourceText, "<a href=\"[^>]+>([^<]+)</a")
on grep_find_all(strg, pattern)
	(*
strg is the string to search in, pattern is the search pattern.
A list or a list of lists is returned, depending
on parenthesis pairs in pattern
*)
	set python_script to quoted form of ¬
		"import re, sys, plistlib
(s, p, c) = [x.decode('utf_8') for x in sys.argv[1:4]]
options = re.UNICODE | re.MULTILINE | re.IGNORECASE * int( c )
print plistlib.writePlistToString(re.findall ( p, s, options ))"
	set pattern to quoted form of pattern
	set strg to quoted form of strg
	set ignore_case to ("a" = "A") as integer
	set cmd to "python -c " & python_script & space & strg & space & pattern & space & ignore_case
	try
		set pl_strg to do shell script cmd
		tell application "System Events"
			set pl to (make new property list item with properties {text:pl_strg})
			set r to value of pl
		end tell
	on error the_error number the_number
		-- remove following comment marks if errors are needed
		--error the_error number the_number
		set r to {}
	end try
	return r
end grep_find_all

JÃ¼rgen

nyl · February 12, 2013, 1:16pm

thank you!

how do I extract from the html page the string contained in between

juergen · February 12, 2013, 2:06pm

The pattern
“<ul class="cnn_bulletbin">(.+?)”
will find everything between the list tags-

Something like

set pattern to "<ul class=\"cnn_bulletbin\">(.+?)</ul>"
set list_items to grep_find_all(sourceText, pattern)
set pattern to "<a href=\".+?>(.+?)</a>"

repeat with J from 1 to count list_items
	set curr to item J of list_items
	set temp to grep_find_all(curr, pattern)
	set item J of list_items to temp
end repeat

creates a list of list with all the text anchors of the links per list. (I used non greedy patterns here as an alternative.) In some cases, the anchors on the CNN page contain span tags. This might cause some more postprocessing.

Documentation for the GREP syntax used here is on

http://docs.python.org/2/howto/regex.html#regex-howto

Good luck.

JÃ¼rgen

nyl · February 12, 2013, 2:11pm


set sourceText to do shell script "curl 'http://www.cnn.com/'"

set pattern to "<ul class=\"cnn_bulletbin\">(.+?)</ul>"
set list_items to grep_find_all(sourceText, pattern)
set pattern to "<a href=\".+?>(.+?)</a>"

repeat with J from 1 to count list_items
	set curr to item J of list_items
	set temp to grep_find_all(curr, pattern)
	set item J of list_items to temp
end repeat

when I run it I get:

error “«script» doesn’t understand the grep_find_all message.” number -1708 from «script»

juergen · February 12, 2013, 2:39pm

I wanted to keep this page short.

Of course you need to copy/paste that handler “grep_find_all” from my first reply to the script.

Regards, JÃ¼rgen