Parsing an HTML page

after I grab an html page:

set sourceText to do shell script "curl http://www.cnn.com"

I would like to get only the HTML between those HTML tags


and store it in a variable

then I would like to get each URL and TITLE in this variable and save it in a file.


example:

every time there is a link like this:

<a href=‹“http://cnn.com/test/testing/index.html”>this is a news title

I would like to create a text file like this:

this is a news title

cnn.com/test/testing/index.html

One possible way (without any html or xml tools): GREP through the text.

Here a very basic example that returns a list of link contents. The GREP pattern could be better :wink:

set sourceText to do shell script "curl 'http://edition.cnn.com/'"

grep_find_all(sourceText, "<a href=\"[^>]+>([^<]+)</a")
on grep_find_all(strg, pattern)
	(*
strg is the string to search in, pattern is the search pattern.
A list or a list of lists is returned, depending
on parenthesis pairs in pattern
*)
	set python_script to quoted form of ¬
		"import re, sys, plistlib
(s, p, c) = [x.decode('utf_8') for x in sys.argv[1:4]]
options = re.UNICODE | re.MULTILINE | re.IGNORECASE * int( c )
print plistlib.writePlistToString(re.findall ( p, s, options ))"
	set pattern to quoted form of pattern
	set strg to quoted form of strg
	set ignore_case to ("a" = "A") as integer
	set cmd to "python -c " & python_script & space & strg & space & pattern & space & ignore_case
	try
		set pl_strg to do shell script cmd
		tell application "System Events"
			set pl to (make new property list item with properties {text:pl_strg})
			set r to value of pl
		end tell
	on error the_error number the_number
		-- remove following comment marks if errors are needed
		--error the_error number the_number
		set r to {}
	end try
	return r
end grep_find_all

Jürgen

thank you!

how do I extract from the html page the string contained in between


The pattern
“<ul class="cnn_bulletbin">(.+?)”
will find everything between the list tags-

Something like

set pattern to "<ul class=\"cnn_bulletbin\">(.+?)</ul>"
set list_items to grep_find_all(sourceText, pattern)
set pattern to "<a href=\".+?>(.+?)</a>"

repeat with J from 1 to count list_items
	set curr to item J of list_items
	set temp to grep_find_all(curr, pattern)
	set item J of list_items to temp
end repeat

creates a list of list with all the text anchors of the links per list. (I used non greedy patterns here as an alternative.) In some cases, the anchors on the CNN page contain span tags. This might cause some more postprocessing.

Documentation for the GREP syntax used here is on

http://docs.python.org/2/howto/regex.html#regex-howto

Good luck.

Jürgen


set sourceText to do shell script "curl 'http://www.cnn.com/'"

set pattern to "<ul class=\"cnn_bulletbin\">(.+?)</ul>"
set list_items to grep_find_all(sourceText, pattern)
set pattern to "<a href=\".+?>(.+?)</a>"

repeat with J from 1 to count list_items
	set curr to item J of list_items
	set temp to grep_find_all(curr, pattern)
	set item J of list_items to temp
end repeat


when I run it I get:

error “«script» doesn’t understand the grep_find_all message.” number -1708 from «script»

I wanted to keep this page short.

Of course you need to copy/paste that handler “grep_find_all” from my first reply to the script.

Regards, Jürgen