One possible way (without any html or xml tools): GREP through the text.
Here a very basic example that returns a list of link contents. The GREP pattern could be better
set sourceText to do shell script "curl 'http://edition.cnn.com/'"
grep_find_all(sourceText, "<a href=\"[^>]+>([^<]+)</a")
on grep_find_all(strg, pattern)
(*
strg is the string to search in, pattern is the search pattern.
A list or a list of lists is returned, depending
on parenthesis pairs in pattern
*)
set python_script to quoted form of ¬
"import re, sys, plistlib
(s, p, c) = [x.decode('utf_8') for x in sys.argv[1:4]]
options = re.UNICODE | re.MULTILINE | re.IGNORECASE * int( c )
print plistlib.writePlistToString(re.findall ( p, s, options ))"
set pattern to quoted form of pattern
set strg to quoted form of strg
set ignore_case to ("a" = "A") as integer
set cmd to "python -c " & python_script & space & strg & space & pattern & space & ignore_case
try
set pl_strg to do shell script cmd
tell application "System Events"
set pl to (make new property list item with properties {text:pl_strg})
set r to value of pl
end tell
on error the_error number the_number
-- remove following comment marks if errors are needed
--error the_error number the_number
set r to {}
end try
return r
end grep_find_all
The pattern
“<ul class="cnn_bulletbin">(.+?)”
will find everything between the list tags-
Something like
set pattern to "<ul class=\"cnn_bulletbin\">(.+?)</ul>"
set list_items to grep_find_all(sourceText, pattern)
set pattern to "<a href=\".+?>(.+?)</a>"
repeat with J from 1 to count list_items
set curr to item J of list_items
set temp to grep_find_all(curr, pattern)
set item J of list_items to temp
end repeat
creates a list of list with all the text anchors of the links per list. (I used non greedy patterns here as an alternative.) In some cases, the anchors on the CNN page contain span tags. This might cause some more postprocessing.
set sourceText to do shell script "curl 'http://www.cnn.com/'"
set pattern to "<ul class=\"cnn_bulletbin\">(.+?)</ul>"
set list_items to grep_find_all(sourceText, pattern)
set pattern to "<a href=\".+?>(.+?)</a>"
repeat with J from 1 to count list_items
set curr to item J of list_items
set temp to grep_find_all(curr, pattern)
set item J of list_items to temp
end repeat
when I run it I get:
error “«script» doesn’t understand the grep_find_all message.” number -1708 from «script»