extracting links

how can I extract hyperlinks and text out of this HTML and then add http://www.cnn.com in front ?

result in file example:

4 Alabama football players arrested
http://www.cnn.com/2013/02/13/justice/alabama-football-player-arrests/index.html?hpt=hp_t2

Kids among dead in NATO airstrike
http://www.cnn.com/2013/02/13/world/asia/afghanistan-air-strike/index.html?hpt=hp_t2">

HMS Bounty’s last moments revealed
http://www.cnn.com/2013/02/12/us/bounty-captain-widow/index.html?hpt=hp_t2">

See the dog named Best in Show
http://www.cnn.com/video/?hpt=hp_t2#/video/us/2013/02/13/point-banana-joe-westminster.cnn

The exact script will depend on how you’re getting the HTML, but for what you’ve given:

set input to "<a href=\"/2013/02/13/justice/alabama-football-player-arrests/index.html?hpt=hp_t2\">4 Alabama football players arrested </a> <a href=\"/2013/02/13/world/asia/afghanistan-air-strike/index.html?hpt=hp_t2\">Kids among dead in NATO airstrike</a>  <a href=\"/2013/02/12/us/bounty-captain-widow/index.html?hpt=hp_t2\">HMS Bounty's last moments revealed</a> <a href=\"/video/?hpt=hp_t2#/video/us/2013/02/13/point-banana-joe-westminster.cnn\">See the dog named Best in Show</a>"

do shell script "<<<" & quoted form of input & " grep -Eo '<a href=\"/[^>]+>[^<]+' | sed -E 's|<a href=\"([^\"]+)\">(.+)|\\2\\'$'\\n''http://www.cnn.com\\1\\'$'\\n''|'"

Great, it worked perfectly thanks!!! :stuck_out_tongue: