Pls Help: Extracting URLs and links from html and / or text files

Moosh · January 30, 2011, 9:07am

Hi all

I am data mining some websites and need some help – I’m a noob at automator and applescript.

I ripped about 40,000 pages via cURL into multiple html files (they average about 50 Megs each).

First, I am extracting emails - which is easy enough using this guide with automator:
http://bit.ly/4UatTu

Next, I want to extract all the http:// links, or rather, all the links, including http, www., etc.

I’ve tried all kinds of things, searched the web high and dry, and can’t really get anything to work - including applescript, perl, sed, grep, etc. (of which I don’t know anything, just trial and error at this point).

The latest attempt was a simple automator that passes the HTML file to Get URLs from Web Page, and what comes back is only the file:// none of the other links, like http://

Can anyone please help this noob?

I literally just started scripting and using automator last week, and have learned a ton. This is easily the greatest thing since sliced bread.

Thanks!

Manish