I am data mining some websites and need some help – I’m a noob at automator and applescript.
I ripped about 40,000 pages via cURL into multiple html files (they average about 50 Megs each).
First, I am extracting emails - which is easy enough using this guide with automator:
Next, I want to extract all the http:// links, or rather, all the links, including http, www., etc.
I’ve tried all kinds of things, searched the web high and dry, and can’t really get anything to work - including applescript, perl, sed, grep, etc. (of which I don’t know anything, just trial and error at this point).
The latest attempt was a simple automator that passes the HTML file to Get URLs from Web Page, and what comes back is only the file:// none of the other links, like http://
Can anyone please help this noob?
I literally just started scripting and using automator last week, and have learned a ton. This is easily the greatest thing since sliced bread.