need a text sraper workflow, any advice??? thanks

Ray_Barber · July 6, 2005, 8:38pm

hello,
Can you help me??

I need to exctract text from this website: www.paginegialle.it
(italian yellow pages)

precisely I need to turn the data (name, address, tel number) related
to each business, into a tab delimited fields file. At that point I
want to use an application (AB transfer 1.6.0) to map those data into
my addressbook.

What I want to do is to first refine my searches (keyword AND local area code) and then, once the pages are visualized, avoid the tedious copy and paste procedure to update my addressbook… if a search shows results on more than one page, it would be great to also extend the automatized procedure to the other pages without having to click the “next page” button.

my machine is a 12" ibook with tiger.

thanks a lot in advance for your advice, your help

hhas · July 6, 2005, 9:31pm

I don’t suppose they provide an XML-RPC interface? That’d be much easier to deal with. Writing a reliable HTML scraping script is frequently a non-trivial task and such scripts are always brittle and tend to break whenever the HTML changes or doesn’t appear exactly as you expect it. If you’ve never done it before, you might want to start by Googling for “HTML scraping” and do some background reading before you dive in.

If you have to scrape the HTML yourself, you could use either regular expressions or an HTML parser. The latter is generally more robust but the former may do for a quick-n-dirty solution.

For REs, you could use TextCommands; it also provides commands for other stuff like decoding HTML entities and trimming whitespace, which you’ll also need. The HTML on that site doesn’t look very clean though, so producing a reliable RE-based solution may not be easy.

For a full-blown HTML parser, particularly one that can deal with the sort of tag soup that site generates, you should really look to other languages as there’s next to nothing available for AppleScript. e.g. See the BeautifulSoup parser for Python, which is well suited to HTML scraping.

HTH