Crawl-n-Diagram a Web Site Given a URL

aka “Get Out of the Digital Dark Ages”

Alright, ever since I got roped-into the job of doing QA for the web team, I have been beating my head against this conundrum, apparently common to QA folks, but for which no app or utility exists to solve (at least good ones). In fact, only Visio Pro even attempts it, and does it awefully:

Given a URL to an existing web site, crawl the site and draw a diagram.

Now here’s a gotcha that Visio falls-down on”it’s a bit “one-way, dead-end” about it’s diagrams. It doesn not show link-backs, or recursions, or whatever you want to call it. For example: the site has a footer with a set of links that is on most pages. I want the diagram to show EVERY page that has links to EVERY other page. In which case there should be lots of diagram lines to the pages the footer has links to. This is incredibly important when doing QA. Software like Integrity will check links WORK, but not if they go to the right place. Diagrams with these details also give a better idea of the structural “web” of a site.

The other gotcha, Visio is too dumb to recognize an image-replacement link from a text one (which means on CSS-centric sites with lots of image replacement links, you get everything twice.).

Anyway, I’m getting rather frustrated that even Visio can’t seem to give a fair representation of a web site (and worse, builds them in such as way as to make complex sites hard to print-out) that I’m exploring writing a script that, along with OmniGraffle, would do the trick.

Maybe impossible or WAY out of my league, but my frustration is getting the better of me.

So anybody want to brainstorm the solution and how AppleScript might accomplish this? For example, I know I’ll need to store pages, all the links on that page, all the pages those links lead to (parsed via Curl I imagine). But then how to manage when those pages are also share by other pages and thus track common “nodes” (pages that are common destinations of other pages) makes my head hurt. I get all fuzzy thinking about how a parent page can be the child of another parent and it’s children can be parents of other parent’s children’s parents. ouch

Then of course, unravel it all in such a way as to get OmniGraffle to draw it (but hey, one step at a time.).

What I need to most help on short-term:
I can muddle through parsing the web site code (like via Curl), I can muddle through the OmniGraffle part, but the middle”the storage and management of the interlinking page information, that makes my head hurt. For example, am I going to need to use a database to do this effectively (does MacOS have something like SQL built-in?) or will simple variable/array storage suffice?

My relational database knowledge is middle-range”not a novice, but beyond relating a handful of tables at once I get mired.

Anyone care to walk me through ideas”or talk me out of my hair-brained scheme? :smiley:

Maybe there is a reason no one has been foolish enough to do this? :stuck_out_tongue:

Can I assume from the crickets-n-silence that this is bigger than even I give it credit for? :rolleyes:

(Given that usually responses flow in after this much time.)

Not looking for finished solutions, which I know you folks love to give, just higher-level “you probably need to look at using X, Y, and Z methods and A, B, and Z tools.”

Just looking to pick people’s brains for advice, not necesarily finished code. :stuck_out_tongue: