how do I retrieve urls (links) with applescript from a web page on Safari?
When I worked for my former employer, we needed to download thousands of patent documents into a database. And in this process it was necessary to find URLs in websites. We used a little Python script for this task, which you can download here. Place it on your desktop to test the AppleScript code below, which calls this Python helper script:
tell application "Safari" set docURL to URL of document 1 end tell -- modify the path if you did not place the Python script on your desktop set pyscriptpath to quoted form of POSIX path of (((path to desktop) as Unicode text) & "listurls.py") set command to "python " & pyscriptpath & space & quoted form of docURL set contURLS to paragraphs of (do shell script command)
Hope this helps.
Have a nice Sunday!
You’re welcome and have a nice Sunday, too
Doesn’t work as is in Safari 3.1.2, Leopard 10.5.5 but I haven’t explored why.
This part looks suspicious. It is indexing over the number of frames, but it is indexing into the list of links. Since there were some other parts that looked odd to me, I rewrote it (tested on my 10.4.11 system):
Model: iBook G4 933
Browser: Safari Version 3.1.2 (4525.22)
Operating System: Mac OS X (10.4)
Thanks for catching this. It’s undoubtedly meant to be:
for (q = 0 ; q < thisFrame.document.links.length ; q++) processEntry(thisFrame.document.links[q].href) ;
I’ve now edited it in my post above. None of the Web pages on which I originally tested can have had multiple frames!
My empty result’s due to an error generated in ‘collectIfNew()’ by the attempt to apply ‘indexOf()’ to the array. With Safaris 1.0.3 and 2.0.4, ‘indexOf()’ can only be used to find substrings in a text.
Cool! The script now works with Safari 1.0.3, if a trifle less quickly than my effort. (About a tenth of a second longer to return the 110 discrete links on this page (before I posted this) using a 400MHz G3 machine.) Presumably it’s good with all three of the Safari versions under discussion.
Much of that tenth of a second can be regained by omitting ‘from in this &&’, which I suspect is superfluous. Its purpose appears (from empirical tests) to be to check that the value of ‘from’ is within the number of items in the array, but this is governed by the repeat parameters anyway.
Thanks for this, Chris. If you’re happy with it, maybe you should post it in Code Exchange.
This works wonderfully. How about if I need to grab the anchor text (not the URL). Can this script be modified to do this? Example of an anchor text is this, where “this” is a anchor text.
Sorry for reviving an old thread.
I have just started with scripts. I found this and it works. It retrieves all urls from a specific page. What do I have to do to filter those results. i.e. does not contain “xxx” and does not contain “yyy”…?
I really appreciate that.
After it is filtered, what do I need to do so that it opens all the remaining urls in a tab?
Wanted to thank Stefan for your example above, worked like a charm and saved me hours! Greetings from rural Japan.