Retrieve html links from a safari page

Hello everybody,
how do I retrieve urls (links) with applescript from a web page on Safari?

When I worked for my former employer, we needed to download thousands of patent documents into a database. And in this process it was necessary to find URLs in websites. We used a little Python script for this task, which you can download here. Place it on your desktop to test the AppleScript code below, which calls this Python helper script:


tell application "Safari"
	set docURL to URL of document 1
end tell
-- modify the path if you did not place the Python script on your desktop
set pyscriptpath to quoted form of POSIX path of (((path to desktop) as Unicode text) & "listurls.py")
set command to "python " & pyscriptpath & space & quoted form of docURL
set contURLS to paragraphs of (do shell script command)

Hope this helps.

Alternatively an AppleScript solution using the do javascript command of Safari


set site_url to "http://bbs.applescript.net"
tell application "Safari"
	activate
	open location site_url
end tell
-- wait until page loaded
if page_loaded(20) is false then return
-- get number of links
set theLinks to {}
tell application "Safari" to set num_links to (do JavaScript "document.links.length" in document 1)
set linkCounter to num_links - 1
-- retrieve the links
repeat with i from 0 to linkCounter
	tell application "Safari" to set end of theLinks to do JavaScript "document.links[" & i & "].href" in document 1
end repeat
theLinks

on page_loaded(timeout_value)
	delay 2
	repeat with i from 1 to the timeout_value
		tell application "Safari"
			if (do JavaScript "document.readyState" in document 1) is "complete" then
				return true
			else if i is the timeout_value then
				return false
			else
				delay 1
			end if
		end tell
	end repeat
	return false
end page_loaded

Oh, I love this place. Everyday I learn something new. Your «do javascript» solution can be of real help in one of my projects where I am not supposed to use external scripts. Thanks a lot!

Have a nice Sunday!

You’re welcome and have a nice Sunday, too

Two or three years ago (in Code Exchange, I think), jj showed me how to get all the links from all the frames in a Web page, omitting any duplicates, with a single JavaScript. I used to know how this works: :rolleyes:

set JS to "// All variables global!
URLs = 'Links:' ; // (Jaguar-compatible) anchor line. Simplifies subsequent AppleScript code.
for (i = 0; i < top.document.links.length; i++) processEntry(top.document.links[i].href);
for (i = 0; i < top.frames.length; i++) mf(top.frames[i]); 
return URLs.split('%%%').join('') ;

function mf(thisFrame) {
	if (thisFrame.frames.length == 0) { // extract links from this page
		try {
			for (q = 0 ; q < thisFrame.document.links.length ; q++) processEntry(thisFrame.document.links[q].href) ;
		} catch (e) {
		}
	} else { // rotate again
		for (q = 0 ; q < thisFrame.frames.length ; q++) mf(thisFrame.frames[q]) ;
	}
}

function processEntry(thisURL) {
	entry = '\\r' + thisURL + '%%%' ;
	if (URLs.indexOf(entry) == -1) URLs += entry ;
}"

tell application "Safari" to set linkURLs to rest of (do JavaScript JS in front document)'s paragraphs

Edit: JavaScript code edited (hopefully corrected) following Chrys’s comment below.

Doesn’t work as is in Safari 3.1.2, Leopard 10.5.5 but I haven’t explored why.

This part looks suspicious. It is indexing over the number of frames, but it is indexing into the list of links. Since there were some other parts that looked odd to me, I rewrote it (tested on my 10.4.11 system):

set JS to "var URLs = [];
function collectIfNew(url) {
    if( URLs.indexOf(url) == -1 ) {
        URLs.push(url);
    }
}
function processDoc(doc) {
    var l = undefined;
    try {
        // If the document is from a different protocol+domain+port than the main page (an iframe to a foreign location), this may throw a security exception. If so, note it and move on.
        l = doc.links;
    } catch(e) { console.warn(e) }
    if( l !== undefined ) {
        for( var i = 0; i < l.length; i++ ) { collectIfNew(l[i].href) }
    }
}
function processFrameset(f) {
    for( var i = 0; i < f.length; i++ ) { process(f[i]) }
}
function process(o) {
    if( o.frames !== undefined && o.frames.length != 0 ) {
        // It is a frameset
        processDoc(o.document); // Process its links. Normal framesets probably have no links, but iframe-based framesets probably have many links.
        processFrameset(o.frames);
    } else {
        // It is a document
        processDoc(o.document);
    }
}
process(window);
URLs;
"

tell application "Safari" to set linkURLs to do JavaScript JS in front document

It may not work on Jaguar, but it seems a bit cleaner to me: less code duplication, no string manipulation in either JavaScript or AppleScript (directly returns a JavaScript Array which comes out as an AppleScript list), and a few more descriptive comments. It is still a less-than-optimal O(N^2), but performance probably will not hurt unless there are many thousands of links.

Model: iBook G4 933
AppleScript: 1.10.7
Browser: Safari Version 3.1.2 (4525.22)
Operating System: Mac OS X (10.4)

Hi, Chris.

Thanks for catching this. It’s undoubtedly meant to be:

for (q = 0 ; q < thisFrame.document.links.length ; q++) processEntry(thisFrame.document.links[q].href) ;

I’ve now edited it in my post above. None of the Web pages on which I originally tested can have had multiple frames!

Having now found the original Code Exchange article, I see I owe jj an apology too. It’s not his code but an attempt by me to be clever with something he actually did post. (My knowledge of JavaScript is sadly minimal.)

I’ll study your code. It only returns an empty Unicode text on my Jaguar and Tiger machines, but that’s probably more to do with the Safari versions than with the JavaScript.

Hi, Chris.

My empty result’s due to an error generated in ‘collectIfNew()’ by the attempt to apply ‘indexOf()’ to the array. With Safaris 1.0.3 and 2.0.4, ‘indexOf()’ can only be used to find substrings in a text.

If the ‘indexOf()’ condition’s removed, the array duly fills with URLs without being screened for duplicates. However, the end result’s still an empty text because the array can’t be passed back to AppleScript. It can be returned as a single, comma-delimited text by concatenating it to an empty string within the JavaScript code; but then you’d need AppleScript’s TIDs to separate the URLs again, which would defeat your reason for using an array in the first place.

It is really too bad that the older versions of Safari do not support returning JavaScript Arrays to AppleScript lists. It find it much nicer than having to do the text manipulation.

If we make the assumption that all of the href string values will be properly encoded URLs, then we know that they will have no bare linefeeds or returns. Then, using a linefeed or return (like the original code) would allow the use of paragraphs instead of text item delimiters, which would streamline the AppleScript side. This approach is not appropriate for strings that might contain linefeeds or returns, but it should be safe here (though ‘\0’ in JavaScript and text item delimiters set to {ASCII character 0} worked in my brief testing, so there is hope even for returning lists of more general strings; as long as they do not contain nulls themselves).

The lack of indexOf can be overcome by implementing it by hand similar to the compatibility code shown in the Mozilla JavaScript reference for indexOf.

Thus:

set JS to "var URLs = [];
if( !URLs.indexOf ) {
	URLs.indexOf = function(elt) {
		// Based on code in https://developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Array/indexOf
		// This is a simplified version. It does not support the optional 'from' second argument.
		var len = this.length;
		var from = 0;
		for ( ; from < len; from++) {
		if (from in this &&
			this[from] === elt)
			return from;
		}
		return -1;
	}
}
function collectIfNew(url) {
	if( URLs.indexOf(url) == -1 ) {
		URLs.push(url);
	}
}
function processDoc(doc) {
	var l = undefined;
	try {
		// If the document is from a different protocol+domain+port than the main page (an iframe to a foreign location), this may throw a security exception. If so, note it and move on.
		l = doc.links;
	} catch(e) { console.warn(e) }
	if( l !== undefined ) {
		for( var i = 0; i < l.length; i++ ) { collectIfNew(l[i].href) }
	}
}
function processFrameset(f) {
	for( var i = 0; i < f.length; i++ ) { process(f[i]) }
}
function process(o) {
	if( o.frames !== undefined && o.frames.length != 0 ) {
		// It is a frameset
		processDoc(o.document); // Process its links. Normal framesets probably have no links, but iframe-based framesets probably have many links.
		processFrameset(o.frames);
	} else {
		// It is a document
		processDoc(o.document);
	}
}
process(window);
URLs.join('\\r');
"

tell application "Safari" to set linkURLs to paragraphs of (do JavaScript JS in front document)

The changes in the JavaScript code should consist of just the new block at the top (defining an indexOf, if it does not already exist), and the last line (uses join to covert the array of strings to a single string with embedded returns). The change to the AppleScript is to use paragraphs to reconstitute the string into a list.

Cool! :cool: The script now works with Safari 1.0.3, if a trifle less quickly than my effort. (About a tenth of a second longer to return the 110 discrete links on this page (before I posted this) using a 400MHz G3 machine.) Presumably it’s good with all three of the Safari versions under discussion.

Much of that tenth of a second can be regained by omitting ‘from in this &&’, which I suspect is superfluous. Its purpose appears (from empirical tests) to be to check that the value of ‘from’ is within the number of items in the array, but this is governed by the repeat parameters anyway.

Thanks for this, Chris. If you’re happy with it, maybe you should post it in Code Exchange.

This works wonderfully. How about if I need to grab the anchor text (not the URL). Can this script be modified to do this? Example of an anchor text is this, where “this” is a anchor text.

Sorry for reviving an old thread.

I have just started with scripts. I found this and it works. It retrieves all urls from a specific page. What do I have to do to filter those results. i.e. does not contain “xxx” and does not contain “yyy”…?

I really appreciate that.

After it is filtered, what do I need to do so that it opens all the remaining urls in a tab?

Thanks!

Wanted to thank Stefan for your example above, worked like a charm and saved me hours! Greetings from rural Japan.