Script to download urls and filter

Hi guys,

I found a script that will download all of the urls from a page. It works perfectly. I need help with the next step, to filter those results to exclude “x” and “y” and “z”. What would be the code to do that?

set site_url to "http://www.apple.com"
tell application "Safari"
	activate
	open location site_url
end tell
-- wait until page loaded
if page_loaded(20) is false then return
-- get number of links
set theLinks to {}
tell application "Safari" to set num_links to (do JavaScript "document.links.length" in document 1)
set linkCounter to num_links - 1
-- retrieve the links
repeat with i from 0 to linkCounter
	tell application "Safari" to set end of theLinks to do JavaScript "document.links[" & i & "].href" in document 1
end repeat
theLinks

on page_loaded(timeout_value)
	delay 2
	repeat with i from 1 to the timeout_value
		tell application "Safari"
			if (do JavaScript "document.readyState" in document 1) is "complete" then
				
				return true
			else if i is the timeout_value then
				return false
			else
				delay 1
			end if
		end tell
	end repeat
	return false
end page_loaded


The next step would be to open the remaining links in new safari or firefox tabs. IS this possible?

I really appreciate all of your help!

Thanks!

Hi, welcome to the forum.

Try this:

set site_url to "http://www.apple.com"
tell application "Safari"
	if not (exists document 1) then reopen
	set URL of document 1 to site_url
	
	-- wait until page loaded
	if my page_loaded(20) is false then return
	-- get number of links
	
	set myLinks to do JavaScript "var linkList = [];
for (i = 0; i<document.links.length; i++)
{
  linkList.push(document.links[i].href);
}
linkList;" in document 1
	
	repeat with alink in myLinks
		if alink contains "itunes" then
			set URL of document 1 to alink
			--insert your code
		end if
	end repeat
end tell

on page_loaded(timeout_value)
	delay 2
	repeat with i from 1 to the timeout_value
		tell application "Safari"
			if (do JavaScript "document.readyState" in document 1) is "complete" then
				
				return true
			else if i is the timeout_value then
				return false
			else
				delay 1
			end if
		end tell
	end repeat
	return false
end page_loaded

Thanks!

How do I limit the list to contain only “xxxx.com/store” Then after that how do I filter those results so that it only includes xxxx.com/store/zzz but does not contain xxxx.com/store/zzz=ref or xxxx.com/store/zzz?update

Thanks so much!

I actually figured out how to do that :slight_smile: yay!!

The next part is, how do I open the remaining links in document 1 as multiple tabs in safari (or firefox)?

Thanks!

 if alink contains "xxxx.com/store" then
if alink contains "xxxx.com/store/zzz" and alink does not contain "xxxx.com/store/zzz=ref" and alink does not contain "xxxx.com/store/zzz?update" then beep
	repeat with alink in myLinks
		if alink does not contain "ipad" then
			tell window 1 to set newTab to make new tab
			set URL of newTab to alink
			--insert your code
		end if
	end repeat
end tell

For some reason that isn’t working.

What do I need to do to get Firefox to open all of the links as separate tabs.

set site_url to "https://www.etsy.com/your/orders/sold?ref=si_ys_dd_sold_orders"
tell application "Safari"
	activate
	open location site_url
end tell
-- wait until page loaded
if page_loaded(30) is false then return
-- get number of links
set theLinks to {}
tell application "Safari" to set num_links to (do JavaScript "document.links.length" in document 1)
set linkCounter to num_links - 1
-- retrieve the links
repeat with i from 0 to linkCounter
	tell application "Safari" to set end of theLinks to do JavaScript "document.links[" & i & "].href" in document 1
end repeat
theLinks
set nonExcludedURLs to {}


repeat with i from 1 to length of theLinks
	if item i of theLinks does not contain "home" and item i of theLinks does not contain "sell" and item i of theLinks contains "https://www.etsy.com/your/orders/" and item i of theLinks does not contain "sold_orders" and item i of theLinks does not contain "open" and item i of theLinks does not contain "completed" and item i of theLinks does not contain "all" and item i of theLinks does not contain "canceled" and item i of theLinks does not contain "open" and item i of theLinks does not contain "unpaid" and item i of theLinks does not contain "unshipped" and item i of theLinks does not contain "from_user_id" and item i of theLinks does not contain "update" and item i of theLinks does not contain "your-notes" then
		
	end if
end repeat

nonExcludedURLs



on page_loaded(timeout_value)
	delay 2
	repeat with i from 1 to the timeout_value
		tell application "Safari"
			if (do JavaScript "document.readyState" in document 1) is "complete" then
				set nonExcludedURLs to {}
				
				return true
			else if i is the timeout_value then
				return false
			else
				delay 1
			end if
			
		end tell
	end repeat
	return false
end page_loaded


Firefox is not scriptable. However, if it is your default browser you can open new tabs with:


open location "http://www.apple.com/itunes/"

Okay… since Firefox is not scriptable how do I open multiple tabs with safari?

To open multiple tabs in Safari read post #6
To open multiple tabs in Firefox read post #8

Hello.

FireFox is so little scriptable, that I guess, its usage cannot be endorsed by governments that has laws for letting disabled people work with computers. FireFox in that respect doesn’t support UI scripting at all! :mad:.

Here is a handler that opens a new tab in Safari’s frontmost window with the url specified.


loadUrlInNewSafariTab for "http://www.macscripter.net"
to loadUrlInNewSafariTab for anUrl
	tell application "Safari"
		tell its first window
			set current tab to tab (index of (make new tab))
			set URL of current tab to anUrl
		end tell
	end tell
end loadUrlInNewSafariTab

You’ll need to activate Safari when you are done. :slight_smile:

Hi,

Thanks. That worked. But how do I get it to open the links that are in document 1 as apposed to the set link of macscripter?

Hello.

I surmise theLinks list you have in your post #1 just contains the urls. then something like this would do the trick:

tell application "Safari"
	-- your stuff for finding the links here
	
	repeat with aLink in theLinks
		loadUrlInNewSafariTab of me for aLink
	end repeat
	activate
end tell

-- the handler here..

I tried to rewrite the code with alinkes but it was not working so I went back to my original code. I am able to get to list all of the urls and then remove the links of the ones that I don’t want. It won’t work for you because you won’t have access to the link but where would I put the tab code and what would it be to work with what I have. I really appreciate your help :slight_smile:

set site_url to "https://www.etsy.com/your/orders/sold?ref=si_ys_dd_sold_orders"
tell application "Safari"
	activate
	open location site_url
end tell
-- wait until page loaded
if page_loaded(30) is false then return
-- get number of links
set theLinks to {}
tell application "Safari" to set num_links to (do JavaScript "document.links.length" in document 1)
set linkCounter to num_links - 1
-- retrieve the links
repeat with i from 0 to linkCounter
	tell application "Safari" to set end of theLinks to do JavaScript "document.links[" & i & "].href" in document 1
end repeat
theLinks
set nonExcludedURLs to {}


repeat with i from 1 to length of theLinks
	if item i of theLinks does not contain "home" and item i of theLinks does not contain "sell" and item i of theLinks contains "https://www.etsy.com/your/orders/" and item i of theLinks does not contain "sold_orders" and item i of theLinks does not contain "open" and item i of theLinks does not contain "completed" and item i of theLinks does not contain "all" and item i of theLinks does not contain "canceled" and item i of theLinks does not contain "open" and item i of theLinks does not contain "unpaid" and item i of theLinks does not contain "unshipped" and item i of theLinks does not contain "from_user_id" and item i of theLinks does not contain "update" and item i of theLinks does not contain "your-notes" then
		
		set end of nonExcludedURLs to item i of theLinks
	end if
end repeat
nonExcludedURLs


on page_loaded(timeout_value)
	delay 2
	repeat with i from 1 to the timeout_value
		tell application "Safari"
			if (do JavaScript "document.readyState" in document 1) is "complete" then
				set nonExcludedURLs to {}
				
				return true
			else if i is the timeout_value then
				return false
			else
				delay 1
			end if
			
		end tell
	end repeat
	return false
end page_loaded




Hello and well done! :slight_smile:

Just insert it where you return the value of nonExcludedUrls, that is the lines just containing nonExcluded Urls.

-- nonExcludedURLs
tell application "Safari"
	repeat with aLink in nonExcludedURLs
		loadUrlInNewSafariTab of me for aLink
	end repeat
	activate
end tell

-- the handlers here..

When I do that I get the error: «script» doesn’t understand the loadUrlInNewSafariTab message.

Hello you’ll have to add this handler to your script, and try again. :slight_smile:

to loadUrlInNewSafariTab for anUrl
   tell application "Safari"
       tell its first window
           set current tab to tab (index of (make new tab))
           set URL of current tab to anUrl
       end tell
   end tell
end loadUrlInNewSafariTab

PERFECT!! Thanks so much!!

Now my last question. Is there a way to print all open tabs? I have an extension in firefox that does that but since I can’t script in firefox is there a way to do that in safari?

Thanks!

There are tons of scripts here that uses UI Scripting to print a web page.

You’ll need something to iterate over the tabs in the front page, which is what I am providing below.

tell application "Safari"
	tell window 1
		set tc to count its tabs
		repeat with tbn from 1 to tc
			
			try
				set current tab to tab tbn
				-- code for printing the current tab here...
			on error
				beep
			end try
		end repeat
	end tell
	
end tell

thanks!

Last questions (promise)? How do I exclude the first tab? The first tab is the page where all of the URLs are grabbed from and I don’t need to print that page. EDIT: I think I figured it out… change 1 to 2???

Also, how do I pause this section of the script to load after say 20 seconds pass? that will give time to all of the tabs to load and then there won’t be a problem with the print…

Thanks!

tell application "Safari"
	tell window 1
		set tc to count its tabs
		repeat with tbn from 1 to tc
			
			try
				set current tab to tab tbn
				tell application "Safari"
					print document 1
				end tell
				
				
			on error
				beep
			end try
		end repeat
	end tell
	
end tell