Extract URL using RegEx?

Luhmann · July 23, 2005, 2:18pm

I just got an iPod and I’m trying to write Applescripts for Audio Hijack to extract the Real Audio feeds of some of my favorite shows, so I can record them to iTunes at night and listen to them on my iPod. I’ve done fairly well with some scripts I found for NPR on the Audio Hijack forums:

http://www.rogueamoeba.com/forum/ubb/Forum1/HTML/000232.html

However, I am now looking at a show whose URL includes the name of the show, not just the date. This makes it difficult, because the sample script I’m using works by the date.

The show is KCRW’s “Sounds Eclectic” - a weekly roundup of music that had been played during the week on “Morning becomes Eclectic”:

http://soundseclectic.com/

The URL for the RA file looks like this:

http://kcrw.com/cgi-bin/ram_wrap.cgi?/sc/sc050717Deadman

I’m wondering if there is some way I can use Regular Expressions to find all URLs on the main page which look like:

http://kcrw.com/cgi-bin/ram_wrap.cgi?/sc/*

And then open that URL.

Oh, and I believe the whole thing is made more difficult by the fact that the page has frames.

Such a script would be very helpful, as it could probably be adapted for a wide range of shows.

Thanks for your help!

sitcom · July 24, 2005, 2:18am

What browser are you using?
SC

Luhmann · July 24, 2005, 2:27am

My personal browser is Firefox, but the scripts I’ve been using all use Safari. It doesn’t really matter to me, since it is all run at night.

sitcom · July 24, 2005, 4:07pm

The link on Sounds Eclectic do not link directly to the ram files; they link to the KCRW’s main page, which then redirects to an archive page that holds a neatly compiled list of the files you seek the url of the archive page is “http://www.kcrw.com/archive.html”. The search string in the urls is “/ram_wrap.cgi?/sc” as you said, but there are href blocks containing the same so I added “if url contains “/ram_wrap.cgi?/sc” and “http:”” so it returns only fully qualified links. I also included a property list to store previously downloaded urls to avoid repeat downloads. The links are snagged off a Safari page, and the rest could be done more efficiently with URL Scripting. I would have written the rest with URL Scripting, but when it gets to that part of the script my Crapintosh tries to open Classic URLScripting and I can’t get it to realise the OSX version <-Anyone know how I can get that back?


set PriorDownloads to {}

tell application "Safari"
	launch
	make new document at the beginning of documents
	set the URL of the front document to "http://www.kcrw.com/archive.html"
	delay 3
end tell

tell application "Safari"
	set HTMLCode to source of document 1
	close window 1
end tell

set AppleScript's text item delimiters to ASCII character 34
set HTMLBlocks to text items of HTMLCode
set AppleScript's text item delimiters to ""

repeat with ThisBlock in HTMLBlocks
	
	if ThisBlock contains "http:" and ThisBlock contains "/ram_wrap.cgi?/sc" then
		if ThisBlock is not in PriorDownloads then
			
			tell application "Safari"
				make new document at the beginning of documents
				set the URL of the front document to ThisBlock
				delay 2
				close window 1
			end tell
			set PriorDownloads to PriorDownloads & ThisBlock
		end if
	end if
end repeat

SC

Luhmann · July 24, 2005, 4:30pm

Thanks!

Actually, there is a link on the main page - just below the one you clicked to go the archives… where it says “clickhere to listen.”

I’ve run into a problem with the script. When I run it I get the following error:

Or something of the sort. I can’t reproduce it because I only go the error the first time I ran the script. The second time I tried (to copy the error code) it actually worked fine. Not sure I know why that might be the case.

Thanks for your help. This will be great for a whole bunch of scripting purposes.

I will probably remove the “prior downloads” checking feature, since I’ll be running this from Audio Hijack which has its own scheduling features I can use to make sure the script is only run once a week.

Let me know if you figure out why the HTMLCode error pops up the first time the script is run.

Cheers!

Luhmann · July 24, 2005, 4:40pm

I think I solved the problem with a longer delay. It may be that the page takes longer to load when there is no cache yet for that page.

Here is what I ended up with:

--Sounds Eclectic feed.
--Adapted by Luhmann from a script by sitcom
--http://bbs.applescript.net/message_send.php?id=6437&tid=13496

tell application "Safari"
	activate
	make new document at the beginning of documents
	set the URL of the front document to "http://www.kcrw.com/archive.html"
	delay 10
	set HTMLCode to source of document 1
end tell

set AppleScript's text item delimiters to ASCII character 34
set HTMLBlocks to text items of HTMLCode
set AppleScript's text item delimiters to ""

repeat with ThisBlock in HTMLBlocks
	if ThisBlock contains "http:" and ThisBlock contains "/ram_wrap.cgi?/sc" then
		tell application "Safari"
			open location ThisBlock
		end tell
	end if
end repeat

tell application "Audio Hijack"
	activate
end tell

Thanks again for your help!

Luhmann · July 24, 2005, 4:57pm

I see now why you had the handlers for previous downloads - because it is an archive page, and there seems to be more than one entry for the same show.

I’ve solved that by replacing the script with the link to the individual frame of the original page which has the links:

http://www.soundseclectic.com/cgi-bin/db/kcrw.pl?tmplt_type=se_home

So the script now looks like this:

--Sounds Eclectic feed.
--Adapted by Luhmann from a script by sitcom
--http://bbs.applescript.net/message_send.php?id=6437&tid=13496

tell application "Safari"
	activate
	make new document at the beginning of documents
	set the URL of the front document to "http://www.soundseclectic.com/cgi-bin/db/kcrw.pl?tmplt_type=se_home"
	delay 10
	set HTMLCode to source of document 1
end tell

set AppleScript's text item delimiters to ASCII character 34
set HTMLBlocks to text items of HTMLCode
set AppleScript's text item delimiters to ""

repeat with ThisBlock in HTMLBlocks
	if ThisBlock contains "http:" and ThisBlock contains "/ram_wrap.cgi?/sc" then
		tell application "Safari"
			open location ThisBlock
		end tell
	end if
end repeat

tell application "Audio Hijack"
	activate
end tell

sitcom · July 24, 2005, 5:22pm

Very nice, I wish I would have had the original page, becuase it contains all the HTML code. And yes, the delay allows the page to load before reading the html source code. The reason you didn’t get it the second time is because it was already loaded in your cache. If you have a slow connection you will need to set accordingly, as I see you have.
I don’t know if it’s true in all cases, but I have had great success in parsing HTML for links with the

block. ASCII character 34 is /"
Then each block can be parsed for conditions such as if ThisBlock contains “http:”

Internet Explorer has a cool script feature “ParseAnchor”. If the HTML code only contains the href data (“…/post.php?tid=13496”) and no fully qualified links, this command will combine the hfref with the server name “http://bbs.applescript.net/” to produce a fully qualified URL “http://bbs.applescript.net/post.php?tid=13496”.
SC

Luhmann · July 24, 2005, 5:42pm

Thanks.

First time I’ve heard anyone say anything good about IE in a long time

jonn8 · July 24, 2005, 6:42pm

First, use the JavaScript “document.readyState” status to check to see if the page has loaded, not just an arbitrary delay:

set the_URL to "http://www.soundseclectic.com/cgi-bin/db/kcrw.pl?tmplt_type=se_home"
tell application "Safari"
	make new document at beginning of documents
	tell document 1
		set URL to the_URL
		if not my wait_to_finish() then return display dialog """ & the_URL & "" failed to load." buttons {"OK"} default button 1 with icon 2 giving up after 10
		set the_HTML to source
	end tell
end tell

on wait_to_finish()
	delay 2
	tell application "Safari"
		repeat with i from 1 to (2 * minutes)
			if (do JavaScript "document.readyState" in document 1) = "complete" then
				return true
			else
				delay 1
			end if
		end repeat
	end tell
	return false
end wait_to_finish

Second, why use Safari at all?

set the_URL to "http://www.soundseclectic.com/cgi-bin/db/kcrw.pl?tmplt_type=se_home"
set the_HTML to (do shell script "curl " & quoted form of the_URL)

The whole script can be simplified to:

set the_URL to "http://www.soundseclectic.com/cgi-bin/db/kcrw.pl?tmplt_type=se_home"
set the_URL to (do shell script "curl " & quoted form of the_URL & " | grep '/ram_wrap.cgi?/sc' | awk -F '\"' '{print $2}' | sed -e 's/ //g'")
do shell script "curl " & quoted form of the_URL & " >> /tmp/temp.ram; open /tmp/temp.ram"
tell application "Audio Hijack" to activate

Jon

Luhmann · July 24, 2005, 7:59pm

What? You couldn’t get it down to 1 line of code?

Just joking. I am impressed, and it seems more efficient/robust. Also, I learned a few new tricks from this. Thanks!

jonn8 · July 25, 2005, 7:07am

Actually, there is an error in what I posted. There should only be a single angle bracket to pipe the output to the file, not two angle brackets. Two would append the data to the file (if it exists) instead of overwriting it (as you want). So, it should be:

set the_URL to "http://www.soundseclectic.com/cgi-bin/db/kcrw.pl?tmplt_type=se_home"
set the_URL to (do shell script "curl " & quoted form of the_URL & " | grep '/ram_wrap.cgi?/sc' | awk -F '\"' '{print $2}' | sed -e 's/ //g'")
do shell script "curl " & quoted form of the_URL & " > /tmp/temp.ram; open /tmp/temp.ram"
tell application "Audio Hijack" to activate

And, assuming that “Audio Hijack” is in your Applications folder, here’s the (very long and bloated) one liner:

do shell script "URL=`curl http://www.soundseclectic.com/cgi-bin/db/kcrw.pl?tmplt_type=se_home | grep '/ram_wrap.cgi?/sc' | awk -F '\"' '{print $2}' | sed -e 's/ //g'`; curl $URL > /tmp/temp.ram; open /tmp/temp.ram; open '/Applications/Audio Hijack.app'"

Jon

Luhmann · July 25, 2005, 12:33pm

Thanks for the update!

If it isn’t too much trouble, could you help me parse the code you wrote. I’m not very good with RegEx, so I don’t quite understand this bit:

jonn8 · July 25, 2005, 5:32pm

The awk command parses the result using the double quote character as the delimiter (") and then extracts the second text item. So, it parses this:

click me!

to this:

http://someurl

If the link was correctly coded like this example, then that’s all you would need. However, when testing this script on the actual content, the HTML coding had an error. Instead of of the code being:

click me!

it had an extraneous space at the beginning of the URL:

click me!

The sed command simply removes and spaces from the found text. Nothing too complicated.

Jon

Luhmann · July 28, 2005, 2:54am

Thanks again!