Parse URL help

K3FKA · November 22, 2011, 10:24am

cURL: www.megarelease.net/category/720p
Each URL of interest begins with “
<a href="http://megarelease.net” and ends with “">”
Retrieve only the URL’s which contain my favourite TV shows: “The Walking Dead”, “Dexter”, (basically any ‘keywords’)
Run the same process, only this time the links should start with “http://megaupload.com” and end with “.mkv” or “.avi” (to ensure its a quality file)
Desired URLs can be set to download via speed download (URLs separated by a comma when added to queue allows for multiple downloads) or maybe send a growl notification to await confirmation…not sure yet.

I would love some help cleaning up what I already have, which is obviously not even close. Im stuck at being able to identify the URL’s by keywords, need to set up variables I imagine? however Im stuck and will probably remain that way even if I solve this step tonight lol, any advice is welcome

set theMegaRelease to do shell script "curl [url=http://megarelease.net/category/720p/]http://megarelease.net/category/720p/"[/url] & " | grep '<h1><a href=' | cut -c 20-1000"
### This returns multiple links ending with "\">description of file"
### Could I use an offset of "\">" in ??? command instead of the commmands that follow?¬
set theCleanURLs to extractBetween(theMegaRelease, "http://", "\">") as list
to extractBetween(SearchText, startText, endText)
	set tid to AppleScript's text item delimiters -- save them for later.
	set AppleScript's text item delimiters to startText -- find the first one.
	set theList to paragraphs of SearchText
	set AppleScript's text item delimiters to endText -- find the end one.
	set extracts to {}
	repeat with subText in theList
		if subText contains endText then
			copy text item 1 of subText to end of extracts
		end if
	end repeat
	set AppleScript's text item delimiters to tid -- back to original values.
	return extracts
end extractBetween
return theCleanURLs

K3FKA · November 29, 2011, 9:57am

Bump…in case no one noticed, or am I being to vague? If so, Im basically wanting to parse a webpage for any lines which begin with {http://www.websiteofchoice} and end with {">}, filtering the list for the lines which contain any keywords I define.

Thank you

StefanK · November 29, 2011, 10:18am

Hi,

to filter the URL’s try this


set theMegaRelease to paragraphs of (do shell script "curl [url=http://megarelease.net/category/720p/]http://megarelease.net/category/720p/"[/url] & " | grep '<h1><a href=' | grep 'Dexter\\|The Walking Dead' | cut -d \\\" -f 2")

it returns a list of the URL’s matching the given keywords in the second grep command

Parse URL help

<a href="http://megarelease.net” and ends with “">”