Extracting a URL from a specific section of a site

Help!

I am a total noob at this. I am trying to extract the link to the record label that’s in the middle of the page of the below URL (right under the google box).

http://www.allrecordlabels.com/db/8/19868.html

I can run a script to extract all of the URLs, but somehow would like to figure out how to grab just this link – because there are 24,000 pages of data I want to extract this information.

Since I posted this I’ve been trying to figure out the find command in Applescript – thinking that a duct tape approach is to first find the words “label:” because that precedes the URL link I am hoping to extract out of the HTML. So maybe if I find the word “label” on the page, the I can set it up to grab the HTML right after?

I don’t know – any help will be greatly appreciated.

Thank you!

Best,

Manish

Hi,

try this


set sourceText to do shell script "curl [url=http://www.allrecordlabels.com/db/8/19868.html]http://www.allrecordlabels.com/db/8/19868.html"[/url]
set {TID, text item delimiters} to {text item delimiters, "SiteSearch Google"}
set sourceText to text item 3 of sourceText
set text item delimiters to "a href=\""
set sourceText to text item 2 of sourceText
set text item delimiters to "\">"
set theLink to text item 1 of sourceText
set text item delimiters to TID
display dialog theLink

Thanks so much - this is excellent. It certainly solves a huge part of the difficulty.

I don’t mean to trouble you - however have a follow up question:

Is it possible to dynamically populate the curl statement from a list within a text file. Given I have about 24,000 URLs that I need to run this through, copying and pasting into the statement would be impossible.

And finally, how would I get this to write the output to a separate text file, i.e., append it?

Thank you.

Best,

Moosh

UPDATE

I’ve figured out how to write the output to a text file.

The last part of it is setting up the input to the script as being a text file or list and enabling the curl target site to be swapped out with the subsequent item on the list (the next row), and so on, until it completes the list.

Thanks for your help!

this script assumes a plain text file as input with one URL per line.
The output is written to ~/Desktop/output.txt


set counter to 0
set theURLs to paragraphs of (read (choose file))
set outputFilePath to ((path to desktop as text) & "output.txt")
try
	set ff to open for access file outputFilePath with write permission
	repeat with anURL in theURLs
		set sourceText to do shell script "curl " & quoted form of anURL
		set {TID, text item delimiters} to {text item delimiters, "SiteSearch Google"}
		set sourceText to text item 3 of sourceText
		set text item delimiters to "a href=\""
		set sourceText to text item 2 of sourceText
		set text item delimiters to "\">"
		set theLink to text item 1 of sourceText
		set text item delimiters to TID
		write theLink & return to ff starting at eof
		set counter to counter + 1
	end repeat
	close access ff
	display dialog (counter as text) & " links written to disk"
on error e
	set text item delimiters to {""}
	display dialog "an error occured while pasrsing URL " & anURL & " (" & counter & ")" & return & e
	try
		close access file outputFilePath
	end try
end try


I am getting the following

error “The variable anURL is not defined.” number -2753 from “anURL”

I changed the script to predefine the index variable.
Actually the script should work. I guess the plain text file has a different text encoding (the script expects MacRoman)


set counter to 0
set currentURL to ""
set theURLs to paragraphs of (read (choose file))
set outputFilePath to ((path to desktop as text) & "output.txt")
try
	set ff to open for access file outputFilePath with write permission
	repeat with anURL in theURLs
		set currentURL to contents of anURL
		set sourceText to do shell script "curl " & quoted form of currentURL
		set {TID, text item delimiters} to {text item delimiters, "SiteSearch Google"}
		set sourceText to text item 3 of sourceText
		set text item delimiters to "a href=\""
		set sourceText to text item 2 of sourceText
		set text item delimiters to "\">"
		set theLink to text item 1 of sourceText
		set text item delimiters to TID
		write theLink & return to ff starting at eof
		set counter to counter + 1
	end repeat
	close access ff
	display dialog (counter as text) & " links written to disk"
on error e
	set text item delimiters to {""}
	display dialog "an error occured while pasrsing URL " & currentURL & " (" & counter & ")" & return & e
	try
		close access file outputFilePath
	end try
end try


WOW

Thanks, that’s amazing. Totally does the job.

Thank you very much. I’m going to study this further, and this certainly is motivating to learn more about applescript.

Manish