I am a total noob at this. I am trying to extract the link to the record label that’s in the middle of the page of the below URL (right under the google box).
I can run a script to extract all of the URLs, but somehow would like to figure out how to grab just this link – because there are 24,000 pages of data I want to extract this information.
Since I posted this I’ve been trying to figure out the find command in Applescript – thinking that a duct tape approach is to first find the words “label:” because that precedes the URL link I am hoping to extract out of the HTML. So maybe if I find the word “label” on the page, the I can set it up to grab the HTML right after?
I don’t know – any help will be greatly appreciated.
set sourceText to do shell script "curl [url=http://www.allrecordlabels.com/db/8/19868.html]http://www.allrecordlabels.com/db/8/19868.html"[/url]
set {TID, text item delimiters} to {text item delimiters, "SiteSearch Google"}
set sourceText to text item 3 of sourceText
set text item delimiters to "a href=\""
set sourceText to text item 2 of sourceText
set text item delimiters to "\">"
set theLink to text item 1 of sourceText
set text item delimiters to TID
display dialog theLink
Thanks so much - this is excellent. It certainly solves a huge part of the difficulty.
I don’t mean to trouble you - however have a follow up question:
Is it possible to dynamically populate the curl statement from a list within a text file. Given I have about 24,000 URLs that I need to run this through, copying and pasting into the statement would be impossible.
And finally, how would I get this to write the output to a separate text file, i.e., append it?
I’ve figured out how to write the output to a text file.
The last part of it is setting up the input to the script as being a text file or list and enabling the curl target site to be swapped out with the subsequent item on the list (the next row), and so on, until it completes the list.
this script assumes a plain text file as input with one URL per line.
The output is written to ~/Desktop/output.txt
set counter to 0
set theURLs to paragraphs of (read (choose file))
set outputFilePath to ((path to desktop as text) & "output.txt")
try
set ff to open for access file outputFilePath with write permission
repeat with anURL in theURLs
set sourceText to do shell script "curl " & quoted form of anURL
set {TID, text item delimiters} to {text item delimiters, "SiteSearch Google"}
set sourceText to text item 3 of sourceText
set text item delimiters to "a href=\""
set sourceText to text item 2 of sourceText
set text item delimiters to "\">"
set theLink to text item 1 of sourceText
set text item delimiters to TID
write theLink & return to ff starting at eof
set counter to counter + 1
end repeat
close access ff
display dialog (counter as text) & " links written to disk"
on error e
set text item delimiters to {""}
display dialog "an error occured while pasrsing URL " & anURL & " (" & counter & ")" & return & e
try
close access file outputFilePath
end try
end try
I changed the script to predefine the index variable.
Actually the script should work. I guess the plain text file has a different text encoding (the script expects MacRoman)
set counter to 0
set currentURL to ""
set theURLs to paragraphs of (read (choose file))
set outputFilePath to ((path to desktop as text) & "output.txt")
try
set ff to open for access file outputFilePath with write permission
repeat with anURL in theURLs
set currentURL to contents of anURL
set sourceText to do shell script "curl " & quoted form of currentURL
set {TID, text item delimiters} to {text item delimiters, "SiteSearch Google"}
set sourceText to text item 3 of sourceText
set text item delimiters to "a href=\""
set sourceText to text item 2 of sourceText
set text item delimiters to "\">"
set theLink to text item 1 of sourceText
set text item delimiters to TID
write theLink & return to ff starting at eof
set counter to counter + 1
end repeat
close access ff
display dialog (counter as text) & " links written to disk"
on error e
set text item delimiters to {""}
display dialog "an error occured while pasrsing URL " & currentURL & " (" & counter & ")" & return & e
try
close access file outputFilePath
end try
end try