do shell script and urls

BHD · September 14, 2004, 9:56am

I need a simple script to take an article from a newsreader, and download it with wget (or curl). I know pretty much nothing about AS, but here’s where I’m at:


-- get url for selected article
tell application "NetNewsWire"
	set article_url to (URL of selectedHeadline)
end tell

-- pass the url to wget for download
tell application "Terminal"
	do shell script "cd ~/Documents/News/downloads; /sw/bin/wget '" & article_url & "'"
end tell

The problem is that in some important cases the URL from the article has a bunch of cruft added to the end that doesn’t allow me to get the specific (printer-friendly) page I want. How can I take the URL as variable, but remove everything after the “.html” in Applescript?

That way I could just concatenate the last bit I need myself. And while I’m at it, how do I only concatenate this if the url contains a certain doman?

Finally, if I use curl instead, the commandline syntax places a command within quotes. How would I handle that with do shell script?

Rob · September 15, 2004, 12:37am

NetNewsWire has a decent AppleScript interface which might allow you to extract the desired data without downloading it again. Can you explain why you want to do what you have proposed?

– Rob

BHD · September 15, 2004, 2:04am

I don’t want to extract information; I want the complete file. I then convert to XHTML using Tidy, and subsequently process to get clean semantic code. It is easiest to do this with the cleanest source HTML, which is generally the “printer” version of stories.

I got some help elsewhere on this, but there’s something wrong with the code below, as curl is complaining abiout a missing url.

The other problem is I don’t know how to handle what I’m trying to do with the article_title variable, which is to take the first three space-separated tokens and either remove the spaces, or replace them with underscores. Then, I need to use that strippedTitle variable to create the curl command -o=“strippedTitle.html”.


-- get url for selected article
tell application "NetNewsWire"
	set article_url to (URL of selectedHeadline)
	set article_title to (title of selectedHeadline)
end tell

-- save original text item delims:
set oldDelims to AppleScript's text item delimiters

-- set text item delims for url:
set AppleScript's text item delimiters to {".html"}

-- extract the sub-string you're interested in:
set strippedURL to item 1 of (every text item of article_url)

-- restore original text item delims:
set AppleScript's text item delimiters to oldDelims

-- set text item delims for titles:
set AppleScript's text item delimiters to {" "}

-- extract the sub-string you're interested in:
set strippedTitle to item 1 of (every text item of article_title)

-- restore original text item delims:
set AppleScript's text item delimiters to oldDelims

if strippedURL contains "nytimes" then
	set strippedURL to strippedURL & "?hp=&pagewanted=print"
end if

-- pass the url to wget for download
tell application "Terminal"
	do shell script "cd ~/Documents/News/downloads; /usr/bin/curl -e="http://www.google.com"'" & strippedURL & "'"
end tell

jobu · September 15, 2004, 4:30am

Assuming that the url is split with a “?” after the page (i.e. http: //www.site.com/page.html?x=1&y=234"), the following should work for you. I’ve got it configured to save it to my desktop, so make sure to change the path in the ‘theShellScript’ variable to the path where you want the page saved. You weren’t very clear on what you WANTED, as there’s talk of so many possibilities… so I worked it all in. This script makes sure that there are enough words in the title of the page before trying to make a page name out of it. If there are more than three words (two spaces) it will create a title out of them. Otherwise, it will just use the name off the server. If the page’s name is rarely the same, you might just want to use the server name all the time, so you can avoid any unforseen problems with making a name dynamically out of the page’s title. I don’t have the ‘NetNewsWire’ app, so I can’t test this, but it worked with a static address and title.

tell application "NetNewsWire"
	set article_url to (URL of selectedHeadline)
	set article_title to (title of selectedHeadline)
end tell 

set delim to AppleScript's text item delimiters

--> Get the cleaned URL
set AppleScript's text item delimiters to "?"
set theUrl to text item 1 of article_url

 --> Get the saved page name from title or server page name
set AppleScript's text item delimiters to " "
if (count text items of article_title) > 2 then
	set thePage to text items 1 through 3 of article_title
	set AppleScript's text item delimiters to "_"
	set thePage to ((thePage as string) & ".html")
else
	set AppleScript's text item delimiters to "/"
	set thePage to text item -1 of theUrl
end if

set AppleScript's text item delimiters to delim

if theUrl contains "nytimes" then
  	set theUrl to (theUrl & "?hp=&pagewanted=print") as string
end if 

set theShellScript to ("curl -o "/Users/jed/Desktop/" & thePage & "" " & theUrl) as string
do shell script theShellScript

j