saving source in Safari/Webkit

mrl · August 5, 2008, 8:29pm

Hi folks,

Long time developer (20+ years), but unfamiliar with AppleScript.

My ultimate goal is to scrape the source from a web page and save it in an html file. The reason I’m using AppleScript is that the file needs to be processed by a JavaScript engine. Here’s my initial attempt at it.


tell application "WebKit"
	activate
	open location "http://google.com/"
	set page_source to source of front document as string
	delay 5
end tell

tell application "BBEdit"
	activate
	set E to make new document
	tell E
		set its text to (page_source)
		save E to "/Users/mrl/temp/123.html"
		close E
	end tell
	delay 5
end tell

I’m using BBEdit because it is supposed to have decent AppleScript support, but any mechanism to save the file to disk is acceptable. The essential flow is:

open web page in a browser
‘View source’
save source to disk
Lather, rinse, repeat

One issue I’m having is that often (seemingly randomly) the BBEdit buffer is blank, thus an empty html file is generated.

Any thoughts, suggestions, tips, fixes, refactoring, etc, are much appreciated. Bonus points for getting the URL and filename.html from an plain text input file.

StefanK · August 5, 2008, 8:53pm

Hi,

this does actually the same without any browser and text editor


do shell script "/usr/bin/curl -L -o '/Users/mrl/temp/123.html' 'http://google.com'"

A batch processing version, reading the URL from a plain text file, one URL per line


set inputFile to choose file
set destinationFolder to POSIX path of (path to home folder) & "temp/"
set theInput to paragraphs of (read inputFile)
set {TID, text item delimiters} to {text item delimiters, "/"}
repeat with _source in theInput
	set _destination to destinationFolder & last text item of _source & ".html"
	do shell script "/usr/bin/curl -L -o " & quoted form of _destination & space & quoted form of _source
end repeat
set text item delimiters to TID

mrl · August 5, 2008, 9:15pm

Stefan, thanks for slapping me upside my head with a clue stick!

I had used curl initially. But I hadn’t put quotes around the URLs. There were ampersands in the url and thus I was curl’ing incomplete URLs, and hadn’t noticed. Sigh, some days it just doesn’t pay to get out of bed.

Thanks again Stefan.