Get Text from Webpage - Applescript

What is the applescript equivalent of the Automator action “Get Text from Webpage”?

I was using curl, but it has recently started returning a “This Document has Moved to here” with a link to the original page. Maybe someone doesn’t like me requesting their HTML source?

Thank you.

Hi,

try curl with the -L switch, which follows redirections

Thanks StefanK! That totally worked for getting around the redirect.

I think I’m still looking for a text based “print preview” version of the webpage because I have a lot of predefined tags I’m using to parse data. With curl I end up having to strip a lot of html.

Is there really no Applescript equivalent of get text from webpage? Is it easier to convert html to text after I’ve done curl?

Here is what I use:

do shell script "curl " & site & " | textutil -stdin -stdout -format html -convert txt -encoding UTF-8 "

Hello

You can get the source of a Safari webpage, but the result would be the same as the result of the curl command.


property newline :  character id 10 ” tinkered to use id instead of deprecated ASCII character

tell application "Safari"
	-- The following line collects info used by the Terminal editors
	set myURL to the URL of document 1 ”  as string *removed* uses text as of AS 2.0
	
	-- Retrieve the source from the browser
	set mySource to the source of document 1 ” as text  *removed* unnecessary coercion as of AS 2.0 -was string
end tell

This is shamelessly stolen from Daniel S. Rubins get browser source scripts (dan@webgraph.com)

I’m sure there is somebody out there who have some great scripts for stripping away the header and such until you get the body, which is what you are really interested in.

I believe that you would really need a “posessive” version of the extractBetween script from this post , which should consider nested tags between the tags, so that you could parse the source hierarchially. By posessive in this context I mean that it would return all the text between the start tag, and its end tag. It exclude the tags searched for
and take “malformed” pages into consideration; that several tags can appear on the same line.

This is a topic I really know far to little about, so for the fun of it i tried apropos html in a terminal window
and voila : I found a command named htmlparse but this package are for people using tcl, and I’m not one of those. Maybe some others can help you with this issue.

I’m sure there are some handlers for parsing html with AppleScript here which could suit you if you execute this query not knowing exactly what you are looking for.

Best Regards

McUsr

two unnecessary coercions

form the dictionary:
URL get/set unicode text The current URL of the document.
source get unicode text The HTML source of the web page currently loaded in the document.

By the way: Since AppleScript 2 there is a constant linefeed for character id 10

Thanks Stefan

I’ll correct it in the post above. that is omit the coercions in both cases.

It so useful when you remind me of this, making the matter stick.
Hopefully you won’t have to do this all of the time. But please don’t stop.

Best Regards

McUsr