extracting text from a web page into a .txt file

Hi,
I’m new to applescript, but having read some posts on this forum, i’m sure i can use it for my task…

I’d like to be able to select text from a web page (loaded in safari if needed), and save that text as a .txt file on my desktop. I dont need any info such as timestamps, dates, urls etc in the .txt file, just the words that are on the web page.

Ideally, I’d like the script to run automatically, say every 10 minutes (assuming that ill be taking text from a blog or something that is being constantly updated), overwriting the .txt file each time. Basically, im setting up a multimedia application that reads a .txt file and uses its contents to create moving image sequences etc, so if possible i need the file name to be something like ‘blog.txt’, rather than the script deciding what to call it.

A huge plus would be to somehow control what text the script grabs - i.e a specified frame of the webpage, or maybe just the first 10 lines or something.

Any advice would be great,
Many thanks
James

Hi James,

this is certainly possible, but the way to do this depends on the source text of the webpage.
If it’s possible to access the text by the curl shell command, a web browser is not needed.
If a lot of PHP or javascript is used, you must read the text from the open browser window.

Have you an example?

Great, thanks.

I havent got an example of the web page yet, as im planning on setting up the web page once i know how simple the script needs it to be, if that makes sense!

I’m planning on customising a blogger page (so people can post text on the site, which can eventually be fed into the multimedia application). With a bit of code changing, I should be able to strip the blogger page down to just text (get rid of all the content that i dont want in the multimedia application etc.)

I ‘think’ blogger uses php, javascript etc, so i guess i need to have a browser window open. Although if i end up using a different webpage, there’s a fair chance there’ll be php etc so maybe its best to go down this route.

Would that mean that somewhere in the script it would need to tell safari to refesh the page, to ensure the most current content is captured?

Thanks for you time
James

What’s about first setting up the site and then ask again for a way to grab the text? :wink:

ok! as an example though, how about this site:

http://3l-project.blogspot.com/2007/01/conversation-with-places-northampton.html

its pretty typical of the site that im planning on ending up with (although the web address probably wont be as long!)

the text id like to capture, ideally, would be the left hand column, the blog post

thanks again

thanks, this can be read (quite) easily with curl, the script uses another shell function to convert html to text.


set temp to ((path to temporary items as Unicode text) & "htmlTemp.txt") -- define tempfile
set Ptemp to quoted form of POSIX path of temp
do shell script "curl http://3l-project.blogspot.com/2007/01/conversation-with-places-northampton.html -o " & Ptemp
do shell script "textutil -format html -inputencoding UTF-8 -convert txt -encoding UTF-16 " & Ptemp -- convert html to txt
set theText to paragraphs of (read file temp as Unicode text)
do shell script "rm " & Ptemp -- delete tempfile
theText
set newText to {}
repeat with i in theText
	tell contents of i
		if it contains "COMMENTS" then exit repeat
		if it is not "" then set end of newText to it
	end tell
end repeat
set {TID, text item delimiters} to {text item delimiters, return}
set newText to newText as Unicode text
set text item delimiters to TID
display dialog newText

great, thanks, ill give that a run tonight and let you know how it goes (im away from my mac at the minute)

is there a line of code i could put in that runs the function automatically, say every 10 minutes?, which will overwrite the contents of the file with the new text grabbed from the html doc?

the line of code: if it contains “COMMENTS” then exit repeat: is this telling the script to stop capturing the text at the end of the blogger post (last word being ‘comments’)? if so - very cool! this will help a lot…

thanks again for your time and help

exactly.

Here is a version with an error handling, internet connection check and a write-to-textfile routine.
Save the script as stay open application, it will be restarted every 10 minutes

on idle
	if chkUP("http://www.apple.com") or chkUP("http://www.google.com") then
		try
			set textFile to ((path to desktop as Unicode text) & "blog.txt")
			set temp to ((path to temporary items as Unicode text) & "htmlTemp.txt") -- define tempfile
			set Ptemp to quoted form of POSIX path of temp
			do shell script "curl http://3l-project.blogspot.com/2007/01/conversation-with-places-northampton.html -o " & Ptemp
			do shell script "textutil -format html -inputencoding UTF-8 -convert txt -encoding UTF-16 " & Ptemp -- convert html to txt
			set theText to paragraphs of (read file temp as Unicode text)
			do shell script "rm " & Ptemp -- delete tempfile
			theText
			set newText to {}
			repeat with i in theText
				tell contents of i
					if it contains "COMMENTS" then exit repeat
					if it is not "" then set end of newText to it
				end tell
			end repeat
			set {TID, text item delimiters} to {text item delimiters, return}
			set newText to newText as text
			set text item delimiters to TID
			write_to_disk from newText into textFile
		end try
	end if
	return 10 * minutes
end idle

on chkUP(theURL)
	return (count (get ((theURL as URL)'s host & {dotted decimal form:""})'s dotted decimal form)) > 0
end chkUP

on write_to_disk from theData into target
	try
		set ff to open for access file target with write permission
		set eof of ff to 0
		write theData to ff
		close access ff
	on error
		try
			close access file target
		end try
	end try
end write_to_disk

Hi Stefan, that’s really cool how you used textutil to convert the text. That gives me a whole new method to think about when parsing web pages or other kinds of documents. A question: where did you learn about the different text encodings, i.e. utf-8 vs utf-16 etc.? Is there any place I can learn about them, like a beginners guide or something? I noticed from the html code from that web page that it was encoded using utf-8, is that standard for all web pages? I also noticed by playing around with textutil that the default output encoding is utf-8 but applescript didn’t seem to like that. Does applescript have something against utf-8?

I see one spelling mistake in your script. In your on idle handler you’ll want to correct the spelling of minutes.

google :wink:

There is no standard, but you could read the header and find out the text encoding.

Yes and no, the “native” text encoding of AppleScript is MacRoman but with some tricks you can handle also UTF-8 and UTF-16

:lol:, thanks, I fixed that

amazing - it does exactly what i hoped it would!
thanks so much for your time on this one
its definatly got me interested in learning more about applescript - i’ve been looking into a lot of ways of achieving this, and this is way better than i couldve aimed for.
cool
thanks again