Hi,
I’m new to applescript, but having read some posts on this forum, i’m sure i can use it for my task…
I’d like to be able to select text from a web page (loaded in safari if needed), and save that text as a .txt file on my desktop. I dont need any info such as timestamps, dates, urls etc in the .txt file, just the words that are on the web page.
Ideally, I’d like the script to run automatically, say every 10 minutes (assuming that ill be taking text from a blog or something that is being constantly updated), overwriting the .txt file each time. Basically, im setting up a multimedia application that reads a .txt file and uses its contents to create moving image sequences etc, so if possible i need the file name to be something like ‘blog.txt’, rather than the script deciding what to call it.
A huge plus would be to somehow control what text the script grabs - i.e a specified frame of the webpage, or maybe just the first 10 lines or something.
this is certainly possible, but the way to do this depends on the source text of the webpage.
If it’s possible to access the text by the curl shell command, a web browser is not needed.
If a lot of PHP or javascript is used, you must read the text from the open browser window.
I havent got an example of the web page yet, as im planning on setting up the web page once i know how simple the script needs it to be, if that makes sense!
I’m planning on customising a blogger page (so people can post text on the site, which can eventually be fed into the multimedia application). With a bit of code changing, I should be able to strip the blogger page down to just text (get rid of all the content that i dont want in the multimedia application etc.)
I ‘think’ blogger uses php, javascript etc, so i guess i need to have a browser window open. Although if i end up using a different webpage, there’s a fair chance there’ll be php etc so maybe its best to go down this route.
Would that mean that somewhere in the script it would need to tell safari to refesh the page, to ensure the most current content is captured?
thanks, this can be read (quite) easily with curl, the script uses another shell function to convert html to text.
set temp to ((path to temporary items as Unicode text) & "htmlTemp.txt") -- define tempfile
set Ptemp to quoted form of POSIX path of temp
do shell script "curl http://3l-project.blogspot.com/2007/01/conversation-with-places-northampton.html -o " & Ptemp
do shell script "textutil -format html -inputencoding UTF-8 -convert txt -encoding UTF-16 " & Ptemp -- convert html to txt
set theText to paragraphs of (read file temp as Unicode text)
do shell script "rm " & Ptemp -- delete tempfile
theText
set newText to {}
repeat with i in theText
tell contents of i
if it contains "COMMENTS" then exit repeat
if it is not "" then set end of newText to it
end tell
end repeat
set {TID, text item delimiters} to {text item delimiters, return}
set newText to newText as Unicode text
set text item delimiters to TID
display dialog newText
great, thanks, ill give that a run tonight and let you know how it goes (im away from my mac at the minute)
is there a line of code i could put in that runs the function automatically, say every 10 minutes?, which will overwrite the contents of the file with the new text grabbed from the html doc?
the line of code: if it contains “COMMENTS” then exit repeat: is this telling the script to stop capturing the text at the end of the blogger post (last word being ‘comments’)? if so - very cool! this will help a lot…
Here is a version with an error handling, internet connection check and a write-to-textfile routine.
Save the script as stay open application, it will be restarted every 10 minutes
on idle
if chkUP("http://www.apple.com") or chkUP("http://www.google.com") then
try
set textFile to ((path to desktop as Unicode text) & "blog.txt")
set temp to ((path to temporary items as Unicode text) & "htmlTemp.txt") -- define tempfile
set Ptemp to quoted form of POSIX path of temp
do shell script "curl http://3l-project.blogspot.com/2007/01/conversation-with-places-northampton.html -o " & Ptemp
do shell script "textutil -format html -inputencoding UTF-8 -convert txt -encoding UTF-16 " & Ptemp -- convert html to txt
set theText to paragraphs of (read file temp as Unicode text)
do shell script "rm " & Ptemp -- delete tempfile
theText
set newText to {}
repeat with i in theText
tell contents of i
if it contains "COMMENTS" then exit repeat
if it is not "" then set end of newText to it
end tell
end repeat
set {TID, text item delimiters} to {text item delimiters, return}
set newText to newText as text
set text item delimiters to TID
write_to_disk from newText into textFile
end try
end if
return 10 * minutes
end idle
on chkUP(theURL)
return (count (get ((theURL as URL)'s host & {dotted decimal form:""})'s dotted decimal form)) > 0
end chkUP
on write_to_disk from theData into target
try
set ff to open for access file target with write permission
set eof of ff to 0
write theData to ff
close access ff
on error
try
close access file target
end try
end try
end write_to_disk
Hi Stefan, that’s really cool how you used textutil to convert the text. That gives me a whole new method to think about when parsing web pages or other kinds of documents. A question: where did you learn about the different text encodings, i.e. utf-8 vs utf-16 etc.? Is there any place I can learn about them, like a beginners guide or something? I noticed from the html code from that web page that it was encoded using utf-8, is that standard for all web pages? I also noticed by playing around with textutil that the default output encoding is utf-8 but applescript didn’t seem to like that. Does applescript have something against utf-8?
I see one spelling mistake in your script. In your on idle handler you’ll want to correct the spelling of minutes.
amazing - it does exactly what i hoped it would!
thanks so much for your time on this one
its definatly got me interested in learning more about applescript - i’ve been looking into a lot of ways of achieving this, and this is way better than i couldve aimed for.
cool
thanks again