I wonder if through Apple script is possible to do the follwing:
1st open a web page in safari from a list of text files containing the various URLS (easy)
2nd Open the “source” of the web page in Safari and get rid of all the html code (cant work it out)
however I’d like to keep the references with their urls if these references refer to images therefore following the very same format and file extension: http://thisurl.here/0001.jpg
In the page there might be more than one.jpg tags but hopefully I’d like to keep all the Url references of that kind using a prefix such as
3rd Paste the text content in a FileMaker Pro Text field
4th Copy the images link in another text field
as my students pictures are never more than 10 I can create ten fields in the database and polpulate them with the script or move to the next record
5th once the record is filled up move to the next url in the text file and repeat the process till the end
I’ve sort of done this already using text wrangler using the “grep” find replace but is very time consuming
this snippet saves all images of the front document window, whose URL starts with the imagePrefix in line 1, in a folder “WebImages” on Desktop.
property imagePrefix : "http://whatever.com/0"
set destFolder to POSIX path of (path to desktop) & "WebImages/"
do shell script "/bin/mkdir -p " & quoted form of destFolder
tell application "Safari" to set numberOfPictures to do JavaScript "document.images.length" in document 1
set {TID, text item delimiters} to {text item delimiters, "/"}
repeat with i from 1 to numberOfPictures
tell application "Safari" to set picURL to do JavaScript "document.images[" & ((i - 1) as string) & "].src" in document 1
set fName to last text item of picURL
if picURL starts with imagePrefix and picURL ends with "jpg" then
do shell script "/usr/bin/curl -o " & quoted form of (destFolder & fName) & space & picURL
end if
end repeat
set text item delimiters to TID
You can retrieve the raw text in Safari without the html tags by using text instead of source
Is there a way to script safari so I can collect all the text I need to find?
I can create a test file like this
url1 …
url2 …
and so on
so the script will start from the first line and collect the text in a file for me?
I don’t understand exactly what you mean when you say: use “text” and not “source”. I don’t know how to script safari and perhaps you refer to some applescript dictionary command specific for Safari
Maybe I posed the question in the wrong way as I meant if it was possible to take out all the html code from the “SOURCE WEB PAGE” and I thought opening the source window in safari was the thing to do.
However there might be an easier way to do so telling Apploescript to collect the text in the window according to the sequence I give in the text file.
Thanks again if you will suggest some additional ways to do so
As for the previous scriplet you suggest, is there a way to change the property imagePrefix : “http://whatever.com/0”
I haven’t tested this, but the script should ask for a text file containing URLs
then load each URL and load the images.
Replace the line – parse the text with the code to extract the text parts you want
property imagePrefix : "http://whatever.com/0"
set allURLs to paragraphs of (read (choose file))
repeat with oneURL in allURLs
try
tell application "Safari" to set URL of document 1 to oneURL
if page_loaded(30) then -- opens in new page
tell application "Safari"
set theText to text of document 1
set theName to name of document 1
end tell
-- parse the text
set destFolder to POSIX path of (path to desktop) & "WebImages/" & theName & "/"
do shell script "/bin/mkdir -p " & quoted form of destFolder
tell application "Safari" to set numberOfPictures to do JavaScript "document.images.length" in document 1
set {TID, text item delimiters} to {text item delimiters, "/"}
repeat with i from 1 to numberOfPictures
tell application "Safari" to set picURL to do JavaScript "document.images[" & ((i - 1) as string) & "].src" in document 1
set fName to last text item of picURL
if picURL starts with imagePrefix and picURL ends with "jpg" then
do shell script "/usr/bin/curl -o " & quoted form of (destFolder & fName) & space & picURL
end if
end repeat
set text item delimiters to TID
end if
on error e number n
set text item delimiters to {""}
display dialog "error " & e & "(" & n & ") occured"
end try
end repeat
on page_loaded(timeout_value)
delay 2
repeat with i from 1 to the timeout_value
tell application "Safari"
if (do JavaScript "document.readyState" in document 1) is "complete" then
return true
else if i is the timeout_value then
return false
else
delay 1
end if
end tell
end repeat
return false
end page_loaded