This might be an interesting challenge for some ...

danwan · January 5, 2011, 3:39pm

I wonder if through Apple script is possible to do the follwing:

1st open a web page in safari from a list of text files containing the various URLS (easy)

2nd Open the “source” of the web page in Safari and get rid of all the html code (cant work it out)

however I’d like to keep the references with their urls if these references refer to images therefore following the very same format and file extension: http://thisurl.here/0001.jpg

In the page there might be more than one.jpg tags but hopefully I’d like to keep all the Url references of that kind using a prefix such as

Image 01 http://thisurl.here/0001.jpg
Image 02 http://thisurl.here/0002.jpg

and so on

3rd Paste the text content in a FileMaker Pro Text field

4th Copy the images link in another text field
as my students pictures are never more than 10 I can create ten fields in the database and polpulate them with the script or move to the next record

5th once the record is filled up move to the next url in the text file and repeat the process till the end

I’ve sort of done this already using text wrangler using the “grep” find replace but is very time consuming

Any ideas

thanks and Happy new year

StefanK · January 5, 2011, 3:53pm

Hi,

this snippet saves all images of the front document window, whose URL starts with the imagePrefix in line 1, in a folder “WebImages” on Desktop.


property imagePrefix : "http://whatever.com/0"

set destFolder to POSIX path of (path to desktop) & "WebImages/"
do shell script "/bin/mkdir -p " & quoted form of destFolder
tell application "Safari" to set numberOfPictures to do JavaScript "document.images.length" in document 1
set {TID, text item delimiters} to {text item delimiters, "/"}
repeat with i from 1 to numberOfPictures
	tell application "Safari" to set picURL to do JavaScript "document.images[" & ((i - 1) as string) & "].src" in document 1
	set fName to last text item of picURL
	if picURL starts with imagePrefix and picURL ends with "jpg" then
		do shell script "/usr/bin/curl -o " & quoted form of (destFolder & fName) & space & picURL
	end if
end repeat
set text item delimiters to TID

You can retrieve the raw text in Safari without the html tags by using text instead of source

danwan · January 5, 2011, 6:32pm

Thanks as always Stefan

however

Is there a way to script safari so I can collect all the text I need to find?

I can create a test file like this

url1 …
url2 …

and so on

so the script will start from the first line and collect the text in a file for me?

I don’t understand exactly what you mean when you say: use “text” and not “source”. I don’t know how to script safari and perhaps you refer to some applescript dictionary command specific for Safari

Maybe I posed the question in the wrong way as I meant if it was possible to take out all the html code from the “SOURCE WEB PAGE” and I thought opening the source window in safari was the thing to do.

However there might be an easier way to do so telling Apploescript to collect the text in the window according to the sequence I give in the text file.

Thanks again if you will suggest some additional ways to do so

As for the previous scriplet you suggest, is there a way to change the property imagePrefix : “http://whatever.com/0”

reading it again from a list I supply previously?

Thanks

Danwan

StefanK · January 5, 2011, 6:53pm

I haven’t tested this, but the script should ask for a text file containing URLs
then load each URL and load the images.
Replace the line – parse the text with the code to extract the text parts you want


property imagePrefix : "http://whatever.com/0"

set allURLs to paragraphs of (read (choose file))
repeat with oneURL in allURLs
	try
		tell application "Safari" to set URL of document 1 to oneURL
		if page_loaded(30) then -- opens in new page
			tell application "Safari"
				set theText to text of document 1
				set theName to name of document 1
			end tell
			-- parse the text
			
			set destFolder to POSIX path of (path to desktop) & "WebImages/" & theName & "/"
			do shell script "/bin/mkdir -p " & quoted form of destFolder
			tell application "Safari" to set numberOfPictures to do JavaScript "document.images.length" in document 1
			set {TID, text item delimiters} to {text item delimiters, "/"}
			repeat with i from 1 to numberOfPictures
				tell application "Safari" to set picURL to do JavaScript "document.images[" & ((i - 1) as string) & "].src" in document 1
				set fName to last text item of picURL
				if picURL starts with imagePrefix and picURL ends with "jpg" then
					do shell script "/usr/bin/curl -o " & quoted form of (destFolder & fName) & space & picURL
				end if
			end repeat
			set text item delimiters to TID
		end if
	on error e number n
		set text item delimiters to {""}
		display dialog "error " & e & "(" & n & ") occured"
	end try
end repeat

on page_loaded(timeout_value)
	delay 2
	repeat with i from 1 to the timeout_value
		tell application "Safari"
			if (do JavaScript "document.readyState" in document 1) is "complete" then
				return true
			else if i is the timeout_value then
				return false
			else
				delay 1
			end if
		end tell
	end repeat
	return false
end page_loaded