Scraping text/info from Web source code; using offset in a repeat loop

I’m trying to get multiple bits of data from a web page using Applescript to automate. This script pulls the links from the web page and then searches for specific text within the source code. The script retrieves the first set of data no problem, but I’m having problems getting the script to go back into the Source code and continue to get the remaining ~80 bits of information. I have another script below that has a chance to collect the data but it isn’t working yet. Any suggestions to get this thing working would be great. I apologize in advance for my newbie ignorance…

This script opens two pages, the second page is the source for the scraping. Getting the thing to loop is giving me trouble…


set oldDelimiters to AppleScript's text item delimiters
set AppleScript's text item delimiters to ""
tell application "Safari"
	activate
	open location "http://chem.sis.nlm.nih.gov/chemidplus/ProxyServlet?objectHandle=DBMaint&actionHandle=default&nextPage=jsp/chemidlite/ResultScreen.jsp&TXTSUPERLISTID=000050000"
	delay 3
	open location "http://chem.sis.nlm.nih.gov/chemidplus/ProxyServlet?objectHandle=Search&actionHandle=getAll3DMViewFiles&nextPage=jsp%2Fcommon%2FChemFull.jsp%3FcalledFrom%3Dlite&chemid=000050000&formatType=_3D"
	delay 3
	set FullRecordHTML to the source of the document 1 as text
end tell

set StartPoint1 to ((text offset of "onclick=\"javascript:popUpInfoWin('<H2>Data Source Information</H2><br><b>List Acronyms</b><br>" in FullRecordHTML) + 94)
set EndPoint1 to ((text offset of "<br>');\"><img src=\"images/chemidlite/infosmall.gif\" width=\"12\" height=\"12\" border=\"0\"></a>" in FullRecordHTML) - 1)
set Target1 to (characters StartPoint1 thru EndPoint1) of FullRecordHTML as string

set Startpoint2 to EndPoint1 + 92
set EndPoint2 to ((text offset of "<!-- In case all the names should be broken, uncomment the line below and comment the line above -->" in FullRecordHTML) - 1)
set NewStartPoint to EndPoint2
set Target2 to (characters Startpoint2 thru EndPoint2) of FullRecordHTML as string

set Target2NumberChar to (number of characters in Target2)
repeat with UpCount from 1 to Target2NumberChar
	set {test1, test2, test3} to {character UpCount of Target2, (character (UpCount + 1) of Target2), (character (UpCount + 2) of Target2)}
	if test1 is equal to test2 and test2 is not equal to test3 and test3 is not equal to " " then
		set Target2 to (characters (UpCount + 2) thru (number of characters in Target2)) of Target2 as string
		exit repeat
	end if
end repeat

set Target2NumberChar to (number of characters in Target2)
repeat with UpCount from 1 to Target2NumberChar
	set {test1, test2, test3} to {character UpCount of Target2, (character (UpCount + 1) of Target2), (character (UpCount + 2) of Target2)}
	if test1 is equal to test2 and test2 is equal to test3 and test3 is equal to " " then
		beep
		set Target2 to (characters 1 thru (UpCount - 4)) of Target2 as string
		exit repeat
	end if
end repeat
set CombinedInfo to (Target1 & " : " & Target2)
display dialog CombinedInfo

set AppleScript's text item delimiters to oldDelimiters
return CombinedInfo

This script sets the source code into paragraphs as a list, but I can’t get the thing to loop successfully…


set oldDelimiters to AppleScript's text item delimiters
set AppleScript's text item delimiters to ""
tell application "Safari"
	activate
	open location "http://chem.sis.nlm.nih.gov/chemidplus/ProxyServlet?objectHandle=DBMaint&actionHandle=default&nextPage=jsp/chemidlite/ResultScreen.jsp&TXTSUPERLISTID=000050000"
	delay 3
	open location "http://chem.sis.nlm.nih.gov/chemidplus/ProxyServlet?objectHandle=Search&actionHandle=getAll3DMViewFiles&nextPage=jsp%2Fcommon%2FChemFull.jsp%3FcalledFrom%3Dlite&chemid=000050000&formatType=_3D"
	delay 3
	set FullRecordHTML to the source of the document 1 as string
end tell

set HTMLparagraphs to every paragraph of FullRecordHTML as list
repeat with EachParagraph in HTMLparagraphs
	set StartString to "onclick=\"javascript:popUpInfoWin('<H2>Data Source Information</H2><br><b>List Acronyms</b><br>"
	set MidString to "<br>');\"><img src=\"images/chemidlite/infosmall.gif\" width=\"12\" height=\"12\" border=\"0\"></a>"
	set EndString to "<!-- In case all the names should be broken, uncomment the line below and comment the line above -->"
	set EachParagraph to (EachParagraph & paragraph (EachParagraph + 1) of HTMLparagraphs and paragraph (EachParagraph + 2) of HTMLparagraphs) as string
	if EachParagraph contains StartString and MidString and EndString then
		RetreiveInfo(EachParagraph)
	end if
end repeat

on RetreiveInfo(EachParagraph)
	beep
	delay 1
	set StartPoint1 to ((text offset of "onclick=\"javascript:popUpInfoWin('<H2>Data Source Information</H2><br><b>List Acronyms</b><br>" in EachParagraph) + 94)
	set EndPoint1 to ((text offset of "<br>');\"><img src=\"images/chemidlite/infosmall.gif\" width=\"12\" height=\"12\" border=\"0\"></a>" in EachParagraph) - 1)
	set Target1 to (characters StartPoint1 thru EndPoint1) of EachParagraph as string
	set Startpoint2 to EndPoint1 + 92
	set EndPoint2 to ((text offset of "<!-- In case all the names should be broken, uncomment the line below and comment the line above -->" in EachParagraph) - 1)
	set NewStartPoint to EndPoint2
	set Target2 to (characters Startpoint2 thru EndPoint2) of EachParagraph as string
	
	set Target2NumberChar to (number of characters in Target2)
	repeat with UpCount from 1 to Target2NumberChar
		set test1 to character UpCount of Target2
		set test2 to character (UpCount + 1) of Target2
		set test3 to character (UpCount + 2) of Target2
		if test1 is equal to test2 and test2 is not equal to test3 and test3 is not equal to " " then
			set Target2 to (characters (UpCount + 2) thru (number of characters in Target2)) of Target2 as string
			exit repeat
		end if
	end repeat
	
	set Target2NumberChar to (number of characters in Target2)
	repeat with UpCount from 1 to Target2NumberChar
		set test1 to character UpCount of Target2
		set test2 to character (UpCount + 1) of Target2
		set test3 to character (UpCount + 2) of Target2
		if test1 is equal to test2 and test2 is equal to test3 and test3 is equal to " " then
			beep
			set Target2 to (characters 1 thru (UpCount - 4)) of Target2 as string
			exit repeat
		end if
	end repeat
	display dialog Target1
end RetreiveInfo

Model: PowerBook G4 (old and but still running strong)
AppleScript: 2.0.1
Browser: Safari 525.20.1
Operating System: Mac OS X (10.5)

Hi,

maybe this is a shorter way to gather the data


tell application "Safari"
	activate
	open location "http://chem.sis.nlm.nih.gov/chemidplus/ProxyServlet?objectHandle=DBMaint&actionHandle=default&nextPage=jsp/chemidlite/ResultScreen.jsp&TXTSUPERLISTID=000050000"
	delay 3
	open location "http://chem.sis.nlm.nih.gov/chemidplus/ProxyServlet?objectHandle=Search&actionHandle=getAll3DMViewFiles&nextPage=jsp%2Fcommon%2FChemFull.jsp%3FcalledFrom=lite&chemid=000050000&formatType=_3D"
	delay 3
	set theSource to source of the document 1
end tell
set FullRecord to do shell script "/bin/echo " & quoted form of theSource & " | /usr/bin/textutil -stdin -stdout -format html -inputencoding iso-8859-1 -convert txt -encoding UTF-8"

Note: the parameter -stdin of textutil is only available in Leopard, in Tiger you have to work around with a temporary file

Hi StefanK

This solution works perfectly. I think that this is the second time you’ve saved my day. Here I am building paper airplanes while you guys are flying around on Jumbo jets. So goes my slow crawl up the AppleScript learning curve.

Thanks again,
-DW

This looks to be the solution I have been looking for, but I need some help with the next steps.

I need to login daily to a website to get the daily password to another site. The password changes daily, and is only available for a short period of time, which, in my time zone, is the middle of the night. I have copied your code from above:

tell application "Safari"
	activate
	open location "http://www.firstwebsite.com/secure/starthere.cfm"
	delay 3
	
	set theSource to source of the document 1
end tell
set FullRecord to do shell script "/bin/echo " & quoted form of theSource & " | /usr/bin/textutil -stdin -stdout -format html -inputencoding iso-8859-1 -convert txt -encoding UTF-8"

and then I see, in the results pane, the entire source of the page. Near the beginning of the source is the text:

I need to extract just the password (elephant, with no quotes), open the second web site (whose URL will occasionally change), and enter the password into the one text input field on the second site.

I only occasionally stumble with sed, or awk, or even Applescript and so it is a trial for me to try to learn enough about those to even determine which tool would be best, and then how to do that.

Any help with the next steps would be very much appreciated! (And would save me from having to get up in the middle of the night just to get the password of the day, and log into the second site!)

Thanks for any help!

Griffin

Model: iMac
AppleScript: 2.1.2
Browser: Safari 533.18.5
Operating System: Mac OS X (10.6)

Bull’s eye, Stefan!
You just gave me the start of a project I’ve been trying to get off the ground for some time.

Now I’ll go write that AppleScript version of AppFresh.

Big thanks!
(but don’t wait up :P)