Help with extracting text from web page

I am trying to extract the name and player ID from the LPGA web site. I got as far as extracting the first name. But the use of the Applescript text item delimeter to move through text is hard for me to follow (still new at this). My current code finds the first person three times.

The web page has code that looks like:

Here is my script. I initially find “PlayerListingControl_AlphaImage” which is close to the text I need. Then I work on the “player_results.aspx?id=” to get the PlayerID, followed by the Player Name. I included all the script in case that helps, I apologize if I should have only included the main loop.


set astid to AppleScript's text item delimiters -- for later reference by variable. It is always good practice to store these and finally set them back. AppleScript's text item delimiters are a global property of AppleScript so if you leave them in a strange setting, other scripts open at the same time will respond to them.

--set cond_start to the text in the web page just before the value, here it is Greensheet labels

set cond_start to "PlayerListingControl_AlphaImage"

set next_cond to "player_results.aspx?id="

set downloadURL to "http://www.lpga.com/player_results.aspx?alpha="

set AllRank to "PlayerID,Last Name,First Name" & return
set alpha to "a"

-- Curl command downloads the web page into variable T
set curlCode to "curl '" & downloadURL & alpha & "'"

set T to (do shell script curlCode)

set AppleScript's text item delimiters to cond_start --find PlayerListingControl_AlphaImage
set lastPart to text item 2 of T -- keep the last part, where the data is located

repeat with m from 1 to 3
	set AppleScript's text item delimiters to next_cond -- player_results.aspx?id=
	set lastPart to text item 2 of T
	
	set AppleScript's text item delimiters to "'>"
	
	--Getting Player ID
	set PlayerID to text item 1 of lastPart
	set lastPart to text item 2 of lastPart
	
	---Getting Player's Name 
	set AppleScript's text item delimiters to "</a>"
	set Mystring to text item 1 of lastPart -- Mystring now has player name with tabs and return
	ParsePlayername(Mystring)
	set Playername to result
	
	set AllRank to AllRank & PlayerID & "," & Playername & return
end repeat


tell application "TextEdit"
	activate
	make new document at beginning of documents
	set text of document 1 to AllRank & return
end tell


--Resetting the text delimiter for Applescript to the default.
set AppleScript's text item delimiters to astid

--Function to clear off tabs and returns
on ParsePlayername(Mystring)
	set AppleScript's text item delimiters to "	"
	set thenewlist to every text item in Mystring
	set AppleScript's text item delimiters to ""
	set notablist to every item in thenewlist as string --notablist has player name with no tabs
	set AppleScript's text item delimiters to return
	set thesecondlist to every text item in notablist
	set AppleScript's text item delimiters to ""
	set cleanName to every item in thesecondlist as string --Playername should have no tabs or return
	return cleanName
end ParsePlayername


You can see I have been reading some of your tutorials and using code liberally. I have just begun to use Applescript, I have done some Excel macro programming and many decades ago I could program Fortran. But not very experienced.

I don’t think you want a repeat loop the way you’ve done it. Since your “next_cond” variable is generic to the results you want, it separates all the results for you and you then iterate on the parts of that separation. I’ve done enough here to get you a list of First Last names. You can arrange that as you like. If you want to get fancy, you can alphabetize and sort on last names, but that would be a much longer script.


set astid to AppleScript's text item delimiters -- for later reference by variable. It is always good practice to store these and finally set them back. AppleScript's text item delimiters are a global property of AppleScript so if you leave them in a strange setting, other scripts open at the same time will respond to them.

--set cond_start to the text in the web page just before the value, here it is Greensheet labels

set cond_start to "PlayerListingControl_AlphaImage"

set next_cond to "player_results.aspx?id="

set downloadURL to "http://www.lpga.com/player_results.aspx?alpha="

set AllRank to "PlayerID,Last Name,First Name" & return
set alpha to "a"

-- Curl command downloads the web page into variable T
set curlCode to "curl '" & downloadURL & alpha & "'"

set T to (do shell script curlCode)

set AppleScript's text item delimiters to cond_start --find PlayerListingControl_AlphaImage
set lastPart to text item 2 of T -- keep the last part, where the data is located
set AppleScript's text item delimiters to next_cond -- player_results.aspx?id=
set tParts to text items 2 thru -1 of lastPart -- this contains all the names.

set tName to {}

repeat with onePart in tParts -- taking one item at a time.
	set AppleScript's text item delimiters to ">" & return
	set temp1 to text item 2 of onePart
	set AppleScript's text item delimiters to ","
	set tLastName to cleanTabs(text item 1 of temp1)
	set AppleScript's text item delimiters to tLastName & "," & return
	set temp2 to text item 2 of onePart
	set AppleScript's text item delimiters to return
	set tFirstName to cleanTabs(text item 1 of temp2)
	set end of tName to tFirstName & space & tLastName
end repeat

tName --> {"Anna Acker-Macosko", "Lynn Adams", "Shi Hyun Ahn", "Kristi Albers", "Amy Alcott", "Loretta  Alderete", "Helen Alfredsson", "Pam Allen", "Danielle Ammaccapane", "Dina Ammaccapane", "Janet  Anderson", "Donna Andrews", "Jody  Anschutz", "Cynthia Sullivan Anzolut", "Debbie  Austin"}

to cleanTabs(aWord)
	set cleaned to ""
	set Char to characters of aWord
	repeat with oneChar in Char
		if contents of oneChar is not tab then set cleaned to cleaned & contents of oneChar
	end repeat
	return cleaned
end cleanTabs

Thanks alot - your code is so elegant.

I am studying now to understand it better.

One piece of script I don’t understand is :


set tParts to text items 2 thru -1 of lastPart -- this contains all the names.

What does the items 2 thru -1 mean? I only barely understand the use of item 1 (before the text delimeter) and item 2 (after the text delimeter).

Also, I actually was looking for the PlayerID number along with the name. I have another script that goes to the player’s stats page which are called by knowing the player ID number. Unfortunately these are not in any sensible order. The range from 10 to several thousand, even though there are only 200 players or so. So some player ID pages are invalid. Instead of just looping through so many curl calls, I thought I could use this script to obtain a list of valid player ID numbers (with the name)

Eventually I wanted a CSV file with player name, ID and then the stats from the stats page that I could read into Excel.

If it is not too much of a bother could you add a pick up for the player ID as well? I may give it a try, hoping to learn more about scripts. But I am sure I will probably take forever and really hack it up.

Again, thanks. Someday I hope to be good enough to help answer questions on the board.

Cam

I did work on this a bit and figured out how to get the player ID

repeat with onePart in tParts -- taking one item at a time.
	
	set AppleScript's text item delimiters to ">" & return
	set PlayerID to text item 1 of onePart ---- Here is my new line, the playerID was just ahead of the >
	set temp1 to text item 2 of onePart
	set AppleScript's text item delimiters to ","
	set tLastName to cleanTabs(text item 1 of temp1)
	set AppleScript's text item delimiters to tLastName & "," & return
	set temp2 to text item 2 of onePart
	set AppleScript's text item delimiters to return
	set tFirstName to cleanTabs(text item 1 of temp2)
	set end of tName to tFirstName & space & tLastName & "," & PlayerID
end repeat

Now I get the following in tName. You can see how the ID numbers have no logic (okay the older players are in the 100’s and the newer players are in the 5000 range).

I may be back when I stumble going forward. Thanks so far.

Found a minor bug - I was getting an extra ’ with my Player IDs (but the " next to them in the result box made them hard to see). So I had to modify your first text delimiter to include the ’ (with the “'>” being the final delimeter).


repeat with onePart in tParts -- taking one item at a time.
	
	set AppleScript's text item delimiters to "'>" & return
	set PlayerID to text item 1 of onePart
	set temp1 to text item 2 of onePart
	set AppleScript's text item delimiters to ","
	set tLastName to cleanTabs(text item 1 of temp1)
	set AppleScript's text item delimiters to tLastName & "," & return
	set temp2 to text item 2 of onePart
	set AppleScript's text item delimiters to return
	set tFirstName to cleanTabs(text item 1 of temp2)
	set end of tName to tFirstName & space & tLastName
	set lineofdata to lineofdata & PlayerID & "," & tLastName & "," & tFirstName
	set Stats to GetStats(PlayerID)
	set lineofdata to lineofdata & Stats & return
	
end repeat

I did manage to incorporate my other script with this one, so thanks. It is now extracting quite a bit of data from LPGA.com web site. I have been and will be spiffing this up.

AppleScript is not the best suited for scraping a webpage. Nokogiri does an excellent job though.
This example uses some regular expressions to tidy up the results and pull out the id.

I had never heard of nokogiri.

It does seem quite efficient, and criptic. But like all languages once you have some experience…

I take it at least one person on the forum knows about this language.

Thanks for the thought.

Nokogiri is a Ruby gem. It is actually Ruby code you see there.

I had heard of Ruby, and did see you had to have this installed to run okogiri.

I have solved my current problem but will keep this in mind if I go through something like this again.

Cam

This is the end of everything [in Applescript] with a few touches:


set astid to AppleScript's text item delimiters -- for later reference by variable. It is always good practice to store these and finally set them back. AppleScript's text item delimiters are a global property of AppleScript so if you leave them in a strange setting, other scripts open at the same time will respond to them.

--set cond_start to the text in the web page just before the value, here it is Greensheet labels

set cond_start to "PlayerListingControl_AlphaImage"

set next_cond to "player_results.aspx?id="

set downloadURL to "http://www.lpga.com/player_results.aspx?alpha="

set AllRank to "PlayerID,Last Name,First Name" & return
set alpha to "a"

-- Curl command downloads the web page into variable T
set curlCode to "curl '" & downloadURL & alpha & "'"

set T to (do shell script curlCode)

set AppleScript's text item delimiters to cond_start --find PlayerListingControl_AlphaImage
set lastPart to text item 2 of T -- keep the last part, where the data is located
set AppleScript's text item delimiters to next_cond -- player_results.aspx?id=
set tParts to text items 2 thru -1 of lastPart -- this contains all the names.

set tName to {}

repeat with onePart in tParts -- taking one item at a time.
	set AppleScript's text item delimiters to "'>" & return
	set PlayerID to text item 1 of onePart ---- Here is my new line, the playerID was just ahead of the >
	set temp1 to text item 2 of onePart
	set AppleScript's text item delimiters to ","
	set tLastName to cleanTabs(text item 1 of temp1)
	set AppleScript's text item delimiters to tLastName & "," & return
	set temp2 to text item 2 of onePart
	set AppleScript's text item delimiters to return
	set tFirstName to cleanTabs(text item 1 of temp2)
	set end of tName to tFirstName & space & tLastName & ", " & PlayerID & return
end repeat

set tName to tName as string --> {"Anna Acker-Macosko", "Lynn Adams", "Shi Hyun Ahn", "Kristi Albers", "Amy Alcott", "Loretta  Alderete", "Helen Alfredsson", "Pam Allen", "Danielle Ammaccapane", "Dina Ammaccapane", "Janet  Anderson", "Donna Andrews", "Jody  Anschutz", "Cynthia Sullivan Anzolut", "Debbie  Austin"}
tell application "TextEdit"
	activate
	set hello to make new document
	set text of hello to "Name, Player-ID" & return & return & tName
end tell
to cleanTabs(aWord)
	set cleaned to ""
	set Char to characters of aWord
	repeat with oneChar in Char
		if contents of oneChar is not tab then set cleaned to cleaned & contents of oneChar
	end repeat
	return cleaned
end cleanTabs

I added the Textedit document and the newer repeat.

Thanks to all for the help.

I am happily extracting data from the site.

Cam

Sorry to be so long getting back…

I didn’t see an answer to the question you asked:

Text item 2 avoids the first line that will show up – it doesn’t contain data.

Text item -1 is the last item picked up (similarly text item -2 is the second last)

text item 3 thru -3 picks up the third from the beginning through the third one back from the end inclusive.

Could you walk me through the script a bit more so I can learn some more?

When we get to “set AppleScript’s text item delimiters to next_cond – player_results.aspx?id=” I know we are just at the beginning of the names. And presumably in lastPart, ahead of this delimeter is some code we don’t want (what I think is called item 1).

So how does your step for tParts work on lastPart? Also tParts seems to be a number as it is part of the repeat statement.

Also, how does the repeat know when to end and not run past the end of the page?

set AppleScript's text item delimiters to cond_start --find PlayerListingControl_AlphaImage
set lastPart to text item 2 of T -- keep the last part, where the data is located
set AppleScript's text item delimiters to next_cond -- player_results.aspx?id=
set tParts to text items 2 thru -1 of lastPart -- this contains all the names.

set tName to {}

repeat with onePart in tParts -- taking one item at a time.
   set AppleScript's text item delimiters to "'>" & return
   set PlayerID to text item 1 of onePart ---- Here is my new line, the playerID was just ahead of the >
   set temp1 to text item 2 of onePart
   set AppleScript's text item delimiters to ","
   set tLastName to cleanTabs(text item 1 of temp1)
   set AppleScript's text item delimiters to tLastName & "," & return
   set temp2 to text item 2 of onePart
   set AppleScript's text item delimiters to return
   set tFirstName to cleanTabs(text item 1 of temp2)
   set end of tName to tFirstName & space & tLastName & ", " & PlayerID & return
end repeat

Again thanks, and if this is too much bother feel free to not reply.

set AppleScript's text item delimiters to cond_start --find PlayerListingControl_AlphaImage
set lastPart to text item 2 of T -- keep the last part, where the data is located
(*At this point we have all the names but none of the starting bumpf in the web page*)

(*A characteristic of every name set is that the line before it contains"next_cond" and a number so we find ALL of those by setting the astid to that phrase.*)
set AppleScript's text item delimiters to next_cond -- player_results.aspx?id=
(*Now, when we set tParts to this tid and look at the text items, we see that the first one has src= in it, not what we want, but all the others precede a name. Understand that tParts is a list of EVERY chunk of lastPart that is preceded by next_cond, not just the first encounter. For the script I submitted above, tParts is this list
"28'>
				Acker-Macosko,
				Anna
			</a>
		
			<br />
		
			<a href='", "5025'>
				Adams,
				Lynn
			</a>
		
			<br />
		
			<a href='", "216'>
				Ahn,
				Shi Hyun
			</a>
		
			<br />
		
			<a href='", "394'>
				Albers,
				Kristi
			</a>
		
			<br />
		
			<a href='", "395'>
				Alcott,
				Amy
			</a>
		
			<br />
		
			<a href='", "5028'>
				Alderete,
				Loretta 
			</a>
		
			<br />
		
			<a href='", "393'>
				Alfredsson,
				Helen
			</a>
		
			<br />
		
			<a href='", "5029'>
				Allen,
				Pam
			</a>
		
			<br />
		
			<a href='", "396'>
				Ammaccapane,
				Danielle
			</a>
		
			<br />
		
			<a href='", "397'>
				Ammaccapane,
				Dina
			</a>
		
			<br />
		
			<a href='", "5034'>
				Anderson,
				Janet 
			</a>
		
			<br />
		
			<a href='", "398'>
				Andrews,
				Donna
			</a>
		
			<br />
		
			<a href='", "5043'>
				Anschutz,
				Jody 
			</a>
		
			<br />
		
			<a href='", "5044'>
				Anzolut,
				Cynthia Sullivan
			</a>
		
			<br />
		
			<a href='", "5045'>
				Austin,
				Debbie 
			</a> 
			...."}
		
followed by a ton of bumpf we don't want from the end of the lastPart*)

set tParts to text items 2 thru -1 of lastPart -- this contains all the names 
-- drops the src= line but still includes the end of lastPart.

set tName to {} -- where we'll collect them.

repeat with onePart in tParts -- taking one item at a time from the list tParts.
	set AppleScript's text item delimiters to "'>" & return
	-- follows the number after <ahref=
	set PlayerID to text item 1 of onePart ---- Here is my new line, the playerID was just ahead of the >
	set temp1 to text item 2 of onePart -- the last name up to the end of this part.
	set AppleScript's text item delimiters to "," -- following the last name
	set tLastName to cleanTabs(text item 1 of temp1) -- pull out the tabs
	set AppleScript's text item delimiters to tLastName & "," & return -- move up to get the first name
	set temp2 to text item 2 of onePart -- this starts with the first name
	set AppleScript's text item delimiters to return -- this ends the first name
	set tFirstName to cleanTabs(text item 1 of temp2)
	set end of tName to tFirstName & space & tLastName & ", " & PlayerID & return
end repeat

There are several forms of “repeat … end”. The one I’m using here is particularly useful for lists. onePart is a reference to an item in the list. Each time we traverse the repeat, we move to the next item in the list and do it again. When the list has been traversed, the repeat ends. Sometimes, a reference to a list item is not coerced to the list item itself and when this happens, we use “contents of onePart”; not a problem here.

Thanks - that was great. I created a text document with your details and my comments so I can use this code and understand it in the future.

I like the onePart of tParts trick with the repeat. Clever.

Have a happy holidays.

Cam