Extract Contacts from Website

I am trying to extract contacts off of a website. I am able to get the emails into a .csv file but I don’t know how to attach names, addresses, etc to them as well.

Looking through the HTML, the only constant in relation to the other contact details that I see is the mailto: tag. Is there a way to use the text item delimiters to look before and not after?

Any ideas?

Here is the code I have so far (cobbled together from various sources/mostly macscripter.net):

set pageHTML to (do shell script "curl " & quoted form of ("http://www.eatwild.com/products/illinois.html"))

set TID to AppleScript's text item delimiters -- save previous value
set theList to {}

set text item delimiters to "mailto:"
set xxx to text items 2 thru -1 of pageHTML

set text item delimiters to "\""
repeat with i from 1 to count of xxx
	set end of theList to text item 1 of (item i of xxx)
end repeat

set csvList to {}
set {TID, text item delimiters} to {text item delimiters, ","}
set end of csvList to theList as text
set text item delimiters to return
set csvText to csvList as text
set text item delimiters to TID

set theFilePath to (path to desktop as string) & "Extractions.csv" as string
set theFileReference to open for access theFilePath with write permission
set eof of theFileReference to 0
write csvText to theFileReference starting at eof
close access theFileReference

Model: iMac 27 Late 2010
Browser: Safari 534.56.5
Operating System: Mac OS X (10.7)

Is there a way using AppleScript to get text in between delimiters

and

only if it contains mailto: ?

Or do you think that you would need to utilize Perl/sed/awk?

quick and dirty


set pageHTML to (do shell script "curl " & quoted form of ("http://www.eatwild.com/products/illinois.html"))

set TID to AppleScript's text item delimiters -- save previous value
set theList to {}

set text item delimiters to "mailto:"
set xxx to text items 1 thru -1 of pageHTML

set text item delimiters to "<p class=\"bodyMargin\">"
repeat with i from 1 to count of xxx
	set yyy to last text item of item i of xxx
	set _offset to offset of "<" in yyy
	set end of theList to text 1 thru (_offset - 1) of yyy
end repeat

set csvList to {}
set {TID, text item delimiters} to {text item delimiters, ","}
set end of csvList to theList as text
set text item delimiters to return
set csvText to csvList as text
set text item delimiters to TID

set theFilePath to (path to desktop as string) & "Extractions.csv" as string
set theFileReference to open for access theFilePath with write permission
set eof of theFileReference to 0
write csvText to theFileReference starting at eof
close access theFileReference


But please consider that the script only gathers the name, not the address
and therefore the trailing comma may cause problems with the following concatenation with the comma delimiter

This is very buggy but should get the job done.

set {cName, cAddress, cPhone, cEmail} to {{}, {}, {}, {}}
tell application "Safari"
activate
	tell document 1
		set URL to "http://www.eatwild.com/products/illinois.html"
		delay 3
		
		set theCount to do JavaScript "document.getElementsByClassName('bodyMargin').length"
		
		repeat with i from 0 to theCount
			set xxx to do JavaScript "document.getElementsByClassName('bodyMargin')[" & i & "].childNodes[5].innerText"
			if xxx contains "@" then
				--email
				set end of cEmail to xxx
				
				-- address part 1
				set theAddress to do JavaScript "document.getElementsByClassName('bodyMargin')[" & i & "].childNodes[1].innerText"
				
				-- This needs to be split into phone and address
				set splitMe to do JavaScript "document.getElementsByClassName('bodyMargin')[" & i & "].childNodes[2].nodeValue"
				--set xxx to do shell script "echo " & quoted form of thePhone & " | sed 's/$.*\\(//'"
				set end of cPhone to (do shell script "echo " & quoted form of splitMe & " | sed 's/.*(/(/'")
				
				-- address
				set theAddress to theAddress & (do shell script "echo " & quoted form of splitMe & " | sed 's/(.*//'")
				set end of cAddress to theAddress
				
				-- name
				set end of cName to do JavaScript "document.getElementsByClassName('bodyMargin')[" & i & "].childNodes[0].nodeValue"
			end if
		end repeat
	end tell
end tell

return {cName, cAddress, cPhone, cEmail}

That do javascript is sure spiffy.

I am going to try to play with it to try to get it to spit out into a csv so that relevant information is near each other.

The problem with that Web site is that the “address” portions of the lines are sometimes fully contained in the link text and sometimes only partially, with the ends in the “name” and/or “phone number” portions. If the parts of each line are concatenated together, it’s quite easy then to police them up and separate the phone numbers from the addresses, but not so easy for a script to tell where the names end and the addresses begin. There don’t appear to be any overlap problems with the e-mail addresses.

Here’s a development of adayzdone’s JavaScript idea which can no doubt be improved upon by those more conversant with JavaScript than I am! It returns a flat list of trimmed strings (‘theDetails’) in which each group of three strings is a name-and-address, a phone number, and an e-mail address. If the separation of the names from the address is required, human intervention will probably be necessary to decide where the names end and the addresses start, although of course all the work after that can be scripted.

set JS to "elementCount = document.getElementsByClassName('bodyMargin').length ;
cDetails = '' ;

// Test each 'bodyMargin' element. If its 'e-mail' node exists and contains an e-mail
// address, concatenate all the relevant details to cDetails, with a return before and
// after the e-mail address. Later, the tidyUp() function will interpolate another return
// between the postal address and the phone number.
for (i = 0 ; i < elementCount ; i++) {
	theseNodes = document.getElementsByClassName('bodyMargin')[i].childNodes ;
	if (theseNodes.length > 5) {
		email = theseNodes[5].innerText ;
		if (!((email == undefined) || (email.indexOf('@') == -1))) {
			cDetails += (theseNodes[0].nodeValue + theseNodes[1].innerText + theseNodes[2].nodeValue + '\\r' + email + '\\r') ;
		}
	} 
}

// Tidy up the gathered details before returning them as a list.
tidyUp(cDetails).split('\\r').slice(0,-1)

// Function to perform the tidying up.
function tidyUp(theText) {
	// Remove any linefeeds. (The returns are ours.)
	theText = theText.split('\\n').join('') ;
	// Reduce multiple spaces to single spaces.
	while (theText.indexOf('  ') > -1) {
			theText = theText.split('  ').join(' ') ;
	}
	// Insert returns between addresses and phone numbers.
	// (Split each line at '(', but then revert where this follows ' or '.)
	theText = theText.split(' (').join('\\r(').split(' or\\r(').join(' or (') ;
	// Remove any leading spaces in the lines (except the first).
	theText = theText.split('\\r ').join('\\r').split(' \\r').join('\\r') ;
	// Remove any leading space in the first line.
	if (theText.indexOf(' ') == 0) {
		theText = theText.split(' ').slice(1,-1).join(' ') ;
	}
	// Remove any full stops which are followed by returns and return the end-result.
	return theText.split('.\\r').join('\\r');
}"

tell application "Safari"
	activate
	tell document 1
		set URL to "http://www.eatwild.com/products/illinois.html"
		delay 0.2
		repeat until ((do JavaScript "document.readyState" in it) is "complete")
			delay 0.2
		end repeat
		
		set theDetails to (do JavaScript JS in it)
	end tell
end tell