Help with Scraping Website

ihmunro · December 1, 2014, 12:42am

Evening

I have been looking at doing some scraping of some event websites for inclusion in my events app. As we have so many events websites, it would be impossible to have to copy and paste each one, so I thought I would try and automate it to some degree.

I found the following code on this site, tried it and it works, but it also brings in everything else.

Is there a better way ?

tell application “Safari”
activate
open location “http://calgarydowntown.com/events/”
delay 3

set theSource to source of the document 1

end tell
set FullRecord to do shell script “/bin/echo " & quoted form of theSource & " | /usr/bin/textutil -stdin -stdout -format html -inputencoding iso-8859-1 -convert txt -encoding UTF-8”

Any help would be appreciated.

Iain

alastor933 · December 1, 2014, 10:45am

You can split text (including html) using text item delimiters. Read about them here.

Here’s a simple example:

set sample to "You can split text (including html) using [i]text item delimiters[/i]. Read about them [url=http://macscripter.net/viewtopic.php?id=24725]here[/url]."

set text item delimiters to "using"
set sample to sample's text items
set text item delimiters to ""
sample --> {"You can split text (including html) ", " [i]text item delimiters[/i]. Read about them [url=http://macscripter.net/viewtopic.php?id=24725]here[/url]."}

StefanK · December 1, 2014, 11:48am

Hi,

here’s a draft as starting point. The URL can be adjusted for a specific date.
All records can be separated by the string class="col
The script extracts the category, title, date and location


set theURL to "curl [url=http://calgarydowntown.com/events/?cal=]http://calgarydowntown.com/events/?cal="[/url] & "2014-12-02"
set FullRecord to do shell script "curl " & theURL

set {TID, text item delimiters} to {text item delimiters, "class=\"col"}
set theEvents to text items of FullRecord
set eventList to {}
repeat with i from 2 to (count theEvents)
	set aPart to item i of theEvents
	set category to getStringValueFromTag(aPart, "h5")
	set title to getStringValueFromTag(aPart, "h4")
	set when to getStringValueFromTagWithClass(aPart, "p", "event_when")
	set location to getStringValueFromTagWithClass(aPart, "p", "event_where")
	set end of eventList to {category & tab & title & tab & when & tab & location}
end repeat
set text item delimiters to return
set eventList to eventList as text
set text item delimiters to TID
display dialog eventList

on getStringValueFromTag(sourceText, tag)
	set text item delimiters to "<" & tag & ">"
	set split to text item 2 of sourceText
	set text item delimiters to "</" & tag & ">"
	set split to text item 1 of split
	return split
end getStringValueFromTag

on getStringValueFromTagWithClass(sourceText, tag, _class)
	set text item delimiters to  "<" & tag & " class=\"" & _class & "\">"
	set split to text item 2 of sourceText
	set text item delimiters to "</" & tag & ">"
	set split to text item 1 of split
	return split
end getStringValueFromTagWithClass

McUsrII · December 1, 2014, 7:38pm

Hello.

You can also use Safari’s do Javascript command (“javaScriptCommand();” in document 1). It may be harder to work it out (return the data), but once you have made it work, it should be able to give you a greater level of control.

Edit
Using Awk from the commandline is also an option for extracting data.

The heavy duty solution is to transform the html back to xml, and then use xquery from say Python, in order to scrape the data.

ihmunro · December 2, 2014, 4:28am

Hi StefanK

This is perfect.

Although I just joined, I have seen your name around already on the forum - thanks for taking the time to do this.

While you have selected the site and date for a specific date, is it possible to do say for all the entries on the site ?

Iain

StefanK · December 2, 2014, 8:20am

actually the code retrieves all entries on the site.
I just flattened the list for the display dialog output

ihmunro · December 2, 2014, 2:01pm

Hi Stefan

I started looking at your code plus the code on the site, so I can see what you are doing. Typically I learn this way, but need a couple of examples from which to gain the knowledge.

As I mentioned there are a few sites that I will have to scrape and then consolidate the data.

This is another one - of course it would have to be completely different in structure. I did start to play with the fields, but am getting errors or it just displays a blank box with the Cancel and Ok buttons.

ANy guidance would be appreciated.

http://www.avenuecalgary.com/Calendar/

Iain

StefanK · December 2, 2014, 2:26pm

here’s a general way to parse HTML

Assuming you’re using Safari enable the Debug Menu in Preferences > Advanced.
Then open the site and press âŒ¥âŒ˜U.
In the main view of the Inspector Panel at the bottom of the window there is a popup menu displaying “Source Code”
switch from Source Code to DOM Tree
Hovering with the mouse over the elements highlights the equivalent section in the browser view.
Open the disclosure triangles until you reach the event listing section.
You will see elements with tag=“article” and class="event-listing "
These are the main separators of the events.
Then you can look for other (unique) tags or attributes to extract the information you need

ihmunro · December 3, 2014, 3:53am

Hi Stefan

Thanks for the tips.

Did some playing around - different results when I changed things around, but still an error.

How does this look ?

set theURL to “curl view-source:http://www.avenuecalgary.com/Calendar/?event_date=” & “2014-12-02”
set FullRecord to 98

set {TID, text item delimiters} to {text item delimiters, “class="col”}
set theEvents to text items of FullRecord
set text item delimiters to TID
set eventList to {}
repeat with i from 2 to (count theEvents)
set aPart to item i of theEvents
set title to getStringValueFromTagWithClass(aPart, “event_header”)
set when to getStringValueFromTagWithClass(aPart, “p”, “event_date”)
set location to getStringValueFromTagWithClass(aPart, “p”, “event_desc”)
set end of eventList to {title & tab & when & tab & location}
end repeat
set {TID, text item delimiters} to {text item delimiters, return}
set eventList to eventList as text
set text item delimiters to TID
display dialog eventList

on getStringValueFromTag(sourceText, tag)
set {TID, text item delimiters} to {text item delimiters, “<” & tag & “>”}
set split to text item 2 of sourceText
set text item delimiters to “</” & tag & “>”
set split to text item 1 of split
set text item delimiters to TID
return split
end getStringValueFromTag

on getStringValueFromTagWithClass(sourceText, tag, _class)
set {TID, text item delimiters} to {text item delimiters, “<” & tag & " class="" & _class & “">”}
set split to text item 2 of sourceText
set text item delimiters to “</” & tag & “>”
set split to text item 1 of split
set text item delimiters to TID
return split
end getStringValueFromTagWithClass

Iain

StefanK · December 3, 2014, 8:19am

your script cannot work. The correct syntax to get the source with curl is


set FullRecord to do shell script "curl [url=http://www.avenuecalgary.com/Calendar/?event_date=]http://www.avenuecalgary.com/Calendar/?event_date="[/url] & "2014-12-02"

The class names contain dashes (-) not underscores (_)

Try this, every website has an individual code and must be treated individually
The script uses a new handler getHTMLTreeFromClass() which returns the HTML text after the specific class name
and a handler getStringValueBetweenSource() to get the text between two arbitrary strings.


set FullRecord to do shell script "curl [url=http://www.avenuecalgary.com/Calendar/?event_date=]http://www.avenuecalgary.com/Calendar/?event_date="[/url] & "2014-12-02"

set {TID, text item delimiters} to {text item delimiters, "class=\"event-listing"}
set theEvents to text items of FullRecord
set eventList to {}
repeat with i from 2 to (count theEvents)
	set aPart to item i of theEvents
	set eventheader to getHTMLTreeFromClass(aPart, "event-header")
	set when to getStringValueFromTagWithClass(eventheader, "p", "event-date")
	set eventbody to getHTMLTreeFromClass(aPart, "event-body")
	set location to getStringValueBetweenSource(eventbody, "</div>", "<p class")
	set end of eventList to {when & tab & location}
end repeat
set text item delimiters to return
set eventList to eventList as text
set text item delimiters to TID
display dialog eventList

on getHTMLTreeFromClass(sourceText, _class)
	set text item delimiters to "class=\"" & _class & "\">"
	return text item 2 of sourceText
end getHTMLTreeFromClass

on getStringValueFromTag(sourceText, tag)
	set text item delimiters to "<" & tag & ">"
	set split to text item 2 of sourceText
	set text item delimiters to "</" & tag & ">"
	return text item 1 of split
end getStringValueFromTag

on getStringValueFromTagWithClass(sourceText, tag, _class)
	set text item delimiters to "<" & tag & " class=\"" & _class & "\">"
	set split to text item 2 of sourceText
	set text item delimiters to "</" & tag & ">"
	return text item 1 of split
end getStringValueFromTagWithClass

on getStringValueBetweenSource(sourceText, delim1, delim2)
	set text item delimiters to delim1
	set split to text item 2 of sourceText
	set text item delimiters to delim2
	return text item 1 of split
end getStringValueBetweenSource