IMMD Reader!

So I wrote this the other day, but it’s kinda inefficient.

What it does is read http://www.itmademyday.com/.

It takes about one minute and a half on average, which is pretty dang slow (not sure how to improve it…)

It is by no means perfect, as it manually converts the HTML symbols such as “”", so it may miss one every so often, but I don’t know a way around it.

Anyway, here it is for fans of the site! :smiley:

Just change the variable myPageNum to whatever page you want to read.


set startTime to time of (current date)
set MMDs to {}
set myPageNum to 1
log "Application initialized..."
set myURL to "http://itmademyday.com/page/" & myPageNum & "/"
--Uncomment the following to get faster, but exclusively recent entries:
--set myURL to "http://feeds.feedburner.com/IMMD"
set mySite to do shell script "curl " & ("http://itmademyday.com/page/" & (myPageNum & "/"))
log "Downloaded page..."
repeat with myCount from 1 to (count paragraphs of mySite)
	if paragraph myCount of mySite contains "<blockquote>" then set MMDs to MMDs & (paragraph myCount of mySite)
end repeat
log "Captured entries..."
repeat with myCount from 1 to (count items of MMDs)
	set text item delimiters to "<blockquote>"
	set myLine to second text item of (item myCount of MMDs)
	set text item delimiters to "</p></blockquote>"
	set myLine to first text item of myLine
	set (item myCount of MMDs) to myLine
end repeat
log "Formatted entries..."
set text item delimiters to {""}
repeat with myCount from 1 to (count items of MMDs)
	if item myCount of MMDs contains """ then set item myCount of MMDs to (replaceText(""", "\"", (item myCount of MMDs)))
	if item myCount of MMDs contains "'" then set item myCount of MMDs to (replaceText("'", "'", (item myCount of MMDs)))
	if item myCount of MMDs contains "." then set item myCount of MMDs to (replaceText(".", "...", (item myCount of MMDs)))
end repeat
log "Made humanly readable..."
tell application "TextEdit"
	repeat with myCount from 1 to (count items of MMDs)
		set text of document 1 to (text of document 1) & (item myCount of MMDs) & return & return
	end repeat
end tell
log "Opened in TextEdit..."
set endTime to time of (current date)
log "Done!"
log "Took " & (endTime - startTime) & " seconds."

-- Thanks to Bruce Phillips of MacScripter for this:
on replaceText(find, replace, someText)
	set prevTIDs to text item delimiters of AppleScript
	set text item delimiters of AppleScript to find
	set someText to text items of someText
	set text item delimiters of AppleScript to replace
	set someText to "" & someText
	set text item delimiters of AppleScript to prevTIDs
	return someText
end replaceText

-SuperScripter

Hi,

your script is quite slow, because you use expensive code in the repeat loops.
Parsing the text would be much faster by using text item delimiters instead of polling each paragraph.

With the help of a few shell commands this is shorter and a bit faster


set myPageNum to 1
set myURL to "http://itmademyday.com/page/" & myPageNum & "/"
set MMDs to paragraphs of (do shell script "curl " & myURL & " |  textutil -stdin -stdout -format html -convert txt -encoding UTF-8 | awk /IMMD/")
set {TID, text item delimiters} to {text item delimiters, return & return}
set MMDs to MMDs as text
set text item delimiters to TID
tell application "TextEdit"
	activate
	if (count documents) = 0 then make new document
	set text of document 1 to MMDs
end tell

…Wow. That’s amazing! :o

But I am a bit curious as to what all of the handlers of your shell script do:

So it takes the source of the URL, and tells textutil to do -stdin and -stdout. What do those two handlers do? Then it tells it to take the HTML and convert it to UTF-8 (presumably to transform “”" and all those). So, exactly what does the awk command do? Somehow it separates the items?

Can you please explain all of these for me? Thanks.

-SuperScripter

textutil reads from -standardinput and writes to -standardoutput text with -format html and -converts the text to plain text using -encoding UTF-8

awk filters all lines of the converted text which contain the keyword IMMD

Awesome, thanks!

I’ll be sure to use those commands next time I’m parsing HTML - very useful!

-SuperScripter