getTextBetween with Amazon.com page source to get shipping weight

earthsaver · October 9, 2007, 3:43am

I’m seeking to script the retrieval of the shipping weight of various products listed on Amazon.com. I’ve started the process and gotten some advance help from Dan Shockley, however I’m stuck on the “list or record” error 1024. With the getTextBetween handler appearing below the script, here’s what I have so far.

set these_codes to paragraphs of (do shell script "cat '/Users/earthsaver/Desktop/file.txt'")

tell application "Safari" to activate
repeat with this_code in these_codes
	set theAddress to "http://www.amazon.com/dp/" & this_code
	set theSource to (do shell script "curl " & theAddress)
	set beforeText to "Shipping Weight: "
	set afterText to "(View"
	set foundWeight to my getTextBetween(theSource, beforeText, afterText)
	set the clipboard to foundWeight
	
	tell application "TextEdit" to activate
	tell application "System Events" to tell process "TextEdit"
		search for this_code
		keystroke right & tab & "v" using command down
	end tell
end repeat

file.txt is for now just a test file containing three random product codes of books on Amazon.com (10-digit ISBN numbers). Amazon makes it easy to view these product pages using the URL format set as theAddress. Also, I’m happy to learn how to properly paste from the clipboard without using keystrokes. The goal is a tab-delimited text file containing product codes and shipping weights. (Know of a good way to script the conversion of ounces to pounds, only if necessary?)

Thanks!

StefanK · October 9, 2007, 8:44am

Hi and welcome.

this script parses title and weight from each page and writes the result with the format ISBN title weight to a text file appending new lines.
It works completely without involving any application.

set these_codes to paragraphs of (do shell script "cat '/Users/earthsaver/Desktop/file.txt'")
set textFile to ((path to desktop as Unicode text) & "textFile.txt")

repeat with this_code in these_codes
	set {theTitle, theWeight} to paragraphs of (do shell script "curl -L [url=http://www.amazon.com/dp/]http://www.amazon.com/dp/"[/url] & this_code & " | awk '/(<title>)|(Shipping Weight)/'")
	set theTitle to parse_text(theTitle, "<title>Amazon.com: ", ": Books")
	set theWeight to parse_text(theWeight, "<li><b>Shipping Weight:</b> ", " (")
	if theWeight contains "ounces" then set theWeight to (word 1 of theWeight as ounces as pounds as string) & " pounds"
	write_to_disk from this_code & tab & theTitle & tab & theWeight & return into textFile with append
end repeat

on parse_text(t, del1, del2)
	set {TID, text item delimiters} to {text item delimiters, del1}
	set t to text item 2 of t
	set text item delimiters to del2
	set t to text item 1 of t
	set text item delimiters to TID
	return t
end parse_text

on write_to_disk from theData into targetFile given append:append
	try
		set ff to open for access file targetFile with write permission
		if append is false then set eof of ff to 0
		write theData to ff starting at eof
		close access ff
		return true
	on error
		try
			close access file targetFile
		end try
		return false
	end try
end write_to_disk

earthsaver · October 9, 2007, 9:55am

Wow! Thanks for offering such simplicity and effectiveness. Good thinking on fetching the title, too, to confirm that Amazon lists the right item. I’ll try it as soon as I can get into the Bookstore.

One thing comes to mind that I forgot to mention: There are likely to be a handful of items that Amazon doesn’t list at all. How would I proactively handle these errors, wherein the page curled doesn’t contain “Shipping Weight”?

StefanK · October 9, 2007, 10:13am

The easiest way is to use a try block.
If an error occurs, the script continues after the end try statement


.
repeat with this_code in these_codes
	try
		set {theTitle, theWeight} to paragraphs of (do shell script "curl -L [url=http://www.amazon.com/dp/]http://www.amazon.com/dp/"[/url] & this_code & " | awk '/(<title>)|(Shipping Weight)/'")
		set theTitle to parse_text(theTitle, "<title>Amazon.com: ", ": Books")
		set theWeight to parse_text(theWeight, "<li><b>Shipping Weight:</b> ", " (")
		if theWeight contains "ounces" then set theWeight to (word 1 of theWeight as ounces as pounds as string) & " pounds"
		write_to_disk from this_code & tab & theTitle & tab & theWeight & return into textFile with append
	end try
.
end repeat

earthsaver · October 9, 2007, 10:19am

Wait! append is creating an invisible character between every character in textFile. These characters aren’t visible spaces in TextEdit or TextWrangler, but they do appear in Numbers. They aren’t actual spaces either, but appear to be treated as individual words. I’ll be checking the results in Excel for importing to LightSpeed. Seems Excel is too old or unsophisticated to read them; perhaps they are unicode invisible characters. So, it looks like I can safely import to Excel and resave the file to rid it of the undesired. However, any way to avoid this altogether?

StefanK · October 9, 2007, 10:31am

Ok,

if you prefer Unicode text, change this line

if theWeight contains "ounces" then set theWeight to (word 1 of theWeight as ounces as pounds as Unicode text) & " pounds"

and the write_to_disk handler

on write_to_disk from theData into targetFile given append:append
	try
		set ff to open for access file targetFile with write permission
		if append is false then set eof of ff to 0
		if (get eof of ff) = 0 then write (ASCII character 254) & (ASCII character 255) to ff -- write BOM 
		write theData to ff as Unicode text starting at eof
		close access ff
		return true
	on error
		try
			close access file targetFile
		end try
		return false
	end try
end write_to_disk

If you want MacRoman, change only this line:

write_to_disk from (this_code & tab & theTitle & tab & theWeight & return) as string into textFile with append

and leave the handler untouched

earthsaver · October 9, 2007, 11:59am

Thanks for all these answers, Stefan. I really appreciate it. I just remembered that the original reason I was checking the results was to see if and how weight values would be truncated, rounded, or otherwise. I see they’re not. Since ounces to pounds is not a complex conversion, no result will extend more than five places behind the decimal, and I can use a spreadsheet to round as I please.

Looks great! I’ll be testing on a Mac Pro as I think our 1.42 GHz PowerPC LightSpeed server will be far too slow for the task.

earthsaver · September 21, 2008, 3:24am

Are you still around?, Stefan. I’d appreciate your help with more more addition to the Amazon parsing script. I want to add Product Dimensions, too. Based on your script template, I the bits about dimensions to the script, as shown. However, when run, it leaves me with only the original list of ISBNs rather than appending the appropriate information as requested. Where did I screw up?

repeat with this_code in these_codes
	try
		set {theTitle, theDimensions, theWeight} to paragraphs of (do shell script "curl -L [url=http://www.amazon.com/dp/]http://www.amazon.com/dp/"[/url] & this_code & " | awk '/(<title>)|(Product Dimensions)|(Shipping Weight)/'")
		set theTitle to parse_text(theTitle, "<title>Amazon.com: ", ": Books")
		set theDimensions to parse_text(theDimensions, "<li><b>Product Dimensions:</b> ", " inches")
		set theWeight to parse_text(theWeight, "<li><b>Shipping Weight:</b> ", " (")
		if theWeight contains "ounces" then set theWeight to (word 1 of theWeight as ounces as pounds as string) & " pounds"
		write_to_disk from (this_code & tab & theTitle & tab & theDimensions & tab & theWeight & return) as string into textFile with append
	on error
		write_to_disk from (this_code & return) as string into textFile with append
	end try
end repeat

StefanK · September 21, 2008, 8:39am

If you take a look at the source text of any amazon page (in Safari âŒ¥âŒ˜U), there is a line break after “Product Dimensions”.
So the proper way is to filter the line containing “inches” and cut the word “inches” at the end


.
try
	set {theTitle, theDimensions, theWeight} to paragraphs of (do shell script "curl -L [url=http://www.amazon.com/dp/]http://www.amazon.com/dp/"[/url] & this_code & " | awk '/(<title>)|(inches)|(Shipping Weight)/'")
	set theTitle to parse_text(theTitle, "<title>Amazon.com: ", ": Books")
	set theDimensions to text 1 thru -7 of theDimensions
	set theWeight to parse_text(theWeight, "<li><b>Shipping Weight:</b> ", " (")
	if theWeight contains "ounces" then set theWeight to (word 1 of theWeight as ounces as pounds as string) & " pounds"
	write_to_disk from (this_code & tab & theTitle & tab & theDimensions & tab & theWeight & return) as string into textFile with append
on error
	write_to_disk from (this_code & return) as string into textFile with append
end try
.

earthsaver · September 21, 2008, 12:23pm

Yep, I did notice those breaks but I didn’t know how to treat them. Thanks again for all your help!

earthsaver · November 5, 2008, 1:20am

So, is there a way to include line breaks in a search? Suppose I now want to capture a Product Description, which in the HTML has line breaks before and after the description. I think I either need to note the breaks in the search query or delete them and attached text after acquiring them in the result. Here’s an example page. Also, do I need to include theDescription before theWeight and theDimensions because it appears before these latter two on the page?, or doesn’t it matter? It got this far:

repeat with this_code in these_codes
	try
		set {theDimensions, theWeight, theDescription} to paragraphs of (do shell script "curl -L [url=http://www.amazon.com/dp/]http://www.amazon.com/dp/"[/url] & this_code & " | awk '/(inches)|(Shipping Weight)|(Product Description)/'")
		set theDimensions to text 1 thru -7 of theDimensions
		set theWeight to parse_text(theWeight, "<li><b>Shipping Weight:</b> ", " (")
		if theWeight contains "ounces" then set theWeight to (word 1 of theWeight as ounces as pounds as string) & " pounds"
		set theDescription to parse_text(theDescrption, "<b>Product Description</b><br />", "</div></div>")
		write_to_disk from (this_code & tab & theDimensions & tab & theWeight & theDescription & return) as string into textFile with append
	on error
		write_to_disk from (this_code & return) as string into textFile with append
	end try
end repeat