I’m seeking to script the retrieval of the shipping weight of various products listed on Amazon.com. I’ve started the process and gotten some advance help from Dan Shockley, however I’m stuck on the “list or record” error 1024. With the getTextBetween handler appearing below the script, here’s what I have so far.
set these_codes to paragraphs of (do shell script "cat '/Users/earthsaver/Desktop/file.txt'")
tell application "Safari" to activate
repeat with this_code in these_codes
set theAddress to "http://www.amazon.com/dp/" & this_code
set theSource to (do shell script "curl " & theAddress)
set beforeText to "Shipping Weight: "
set afterText to "(View"
set foundWeight to my getTextBetween(theSource, beforeText, afterText)
set the clipboard to foundWeight
tell application "TextEdit" to activate
tell application "System Events" to tell process "TextEdit"
search for this_code
keystroke right & tab & "v" using command down
end tell
end repeat
file.txt is for now just a test file containing three random product codes of books on Amazon.com (10-digit ISBN numbers). Amazon makes it easy to view these product pages using the URL format set as theAddress. Also, I’m happy to learn how to properly paste from the clipboard without using keystrokes. The goal is a tab-delimited text file containing product codes and shipping weights. (Know of a good way to script the conversion of ounces to pounds, only if necessary?)
this script parses title and weight from each page and writes the result with the format ISBN title weight to a text file appending new lines.
It works completely without involving any application.
set these_codes to paragraphs of (do shell script "cat '/Users/earthsaver/Desktop/file.txt'")
set textFile to ((path to desktop as Unicode text) & "textFile.txt")
repeat with this_code in these_codes
set {theTitle, theWeight} to paragraphs of (do shell script "curl -L [url=http://www.amazon.com/dp/]http://www.amazon.com/dp/"[/url] & this_code & " | awk '/(<title>)|(Shipping Weight)/'")
set theTitle to parse_text(theTitle, "<title>Amazon.com: ", ": Books")
set theWeight to parse_text(theWeight, "<li><b>Shipping Weight:</b> ", " (")
if theWeight contains "ounces" then set theWeight to (word 1 of theWeight as ounces as pounds as string) & " pounds"
write_to_disk from this_code & tab & theTitle & tab & theWeight & return into textFile with append
end repeat
on parse_text(t, del1, del2)
set {TID, text item delimiters} to {text item delimiters, del1}
set t to text item 2 of t
set text item delimiters to del2
set t to text item 1 of t
set text item delimiters to TID
return t
end parse_text
on write_to_disk from theData into targetFile given append:append
try
set ff to open for access file targetFile with write permission
if append is false then set eof of ff to 0
write theData to ff starting at eof
close access ff
return true
on error
try
close access file targetFile
end try
return false
end try
end write_to_disk
Wow! Thanks for offering such simplicity and effectiveness. Good thinking on fetching the title, too, to confirm that Amazon lists the right item. I’ll try it as soon as I can get into the Bookstore.
One thing comes to mind that I forgot to mention: There are likely to be a handful of items that Amazon doesn’t list at all. How would I proactively handle these errors, wherein the page curled doesn’t contain “Shipping Weight”?
The easiest way is to use a try block.
If an error occurs, the script continues after the end try statement
.
repeat with this_code in these_codes
try
set {theTitle, theWeight} to paragraphs of (do shell script "curl -L [url=http://www.amazon.com/dp/]http://www.amazon.com/dp/"[/url] & this_code & " | awk '/(<title>)|(Shipping Weight)/'")
set theTitle to parse_text(theTitle, "<title>Amazon.com: ", ": Books")
set theWeight to parse_text(theWeight, "<li><b>Shipping Weight:</b> ", " (")
if theWeight contains "ounces" then set theWeight to (word 1 of theWeight as ounces as pounds as string) & " pounds"
write_to_disk from this_code & tab & theTitle & tab & theWeight & return into textFile with append
end try
.
end repeat
Wait! append is creating an invisible character between every character in textFile. These characters aren’t visible spaces in TextEdit or TextWrangler, but they do appear in Numbers. They aren’t actual spaces either, but appear to be treated as individual words. I’ll be checking the results in Excel for importing to LightSpeed. Seems Excel is too old or unsophisticated to read them; perhaps they are unicode invisible characters. So, it looks like I can safely import to Excel and resave the file to rid it of the undesired. However, any way to avoid this altogether?
if theWeight contains "ounces" then set theWeight to (word 1 of theWeight as ounces as pounds as Unicode text) & " pounds"
and the write_to_disk handler
on write_to_disk from theData into targetFile given append:append
try
set ff to open for access file targetFile with write permission
if append is false then set eof of ff to 0
if (get eof of ff) = 0 then write (ASCII character 254) & (ASCII character 255) to ff -- write BOM
write theData to ff as Unicode text starting at eof
close access ff
return true
on error
try
close access file targetFile
end try
return false
end try
end write_to_disk
If you want MacRoman, change only this line:
write_to_disk from (this_code & tab & theTitle & tab & theWeight & return) as string into textFile with append
Thanks for all these answers, Stefan. I really appreciate it. I just remembered that the original reason I was checking the results was to see if and how weight values would be truncated, rounded, or otherwise. I see they’re not. Since ounces to pounds is not a complex conversion, no result will extend more than five places behind the decimal, and I can use a spreadsheet to round as I please.
Looks great! I’ll be testing on a Mac Pro as I think our 1.42 GHz PowerPC LightSpeed server will be far too slow for the task.
Are you still around?, Stefan. I’d appreciate your help with more more addition to the Amazon parsing script. I want to add Product Dimensions, too. Based on your script template, I the bits about dimensions to the script, as shown. However, when run, it leaves me with only the original list of ISBNs rather than appending the appropriate information as requested. Where did I screw up?
repeat with this_code in these_codes
try
set {theTitle, theDimensions, theWeight} to paragraphs of (do shell script "curl -L [url=http://www.amazon.com/dp/]http://www.amazon.com/dp/"[/url] & this_code & " | awk '/(<title>)|(Product Dimensions)|(Shipping Weight)/'")
set theTitle to parse_text(theTitle, "<title>Amazon.com: ", ": Books")
set theDimensions to parse_text(theDimensions, "<li><b>Product Dimensions:</b> ", " inches")
set theWeight to parse_text(theWeight, "<li><b>Shipping Weight:</b> ", " (")
if theWeight contains "ounces" then set theWeight to (word 1 of theWeight as ounces as pounds as string) & " pounds"
write_to_disk from (this_code & tab & theTitle & tab & theDimensions & tab & theWeight & return) as string into textFile with append
on error
write_to_disk from (this_code & return) as string into textFile with append
end try
end repeat
If you take a look at the source text of any amazon page (in Safari ⌥⌘U), there is a line break after “Product Dimensions”.
So the proper way is to filter the line containing “inches” and cut the word “inches” at the end
.
try
set {theTitle, theDimensions, theWeight} to paragraphs of (do shell script "curl -L [url=http://www.amazon.com/dp/]http://www.amazon.com/dp/"[/url] & this_code & " | awk '/(<title>)|(inches)|(Shipping Weight)/'")
set theTitle to parse_text(theTitle, "<title>Amazon.com: ", ": Books")
set theDimensions to text 1 thru -7 of theDimensions
set theWeight to parse_text(theWeight, "<li><b>Shipping Weight:</b> ", " (")
if theWeight contains "ounces" then set theWeight to (word 1 of theWeight as ounces as pounds as string) & " pounds"
write_to_disk from (this_code & tab & theTitle & tab & theDimensions & tab & theWeight & return) as string into textFile with append
on error
write_to_disk from (this_code & return) as string into textFile with append
end try
.
So, is there a way to include line breaks in a search? Suppose I now want to capture a Product Description, which in the HTML has line breaks before and after the description. I think I either need to note the breaks in the search query or delete them and attached text after acquiring them in the result. Here’s an example page. Also, do I need to include theDescription before theWeight and theDimensions because it appears before these latter two on the page?, or doesn’t it matter? It got this far:
repeat with this_code in these_codes
try
set {theDimensions, theWeight, theDescription} to paragraphs of (do shell script "curl -L [url=http://www.amazon.com/dp/]http://www.amazon.com/dp/"[/url] & this_code & " | awk '/(inches)|(Shipping Weight)|(Product Description)/'")
set theDimensions to text 1 thru -7 of theDimensions
set theWeight to parse_text(theWeight, "<li><b>Shipping Weight:</b> ", " (")
if theWeight contains "ounces" then set theWeight to (word 1 of theWeight as ounces as pounds as string) & " pounds"
set theDescription to parse_text(theDescrption, "<b>Product Description</b><br />", "</div></div>")
write_to_disk from (this_code & tab & theDimensions & tab & theWeight & theDescription & return) as string into textFile with append
on error
write_to_disk from (this_code & return) as string into textFile with append
end try
end repeat