get info from browser and store in an excel spread-sheet

b_bobb · May 21, 2007, 7:59pm

Hi,

I need help creating a script to handle the following:

Read a list of ISBN’s from a text file. each on a separate line:

1233455
9809808
…
…
0840809

For each ISBN, lookup book information. I have tested curl=“my_link”, and it was very nice displaying data as XML elements.
The web info needed to grab information from the site and display in XML is:
http://userinfo_etc=1&query=(bn=#{isbn})&startRecord=1&endRecord=1&sortField=DE

When I run this, I just replace the actual ISBN value instead of #{isbn})
sample output when I run the single line for the ISBN 123456 in terminal using curl:
123456
book
…
…
…
520
3) I want to grab all elements generated from step 2 and store them in a spread sheet (making elements as field).

Please let me know of any suggestions.

thanks.

James_Nierodzik · May 21, 2007, 10:35pm

I think your URL is broken there… whats a correct one so I can take alook at exactly what data we are working with.

regulus6633 · May 21, 2007, 10:44pm

This will get you started with steps 1 and 2. I don’t know how to write the results to a spreadsheet…

-- get the isbn file
set fileName to "isbn.txt"
set deskFolder to (path to desktop folder as string)
set the_file to deskFolder & fileName

-- read the file's data
set the_data to read file the_file as text

-- make the data into a list
set file_data to paragraphs of the_data

-- get the web info
set my_link1 to "http://userinfo_etc=1&query=(bn=#"
set my_link2 to ")&startRecord=1&endRecord=1&sortField=DE"

repeat with i from 1 to (count of file_data)
	do shell script "curl " & my_link1 & (item i of file_data) & my_link2
end repeat

b_bobb · May 22, 2007, 6:13pm

Thank you regulus6633 for your help.

I ran the script with the correct web link, but for some reason the web link is not working and I keep getting an error. Please let me know if I can e-mail you the web link since I don’t want to make it public information.

Also, If the script runs fine, can you please tell me where the output is stored? Maybe I should know this, but does it create an output file somewhere that I can access and maybe find a way to create an excel sheet from the results.

Thanks,

James_Nierodzik · May 22, 2007, 6:15pm

If you want to email me the link I would be happy to take a look at it as well. You can email me through my profile.

regulus6633 · May 22, 2007, 6:18pm

Sure, email it to me. Just click on the E-mail link under my name.

b_bobb · May 22, 2007, 6:33pm

sorry, but can we please try something else before the e-mail. My manager wasn’t too happy sending user name info to someone outisde company! sorry, but will use it as a last option.

I have noticed running the whole link in after the curl command in the do shell script, if I use single quotes, it works fine, otherrwise I get the same error I got before. Not to mention the script ru slow, is it because of terminal, or the database search for each entry is

Tslow?

I am thinking if we can find a a way to define my_link1 and my_link2 in one line and only replace the variable #{isbn} with some variable we define reading the isbn form the list supplied.

Also, I have noticed the results are displayed on the screen in Script Editor, I know in terminal if I use > output.xml, the results will go there. How can I direct the ouput from the script to some output file that I can look at, save…etc.

Finally, here is sample output for each isbn:

320731711100724156659780072415667bWorld Music: Traditions and TransformationsBakan, Michael B.Author156474963.75012007McGraw-Hill Humanities/Social Sciences/Langua

Thanks again.

regulus6633 · May 22, 2007, 6:51pm

About the speed problems you mention, there’s nothing that can be done with that. Curl is as fast as you can get and there’s really not processing that’s being done on your computer’s end that would slow things down… so it must be something on the server end… I suspect it’s the searching that you mentioned. The way we’re defining myLink1 and 2 would not effect the speed at all. It’s not a value that’s changing. The only variable is the isbn number we get from the file, and since it’s just a text file and we’re not actually searching that file, then that shouldn’t cause any speed problems either. I can’t see anything on your ocmputer causing speed issues, it must be on the server’s end… which you can’t control.

I incorporated the changes you wanted, meaning I took care of the quotes and directed the output file to the desktop. As far as how to get the output files into excel, I would suggest after you get this downloading stuff working that you start a new topic asking just that specific question. Anyway, here’s the revised script. I hope it helps. Of course I can’t test it on my end so I’m not really sure how well it works.

Note: I edited this to make the output format better, so this script has been updated.

(*For this to work your input isbn file should be on your desktop. Your output file will be written to your desktop as well. The output file name will be in the form "ISBN (isbn number).xml" *)

-- you can change the name of your input isbn file to whatever it is here
set inputFileName to "isbn.txt"

set desktopFolder to (path to desktop folder as string)
set inputFile to desktopFolder & inputFileName
set outputPosixPath to POSIX path of desktopFolder
set outName to "ISBN "
set outExt to ".xml"

-- read the file's data
set the_data to read file inputFile as text

-- make the data into a list
set file_data to paragraphs of the_data

-- get the web info
set my_link1 to "http://userinfo_etc=1&query=(bn=#"
set my_link2 to ")&startRecord=1&endRecord=1&sortField=DE"

-- cycle through the isbn numbers and create a new file for each named for the isbn number
repeat with i from 1 to (count of file_data)
	do shell script "curl " & quoted form of (my_link1 & (item i of file_data) & my_link2) & " -o " & quoted form of (outputPosixPath & outName & (item i of file_data) & outExt)
end repeat

b_bobb · May 22, 2007, 9:09pm

regulus6633,

Please check your e-mail, I just e-mailed you the script with some changes.

Thanks.

regulus6633 · May 22, 2007, 10:49pm

I haven’t gotten your email. I sent you a private message, check it out.

b_bobb · May 23, 2007, 3:14pm

Just sent you a messages. hope this one works.

regulus6633 · May 23, 2007, 4:26pm

Yep, got that message. It works great. In testing 7 isbn numbers, it downloaded the results in about a second or two. I sent you a message about it. Here’s one more upgrade for your script. This saves the xml files into a folder on your desktop rather than having a bunch of loose files.

set inputFileName to "isbn.txt"
set outputFolderName to "isbn xmls"

set desktopFolder to (path to desktop folder as string)
set inputFile to desktopFolder & inputFileName
set outputPath to desktopFolder & outputFolderName
set outputPosixPath to POSIX path of desktopFolder & outputFolderName & "/"
set outName to "ISBN "
set outExt to ".xml"

-- create the ouput folder if needed
tell application "Finder"
	if not (exists folder outputPath) then make new folder at folder desktopFolder with properties {name:outputFolderName}
end tell

-- read the input file's data
set the_data to read file inputFile as text

-- make the data into a list
set file_data to paragraphs of the_data

-- the web link info
set my_link1 to "http://userinfo_etc=1&query=(bn=#" --> insert your real code here!
set my_link2 to ")&startRecord=1&endRecord=1&sortField=DE"

-- cycle through the isbn numbers and create a new file for each, named for the isbn number
repeat with i from 1 to (count of file_data)
	do shell script "curl " & quoted form of (my_link1 & (item i of file_data) & my_link2) & " -o " & quoted form of (outputPosixPath & outName & (item i of file_data) & outExt)
end repeat

b_bobb · May 23, 2007, 5:46pm

Thanks regulus6633. You are right, it’s a lot faster than before. a lot. This was neat thinking about placing all the XML files in a folder.

I have noticed that some of the XML files showing either bad request, or incorrect query type. what you think might be causing that?

Also, I usually open the files in BBedit, and foramt the file by selecting (Markup, Utilities, Format, Hierarchial). I think I can batch this some way in BBedit, but if you any of any other ways, please let me know.

Do you suggest starting another post trying to generate an excell sheet from results based on elements. for exmaple, I want to display:

ISBN Author Title Pages

Thanks.
Bill

James_Nierodzik · May 23, 2007, 5:56pm

send me an xml file from the output of the previous script or the script itself and I can help you parse the data out and into excel.

regulus6633 · May 23, 2007, 6:02pm

Regarding the “bad requests” - assuming the isbn numbers are correct, my only thought is that maybe the script is too fast for the curl command and therefore the web requests can’t stay up with the script. If that’s the case then maybe we need to slow down the script some. You can try changing this line and see if it helps.

This will put a 1 second delay after each curl request. You can try changing the value to find what works best… if that’s the problem.

repeat with i from 1 to (count of file_data)
	do shell script "curl " & quoted form of (my_link1 & (item i of file_data) & my_link2) & " -o " & quoted form of (outputPosixPath & outName & (item i of file_data) & outExt)
	delay 1
end repeat

Regarding how to get it into excel, I know that excel can import tab deliminated or comma deliminated text files, so if you can parse the xml file into one of these then you can just import the parsed text file into excel. I envision a format such as this for the parsed file:

tag header, tag value
tag header, tag value
etc…

b_bobb · May 23, 2007, 6:17pm

This is intersting. I ran the script with 4 seconds delay, I have less bad requests, but can’t eleimnate all of them!
Also, If I run the script more than once with the same time delay, I get different results.

what you think is going wrong?

Thanks.

b_bobb · May 23, 2007, 6:21pm

James,

I have sent you the working script. Thanks.

b_bobb · May 23, 2007, 6:51pm

James,
did you get my e-mail?

regulus6633 · May 23, 2007, 6:53pm

I see maybe 2 problems with the “bad xml files”…

make sure you don’t have any spaces at the end of the isbn numbers in the isbn.txt file. When I copied/pasted your isbn numbers from your message initially I noticed a few of the numbers had spaces before the end of the line. I fixed these before running the script.
I just tried a delay of 10 and it worked for 5 of the 7 files. The 2 bad ones had an error stating that “The process cannot access the file "D:\wwwroot\something something something” because it is being used by another process."

I assume that error is because you or James was accessing the database at the same time I was and therefore it wouldn’t allow me access for those 2 file requests.

So my suggestion is 1) get rid of the trailing spaces, 2) to use a longer delay, and 3) only one person access the database at a time!

b_bobb · May 23, 2007, 7:09pm

I think you are right. getting rid of the spaces makes a big difference in getting rid of the bad XML files error.

I am hoping James is not ignoring me…cause the Excel step is the main step