get info from browser and store in an excel spread-sheet

regulus6633 · May 23, 2007, 7:15pm

I’ve put something together for this too with my previous idea, it’ll be done shortly.

regulus6633 · May 23, 2007, 7:31pm

Try this to convert the xml files. I’m sure there must be a better way but this is all I can do as I don’t have experience with this. Hopefully James will have a better idea.

(* This will convert all the xml files in the "isbn xml" folder in tab delimited text files. They will be written to a folder on your desktop called "isbn txt". You should be able to import these into excel as a tab delimited text file.*)

set inputFolderName to "isbn xmls"
set outputFolderName to "isbn txt"
set outExt to ".txt"

set desktopFolder to (path to desktop folder as string)
set inputPath to desktopFolder & inputFolderName
set outputPath to desktopFolder & outputFolderName

-- create the ouput folder if needed
tell application "Finder"
	if not (exists folder outputPath) then make new folder at folder desktopFolder with properties {name:outputFolderName}
	
	-- get the files in the input folder
	set inputFiles to files of folder inputPath
end tell

repeat with i from 1 to (count of inputFiles)
	-- read in the data for an xml file
	set this_inputFile to (item i of inputFiles) as string
	set the_data to read file this_inputFile as text
	
	-- find all the xml tags for the file
	set all_tags to {}
	set data_tags to text items of the_data
	repeat with i from 1 to (count of data_tags)
		set chr to item i of data_tags
		if chr is "<" then
			set this_tag to {}
			repeat with j from (i + 1) to (count of data_tags)
				set chr to item j of data_tags
				if chr is ">" then
					set this_tag to (items i thru j of data_tags) as string
					set end of all_tags to this_tag
					set i to j
					exit repeat
				end if
			end repeat
		end if
	end repeat
	
	--strip out the xml tags from the xml file and add a tag name before each tag value
	repeat with aTag in all_tags
		set text item delimiters to ""
		set this_textItems to text items of aTag
		set text item delimiters to aTag
		set a to text items of the_data
		if (item 2 of this_textItems) is not "/" then
			set text item delimiters to ((items 2 thru -2 of this_textItems) as string) & tab
			set the_data to a as string
		else
			set text item delimiters to "" & return
			set the_data to a as string
		end if
	end repeat
	set text item delimiters to ""
	
	-- write the new text file to the ouput folder
	set fPath to (outputPath & ":" & (name of (info for file this_inputFile) & outExt)) as string
	my writeTo(the_data, fPath, false, string)
end repeat



(*============== SUBROUTINES ===============*)
on writeTo(this_data, target_file, append_data, mode)
	try
		set target_file to target_file as Unicode text
		if target_file does not contain ":" then set target_file to POSIX file target_file as Unicode text
		set the open_target_file to open for access file target_file with write permission
		if append_data is false then set eof of the open_target_file to 0
		write this_data to the open_target_file starting at eof as mode
		close access the open_target_file
		return true
	on error
		try
			close access file target_file
		end try
		return false
	end try
end writeTo

b_bobb · May 23, 2007, 7:54pm

Wow. this script works really well. The tab seprates a column from another column.
The only issue with the script is cleaning up the XML files before creating delimited text files from the xml file. What I mean is, I am only ineterested in certain elements(title, author, pages…etc).
Another issue I find with your script, I have to import each text file to the excel document. can I either create one text file for all?

regulus6633 · May 23, 2007, 8:27pm

I’m glad it worked the way I thought it would. I knew you would want just one file to import, but that script gets you close. If I think of a way to make it just one file I’ll let you know. I’ll think about it but off-hand I’m not sure how to do it.

Also, if you post just the headings you’re interested in it shouldn’t be hard for me to only get those headings. So post the headings you want and if I get the chance I’ll work on that.

Maybe James will have some ideas too and he’ll also want to know the headings you’re interested in.

James_Nierodzik · May 23, 2007, 8:58pm

you got mail b_bobb =)

b_bobb · May 23, 2007, 9:08pm

Thanks James. I have a few comments on your mail. can you please advice on how put together everything.

regulus6633 · May 23, 2007, 11:42pm

Well after thinking about your request ie. for one final text file and with a known list of xml tags that you wanted, I thought of a way to do it. It was actually much easier having a list of xml tags than getting all of them like I did previously. And it’s really fast too. Anyway, here’s the result. The directions are in the script. Let me know how it works for you.

(* This will convert all the xml files in the "isbn xml" folder into one tab delimited text file. This file, called "isbn txt", will be written to your desktop (no need for a folder with just one file!) You should be able to import it into excel as a tab delimited text file.*)

-- Note the format for the "good_tags"
-- This is a list of the tags you want to get the values of
-- Make sure you put the tag in the "good_tags" variable as "<tagName>" ie. with "<" and ">" around the tag name
-- You can add, remove, or rearrange the tags in my example list as you wish, it doesn't matter what you do to them

-- input variables you can change
set good_tags to {"<ISBN>", "<Title>", "<Name>", "<Role>", "<PubListPrice>", "<PubDate>", "<Publisher>", "<Description>", "<Pages>"}
set inputFolderName to "isbn xmls"
set outputFileName to "isbn txt.txt"
set desktopFolder to (path to desktop folder as string)

-- calculated variables
set inputPath to desktopFolder & inputFolderName
set outputPath to desktopFolder & outputFileName

-- get the files in the input folder
tell application "Finder" to set inputFiles to files of folder inputPath

-- read in the data from the xml files, one at a time
set dataList to {}
repeat with i from 1 to (count of inputFiles)
	set this_inputFile to (item i of inputFiles) as string
	set the_data to read file this_inputFile as text
	
	-- gather the data from only the "good tags"
	set temp_data to {}
	repeat with i from 1 to (count of good_tags)
		if (item i of good_tags) is in the_data then
			set text item delimiters to (item i of good_tags)
			set a to text items of the_data
			set text item delimiters to ("</" & (text items 2 thru -1 of (item i of good_tags))) as string
			set b to text items of (item 2 of a)
			set the_value to item 1 of b
			set text item delimiters to ""
		else
			set the_value to "missing value"
		end if
		set end of temp_data to the_value
	end repeat
	set end of dataList to temp_data
end repeat
set text item delimiters to ""

-- prepare the data for writing
-- get the tags into a list
set tagData to {}
repeat with a_good_tag in good_tags
	set this_tag to (text items 2 thru -2 of a_good_tag) as string
	set end of tagData to this_tag
end repeat

-- add the tags and the dataList together
set writeData to {}
set end of writeData to tagData
repeat with aData in dataList
	set end of writeData to aData as list
end repeat
writeData

-- put the compiled data into string format for writing to a tab delimited file
set finalString to ""
repeat with a_writeData in writeData
	repeat with i from 1 to (count of a_writeData)
		set finalString to finalString & (item i of a_writeData) & tab
	end repeat
	set finalString to finalString & return
end repeat

-- write the text file to the desktop
my writeTo(finalString, outputPath, false, string)


(*============== SUBROUTINES ===============*)
on writeTo(this_data, target_file, append_data, mode)
	try
		set target_file to target_file as Unicode text
		if target_file does not contain ":" then set target_file to POSIX file target_file as Unicode text
		set the open_target_file to open for access file target_file with write permission
		if append_data is false then set eof of the open_target_file to 0
		write this_data to the open_target_file starting at eof as mode
		close access the open_target_file
		return true
	on error
		try
			close access file target_file
		end try
		return false
	end try
end writeTo

b_bobb · May 24, 2007, 5:55am

This really looks a lot better. thanks.
what’s interesting though is the missing value displayed in the excel file. I seen how you used it in the script, but I think it would more sense if we have some sort of check to clean up the XML files before creating tabbed text file.
I am not sure about this one, I would suggest moving the XML files with missing values (at least I know which ones to run the script again to try to get their metadata) in a separate folder, and the ISBN XMLs folder only contains XML files with metadata information. what you think.

I was thinking about curl as well, and was trying to find other ways of grabbing information, that’s why I am not sleeping now I have looked at some other posts and none helped. or at least none had the same issues.
Could there be any curl options? I noticed you used -o. would it make a difference if placed at the begging!
I am hoping you are excusing my ignorance, I am just trying to get this script to work will all ISBN’s

Thanks again.

James_Nierodzik · May 24, 2007, 1:08pm

You’ve got mail</aol voice>

regulus6633 · May 24, 2007, 5:54pm

Re: the error check
I’ve been thinking about it. There’s a tag in the xml file called (meaning matching records) that could be checked. I can think of 4 possibilities with it.

that tag doesn’t exist in the xml file which means that there was a problem with the downloading of the file ie. a curl error
the value of that tag is 0 which means that curl executed properly but no matching records were found for that isbn
the value is 1 which means success for that isbn query
the value is greater than 1 which means some other problem. I’m not sure this will occur.

The problem is that some of the isbn queries return a matching record value of 1 but some of the other fields still do not exist and therefore those fields are filled with “missing value”. I think we have to consider that isbn query “good” (meaning curl executed properly). I don’t think re-querrying the database for that isbn will turn up any different results, I just think those missing fields are missing and requerrying won’t change that. I think the only errors are if the tag doesn’t exist, if its value is 0, or if its value is greater than 1.

Let me know what you think about that error assessment. Also, if you agree with my assessment, let me know what action you want for each error case.

Re: curl
I don’t know how to improve that command. The “-o” option is the “output file” option, meaning that’s how curl knows to write the returned data to a file… otherwise it returns the data to the terminal screen.

Let me know. Also let me know if James’ efforts look promising then we won’t have to worry about fixing up my script.

b_bobb · May 24, 2007, 7:11pm

I am hoping James is looking at this post. we had a few e-mails exchanged trying to put together the script all in one.

This is really interetsing. what would be great, is if we can move XML files in separate folders based on that and generate separate Excel files based on that. However, I don’t think we would have a value larger than 1 for .

For one of the titles (please don’t list titles here) that we got an error message, I ran a curl command for that title alone in terminal, and it works fine. If you open the XML file when you get the error, it does mention that the title is being used by another process.

another question I had in mind is, why do we create ISBN.XML? what supposed to go there?

Thanks.

regulus6633 · May 24, 2007, 7:55pm

Putting the script all-in-one is easy… that can be done at the end. Right now I find it advantageous to have it separate because I don’t have to re-download the xml files every time I want to check the second script. It makes it easier to troubleshoot too. I haven’t seen James’ approach. It doesn’t make sense for two people to be generating ideas separately. We should be sharing ideas and code so that work isn’t being duplicated.

That’s easy enough to implement and I agree about values larger than 1, so we can ignore that.

Wouldn’t our 3 step check for the tag catch this properly? Does that have the tag? Are you saying that it is a “good” curl download or a bad one?

I don’t see anything in my scripts about that. What do you mean?

James_Nierodzik · May 24, 2007, 8:20pm

hey regulus, shoot me an email through here so I can reply and send you what I’ve cooked/cooking up

b_bobb · May 24, 2007, 8:34pm

what I meant here is the XML files with the tag not the tag. If you run those ISBN’s in terminal, it will get information without a problem.

please check your e-mail/messages.

regulus6633 · May 24, 2007, 11:22pm

I was reading through the curl man page and saw this:

So I decided to try this technique. We’ll let curl handle the multiple downloads rather than the repeat loop with the delay that we implemented before. I figured out how to get it into the proper form and it seems to work. See if this corrects some of those nagging errors with the dowloaded xml files. Remember to fix the “download_link1” variable with the proper link.

set inputFileName to "isbn.txt"
set outputFolderName to "isbn xmls"

set desktopFolder to (path to desktop folder as string)
set inputFile to desktopFolder & inputFileName
set outputPath to desktopFolder & outputFolderName
set outputPosixPath to POSIX path of outputPath & "/"
set outName to "ISBN"
set outExt to ".xml"


-- create the ouput folder if needed
tell application "Finder"
	if not (exists folder outputPath) then make new folder at folder desktopFolder with properties {name:outputFolderName}
end tell

-- read the input file's data
set the_data to read file inputFile as string

-- make the data into a list
set file_data_list to paragraphs of the_data

-- make file_data_list into a comma separated string
set text item delimiters to ","
set file_data_string to file_data_list as string
set text item delimiters to ""

-- the web link info
set download_link1 to "http://some_user_info"
set download_link2 to ")&startRecord=1&endRecord=1&sortField=DE"
set download_link to download_link1 & "{" & file_data_string & "}" & download_link2
set saved_file_names to outputPosixPath & outName & " #1" & outExt

-- cycle through the isbn numbers and create a new file for each, named for the isbn number
do shell script "curl " & quoted form of download_link & " -o " & quoted form of saved_file_names

James_Nierodzik · May 24, 2007, 11:40pm

you BOTH have mail =)

b_bobb · May 25, 2007, 5:00am

sorry for the long delay, just posted responds in e-mail.

b_bobb · May 25, 2007, 9:01pm

Thanks Guys for all the help. you guys were wonderful and very helpful. thanks.