Searching a large text file

delpucci · July 22, 2010, 11:42pm

I would be thrilled to get some advice about searching for specified text items in the paragraphs of a text file that will continue to grow over time.

Specifically:

The text file “Details.txt” is stored in folder “Document Index” on the desktop

When a document is saved a paragraph of the various details of the document is added to the end of the text file in the following manner.

The last item in the paragraph is the document’s path.

The aim of any subsequent search of this text file will be to obtain documents’ paths that meet the search criteria in order to do further work with these files.

When searching by 3 items the following script appears to be sufficient up to about 1000 paragraphs after which it appears to slow which might not be unexpected. (about 12s for 10000 paragraphs)

This script is an example and will only run if the text file has been already been set up.

set asitem to "Joseph"
set bsitem to "Niji"
set tsitem to "2010072200000"
set theFile to "Details.txt"
set theFilePath to ((path to desktop as text) & "Document Index:") & theFile
set theData to every paragraph of (read file theFilePath)
set File_Values to {}
set {tids, text item delimiters} to {text item delimiters, "|zyxw|"}
repeat with i from 1 to ((count theData) - 1)
	set A to item i of theData
	if text item 2 of A contains asitem and text item 3 of A contains bsitem and text item 16 of A < tsitem then
		set end of File_Values to text item 19 of A
	end if
end repeat
set text item delimiters to tids
File_Values ------ a list of files meeting the criteria

Is there a faster way to do this ?

or even

a better way to do this kind of project ?

Any thoughts appreciated.

Val

Model: iMac intel
AppleScript: Version 2.3 (118)
Browser: Safari 533.16
Operating System: Mac OS X (10.6)

McUsr · July 23, 2010, 12:07am

Hello Val.

How about converting it to a CSV file and read it into a spread sheet, where you can either search and sort on each an every field, and at the same time search for specific contents of a cell, it is also possible to have pre made search filters. Having Excel or Numbers do all your work, and at the same time ending up with a very flexible solution, regarding the manipulation of the date.

If you are to convert it to a CSV file, then every text item, should be contained within quotation marks.

I would at least have tried this as a first solution, I have however no experience with sorting and searching of so large databases within Excel, but I would give that a shot.

I recommend that you read up on filtering of data, and tables in Excel or Numbers if you got one of those.

regulus6633 · July 23, 2010, 5:59am

A few things…

in your if statement, you get “text items of A” 3 times. So why not get text items 1 time and pull the values from that instead?
I’m not sure if this is true for you, but many times once you find a result of the if statement, you don’t have to keep searching because you know only 1 result should be found. If that’s the case here then you can exit the repeat loop at that point to stop the search.

So with these 2 changes your repeat loop would look like this.

repeat with i from 1 to ((count theData) - 1)
	set A to text items of (item i of theData)
	if item 2 of A contains asitem and item 3 of A contains bsitem and item 16 of A < tsitem then
		set end of File_Values to item 19 of A
		exit repeat
	end if
end repeat

when executing a repeat loop, it’s often faster to use references to values rather than actual values.
when searching large lists, you can often get a huge speed boost by using script objects to hold the data.

So combining everything I mentioned, here’s how I would write your code. You’ll need to do a little testing to see which suggestions work and which don’t.

script f
	property fileItems : missing value
	property textItems : missing value
	property fileValues : missing value
end script

set asitem to "Joseph"
set bsitem to "Niji"
set tsitem to "2010072200000"
set theFile to "Details.txt"
set theFilePath to ((path to desktop as text) & "Document Index:") & theFile
set f's fileItems to every paragraph of (read file theFilePath)
set f's fileValues to {}
set {tids, text item delimiters} to {text item delimiters, "|zyxw|"}
repeat with anItem in f's fileItems
	set f's textItems to text items of anItem
	if item 2 of f's textItems contains asitem and item 3 of f's textItems contains bsitem and item 16 of f's textItems < tsitem then
		set end of f's fileValues to item 19 of f's textItems
		exit repeat
	end if
end repeat
set text item delimiters to tids
set f's fileItems to missing value -- clear out the variable
return f's fileValues ------ a list of files meeting the criteria

Nigel_Garvey · July 23, 2010, 9:14am

If there are always 19 text items per paragraph, a further development of Hank’s suggestion would be to dispense with the ‘paragraphs’ stage. Instead of getting the paragraphs and extracting the text items of each one, simply get the text items of the entire text and loop through them in groups of 19. (In fact, though, a blank text item would only appear at the very beginning of the list, so the actual process would be to start at 2 and loop in groups of 18.) The return would have to be stripped from any path returned.

local f

script
	property theData : missing value
	property File_Values : missing value
end script
set f to result

set asitem to "Joseph"
set bsitem to "Niji"
set tsitem to "2010072200000"
set theFile to "Details.txt"
set theFilePath to ((path to desktop as text) & "Document Index:") & theFile
set tids to AppleScript's text item delimiters
set AppleScript's text item delimiters to "|zyxw|"
set f's theData to text items of (read file theFilePath as string)
set AppleScript's text item delimiters to tids
set f's File_Values to {}
repeat with i from 2 to (count f's theData) by 18
	if (item i of f's theData contains asitem) and (item (i + 1) of f's theData contains bsitem) and (item (i + 14) of f's theData < tsitem) then
		set end of f's File_Values to text 1 thru -2 of item (i + 17) of f's theData
	end if
end repeat
set text item delimiters to tids
f's File_Values ------ a list of files meeting the criteria

By the way, the ‘<’ operator in the ‘if’ line doesn’t match the sample paragraph. I had to change it to ‘>’ to test the script. I don’t know if that’s what you meant or not.

delpucci · July 23, 2010, 11:51am

Many thanks for all of the advice.

MacUsr: my immediate concern was speed. I should have stated “a better way to do this kind of project faster” ? But your idea would certainly be useful for other projects.

Nigel: Your iteration does not appear to be any faster than my original. It does the job. The “tsitem” is a timestamp value. When I ran your script (as on post) it did pick out the files with a value lower than “2010072200000” which is what I wanted.

regulus6633: Your iteration adapted to cope with the linefeed appears to be considerably faster than the original. I also removed the exit repeat as there may be many “hits”. See script below. (tested on 6.7mb text file with 11002 paragraphs.)

Again i’d be thrilled to get any improvement. and thanks again.

script f
	property fileItems : missing value
	property xfileItems : missing value
	property textItems : missing value
	property fileValues : missing value
end script

set asitem to "Joseph"
set bsitem to "Niji"
set tsitem to "2010072200000"
set theFile to "Details.txt"
set theFilePath to ((path to desktop as text) & "Document Index:") & theFile
set f's fileItems to every paragraph of (read file theFilePath)
set n to count of items in f's fileItems
set f's xfileItems to items 1 thru (n - 1) in f's fileItems-------this seems to work but is it really correct ?
set f's fileValues to {}
set {tids, text item delimiters} to {text item delimiters, "|zyxw|"}
repeat with anItem in f's xfileItems
	set f's textItems to text items of anItem
	if item 2 of f's textItems contains asitem and item 3 of f's textItems contains bsitem and item 16 of f's textItems < tsitem then
		set end of f's fileValues to item 19 of f's textItems
	end if
end repeat
set text item delimiters to tids
set f's xfileItems to missing value -- clear out the variable
return f's fileValues ------ a list of files meeting the criteria

Val

regulus6633 · July 23, 2010, 12:12pm

Script objects are the biggest reason for the speed gain. It’s really amazing how much faster they are for large lists. Looking at your code you can optimize it further if you’re certain you don’t want to iterate over the last item in the list. Change these 3 lines (and remove xfileItems property from f) to the last line after my comments.

set f's fileItems to every paragraph of (read file theFilePath)
set n to count of items in f's fileItems
set f's xfileItems to items 1 thru (n - 1) in f's fileItems -------this seems to work but is it really correct ?


-- you can do these 3 steps in 1 step
-- plus you eliminate one very large variable... xfileItems
-- basically you can reference the last item in a list by "-1" thus the second to last item is "-2"
set f's fileItems to items 1 thru -2 of (paragraphs of (read file theFilePath))

Finally, remember that “properties” are persistent, meaning that they are saved and then restored when you launch/quit a script. So you don’t want unnecessary variables to be saved/restored if you don’t need them. Therefore, at the end of your script, make sure to reset f’s properties back to missing value.

delpucci · July 23, 2010, 12:40pm

regulus6633 said

It is. On this trial it appears to be by about a factor of 7. i.e 7 times faster

Latest version of script

script f
	property fileItems : missing value
	property textItems : missing value
	property fileValues : missing value
end script

set asitem to "Joseph"
set bsitem to "Niji"
set tsitem to "2010072200000"
set theFile to "Details.txt"
set theFilePath to ((path to desktop as text) & "Document Index:") & theFile
set f's fileItems to items 1 thru -2 of (paragraphs of (read file theFilePath))
set f's fileValues to {}
set {tids, text item delimiters} to {text item delimiters, "|zyxw|"}
repeat with anItem in f's fileItems
	set f's textItems to text items of anItem
	if item 2 of f's textItems contains asitem and item 3 of f's textItems contains bsitem and item 16 of f's textItems < tsitem then
		set end of f's fileValues to item 19 of f's textItems
	end if
end repeat
set text item delimiters to tids
set f's fileItems to missing value -- clear out the variable
return f's fileValues ------ a list of files meeting the criteria

Thanks again,

Val

regulus6633 · July 23, 2010, 12:58pm

Last comment about cleaning up the persistent variables. It’s not a big deal since we already cleaned up the very large variable but you may as well clean them all up just to be efficient.

The reason it’s important is because the properties get saved to the script file so they can be restored at next launch. You can do a test… check the file size of the script in the Finder using “get info”. You’ll see a difference if you do clean up versus not. And in your case you have a very large variable so it would make a difference in the file size and in the speed of quitting/launching the script because the variables have to be written and read from disk.

-- clear out the variables
set theFileValues to f's fileValues
set f's fileItems to missing value
set f's textItems to missing value
set f's fileValues to missing value

return theFileValues

delpucci · July 23, 2010, 1:06pm

Point taken.

I was ignorant about this and the editor had trouble saving after running.

Many thanks. I have learnt a lot here.

Val