Get line of a text file containing a specific word

Baschi · July 28, 2005, 8:39am

Hello

Please forgive me if this is too easy for you, but I can’t find a solution for that problem.

With this part of a script I open a file containing Tab separated information as the result of a database export.

set vmnum to "6221393"


set adbaselist to choose file with prompt "Please select file"
set adbaselist to adbaselist as text

set AppleScript's text item delimiters to ""
set every_row to paragraphs of (read file adbaselist)

This file contains about 100’000 paragraphs (records)

Now I would like to get the entire record containing a specific information. Is this possible with AppleScript?

Thanks in advance

Baschi

julifos · July 28, 2005, 11:42am

Too much records for AS, I think. I’d better use “grep”:

set searchTerm to "blah"
set inputFile to alias "path:to:file.txt"

set matchingRecords to (do shell script "grep -e " & quoted form of searchTerm & space & quoted form of POSIX path of inputFile)

julifos · July 28, 2005, 11:43am

It could be done in a different fashion in plain AS, though (and also very quick) using offsets, but I’d be so lazy to do it while it can be done with “grep” :rolleyes:

Baschi · July 28, 2005, 12:08pm

Hi jj

Thanks a lot - works phantastic

Baschi

kel · July 28, 2005, 12:09pm

Hi,

Also, if you want to do with offset, then something like this:

set vmnum to “6221393”

set adbaselist to choose file with prompt “Please select file”
set db_text to (read adbaselist)
set the_offset to (offset of vmnum in db_text)
set the_record to (read adbaselist from the_offset before (ASCII character 10) using delimiter tab)
– ascii character 10 is linefeed
– or use return (ascii character 13)
– I wrote my test text file with TextEdit

622139 needs to be a unique field value. If not, then you can do other things depending on how the record fields are ordered etc. Here Iassume that the vmnum is in the first field of the record.

gl,

Baschi · July 29, 2005, 8:39am

hi kel

Thank you for the “offset” solution. Works fine too and it doesn’t matter how the source text file is formatted. Working with the grep-version requires a source file with UNIX linebreaks.

Regards, Baschi

kai · July 29, 2005, 10:05pm

On very large files, offset may be relatively slow - while grep, though speedy, will always carry the overhead of making an external call to the shell. Another option, which should be faster and equally effective, is to use AppleScript’s text item delimiters - perhaps something like this (to return the first paragraph containing a given search string):

to searchText of currText for searchString
	set tid to text item delimiters
	set text item delimiters to searchString
	considering case
		tell currText to if (count text items) is 1 then
			set searchResult to "ERROR: no match found for \"" & ({""} & "\".")
		else
			set searchResult to text item 1's paragraph -1 & ({""} & text item 2's paragraph 1)
		end if
	end considering
	set text item delimiters to tid
	searchResult
end searchText

set vmnum to "6221393"
searchText of (read (choose file)) for vmnum

kel · July 30, 2005, 3:22pm

Hi Kai,

Actually I think the offset method might be a lot faster even with a completed script. Here’s a test using Jon’s Commands:

set f to choose file
set t to read f

set vmnum to “6221393”

set t1 to the ticks
set b to offset of vmnum in t
set r1 to read f from b until return
set t2 to the ticks
set d1 to t2 - t1

set t1 to the ticks
set r2 to searchText of (read (choose file)) for vmnum – double read
set t2 to the ticks
set d2 to t2 - t1

display dialog “Offset method: " & d1 & " ticks” & return & “TID method: " & d2 & " ticks”
r1 = r2

to searchText of currText for searchString
set tid to text item delimiters
set text item delimiters to searchString
considering case
tell currText to if (count text items) is 1 then
set searchResult to “ERROR: no match found for "” & ({“”} & “".”)
else
set searchResult to text item 1’s paragraph -1 & ({“”} & text item 2’s paragraph 1)
end if
end considering
set text item delimiters to tid
searchResult
end searchText

Baschi, note that you need to error check your script. One issue that might arise is there might be numbers “6221393” and “62213934”. Stuff like that might cause problems, but if the ad number has just seven digits, then it should be ok.

EDIT: wait I made a mistake and there’s a double read. If you replace:

(read (choose file))

with the variable t then there’s almost no difference. Disregard.

gl,

kel · July 30, 2005, 3:27pm

I forgot to mention that I tested it on a file with 100001 paragraphs (records). Here’s the script to write the file:

set desk_path to (path to desktop) as string
set file_spec to (desk_path & “ads”) as file specification
set ref_num to (open for access file_spec with write permission)
try
repeat with i from 1 to 100000
set t to ((i as string) & tab & “hello” & return)
write t to ref_num
end repeat
set t to (“6221393” & tab & “bye”)
write t to ref_num
close access ref_num
on error
close access ref_num
beep 2
return
end try

kai · July 31, 2005, 7:55pm

Never mind the double read, kel - that ‘choose file’ was a sure-fire way of stopping the tids method in its tracks!

Out of curiosity, I’ve just carried out a few tests myself on some files containing between 10 and 100,000 paragraphs (each consisting of just over 10 items/words).

Offset certainly seems faster than when I last compared it. (I’d swear they’ve tweaked it!) However, using a finer timing method, tids still appear to have a slight edge. As I suggested earlier, the shell overhead puts grep at somewhat of a disadvantage when parsing smaller files - but it’s certainly worth considering for large files, as you’ll see from these results (to which the usual qualifications about timings apply):

file size: all times in milliseconds:
paragraphs offset tids grep
10 5.1 2.6 94
100 5.6 3.8 92
1,000 20 11 95
10,000 146 101 111
100,000 1283 906 200

Since performance is only one aspect of such comparisons, it might be worth reiterating one or two other points to consider generally. Baschi has already observed that grep requires a file to be LF (ASCII character 10) delimited and, as you pointed out, the offset method may need to be adjusted according to a file’s paragraph delimiters. The tid-based approach should handle UNIX, DOS or Mac line-breaks without modification.

It also appears that, to return a complete record/paragraph, the offset suggestion relies on the record starting with the search string. (No biggie in this case, since it apparently works for Baschi - and that’s the main thing!)

Good point, kel.

I’m not aware of the precise structure of the file in question, but one way around this might be to “top’n’tail” the search string with the appropriate separators (“\r”, “\n”, “\t”, space, whatever). That way, part of a longer string shouldn’t be confused with a shorter one.

wsterdan · March 29, 2006, 8:58pm

kai:

to searchText of currText for searchString
	set tid to text item delimiters
	set text item delimiters to searchString
	considering case
		tell currText to if (count text items) is 1 then
			set searchResult to "ERROR: no match found for \"" & ({""} & "\".")
		else
			set searchResult to text item 1's paragraph -1 & ({""} & text item 2's paragraph 1)
		end if
	end considering
	set text item delimiters to tid
	searchResult
end searchText

set vmnum to "6221393"
searchText of (read (choose file)) for vmnum

I’ve used this as the basis for a MIF-file parsing algorithm and it works great; my problem (as posted elsewhere) is that I now need to convert it to a droplet, which will allow for multiple files to be processed unattended.

When I add a simple Droplet shell dropped two files onto it:

on open filelist
	repeat with CurrentDocument in filelist
		
		set delimiterString to "<ParaLine"
		searchText of CurrentDocument for delimiterString
		
	end repeat
end open

to searchText of currText for searchString
	set tid to text item delimiters
	set text item delimiters to searchString
	considering case
		tell currText to if (count text items) is 1 then
			set searchResult to "ERROR: no match found for \"" & ({""} & "\".")
		else
			set searchResult to text item 2's paragraph 3 & ({""} & text item 2's paragraph 1)
		end if
	end considering
	set text item delimiters to tid
	searchResult
end searchText

I get an error that “item 1 of (alias “Macintosh HD:filename1.mif”, alias “Macintosh HD:filename .mif”) doesn’t understand the count message.”

Any chance it’s a simple fix to get the droplet working?

– Walt Sterdan

Get line of a text file containing a specific word

set f to choose file set t to read f

set vmnum to “6221393”

set t1 to the ticks set b to offset of vmnum in t set r1 to read f from b until return set t2 to the ticks set d1 to t2 - t1

set t1 to the ticks set r2 to searchText of (read (choose file)) for vmnum – double read set t2 to the ticks set d2 to t2 - t1

display dialog “Offset method: " & d1 & " ticks” & return & “TID method: " & d2 & " ticks” r1 = r2

set f to choose file
set t to read f

set t1 to the ticks
set b to offset of vmnum in t
set r1 to read f from b until return
set t2 to the ticks
set d1 to t2 - t1

set t1 to the ticks
set r2 to searchText of (read (choose file)) for vmnum – double read
set t2 to the ticks
set d2 to t2 - t1

display dialog “Offset method: " & d1 & " ticks” & return & “TID method: " & d2 & " ticks”
r1 = r2