Finding matching strings in a text file and summing their numbers

Hi. I’m a novice at AppleScript and have been trying to sort out this problem with little success. I’m hoping someone here can offer me some direction.

I often get comma-delimited text files where a string value will appear several times throughout the file but with different accompanying numeric values. For example:

TS10031804,359
GS16485441,1
TS10031804,5
GS16485441,1
GS16485441,1
TS10031804,6
GS16485441,1

What I need to do is find all the matching strings from the first field of each line and sum their respective numbers from the second field to end up with a composited file that looks like this:

TS10031804,370
GS16485441,4

Ideally this would write out the summed values to a separate file. Typically these files each contain thousands of lines but I only need to process one file at a time and I don’t need to sum values from multiple files at once.

I appreciate any guidance you may be able to provide.

Thanks,
Tom

Model: Macbook Air
AppleScript: 2.2.4
Browser: Safari 536.29.13
Operating System: Mac OS X (10.8)

Hello.
I thought awk and associative arrays would do the trick.

set theData to quoted form of "TS10031804,359
GS16485441,1
TS10031804,5
GS16485441,1
GS16485441,1
TS10031804,6
GS16485441,1"
set theSum to (do shell script " awk 'BEGIN {FS=\",\"}{line[$1]+=$2 }END { for ( item in line ) printf(\"%s %d\\n\", item, line[item]) }' <<<" & theData)
--> "TS10031804 370
--> GS16485441 4"

Thanks for the quick response. That did exactly what I needed but my data sets are typically 10,000 lines or more. Is there a way to apply that expression to the contents of an entire file?

Hello.

Actually, I think it will perform as good as anything not written specifically in C, or any other compiled language.

I modified it, so that you select a file in Finder, before running the script below.


tell application "Finder"
	if selection = {} then
		beep
		return
	end if
	set n to (count selection as list)
	if n > 1 then
		beep
		return
	else
		set a to POSIX path of (selection as alias)
	end if
end tell
-- code for timing on line  below, just remove it.
set s to (current date)

set theSum to (do shell script "cat " & quoted form of a & "| awk 'BEGIN {FS=\",\"}{line[$1]+=$2 }END { for ( item in line ) printf(\"%s %d\\n\", item, line[item]) }'")
-- code for timing on line  below, just remove it.
set t to (current date) - s
tell application "TextEdit"
	make new document at front
	set text of document 1 to theSum
	activate
end tell
-- code for timing in the block below, just remove it.
set l to do shell script "wc -l  < " & quoted form of a
tell application "TextEdit"
	display dialog l & " lines took " & t & " seconds."
end tell

It would be interesting to know the count of records (lines) and the time it took to execute.
Edit
I have added code for timing the results, that you can remove later, it would be interesting to know about the
performance for 10000 lines. :slight_smile:

Hi McUsrII,

Outstanding! That worked perfectly and performance is certainly not an issue as it processed 11,540 lines almost instantly (your timer dialog said 0 seconds). I learned several things from your AppleScript example that I’ll be able to build on in the future.

Thanks for your quick and very helpful response. :smiley:

Hello.

Awk was built for report writing, and data extraction like this, so you got the optimum bang for the buck, as you’d have to implement a hashtable/dictionary, to achieve the same speed, in lookup/storage.

Given the size of your data set, I was abit worried about the hashkeys, that are calculated for finding matching elements. It seems it was a good match in your case.

I just improved i/o with some 50%, but I guess that isn’t necessary now.

set theSum to (do shell script "awk 'BEGIN {FS=\",\"}{line[$1]+=$2 }END { for ( item in line ) printf(\"%s %d\\n\", item, line[item]) }' <" & quoted form of a)

Since I’ll always be starting from one of these documents in BBEdit I altered the process a bit so that it grabs the file name of the front document and then dumps the results into a new BBEdit document:


tell application "BBEdit"
	set a to POSIX path of (get file of front document)
end tell

set theSum to (do shell script "awk 'BEGIN {FS=\",\"}{line[$1]+=$2 }END { for ( item in line ) printf(\"%s,%d\\n\", item, line[item]) }' <" & quoted form of a)

tell application "BBEdit"
	make new document at front
	set text of document 1 to theSum
	activate
end tell

I’m sure there is a way to process the actual contents of the front-most document rather than having to use the documents file name but I’ll leave that to another time. I should also add some error handling in case the front file is not correct.

One thing I noticed is that the resulting output always includes the first line of the file being processed followed by the summed version of that line. For example, given the following at the top of the file:

GS16484766,1
GS16484766,1
GS16484766,1096
GS16484766,11

The result is always:

GS16484766,1
GS16484766,1108

The actual sum is correct on the second line, and this is the only duplicate left after processing, but it always just seems to hang on to the first line.

Thanks again for your help. Running this from BBEdit is blindingly fast and every time I have to do this I will be saving 10-15 minutes off my old process. :smiley:

Hello.

I have no idea why that is happening, if the sum in line 2 had been incorrect by one, then I’d have guessed that it was an invisible character being part of the first field. At least I can’t see anything with the Awk command that should make that happen. You can see invisibles in BBEdit.

I’d also go back to the original version, and see if that happens when you run it with output to TextEdit.

By the way, you can easily pass the contents of the front document of BBEdit to the shell script, the you’ll use the input string construct ( “<<<” & quoted form of bbeditTExt) for the do shell script. You’ll find the correct syntax in BBEdit’s scripting dictionary.

Hello.

Try to zap gremlins. What I guess has happened is that a ctrl-A or something is in front of the first value. I guess that you searched for an item, when you counted up the fields manually, then the offending item wouldn’t have shown up, and your count would seem correct, but is really off by one.

It does happen in the original version but I was wrong about the final summed value. It is actually off by the value of the first field. So that first line isn’t getting counted at all but is getting included in the final output. Again, that only seems to affect the first line of the file.

I’ll spend some time with the BBEdit scripting dictionary and sort out the rest.

Thanks.

As I stated in my post above, you should really try to zap gremlins from within BBEdit, and the re-run the script, then you should get only unique lines. (Please tell me if you don’t. ) :slight_smile:

Nope, I zapped gremlins, saved the file, and got the same results. I’m curious if you get the same results on a file containing just the following lines:

GS16484766,1
GS16484766,1
GS16484766,1
GS16484766,1
GS16484766,1
GS16484766,1096
GS16484766,11
GS16484766,18
GS16484766,2
GS16484766,4
GS16484766,5
GS16484766,566
GS16484766,6
GS16484766,6
GS16484766,6
GS16484766,7
GS16485279,1
GS16485279,1
GS16485279,1
GS16485279,2
GS16485279,2
GS16485279,2
GS16485279,26

For me, the expression returns:

GS16484766,1
GS16484766,1731
GS16485279,35

Nope, I got this as result:

GS16484766 1732 GS16485279 35
But now I do believe that you have saved the document with BOM :slight_smile:

If you have saved the document as UTF”XX with BOM, the please save it with NO BOM in BBEdit, and reopen it.
and retry,

If there were a bom, (that we can’t see, and that doesn’t really turn up in the BBEdit document, then that would make perfect sense, as it would be mistaken as part of the first item of Awk, and you still won’t see the items, as they are reparsed as a BOM in the output. :D.

That was it. I didn’t even think to look at the encoding for whether I was saving with BOM. Thanks for indulging me.

Ahhh. I really didn’t think of mentioning it, as I haven’t had that kind of problems as of lately.

It was a fun masqueradeing problem! :smiley:

Well, you really don’t have to consider it, once you use the

contents of front window of application "BBEdit"

as your input source.

I’ve been experimenting with reading the contents directly out of the front document in BBEdit:


tell application "BBEdit"
	set a to contents of front document
end tell

set theSum to (do shell script "awk 'BEGIN {FS=\",\"}{line[$1]+=$2 }END { for ( item in line ) printf(\"%s,%d\\n\", item, line[item]) }' <<<" & quoted form of a)

tell application "BBEdit"
	make new document at front
	set text of document 1 to theSum
	activate
end tell

It runs without errors but the end result is always a file containing only the first line (no summing) from the file it read. I assume I’m doing something wrong with the contents that are passed to the awk expression. I’m still investigating.

Try to set newlines to line endings of the BBEdit document.

Hello.

This was probably to finicky as I haven’t scripted BBEdit for a while, but it should work.

tell application "BBEdit"
	set a to contents of every line of window 1
	set {tids, AppleScript's text item delimiters} to {AppleScript's text item delimiters, "
"}
	set a to a as text
	set AppleScript's text item delimiters to tids
	set theSum to (do shell script "awk 'BEGIN {FS=\",\"}{line[$1]+=$2 }END { for ( item in line ) printf(\"%s %d\\n\", item, line[item]) }' <<<" & quoted form of a without altering line endings)
	make new document at front
	set text of document 1 to theSum
	activate
end tell

Indeed it did. I will deconstruct it all further tomorrow to better understand the details. This has been a very educational day. :slight_smile:

Again.my thanks.

Well, we solved it! :slight_smile:

The last problem is that I haven’t got the right incantation, to get BBEdit to deliver text separated by linefeeds, so I had to massage the lines into linefeeed separated lines, as awk doesn’t reckognize return as an end of line marker, therefore awk just saw one line.

Educational indeed. :slight_smile: