Applescript to add up all numbers in a Text File?

pjdube · October 18, 2012, 9:01pm

Hello!

I am new to applescript and was wondering if someone can assist me with an applescript.

I have a large text file and in this text file is a bunch of texts, but also number characters. All I want to do is have an applescript literally go through the text file and add up all the numbers and give me a total.

Is that possible?

It would save me a lot of time.

Thanks a lot!

Joy · October 18, 2012, 9:12pm

Hi,
and welcome!
Well, not everything is so easy to script as one can mean. Before somebody helps you in your research, consider to be more precise; like:

-scripting which Application ? (consider to use TextWrangler instead of Textedit, if you work on large text files)
-whats your intention at the very end? count the pages of your document ?
.

pjdube · October 18, 2012, 9:18pm

Joy,

Wow, that was fast. Thanks.

Yes, I have text wrangler.

The purpose of this is that I receive several txt files that have a lot of text in them and numbers. All I want to do is have an applescript to add up (i.e. a total) the numbers that are located in the entire text. I get these txt files every week and I have to sift through them manually and count up the figures that are in there.

If you can help me then that will be great. If there is no way that a script can do this, then I guess I will have to keep doing this every week. :o

Phil

pjdube · October 18, 2012, 9:21pm

Joy,

Wow, that was fast. Thanks.

Yes, I have text wrangler.

The purpose of this is that I receive several txt files that have a lot of text in them and numbers. All I want to do is have an applescript to add up (i.e. a total) the numbers that are located in the entire text. I get these txt files every week and I have to sift through them manually and count up the figures that are in there.

If you can help me then that will be great. If there is no way that a script can do this, then I guess I will have to keep doing this every week. :o

Phil

Nigel_Garvey · October 18, 2012, 11:03pm

Hi Phil. Welcome to MacScripter.

If by “text file” you mean “8-bit Mac Roman”, this may serve your need. If the text is Unicode, the script will need a small adjustment. It reads the text directly from the file you choose.

do shell script ("osascript -e 'read file \"" & (choose file) & "\" as string' | sed -En '/[^0-9]*([0-9][0-9.,]*)[^0-9]*/ { s//\\1+/g ; H ; } ; $ { g ; s/[+]$// ; s/\\n//g ; p ; } ;' | bc")
display dialog result

Otherwise, if you want something that works on the text of the front document in TextWrangler, this version does that:

tell application "TextWrangler" to set txt to text of front text window

do shell script ("sed -En '/[^0-9]*([0-9][0-9.,]*)[^0-9]*/ { s//\\1+/g ; H ; } ; $ { g ; s/[+]$// ; s/\\n//g ; p ; } ;' <<<" & quoted form of txt & " | bc")
display dialog result

If you’re interested in how they work, just ask.

Edit: I was forgetting you said large text file. The second script may not work if the text is too large as there’s a limit to how much you can put in a ‘do shell script’ string. Here’s another version in which the shell script gets the text for itself and thus keeps the source code short:

do shell script ("osascript -e 'tell application \"TextWrangler\" to get text of front text window' | sed -En '/[^0-9]*([0-9][0-9.,]*)[^0-9]*/ { s//\\1+/g ; H ; } ; $ { g ; s/[+]$// ; s/\\n//g ; p ; } ;' | bc")
display dialog result

adayzdone · October 18, 2012, 11:46pm

sed -En '/[^0-9]*([0-9][0-9.,]*)[^0-9]*/ { s//\\1+/g ; H ; } ; $ { g ; s/[+]$// ; s/\\n//g ; p ; } ;' | bc"

I am interested !

Shane_Stanley · October 19, 2012, 12:37am

Nigel,

Why “[^0-9]([0-9][0-9.,])[^0-9]" and not just "[0-9][0-9.,]”?

pjdube · October 19, 2012, 12:50am

Hey guys, thanks for all that!

My oh my! Yes, I have no clue what all this means. Anybody care to walk me through??

Made me want to give up! Just kidding!

I tried each version and all of them ran in the applescript editor fine, but the Dialog Display did not show any result. And the Result below just kept on saying “running”. I left it for over 30 minutes and still said “running”.

Not sure where I went wrong on this. But I did make sure that TextWrangler was open with the document opened as well, and still have me that result.

Any suggestions?

adayzdone · October 19, 2012, 12:57am

Why don’t you have to escape the parentheses ? Is it because you are using -n instead of s/ ?

Nigel_Garvey · October 19, 2012, 2:06am

OK.

sed -En ” Use “sed” with extended regex and only print lines where specifically told to do so.
/[^0-9]([0-9][0-9.,])[^0-9]*/ ” This is a sed “address” specifying that the following instruction (or instruction group) should only be performed on lines matching this regex. As Shane’s pointed out, it’s a little overspecified for the purpose, but I’ve made it as full as it is so that it can double as the search term in the following ‘s’ command without the need write that out too. The regex means “zero or more non-digit characters, followed by a digit, followed by zero or more characters which can be digits, full stops, or commas, followed by zero or more non-digits. Remember what matches the parenthesised sequence (which is hopefully a number).” Not knowing how Phil’s texts are arranged, I’ve guessed this to be the best way to identify the numbers in them.
{ s//\1+/g ; H ; } ” The instruction group to be executed on lines matched above. "Replace every match to the above sequence with the remembered number bit and a plus sign, and append the result to sed’s “hold space” with a linefeed.
$ { g ; s/[+]$// ; s/\n//g ; p ; } ” On the last line, after carrying out the preceding instruction group (if relevant), get the hold space contents back to the “pattern space”, zap the unwanted “+” at the end, zap all the line feeds, and output what’s left. This is hopefully a string of numbers with plus signs between them.
| bc ” Pass the result to “bc” to calculate.

This is possibly because I’ve guessed wrongly about what’s in the text (or sed’s behaving differently on our two machines). Could you post a short example?

No. It’s the -E actually. It causes different interpretations of special characters, so that you have to escape them or “character class” them to match them in the text.

pjdube · October 19, 2012, 3:06am

Hey thanks for all that. My head is swimming a little but I’ll check on-line and learn about it. I am not much of a programmer and definitely want to be.

Ok so I have a little snippet of the text document I am trying to run the applescript on.

First x = words (not numbers) that I don’t want shown. As you can see it is not all consistent and I just want to grab the numbers. The cool thing is that in the entire text document these are the numbers that i need to add up, and no other numbers exist (i.e. there are no dates in these text documents I get, so it won’t add them up ). The other thing is that the numbers are no more than two digits, and will never be more than that. I hope this helps!!

This is a small snippet of the whole text file:

xxxx xxxx
xxxx xxxx
xxxx xxxx
xxxx xxxx
xxx xxx xxx xxxxx Name Name emailaddress@gmail.com
xxx xxx xxxxx 25 xxx

xxxx xxxx
xxxx xxxx
xxxx xxxx
xxxxx Name Name emailaddress@gmail.com xxxx
xxxx 45 xxx xxxxx x xxxxx xxxxxxxxx xxx

xxxx xxxx
xxxx xxxx
xxxxx xx x Name Name emailaddress@gmail.com xxxx
xxxxxxx xxx 58 xxx
xxxx xxxx
xxxx xxxx (xxxx-xx)
xxxxx Name Name emailaddress@gmail.com xxxx
xxx xxxxxxxxxxx xxx x xxxxxxx 6 xxx

Name Name emailaddress@gmail.com xxxx
xxx x x xxxx xxxxx xxxx 35 xxx
xxxx xxxx xxxxxxx xxxx
xxxxx Name Name emailaddress@gmail.com xxxx
18 xxx
xxxxx Name Name emailaddress@gmail.com xxxx
xxx x xxxx 5 xxx

Model: MacBook Pro
Browser: Safari 536.26.14
Operating System: Mac OS X (10.7)

McUsr · October 19, 2012, 7:28am

Hello!

Your sed skills are very impressive Nigel!

It is of course much easier to make up a solution now, that the example input is more well defined, and on the table.

An AS solution, may not be that fast, but not that slow either, and it scores on readability. And it is fast to script when you don’t know, or have no intention to know sed that well.

The pass by reference construct, text item delimiters, and makes it fairly fast to sift out numbers with AppleScript

My solution works this way:

First the text is split into paragraphs. if the input consists of more than 16384 paragraphs the solution is not going to work, but this is unlikely.
Then I split up the text by text item delimiters into a list, then I collect the numbers, by summarizing them.

The handler summarizes those numbers, and returns the partial sum, to be aggregated in the main handler.

The main handler then gets the next chunk of text by paragraphs and continues to add up the partial sum, until the text is processed.

Math behind the solution:

An As list can contain 16384 = 2^14 items.

A conservative approach would say that no paragraph would contain more than 400 words.

That gives us 16384/400 â‰ˆ 40 paragraphs to process at a time.

What filesize concerns that would be just a guess, since the number of words are so variable. I think the solution will hold for files up to 1.5 Mb in size under all circumstances given that they contain “normal” text. This is guess is based upon the maximum estimates above that predicts a maximum filesize of 12.5 Megabytes. I have then reduced the number of words in a paragraph to an average of 10, with an average word size of 5 characters, and done the math.

16383 paragraphs times 10 words times 5 characters times 2 bytes div (1024*1024) â‰ˆ 1.5 Mb

Speed:

Programmers speed is of the greatest concern!

As for speed, it may take Nigel like 20 minutes or whatever now, to come up with that sed script, I could easily use like 5 hours to get it right

On the other hand, this solution took me about 20 minutes, maybe an hour, I was distracted. So much for efficency.


property chunkSize : 40

to max(a, b)
	if a > b then return a
	return b
end max

to min(a, b)
	if a < b then return a
	return b
end min

to sumPars(L)
	script o
		property pars : L
	end script
	set o's pars to o's pars as text
	
	local tids
	set {tids, AppleScript's text item delimiters} to {AppleScript's text item delimiters, {tab, space, return}}
	set o's pars to text items 1 thru -1 of o's pars
	set AppleScript's text item delimiters to tids
	local i, tSum
	set {i, tSum} to {1, 0}
	repeat (length of o's pars) times
		try
			set tSum to tSum + (item i of o's pars as number)
		end try
		set i to i + 1
	end repeat
	return tSum
end sumPars

to summarizeText(theText)
	
	set pList to paragraphs of theText
	# Turns the text into a list paragraphs
	set pCount to length of pList
	# Number of pars to process
	set pRemains to max((pCount - chunkSize), 0)
	# remaining paragrpahs to process
	set {pStart, pEnd, iSum, chunkCount} to {1, min(chunkSize, pCount), 0, ((pCount div chunkSize) + 1)}
	# initial range of paragraphs to process, intermediary sum and count of chunks with paragrapsh
	
	
	repeat chunkCount times
		
		set iSum to iSum + sumPars(items pStart thru pEnd of pList)
		# gets the count and adds it to the partial sum
		set {pStart, pEnd, pRemains} to {(pStart + chunkSize), (pEnd + min(chunkSize, pRemains)), max((pRemains - chunkSize), 0)}
	end repeat
	return iSum
end summarizeText


tell application "TextWrangler" to set theTxt to text of front text window

set theSum to summarizeText(theTxt)
log theSum

Shane_Stanley · October 19, 2012, 10:52am

On that basis, and with the provided sample in mind, here’s a solution using ASObjC Runner:

on addUpNumbers(theString)
	tell application "ASObjC Runner"
		set theString to look for "<[^>]+>" in theString replacing with "" -- remove <bill334@me.com>, &c
		set numberStrings to look for "[0-9]+" in theString -- find number strings
		set theNumbers to value for key path "doubleValue" of item numberStrings -- strings to numbers
		return item 1 of (summarize numbers theNumbers) -- add 'em up
	end tell
end addUpNumbers

McUsr · October 19, 2012, 11:17am

That is so nice.

By the way, I figured that the way I use text item delimiters would keep nighstalker555@hotmail.com out of consideration. That is also why I didn’t try o’s par’s text’s numbers. ( I don’t know if that would have worked anyway.)

But I must say, that AsObj runnner Runner makes up for readable pieces! Quite impressive. I still believe Nigels solution to be unbeatable, until you rewrite it in C, C++ (The Boost library for regexp.) or Objective-C.

It is nice to see different approaches!

Nigel_Garvey · October 19, 2012, 11:19am

Hi.

Thanks for posting the example. I’m flummoxed because when I paste it into TextWrangler and run my second or third script, the dialog correctly shows “192”, so I don’t know why it’s not working for you. I’ve tried changing the line endings before passing the text to the shell script, but it always works.

Knowing that each number only consists of one or two digits allows the regex to be tightened up a little. I’ve also added some preprocessing to remove any digits which (despite what you’ve said) may appear in the e-mail addresses:

do shell script ("osascript -e 'tell application \"TextWrangler\" to get text of front text window' | sed -En 's/[[:alpha:][:punct:]][0-9]+//g ; /[^0-9]*([0-9]{1,2})[^0-9]*/ { s//\\1+/g ; H ; } ; $ { g ; s/[+]$// ; s/\\n//g ; p ; } ;' | bc")
display dialog result

Shane_Stanley · October 19, 2012, 11:58am

You mean in terms of speed? Maybe my regex is too simple, but the Runner method shades the last sed offering, if only by a whisker.

McUsr · October 19, 2012, 12:24pm

Sed is blazingly fast, though nobody will know for sure before things are tested, under equal conditions, my uneducated guess is that your solution is a just that away from it, a whisker. I guess your AsObjC-Runner is almost as fast as Objective-C. Now there are factors here, like if the stuff is residing in the run-time cache or not, and if there are one or more files to process.

If there are just one single file, then I am not overly, but confident that Nigel’s solution will be the fastest. If there are more than one file, and the cache comes to advantage, then I am unsure.

I am also curious as to how fast both of your solutions are, compared to my As solution, but I have not, and will not try to measure it at this time. But I think my solution is still acceptable, as I live in the belief that I get much for free with using the reference hack and text item delimiters.

By the way: the readability of your code is great, and I guess the only thing that hampered the time for you to come up with the solution was your typing speed.

All in all, your solution was kind of ephinanic to me, I have not considered AsObj-C’s regexp capabilities before, nor such commands as “summarize”.

pjdube · October 19, 2012, 2:54pm

Nigel, Hey man thanks!!! And thanks to the rest of you! It works. I tried your script again and it worked! for some reason it wasn’t before, well now it does. Thanks a lot! Save me a lot of time!

Here is the one that I tried (which worked).


do shell script ("osascript -e 'tell application \"TextWrangler\" to get text of front text window' | sed -En 's/[[:alpha:][:punct:]][0-9]+//g ; /[^0-9]*([0-9]{1,2})[^0-9]*/ { s//\\1+/g ; H ; } ; $ { g ; s/[+]$// ; s/\\n//g ; p ; } ;' | bc")
display dialog result

Yvan_Koenig · October 19, 2012, 2:57pm

Hello

I’m trying to understand the way sed behave.

I tried :


set txt to "
xxxx xxxx
xxxx xxxx
xxxxx xx x Name Name <emailad567ss@gmail.com> xxxx
xxxabcx xxx 58 xxx
xxxx xxxx
xxxx xxxx (xxxx-xx)
xxxxx Name Name <emailaddress@gmail.com> xxxx
xxx xxxxxxxxxxx xxx  x  xxxxxxx 6 xxx
"
do shell script "sed -En '/abc/s//y/' <<< " & quoted form of txt

assuming that it will replace the string “abc” by “y”

Clearly I assumed wrongly because it returns “”.
What’s wrong ?

Yvan KOENIG (VALLAURIS, France) vendredi 19 octobre 2012 16:57:31

Nigel_Garvey · October 19, 2012, 3:12pm

It depends on whether Runner has to launch first and whether or not it wants to tell you about a new update. But otherwise the Runner code is faster than my shell script, even when including the TextWrangler stuff in the Runner timing. Faster still is this Satimage/Vanilla hybrid:

tell application "TextWrangler" to set txt to text of front text window

set txt to (change "<[^>]+>" into "" in txt with regexp) -- Zap e-mail addresses. (Uses Satimage OSAX.)
set txt to (change "[^0-9 ]" into "" in txt with regexp) -- Zap anything else which isn't a digit or a space or a line ending. (Ditto.)
set astid to AppleScript's text item delimiters
set AppleScript's text item delimiters to "+"
set txt to txt's words as text
set AppleScript's text item delimiters to astid
run script txt