Applescript to add up all numbers in a Text File?

Nigel_Garvey · October 19, 2012, 3:20pm

Yvan Koenig:


set txt to "
xxxx xxxx
xxxx xxxx
xxxxx xx x Name Name <emailad567ss@gmail.com> xxxx
xxxabcx xxx 58 xxx
xxxx xxxx
xxxx xxxx (xxxx-xx)
xxxxx Name Name <emailaddress@gmail.com> xxxx
xxx xxxxxxxxxxx xxx  x  xxxxxxx 6 xxx
"
do shell script "sed -En '/abc/s//y/' <<< " & quoted form of txt

assuming that it will replace the string “abc” by “y”

Clearly I assumed wrongly because it returns “”.

Hi Yvan.

The -n option is stopping any of the lines from being sent to the output. When you use it, you have to specify what lines to print. For instance:


set txt to "
xxxx xxxx
xxxx xxxx
xxxxx xx x Name Name <emailad567ss@gmail.com> xxxx
xxxabcx xxx 58 xxx
xxxx xxxx
xxxx xxxx (xxxx-xx)
xxxxx Name Name <emailaddress@gmail.com> xxxx
xxx xxxxxxxxxxx xxx  x  xxxxxxx 6 xxx
"
do shell script "sed -En '/abc/s//y/p' <<< " & quoted form of txt
-- Or:
do shell script "sed -En '/abc/{ s//y/ ; p ; }' <<< " & quoted form of txt

Yvan_Koenig · October 19, 2012, 3:37pm

Thanks Nigel but it appears that I was not clear enough.
I wanted to get the entire text with every occurences of abc replaced by y.

In fact, my true goal was to build a more intricate request removing every strings “”
like Shane’s instruction :

set theString to look for “<[^>]+>” in theString replacing with “” – remove bill334@me.com, &c

I’m accustomed to use ASObjC Runner but wish to understand a bit of other tools.

Yvan KOENIG (VALLAURIS, France) vendredi 19 octobre 2012 17:37:34

McUsr · October 19, 2012, 4:09pm

@ Nigel: I was totally unaware of the fact that it is faster, quite astonishing, also the way you can use Satimage.Osax, maybe I got an old version, but last time I remember using it, I had to kind of create a regexp object, (almost) compile it, but definately drag out the matches of it… This I have to look into!

@ Yvan:
This is how I do it, assuming input is piped into sed.

E is not needed since you don’t use the enhanced regular expressions. Neither is n, since you are going to make the replacements inline, not surpressing any input.

You want to subsitute every ocurrence of abc with y, so an s command can be prepended to the search pattern, and a g flag for global should be appended after the searchpatern.

sed 's/abc/y/g'

You may find some interesting files if you google for 'macmahon sed’ and sed tutorial.

Sed also resembles a lot of how the venerable ed works. If you know how to use ed then you know a lot about how sed works already.

Nigel_Garvey · October 19, 2012, 4:56pm

I was about to reply to Yvan’s question, but I see you’ve already done it.

With regard to the speeds of the various methods, I’m only saying that the particular combination of Satimage and vanilla I posted is faster than Shane’s handler and my shell script at achieving the required result with the given sample text. It doesn’t necessary mean that Satimage is always faster than ASObjC Runner or a shell script.

McUsr · October 19, 2012, 5:49pm

I am not that after speed really, to be honest, before it really matters, I am after programmer speed And I dare say that AsObjC-Runner seems to deliver the goods I want.

But it is always interesting to have a diversity of tools available, so one can pick one that fits a particular bill. So, the way you uses Satimage.Osax was totally new to me. Thanks for the insight!

Nigel_Garvey · October 19, 2012, 6:12pm

I think it’s possible to combine the two ‘change’ commands:

tell application "TextWrangler" to set txt to text of front text window

set txt to (change "<[^>]+>|[^0-9 ]" into "" in txt with regexp) -- Zap e-mail addresses and anything else which isn't a digit or a space. (Uses Satimage OSAX.)
set astid to AppleScript's text item delimiters
set AppleScript's text item delimiters to "+"
set txt to txt's words as text
set AppleScript's text item delimiters to astid
run script txt

Shane_Stanley · October 19, 2012, 10:38pm

Who goes to the trouble of quitting FBAs?

And it occurs to me that the sample is highly artificial – at least, mine was – so even there the rankings might best be described as provisional.

Shane_Stanley · October 19, 2012, 10:51pm

There are roughly three stages: send stuff from AS to the app or scripting addition, process it, and send it back. Depending on the nature of the data, the sending back and forth takes a lot of the total time. And that has to be done similarly with an app like Runner or a scripting addition like Satimage or do shell script. The differences in performance of the regex libraries each case calls are probably trivial in such a scheme.

Adam_Bell · October 20, 2012, 12:26am

What a great thread – a major education.

Yvan_Koenig · October 20, 2012, 7:54pm

Hello

As I scrapped my bare head before discovering that, I pass a basic info.

When we call sed to treat a multiline file, it seems that it’s useful to double check that end of lines are linefeeds not returns.

I found a tutorial on the net but its examples were designed for the Terminal.
So, I spent time trying to edit them for do shell script instructions.
Alas, the ones dedicated to multiline documents failed.
Replacing the returns by linefeeds solved the problem.

Yvan KOENIG (VALLAURIS, France) samedi 20 octobre 2012 21:54:41

Oops, the returns were not introduced by the script editor but by Pages in which I stored the tutorial borrowed from the net.

Nigel_Garvey · October 20, 2012, 10:16pm

Hi Yvan.

If you suspect your text has returns instead of linefeeds, you can (if you don’t want to use AppleScript’s text item delimiters) insert an additional “sed” call to replace them before doing the line-by-line editing:

Or, as I prefer for surety:

“sed” has a peculiarity whereby when a linefeed is a replacement character, it must be rendered as a backslash (escaped with another backslash in the AppleScript shell script text) followed by an actual linefeed character. As a character to be replaced, it would be rendered as an (escaped) backslash followed by the letter ‘n’!

McUsr · October 21, 2012, 7:11am

When it comes to linefeeds and sed, I have found that it is much easier to use tr, when it comes to things as deleting linefieeds and such.

But then, I am not Nigel Garvey!

And when working with those tools, it is far better to expriment with them in a terminal window!

It is pertinent to have access to od -cb for both analyzing input and output of commands like sed when you need to figure out what happens.

od stands for octal dump. (man od).

Yvan_Koenig · October 21, 2012, 9:51am

Thanks both of you.

I made some experiments and got odd results.
It seems that I missed something.


--read file ((path to desktop as text & "index.xml")
set forTests to "ligne 1
ligne 2
PÃ©nÃ©lope KÅ“nig
"

if forTests contains linefeed then
	log ">>> contained linefeeds"
	set oTids to AppleScript's text item delimiters
	set AppleScript's text item delimiters to return
	set forTests to paragraphs of forTests as text
	set AppleScript's text item delimiters to oTids
end if

set quotedSource to quoted form of forTests
log quotedSource contains return
(*true*)
log quotedSource contains linefeed
(*false*)
# Here, I'm sure that quotedSource contain returns, not linefeeds
(*
I wrote this one to check that I passed correctly the text to treat.
OK, it was correctly passed but altough man claims :
(This should be preferred over the traditional UNIX idiom of ``tr a-z A-
     Z'', since it works correctly in all locales.)
it doesn't !
*)
set theText to do shell script "tr '[:lower:]' '[:upper:]' <<<" & quotedSource
(*
"LIGNE 1
LIGNE 2
PÃ©NÃ©LOPE KÅ“NIG
"  (was supposed to be PÃ‰NÃ‰LOPE KÅ’NIG ) *)
# Nigel code ( at least if I reproduced it correctly )
set theText to do shell script "sed 's/\\r/\\'$'\\n''/g' <<<" & quotedSource
(*"ligne 1
ligne 2
PÃ©nÃ©lope KÅ“nig
"*)
log theText contains return
(*true*)
log theText contains linefeed
(*false*)
# Clearly, the substitution failed !

# McUsr proposal (also used by Nigel elsewhere)
set theText to do shell script "tr '\\r' '\\n' <<<" & quotedSource
(*"ligne 1
ligne 2
PÃ©nÃ©lope KÅ“nig
"*)
log theText contains return
(*true*)
log theText contains linefeed
(*false*)
# Clearly, the substitution failed !

I guess that I need to drop this area for a while.

Yvan KOENIG (VALLAURIS, France) dimanche 21 octobre 2012 11:50:49

Nigel_Garvey · October 21, 2012, 10:07am

Hi Yvan.

The linefeeds are turned into returns when ‘do shell script’ returns (!) unless you use the ‘altering line endings’ parameter:

set theText to do shell script "tr '\\r' '\\n' <<<" & quotedSource without altering line endings

With this version, any linefeeds will be preserved on the return to AppleScript, as will any trailing line endings.

Edit: As for the upper-casing of diacritical characters with “tr”, I’ve no idea why it doesn’t work with ‘do shell script’, but it does work in the Terminal, both when the instruction’s typed in directly and when it’s scripted:

set forTests to "ligne 1
ligne 2
PÃ©nÃ©lope KÅ“nig
"

if forTests contains linefeed then
	set oTids to AppleScript's text item delimiters
	set AppleScript's text item delimiters to return
	set forTests to paragraphs of forTests as text
	set AppleScript's text item delimiters to oTids
end if

set quotedSource to quoted form of forTests
tell application "Terminal" to do script "tr '[:lower:]' '[:upper:]' <<<" & quotedSource in window 1

Unfortunately, the result returned to AppleScript is just a reference to the Terminal tab in which the script was run, so you’d have to parse the tab’s ‘contents’ or ‘history’ to get the transformation.

Yvan_Koenig · October 21, 2012, 7:06pm

Thanks Nigel

Now I know why you used the instruction “without altering line endings”

My old brain assumed that it was useful when the shell script was receiving the source text.

What’s funny is the fact that I decided to try to use sed to solve a problem which doesn’t match the tool requirements;

I was trying to enhance a script applying to the index.xml file describing the contents of iWork documents.
I just realised, two hours ago that :

(a) if the file is organized as numerous paragraphs given my customized setting, it’ made of two paragraphs for current users.
One with about 50 characters and a huge one which may embed thousands of characters so that it can’t be treated by shell scripts and even by text delimiters.
So I had to read it using :
read file chemin_IndexXml using delimiter “>”
Given that, I get a huge list and extracting the wanted datas is really simple.
I’m just forced to use ASObjC Runner to unescape the returned values.

But it’s not a problem.
I will continue to try to learn sed.

Yvan KOENIG (VALLAURIS, France) dimanche 21 octobre 2012 21:06:06

JohnDelacour · October 22, 2012, 8:43am

In BBEdit or TextWrangler you can write “UNIX” text filters/scripts and run them from the Text Filters palette. For example, if you save this script in ~/Library/Application Support/TextWrangler and open the palette from the WIndow menu it will appear there and can be run on the front document either by double-clicking or with a key-command that you set. This script will print the sum of all numbers in the document at the end of the document.

#! /usr/bin/perl
use strict;
my $sum;
while (<>) { #read the doc line by line
print; # print the line as it is
# add to $sum every instance of a number found in the line
$sum += $1 while s/^\d[^\d]//;
}
print “\n\n$sum”; #print the sum at the end of the doc.

#JD

JohnDelacour · October 22, 2012, 8:45am

That should be ~/Library/Application Support/TextWrangler/Text Filters/

Nigel_Garvey · October 22, 2012, 9:49am

Hi John.

That’s handy! But it prints 174 instead of 192 with the sample text in post #11. The problem seems to be the 18 at the beginning of a line.

PS. The other later solutions have also allowed for the possibility of digits in e-mail addresses.