Delimiter rant

Xpresso · May 28, 2013, 7:37pm

DJ Bazzie Wazzie:

kel1:

Still under a second. I think DJ Bazzie Wazzie is right though that tids is faster.

It’s because of the cost of an Apple Event, like Chris/ccstone his code. Offset and like chris satimage command needs to send an Apple Event to the Apple Event Manager and back into the process. Like some of you know from older machines (Mac OS era) Apple Events are expensive and, like Shane mentioned this week, was around 60 per minute. Today it’s normal to do a thousand events per second, but still they’re overhead considered to an non-Apple-Event solution. Therefore using text item delimiters is the way to go.

@Xpresso: I understand that the code needs to be efficient as possible. But did you actually time my code? It can do 20,000 dd’s each second on my old i7 MBP. Like I said, Apple Events are your enemy, so avoid them.

No, it’s not that. When i saw kel1’s timing script i was just curious and tried it with the tid version.
Ftr: It has to be effecient, but the efficiency it has now should be way more than enough

@kel1, i made an app out of your timing code, you can see it here http://macscripter.net/viewtopic.php?id=40958

Shane_Stanley · May 29, 2013, 12:01am

It can’t. The process we call compiling is actually two processes: the code we write is first compiled into AS bytecode, and then that code is decompiled to styled AS text. So the process that produces the output doesn’t actually see what was input; there are no linefeeds or returns in the bytecode, except in string literals that are basically untouched (as long as they don’t contain continuation characters, which add another wrinkle).

Given that editors are now based on text views that default to linefeeds, there’s a good argument that the compiled code should also use linefeeds, and then the only returns would those entered via the return keyword – but who knows what such a change would break elsewhere. (I tried out doing the switch in AppleScriptObjC Explorer, but I decided it was just another point of confusion.)

Edit: Actually there may be returns in the bytecode, but there won’t be instances of equivalent linefeeds and returns.

kel1 · May 29, 2013, 4:14am

Apple should just use linefeeds, since that what unix uses. Then, we don’t have to guess about we’re using or need to do something to find out what the eol is.

kel1 · May 29, 2013, 4:17am

I just thought of this; a lot of scripts would break.

ccstone · May 29, 2013, 6:19am

Okay, since we’re doing timing tests…

Requires: Satimage.osax, LapTime.osax


# Deps: Satimage.osax, LapTime.osax
set theText to "Var0 : Added this with a return.
Var1 : Lorem ipsum dolor sit amet, consectetur adipisicing elit
Var2 : And it was 11 o'clock, which was time for a little something
Var3 : It's over 9000!
Var4 : Some Var5 text.
Var5 : More variables."

repeat with i from 6 to 100
	set srcData to theText & (linefeed & "Var" & i & " : abc123.")
end repeat

set tmr1 to START_TIMER() of me
repeat 1000 times
	fnd("Var0", srcData, false, false, true) of me
	fnd("Var1", srcData, false, false, true) of me
	fnd("Var2", srcData, false, false, true) of me
	fnd("Var3", srcData, false, false, true) of me
	fnd("Var4", srcData, false, false, true) of me
	fnd("Var5", srcData, false, false, true) of me
	fnd("Var100", srcData, false, false, true) of me
end repeat
set tmr1Stop to STOP_TIMER(tmr1) of me

-------------------------------------------------------------------------------------------
on fnd(_find, _data, _case, _all, strRslt) # Last 3 are all bool
	try
		find text _find in _data case sensitive _case all occurrences _all string result strRslt with regexp
	on error
		return false
	end try
end fnd
-------------------------------------------------------------------------------------------
on START_TIMER()
	return start timer
end START_TIMER
-------------------------------------------------------------------------------------------
on STOP_TIMER(tmrNumber)
	return format ((stop timer tmrNumber) / 1000) into "##.####"
end STOP_TIMER
-------------------------------------------------------------------------------------------

0.543 seconds in the Applescript Editor using the LapTime osax for the timer.
0.517 seconds in Smile using its built-in chrono function for the timer.
0.410 seconds from FastScripts using the LapTime osax for the timer.

In a separate and very simple find/replace of 15 character in a 365 character text sample run from the Applescript Editor:

0.00025 seconds - TIDs
0.00030 seconds - Satimage.osax

In a 10,000 line test file with 700,000 characters where 30,000 characters were replaced again run from the Applescript Editor:

0.58 seconds - TIDs
0.17 seconds - Satimage.osax

The Satimage.osax ain’t no slouch. It it was I wouldn’t use it hundreds of times per day in dozens of scripts for real work.

McUsrII · May 29, 2013, 6:55am

Satimage.osax is fast, no doubt, but there is no way it can compete with standard unix tools like awk, sed and grep, when we start talking about quantities of data.

I can tell you that I a week or two ago, found unique lines in an 11000 line file with awk, in less than a second.
(I am not so concerned about time, as long as it is fast enough, given that it doesn’t run all the time.) Except for what happens in the GUI, everything in the GUI should IMHO have happened yesterday.

ccstone · May 29, 2013, 8:43am

It eats those utilities for lunch on certain jobs.

For others the shell is the way to go.

A few months ago I parsed a quarter of a million lines of text with Perl in about 3 seconds.

You pick the tool for the job.

McUsrII · May 29, 2013, 9:52am

Hello.

Funny thing about sed, is that is plain old regexp, with no backtracking like PCRE regexp’s do, on the contrary. I read a paper recently, stating that the regexp engine in sed, awk, and familiar old tools. can be a over a million times faster than the one in languages like php, java, ruby,perl and so on, (given the right regexep of course).

What was said made a lot of sense, given the primitiveness of the algorithms, contrary to the newer PCRE, the paper stated that they outperformed anything else, in any other language within their problem solving domain.

So, basically, if you need fast regexp’s you’d settle for sed, and at least not perl or any other scripted language, unless you need the features that PCRE gives, like lookahead, behind, and so on.

I’m not discussing this, just for your information. And, I found the link

(Thompson NFA, would be the name of the algorithm that are used in sed, awk and grep.)

In all fairness, you have like 5 different problem domains, ending with natural grammar, sed, awk and grep is capapable of solving problems within the first one , which if memory serves me right is regular grammars, PCRE is capable of sovling problems in context free grammars, but from there on after, it is a NP problem.

You can read about it here

The conclusion must be that sed, awk and similies are best in their little problem domain. From there onwards, it is better to use tools like cocoa, php, perl, java, and so on that implements PCRE.

Having said all this. Let me finish with, that I am very happy that I am scripting in Applescript, where I have an abudance of options, for getting a good solution! And I see no general “right” solution really, I am not religous, I think most solutions are right given a fitting context.

ccstone · May 29, 2013, 10:35am

Thanks. I’ve had that article for some time. Awk could have done that big job (and perhaps faster than Perl), but my awk-fu wasn’t up to the task. Nevertheless I’m not going to sneeze at 3 seconds. An Applescript solution took about 89 seconds IIRC.

I’m not going to stop continuing to learn Perl and perhaps Python or Ruby, but I will also continue to learn and use both sed and awk. I use them frequently (and grep), and I continue to improve my knowledge of the shell.

It’s good to have a broad tool-set.

McUsrII · May 29, 2013, 11:01am

It is good to use a language, that allows you to have and use a broad toolset.

As for perl, I have written some stuff in perl, but I can only guarrantee it to work on my machine. I think perl to be pure hell, when things like encoding and such doesn’t add up. And after having let the code lie down for a while, I find it as readable as sed. By the way, if you have XQuartz installed, then there is a debugger named ddd, which can debug perl graphically. I haven’t used it yet, but I sure will do the next time I sit down with perl

But for more complex regexp problems, I’ll use that or whatever else it takes to use PCRE.

By the way, awk works like sed, one line at a time gets sifted through patterns/nested patterns, which trigger actions the same way, . Awks main benefit IMHO, is that you can call other shell tools from within, and associative arrays, and also has BEGIN and END blocks, for preamble and post script.

It has also command words that are more than one letter long!

DJ_Bazzie_Wazzie · May 29, 2013, 12:25pm

Besides from it’s better syntax it has more built-in functions as sed, some versions of AWK are up to 4 times faster (gnu awk is one of the slowest awk versions) and you can can use all kind of expression on an line to be executed and they are allowed to overlap, sed works more like an pipeline.

McUsrII · May 29, 2013, 12:56pm

Hello.

I am sure there are a sed version out there with a byte-code runtime as well (LLVM) :D.

Ok, so you have syntax that is c-alike, but not fully, and such. But the idea here, is that awk parses input exactly the same way as sed does. You have the same pipeline, but you can manipulate it more with getline, and such, and pull in lines from other files as well.

Awk is also a filter at least in its middle clause! In the sense that the middle clause starts by getting input, on a line basis, and stops when there are no more lines to process. Aside from that, you can structure the script, by regexps, using about the same range specifies like you do in sed.

I just post this, to prove a point:

[code]#!/usr/bin/awk -f

© McUsr 2012 And put into public domain under the license GPL 1.0

I interpret GPL 1.0 as that you may not post it, have it in a script collection for public consumption, or a repository for a

closed circle, or charge something by using this, or a development of this work into something of your own.

Nor charge for a compilation of scripts or tips where this script is included. Sharing between friends are fine!

Please refer to this link otherwise: http://macscripter.net/viewtopic.php?pid=159019#p159019

Release 1.0.0

BEGIN { curLevel=0 ; haderror = 0 ; initialized=0; }
function perror() {

will work in filename.

printf(“preProcess.awk: Missing levels …\nBailing out!\n”)
haderror=1
exit 1
}
function initSeq() {

the linelength, set large, to avoid indenting.

print “.ll 30i” ;

organizing of “Paper size and such”

print “.PGNH”

no headers

print ".nr Ls " linefeedsForLevels

embed newlines around list-elms all 6 levels.

change this to contract newlines between blocks with a level.

print ".nr Li " indentSpaces

indent a deeper level with 3 spaces.

print “.nr PO 0i”
}
function startList( listLevel ) {
printf(“.AL %s\n”, substr(outlineNumbering,listLevel,1))
}
$0 ~ /^#\ / {

if (curLevel == 0) {
if ( initialized == 0 ) {
initSeq() ; initialized=1 ;
}
curLevel=1 ; startList(curLevel) ;
} else if ( curLevel > 1 ) {
while ( --curLevel > 1)
print “.LE” ;
print “.LE” ;
}
}
$0 ~ /^##\ / {
if (curLevel == 1) {
curLevel=2 ; startList(curLevel) ;
} else if ( curLevel > 2 ) {
while ( --curLevel > 2 )
print “.LE” ;
print “.LE” ;
} else if ( curLevel < 1 )
perror() ;
}
$0 ~ /^###\ / {
if (curLevel == 2 ) {

curLevel=3 ; startList(curLevel) ;
} else if ( curLevel > 3 ) {
while ( --curLevel > 3 )
print “.LE” ;
print “.LE” ;
} else if ( curLevel < 2 )
perror() ;
}
$0 ~ /^####\ / {
if (curLevel == 3) {
curLevel=4 ; startList(curLevel) ;
} else if ( curLevel > 4 ) {
while ( --curLevel > 4 )
print “.LE” ;
print “.LE” ;
} else if ( curLevel < 3 )
perror() ;
}
$0 ~ /^#####\ / {
if (curLevel == 4) {
curLevel=5 ; startList(curLevel) ;
} else if ( curLevel > 5 ) {
print “.LE” ; curLevel=5 ;
} else if ( curLevel < 4 )
perror() ;
}
$0 ~ /^######\ / {
if (curLevel == 5) {
curLevel=6 ; startList(curLevel) ;
} else if ( curLevel != 6 )
perror() ;
}
$0 !~ /^##/ {
if ( $0 !~ /^[ \t]$/ ) {
if ( level == 0 && initialized == 0 ) {
initSeq() ; initialized=1 ; curLevel=1 ; startList(curLevel)
}
}
}
{
if ( $0 !~ /^[ \t]$/ ) { print “.LI” ; sub("^##\ “,”“) ; sub(”^[ \t][ \t]*“,”") ; print $0 ; next }
}
END {
if ( haderror == 0 ) {

if ( curLevel > 0 ) {
while ( --curLevel > 0 )
print “.LE” ;
} else { print “Nothing to do.” }
}
}[/code]
It is from A no frills text outliner for TextEdit in Code Exchange

Hindsightly

Maybe I should have stated that "Awk works like sed in its intrinsic behaviour -But that intrinsic behaviour, is what you’ll structure your script around, if you want it to be efficient with awk.

As a curiousity.

Richard W. Stevens who wrote “Advanced Programming in the Unix Environment” recommended the version of awk, that ships with Mac Os X, I think him not to do that out of a quest for speed, but for robustness, and compatibility between other awks, gawk (GNU Awk) was second on his list, and nawk at the bottom.

DJ_Bazzie_Wazzie · May 29, 2013, 1:53pm

[strikethrough]Are you sure it was nawk? Because nawk is the version that ships with Mac OS X, which means it’s some sort of paradox.[/strikethrough] McUsr has alerted my that there are two implementations of awk with the name nawk but aren’t the same. The book he read meant another nawk so we can use the other name of the nawk in Mac OS X: BWK (Brian W. Kernighan) awk.

The input is not the same, the input of awk is based on an logical expression whereas sed is based on an regular expression.

edit: Thanks for the headsup McUsr

McUsrII · May 29, 2013, 2:33pm

Hello.

This was written after OS X 10.3 was released, and at least then, the version of awk that shipped with OS X, was the “new one” from Bell Labs, (Later Lucent Technologies.).

So I wonder when they changed, I have never found any extras, nor any misbehaviour in awk, giving me any reason to suspect that they had shipped something else, furthermore, I thought nawk to be network-awk, and I have seen no signs of sockets, or anything else that would help.

I think one reason for the confusion, is that I didn’t know that nawk was the same as that version,
But see for yourself, below is a word for word passage from the book:

Here is a quote from the book:

This was awk-ward! But I am not at fault here. (For once!)

kel1 · May 29, 2013, 10:54pm

Hi Chris,

I’m quite sure I’ve seen chrono in Smile, but can’t find it now in the dictionary. Strange. I used to use it a lot before too. Used to like the ui creator thing, but now there’s xCode. Anyway, I retimed the script with ‘offset’ and the time is much lower now for some reason. Ran it with ‘run script’ from the Script Editor:

→ {0.52999997139, {“Var0 : Added this with a return.”, “Var1 : Lorem ipsum dolor sit amet, consectetur adipisicing elit”, “Var2 : And it was 11 o’clock, which was time for a little something”, “Var3 : It’s over 9000!”, “Var4 : Some Var5 text.”, “Var5 : More variables.”, “Var100 : abc123.”}}

Strange huh.

The reason I used the Python timer was that it can be modified (calibrated) to account for the delay from ‘do shell script’. Also, with Python’s ‘popen’, It can time shell script’s as well as AppleScripts with ‘osascript’. There are other unix utilities that can do this but I like the precision (can’t think of the word now for decimal places ).

It’s fun experimenting with the timings, huh.

gl,
kel

ccstone · May 29, 2013, 11:35pm

It’s in the Smile application class. Search works in the dictionary, and of course you can use the Applescript Editor’s dictionary viewer. (I use Script Debugger me’self.)

I rather like Python for its very clean coding style. At the moment I’m learning Perl, but I’ll tackle Python or Ruby next.

Yes!

Shane_Stanley · May 29, 2013, 11:40pm

It is – as long as you realise that most of those digits are just eye candy for geeks. AppleScript has too many variables to make the figures usable for anything other than gross comparisons.

kel1 · May 30, 2013, 3:32am

Hi Shane,

Yes, makes me feel more like a scientist! And, love the name Python also! Python, python, python …

gl,
kel

McUsrII · May 30, 2013, 5:25am

Hello.

I use timings, to figure out proportions between different solutions, and as such timings are great. But you must realize that you can’t run a Nuclear Facility from Applescript. First of all you may have a congestion in System Events or Finder, then there are scheduled system tasks, then there are the various load on the system, and then there are apps that blocks a lot of the system queues.

For those wanting to work with perl, I urge them to set up XQuartz with ddd, and the perl debugger, there is also affrus, for debugging perl, that is probably much better, from LateNight Software.

A debugger is always handy, for inspecting things, and key to saving time. And it is most useful, at the start of learning something new, that is my experience. At least hindsightly.

You can allways insert log statements, and print out stuff, but you only log where in your code you know or anticipate where the error is. Sometimes, the error isn’t there, and then you spend hours trying to figure it out, whereas stepping through your code, could have revealed the malfunctioning piece in just some minutes.

kel1 · May 30, 2013, 8:12am

Hi,

Just found out that Apple has already implemented automatic return to linefeed in strings. However, you can still concatenate ‘return’ or embed “\r”. Good to know. All this time, I’ve been using Shift + Return for linefeeds.

gl,
kel