Grep and double Backslash in do shell script

Hi,

I’m not expert in grep. I found a small grep that can extract all email address in a text file.
To use it in a do shell script I need to escape the backslash so the shot script should be:

set myText to "hello john@yahoo.com, steve@apple.com - "
set shellResult to “echo " & quoted form of myText & " | grep -EiEio ‘\b[A-Z0-9._%±]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b’”

But it return no data.

If I copy and paste into terminal changing the \b into \b and \ into \ all works returning two email address.

How can I solve the trouble?

Hagi

Hello.

It works if you put in a do shell script command before echo, like this.

set myText to "hello john@yahoo.com, steve@apple.com - "
set shellResult to do shell script "echo " & quoted form of myText & " | grep -EiEio '\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}\\b'"

Hi,

My previous post was not exactly. The correct version is this:

set myText to "hello john@yahoo.com, steve@apple.com - "
set shellResult to do shell script “echo " & quoted form of myText & " | grep -EiEio ‘\b[A-Z0-9._%±]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b’”

Like you suggest. Anyway it return no data.
The trouble seems the double backslash required to compile the script.

The same string (without double backslash) executed in terminal return two email address.

Hagi

Hello Hagi.

Well, I guess you are compiling the regexp yourself then, or escape it programmatically for the script works fine as it is for me (your latest version), I am getting this result:

"john@yahoo.com steve@apple.com"
Please try a

set mygrep to do shell script "which grep"

from AS, and which grep from the Terminal to assure that you invoke the same grep.

Hi McUsr,

which grep in Terminal return /usr/bin/grep

So, I modified the script adding the full path of unix commands like:

set myText to "hello john@yahoo.com, steve@apple.com - "
set shellResult to do shell script “/bin/echo " & quoted form of myText & " | /usr/bin/grep -EiEio ‘\b[A-Z0-9._%±]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b’”

But again it return no result.

Hagi

It doesn’t work on my machine either. It’s because of the limited character range is all upppercase so using the following code should work.


set myText to "hello john@yahoo.com, steve@apple.com - "
set shellResult to do shell script "echo " & quoted form of myText & " | grep -Eo '\\b[a-zA-Z0-9]+@[a-zA-Z0-9.-]+[a-zA-Z]{2,4}\\b'"

edit: Type in the terminal ‘grep --version’… if it’s 2.5.1 it’s an known bug

Hello. I also use /usr/bin/grep, and it works?

when I do grep ”version I get 2.5.1 … Gnu, that may be why it works for me.

And Hagi do have an -i switch in there for ignore case. I make it work with just grep -Eio by the way.

If the command works in Terminal, but not in the do shell script, then I think it is something else!

Try restarting your editor, and System Events, just to be sure. :slight_smile:

Like I said it’s an bug in grep 2.5.1 (and of course it is also under GNU) it doesn’t really matter which location it’s the version causing the bug. There is a patch file and maybe McUsr have used that patch, installed the GNU essentials package or even installed software like MacPorts or Fink which fixes the bug as well. The bug is know for that the option ‘ignoring case’ loses it’s meaning when the option -o is used. In other words in Snow Leopard and Early versions of Lion this is an known bug.

edit
After some search it seems that the problem is because the input is case insensitive while the output (option -o) is case sensitive. Therefore you have no results and still no error, normally when you have no results grep returns an error. That means that it finds results but doesn’t print them out.

I recommend you use the code I’ve posted because then you’re sure it works from Tiger to Mountain Lion

I have indeed installed software from MacPorts, so I think I may have the patched version.

(I wasn’t aware of that they change utilities residing in /usr/bin)

But it seems like a solution is nearby then, if this is the cause, install something from MacPorts.org that upgrade your grep.

But the OP made this work from the commandline, and that I can’t fathom: why it works from the command line, and not from do shell script.

Another solution would be to use sed, and you can really do without extended regexp here. \b can be put in as [[:<:]] and [[:>:]] respectively, (though harder to type).

The plus has a counter part in /{1,} and ? has a counterpart in {0,1}.

I think calling the extended regexp for read and group able instead of extended would help clarify their purpose better. :smiley:

So far it seems like an ugly thing to write in sed. :slight_smile:

How about trying with egrep? Or some other grep found with which -a grep?

Hello.

Not a beauty, nor an efficiency contest contender, but it should work


set mres to do shell script "echo \"hello john@yahoo.com, steve@apple.com - \" |sed 's/\\([ ,]\\)/\\
/g' | sed -n 's/^[[:<:]][a-zA-Z0-9._%+-]\\{1,\\}@[a-zA-Z0-9.-]\\{1,\\}\\.[a-zA-Z]\\{2,4\\}[[:>:]]/&/p'"
--> john@yahoo.com
--> steve@apple.com


Edit

This would be more correct, now, if I did take a copy of sed’s pattern space first and put it into the hold buffer, having a label in front, then finding the pattern, printing, branching to a label behind, getting the hold buffer back, delete the pattern I found, copy it back into the hold buffer,and branched back to start, branch to a branch behind this one, and and get next line of input into pattern space and branch to start if I didn’t find any . Then I would do it with one sed command, but I think what is below is more appropriate. :slight_smile:

set mailAddreses to do shell script "echo \"hello john@yahoo.com, steve@apple.com - \" |tr ' ,' '\\
' | sed -n 's/^[a-zA-Z0-9._%+-]\\{1,\\}@[a-zA-Z0-9.-]\\{1,\\}\\.[a-zA-Z]\\{2,4\\}/&/p'"
--> john@yahoo.com
--> steve@apple.com

Hi,

thanks to all for investigate into the problem.

For DJ:

if I use your version:

set myText to "hello john@yahoo, steve@apple.com - "
set shellResult to do shell script “echo " & quoted form of myText & " | grep -Eo ‘\b[a-zA-Z0-9]+@[a-zA-Z0-9.-]+[a-zA-Z]{2,4}\b’”

When the string contains @ (example john@yahoo,) but is not a valid email is extracted (should be skipped).
Can the grep be adjusted to extract only valid email addresses?

Hagi

I changed my version, the last in the post above to use sed, as to get rid of any trailing dot.

And thanks for making me aware of the usefulness of grep’s -o option! :slight_smile:

I fixed DJ’s version for you Hagi, it seems like the dot within the brackets aren’t interpreted as it should.


set myText to "hello john@yahoo, steve@apple.com - "
set shellResult to do shell script "echo " & quoted form of myText & " | grep -Eo '\\b[a-zA-Z0-9]+@[a-zA-Z0-9-]+\\.[a-zA-Z]{2,4}\\b'"

Not quite. :slight_smile:

set myText to "Fred.Jones@fibble.co.uk
applescript-users@lists.apple.com
rhubarb
hello <john@yahoo.com>, steve@apple - "

set shellResult to do shell script "echo " & quoted form of myText & " | grep -Eo '\\b[a-zA-Z0-9]+@[a-zA-Z0-9-]+\\.[a-zA-Z]{2,4}\\b'"

-->
(*"Jones@fibble.co
john@yahoo.com"*)

Should be (according to my favourite regex tutorial):

set shellResult to do shell script "echo " & quoted form of myText & " | grep -Eo '\\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,4}\\b'"

Or:

set shellResult to do shell script "echo " & quoted form of myText & " | grep -Eo '\\b[[:alnum:]._%+-]+@[[:alnum:].-]+\\.[[:alpha:]]{2,4}\\b'"

I was aware of that I didn’t make a general solution, a general regexp for parsing mailaddresses can be found here. :slight_smile:

Then the inevitable question!

What is your favourite regexp tutorial Nigel?. I have not found any single one, that are my favourite, and am interested in having a look at something new.

I have picked up a couple of tricks though, and I know you have done the same. The greatest of them being the anti-pattern trick, to stop “greediness” For instance to search for ([^ ]+)[ ] to ensure that you don’t get the space into the current search group, as the problem is often to limit what you get.

The other trick, is to write down the regexps, once they work, nd describe how, and what they do. :slight_smile:

Edit

It seems to me that the the dot b[/b] must be treated outside of any character class to make it work with grep, at least under those circumstances.


set myText to "Fred.Jones@fibble.co.uk
applescript-users@lists.apple.com
rhubarb
hello <john@yahoo.com>, steve@apple - "

set shellResult to do shell script "echo " & quoted form of myText & " | grep -Eo '\\b([a-zA-Z0-9-]|\\.)+\\b@([a-zA-Z0-9]|\\.)+\\b\\.[a-zA-Z]{2,4}\\b'"

-->
(* "Jones@fibble.co.uk
applescript-users@lists.apple.com
john@yahoo.com"*)

I think you may know it: http://www.regular-expressions.info/

Not necessarily. The two lines I posted above give the correct result.

Thank you for the link, and I didn’t.

I have on the other hand RegExpEditor, which is nowhere near RegExpBuddy for the Windows platform, and even lesser than the app you can by for two dollar in the AppStore. Someone able and agile, could really fence in some cash on making something like RegExpBuddy for the Mac!

RegExpEditor is free and I believe downloadable from either MacPorts or freedesktop.org, but you have to have XQuartz installed.

Maybe I should have stated “in brackets”, because the dots are outside those in your regexp expressions as well.

Edit

I got it! When you need a mandatory element, then that element can’t be put inside a character class. Silly of me. :slight_smile:

It can be if it’s the only member of the class. :slight_smile:

Nigel can read this better than Neo can… :wink:

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
 \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
)*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
 \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
 \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
?:\r\n)?[ \t])*))*)?;\s*)

True, but the character classes in speak, had more than one character , so a match is then “optional”, when a character class has just one member, and no “modifier”, (like * or ?) then a match of that character class is mandatory.

Well, I’m off for the rest of the evening, seeing if I can make a sed script, that first extracts like the grep -o, so I have a usable loop construct for inline editing of several matches. :slight_smile:

And I think the regexp above this post is computer generated.

By the way: a good way to figure out how to program with sed, is to look at it as a repeat loop repeating over every line of input, then you have to variables; the hold space, and pattern space, the input, the lines that are fed in, and the output, what is printed to stdout, and the commanlist that is parsed for each line of input. So there will be an iteration for each line of the input, implicitly, unless you are having branches (loops) in your command list. You can look at the input, output, holdspace and patternspace as large as shoe boxes or whatever, if it helps you.

Then there are the commands, that should be fairly easy to grasp, with the mental tool just provided, because sed’s vocabulary is so blissfully small! :slight_smile: