Using grep & regex for pattern matching in CSV lines

RAE3000 · September 12, 2016, 7:35pm

Hello fellow AppleScripter and RegexPros,

I am just starting to learn more about Regex and I am a little stuck with matching a pattern.
I have more than 5000 lines of a csv. All columns are delimited by tab.

example row

A a year (4 digits long)
B a number or number & characters including whitespace various length
C a number or number & characters including whitespace various length
D a number or number & characters including whitespace various length

How can I just find/match C (basically everything between 2nd tab & 3rd tab)?

I want to get the linenumber of a matching row and the contents of the row itself using grep by matching Column C by a Pattern.
The searchstring will be dynamic and changed. Since the seachstring can also appear in Column D i have to be sure it gets searched only in column C

Here is what I came up with so far:


set searchstring to "Pk 155"
set pattern to  ""  -here i need the RegEx for matching
set matching_line to do shell script "grep" ___________  

--grep -n pattern file.txt | cut -d : -f 1
set linenumber to do shell script "grep -n"  & space & pattern &  space & quoted form of myFile & space & "| cut -d : -f 1"

I think I have to use parentheses somewhere but wherever I put them I dont get the right stuff(in the examplerow just “Pk 155”)

Hope I explained it so you can follow me.

DJ_Bazzie_Wazzie · September 12, 2016, 11:49pm

Assuming only the pattern is important here an example code using AppleScript Toolbox.osax. The pattern can be used on the command line. Keep in mind the pattern makes use of extended regex which means you have to use the command egrep or use the option -e on the command line.


set fileContents to "A	B	C	D	E	F	G
2016	5671	Pk 155	5897 A	This is 1 Example	1,3	16,5:18
2016	pk 155	Pk 176	5897 A	This is 1 Example	1,3	16,5:18
2016	8001	Pk 155	5897 A	This is 1 Example	1,3	16,5:18
2016	5671	Pk 180	pk 155	This is 1 Example	1,3	16,5:18"

set searchString to "Pk 155"
set pattern to "^(?:[^	]+)	(?:[^	]+)	(" & searchString & ")	.*$"
set matches to AST find regex pattern in string fileContents with newline sensitivity

As you can see I have added multiple rows with the search string on different rows in different fields to show it is working correctly.

p.s. When using the command line, always quote every argument unless static. Therefore always quote patterns as well.

Nigel_Garvey · September 13, 2016, 9:11am

Hi.

The ‘cut’ command by itself is actually probably the simplest to code:

set sourceText to "A	B	C	D	E	F	G
2016	5671	Pk 155	5897 A	This is 1 Example	1,3	16,5:18"

do shell script "cut -f 3 <<<" & quoted form of sourceText -- 3rd column in each line. (Tab is 'cut''s default delimiter.)

But if you’re interested in regex, there are various ways. The ‘grep’ command isn’t particularly suited to what you want to do (it’s defeated me, anyway ;)) but ‘sed’ is usable:

set sourceText to "A	B	C	D	E	F	G
2016	5671	Pk 155	5897 A	This is 1 Example	1,3	16,5:18"

do shell script "sed -E 's/^([^[:cntrl:]]*[[:cntrl:]]){2}([^[:cntrl:]]+).+/\\2/' <<<" & quoted form of sourceText

-- Or with literal tabs instead of [:cntrl:] :
-- do shell script "sed -E 's/^([^	]*[	]){2}([^	]+).+/\\2/' <<<" & quoted form of sourceText

If you’re not familiar with ‘sed’, it edits linefeed-delimited text line by line. Its ‘s’ command performs substitutions in each line like this:

The -E option is for “extended Regex”. It allows the ‘{2}’ and ‘+’ operators to be recognised and is ‘sed’'s idea of advanced regex.

The search regex itself breaks down as:

^ Starting at the beginning of each line .
b{2}[/b] . match two sequences of (zero or more non-control characters followed by a control character) .
b[/b] . then match and memorise the run of zero or more non-control characters after that (ie. the contents of the third column) .
.+ . and the rest of the line.

The \2 replaces the entire match with the particular match of the regex in the second pair of parentheses. This happens in every line except where the regex can’t be matched.

Requested explanation: The “<<<” operator is a facility offered by the bash shell for feeding a literal string to the standard input of the command with which it’s used. bash’s man file calls it a “Here String”. In the scripts in this post, it’s used to inject the value of sourceText. A Here String can be used on either side of the command. I think it makes more sense on the left, since it’s for input, but I’ve put it on the right here as I’ve been taken to task in the past by purists who insist that’s where it should go.

There’s a related “<” operator for “Redirecting Input”. The parameter for this is a POSIX path and the contents of the file to which the path refers are what’s fed to the standard input. And there’s also “>” for redirecting output to a file. Neither are used here.

RAE3000 · September 13, 2016, 11:36am

Hi guys,

thanks for the replies.

@ Dj Bazzie Wazzie

The Applescipt Toolbox looks interesting, but requires 10.9. I am still on 10.6.8 and could not find if there is a toolbox supported for this OS.

If I test the pattern without and write

set match_line to do shell script "grep -e ^(?:[^    ]+)    (?:[^    ]+)    (" & searchString & ")    .*$"

I get an error
error “sh: -c: line 0: syntax error near unexpected token (' sh: -c: line 0: grep -e ^(?:[^ ]+) (?:[^ ]+) (154x) .*$'” number 2

@ Nigel Garvey

I used some basic sed in one script. But I don’t want to edit the file I just need the contents of the matching line and the linenumber of the matching line.
In your cut command how and where would I specify the searchterm? eg PK 156 or 134 A

If I do

set a to do shell script "sed -E 's/^([^[:cntrl:]]*[[:cntrl:]]){2}([^[:cntrl:]]+).+/\\2/' <<<" & quoted form of sourceText

I get the error “The command completed with a non-zero result.” number 255

DJ_Bazzie_Wazzie · September 13, 2016, 2:21pm

I wouldn’t recommend to use AppleScript Toolbox on systems lower than 10.8. The site does say 10.9 because that is the OS version I wrote AST in. There are CF functions used that are not supported in 10.8 or lower so a good choice not using it.

In the footnote of my previous post I pointed out that you need to quote arguments on the command line, so the pattern needs to be quoted.

set fileContents to "A	B	C	D	E	F	G
2016	5671	Pk 155	5897 A	This is 1 Example	1,3	16,5:18
2016	pk 155	Pk 176	5897 A	This is 1 Example	1,3	16,5:18
2016	8001	Pk 155	5897 A	This is 1 Example	1,3	16,5:18
2016	5671	Pk 180	pk 155	This is 1 Example	1,3	16,5:18"

set searchString to "Pk 155"
set pattern to "^(?:[^	]+)	(?:[^	]+)	(" & searchString & ")	.*$"
set matches to do shell script "egrep " & quoted form of pattern & " <<<" & quoted form of fileContents

Nigel_Garvey · September 13, 2016, 3:32pm

The scripts I posted both edit text, not a file.

But I was answering the question “How can I just find/match C (basically everything between 2nd tab & 3rd tab)?” I didn’t notice the contradictory bit about wanting the entire line.

If you’re just after the line containing “Pk 155” and its number, this would do it:

set sourceText to "A	B	C	D	E	F	G
2016	5671	Pk 155	5897 A	This is 1 Example	1,3	16,5:18"

set searchString to "Pk 155"

set searchResult to (do shell script "grep -n " & quoted form of searchString & "  <<<" & quoted form of sourceText & " | sed 's/:/\\'$'\\n''/'")
if ((count searchResult) > 0) then
	set {lineNumber, wholeLine} to paragraphs of searchResult
else
	set {lineNumber, wholeLine} to {"", ""}
end if

If it’s important that the search term should only be matched in a particular column:

set sourceText to "A	B	C	D	E	F	G
2016	5671	Pk 155	5897 A	This is 1 Example	1,3	16,5:18"

set searchString to "Pk 155"

set searchResult to (do shell script "egrep -n " & quoted form of ("^([^	]*	){2}" & searchString & tab) & "  <<<" & quoted form of sourceText & " | sed 's/:/\\'$'\\n''/'")
if ((count searchResult) > 0) then
	set {lineNumber, wholeLine} to paragraphs of searchResult
else
	set {lineNumber, wholeLine} to {"", ""}
end if

The shell scripts should return all results if there are more than one, but the AppleScript ‘if’ statement I’ve put after them is only set up to catch the first.

RAE3000 · September 13, 2016, 4:08pm

Hi Guys,

thanks for all your support I really appreciate it and learn new things! Command line and Regex is powerfull but sometimes hard to grasp.

@DJ
If I use the copy paste from your Script unfortunately I only get

error “The command completed with a non-zero result.” number 1

no match for “Pk 155” which is definetly in the sourcetext.

@Nigel

I tried both your examples. It is important that the match is only searched for in column C, since it can also appear in column D which would be a false positive.
Somehow the 2nd script is not finding anything although the string is in Column C. Is there a problem wih the pattern?
A tab B tab C tab D C should match

If have searchstrings like:

Br 13
Br 152X
115 ii
Br 123 A

but since I use the exact same string match it against the pattern there should not matter, or ?

Nigel_Garvey · September 13, 2016, 5:08pm

It works as posted. Are you clicking the “Open this Scriplet in your Editor:” link to open it in Script Editor, or are you copy/pasting from the Web page? It works the first way, but the tabs are turned to various numbers of spaces the second.

DJ_Bazzie_Wazzie · September 13, 2016, 5:34pm

For the record: It is because HTML doesn’t support tabs so they are spaces in the presentation. Just click the “open this scriptlet in your editor” link so tabs are preserved.

ccstone · September 14, 2016, 1:26am

Hey Folks,

The following awk script will return the found line preceded by the found line number and a tab:

[format]5 2016 5671 Pk 180 pk 155 This is 1 Example 1,3 16,5:18
[/format]

Do NOT use copy to capture the scripts, because literal tabs will NOT be copied “ use the “Open this Scriplet in your Editor:” link.

Using DJ’s 5 line test text in a file to test with.

(Change the file path to work properly with your test file.)

Basic shell script:


FILE=~/Downloads/test.txt
awk '/^[^	]+	[^	]+	+Pk 180/{ print NR "	" $0}' "$FILE"

NOTE: The findPattern variable is really the literal text to be found in column 3.

Shell script with find-pattern and file separated out:


findPattern='Pk 180';
FILE=~/Downloads/test.txt;
shCMD="awk '/^[^	]+	[^	]+	+""$findPattern""/{ print NR \"	\" "'$0'"}' \"$FILE\""
eval "$shCMD"

Shell script with find-pattern and file separated out converted to AppleScript:


set shCMD to "
findPattern='Pk 180';
FILE=~/Downloads/test.txt;
shCMD=\"awk '/^[^	]+	[^	]+	+\"\"$findPattern\"\"/{ print NR \\\"	\\\" \"'$0'\"}' \\\"$FILE\\\"\"
eval \"$shCMD\"
"
set theResult to do shell script shCMD

–
Chris

{ MacBookPro6,1 · 2.66 GHz Intel Core i7 · 8GB RAM · OSX 10.11.6 }
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

RAE3000 · September 14, 2016, 10:21am

I had some problems with opening the Scriplets form Firefox. Switching to Safari fixed this issue.

Now I could check all the codes. Thanks a lot they all work great.

But it seems if I read in my file the way I do the tabs go missing (?, i think) and nothing is found anymore.

My code for reading in the file


set theCSV to "/Users/RAE/Desktop/TEST.csv"

set f to read theCSV 
--skip first line with headers
set all_records to text (paragraph 2) thru -1 of f

In the script from ccstone it works fine with the file, but my way of reading the contents of the file for using DJs & NGs code with my CSV file seems not to work.

Nigel_Garvey · September 14, 2016, 11:23am

It’s hard to tell without seeing your actual file. With the ‘read’ command, you have to know the format of the text in the file ” eg.ASCII, UTF-8, UTF-16 ” in order for it to be interpreted correctly. Some shell programs only recognise linefeeds as line separators.

Given what you seem to want to do, it may be better to have the file read directly by the shell script. For instance, modifying my script:

set sourceFile to POSIX path of (choose file)
set searchString to "Pk 155"

-- The first 'sed' reads the file and cuts the first line (provided that the line delimiters are linefeeds). The 'grep' locates the required line(s) and prepends the line numbers. The second 'sed' splits each numbered line into number and line.
set searchResult to (do shell script "sed -n '2,$ p' " & quoted form of sourceFile & " | grep -n " & quoted form of searchString & " | sed 's/:/\\'$'\\n''/'")
if ((count searchResult) > 0) then
	set {lineNumber, wholeLine} to paragraphs of searchResult
else
	set {lineNumber, wholeLine} to {"", ""}
end if

RAE3000 · September 14, 2016, 12:19pm

The file was exported from a OpenOffice Calc Document.
Encoded utf8.

@NG
Just tried your script with the file. Works like a charm.

DJ, Nigel, ccstone thank you all so very much. This solved a big problem I had with the script.
And I am sure I will use your snippets it in more scripts to come.

Nigel_Garvey · September 14, 2016, 1:57pm

Hey RAE3000.

For some reason, I adapted the general-search script. Here’s the third-column-only version:

set sourceFile to POSIX path of (choose file)
set searchString to "Pk 155"

set searchResult to (do shell script "sed -n '2,$ p' " & quoted form of sourceFile & " | egrep -n " & quoted form of ("^([^	]*	){2}" & searchString & tab) & " | sed 's/:/\\'$'\\n''/'")
if ((count searchResult) > 0) then
	set {lineNumber, wholeLine} to paragraphs of searchResult
else
	set {lineNumber, wholeLine} to {"", ""}
end if

Many apologies.

RAE3000 · September 14, 2016, 3:07pm

@NG
ah thats why I had false positives. Thanks for fixing that.

My script just got way faster and also more flexible.

Nigel_Garvey · September 14, 2016, 4:13pm

Done.

ccstone · September 17, 2016, 9:56pm

A simplified version of the AppleScript


set findStr to "Pk 180"

set filePath to "~/Downloads/test.csv"
tell application "System Events" to set filePath to POSIX path of disk item filePath

# Read the data.
set fileContent to quoted form of (read filePath as «class utf8»)

# Now removes header line.
set theResult to do shell script "<<<" & fileContent & " awk 'NR == 1 {next} {print}' | awk '/^[^	]+	[^	]+	+" & findStr & "/{ print NR \"\\n\" $0}'"

if theResult â‰  "" then
	set {lineNum, lineContent} to paragraphs of theResult
else
	set {lineNum, lineContent} to {"", ""}
end if

RAE3000 · September 18, 2016, 7:10pm

Thank you ccstone for the update.
Also learned a new way of reading in a file.

DJ_Bazzie_Wazzie · September 19, 2016, 12:20pm

Here an more efficient way


set searchString to "pk 155"
set myFile to quoted form of POSIX path of (path to desktop folder) & "test.csv"
set pattern to quoted form of ("^(?:[^	]+)	(?:[^	]+)	(" & searchString & ")	.*$")
set matches to do shell script "tail -n +2 " & myFile & " | egrep -in " & pattern & " || true"

The script assumes that the file is located on your desktop with the name “test.csv”
The ‘tail -n +2’ will print every line while reading the file starting from the second row
In egrep we use the option -i to make the match case insensitive
At the end we append an or operator (||) so when egrep throws an error the true command is executed instead; when there is no match an empty string will be returned and no longer an error.

Advantage over my previous post is that the data is read an send to egrep right away. This way we don’t need a large buffer and send it to bash, also the tail command itself uses no buffers and is very fast. I prefer grep and egrep over other regex commands because grep (search only) needs no buffer and is up to 10 times faster than the other tools (search and replace). Performance difference depend heavily on the expression itself of course but grep and egrep can process characters 5 times faster. Between sed and awk (and python) there is no real big difference in performance. It should be personal preference because the awk language is based on the C syntax while sed has more like his own code based on ed. I don’t know how important performance is and I don’t know how many more lines than 5000 the file actually contains but the solution above is built for speed.

ccstone · September 19, 2016, 9:02pm

Hey DJ,

Yes, but you’re failing to get the line number as the OP wanted.

You’re also forgetting the OP complained that my original script didn’t let him read his data the way he wanted to. (I read the file directly into awk.)

My script should be one awk script instead of two, but my awk-fu isn’t quite there yet.

-Chris