Deleting lines in a text file containing specific strings

McUsr · August 5, 2012, 4:26pm

Hello!

Seeing Nigel’s elegant solution, I couldn’t help myself. Nigel’s solution simulates the grep -v, so I made a grep which I called keepLInes, which I guess is made before but . it has irritated me, not figuring out to write those handlers, but now I had a head start!


set fileText to "This is a line of text is also the  third line.
Third line is a popular line
Here's some text in a file
This is a third line of text.
This is a line of text is also the  third line.
It contains a few lines of text.
third line is a line of text.
This is a third line of text.
And finally a fourth line of text."

set keepPhrase to "third line"

set keptLines to keepLinesFromText(fileText, keepPhrase)

log keptLines

(*
"This is a line of text is also the  third line.
third line is a popular line
This is a third line of text.
This is a line of text is also the  third line.
third line is a line of text.
This is a third line of text."
*)

on keepLinesFromText(theText, keepPhrase)
		-- http://macscripter.net/viewtopic.php?id=37830 © McUsr 
		-- Concept by NG
	set newText to {}
	
	-- Get the text items of the text using the keepPhrase as a delimiter.
	set astid to AppleScript's text item delimiters
	set AppleScript's text item delimiters to keepPhrase
	set textItems to theText's text items
	
	set textItemCount to (count textItems)
	if (textItemCount > 1) then
		-- The phrase was in the text. Collect the text items except for the bits of the paragraphs it was in.
		tell beginning of textItems to set end of newText to paragraph -1
		
		repeat with i from 2 to textItemCount - 1
			tell item i of textItems to if ((count each paragraph) > 1) then
				set end of newText to paragraph 1 & return & paragraph -1
			end if
			
		end repeat
		tell end of textItems to set end of newText to paragraph 1
		set newText to newText as text
	else
		-- The phrase is not in the text.
		set newText to theText
	end if
	set AppleScript's text item delimiters to astid
	
	return newText
end keepLinesFromText

gaseous1 · September 24, 2012, 6:16am

Matt Neuberg’s filter routine works well, filtering out any line containing the text supplied.

However, your final example which imitates “grep” chokes on any line which contains more than one instance of the desired text, by only returning a portion of the line. To see this in action substitute “line” for “third line” as the value of keepPhrase. Unfortunately, Nigel’s final example (partway through this thread) suffers a similar outcome if the same substitution is made for the value of deletePhrase in his routine.

This is solved by the following, which takes each line in order, essentially identical to the method Hank originally offered:

set fileText to "This is a line of text is also the third line.
Third line is a popular line
Here's some text in a file
This is a third line of text.
This is a line of text is also the third line.
It contains a few lines of text.
third line is a line of text.
This is a third line of text.
And finally a fourth line of text."

set keepPhrase to "line"

set keptLines to keepLinesFromText(fileText, keepPhrase)

log keptLines


on keepLinesFromText(theText, keepPhrase)
	-- http://macscripter.net/viewtopic.php?id=37830 © McUsr 
	-- Concept by NG
	set newText to {}
	
	set workingText to the paragraphs of theText

	set astid to AppleScript's text item delimiters
	set AppleScript's text item delimiters to keepPhrase
	
	repeat with workingItem in workingText
		set textItems to workingItem's text items
		set textItemCount to (count textItems)
		if (textItemCount > 1) then
			set end of newText to workingItem as text
		end if
	end repeat
	
	set AppleScript's text item delimiters to return
	set newText to newText as text
	set AppleScript's text item delimiters to astid

	return newText
end keepLinesFromText

This example doesn’t error check. It also can only handle straight text, not regular expressions. For those interested in an example using regular expressions, see Yvan Koenig’s last post in this thread.

If you were to want to delete lines containing the text, then the if/then condition would need to be textItemCount = 1.

Eric

Nigel_Garvey · September 24, 2012, 1:12pm

Eric’s purpose here was to return the line numbers of lines not containing a particular phrase. I’ve been learning more about “sed” in the meantime and it seems that the “grep” and “sed” portions above can be covered by one “sed” command:

The -n option turns off the automatic appending of text lines to the output. Each line of the input is examined to see if it contains a match to the regular expression between the “/” delimiters. If not (“!”), a line containing the line’s number is appended to the output (“=”).

set theText to "This is a line of text is also the third line.
Third line is a popular line
Here's some text in a file
This is a third line of text.
This is a line of text is also the third line.
It contains a few lines of text.
third line is a line of text.
This is a third line of text.
And finally a fourth line of text."

set deletePhrase to "line of text"

paragraphs of (do shell script ("echo " & quoted form of theText & " | sed -n '/" & deletePhrase & "/ ! ='"))
--> {"2", "3", "6"} -- The numbers of lines not containing "line of text".

cirno · September 25, 2012, 6:00am

I learned a lot from this thread. Thanks

I been looking for something similar for long time.

Is there a way to find row containing xyz and then change that text and also is there a way to add new text row after row containing xyz? Preferably using grep, without repeat loop.

gaseous1 · September 25, 2012, 6:12am

Sweet solution! I really like the condensed version. You aren’t changing over to the dark side perchance (ie, Unix), Nigel?

The OP was “delete lines containing.” This morphed into delete lines containing: text only or regular expressions?

AppleScript is wonderful for identifying lines by searching with text only. AppleScript and Unix are wonderful for identifying lines by using regular expressions, which are considerably more powerful than text only. Additionally, the indices of these lines can be returned. AppleScript can do this also, but it requires further coding.

Yvan created a nice routine (several posts back) to process a variety of choices: lines matching, lines not matching, indices of lines matching and indices of lines not matching.

In that spirit, I have compiled the following:


set theText to "This is a line of text is also the third line.
Third line is a popular line
Here's some text in a file
This is a third line of text.
This is a line of text is also the third line.
It contains a few lines of text.
third line is a line of text.
This is a third line of text.
And finally a fourth line of text."

set thePhrase to "line of text"

-- index or lines NOT matching
set t1 to paragraphs of (do shell script ("echo " & quoted form of theText & " | sed -n '/" & thePhrase & "/ !='"))
set t3 to paragraphs of (do shell script ("echo " & quoted form of theText & " | sed -n '/" & thePhrase & "/ !p'"))

-- index or lines matching
set t2 to paragraphs of (do shell script ("echo " & quoted form of theText & " | sed -n '/" & thePhrase & "/ ='"))
set t4 to paragraphs of (do shell script ("echo " & quoted form of theText & " | sed -n '/" & thePhrase & "/ p'"))

Nigel covered sed nicely:

And … the line will be appended to the output (“p”).

Eric

gaseous1 · September 25, 2012, 6:33am

cirno asked:

Both grep and sed can be used. Since we just used sed, I’ll provide that solution:


set theText to "This is a line of text is also the third line.
Third line is a popular line
Here's some text in a file
This is a third line of text.
This is a line of text is also the third line.
It contains a few lines of text.
third line is a line of text.
This is a third line of text.
And finally a fourth line of text."

set thePhrase to "line of text"
set theNewText to "LINE OF TEXT"

set t1 to paragraphs of (do shell script ("echo " & quoted form of theText & " | sed 's/" & thePhrase & "/" & theNewText & "/'"))

Sed’s “s/searchText/replaceText/” substitution construct is very straightforward and can use regular expressions also.

The answer to your second question depends on where you wish to place your boilerplate text. For example

will append " xyz" to every line. The regular expression “$” is the placeholder for the end of line.

Note that an AppleScript solution will be very fast and is equally simple, depending on where in the line you want to append the boilerplate. The primary advantage to the UNIX call is in the use of regular expressions, which AppleScript doesn’t support out-of-the-box. There is overhead in the do shell script call which is not insignificant.

Eric

DJ_Bazzie_Wazzie · September 25, 2012, 8:09am

The pre successor of grep, g/re/p built-in string function, is very fast and won’t need to launch an extra process. The code would look like:


set theText to "This is a line of text is also the third line.
Third line is a popular line
Here's some text in a file
This is a third line of text.
This is a line of text is also the third line.
It contains a few lines of text.
third line is a line of text.
This is a third line of text.
And finally a fourth line of text."

set thePhrase to "line of text"
set theNewText to "LINE OF TEXT"

set t1 to do shell script "IFS='" & return & "';mystr=" & quoted form of theText & ";echo ${mystr//" & quoted form of thePhrase & "/" & quoted form of theNewText & "}"

Nigel_Garvey · September 25, 2012, 10:28am

I don’t think there is with “grep” (if you mean the Unix utility of that name), but I’ll do some research.

gaseous1 has shown the “sed” solution for a substitution. I’m not sure if you want the new text row as a separate thing or in combination with the substitution. Possible solutions to both are in this script:


set theText to "This is a line of text is also the third line.
Third line is a popular line
Here's some text in a file
This is a third line of text.
This is a line of text is also the third line.
It contains a few lines of text.
third line is a line of text.
This is a third line of text.
And finally a fourth line of text."

set thePhrase to "line of text"
set theNewText to "LINE OF TEXT"
set theNewLine to "NEW LINE"

-- Append a new line after each line containing thePhrase.
set t1 to paragraphs of (do shell script ("echo " & quoted form of theText & " | sed -E '/" & thePhrase & "/ s/$/\\" & linefeed & theNewLine & "/'"))

-- Change thePhrase to theNewText AND append a new line.
set t1 to paragraphs of (do shell script ("echo " & quoted form of theText & " | sed -E 's/^(.*)" & thePhrase & "(.*)$/\\1" & theNewText & "\\2\\" & linefeed & theNewLine & "/'"))

There may be better solutions within “sed”, but I don’t know them yet.

Edit: Instead of being concatenated into the shell script string from AppleScript, the linefeed could be written in a Unix-y way as $‘\n’, which in an AppleScript string would be “$‘\n’”. The “sed” command is already in single quotes, which therefore have to be interrupted for the dollar sign:

-- Append a new line after each line containing thePhrase.
set t1 to paragraphs of (do shell script ("echo " & quoted form of theText & " | sed -E '/" & thePhrase & "/ s/$/\\'$'\\n" & theNewLine & "/'"))

-- Change thePhrase to theNewText AND append a new line.
set t1 to paragraphs of (do shell script ("echo " & quoted form of theText & " | sed -E 's/^(.*)" & thePhrase & "(.*)$/\\1" & theNewText & "\\2\\'$'\\n" & theNewLine & "/'"))

Edit 2: There is a “sed” command which specifically appends text to the output immediately after a line of interest, but you still have to provide the linefeeds. I don’t think it’s necessarily any better here:

-- Append a new line after each line containing thePhrase.
set t1 to paragraphs of (do shell script ("echo " & quoted form of theText & " | sed -E ' /" & thePhrase & "/ a\\'$'\\n" & theNewLine & "\\n'"))

Nigel_Garvey · September 25, 2012, 10:31am

That’s a nice one to know. Thanks!

gaseous1 · September 25, 2012, 11:54am

cirno,

Sorry, but I misspoke earlier. GREP cannot provide substitution. However, sed, awk, tr and perl all can. Thanks to DJ Wazzie Dazzie for pointing out the Internal Field Separator and its use – it’s akin to the AppleScript text item delimiter. Overall the Unix tools are all quite speedy.

The do script call still has a significant latency, however, which makes pure AppleScript a viable alternative for programming “smaller” text processing jobs. I like using the Unix commands for handling large data files and for their regex capacities.

Eric

Nigel_Garvey · September 25, 2012, 3:43pm

I’m a bit confused by the “IFS=” business in DJ Bazzie Wazzie’s script.

¢ When the IFS value’s a return, the script works whether the line separator in the text is a linefeed or a return. Besides concatenation from AppleScript, a return value can be specified by writing “IFS=$‘\r’;”
¢ The script also works properly when the IFS value doesn’t contain any characters which appear in the text after the substitutions are made ” in which case, an empty value of ‘’ (two single quotes) would be less confusing than a return.
¢ Characters in the IFS value which do appear in the text after the substitutions are made are replaced with spaces.
¢ Return characters in the main text are replaced with linefeeds.
¢ Empirical testing suggests that the substitution order is: 1) Returns in the main text are replaced with linefeeds; 2) Instances of the search string in the edited main text are replaced with instances of the replacement string; 3) Characters now in both the main text and in the IFS string are replaced with spaces.

Weird.

DJ_Bazzie_Wazzie · September 25, 2012, 4:25pm

It’s (partly) some copied code from myself.

a better line would be:

set t1 to do shell script "mystr=" & quoted form of theText & ";echo \"${mystr//" & thePhrase & "/" & theNewText & "}\""

My apologies for the confusion. I didn’t post the code to show IFS, just to show the internal string function.

edit:
Better readable code:

set theText to "This is a line of text is also the third line.
Third line is a popular line
Here's some text in a file
This is a third line of text.
This is a line of text is also the third line.
It contains a few lines of text.
third line is a line of text.
This is a third line of text.
And finally a fourth line of text."

set thePhrase to "Here's"
set theNewText to "\""

set t1 to do shell script "mystr=" & quoted form of theText & "
substr=" & quoted form of thePhrase & "
repstr=" & quoted form of theNewText & "
echo \"${mystr//$substr/$repstr}\""

McUsr · September 25, 2012, 4:38pm

Hello!

Some of the confusion may stem from the fact that awk has both input and output field and record separators.

FS and OFS, RS and ORS to give maximum flexibility.

They are described in man -s1 awk

DJ_Bazzie_Wazzie · September 25, 2012, 4:44pm

IFS is an separator for bash (like delimiters in AS) AWK has nothing to do with this. When I send an array to echo, because I changed the command line field separator, it will be send as one of multiple arguments.

small example:

do shell script "echo hello                            world!"

As you can see in the results it will be send to the echo command with two arguments, ‘hello’ and ‘world!’. This behavior changes when you change the IFS variable. But it is was not my intention to discuss IFS

Yvan_Koenig · September 25, 2012, 5:09pm

Hello

To replace thePhrase by theNewText I would use Shane Stanley’s ASObjC Runner


set theText to "This is a line of text is also the third line.
Third line is a popular line
Here's some text in a file
This is a third line of text.
This is a line of text is also the third line.
It contains a few lines of text.
third line is a line of text.
This is a third line of text.
And finally a fourth line of text."

set thePhrase to "line of text"
set theNewText to "LINE OF TEXT"

tell application "ASObjC Runner"
	replace string thePhrase in item theText replacing with theNewText
end tell

As, on my machine, this gem is ran during the boot process, there is quite no added offset.

Yvan KOENIG (VALLAURIS, France) mardi 25 septembre 2012 19:08:47

McUsr · September 25, 2012, 6:15pm

Hello!

Bash Manual:

Expands to the positional parameters, starting from one. When the expansion occurs within double quotes, it expands to a single word with the value of each

         parameter separated by the first character of the IFS special variable.  That is, "$*" is equivalent to "$1c$2c...", where c is the first character of  the  value

         of the IFS variable.  If IFS is unset, the parameters are separated by spaces.  If IFS is null, the parameters are joined without intervening separators.

I find the observations as weird as Nigel Garvey, with regards to the result of the substitution/expansion.

I have looked through the manual, and found no explanation for it.

It still looks a bit like an IFS/OFS thing, though not documented, that a space is a fallback OFS, when bash perceives conflicts. Just a guess more than a suggestion.

DJ_Bazzie_Wazzie · September 25, 2012, 6:58pm

The observation is quite straightforward though. I change the IFS to a character that isn’t used in the text, the text contains line feeds and not returns. The result is that the text is not interpreted as a list anymore but as one single string. Now the parameter substitution will send the text as a single argument to echo (similar to quoting–>see my corrected post). When IFS isn’t changed (defaults it is space followed by a tab and linefeed) every word is send as an parameter to echo and therefore everything is separated by spaces; there are no returns send to echo. When you set IFS to linefeed every line is send as an argument to echo and therefore explains the spaces as well.

Maybe it seems so but my code will send the whole string as a single text to echo and not line by line. If you do then echo will also fulfill it’s job because of the trailing new line each iteration. I think the misinterpretation here is that the text doesn’t contain returns but line feeds. When you change the line feeds in the text to returns you get the same ‘weird’ results and you don’t need to set IFS at all.

Nigel_Garvey · September 25, 2012, 8:12pm

I’m not that weird! :lol:

McUsr · September 26, 2012, 11:52am

:lol:

Something did get lost in translation, the correct frasing would be

I find the observations as weird as Nigel Garvey did.

And I buy DJ Bazzie Wazzies explanation b[/b] I must say that I haven’t thought of exploiting delimiters in shell script’s.

There is one command that hasn’t been mentioned so far, and that is one I find dear, and use in here documents, good old ed edit’s files “in file” Here is a command I have snatched from Stephen R. Bourne It performs search and replace within a file.

edg

#! /bin/bash ed - $3 <<% g/$1/s//$2/g w %