Deleting lines in a text file containing specific strings

I’m needing to delete lines containing specific strings in an XML file that is generated every 10 minutes. While I could process it through TextWrangler, I feel this shouldn’t be necessary. I’m already using a Search-and-Replace subroutine and was hoping to find something comparable to delete these unneeded lines. Any ideas?

Thanks in advance.
Brad

I wrote you a handler called deleteLinesFromText that will do this. Just feed it the text and the phrase you want and all lines containing that phrase will be deleted. So you should read in the xml file, send it through the handler, then write the results back to the xml file. An xml file is just a text file, so you can read and write from applescript as normal text. You can find information on this website about reading and writing text files.

 set fileText to "Here's some text in a file
It contains a few lines of text.
This is a third line of text.
And finally a fouth line of text."

-- this is the pharse will will check against in the fileText
set deletePhrase to "third line"

deleteLinesFromText(fileText, deletePhrase)


on deleteLinesFromText(theText, deletePhrase)
	set newText to ""
	try
		-- here's how you can delete all lines of text fron fileText that contain the deletePhrase.
		-- first turn the text into a list so you can repeat over each line of text
		set textList to paragraphs of theText
		
		-- now repeat over the list and ignore lines that have the deletePhrase
		repeat with i from 1 to count of textList
			set thisLine to item i of textList
			if thisLine does not contain deletePhrase then
				set newText to newText & thisLine & return
			end if
		end repeat
		if newText is not "" then set newText to text 1 thru -2 of newText
	on error
		set newText to theText
	end try
	return newText
end deleteLinesFromText

Hank’s script is excellent for removing specific text.

However, depending on your needs (e.g., text in a specific location within a line, or text associated with other text on the same line) you may need to use regular expressions to more flexibly extract or remove matching text lines.

To search with regular expressions, I would recommend the use of the free TextWrangler (or it’s commercial cousin, BBEdit), the free Satimage osaxen, the commercial TextSoap, or the UNIX tools (grep, sed and awk) contained within OSX. All of these tools have their relative strengths, but one thing they have in common is they are all accessible from AppleScript. They all also work with Lion.

Regular expressions are incredibly flexible and powerful at identifying matching patterns of text. They can match different criteria in the same text line (akin to a logical OR), as an example. They are also very fast! A good primer on “regex’s” is contained within TextWrangler’s/BBEdit’s documentation.

Eric

Hello

Here is an edited version of Hank’s script.
It doesn’t create a new list but work with a single one.
I don’t know if it’s faster but I guess that it’s more efficient in terms of memory use.


set fileText to "Here's some text in a file
It contains a few lines of text.
Hello happy tax payers.
and here is a fake fourth line
Hello angry tax payers too
And finally a sixth line of text."

-- this is the pharse will will check against in the fileText
set deletePhrase to "tax payer"

deleteLinesFromText(fileText, deletePhrase)


on deleteLinesFromText(theText, deletePhrase)
	try
		-- here's how you can delete all lines of text fron fileText that contain the deletePhrase.
		-- first turn the text into a list so you can repeat over each line of text
		set textList to paragraphs of theText
		
		-- now repeat over the list and ignore lines that have the deletePhrase
		set j to 0
		repeat with i from 1 to count of textList
			set thisLine to item i of textList
			if thisLine does not contain deletePhrase then
				set j to j + 1
				set item j of textList to thisLine
			end if
		end repeat
		if j < i then set theText to my recolle(items 1 thru j of textList, return)
		
	end try
	return theText
end deleteLinesFromText

--=====

on recolle(l, d)
	local oTIDs, t
	set oTIDs to AppleScript's text item delimiters
	set AppleScript's text item delimiters to d
	set t to l as text
	set AppleScript's text item delimiters to oTIDs
	return t
end recolle

--=====

Yvan KOENIG (VALLAURIS, France) samedi 7 janvier 2012 17:24:37

I was going to weigh in some suggestions this morning, but changed my mind because Hank had produced a simple script which does exactly what the original poster wanted and in a way the OP could probably understand.

With a short text, no efficiency measures will make any noticeable difference. With a substantial text, there are one or two things you could do. For instance, if (as seems likely) there are only a few instances of the offending phrase in the text, it would be better in the repeat to act only when the line does contain the phrase rather than when it doesn’t. The concatenations could also be saved for a single mass list-to-text coercion at the end:

set fileText to "Here's some text in a file
It contains a few lines of text.
This is a third line of text.
And finally a fourth line of text."

-- this is the phrase we'll check against in the fileText
set deletePhrase to "third line"

deleteLinesFromText(fileText, deletePhrase)


on deleteLinesFromText(theText, deletePhrase)
	-- here's how you can delete all lines of text fron fileText that contain the deletePhrase.
	-- first turn the text into a list so you can repeat over each line of text
	set textList to paragraphs of theText
	
	-- now repeat over the list and replace lines that have the deletePhrase with 'missing values'.
	repeat with i from 1 to (count textList)
		if (item i of textList contains deletePhrase) then set item i of textList to missing value
	end repeat
	
	-- Coerce the paragraphs which are left to a single text using return delimiters.
	set astid to AppleScript's text item delimiters
	set AppleScript's text item delimiters to return
	set newText to textList's text as text
	set AppleScript's text item delimiters to astid
	
	return newText
end deleteLinesFromText

Other approaches include reducing the number of concatenations by concatenating entire clear sections rather than individual paragraphs:

set fileText to "Here's some text in a file
It contains a few lines of text.
This is a third line of text.
And finally a fourth line of text."

-- this is the phrase we'll check against in the fileText
set deletePhrase to "third line"

deleteLinesFromText(fileText, deletePhrase)


on deleteLinesFromText(theText, deletePhrase)
	set newText to ""
	-- here's how you can delete all lines of text fron fileText that contain the deletePhrase.
	-- first turn the text into a list so you can repeat over each line of text
	set textList to paragraphs of theText
	
	-- now repeat over the list and concatenate the chunks between the lines containing the deletePhrase.
	set i to 1
	repeat with j from 1 to (count textList)
		if (item j of textList contains deletePhrase) then
			if (j > i) then set newText to newText & text from paragraph i to paragraph (j - 1) of theText & return
			set i to j + 1
		end if
	end repeat
	if (j ≥ i) then
		set newText to newText & text from paragraph i to paragraph j of theText
	else if (newText ends with return) then
		set newText to text 1 thru -2 of newText
	end if
	
	return newText
end deleteLinesFromText

Or you could give the repeat fewer iterations by splitting the text at the phrase instances rather than at the line ends:

set fileText to "Here's some text in a file
It contains a few lines of text.
This is a third line of text.
And finally a fourth line of text."

-- this is the phrase we'll check against in the fileText
set deletePhrase to "third line"

deleteLinesFromText(fileText, deletePhrase)


on deleteLinesFromText(theText, deletePhrase)
	set newText to {}
	
	-- Get the text items of the text using the deletePhrase as a delimiter.
	set astid to AppleScript's text item delimiters
	set AppleScript's text item delimiters to deletePhrase
	set textItems to theText's text items
	
	set textItemCount to (count textItems)
	if (textItemCount > 1) then
		-- The phrase was in the text. Collect the text items except for the bits of the paragraphs it was in.
		tell beginning of textItems to if ((count each paragraph) > 1) then set end of newText to text 1 thru paragraph -2
		repeat with i from 2 to textItemCount - 1
			tell item i of textItems to if ((count each paragraph) > 1) then set end of newText to text from paragraph 2 to paragraph -2
		end repeat
		tell end of textItems to if ((count each paragraph) > 1) then set end of newText to text from paragraph 2 to -1
		set AppleScript's text item delimiters to return
		set newText to newText as text
	else
		-- The phrase is in the text.
		set newText to theText
	end if
	set AppleScript's text item delimiters to astid
	
	return newText
end deleteLinesFromText

Here is Hank’s original script, modified to use grep (a built-in Unix tool), which would allow the use of regular expressions to identify lines to delete.

Note that the environment variable needs to be passed to the “do shell script” in order that the grep command can be run. The shell which is invoked by the do shell script doesn’t inherit the environment variables which populate the shell used by the Terminal program.

Also, the UNIX environment does better with linefeeds in its data; therefore, the subroutine changes return characters to linefeeds, prior to processing the do shell script command.

The strategy behind the shell command is to echo the text and pipe the output to the grep command. This should work for any variable containing text. The command needs to be altered to work with files, but that is fairly trivial.

Consider changing the “third” in the following line of the script:

 set deletePhrase to "third"

to any of the following:

“a.*?line” - to remove lines containing “a … line”
“^#” - to remove lines beginning with a #

Other regex’s (regular expressions) should also work.

set fileText to "Here's some text in a file
It contains few lines of text.
This is the third line of text.
# a comment line!
A line with the # (pound) symbol.
And finally a last line of text."

-- this is the phrase we will check against in the fileText
set deletePhrase to "third"

set theAns to deleteLinesFromText(fileText, deletePhrase)


on deleteLinesFromText(theText, deletePhrase)
	set theEnv to "export PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:"
	set theText to switchText from theText to linefeed instead of return

	set newText to ""
	try
		set theCmd to "echo " & quoted form of theText & " | grep -Ev " & quoted form of deletePhrase & ""
		set newText to do shell script theEnv & "; " & theCmd
	on error
		set newText to "ERROR: " & theText
	end try
	return newText
end deleteLinesFromText

to switchText from t to r instead of s
	set d to text item delimiters
	set text item delimiters to s
	set t to t's text items
	set text item delimiters to r
	tell t to set t to item 1 & ({""} & rest)
	set text item delimiters to d
	t
end switchText

Nigel’s post is wonderfully thorough and logical. Moreover, it stays within AppleScript and doesn’t use any other software. I provide my solution only for those individuals looking to use regular expressions in their search criteria. It uses the grep available as part of OSX.

Eric

Excellent guys. I enjoyed looking at all the solutions. It’s funny how some of the simplest tasks are the most fun to think about. :lol:

Thanks for this powerful script but I am like an umbrella in front of a sewing machine.
I really don’t understand what is doing the piece of code theEnv which is triggered twice.

Yvan KOENIG (VALLAURIS, France) samedi 7 janvier 2012 21:06:08

Sorry Yvan!

That was my mistake, which I’ve edited in the post above. Similarly, I’ve moved the conversion of returns to linefeeds into the subroutine.

theEnv is used to set the environment of the “do shell script” call. In OSX, the sh shell is called by default, when the do shell script is used in AppleScript. However, do shell script doesn’t use the environment variables of the shell used by the Terminal program. By prefixing the command in theEnv to any other commands used during the do shell script call, the environment variables of that particular call are set.

Note that theEnv contains the paths to the folders which contain the system UNIX executables (change this to fit your system’s paths). PATH is the environment variable of the sh shell which is launched by the do shell script command.

I hope that makes sense.

Eric

Thanks.

Other question.

What need for the tell t to set. esoteric instruction in your handler ?


to switchText from t to r instead of s
   set d to text item delimiters
   set text item delimiters to s
   set t to t's text items
   set text item delimiters to r
   tell t to set t to item 1 & ({""} & rest)
   set text item delimiters to d
   t
end switchText

Isn’t it doing the same than :


to switchText from t to r instead of s
   set d to text item delimiters
   set text item delimiters to s
   set t to t's text items
   set text item delimiters to r
   set t to t as text
   set text item delimiters to d
   t
end switchText

Yvan KOENIG (VALLAURIS, France) samedi 7 janvier 2012 21:51:48

That subroutine belongs to Nigel and kai (http://macscripter.net/viewtopic.php?id=13008) who had written it during the days of ASCII and Unicode text usage (about 2005) in AppleScript. It preserved either type of encoding.

I would imagine that since AppleScript text is now Unicode, the distinction is no longer an issue, and your revision is appropriate. I wonder what Nigel has to say about this?

Eric

As I see Eric has just said, it’s a techinique invented by Kai Edwards a few years ago when AppleScript differentiated between ‘string’ and ‘Unicode text’. If you used ‘as text’ or ‘as Unicode text’, the class of the result would be whatever was specified by the coercion. But if the rest of the list was concatenated to the first item, the result would be the same class as the original text. The list with the empty string mimicked an empty text item, so that an instance of the delimiter would be inserted there during the implicit coercion caused by the concatenation to item 1.

Thanks to both of you.
If I remember well, I used set t to “” & t

It’s the fact that you put the empty string after item 1 which was puzzling me.

Yvan KOENIG (VALLAURIS, France) dimanche 8 janvier 2012 10:16:00

I found a grep man and started to study it.

It brought some other questions,

(1) what need for the & “” at the end of the instruction :
set theCmd to "echo " & quoted form of theText & " | grep -Ev " & quoted form of deletePhrase & “”

(2) what need for the E option in the same instruction ?
I understood that adding the option e would be useful in case of key string embedding the - (minus) character
but during my tests, the code behaved the same with or without the E.

I made tests with :
(a) exclude matching lines:
" | grep -Eve "
" | grep -ve "
" | grep -Evn "
" | grep -vn "
" | grep -Evne "
" | grep -vne "
(b) keep only matching lines :
" | grep -E "
" | grep - "
" | grep -Ee "
" | grep -e "
" | grep -En "
" | grep -n "
" | grep -Ene "
" | grep -ne "

(3) I’m wondering if there is a way to filter the output so that it return only the index (option n ) of the lines

Yvan KOENIG (VALLAURIS, France) dimanche 8 janvier 2012 16:53:52

  1. the & “” at the end of the command is merely a place-holder, and unnecessary (I use this code as a template for other subroutines).

  2. Not all grep programs are the same. Some only offer “basic” functionality. The “-E” option offers extended functionality, which may not be available on all platforms or versions of grep.

Extended grep may also result in simpler and clearer writing of regular expressions. To wit: http://unixhelp.ed.ac.uk/CGI/man-cgi?grep,

I have not tested this option on OSX, but if you find it is not necessary for the execution of your code, then don’t use it. It’s presence (or absence) should have very little effect on the speed of the subroutine.

  1. An AppleScript solution to your question about extracting indices follows:
set fileText to "Here's some text in a file
It contains few lines of text.
This is the third line of text.
# a comment line!
Here is a # (pound) symbol.
And finally a fourth line of text."

-- this is the phrase we will check against in the fileText
set deletePhrase to "^#"

set theAns to deleteLinesFromText(fileText, deletePhrase)

set theAnsList to the paragraphs of theAns
set theListCount to the count of theAnsList

repeat with iCtr from 1 to theListCount
	set item iCtr of theAnsList to item 1 of stringToList(item iCtr of theAnsList, ":")
end repeat


on deleteLinesFromText(theText, deletePhrase)
	set theEnv to "export PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:"
	set theText to switchText from theText to linefeed instead of return
	set newText to ""
	try
		set theCmd to "echo " & quoted form of theText & " | grep -Evn " & quoted form of deletePhrase & "" --& quoted form of deletePhrase & ""
		set newText to do shell script theEnv & "; " & theCmd
	on error
		set newText to "ERROR in deleteLinesFromText: " & return & theText
	end try
	return newText
end deleteLinesFromText

to switchText from t to r instead of s
	-- keeps encoding of individual list items intact by using {""}
	set d to text item delimiters
	set text item delimiters to s
	set t to t's text items
	set text item delimiters to r
	tell t to set t to beginning & ({""} & rest)
	set text item delimiters to d
	t
end switchText

on stringToList(theString, theDelimiter)
	set oldTID to AppleScript's text item delimiters
	set AppleScript's text item delimiters to theDelimiter
	set resultList to text items of theString
	set AppleScript's text item delimiters to oldTID
	return resultList
end stringToList

You can balance this approach for one which will be considerably faster on very long lists. It uses more memory by creating a second list, but is faster by eliminating the subroutine call which parses each list item into the index and the remainder of item’s data. I list only the main routine code here as the subroutines are unchanged from the code above:

set fileText to "Here's some text in a file
It contains few lines of text.
This is the third line of text.
# a comment line!
Here is a # (pound) symbol.
And finally a fourth line of text."

-- this is the phrase we will check against in the fileText
set deletePhrase to "^#"

set theAns to deleteLinesFromText(fileText, deletePhrase)

set theAnsList to stringToList(theAns, {":", return})
set theListCount to the count of theAnsList
set the newList to {}

repeat with iCtr from 1 to theListCount by 2
	set the end of my newList to item iCtr of theAnsList
end repeat

The ruse of this approach is that a string can be transformed to a list by using a list of multiple delimiters, which allows isolation of the indices based on the location of the index between the line return and the colon. Unfortunately, a list item cannot be effectively deleted, but this is handled by looping through the list, and taking every other item – which happen to be the indices – and placing them in a new list. If the list is very long, then the “my” construct, used with newList, speeds up the process considerably.

Finally, if you were to use UNIX, then piping the grep output to “sed” would work also.

Eric

Thanks

Splitting the lines returned by the call to deleteLinesFromText was what I used.
I was just wondering if there was a way to do that thru a parameter which I missed in the main call.

I learnt something.

I thought that my was fastening the treatment of lists when these ones were defined as properties/globals

Yvan KOENIG (VALLAURIS, France) lundi 9 janvier 2012 16:21:24

Changing the appropriate line in the deleteLinesFromText handler would allow you to accomplish this in UNIX. This would be the replacement line:

This pipes the grep output to sed which looks for any characters after the colon and replaces them with an empty string. You can change the handler itself to create a parameter which would allow you to add the sed string programmatically, which may make the handler more generically useful.

Eric

Thanks a lot.

Here is my synthesis :


set fileText to "Here's some text in a file
It contains few lines of text.
This is the -third line of text.
# a comment line!
A line with the # (pound) symbol.
And finally a last line of text."

-- this is the phrase we will check against in the fileText
set keyString to "-third"
(*
The last parameter may be :
"00" : Extract every line which doesn't contain keyString 
"01" : idem + put the line index (in the source) at front of the line 
"02" : return index of every line which doesn't contain keyString
"10" : Extract every line which contain keyString 
"11" : idem + put the line index (in the source) at front of the line 
"12" : return index of every line which contains keyString
*)
set theAns to deleteLinesFromText(fileText, keyString, "02")


on deleteLinesFromText(theText, keyString, flag)
	set theEnv to "export PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:"
	set theText to switchText from theText to linefeed instead of return
	
	set newText to ""
	try
		if flag = "00" then
			set theCmd to "echo " & quoted form of theText & " | grep -Eve " & quoted form of keyString & ""
		else if flag = "01" then
			(*
Extract every line which doesn't contain keyString
*)
			set theCmd to "echo " & quoted form of theText & " | grep -Evne " & quoted form of keyString & ""
		else if flag = "02" then
			(*
Supposed to return index of lines without the keyString *)
			set theCmd to "echo " & quoted form of theText & " | grep -Evne " & quoted form of keyString & " | sed s/:.*//"
		else if flag = "10" then
			(*
Extract lines containing keyString *)
			set theCmd to "echo " & quoted form of theText & " | grep -Ee " & quoted form of keyString & ""
		else if flag = "11" then
			(*
Extract lines containing keyString
and put the line index (in the source) at front of the line *)
			set theCmd to "echo " & quoted form of theText & " | grep -Ene " & quoted form of keyString & ""
		else if flag = "12" then
			(*
Extract lines containing keyString
and put the line index (in the source) at front of the line *)
			set theCmd to "echo " & quoted form of theText & " | grep -Ene " & quoted form of keyString & " | sed s/:.*//"
		end if
		set newText to do shell script theEnv & "; " & theCmd
	on error
		set newText to "ERROR: " & theText
	end try
	return newText
end deleteLinesFromText

to switchText from t to r instead of s
	set d to text item delimiters
	set text item delimiters to s
	set t to t's text items
	set text item delimiters to r
	tell t to set t to item 1 & ({""} & rest)
	set text item delimiters to d
	t
end switchText

Yvan KOENIG (VALLAURIS, France) lundi 9 janvier 2012 18:40:03

Hello!

last script in post #5:

tell beginning of textItems to if ((count each paragraph) > 1) then set end of newText to text 1 thru paragraph -2

Now. this is neat if not dark arts! Surely totally mind boggling! :wink:

Hello!

Here is another way to do it, with a general handler. Matt Neuburgs filter handler. Though Nigel’s is my favorite. It may not be the most readable, but its usefullness by far outweighs any negativisms! :slight_smile:


on run
	global searchterm
	
	set fileText to "Here's some text in a file
It contains few lines of text.
This is the third line of text.
# a comment line!
Here is a # (pound) symbol.
And finally a fourth line of text."
	
	set searchterm to "third line"
	set {tids, AppleScript's text item delimiters} to {AppleScript's text item delimiters, linefeed}
	
	set thetext to text items of fileText
	set newtext to _filter(thetext, notContains)
	log newtext
	set newtext to text items of newtext as text
	set AppleScript's text item delimiters to tids
	log newtext
end run

on _filter(L, crit)
	script filterer
		property criterion : crit
		on _filter(L)
			if L = {} then return L
			if criterion(item 1 of L) then
				return {item 1 of L} & (_filter(rest of L))
			else
				return _filter(rest of L)
			end if
		end _filter
	end script
	return filterer's _filter(L)
end _filter

on notContains(x)
	global searchterm
	if x does not contain searchterm then return true
	return false
end notContains