Shell script to remove blank lines in text file?

emendelson · October 21, 2021, 12:05pm

I’ve been searching for an answer to this, but nothing that I’ve found seems to work correctly.

Is there a shell script that I can run from an AppleScript to remove blank lines in a text file? I’ve tried various solutions with sed and grep, without success. Probably I’m using the wrong syntax. Is there any straightforward way to do this? I’ve tried replacing \n\n with \n but even if I escape the backslash, I can’t get the script to work.

Thanks for any help, and apologies if I’ve missed something obvious.

Mockman · October 21, 2021, 4:10pm

First, you should confirm that the blank lines are actually blank, i.e. do not contain any invisible characters such as tab or space. Most decent text editors offer a ‘show invisibles’ feature but alas, textedit does not. SubEthaEdit is a good and free example, and BBEdit offers a free version which is very good.

You can check some example test in Script Editor though. Select the last character of the preceding paragraph through to the first character of the subsequent paragraph and copy to the clipboard. In a new script, set that text to a variable. Then, get the ‘id’ for the selected characters. For example, the text below consists of the following text: period, linefeed, space, space, tab, linefeed, letter A.

set ab to ".
  
A"

id of ab

[format]–> {46, 10, 32, 32, 9, 10, 65}[/format]

You can see that the ‘32’ characters are the spaces, the ‘9’ marks the tab. So if you used sed to delete blank lines, it would not consider this to be blank and would skip it.

Furthermore, if you add the line ‘characters of ab’ then it will return the characters as a list, like so.

[format]–> {“.”, "
", " ", " ", " ", "
", “A”}[/format]

So it would be worth confirming whether that is the case for your text file.

Now, as to deleting blank lines, unfortunately, you don’t include your sed command so who knows what you were doing, but in general, this command would delete any blank lines in a given file. What it does is find lines that have a beginning (^) and an end ($) but nothing in between and then deletes each one. Obviously, in the example above, it would not detect any.

sed -E '/^$/ d' < file

You can have it include tabs and spaces however, like this. The ‘[[:blank:]]’ is a character class that contains the tab and space characters, so this looks for lines that have either nothing or only tabs and spaces and then deletes them.

sed -E '/^[[:blank:]]*$/ d' <file

And to follow through for our above example,

echo ".                          

A" | sed -E '/^[[:blank:]]*$/ d'

results in:

.
A

So, to wrap that up in an applescript,

-- sed -E '/^[[:blank:]]*$/ d'
set xy to do shell script "echo \"" & ab & "\"" & " | sed -E '/^[[:blank:]]*$/ d' "

… will run these shell commands:
[format]"echo ".

A" | sed -E ‘/^[[:blank:]]*$/ d’ "[/format]
… and result in this:

[format]“.
A”[/format]

t.spoon · October 21, 2021, 4:31pm

Great answer from Mockman.

I just wanted to add that, if in prior attempts you were only looking for “/n/n,” that can fail not only if there are blank characters between the newlines as Mockman pointed out, but also if the document is using different characters for a newline - there’s “Linefeed” which is “/n,” but there’s also “Carriage Return” which is “/r,” and to make it extra confusing, sometimes text editors use the two in conjunction “/r/n” to specify a single new line… a .

EDIT: that’s interesting that something in Mockman’s post appears to change the default font on the rest of this page on macscripter…

wch1zpink · October 21, 2021, 4:39pm

This seems to work for me in Terminal.

[format]cat path/to/your/file.txt | grep ‘.’[/format]

activate
set theFile to POSIX path of ((choose file) as text)

set theResults to paragraphs of (do shell script "cat " & quoted form of theFile & " | grep '.'")

emendelson · October 21, 2021, 5:19pm

Many thanks for all these extremely helpful answers.

wch1zpink, your terminal script works perfectly in my application. Thank you again! I was hoping for a script that would change the existing file, but it’s trivial to send the output to a temporary file and then replace the original file with the edited one.

This forum is a superb resource, as always. Thanks again to everyone.

wch1zpink · October 21, 2021, 6:02pm

If you want to edit and overwrite the existing file, this following shell command should accomplish that for you.

[format]pbcopy < path/to/your/file.txt ; pbpaste | grep ‘.’ > path/to/your/file.txt[/format]

KniazidisR · October 21, 2021, 6:22pm

It seems to me, it isn’t so simple when the original text includes characters from set {character id 11, character id 12, character id 133, character id 8232, character id 8233}


set theText to "Hello," & character id 8232 & character id 8232 & "world!"
return paragraphs of theText

--> "Hello,
-->
--> "word"

--> INSTEAD OF CORRECT:
--> "Hello,"
--> "word!"

Mockman · October 21, 2021, 6:54pm

I noticed this too. Weirdly, it took place for me while editing. At one point, I was logged out, presumably due to time. I copied the partial answer I’d written, reloaded the page, logged back in and pasted the text back in. I think (but can’t be certain) that after previewing the answer again, the font went to monospace for some parts.

Right now, the font change begins with the line that opens with ‘You can have it include….’ The code above it doesn’t seem likely to cause such a change. All very strange but it seems fine now.

Update: Seems I spoke prematurely. While editing, the posts below mine were normal but after submitting, they went back to monospace. If anyone notices any unclosed quotes or what have you in any code on this page, let me know and I’ll try to fix it.

KniazidisR · October 21, 2021, 7:00pm

@Mockman, most likely in the post #2 you have formatted something using [format] formate tags of the site[/format]. Open your post to Edit to see this formatting. Remove formate tags, and use appleScript tags instead. Or, simply make your text bold.

Mockman · October 21, 2021, 7:25pm

Well, technically, that was it. As far as I could tell, I had a closing tag for each open tag. I replaced the instance immediately before the font change and the regular text after returned to normal. However, from the next ‘format’ tag onward, the monospace persisted. So I replaced a few more and now it seems fine. I didn’t replace them all… indeed, there were two sets of format tags above the changed fonts which had no negative effect (and they’re still there, along with two at the tail end of the post). Anyway, glad it is resolved.

Thanks t.spoon for pointing it out and the overly kind words, and thanks KniazidisR for suggesting a fix.

Mockman · October 21, 2021, 7:48pm

This is true but it ever was the case that you can get weird characters. But if you look carefully with tools such as ‘character id’ and ‘id’, you can identify anomalies and then work towards a solution. On the positive side, these characters are relatively uncommon.

As for a fix, the first three are members of the [[:cntrl:]] class, and as such it is simple enough to include them in the sed search pattern.

The latter two are quite rare and unfortunately, don’t seem to belong to any class. However, if they are a practical issue for someone, it’s fairly straightforward to replace them as literals.

Marc_Anthony · October 21, 2021, 10:19pm

Another straight-from-file option would be to use awk for the read and tr to condense duplicates.

do shell script "awk 'BEGIN{ RS = \"\\r\" } {print $0}' " & (choose file)'s POSIX path's quoted form & " | tr -s " & (linefeed's quoted form)

Edit:
After consideration, my initial effort is in error, as it inefficiently involves two programs—awk and tr; with a slight modification, either can condense back-to-back line returns—i.e. “empty lines”—on their own. There are certainly benefits from other posters’ approaches, if you might have other gremlins in there that give a blank appearance.

do shell script "awk 'length $0>0' " & (choose file)'s POSIX path's quoted form

do shell script " tr -s '[:space:]' <  " & (choose file)'s POSIX path's quoted form

KniazidisR · October 22, 2021, 4:16am

No, I still won’t calm down. I would not like to search for characters from the indicated above set manually, so I wrote a script here that removes blank lines more stably than in the 3 solutions above. Also the echo command has an unwanted text length limitation, so I only recommend the 2nd and 3rd approaches from @wch1zpink and @Marc Anthony in the following hybrid script:


-- script: Remove empty lines

set lineChangingChars to ¬
	{character id 11, character id 12, character id 133, character id 8232, character id 8233}

set aFile to choose file
set originalText to read aFile

set ATID to AppleScript's text item delimiters
set AppleScript's text item delimiters to lineChangingChars
set originalText to text items of originalText
set AppleScript's text item delimiters to linefeed
set originalText to originalText as text
set AppleScript's text item delimiters to ATID

my write_UTF8_file(aFile, originalText)
do shell script "awk 'BEGIN{ RS = \"\\r\" } {print $0}' " & ¬
	aFile's POSIX path's quoted form & " | tr -s " & (linefeed's quoted form)

to write_UTF8_file(the_file, the_text)
	set file_ID to open for access the_file with write permission
	-- Clear any existing content if the file already exists:
	set eof file_ID to 0
	-- Add the UTF-8 encoded text:
	write the_text to file_ID as «class utf8»
	close access file_ID
end write_UTF8_file

Nigel_Garvey · October 22, 2021, 11:03am

Hi.

As has been noted further up-topic, one has to know exactly what’s meant by “blank lines” before one can write a script to fix them. Are they just lines with nothing in them? Might they contain invisible characters? Is vertical formatting to be treated the same as line endings in the text? It’s also not clear why a shell script’s been particularly requested.

In the spirit of KniazidisR’s zap-all-possibilities solution, here’s some ASObjC code which does the same thing. In ICU regex, the metacharacter \R matches LF, CR, CRLF, and any of the characters in KniazidisR’s lineChangingChars list. The code here replaces any run consisting of one of these characters followed (possibly repeatedly) by optional horizontal white space and another vertical character with the first character in that run.

use AppleScript version "2.4" -- OS X 10.10 (Yosemite) or later
use framework "Foundation"
use scripting additions

set filePath to POSIX path of (choose file of type {"txt"})

set {theText, encoding} to current application's class "NSMutableString"'s stringWithContentsOfFile:(filePath) ¬
	usedEncoding:(reference) |error|:(missing value)

tell theText to replaceOccurrencesOfString:("(\\R)(?:\\h*+\\R)++") withString:("$1") ¬
	options:(current application's NSRegularExpressionSearch) range:({0, theText's |length|()})

set newFilePath to filePath -- Edit if and as required.
tell theText to writeToFile:(newFilePath) atomically:(true) encoding:(encoding) |error|:(missing value)

KniazidisR · October 22, 2021, 1:37pm

This solution is better than mine, as it only writes to the file once. In my solution, writing to the file must be done twice to finally replace the content.

emendelson · October 23, 2021, 12:09am

Nigel Garvey, a thousand thanks. That’s a terrific script. I also admire Marc Anthony’s one-line shell script, and am grateful for all other contributions. Now to experiment with Nigel Garvey’s and Marc Anthony’s script, though I expect that both will work perfectly in my script.

For information’s sake: my script launches the SheepShaver emulator program and makes changes in its plain-text preferences file. Earlier versions of my scripts added blank lines to the preferences file because I was incompetent. I want the updated version of the script to clear out the mistakes made by earlier versions.

Thanks again to everyone.

peavine · April 13, 2022, 1:34pm

This thread contains many helpful scripts, but it doesn’t include one that uses basic AppleScript. So, I’ve included a suggestion below. It removes paragraphs that are blank and that contain white space characters only.

I ran some timing tests with a string that contained 5,185 paragraphs, and Nigel’s suggestion was fastest at 3 milliseconds (if the Foundation framework is in memory). The other suggestions including mine took from 9 to 15 milliseconds. All of these timing results include reading and processing but not writing the text.

set theFile to (choose file)
set theText to (read theFile as «class utf8»)
set trimmedText to getTrimmedText(theText)

on getTrimmedText(theText)
	script o
		property trimmedText : (paragraphs of theText)
	end script
	ignoring white space
		repeat with i from 1 to (length of o's trimmedText)
			if item i of o's trimmedText is equal to "" then set item i of o's trimmedText to missing value
		end repeat
	end ignoring
	set {TID, text item delimiters} to {text item delimiters, linefeed}
	set trimmedText to text of o's trimmedText as text
	set text item delimiters to TID
	return trimmedText
end getTrimmedText

emendelson · April 15, 2022, 9:38pm

@peavine, This is really excellent: straightforward and fast. Thank you! I’m using Nigel Garvey’s script in my application but I’m very glad to have this alternative for future use.