How can I alter this regex to also remove all carriage returns?

I’m using this regex to clean up some body text grabbed from some emails.

set clean_text to do shell script "echo " & quoted form of alertcon & "|sed \"s/[^[:alnum:][:space:]]//g\""

It does a pretty good job within my script, but there’s unwanted space left where an image is removed. How can I alter that regex part so it also removes all carriage returns?

For example, turn this:


Sample text here.

End of sample text.


TO this:


Sample text here. End of sample text.


Thank you! :slight_smile:

Hello.

The easiest thing would be to append the tr command to your shell script like this.

set clean_text to do shell script "echo " & quoted form of alertcon & "|sed \"s/[^[:alnum:][:space:]]//g\" |tr -d '\\r'"

You said carriage returns and not newlines, so carriage returns are the only things that gets removed, not any other empty lines in your scraping.

Hi.

Another way:

set clean_text to do shell script "echo " & quoted form of alertcon & "|sed \"s/\"$'\\r'\"/ /g ; s/[^[:alnum:][:space:][:punct:]]//g\""

Edit: Not what I originally posted. :rolleyes:

:confused: Whoops! Thank you for correcting me.

I didn’t know the proper way to describe the newline or empty lines. It’s not carriage returns and the empty space is still there after applying the new regex. I should have said I want to remove empty lines or newlines from the text.

How would I do that instead? That’s is, remove both carriage returns AND empty lines/newlines?

What’s the difference between a gap resulting from the removal of a picture (which you want to remove) and a gap between paragraphs (which presumably you want to keep)?

I changed the script to remove empty lines. Maybe that works for you, or you’ll redecide having read Nigel’s consideration. You cain’t have both! :slight_smile:

Here is the changed version, with some demo text, before and after, I have newlines as line-endings in Script Debugger, you may experience a different result, if you use AppleScript Editor, with line endings set to carriage return, so please try it with some text from disk.

set alertcon to "This 

is some text 

that should be cleansed " & return & " and fine."
set clean_text to do shell script "echo " & quoted form of alertcon & "|sed -e 's/[^[:alnum:][:space:]]//g' -e '/^$/ d' |tr -d '\\r'"
# "This 
# is some text 
#  that should be cleansed  and fine"


That did it! Thank you! Now the space where the image was is gone. Perfect!

Thank you so much!

I’m not sure what was causing the gap where the picture was in the first place. What I did was write an Applescript that runs when I get email (via Apple Mail rule) and it takes the sender, subject and part of the body to send via iChat to a an old cell phone to get around an issue with the cell phone service provider. I need the text body to be less than 180 characters so I fixed that as well. Finally, I made an Applescript that turns on the rule when away from computer that runs the Applescript and the rule also uses “Stop evaluating rules” so everything temporarily goes to the inbox. Then another script to turn off the rule when I get back to the computer and no longer need the script to run. Those both are Actions in LaunchBar, so it’s trivial to do it.

Anyway, everything was working perfect until I got emails with inline images within them and it would break the script and nothing would happen. So I experimented with the script looking for attachments and that regex I found elsewhere and it solved the problem, but there was a lot of space left over in the body text where the image was removed and would waste space on my old cell phone with a tiny screen.

In case you’re curious, this is the final Applescript I patched together that works perfect now:

tell application "iChat"
	activate
end tell
tell application "Mail"
	activate
	set theSelection to (get every message of inbox whose read status is false)
	set theMessage to item 1 of theSelection
	set alertsend to sender of theMessage
	set alertsub to subject of theMessage
	set alertcon to content of theMessage
	if length of alertcon > 180 then set alertcon to (text 1 thru (180 - 1) of alertcon) as string
	if length of alertcon < 180 then set alertcon to text 1 thru length of alertcon as string
	set alert to alertsend & alertsub & alertcon	
	if (every mail attachment of theMessage) ≠ {} then
		set clean_text to do shell script "echo " & quoted form of alertcon & "|sed -e 's/[^[:alnum:][:space:]]//g' -e '/^$/ d' |tr -d '\\r'"		
		set alert to alertsend & alertsub & clean_text
	end if
	set read status of theMessage to true
	tell application "iChat"
		activate
		send alert to buddy "+18001234567" of service 1
	end tell
end tell

Granted, this is an absolute Frankenstein I patched together so I’m positive there’s many things wrong with the way I’ve done this, but it works perfect now.

Thanks again for your response/help.

Ah. So you weren’t interested in keeping gaps between paragraphs anyway. And it links in with why you were happy for your regex to zap punctuation.

The code I originally posted in post #3 did away with the “tr” command by simply changing [:space:] to [:blank:], which would look like this applied to McUsrII’s second offering:

set clean_text to do shell script "echo " & quoted form of alertcon & "|sed -e 's/[^[:alnum:][:blank:]]//g' -e '/^$/ d'"

Or:

set clean_text to do shell script "echo " & quoted form of alertcon & "|sed 's/[^[:alnum:][:blank:]]//g ; /^$/ d'"

Someone I know has to read the re_format man page better, as [:blank:] apparently never sunk in.

Thanks Nigel

Edit

I like the script too, it is one smart script, that if not beats, so circumvents the “system”. :slight_smile:

Thank you Nigel & McUsrII for your help!