How to write an applescript to match strings in 2 TextEdit documents?

Hi,

Here is my scenario. I have two TextEdit documents which each contain a series of email addresses, with one email address per line. File A has a very long list of email address. File B has a shorter list. For each email address in File A I would like the script to check if the same email address exists in File B, and if it does, delete it from File A.

Would it be straightforward to do this with applescript, and if so could someone help get me started?

Thanks,

Nick

Try:

set longFile to POSIX path of ((path to desktop as text) & "long.txt")
set newLongFile to POSIX path of ((path to desktop as text) & "newLong.txt")
set shortFile to POSIX path of ((path to desktop as text) & "short.txt")

set longList to every paragraph of (do shell script "cat " & quoted form of longFile)
set shortList to every paragraph of (do shell script "cat " & quoted form of shortFile)

set newLong to {}
repeat with anEmail in longList
	set anEmail to contents of anEmail
	if anEmail is not in shortList then set end of newLong to anEmail & return
end repeat

do shell script "echo " & quoted form of (newLong as text) & " > " & quoted form of newLongFile

Thanks very much adayzdone. That is working fine.

Nick

Hello.

I just came up with this to provide some variation. :slight_smile: This solution, leverages on the diff command, which lists which entries that are unique for the two files, the result for the long file is then filtered through a sed command, and written back to a file.


set longFile to POSIX path of ((path to desktop as text) & "Long.txt")
set newLongFile to POSIX path of ((path to desktop as text) & "NewLong.txt")
set shortFile to POSIX path of ((path to desktop as text) & "Short.txt")

set difflist to do shell script "diff " & quoted form of longFile & " " & quoted form of shortFile & "|sed -n 's/^< \\(.*\\)/\\1/p'" # without altering line endings

do shell script "cat >" & newLongFile & " <<<" & quoted form of difflist

McUsr, your script will not work because diff compared files line by line:

Long.txt =
a@gmail.com
b@gmail.com
c@gmail.com
d@gmail.com

Short.txt =
d@gmail.com
c@gmail.com

McUsr’s
NewLong.txt =
a@gmail.com
b@gmail.com
d@gmail.com

NewLong.txt SHOULD =
a@gmail.com
b@gmail.com

:slight_smile: I tested it on sorted items, what can I say?

I should have sorted the short file, I have done so now, and it should work properly now.

set longFile to POSIX path of ((path to desktop as text) & "Long.txt")
set newLongFile to POSIX path of ((path to desktop as text) & "NewLong.txt")
set newShortFile to POSIX path of ((path to desktop as text) & "NewShort.txt")
set shortFile to POSIX path of ((path to desktop as text) & "Short.txt")

set difflist to do shell script "sort -o " & quoted form of newShortFile & " " & quoted form of shortFile & " ; sort -o " & quoted form of newLongFile & " " & quoted form of newLongFile & " ; diff " & quoted form of newLongFile & " " & quoted form of newShortFile & "|sed -n 's/^< \\(.*\\)/\\1/p'" # without altering line endings

do shell script "rm " & quoted form of newShortFile & " cat >" & quoted form of newLongFile & " <<<" & quoted form of difflist

Edit

Now the list is sorted, so it may not be as good a solution.

This is another version, maybe even more clumsier, because I have some problems with making the -i ->inplace editing option of sed to work at times, hopefully it isn’t just me.

I sort lexically as before, then I use the comm utility, to return the column with lines that are the same in both files, I then feed that output to sed, which generates a sed script I then apply, to delete the non-unique mail addresses from the “long file”, and catenate the new pruned mailing list back to “long file”.

It should be fairly fast and robust, even though there are now many files involved, though I am not sure of any benefits of this exercise besides my own amusement. :slight_smile:


# temporary files
set sedFile to (do shell script "mktemp /tmp/mailListSort.XXXXXX")
set newLongFile to (do shell script "mktemp /tmp/mailListSort.XXXXXX")
set newShortFile to (do shell script "mktemp /tmp/mailListSort.XXXXXX")

set longFile to quoted form of POSIX path of ((path to desktop as text) & "Long.txt")
set shortFile to quoted form of POSIX path of ((path to desktop as text) & "Short.txt")

set theRes to do shell script "sort -o " & newShortFile & " " & shortFile & " ; sort -o " & newLongFile & " " & longFile & " ; comm -i -1 -2  " & newLongFile & " " & newShortFile & " |sed -n 's_\\(.*\\)_/\\1\\/ d_p ' >" & sedFile & " ; sed  -f " & sedFile & " " & longFile without altering line endings

try
	do shell script "cat <<< " & quoted form of theRes & " >" & longFile
end try

Edit

This describes the problem: sed in-place editing (-i) difference on Mac OS X: “undefined label” errors « MPD but provides no solution when you are using a -f sed script file.

Just deleting the lines directly, is of course also an option, that will be better, the smaller the longfile is, but I guess it takes some size to compensate for two sorts, and the invocation of the comm tool. :slight_smile:


# temporary file
set sedFile to (do shell script "mktemp /tmp/mailListSort.XXXXXX")

set longFile to quoted form of POSIX path of ((path to desktop as text) & "Long.txt")
set shortFile to quoted form of POSIX path of ((path to desktop as text) & "Short.txt")

set theRes to do shell script "cat " & shortFile & "|sed -n 's_\\(.*\\)_/\\1\\/ d_p ' >" & sedFile & " ; sed  -f " & sedFile & " -i '' " & longFile without altering line endings

McUsr is becoming a Sed junkie :slight_smile:

Actually not, but it is a practical tool. I thought later, it all depends on the size of the list, in this particular case.

And if I am, there are reasons for that, sed is a tool that requires practition, and there is a why involved here, now why would I take the pain it is to learn sed thoroughly?

The answer to that, is not that I am a bit-fiddler by nature, but that sed is a very versatile tool, that can do lots of amazing things with text files, that are really a pain with AppleScript, lacking RegExp’s and all.

Sed opens up a whole new world to programming towards text files, and your private organization of them, actually. Without the full blown pain that perl posesses. There few commands and few side-effects in sed. Perl has lots.

But sed do allow me, to edit the big list from the little list, and sift out any duplicate items in the big list. Now, you could do that easily with text items delimiters, using each and every mailaddress in the small list as text item delimiters for the big list to sift them out.

That is actually the ideal solution here, as long as the list is smaller than 16300 mail addresses, but I think it will become slower than the sed solution some time before that.

Another solution would be to use the bsearch handler of Nigel Garvey for non sorted lists, and delete items found in the big list.

This relearned me really, that it is best to work from the small list, towards the big list, in order to reduce the number of operations.

I would avoid telling the OP which solution is ideal. The best solution will be the one that accomplishes the task in a way that the user understands.

I really didn’t tell the OP which solution was ideal. I told you, from a dataprocessing perspective. Let me add that I followed this theme, as it is an interesting general problem, as we often want to have two lists, or more implemented in text files in sync with each other, either for both containing unique items, or that both items exists in both.

For the OP, the best solution is of course the one he settles for, and as I remember, he had already settled for your solution before I followed up on it. :slight_smile:

Agreed. I also like DJ Bazzie Wazzie’s method of combining two lists:


--NOTE: Only works with lists containing strings and not containing linefeeds. 
set listA to {"A1", "A2", "A3"}
set listB to {"B1", "B2", "A3"}

set AppleScript's text item delimiters to linefeed
set newList to every paragraph of (do shell script "sort -fu <<< " & quoted form of ((listA as string) & linefeed & listB as string))
set AppleScript's text item delimiters to ""

return newList

I wish MacScripter had a chat feature or an IRC channel to discuss general interest topics so we didn’t pollute the threads. I guess that is what code exchange is for…

I disagree about the IRC channel, and to pollution, I often learn things, not only by my mistakes :slight_smile: but also from the different solutions and such. I think you do too, and everybody else as well. Not every theme gets born in Code Exchange, that is some of the charm of this forum, if you ask me. :slight_smile:

The merging was nice, for its purposes. The new list is left sorted, that is, in a different state, another way to do it would be to just insert the different items on top of one of the lists.

The main quality with your solution above, as I see it, was that it left the list in the same state as it was.

If you don’t mess up the order of a list, in order to process it, then you leave the user free to organize the data as he or she see’s fit.

The mail addresses could for instance have been organized in different sections with say a settext heading between them, telling the category. Like this:

Class of 98 ≈≈≈≈≈≈≈≈≈ a@gmail.com b@gmail.com

Don’t forget that when coercing a list to text, AppleScript’s text item delimiters should properly be set explicitly to a known value first:

-- Top 5 lines of script here.

set newLong to {}
repeat with anEmail in longList
	set anEmail to contents of anEmail
	if anEmail is not in shortList then set end of newLong to anEmail & return
end repeat

set astid to AppleScript's text item delimiters
set AppleScript's text item delimiters to ""
set newLong to newLong as text
set AppleScript's text item delimiters to astid

do shell script "echo " & quoted form of newLong & " > " & quoted form of newLongFile

Or of course the ‘return’ concatenations could be omitted in the repeat and ‘return’ used as the delimiter instead.

An alternative to building the new list text-by-text in response to addresses not matching is to replace the minority of adresses which do match with items of a different class and then exclude them from the new list;

-- Top 5 lines of script here.

repeat with anEmail in longList
	if (anEmail is in shortList) then set anEmail's contents to missing value
end repeat
set newLong to longList's text

-- Then coerce newLong to text with a 'return' (or 'linefeed') delimiter.

Great point. You can’t assume that tid are set to {“”}.

This is an alternative to Nigel’s last, it should be somewhat faster, and not fully implemented, and only covers the pruning of the theLongList, which in this example is presumed to be text all the way. (Only shortlist is in list form.)

The eminent handler is made by Nigel Garvey. :slight_smile:


theLonglistAsText="the long list of email addresses, as text"

repeat with anEmail in shortList
	set theLonglistAsText to deleteLinesFromText(theLonglistAsText, contents of anEmail)
end repeat


on deleteLinesFromText(theText, deletePhrase)
	-- http://macscripter.net/viewtopic.php?id=37830 NG
	set newText to {}
	
	-- Get the text items of the text using the deletePhrase as a delimiter.
	set astid to AppleScript's text item delimiters
	set AppleScript's text item delimiters to deletePhrase
	set textItems to theText's text items
	
	set textItemCount to (count textItems)
	if (textItemCount > 1) then
		-- The phrase was in the text. Collect the text items except for the bits of the paragraphs it was in.
		tell beginning of textItems to if ((count each paragraph) > 1) then set end of newText to text 1 thru paragraph -2
		repeat with i from 2 to textItemCount - 1
			tell item i of textItems to if ((count each paragraph) > 1) then set end of newText to text from paragraph 2 to paragraph -2
		end repeat
		tell end of textItems to if ((count each paragraph) > 1) then set end of newText to text from paragraph 2 to -1
		set AppleScript's text item delimiters to return
		set newText to newText as text
	else
		-- The phrase is in the text.
		set newText to theText
	end if
	set AppleScript's text item delimiters to astid
	
	return newText
end deleteLinesFromText

McUsrII? Have you supplanted yourself!? :wink:

If you’re going to use a TIDs-based method, why not this?

set longText to "c@gmail.com
b@gmail.com
d@gmail.com
a@gmail.com
h@gmail.com"
set shortText to "d@gmail.com
c@gmail.com
g@gmail.com"

set astid to AppleScript's text item delimiters
set AppleScript's text item delimiters to return & linefeed
set longText to linefeed & longText's paragraphs & return
set AppleScript's text item delimiters to return & character id 0 & linefeed
set shortText to linefeed & shortText's paragraphs & return
set AppleScript's text item delimiters to character id 0
set AppleScript's text item delimiters to shortText's text items
set newLong to longText's text items
set AppleScript's text item delimiters to ""
set newLong to newLong as text
set AppleScript's text item delimiters to return -- or linefeed if you prefer.
set newLong to text from word 1 to word -1 of (newLong's paragraphs as text)
set AppleScript's text item delimiters to astid
newLong

I wrote this login script for Macscripter, and lost the password, then it turns up that I hadn’t changed my email address. I accept it, if I have to get used to this. :slight_smile:

It is just one problem with your last otherwise perfect solution; what if there are leading or trailing blanks?
if there are leading and trailing blanks in shortText, then nothing will be sifted out, for that mailaddress that in reality was in longText. It doesn’t matter whether there are leading trailing or blanks in long text, but if there is, you’ll be left with a paragraph containing blanks.

I guess you could have used every word of shortText to adjust for the spaces, but then there is the ‘@’! :smiley:

Well, filtering the short list for blanks, would make up for the ideal solution, with speed beyond any comparison. There aren’t many things scripting that beats this in terms of speed. Thanks for showing it.

This is raising the ante, as I am sure that the sed script generating version has the same flaw, whereas the deleteLInesFromText should be able to cope very well with that.

I have another solution here, it is also yours. This one considers the spaces:)



set longText to "c@gmail.com
b@gmail.com
d@gmail.com
a@gmail.com
h@gmail.com"
set shortText to "d@gmail.com
c@gmail.com
g@gmail.com"

set shortList to paragraphs of shortText
set longList to paragraphs of longText
set AppleScript's text item delimiters to " " & tab
ignoring white space
	repeat with anEmail in shortList
		binaryPrune(longList, contents of anEmail)
	end repeat
end ignoring
set longList to longList's text
set {tids, AppleScript's text item delimiters} to {AppleScript's text item delimiters, linefeed}
set longText to longList as text
set AppleScript's text item delimiters to tids
longText

on binaryPrune(thelist, value)
	-- NG http://macscripter.net/viewtopic.php?id=17340
	script o
		property lst : thelist
	end script
	
	set valueAsList to {value}
	set L to 1
	set R to (count thelist)
	
	if (valueAsList is in thelist) then
		repeat until (value is item L of o's lst)
			set L to L + 1
			set m to (L + R) div 2
			if (valueAsList is in items L thru m of o's lst) then
				set R to m - 1
			else
				set L to m + 1
			end if
		end repeat
		set item L of o's lst to missing value
		return true
	else
		return false
	end if
end binaryPrune

I must say I liked Nigel’s last solution best,

The OP didn’t mention them and I haven’t catered for them.

I know, and said I raised the ante, i also said that my last sed solution would break by the added criteria.

The OP said that the mail addresses was in two TextEdit documents, that made me think, that maybe there should be some robustness regarding leading and trailing blanks. Also baked in as a clause in the generality of this, subject, if we look up. How to remove lines, or items from a text file, based on another.

This is a recurring theme for me, I bet it is for others as well.

I like do like your solution, and I will use it when I can, taking precautions when data are entered, or better: Filter text for leading and trailing blanks before creating the list. Then you have the freedom of editing and looking at your lists in TextEdit or similar, and just process them via applescript when you need to.

By the way:
I have stumbled over sed scritps, that returns the Nth paragraph the Nth block, the first line of every paragraph and so on. This in combination with settext headers, makes for very good looking textual databases. I just post the two very good links I have found here, not relevant to the theme, but to text processing in general. :slight_smile:

Sculpting text with regex, grep, sed and awk I really like the explanation of regexp here. (But nobody seem to write that [^[:class:]] makes for the inverted one.)

SED Examples, Scripts, and Regular Expressions