Problem with extracting a list of words while ignoring case

BogdanOancea · June 18, 2011, 1:13pm

Hi there,

I want to extract from any given text a list of words that contain the letter “Åž” or “ÅŸ”, in order to replace these with “È˜” and “È™” respectively. (Explanation: MS shipped Windows XP with wrong diacriticals for the Romanian keyboard layout – to be more exact, they put the Turkish s with cedilla instead of s with comma. As a result the vast majority of Romanian text written today has the wrong diacriticals – and MS also gave us “t with cedilla” instead of “t with comma”)

I want to automate this replacement with an AppleScript, BUT trouble is I want to avoid replacing the letter “ÅŸ” in any turkish words that could happen along (not terribly probable, but still…)

So… I made a long list of turkish words to ignore (about 1940), but I can’ get AppleScript to ignore case when I use the “if i is not in turkishWordList”

Here’s the script:

set turkishWordList to {"abaÅŸo", "acarlaÅŸma", "ÅŸehrazat"} -- <-- much shorter list of Turkish words

set txtBlock to "AbaÅŸo abaÅŸo aÅŸa acarlaÅŸma acarlaÅžma ÅŸehrazat Åžehrazat" -- the only Romanian word here is "aÈ™a"
set wlBrut to words of txtBlock
set wordList to {}


repeat with i in wlBrut
	ignoring case
		if (i as string) contains "ÅŸ" then
			if (i as string) is not in turkishWordList then
				set wordList to wordList & i
			end if
		end if
	end ignoring
end repeat

The variable wordList should contain only one word – “aÅŸa”. Instead, the script considers that “acarlaÅžma” is not in turkishWordList, although I placed the if statement inside “ignoring case/end ignoring case” conditionals. The same situation goes with Åžehrazat…

Curiously, the Turkish word “AbaÅŸo” is not included. What am I doing wrong?

Yvan_Koenig · June 18, 2011, 1:59pm

IT seems that this one does the trick.


set turkishWordList to {"abaÅŸo", "acarlaÅŸma", "ÅŸehrazat"} -- <-- much shorter list of Turkish words

set txtBlock to "AbaÅŸo abaÅŸo aÅŸa acarlaÅŸma acarlaÅžma ÅŸehrazat Åžehrazat" -- the only Romanian word here is "aÈ™a"
set wlBrut to words of txtBlock
set wordList to {}

repeat with i in wlBrut
	ignoring case
		set flag to (i as string) contains "ÅŸ"
	end ignoring
	if flag then
		if (i as string) is not in turkishWordList then
			set wordList to wordList & i
		end if
	end if	
end repeat
wordList

Yvan KOENIG (VALLAURIS, France) samedi 18 juin 2011 15:59:25

Nigel_Garvey · June 18, 2011, 3:56pm

The original works OK for me as posted. I’m using Mac OS 10.6.7, preferred language English.

Not connected with your problem: Applescript ‘ignores’ case unless told specifically to consider it, so there’s no need to specify that. The script could also be made a little simpler and more efficient:

set turkishWordList to {"abaÅŸo", "acarlaÅŸma", "ÅŸehrazat"} -- <-- much shorter list of Turkish words

set txtBlock to "AbaÅŸo abaÅŸo aÅŸa acarlaÅŸma acarlaÅžma ÅŸehrazat Åžehrazat" -- the only Romanian word here is "aÈ™a"
set wlBrut to words of txtBlock
set wordList to {}


repeat with i in wlBrut
	if (i contains "ÅŸ") and (i is not in turkishWordList) then set end of wordList to i's contents
end repeat

wordList --> {"aÅŸa"}

BogdanOancea · June 18, 2011, 4:22pm

Well… I admit I was logged in the Leopard partition, and now, logged on SL, it works perfectly. Hmm… I was under the impression that there is something wrong with where I placed the “ignore case/end ignoring” lines.

And excuse me not trying the script on both systems before asking.

Thank you both. You saved me from abandoning the idea to make this search & replace script