Approximate string matching: The Levenshtein distance

DJ_Bazzie_Wazzie · June 2, 2015, 12:03am

Just for the record, that behaviour depends on the compiler (settings). Statics can be assigned to the data (initialized), bss(uninitialized) and code(both) segment of the executable. Also it would make no difference whether or not to use statics in this situation, unless you make it recursive. Therefore my reference to stack memory was more general in terms as a fixed memory space that is not heap memory and are all bound to the same limitations.

I don’t worry over bandwidth at all :), I think you misunderstood me. My point was that even when using Nigel’s approach you would use less memory even without gaining any performance. The reason for me to say that explicitly was because the topic looks like (read the timings) that the best implementation is the fastest, without regards to quality of the code itself. So without timing any of the solutions I liked Nigel’s script the most, even if it was slower (which is not).

McUsrII · June 2, 2015, 7:32am

Hello.

I agree in that Nigels second version, is also the most readable, so that is a good reason for me to like it to. After all, performance is just a currency, that we use to “buy” security and functionality for, so it is important, but not the most important property of a routine really, of course, unless it really is, for one routine.

Nigel_Garvey · June 3, 2015, 12:28pm

A couple of further possibilities:

Make ‘charList1’ and ‘charList2’ the ‘id’ of the respective strings rather than the ‘characters’. This will make the character comparisons sensitive to everything and speed them up because they’ll be integer rather than text comparisons.
Test the strings (or the ‘id’ lists) for equality before proceeding with the rest of the process. This will greatly speed up the return of 0 for equal strings without noticeably slowing the treatment of unequal strings.

set string1 to "In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences."
--same as string 1 but with a typos (deletion, insertion, altering and transposition)
set string2 to "In informatoin theory and computer science, thee Levenshtein distance is a string metric far measuring the diffrence between two sequences."

set edits to levenshteinDistance(string1, string2)

on levenshteinDistance(s1, s2)
	set xLen to (count s1)
	set yLen to (count s2)
	
	script o
		property charList1 : id of s1 -- For everything sensitivity .
		property charList2 : id of s2 -- . and speed.
		property previousRow : missing value
		property currentRow : missing value
	end script
	
	-- Return 0 straight away if the two strings are equal.
	if (o's charList1 = o's charList2) then return 0
	
	-- Otherwise intitialise two "row" lists as the first row of a notional matrix.
	set o's previousRow to {0} & o's charList1
	repeat with x from 1 to xLen
		set item (x + 1) of o's previousRow to x
	end repeat
	set o's currentRow to o's previousRow's items
	-- Handle the remaining rows in a rolling manner, the two lists alternating as previous and current rows.
	repeat with y from 1 to yLen
		set item 1 of o's currentRow to y
		repeat with x from 1 to xLen
			set deletion to (item x of o's currentRow) + 1
			if (item x of o's charList1 is item y of o's charList2) then
				set alternate to (item x of o's previousRow)
			else
				set alternate to (item x of o's previousRow) + 1
			end if
			set min to (item (x + 1) of o's previousRow) + 1
			if (deletion < min) then set min to deletion
			if (alternate < min) then set min to alternate
			
			set item (x + 1) of o's currentRow to min
		end repeat
		
		tell o's previousRow
			set o's previousRow to o's currentRow
			set o's currentRow to it
		end tell
	end repeat
	
	return end of o's previousRow
end levenshteinDistance

McUsrII · June 3, 2015, 1:29pm

Hello Nigel.

Great ideas for optimizing the handler. I at least can se no reason why anybody would want such a handler to be case insensitive. I haven’t timed it, but I wager it is alot faster.

DJ_Bazzie_Wazzie · June 3, 2015, 1:57pm

Spelling corrections, where it is used most, is a mix of case sensitive and case insensitive. Also my finished project had to be case insensitive. However, when you still want it case insensitive you can convert both strings to the same case (lowercase) before calling this function.

McUsrII · June 4, 2015, 2:12pm

Hello.

Here is another idea, regarding spelling distances of filenames. The idea, is that a user types in a filename, or something else that is word sized, and you can iterate of the contents of the container of the items, and return the one with the least distance.

on spdist for s1 against s2
	-- returns coarsely the differences between two strings
	-- 0 if strings are identical
	-- 1 if two chars are transposed, or one is different
	-- 2 if one char added or deleted
	-- 3 otherwise.
	
	-- it is really meant solve problems with filenames that are
	-- misspelled by one character 
	-- Ported from "The Unix Programming Environment" p.211 
	-- by B.W Kernighan and R. Pike.
	
	script o
		property l : s1's id
		property m : s2's id
	end script
	
	set l1 to length of s1
	set l2 to length of s2
	
	if l1 = l2 then
		set diffs to 0
		set ante to 0
		repeat with i from 1 to l1
			if item i of o's l â‰  item i of o's m then
				if diffs = 0 then
					if ante = 0 then -- first error 
						set ante to 1 -- 1 spelling error.
					else if ante = 1 then
						if item i of o's l = item (i - 1) of o's m and item (i - 1) of o's l = item i of o's m then
							set diffs to 1 -- one full transpostion.
						else
							set diffs to 3 -- two spelling errors.
						end if
						set ante to -1
					end if
				else
					set diffs to 3
				end if
			end if
		end repeat
		if diffs = 0 and ante = 1 then set diffs to 1
	else
		if l1 > l2 then
			if (l1 - l2) < 2 then
				set diffs to 2
			else
				set diffs to 3
			end if
		else
			if (l2 - l1) < 2 then
				set diffs to 2
			else
				set diffs to 3
			end if
		end if
	end if
	return diffs
end spdist

Edit

So I mulled over it a little, and I changed the code to work similiar to the description, almost: A transpositon, may not be an actual transposition, and, a single misspelled character also counts as a transposition. So in the worst case, two independently misspelled characters counts as one transposition.

Edit++
It is changed slightly so it now differs between a “true” transposition, and different characters at two different places.

McUsrII · June 11, 2015, 11:14am

Hello.

I just found out that Levenshtein distances may be used for computing queries. I just provide a link for those interested. Levenshtein Automata (2010) | Hacker News. -You reach the original link by clicking on it at the top of the page.

DJ_Bazzie_Wazzie · June 11, 2015, 11:52am

Yes, search engines uses them too but in a whole different way.

McUsrII · June 11, 2015, 12:54pm

Well, it is interesting, I haven’t but skimmed that article yet. Here is another one covering editing distances, and an optimal algorithm, which I have no time to read at the moment. Computer scientists prove that a 40-year-old algorithm is optimal | Hacker News

Just mentioning it, I use the fact that something like this have been used by Google, many times, when I wonder what an english word means, but haven’t got the spelling right. Last one was earlier today, I wondered what itinary meant, so I tried Spotlight first, but the spelling was wrong. Using Google however, gave me back results for itinerary immediately.

Edit

Building a spell checker, must be a very interesting project, with regards to the trade-offs of the datastructures involved. I see that nearly every one of them uses the Levenshtein distance.

DJ_Bazzie_Wazzie · June 11, 2015, 2:38pm

They are, but they differ a lot on how they are implemented. Almost no spellings checker is actually processing entire words, the levenshtein itself is done in one of the last steps of the spelling check. Only pieces of words who differs is the levenshtein applied to. For instance when matching the typo “backkup” against “backup”, there is no levenshtein match required when the spelling dictionary is syllable based b-tree. It is when a typo like “backop” is made, then the levenshtein is used to match “op” against “up” (and all other possible remaining syllables).

While the algorithm is quite heavy to process, the implementations are done in such a way that the levenshtein is actually applied to a smaller piece of the word against an already narrowed down dictionary during preprocessing. This makes it possible to apply approximate matching and give suggestions on the fly.

Since you mentioned Google. It is using another algorithm than a spelling checker. Because Google is using statistics, the levenshtein will be indexed and therefore it doesn’t need to be processed each time you write a typo. Today’s suggestion can be somethings else tomorrow when using the same search string.

McUsrII · June 11, 2015, 2:53pm

Yes, I am aware of the weighting of searchwords by google, I recently saw a layout of the search engine by the way, and it seemed pretty much like a monster to me. But I guess they did use something like Levenshtein, in the days, when they didn’t have statistics for everything.

Spell checking is an interesting subject, one theme is autocompletion, which interests me. There are several open source projects that can be scavenged for that purpose, as is aspell, ispell, and xcalibur for spellchecking of documents.

I’m pretty sure your implementation will be an efficient one, and good luck!