# Approximate string matching: The Levenshtein distance

Fuzzy searching is a pretty popular term nowadays and when you have a typo in your favourite search engine it will ask you if you didn’t meant something else. Or the spell checker in Mac OS X or FF underlines a wrongly typed word with a red dashed line and you can choose the proper word from a contextual menu when clicking on it. Vladimir Levenshtein has thought of an algorithm that will indicate how much edits are minimum required to turn one string into the other. The Levenshtein distance will only use insertion, deletion and alternation as 3 different types of required edits.

Note: AppleScript is a slow language and therefore longer strings will take up time exponentially. The handler can be improved but wrote it in an verbose way so it’s clear how it works.

``````set string1 to "In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences."
--same as string 1 but with a typos (deletion, insertion, altering and transposition)
set string2 to "In informatoin theory and computer science, thee Levenshtein distance is a string metric far measuring the diffrence between two sequences."

set edits to levenshteinDistance(string1, string2)

set percentage to ((100 - 1 / (count string1) * 100 * edits) * 100 as integer) / 100
set msg to "Strings matches for " & percentage & "%, it requires " & edits & " single-character edits to turn one string into the other."

display dialog msg

on levenshteinDistance(s1, s2)
set matrix to {}
set xLen to count s1
set yLen to count s2

script o
property matrixRef : matrix -- to improve list access
end script

-- first create a matrix
repeat with y from 0 to yLen
set theRow to {}
repeat with x from 0 to xLen
if x = 0 then
set end of theRow to y
else if y = 0 then
set end of theRow to x
else
set end of theRow to 0
end if
end repeat
set end of o's matrixRef to theRow
end repeat

-- fill in the matrix
repeat with y from 2 to yLen + 1
repeat with x from 2 to xLen + 1
set charX to character (x - 1) of s1
set charY to character (y - 1) of s2

if charX = charY then
set edit to 0
else
set edit to 1
end if

set deletion to (item (x - 1) of item y of o's matrixRef) + 1
set insertion to (item x of item (y - 1) of o's matrixRef) + 1
set alternate to (item (x - 1) of item (y - 1) of o's matrixRef) + edit

set minimum to insertion

if deletion < minimum then set minimum to deletion
if alternate < minimum then set minimum to alternate

set item x of item y of o's matrixRef to minimum
end repeat
end repeat

-- return results
return item (xLen + 1) of item (yLen + 1) of o's matrixRef
end levenshteinDistance
``````

Extension: The Damerau-Levenshtein is an popular extension which has a transposition as the 4th edit, right letters in the wrong order. Damerau stated that more than 80% of all spelling errors are 4 or these errors including transposition. So a 2 step edit which is a transposition doesn’t weigh as 2 edits.

Here an more optimized version where creating and filling the matrix is done in one go:

``````on levenshteinDistance(s1, s2)
set matrix to {}
set xLen to (count s1) + 1
set yLen to (count s2) + 1

script o
property matrixRef : matrix -- to improve list access
property charList1 : characters of s1
property charList2 : characters of s2
end script

-- create and fill the matrix
repeat with y from 1 to yLen
set end of o's matrixRef to {}
repeat with x from 1 to xLen
if x = 1 then
set end of item y of o's matrixRef to (y - 1)
else if y = 1 then
set end of item y of o's matrixRef to (x - 1)
else
set deletion to (item (x - 1) of item y of o's matrixRef) + 1
set alternate to (item (x - 1) of item (y - 1) of o's matrixRef) + ((item (x - 1) of o's charList1 is not item (y - 1) of o's charList2) as integer)
set minimum to (item x of item (y - 1) of o's matrixRef) + 1
if deletion < minimum then set minimum to deletion
if alternate < minimum then set minimum to alternate

set end of item y of o's matrixRef to minimum
end if
end repeat
end repeat
return item (xLen) of item (yLen) of o's matrixRef
end levenshteinDistance
``````

Hello.

I don’t know if you have seen the version of the levensthein distance I posted a while ago over at rosettacode.org. I did write it and posted it because someone needed it. My version is based on the fast matrix version, but I only use the Col column of the matrix, I actually did port it from the C version I found at rosettacode.org directly.

It is interesting to see other takes on the same algorithm.

Personally I prefer a handler named mindist for practical purposes, (and not implemented in AppleScript) that just checks for one edit operation, like change character, delete or insert, (a spelling distance of 1).

``````# Implementation of the "fast" C-version.
set dist to findLevenshteinDistance for "sunday" against "saturday"
to findLevenshteinDistance for s1 against s2
script o
property l : s1
property m : s2
end script
if s1 = s2 then return 0
set ll to length of s1
set lm to length of s2
if ll = 0 then return lm
if lm = 0 then return ll

set v0 to {}

repeat with i from 1 to (lm + 1)
set end of v0 to (i - 1)
end repeat
set item -1 of v0 to 0
copy v0 to v1

repeat with i from 1 to ll
-- calculate v1 (current row distances) from the previous row v0

-- first element of v1 is A[i+1][0]
--   edit distance is delete (i+1) chars from s to match empty t
set item 1 of v1 to i
--  use formula to fill in the rest of the row
repeat with j from 1 to lm
if item i of o's l = item j of o's m then
set cost to 0
else
set cost to 1
end if
set item (j + 1) of v1 to min3 for ((item j of v1) + 1) against ((item (j + 1) of v0) + 1) by ((item j of v0) + cost)
end repeat
copy v1 to v0
end repeat
return item (lm + 1) of v1
end findLevenshteinDistance

to min3 for anInt against anOther by theThird
if anInt < anOther then
if theThird < anInt then
return theThird
else
return anInt
end if
else
if theThird < anOther then
return theThird
else
return anOther
end if
end if
end min3
``````

I have had seen the code but I didn’t knew it was yours. Out of curiosity I re-wrote it in AppleScript to see how much performance difference there was with my own C implementation. Because it’s doing fine for single words, I decided to post it here also because it was missing from MS.

Today I submitted another code to rosettacode.org where my approach was quite different compared to the other examples. Sites like rosettacode.org unintentionally invites code golfing which leads to very interesting and sometimes odd at first solutions.

Hello.

Now I’m curious: how my AppleScript version came out, if you timed the three versions, and which algorithm you posted at rosetta code.

Edit

I did some timing on my own, and your second version seems to be around 4 times faster than mine. That is before I inline the min3 handler, but, I have decided to let it be like it is, so it is comparable to the C version, at rosetta code.

Hello.

It just interested me, to see if I could match yours by applying regular optimization. Now it performs considerably better. I learned something as well.

``````
to findLevenshteinDistance for s1 against s2
script o
property l : characters of s1
property m : characters of s2
property v0 : {}
property v1 : missing value
end script
if s1 = s2 then return 0
set ll to length of s1
set lm to length of s2
if ll = 0 then return lm
if lm = 0 then return ll

repeat with i from 1 to (lm + 1)
set end of o's v0 to (i - 1)
end repeat
set item -1 of o's v0 to 0
set o's v1 to items of o's v0

repeat with i from 1 to ll
-- calculate v1 (current row distances) from the previous row v0

-- first element of v1 is A[i+1][0]
-- edit distance is delete (i+1) chars from s to match empty t
set item 1 of o's v1 to i
-- use formula to fill in the rest of the row
repeat with j from 1 to lm
if item i of o's l = item j of o's m then
set cost to 0
else
set cost to 1
end if
if ((item j of o's v1) + 1) < ((item (j + 1) of o's v0) + 1) then
if ((item j of o's v0) + cost) < ((item j of o's v1) + 1) then
set item (j + 1) of o's v1 to ((item j of o's v0) + cost)
else
set item (j + 1) of o's v1 to ((item j of o's v1) + 1)
end if
else
if ((item j of o's v0) + cost) < ((item (j + 1) of o's v0) + 1) then
set item (j + 1) of o's v1 to ((item j of o's v0) + cost)
else
set item (j + 1) of o's v1 to ((item (j + 1) of o's v0) + 1)
end if
end if
end repeat
set o's v0 to items of o's v1
end repeat
return item (lm + 1) of o's v1
end findLevenshteinDistance

``````

Hi.

I’ve had a go at further optimising DJ’s optimised handler.

The first effort reduces the number of minus-ones needed by using lower loop indices, eliminates the ‘if.else if.else’ tests by having a separate loop for the first row, uses ‘end of’ instead of ‘item y of’, and makes references to the current row more direct by giving it its own script-object property and adding it to the matrix only after it’s filled. I’ve modified a couple of DJ’s variable names because they clash with something on my machine (probably the Satimage OSAX).

``````on levenshteinDistance(s1, s2)
set xLen to (count s1)
set yLen to (count s2)

script o
property |matrix| : {} -- to improve list access
property charList1 : characters of s1
property charList2 : characters of s2
property currentRow : {}
end script

-- Create and fill the matrix. Firstly the simple first row.
set o's currentRow to {}
repeat with x from 0 to xLen
set end of o's currentRow to x
end repeat
set end of o's |matrix| to o's currentRow
-- Then the more complex remaining rows.
repeat with y from 1 to yLen
set o's currentRow to {y}
repeat with x from 1 to xLen
set deletion to (item x of o's currentRow) + 1
if (item x of o's charList1 is item y of o's charList2) then
set alternate to (item x of end of o's |matrix|)
else
set alternate to (item x of end of o's |matrix|) + 1
end if
set min to (item (x + 1) of end of o's |matrix|) + 1
if deletion < min then set min to deletion
if alternate < min then set min to alternate

set end of o's currentRow to min
end repeat

set end of o's |matrix| to o's currentRow
end repeat

return end of end of o's |matrix|
end levenshteinDistance
``````

Only the previous row of the matrix is used above during the creation of the current one and only the last item of the last row is returned. This means the matrix isn’t really needed at all, just a rolling system of current and previous rows:

``````on levenshteinDistance(s1, s2)
set xLen to (count s1)
set yLen to (count s2)

script o
property charList1 : characters of s1
property charList2 : characters of s2
property previousRow : {}
property currentRow : missing value
end script

-- Set up the first row.
repeat with x from 0 to xLen
set end of o's previousRow to x
end repeat
-- Handle the remaining rows in a rolling manner.
repeat with y from 1 to yLen
set o's currentRow to {y}
repeat with x from 1 to xLen
set deletion to (item x of o's currentRow) + 1
if (item x of o's charList1 is item y of o's charList2) then
set alternate to (item x of o's previousRow)
else
set alternate to (item x of o's previousRow) + 1
end if
set min to (item (x + 1) of o's previousRow) + 1
if deletion < min then set min to deletion
if alternate < min then set min to alternate

set end of o's currentRow to min
end repeat

set o's previousRow to o's currentRow
end repeat

return end of o's previousRow
end levenshteinDistance
``````

Edit: minor error corrected in the initialisation of the first row in both scripts. :rolleyes:

Hello Nigel.

Your last (brilliant) one beats mine with 3 hundreths of a second, which is reason for doing the optimization, in my eyes.
Now, you have adopted the exactly same scheme as my handler is built upon, I have to do a major rework of it to get any better results, so I’ll leave the matters be, as I expect to end up with a blue print of your version anyways.

Thank you Nigel

Hi McUsrII.

Yes. Now I’ve had a chance to study yours too, I can see that’s the case.

By the way, I’ve just corrected a minor error in my handlers which was causing the first row in each case to be one item too long. :rolleyes:

Hello Nigel.

It is very hard to beat your version, with respect to time now. And it is an interesting scheme -just needing the contributions of one row/col of a matrix multiplication, because that saves a lot of multiplications and additions, it is not easy to number it, because the strings may be of varying length, but we end up with just a fraction of the number of operations in the order of the square root, so, that is an optimization I’ll look for in other procedures, related to linear algebra and vectors (ColB), maybe graph algorithms as well, but there (at least for transitive closure) you need to consider the whole matrix.

I also took away the “characters of” optimization in the script object, that was a significant optimization in my handler, for speeding up the access of the individual items (characers).

It was all in all very interesting to code up the algorithm, given it works the way it does.

And one may want to consider using a ‘considering case’ statement, either round the main repeat in the handler or round the call to the handler itself.

Absolutely.

I haven’t timed the implementations but I think Nigel’s version on efficiency is the best version of all. In C for example, Nigel’s implementation would not be faster (marginally slower due to excessive allocations and freeing of memory) but when used with +1k data his memory usage would be much lower while the bandwidth remained the same due to using a much smaller variable buffer size.

It’s possible to modify the rolling two-row version to reuse the same two “row” lists throughout the process:

``````set string1 to "In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences."
--same as string 1 but with a typos (deletion, insertion, altering and transposition)
set string2 to "In informatoin theory and computer science, thee Levenshtein distance is a string metric far measuring the diffrence between two sequences."

set edits to levenshteinDistance(string1, string2)

on levenshteinDistance(s1, s2)
set xLen to (count s1)
set yLen to (count s2)

script o
property charList1 : characters of s1
property charList2 : characters of s2
property previousRow : missing value
property currentRow : missing value
end script

-- Intitialise both "row" lists as the first row.
set o's previousRow to {0} & o's charList1
repeat with x from 1 to xLen
set item (x + 1) of o's previousRow to x
end repeat
set o's currentRow to o's previousRow's items
-- Handle the remaining rows in a rolling manner, the two lists alternating as the previous and current rows.
repeat with y from 1 to yLen
set item 1 of o's currentRow to y
repeat with x from 1 to xLen
set deletion to (item x of o's currentRow) + 1
if (item x of o's charList1 is item y of o's charList2) then
set alternate to (item x of o's previousRow)
else
set alternate to (item x of o's previousRow) + 1
end if
set min to (item (x + 1) of o's previousRow) + 1
if deletion < min then set min to deletion
if alternate < min then set min to alternate

set item (x + 1) of o's currentRow to min
end repeat

tell o's previousRow
set o's previousRow to o's currentRow
set o's currentRow to it
end tell
end repeat

return end of o's previousRow
end levenshteinDistance
``````

First, I clocked my version to use 1.26 seconds, while your second version used 0.47 seconds, then I performed the same optimizations as you had done, and I ended up at 0.3 seconds for it on my machine, then Nigel comes a long with his version at a whopping 0.27 seconds; all timings based on 100 iterations over the strings you provided.

In C, I think it is possible, to optimize Nigels version, by treating the input in chunks for instance, and just accumulate the results, thereby using chunk-size arrays, thus minimizing the allocations/freeing of memory by using static storage.

What I have written was using a single memory allocation and using a 1D array. For lists in general, as for AppleScript, you can gain performance by using 2D arrays or small lists. Unlike lists, for arrays it doesn’t matter if you want to access the first item or the 1,000,000th item, so avoiding 2D arrays has performance gains within the language. Keeping the arrays small, even without requiring any extra steps (which is not possible), has no performance gains. But as I said I like Nigel’s code because it’s efficiency in used resources.

Thanks for the times

I read a whole lot about memory allocation, my point is, that instead of allocating memory again and again, and thereby using bandwidth over the bus, you could really just preallocate some static arrays, which you would align with the size of text you read in for each call, then you’d just reinitialize the arrays for each call. The beauty lies in the iterations, here you could just flip two pointers, so that they point to the other array, and access v1 and v0 through those pointers, which is a quite fast alternative to any copying. Maybe we could have utilized that trick in Applescript too, but I doubt it would have a large positive impact.

I know, I was maybe ahead of you and I meant that there would no performance gain for the “v1 and v0” solution either. Even swapping pointers, how small it is in execution time, no swapping is faster. When using static arrays, using stack memory instead of heap, does have a lot of performance gains but as always, using stack memory limits your buffers. It’s an unwritten rule not to use stack memory for these kind of data.

Maybe you were ahead of me, I don’t know, I just had the idea of using static storage, whether by declaring reasonably sized arrays up front, or allocating memory from the heap at program startup. And just for the record, Nigel seem to use the “pointer-flipping” optimization already, and I just named them v0 and v1, because I had forgotten what Nigel called them. I mentioned it, because you worried over bandwidth and memory allocations. My point is that those two problems can be almost optimized away with respect to variables that the function uses, save during program load and exit. I had no intention, what so ever of allocating the arrays on the stack.

Edit

I too are aware of the fact that all static variables are created in the data segment? (some segment, anyway), so, no matter how big static variables I declare in a C-function, they will not be allocated on the program stack at run time, like automatic variables.