Friday, December 15, 2017

#26 2015-06-04 09:12:04 am

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

Hello.

Here is another idea, regarding spelling distances of filenames. The idea, is that a user types in a filename, or something else that is word sized, and you can iterate of the contents of the container of the items, and return the one with the least distance.

Applescript:

on spdist for s1 against s2
   -- returns coarsely the differences between two strings
   -- 0 if strings are identical
   -- 1 if two chars are transposed, or one is different
   -- 2 if one char added or deleted
   -- 3 otherwise.
   
   -- it is really meant solve problems with filenames that are
   -- misspelled by one character
   -- Ported from "The Unix Programming Environment" p.211
   -- by B.W Kernighan and R. Pike.
   
   script o
       property l : s1's id
       property m : s2's id
   end script
   
   set l1 to length of s1
   set l2 to length of s2
   
   if l1 = l2 then
       set diffs to 0
       set ante to 0
       repeat with i from 1 to l1
           if item i of o's l ≠ item i of o's m then
               if diffs = 0 then
                   if ante = 0 then -- first error
                       set ante to 1 -- 1 spelling error.
                   else if ante = 1 then
                       if item i of o's l = item (i - 1) of o's m and item (i - 1) of o's l = item i of o's m then
                           set diffs to 1 -- one full transpostion.
                       else
                           set diffs to 3 -- two spelling errors.
                       end if
                       set ante to -1
                   end if
               else
                   set diffs to 3
               end if
           end if
       end repeat
       if diffs = 0 and ante = 1 then set diffs to 1
   else
       if l1 > l2 then
           if (l1 - l2) < 2 then
               set diffs to 2
           else
               set diffs to 3
           end if
       else
           if (l2 - l1) < 2 then
               set diffs to 2
           else
               set diffs to 3
           end if
       end if
   end if
   return diffs
end spdist

Edit

So I mulled over it a little, and I changed the code to work similiar to the description, almost: A transpositon, may not be an actual transposition, and, a single misspelled character also counts as a transposition. So in the worst case, two independently misspelled characters counts as one transposition.

Edit++
It is changed slightly so it now differs between a  "true" transposition, and different characters at two different places.

Last edited by McUsrII (2015-06-04 10:44:54 am)

Offline

 

#27 2015-06-11 06:14:06 am

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

Hello.

I just found out that Levenshtein distances may be used for computing queries. I just provide a link for those interested. Levenshtein Automata (2010) | Hacker News. -You reach the original link by clicking on it at the top of the page. smile

Last edited by McUsrII (2015-06-11 06:15:25 am)

Offline

 

#28 2015-06-11 06:52:06 am

DJ Bazzie Wazzie
Member
From:: the Netherlands
Registered: 2004-10-20
Posts: 2727
Website

Re: Approximate string matching: The Levenshtein distance

McUsrII wrote:

I just found out that Levenshtein distances may be used for computing queries.


Yes, search engines uses them too but in a whole different way.

Offline

 

#29 2015-06-11 07:54:40 am

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

DJ Bazzie Wazzie wrote:
McUsrII wrote:

I just found out that Levenshtein distances may be used for computing queries.


Yes, search engines uses them too but in a whole different way.


Well, it is interesting, I haven't but skimmed that article yet. Here is another one covering editing distances, and an optimal algorithm, which I have no time to read at the moment. Computer scientists prove that a 40-year-old algorithm is optimal | Hacker News


Just mentioning it, I use the fact that something like this have been used by Google, many times, when I wonder what an english word means, but haven't got the spelling right. Last one was earlier today, I wondered what itinary meant, so I tried Spotlight first, but the spelling was wrong. Using Google however, gave me back results for itinerary immediately. smile

Edit

Building a spell checker, must be a very interesting project, with regards to the trade-offs of the datastructures involved. I see that nearly every one of them uses the Levenshtein distance.

Last edited by McUsrII (2015-06-11 08:27:48 am)

Offline

 

#30 2015-06-11 09:38:26 am

DJ Bazzie Wazzie
Member
From:: the Netherlands
Registered: 2004-10-20
Posts: 2727
Website

Re: Approximate string matching: The Levenshtein distance

McUsrII wrote:

I see that nearly every one of them uses the Levenshtein distance.


They are, but they differ a lot on how they are implemented. Almost no spellings checker is actually processing entire words, the levenshtein itself is done in one of the last steps of the spelling check. Only pieces of words who differs is the levenshtein applied to. For instance when matching the typo "backkup" against "backup", there is no levenshtein match required when the spelling dictionary is syllable based b-tree. It is when a typo like "backop" is made, then the levenshtein is used to match "op" against "up" (and all other possible remaining syllables).

While the algorithm is quite heavy to process, the implementations are done in such a way that the levenshtein is actually applied to a smaller piece of the word against an already narrowed down dictionary during preprocessing. This makes it possible to apply approximate matching and give suggestions on the fly.

Since you mentioned Google. It is using another algorithm than a spelling checker. Because Google is using statistics, the levenshtein will be indexed and therefore it doesn't need to be processed each time you write a typo. Today's suggestion can be somethings else tomorrow when using the same search string.

Last edited by DJ Bazzie Wazzie (2015-06-11 09:41:52 am)

Offline

 

#31 2015-06-11 09:53:01 am

McUsrII
Member
Registered: 2012-11-21
Posts: 3046
Website

Re: Approximate string matching: The Levenshtein distance

Yes, I am aware of the weighting of searchwords by  google, I recently saw a layout of the search engine by the way, and it seemed pretty much like a monster to me. smile But I guess they did use something like Levenshtein, in the days, when they didn't have statistics for everything.

Spell checking is an interesting subject, one theme is autocompletion, which interests me. There are several open source projects that can be scavenged for that purpose, as is aspell, ispell, and xcalibur for spellchecking of documents.

I'm pretty sure your implementation will be an efficient one, and good luck! smile

Last edited by McUsrII (2015-06-11 09:53:59 am)

Offline

 

Board footer

Powered by FluxBB

RSS (new topics) RSS (active topics)