My *nix world

How to measure the similarity between two words

One problem that I have encountered these days is related to "how to measure the similarity between two words".

The problem is not new (in fact its solution is older than any person I know alive) and one solution could be a hash function that will calculate a score for each word then you compare the both scores and if they are "close enough" then you can decide whether or not they are alike. It sounds like soundex function, which in fact is a hashing system for English words.

There are also other alternatives which seems to be even better. One is Levenshtein edit distance, a method that measure how many edit operations (such as insert/delete/substitute of a single char) should be done to transform one word into another. Another is the Needleman-Wunsch algorithm which tries to find the optimal score when aligning two sequences (words). For two words with the same length one could use even the Hamming distance (which counts how many chars are different, so it measures only the necessary substitutions that are required to change one word to another).

Even if the Needleman-Wunsch algorithm is helping us because it provides the optimal alignment of the two sequences, for my problem the Levenshtein distance seems to be 1% better.

Levenshtein distance

measure the similarity between two words

source: http://www-igm.univ-mlv.fr/~lecroq/seqcomp/node2.html

In the example above we have tried to find the similarity between the word "LEVENSHTEIN" and "LEVENSHMIT". So using the Levenshtein disntace algorithm you will get a number (bottom-right in matrix) that will show you how much it will cost you if you want to change first word into the second one:

  • if we choose the first path/way then you have to delete the "T" letter then to substitute the "E" letter with a new one "M" (which does not exists in our source word) then to come up with another one "N" (which replaces the last "T"); so 3 edit operations
  • if we choose the second path/way then you have to substitute the "T" with "M", delete the "E" and insert the "T" at the end of the word; also 3 edit operations

Needleman-Wunsch algorithm

Needleman-Wunsch algorithm helps us to align two different words
in order to find the optimal alignment between them.

Would you like to see what am I talking about? [Y/N]y
Enter the 1st word : LEVENSHTEIN
Enter the 2nd word : LEVENSHMIT
We assume the following params:
  - insert score = 0
  - delete score = 0
  - substitute score = 0
  - match score = 1
  - similarity factor = 0.5 (i.e the words should be >= 50% alike)

Testing Needleman-Wunsch algorithm
  - word 1 = LEVENSHTEIN
  - word 2 = LEVENSHMIT

      | L | E | V | E | N | S | H | T | E | I | N |
  +++++++++++++++++++++++++++++++++++++++++++++++++
  | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
L | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
E | 0 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
V | 0 | 1 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
E | 0 | 1 | 2 | 3 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
N | 0 | 1 | 2 | 3 | 4 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
S | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 6 | 6 | 6 | 6 | 6 |
H | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 7 | 7 | 7 | 7 |
M | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 7 | 7 | 7 | 7 |
I | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 7 | 7 | 8 | 8 |
T | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 8 | 8 | 8 |
   +++++++++++++++++++++++++++++++++++++++++++++++++
   dist(Needleman-Wunsch)[10][11]=8

   The strings's alignment is:

     LEVENSH+MIT
     |||||||   | 
     LEVENSHTEIN
     ----------
     mmmmmmmisms
+########################################################+

The code above is an capture from the stdout of my Java implementation of the Levenshtein and Needleman-Wunsch algorithms. I made it available for the public and can be downloaded from:

https://docs.google.com/folder/d/0B95k2kr1bG9fRHN6VGZPakc2SjQ/edit

Besides Wikipedia I've found other useful information about this subject at:

  • http://www-igm.univ-mlv.fr/~lecroq/seqcomp/index.html
  • http://www.let.rug.nl/kleiweg/lev/
  • the back-trace algorithm: http://de.wikipedia.org/wiki/Levenshtein-Distanz
  • http://www.avatar.se/molbioinfo2001/multali.html

Now, if you think that this article was interesting don't forget to rate it. It shows me that you care and thus I will continue write about these things.

 
The following two tabs change content below.
How to measure the similarity between two words

Eugen Mihailescu

Founder/programmer/one-man-show at Cubique Software
Always looking to learn more about *nix world, about the fundamental concepts of math, physics, electronics. I am also passionate about programming, database and systems administration. 16+ yrs experience in software development, designing enterprise systems, IT support and troubleshooting.

Leave a Reply

Your email address will not be published. Required fields are marked *