Levenshtein Word Distance

A while ago I wrote an implementation of the Soundex Algorithm which attempts to assign the same encoding to words which are pronounced the same but spelled differently. In this post I'll cover the Levenshtein Word Distance algorithm which is a related concept measuring the "cost" of transforming one word into another by totalling the number of letters which need to be inserted, deleted or substituted.

The Levenshtein Word Distance has a fairly obvious use in helping spell checkers decided which words to suggest as alternatives to mis-spelled words: if the distance is low between a mis-spelled word and an actual word then it is likely that word is what the user intended to type. However, it can be used in any situation where strings of characters need to be compared, such as DNA matching.

The Levenshtein Word Distance

I usually leave the program's output until near the end of each post but this time I'll put it up front as it is the easiest way of showing how the algorithm works. I'll actually show two program outputs, the first being after the required data structure has been initialized but before the algorithm has been run, and the second after the algorithm has been run.

Program Output - Data structure initialized

-----------------------------
| code-in-python.com        |
| Levenshtein Word Distance |
-----------------------------

            b    a    n    a    n    a

       0    1    2    3    4    5    6

  b    1    0    0    0    0    0    0

  a    2    0    0    0    0    0    0

  n    3    0    0    0    0    0    0

  a    4    0    0    0    0    0    0

  m    5    0    0    0    0    0    0

  a    6    0    0    0    0    0    0

Here we have a hypothetical situation where somebody has typed "banama" and we want to see how close it is to "banana". The source word "banama" is shown down the left and the target word "banana" is along the top. We also have a grid of integers with one more row than the number of letters in the source, and one more column than the number of letters in the target.

Most values are initially set to 0, but the first row of numbers represent the cumulative number of letters which need to be inserted to form the target, and the first column shows the cumulative number of deletions to remove the source. In other words, to transform "banama" into "banana" you could delete six letters and insert six more. Obviously that's not the most efficient way though as only one operation needs to be done, the substitution of the "m" with an "n". Let's now look at the program output after the algorithm has been run.

Program Output - Algorithm run

-----------------------------
| code-in-python.com        |
| Levenshtein Word Distance |
-----------------------------

            b    a    n    a    n    a

       0    1    2    3    4    5    6

  b    1    0    1    2    3    4    5

  a    2    1    0    1    2    3    4

  n    3    2    1    0    1    2    3

  a    4    3    2    1    0    1    2

  m    5    4    3    2    1    1    2

  a    6    5    4    3    2    2    1

Minimum cost of transforming "banama" to "banana" = 1

All the other numbers have been filled in, and the number in the bottom right is the total cost of transforming "banama" to "banana" using the minimum number of operations, in this case 1.

The values for each of the 36 numbers initialized to 0 are calculated as follows:

  • Find the value diagonally top-left and if the corresponding letters are different add 1 (substitution cost)
  • Find the value above and add 1 (deletion cost)
  • Find the value to the left and add 1 (insertion cost)
  • Set the current value to the lowest of the above three values
For this project I will use 1 as the "cost" of all three possible operations: substitution, insertion and deletion. However, this is not mandatory and in some circumstances different costs might be considered appropriate. For example in a spell checker you might feel someone is more likely to type the wrong letter than to miss out a letter or type an extra letter. Setting the insert and delete cost to > 1 will therefore make words with the same number of letters more likely to be suggested as alternatives than words where inserts or deletes are required.

The overall plan of the implemention of Levenshtein's Word Distance should now be clear - given two words we just need to create a suitably sized 2D list, initialize the numbers and then iterate the rows and columns, setting the elements to their final values.

Coding

Create a folder and within it create the following empty files. You can download the source code from the Downloads page or clone/download from Github if you prefer.

  • levenshtein.py
  • main.py

Open levenshtein.py in your favourite editor and enter or paste this.

levenshtein.py

import math

# ANSI terminal codes for colours
BLUE = "\x1B[94m"
RESET = "\x1B[0m"

class Levenshtein(object):
    """
    Implements the Levenshtein Word Distance Algorithm for calculating
    the minimum number of changes needed to transform one word into another.
    """


    def __init__(self):
        """
        Just creates costs, and other attributes with default values.
        """


        self.insert_cost = 1
        self.delete_cost = 1
        self.substitute_cost = 1

        self.grid = []

        self.source_word = ""
        self.target_word = ""

        self.minimum_cost = -1

    def calculate(self):
        """
        Creates a grid for the given words and iterates rows and columns,
        calculating missing values.
        """


        self.__init_grid()

        for sourceletter in range(0, len(self.source_word)):

            for targetletter in range(0, len(self.target_word)):

                if self.target_word[targetletter] != self.source_word[sourceletter]:
                    total_substitution_cost = self.grid[sourceletter][targetletter] + self.substitute_cost
                else:
                    total_substitution_cost = self.grid[sourceletter][targetletter]

                total_deletion_cost = self.grid[sourceletter][targetletter+1] + self.delete_cost

                total_insertion_cost = self.grid[sourceletter+1][targetletter] + self.insert_cost

                self.grid[sourceletter+1][targetletter+1] = min(total_substitution_cost, total_deletion_cost, total_insertion_cost)

        self.minimum_cost = self.grid[len(self.source_word)][len(self.target_word)]

    def print_grid(self):
        """
        Prints the target and source words and all transformation costs in a grid
        """


        print("        ", end="")

        for t in self.target_word:

            print(BLUE + "%5c" % t + RESET, end="")

        print("\n")

        for row in range(0, len(self.grid)):

            if row > 0:
                print(BLUE + "%3c" % self.source_word[row - 1] + RESET, end="")
            else:
                print("   ", end="")

            for column in range(0, len(self.grid[row])):

                print("%5d" % self.grid[row][column], end="")

            print("\n")

    def print_cost(self):
        """
        This is a separate function to allow printing just the cost if you are not
        interested in seeing the grid
        """


        if self.minimum_cost >= 0:
            print("Minimum cost of transforming \"%s\" to \"%s\" = %d" % (self.source_word, self.target_word, self.minimum_cost))
        else:
            print("Costs not yet calculated")

    def __init_grid(self):
        """
        Sets values of first row and first column to 1, 2, 3 etc.
        Other values initialized to 0
        """


        del self.grid[:]

        # Don't forget we need one more row than the letter count of the source
        # and one more column than the target word letter count
        for row in range(0, (len(self.source_word) + 1)):

            self.grid.append([0] * (len(self.target_word) + 1))

            self.grid[row][0] = row

            if row == 0:

                for column in range(0, (len(self.target_word) + 1)):

                    self.grid[row][column] = column

        self.minimum_cost = -1

The init method simply creates a few attributes in our class with default values.

The calculate method is central to this whole project, and after calling __init_grid it iterates the source word letters and then the target word letters in a nested loop. Within the inner loop it calculates the substitution, deletion and insertion cost according to the rules described above, finally setting the grid value to the minimum of these. After the loop we set the minimum cost to the bottom right value.

The print_grid method is necessarily rather fiddly, and prints the words and grid in the format shown above. The separate print_cost method is much simpler and just prints the cost with a suitable message, or a warning if it has not yet been calculated.

The __init_grid function called in calculate firstly empties the grid for those circumstances where we are reusing the object. It then loops the rows and columns, setting the first values to 1, 2, 3... and the rest to 0. Finally minimum_cost is set to -1 as an indicator that the costs have not yet been calculated, and is used in print_cost.

Let's now move on to main.py.

main.py

import levenshtein

def main():
    """
    Simple demo of the Levenshtein class.
    """


    print("-----------------------------\n| code-in-python.com        |\n| Levenshtein Word Distance |\n-----------------------------\n")

    source_words = ["banama", "banama", "levinstein"]
    target_words = ["banana", "elephant", "levenshtein"]

    lp = levenshtein.Levenshtein()

    for i in range(0, len(source_words)):
        lp.source_word = source_words[i]
        lp.target_word = target_words[i]

        lp.calculate()

        lp.print_grid()
        lp.print_cost()

        print("")

#-----------------------------------------------------------------------
main()

Firstly we create two lists of word pairs to run the algorithm on, and then create a Levenshtein object. Then we iterate the lists, setting the words and calling the methods.

Run the code with this command.

Running the program

python3.6 main.py

I have already included the banama/banana output above so won't repeat it here. The word distance there was 1, so banana easily qualifies as a sensible suggestion if somebody mis-types banama. However, if you typed banama you wouldn't expect your word processor to suggest elephant, so let's try those two words.

Program Output = "banama" and "elephant"

-----------------------------
| code-in-python.com        |
| Levenshtein Word Distance |
-----------------------------

            e    l    e    p    h    a    n    t

       0    1    2    3    4    5    6    7    8

  b    1    1    2    3    4    5    6    7    8

  a    2    2    2    3    4    5    5    6    7

  n    3    3    3    3    4    5    6    5    6

  a    4    4    4    4    4    5    5    6    6

  m    5    5    5    5    5    5    6    6    7

  a    6    6    6    6    6    6    5    6    7

Minimum cost of transforming "banama" to "elephant" = 7

The word distance here is 7 as banama needs to be almost completely re-typed to get to elephant. Clearly not a valid alternative.

Program Output = "levinstein" and "levenshtein"

-----------------------------
| code-in-python.com        |
| Levenshtein Word Distance |
-----------------------------

            l    e    v    e    n    s    h    t    e    i    n

       0    1    2    3    4    5    6    7    8    9   10   11

  l    1    0    1    2    3    4    5    6    7    8    9   10

  e    2    1    0    1    2    3    4    5    6    7    8    9

  v    3    2    1    0    1    2    3    4    5    6    7    8

  i    4    3    2    1    1    2    3    4    5    6    6    7

  n    5    4    3    2    2    1    2    3    4    5    6    6

  s    6    5    4    3    3    2    1    2    3    4    5    6

  t    7    6    5    4    4    3    2    2    2    3    4    5

  e    8    7    6    5    4    4    3    3    3    2    3    4

  i    9    8    7    6    5    5    4    4    4    3    2    3

  n   10    9    8    7    6    5    5    5    5    4    3    2

Minimum cost of transforming "levinstein" to "levenshtein" = 2

Finally levinstein/levenshtein. One substitution and one insertion give us a word distance of 2, also a sensible suggestion. You would probably consider word distances of 2 or perhaps 3 to be reasonable for alternative suggestions, but no more.

Please follow Code in Python on Twitter to keep up to date with new posts.