• Home
  • About
  • Archives
  • Book
  • Contact me
  • Photos
  • Projects
  • Talks
Subscribe: Posts | Comments | E-mail
  • ArticlesArticles which I authored
  • GSOCGoogle Summer of code archives
  • HacksExperiments
  • LifeIn and around life
  • Open SourceFree and Open Source Software
  • PardusContributions with Pardus Project

Sarath Lakshman

Posts Tagged ‘algorithm’


Posted on March 5, 2011 - by Sarath

Implementing a spell checker

Following my previous post on autocomplete, here comes a spell checker program. First of all I would like to warn you that this is the worst and inefficient spell checker you have ever seen. The most slowest and CPU expensive :) . I wrote this program long ago for fun. I thought of sharing the code I wrote.

The purpose of this post is to show how to write a spell checker in few lines by using the Levenshtein distance algorithm. This is not a context sensitive spell checker. The following spell checker program checks each word, and if it has spell errors it will list out some word suggestions.

Levenshtein algorithm is used to calculate distance between two words Levenshtein distance. It is defined as minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. See wikipedia for more details on the algorithm.

This spell checker program uses a dictionary file. Every Unix like OS comes with a default dictionary file (in the directory /usr/share/dict/). Make use of that. To spell check a word, following program calculate Levenshtein distance by comparing all words in the dictionary. If the Levenshtein distance equal to one, those words will be listed as word suggestions. If the distance is zero, that means the word is a valid dictionary word. Checking across all the words in the dictionary is very expensive. Hence in order to reduce the number of comparisons needed, we use a subset of words in the dictionary having word length equal to, plus one and minus one with respect to the word we are checking for spell error. For implementing that, we have built a hash called wordsmap (python dictionary). The key used for the hash is the word length and the value assigned is a list of words of words having the word length as the key.

#!/usr/bin/env python

import sys

def build_matrix(n,m):

    '''Build a nxm matrix with first row = [0,1,2..m-1] and first col = [0,1,2..n-1]'''

    matrix = [[i for i in range(m)]]
    row = [0 for i in range(m)]
    for i in range(1,n):
        matrix.append(row[:])
        matrix[i][0] = i

    return matrix

def print_matrix(mat):
    '''Print matrix in formatted rows'''
    for col in mat:
        print col

def distance(src,dest):
    '''Calculate Levenshtein distance'''
    n = len(src)
    m = len(dest)

    matrix = build_matrix(n+1,m+1)
    for i in range(1,n+1):
        for j in range(1,m+1):
            cost = int( (ord(src[i-1]) - ord(dest[j-1]))!= 0)
            v1 = matrix[i-1][j] + 1
            v2 = matrix[i][j-1] + 1
            v3 = matrix[i-1][j-1] + cost

            matrix[i][j] = min(v1,v2,v3)
    return matrix[n][m]



def dictparse(filename):
    '''Parse words from dictionary and build a hash for reduce word comparisons'''
    f = open(filename)
    wordsmap = {}
   
    for word in f:
       
        word = word.strip('\r\n')
        length = len(word)
        if not wordsmap.has_key(length):
            wordsmap[length] = []
   
        wordsmap[length].append(word)

    return wordsmap


def spellcheck(word,wordsmap):
    '''Perform spellcheck and provide word suggestions'''
    minword = ''
    errors = []
    length = len(word)
    if length == 1:
        return None
    selected_set = wordsmap[length] + wordsmap[length+1]
    dist = 1000
    if word in wordsmap[length]:
        return None

    for w in selected_set:
        calcdist = distance(word,w)

        if calcdist < 2:
            errors.append(w)

    return errors

if __name__ == '__main__':

    wordsmap = dictparse('/usr/share/dict/words')

    line = raw_input("Input Line: ")
    for w in line.split(' '):
        result = spellcheck(w,wordsmap)

        if result != None:
            print w,"(",",".join(result),")",
        else:
            print w,

Let us do a test run with the program:

$ python spellchecker.py
Input Line: prackers break and stel code
prackers ( crackers ) break and stel ( seel,skel,steg,stem,sten,step,stet,stew,stey,steal,steel,stela,stele,stell ) code

Write your comments here.


Posted on March 3, 2011 - by Sarath

Implementing autocomplete with trie data structure

We have seen autocomplete functionality in many applications and devices. Have you ever thought of implementing it with an efficient data structure? If not let us learn the data structure trie and implement a basic auto complete using python.

According to wikipedia:
In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest.

Let us go through requirements for an autocompletion functionality:
* Search should be fast
* Memory should be efficiently used
* Easy to build the data structure

We use a dictionary as data store for storing the words. When we type few letters of a word, we can do a regular expression match each time through the dictionary text file and print the words that starting the letters. Using the command grep we can print the words starting with letters as follows.

$ grep "^an" dictionary.txt

But it is a quick and dirty hack. But it is very inefficient if we need to use it in a large scale because, it needs to read the entire words in the dictionary and compare character by character each time. It is costly operation.

Trie data structure is a tree with each node consisting of one letter as data and pointer to the next node. It uses Finite Deterministic automation. That means, it uses state to state transition.

This is a trie for keys “A”, “to”, “tea”, “ted”, “ten”, “i”, “in”, and “inn”. When we need to do auto complete for the starting characters, “te”, we need to get output tea, ted and ten. Instead of checking regular expression match for all the words in the database, it will make use of transitions. First character is t. Then in the root element, it will make transition for ‘t’ so that it will reach the node with data ‘t’, then at node ‘t’, it will make transition for next node ‘e’. At that point, we need to follow all paths from node ‘e’ to leaf nodes so that we can get the paths t->e->a, t->e->d and t->e->n. This is the basic algorithm behind implementing an efficient auto complete.

Implementation of auto complete was a question asked for programming round of Taggle.com recruitment process held at our campus. I implemented it in python with Trie and I got placed :) .

Here is the python program I wrote:

#!/usr/bin/env python
import sys

class Node:
    def __init__(self):
        self.next = {}  #Initialize an empty hash (python dictionary)
        self.word_marker = False
        # There can be words, Hot and Hottest. When search is performed, usually state transition upto leaf node is peformed and characters are printed.
        # Then in this case, only Hottest will be printed. Hot is intermediate state. Inorder to mark t as a state where word is to be print, a word_marker is used


    def add_item(self, string):
        ''' Method to add a string the Trie data structure'''
       
        if len(string) == 0:
            self.word_marker = True
            return
       
        key = string[0] #Extract first character
        string = string[1:] #Create a string by removing first character

        # If the key character exists in the hash, call next pointing node's add_item() with remaining string as argument
        if self.next.has_key(key):
            self.next[key].add_item(string)
        # Else create an empty node. Insert the key character to hash and point it to newly created node. Call add_item() in new node with remaining string.
        else:
            node = Node()
            self.next[key] = node
            node.add_item(string)


    def dfs(self, sofar=None):
        '''Perform Depth First Search Traversal'''
       
        # When hash of the current node is empty, that means it is a leaf node.
        # Hence print sofar (sofar is a string containing the path as character sequences through which state transition occured)
        if self.next.keys() == []:
            print "Match:",sofar
            return
           
        if self.word_marker == True:
            print "Match:",sofar

        # Recursively call dfs for all the nodes pointed by keys in the hash
        for key in self.next.keys():
            self.next[key].dfs(sofar+key)

    def search(self, string, sofar=""):
        '''Perform auto completion search and print the autocomplete results'''
        # Make state transition based on the input characters.
        # When the input characters becomes exhaused, perform dfs() so that the trie gets traversed upto leaves and print the state characters
        if len(string) > 0:
            key = string[0]
            string = string[1:]
            if self.next.has_key(key):
                sofar = sofar + key
                self.next[key].search(string,sofar)
               
            else:
                print "No match"
        else:
            if self.word_marker == True:
                print "Match:",sofar

            for key in self.next.keys():
                self.next[key].dfs(sofar+key)


def fileparse(filename):
    '''Parse the input dictionary file and build the trie data structure'''
    fd = open(filename)

    root = Node()  
    line = fd.readline().strip('\r\n') # Remove newline characters \r\n

    while line !='':
        root.add_item(line)
        line = fd.readline().strip('\r\n')

    return root



if __name__ == '__main__':

    if len(sys.argv) != 2:
        print "Usage: ", sys.argv[0], "dictionary_file.txt"
        sys.exit(2)
       
    root  = fileparse(sys.argv[1])

    print "Input:",
    input=raw_input()
    root.search(input)

You can run it as:

$ python autocomplete.py dictionary.txt
Input: an
Match: and
Match: angry
Match: angle
Match: animal
Match: answer
Match: ant
Match: any

You can download the source files from here: autocomplete.tar.gz

Note: The image of trie used in this post is taken from wikipedia.



  • About

    Sarath Lakshman is a Hactivist of Free and Open Source Software from Kerala.
    Read more about him.
  • My Book

    Solve real-world shell scripting problems with over 110 simple but incredibly effective recipes.



  • Follow

  • Random Photos

    Location: Near Model Engineering College Canteen
    Shreyas Srinivasan - Hacking wordpress 
Hackfest @ MEC
    Atul Chitnis - Talking 'FOSS in the real world'
  • Tweets

    • Preparing for your first-job interviews:
      http://t.co/SBdRl4At
      2011/11/30 23:31
    • is down. Having some issues with hosting account. I will update when it is back.
      http://t.co/Hj3u1qm1
      2011/11/29 11:59
    • Blog post: Preparing for your first-job interviews:
      http://t.co/SBdRl4At
      2011/11/29 09:47
    • Packt Publishers interviewed me.
      http://t.co/CMrvOPh
      2011/09/08 00:21
  • Calendar

    May 2012
    M T W T F S S
    « Nov    
     123456
    78910111213
    14151617181920
    21222324252627
    28293031  
  • Archives

  • Blogroll

    • FOSS.IN
    • GNU Vision Blog
    • Hiran Effects
    • J5′s blog
    • Pardus planet
    • Praveen Arimbrathodiyil’s blog
    • Santhosh Thottingal
    • SLYNUX GNU Operating System
    • St Josephs HSS, Thalassery – Alumni
    • Swaroop CH
    • TT’s Jottings-Blog of VU2SWX
  • Tags

    algorithm automation bangalore bash bash scripting bug code college contribution define development facebook fedora foss fossmeet freedom free sms freesoftware Friends fun gnome gnu google google summer of code hack hacking hacks internet interview joy kde 4.1.2 kochi Life linux mec microsoft new year night nitc pardus pitivi python script summer of code video editor
Copyright © 2005 - 2010 Sarath Lakshman
Powered by Wordpress 3.04