Posts Tagged ‘algorithm’
Posted on March 5, 2011 - by Sarath
Implementing a spell checker
Following my previous post on autocomplete, here comes a spell checker program. First of all I would like to warn you that this is the worst and inefficient spell checker you have ever seen. The most slowest and CPU expensive
. I wrote this program long ago for fun. I thought of sharing the code I wrote.
The purpose of this post is to show how to write a spell checker in few lines by using the Levenshtein distance algorithm. This is not a context sensitive spell checker. The following spell checker program checks each word, and if it has spell errors it will list out some word suggestions.
Levenshtein algorithm is used to calculate distance between two words Levenshtein distance. It is defined as minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. See wikipedia for more details on the algorithm.
This spell checker program uses a dictionary file. Every Unix like OS comes with a default dictionary file (in the directory /usr/share/dict/). Make use of that. To spell check a word, following program calculate Levenshtein distance by comparing all words in the dictionary. If the Levenshtein distance equal to one, those words will be listed as word suggestions. If the distance is zero, that means the word is a valid dictionary word. Checking across all the words in the dictionary is very expensive. Hence in order to reduce the number of comparisons needed, we use a subset of words in the dictionary having word length equal to, plus one and minus one with respect to the word we are checking for spell error. For implementing that, we have built a hash called wordsmap (python dictionary). The key used for the hash is the word length and the value assigned is a list of words of words having the word length as the key.
import sys
def build_matrix(n,m):
'''Build a nxm matrix with first row = [0,1,2..m-1] and first col = [0,1,2..n-1]'''
matrix = [[i for i in range(m)]]
row = [0 for i in range(m)]
for i in range(1,n):
matrix.append(row[:])
matrix[i][0] = i
return matrix
def print_matrix(mat):
'''Print matrix in formatted rows'''
for col in mat:
print col
def distance(src,dest):
'''Calculate Levenshtein distance'''
n = len(src)
m = len(dest)
matrix = build_matrix(n+1,m+1)
for i in range(1,n+1):
for j in range(1,m+1):
cost = int( (ord(src[i-1]) - ord(dest[j-1]))!= 0)
v1 = matrix[i-1][j] + 1
v2 = matrix[i][j-1] + 1
v3 = matrix[i-1][j-1] + cost
matrix[i][j] = min(v1,v2,v3)
return matrix[n][m]
def dictparse(filename):
'''Parse words from dictionary and build a hash for reduce word comparisons'''
f = open(filename)
wordsmap = {}
for word in f:
word = word.strip('\r\n')
length = len(word)
if not wordsmap.has_key(length):
wordsmap[length] = []
wordsmap[length].append(word)
return wordsmap
def spellcheck(word,wordsmap):
'''Perform spellcheck and provide word suggestions'''
minword = ''
errors = []
length = len(word)
if length == 1:
return None
selected_set = wordsmap[length] + wordsmap[length+1]
dist = 1000
if word in wordsmap[length]:
return None
for w in selected_set:
calcdist = distance(word,w)
if calcdist < 2:
errors.append(w)
return errors
if __name__ == '__main__':
wordsmap = dictparse('/usr/share/dict/words')
line = raw_input("Input Line: ")
for w in line.split(' '):
result = spellcheck(w,wordsmap)
if result != None:
print w,"(",",".join(result),")",
else:
print w,
Let us do a test run with the program:
Input Line: prackers break and stel code
prackers ( crackers ) break and stel ( seel,skel,steg,stem,sten,step,stet,stew,stey,steal,steel,stela,stele,stell ) code
Posted on March 3, 2011 - by Sarath
Implementing autocomplete with trie data structure
We have seen autocomplete functionality in many applications and devices. Have you ever thought of implementing it with an efficient data structure? If not let us learn the data structure trie and implement a basic auto complete using python.
According to wikipedia:
In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest.
Let us go through requirements for an autocompletion functionality:
* Search should be fast
* Memory should be efficiently used
* Easy to build the data structure
We use a dictionary as data store for storing the words. When we type few letters of a word, we can do a regular expression match each time through the dictionary text file and print the words that starting the letters. Using the command grep we can print the words starting with letters as follows.
But it is a quick and dirty hack. But it is very inefficient if we need to use it in a large scale because, it needs to read the entire words in the dictionary and compare character by character each time. It is costly operation.
Trie data structure is a tree with each node consisting of one letter as data and pointer to the next node. It uses Finite Deterministic automation. That means, it uses state to state transition.
This is a trie for keys “A”, “to”, “tea”, “ted”, “ten”, “i”, “in”, and “inn”. When we need to do auto complete for the starting characters, “te”, we need to get output tea, ted and ten. Instead of checking regular expression match for all the words in the database, it will make use of transitions. First character is t. Then in the root element, it will make transition for ‘t’ so that it will reach the node with data ‘t’, then at node ‘t’, it will make transition for next node ‘e’. At that point, we need to follow all paths from node ‘e’ to leaf nodes so that we can get the paths t->e->a, t->e->d and t->e->n. This is the basic algorithm behind implementing an efficient auto complete.
Implementation of auto complete was a question asked for programming round of Taggle.com recruitment process held at our campus. I implemented it in python with Trie and I got placed
.
Here is the python program I wrote:
import sys
class Node:
def __init__(self):
self.next = {} #Initialize an empty hash (python dictionary)
self.word_marker = False
# There can be words, Hot and Hottest. When search is performed, usually state transition upto leaf node is peformed and characters are printed.
# Then in this case, only Hottest will be printed. Hot is intermediate state. Inorder to mark t as a state where word is to be print, a word_marker is used
def add_item(self, string):
''' Method to add a string the Trie data structure'''
if len(string) == 0:
self.word_marker = True
return
key = string[0] #Extract first character
string = string[1:] #Create a string by removing first character
# If the key character exists in the hash, call next pointing node's add_item() with remaining string as argument
if self.next.has_key(key):
self.next[key].add_item(string)
# Else create an empty node. Insert the key character to hash and point it to newly created node. Call add_item() in new node with remaining string.
else:
node = Node()
self.next[key] = node
node.add_item(string)
def dfs(self, sofar=None):
'''Perform Depth First Search Traversal'''
# When hash of the current node is empty, that means it is a leaf node.
# Hence print sofar (sofar is a string containing the path as character sequences through which state transition occured)
if self.next.keys() == []:
print "Match:",sofar
return
if self.word_marker == True:
print "Match:",sofar
# Recursively call dfs for all the nodes pointed by keys in the hash
for key in self.next.keys():
self.next[key].dfs(sofar+key)
def search(self, string, sofar=""):
'''Perform auto completion search and print the autocomplete results'''
# Make state transition based on the input characters.
# When the input characters becomes exhaused, perform dfs() so that the trie gets traversed upto leaves and print the state characters
if len(string) > 0:
key = string[0]
string = string[1:]
if self.next.has_key(key):
sofar = sofar + key
self.next[key].search(string,sofar)
else:
print "No match"
else:
if self.word_marker == True:
print "Match:",sofar
for key in self.next.keys():
self.next[key].dfs(sofar+key)
def fileparse(filename):
'''Parse the input dictionary file and build the trie data structure'''
fd = open(filename)
root = Node()
line = fd.readline().strip('\r\n') # Remove newline characters \r\n
while line !='':
root.add_item(line)
line = fd.readline().strip('\r\n')
return root
if __name__ == '__main__':
if len(sys.argv) != 2:
print "Usage: ", sys.argv[0], "dictionary_file.txt"
sys.exit(2)
root = fileparse(sys.argv[1])
print "Input:",
input=raw_input()
root.search(input)
You can run it as:
Input: an
Match: and
Match: angry
Match: angle
Match: animal
Match: answer
Match: ant
Match: any
You can download the source files from here: autocomplete.tar.gz
Note: The image of trie used in this post is taken from wikipedia.

Solve real-world shell scripting problems with over 110 simple but incredibly effective recipes.

