Implementing a spell checker

Following my previous post on autocomplete, here comes a spell checker program. First of all I would like to warn you that this is the worst and inefficient spell checker you have ever seen. The most slowest and CPU expensive :). I wrote this program long ago for fun. I thought of sharing the code I wrote.

The purpose of this post is to show how to write a spell checker in few lines by using the Levenshtein distance algorithm. This is not a context sensitive spell checker. The following spell checker program checks each word, and if it has spell errors it will list out some word suggestions.

Levenshtein algorithm is used to calculate distance between two words Levenshtein distance. It is defined as minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. See wikipedia for more details on the algorithm.

This spell checker program uses a dictionary file. Every Unix like OS comes with a default dictionary file (in the directory /usr/share/dict/). Make use of that. To spell check a word, following program calculate Levenshtein distance by comparing all words in the dictionary. If the Levenshtein distance equal to one, those words will be listed as word suggestions. If the distance is zero, that means the word is a valid dictionary word. Checking across all the words in the dictionary is very expensive. Hence in order to reduce the number of comparisons needed, we use a subset of words in the dictionary having word length equal to, plus one and minus one with respect to the word we are checking for spell error. For implementing that, we have built a hash called wordsmap (python dictionary). The key used for the hash is the word length and the value assigned is a list of words of words having the word length as the key.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
#!/usr/bin/env python

import sys

def build_matrix(n,m):

	'''Build a nxm matrix with first row = [0,1,2..m-1] and first col = [0,1,2..n-1]'''

	matrix = [[i for i in range(m)]]
	row = [0 for i in range(m)]
	for i in range(1,n):
		matrix.append(row[:])
		matrix[i][0] = i

	return matrix

def print_matrix(mat):
	'''Print matrix in formatted rows'''
	for col in mat:
		print col

def distance(src,dest):
	'''Calculate Levenshtein distance'''
	n = len(src)
	m = len(dest)

	matrix = build_matrix(n+1,m+1)
	for i in range(1,n+1):
		for j in range(1,m+1):
			cost = int( (ord(src[i-1]) - ord(dest[j-1]))!= 0)
			v1 = matrix[i-1][j] + 1 
			v2 = matrix[i][j-1] + 1
			v3 = matrix[i-1][j-1] + cost

			matrix[i][j] = min(v1,v2,v3)
	return matrix[n][m]



def dictparse(filename):
	'''Parse words from dictionary and build a hash for reduce word comparisons'''
	f = open(filename)
	wordsmap = {}
	
	for word in f:
		
		word = word.strip('\r\n')
		length = len(word)
		if not wordsmap.has_key(length):
			wordsmap[length] = []
	
		wordsmap[length].append(word)

	return wordsmap


def spellcheck(word,wordsmap):
	'''Perform spellcheck and provide word suggestions'''
	minword = ''
	errors = []
	length = len(word)
	if length == 1:
		return None
	selected_set = wordsmap[length] + wordsmap[length+1]
	dist = 1000
	if word in wordsmap[length]:
		return None

	for w in selected_set:
		calcdist = distance(word,w)

		if calcdist < 2:
			errors.append(w)

	return errors

if __name__ == '__main__':

	wordsmap = dictparse('/usr/share/dict/words')

	line = raw_input("Input Line: ") 
	for w in line.split(' '):
		result = spellcheck(w,wordsmap)

		if result != None:
			print w,"(",",".join(result),")",
		else:
			print w,

Let us do a test run with the program:

1
2
3
$ python spellchecker.py
Input Line: prackers break and stel code
prackers ( crackers ) break and stel ( seel,skel,steg,stem,sten,step,stet,stew,stey,steal,steel,stela,stele,stell ) code