March 03, 2011

Implementing autocomplete with trie data structure

We have seen autocomplete functionality in many applications and devices. Have you ever thought of implementing it with an efficient data structure? If not let us learn the data structure trie and implement a basic auto complete using python.

According to wikipedia: In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest.

Let us go through requirements for an autocompletion functionality: * Search should be fast * Memory should be efficiently used * Easy to build the data structure

We use a dictionary as data store for storing the words. When we type few letters of a word, we can do a regular expression match each time through the dictionary text file and print the words that starting the letters. Using the command grep we can print the words starting with letters as follows.

1 $ grep "^an" dictionary.txt

But it is a quick and dirty hack. But it is very inefficient if we need to use it in a large scale because, it needs to read the entire words in the dictionary and compare character by character each time. It is costly operation.

Trie data structure is a tree with each node consisting of one letter as data and pointer to the next node. It uses Finite Deterministic automation. That means, it uses state to state transition.

This is a trie for keys “A”, “to”, “tea”, “ted”, “ten”, “i”, “in”, and “inn”. When we need to do auto complete for the starting characters, “te”, we need to get output tea, ted and ten. Instead of checking regular expression match for all the words in the database, it will make use of transitions. First character is t. Then in the root element, it will make transition for ‘t’ so that it will reach the node with data ‘t’, then at node ‘t’, it will make transition for next node ‘e’. At that point, we need to follow all paths from node ‘e’ to leaf nodes so that we can get the paths t->e->a, t->e->d and t->e->n. This is the basic algorithm behind implementing an efficient auto complete.

Implementation of auto complete was a question asked for programming round of Taggle.com recruitment process held at our campus. I implemented it in python with Trie and I got placed :).

Here is the python program I wrote:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94 #!/usr/bin/env python
import sys

class Node:
	def __init__(self):
		self.next = {}	#Initialize an empty hash (python dictionary)

		self.word_marker = False 
		# There can be words, Hot and Hottest. When search is performed, usually state transition upto leaf node is peformed and characters are printed. 

		# Then in this case, only Hottest will be printed. Hot is intermediate state. Inorder to mark t as a state where word is to be print, a word_marker is used



	def add_item(self, string):
		''' Method to add a string the Trie data structure'''
		
		if len(string) == 0:
			self.word_marker = True 
			return 
		
		key = string[0] #Extract first character

		string = string[1:] #Create a string by removing first character


		# If the key character exists in the hash, call next pointing node's add_item() with remaining string as argument

		if self.next.has_key(key):
			self.next[key].add_item(string)
		# Else create an empty node. Insert the key character to hash and point it to newly created node. Call add_item() in new node with remaining string.

		else:
			node = Node()
			self.next[key] = node
			node.add_item(string)


	def dfs(self, sofar=None):
		'''Perform Depth First Search Traversal'''
		
		# When hash of the current node is empty, that means it is a leaf node. 

		# Hence print sofar (sofar is a string containing the path as character sequences through which state transition occured)

		if self.next.keys() == []:
			print "Match:",sofar
			return
			
		if self.word_marker == True:
			print "Match:",sofar

		# Recursively call dfs for all the nodes pointed by keys in the hash

		for key in self.next.keys():
			self.next[key].dfs(sofar+key)

	def search(self, string, sofar=""):
		'''Perform auto completion search and print the autocomplete results'''
		# Make state transition based on the input characters. 

		# When the input characters becomes exhaused, perform dfs() so that the trie gets traversed upto leaves and print the state characters

		if len(string) > 0:
			key = string[0]
			string = string[1:]
			if self.next.has_key(key):
				sofar = sofar + key
				self.next[key].search(string,sofar)
				
			else:
				print "No match"
		else:
			if self.word_marker == True:
				print "Match:",sofar

			for key in self.next.keys():
				self.next[key].dfs(sofar+key)


def fileparse(filename):
	'''Parse the input dictionary file and build the trie data structure'''
	fd = open(filename)

	root = Node()	
	line = fd.readline().strip('\r\n') # Remove newline characters \r\n


	while line !='':
		root.add_item(line)
		line = fd.readline().strip('\r\n')

	return root



if __name__ == '__main__':

	if len(sys.argv) != 2:
		print "Usage: ", sys.argv[0], "dictionary_file.txt"
		sys.exit(2)
		
	root  = fileparse(sys.argv[1])

	print "Input:",
	input=raw_input()
	root.search(input)

You can run it as:

1
2
3
4
5
6
7
8
9 $ python autocomplete.py dictionary.txt
Input: an
Match: and
Match: angry
Match: angle
Match: animal
Match: answer
Match: ant
Match: any

You can download the source files from here: autocomplete.tar.gz

Note: The image of trie used in this post is taken from wikipedia.