Implementing autocomplete with trie data structure

We have seen autocomplete functionality in many applications and devices. Have you ever thought of implementing it with an efficient data structure? If not let us learn the data structure trie and implement a basic auto complete using python.

According to wikipedia: In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest.

Let us go through requirements for an autocompletion functionality: * Search should be fast * Memory should be efficiently used * Easy to build the data structure

We use a dictionary as data store for storing the words. When we type few letters of a word, we can do a regular expression match each time through the dictionary text file and print the words that starting the letters. Using the command grep we can print the words starting with letters as follows.

1 $ grep "^an" dictionary.txt

But it is a quick and dirty hack. But it is very inefficient if we need to use it in a large scale because, it needs to read the entire words in the dictionary and compare character by character each time. It is costly operation.

Trie data structure is a tree with each node consisting of one letter as data and pointer to the next node. It uses Finite Deterministic automation. That means, it uses state to state transition.

This is a trie for keys “A”, “to”, “tea”, “ted”, “ten”, “i”, “in”, and “inn”. When we need to do auto complete for the starting characters, “te”, we need to get output tea, ted and ten. Instead of checking regular expression match for all the words in the database, it will make use of transitions. First character is t. Then in the root element, it will make transition for ‘t’ so that it will reach the node with data ‘t’, then at node ‘t’, it will make transition for next node ‘e’. At that point, we need to follow all paths from node ‘e’ to leaf nodes so that we can get the paths t->e->a, t->e->d and t->e->n. This is the basic algorithm behind implementing an efficient auto complete.

Implementation of auto complete was a question asked for programming round of recruitment process held at our campus. I implemented it in python with Trie and I got placed :).

Here is the python program I wrote:

 1 #!/usr/bin/env python
 2 import sys
 4 class Node:
 5 	def __init__(self):
 6 = {}	#Initialize an empty hash (python dictionary)
 7 		self.word_marker = False 
 8 		# There can be words, Hot and Hottest. When search is performed, usually state transition upto leaf node is peformed and characters are printed. 
 9 		# Then in this case, only Hottest will be printed. Hot is intermediate state. Inorder to mark t as a state where word is to be print, a word_marker is used
12 	def add_item(self, string):
13 		''' Method to add a string the Trie data structure'''
15 		if len(string) == 0:
16 			self.word_marker = True 
17 			return 
19 		key = string[0] #Extract first character
20 		string = string[1:] #Create a string by removing first character
22 		# If the key character exists in the hash, call next pointing node's add_item() with remaining string as argument
23 		if
25 		# Else create an empty node. Insert the key character to hash and point it to newly created node. Call add_item() in new node with remaining string.
26 		else:
27 			node = Node()
28[key] = node
29 			node.add_item(string)
32 	def dfs(self, sofar=None):
33 		'''Perform Depth First Search Traversal'''
35 		# When hash of the current node is empty, that means it is a leaf node. 
36 		# Hence print sofar (sofar is a string containing the path as character sequences through which state transition occured)
37 		if == []:
38 			print "Match:",sofar
39 			return
41 		if self.word_marker == True:
42 			print "Match:",sofar
44 		# Recursively call dfs for all the nodes pointed by keys in the hash
45 		for key in
48 	def search(self, string, sofar=""):
49 		'''Perform auto completion search and print the autocomplete results'''
50 		# Make state transition based on the input characters. 
51 		# When the input characters becomes exhaused, perform dfs() so that the trie gets traversed upto leaves and print the state characters
52 		if len(string) > 0:
53 			key = string[0]
54 			string = string[1:]
55 			if
56 				sofar = sofar + key
59 			else:
60 				print "No match"
61 		else:
62 			if self.word_marker == True:
63 				print "Match:",sofar
65 			for key in
69 def fileparse(filename):
70 	'''Parse the input dictionary file and build the trie data structure'''
71 	fd = open(filename)
73 	root = Node()	
74 	line = fd.readline().strip('\r\n') # Remove newline characters \r\n
76 	while line !='':
77 		root.add_item(line)
78 		line = fd.readline().strip('\r\n')
80 	return root
84 if __name__ == '__main__':
86 	if len(sys.argv) != 2:
87 		print "Usage: ", sys.argv[0], "dictionary_file.txt"
88 		sys.exit(2)
90 	root  = fileparse(sys.argv[1])
92 	print "Input:",
93 	input=raw_input()

You can run it as:

1 $ python dictionary.txt
2 Input: an
3 Match: and
4 Match: angry
5 Match: angle
6 Match: animal
7 Match: answer
8 Match: ant
9 Match: any

You can download the source files from here: autocomplete.tar.gz

Note: The image of trie used in this post is taken from wikipedia.