• Home
  • About
  • Archives
  • Book
  • Contact me
  • Photos
  • Projects
  • Talks
Subscribe: Posts | Comments | E-mail
  • ArticlesArticles which I authored
  • GSOCGoogle Summer of code archives
  • HacksExperiments
  • LifeIn and around life
  • Open SourceFree and Open Source Software
  • PardusContributions with Pardus Project

Sarath Lakshman

Posted on March 3, 2011 - by Sarath

Implementing autocomplete with trie data structure

Articles Interview
FacebookGoogle BuzzIdenti.caShare


We have seen autocomplete functionality in many applications and devices. Have you ever thought of implementing it with an efficient data structure? If not let us learn the data structure trie and implement a basic auto complete using python.

According to wikipedia:
In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest.

Let us go through requirements for an autocompletion functionality:
* Search should be fast
* Memory should be efficiently used
* Easy to build the data structure

We use a dictionary as data store for storing the words. When we type few letters of a word, we can do a regular expression match each time through the dictionary text file and print the words that starting the letters. Using the command grep we can print the words starting with letters as follows.

$ grep "^an" dictionary.txt

But it is a quick and dirty hack. But it is very inefficient if we need to use it in a large scale because, it needs to read the entire words in the dictionary and compare character by character each time. It is costly operation.

Trie data structure is a tree with each node consisting of one letter as data and pointer to the next node. It uses Finite Deterministic automation. That means, it uses state to state transition.

This is a trie for keys “A”, “to”, “tea”, “ted”, “ten”, “i”, “in”, and “inn”. When we need to do auto complete for the starting characters, “te”, we need to get output tea, ted and ten. Instead of checking regular expression match for all the words in the database, it will make use of transitions. First character is t. Then in the root element, it will make transition for ‘t’ so that it will reach the node with data ‘t’, then at node ‘t’, it will make transition for next node ‘e’. At that point, we need to follow all paths from node ‘e’ to leaf nodes so that we can get the paths t->e->a, t->e->d and t->e->n. This is the basic algorithm behind implementing an efficient auto complete.

Implementation of auto complete was a question asked for programming round of Taggle.com recruitment process held at our campus. I implemented it in python with Trie and I got placed :) .

Here is the python program I wrote:

#!/usr/bin/env python
import sys

class Node:
    def __init__(self):
        self.next = {}  #Initialize an empty hash (python dictionary)
        self.word_marker = False
        # There can be words, Hot and Hottest. When search is performed, usually state transition upto leaf node is peformed and characters are printed.
        # Then in this case, only Hottest will be printed. Hot is intermediate state. Inorder to mark t as a state where word is to be print, a word_marker is used


    def add_item(self, string):
        ''' Method to add a string the Trie data structure'''
       
        if len(string) == 0:
            self.word_marker = True
            return
       
        key = string[0] #Extract first character
        string = string[1:] #Create a string by removing first character

        # If the key character exists in the hash, call next pointing node's add_item() with remaining string as argument
        if self.next.has_key(key):
            self.next[key].add_item(string)
        # Else create an empty node. Insert the key character to hash and point it to newly created node. Call add_item() in new node with remaining string.
        else:
            node = Node()
            self.next[key] = node
            node.add_item(string)


    def dfs(self, sofar=None):
        '''Perform Depth First Search Traversal'''
       
        # When hash of the current node is empty, that means it is a leaf node.
        # Hence print sofar (sofar is a string containing the path as character sequences through which state transition occured)
        if self.next.keys() == []:
            print "Match:",sofar
            return
           
        if self.word_marker == True:
            print "Match:",sofar

        # Recursively call dfs for all the nodes pointed by keys in the hash
        for key in self.next.keys():
            self.next[key].dfs(sofar+key)

    def search(self, string, sofar=""):
        '''Perform auto completion search and print the autocomplete results'''
        # Make state transition based on the input characters.
        # When the input characters becomes exhaused, perform dfs() so that the trie gets traversed upto leaves and print the state characters
        if len(string) > 0:
            key = string[0]
            string = string[1:]
            if self.next.has_key(key):
                sofar = sofar + key
                self.next[key].search(string,sofar)
               
            else:
                print "No match"
        else:
            if self.word_marker == True:
                print "Match:",sofar

            for key in self.next.keys():
                self.next[key].dfs(sofar+key)


def fileparse(filename):
    '''Parse the input dictionary file and build the trie data structure'''
    fd = open(filename)

    root = Node()  
    line = fd.readline().strip('\r\n') # Remove newline characters \r\n

    while line !='':
        root.add_item(line)
        line = fd.readline().strip('\r\n')

    return root



if __name__ == '__main__':

    if len(sys.argv) != 2:
        print "Usage: ", sys.argv[0], "dictionary_file.txt"
        sys.exit(2)
       
    root  = fileparse(sys.argv[1])

    print "Input:",
    input=raw_input()
    root.search(input)

You can run it as:

$ python autocomplete.py dictionary.txt
Input: an
Match: and
Match: angry
Match: angle
Match: animal
Match: answer
Match: ant
Match: any

You can download the source files from here: autocomplete.tar.gz

Note: The image of trie used in this post is taken from wikipedia.

This entry was posted on Thursday, March 3rd, 2011 at 1:58 pm and is filed under Articles, Interview. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

6 Comments

We'd love to hear yours!



  1. Visit My Website

    March 3, 2011

    Permalink

    Robin Marx said:

    Nice article!

    As you say, this is a very basic way of implementing autocomplete, one that is efficient in terms of speed, but maybe slightly inefficient in terms of memory usage (extra pointers to store).

    When willing to use autocomplete for a large database though (say for example youtube), you will need many more provisions to get a good system. I’m thinking here about caching results, substring autocomplete (“red” will give “reduction” but also “bored”), have nodes that stand for more than one letter, etc.

    I would love to hear your thoughts on these kinds of optimizations and extensions, as I feel they are a little more difficult to integrate into this kind of data structure.

    Thanks for sharing



  2. Visit My Website

    March 5, 2011

    Permalink

    Sarath said:

    The implementation will get more complex with caching systems and datastore. We have to digg more on it to come up with a production level autocomplete implementation. Though we talk lot about normalizing the database tables for efficiency, I have heard that in large systems, de-normalization is more useful to avoid a sequence of queries and make faster.



  3. Visit My Website

    March 13, 2011

    Permalink

    Jithu Sunny said:

    Nice article, your blog is simply addictive…:)



  4. Visit My Website

    November 28, 2011

    Permalink

    Preparing for your first-job interviews | Sarath Lakshman said:

    [...] Trie data structure Trie is an interesting data structure that can be used to implement autocomplete feature. You can read more about trie from my older blog post. (Implementing autocomplete with trie data structure) [...]



  5. Visit My Website

    February 4, 2012

    Permalink

    Jenny said:

    Such a great article it was which computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. Thanks for sharing this article.



  6. Visit My Website

    May 22, 2012

    Permalink

    Nicolai Diethelm said:

    Thank you for your article about implementing autocomplete with a trie data structure!
    If the set of autocomplete suggestions is rank-ordered, one should consider using a SuggestTree, a special variant of a trie. For any given prefix, it provides fast access to the top k suggestions that start with that prefix.
    http://suggesttree.sourceforge.net/



Leave a Comment

Here's your chance to speak.

  1. Name (required)

    Mail (required)

    Website

    Message

  • About

    Sarath Lakshman is a Hactivist of Free and Open Source Software from Kerala.
    Read more about him.
  • My Book

    Solve real-world shell scripting problems with over 110 simple but incredibly effective recipes.



  • Follow

  • Random Photos

    Atul Chitnis - Talking 'FOSS in the real world'
  • Tweets

    • Implementation overview of redirection and pipe operators in shell:
      http://t.co/z58r55oX
      2012/09/24 23:53
    • Nexus 7 looks great!
      2012/06/29 09:02
    • Preparing for your first-job interviews:
      http://t.co/SBdRl4At
      2011/11/30 23:31
    • is down. Having some issues with hosting account. I will update when it is back.
      http://t.co/Hj3u1qm1
      2011/11/29 11:59
  • Calendar

    March 2011
    M T W T F S S
    « Feb   Apr »
     123456
    78910111213
    14151617181920
    21222324252627
    28293031  
  • Archives

  • Blogroll

    • FOSS.IN
    • GNU Vision Blog
    • Hiran Effects
    • J5′s blog
    • Pardus planet
    • Praveen Arimbrathodiyil’s blog
    • Santhosh Thottingal
    • SLYNUX GNU Operating System
    • St Josephs HSS, Thalassery – Alumni
    • Swaroop CH
    • TT’s Jottings-Blog of VU2SWX
  • Tags

    algorithm automation bangalore bash bash scripting bug code college contribution define development facebook fedora foss fossmeet freedom free sms freesoftware Friends fun gnome gnu google google summer of code hack hacking internet interview joy kde 4.1.2 kochi Life linux mec microsoft new year night nitc pardus pitivi python script summer of code unix video editor
Copyright © 2005 - 2010 Sarath Lakshman
Powered by Wordpress 3.04