More Tries in Python

Blog

Code, Python, Tech

by Maria on 13 Sep 2013 - 06:40  

In the previous blog post we tried to figure out what the python code I found on Wikipedia that was supposedly returning an iterator over the items of the Trie was actually doing, and then we sort of ran with that to gain some understanding of Tries, generators, and recursion in Python. And that was great, but in the end we didn't really have anything useful. Our method returned all of the letters in our Trie, one at a time. It seems it would be more useful if it returned all of the words in our tree, so let's try doing that. Since we have a goal, let's create a test so we can tell if we have reached our goal. Here is our test code:

import unittest
from trie import Trie

class TestWords(unittest.TestCase):
    """Tests for function words"""

    def setUp(self):
        unittest.TestCase.setUp(self)

        self.mytrie = Trie()
        self.mytrie.add('ant',1)
        self.mytrie.add('ante',2)
        self.mytrie.add('antic',3)
        self.mytrie.add('antsy',4)
        self.mytrie.add('antse',5)
        self.mytrie.add('ban',6)
        self.mytrie.add('banana',7)

    def test_default_case(self):
        """Test words retrieves all words properly from Trie."""
        expected = ['ant','ante','antic','antsy','antse','ban','banana']
        actual = []
        for words in self.mytrie.words():
            actual.append(words)
        print 'actual', actual                                                                                          
        print 'expected', expected                                                                                        
        self.assertTrue(sorted(actual)==sorted(expected))

if __name__ == '__main__':
    unittest.main(exit=False)

I've made a fairly complicated, but relatively small, Trie, to really give our code a run for our money. And yes, I even made up a word. And, of course, our method fails miserably. So, let's see about getting some words instead of letters.

We know that many of our words have prefixes in common, that is the point of creating a tree like this. So, the general idea is going to be to collect letters and re-use them. To begin, let's just start collecting letters, and spitting out what we have when we reach the end of a word. We can tell if we are at the end of a word by whether there is a node.value.

    def words(self, prefix = []):
        """Return an iterator over the items (words) of the 'Trie'."""
        word = False
        for char, node in self.root.iteritems():
            prefix.append(char)
            if node.value:
                yield ''.join(prefix)
            for i in node.words():
                yield i
 

And, our test fails!

actual ['ant', 'antic', 'anticsy', 'anticsye', 'anticsyee', 'anticsyeeban', 'anticsyeebanana']
expected ['ante', 'antic', 'ant', 'antsy', 'antse', 'banana', 'ban']
F
======================================================================
FAIL: test_default_case (__main__.TestWords)
Test words retrieves all words properly from Trie.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_trie.py", line 27, in test_default_case
    self.assertTrue(sorted(actual)==sorted(expected))
AssertionError: False is not true

----------------------------------------------------------------------
Ran 1 test in 0.001s

FAILED (failures=1)

I have printed out the expected and actual list of words, and we can see immediately what the problem is. We need to figure out how to delete the letters we no longer need. How do we decide when we should delete a letter? Let's take a look at our Trie:

It looks like we will need to delete letters when we come to the end of a word, but only if that letter is at the end of a branch. How will we know? And how many letters will we delete? When we get to the end of the word antsy, we will need to delete just the y, but when we get to the end of the word antse, we will need to delete both the e and the s. (We actually have no way of knowing which branch will be traversed first; since the tree is based on a dictionary, which node it traverses first is random.)

What would be really useful at this point is if we knew at each branch how many words share that prefix. We could certainly figure this out by traversing the tree once, and then using the information when we traverse the tree again to get the words. But, it seems like this might actually be useful information, in and of itself. So, maybe this is something that really should already be available. This is the advantage with playing around with your proposed data structure before production use. We can still go back and add functionality. So, let's put some new code in our add function to keep track of the number of words using a given prefix. The first two methods of our class now look like this:

class Trie:
    def __init__(self):
        self.root = defaultdict(Trie)
        self.value = None
        self.count = 0

    def add(self, s, value):
        """Add the string `s` to the                                                                          
        `Trie` and map it to the given value. Additionally, keep track of
        how many words each node is shared with.
        """


        head, tail = s[0], s[1:]
        cur_node = self.root[head]
        cur_node.count += 1
        if not tail:
            cur_node.value = value
            return  # No further recursion                                                                    
        self.root[head].add(tail, value)

Very simple addition, which is going to make our life much easier. We may also want to create a method that returns the number of words in our tree that share a particular prefix we may be interested in, but for now, let's finish our iterator.

For every letter that we find, we now know how many words use that letter, so if we keep track of how many words we find that are using each letter, we can compare the two. When we reach the end of a word that is also the end of a limb, we need to delete at least to the last branching point, but possibly farther. We traverse A-N-T-E and we have hit an endpoint. The E only has one word associated with it, so we can delete it, but the T has 5 words associated with it, so we need to check if we have finished 5 words yet. We can't just keep one running tally of words finished, because sometimes we will be keeping track of multiple branching points. For example, we need 5 words for the T, but only 2 for the S, but can be working on completing both of those branches at the same time. Clearly, we need to keep a list of words finished for each letter. We start at the A, we have finished no words, [0], continue to the N, [0, 0], and now to the T, [0, 0, 0], but now we have finished a word, so we go back and add 1 to everything [1, 1, 1]. Let's go to the E, we have [1, 1, 1, 0], and once again we have finished a word, so we now have [2, 2, 2, 1]. We are now at the end of a limb, so we delete letters.

But, wait, how did we know we are at the end of a limb? Well, we know we are at the end of a word, since we can check to see if there is a value associated with it. And if we check the node.count at this point, we can see that this letter is associated with just one word, so we must be at the end of a limb. Great, we can delete letters. How many? We have to check. We delete the E, then check the T. The T has 5 words associated with it, but we have only 2 recorded for the T, so we can't delete it yet. We are done deleting, as no earlier letters can possibly be ready for deleting yet. Now we go to the I, but wait, our letters-visited-list is [2, 2, 2, 1]. Hmm, looks like we need to remember to delete the 1 at the same time that we delete the letter. Makes sense, as this list should correspond to our prefix list. We are at [2, 2, 2, 0] now. We go to the C, and we are at the end of another word, so the array becomes [3, 3, 3, 1, 1]. We check our node.count, and it is 1, so we are at the end of a limb. Delete letters at the end of our array until we get back to the T, which is still a 5, and we are only at 3. Great, seem to have the hang of this. But, we have now checked the T twice, and you may remember that we are not actually going to have access to that node when we backing out. You can tell this is true, because when we were getting the letters, and not deleting any of them, we did not see multiple T's. The final word was 'anticsyeebanana', so the algorithm was just tacking on more endings, and not re-visiting previous letters.

What would make more sense anyway is to keep an array of the nodes we have visited, and what their corresponding word counts should be. So, that list would now look like [5, 5, 5, 1, 1]. Now we just compare the arrays and when numbers are equal, get rid of those letters. Well, we still must only check when we are at the end of a limb. But, this seems a little over-complicated too. What if we just collected the word count array from the nodes, and subtracted 1 from the whole array whenever we hit the end of a word? Any letters currently in the array would be ones that this word would be a part of, since it is just a representation of the letters in the prefix array. Any zeros would represent the letters (and positions in the word count array) we need to delete. Let's try it.

    def words(self, path = [], prefix = []):
        """Return an iterator over the words of the 'Trie'."""
        for char, node in self.root.iteritems():
            prefix.append(char)
            path.append(node.count)

            # if there is a node.value, then we are at the end of a word and
            # we should yield the word and subtract 1 (word!) from the
            #  entire node path
            if node.value:
                yield ''.join(prefix)        
                for x,y in enumerate(path):
                        path[x] = y - 1

            # if we are at the end of a word and the count is 1,
            # we are at the end of a branch, and it is time to
            # delete letters back to the last branch we haven't
            # gone down yet.
            if node.value and node.count == 1:
                    for j in range(path.count(0)):
                        del path[-1]
                        del prefix[-1]

            for i in node.words():
                yield i

And, let's run our test.

actual ['ant', 'antic', 'antsy', 'antse', 'ante', 'ban', 'banana']
expected ['ante', 'antic', 'ant', 'antsy', 'antse', 'banana', 'ban']
.
----------------------------------------------------------------------
Ran 1 test in 0.000s

OK

Left the print statements so you could see the words. So, at this point, we should create a bunch more tests to check more edge cases.

Do you suppose this is what the author was trying to do with the code in the Wikipedia article? Is there any way to do this without including the node.count or checking the tree twice?

You can find my code on GitHub.

Finally, here is a nice post about how Tries are useful for word storage for computer versions of Boggle, without ever mentioning the word Boggle.

Bookmark and Share


Comments: 0

Contact me if you want to comment:

Subject: Subject:

Name:
Email:
Comments:

Enter code:

  LinkedIn