Python: Recursion, Generators and Tries

Blog

Code, Python, Tech

by Maria on 05 Sep 2013 - 04:25  

I was learning about Tries, and writing some code to do fun stuff with them, and along the way I ran into some code on Wikipedia that seems faulty, and also came up with a silly way to explain recursion. So, I thought I'd share.

First the silly explanation of recursion:

Let's say you have this wooden sign, but somehow the letters were put on the sign backwards, so your 'welcome' sign says 'emoclew'. So, you put it in your magic box. The box has boxes inside boxes, each with one door leading to the next inner box, kind of like a russian doll with doors. The magic part is that the inner boxes and doors only appear when you feed in a piece of wood with at least one letter on it. The sign goes through the first door, and the first letter (e) is cut off and set aside, and the rest of the sign continues to go through doors. Each time a letter is cut off, set by the door, and another door magically appears, until there is just one letter on your piece of wood. At this point no more doors appear, and the wood is sent back through the door(s) it came in. Each time it goes out of a door, the letter from that box is put back on the piece of wood, but now to the other side of the piece of wood. So as it goes back through the doors it looks like this: 'w' -> 'we' -> 'wel' -> 'welc', etc., until it pops out of the last door, and you have your sign the way you want it. Or, you could just repaint he sign. ;-)

This is pretty much how recursion works. The important bit is that the data is processed both on the way in and on the way out, and each step does exactly the same thing. This is how that code looks:

def reverse(s):
    if s == "":
        return s
    else:
        return reverse(s[1:]) + s[0]

reverse('emoclew')

The first time it goes in, there is a string there, so it goes to the else statement and runs reverse('moclew') + 'e'. The 'e' just hangs around by the door waiting for the return to be completed, while the 'moclew' goes through the next door. Same thing with the next run, the 'm' hangs out by the door, while the 'oclew' goes through the next door. Once we get to the last letter, no new door appears, instead our 'w' gets kicked out. This is the 'return s' line. But, we still have to go back out the doors we went through, so at the next door, we hit our return reverse(s[1:]) + s[0]. s[1:] at this point we have 'w' ('reverse(s[1:]) from the last door plus the 'e' (s[0]) we left hanging out, so we add them together. Going back through the next door, the reverse(s[1:]) is now 'we' and we add the 'l', continuing until we once more have our welcome sign, letters correctly written.

Now things get weird (and very useful) when we get multiple inner magic boxes. Let's say instead of our single word, we have a word tree. This time we aren't reversing the letters, just collecting them to find out what all words we have in our tree.

   B
   A
  C R
 K   E

Wouldn't it be great, if we could find all the words at once? This is basically what recursion can do. We send our tree in to our magic box. It collects the first letter, sends the tree through the next door, gets the second letter, sends the rest of the tree through the next door, but now finds two letters where it normally finds one. So, open two doors, and send each remaining tree through its own door, simultaneously! You can see how this should be faster than a for loop that would go through each branch one at a time. Recursion is most useful when you have data in tree format, such as directory structures, classification hierarchies, organization charts, etc. Or, if you are trying to do a flood fill. For an awesome explanation of the flood fill algorithm (with cats and zombies!) see the invent with python blog. There are, of course, times when recursion is not the best solution, such as calculating the nth Fibonacci number.

Great, now we are ready to look at Tries and the actual code to do this, and along the way introduce generators.

I was trying to understand Tries (aka Prefix Trees), and started playing around with the Python code on the Wikipedia entry for Tries. (You should probably go ahead and look at the whole article if you aren't sure what Tries are.) One of the last methods for Trie class in that code is items, which claims it returns an iterator.

def items(self):
    """Return an iterator over the items of the `Trie`."""
    for char, node in self.root.iteritems():
        if node.value is None:
            yield node.items
        else:
            yield node

So first let's see if we can figure out what this code is doing. Notice the yield statements. The yield statements indicate this is a generator. A yield statement is similar to a return statement, in that it sends the contents of whatever it is yielding to the calling function, but there is an important difference. As the name implies, it is only temporarily handing control over to the calling function. The generator is an object capable of creating and returning its members one at a time, because it is able to remember where it is when control is returned to it. It's purpose in life is iteration. If you have no idea what an iterable is, there is a great post on iterables here. This all may seem a bit fuzzy, but I hope by diving into this code some more, it will become clear.

So, this code seems to be creating a generator that spits out the node items of our tree. That sounds right. We know from looking at the other methods in our class Trie, that node.value is None until we come to an end of a word. So, maybe it spits out letters until it gets to the end of the word, and then moves to the next node? How would that happen? Hmm, this isn't sounding quite right...

And, when I started playing around with this method, I found it did not return what I expected or wanted. I wrote a little code snippet to test the items method. To start with I just added 2 entirely different words to my Trie. Here is a graphical representation of my tree:

And here is my little script:

$ from trie import Trie
$ mytrie = Trie()
$ mytrie.add('terror',1)
$ mytrie.add('plant',2)
$ for letters in mytrie.items():
$    print 'letters', letters 

I thought this would give me back all of the letters in my tree, but instead I got the following:

letters <bound method Trie.items of <wiki_trie.Trie instance at 0x1101921b8>>
letters <bound method Trie.items of <wiki_trie.Trie instance at 0x11018eef0>>

Certainly not the letters I expected. So, what is going on here? Well, the first problem is that we seem to be yielding the wrong thing, if we want letters. It seems to be yielding the method itself. Well, yes, that is exactly what it is doing. What do you even do with that? Some sort of twisted recursion? Hmm, well no surprise that the char in the iterator is the letter, so let's try yielding that instead. But it turns out, that if we yield char, we only get a letter 't' and letter 'p'. Why aren't we getting all of our tree?

So, our tree is made of up to 26 nodes at each level (assuming we are just allowing the lower case alphabet). In our example, we are using 2 nodes at the top level, 't' for 'terror' and 'p' for 'plant'. So, it looks like our code is iterating through those two nodes and stopping, and now we have to figure out how to get it to iterate through each of the child nodes as well. So, how do we do that? Recursion! Remember how I mentioned that recursion is useful for tree traversal? So, what happens if we iterate over node.items()*? Let's change our method, and for now, let's just print what we get:

* Note the parentheses that were missing in the original code on Wikipedia - we are now calling the function, not yielding the name. And no, yielding the call would not have been any more useful than yielding the name, or at least I could not find a use for it...

1    def items(self):
2        """Return an iterator over the items of the `Trie`."""
3        for char, node in self.root.iteritems():
4            # node.value is none when not at end of word
5            if node.value is None:
6                yield char
7                for i in node.items():
8                    print 'i',i
9            else:
10                yield char

And, rerunning my little test script, I get:

$ python test_trie.py
letters p
i l
i a
i n
i t
letters t
i e
i r
i r
i o
i r

Aha! So, now we yield this, and we are in business. A quick note about yield vs. return. With yield, you are not waiting until the function exits to return your variable. Using the box analogy, you are not waiting for the block of wood to go through all its boxes and come back again. You are yeilding your variable as soon as you are done with initial computations and hit yield, but remembering where you are, because you do expect to see the block of wood again. This means that, just like above where the i prints right away, you will get your letters one at a time as they are found.

We change our code again to this:

1 def items(self):
2      """Return an iterator over the items of the `Trie`."""
3      for char, node in self.root.iteritems():
4          # node.value is none when not at end of word
5          if node.value is None:
6              yield char
7               for i in node.items():
8                   yield i
9          else:                                                                              
10                yield char

And we get our letters!

letters p
letters l
letters a
letters n
letters t
letters t
letters e
letters r
letters r
letters o
letters r

Note that Python 3.3 added the syntax yield from X, as proposed in PEP 380, which simplifies our inner loop. With it you can do this instead:

yield from node.items()

But, what is going on in this code? It was straight-forward when we had the print statement, but now we have recursion in a generator. We had to make it a yield statement, because we can't have a return statement in a generator, but that does make things weird. So, now what is actually happening here?

Let's step through our loop. In our test code, we enter our loop and call our method to get the first letter. Node.value is none, so it yields char, and yields control back to our test loop. Our test loop prints 'letters' and the letter yielded. Now we go to the next 'letters in mytrie.items()' in the test loop again. This is the point where if we hadn't have added that inner loop, we would just continue to loop through that top h-node. Instead, something happens that may be a little unexpected. When we return to our method, we return to the same place we left; we do not go back to the beginning again, and this time this has consequences. This is something that is often glossed over when generators and yield statements are introduced. It is not just the variables that are preserved, but the position in the function as well. In the box analogy, perhaps when you return to the previous room (function call), instead of magically appearing at the exit door, you have to continue down a hallway to get to the exit. And maybe you have to pull a lever or something else before you leave.

When we left our items method, we had just yielded char on line 6, so now ask if there is a node.value (yes), and we enter our 'for i in node.items()' loop (I'll just call it the i-node loop). But, wait, node.items() is calling our method again (we hit recursion!), so now we do go back to the beginning of our method, and we continue until we hit our next yield statement, yield char again. Now yield char is an 'l', since we have iterated forward one, while still in our method. But, this yield is to the recursive call, so now control is returned to our i-node loop. Our i-node loop just has another yield, so now we yield again to the test loop, and print our 'l'. Okay, let's do another loop. We make our call from the test loop, and go back to our function, but where? The last yield statement was in the i-loop, so that is where we return, but that loop is now completed, so we go back to the next to last place we yielded, right after the yield char. That is followed by the i-node loop, where we finally get to create a new generator, and advance to the next letter. Every time through, we will create a new generator, and have to back out of that many more yields. While this is happening, Python is returning to the stacks it created in previous calls to create the current stack. This may sound inefficient, but that is because we are only going down one branch at the moment. As soon as we have to back partway out, and go down a new branch, it becomes more obvious why it is good to keep the old stacks around, backing out of yield calls to figure out which stack is relevant.

It is worth noting, that if this were ordinary recursion, it would not have worked. The first loop in our function is yielding char. If it were returning char, the inner loop could not advance, because it needs to know both what the last i was, and what node.items is in order to advance. In a normal recursive call, it would not have access to this information, since these variables are not being returned.

In our example of recursion with the magic boxes, we left a variable by the door, and picked it up on our way back out. The only reason we were able to do this is because that variable was in the return statement. In a normal function, once you leave the function (even to go into the next call of the same function), you lose any other information from that function, except what is in the return statement. But, in a yield statement, you return to the same place in the function, with all of the information you had when you were in the function before. And, if there is code after the yield statement, it will run that code, as if it had never left the function.

To see that the code continues after a yield, you could put the if statement:

 if node.value:
    print "end of word", node.value

after the i-node loop. It will print the end of word, as if it is finding the end of the word as it exits that node on its way back up the tree, as shown here:

maria@lamia:~/python/test$ python test_trie.py
('letters', 'a')
('letters', 'n')
('letters', 't')
('letters', 'i')
('letters', 'c')
end of word 2
end of word 1

for words ant (1) and antic (2)

Switch the order, and now suddenly the world makes sense...

$ python test_trie.py
('letters', 'a')
('letters', 'n')
('letters', 't')
end of word 1
('letters', 'i')
('letters', 'c')
end of word 2

Let's think about our data a minute. I think the graphic on the Wikipedia site is a bit misleading. I think there should be lines drawn horizontally between nodes in the same branch, because it is possible, and necessary, to travel between nodes at the same level in the same branch. Like so:

When we ask for the "next item" using the items method in the case of no recursion, it gives us the next node at that level. But if we are inside our method, and ask for the next node (recursion), we are essentially in a node and asking for the next node of that particular branch. So, in the drawing above, if you are in the "A" node, and ask for the next node, then you get "N" as that is the next node under "A". But if you are not in the "A" node, then the next node is "B". When you get down to the "T" node, you will get an "E", "I" or "S" when you use recursion. But, if you are yielded the "I" and ask for the next letter without recursion, you will get the "S". This will happen when you are moving from one branch to the next, because the method will back out of all of the recursive calls, and then go to the next item, which will be the head of the next branch. There it will hit recursion again, and descend down the next branch. Note that our code is not going down branches simultaneously at this point. This is because we do not have a call to split the recursion to go down the separate branches at once. When it hits a branch, it just goes down the first one it finds. We can consider this in a future post.

Okay, that's quite a bit for one post. Next post, let's try to expand our items function to be more useful.

Many thanks to the Seattle Python Interest Group for helping me sort this all out!

Bookmark and Share


Comments: 0

Contact me if you want to comment:

Subject: Subject:

Name:
Email:
Comments:

Enter code:

  LinkedIn