Tries (Prefix Trees)
Given a set of \(k\) strings of total length \(m\) and a prefix \(p\) of length \(n\), we want to find all the strings that start with \(p\). The simplest solution would be to match each of the \(k\) strings with the prefix. This will take \(O(m)\) time. However, suppose that we now we have \(r\) patterns, then this approach will take \(O(mr)\) time which is really slow. Similar to the RMQ notation, we will write \(\langle O(1), O(mr)\rangle\) where \(O(1)\) is the preprocessing time and \(O(mr)\) is the query time. How can we make it faster?
The Trie Data Structure
The trie (pronounced “try”) data structure is an efficient data structure that can help us achieve a faster running time when searching for prefixes. In the picture above forexample, we can find the words that start with the prefix “ca” by starting at the root and traversing down to find the words (words end with colored nodes): cake, cakes and cat.
Here are some important properties of tries
- Each node indicates whether it is the end of a valid word or not. In the above example, colored nodes represent the end of valid words. We can simply use a bool value (yes/no) in each node to do this.
- Each node will have \(R\) children/edges where \(R\) is the alphabet size.
- Edges represent letters. For example, the root node has one edge (\(c\)) to the nodes below.
- Each node also has only one parent.
Let’s look at what a trie really looks like:
You can see above, the root has only one edge of value \(c\). From \(c\), we can traverse to either \(a\) or \(i\). From \(a\) we can traverse to \(k\) or \(t\). From \(i\) we can traverse to \(d\) or \(v\) and so on.
Node and Edges Representation
This can be done in many ways. The following is just one way that uses a struct to represent a node. This struct will include a bool value to indicate whether we have arrived at a valid word or not. Each node/struct will also have an array to store all possible edges. Each array position holds a pointer to a node struct, intially all set to NULL. Here, we will consider the English alphabet which consists of 26 letters. We will use position 0 to represent ‘a’, 1 to represent ‘b’ and so. We will take advantage of the fact that the ASCII value for the letter a is 97 and z is 122 and so to store ‘c’, we simply find the right position in the array by subtracting ‘a’ (97) from ‘c’ (99) to get 2 which is the correct position for ‘c’.
Prefix Search
Given a prefix \(p\) of length \(n\), we want to output all the possible words that start with \(p\). To do, we start with the root of the trie and the first letter of \(p\). We then check if the root contains a non-null node at the first letter of \(p\). If there is, then we move to this new node and advance to the next letter of \(p\). We repeat the same process until we either exit (return NULL) or reach the last letter of \(p\) and return that node.
Once we find the prefix (node returned above is not null), we know that all the words in the current subtree are words that start with prefix. To collect all these words, we pass a queue or any container of our choice to collect below. collect is just a depth first search that recursively traverses the subtree rooted at current.
Finally, we utilize find_prefix to find the prefix and then, pass a queue to collect which will fill the queue with all matches.
Insert
Suppose we want to insert a new word of length \(n\). We follow a similar process. We start at the root and we also start at the beginning of the word to be inserted. If the letter we’re on has already a link in the root then we follow it a long and move to the next letter as well. If the link is null then we create a new node for that letter.
insert takes \(O(n)\) which is the length of word.
Delete
To be added.
Longest Prefix
To be added
Running Time
We mentioned above that insert and search both take \(O(n)\) time where \(n\) is the length of the pattern. How long does it take to build a trie to represent all words of total length \(m\)? We need to make \(k\) insertions each of which will take \(O(\)size of each word\()\). Since the total length of all words is \(m\), then the total time is \(O(m)\) which happens only once initially. Therefore, in the RMQ notation we will have \(\langle O(m), O(n) \rangle\) which is a lot better than the naive solution.
Full Implementation
https://github.com/strncat/algorithms-and-data-structures/tree/master/strings/trie.cpp
References
- Algorithms by Robert Sedgwick and Kevin Wayne
- http://web.stanford.edu/class/cs166/