B-Trees
B-trees are balanced search trees. While red-black trees are also balanced search trees, their branching factor is only 2. B-trees differ in that their branching factor can be really large. The number of keys in a B-tree node determines the number of its children. A node with \(n\) keys have \(n+1\) children. This way, each child handles a specific range of keys in its parent.
Motivation
But why do we need a large branching factor? The motivation behind this idea has to do with minimizing disk I/O operations. Data on disk can be read in pages and so if each node’s size is a whole page in which we pack a lot of keys in, we can then minimize the number of read operations.
Specifications
-
Let \(x\) be an internal node and let \(n\) be the number of keys stored at \(x\). \(x\) must have \(n+1\) children. Furthermore, the keys are stored in non-decreasing order. Leaf nodes on the other hand have no children. No restrictions on the number of children of the root.
-
The minimum degree of a b-tree is an integer, \(t\) such that \(t \geq 2\). Every node besides the root must have between \(t-1\) and \(2t-1\) keys. This also implies that it must have at least \(t\) children and at most \(2t\) children. If the tree is not empty, the root must be at least have 1 key.
-
The keys in each node are stored in non-decreasing order. The keys also define the range of keys stored in the children subtrees below. In the above figure, We see the range [0,23] defines the range for the child keys below [11,19]. Similarly [23,53] defines the range for the next child’s keys [29,43].
Why are B-trees balanced?
To see why B-trees are balanced, we will prove that the maximum height of a b-tree of degree \(t\) with \(n\) keys is \(O(\log_t(n))\).
Proof: At level 0, the root must at least have 1 key. At level 1 we must have two children, each of which must have at least \(t-1\) keys, so the total number of keys is \(2(t-1)\). Each of these children will have at least \(t\) children and so at level 2, we should have \(2t(t-1)\) keys. Therefore, the total number of keys is at least
Search
Similar to searching in a binary tree, we descend down the tree and select the right branch based on the key value. In \(B\)-trees, each node has has \(n\) keys and so this means that we have \(n+1\) children or branches. We need to compare our key with these \(n\) keys to find the right child/branch to descend into.
Each node stores an array of its keys. In the figure above, suppose we’re currently at node $x$ during the search and suppose we’re searching for \(k = 84\). We first want to find the smallest index such that \(k \leq keys[i]\). For \(k=84\), the smallest index is \(i = 3\). 101 is the smallest key that is larger than 84. We then have several cases:
- \(k = keys[i]\). This is great, we just return \(x\) with the index of the key \(i\).
- \(x\) is a leaf node so we return NULL.
- \(k \neq keys[i]\). \(k\) falls somewhere between the key at index \(i-1\) and the key at index \(i\). This range corresponds to the child at index \(i\). We call search again on the \(i\)th child.
How long does Search take?
Note that in the above method, we were using a linear search to find the right key. For a tree of degree \(t\), each node can have at most \(2t-1\) keys. So this search takes \(O(t)\) time. Since the tree height is \(O(log_t(n))\) then the total cost is \(O(t\log_t(n))\).
Suppose we instead use binary search to find the key. This means that each search will now take \(O(\log(t))\) time and so the total cost becomes \(O(\log(t)\log_t(n))\). Re-writing \(\log_t(n)\) as \(\frac{\log(n)}{\log(t)}\), we see that the total time is \(O(\log(n))\).
Insert
TODO
Delete
TODO
References
- CLRS
- Stanford CS166