Binary search trees (BST)

25 Jun 2020
In this post I will try to explain what a binary search tree is, how they work and how useful they are for many problems. In future posts I plan to analyze a couple of more advanced BSTs.
But first, let's start describing a plain binary tree.

What is a binary tree?

A binary tree is a data structure where we store our items in a hierarchical way, where a parent node can have at most two children.
13165
910155-3107
Two examples of binary trees
We can store anything in a binary tree, but in this simple form they are not very useful.

What is a binary search tree (BST)?

A binary search tree is a binary tree where we add an additional invariant:
Given a node
n
with value
v
, all values on the left subtree are smaller than
v
, and all values on the right subtree are greater than
v
.
1210135
1417111263510
Two examples of binary search trees
If you analyze any node above, you can see they all follow the rule, for example:
Node 12 on the right tree has 11 on its left which is smaller, and 17 and 14 on its right which are greater. Additionally note that 14 is the left child of 17, because it is smaller.
If we were to traverse the tree using
in-order
we would get the items ordered from smallest to greatest.
This invariant has another two consequences:
A BST can only store items we can compare and order (numbers, names, etc.).
All values will be unique, since no value is greater or smaller than itself.

Ok, but what are they used for?

The clue is on the name, binary
search
trees. They allow to search elements in the tree using binary search. Binary search is fast because in each step half of the elements are discarded. For more details about this operation, please keep reading.
They are also great to implement sets of items, since the properties are great for that use case:
Holds only unique items.
Fast lookup to test if a value is a member of the set.
Fast insertion of a new value.
Fast deletion a existing value.
With minimal changes they can be used to implement dictionaries: key value data structures where you can retrieve a data element by a key. For example, we could store a user record with name, email, telephone etc. using a unique id of the user as a key. We would then use this key in the tree operations to decide where to store the data.

How do they work?

Now that we know the invariants we need to maintain we can analyze how the operations work. We will cover the three basic operations: lookup, insertion and deletion.

Lookup

The way the values are arranged makes the lookup a very simple operation, we can use
binary search
directly.
Since we know that for a given node, all values to the left are smaller, and all values to the right are greater, we can eliminate one of the subtrees on each step.
Let's look at an example, lookup of 14 in the following tree.
21161714111315967312510
Lookup 14: 14 > 10 so we go right, 14 < 15 so we go left, 14 > 13 so we go right, found 14
Now let's lookup for 8, which is not in the tree.
21161714111315967312510
Lookup 8: 8 < 10 so we go left, 8 > 5 so we go right, 8 > 7 so we go right, 9 is a leaf, 8 is not in the tree
From these examples we can clearly see that we don't have to go through all elements of the tree one by one, exploiting the properties of the BST we can have a result quickly.

Insertion

To insert an item, we build on top of lookup, making sure we respect the invariant. We can find a couple of different cases.
If the item is already in the tree, we do nothing.
9673125
Insert 3: 3 < 5 so we go left, 3 > 2 so we go right, we found 3, so we do nothing
If we reach a node where the lookup algorithm tells us to pick a direction and the node doesn't have a subtree on that direction, we create a subtree on that direction with the node we want to insert.
973125
9673125
Insert 6: 6 > 5 so we go right, 6 < 7 and 7 has no left subtree, we insert 6 on the left
973125
9743125
Insert 4: 4 < 5 so we go left, 4 > 2 so we go right, 4 > 3 and 3 has no left subtree, so we insert 4 there

Deletion

Deletion also builds on lookup, but it's probably the trickiest operation. We can remove a node on any position of the tree, but to maintain the invariant we may need to rearrange the tree.
Let's go case by case, starting with the easiest, when the value is not in the tree.

When the element is not contained in the tree

We would apply the same lookup algorithm and find out the element is not in tree. Nothing else to do.

When the element to delete is a leaf

If we apply the lookup algorithm and find out the node is a leaf (it doesn't have any children) we can just delete the node.
973125
97125
Delete 3: 3 < 5 so we go left, 3 > 2 so we go right, found 3, is a leaf so we delete it

When the element to delete has only one child

In the case the node has only one child, we can delete the node and replace it with the child. Let's look at some examples.
973125
93125
Delete 7: 7 > 5 so we go right, found 7, has only one child so delete 7 and replace it with its child
A node can have only one child, but its child can be an arbitrarily big subtree. The rule is still just replace the node with its child.
9731245
973125
Delete 4: 4 < 5 so we go left, 4 has only one child, so we delete it and replace with its child

When the element to delete has both children

In the case where the element has both children, it's a bit more complex.
We will begin with the simplest case, where both children are leaf. In this case we could replace the node we are deleting with any of the children. We chose to replace it with the right child, but it's an arbitrary decision.
973125
97135
Delete 2: 2 < 5 so we go left, 2 has both children, so we delete it and replace with its right child, 3
In the general case, where one or both children are arbitrary subtrees, the rule is as follows: We need to find the successor (the next if we follow the order) of the element we want to delete. In a binary search tree this means finding the leftmost element of the right subtree. The right subtree contains all elements greater than the element we want to delete, and the leftmost is the smaller amongst them. Let's call it
m
.
Once we find it, there are two different cases:
Case 1:
m
is a leaf
When
m
is a leaf, we just need to replace the node we want to delete with
m
.
16171113156732510
161713156732511
Delete 10: we find the leftmost node of the right subtree, in this case 11, and put it in place of 10
Case 2:
m
has a right child
By definition,
m
can only have a right child, because if it had a left child it wouldn't be the leftmost child. In this case we need to do two things:
As in the case before, replace the element to delete with
m
.
'Promote' the right subtree of
m
, putting it where
m
was.
1597862510
159782610
Delete 5: we find the leftmost node of the right subtree (6) and replace 5 with it, the right subtree of 6 goes one level up where 6 was

Balance and performance

Our binary search tree performs really well, allowing us to use binary search, which means all operations are of order
O(log n)
. This means the execution time is proportional to the height of the tree.
But, there is a catch, for this to be true, the tree needs to be balanced. A balanced tree is one where the left and right subtrees of every node differ in height by a max of 1.
9673125
93125
Two balanced trees
Any binary search tree that does not meet that requirement in considered
unbalanced
. To clearly see how this affects the performance, let's look at the next example:
7654321
A degenerate tree, will behave like a linked list
The degenerate tree behaves like a linked list, meaning all operations are
O(n)
, since in the worst case all elements will be visited.
You can easily run into these cases, the tree from the example is the result of inserting the elements from 1 to 7 using the insertion procedure explained before.
It is so important for binary search tree to be balanced that there are several enhanced tree data structures called self-balancing binary search trees.
AVL trees
and
red-black trees
are two classic examples, each using a different strategy to keep the tree balanced.
In the next post, we'll concentrate on
AVL trees
.