Module 4 - Trees - BD Unit 7 Sprint 26

Module Overview

Learn about tree data structures and their implementations in Java, including binary trees, binary search trees, and tree traversal algorithms.

Learning Objectives

Assemble a tree from provided hierarchical data
Analyze whether a tree models a provided business problem
Perform a manual breadth-first traversal of a tree to find a node closest to the root that satisfies a condition
Perform a manual depth-first traversal of a tree to perform an operation on each node
Recall that a binary search tree keeps its keys in sorted order to allow for faster searching
Explain why checking for the existence of an item in a TreeMap or TreeSet runs in logarithmic O(log n) time
Design and implement a class that uses a TreeMap to maintain the ordering of key:value pairs
Explain what a tree is and outline its characteristics

Introduction to Trees

Trees are hierarchical data structures that represent parent-child relationships. They're essential in many areas of programming and computer science.

Key characteristics of trees:

Trees consist of nodes connected by edges
Each node can have zero or more child nodes
There is exactly one path from the root to any node

// Basic tree node implementation
public class TreeNode<T> {
    private T data;
    private List<TreeNode<T>> children;
    
    public TreeNode(T data) {
        this.data = data;
        this.children = new ArrayList<>();
    }
    
    public void addChild(TreeNode<T> child) {
        children.add(child);
    }
    
    public T getData() { return data; }
    public List<TreeNode<T>> getChildren() { return children; }
}

Overview

This lesson introduces the Tree data structure as a way of representing hierarchical data.

For this reading, we'll introduce tree terminology and describe examples of data often organized as a tree. We'll discuss implementation details and compare tree nodes to linked list nodes. Finally, we'll explore general search strategies.

In the next reading, we'll focus on binary search trees, their implementation, and the time complexity of basic operations. We'll also describe Java's TreeSet and TreeMap classes, which are implemented as self-balancing binary search trees.

What is a tree?

Like arrays and linked lists, trees are a way of organizing and accessing data. Unlike the linear structure of arrays and linked lists, trees are represent a hierarchy. Much of our tree terminology matches from the familiar concept of a family tree. Consider a small segment of the British royal family, shown in Figure 1.

Figure 1: Family tree showing a small segment of the British Royal Family. The tree is labeled with terminology for root, parent, child, sibling, and leaf nodes and depth of nodes. In this tree, the root node at depth 0 is Prince Charles. Two child nodes of the root, Prince William and Prince Harry, are sibling nodes at depth 1. At depth 2, as child nodes of Prince William, are leaf nodes Prince George, Princess Charlotte, and Prince Louis. Also at depth 2, as a child node of Prince Harry, is a leaf node Archie.

In this tree, the node containing Prince Charles is the root of the tree and is the parent of the nodes containing Prince William and Prince Harry, which are child nodes. Similarly, the node containing Prince William is the parent node to the child nodes containing Prince George, Princess Charlotte, and Prince Louis. The nodes containing Prince William and Prince Harry are siblings, which means they are child nodes of the same parent node. The nodes at the bottom of the tree that have no child nodes are called leaf nodes (because they are like the leaves on the ends of tree branches). There is only one root node in a tree, but there can be many parent, child, and leaf nodes.

The depth of a node is the number of levels from that node to the root. The depth of the root is zero. In this example, the four leaf nodes are each at depth 2. The height of a node is the number of levels from that node to the deepest leaf node below it. The height of the root node is the depth of the deepest leaf in the tree, in this case 2. The height of a leaf node is zero.

Another tree you see in your day-to-day is a computer's file system. Figure 2 shows one directory's sub-tree removed from the larger file system. The Pictures folder is the root node, with child nodes Parties, Staff, and Vacations.

Figure 2: A selection of folders in a file system. The main folder, Pictures, contains folders Parties, Staff, and Vacations. Folder Parties contains sub-folders Halloween and New Years Eve. Folder Vacations contains sub-folder Australia 2018, Europe 2019, and Staycation 2020. Folder Europe 2019 contains sub-folders Italy and Spain.

Under the Parties node are child nodes for Halloween and New Year's Eve. The Staff node has no child nodes while the Vacations node has three, Australia 2018, Europe 2019, and Staycation 2020. The Europe 2019 node has additional child nodes for Italy and Spain.

Notice that all leaf nodes don't have to be at the same depth. In this tree, the Staff leaf node is at depth 1, while the Italy and Spain leaf nodes are at depth 3. The remaining leaf nodes are at depth 2. This is normal and is determined by the data represented in the tree.

The height of the entire tree is 3 because there are 3 levels between the root Pictures node and the Italy and Spain nodes, which are the deepest leaf nodes. The height of the Parties node is 1 because the leaf nodes under it in the tree are only one branch away. The height of the Staff node is 0 because it is a leaf. The Parties and Staff nodes are siblings, so they have the same depth, but they don't have the same height.

Visualizing data as a tree

In the previous examples, the hierarchies might be difficult to think of in any format other than a tree. There are times when data is available in a non-hierarchical format, such as a database, but it may be useful to think of the data as a tree. Doing so can help you visualize the structure of the hierarchy and find useful parent, child, or sibling relationships. It can also help us organize other data that's dependent on the hierarchy.

The menu structure used to navigate product categories on the Amazon.com website can be viewed as a tree. The root node would be the parent for all product categories, like the Shop by Category menu. The child nodes classify products into categories such as Automotive, Home, and Sports (to name just a few.) The Automotive node may have child nodes for Accessories, Parts, and Tools, while the Home node could have child nodes for Indoor and Outdoor.

Since products come and go, these categories and their organization are likely to change from time to time. Thus, it's conceivable they would be stored in a database such as the following:

Category Name	Category ID	Parent Category ID
Automotive	Auto_00	Menu_00
Home	Home_00	Menu_00
Sports	Sports_00	Menu_00
Accessories	Auto_01	Auto_00
Parts	Auto_02	Auto_00
Tools	Auto_03	Auto_00
Indoor	In_00	Home_00
Outdoor	Out_00	Home_00
Furniture	In_01	In_00
Kitchen	In_02	In_00
Flooring	In_03	In_00
Furniture	Out_01	Out_00
Gardening	Out_02	Out_00
Grills	Out_03	Out_00
Clothing	Sports_01	Sports_00
Equipment	Sports_02	Sports_00

The data in the above table would be represented by the tree shown in Figure 3.

Figure 3: Tree representing the product category data. Product Menu is the root node with children Automotive, Home, and Sports. Automotive has children Accessories, Parts, and Tools. Home has children Indoor and Outdoor. Sports has children Clothing and Equipment. Indoor has children Furniture, Kitchen, and Flooring. Outdoor has children Furniture, Gardening, and Grills.

The top-level categories in the database do not have a parent category, so an artificial root node (Product Menu) for the entire tree must be created. This data and the associated tree could then be used to build the website's menu structure. A user would interact directly with the tree structure as they select categories and sub-categories while shopping.

Tree implementation

A Tree is implemented much like a LinkedList in that it's a collection of nodes, each with an associated data element. Instead of a single pointer to the "next" node in the list, a tree node has multiple pointers to child nodes. Some or all of these child pointers may be null, just like the next pointer in the last node of a linked list. If all the child pointers in a node are null, that node is a leaf.

These child pointers may be stored within a tree node as individual member variables or as a list, depending on what is known about the structure of the tree. If every tree node has the same number of child nodes, individual member variables could be a good choice. If, as in the examples presented so far, the number of child nodes for each tree node is not known in advance, a more dynamic structure like a list is preferable.

Searching a tree

As with any data structure, a tree stores data, so it can be searched or traversed. For example, searching the file structure tree to find a file with a specific name. Traversing a tree is done when it is necessary to perform an operation on every item in the tree. For example, traversing every node in the file structure tree to determine the total amount of space being used. The algorithms used for searching and traversing a tree are essentially the same, the difference being that a search stops when it finds something (or knows it can't), while a traversal continues through the entire tree. We'll discuss those algorithms in the following sections with a focus on searching a tree.

There are two primary strategies for searching a tree, breadth-first search and depth-first search. Both start with the root node. The difference lies in how the algorithms choose between child nodes and sibling nodes as the search progresses.

Breadth-first search

A breadth-first search (BFS) examines all nodes at the same depth before moving to the next depth. In other words, it starts with the root at depth zero, then examines all nodes at depth 1, then all nodes at depth 2, etc.

Figure 4 illustrates the progression of the search, starting with the green root node, then blue nodes at depth 1, then yellow nodes at depth 2, and finally red nodes at depth 3. The nodes at each depth are numbered and are traditionally thought of as being searched left-to-right when viewing a diagram of a tree. However, in actual implementations of the algorithm, the choice of nodes at the same depth may not be specified.

Figure 4: Color coded tree illustrating progression of breadth-first search. The root node is green with value 1. Level 1 has blue nodes with values 2, 3, and 4 that are all children of the root. Level 2 has yellow nodes, with nodes 5 and 6 children of the blue 2, and nodes 7, 8, and 9 children of the blue 4. Level 3 has red nodes, with nodes 10 and 11 children of the yellow 5, and nodes 12 and 13 children of the yellow 7.

The breadth-first search pseudocode follows:

> Put the root node in a queue of nodes waiting to be searched
> While queue of nodes is not empty
>   Remove the next node from queue
>   If this node is the target, success
>   Else, put this node's child nodes in the queue
> Queue becomes empty, failure

The following lines show the queue of nodes waiting to be searched each time the loop condition is tested, before the loop body executes. In order to follow the search through the entire tree, assume we're searching for the target value 42, which is not in the tree. The first time the loop condition is tested, the root is the only node in the queue:

> Iteration 0 -- Queue: 1

The next node, 1, is removed from the queue, it's not the target, so its child nodes are added to the queue:

> Iteration 1 -- Queue: 2 3 4

Notice when the 2 is removed from the queue, its child nodes 5 and 6 are placed at the end of the queue:

> Iteration 2 -- Queue: 3 4 5 6

As this process continues, the queue grows when a node has child nodes and shrinks when a node is a leaf:

> Iteration  3 -- Queue: 4 5 6             // Node 3 is a leaf node (has no chilren)
> Iteration  4 -- Queue: 5 6 7 8 9         // Node 4 has children 7, 8, and 9
> Iteration  5 -- Queue: 6 7 8 9 10 11     // Node 5 has children 10 and 11
> Iteration  6 -- Queue: 7 8 9 10 11       // Node 6 is a leaf node
> Iteration  7 -- Queue: 8 9 10 11 12 13   // Node 7 has children 12 and 13
> Iteration  8 -- Queue: 9 10 11 12 13     // All remaining nodes are leaves
> Iteration  9 -- Queue: 10 11 12 13
> Iteration 10 -- Queue: 11 12 13
> Iteration 11 -- Queue: 12 13
> Iteration 12 -- Queue: 13
> Iteration 13 -- Queue:

The loop exits when the queue becomes empty after failing to find the target value 42.

Depth-first search

A depth-first search (DFS) proceeds as far as possible down one branch until reaching a leaf. Once a leaf is reached, the depth-first search backtracks until it can choose a sibling node. It then explores that node and its descendants in a similar manner.

Figure 5: Color coded tree illustrating progression of depth-first search. The root node 1 is green as are the leftmost nodes 2, 3, and 4 in the tree along a path directly from the root to a leaf. The next nodes encountered after backtracking are a blue 5 that is a child of 3, a yellow 6 that is a child of 2, a red 7 that is a child of 1, an orange 8, it's child an orange 9, it's child an orange 10, a purple 11 that is a child of 8, a pink 12 that is a child of 8, and a teal 13 that is another child of 8.

Pseudocode for the depth-first search of a tree is as follows:

> Put the root node on a stack of nodes waiting to be searched
> While stack of nodes is not empty
>   Remove the next node from stack
>   If this node is the target, success
>   Else, put this node's child nodes on the stack
> Stack becomes empty, failure

If this looks familiar, it should -- the only difference between this algorithm and the one used for Breadth-First Search is the use of a stack instead of a queue to store the nodes waiting to be searched. Since these nodes come out of the stack in Last-In, First-Out (LIFO) order, the search proceeds as deep as it can before backtracking, while a First-In, First-Out (FIFO) queue processes nodes for a BFS in the order they are added.

The following lines show the stack of nodes waiting to be searched each time the loop condition is tested, before the loop body executes. In order to follow the search through the entire tree, assume we're searching for the target value 42, which is not in the tree. The first time the loop condition is tested, the root is the only node in the stack:

> Iteration 0 -- Stack: 1

The next node is removed from the stack, it's not the target, so its child nodes are added to the stack:

> Iteration 1 -- Stack: 2 7 8

Notice when the 2 is removed from the stack, its child nodes 3 and 6 are placed at the front (top) of the stack:

> Iteration 2 -- Stack: 3 6 7 8

As with breadth-first search, the stack grows when a node has child nodes and shrinks when a node is a leaf:

> Iteration 3 -- Stack: 4 5 6 7 8     // Node 3 has children 4 and 5
> Iteration 4 -- Stack: 5 6 7 8       // Node 4 is a leaf node (has no children)
> Iteration 5 -- Stack: 6 7 8         // Node 5 is a lead node
> Iteration 6 -- Stack: 7 8           // Node 6 is a leaf node
> Iteration 7 -- Stack: 8             // Node 3 is a leaf node
> Iteration 8 -- Stack: 9 12 13       // Node 8 has children 9, 12, and 13
> Iteration 9 -- Stack: 10 11 12 13   // Node 9 has children 10 and 11
> Iteration 10 -- Stack: 11 12 13     // All remaining nodes are leaves
> Iteration 11 -- Stack: 12 13
> Iteration 12 -- Stack: 13
> Iteration 13 -- Stack:

The loop exits when the stack becomes empty after failing to find the target value 42.

Comparing BFS and DFS

Looking at how the breadth-first and depth-first searches work, we can see that they're similar concepts. They just move along the tree in different paths. What may not be clear is how their differences affect their performance. Let's take a closer look at the advantages and disadvantages of these two search algorithms and what situations each is best for.

The primary advantage of a breadth-first search becomes apparent in a tree where multiple nodes satisfy the search condition. The breadth-first search strategy will find the one closest to the root node. However, the size of the queue storing nodes remaining to be searched can become prohibitively large as the search moves deeper in the tree. Consider a tree where each node has, on average, three child nodes. When the search is at depth 1, there would be three nodes in the queue. At depth 2, there would be nine nodes in the queue. At depths 3, 4, and 5, there would be 27, 81, and 243 nodes in the queue, respectively. At depth 10, there would be almost 60,000 nodes in the queue!

The primary advantage of a depth-first search over a breadth-first search is the relatively modest size of the stack compared to the size of the queue. At depth 1, there would be three nodes on the stack. At depth 2, there would be 6 nodes on the stack. Each successive depth would only add, on average, three nodes to the stack. At depth 10, there would be 30 or so nodes on the stack. The memory savings compared to nearly 60,000 nodes in the queue is significant. There are disadvantages to depth-first search as well. The main disadvantage is the possibility of exploring a long way down a branch that does not contain the target. However, in a situation where all nodes in a tree are to be traversed (for example, counting the number of files in a file structure), the depth-first search would be preferred due to the smaller stack's memory savings.

Summary

This section introduced the tree data structure as a hierarchical arrangement of data, in contrast to the linear structure of an array or linked list. We also walked through examples of breadth-first and depth-first searching a tree.

Up next

The next reading introduces the binary search tree. This is a tree data structure with a few straightforward restrictions on how it is constructed that result in significant advantages when searching a tree.

Tree Traversal Techniques

Trees can be traversed in different ways depending on your needs. The two primary traversal methods are depth-first and breadth-first.

Key traversal approaches:

Depth-first traversal explores as far as possible along each branch
Breadth-first traversal explores all nodes at the current depth before moving to nodes at the next depth
Depth-first traversal can be further divided into pre-order, in-order, and post-order

// Depth-first traversal (recursive)
public void depthFirstTraversal(TreeNode<T> node) {
    if (node == null) return;
    
    // Process current node
    System.out.println(node.getData());
    
    // Recur for all children
    for (TreeNode<T> child : node.getChildren()) {
        depthFirstTraversal(child);
    }
}

// Breadth-first traversal (using queue)
public void breadthFirstTraversal(TreeNode<T> root) {
    if (root == null) return;
    
    Queue<TreeNode<T>> queue = new LinkedList<>();
    queue.add(root);
    
    while (!queue.isEmpty()) {
        TreeNode<T> current = queue.poll();
        
        // Process current node
        System.out.println(current.getData());
        
        // Enqueue all children
        queue.addAll(current.getChildren());
    }
}

Binary Search Trees

Binary Search Trees (BSTs) are special binary trees where for each node, all elements in the left subtree are less than the node, and all elements in the right subtree are greater.

Properties of BSTs:

Efficient search, insert, and delete operations
Average time complexity of O(log n) for these operations
In-order traversal yields elements in sorted order

// Binary Search Tree node
public class BSTNode<T extends Comparable<T>> {
    private T data;
    private BSTNode<T> left;
    private BSTNode<T> right;
    
    public BSTNode(T data) {
        this.data = data;
    }
    
    // BST insert operation
    public void insert(T value) {
        if (value.compareTo(data) < 0) {
            if (left == null) {
                left = new BSTNode<>(value);
            } else {
                left.insert(value);
            }
        } else if (value.compareTo(data) > 0) {
            if (right == null) {
                right = new BSTNode<>(value);
            } else {
                right.insert(value);
            }
        }
        // Equal values not inserted (no duplicates)
    }
    
    // BST search operation
    public boolean contains(T value) {
        if (value.compareTo(data) == 0) {
            return true;
        } else if (value.compareTo(data) < 0) {
            return left != null && left.contains(value);
        } else {
            return right != null && right.contains(value);
        }
    }
    
    // In-order traversal (produces sorted output)
    public void inOrderTraversal() {
        if (left != null) {
            left.inOrderTraversal();
        }
        System.out.println(data);
        if (right != null) {
            right.inOrderTraversal();
        }
    }
}

Overview

This lesson introduces the tree data structure as a way of representing hierarchical data.

In the previous reading, we introduced tree terminology and described examples of data often organized as a tree. We discussed implementation details and compared tree nodes to linked list nodes. Finally, we explored general search strategies when using breadth or depth-first search.

For this reading, we'll focus on binary search trees, their implementation, and the time complexity of basic operations. We'll also describe how to use Java's TreeSet and TreeMap classes, which are implemented as self-balancing binary search trees. We'll go through some code samples so you can see trees in action and build your confidence in using different data searches.

Review of logarithmic time complexity

The breadth-first and depth-first search algorithms are sometimes called "uninformed" searches. Though the two strategies progress through tree nodes differently, both require O(n) time complexity as they must examine every node. binary search trees allow a more "informed" search strategy that results in O(log n) performance. Following is a brief review of O(log n) time complexity to help motivate the study of binary search trees.

Figure 1 shows a graph of time complexities for various algorithms, including O(log n) for the binary search algorithm. In previous lessons, the binary search was performed on a sorted array list. The concepts and time complexity of that algorithm are worth reviewing.

Figure 1: Big-O time complexity comparison of algorithms for selection sort, O(n2), merge sort, O(n log n), linear search, O(n), binary search, O(log n), and array list element access, O(1).

Imagine Jane and Mateo are playing a number guessing game called High-Low. To play, Jane picks a secret number between 1 and 100 and Mateo tries to guess it. Each time Mateo guesses, Jane reveals whether the guess is correct, too high, or too low. For example:

Jane chooses the secret number 37.
Mateo sensibly guesses 50 and Jane says, "Too high."
Mateo wisely guesses 25 and Jane says, "Too low."
Mateo expertly guesses 37 and wins the game.

Mateo's guesses demonstrate an understanding of logarithmic time complexity. The first guess, and Jane's response, eliminates half of the possibilities for the secret number, leaving 1 through 49. The second guess eliminates half of the remaining possibilities, leaving 26 through 49. The third guess is again intended to eliminate half of the remaining possibilities, though Mateo is a bit lucky to be correct.

Consider the game if Jane had chosen 1 as the secret number. Mateo's guesses would be 50, 25, 12, 6, 3, 2, and 1. This is a worst-case scenario for Mateo, but he still wins the game in just seven guesses. This result demonstrates logarithmic time complexity. Another way to think of O(log n) is, "How many times does n have to be cut in half before the result is 1?" When n is 100, O(log n) is 7, as shown by Mateo's guesses. If n were 1,000, Mateo's guesses to reach 1 would be 500, 250, 125, 62, 31, 15, 7, 4, 2, and 1. Thus, when n is 1,000, O(log n) would be 10.

An increase in possibilities for Jane's secret number from 100 to 1,000 only increased Mateo's guesses from seven to 10. This is really good! Take a moment and use a piece of scratch paper to calculate how many times one million would need to be cut in half before reaching 1. Did you end up with 20? You can see this is a really powerful algorithm. It can operate on 1,000 elements in 10 steps and 1,000,000 elements in 20 steps. If only there were a data structure that took advantage of this magnificent performance!

What is a binary search tree?

Fortunately, there is just such a data structure! A binary search tree (BST) is a tree with two constraints placed on the structure and content of the tree:

Each tree node has at most two children, called left and right
The value of the data in the left child must be less than or equal the value of the data in the parent and the value of the data in the right child must be greater than the value of the data in the parent

These constraints allow a BST to be searched in logarithmic time! As with each of Mateo's guesses in the High-Low game, each choice of left or right child eliminates half of the remaining tree. However, there's one important caveat. The tree must remain balanced! We'll discuss this caveat in more detail later in this lesson. By then your understanding of the search and insertion operations will put a balanced tree into better context.

Consider the BST in Figure 2. Other than the leaf nodes, each node has a left and right child. The value in each node is greater than the value in its left child and less than the value in its right child. If Jane and Mateo were playing the High-Low number guessing game with the range of values being 1 to 15, this tree might represent Mateo's guesses. His first guess would be 8. His second guess would be 4 or 12, etc. No matter what secret number Jane picks, Mateo only needs four or fewer guesses to find it.

Figure 2: A binary search tree with 15 nodes. The value 8 is the root node with values 4 and 12 in left and right child nodes, respectively. The values 2 and 6 are in child nodes left and right of the value 4 and the values 10 and 14 are in child nodes left and right of the value 12. The value pairs 1 and 3, 5 and 7, 9 and 11, and 13 and 15 are each left and right child nodes of the values 2, 6, 10, and 14, respectively. These last eight nodes are all leaf nodes.

Searching a binary search tree

The constraints placed on a BST allow it to perform a more "informed" search than the earlier "uninformed" breadth-first and depth-first searches. Specifically, comparing the target value to the value in a given node lets the search decide whether to move to the left or right child node to continue the search. It knows the target will not be in the other child node. As with Mateo's guesses in the High-Low game, each time the target is compared to a node, half of the remaining nodes are eliminated from the search.

The following pseudocode for searching a BST shows the if-else statement where the informed decision is made to proceed to the left or right child of the current node:

> Set a pointer, p, to the root node
> While p is not null and p's value is not the target value
> If the target value is less than p's value, set p = p's left child
> Else set p = p's right child
> If p is null, failure
> Else, success

If the target value is 3, the first comparison of 3 < 8 would move the pointer left. It would go to the node containing 4 as shown in Figure 2. The next comparison would be 3 < 4, again moving the pointer left, to the node containing 2. Comparing 3 < 2 is false, so the pointer moves right, to the node containing 3. The target value is found!

What happens if the target value isn't in the tree? If we search for the target value 55 in the tree shown in Figure 3, it results in comparing 55 to the nodes 47, 73, 67, and 59. After comparing 55 to 59, the pointer would be set to the left child of the node containing 59, which is null. At this point, the search terminates with the null pointer indicating failure.

Figure 3: A binary search tree with 15 nodes. The value 47 is the root node with values 23 and 73 in left and right child nodes, respectively. The values 17 and 37 are in child nodes left and right of the value 23 and the values 67 and 83 are in child nodes left and right of the value 73. The value pairs 11 and 19, 29 and 43, 59 and 71, and 79 and 97 are each left and right child nodes of the values 17, 37, 67, and 83, respectively. These last eight nodes are all leaf nodes.

What if the target value 55 somehow ended up in the node where 43 is currently shown? Looking at each individual node in the tree, the two binary search tree constraints are still satisfied. In particular, 47 is greater the value in its left child, 23, which is less than the value in its right child, 37, which is less than the value in its right child, 55 (again, assuming we replaced the value 43 currently in that node in Figure 3 with value 55.) When searching, the first comparison of 55 and 47 causes the search to move right, to the node containing the value 73. From 73, the search could never find the value 55 on the other side of the tree. Fortunately, this situation does not arise due to the method used to insert values into a BST.

Inserting into a binary search tree

Starting with an empty BST, inserting the value 47 results in a tree with a single node, which is the root as you can see in Figure 4.

Figure 4: A binary search tree with the value 47 in the root node and no other nodes in the tree.

Inserting follows much the same algorithm as searching. Comparisons between the value being inserted and values already in the tree determine whether the new value is inserted to the left or right of an existing node. Inserting the values 23 and 73 results in the tree shown in Figure 5.

Figure 5: A binary search tree with 47 at the root, a left child of 23, and a right child of 73.

With a few nodes already in the BST, inserting new values is a bit more interesting. Figure 6 shows the tree after the values 37 and 67 are inserted. Since 37 is less than 47 it must go left of 47 and then since it is greater than 23 it must go right of 23. Similarly, since 67 is greater than 47 it must go right of 47 and then since it is less than 73 it must go left of 73. The comparisons made when determining where to put a new value are the same as those made when searching for a value.

Figure 6: The previous BST with 37 added as the right child of 23 and 67 added as the left child of 73.

A couple more insertions with the values 17 and 83 results in the tree shown in Figure 7.

Figure 7: The Previous BST with 17 added as the left child of 23 and 83 added as the right child of 73.

At this point, consider the scenario previously discussed where the value 55 is the right child of the node containing the value 37. The first comparison done when inserting a value is with the root node, which contains the value 47. The new value 55 must be inserted to the right of 47, so it's not possible for the 55 to end up on the wrong side of the tree. Instead, inserting the value 55 at this point results in it being the left child of the node containing the value 67.

Inserting into a BST is O(log n) for the same reason searching a BST is O(log n). Finding the location where the new value is to be inserted follows the same O(log n) process as searching. The actual insertion of the new tree node is O(1) in the same way adding a new node to a linked list is O(1). However, the same caveat applies -- the BST must be balanced.

Balanced binary search trees

It's important to understand new values are always inserted into a BST as leaf nodes. Reaching a leaf node by always making the correct choice between left and right child is what ensures the BST constraints remain satisfied. Unfortunately, this can also lead to an unbalanced tree which in turn leads to less than O(log n) performance.

Consider the values in the BST in Figure 7. These values were inserted in the order 47, 23, 73, 37, 67, 17, and 83. It's this insertion order that resulted in the nicely balanced tree. If the values were inserted in ascending order, the resulting tree would be completely lopsided. This worst-case scenario is shown in Figure 8.

Figure 8: An unbalanced BST with 17 as the root and the values 23, 37, 47, 67, 73, and 83 in successive right child nodes of the preceding node.

When the value 17 is the first to be inserted, it becomes the root of the tree. Each subsequent insertion is done as a leaf node attached to the tree as the right child of the previous node. The result looks more like a linked list than a tree. This is still a valid BST, but the search performance will be linear instead of logarithmic.

A BST is considered balanced if its left and right sub-trees are balanced and their heights differ by at most 1. There are several versions of binary search trees that are able to maintain their balance regardless of insertion order. The details are beyond the scope of this lesson, but fear not! Java's TreeSet and TreeMap classes are implemented as self-balancing binary search trees. Using these classes provides the logarithmic performance of search, insertion, and deletion, without all the fuss.

Speaking of deletion, it's a bit more complex than insertion, but not significantly. The details are again beyond the scope of this lesson but suffice it to say the time complexity is also O(log n) for reasons very similar to those for insertion.

Java's TreeSet and TreeMap

Previous lessons introduced the Set and Map interfaces as well as the HashSet and HashMap implementations. These have excellent performance and should be the preferred choice when the sorting of elements isn't required. When sorting of elements is required, the TreeSet and TreeMap classes are a good choice as they're implemented as self-balancing binary search trees. As BSTs, they have O(log n) performance for search, insertion, and deletion.

TreeSet items are sorted

Again, Java's HashSet has excellent performance, but it makes no guarantees as to the iteration order of the set. Further, it doesn't guarantee the order will remain constant over time. This means two different runs of the same code using a HashSet could produce different orderings of the results. In fact, if an application does a lot of adding and removing of items from a HashSet, the ordering could change during the same run! Thus, in an application requiring items to be in a predictable and consistent order, a TreeSet is preferred over a HashSet.

The following example demonstrates elements in a TreeSet are stored in sorted order while those in a HashSet are not:

import java.util.HashSet;
import java.util.TreeSet;

public class TreeSetDemo {
    public static void main(String[] args) {
        HashSet<String> hash = new HashSet<>();
        TreeSet<String> tree = new TreeSet<>();
        String[] fruits = { "Bananas", "Grapes", "Figs", "Dates",
        "Elderberries", "Apples", "Cantaloupes" };

        for(String fruit : fruits) {
            hash.add(fruit);
            tree.add(fruit);
        }

        System.out.println(hash);
        System.out.println(tree);
    }

}

The second line of the output produced by the above code shows the contents of the TreeSet in sorted order:

[Elderberries, Apples, Cantaloupes, Bananas, Grapes, Dates, Figs]
[Apples, Bananas, Cantaloupes, Dates, Elderberries, Figs, Grapes]

TreeMap keys are sorted

As a brief refresher, a Map data structure allows a program to make an association between a key and a value. In a sense, this is the same thing an ArrayList does with the keys being integer indices into the array and the values being the objects stored in the array. A Map is more generic in that the keys are not limited to integers.

A previous example worth revisiting here is an application that stores information about ATA participants and allows looking up a participant's information given their username. To add a bit of functionality appropriate for the current topic, let's add the requirement that the application be able to print a roster of participants sorted by their usernames. The fact that a TreeMap stores its data in sorted order will be especially handy!

First, a class, Participant, is used to represent ATA participants. This class stores a participant's ID number, username, and full name. (In reality it would likely contain a lot more useful information such as address, phone number, favorite Star Wars character, etc.) The following code listing shows an implementation of such a Participant class:

/**
* Participant class representing a single ATA participant.
*/
public class Participant {
    // Used to generate a unique ID number for each participant.
    private static int nextId = 0;

    /** Unique ID number for each participant. */
    public int idNumber;

    /** Username for each participant; should be unique. */
    public String username;

    /** Full name for each participant. */
    public String fullname;

    /**
    * Constructs a Participant with the given username
    * and fullname, creating a unique ID number.
    *
    * @param username This Participant's username.
    * @param fullname This Participant's fullname.
    */
    public Participant(String username, String fullname) {
        this.idNumber = ++Participant.nextId;
        this.username = username;
        this.fullname = fullname;
    }

    /**
    * Builds and returns a String representation of this Participant.
    *
    * @return A String representation of this Participant.
    */
    @Override
    public String toString() {
        return String.format("ID: %d, username: %s, name: %s",
        this.idNumber, this.username, this.fullname);
    }
}

Notice how the static class variable nextId in the Participant class is used to assign each new Participant object a unique ID number.

Next, the ParticipantDirectory class creates a TreeMap which is used to map String usernames to their associated Participant objects. This class also includes methods to implement the desired functionality of looking up a participant given their username and printing a participant roster sorted by usernames.

Notice that while we instantiate a TreeMap, we store it as a SortedMap. This is similar to how we have previously used List, Map, or Set interfaces to limit our access to only a desired set of behaviors. SortedMap gives us access to methods related to sorting and retrieving sorted values from our TreeMap.

Below is the full listing of the ParticipantDirectory class with discussion of individual sections to follow later in the reading:

import java.util.Map;
import java.util.TreeMap;
import java.util.SortedMap;

/**
 * Application to demonstrate a TreeMap storing key:value
 * pairs in sorted order based on the keys.
 */
public class ParticipantDirectory {
    // A bit of test data.
    private static final Participant[] testData = {
            new Participant("adesai", "Arnav Desai"),
            new Participant("mrivera", "Martha Rivera"),
            new Participant("ssarkar", "Saanvi Sarkar"),
            new Participant( "jstiles", "John Stiles"),
    };

    // TreeMap storing participants with username as key.
    private SortedMap<String, Participant> participants;

    /**
     * ParticipantDirectory constructor creates the TreeMap
     * and loads the test data.
     */
    public ParticipantDirectory() {
        // Use a TreeMap so data is stored in sorted order.
        this.participants = new TreeMap<>();
        // Load test data using p.username as the key.
        for(Participant p : testData) {
            this.participants.put(p.username, p);
        }
    }

    /**
     * Looks up a Participant given a username. This operation is
     * O(log n) when a TreeMap is used to store participants.
     *
     * @param username The username to be looked up.
     * @return Participant object mapped to the given username;
     *                 null if username not in map.
     */
    public Participant getParticipantByUsername(String username) {
        return this.participants.get(username);
    }

    /**
     * Prints a participant roster sorted by username.
     */
    public void printRosterByUsername() {
        // The collection of Participant objects returned by
        // values() is sorted based on the username keys.
        for(Participant p : this.participants.values()) {
            System.out.println(p);
        }
    }

    /**
     * Main method to demonstrate the ParticipantDirectory.
     *
     * @param args Command line arguments; ignored in this application.
     */
    public static void main(String[] args) {
        ParticipantDirectory dir = new ParticipantDirectory();

        System.out.println("Participant 'adesai': " +
                dir.getParticipantByUsername("adesai"));

        System.out.println("Participant Roster:");
        dir.printRosterByUsername();
    }
}

Stripping away the test data and a bit of other overhead, the beginning of the ParticipantDirectory is shown below:

public class ParticipantDirectory {
    private SortedMap<String, Participant> participants;

    public ParticipantDirectory() {
        this.participants = new TreeMap<>();
    }

    public Participant getParticipantByUsername(String username) {
        return this.participants.get(username);
    }
}

The above code declares the instance variable participants to be type Map where the keys are type String and the values are type Participant. This map is used to associate String usernames with Participant objects. The map is then used by the getParticipantByUsername() method to look up and return a Participant object given a username.

The ParticipantDirectory constructor assigns an instance of the TreeMap class to the participants variable. The participants variable was declared as a Map interface, so any of Java's concrete map classes could be used. The TreeMap stores the key:value pairs as a balanced binary search tree. The choice of a TreeMap allows us to implement the added functionality of printing a roster of participants sorted by their usernames. (In the earlier lesson's use of this example, the map used was a HashMap, which provides faster lookup, but doesn't sort its data.)

The TreeMap data structure stores values in sorted order based on their associated keys. In this example, each participant's username is used as the key value to store the Participant objects in the map. Thus, the sorted order in the TreeMap is based on usernames, which is exactly what we want when printing a roster. This makes implementing the printRosterByUsername() method rather straightforward, as shown below:

public void printRosterByUsername() {
    for(Participant p : this.participants.values()) {
        System.out.println(p);
    }
}

Because the TreeMap stores values in order based on their associated keys, the printRosterByUsername() method can directly use the values() method of the TreeMap class to iterate over the Participant objects, knowing they're already in sorted order.

Finally, the application includes a short main program to create a ParticipantDirectory and demonstrate the methods to look up a user and print a roster.

public static void main(String[] args) {
    ParticipantDirectory dir = new ParticipantDirectory();
    System.out.println("Participant 'adesai': " + dir.getParticipantByUsername("adesai"));
    System.out.println("Participant Roster:");
    dir.printRosterByUsername();
}

On the surface, the output of the above main method, shown below, is nothing flashy:

Participant 'adesai': ID: 3, username: adesai, name: Arnav Desai
Participant Roster:
ID: 1, username: adesai, name: Arnav Desai
ID: 4, username: jstiles, name: John Stiles
ID: 2, username: mrivera, name: Martha Rivera
ID: 3, username: ssarkar, name: Saanvi Sarkar

The result of looking up username "adesai" appears exactly as expected, showing the unique user ID number, username, and full name for the Participant object in the TreeMap associated with the username "adesai".

At first glance, the participant roster may appear to be a bit unusual. The user ID numbers 1, 4, 2, 3 at the beginning of each line are out of order. However, remember the user ID numbers are nothing more than a unique ID number assigned to each new Participant object in the class constructor. More importantly, the TreeMap storing the Participant objects is sorted by keys. In this case, we used their usernames. The printed roster is therefore sorted by usernames, as desired, all courtesy of the TreeMap with no additional effort required.

Conclusion

The structure of binary search trees allows search, insertion, and deletion operations to take place in logarithmic time, O(log n). However, this requires an important caveat -- the tree must be balanced. Java's TreeSet and TreeMap classes, which are implemented as self-balancing binary search trees, give very good logarithmic performance for search, insertion, and deletion operations. They also have the added benefit of storing objects in sorted order. A TreeSet sorts its entries based on their natural ordering, and a TreeMap sorts its key-value pairs based on the natural ordering of the keys.

Guided Project

Phone Contacts App

Practice Exercises

Phase 1: :hacker_voice: "I'm in." A Filesystem as a Tree

Consider the following filesystem, represented as the output from the command tree /home

(Note: tree's output can be somewhat ambiguous. In the snippet below, lines without file extensions represent directories)

/home
├── Documents
│   ├── report.txt
│   ├── source.txt
│   └── lab3.txt
├── Desktop
│   ├── midterm.doc
│   ├── contacts
│   └── model4
└── Pictures
    ├── 332459.png
    ├── profile.jpg
    └── vacation2012

If you had a shell running in the /home directory, how would you navigate from home to model4?

Are there any files which cannot be reached from the /home directory?

Based solely on the structure above, is the /home directory located inside of any other directories?

Each line of the output above can be seen as a node in a tree. Are there any nodes in the tree that do not have a parent?

Which node is the root of the filesystem structure?

Are there any nodes that have more than one parent?

Are there any nodes that don't have any children? (These are leaf nodes.)

Phase 2: Breadth- and Depth-First Search

What is the first node inspected by a depth-first search for the value 18, which does not exist in this tree?

What is the first node inspected by a breadth-first search for the value 18?

What is the last node inspected by a depth-first search for the value 18?

What is the last node inspected by a breadth-first search for the value 18?

Which color is the node found by a depth-first search for the value 73? Which nodes, in order, did you visit to find 73?

Which color is the node found by a breadth-first search for the value 73? Which nodes, in order, did you visit to find 73?

Phase 3: Binary Search Trees

The tree above is a binary search tree (BST). Looking at Figure 2, what do you think are the properties of a binary search tree?

What is the height of the tree?

If you were searching for the value 13 (which doesn't exist in this tree), how many nodes would you have to check before knowing the tree did not contain the value 13?

How is the height of a BST related to the search time of a BST?

Phase 4: Problem Spaces and Trees

Can the class hierarchy in Java be represented with a tree? Why or why not?

A company's organization includes a CEO, several VPs that report to the CEO, managers that each report to a VP, and individual contributors that report to managers. Some of the individual contributors can work on two different teams at the same time, so they report to two managers. Can this organization be represented by a tree? Why or why not?

Module 4 - Trees

Module Overview

Learning Objectives

Introduction to Trees

Overview

What is a tree?

Visualizing data as a tree

Tree implementation

Searching a tree

Breadth-first search

Depth-first search

Comparing BFS and DFS

Summary

Up next

Tree Traversal Techniques

Binary Search Trees

Overview

Review of logarithmic time complexity

What is a binary search tree?

Searching a binary search tree

Inserting into a binary search tree

Balanced binary search trees

Java's TreeSet and TreeMap

TreeSet items are sorted

TreeMap keys are sorted

Conclusion

Guided Project

Practice Exercises

Phase 1: :hacker_voice: "I'm in." A Filesystem as a Tree

Phase 2: Breadth- and Depth-First Search

Phase 3: Binary Search Trees

Phase 4: Problem Spaces and Trees

Resources