# Edward Valentini

## Suffix Trees and Ukkonen’s Algorithm in Swift Part 1 Suffix Trees are vastly important in computer science. However, I felt there is a gap in the literature about them, so with this article I hope to demystify the suffix tree and will show how one can be constructed using Ukkonen’s algorithm. Ukkonnen developed an `O(n)` time algorithm for constructing suffix trees. One excellent reference I found on suffix trees and string processing in general is Dan Gusfield’s Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology which I have found to be an invaluable reference.

### Background

Before talking about suffix trees, let’s briefly talk about suffix tries as a motivation for learning about the suffix tree.  It is somewhat unfortunate that there is a split of authority as to the correct pronunciation of trie.  Some pronounce it like tree and some pronounce it like try. I generally try to pronounce it like try 🤪. A trie is a data structure used to represent a string of characters where each edge represents a single character and a particular node can have only uniquely labelled edges emanating from it. Let’s consider all the suffixes of the word `banana\$` (The `\$` at the end of the word will be explained shortly) `banana\$`, `anana\$`, `nana\$`, `ana\$`, `na\$`, `a\$` and `\$`. If we want to express these suffixes in a trie, then we have the following: Notice here that every edge consists of a single character. a `Suffix Tree` on the other hand is a related data structure, which you can think of as a compressed suffix trie, in which new nodes are only created for unique extensions of suffixes. A suffix tree for the same string `banana\$` is below. ### Suffix Tree

A Suffix Tree for a string `S` of length `n` will have `n` leaves. To aid in the contruction it is necessary for the string to have a terminal character which occurs nowhere in the interior of the string. By convention the character `\$` is used. In a suffix tree each internal node (except the root) has at least 2 children. Each edge is labelled with a substring of S. Each node will not have more than one edge emanating from it which starts with the same character. Note: in the above suffix tree that the concatenation of edge labels from root to leaf spell out the suffix which starts at the index contained in the leaf node. The red arrows are suffix links which will be introduced later.

#### Naïve algorithm to construct a Suffix Tree

Let’s take a string `S` of length `n` where the last character of `S` is already unique. When we add an edge from the root for the substring `S[0..<n]` (using Swift notation for ranges) this is denoted as `T0`. Similarly, let `Tj` be the tree after `j` steps, that is after adding `S[0..<n]`, `S[1..<n]`,…,`S[j..<n]`

1. Start at the root and add an edge for `S[0..<n]`. This is `T0`.
2. To create `Tj+1` from `Tj`, start at the root of `Tj` and find the longest ath which matches a prefix of `S[(j+1)..<n]`. You will either not find any such path, end at a node or end in the middle of an edge.
3. If you do not find such a path, then you add an edge from root, for the label `S[(j+1)..<n]`.
4. If you end at a node and and the length of the edge label is `a` then you add a new edge from the node with the label `S[(j+1+a)..<n]`.
5. If you end in the middle of an edge `b` characters into the edge where the entire edge represents string `P` of length `c`, then you break the edge into 2 pieces, `P[0..<b]` and `P[b..<c]`. You attach a new node at the end of the `P[0..<b]` edge. From this node, you attach 2 edges: one which is the `P[b..<c]` edge which is the remainder of the edge you broke, the other with label `S[(j+1+b)..<n]`.

This naive algorithm is `O(n2)`. Let’s go through this naive algorithm step by step for the string: `banana\$`.

After Step 1 we have the tree for `T0`: After Step 2 & 3 in creating `T1` we have: After Step 2 & 3 in creating `T2` we have: Now we need to match the string `ana\$`. According to Step 2, we find the longest path `p` which matches a prefix of `ana\$`. This is the middle of the middle edge in the graph above. Substring `ana` of our string `S` has matched `ana` of our string `p`. However, `\$` from `S` does not match `n` of `p`, thus we need to break the `anana\$` edge into 2 pieces, namely the `ana` portion and the `na\$` portion. Then we need to create a new node and insert that at the end of the `ana` edge. To that new node we add the `na\$` edge and we also create a new node labeled with `\$` (the leftover character from `S` which didn’t match) and make that the label of the new edge.

We then end up with: Now we need to match `na\$`. The procedure is similar to that of the previous step. We find that `na\$` matches along the `nana\$` edge. After breaking up that edge, and adding the requisite nodes and edges we end up with: Then we add `a\$` similarly and end up with: and then we add the last character and end up with: This was the naïve algorithm. Now lets look at Ukkonen’s algorithm.

# Ukkonen’s Algorithm

In constructing a suffix tree with Ukkonen’s algorithm, one needs to be familiar with the notion of the implicit suffix tree.

## Implicit Suffix Tree

An implicit suffix tree for a string `S` of length `n` can be constructed from the suffix tree for string `S` by performing the following actions:
* deleting all the unique terminal characters from the edge labels
* removing all edges without labels
* removing all nodes that have strictly fewer than 2 children

So the implicit suffix tree for `banana\$` would be: The implicit suffix tree will be important in the linear time construction of the suffix tree using Ukkonen’s algorithm.

Let’s denote `Ii` to be the implicit suffix tree generated by the prefix of S up through the ith character. So `Ii``= S[0..<i]`. In this notation `I1` denotes the implicit suffix tree for the first character.

## High level overview

1. Generate `I1`
2. Keep generating `Ii+1` from `Ii`
3. At the last stage when the terminal symbol `\$` is added, the implicit suffix tree will automatically be converted to a true suffix tree.

in pseudo code this looks like:

for `i` in `0..<(n-1)`:
– for `j` in `0..<(i+1)`:
– from root attempt to locate the end of the path `S[j..<(i+1)]`. If no such path exists return root.
– perform a suffix extension on the result of the previous step

but I have not defined what a suffix extension is.

## Suffix Extension

A suffix extension is performed by evaluating the following rules.

### Rule 1

If you can find a path `S[j..<(i+1)]` from root to leaf then append `S[i+1]` to the leaf edge.

### Rule 2

I find it helpful to break down `Rule 2` into a few different cases since it is quite complex.

Case 1: There is no path `p = S[j..<(i+1)]` from root.

If, however, you are able to find a path `p` where `p = S[j..<(i+1)]`, then you will find that `p` ends either in the middle of an edge or at an internal node. Path `p` cannot end at a leaf node because then that would be a `Rule 1` extension.

Case 2: Path `p` ends in the middle of an edge AND the next character on the path `p`, namely `p[i+1]` IS NOT equal to `S[i+1]`. This means the end of the matching portion lies in the middle of an edge and there is a nonempty string in the unmatched portion left over on the edge.

Case 3: Path `p` ends at an internal node AND the next character on the path `p`, namely `p[i+1]` IS NOT equal to `S[i+1]`. This means the end of the matching portion lies at the end of an edge at a node and and there is a nonempty string of the unmatched portion left over in another edge emanating from that node.

These are the three cases of `Rule 2` visualized graphically: If you encounter Case 1 of Rule 2, i.e. there is no path `p = S[j..<(i+1)]` from root, then create a path starting from root starting with character `S[j]` and label it `S[j..<(i+1)]`. If you encounter Case 2 of Rule 2, i.e. `p = S[j..<(i+1)]` ends at the middle of an edge `E` AND the next character on the path `p`, namely `p[i+1]` IS NOT equal to `S[i+1]`, and let’s assume that edge `E` ends at node `R`. Then a new internal node `N` must be created after the last character where `p` and `S` match on `E`. Then one edge emanating from `N` is the remainder of the `E` edge that we just broke into two, namely the portion of the edge with label `p[i+1]` and the other ege emanating from `N` is a newly created edge with leaf node `Q` where this edge is labelled starting with character `S[i+1]`. Finally if you encounter Case 3 of Rule 2, i.e. `p = S[j..<(i+1)]` ends at a node `R` AND the next character on the path `p`, namely `p[i+1]` IS NOT equal to `S[i+1]`, then a new edge ending in a leaf node `Q` must be created from `R` which starts with character `S[i+1]`. ### Rule 3

If, however, you are able to find a path `p = S[j..<(i+1)]` ending either in the middle of an edge or at an internal node AND the next character on the path `p`, namely `p[i+1]` IS EQUAL to `S[i+1]`, then you do nothing. Later we will do something. But for now we do nothing.

## Example of Construction

Let’s walk through an example to see how this works.

In constructing the string `banana\$`,

We enter the loop with `i=0`:
* For `j=0` and `p=S[0..<1]=b` we perform a `Rule 2` extension `Case 1` to add edge labelled `b` to the root. Now `i=1`
* For `j=0` and `p=S[0..<2]=ba` we perform a `Rule 1` extension on the edge labelled `b` by adding `a` to the label.
* For `j=1` and `p=S[1..<2]=a` we perform a`Rule 2` extension `Case 1` by adding the edge `a` to the root. Now `i=2`:
* For `j=0` and `p=S[0..<3]=ban` we perform a `Rule 1` extension on the edge labelled `ba` by adding an `n` to the label.
* For `j=1` and `p=S[1..<3]=an` we perform a `Rule 1` extension on the edge labelled `a` by adding an `n` to that label.
* For `j=2` and `p=S[2..<3]=n` we perform a `Rule 2` extension `Case 1` by adding the edge `n` to the root. Now `i=3`:
* For `j=0` and `p=S[0..<4]=bana` we perform a `Rule 1` extension on the edge labelled `ban` by adding an `a` to it.
* For `j=1` and `p=S[1..<4]=ana` we perform a `Rule 1` extension on the edge labelled `an` by adding an `a` to it.
* For `j=2` and `p=S[2..<4]=na` we perform a `Rule 1` extension on the edge labelled `n` by adding an a to it.
* For `j=3` and `p=S[3..<4]=a` we perform a `Rule 3` extension on the edge now labelled `ana` and do nothing. Now `i=4`:
* For `j=0` and `p=S[0..<5]=banan` we perform a `Rule 1` extension on the edge labelled `bana` by adding an `n` to it.
* For `j=1` and `p=S[0..<5]=anan` we perform a `Rule 1` extension on the edge labelled `ana` by adding an `n` to it.
* For `j=2` and `p=S[0..<5]=nan` we perform a `Rule 1` extension on the edge labelled `na` by adding an `n` to it.
* For `j=3` and `p=S[0..<5]=an` we perform a `Rule 3` extension on the edge now labelled `anan` and do nothing.
* For `j=4` and `p=S[0..<5]=n` we perform a `Rule 3` extension on the edge now labelled `nan` and do nothing. Now `i=5`:
* For `j=0` and `p=S[0..<6]=banana` we perform a `Rule 1` extension on the edge labelled `banan` by adding an `a` to it.
* For `j=1` and `p=S[1..<6]=anana` we perform a `Rule 1` extension on the edge labelled `anan` by adding an `a` to it.
* For `j=2` and `p=S[2..<6]=nana` we perform a `Rule 1` extension on the edge labelled `nan` by adding an `a` to it.
* For `j=3` and `p=S[3..<6]=ana` we perform a `Rule 3` extension on edge now labelled `anana` and do nothing
* For `j=4` and `p=S[4..<6]=na` we perform a `Rule 3` extension on the edge now labelled `nana` and do nothing.
* For `j=5` and `p=S[5..<6]=a` we perform a `Rule 3` extension on edge now labelled `anana` and do nothing For the last step I will show each transition since most of the work happens in this step.
Now `i=6`:
* For `j=0` and `p=S[0..<7]=banana\$` we perform a `Rule 1` extension on the edge labelled `banana` by adding an `\$` to it. • For `j=1` and `p=S[1..<7]=anana\$` we perform a `Rule 1` extension on the edge labelled `anana` by adding an `\$` to it. • For `j=2` and `p=S[2..<7]=nana\$` we perform a `Rule 1` extension on the edge labelled `nana` by adding an `\$` to it. • For `j=3` and `p=S[3..<7]=ana\$` we perform a `Rule 2` extension `Case 2` on edge now labelled `anana\$`. We split that edge after the matching portion `ana` and add there a new internal node which has an edge emanating from it containing the unmatched portion of `p` namely `na\$` and another edge labelled with the unmatched portion of `S` namely `S[i+1]` or `S` or `\$` pointing to a leaf node. • For `j=4` and `p=S[4..<7]=na\$` we perform a `Rule 2` extension `Case 2` on the edge now labelled `nana\$`. We split that edge after the matching portion `na` and add there a new internal node which has an edge emanating from it containing the unmatched portion of `p` namely `na\$` and another edge labelled with the unmatched portion of `S` namely `S[i+1]` or `S` or `\$` pointing to a leaf node. • For `j=5` and `p=S[5..<7]=a\$` we perform a `Rule 2` extension `Case 2` on edge now labelled `ana`. We split that edge after the matching portion `a` and add there a new internal node which has an edge emanating from it containing the unmatched portion of `p` namely `na` and another edge labelled with the unmatched portion of `S` namely `S[i+1]` or `S` or `\$` pointing to a leaf node. • For `j=6` and `p=S[6..<7]=\$` we perform a `Rule 2` extension `Case 1` to add edge labelled `\$` to the root.

And we end up with the completed suffix tree. This is the basic flow of Ukkkonen’s algorithm. But wait! Something doesn’t seem right. Using the method outlined above, we have for a string `S` of length `n` exactly `n` phases. Then for each extension we have `O(n)` steps. Then finally to perform each extension, we will need to find the path `S[j..<(i+1)]` in the tree. This will require us to traverse the tree which in the worst case is an `O(n)` operation. This gives rise to a run-time complexity of `O(n3)`. But I promised you an `O(n)` algorithm so what gives? Furthermore, this is worse than the naïve algorithm which was `O(n2)`.

It turns out that we will take this `O(n3)` algorithm and apply to it a series of optimizations and tricks to eventually get to the `O(n)` I promised.