Suffix Trees and Ukkonen’s Algorithm in Swift Part 1

Suffix Trees are vastly important in computer science. However, I felt there is a gap in the literature about them, so with this article I hope to demystify the suffix tree and will show how one can be constructed using Ukkonen’s algorithm. Ukkonnen developed an O(n) time algorithm for constructing suffix trees. One excellent reference I found on suffix trees and string processing in general is Dan Gusfield’s Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology which I have found to be an invaluable reference.

Background

Before talking about suffix trees, let’s briefly talk about suffix tries as a motivation for learning about the suffix tree.  It is somewhat unfortunate that there is a split of authority as to the correct pronunciation of trie.  Some pronounce it like tree and some pronounce it like try. I generally try to pronounce it like try 🤪. A trie is a data structure used to represent a string of characters where each edge represents a single character and a particular node can have only uniquely labelled edges emanating from it. Let’s consider all the suffixes of the word banana$ (The $ at the end of the word will be explained shortly) banana$, anana$, nana$, ana$, na$, a$ and $. If we want to express these suffixes in a trie, then we have the following:

Notice here that every edge consists of a single character. a Suffix Tree on the other hand is a related data structure, which you can think of as a compressed suffix trie, in which new nodes are only created for unique extensions of suffixes. A suffix tree for the same string banana$ is below.

Suffix Tree

A Suffix Tree for a string S of length n will have n leaves. To aid in the contruction it is necessary for the string to have a terminal character which occurs nowhere in the interior of the string. By convention the character $ is used. In a suffix tree each internal node (except the root) has at least 2 children. Each edge is labelled with a substring of S. Each node will not have more than one edge emanating from it which starts with the same character. Note: in the above suffix tree that the concatenation of edge labels from root to leaf spell out the suffix which starts at the index contained in the leaf node. The red arrows are suffix links which will be introduced later.

Naïve algorithm to construct a Suffix Tree

Let’s take a string S of length n where the last character of S is already unique. When we add an edge from the root for the substring S[0..<n] (using Swift notation for ranges) this is denoted as T0. Similarly, let Tj be the tree after j steps, that is after adding S[0..<n], S[1..<n],…,S[j..<n]

  1. Start at the root and add an edge for S[0..<n]. This is T0.
  2. To create Tj+1 from Tj, start at the root of Tj and find the longest ath which matches a prefix of S[(j+1)..<n]. You will either not find any such path, end at a node or end in the middle of an edge.
  3. If you do not find such a path, then you add an edge from root, for the label S[(j+1)..<n].
  4. If you end at a node and and the length of the edge label is a then you add a new edge from the node with the label S[(j+1+a)..<n].
  5. If you end in the middle of an edge b characters into the edge where the entire edge represents string P of length c, then you break the edge into 2 pieces, P[0..<b] and P[b..<c]. You attach a new node at the end of the P[0..<b] edge. From this node, you attach 2 edges: one which is the P[b..<c] edge which is the remainder of the edge you broke, the other with label S[(j+1+b)..<n].

This naive algorithm is O(n2). Let’s go through this naive algorithm step by step for the string: banana$.

After Step 1 we have the tree for T0:

After Step 2 & 3 in creating T1 we have:

After Step 2 & 3 in creating T2 we have:

Now we need to match the string ana$. According to Step 2, we find the longest path p which matches a prefix of ana$. This is the middle of the middle edge in the graph above. Substring ana of our string S has matched ana of our string p. However, $ from S does not match n of p, thus we need to break the anana$ edge into 2 pieces, namely the ana portion and the na$ portion. Then we need to create a new node and insert that at the end of the ana edge. To that new node we add the na$ edge and we also create a new node labeled with $ (the leftover character from S which didn’t match) and make that the label of the new edge.

We then end up with:

Now we need to match na$. The procedure is similar to that of the previous step. We find that na$ matches along the nana$ edge. After breaking up that edge, and adding the requisite nodes and edges we end up with:

Then we add a$ similarly and end up with:

and then we add the last character and end up with:

This was the naïve algorithm. Now lets look at Ukkonen’s algorithm.

Ukkonen’s Algorithm

In constructing a suffix tree with Ukkonen’s algorithm, one needs to be familiar with the notion of the implicit suffix tree.

Implicit Suffix Tree

An implicit suffix tree for a string S of length n can be constructed from the suffix tree for string S by performing the following actions:
* deleting all the unique terminal characters from the edge labels
* removing all edges without labels
* removing all nodes that have strictly fewer than 2 children

So the implicit suffix tree for banana$ would be:

The implicit suffix tree will be important in the linear time construction of the suffix tree using Ukkonen’s algorithm.

Let’s denote Ii to be the implicit suffix tree generated by the prefix of S up through the ith character. So Ii= S[0..<i]. In this notation I1 denotes the implicit suffix tree for the first character.

High level overview

  1. Generate I1
  2. Keep generating Ii+1 from Ii
  3. At the last stage when the terminal symbol $ is added, the implicit suffix tree will automatically be converted to a true suffix tree.

in pseudo code this looks like:

for i in 0..<(n-1):
– for j in 0..<(i+1):
– from root attempt to locate the end of the path S[j..<(i+1)]. If no such path exists return root.
– perform a suffix extension on the result of the previous step

but I have not defined what a suffix extension is.

Suffix Extension

A suffix extension is performed by evaluating the following rules.

Rule 1

If you can find a path S[j..<(i+1)] from root to leaf then append S[i+1] to the leaf edge.

Rule 2

I find it helpful to break down Rule 2 into a few different cases since it is quite complex.

Case 1: There is no path p = S[j..<(i+1)] from root.

If, however, you are able to find a path p where p = S[j..<(i+1)], then you will find that p ends either in the middle of an edge or at an internal node. Path p cannot end at a leaf node because then that would be a Rule 1 extension.

Case 2: Path p ends in the middle of an edge AND the next character on the path p, namely p[i+1] IS NOT equal to S[i+1]. This means the end of the matching portion lies in the middle of an edge and there is a nonempty string in the unmatched portion left over on the edge.

Case 3: Path p ends at an internal node AND the next character on the path p, namely p[i+1] IS NOT equal to S[i+1]. This means the end of the matching portion lies at the end of an edge at a node and and there is a nonempty string of the unmatched portion left over in another edge emanating from that node.

These are the three cases of Rule 2 visualized graphically:

If you encounter Case 1 of Rule 2, i.e. there is no path p = S[j..<(i+1)] from root, then create a path starting from root starting with character S[j] and label it S[j..<(i+1)].

If you encounter Case 2 of Rule 2, i.e. p = S[j..<(i+1)] ends at the middle of an edge E AND the next character on the path p, namely p[i+1] IS NOT equal to S[i+1], and let’s assume that edge E ends at node R. Then a new internal node N must be created after the last character where p and S match on E. Then one edge emanating from N is the remainder of the E edge that we just broke into two, namely the portion of the edge with label p[i+1] and the other ege emanating from N is a newly created edge with leaf node Q where this edge is labelled starting with character S[i+1].

Finally if you encounter Case 3 of Rule 2, i.e. p = S[j..<(i+1)] ends at a node R AND the next character on the path p, namely p[i+1] IS NOT equal to S[i+1], then a new edge ending in a leaf node Q must be created from R which starts with character S[i+1].

Rule 3

If, however, you are able to find a path p = S[j..<(i+1)] ending either in the middle of an edge or at an internal node AND the next character on the path p, namely p[i+1] IS EQUAL to S[i+1], then you do nothing. Later we will do something. But for now we do nothing.

Example of Construction

Let’s walk through an example to see how this works.

In constructing the string banana$,

We enter the loop with i=0:
* For j=0 and p=S[0..<1]=b we perform a Rule 2 extension Case 1 to add edge labelled b to the root.

Now i=1
* For j=0 and p=S[0..<2]=ba we perform a Rule 1 extension on the edge labelled b by adding a to the label.
* For j=1 and p=S[1..<2]=a we perform aRule 2 extension Case 1 by adding the edge a to the root.

Now i=2:
* For j=0 and p=S[0..<3]=ban we perform a Rule 1 extension on the edge labelled ba by adding an n to the label.
* For j=1 and p=S[1..<3]=an we perform a Rule 1 extension on the edge labelled a by adding an n to that label.
* For j=2 and p=S[2..<3]=n we perform a Rule 2 extension Case 1 by adding the edge n to the root.

Now i=3:
* For j=0 and p=S[0..<4]=bana we perform a Rule 1 extension on the edge labelled ban by adding an a to it.
* For j=1 and p=S[1..<4]=ana we perform a Rule 1 extension on the edge labelled an by adding an a to it.
* For j=2 and p=S[2..<4]=na we perform a Rule 1 extension on the edge labelled n by adding an a to it.
* For j=3 and p=S[3..<4]=a we perform a Rule 3 extension on the edge now labelled ana and do nothing.

Now i=4:
* For j=0 and p=S[0..<5]=banan we perform a Rule 1 extension on the edge labelled bana by adding an n to it.
* For j=1 and p=S[0..<5]=anan we perform a Rule 1 extension on the edge labelled ana by adding an n to it.
* For j=2 and p=S[0..<5]=nan we perform a Rule 1 extension on the edge labelled na by adding an n to it.
* For j=3 and p=S[0..<5]=an we perform a Rule 3 extension on the edge now labelled anan and do nothing.
* For j=4 and p=S[0..<5]=n we perform a Rule 3 extension on the edge now labelled nan and do nothing.

Now i=5:
* For j=0 and p=S[0..<6]=banana we perform a Rule 1 extension on the edge labelled banan by adding an a to it.
* For j=1 and p=S[1..<6]=anana we perform a Rule 1 extension on the edge labelled anan by adding an a to it.
* For j=2 and p=S[2..<6]=nana we perform a Rule 1 extension on the edge labelled nan by adding an a to it.
* For j=3 and p=S[3..<6]=ana we perform a Rule 3 extension on edge now labelled anana and do nothing
* For j=4 and p=S[4..<6]=na we perform a Rule 3 extension on the edge now labelled nana and do nothing.
* For j=5 and p=S[5..<6]=a we perform a Rule 3 extension on edge now labelled anana and do nothing

For the last step I will show each transition since most of the work happens in this step.
Now i=6:
* For j=0 and p=S[0..<7]=banana$ we perform a Rule 1 extension on the edge labelled banana by adding an $ to it.

  • For j=1 and p=S[1..<7]=anana$ we perform a Rule 1 extension on the edge labelled anana by adding an $ to it.
  • For j=2 and p=S[2..<7]=nana$ we perform a Rule 1 extension on the edge labelled nana by adding an $ to it.
  • For j=3 and p=S[3..<7]=ana$ we perform a Rule 2 extension Case 2 on edge now labelled anana$. We split that edge after the matching portion ana and add there a new internal node which has an edge emanating from it containing the unmatched portion of p namely na$ and another edge labelled with the unmatched portion of S namely S[i+1] or S[7] or $ pointing to a leaf node.

  • For j=4 and p=S[4..<7]=na$ we perform a Rule 2 extension Case 2 on the edge now labelled nana$. We split that edge after the matching portion na and add there a new internal node which has an edge emanating from it containing the unmatched portion of p namely na$ and another edge labelled with the unmatched portion of S namely S[i+1] or S[7] or $ pointing to a leaf node.
  • For j=5 and p=S[5..<7]=a$ we perform a Rule 2 extension Case 2 on edge now labelled ana. We split that edge after the matching portion a and add there a new internal node which has an edge emanating from it containing the unmatched portion of p namely na and another edge labelled with the unmatched portion of S namely S[i+1] or S[7] or $ pointing to a leaf node.
  • For j=6 and p=S[6..<7]=$ we perform a Rule 2 extension Case 1 to add edge labelled $ to the root.

And we end up with the completed suffix tree.

This is the basic flow of Ukkkonen’s algorithm. But wait! Something doesn’t seem right. Using the method outlined above, we have for a string S of length n exactly n phases. Then for each extension we have O(n) steps. Then finally to perform each extension, we will need to find the path S[j..<(i+1)] in the tree. This will require us to traverse the tree which in the worst case is an O(n) operation. This gives rise to a run-time complexity of O(n3). But I promised you an O(n) algorithm so what gives? Furthermore, this is worse than the naïve algorithm which was O(n2).

It turns out that we will take this O(n3) algorithm and apply to it a series of optimizations and tricks to eventually get to the O(n) I promised.

You can read about those details in Part 2 of this article: Suffix Trees and Ukkonen’s Algorithm in Swift Part 2