This article is a continuation of Suffix Trees and Ukkonen’s Algorithm in Swift Part 1.

I mentioned at the end of part 1, that we would apply a series of optimizations and tricks to get to an `O(n)`

algorithm.

## Optimizations

So in this article we will talk about various optimizations. As is the practice in an introduction, let me tell you what I am going to tell you, then let me tell you, then let me tell you what I told you. So without further ado, we will talk about suffix links, various tips and speed implementation tricks, active points, we will do a walkthrough of a comprehensive example, and finally we will look at some implementation details in Swift. Then of course I will tell you again what I just told you.

### Suffix Links

What is a suffix link? If you look up the definition you will see the following:

*For an internal node A with path label xα where x denotes a single character and α denotes a *

**possibly empty**substring, if there is another node B with path label α, then a pointer from A to B is called a suffix link.

Wow, isn’t that a mouthful? So what does this mean? What this means is that if you have a node `A`

which has path-label `ana`

and another node `B`

that has path label `na`

then a pointer exists from `A`

to `B`

and we call that pointer a suffix link.

In the tree below, there is a suffix link from the node labeled `A`

to the node labeled `B`

since `A`

is an internal node and `A`

has path label `ana`

and `B`

has path label `na`

. See the turquoise arrow below:

But remember that pesky part of the definition *possibly empty* ? So what that means is that if you have an internal node `A`

with a single character path label, such as the label `a`

in the tree below, then it points to a node whose path is empty? What node has an empty path? ROOT. See the turquoise arrow below:

#### Use of Suffix Links to gain speed

Now that we know exactly what a suffix link is, we need to understand how they help us gain speed.

Let’s consider that we are in phase `i+1`

step `j`

. The algorithm dictates that we need to find the end of the path `S[j..<(i+1)]`

in the tree. If I am trying to locate node `B`

then, let’s say I had kept a record of the end of path `S[j-1..<(i+1)]`

namely `A`

, in that case all I would need to do is walk up one node, follow the suffix link and then walk down… This is going to give us a better run time than traversing from root.

What this means is that if we create an internal node `B`

in some extension `j`

of some phase `i`

and that is not the first internal node created in that extension (i.e. we have previously created an internal node `A`

in a previous extension `j-k`

of the same phase `i`

where `0 < k <= j`

), then we can create a suffix link from `A`

to `B`

. In code what we will typically do is as soon as we create an internal node we will set its suffix link to point to root. Then later if we end up creating another internal node in another extension of the same phase, we will change the suffix link to point to that new internal node rather than root.

This is the basic idea. I realize there is a little bit of hand waving here, but this will become alot clearer shortly.

### Skip / Count Trick

Another optimization we can do is the following. After we traverse a suffix link from `C`

to `D`

in the above tree we know that if there exists a path with label `def`

from `C`

to leaf, then there must also exist a path with label `def`

from `D`

to leaf since `D`

is the suffix link of `C`

. Using this knowledge we can quickly skip over edges if the length of the edge label is less than the number of characters we are trying to match. Now it would be great if we could obtain the length of an edge in constant time. Well, yes we can.

### Edge Label Compression

You may have noticed that I have been including some strange numbers underneath the edge labels. For example under `na`

above I have `[2,3]`

and under `a`

I have `[1,1]`

. Rather than copy substrings all over the place we have a global reference to a `String`

and all edge labels refer to the substrings which they define by using indices. So `[2,3]`

means `string[2...3]`

in Swift. As an aside, you can’t actually do that in Swift. You cannot create a substring using integer subscripts. In fact, I wrote an article just on that topic. So if you would like to see how you can make strings integer subscriptable in Swift, I encourage you to read my article Making Strings Integer-Subscriptable.

You may be wondering why there are several labels listed as `[0,#]`

for example with a `#`

instead of a number. **Patience you must have** *my young padawan*.

### Rule 3 is a show stopper

If we are in phase `i+1`

step `j`

, that means we are trying to locate the end of path `S[j..<(i+1)]`

in the tree. Now if a `Rule 3`

extension applies it means that the path `p = S[j..<(i+1)]`

continues with `S[i+1]`

. It stands to reason that if path `p = S[j..<(i+1)]`

continues with `S[i+1]`

, then so too will `q = S[j+1..<(i+1)]`

continue with `S[i+1]`

and so too will path `r = S[j+2..<(i+1)]`

continue with `S[i+1]`

, and so on. Therefore, during a given phase of `i`

, if a `Rule 3`

extension is encountered then we can short circuit and get out of the loop. This seems like a good place to use a `break`

as we will see later. This trick will also afford us some speed increases.

### Once a leaf, always a leaf

If you look at the definition of the algorithm I gave in Part 1, you will see that `Rule 1`

and `Rule 3`

never create any nodes. `Rule 2`

will either add a new node to root (thereby adding a leaf node), will add a new leaf node to some other node, or will insert a new internal node and add a leaf node. At no point does a leaf node ever become an internal node. We can exploit this to quickly handle all of the `Rule 1`

extensions for a given phase `i`

in constant time. Recall that we are now storing indices into the string `S`

in lieu of edge labels. For edges which terminate at a leaf node, rather than storing a scalar integer as the end index, we can store a pointer to the global end. This is what I referred to as # in the diagrams (a global end variable). Then for any phase `i`

we need only increment the global end by `1`

and we have instantly performed all the `Rule 1`

extensions for phase `i`

. Now in `c`

or `c++`

this is relatively straightforward, you just make sure you pass the address of the global end `&end`

and you are good to go. But this requires a little extra work in Swift. We could always just declare the global `end`

an `inout`

type and pass it around by pointer, but this definitely doesn’t feel swifty. Since `Int`

is a struct in swift, a value-type allocated on the stack, we can construct a `class`

based wrapper around `Int`

. Since the swift `class`

is a reference type allocated on the heap, we are always referring to it by pointer anyway. In Swift, this looks something like this.

1 2 3 4 5 6 7 |
class ReferenceInt { let value:Int init(_ value: Int) { self.value = value } } |

These tricks plus the definitions from Part 1 comprise Ukkonen’s algorithm and will allow you to construct a suffix tree in linear time.

Now I will talk about implementation.

## Implementation

At the end of the day, what I would like to have is a `SuffixTree`

data type that would allow me to construct a tree from a string or series of strings and then perform queries on it so it makes sense that `SuffixTree`

will be the exposed datatype. Internally, I am dealing with edges and nodes. Some implementations only model nodes. I prefer to model both edges and nodes.

### Active Point

Recall the example of Ukkonen’s algorithm that was given at the end of the last post, and that there are `n`

phases with each phase having at most `i`

steps. At most, because if we incorporate the tricks introduced here, that in a Rule 3 step the entire phase will short circuit and exit early. Also notice that phase 1 starts with `Rule 2`

and all of the remaining steps start with `Rule 1`

. Notice further that all of the `Rule 1`

extensions will be performed, followed by zero or more `Rule 2`

extensions followed by zero or more `Rule 3`

extensions.

Also using the trick above, all of the `Rule 1`

extensions can be handled in constant time using a pointer to the global end variable. For `Rules 2 & 3`

, we need to traverse down the tree to determine if the path exists. Assume we have the tree below, and we are in some phase `i`

where we have completed all of the `Rule 1`

extensions for some string `S = ...abcdef...`

.

So now at the point where all the `Rule 1`

extensions have been performed, we need to check if `a`

exists in the tree. Sure enough we see that it does exist. And according to `Rule 3`

, the algorithm states that we do nothing. And we also know that now the phase will short circuit and the next phase will start. In the next phase we will again perform all the `Rule 1`

extensions and we will get to the point where we need to check if `ab`

exists in the tree. Again we start from root and traverse. Eventually you can see where this is going. Because a check in some phase is going to build upon the location of the previous phase, we can keep track of the `active node`

from where we can begin our checks. Let’s define an `Active Point`

which is comprised of an `Active Node`

, an `Active Edge`

and an `Active Length`

. This lets us know from which node to commence our traversal, so we don’t always have to start traversal from root. The `Active Edge`

lets us know on which edge to walk down, and the `Active Length`

lets us know how many characters along that edge were previously considered.

#### Rules governing alteration of the Active Point

Under certain conditions we alter the active point so that subsequent traversals are as efficient as possible. There are certain rules for alteration of the active point based on `Rule 2`

or `Rule 3`

extensions. However, before we introduce those rules, we need to discuss one other variable.

Since we can handle all of the `Rule 1`

extensions in one fell swoop by incrementing the global `end`

variable by one, it makes sense to keep track of the remaining “real” suffixes that need to be inserted into the tree. We will call this variable `remaining`

.

`Rule 2`

Active Point Alterations

When a `Rule 2`

extension is applied we consider two cases:

if `Active Node == root`

**AND** `Active Length > 0`

then we:

1. Decrement `Active Length`

by `1`

2. Set `Active Edge`

to `S[i - remaining + 1]`

if `Active Node != root`

1. Follow the suffix link from the `Active Node`

and that becomes the new `Active Node`

2. No change to `Active Edge`

or `Active Length`

`Rule 3`

Active Point Alterations

When a `Rule 3`

extension is applied, `Active Node`

remains the same, `Active Edge`

remains the same, and `Active Length`

is incremented by `1`

.

##### Walkdown Rules for Active Point Alteration

If at any point when walking down an edge, another internal node is encountered, then that node will become the `Active Node`

, and `Active Edge`

and `Active Length`

will change make sense with the new active node.

In that case the algorithm from the last post can be further summarized in a more optimized way as such in the general form:

- Initialize
`remaining`

to`0`

. - Initialize
`end`

to`-1`

.- Read in next character of string
`S`

- Increment the global
`end`

variable by one. - Increment
`remaining`

by one. - Enter into a loop
`remaining`

times.- if the active length is zero then set the active edge to the current character
- check if there is an edge emanating from the active node for the active edge.
- If no, then invoke
`Rule 2`

and add a leaf node and possibly an internal node if needed. Then decrement`remaining`

since a real suffix was added. If, during the furtherance of invoking`Rule 2`

an internal node is created, we set that node’s suffix link to root and keep track of that node. If this is not the first internal node, then create a suffix link from the previous internal node to the current one. We also check to make sure we alter the`Active Point`

if needed according to the`Rule 2`

alteration rules above. - If yes, then we check to see if the current
`Active Length`

is less than the current edge length. If so we can compare the character at the active point and the current character in our string. If they are equal, then we can invoke`Rule 3`

, otherwise we must create it using`Rule 2`

. If, however, we found that the current active length was greater than the edge length we need to perform a walkdown (changing active node, active edge, active length as needed) and we need to keep performing the walk down until such time as`Active Length`

<`Edge Length`

, then we compare the character at the new active point to the current character. If they are equal, then we can invoke`Rule 3`

, otherwise we must create it using`Rule 2`

. If we invoked`Rule 2`

we must remember to decrement`remaining`

since a suffix was added. Then whether we invoked`Rule 2`

or`Rule 3`

we need to perform a suffix link check and an active point alteration check and set those accordingly. In either case once`Rule 3`

is encountered we can break out of the loop and advance to the next phase.

- If no, then invoke

- Read in next character of string

This is the general idea of the `O(n)`

implementation of Ukkonen’s algorithm. Let’s do a walk through.

### Walk Through of Suffix Tree Creation for `mississippi`

To really understand the `Active Point`

and all of the implementation tricks, we will do a complete walk through of the string `mississippi$`

for all 11 phases. Also, convention dictates that the `Active Point`

is referred to as a tuple of `(Active Node, Active Edge, Active Length)`

Let’s initialize some variables.

`Active Node`

is set to `root`

.

`Active Edge`

is set to `nil`

.

`Active Length`

is set to `0`

.

In this case this means the `Active Point`

is `(root,nil,0)`

`end`

is set to `-1`

.

`remaining`

is set to `0`

.

We start with `i=0`

. We read in the char at `S[i] = S[0] = m`

:

Increment `end`

to `0`

(This takes care of all the `Rule 1`

extensions in one fell swoop, of course in this case there are none yet).

Increment `remaining`

to `1`

We enter a loop `remaining = 1`

time:

- Since
`Active Length == 0`

we set`Active Edge`

to be equal to the current character which is`m`

. So now the`Active Point`

is`(root,m,0)`

. - We check if there is an edge emanating from the
`Active Node`

for`Active Edge`

. If not, we must create it in accordance with`Rule 2`

. - Once we create the node and edge, since we performed a
`Rule 2`

extension we decrement`remaining`

to`0`

.

Now `i=1`

. We read in the char at `S[i] = S[1] = i`

:

Increment `end`

to `1`

. (This takes care of all the `Rule 1`

extensions in one fell swoop).

Increment `remaining`

to `1`

We enter a loop `remaining = 1`

time:

- Since
`Active Length == 0`

we set`Active Edge`

to be equal to the current character which is`i`

. So now the`Active Point`

is`(root,i,0)`

. - We check if there is an edge emanating from the
`Active Node`

for`Active Edge`

. If not, we must create it in accordance with`Rule 2`

. - Once we create the node and edge, since we performed a
`Rule 2`

extension we decrement`remaining`

to`0`

.

Now `i=2`

. We read in the char at `S[i] = S[2] = s`

:

Increment `end`

to `2`

. (This takes care of all the `Rule 1`

extensions in one fell swoop).

Increment `remaining`

to `1`

We enter a loop `remaining = 1`

time:

- Since
`Active Length == 0`

we set`Active Edge`

to be equal to the current character which is`s`

. So now the`Active Point`

is`(root,s,0)`

. - We check if there is an edge emanating from the
`Active Node`

for`Active Edge`

. If not, we must create it in accordance with`Rule 2`

. - Once we create the node and edge, since we performed a
`Rule 2`

extension we decrement`remaining`

to`0`

.

Now `i=3`

. We read in the char at `S[i] = S[3] = s`

:

Increment `end`

to `3`

. (This takes care of all the `Rule 1`

extensions in one fell swoop).

Increment `remaining`

to `1`

We enter a loop `remaining = 1`

time:

- Since
`Active Length == 0`

we set`Active Edge`

to be equal to the current character which is`s`

. So now the`Active Point`

is`(root,s,0)`

. - We check if there is an edge emanating from the
`Active Node`

for`Active Edge`

. There is. So we compare the`Active Length`

of`0`

with`length(Active Edge)`

of`2`

. Since active point falls on the middle of the edge, no walkdown is needed and this we compare the active point with the current character. The active point is`s`

and the current character is also`s`

so we have a`Rule 3`

extension. We can immediately break out of the loop and advance to the next phase. However before doing that we need to do two things, a suffix link check and an active point check. No internal node has been created any time during this phase so there is nothing to do on the suffix link check. For the active point check, we know that after a`Rule 3`

extension we increment the`Active Length`

by`1`

so we do that. Now the`Active Point`

is`(root,s,1)`

.

Now `i=4`

. We read in the char at `S[i] = S[4] = i`

:

Increment `end`

to `4`

. (This takes care of all the `Rule 1`

extensions in one fell swoop).

Increment `remaining`

to `2`

We enter a loop `remaining = 2`

times:

- Since
`Active Length > 0`

we don’t need to change the`Active Edge`

. - We check if there is an edge emanating from the
`Active Node`

(root) for`Active Edge`

(s). There is. So we compare the`Active Length`

of`1`

with`length(Active Edge)`

of`3`

. Since active point falls on the middle of the edge, no walkdown is needed and this we compare the active point with the current character. The active point is`s`

and the current character is`i`

. Since the characters are not equal, we need to perform a`Rule 2`

extension. - We break the
`ssi`

edge after the first`s`

and add an internal node there. We will label this internal node`A`

. We will then add the`si`

edge to`A`

and also create a new edge for the current character`i`

also emanating from`A`

. - Now that we performed a
`Rule 2`

extension we need to decrement`remaining`

to 1. - We perform the Suffix link check. Since an internal node was created we can direct its suffix link to point to root.
- We also need to perform the Active point alteration check. We have the
`Active Node`

set to root and the`Active Length`

of 1`> 0`

so we can follow the`Rule 2 Active Point Alteration`

rule and decrement`Active Length`

by`1`

and set`Active Edge`

to`S[i - remaining + 1]`

which is`S[4-1+1]`

or`S[4]`

or`i`

. So the`Active Point`

is now`(root, i, 0)`

.

Now we enter the 2nd iteration of the loop.

- Since
`Active Length == 0`

we set the`Active Edge`

to the current character which is`i`

. - We check if there is an edge emanating from the
`Active Node`

for the`Active Edge`

. Yes there is an`i`

edge. - So we compare lengths to see if a walkdown is necessary. The
`Active Length`

is`0`

and the`length(Active Edge)`

is`4`

. Thus no walkdown is needed. - We check if the active point
`i`

is equal to the current character`i`

. Yes it is. So we have a`Rule 3`

extension. We will be able to break out of the loop, but before doing so, some housekeeping. - We perform a suffixlink check. Since we did not create any internal node in this extension step, the internal node from the previous extension, node
`A`

continues having its suffix link pointed at root. - We perform an
`Active Point Alteration for Rule 3`

check which means that we need to increment the`Active Length`

by`1`

. Now the`Active Point`

is`(root,i,1)`

.

Now `i=5`

. We read in the char at `S[i] = S[5] = s`

:

Increment `end`

to `5`

. (This takes care of all the `Rule 1`

extensions in one fell swoop).

Increment `remaining`

to `2`

We enter a loop `remaining = 2`

times:

- Since
`Active Length > 0`

we don’t need to change the`Active Edge`

. - We check if there is an edge emanating from the
`Active Node`

(root) for`Active Edge`

(i). There is. - So we compare the
`Active Length`

of`1`

with`length(Active Edge)`

of`5`

. Since the active point falls on the middle of the edge, no walkdown is needed and this we compare the active point with the current character. The active point is`s`

and the current character is`s`

. Since the characters are equal, we need to perform a`Rule 3`

extension. We can break out of the loop, but as usual our normal housekeeping. - For our suffixlink check, we have not created any internal node at all in this phase so there is nothing to do there.
- For our
`Active Point Alteration`

check we increment`Active Length`

to`2`

. So now`Active Point`

is now`(root, i, 2)`

.

Now `i=6`

. We read in the char at `S[i] = S[6] = s`

:

Increment `end`

to `6`

. (This takes care of all the `Rule 1`

extensions in one fell swoop).

Increment `remaining`

to `3`

We enter a loop `remaining = 3`

times:

- Since
`Active Length > 0`

we don’t need to change the`Active Edge`

. - We check if there is an edge emanating from the
`Active Node`

(root) for`Active Edge`

(i). There is. - So we compare the
`Active Length`

of`2`

with`length(Active Edge)`

of`6`

. Since the active point falls on the middle of the edge, no walkdown is needed and this we compare the active point with the current character. The active point is`s`

and the current character is`s`

. Since the characters are equal, we need to perform a`Rule 3`

extension. We can break out of the loop, but as usual our normal housekeeping. - For our suffixlink check, we have not created any internal node at all in this phase so there is nothing to do there.
- For our
`Active Point Alteration`

check we increment`Active Length`

to`3`

. So now`Active Point`

is now`(root, i, 3)`

.

Now `i=7`

. We read in the char at `S[i] = S[7] = i`

:

Increment `end`

to `7`

. (This takes care of all the `Rule 1`

extensions in one fell swoop).

Increment `remaining`

to `4`

We enter a loop `remaining = 4`

times:

- Since
`Active Length > 0`

we don’t need to change the`Active Edge`

. - We check if there is an edge emanating from the
`Active Node`

(root) for`Active Edge`

(i). There is. - So we compare the
`Active Length`

of`3`

with`length(Active Edge)`

of`7`

. Since the active point falls on the middle of the edge, no walkdown is needed and this we compare the active point with the current character. The active point is`i`

and the current character is`i`

. Since the characters are equal, we need to perform a`Rule 3`

extension. We can break out of the loop, but as usual our normal housekeeping. - For our suffixlink check, we have not created any internal node at all in this phase so there is nothing to do there.
- For our
`Active Point Alteration`

check we increment`Active Length`

to`4`

. So now`Active Point`

is now`(root, i, 4)`

.

Now `i=8`

. We read in the char at `S[i] = S[8] = p`

:

Increment `end`

to `5`

. (This takes care of all the `Rule 1`

extensions in one fell swoop).

Increment `remaining`

to `5`

We enter a loop `remaining = 5`

times:

First iteration of the loop. (for extension `issip`

)

- Since
`Active Length > 0`

we don’t need to change the`Active Edge`

. - We check if there is an edge emanating from the
`Active Node`

(root) for`Active Edge`

(s). There is. - We compare the
`Active Length`

of`4`

with`length(Active Edge)`

of`8`

. Since active point falls on the middle of the edge, no walkdown is needed. - We compare the active point with the current character. The active point is
`s`

and the current character is`p`

. Since the characters are not equal, we need to perform a`Rule 2`

extension. - We break the
`ississip`

edge after`issi`

and add an internal node there. We will label this internal node`B`

. We will then add the`ssip`

edge to`B`

and also create a new edge for the current character`p`

also emanating from`B`

. - Now that we performed a
`Rule 2`

extension we need to decrement`remaining`

to 4. - We perform the Suffix link check. Since an internal node
`B`

was created we can direct its suffix link to point to root. We will also keep track of this node in case we end up making more internal nodes in this phase. - We also need to perform the
`Active Point Alteration for Rule 2`

check. We have the`Active Node`

set to root and the`Active Length`

of`4 > 0`

so we can follow the`Rule 2 Active Point Alteration`

rule and decrement`Active Length`

by`1`

and set it to`3`

and set`Active Edge`

to`S[i - remaining + 1]`

which is`S[8-4+1]`

or`S[5]`

or`s`

. So the`Active Point`

is now`(root, s, 3)`

. - Because this is a complicated phase involving 5 extensions, I will break from tradition and show a tree diagram during each extension of this phase. But don’t get too used to it.

Now we enter the 2nd iteration of the loop. (for extension `ssip`

)

- Since
`Active Length > 0`

we don’t need to change the`Active Edge`

. - We check if there is an edge emanating from the
`Active Node`

of`root`

for the`Active Edge`

of`s`

. Yes there is an`s`

edge. - So we compare lengths to see if a walkdown is necessary. The
`Active Length`

is`3`

and the`length(Active Edge)`

is`1`

. Thus we need to do a walkdown. - Since the current extension is
`ssip`

and the`Active Length`

is`3`

we start doing a walkdown. We consume the first`s`

from`ssip`

and encounter node`A`

. Remember when performing a walkdown, if another internal node is encountered, that node must become the`Active Node`

and the`Active Edge`

and`Active Length`

must change to be in accordance therewith. So we set the`Active Node`

to`A`

and reduce the`Active Length`

by`1`

to`2`

. Since we still need to consume`si`

from the current extension we must embark on the`sissip`

edge. So we can set the`Active Edge`

to be`s`

. Since the length of the`sissip`

edge of`6`

is greater than the`Active Length`

of`2`

, there will not be any need for a further walkdown. We can complete our walkdown with`Active Point`

now being set to`(A,s,2)`

. - Now that the walkdown portion of the festivities are over, we compare current active point the 2nd
`s`

on the`sissip`

edge with the current character`p`

. There are not equal, so we will need to do a`Rule 2`

extension. - We break the
`sissip`

edge after`si`

and create a new internal node`C`

which will have as an edge the remainder of the broken edge`ssip`

and also add a leaf node for`p`

. - Since we performed a
`Rule 2`

extension we decrement`remaining`

from`4`

to`3`

. - We perform our suffix link check. Since we just created node
`C`

and a previous extension in the current phase created node`B`

. We point`B`

‘s suffix link to`C`

. We also point`C`

‘s suffix link to root, and keep track of it as the last internal node created in this phase. - We also perform our
`Active Point Alteration`

check. Since the`Active Node`

is not root, but rather is node`A`

, we follow the suffix link from`A`

and make that the new active node.`A`

‘s suffix link is root. So the new`Active Point`

is now`(root,s,2)`

. - A diagram ensues:

Now we enter the 3rd iteration of the loop. (for extension `sip`

)

- Since
`Active Length > 0`

we don’t need to change the`Active Edge`

. - We check if there is an edge emanating from the
`Active Node`

of`root`

for the`Active Edge`

of`s`

. Yes there is an`s`

edge. - So we compare lengths to see if a walkdown is necessary. The
`Active Length`

is`2`

and the`length(Active Edge)`

is`1`

. Thus we need to do a walkdown. - Since the current extension is
`sip`

and the`Active Length`

is`2`

we start doing a walkdown. We consume the first`s`

from`sip`

and encounter node`A`

. Remember when performing a walkdown, if another internal node is encountered, that node must become the`Active Node`

and the`Active Edge`

and`Active Length`

must change to be in accordance therewith. So we set the`Active Node`

to`A`

and reduce the`Active Length`

by`1`

to`1`

. Since we still need to consume`i`

from the current extension we must embark on the`issip`

edge. So we can set the`Active Edge`

to be`i`

. Since the length of the`issip`

edge of`5`

is greater than the`Active Length`

of`1`

, there will not be any need for a further walkdown. We can complete our walkdown with`Active Point`

now being set to`(A,i,1)`

. - Now that the walkdown portion of the festivities are over, we compare the current active point on the
`i`

edge which is the 1st`s`

with the current character`p`

. There are not equal, so we will need to do a`Rule 2`

extension. - We break the
`issip`

edge after`i`

and create a new internal node`D`

which will have as an edge the remainder of the broken edge`ssip`

and also add a leaf node for`p`

. - Since we performed a
`Rule 2`

extension we decrement`remaining`

from`3`

to`2`

. - We perform our suffix link check. Since we just created node
`D`

and a previous extension in the current phase created node`C`

. We point`C`

‘s suffix link to`D`

. We also point`D`

‘s suffix link to root, and keep track of it as the last internal node created in this phase. - We also perform our
`Active Point Alteration`

check. Since the`Active Node`

is not root, but rather is node`A`

, we follow the suffix link from`A`

and make that the new active node.`A`

‘s suffix link is root. So the new`Active Point`

is now`(root,i,1)`

. - A diagram ensues:

Now we enter the 4th iteration of the loop. (for extension `ip`

)

- Since
`Active Length > 0`

we don’t need to change the`Active Edge`

. - We check if there is an edge emanating from the
`Active Node`

of`root`

for the`Active Edge`

of`i`

. Yes there is an`i`

edge. - So we compare lengths to see if a walkdown is necessary. The
`Active Length`

is`1`

and the`length(Active Edge)`

is`4`

. Thus no walkdown is necessary. Since active point falls on the middle of the edge, no walkdown is needed. - We compare the active point with the current character. The active point is
`s`

and the current character is`p`

. Since the characters are not equal, we need to perform a`Rule 2`

extension. - We break the
`issi`

edge after`i`

and create a new internal node`E`

which will have as an edge the remainder of the broken edge`issi`

, namely`ssi`

and also add a leaf node for`p`

. - Since we performed a
`Rule 2`

extension we decrement`remaining`

from`2`

to`1`

. - We perform our suffix link check. Since we just created node
`E`

and a previous extension in the current phase created node`D`

. We point`D`

‘s suffix link to`E`

. We also point`E`

‘s suffix link to root, and keep track of it as the last internal node created in this phase.- We also perform our
`Active Point Alteration`

check. Since the`Active Node`

is root, and`Active Length`

of`1 > 0`

, we decrement`Active Length`

from`1`

to`0`

and we set the`Active Edge`

to`S[i - remaining + 1]`

which is`S[8-1+1]`

or`S[8]`

or`p`

. So the new`Active Point`

is now`(root,p,0)`

.

- We also perform our
- A diagram ensues:

Now we enter the 5th iteration of the loop. (for extension `p`

)

- Since
`Active Length == 0`

we change the`Active Edge`

to the current character which is`p`

. - We check if there is an edge emanating from the
`Active Node`

of`root`

for the`Active Edge`

of`p`

. There is no such edge. So we must create it in accordance with`Rule 2`

- We add a leaf node with edge label
`p`

emanating from root. - Since we performed a
`Rule 2`

extension we decrement`remaining`

from`1`

to`0`

. - We perform our suffix link check. Since we didn’t create any new internal node in this step, the previous internal node
`E`

remains with its suffix link pointed at root.- We also perform our
`Active Point Alteration`

check. Since the`Active Node`

is root, but`Active Length`

of`0 == 0`

, so there is no change here. So the`Active Point`

is still`(root,p,0)`

.

- We also perform our
- A diagram ensues:

And we are finally done with that phase… onwards to the next phase.

Now `i=9`

. We read in the char at `S[i] = S[9] = p`

:

Increment `end`

to `9`

. (This takes care of all the `Rule 1`

extensions in one fell swoop).

Increment `remaining`

to `1`

We enter a loop `remaining = 1`

time:

- Since
`Active Length == 0`

we change the`Active Edge`

to the current character which is`p`

. - We check if there is an edge emanating from the
`Active Node`

(root) for`Active Edge`

(p). There is. - So we compare the
`Active Length`

of`0`

with`length(Active Edge)`

of`2`

. Since the active point falls on the middle of the edge, no walkdown is needed and this we compare the active point with the current character. The active point is`p`

and the current character is`p`

. Since the characters are equal, we need to perform a`Rule 3`

extension. We can break out of the loop, but as usual our normal housekeeping. - For our
`Active Point Alteration`

check we increment`Active Length`

to`1`

. So now`Active Point`

is now`(root, p, 1)`

. - A diagram ensues:

Now `i=10`

. We read in the char at `S[i] = S[10] = i`

:

Increment `end`

to `10`

. (This takes care of all the `Rule 1`

extensions in one fell swoop).

Increment `remaining`

to `2`

We enter a loop `remaining = 2`

times:

Now we enter the 1st iteration of the loop. (for extension `pi`

)

- Since
`Active Length > 0`

we don’t need to change the`Active Edge`

. - We check if there is an edge emanating from the
`Active Node`

(root) for`Active Edge`

(p). There is. - So we compare the
`Active Length`

of`1`

with`length(Active Edge)`

of`3`

. Since the active point falls on the middle of the edge, no walkdown is needed and this we compare the active point with the current character. The active point is points to the 2nd`p`

and the current character is`i`

. Since the characters are not equal, we need to perform a`Rule 2`

extension. - We break the
`ppi`

edge after the first`p`

and create a new internal node`F`

which will have as an edge the remainder of the broken edge`ppi`

, namely`pi`

and also add a leaf node for`i`

. - Since we performed a
`Rule 2`

extension we decrement`remaining`

from`2`

to`1`

. - We perform our suffix link check. Since we just created node
`F`

, we point`F`

‘s suffix link to root, and keep track of it as the last internal node created in this phase.- We also perform our
`Active Point Alteration`

check. Since the`Active Node`

is root, and`Active Length`

of`1 > 0`

, we decrement`Active Length`

from`1`

to`0`

and we set the`Active Edge`

to`S[i - remaining + 1]`

which is`S[10-1+1]`

or`S[10]`

or`i`

. So the new`Active Point`

is now`(root,i,0)`

.

- We also perform our

Now we enter the 2nd iteration of the loop. (for extension `i`

)

- Since
`Active Length == 0`

we change the`Active Edge`

to the current character of`i`

- We check if there is an edge emanating from the
`Active Node`

(root) for`Active Edge`

(i). There is. - So we compare the
`Active Length`

of`0`

with`length(Active Edge)`

of`1`

. Since the active point falls on the middle of the edge, no walkdown is needed and this we compare the active point with the current character. The active point is points to the`i`

and the current character is`i`

. Since the characters are equal equal, we need to perform a`Rule 3`

extension. We can break out of the loop, but as usual our normal housekeeping. - For our suffixlink check, we have not created any internal node so the previous node
`F`

remains with suffix link pointed to root. - For our
`Active Point Alteration`

check we increment`Active Length`

to`1`

. So now`Active Point`

is now`(root, i, 1)`

.- A diagram ensues:

Now `i=11`

. We read in the char at `S[i] = S[11] = $`

:

Increment `end`

to `11`

. (This takes care of all the `Rule 1`

extensions in one fell swoop).

Increment `remaining`

to `2`

We enter a loop `remaining = 2`

times:

Now we enter the 1st iteration of the loop. (for extension `i$`

)

- Since
`Active Length > 0`

we don’t need to change the`Active Edge`

. - We check if there is an edge emanating from the
`Active Node`

(root) for`Active Edge`

(i). There is. - So we compare the
`Active Length`

of`1`

with`length(Active Edge)`

of`1`

. Since they are equal a walkdown is needed. We Consume`1`

character and end up at node`E`

. So the`Active Node`

now becomes`E`

and the`Active Length`

is decremented from`1`

to`0`

accordingly. We now check if the next character on the “edge” (i.e. where the active point would be) is equal to the current character of`$`

. It is not, so we need to effect a`Rule 2`

extension. Since we are at an actual node, no internal node needs to be created. We can simply create a leaf node from node`E`

with edge label`$`

. - Since we performed a
`Rule 2`

extension we decrement`remaining`

from`2`

to`1`

. - We perform our suffix link check. No internal node has been previously saved in this phase, and we did not create one now. so there is nothing to do here.
- We also perform our
`Active Point Alteration`

check. Since the`Active Node`

is not root, we follow the suffix link from the active node or`E`

and make that the new`Active Node`

.`E`

‘s suffix link points to root, so root becomes the`Active Node`

now. So the new`Active Point`

is now`(root,i,0)`

.

- We also perform our

Now we enter the 2nd iteration of the loop. (for extension `$`

)

- Since
`Active Length == 0`

we change the`Active Edge`

to the current character of`$`

- We check if there is an edge emanating from the
`Active Node`

(root) for`Active Edge`

($). There is not. - So we must create the leaf node and edge label
`$`

in accordance with`Rule 2`

from root. - Since we performed a
`Rule 2`

extension we decrement`remaining`

from`1`

to`0`

. - We perform our suffix link check. No internal node has been previously saved in this phase, and we did not create one now. so there is nothing to do here.
- We also perform our
`Active Point Alteration`

check. Since the`Active Node`

is root, and`Active Length`

of`0 == 0`

, we do nothing here. So the`Active Point`

is still`(root,$,0)`

. - A diagram ensues:

- We also perform our

## The Code

Finally, we can see how this would be implemented in Swift.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 |
class Node { enum NodeType { case root case `internal` case leaf } var indexValue:Int { if nodeType == .root { return -1 } guard let parentEdge = self.parent else { return -2 } if nodeType == .internal { return parentEdge.length + parentEdge.parent.indexValue } return suffixTree.corpus.value.count - ( parentEdge.length + parentEdge.parent.indexValue ) - 1 } weak var parent: Edge? = nil var children:[Character:Edge] = [:] var suffixLink:Node? = nil var nodeType:NodeType let uuid:UUID var suffixTree:SuffixTree! static var emittedNodes:[Node:Bool]=[:] var nodeIdentifier:String { return "node" + String(describing:self.uuid.hashValue) } init(suffixTree:SuffixTree!, nodeType:NodeType = .leaf) { self.nodeType = nodeType self.suffixTree = suffixTree self.uuid = UUID() } func addEdgeWithNewNode(character:Character,start:Int,end:IntReference) { let leafNode = Node(suffixTree: suffixTree, nodeType: .leaf) let edge = Edge(suffixTree: suffixTree, start: start, end: end, child: leafNode, parent: self) self.children[character] = edge } func addEdgeWithExistingNode(character:Character,start:Int,end:IntReference, existing node: Node) { let edge = Edge(suffixTree: suffixTree, start: start, end: end, child: node, parent: self) self.children[character] = edge } } extension Node : Hashable { var hashValue: Int { return self.uuid.hashValue } static func ==(lhs: Node, rhs: Node) -> Bool { return lhs.hashValue == rhs.hashValue } } class Edge { var suffixTree:SuffixTree var start: Int var end: IntReference var child: Node unowned var parent: Node var length:Int { return end.value - start + 1 } var diff:Int { return end.value - start } var label:String { return suffixTree.corpus.value[start...end.value] } init(suffixTree:SuffixTree, start:Int, end:IntReference, child: Node, parent: Node) { self.start = start self.end = end self.child = child self.parent = parent self.suffixTree = suffixTree self.child.parent = self } } class ActivePoint { var node:Node var suffixTree:SuffixTree! var edge:Edge? { guard let c = self.char else { return nil } guard let e = self.node.children[c] else { return nil } return e } var char:Character? { guard self.index >= 0 else { return nil } return suffixTree.corpus[index] } var index:Int = -1 var length:Int init(suffixTree: SuffixTree!, node:Node,char:Character?,length:Int) { self.node = node self.length = length } func nextCharDownActiveEdge(from char: Character) -> Character? { guard let edge = self.edge else { fatalError("Experienced a nil active Edge") } let corpus = suffixTree.corpus if edge.diff >= length { return corpus[edge.start + length] } else if edge.diff + 1 == length { if let _ = edge.child.children[char] { return char } return nil } else { node = edge.child length = length - ( edge.diff + 1 ) self.index = self.index + ( edge.diff + 1 ) return nextCharDownActiveEdge(from: char) } } func walkDownActiveEdge(from char: Character) -> Bool { guard let edge = self.edge else { fatalError("Experienced a nil active Edge") } if edge.diff < length { node = edge.child length = length - edge.diff if let selectEdge = node.children[char] { self.index = selectEdge.start } return true } else { length += 1 } return false } } class SuffixTree { let corpus: StringReference let active: ActivePoint let root: Node var remainder: Int let globalEnd: IntReference var lastInternalNode: Node? = nil init(with corpus:String) { self.corpus = StringReference(corpus) let rootNode = Node(suffixTree: nil, nodeType: .root) self.root = rootNode let activePoint = ActivePoint(suffixTree: nil, node: rootNode, char: nil, length: 0) self.active = activePoint self.remainder = 0 self.globalEnd = IntReference(-1) rootNode.suffixTree = self activePoint.suffixTree = self buildTree() } private func buildTree() { for (i,char) in self.corpus.value.enumerated() { performPhase(i: i, char: char) } } func suffixLinkCheck() { if active.node != root { if let sufLink = active.node.suffixLink { active.node = sufLink } else { active.node = root } } else { active.index += 1 active.length -= 1 } remainder -= 1 } private func performPhase(i:Int,char:Character) { var lastInternalNode: Node? = nil globalEnd.value += 1 remainder += 1 while remainder > 0 { if active.length == 0 { if let existingEdge = root.children[char] { active.index = existingEdge.start active.length += 1 break } else { root.addEdgeWithNewNode(character: char, start: i, end: globalEnd) remainder -= 1 } } else { if let nextChar = active.nextCharDownActiveEdge(from: char) { if nextChar == char { let walkedDown = active.walkDownActiveEdge(from: char) if walkedDown { lastInternalNode?.suffixLink = active.edge?.child } break } else { guard let activeEdge = active.edge else { fatalError("Experienced a nil active Edge") } let oldEnd = activeEdge.end activeEdge.end = IntReference(activeEdge.start + active.length - 1) let newInternalNode = Node(suffixTree: self, nodeType: .internal) newInternalNode.addEdgeWithExistingNode(character: corpus[activeEdge.start + active.length], start: activeEdge.start + active.length, end: oldEnd, existing: activeEdge.child) newInternalNode.addEdgeWithNewNode(character: corpus[i], start: i, end: globalEnd) activeEdge.child = newInternalNode newInternalNode.parent = activeEdge newInternalNode.suffixLink = root lastInternalNode?.suffixLink = newInternalNode lastInternalNode = newInternalNode suffixLinkCheck() } } else { if let activeEdge = active.edge { activeEdge.child.addEdgeWithNewNode(character: char, start: i, end: globalEnd) suffixLinkCheck() } } } } } } |