This article is a continuation of Suffix Trees and Ukkonen’s Algorithm in Swift Part 1.
I mentioned at the end of part 1, that we would apply a series of optimizations and tricks to get to an O(n)
algorithm.
Optimizations
So in this article we will talk about various optimizations. As is the practice in an introduction, let me tell you what I am going to tell you, then let me tell you, then let me tell you what I told you. So without further ado, we will talk about suffix links, various tips and speed implementation tricks, active points, we will do a walkthrough of a comprehensive example, and finally we will look at some implementation details in Swift. Then of course I will tell you again what I just told you.
Suffix Links
What is a suffix link? If you look up the definition you will see the following:
For an internal node A
with path label xα
where x
denotes a single character and α
denotes a possibly empty substring, if there is another node B
with path label α
, then a pointer from A
to B
is called a suffix link.
Wow, isn’t that a mouthful? So what does this mean? What this means is that if you have a node A
which has path-label ana
and another node B
that has path label na
then a pointer exists from A
to B
and we call that pointer a suffix link.
In the tree below, there is a suffix link from the node labeled A
to the node labeled B
since A
is an internal node and A
has path label ana
and B
has path label na
. See the turquoise arrow below:

But remember that pesky part of the definition possibly empty ? So what that means is that if you have an internal node A
with a single character path label, such as the label a
in the tree below, then it points to a node whose path is empty? What node has an empty path? ROOT. See the turquoise arrow below:

Use of Suffix Links to gain speed
Now that we know exactly what a suffix link is, we need to understand how they help us gain speed.
Let’s consider that we are in phase i+1
step j
. The algorithm dictates that we need to find the end of the path S[j..<(i+1)]
in the tree. If I am trying to locate node B
then, let’s say I had kept a record of the end of path S[j-1..<(i+1)]
namely A
, in that case all I would need to do is walk up one node, follow the suffix link and then walk down… This is going to give us a better run time than traversing from root.
What this means is that if we create an internal node B
in some extension j
of some phase i
and that is not the first internal node created in that extension (i.e. we have previously created an internal node A
in a previous extension j-k
of the same phase i
where 0 < k <= j
), then we can create a suffix link from A
to B
. In code what we will typically do is as soon as we create an internal node we will set its suffix link to point to root. Then later if we end up creating another internal node in another extension of the same phase, we will change the suffix link to point to that new internal node rather than root.

This is the basic idea. I realize there is a little bit of hand waving here, but this will become alot clearer shortly.
Skip / Count Trick
Another optimization we can do is the following. After we traverse a suffix link from C
to D
in the above tree we know that if there exists a path with label def
from C
to leaf, then there must also exist a path with label def
from D
to leaf since D
is the suffix link of C
. Using this knowledge we can quickly skip over edges if the length of the edge label is less than the number of characters we are trying to match. Now it would be great if we could obtain the length of an edge in constant time. Well, yes we can.
Edge Label Compression
You may have noticed that I have been including some strange numbers underneath the edge labels. For example under na
above I have [2,3]
and under a
I have [1,1]
. Rather than copy substrings all over the place we have a global reference to a String
and all edge labels refer to the substrings which they define by using indices. So [2,3]
means string[2...3]
in Swift. As an aside, you can’t actually do that in Swift. You cannot create a substring using integer subscripts. In fact, I wrote an article just on that topic. So if you would like to see how you can make strings integer subscriptable in Swift, I encourage you to read my article Making Strings Integer-Subscriptable.
You may be wondering why there are several labels listed as [0,#]
for example with a #
instead of a number. Patience you must have my young padawan.
Rule 3 is a show stopper
If we are in phase i+1
step j
, that means we are trying to locate the end of path S[j..<(i+1)]
in the tree. Now if a Rule 3
extension applies it means that the path p = S[j..<(i+1)]
continues with S[i+1]
. It stands to reason that if path p = S[j..<(i+1)]
continues with S[i+1]
, then so too will q = S[j+1..<(i+1)]
continue with S[i+1]
and so too will path r = S[j+2..<(i+1)]
continue with S[i+1]
, and so on. Therefore, during a given phase of i
, if a Rule 3
extension is encountered then we can short circuit and get out of the loop. This seems like a good place to use a break
as we will see later. This trick will also afford us some speed increases.
Once a leaf, always a leaf
If you look at the definition of the algorithm I gave in Part 1, you will see that Rule 1
and Rule 3
never create any nodes. Rule 2
will either add a new node to root (thereby adding a leaf node), will add a new leaf node to some other node, or will insert a new internal node and add a leaf node. At no point does a leaf node ever become an internal node. We can exploit this to quickly handle all of the Rule 1
extensions for a given phase i
in constant time. Recall that we are now storing indices into the string S
in lieu of edge labels. For edges which terminate at a leaf node, rather than storing a scalar integer as the end index, we can store a pointer to the global end. This is what I referred to as # in the diagrams (a global end variable). Then for any phase i
we need only increment the global end by 1
and we have instantly performed all the Rule 1
extensions for phase i
. Now in c
or c++
this is relatively straightforward, you just make sure you pass the address of the global end &end
and you are good to go. But this requires a little extra work in Swift. We could always just declare the global end
an inout
type and pass it around by pointer, but this definitely doesn’t feel swifty. Since Int
is a struct in swift, a value-type allocated on the stack, we can construct a class
based wrapper around Int
. Since the swift class
is a reference type allocated on the heap, we are always referring to it by pointer anyway. In Swift, this looks something like this.
1 2 3 4 5 6 7 |
class ReferenceInt { let value:Int init(_ value: Int) { self.value = value } } |
These tricks plus the definitions from Part 1 comprise Ukkonen’s algorithm and will allow you to construct a suffix tree in linear time.
Now I will talk about implementation.
Implementation
At the end of the day, what I would like to have is a SuffixTree
data type that would allow me to construct a tree from a string or series of strings and then perform queries on it so it makes sense that SuffixTree
will be the exposed datatype. Internally, I am dealing with edges and nodes. Some implementations only model nodes. I prefer to model both edges and nodes.
Active Point
Recall the example of Ukkonen’s algorithm that was given at the end of the last post, and that there are n
phases with each phase having at most i
steps. At most, because if we incorporate the tricks introduced here, that in a Rule 3 step the entire phase will short circuit and exit early. Also notice that phase 1 starts with Rule 2
and all of the remaining steps start with Rule 1
. Notice further that all of the Rule 1
extensions will be performed, followed by zero or more Rule 2
extensions followed by zero or more Rule 3
extensions.
Also using the trick above, all of the Rule 1
extensions can be handled in constant time using a pointer to the global end variable. For Rules 2 & 3
, we need to traverse down the tree to determine if the path exists. Assume we have the tree below, and we are in some phase i
where we have completed all of the Rule 1
extensions for some string S = ...abcdef...
.

So now at the point where all the Rule 1
extensions have been performed, we need to check if a
exists in the tree. Sure enough we see that it does exist. And according to Rule 3
, the algorithm states that we do nothing. And we also know that now the phase will short circuit and the next phase will start. In the next phase we will again perform all the Rule 1
extensions and we will get to the point where we need to check if ab
exists in the tree. Again we start from root and traverse. Eventually you can see where this is going. Because a check in some phase is going to build upon the location of the previous phase, we can keep track of the active node
from where we can begin our checks. Let’s define an Active Point
which is comprised of an Active Node
, an Active Edge
and an Active Length
. This lets us know from which node to commence our traversal, so we don’t always have to start traversal from root. The Active Edge
lets us know on which edge to walk down, and the Active Length
lets us know how many characters along that edge were previously considered.
Rules governing alteration of the Active Point
Under certain conditions we alter the active point so that subsequent traversals are as efficient as possible. There are certain rules for alteration of the active point based on Rule 2
or Rule 3
extensions. However, before we introduce those rules, we need to discuss one other variable.
Since we can handle all of the Rule 1
extensions in one fell swoop by incrementing the global end
variable by one, it makes sense to keep track of the remaining “real” suffixes that need to be inserted into the tree. We will call this variable remaining
.
Rule 2
Active Point Alterations
When a Rule 2
extension is applied we consider two cases:
if Active Node == root
AND Active Length > 0
then we:
1. Decrement Active Length
by 1
2. Set Active Edge
to S[i - remaining + 1]
if Active Node != root
1. Follow the suffix link from the Active Node
and that becomes the new Active Node
2. No change to Active Edge
or Active Length
Rule 3
Active Point Alterations
When a Rule 3
extension is applied, Active Node
remains the same, Active Edge
remains the same, and Active Length
is incremented by 1
.
Walkdown Rules for Active Point Alteration
If at any point when walking down an edge, another internal node is encountered, then that node will become the Active Node
, and Active Edge
and Active Length
will change make sense with the new active node.
In that case the algorithm from the last post can be further summarized in a more optimized way as such in the general form:
- Initialize
remaining
to0
. - Initialize
end
to-1
.- Read in next character of string
S
- Increment the global
end
variable by one. - Increment
remaining
by one. - Enter into a loop
remaining
times.- if the active length is zero then set the active edge to the current character
- check if there is an edge emanating from the active node for the active edge.
- If no, then invoke
Rule 2
and add a leaf node and possibly an internal node if needed. Then decrementremaining
since a real suffix was added. If, during the furtherance of invokingRule 2
an internal node is created, we set that node’s suffix link to root and keep track of that node. If this is not the first internal node, then create a suffix link from the previous internal node to the current one. We also check to make sure we alter theActive Point
if needed according to theRule 2
alteration rules above. - If yes, then we check to see if the current
Active Length
is less than the current edge length. If so we can compare the character at the active point and the current character in our string. If they are equal, then we can invokeRule 3
, otherwise we must create it usingRule 2
. If, however, we found that the current active length was greater than the edge length we need to perform a walkdown (changing active node, active edge, active length as needed) and we need to keep performing the walk down until such time asActive Length
<Edge Length
, then we compare the character at the new active point to the current character. If they are equal, then we can invokeRule 3
, otherwise we must create it usingRule 2
. If we invokedRule 2
we must remember to decrementremaining
since a suffix was added. Then whether we invokedRule 2
orRule 3
we need to perform a suffix link check and an active point alteration check and set those accordingly. In either case onceRule 3
is encountered we can break out of the loop and advance to the next phase.
- If no, then invoke
- Read in next character of string
This is the general idea of the O(n)
implementation of Ukkonen’s algorithm. Let’s do a walk through.
Walk Through of Suffix Tree Creation for mississippi
To really understand the Active Point
and all of the implementation tricks, we will do a complete walk through of the string mississippi$
for all 11 phases. Also, convention dictates that the Active Point
is referred to as a tuple of (Active Node, Active Edge, Active Length)
Let’s initialize some variables.
Active Node
is set to root
.
Active Edge
is set to nil
.
Active Length
is set to 0
.
In this case this means the Active Point
is (root,nil,0)
end
is set to -1
.
remaining
is set to 0
.
We start with i=0
. We read in the char at S[i] = S[0] = m
:
Increment end
to 0
(This takes care of all the Rule 1
extensions in one fell swoop, of course in this case there are none yet).
Increment remaining
to 1
We enter a loop remaining = 1
time:
- Since
Active Length == 0
we setActive Edge
to be equal to the current character which ism
. So now theActive Point
is(root,m,0)
. - We check if there is an edge emanating from the
Active Node
forActive Edge
. If not, we must create it in accordance withRule 2
. - Once we create the node and edge, since we performed a
Rule 2
extension we decrementremaining
to0
.

Now i=1
. We read in the char at S[i] = S[1] = i
:
Increment end
to 1
. (This takes care of all the Rule 1
extensions in one fell swoop).
Increment remaining
to 1
We enter a loop remaining = 1
time:
- Since
Active Length == 0
we setActive Edge
to be equal to the current character which isi
. So now theActive Point
is(root,i,0)
. - We check if there is an edge emanating from the
Active Node
forActive Edge
. If not, we must create it in accordance withRule 2
. - Once we create the node and edge, since we performed a
Rule 2
extension we decrementremaining
to0
.

Now i=2
. We read in the char at S[i] = S[2] = s
:
Increment end
to 2
. (This takes care of all the Rule 1
extensions in one fell swoop).
Increment remaining
to 1
We enter a loop remaining = 1
time:
- Since
Active Length == 0
we setActive Edge
to be equal to the current character which iss
. So now theActive Point
is(root,s,0)
. - We check if there is an edge emanating from the
Active Node
forActive Edge
. If not, we must create it in accordance withRule 2
. - Once we create the node and edge, since we performed a
Rule 2
extension we decrementremaining
to0
.

Now i=3
. We read in the char at S[i] = S[3] = s
:
Increment end
to 3
. (This takes care of all the Rule 1
extensions in one fell swoop).
Increment remaining
to 1
We enter a loop remaining = 1
time:
- Since
Active Length == 0
we setActive Edge
to be equal to the current character which iss
. So now theActive Point
is(root,s,0)
. - We check if there is an edge emanating from the
Active Node
forActive Edge
. There is. So we compare theActive Length
of0
withlength(Active Edge)
of2
. Since active point falls on the middle of the edge, no walkdown is needed and this we compare the active point with the current character. The active point iss
and the current character is alsos
so we have aRule 3
extension. We can immediately break out of the loop and advance to the next phase. However before doing that we need to do two things, a suffix link check and an active point check. No internal node has been created any time during this phase so there is nothing to do on the suffix link check. For the active point check, we know that after aRule 3
extension we increment theActive Length
by1
so we do that. Now theActive Point
is(root,s,1)
.

Now i=4
. We read in the char at S[i] = S[4] = i
:
Increment end
to 4
. (This takes care of all the Rule 1
extensions in one fell swoop).
Increment remaining
to 2
We enter a loop remaining = 2
times:
- Since
Active Length > 0
we don’t need to change theActive Edge
. - We check if there is an edge emanating from the
Active Node
(root) forActive Edge
(s). There is. So we compare theActive Length
of1
withlength(Active Edge)
of3
. Since active point falls on the middle of the edge, no walkdown is needed and this we compare the active point with the current character. The active point iss
and the current character isi
. Since the characters are not equal, we need to perform aRule 2
extension. - We break the
ssi
edge after the firsts
and add an internal node there. We will label this internal nodeA
. We will then add thesi
edge toA
and also create a new edge for the current characteri
also emanating fromA
. - Now that we performed a
Rule 2
extension we need to decrementremaining
to 1. - We perform the Suffix link check. Since an internal node was created we can direct its suffix link to point to root.
- We also need to perform the Active point alteration check. We have the
Active Node
set to root and theActive Length
of 1> 0
so we can follow theRule 2 Active Point Alteration
rule and decrementActive Length
by1
and setActive Edge
toS[i - remaining + 1]
which isS[4-1+1]
orS[4]
ori
. So theActive Point
is now(root, i, 0)
.
Now we enter the 2nd iteration of the loop.
- Since
Active Length == 0
we set theActive Edge
to the current character which isi
. - We check if there is an edge emanating from the
Active Node
for theActive Edge
. Yes there is ani
edge. - So we compare lengths to see if a walkdown is necessary. The
Active Length
is0
and thelength(Active Edge)
is4
. Thus no walkdown is needed. - We check if the active point
i
is equal to the current characteri
. Yes it is. So we have aRule 3
extension. We will be able to break out of the loop, but before doing so, some housekeeping. - We perform a suffixlink check. Since we did not create any internal node in this extension step, the internal node from the previous extension, node
A
continues having its suffix link pointed at root. - We perform an
Active Point Alteration for Rule 3
check which means that we need to increment theActive Length
by1
. Now theActive Point
is(root,i,1)
.

Now i=5
. We read in the char at S[i] = S[5] = s
:
Increment end
to 5
. (This takes care of all the Rule 1
extensions in one fell swoop).
Increment remaining
to 2
We enter a loop remaining = 2
times:
- Since
Active Length > 0
we don’t need to change theActive Edge
. - We check if there is an edge emanating from the
Active Node
(root) forActive Edge
(i). There is. - So we compare the
Active Length
of1
withlength(Active Edge)
of5
. Since the active point falls on the middle of the edge, no walkdown is needed and this we compare the active point with the current character. The active point iss
and the current character iss
. Since the characters are equal, we need to perform aRule 3
extension. We can break out of the loop, but as usual our normal housekeeping. - For our suffixlink check, we have not created any internal node at all in this phase so there is nothing to do there.
- For our
Active Point Alteration
check we incrementActive Length
to2
. So nowActive Point
is now(root, i, 2)
.

Now i=6
. We read in the char at S[i] = S[6] = s
:
Increment end
to 6
. (This takes care of all the Rule 1
extensions in one fell swoop).
Increment remaining
to 3
We enter a loop remaining = 3
times:
- Since
Active Length > 0
we don’t need to change theActive Edge
. - We check if there is an edge emanating from the
Active Node
(root) forActive Edge
(i). There is. - So we compare the
Active Length
of2
withlength(Active Edge)
of6
. Since the active point falls on the middle of the edge, no walkdown is needed and this we compare the active point with the current character. The active point iss
and the current character iss
. Since the characters are equal, we need to perform aRule 3
extension. We can break out of the loop, but as usual our normal housekeeping. - For our suffixlink check, we have not created any internal node at all in this phase so there is nothing to do there.
- For our
Active Point Alteration
check we incrementActive Length
to3
. So nowActive Point
is now(root, i, 3)
.

Now i=7
. We read in the char at S[i] = S[7] = i
:
Increment end
to 7
. (This takes care of all the Rule 1
extensions in one fell swoop).
Increment remaining
to 4
We enter a loop remaining = 4
times:
- Since
Active Length > 0
we don’t need to change theActive Edge
. - We check if there is an edge emanating from the
Active Node
(root) forActive Edge
(i). There is. - So we compare the
Active Length
of3
withlength(Active Edge)
of7
. Since the active point falls on the middle of the edge, no walkdown is needed and this we compare the active point with the current character. The active point isi
and the current character isi
. Since the characters are equal, we need to perform aRule 3
extension. We can break out of the loop, but as usual our normal housekeeping. - For our suffixlink check, we have not created any internal node at all in this phase so there is nothing to do there.
- For our
Active Point Alteration
check we incrementActive Length
to4
. So nowActive Point
is now(root, i, 4)
.

Now i=8
. We read in the char at S[i] = S[8] = p
:
Increment end
to 5
. (This takes care of all the Rule 1
extensions in one fell swoop).
Increment remaining
to 5
We enter a loop remaining = 5
times:
First iteration of the loop. (for extension issip
)
- Since
Active Length > 0
we don’t need to change theActive Edge
. - We check if there is an edge emanating from the
Active Node
(root) forActive Edge
(s). There is. - We compare the
Active Length
of4
withlength(Active Edge)
of8
. Since active point falls on the middle of the edge, no walkdown is needed. - We compare the active point with the current character. The active point is
s
and the current character isp
. Since the characters are not equal, we need to perform aRule 2
extension. - We break the
ississip
edge afterissi
and add an internal node there. We will label this internal nodeB
. We will then add thessip
edge toB
and also create a new edge for the current characterp
also emanating fromB
. - Now that we performed a
Rule 2
extension we need to decrementremaining
to 4. - We perform the Suffix link check. Since an internal node
B
was created we can direct its suffix link to point to root. We will also keep track of this node in case we end up making more internal nodes in this phase. - We also need to perform the
Active Point Alteration for Rule 2
check. We have theActive Node
set to root and theActive Length
of4 > 0
so we can follow theRule 2 Active Point Alteration
rule and decrementActive Length
by1
and set it to3
and setActive Edge
toS[i - remaining + 1]
which isS[8-4+1]
orS[5]
ors
. So theActive Point
is now(root, s, 3)
. - Because this is a complicated phase involving 5 extensions, I will break from tradition and show a tree diagram during each extension of this phase. But don’t get too used to it.

Now we enter the 2nd iteration of the loop. (for extension ssip
)
- Since
Active Length > 0
we don’t need to change theActive Edge
. - We check if there is an edge emanating from the
Active Node
ofroot
for theActive Edge
ofs
. Yes there is ans
edge. - So we compare lengths to see if a walkdown is necessary. The
Active Length
is3
and thelength(Active Edge)
is1
. Thus we need to do a walkdown. - Since the current extension is
ssip
and theActive Length
is3
we start doing a walkdown. We consume the firsts
fromssip
and encounter nodeA
. Remember when performing a walkdown, if another internal node is encountered, that node must become theActive Node
and theActive Edge
andActive Length
must change to be in accordance therewith. So we set theActive Node
toA
and reduce theActive Length
by1
to2
. Since we still need to consumesi
from the current extension we must embark on thesissip
edge. So we can set theActive Edge
to bes
. Since the length of thesissip
edge of6
is greater than theActive Length
of2
, there will not be any need for a further walkdown. We can complete our walkdown withActive Point
now being set to(A,s,2)
. - Now that the walkdown portion of the festivities are over, we compare current active point the 2nd
s
on thesissip
edge with the current characterp
. There are not equal, so we will need to do aRule 2
extension. - We break the
sissip
edge aftersi
and create a new internal nodeC
which will have as an edge the remainder of the broken edgessip
and also add a leaf node forp
. - Since we performed a
Rule 2
extension we decrementremaining
from4
to3
. - We perform our suffix link check. Since we just created node
C
and a previous extension in the current phase created nodeB
. We pointB
‘s suffix link toC
. We also pointC
‘s suffix link to root, and keep track of it as the last internal node created in this phase. - We also perform our
Active Point Alteration
check. Since theActive Node
is not root, but rather is nodeA
, we follow the suffix link fromA
and make that the new active node.A
‘s suffix link is root. So the newActive Point
is now(root,s,2)
. - A diagram ensues:

Now we enter the 3rd iteration of the loop. (for extension sip
)
- Since
Active Length > 0
we don’t need to change theActive Edge
. - We check if there is an edge emanating from the
Active Node
ofroot
for theActive Edge
ofs
. Yes there is ans
edge. - So we compare lengths to see if a walkdown is necessary. The
Active Length
is2
and thelength(Active Edge)
is1
. Thus we need to do a walkdown. - Since the current extension is
sip
and theActive Length
is2
we start doing a walkdown. We consume the firsts
fromsip
and encounter nodeA
. Remember when performing a walkdown, if another internal node is encountered, that node must become theActive Node
and theActive Edge
andActive Length
must change to be in accordance therewith. So we set theActive Node
toA
and reduce theActive Length
by1
to1
. Since we still need to consumei
from the current extension we must embark on theissip
edge. So we can set theActive Edge
to bei
. Since the length of theissip
edge of5
is greater than theActive Length
of1
, there will not be any need for a further walkdown. We can complete our walkdown withActive Point
now being set to(A,i,1)
. - Now that the walkdown portion of the festivities are over, we compare the current active point on the
i
edge which is the 1sts
with the current characterp
. There are not equal, so we will need to do aRule 2
extension. - We break the
issip
edge afteri
and create a new internal nodeD
which will have as an edge the remainder of the broken edgessip
and also add a leaf node forp
. - Since we performed a
Rule 2
extension we decrementremaining
from3
to2
. - We perform our suffix link check. Since we just created node
D
and a previous extension in the current phase created nodeC
. We pointC
‘s suffix link toD
. We also pointD
‘s suffix link to root, and keep track of it as the last internal node created in this phase. - We also perform our
Active Point Alteration
check. Since theActive Node
is not root, but rather is nodeA
, we follow the suffix link fromA
and make that the new active node.A
‘s suffix link is root. So the newActive Point
is now(root,i,1)
. - A diagram ensues:

Now we enter the 4th iteration of the loop. (for extension ip
)
- Since
Active Length > 0
we don’t need to change theActive Edge
. - We check if there is an edge emanating from the
Active Node
ofroot
for theActive Edge
ofi
. Yes there is ani
edge. - So we compare lengths to see if a walkdown is necessary. The
Active Length
is1
and thelength(Active Edge)
is4
. Thus no walkdown is necessary. Since active point falls on the middle of the edge, no walkdown is needed. - We compare the active point with the current character. The active point is
s
and the current character isp
. Since the characters are not equal, we need to perform aRule 2
extension. - We break the
issi
edge afteri
and create a new internal nodeE
which will have as an edge the remainder of the broken edgeissi
, namelyssi
and also add a leaf node forp
. - Since we performed a
Rule 2
extension we decrementremaining
from2
to1
. - We perform our suffix link check. Since we just created node
E
and a previous extension in the current phase created nodeD
. We pointD
‘s suffix link toE
. We also pointE
‘s suffix link to root, and keep track of it as the last internal node created in this phase.- We also perform our
Active Point Alteration
check. Since theActive Node
is root, andActive Length
of1 > 0
, we decrementActive Length
from1
to0
and we set theActive Edge
toS[i - remaining + 1]
which isS[8-1+1]
orS[8]
orp
. So the newActive Point
is now(root,p,0)
.
- We also perform our
- A diagram ensues:

Now we enter the 5th iteration of the loop. (for extension p
)
- Since
Active Length == 0
we change theActive Edge
to the current character which isp
. - We check if there is an edge emanating from the
Active Node
ofroot
for theActive Edge
ofp
. There is no such edge. So we must create it in accordance withRule 2
- We add a leaf node with edge label
p
emanating from root. - Since we performed a
Rule 2
extension we decrementremaining
from1
to0
. - We perform our suffix link check. Since we didn’t create any new internal node in this step, the previous internal node
E
remains with its suffix link pointed at root.- We also perform our
Active Point Alteration
check. Since theActive Node
is root, butActive Length
of0 == 0
, so there is no change here. So theActive Point
is still(root,p,0)
.
- We also perform our
- A diagram ensues:

And we are finally done with that phase… onwards to the next phase.
Now i=9
. We read in the char at S[i] = S[9] = p
:
Increment end
to 9
. (This takes care of all the Rule 1
extensions in one fell swoop).
Increment remaining
to 1
We enter a loop remaining = 1
time:
- Since
Active Length == 0
we change theActive Edge
to the current character which isp
. - We check if there is an edge emanating from the
Active Node
(root) forActive Edge
(p). There is. - So we compare the
Active Length
of0
withlength(Active Edge)
of2
. Since the active point falls on the middle of the edge, no walkdown is needed and this we compare the active point with the current character. The active point isp
and the current character isp
. Since the characters are equal, we need to perform aRule 3
extension. We can break out of the loop, but as usual our normal housekeeping. - For our suffixlink check, we have not created any internal node at all in this phase so there is nothing to do there.
- For our
Active Point Alteration
check we incrementActive Length
to1
. So nowActive Point
is now(root, p, 1)
. - A diagram ensues:

Now i=10
. We read in the char at S[i] = S[10] = i
:
Increment end
to 10
. (This takes care of all the Rule 1
extensions in one fell swoop).
Increment remaining
to 2
We enter a loop remaining = 2
times:
Now we enter the 1st iteration of the loop. (for extension pi
)
- Since
Active Length > 0
we don’t need to change theActive Edge
. - We check if there is an edge emanating from the
Active Node
(root) forActive Edge
(p). There is. - So we compare the
Active Length
of1
withlength(Active Edge)
of3
. Since the active point falls on the middle of the edge, no walkdown is needed and this we compare the active point with the current character. The active point is points to the 2ndp
and the current character isi
. Since the characters are not equal, we need to perform aRule 2
extension. - We break the
ppi
edge after the firstp
and create a new internal nodeF
which will have as an edge the remainder of the broken edgeppi
, namelypi
and also add a leaf node fori
. - Since we performed a
Rule 2
extension we decrementremaining
from2
to1
. - We perform our suffix link check. Since we just created node
F
, we pointF
‘s suffix link to root, and keep track of it as the last internal node created in this phase.- We also perform our
Active Point Alteration
check. Since theActive Node
is root, andActive Length
of1 > 0
, we decrementActive Length
from1
to0
and we set theActive Edge
toS[i - remaining + 1]
which isS[10-1+1]
orS[10]
ori
. So the newActive Point
is now(root,i,0)
.
- We also perform our
Now we enter the 2nd iteration of the loop. (for extension i
)
- Since
Active Length == 0
we change theActive Edge
to the current character ofi
- We check if there is an edge emanating from the
Active Node
(root) forActive Edge
(i). There is. - So we compare the
Active Length
of0
withlength(Active Edge)
of1
. Since the active point falls on the middle of the edge, no walkdown is needed and this we compare the active point with the current character. The active point is points to thei
and the current character isi
. Since the characters are equal equal, we need to perform aRule 3
extension. We can break out of the loop, but as usual our normal housekeeping. - For our suffixlink check, we have not created any internal node so the previous node
F
remains with suffix link pointed to root. - For our
Active Point Alteration
check we incrementActive Length
to1
. So nowActive Point
is now(root, i, 1)
.- A diagram ensues:

Now i=11
. We read in the char at S[i] = S[11] = $
:
Increment end
to 11
. (This takes care of all the Rule 1
extensions in one fell swoop).
Increment remaining
to 2
We enter a loop remaining = 2
times:
Now we enter the 1st iteration of the loop. (for extension i$
)
- Since
Active Length > 0
we don’t need to change theActive Edge
. - We check if there is an edge emanating from the
Active Node
(root) forActive Edge
(i). There is. - So we compare the
Active Length
of1
withlength(Active Edge)
of1
. Since they are equal a walkdown is needed. We Consume1
character and end up at nodeE
. So theActive Node
now becomesE
and theActive Length
is decremented from1
to0
accordingly. We now check if the next character on the “edge” (i.e. where the active point would be) is equal to the current character of$
. It is not, so we need to effect aRule 2
extension. Since we are at an actual node, no internal node needs to be created. We can simply create a leaf node from nodeE
with edge label$
. - Since we performed a
Rule 2
extension we decrementremaining
from2
to1
. - We perform our suffix link check. No internal node has been previously saved in this phase, and we did not create one now. so there is nothing to do here.
- We also perform our
Active Point Alteration
check. Since theActive Node
is not root, we follow the suffix link from the active node orE
and make that the newActive Node
.E
‘s suffix link points to root, so root becomes theActive Node
now. So the newActive Point
is now(root,i,0)
.
- We also perform our
Now we enter the 2nd iteration of the loop. (for extension $
)
- Since
Active Length == 0
we change theActive Edge
to the current character of$
- We check if there is an edge emanating from the
Active Node
(root) forActive Edge
($). There is not. - So we must create the leaf node and edge label
$
in accordance withRule 2
from root. - Since we performed a
Rule 2
extension we decrementremaining
from1
to0
. - We perform our suffix link check. No internal node has been previously saved in this phase, and we did not create one now. so there is nothing to do here.
- We also perform our
Active Point Alteration
check. Since theActive Node
is root, andActive Length
of0 == 0
, we do nothing here. So theActive Point
is still(root,$,0)
. - A diagram ensues:
- We also perform our

The Code
Finally, we can see how this would be implemented in Swift.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 |
class Node { enum NodeType { case root case `internal` case leaf } var indexValue:Int { if nodeType == .root { return -1 } guard let parentEdge = self.parent else { return -2 } if nodeType == .internal { return parentEdge.length + parentEdge.parent.indexValue } return suffixTree.corpus.value.count - ( parentEdge.length + parentEdge.parent.indexValue ) - 1 } weak var parent: Edge? = nil var children:[Character:Edge] = [:] var suffixLink:Node? = nil var nodeType:NodeType let uuid:UUID var suffixTree:SuffixTree! static var emittedNodes:[Node:Bool]=[:] var nodeIdentifier:String { return "node" + String(describing:self.uuid.hashValue) } init(suffixTree:SuffixTree!, nodeType:NodeType = .leaf) { self.nodeType = nodeType self.suffixTree = suffixTree self.uuid = UUID() } func addEdgeWithNewNode(character:Character,start:Int,end:IntReference) { let leafNode = Node(suffixTree: suffixTree, nodeType: .leaf) let edge = Edge(suffixTree: suffixTree, start: start, end: end, child: leafNode, parent: self) self.children[character] = edge } func addEdgeWithExistingNode(character:Character,start:Int,end:IntReference, existing node: Node) { let edge = Edge(suffixTree: suffixTree, start: start, end: end, child: node, parent: self) self.children[character] = edge } } extension Node : Hashable { var hashValue: Int { return self.uuid.hashValue } static func ==(lhs: Node, rhs: Node) -> Bool { return lhs.hashValue == rhs.hashValue } } class Edge { var suffixTree:SuffixTree var start: Int var end: IntReference var child: Node unowned var parent: Node var length:Int { return end.value - start + 1 } var diff:Int { return end.value - start } var label:String { return suffixTree.corpus.value[start...end.value] } init(suffixTree:SuffixTree, start:Int, end:IntReference, child: Node, parent: Node) { self.start = start self.end = end self.child = child self.parent = parent self.suffixTree = suffixTree self.child.parent = self } } class ActivePoint { var node:Node var suffixTree:SuffixTree! var edge:Edge? { guard let c = self.char else { return nil } guard let e = self.node.children[c] else { return nil } return e } var char:Character? { guard self.index >= 0 else { return nil } return suffixTree.corpus[index] } var index:Int = -1 var length:Int init(suffixTree: SuffixTree!, node:Node,char:Character?,length:Int) { self.node = node self.length = length } func nextCharDownActiveEdge(from char: Character) -> Character? { guard let edge = self.edge else { fatalError("Experienced a nil active Edge") } let corpus = suffixTree.corpus if edge.diff >= length { return corpus[edge.start + length] } else if edge.diff + 1 == length { if let _ = edge.child.children[char] { return char } return nil } else { node = edge.child length = length - ( edge.diff + 1 ) self.index = self.index + ( edge.diff + 1 ) return nextCharDownActiveEdge(from: char) } } func walkDownActiveEdge(from char: Character) -> Bool { guard let edge = self.edge else { fatalError("Experienced a nil active Edge") } if edge.diff < length { node = edge.child length = length - edge.diff if let selectEdge = node.children[char] { self.index = selectEdge.start } return true } else { length += 1 } return false } } class SuffixTree { let corpus: StringReference let active: ActivePoint let root: Node var remainder: Int let globalEnd: IntReference var lastInternalNode: Node? = nil init(with corpus:String) { self.corpus = StringReference(corpus) let rootNode = Node(suffixTree: nil, nodeType: .root) self.root = rootNode let activePoint = ActivePoint(suffixTree: nil, node: rootNode, char: nil, length: 0) self.active = activePoint self.remainder = 0 self.globalEnd = IntReference(-1) rootNode.suffixTree = self activePoint.suffixTree = self buildTree() } private func buildTree() { for (i,char) in self.corpus.value.enumerated() { performPhase(i: i, char: char) } } func suffixLinkCheck() { if active.node != root { if let sufLink = active.node.suffixLink { active.node = sufLink } else { active.node = root } } else { active.index += 1 active.length -= 1 } remainder -= 1 } private func performPhase(i:Int,char:Character) { var lastInternalNode: Node? = nil globalEnd.value += 1 remainder += 1 while remainder > 0 { if active.length == 0 { if let existingEdge = root.children[char] { active.index = existingEdge.start active.length += 1 break } else { root.addEdgeWithNewNode(character: char, start: i, end: globalEnd) remainder -= 1 } } else { if let nextChar = active.nextCharDownActiveEdge(from: char) { if nextChar == char { let walkedDown = active.walkDownActiveEdge(from: char) if walkedDown { lastInternalNode?.suffixLink = active.edge?.child } break } else { guard let activeEdge = active.edge else { fatalError("Experienced a nil active Edge") } let oldEnd = activeEdge.end activeEdge.end = IntReference(activeEdge.start + active.length - 1) let newInternalNode = Node(suffixTree: self, nodeType: .internal) newInternalNode.addEdgeWithExistingNode(character: corpus[activeEdge.start + active.length], start: activeEdge.start + active.length, end: oldEnd, existing: activeEdge.child) newInternalNode.addEdgeWithNewNode(character: corpus[i], start: i, end: globalEnd) activeEdge.child = newInternalNode newInternalNode.parent = activeEdge newInternalNode.suffixLink = root lastInternalNode?.suffixLink = newInternalNode lastInternalNode = newInternalNode suffixLinkCheck() } } else { if let activeEdge = active.edge { activeEdge.child.addEdgeWithNewNode(character: char, start: i, end: globalEnd) suffixLinkCheck() } } } } } } |