Suffix tree
Encyclopedia
|
| Tutorials | Encyclopedia | Dictionary | Directory |
|
Suffix tree
Suffix tree for the string BANANA padded with $. The six paths from the root to a leaf (shown as boxes) correspond to the six suffixes A$, NA$, ANA$, NANA$, ANANA$ and BANANA$. The numbers in the boxes give the start position of the corresponding suffix. Suffix links drawn dashed.The suffix tree for a string S is a tree whose edges are labeled with strings, and such that each suffix of S corresponds to exactly one path from the tree's root to a leaf. It is thus a radix tree (more specifically, a Patricia trie) for the suffixes of S. Constructing such a tree for the string S takes time and space linear in the length of S. Once constructed, several operations can be performed quickly, for instance locating a substring in S, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc. Suffix trees also provided one of the first linear-time solutions for the longest common substring problem. These speedups come at a cost: storing a string's suffix tree typically requires significantly more space than storing the string itself.
HistoryThe concept was first introduced as a position tree by Weiner in 1973[1] in a paper which Donald Knuth subsequently characterized as "Algorithm of the Year 1973". The construction was greatly simplified by McCreight in 1976 [2] , and also by Ukkonen in 1995[3][4]. Ukkonen provided the first linear-time online-construction of suffix trees, now known as Ukkonen's algorithm. DefinitionThe suffix tree for the string S of length n is defined as a tree such that ([5] page 90):
Since such a tree does not exist for all strings, S is padded with a terminal symbol not seen in the string (usually denoted Suffix links are a key feature for linear-time construction of the tree. In a complete suffix tree, all internal non-root nodes have a suffix link to another internal node. If the path from the root to a node spells the string \chi\alpha, where \chi is a single character and \alpha is a string (possibly empty), it has a suffix link to the internal node representing \alpha. See for example the suffix link from the node for FunctionalityA suffix tree for a string S of length n can be built in \Theta(n) time, if the alphabet is constant or integer [6]. Otherwise, the construction time depends on the implementation. The costs below are given under the assumption that the alphabet is constant. If it is not, the cost depends on the implementation (see below). Assume that a suffix tree has been built for the string S of length n, or that a generalised suffix tree has been built for the set of strings D=\{S_1,S_2,...,S_K\} of total length n=|n_1|+|n_2|+...+|n_K|. You can:
).
The suffix tree can be prepared for constant time lowest common ancestor retrieval between nodes in \Theta(n) time ([5] chapter 8). You can then also:
UsesSuffix trees are often used in bioinformatics applications, where they are used for searching for patterns in DNA or protein sequences, which can be viewed as long strings of characters. The ability to search efficiently with mismatches might be the suffix tree's greatest strength. It is also used in data compression, where on the one hand it is used to find repeated data and on the other hand it can be used for the sorting stage of the Burrows-Wheeler transform. Variants of the LZW compression schemes use it (LZSS). A suffix tree is also used in suffix tree clustering, a data clustering algorithm used in some search engines (first introduced in [8]). ImplementationIf each node and edge can be represented in \Theta(1) space, the entire tree can be represented in \Theta(n) space. The total length of the edges in the tree is O(n^2), but each edge can be stored as the position and length of a substring of S, giving a total space usage of \Theta(n) computer words. The worst-case space usage of a suffix tree is seen with a fibonacci string, giving the full 2n nodes. An important choice when making a suffix tree implementation is the parent-child relationships between nodes. The most common is using linked lists called sibling lists. Each node has pointer to its first child, and to the next node in the child list it is a part of. Hash maps, sorted/unsorted arrays (with array doubling), and balanced search trees may also be used, giving different running time properties. We are interested in:
Let \sigma be the size of the alphabet. Then you have the following costs:
Note that the insertion cost is amortised, and that the costs for hashing are given perfect hashing. The large amount of information in each edge and node makes the suffix tree very expensive, consuming about ten to twenty times the memory size of the source text in good implementations. The suffix array reduces this requirement to a factor of four, and researchers have continued to find smaller indexing structures. See alsoReferencesExternal links
de:Suffixbaum fr:Arbre des suffixes he:?? ????? id:Pohon sufiks it:Albero dei suffissi ja:???? lt:Priesag? medis pl:Drzewo sufiksowe ru:?????????? ?????? zh:??? Source: Wikipedia | The above article is available under the GNU FDL. | Edit this article
|
|
top
©2008-2009 TutorGig.com. All Rights Reserved. Privacy Statement