Overview
Tries (prefix trees) and suffix trees represent specialized tree structures optimized for string operations. A trie stores a set of strings by decomposing each string into individual characters, with each path from root to leaf representing one complete string. Each node represents a character, and shared prefixes share the same path from the root. Suffix trees extend this concept by storing all suffixes of a given string or set of strings, enabling linear-time pattern matching and substring operations.
These structures solve problems where hash tables and balanced search trees prove inefficient. Hash tables provide O(1) average-case lookup but cannot efficiently answer prefix queries like "find all strings starting with 'pre'". Balanced search trees require O(m log n) time for string operations, where m represents string length and n represents the number of stored strings. Tries reduce this to O(m), independent of the number of stored strings.
The trie structure appears in diverse applications. Autocomplete systems use tries to find all completions for a partial input. IP routing tables use tries (specifically, compressed variants called Patricia tries) to match network addresses. Spell checkers use tries to verify word existence and suggest corrections. Search engines use suffix trees to identify document patterns and build inverted indexes.
Suffix trees compress all suffixes of a string into a compact tree structure. For a string of length n, the naive approach would store n suffixes requiring O(n²) space. Suffix trees compress this representation to O(n) space while maintaining the ability to search for any pattern in O(m) time, where m represents the pattern length. This property makes suffix trees foundational for string matching algorithms, bioinformatics applications, and text compression.
Key Principles
A trie node contains three components: a character value (optional at root), a hash or array mapping characters to child nodes, and a boolean flag indicating whether the node represents a complete string. The root node represents the empty string. Each path from root to a marked node spells out one stored string.
Insertion traverses the trie following the characters of the input string. For each character, if a child node exists for that character, traversal continues to that child. If no child exists, insertion creates a new node and adds it as a child. After processing all characters, insertion marks the final node as a word ending. This process ensures all strings sharing prefixes share the same nodes for those prefixes.
# Basic trie structure
class TrieNode
attr_accessor :children, :is_end_of_word
def initialize
@children = {}
@is_end_of_word = false
end
end
Search operations follow the same traversal pattern. Starting at the root, search follows child pointers corresponding to each character in the target string. If at any point no child exists for the next character, the string does not exist in the trie. If traversal completes and reaches a node marked as a word ending, the string exists. This approach requires exactly m node traversals for a string of length m.
Prefix search extends basic search by collecting all complete words reachable from the node where the prefix ends. After traversing to the prefix end, a depth-first traversal collects all marked nodes in the subtree. This enables autocomplete functionality: given a prefix, find all stored strings with that prefix.
Suffix trees store all suffixes of a string by inserting each suffix into a trie-like structure. For the string "banana", suffixes include "banana", "anana", "nana", "ana", "na", and "a". The naive suffix trie inserts each suffix separately. Path compression combines chains of single-child nodes into edges labeled with strings rather than single characters, reducing space consumption.
Edge labels in compressed suffix trees store substring ranges rather than copying characters. Instead of storing the substring "ana", an edge stores references to the original string plus start and end indices. This representation achieves O(n) space for n suffixes because each character in the original string appears in at most one edge label.
Suffix links connect nodes representing suffixes that differ by one character. If a node represents the suffix "banana", a suffix link points to the node representing "anana". Suffix links enable linear-time construction algorithms. Without suffix links, construction requires O(n²) time because inserting each suffix requires separate traversal. With suffix links, construction can jump to the appropriate insertion point after processing each character.
Ukkonen's algorithm constructs suffix trees in O(n) time using suffix links and incremental construction. The algorithm processes the string left to right, extending the tree by one character at each step. Active points track where to begin extensions, and suffix links enable rapid movement between related insertion positions. This algorithm proves complex to implement but provides theoretical and practical efficiency guarantees.
Ruby Implementation
Ruby lacks built-in trie implementations, requiring manual construction. The implementation uses a hash to map characters to child nodes, providing efficient lookups and flexible character sets beyond just lowercase letters.
class Trie
def initialize
@root = TrieNode.new
end
def insert(word)
node = @root
word.each_char do |char|
node.children[char] ||= TrieNode.new
node = node.children[char]
end
node.is_end_of_word = true
end
def search(word)
node = find_node(word)
node && node.is_end_of_word
end
def starts_with(prefix)
!find_node(prefix).nil?
end
private
def find_node(prefix)
node = @root
prefix.each_char do |char|
return nil unless node.children[char]
node = node.children[char]
end
node
end
end
The implementation separates the node-finding logic into a private method used by both search and prefix checking. The search method verifies that the found node marks a word ending, while starts_with only verifies the prefix path exists. This distinction matters: a trie containing "apple" will return true for starts_with("app") but false for search("app").
Autocomplete requires collecting all words with a given prefix. This involves traversing to the prefix node, then performing depth-first traversal to collect all complete words in that subtree:
def autocomplete(prefix)
node = find_node(prefix)
return [] unless node
results = []
collect_words(node, prefix, results)
results
end
private
def collect_words(node, current_word, results)
results << current_word if node.is_end_of_word
node.children.each do |char, child_node|
collect_words(child_node, current_word + char, results)
end
end
The recursive collection builds up the current word by concatenating characters as it traverses. String concatenation creates new string objects, but Ruby's string implementation handles this efficiently for reasonable word lengths. For very large tries, using an array of characters and joining at collection points provides better performance.
Deletion requires more careful handling than insertion. Simply removing a word marking leaves the path intact for other words sharing that prefix. Deletion must only remove nodes that have no other purpose:
def delete(word)
delete_helper(@root, word, 0)
end
private
def delete_helper(node, word, index)
return false if node.nil?
if index == word.length
return false unless node.is_end_of_word
node.is_end_of_word = false
return node.children.empty?
end
char = word[index]
child = node.children[char]
should_delete_child = delete_helper(child, word, index + 1)
if should_delete_child
node.children.delete(char)
return node.children.empty? && !node.is_end_of_word
end
false
end
The recursive deletion returns true when a node can be safely deleted, which occurs when the node has no children and does not mark another word's ending. This approach cleans up the trie structure while preserving all remaining words.
Suffix tree implementation in Ruby proves more complex due to the need for edge compression and suffix links. A simplified suffix trie without compression demonstrates the core concepts:
class SuffixTrie
def initialize(text)
@root = TrieNode.new
@text = text
build_suffix_trie
end
def search(pattern)
node = @root
pattern.each_char do |char|
return false unless node.children[char]
node = node.children[char]
end
true
end
private
def build_suffix_trie
@text.length.times do |i|
insert_suffix(@text[i..])
end
end
def insert_suffix(suffix)
node = @root
suffix.each_char do |char|
node.children[char] ||= TrieNode.new
node = node.children[char]
end
node.is_end_of_word = true
end
end
This implementation stores all suffixes explicitly, resulting in O(n²) space complexity. Each suffix inserts its full character sequence. For the string "banana", this creates separate paths for "banana", "anana", "nana", "ana", "na", and "a". Searching for any substring requires following the corresponding path from the root.
A more practical Ruby implementation uses edge compression with range references:
class CompressedSuffixNode
attr_accessor :children, :start, :end_ref
def initialize(start = nil, end_ref = nil)
@children = {}
@start = start
@end_ref = end_ref
end
def edge_length(text)
return 0 if @start.nil?
(@end_ref.respond_to?(:call) ? @end_ref.call : @end_ref) - @start + 1
end
end
The end_ref can be a lambda or integer, allowing edges to extend dynamically as more characters are processed. This technique reduces space by avoiding character duplication. However, implementing a full compressed suffix tree with linear-time construction requires significant additional complexity beyond basic trie operations.
Performance Considerations
Trie operations demonstrate predictable time complexity independent of the number of stored strings. Insertion, search, and deletion each require O(m) time for a string of length m. This contrasts with hash tables where string hashing requires O(m) time plus potential collision resolution, and balanced search trees where string comparison during tree navigation requires O(m log n) time.
Space complexity depends on the strings' structure and sharing patterns. In the worst case with no shared prefixes, a trie storing n strings with average length m requires O(nm) space. Each character requires its own node. With significant prefix sharing, space consumption approaches O(m) for the total unique character count across all strings.
Hash-based child storage trades memory for flexibility. Each node maintains a hash mapping characters to child nodes. This approach supports arbitrary character sets including Unicode without pre-allocating arrays. The hash requires memory overhead for empty buckets and collision handling. For English lowercase letters only, a 26-element array provides more compact representation:
class ArrayTrieNode
def initialize
@children = Array.new(26)
@is_end_of_word = false
end
def get_child(char)
index = char.ord - 'a'.ord
@children[index]
end
def set_child(char, node)
index = char.ord - 'a'.ord
@children[index] = node
end
end
Array-based storage reduces per-node memory and eliminates hash computation, but restricts the character set and wastes space for sparse nodes with few children. For tries storing English words, the root and common prefix nodes have many children while deeper nodes have few children. This bimodal distribution means neither approach dominates uniformly.
Suffix tree construction algorithms demonstrate varying time complexity. Naive construction inserting all n suffixes separately requires O(n²) time. Each suffix insertion requires O(i) time for a suffix of length i, and n suffixes total (n + (n-1) + (n-2) + ... + 1) = O(n²) character operations. This approach proves practical for small strings but fails for large genomic sequences or document processing.
Ukkonen's algorithm achieves O(n) construction time through clever bookkeeping. The algorithm maintains active points and uses suffix links to avoid redundant traversals. Implementation complexity increases substantially, with explicit handling of edge cases and careful pointer management. For many applications, the simpler O(n²) construction suffices because construction occurs once while search occurs many times.
Compressed suffix trees reduce space from O(n²) to O(n) by storing edge labels as references rather than character sequences. Each edge stores start and end indices into the original string. Pattern matching still requires O(m) time to follow m characters along edges, but space consumption becomes linear. This compression proves essential for genomic applications where strings contain millions of characters.
Memory locality affects practical performance beyond asymptotic complexity. Tries with hash-based child storage exhibit poor cache performance because child lookups require following pointers to different memory locations. Array-based storage improves locality when children occupy consecutive memory. For critical performance requirements, custom memory allocators that group related nodes can improve cache hit rates substantially.
Patricia tries (Practical Algorithm to Retrieve Information Coded in Alphanumeric) compress paths with single children, reducing node count and improving space efficiency. Instead of creating a node for each character in a unique suffix, Patricia tries create single edges labeled with character sequences. This optimization reduces node count without sacrificing functionality:
class PatriciaNode
attr_accessor :children, :is_end_of_word, :label
def initialize(label = "")
@children = {}
@is_end_of_word = false
@label = label
end
end
Patricia tries particularly benefit string sets with long unique suffixes. IP address routing benefits significantly because addresses share common prefixes but diverge completely in suffixes. The compression reduces both space consumption and traversal overhead.
Practical Examples
Autocomplete systems represent the canonical trie application. Consider a search interface suggesting completions as users type. The system maintains a trie of valid search terms, each marked with frequency or relevance data:
class AutocompleteTrie
def initialize
@root = AutocompleteNode.new
end
def insert(word, frequency)
node = @root
word.each_char do |char|
node.children[char] ||= AutocompleteNode.new
node = node.children[char]
end
node.is_end_of_word = true
node.frequency = frequency
end
def suggest(prefix, max_results = 10)
node = find_node(prefix)
return [] unless node
results = []
collect_suggestions(node, prefix, results)
results.sort_by { |word, freq| -freq }.take(max_results).map(&:first)
end
private
def collect_suggestions(node, current_word, results)
results << [current_word, node.frequency] if node.is_end_of_word
node.children.each do |char, child|
collect_suggestions(child, current_word + char, results)
end
end
end
class AutocompleteNode
attr_accessor :children, :is_end_of_word, :frequency
def initialize
@children = {}
@is_end_of_word = false
@frequency = 0
end
end
This implementation stores frequency data at word-ending nodes, enabling ranking by popularity. The collect_suggestions method gathers all completions then sorts by frequency. For very large tries, maintaining frequency data at internal nodes enables pruning low-frequency branches during traversal, reducing the search space.
IP routing demonstrates Patricia trie application for network address matching. Routers maintain routing tables mapping IP address prefixes to next-hop destinations. For an incoming packet, the router finds the longest matching prefix to determine routing:
class IPRouter
def initialize
@root = IPRouteNode.new
end
def add_route(ip_prefix, prefix_length, next_hop)
node = @root
ip_binary = ip_to_binary(ip_prefix)
prefix_length.times do |i|
bit = ip_binary[i]
node.children[bit] ||= IPRouteNode.new
node = node.children[bit]
end
node.next_hop = next_hop
end
def route_packet(ip_address)
node = @root
last_match = node.next_hop
ip_binary = ip_to_binary(ip_address)
ip_binary.each_char do |bit|
break unless node.children[bit]
node = node.children[bit]
last_match = node.next_hop if node.next_hop
end
last_match
end
private
def ip_to_binary(ip_string)
ip_string.split('.').map { |octet| octet.to_i.to_s(2).rjust(8, '0') }.join
end
end
class IPRouteNode
attr_accessor :children, :next_hop
def initialize
@children = {}
@next_hop = nil
end
end
The binary trie matches IP addresses bit by bit, finding the longest prefix match. As traversal proceeds, route_packet remembers the last node with routing information. When no deeper match exists, the router uses the longest prefix found. This structure efficiently handles CIDR notation and variable-length subnet masks.
Spell checking uses tries to verify word existence and generate correction suggestions. The trie stores a dictionary, and spell checking searches for the input word. When the word does not exist, the system generates candidates by applying single-character edits:
class SpellChecker
def initialize(dictionary)
@trie = Trie.new
dictionary.each { |word| @trie.insert(word) }
end
def check(word)
@trie.search(word)
end
def suggest(word, max_distance = 1)
return [word] if check(word)
candidates = generate_edits(word, max_distance)
candidates.select { |candidate| check(candidate) }
end
private
def generate_edits(word, distance)
return [word] if distance == 0
edits = Set.new
# Deletions
word.length.times do |i|
edits.add(word[0...i] + word[i+1..])
end
# Transpositions
(word.length - 1).times do |i|
edits.add(word[0...i] + word[i+1] + word[i] + word[i+2..])
end
# Replacements
word.length.times do |i|
('a'..'z').each do |char|
edits.add(word[0...i] + char + word[i+1..]) if char != word[i]
end
end
# Insertions
(word.length + 1).times do |i|
('a'..'z').each do |char|
edits.add(word[0...i] + char + word[i..])
end
end
return edits.to_a if distance == 1
# Generate second-level edits
second_level = Set.new
edits.each do |edit|
generate_edits(edit, 1).each { |e| second_level.add(e) }
end
second_level.to_a
end
end
This implementation generates all single-edit candidates (deletions, transpositions, replacements, insertions) and filters them against the dictionary trie. For edit distance 2, it generates second-level edits recursively. The trie enables efficient filtering because checking thousands of candidates requires only O(m) time per candidate rather than searching a sorted list or hash set.
Document pattern matching uses suffix trees to find all occurrences of a pattern. After constructing a suffix tree for the document, pattern matching traverses the tree following the pattern characters. All leaves reachable from the pattern node represent match positions:
class DocumentMatcher
def initialize(text)
@text = text
@suffix_trie = build_suffix_trie(text)
end
def find_all(pattern)
node = @suffix_trie
pattern.each_char do |char|
return [] unless node.children[char]
node = node.children[char]
end
collect_positions(node)
end
private
def build_suffix_trie(text)
root = TrieNode.new
text.length.times do |start|
node = root
(start...text.length).each do |i|
char = text[i]
node.children[char] ||= TrieNode.new
node = node.children[char]
end
node.position = start
node.is_end_of_word = true
end
root
end
def collect_positions(node)
positions = []
queue = [node]
while !queue.empty?
current = queue.shift
positions << current.position if current.is_end_of_word
queue.concat(current.children.values)
end
positions.sort
end
end
This example demonstrates suffix trie application for pattern matching. The collect_positions method performs breadth-first traversal to find all positions where suffixes begin. For the text "banana" and pattern "ana", the suffix tree contains "banana", "anana", and "nana" among other suffixes. Following "ana" from the root leads to a node where both suffixes "anana" (position 1) and "ana" (position 3) are reachable.
Common Patterns
The lazy deletion pattern avoids immediate node removal, instead marking nodes as deleted and performing cleanup periodically. This approach simplifies deletion logic and improves performance when deletions occur in bursts:
class LazyDeletionTrie
def initialize
@root = TrieNode.new
@deleted_count = 0
end
def delete(word)
node = find_node(word)
if node && node.is_end_of_word
node.is_end_of_word = false
@deleted_count += 1
cleanup if @deleted_count > threshold
end
end
private
def cleanup
@root = rebuild_trie
@deleted_count = 0
end
def rebuild_trie
words = collect_active_words(@root, "")
new_trie = Trie.new
words.each { |word| new_trie.insert(word) }
new_trie.root
end
def threshold
100 # Rebuild after 100 deletions
end
end
The threshold determines when cleanup occurs. Lower thresholds maintain tighter memory usage but increase cleanup frequency. Higher thresholds reduce cleanup overhead but allow more memory waste. The optimal threshold depends on deletion frequency and available memory.
The persistence pattern creates immutable trie versions, enabling multiple versions to coexist without duplication. When inserting into an immutable trie, nodes along the insertion path are copied while unchanged subtrees are shared:
class PersistentTrie
attr_reader :root
def initialize(root = TrieNode.new)
@root = root
end
def insert(word)
new_root = insert_helper(@root, word, 0)
PersistentTrie.new(new_root)
end
private
def insert_helper(node, word, index)
new_node = node.dup
new_node.children = node.children.dup
if index == word.length
new_node.is_end_of_word = true
return new_node
end
char = word[index]
child = node.children[char] || TrieNode.new
new_node.children[char] = insert_helper(child, word, index + 1)
new_node
end
end
Each modification creates a new trie version with minimal copying. Unchanged branches remain shared between versions, providing O(m) time and space overhead per insertion. This pattern enables undo/redo functionality, concurrent access patterns, and snapshots for backup.
The compressed path pattern eliminates chains of nodes with single children. When a node has exactly one child and that child also has one child, these nodes can be compressed into a single edge labeled with multiple characters:
class CompressedTrie
def compress
compress_node(@root, nil, nil)
end
private
def compress_node(node, parent, edge_char)
return if node.nil?
# Node with single child and not a word ending can be compressed
if node.children.size == 1 && !node.is_end_of_word && parent
child_char, child_node = node.children.first
label = edge_char + child_char
# Keep compressing down the chain
while child_node.children.size == 1 && !child_node.is_end_of_word
char, next_node = child_node.children.first
label += char
child_node = next_node
end
parent.children.delete(edge_char)
parent.children[label] = child_node
end
node.children.each do |char, child|
compress_node(child, node, char)
end
end
end
Compression improves space efficiency and reduces traversal overhead. Insertions into compressed tries require edge splitting when the new string diverges mid-edge. This added complexity trades off against space savings and improved lookup performance.
The frequency-augmented pattern annotates nodes with subtree word frequencies, enabling efficient ranking without traversing entire subtrees:
class FrequencyTrieNode
attr_accessor :children, :is_end_of_word, :frequency, :max_subtree_frequency
def initialize
@children = {}
@is_end_of_word = false
@frequency = 0
@max_subtree_frequency = 0
end
end
class FrequencyTrie
def insert(word, frequency)
node = @root
word.each_char do |char|
node.children[char] ||= FrequencyTrieNode.new
node = node.children[char]
node.max_subtree_frequency = [node.max_subtree_frequency, frequency].max
end
node.is_end_of_word = true
node.frequency = frequency
end
def top_k_with_prefix(prefix, k)
node = find_node(prefix)
return [] unless node
collect_top_k(node, prefix, k)
end
private
def collect_top_k(node, current_word, k)
candidates = []
priority_queue = [[node, current_word, node.max_subtree_frequency]]
while !priority_queue.empty? && candidates.size < k
current_node, word, _ = priority_queue.pop
if current_node.is_end_of_word
candidates << [word, current_node.frequency]
end
current_node.children.each do |char, child|
priority_queue << [child, word + char, child.max_subtree_frequency]
end
priority_queue.sort_by! { |_, _, freq| -freq }
end
candidates
end
end
The max_subtree_frequency enables pruning: branches with lower maximum frequencies than the current k-th result cannot contain better results. This optimization dramatically improves top-k queries in large tries.
Reference
Core Operations
| Operation | Time Complexity | Space Complexity | Description |
|---|---|---|---|
| Insert | O(m) | O(m) | Insert string of length m |
| Search | O(m) | O(1) | Search for exact string match |
| Prefix Search | O(m) | O(1) | Check if prefix exists |
| Delete | O(m) | O(1) | Remove string and cleanup nodes |
| Autocomplete | O(p + n + k log k) | O(n) | Find k completions for prefix p from n results |
| Longest Prefix | O(m) | O(1) | Find longest stored prefix of string |
Trie Variants Comparison
| Variant | Space | Construction Time | Best Use Case |
|---|---|---|---|
| Standard Trie | O(ALPHABET_SIZE * N * M) | O(N * M) | Small character sets with sharing |
| Compressed Trie | O(N * M) | O(N * M) | Space-constrained applications |
| Patricia Trie | O(N) | O(N * M) | IP routing, sparse key sets |
| Suffix Trie | O(M²) | O(M²) | Pattern matching, small texts |
| Suffix Tree | O(M) | O(M) with Ukkonen | Genomics, large text search |
| Ternary Search Trie | O(N * M) | O(N * M) | Memory-efficient general purpose |
Suffix Tree Operations
| Operation | Time Complexity | Description |
|---|---|---|
| Build (naive) | O(n²) | Insert all n suffixes separately |
| Build (Ukkonen) | O(n) | Linear time with suffix links |
| Pattern Search | O(m) | Find pattern of length m |
| Longest Repeat | O(n) | Find longest repeated substring |
| Longest Common | O(n + m) | Longest common substring of two strings |
Node Implementation Strategies
| Strategy | Memory per Node | Lookup Time | Character Set Support |
|---|---|---|---|
| Hash Map | 32-48 bytes + entries | O(1) average | Arbitrary Unicode |
| Array (26) | 208 bytes fixed | O(1) | Lowercase letters only |
| Array (256) | 2048 bytes fixed | O(1) | Full ASCII |
| Linked List | 16 bytes + entries | O(k) for k children | Arbitrary |
| Binary Search | 16 bytes + entries | O(log k) | Arbitrary |
Construction Algorithms
| Algorithm | Time | Space | Implementation Difficulty |
|---|---|---|---|
| Naive Suffix Insertion | O(n²) | O(n²) | Low |
| Compressed Naive | O(n²) | O(n) | Medium |
| McCreight | O(n) | O(n) | High |
| Ukkonen | O(n) | O(n) | High |
| Weiner | O(n) | O(n) | Very High |
Ruby Implementation Patterns
| Pattern | Code Snippet | Use Case |
|---|---|---|
| Basic Node | children = {}, is_end_of_word = false | General purpose tries |
| Array Node | children = Array.new(26) | English text only |
| Labeled Edge | start_index, end_index | Compressed trees |
| Frequency Node | frequency, max_subtree_freq | Ranked autocomplete |
| Persistent Node | node.dup, children.dup | Immutable versions |
| Position Node | position, is_end_of_word | Suffix tree matches |
Common Applications
| Application | Trie Type | Key Operations | Complexity |
|---|---|---|---|
| Autocomplete | Standard | Insert, prefix search, collect | O(p + n + k log k) |
| Spell Check | Standard | Search, edit distance | O(m * alphabet_size) |
| IP Routing | Patricia | Longest prefix match | O(32) for IPv4 |
| DNA Matching | Suffix Tree | Pattern search | O(m) |
| LZ Compression | Suffix Tree | Longest match | O(m) |
| Word Break | Standard | Prefix search, DP | O(n²) |
Performance Tuning Guidelines
| Scenario | Optimization | Impact |
|---|---|---|
| Memory constrained | Use Patricia trie or compression | 50-90% space reduction |
| Frequent updates | Lazy deletion with periodic rebuild | Amortized O(1) deletes |
| Top-k queries | Frequency augmentation at nodes | Avoid full traversal |
| Large character sets | Use hash-based children | Flexible, moderate overhead |
| Small character sets | Use array-based children | Fast, fixed overhead |
| Version control | Persistent data structure | O(m) per version overhead |
Edge Cases and Considerations
| Edge Case | Handling Strategy |
|---|---|
| Empty string insertion | Mark root as word ending |
| Duplicate insertions | Update frequency or ignore |
| Deletion of non-existent | Return false, no state change |
| Unicode characters | Use hash-based children storage |
| Very long strings | Consider compressed representation |
| Memory exhaustion | Implement node pooling or limits |
| Concurrent access | Use persistent tries or locking |
| Case sensitivity | Normalize on insert or use separate tries |