[Data Structure/Algorithm] String
A string is a data structure composed of consecutive characters grouped together. Therefore, a string can be structured using various data structures of collections (an abstract data type that groups data), and problems such as searching and comparing substrings can be solved by using various data structure traversal algorithms.
Linear Collections
Graph
When a string is structured as a graph data structure to search and compare substrings, nodes generally refer to specific positions (indices) or states of the string, and edges indicate transitions between nodes (manipulating the string to make another string) or relationships (relations between two strings).
When structuring a single string as a graph to search and compare substrings, an example of graph structuring is as follows.
- Nodes: consider the indices of the string as nodes. Each node represents the string up to that index. For a string of length
N, there areN+1nodes from0toN. Node 0 is the empty string, and nodeNis the entire string. - Edges: consider substrings of the string as edges. For example, in the string
ABCD, the edge between node 0 and node 1 is the substringA. The edge between node 1 and node 3 is the substringBC.
When searching and comparing substrings of two strings to explore the relationship between the two strings, an example of graph structuring is as follows.
- Nodes: treat pairs of indices from the two strings as nodes. The node
(i, j)refers to the i-th index of the first string and the j-th index of the second string. - Edges: consider operations that compare the two strings and move to the next state as edges.
By structuring strings as graphs, problems can be solved using graph traversal algorithms such as breadth-first search, depth-first search, and shortest path search.
String Segmentation
The problem of determining whether a string can be segmented into a specific set of words can be solved using graph traversal algorithms.
The process of solving the problem using breadth-first search is as follows.
- Nodes: consider each index of the string as a node. Node 0 is the empty string, and node 1 refers to the first character.
- Edges: consider an edge if the substring resulting from segmentation exists in the word set. To create edges from node 1, check among the substrings from index 1 to index N whether they exist in the word set, and if they do, connect the corresponding index nodes as edges.
- Data structure preparation: use a set data structure to quickly check for word existence in the word set. Use a queue to store nodes to be explored. Use an array to manage already visited nodes.
- Graph traversal: add node 0 to the queue and then dequeue it. Mark node 0 as visited. Check among the substrings from index 0 to N whether they exist in the word set. If they exist, enqueue the corresponding index nodes. Dequeue nodes from the queue and repeat the above process until the queue is empty. If an index equals the length of the string, it means the end of the string has been reached, and the segmentation is considered successful; then terminate the loop.
The process of solving the problem using depth-first search is as follows.
- Nodes: consider each index of the string as a node. Node 0 is the empty string, and node 1 refers to the first character.
- Edges: consider an edge if the substring resulting from segmentation exists in the word set. To create edges from node 1, check among the substrings from index 1 to index N whether they exist in the word set, and if they do, connect the corresponding index nodes as edges.
- Data structure preparation: use a set data structure to quickly check for word existence in the word set. Use a stack to store nodes to be explored. Use an array to manage already visited nodes.
- Graph traversal: add node 0 to the stack and then pop it. Mark node 0 as visited. Check among the substrings from index 0 to N whether they exist in the word set. If they exist, push the corresponding index nodes onto the stack. Pop nodes from the stack and repeat the above process until the stack is empty. If an index equals the length of the string, it means the end of the string has been reached, and the segmentation is considered successful; then terminate the loop.
Trie Data Structure
A trie is a type of tree data structure for string search. After constructing multiple strings into a trie, a specific string can be searched at very high speed. Trie data structures are used in text autocompletion, next-word prediction, approximate string matching, spell checking, and so on. String search using a trie is more efficient than a binary search tree. Tries typically use associative arrays or fixed arrays sized by the alphabet to distribute strings and store one character per node.
In a trie, the root node is the empty string, and all child nodes share a common prefix with their parent node. Because string search compares common prefixes with the target string as it proceeds, a trie is also called a prefix tree.
When storing a string, from the root node descend to child nodes and store each character of the string as the node’s key from the front. A node’s children become part of the suffix, and a node’s parent becomes part of the prefix.
Unlike a binary search tree that stores the entire string as the key in each node, a trie stores a single character in each node or edge. The path from the root node to a particular node is a substring. Connections between nodes are determined by individual characters, not entire strings. The end of a key (the end of a string) is marked by a terminal flag on a node.
The differences between a binary search tree and a trie are as follows.
- Binary Search Tree: store the entire string as the key in each node. When searching for a string, compare the current node’s key with the target key. When inserting a node, compare the insertion key with node keys to determine the node’s position. In a binary search tree, each node represents a specific string.
- Trie: store a character as the key in each node. When searching a string, follow and compare the node keys with each character of the search key from the first to the last character. When inserting a node, compare the insertion string character by character with node keys to determine the node’s position. In a trie, each node does not itself represent a specific string; the path from the root node to a particular node represents a specific string.
Comments