PolarSPARC

Introduction to Trie Data Structure - Part 2


Bhaskar S 08/31/2019


👧 👦 Hey Charlie !!! We are really excited and can't wait to hear about the optimizations you alluded to during our previous ( Part-1 ) discussion. Would you have some time now to elaborate on it ???

👲 Hey guys !!! Sure, have some time now to discuss the optimized version of the Trie data structure. It is also known as a Compact Trie or a Patricia Trie, where PATRICIA stands for Practical Algorithm To Retrieve Information Coded In Alphanumeric.

👲 We will leverage the Node implementation as discussed previously.

👲 The following is the implementation of a Node of a Trie using Python for convenience:

TrieNode.py
#
# Trie Node
#
class TrieNode(object):
    def __init__(self):
        self._ancestor = None
        self._children = {}
        self._item = None
        self._leaf = False

    def __str__(self, *args, **kwargs):
        return "item => %s, children => %s, leaf => %s" % (self._item, list(self._children.keys()), self._leaf)

    @property
    def ancestor(self):
        return self._ancestor

    @ancestor.setter
    def ancestor(self, node):
        self._ancestor = node

    @property
    def item(self):
        return self._item

    @item.setter
    def item(self, item):
        self._item = item

    @property
    def children(self):
        return self._children

    @children.setter
    def children(self, children: dict):
        self._children.update(children)

    @property
    def leaf(self):
        return self._leaf

    @leaf.setter
    def leaf(self, leaf):
        self._leaf = leaf

    def clear(self):
        self._children.clear()

    def get_node(self, letter):
        return self._children.get(letter)

    def add_node(self, letter, node):
        self._children[letter] = node

    def remove_node(self, letter):
        del self._children[letter]

    def get_children_keys(self):
        return list(self._children.keys())

    def is_children_empty(self):
        if len(self._children) == 0:
            return True
        return False

👲 Let me dive right into explaining how one can insert a word into a Compact Trie data structure using an example. Assume the word is ant. Starting at the Root Node as the current Node and starting with the letter at index 0 (letter a), lookup the letter in the children of the current Node. It will not be found, so create a new Node and set the value of the item to be the entire word. Also, set the ancestor to be the current Node and add the newly created node as the reference to the letter in the children of the current Node. Finally, set the leaf to be True.

👲 The following illustration depicts the Compact Trie data structure:

Compact Trie for Ant
Figure-1

👲 Assume the next word is art. Starting at the Root Node as the current Node and starting with the letter at index 0 (letter a), lookup the letter in the children of the current Node. This time around it will be found and will point to the Node whose item value will be the word ant. Assign this as the current Node. Given that we found a Node, we have to find the largest common prefix between the given word art and the existing word ant. The common prefix between the two words is a. Since the length of the item value in the current Node is greater than the length of the common prefix, we create a new Node with the item value set to the suffix nt (from the existing word) and add it as child for the letter l under the current Node. Next, create another Node with the item value set to the suffix rt (for the given word) and add it as child for the letter r under the current Node. Finally, we change the item value of the current Node to the common prefix, which is a. The following illustration depicts the Compact Trie data structure with the two words:

Compact Trie for Ant and Art
Figure-2

👲 Does this make sense ???

👦 👧 Hmm !!! Not sure ... may be another example would help ???

👲 Ok, sure ... Let us consider the words hunter, hunt, hunted, and hung.

👲 We start with the word is hunter. Starting at the Root Node as the current Node and starting with the letter at index 0 (letter h), we lookup the letter in the children of the current Node. It will not be found, so we create a new Node and set the value of the item to be the entire word. Also, set the ancestor to be the current Node and add the newly created node as the child for the letter h under the current Node. Finally, set the leaf to be True. The following illustration depicts the Compact Trie data structure:

Compact Trie for Hunter
Figure-3

👲 The next word is hunt. Starting at the Root Node as the current Node and starting with the letter at index 0 (letter h), we lookup the letter in the children of the current Node. This time around it will be found and will point to the Node whose item value will be the word hunter. Assign this as the current Node. Given that we found a Node, we have to find the largest common prefix between the given word hunt and the existing word hunter. The common prefix between the two words is hunt. Since the length of the item value in the current Node is greater than the length of the common prefix, we create a new Node with the item value set to the suffix er (from the existing word) and add it as child for the letter e under the current Node. Finally, we change the item value of the current Node to the common prefix, which is hunt. The following illustration depicts the Compact Trie data structure:

Compact Trie for Hunter and Hunt
Figure-4

👲 So far we good ???

👧 👦 YES !!!

👲 The next word is hunted. Starting at the Root Node as the current Node and starting with the letter at index 0 (letter h), we lookup the letter in the children of the current Node. Again, this time it will be found and will point to the Node whose item value will be the word hunt. Assign this as the current Node. Given that we found a Node, we have to find the largest common prefix between the given word hunted and the existing word hunt. The common prefix between the two words is hunt. Since the length of the item value in the current Node is equal to the length of the common prefix and the length of the new word is greater, we move the index forward from index 0 plus the length of the prefix to the new index 4. Starting at the current Node and starting with the letter at index 4 (letter e), we lookup the letter in the children of the current Node. Once again, it will be found and will point to the Node whose item value will be the suffix er. Assign this as the current Node. Given that we found a Node, we have to find the largest common prefix between the suffix word ed and the existing suffix word er. The common prefix between the two suffix words is e. Since the length of the item value in the current Node is greater than the length of the common prefix, we create a new Node with the item value set to the suffix r (from the existing suffix word er) and add it as child for the letter r under the current Node. Next, create another Node with the item value set to the suffix d (for the given suffix word ed) and add it as child for the letter d under the current Node. Finally, we change the item value of the current Node to the common prefix, which is e. The following illustration depicts the Compact Trie data structure with the three words:

Compact Trie for Hunter, Hunt, and Hunted
Figure-5

👲 Now for the final word hung. Starting at the Root Node as the current Node and starting with the letter at index 0 (letter h), we lookup the letter in the children of the current Node. It will be found and will point to the Node whose item value will be the word hunt. Assign this as the current Node. Given that we found a Node, we have to find the largest common prefix between the given word hung and the existing word hunt. The common prefix between the two words is hun. Since the length of the item value in the current Node is greater than the length of the common prefix, we create a new Node with the item value set to the suffix t (from the existing word hunt) and add it as child for the letter t under the current Node. Next, create another Node with the item value set to the suffix g (for the given word hung) and add it as child for the letter g under the current Node. Finally, we change the item value of the current Node to the common prefix, which is hun. The following illustration depicts the Compact Trie data structure with the four words:

Compact Trie for Four Words
Figure-6

👧 👦 Ahh !!! Got it. This makes a lot of sense now, WOW !!! This is GREAT !!!

👲 Excellent !!! Glad the example made it clear. The key point to remember is that when add a new word, we need to check for the largest common prefix between the new word and the existing word(s) in the Compact Trie data structure.

👲 The following is the code implementation to insert a word into a Compact Trie data structure using Python:

insert_word()
def insert_word(self, word):
    if word and len(word) > 0:
        # print("insert_word: word => {}".format(word))
        node = self._root
        i = 0
        while True:
            # print("insert_word: i => {}".format(i))
            letter = word[i:i+1].lower()
            # print("insert_word: letter => {}".format(letter))
            next_node = node.get_node(letter)
            if not next_node:
                suffix_node = TrieNode()
                suffix_node.item = word[i:].lower()
                suffix_node.ancestor = node
                node.add_node(suffix_node.item[0], suffix_node)
                suffix_node.leaf = True
                # print("insert_word: suffix_node => {}".format(suffix_node))
                # print("insert_word: node => {}".format(node))
                break
            else:
                node = next_node

                pos = 0
                max_pos = min(len(node.item), len(word[i:]))
                # print("insert_word: node.item => {}, word[i:] => {} max_pos => {}".format(node.item,
                #                                                                           word[i:].lower,
                #                                                                           max_pos))

                # Move index forward as long as the letters match
                while pos < max_pos and node.item[pos] == word[i+pos].lower():
                    pos += 1
                # print("insert_word: node.item => {}, word => {}, pos => {}".format(node.item,
                #                                                                    word[i:].lower(),
                #                                                                    pos))

                if len(node.item) == pos:
                    if len(word[i:]) == pos:
                        node.leaf = True
                        # print("insert_word: node => {}".format(node))
                        break
                    else:
                        # Consider the words - hunt and hunter. After the word hunt is inserted, when we go
                        # insert the word hunter, the prefix hunt would be common and we drop here
                        i += pos
                elif len(node.item) > pos:
                    suffix_node = TrieNode()
                    suffix_node.item = node.item[pos:]
                    suffix_node.ancestor = node
                    if node.leaf:
                        suffix_node.leaf = True
                    if not node.is_children_empty():
                        suffix_node.children = node.children
                        node.clear()
                    node.add_node(suffix_node.item[0], suffix_node)
                    # print("insert_word: suffix_node => {}".format(suffix_node))
                    node.item = node.item[:pos]
                    node.leaf = False
                    if len(word[i:]) == pos:
                        node.leaf = True
                        # print("insert_word: node => {}".format(node))
                        break
                    else:
                        i += pos

        print("insert_word: word => {} inserted".format(word.lower()))

👲 Assume we inserted the follwoing 10 words into the Trie data structure:

👲 The following illustration depicts the Nodes in the example Compact Trie:

Trie for 10 Words
Figure-7

👲 See how compact this Trie data structure is compared to the basic Trie !!!

👦 👧 Oh Yeah !!! Looks very space efficient as fewer Nodes are created.

👲 Next, let me explain how to look-up a prefix of a word in a Compact Trie data structure. Assume the prefix is ben. The first letter in the prefix is b. Starting at the Root Node as the current Node and starting with the letter at index 0 (letter b), lookup the letter in the children of the current Node. It will be found and will point to the child Node whose item value will be the prefix word b. Since the length of the item value equals 1, move the index forward by 1 and assign the child as the current Node. Starting at the current Node and starting with the letter at index 1 (letter e), lookup the letter in the children of the current Node. Again, it will be found and will point to the child Node whose item value will be the suffix word end. Since the length of the item value is greater than 1, we compare one letter after the other between the item value and the given prefix starting at the current index. On each successful compare, we move forward the index by 1. If the entire length of the prefix is covered successfully, we have found the given prefix. Else, the prefix is not found and we are done.

👲 The following is the code implementation to search a prefix from a Compact Trie data structure using Python:

is_prefix()
def is_prefix(self, prefix) -> (bool, TrieNode):
    present = False
    node = None
    if prefix and len(prefix) > 0:
        present = True
        node = self._root
        i = 0
        while i < len(prefix):
            ch = prefix[i].lower()
            # print("is_prefix: ch => {}".format(ch))
            ch_node = node.get_node(ch)
            # print("is_prefix_present: node => {}, ch_node => {}".format(node, ch_node))
            if not ch_node:
                present = False
                break
            else:
                if len(ch_node.item) == 1:
                    i += 1
                elif len(ch_node.item) > 1:
                    # print("is_prefix: ch_node.item => {}".format(ch_node.item))
                    k = 0
                    while i < len(prefix) and k < len(ch_node.item):
                        if ch_node.item[k] == prefix[i].lower():
                            k += 1
                            i += 1
                        else:
                            present = False
                            break
                    if not present:
                        break
                node = ch_node
    return present, node

👧 👦 WoW !!! This is awesome.

👲 Next, let me explain how to look-up a word in a Compact Trie data structure. Assume the word is art. We use the same logic as looking up the prefix art. Except, we add one more check to see if the Node returned by the prefix check has the leaf flag set as True. That is it.

👲 The following is the code implementation to search a word from a Compact Trie data structure using Python:

is_word()
def is_word(self, word) -> (bool, TrieNode):
    (present, node) = self.is_prefix(word)
    if present and word.find(node.item) >= 0 and node.leaf:
        return True, node
    return False, None

👲 Finally, let me explain how to remove a word from a Compact Trie data structure. Assume the word is ball. For this, we use the same logic as looking up the word ball. If the word exists, we get a reference to the suffix Node (all). If there are no children in the returned Node, we remove the reference to this Node from its ancestor. The following illustration depicts the state of the Compact Trie data structure after the suffix all is removed:

Trie After Remove
Figure-8

👲 We are not done yet. If the ancestor of the returned Node has just one remaining child (which is the case with the suffix end), then we need to compress and merge the child Node with the ancestor as shown in the illustration below:

Trie Nodes to Merge
Figure-9

👲 Once the Nodes are compressed and merged, the following illustration depicts the state of the Compact Trie data structure:

Trie After Merge
Figure-10

👲 The following is the code implementation to remove a word from a Trie data structure using Python:

remove_word()
def remove_word(self, word):
    (present, node) = self.is_word(word)
    # print("remove_word: word => {}, present => {}, node => {}".format(word, present, node))
    if present:
        node.leaf = False
        if node.is_children_empty():
            node.ancestor.remove_node(node.item[0])
            ch_keys = node.ancestor.get_children_keys()
            if len(ch_keys) == 1:
                ch_node = node.ancestor.get_node(ch_keys[0])
                node.ancestor.remove_node(ch_keys[0])
                node.ancestor.item = node.ancestor.item + ch_node.item
                if ch_node.leaf:
                    node.ancestor.leaf = True
                if not ch_node.is_children_empty():
                    node.ancestor.children = ch_node.children
        print("remove_word: word => {} deleted".format(word))

👲 The following is the full source code for a compact Trie data structure using Python:

CompactTrie
#
# Name:   CompactTrie
# Author: Bhaskar S
# Site:   https://www.polarsparc.com
#

from Trie.TrieNode import TrieNode


#
# CompactTrie Data Structure
#
class CompactTrie(object):
    def __init__(self):
        self._root = TrieNode()

    def insert_word(self, word):
        if word and len(word) > 0:
            # print("insert_word: word => {}".format(word))
            node = self._root
            i = 0
            while True:
                # print("insert_word: i => {}".format(i))
                letter = word[i:i+1].lower()
                # print("insert_word: letter => {}".format(letter))
                next_node = node.get_node(letter)
                if not next_node:
                    suffix_node = TrieNode()
                    suffix_node.item = word[i:].lower()
                    suffix_node.ancestor = node
                    node.add_node(suffix_node.item[0], suffix_node)
                    suffix_node.leaf = True
                    # print("insert_word: suffix_node => {}".format(suffix_node))
                    # print("insert_word: node => {}".format(node))
                    break
                else:
                    node = next_node

                    pos = 0
                    max_pos = min(len(node.item), len(word[i:]))
                    # print("insert_word: node.item => {}, word[i:] => {} max_pos => {}".format(node.item,
                    #                                                                           word[i:].lower,
                    #                                                                           max_pos))

                    # Move index forward as long as the letters match
                    while pos < max_pos and node.item[pos] == word[i+pos].lower():
                        pos += 1
                    # print("insert_word: node.item => {}, word => {}, pos => {}".format(node.item,
                    #                                                                    word[i:].lower(),
                    #                                                                    pos))

                    if len(node.item) == pos:
                        if len(word[i:]) == pos:
                            node.leaf = True
                            # print("insert_word: node => {}".format(node))
                            break
                        else:
                            # Consider the words - hunt and hunter. After the word hunt is inserted, when we go
                            # insert the word hunter, the prefix hunt would be common and we drop here
                            i += pos
                    elif len(node.item) > pos:
                        suffix_node = TrieNode()
                        suffix_node.item = node.item[pos:]
                        suffix_node.ancestor = node
                        if node.leaf:
                            suffix_node.leaf = True
                        if not node.is_children_empty():
                            suffix_node.children = node.children
                            node.clear()
                        node.add_node(suffix_node.item[0], suffix_node)
                        # print("insert_word: suffix_node => {}".format(suffix_node))
                        node.item = node.item[:pos]
                        node.leaf = False
                        if len(word[i:]) == pos:
                            node.leaf = True
                            # print("insert_word: node => {}".format(node))
                            break
                        else:
                            i += pos

            print("insert_word: word => {} inserted".format(word.lower()))

    def is_prefix(self, prefix) -> (bool, TrieNode):
        present = False
        node = None
        if prefix and len(prefix) > 0:
            present = True
            node = self._root
            i = 0
            while i < len(prefix):
                ch = prefix[i].lower()
                # print("is_prefix: ch => {}".format(ch))
                ch_node = node.get_node(ch)
                # print("is_prefix_present: node => {}, ch_node => {}".format(node, ch_node))
                if not ch_node:
                    present = False
                    break
                else:
                    if len(ch_node.item) == 1:
                        i += 1
                    elif len(ch_node.item) > 1:
                        # print("is_prefix: ch_node.item => {}".format(ch_node.item))
                        k = 0
                        while i < len(prefix) and k < len(ch_node.item):
                            if ch_node.item[k] == prefix[i].lower():
                                k += 1
                                i += 1
                            else:
                                present = False
                                break
                        if not present:
                            break
                    node = ch_node
        return present, node

    def is_word(self, word) -> (bool, TrieNode):
        (present, node) = self.is_prefix(word)
        if present and word.find(node.item) >= 0 and node.leaf:
            return True, node
        return False, None

    def remove_word(self, word):
        (present, node) = self.is_word(word)
        # print("remove_word: word => {}, present => {}, node => {}".format(word, present, node))
        if present:
            node.leaf = False
            if node.is_children_empty():
                node.ancestor.remove_node(node.item[0])
                ch_keys = node.ancestor.get_children_keys()
                if len(ch_keys) == 1:
                    ch_node = node.ancestor.get_node(ch_keys[0])
                    node.ancestor.remove_node(ch_keys[0])
                    node.ancestor.item = node.ancestor.item + ch_node.item
                    if ch_node.leaf:
                        node.ancestor.leaf = True
                    if not ch_node.is_children_empty():
                        node.ancestor.children = ch_node.children
            print("remove_word: word => {} deleted".format(word))

    def get_words(self) -> list:
        words = []
        node_stack = []
        prefix_stack = []
        prefix = ""
        node_stack.append(self._root)
        prefix_stack.append(prefix)
        while len(node_stack) > 0:
            node = node_stack.pop()
            prefix = prefix_stack.pop()
            keys = node.get_children_keys()
            # print("get_words: node.item => {}, keys => {}, prefix => {}".format(node.item, keys, prefix))
            if len(keys) > 0:
                for i in range(len(keys) - 1, -1, -1):
                    kn = node.get_node(keys[i])
                    pf = node.item if node.item else ""
                    # print("get_words: node.item => {}, pf => {}".format(node.item, pf))
                    node_stack.append(kn)
                    prefix_stack.append(prefix+pf)
                    if node.leaf:
                        words.append(prefix+pf)
            else:
                pf = node.item if node.item else ""
                # print("get_words: node.item => {}, pf => {}".format(node.item, pf))
                if node.leaf:
                    words.append(prefix + pf)
        return words

    def print_details(self):
        node_stack = []
        node = self._root
        print("--------------------------------------------------")
        while True:
            keys = node.get_children_keys()
            print("print_details: {}".format(node))
            for i in range(len(keys) - 1, -1, -1):
                kn = node.get_node(keys[i])
                node_stack.append(kn)
            if len(node_stack) == 0:
                break
            node = node_stack.pop()
        print("--------------------------------------------------")


#
# Main function
#
def main():
    trie = CompactTrie()

    trie.insert_word("ant")
    trie.insert_word("art")
    trie.insert_word("ball")
    trie.insert_word("bend")
    trie.insert_word("cart")
    trie.insert_word("car")
    trie.insert_word("hunt")
    trie.insert_word("hunter")
    trie.insert_word("hunted")
    trie.insert_word("hung")

    (present, _) = trie.is_prefix("ben")
    print("Is prefix ben present => {}".format(present))
    (present, _) = trie.is_prefix("bet")
    print("Is prefix bet present => {}".format(present))
    (present, _) = trie.is_prefix("hunt")
    print("Is prefix hunt present => {}".format(present))

    (present, _) = trie.is_word("ball")
    print("Is word ball present => {}".format(present))
    (present, _) = trie.is_word("car")
    print("Is word car present => {}".format(present))
    (present, _) = trie.is_word("care")
    print("Is word care present => {}".format(present))
    (present, _) = trie.is_word("cart")
    print("Is word cart present => {}".format(present))
    (present, _) = trie.is_word("hunter")
    print("Is word hunter present => {}".format(present))

    trie.remove_word("ball")
    trie.remove_word("ben")
    trie.remove_word("hunter")

    (present, _) = trie.is_prefix("ba")
    print("Is prefix ba present => {}".format(present))
    (present, _) = trie.is_prefix("ben")
    print("Is prefix ben present => {}".format(present))

    (present, _) = trie.is_word("art")
    print("Is word art present => {}".format(present))
    (present, _) = trie.is_word("ball")
    print("Is word ball present => {}".format(present))
    (present, _) = trie.is_word("hunter")
    print("Is word hunter present => {}".format(present))

    all_words = trie.get_words()
    print("All words from the trie => {}".format(all_words))

    trie.print_details()


#
# Invoke main()
#
main()

👲 Executing the above Python code will generate the following output:

Output.1

insert_word: word => ant inserted
insert_word: word => art inserted
insert_word: word => ball inserted
insert_word: word => bend inserted
insert_word: word => cart inserted
insert_word: word => car inserted
insert_word: word => hunt inserted
insert_word: word => hunter inserted
insert_word: word => hunted inserted
insert_word: word => hung inserted
Is prefix ben present => True
Is prefix bet present => False
Is prefix hunt present => True
Is word ball present => True
Is word car present => True
Is word care present => False
Is word cart present => True
Is word hunter present => True
remove_word: word => ball deleted
remove_word: word => hunter deleted
Is prefix ba present => False
Is prefix ben present => True
Is word art present => True
Is word ball present => False
Is word hunter present => False
All words from the trie => ['ant', 'art', 'bend', 'car', 'cart', 'hunt', 'hunted', 'hung']
--------------------------------------------------
print_details: item => None, children => ['a', 'b', 'c', 'h'], leaf => False
print_details: item => a, children => ['n', 'r'], leaf => False
print_details: item => nt, children => [], leaf => True
print_details: item => rt, children => [], leaf => True
print_details: item => bend, children => [], leaf => True
print_details: item => car, children => ['t'], leaf => True
print_details: item => t, children => [], leaf => True
print_details: item => hun, children => ['t', 'g'], leaf => False
print_details: item => t, children => ['e'], leaf => True
print_details: item => ed, children => [], leaf => True
print_details: item => g, children => [], leaf => True
--------------------------------------------------

👧 👦 This is just AWESOME !!! Thanks for explaining the Compact Trie data structure Charlie.

👲 You welcome. Oh crap !!! Am late for my next meeting, have to run !!!


References

Introduction to Trie Data Structure - Part 1



© PolarSPARC