WO2015010508A1

WO2015010508A1 - One-dimensional linear space-based method for implementing trie tree dictionary storage and management

Info

Publication number: WO2015010508A1
Application number: PCT/CN2014/080176
Authority: WO
Inventors: 贾西贝; 王国印
Original assignee: 深圳市华傲数据技术有限公司
Priority date: 2013-07-03
Filing date: 2014-06-18
Publication date: 2015-01-29
Also published as: CN103365991A; CN103365991B

Abstract

A one-dimensional linear space-based method for implementing trie tree dictionary storage and management. The method comprises the following steps: acquiring complete dictionary data; generating ordered dictionary data and storing in a one-dimensional array; and, establishing a trie tree to implement storage of the dictionary data. The method employs the one-dimensional array instead of a dual array (base [] and check []), thus allowing for serialization and deserialization of the trie tree to be of increased degree of convenience and speed, and allowing for increased efficiency in loading and storage of a dictionary, while at the same time solving the problem of data movement or storage space backtracking found in a trie tree dictionary data storage method implemented with a dual array.

Description

A dictionary storage management method based on one-dimensional linear space to implement Trie tree

The present invention relates to a dictionary storage management method, and more particularly to a dictionary storage management method for implementing a Trie tree based on a one-dimensional linear space. Background technique

In the field of information retrieval and natural language processing, especially in dictionary-based technology applications, the scale of the dictionary is generally very large, with thousands or even hundreds of records, especially the reverse index of search engines. . The storage of massive data dictionaries is currently implemented using an indexed data structure. Commonly used index structures include linear index tables, inverted tables, hash tables, and search trees. The implementation versions of the popular Trie tree on the current network are generally based on double arrays. The names of the two arrays are base[P check[], and each element in the array is subscript i equivalent to a node number of the Trie tree. Or the storage location in a double array, also known as the state number.

Base[i] : stores the current state i to all subsequent states with minimal collision-free offsets;

Check[i] _: stores the direct precursor information of the current state i, that is, which state is transferred from the current state;

Base and check are in pairs;

Base[i] and check[i] represent attributes of the same state.

However, the dictionary data storage method based on the dual array implementation of the Trie tree has such a problem that the conflict caused by inserting a new state causes a large amount of conflicting dictionary data to be moved, which not only causes the dictionary data storage rate to be slow, but also Backtracking issues that can cause data movement or storage space. Summary of the invention

In order to solve the defects of the foregoing technology, an embodiment of the present invention provides a dictionary storage management method for implementing a Trie tree based on a one-dimensional linear space, and the present invention uses a one-dimensional array instead of a double array (base[] and check[]). The method makes the Trie tree serialization and deserialization more convenient, which makes the loading and storage of the dictionary more efficient. At the same time, the present invention can solve the problem of backtracking of data movement or storage space existing in the dictionary data storage method of the dual array implementing the Trie tree.

To this end, the embodiment of the invention discloses a dictionary storage management method for implementing a Trie tree based on a one-dimensional linear space. The method comprises the steps of: obtaining complete dictionary data; generating ordered dictionary data and storing it in a one-dimensional array; The Trie tree implements the storage of dictionary data.

In the embodiment of the present invention, a one-dimensional array is used instead of a double array (base and che _C k[] ). The advantage of this method makes the Trie tree serialization and deserialization easier, which makes the loading of the dictionary easier. The specific method can put the base array in the even-numbered array of the one-dimensional array, and the check array is placed in the odd-numbered bits. The corresponding relationship is as follows:

Base[i] -» array[2*i];

Check[i] -» array[2*i + 1].

In an embodiment of the present invention, the generating the ordered dictionary data includes:

Sort all the terms and attribute information in the dictionary data in the lexicographic order centering on Key;

Merges values that have the same Key value.

In an embodiment of the present invention, the key includes:

Add a virtual character after each key to represent the leaf nodes (terminal nodes). The storage location of each leaf node virtual character is determined by the base value of its direct precursor.

In the embodiment of the present invention, each key is allowed to bring its own attribute value values (part of speech or other annotation information). Currently, many implementation versions of the Trie tree can only store keys, and cannot directly associate the attributes and interpretation information (collectively referred to as values) of key and key. The solution of this method is to add a virtual character "$" after each key to represent the leaf node (terminal node), so that the original terminal node becomes a non-terminal node, and then a terminal node is added." $" as its direct successor. The base value of each leaf node "$" (terminal node) is assigned the opposite number of the key in the lexicographic order number of the key -m (m is the lexicographic sequence number of the current key in all the term sets, m can Directly determine the storage location of the value corresponding to the relevant key), the storage location of each leaf node "$" is directly determined by the base value of its direct precursor, and the direct precursor of the leaf node is defined as itself, that is, the leaf node The check value is equal to its state number, which is the logical storage location.

In an embodiment of the present invention, the lexicographic ordering further includes:

The keys with the common prefix are adjacent.

In an embodiment of the present invention, the information about the status includes:

Using a data mechanism Node to store information for each state;

The information for each state contains: the current input character, the depth of the state, the first number that has the current state key, the last number that has the current state key, and the number of current state keys.

In the embodiment of the present invention, in the process of creating a Trie tree, in order to avoid data movement caused by conflicts caused by inserting a new state, all information is required to be in order of keys, and all direct successor states of the current state can be obtained. Information (such as the currently entered character c, the depth of the state in the Trie tree, the scope of the term position contained in the direct successor state, etc.). For convenience, the present invention defines a data structure Node that stores information for each state.

In one embodiment of the invention, creating a Trie tree includes the following steps:

Define the starting state, number 0;

Put the starting state into the zero position of the double array;

Start state as current state;

Get information about all direct successor states of the current state;

Find a suitable base value for the current node and insert all its direct successor nodes.

The dictionary storage management method based on the one-dimensional linear space to implement the Trie tree can make the Trie tree serialization and deserialization more convenient, improve the data loading and storage efficiency of the dictionary, and the present invention overcomes the double array. The backtracking problem of data movement or storage space existing in the dictionary data storage method of the Trie tree is realized.

It is to be understood that the foregoing general description DRAWINGS

1 is a flow chart of a method for managing a dictionary storage management of a Trie tree based on a one-dimensional linear space according to an embodiment of the present invention. 2 is a structural diagram of a forest composed of a dictionary data prefix tree in the embodiment of the present invention.

FIG. 3 is a schematic diagram of a flow of implementing dictionary storage of a Trie tree in an embodiment of the present invention.

FIG. 4 is a schematic diagram of a flow of inserting a node when a Trie tree is created in an embodiment of the present invention. detailed description

The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

A dictionary storage management method for implementing a Trie tree based on a one-dimensional linear space according to an embodiment of the present invention includes the following steps: acquiring complete dictionary data; generating ordered dictionary data and storing the data in a one-dimensional array; creating a Trie The tree implements the storage of dictionary data.

As shown in FIG. 1 , it is a flowchart of a method for managing a dictionary storage management of a Trie tree in a one-dimensional linear space according to an embodiment of the present invention, which includes the following steps:

Step S110, obtaining complete dictionary data. For example, the dictionary should store these key data in Table 1 below:

Table 1

Step S120, generating ordered dictionary data and storing the data in a one-dimensional array.

Sort all terms and attribute information in the dictionary data in lexicographic order centered on Key; merge values with the same Key value. Also, let the keys with the common prefix be adjacent. According to the dictionary data obtained above, there are some common prefixes between the words. According to these prefix trees, a forest can be formed. As shown in Fig. 2, the nodes of each tree are as follows:

The dotted circle represents the terminal node of the tree;

The solid circle represents the non-terminal node of the number;

The word formed from the root node of the tree to the current terminal node is a complete entry in the dictionary;

The word formed from the root node of the tree to a non-terminal node is the common prefix of certain terms in the dictionary.

It can be seen that in constructing the Trie tree, it is necessary to store the direct precursor information of the current state. The implementation versions of the popular Trie tree on the current network are generally based on double arrays. The names of the two arrays are base[] and check[]. The subscript i of each element in the array is equivalent to a node of the Trie tree. The number or storage location in the double array, also known as the status number.

Base[i]: Stores the current state i to all subsequent states with minimal collision-free offset.

Check[i] _: stores the direct predecessor information of the current state i, that is, which state is transferred from the current state.

Base and check are paired, base[i] and check[i] represent attributes of the same state

If the current state is 3, the input character is C, and the next state is t, the constraint condition of the query process is: check[base[s]+c] s;

Base[s]+c=t

In one embodiment of the invention, a virtual character is added after each key to represent a leaf node (terminal node), The storage location of each leaf node virtual character is determined by the base value of its immediate precursor. Allow each key to bring its own attribute value values ( part of speech or other annotation information). Currently, many implementation versions of the Trie tree can only store keys, and cannot directly associate key and key attributes or interpretation information (collectively referred to as values). The solution of this method is to add a virtual character "$" after each key to represent the leaf node (terminal node), so that the original terminal node becomes a non-terminal node, and then a terminal node is added."$" as its direct successor. The base value of each leaf node "$" (terminal node) is assigned the opposite number of the key in the lexicographic order number of the key -m (m is the lexicographic sequence number of the current key in all the term sets, m can Directly determine the storage location of the value corresponding to the relevant key), the storage location of each leaf node "$" is directly determined by the base value of its direct precursor, and the direct precursor of the leaf node is defined as itself, that is, the leaf node The check value is equal to its state number, which is the logical storage location.

Using a one-dimensional array instead of a double array (base[] and check[] ), the benefits of this approach make Trie tree serialization and reverse-sequence easier, making dictionary loading easier. The specific method can put the base array in the even-numbered bits of the one-dimensional array, and the check array is placed in the odd-numbered bits. The corresponding relationship is as follows:

Base[i] -» array[2*i];

Check[i] -» array[2*i + 1].

The one-dimensional array includes: the even-numbered bits of the one-dimensional array store the base value of the double-array, and the odd-numbered bits of the one-dimensional array store the check value of the double-array.

In the process of creating the Trie tree, in order to avoid the data movement caused by the conflict caused by inserting the new state, all the information is required to be in order of the key, so that all the direct successor status information of the current state can be obtained (such as the currently input character). 0, the state is in the depth of the Trie tree, the range of the term position contained in the direct successor state, etc.). For convenience, embodiments of the present invention define a data structure Node that stores information for each state that is used to store newly inserted state information when the Trie tree is created. The main stored information descriptions include:

The code stores the current input character c, which can be the Unicode value or the byte value of c. In order to avoid the virtual terminal node "$" (the code of the terminal node is 0), this method defines the value of each code. The Unicode value of the character c is +1; the depth of the current state in the Trie tree is +1, that is, its direct successor is at the depth of the Trie tree (the root node of the Trie tree is the initial node defined as the 0th layer);

Start first has the current state key number;

End last has the next number of the current state key;

End-start is the number of keys that have the current state, that is, these keys have a common prefix.

Step S130, creating a Trie tree to implement storage of dictionary data.

As shown in FIG. 3, it is a storage process for implementing Trie tree dictionary data, which specifically includes the following steps - In step S131, all the terms and attribute information are sorted in the lexicographic order with the key as the center, and the values having the same key value are merged, and the key is not duplicated;

Step S132, defining a starting state, numbered 0, and the information value thereof is [code = 0, depth = 0, start = 0, end = N], where N is the size of the dictionary, that is, the number of keys;

Step S133, placing the initial state into the 0th position of the double array, and setting its base[0]=l (array[2*0]=array[0]=l), and identifying that the value of base is 1 is occupied ( Ensure that the base value of all states is unique), check[0]=0 (array[2*0+l]=array[l]=0) _;

Step S134, taking the initial state as the current state;

Step S135, obtaining information about all direct successor states of the current state. If the direct successor node list is empty, that is, the current node is the terminal node "$", indicating that the key formed from the starting node to the current node is exactly a complete entry in the dictionary, the base value of the current node (terminal node) is assigned to the opposite of the current key dictionary sequence number, the path is executed; otherwise, step S136 is performed;

Step S136, searching for a suitable base value for the current node, so that the base value is unique, and does not cause all the direct successor nodes to collide with the nodes stored in the existing Trie tree. The direct successor node of the current node is sequentially inserted into the Trie tree, and the check value thereof is assigned to the base value of the current node, and then the direct successor node of the current node is sequentially used as the current node, and the process proceeds to step S135. .

As shown in FIG. 4, the order of inserting a new node in the Trie tree is to directly insert the direct successor node of the current node, and then sequentially use the direct successor node of the current node as the current node to perform the insertion operation recursively. The node has no successor node, that is, the current node is the terminal node "$", and the current recursion is jumped out, until all the nodes are inserted, the Trie tree creation operation can be completed. If all nodes (including the terminal node "$") are numbered in the order of insertion.

Embodiments of the present invention provide a dictionary storage management method for implementing a Trie tree based on a one-dimensional linear space, the method comprising the steps of: acquiring complete dictionary data; generating ordered dictionary data and storing the data in a one-dimensional array; creating a Trie tree Achieve the storage of dictionary data. The invention adopts a one-dimensional array instead of a double array (base[] and che _C k[] ). This method makes the Trie tree serialization and deserialization more convenient, which makes the loading and storage of the dictionary more efficient, and the invention can solve the double The array implements the backtracking problem of data movement or storage space existing in the dictionary data storage method of the Trie tree.

Claims

Claim

A dictionary storage management method for implementing a Trie tree based on a one-dimensional linear space, characterized in that the method comprises the following steps:

Get complete dictionary data;

Generate ordered dictionary data and store it in a one-dimensional array;

Create a Trie tree to store the dictionary data.

The method according to claim 1, wherein the generating the ordered dictionary data comprises: sorting all the terms and attribute information in the dictionary data in a lexicographic order centering on Key;

Merge valuer with the same Key value

The method according to claim 1, wherein the one-dimensional array comprises: an even-numbered one-dimensional array stores a base value of a double-array, and an odd-numbered one-dimensional array stores a check value of the double-array.

The method according to claim 1, wherein the creating a Trie tree comprises the steps of: defining a starting state, numbered 0;

Put the starting state into the zero position of the double array;

Start state as current state;

Get information about all direct successor states of the current state;

The method according to claim 1, wherein the lexicographic ordering further comprises:

The keys with the common prefix are adjacent.

The method according to claim 1 or claim 4, wherein the information of the status comprises: storing information of each state by using a data mechanism Node;

The method according to claim 1 or claim 6, wherein the key comprises: adding a virtual character after each key to represent a leaf node (terminal node), each leaf node virtual The storage location of the character is determined by the base value of its immediate precursor.