WO2015010508A1 - One-dimensional linear space-based method for implementing trie tree dictionary storage and management - Google Patents

One-dimensional linear space-based method for implementing trie tree dictionary storage and management Download PDF

Info

Publication number
WO2015010508A1
WO2015010508A1 PCT/CN2014/080176 CN2014080176W WO2015010508A1 WO 2015010508 A1 WO2015010508 A1 WO 2015010508A1 CN 2014080176 W CN2014080176 W CN 2014080176W WO 2015010508 A1 WO2015010508 A1 WO 2015010508A1
Authority
WO
WIPO (PCT)
Prior art keywords
trie tree
state
key
array
node
Prior art date
Application number
PCT/CN2014/080176
Other languages
French (fr)
Chinese (zh)
Inventor
贾西贝
王国印
Original Assignee
深圳市华傲数据技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市华傲数据技术有限公司 filed Critical 深圳市华傲数据技术有限公司
Publication of WO2015010508A1 publication Critical patent/WO2015010508A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees

Definitions

  • the present invention relates to a dictionary storage management method, and more particularly to a dictionary storage management method for implementing a Trie tree based on a one-dimensional linear space. Background technique
  • the scale of the dictionary is generally very large, with thousands or even hundreds of records, especially the reverse index of search engines.
  • the storage of massive data dictionaries is currently implemented using an indexed data structure.
  • Commonly used index structures include linear index tables, inverted tables, hash tables, and search trees.
  • the implementation versions of the popular Trie tree on the current network are generally based on double arrays. The names of the two arrays are base[P check[], and each element in the array is subscript i equivalent to a node number of the Trie tree. Or the storage location in a double array, also known as the state number.
  • Base[i] stores the current state i to all subsequent states with minimal collision-free offsets
  • Check[i] stores the direct precursor information of the current state i, that is, which state is transferred from the current state;
  • Base[i] and check[i] represent attributes of the same state.
  • the dictionary data storage method based on the dual array implementation of the Trie tree has such a problem that the conflict caused by inserting a new state causes a large amount of conflicting dictionary data to be moved, which not only causes the dictionary data storage rate to be slow, but also Backtracking issues that can cause data movement or storage space. Summary of the invention
  • an embodiment of the present invention provides a dictionary storage management method for implementing a Trie tree based on a one-dimensional linear space, and the present invention uses a one-dimensional array instead of a double array (base[] and check[]).
  • the method makes the Trie tree serialization and deserialization more convenient, which makes the loading and storage of the dictionary more efficient.
  • the present invention can solve the problem of backtracking of data movement or storage space existing in the dictionary data storage method of the dual array implementing the Trie tree.
  • the embodiment of the invention discloses a dictionary storage management method for implementing a Trie tree based on a one-dimensional linear space.
  • the method comprises the steps of: obtaining complete dictionary data; generating ordered dictionary data and storing it in a one-dimensional array;
  • the Trie tree implements the storage of dictionary data.
  • a one-dimensional array is used instead of a double array (base and che C k[] ).
  • the advantage of this method makes the Trie tree serialization and deserialization easier, which makes the loading of the dictionary easier.
  • the specific method can put the base array in the even-numbered array of the one-dimensional array, and the check array is placed in the odd-numbered bits. The corresponding relationship is as follows:
  • the generating the ordered dictionary data includes:
  • the key includes:
  • each leaf node virtual character is determined by the base value of its direct precursor.
  • each key is allowed to bring its own attribute value values (part of speech or other annotation information).
  • attribute value values part of speech or other annotation information.
  • many implementation versions of the Trie tree can only store keys, and cannot directly associate the attributes and interpretation information (collectively referred to as values) of key and key.
  • the solution of this method is to add a virtual character "$" after each key to represent the leaf node (terminal node), so that the original terminal node becomes a non-terminal node, and then a terminal node is added.” $" as its direct successor.
  • each leaf node "$" terminal node
  • m is the lexicographic sequence number of the current key in all the term sets, m can Directly determine the storage location of the value corresponding to the relevant key
  • the storage location of each leaf node "$" is directly determined by the base value of its direct precursor, and the direct precursor of the leaf node is defined as itself, that is, the leaf node
  • the check value is equal to its state number, which is the logical storage location.
  • the lexicographic ordering further includes:
  • the keys with the common prefix are adjacent.
  • the information about the status includes:
  • the information for each state contains: the current input character, the depth of the state, the first number that has the current state key, the last number that has the current state key, and the number of current state keys.
  • the present invention in the process of creating a Trie tree, in order to avoid data movement caused by conflicts caused by inserting a new state, all information is required to be in order of keys, and all direct successor states of the current state can be obtained.
  • Information (such as the currently entered character c, the depth of the state in the Trie tree, the scope of the term position contained in the direct successor state, etc.).
  • the present invention defines a data structure Node that stores information for each state.
  • creating a Trie tree includes the following steps:
  • the dictionary storage management method based on the one-dimensional linear space to implement the Trie tree can make the Trie tree serialization and deserialization more convenient, improve the data loading and storage efficiency of the dictionary, and the present invention overcomes the double array.
  • the backtracking problem of data movement or storage space existing in the dictionary data storage method of the Trie tree is realized.
  • 1 is a flow chart of a method for managing a dictionary storage management of a Trie tree based on a one-dimensional linear space according to an embodiment of the present invention.
  • 2 is a structural diagram of a forest composed of a dictionary data prefix tree in the embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a flow of implementing dictionary storage of a Trie tree in an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a flow of inserting a node when a Trie tree is created in an embodiment of the present invention. detailed description
  • a dictionary storage management method for implementing a Trie tree based on a one-dimensional linear space includes the following steps: acquiring complete dictionary data; generating ordered dictionary data and storing the data in a one-dimensional array; creating a Trie The tree implements the storage of dictionary data.
  • FIG. 1 it is a flowchart of a method for managing a dictionary storage management of a Trie tree in a one-dimensional linear space according to an embodiment of the present invention, which includes the following steps:
  • Step S110 obtaining complete dictionary data.
  • the dictionary should store these key data in Table 1 below:
  • Step S120 generating ordered dictionary data and storing the data in a one-dimensional array.
  • the dotted circle represents the terminal node of the tree
  • the solid circle represents the non-terminal node of the number
  • the word formed from the root node of the tree to the current terminal node is a complete entry in the dictionary
  • the word formed from the root node of the tree to a non-terminal node is the common prefix of certain terms in the dictionary.
  • Trie tree it is necessary to store the direct precursor information of the current state.
  • the implementation versions of the popular Trie tree on the current network are generally based on double arrays.
  • the names of the two arrays are base[] and check[].
  • the subscript i of each element in the array is equivalent to a node of the Trie tree.
  • the number or storage location in the double array also known as the status number.
  • Base[i] Stores the current state i to all subsequent states with minimal collision-free offset.
  • Check[i] stores the direct predecessor information of the current state i, that is, which state is transferred from the current state.
  • Base and check are paired, base[i] and check[i] represent attributes of the same state
  • a virtual character is added after each key to represent a leaf node (terminal node),
  • the storage location of each leaf node virtual character is determined by the base value of its immediate precursor. Allow each key to bring its own attribute value values ( part of speech or other annotation information).
  • Currently, many implementation versions of the Trie tree can only store keys, and cannot directly associate key and key attributes or interpretation information (collectively referred to as values).
  • the solution of this method is to add a virtual character "$" after each key to represent the leaf node (terminal node), so that the original terminal node becomes a non-terminal node, and then a terminal node is added.”$" as its direct successor.
  • each leaf node "$" terminal node
  • m is the lexicographic sequence number of the current key in all the term sets, m can Directly determine the storage location of the value corresponding to the relevant key
  • the storage location of each leaf node "$" is directly determined by the base value of its direct precursor, and the direct precursor of the leaf node is defined as itself, that is, the leaf node
  • the check value is equal to its state number, which is the logical storage location.
  • the one-dimensional array includes: the even-numbered bits of the one-dimensional array store the base value of the double-array, and the odd-numbered bits of the one-dimensional array store the check value of the double-array.
  • embodiments of the present invention define a data structure Node that stores information for each state that is used to store newly inserted state information when the Trie tree is created.
  • the main stored information descriptions include:
  • the code stores the current input character c, which can be the Unicode value or the byte value of c. In order to avoid the virtual terminal node "$" (the code of the terminal node is 0), this method defines the value of each code.
  • the Unicode value of the character c is +1; the depth of the current state in the Trie tree is +1, that is, its direct successor is at the depth of the Trie tree (the root node of the Trie tree is the initial node defined as the 0th layer);
  • End-start is the number of keys that have the current state, that is, these keys have a common prefix.
  • Step S130 creating a Trie tree to implement storage of dictionary data.
  • Trie tree dictionary data As shown in FIG. 3, it is a storage process for implementing Trie tree dictionary data, which specifically includes the following steps - In step S131, all the terms and attribute information are sorted in the lexicographic order with the key as the center, and the values having the same key value are merged, and the key is not duplicated;
  • Step S134 taking the initial state as the current state
  • Step S135 obtaining information about all direct successor states of the current state. If the direct successor node list is empty, that is, the current node is the terminal node "$", indicating that the key formed from the starting node to the current node is exactly a complete entry in the dictionary, the base value of the current node (terminal node) is assigned to the opposite of the current key dictionary sequence number, the path is executed; otherwise, step S136 is performed;
  • Step S136 searching for a suitable base value for the current node, so that the base value is unique, and does not cause all the direct successor nodes to collide with the nodes stored in the existing Trie tree.
  • the direct successor node of the current node is sequentially inserted into the Trie tree, and the check value thereof is assigned to the base value of the current node, and then the direct successor node of the current node is sequentially used as the current node, and the process proceeds to step S135. .
  • the order of inserting a new node in the Trie tree is to directly insert the direct successor node of the current node, and then sequentially use the direct successor node of the current node as the current node to perform the insertion operation recursively.
  • the node has no successor node, that is, the current node is the terminal node "$", and the current recursion is jumped out, until all the nodes are inserted, the Trie tree creation operation can be completed. If all nodes (including the terminal node "$") are numbered in the order of insertion.
  • Embodiments of the present invention provide a dictionary storage management method for implementing a Trie tree based on a one-dimensional linear space, the method comprising the steps of: acquiring complete dictionary data; generating ordered dictionary data and storing the data in a one-dimensional array; creating a Trie tree Achieve the storage of dictionary data.
  • the invention adopts a one-dimensional array instead of a double array (base[] and che C k[] ). This method makes the Trie tree serialization and deserialization more convenient, which makes the loading and storage of the dictionary more efficient, and the invention can solve the double
  • the array implements the backtracking problem of data movement or storage space existing in the dictionary data storage method of the Trie tree.

Abstract

A one-dimensional linear space-based method for implementing trie tree dictionary storage and management. The method comprises the following steps: acquiring complete dictionary data; generating ordered dictionary data and storing in a one-dimensional array; and, establishing a trie tree to implement storage of the dictionary data. The method employs the one-dimensional array instead of a dual array (base [] and check []), thus allowing for serialization and deserialization of the trie tree to be of increased degree of convenience and speed, and allowing for increased efficiency in loading and storage of a dictionary, while at the same time solving the problem of data movement or storage space backtracking found in a trie tree dictionary data storage method implemented with a dual array.

Description

一种基于一维线性空间实现 Trie树的词典存储管理方法 技术领域  A dictionary storage management method based on one-dimensional linear space to implement Trie tree
本发明涉及一种词典存储管理方法, 尤其涉及一种基于一维线性空间实现 Trie树的词典 存储管理方法。 背景技术  The present invention relates to a dictionary storage management method, and more particularly to a dictionary storage management method for implementing a Trie tree based on a one-dimensional linear space. Background technique
在信息检索和自然语言处理领域, 特别是基于词典的技术应用中, 词典的规模一般都非 常大, 拥有成千上万甚至上亿条记录, 尤其是搜索引擎中倒排索引词的规模最为庞大。 对海 量数据词典的存储, 当前通常采用索引的数据结构来实现。常用的索引结构包括线性索引表、 倒排表、 散列 (hash)表和搜索树等。 当前网络上流行的 Trie树的实现版本一般都是基于双 数组的, 两个数组的名字分别为 base[ P check[], 数组中的每一个元素下标 i相当于 Trie树 的一个结点编号或在双数组中的存储位置, 又称状态编号。  In the field of information retrieval and natural language processing, especially in dictionary-based technology applications, the scale of the dictionary is generally very large, with thousands or even hundreds of records, especially the reverse index of search engines. . The storage of massive data dictionaries is currently implemented using an indexed data structure. Commonly used index structures include linear index tables, inverted tables, hash tables, and search trees. The implementation versions of the popular Trie tree on the current network are generally based on double arrays. The names of the two arrays are base[P check[], and each element in the array is subscript i equivalent to a node number of the Trie tree. Or the storage location in a double array, also known as the state number.
base[i] : 存放的是当前状态 i到所有后继状态最小无冲突的偏移量;  Base[i] : stores the current state i to all subsequent states with minimal collision-free offsets;
check[i]: 存放的是当前状态 i的直接前驱信息, 即存储当前状态是由哪一个状态转移而 来; Check[i] : stores the direct precursor information of the current state i, that is, which state is transferred from the current state;
base和 check是成对的;  Base and check are in pairs;
base[i]和 check[i]代表同一个状态的属性。  Base[i] and check[i] represent attributes of the same state.
然而这种基于双数组实现的 Trie树的词典数据存储方法存在着这样一个问题: 会因为插 入新状态而引起的冲突导致要移动大量存在冲突的词典数据, 不仅会导致词典数据存储速率 慢, 也会导致数据移动或存储空间的回溯问题。 发明内容  However, the dictionary data storage method based on the dual array implementation of the Trie tree has such a problem that the conflict caused by inserting a new state causes a large amount of conflicting dictionary data to be moved, which not only causes the dictionary data storage rate to be slow, but also Backtracking issues that can cause data movement or storage space. Summary of the invention
为了解决上述技术的缺陷之一, 本发明实施例提供一种基于一维线性空间实现 Trie树的 词典存储管理方法, 本发明采用一维数组代替双数组(base[]和 check[] ), 本方法使得 Trie树 序列化和反序列化更加便捷, 使得词典的加载存储更有效率, 同时本发明可以解决双数组实 现 Trie树的词典数据存储方法中存在的数据移动或存储空间的回溯问题。  In order to solve the defects of the foregoing technology, an embodiment of the present invention provides a dictionary storage management method for implementing a Trie tree based on a one-dimensional linear space, and the present invention uses a one-dimensional array instead of a double array (base[] and check[]). The method makes the Trie tree serialization and deserialization more convenient, which makes the loading and storage of the dictionary more efficient. At the same time, the present invention can solve the problem of backtracking of data movement or storage space existing in the dictionary data storage method of the dual array implementing the Trie tree.
为此, 本发明实施例公开了一种基于一维线性空间实现 Trie树的词典存储管理方法。 该 方法包括以下步骤: 获取完整的词典数据; 生成有序的词典数据并存放在一维数组中; 创建 Trie树实现对词典数据的存储。 To this end, the embodiment of the invention discloses a dictionary storage management method for implementing a Trie tree based on a one-dimensional linear space. The method comprises the steps of: obtaining complete dictionary data; generating ordered dictionary data and storing it in a one-dimensional array; The Trie tree implements the storage of dictionary data.
在本发明实施例中, 使用一维数组代替双数组 (base 和 cheCk[] ), 此种做法的好处使得 Trie树序列化和反序列化更加简单, 使得词典的加载更加容易。 具体做法可以将 base数组放 在一维数组的偶数位, check数组放在奇数位, 对应关系如下: In the embodiment of the present invention, a one-dimensional array is used instead of a double array (base and che C k[] ). The advantage of this method makes the Trie tree serialization and deserialization easier, which makes the loading of the dictionary easier. The specific method can put the base array in the even-numbered array of the one-dimensional array, and the check array is placed in the odd-numbered bits. The corresponding relationship is as follows:
base[i] -» array[2*i];  Base[i] -» array[2*i];
check[i] -» array[2*i + 1].  Check[i] -» array[2*i + 1].
在本发明的一个实施例中, 所述生成有序的词典数据包括:  In an embodiment of the present invention, the generating the ordered dictionary data includes:
将词典数据中所有词条和属性信息以 Key为中心按词典顺序排序;  Sort all the terms and attribute information in the dictionary data in the lexicographic order centering on Key;
合并拥有相同 Key值的 values。  Merges values that have the same Key value.
在本发明的一个实施例中, 所述的 key包括:  In an embodiment of the present invention, the key includes:
在每个 key后面添加一个虛拟字符代表叶子结点(终端结点), 每个叶子结点虚拟字符的 存储位置由其直接前驱的 base值确定。  Add a virtual character after each key to represent the leaf nodes (terminal nodes). The storage location of each leaf node virtual character is determined by the base value of its direct precursor.
本发明实施例中, 允许让每个 key带上自己的属性值 values (词性或者其他标注信息)。 当前很多 Trie树的实现版本, 只能存储 key, 无法将 key和 key的属性或解释信息 (统称为 values) 直接关联起来。 本方法的解决方式是在每个 key后面加一个虚拟字符 "$"代表叶子 结点 (终端结点), 使得原先的终端结点变为非终端结点, 在其后增加一个终端结点 "$"作 为其直接后继。 把每个叶子结点 "$" (终端结点) 的 base值赋上其所处的 key按词典顺序序 号的相反数 -m (m为当前 key在所有词条集中的词典顺序序号, m能直接确定相关 key对应 的 values的存储位置), 每个叶子结点 " $ " 的存储位置, 直接由其直接前驱的 base值确定, 同时叶子结点的直接前驱定义为自己, 即叶子结点的 check值等于其状态编号即逻辑存储位 置。  In the embodiment of the present invention, each key is allowed to bring its own attribute value values (part of speech or other annotation information). Currently, many implementation versions of the Trie tree can only store keys, and cannot directly associate the attributes and interpretation information (collectively referred to as values) of key and key. The solution of this method is to add a virtual character "$" after each key to represent the leaf node (terminal node), so that the original terminal node becomes a non-terminal node, and then a terminal node is added." $" as its direct successor. The base value of each leaf node "$" (terminal node) is assigned the opposite number of the key in the lexicographic order number of the key -m (m is the lexicographic sequence number of the current key in all the term sets, m can Directly determine the storage location of the value corresponding to the relevant key), the storage location of each leaf node "$" is directly determined by the base value of its direct precursor, and the direct precursor of the leaf node is defined as itself, that is, the leaf node The check value is equal to its state number, which is the logical storage location.
在本发明的一个实施例中, 所述词典顺序排序还包括:  In an embodiment of the present invention, the lexicographic ordering further includes:
拥有公共前缀的 keys相邻。  The keys with the common prefix are adjacent.
在本发明的一个实施例中, 所述状态的信息包括:  In an embodiment of the present invention, the information about the status includes:
用一个数据机构 Node存储每个状态的信息;  Using a data mechanism Node to store information for each state;
每个状态的信息包含: 当前输入字符、 状态的深度、 第一个拥有当前状态 key的编号、 最后一个拥有当前状态 key的下一个编号、 拥有当前状态 key的数量。  The information for each state contains: the current input character, the depth of the state, the first number that has the current state key, the last number that has the current state key, and the number of current state keys.
在本发明实施例中, 在创建 Trie树的过程中为了很好的避免因插入新状态导致冲突而引 起的数据移动, 要求所有信息按照 key有序, 才能够获取当前状态的所有直接后继状态的信 息 (如当前输入的字符 c, 该状态在 Trie树的深度, 直接后继状态包含的词条位置范围等)。 为了方便, 本发明定义一个数据结构 Node存储每个状态的信息。 In the embodiment of the present invention, in the process of creating a Trie tree, in order to avoid data movement caused by conflicts caused by inserting a new state, all information is required to be in order of keys, and all direct successor states of the current state can be obtained. Information (such as the currently entered character c, the depth of the state in the Trie tree, the scope of the term position contained in the direct successor state, etc.). For convenience, the present invention defines a data structure Node that stores information for each state.
在本发明的一个实施例中, 创建 Trie树包括以下步骤:  In one embodiment of the invention, creating a Trie tree includes the following steps:
定义起始状态, 编号为 0;  Define the starting state, number 0;
将起始状态放入双数组第 0位置;  Put the starting state into the zero position of the double array;
以起始状态为当前状态;  Start state as current state;
获取当前状态的所有直接后继状态的信息;  Get information about all direct successor states of the current state;
为当前结点寻找一个合适的 base值, 插入其所有直接后继结点。  Find a suitable base value for the current node and insert all its direct successor nodes.
本发明实施例提供的一种基于一维线性空间实现 Trie树的词典存储管理方法能够使 Trie 树序列化和反序列化更加便捷, 提高了词典的数据加载存储效率, 同时本发明克服了双数组 实现 Trie树的词典数据存储方法中存在的数据移动或存储空间的回溯问题。  The dictionary storage management method based on the one-dimensional linear space to implement the Trie tree can make the Trie tree serialization and deserialization more convenient, improve the data loading and storage efficiency of the dictionary, and the present invention overcomes the double array. The backtracking problem of data movement or storage space existing in the dictionary data storage method of the Trie tree is realized.
应当理解, 以上总体说明和以下详细说明都是说明性和实例性的, 旨在提供对所要求的 本发明的进一步说明。 附图说明  It is to be understood that the foregoing general description DRAWINGS
图 1是本发明实施例一种基于一维线性空间实现 Trie树的词典存储管理方法的流程图。 图 2是本发明实施例中词典数据前缀树组成的森林的构造图。  1 is a flow chart of a method for managing a dictionary storage management of a Trie tree based on a one-dimensional linear space according to an embodiment of the present invention. 2 is a structural diagram of a forest composed of a dictionary data prefix tree in the embodiment of the present invention.
图 3是本发明实施例中实现 Trie树的词典存储的流程的示意图。  FIG. 3 is a schematic diagram of a flow of implementing dictionary storage of a Trie tree in an embodiment of the present invention.
图 4是本发明实施例中实现 Trie树创建时结点的插入顺序流程的示意图。 具体实施方式  FIG. 4 is a schematic diagram of a flow of inserting a node when a Trie tree is created in an embodiment of the present invention. detailed description
为了使本发明的目的、 技术方案及优点更加清楚明白, 以下结合附图及实施例, 对本发 明进行进一步的详细说明。 应当理解, 此处所描述的具体实施例仅仅用于解释本发明, 并不 用于限定本发明。  The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
本发明实施例提供的一种基于一维线性空间实现 Trie树的词典存储管理方法, 该方法包 括以下步骤: 获取完整的词典数据; 生成有序的词典数据并存放在一维数组中; 创建 Trie树 实现对词典数据的存储。  A dictionary storage management method for implementing a Trie tree based on a one-dimensional linear space according to an embodiment of the present invention includes the following steps: acquiring complete dictionary data; generating ordered dictionary data and storing the data in a one-dimensional array; creating a Trie The tree implements the storage of dictionary data.
如图 1所示, 是本发明实施例一种一维线性空间实现 Trie树的词典存储管理方法的流程 图, 包括以下步骤:  As shown in FIG. 1 , it is a flowchart of a method for managing a dictionary storage management of a Trie tree in a one-dimensional linear space according to an embodiment of the present invention, which includes the following steps:
步骤 S110, 获取完整的词典数据。 例如, 词典中要存放如下表 1中的这些词条 (key) 数据: Step S110, obtaining complete dictionary data. For example, the dictionary should store these key data in Table 1 below:
Figure imgf000006_0001
Figure imgf000006_0001
表 1  Table 1
步骤 S120, 生成有序的词典数据并存放在一维数组中。  Step S120, generating ordered dictionary data and storing the data in a one-dimensional array.
将词典数据中所有词条和属性信息以 Key为中心按词典顺序排序;合并拥有相同 Key值 的 values。 同时, 让拥有公共前缀的 keys相邻。 根据上述获取的词典数据, 那些词之间存在 着一些共同的前缀, 按照这些前缀树可以组成一个森林, 如图 2所示, 各棵树的结点做如下 说明:  Sort all terms and attribute information in the dictionary data in lexicographic order centered on Key; merge values with the same Key value. Also, let the keys with the common prefix be adjacent. According to the dictionary data obtained above, there are some common prefixes between the words. According to these prefix trees, a forest can be formed. As shown in Fig. 2, the nodes of each tree are as follows:
虚线圆代表树的终端结点;  The dotted circle represents the terminal node of the tree;
实线圆代表数的非终端结点;  The solid circle represents the non-terminal node of the number;
从树的根部结点到当前终端结点构成的词是词典中的一条完整词条;  The word formed from the root node of the tree to the current terminal node is a complete entry in the dictionary;
从树的根部结点到某一非终端结点构成的词是词典中某些词条的公共前缀。  The word formed from the root node of the tree to a non-terminal node is the common prefix of certain terms in the dictionary.
由此可以看出在构造 Trie树中, 一定要存储当前状态的直接前驱信息。 当前网络上流行 的 Trie树的实现版本一般都是基于双数组的, 两个数组的名字分别为 base[]和 check[], 数组 中的每一个元素下标 i相当于 Trie树的一个结点编号或在双数组中的存储位置, 又称状态编 号。  It can be seen that in constructing the Trie tree, it is necessary to store the direct precursor information of the current state. The implementation versions of the popular Trie tree on the current network are generally based on double arrays. The names of the two arrays are base[] and check[]. The subscript i of each element in the array is equivalent to a node of the Trie tree. The number or storage location in the double array, also known as the status number.
base[i]: 存放的是当前状态 i到所有后继状态最小无冲突的偏移量。  Base[i]: Stores the current state i to all subsequent states with minimal collision-free offset.
check[i] : 存放的是当前状态 i 的直接前驱信息, 即存储当前状态是由哪一个状态转移而 来 Check[i] : stores the direct predecessor information of the current state i, that is, which state is transferred from the current state.
base和 check是成对的, base[i]和 check[i]代表同一个状态的属性  Base and check are paired, base[i] and check[i] represent attributes of the same state
假如当前状态为3, 输入的字符为 C, 下一状态为 t, 则査询过程的约束条件为: check[base[s]+c] s;  If the current state is 3, the input character is C, and the next state is t, the constraint condition of the query process is: check[base[s]+c] s;
base[s]+c=t  Base[s]+c=t
在本发明的一个实施例中,在每个 key后面添加一个虚拟字符代表叶子结点(终端结点), 每个叶子结点虚拟字符的存储位置由其直接前驱的 base值确定。 允许让每个 key带上自己的 属性值 values (词性或者其他标注信息)。 当前很多 Trie树的实现版本, 只能存储 key, 无法 将 key和 key的属性或解释信息 (统称为 values)直接关联起来。 本方法的解决方式是在每 个 key后面加一个虚拟字符 " $ "代表叶子结点 (终端结点), 使得原先的终端结点变为非终 端结点, 在其后增加一个终端结点 "$"作为其直接后继。 把每个叶子结点 "$" (终端结点) 的 base值赋上其所处的 key按词典顺序序号的相反数 -m (m为当前 key在所有词条集中的词 典顺序序号, m能直接确定相关 key对应的 values的存储位置), 每个叶子结点 "$"的存储 位置, 直接由其直接前驱的 base值确定, 同时叶子结点的直接前驱定义为自己, 即叶子结点 的 check值等于其状态编号即逻辑存储位置。 In one embodiment of the invention, a virtual character is added after each key to represent a leaf node (terminal node), The storage location of each leaf node virtual character is determined by the base value of its immediate precursor. Allow each key to bring its own attribute value values ( part of speech or other annotation information). Currently, many implementation versions of the Trie tree can only store keys, and cannot directly associate key and key attributes or interpretation information (collectively referred to as values). The solution of this method is to add a virtual character "$" after each key to represent the leaf node (terminal node), so that the original terminal node becomes a non-terminal node, and then a terminal node is added."$" as its direct successor. The base value of each leaf node "$" (terminal node) is assigned the opposite number of the key in the lexicographic order number of the key -m (m is the lexicographic sequence number of the current key in all the term sets, m can Directly determine the storage location of the value corresponding to the relevant key), the storage location of each leaf node "$" is directly determined by the base value of its direct precursor, and the direct precursor of the leaf node is defined as itself, that is, the leaf node The check value is equal to its state number, which is the logical storage location.
使用一维数组代替双数组(base[]和 check[] ), 此种方法的好处使得 Trie树序列化和反序 列化更加简单,使得词典的加载更加容易。具体做法可以将 base数组放在一维数组的偶数位, check数组放在奇数位, 对应关系如下:  Using a one-dimensional array instead of a double array (base[] and check[] ), the benefits of this approach make Trie tree serialization and reverse-sequence easier, making dictionary loading easier. The specific method can put the base array in the even-numbered bits of the one-dimensional array, and the check array is placed in the odd-numbered bits. The corresponding relationship is as follows:
base[i] -» array[2*i];  Base[i] -» array[2*i];
check[i] -» array[2*i + 1].  Check[i] -» array[2*i + 1].
所述一维数组包括: 一维数组的偶数位存放双数组的 base值, 一维数组的奇数位存放双 数组的 check值。  The one-dimensional array includes: the even-numbered bits of the one-dimensional array store the base value of the double-array, and the odd-numbered bits of the one-dimensional array store the check value of the double-array.
在创建 Trie树的过程中为了很好的避免因插入新状态导致冲突而引起的数据移动, 要求 所有信息按照 key有序, 才能够获取当前状态的所有直接后继状态的信息 (如当前输入的字 符0, 该状态在 Trie树的深度, 直接后继状态包含的词条位置范围等)。 为了方便, 本发明实 施例定义一个数据结构 Node存储每个状态的信息, 该数据结构用于在创建 Trie树时存储新 插入的状态信息, 主要存储的信息说明包括:  In the process of creating the Trie tree, in order to avoid the data movement caused by the conflict caused by inserting the new state, all the information is required to be in order of the key, so that all the direct successor status information of the current state can be obtained (such as the currently input character). 0, the state is in the depth of the Trie tree, the range of the term position contained in the direct successor state, etc.). For convenience, embodiments of the present invention define a data structure Node that stores information for each state that is used to store newly inserted state information when the Trie tree is created. The main stored information descriptions include:
code存储的是当前的输入字符 c, 可以为 c的 Unicode值或者字节值, 为了规避虚拟的 终端结点 "$" (终端结点的 code为 0), 本方法定义每个 code的值取字符 c的 Unicode值 +1 ; depth当前状态在 Trie树中的深度 +1, 即其直接后继在 Trie树的深度 (Trie树的根结点 即初始结点定义为第 0层);  The code stores the current input character c, which can be the Unicode value or the byte value of c. In order to avoid the virtual terminal node "$" (the code of the terminal node is 0), this method defines the value of each code. The Unicode value of the character c is +1; the depth of the current state in the Trie tree is +1, that is, its direct successor is at the depth of the Trie tree (the root node of the Trie tree is the initial node defined as the 0th layer);
start第一个拥有当前状态 key的编号;  Start first has the current state key number;
end最后一个拥有当前状态 key的下一个编号;  End last has the next number of the current state key;
end-start为拥有当前状态 key的数量, 即这些 key拥有共同的前缀。  End-start is the number of keys that have the current state, that is, these keys have a common prefix.
步骤 S130, 创建 Trie树实现对词典数据的存储。  Step S130, creating a Trie tree to implement storage of dictionary data.
如图 3所示, 是实现 Trie树词典数据的存储流程, 具体包括以下步骤- 步骤 S131, 将所有词条和属性信息以 key为中心按词典顺序排序, 合并拥有相同 key值 的 values, 要保证 key不存在重复; As shown in FIG. 3, it is a storage process for implementing Trie tree dictionary data, which specifically includes the following steps - In step S131, all the terms and attribute information are sorted in the lexicographic order with the key as the center, and the values having the same key value are merged, and the key is not duplicated;
步骤 S132,定义起始状态,编号为 0,其包含的信息值为 [code = 0, depth = 0, start = 0, end = N], 其中 N为词典的规模, 即 key的数量;  Step S132, defining a starting state, numbered 0, and the information value thereof is [code = 0, depth = 0, start = 0, end = N], where N is the size of the dictionary, that is, the number of keys;
步骤 S133 , 将起始状态放入双数组第 0位置, 将其 base[0]=l ( array[2*0]=array[0]=l ), 并标识 base 为 1 的值已经被占用 (保证所有状态的 base 值唯一), check[0]=0 (array[2*0+l]=array[l]=0); Step S133, placing the initial state into the 0th position of the double array, and setting its base[0]=l (array[2*0]=array[0]=l), and identifying that the value of base is 1 is occupied ( Ensure that the base value of all states is unique), check[0]=0 (array[2*0+l]=array[l]=0) ;
步骤 S134, 以起始状态作为当前状态;  Step S134, taking the initial state as the current state;
步骤 S135, 获取当前状态的所有直接后继状态的信息, 若直接后继结点列表为空, 即当 前结点为终端结点 "$", 表示从起始结点到当前结点构成的 key恰好是词典中的一个完整词 条, 将当前结点 (终端结点) 的 base值赋上当前 key词典顺序序号的相反数, 该路径上执行 完毕; 否则执行步骤 S136;  Step S135, obtaining information about all direct successor states of the current state. If the direct successor node list is empty, that is, the current node is the terminal node "$", indicating that the key formed from the starting node to the current node is exactly a complete entry in the dictionary, the base value of the current node (terminal node) is assigned to the opposite of the current key dictionary sequence number, the path is executed; otherwise, step S136 is performed;
步骤 S136, 为当前结点寻找一个合适的 base值, 使得 base值唯一, 且不会导致所有直 接后继结点与现有 Trie树存储的结点冲突。 依次将当前结点的直接后继结点插入 Trie树中, 并将其 check值赋上当前结点的 base值, 再依次把当前结点的直接后继结点作为当前结点, 跳转到步骤 S135。  Step S136, searching for a suitable base value for the current node, so that the base value is unique, and does not cause all the direct successor nodes to collide with the nodes stored in the existing Trie tree. The direct successor node of the current node is sequentially inserted into the Trie tree, and the check value thereof is assigned to the base value of the current node, and then the direct successor node of the current node is sequentially used as the current node, and the process proceeds to step S135. .
如图 4所示, Trie树在插入新结点的顺序是依次插入当前结点的直接后继结点, 然后依 次把当前结点的直接后继结点作为当前结点,递归执行插入操作, 若当前结点没有后继结点, 即当前结点为终端结点 " $", 跳出当前递归, 直至所有结点插入完毕, 即可完成 Trie树的创 建操作。 若把所有的结点 (包括终端结点 "$") 按照插入的顺序编号。  As shown in FIG. 4, the order of inserting a new node in the Trie tree is to directly insert the direct successor node of the current node, and then sequentially use the direct successor node of the current node as the current node to perform the insertion operation recursively. The node has no successor node, that is, the current node is the terminal node "$", and the current recursion is jumped out, until all the nodes are inserted, the Trie tree creation operation can be completed. If all nodes (including the terminal node "$") are numbered in the order of insertion.
本发明实施例提供一种基于一维线性空间实现 Trie树的词典存储管理方法, 该方法包括 以下步骤: 获取完整的词典数据; 生成有序的词典数据并存放在一维数组中; 创建 Trie树实 现对词典数据的存储。 本发明采用一维数组代替双数组 (base[]和 cheCk[] ), 本方法使得 Trie 树序列化和反序列化更加便捷, 使得词典的加载存储更有效率, 同时本发明可以解决双数组 实现 Trie树的词典数据存储方法中存在的数据移动或存储空间的回溯问题。 Embodiments of the present invention provide a dictionary storage management method for implementing a Trie tree based on a one-dimensional linear space, the method comprising the steps of: acquiring complete dictionary data; generating ordered dictionary data and storing the data in a one-dimensional array; creating a Trie tree Achieve the storage of dictionary data. The invention adopts a one-dimensional array instead of a double array (base[] and che C k[] ). This method makes the Trie tree serialization and deserialization more convenient, which makes the loading and storage of the dictionary more efficient, and the invention can solve the double The array implements the backtracking problem of data movement or storage space existing in the dictionary data storage method of the Trie tree.

Claims

权 利 要 求 书 Claim
1.一种基于一维线性空间实现 Trie树的词典存储管理方法, 其特征在于, 该方法包括以 下步骤: A dictionary storage management method for implementing a Trie tree based on a one-dimensional linear space, characterized in that the method comprises the following steps:
获取完整的词典数据;  Get complete dictionary data;
生成有序的词典数据并存放在一维数组中;  Generate ordered dictionary data and store it in a one-dimensional array;
创建 Trie树实现对词典数据的存储。  Create a Trie tree to store the dictionary data.
2.根据权利要求 1所述的方法, 其特征在于, 所述生成有序的词典数据包括: 将词典数据中所有词条和属性信息以 Key为中心按词典顺序排序;  The method according to claim 1, wherein the generating the ordered dictionary data comprises: sorting all the terms and attribute information in the dictionary data in a lexicographic order centering on Key;
合并拥有相同 Key值的 valuer  Merge valuer with the same Key value
3.根据权利要求 1所述的方法, 其特征在于, 所述一维数组包括: 一维数组的偶数位存 放双数组的 base值, 一维数组的奇数位存放双数组的 check值。  The method according to claim 1, wherein the one-dimensional array comprises: an even-numbered one-dimensional array stores a base value of a double-array, and an odd-numbered one-dimensional array stores a check value of the double-array.
4.根据权利要求 1所述的方法, 其特征在于, 所述创建 Trie树包括以下步骤: 定义起始状态, 编号为 0 ;  The method according to claim 1, wherein the creating a Trie tree comprises the steps of: defining a starting state, numbered 0;
将起始状态放入双数组第 0位置;  Put the starting state into the zero position of the double array;
以起始状态为当前状态;  Start state as current state;
获取当前状态的所有直接后继状态的信息;  Get information about all direct successor states of the current state;
为当前结点寻找一个合适的 base值, 插入其所有直接后继结点。  Find a suitable base value for the current node and insert all its direct successor nodes.
5.根据权利要求 1所述的方法, 其特征在于, 所述词典顺序排序还包括:  The method according to claim 1, wherein the lexicographic ordering further comprises:
拥有公共前缀的 keys相邻。  The keys with the common prefix are adjacent.
6.根据权利要求 1或权利要求 4所述的方法, 其特征在于, 所述状态的信息包括: 用一个数据机构 Node存储每个状态的信息;  The method according to claim 1 or claim 4, wherein the information of the status comprises: storing information of each state by using a data mechanism Node;
每个状态的信息包含: 当前输入字符、 状态的深度、 第一个拥有当前状态 key的编号、 最后一个拥有当前状态 key的下一个编号、 拥有当前状态 key的数量。  The information for each state contains: the current input character, the depth of the state, the first number that has the current state key, the last number that has the current state key, and the number of current state keys.
7.根据权利要求 1或权利要求 6所述的方法, 其特征在于, 所述的 key包括: 在每个 key后面添加一个虚拟字符代表叶子结点(终端结点), 每个叶子结点虚拟字符的 存储位置由其直接前驱的 base值确定。  The method according to claim 1 or claim 6, wherein the key comprises: adding a virtual character after each key to represent a leaf node (terminal node), each leaf node virtual The storage location of the character is determined by the base value of its immediate precursor.
PCT/CN2014/080176 2013-07-03 2014-06-18 One-dimensional linear space-based method for implementing trie tree dictionary storage and management WO2015010508A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310288785.2A CN103365991B (en) 2013-07-03 2013-07-03 A kind of dictionaries store management method realizing Trie tree based on one-dimensional linear space
CN201310288785.2 2013-07-03

Publications (1)

Publication Number Publication Date
WO2015010508A1 true WO2015010508A1 (en) 2015-01-29

Family

ID=49367332

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/080176 WO2015010508A1 (en) 2013-07-03 2014-06-18 One-dimensional linear space-based method for implementing trie tree dictionary storage and management

Country Status (2)

Country Link
CN (1) CN103365991B (en)
WO (1) WO2015010508A1 (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103365991B (en) * 2013-07-03 2017-03-08 深圳市华傲数据技术有限公司 A kind of dictionaries store management method realizing Trie tree based on one-dimensional linear space
CN107153647B (en) * 2016-03-02 2021-12-07 北京字节跳动网络技术有限公司 Method, apparatus, system and computer program product for data compression
CN107239549A (en) * 2017-06-07 2017-10-10 传神语联网网络科技股份有限公司 Method, device and the terminal of database terminology retrieval
CN110019682B (en) * 2017-12-28 2022-12-27 北京京东尚科信息技术有限公司 System, method and apparatus for processing information
CN108153907B (en) * 2018-01-18 2021-01-22 中国计量大学 Dictionary storage management method for realizing space optimization through 16-bit Trie tree
CN108399152B (en) * 2018-02-06 2021-05-07 中国科学院信息工程研究所 Compression representation method, system, storage medium and rule matching device for digital search tree
CN109933656B (en) * 2019-03-15 2023-08-15 深圳市赛为智能股份有限公司 Public opinion polarity prediction method, public opinion polarity prediction device, computer equipment and storage medium
CN111680489B (en) * 2020-06-10 2021-11-19 腾讯科技(深圳)有限公司 Target text matching method and device, storage medium and electronic equipment
CN112612427B (en) * 2020-12-30 2022-11-25 北京优挂信息科技有限公司 Vehicle stop data processing method and device, storage medium and terminal

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1786962A (en) * 2005-12-21 2006-06-14 中国科学院计算技术研究所 Method for managing and searching dictionary with perfect even numbers group TRIE Tree
CN101365131A (en) * 2008-08-19 2009-02-11 华亚微电子(上海)有限公司 Simplified code table for variable length decoding by AVS video decoder suitable for VLSI implementation and implementing method
CN101788990A (en) * 2009-01-23 2010-07-28 北京金远见电脑技术有限公司 Global optimization and construction method and system of TRIE double-array
US20110055233A1 (en) * 2009-08-25 2011-03-03 Lutz Weber Methods, Computer Systems, Software and Storage Media for Handling Many Data Elements for Search and Annotation
CN103365992A (en) * 2013-07-03 2013-10-23 深圳市华傲数据技术有限公司 Method for realizing dictionary search of Trie tree based on one-dimensional linear space
CN103365991A (en) * 2013-07-03 2013-10-23 深圳市华傲数据技术有限公司 Method for realizing dictionary memory management of Trie tree based on one-dimensional linear space

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7096235B2 (en) * 2003-03-27 2006-08-22 Sand Technology Systems International, Inc. Computer implemented compact 0-complete tree dynamic storage structure and method of processing stored data
WO2005024659A1 (en) * 2003-08-11 2005-03-17 France Telecom Trie memory device with a circular pipeline mechanism
CN101499094B (en) * 2009-03-10 2010-09-29 焦点科技股份有限公司 Data compression storing and retrieving method and system
JP5262864B2 (en) * 2009-03-10 2013-08-14 富士通株式会社 Storage medium, search method and search device
GB2494337A (en) * 2010-05-28 2013-03-06 Securitymetrics Inc Systems and methods for determining whether data includes strings that correspond to sensitive information
CN102651026B (en) * 2012-04-01 2015-02-18 百度在线网络技术(北京)有限公司 Method for optimizing word segmentation of search engine through precomputation and word segmenting device of search engine

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1786962A (en) * 2005-12-21 2006-06-14 中国科学院计算技术研究所 Method for managing and searching dictionary with perfect even numbers group TRIE Tree
CN101365131A (en) * 2008-08-19 2009-02-11 华亚微电子(上海)有限公司 Simplified code table for variable length decoding by AVS video decoder suitable for VLSI implementation and implementing method
CN101788990A (en) * 2009-01-23 2010-07-28 北京金远见电脑技术有限公司 Global optimization and construction method and system of TRIE double-array
US20110055233A1 (en) * 2009-08-25 2011-03-03 Lutz Weber Methods, Computer Systems, Software and Storage Media for Handling Many Data Elements for Search and Annotation
CN103365992A (en) * 2013-07-03 2013-10-23 深圳市华傲数据技术有限公司 Method for realizing dictionary search of Trie tree based on one-dimensional linear space
CN103365991A (en) * 2013-07-03 2013-10-23 深圳市华傲数据技术有限公司 Method for realizing dictionary memory management of Trie tree based on one-dimensional linear space

Also Published As

Publication number Publication date
CN103365991A (en) 2013-10-23
CN103365991B (en) 2017-03-08

Similar Documents

Publication Publication Date Title
WO2015010508A1 (en) One-dimensional linear space-based method for implementing trie tree dictionary storage and management
CN103902698B (en) A kind of data-storage system and storage method
JP2670383B2 (en) Prefix search tree with partial key branch function
US8224861B2 (en) Coupled node tree splitting/conjoining method and program
WO2015010509A1 (en) One-dimensional liner space-based method for implementing trie tree dictionary search
US9870382B2 (en) Data encoding and corresponding data structure
US8332410B2 (en) Bit string merge sort device, method, and program
US8190591B2 (en) Bit string searching apparatus, searching method, and program
CN111190904B (en) Method and device for hybrid storage of graph-relational database
EP3435256B1 (en) Optimal sort key compression and index rebuilding
CN108509505B (en) Character string retrieval method and device based on partition double-array Trie
TW200401206A (en) Enhanced multiway radix tree and related methods
JP7105982B2 (en) Structured record retrieval
CN112817530A (en) Method for safely and efficiently reading and writing ordered data in multithreading manner
WO2001090933A1 (en) Synchronisation of databases
US8250089B2 (en) Bit string search apparatus, search method, and program
CN108549679B (en) File extension fast matching method and device for URL analysis system
US20110060748A1 (en) Apparatus and Method for Heap Sorting with Collapsed Values and Selective Value Expansion
CN109241058A (en) A kind of method and apparatus from key-value pair to B+ tree batch that being inserted into
CN111373389A (en) Data storage system and method for providing a data storage system
JP6323887B2 (en) Method and device for changing root node
CN110083603B (en) Method and system for realizing node path query based on adjacency list
CN103838760B (en) A kind of method and system inquiring about friend information
US8195667B2 (en) Bit string search apparatus, search method, and program
US8745035B1 (en) Multistage pipeline for feeding joined tables to a search system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14829424

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14829424

Country of ref document: EP

Kind code of ref document: A1