CN109684439A - The method and device of prefix index is carried out during participle - Google Patents

The method and device of prefix index is carried out during participle Download PDF

Info

Publication number
CN109684439A
CN109684439A CN201811622746.0A CN201811622746A CN109684439A CN 109684439 A CN109684439 A CN 109684439A CN 201811622746 A CN201811622746 A CN 201811622746A CN 109684439 A CN109684439 A CN 109684439A
Authority
CN
China
Prior art keywords
word
retrieved
double
hash
prefix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811622746.0A
Other languages
Chinese (zh)
Other versions
CN109684439B (en
Inventor
谭峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Language Network (wuhan) Information Technology Co Ltd
Original Assignee
Language Network (wuhan) Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Language Network (wuhan) Information Technology Co Ltd filed Critical Language Network (wuhan) Information Technology Co Ltd
Priority to CN201811622746.0A priority Critical patent/CN109684439B/en
Publication of CN109684439A publication Critical patent/CN109684439A/en
Application granted granted Critical
Publication of CN109684439B publication Critical patent/CN109684439B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the present invention provides the method and device that prefix index is carried out during a kind of participle, which comprises is split and is stored in multiple even numbers group Trie trees to dictionary data based on improved hash algorithm SDBMHash;Hash calculation is carried out to word to be retrieved using the improved hash algorithm SDBMHash, and determines the even numbers group Trie tree where the word to be retrieved according to the result of Hash calculation;Prefix index is carried out to the word to be retrieved in the even numbers group Trie tree where the word to be retrieved.The embodiment of the present invention is under the application scenarios for supporting dictionary to split multiple even numbers group Trie trees, moreover it is possible to guarantee the efficient of during participle prefix index.

Description

Method and device for indexing prefix in word segmentation process
Technical Field
The embodiment of the invention relates to the technical field of natural language processing, in particular to a method and a device for indexing a prefix in a word segmentation process.
Background
A double-array Trie tree (double ArrayTrie) is a Trie tree with low space complexity, and is mainly applied to the field of information retrieval to construct a word segmentation dictionary. The double array Trie combines fast array access and chain storage compression. The double-array Trie tree supports prefix index, namely, whether a word has other words with the word as a prefix in the tree can be searched.
Word segmentation is the decomposition of a sentence into words. The scenario of applying the double-array Trie is to decompose a sentence into a plurality of words existing in the double-array Trie. In the word segmentation process, prefix query needs to be performed on words in sentences in the double-array Trie tree in sequence, and whether other words with the word as a prefix exist in the words is judged, so that the purpose of word segmentation is achieved.
The double-array Trie tree improves the retrieval efficiency of the dictionary, but due to the data storage mode of the double arrays, all dictionary data are a whole. In practical applications, the dictionary tends to be very large. This may cause defects of inefficiency, long time consumption, difficult management, etc. for loading, persistence, storage, etc. of the dictionary. In order to solve the problem of overlarge dictionary, the dictionary data can be split and stored in a plurality of double-array Trie trees in a distributed manner. However, when prefix indexing is performed in the word segmentation process, each double-array Trie tree needs to be sequentially searched for the word, and the searching efficiency is greatly reduced.
Therefore, it is desirable to provide a method that can ensure efficient prefix indexing in word segmentation in an application scenario supporting splitting a dictionary into multiple double-array Trie trees.
Disclosure of Invention
Embodiments of the present invention provide a method and an apparatus for performing prefix indexing in a word segmentation process, which overcome the above problems or at least partially solve the above problems.
In a first aspect, an embodiment of the present invention provides a method for performing prefix indexing in a word segmentation process, including:
splitting dictionary data based on an improved hash algorithm SDBMHash and storing the split dictionary data into a plurality of double-array Trie trees;
performing hash calculation on the word to be retrieved by using the improved hash algorithm SDBMHash, and determining a double-array Trie tree where the word to be retrieved is located according to the result of the hash calculation;
and performing prefix index on the word to be retrieved in the double-array Trie tree where the word to be retrieved is located.
In a second aspect, an embodiment of the present invention provides an apparatus for performing prefix indexing in a word segmentation process, including:
the grouping module is used for splitting the dictionary data based on an improved Hash algorithm SDBMHash and storing the split dictionary data into a plurality of double-array Trie trees;
the hash calculation module is used for carrying out hash calculation on the word to be retrieved by using the improved hash algorithm SDBMHash, and determining a double-array Trie tree where the word to be retrieved is located according to the result of the hash calculation;
and the prefix index module is used for carrying out prefix index on the word to be retrieved in the double-array Trie tree where the word to be retrieved is located.
In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method for performing prefix indexing in a word segmentation process as provided in the first aspect when executing the program.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the method for prefix indexing in a word segmentation process as provided in the first aspect.
The method and the device for prefix indexing in the word segmentation process provided by the embodiment of the invention can also ensure the high efficiency of prefix indexing in the word segmentation process under the application scene of supporting the dictionary to split a plurality of double-array Trie trees.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a method for performing prefix indexing in a word segmentation process according to an embodiment of the present invention;
FIG. 2 is a flow chart of an improved hashing algorithm SDBMHash provided by the embodiment of the invention;
fig. 3 is a schematic flowchart of performing prefix indexing on the word to be retrieved according to the embodiment of the present invention;
fig. 4 is a schematic structural diagram of an apparatus for performing prefix indexing in a word segmentation process according to an embodiment of the present invention;
fig. 5 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a method for performing prefix indexing in a word segmentation process according to an embodiment of the present invention, as shown in the figure, the method includes:
step 100, splitting dictionary data based on an improved hash algorithm SDBMHash and storing the split dictionary data into a plurality of double-array Trie trees;
the invention provides a high-efficiency word segmentation prefix index method based on a distributed double-array Trie tree, which is independent of the logic of a word segmentation algorithm.
Firstly, the dictionary data is split and stored in a plurality of double-array Trie trees. Since the conventional hash algorithm can only determine the tree where a word is located, all words with the same prefix cannot be put into one tree, and the word segmentation search needs to support the prefix index, all trees need to be searched when the prefix index is performed, so that the search efficiency is low. In order to optimize the situation, the embodiment of the present invention provides an improved hash algorithm applicable to the scene, and splits the dictionary data to store all words of the same prefix in as few trees as possible which are predictable, so that the need to search all trees during prefix indexing is avoided.
The embodiment of the invention adopts an improved Hash algorithm SDBMHash to split dictionary data, stores the dictionary data into a plurality of double-array Trie trees, and can ensure that the same prefix is stored in as few trees as possible, thereby reducing the time complexity O (k) of searching from all trees when prefix indexing is carried out, and reducing the time complexity O (1) of searching for foreseeable trees.
The improved hash algorithm SDBMHash is shown in a flow chart of FIG. 2, the SDBMHash can circularly process each byte of a word to be searched, and the improvement provided by the embodiment of the invention is that a maximum value N of the length of the processed byte is preset, namely, the number of times of circularly processing the byte at the maximum is limited. If the byte number of the word is larger than N, only circulating N times. Therefore, if words with byte length greater than N have the same prefix, the same hash value will be obtained finally, that is, words with length greater than N and the same prefix will be stored in the same even Trie tree.
In one embodiment, N may take on a value between 10 and 30.
Step 101, performing hash calculation on a word to be retrieved by using the improved hash algorithm SDBMHash, and determining a double-array Trie tree where the word to be retrieved is located according to a hash calculation result;
the method comprises the steps of performing hash calculation on a word to be retrieved by using the improved hash algorithm SDBMHash to obtain the length of the word to be retrieved, namely the number of bytes contained, judging the size relation between the length of the word to be retrieved and N due to the fact that the maximum value N of the length of a processing byte is set in the improved hash algorithm SDBMHash, circularly taking out the first N bytes of the word to be retrieved for hash calculation if the length of the word to be retrieved is larger than or equal to N, or circularly taking out the bytes of the word to be retrieved for hash calculation if the length of the word to be retrieved is smaller than N, and outputting a hash value obtained by the last circular calculation.
And determining the double-array Trie tree in which the word to be retrieved is stored according to the obtained hash value. After the even-numbered Trie tree where the word to be retrieved is located is determined, since all words of the same prefix are stored in some trees which are predictable and as few as possible, prefix indexing can be started according to the even-numbered Trie stored corresponding to the word to be retrieved.
And 102, performing prefix index on the word to be retrieved in the double-array Trie tree where the word to be retrieved is located.
Specifically, in the word segmentation process, the prefix index of a word is one of the bases for word segmentation, and the embodiment of the present invention needs to determine whether the word to be retrieved has the prefix index. Performing prefix indexing on the word to be retrieved in the double-array Trie tree in which the word to be retrieved is located refers to searching whether other words with the word to be retrieved as prefixes exist in the double-array Trie tree in which the word to be retrieved is located.
In the word segmentation process, when prefix indexing is carried out on a word in a sentence to be segmented, if the length of the word is larger than or equal to N, because the word longer than the word and the word are in the same tree, prefix indexing is only carried out on the tree where the word is located, and whether other words with the word to be retrieved as a prefix exist or not is only searched in a double-array Trie tree where the word to be retrieved is located, and if the words exist, the word to be retrieved is determined to have the prefix index; and if not, determining that the word to be retrieved does not have a prefix index.
If the length of the word to be retrieved is smaller than N, the word to be retrieved is supplemented by utilizing bytes of words behind the word to be retrieved in the sentence to be segmented to generate a new word, the byte length of the new word is N, then the new word is subjected to hash calculation by utilizing an improved hash algorithm SDBMHash, so that a double-array Trie tree where the new word is located is determined, and prefix indexing is performed on the new word in the double-array Trie tree where the new word is located.
The length of the new word is N, so that prefix indexing can be completed on the new word, and if other words with the new word as a prefix exist in the double-array Trie tree where the new word is located, it can be determined that the word to be retrieved has the prefix index. If no other words using the new word as a prefix exist in the double-array Trie tree where the new word is located, the query needs to be performed in other double-array Trie trees. The method comprises the steps of sequentially and circularly removing the last byte of the new word, judging whether the new word subjected to byte deletion has a prefix index in the corresponding double-array Trie or not until the length of the new word subjected to byte deletion is equal to that of the word to be retrieved, and ending circulation if any new word subjected to byte deletion has the prefix index in the circulation process. And if the new word after the byte deletion processing does not have the prefix index until the circulation is finished, judging that the word to be retrieved does not have the prefix index.
The length of the word to be retrieved is expanded to the maximum value N of the preset processing byte length, so that the number of the maximum trees needing to be queried in the word segmentation sentence is not more than N no matter how many trees the double-array Trie tree is divided into, namely the time complexity of O (1) is guaranteed.
The method for indexing the prefix in the word segmentation process provided by the embodiment of the invention can also ensure the high efficiency of the prefix index in the word segmentation process under the application scene of supporting the dictionary to split a plurality of double-array Trie trees.
Based on the content of the above embodiment, the step of performing hash calculation on the word to be retrieved by using the improved hash algorithm SDBMHash specifically includes:
judging whether the length of the word to be retrieved is larger than the maximum value N of the preset processing byte length;
if the length of the word to be retrieved is larger than or equal to N, sequentially and circularly taking out the bytes of the word to be retrieved from the first byte of the word to be retrieved for hash calculation until the cycle number is equal to N, and outputting a hash value obtained in the last cycle; or,
and if the length of the word to be retrieved is smaller than N, sequentially and circularly taking out the bytes of the word to be retrieved from the first byte of the word to be retrieved for hash calculation until the word to be retrieved is traversed, and outputting a hash value obtained in the last circulation.
Specifically, the length of the word to be retrieved is firstly obtained, and whether the length of the word to be retrieved is larger than the maximum value N of the preset processing byte length is judged;
if the length of the word to be retrieved is larger than or equal to N, sequentially and circularly taking out the bytes of the word to be retrieved from the first byte of the word to be retrieved for hash calculation until the cycle number is equal to N, and outputting a hash value obtained in the last cycle;
that is, when the length of the word to be retrieved is greater than or equal to N, sequentially and circularly taking out the bytes of the word to be retrieved for hash calculation, wherein a specific formula for performing hash calculation is as follows:
hash-byte a + (hash < 6) + (hash < 16) -hash
In the above formula, byte a is the byte of the word to be retrieved taken out by the current loop.
And outputting the hash value obtained in the last cycle, dividing the obtained hash value by the total number of the double-array Trie trees and taking a remainder, wherein the remainder is the number of the double-array Trie tree where the word to be searched is located, and the double-array Trie tree where the word to be searched is located can be inquired according to the number.
And if the length of the word to be retrieved is smaller than N, sequentially and circularly taking out the bytes of the word to be retrieved from the first byte of the word to be retrieved for hash calculation until the word to be retrieved is traversed, and outputting a hash value obtained in the last circulation. And similarly, dividing the obtained hash value by the total number of the double-array Trie trees, taking a remainder, and determining the double-array Trie tree where the word to be retrieved is located according to the remainder.
Based on the content of the above embodiment, the step of performing prefix index on the word to be retrieved in the double-array Trie where the word to be retrieved is located specifically includes:
if the length of the word to be retrieved is larger than or equal to N, searching other words taking the word to be retrieved as prefixes in the double-array Trie tree where the word to be retrieved is located; or if the length of the word to be retrieved is smaller than N, acquiring a word behind the word to be retrieved in the sentence to be subjected to word segmentation, and supplementing the word to be retrieved by using bytes of the word behind the word to be retrieved so that the length of a new word obtained after the supplementation reaches N;
and determining an even-numbered Trie tree corresponding to the new word obtained after the supplementing, and performing prefix index on the new word obtained after the supplementing in the even-numbered Trie tree corresponding to the new word obtained after the supplementing.
Specifically, if the length of the word to be retrieved is greater than or equal to the maximum value N of the preset processing byte length, it is only necessary to search whether other words with the word to be retrieved as prefixes exist in the double-array Trie where the word to be retrieved is located, if so, it is determined that the word to be retrieved has a prefix index, and if not, it is determined that the word to be retrieved does not have a prefix index.
If the length of the word to be retrieved is smaller than the maximum value N of the preset processing byte length, the word to be retrieved is supplemented by the word behind the sentence where the word to be retrieved is located, and the length of a new word obtained after the supplement is equal to N.
The step of determining an even-numbered Trie corresponding to the new word obtained after the supplementing, and performing prefix indexing on the new word obtained after the supplementing in the even-numbered Trie corresponding to the new word obtained after the supplementing specifically includes:
circularly executing the following steps until the length of the new word obtained after the supplement is equal to the length of the word to be retrieved:
performing hash calculation on the new word obtained after the supplementation by using the SDBMHash algorithm, and determining a double-array Trie tree corresponding to the new word obtained after the supplementation according to the result of the hash calculation;
if a word with the new word obtained after the supplement as a prefix exists in the double-array Trie tree corresponding to the new word obtained after the supplement, acquiring that the word to be retrieved has a prefix index, and exiting the circulation; or,
and if the words with the new words obtained after the supplementation as prefixes do not exist in the double-array Trie trees corresponding to the new words obtained after the supplementation, removing the last byte of the new words obtained after the supplementation.
Specifically, the improved hash algorithm SDBMHash algorithm is used to calculate the hash value of the new word, divide the obtained hash value by the total number of the double-array Trie trees and obtain a remainder, so that the number of the double-array Trie tree where the new word is located can be determined according to the remainder, and perform prefix index on the new word in the double-array Trie tree where the new word is located, that is, determine whether there are other words using the new word as a prefix, if so, learn that the prefix index exists in the word to be retrieved, and if not, query in the other double-array Trie trees is required.
The method for querying in other double-array Trie trees specifically comprises the following steps: and sequentially and circularly deleting the last byte of the new word, in each circulation process, firstly determining a double-array Trie tree where the new word subjected to byte deletion processing is located by using an improved Hash algorithm SDBMHash algorithm, then judging whether the new word subjected to byte deletion processing has a prefix index in the corresponding double-array Trie, if so, judging that the word to be retrieved has the prefix index and exiting circulation, and if not, starting next circulation until the length of the new word is equal to that of the word to be retrieved. And if the prefix index does not exist when the circulation is finished, judging that the prefix index does not exist in the word to be retrieved.
The embodiment of the invention can ensure that the number of the trees which need to be inquired most for inquiring a word in the word segmentation sentence is not more than N no matter how many trees the double-array Trie tree is divided into, namely the time complexity of O (1) is ensured, thereby ensuring the high efficiency of prefix index in the word segmentation process.
Referring to fig. 3, a schematic flow chart of performing prefix indexing on the word to be retrieved according to the embodiment of the present invention includes:
step 300, judging whether the length of a word A to be retrieved in a sentence to be participated is larger than or equal to the maximum value N of the preset processing byte length;
step 301, if the length of the word to be retrieved is greater than N, calculating the Hash value of the word by a Hash function, and obtaining the serial number of the double-array Trie where the word is located according to the Hash value,
the hash function is specifically:
hash ═ current byte + (hash < 6) + (hash < 16) -hash;
step 302, performing prefix index on the word in the numbered Trie tree;
step 303, judging whether other words with the word as a prefix exist in the numbered Trie tree;
step 304, if yes, judging that the word A has a prefix index, and ending the prefix index;
otherwise, judging that the word has no prefix index;
step 305, if the length of the word to be retrieved is smaller than N, acquiring a word behind the word in the sentence to be segmented, and supplementing the word until the length is N or the tail of the sentence, wherein a new word is marked as a word B;
step 306, judging whether the length of the word B is smaller than that of the word A;
step 307, if not, calculating the Hash value of the word B through a Hash function to obtain the number of the Trie tree in which the word B is positioned;
step 308, performing prefix index on the word B in the Trie tree where the word B is located;
step 309, judging whether other words with the word B as a prefix exist in the Trie tree in which the word B is located;
step 310, if the word B does not exist, the last byte of the word B is removed, and the step 306 is returned to;
and if so, exiting the loop, and judging that the word A to be retrieved has the prefix index.
As shown in fig. 4, a schematic structural diagram of an apparatus for performing prefix indexing in a word segmentation process according to an embodiment of the present invention includes: a grouping module 410, a hash calculation module 420, and a prefix index module 430, wherein,
a grouping module 410, configured to split dictionary data based on an improved hash algorithm SDBMHash and store the split dictionary data in a plurality of double-array Trie trees;
the group module 410 splits the dictionary data by using an improved hash algorithm SDBMHash, stores the dictionary data in a plurality of even-numbered Trie trees, and can ensure that the same prefix is stored in as few trees as possible, thereby reducing the time complexity O (k) of searching from all the trees when prefix indexing is performed, and reducing the time complexity O (1) of searching for predictable trees.
The improved hash algorithm SDBMHash is shown in a flow chart of FIG. 2, the SDBMHash circularly processes each byte of a word to be searched, and the improvement provided by the embodiment of the invention is that a maximum value N of the length of the processed byte is preset, namely, the number of times of circularly processing the byte at the maximum is limited. If the byte number of the word is larger than N, only circulating N times. Therefore, if words with byte length greater than N have the same prefix, the same hash value will be obtained finally, that is, words with length greater than N and the same prefix will be stored in the same even Trie tree.
In one embodiment, N may take on a value between 10 and 30.
A hash calculation module 420, configured to perform hash calculation on a word to be retrieved by using the modified hash algorithm SDBMHash, and determine a double-array Trie where the word to be retrieved is located according to a result of the hash calculation;
the hash calculation module 420 performs hash calculation on the word to be retrieved by using the improved hash algorithm SDBMHash to obtain the length of the word to be retrieved, that is, the number of bytes included, since the improved hash algorithm SDBMHash sets a maximum value N of the length of the processing bytes, the size relationship between the length of the word to be retrieved and N is determined, if the length of the word to be retrieved is greater than or equal to N, the first N bytes of the word to be retrieved are cyclically taken out for hash calculation, and if the length of the word to be retrieved is less than N, the bytes of the word to be retrieved are cyclically taken out for hash calculation, and the hash value obtained by the last cycle calculation is output.
And determining the double-array Trie tree in which the word to be retrieved is stored according to the obtained hash value. After the even-numbered Trie tree where the word to be retrieved is located is determined, since all words of the same prefix are stored in some trees which are predictable and as few as possible, prefix indexing can be started according to the even-numbered Trie stored corresponding to the word to be retrieved.
A prefix indexing module 430, configured to perform prefix indexing on the word to be retrieved in the even-number Trie where the word to be retrieved is located.
In the word segmentation process, when the prefix indexing module 430 performs prefix indexing on a word in a sentence to be word segmented, if the length of the word is greater than or equal to N, because a word longer than the word and the word are both in the same tree, the prefix indexing is performed only on the tree where the word is located, and only whether other words with the word to be retrieved as a prefix exist is searched in a double-array Trie tree where the word to be retrieved is located, if so, the word to be retrieved is determined to have the prefix index; and if not, determining that the word to be retrieved does not have a prefix index.
If the length of the word to be retrieved is smaller than N, the prefix indexing module 430 supplements the word to be retrieved by using bytes of words behind the word to be retrieved in the sentence to be segmented to generate a new word, the byte length of the new word is N, and then performs hash calculation on the new word by using an improved hash algorithm SDBMHash, so as to determine a double-array Trie tree where the new word is located, and performs prefix indexing on the new word in the double-array Trie tree where the new word is located.
The length of the new word is N, so that prefix indexing can be completed on the new word, and if other words with the new word as a prefix exist in the double-array Trie tree where the new word is located, it can be determined that the word to be retrieved has the prefix index. If no other word with the new word as the prefix exists in the double-array Trie where the new word is located, the prefix indexing module 430 needs to query in other double-array Trie. The method comprises the steps of sequentially and circularly removing the last byte of the new word, judging whether the new word subjected to byte deletion has a prefix index in the corresponding double-array Trie or not until the length of the new word subjected to byte deletion is equal to that of the word to be retrieved, and ending circulation if any new word subjected to byte deletion has the prefix index in the circulation process. And if the new word after the byte deletion processing does not have the prefix index until the circulation is finished, judging that the word to be retrieved does not have the prefix index.
The length of the word to be retrieved is expanded to the maximum value N of the preset processing byte length, so that the number of the maximum trees needing to be queried in the word segmentation sentence is not more than N no matter how many trees the double-array Trie tree is divided into, namely the time complexity of O (1) is guaranteed.
The device for indexing the prefix in the word segmentation process provided by the embodiment of the invention can also ensure the high efficiency of the prefix index in the word segmentation process under the application scene of supporting the dictionary to split a plurality of double-array Trie trees.
Fig. 5 is a schematic entity structure diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 5, the electronic device may include: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke a computer program stored on memory 530 and operable on processor 510 to perform the method for prefix indexing during word segmentation provided by the various embodiments described above, including, for example: splitting dictionary data based on an improved hash algorithm SDBMHash and storing the split dictionary data into a plurality of double-array Trie trees; performing hash calculation on the word to be retrieved by using the improved hash algorithm SDBMHash, and determining a double-array Trie tree where the word to be retrieved is located according to the result of the hash calculation; and performing prefix index on the word to be retrieved in the double-array Trie tree where the word to be retrieved is located.
Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or make a contribution to the prior art, or may be implemented in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
An embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a method for performing prefix indexing in a word segmentation process provided in the foregoing embodiments, and for example, the method includes: splitting dictionary data based on an improved hash algorithm SDBMHash and storing the split dictionary data into a plurality of double-array Trie trees; performing hash calculation on the word to be retrieved by using the improved hash algorithm SDBMHash, and determining a double-array Trie tree where the word to be retrieved is located according to the result of the hash calculation; and performing prefix index on the word to be retrieved in the double-array Trie tree where the word to be retrieved is located.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for indexing prefixes in a word segmentation process is characterized by comprising the following steps:
splitting dictionary data based on an improved hash algorithm SDBMHash and storing the split dictionary data into a plurality of double-array Trie trees;
performing hash calculation on the word to be retrieved by using the improved hash algorithm SDBMHash, and determining a double-array Trie tree where the word to be retrieved is located according to the result of the hash calculation;
and performing prefix index on the word to be retrieved in the double-array Trie tree where the word to be retrieved is located.
2. The method according to claim 1, wherein the step of performing hash calculation on the word to be retrieved by using the modified hash algorithm SDBMHash specifically comprises:
judging whether the length of the word to be retrieved is larger than the maximum value N of the preset processing byte length;
if the length of the word to be retrieved is larger than or equal to N, sequentially and circularly taking out the bytes of the word to be retrieved from the first byte of the word to be retrieved for hash calculation until the cycle number is equal to N, and outputting a hash value obtained in the last cycle; or,
and if the length of the word to be retrieved is smaller than N, sequentially and circularly taking out the bytes of the word to be retrieved from the first byte of the word to be retrieved for hash calculation until the word to be retrieved is traversed, and outputting a hash value obtained in the last circulation.
3. The method according to claim 2, wherein the step of determining the double-array Trie where the word to be retrieved is located according to the result of the hash calculation specifically comprises:
and dividing the obtained hash value by the total number of the double-array Trie trees, taking a remainder, and determining the double-array Trie trees where the words to be searched are located according to the remainder.
4. The method according to claim 2, wherein the formula for circularly taking out bytes of the word to be retrieved in sequence for hash calculation is as follows:
hash-byte a + (hash < 6) + (hash < 16) -hash
And the byte A is the byte of the word to be retrieved which is taken out by the current cycle.
5. The method according to claim 2, wherein the step of performing prefix indexing on the word to be retrieved in the double-array Trie in which the word to be retrieved is located specifically comprises:
if the length of the word to be retrieved is larger than or equal to N, searching other words taking the word to be retrieved as prefixes in the double-array Trie tree where the word to be retrieved is located; or,
if the length of the word to be retrieved is smaller than N, acquiring a word behind the word to be retrieved in the sentence to be subjected to word segmentation, and supplementing the word to be retrieved by using bytes of the word behind the word to be retrieved so that the length of a new word obtained after the supplementation reaches N;
and determining an even-numbered Trie tree corresponding to the new word obtained after the supplementing, and performing prefix index on the new word obtained after the supplementing in the even-numbered Trie tree corresponding to the new word obtained after the supplementing.
6. The method according to claim 5, wherein the step of determining an even-numbered Trie corresponding to the new word obtained after the supplementing, and performing prefix indexing on the new word obtained after the supplementing in the even-numbered Trie corresponding to the new word obtained after the supplementing specifically comprises:
circularly executing the following steps until the length of the new word obtained after the supplement is equal to the length of the word to be retrieved:
performing hash calculation on the new word obtained after the supplementation by using the improved hash algorithm SDBMHash, and determining a double-array Trie tree corresponding to the new word obtained after the supplementation according to the result of the hash calculation;
if a word with the new word obtained after the supplement as a prefix exists in the double-array Trie tree corresponding to the new word obtained after the supplement, acquiring that the word to be retrieved has a prefix index, and exiting the circulation; or,
and if the words with the new words obtained after the supplementation as prefixes do not exist in the double-array Trie trees corresponding to the new words obtained after the supplementation, removing the last byte of the new words obtained after the supplementation.
7. An apparatus for prefix indexing in a word segmentation process, comprising:
the grouping module is used for splitting the dictionary data based on an improved Hash algorithm SDBMHash and storing the split dictionary data into a plurality of double-array Trie trees;
the hash calculation module is used for carrying out hash calculation on the word to be retrieved by using the improved hash algorithm SDBMHash, and determining a double-array Trie tree where the word to be retrieved is located according to the result of the hash calculation;
and the prefix index module is used for carrying out prefix index on the word to be retrieved in the double-array Trie tree where the word to be retrieved is located.
8. The apparatus of claim 7, wherein the hash calculation module is specifically configured to:
judging whether the length of the word to be retrieved is larger than the maximum value N of the preset processing byte length;
if the length of the word to be retrieved is larger than or equal to N, sequentially and circularly taking out the bytes of the word to be retrieved from the first byte of the word to be retrieved for hash calculation until the cycle number is equal to N, and outputting a hash value obtained in the last cycle; or,
and if the length of the word to be retrieved is smaller than N, sequentially and circularly taking out the bytes of the word to be retrieved from the first byte of the word to be retrieved for hash calculation until the word to be retrieved is traversed, and outputting a hash value obtained in the last circulation.
9. An electronic device, comprising:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 6.
10. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1 to 6.
CN201811622746.0A 2018-12-28 2018-12-28 Method and device for indexing prefix in word segmentation process Active CN109684439B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811622746.0A CN109684439B (en) 2018-12-28 2018-12-28 Method and device for indexing prefix in word segmentation process

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811622746.0A CN109684439B (en) 2018-12-28 2018-12-28 Method and device for indexing prefix in word segmentation process

Publications (2)

Publication Number Publication Date
CN109684439A true CN109684439A (en) 2019-04-26
CN109684439B CN109684439B (en) 2020-10-30

Family

ID=66190798

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811622746.0A Active CN109684439B (en) 2018-12-28 2018-12-28 Method and device for indexing prefix in word segmentation process

Country Status (1)

Country Link
CN (1) CN109684439B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414648A (en) * 2020-03-04 2020-07-14 传神语联网网络科技股份有限公司 Corpus authentication method and apparatus

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070083503A1 (en) * 2005-10-07 2007-04-12 Oracle International Corporation Generating a synonym dictionary representing a mapping of elements in different data models
CN101464899A (en) * 2009-01-13 2009-06-24 阿里巴巴集团控股有限公司 Commercial scale dictionary storage method and query method with low search error rate
CN101751430A (en) * 2008-12-12 2010-06-23 汉王科技股份有限公司 Electronic dictionary fuzzy searching method
CN102651026A (en) * 2012-04-01 2012-08-29 百度在线网络技术(北京)有限公司 Method for optimizing word segmentation of search engine through precomputation and word segmenting device of search engine
CN103365992A (en) * 2013-07-03 2013-10-23 深圳市华傲数据技术有限公司 Method for realizing dictionary search of Trie tree based on one-dimensional linear space
CN106126722A (en) * 2016-06-30 2016-11-16 中国科学院计算技术研究所 A kind of prefix compound tree based on checking and method for designing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070083503A1 (en) * 2005-10-07 2007-04-12 Oracle International Corporation Generating a synonym dictionary representing a mapping of elements in different data models
CN101751430A (en) * 2008-12-12 2010-06-23 汉王科技股份有限公司 Electronic dictionary fuzzy searching method
CN101464899A (en) * 2009-01-13 2009-06-24 阿里巴巴集团控股有限公司 Commercial scale dictionary storage method and query method with low search error rate
CN102651026A (en) * 2012-04-01 2012-08-29 百度在线网络技术(北京)有限公司 Method for optimizing word segmentation of search engine through precomputation and word segmenting device of search engine
CN103365992A (en) * 2013-07-03 2013-10-23 深圳市华傲数据技术有限公司 Method for realizing dictionary search of Trie tree based on one-dimensional linear space
CN106126722A (en) * 2016-06-30 2016-11-16 中国科学院计算技术研究所 A kind of prefix compound tree based on checking and method for designing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王思力,张华平,王斌: "双数组Trie树算法优化及其应用研究", 《中文信息学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414648A (en) * 2020-03-04 2020-07-14 传神语联网网络科技股份有限公司 Corpus authentication method and apparatus
CN111414648B (en) * 2020-03-04 2023-05-12 传神语联网网络科技股份有限公司 Corpus authentication method and device

Also Published As

Publication number Publication date
CN109684439B (en) 2020-10-30

Similar Documents

Publication Publication Date Title
CN104182405B (en) Method and device for connection query
US20210216854A1 (en) Neural network searching method, device and storage medium
CN108228799B (en) Object index information storage method and device
KR20130020050A (en) Apparatus and method for managing bucket range of locality sensitivie hash
CN106778079A (en) A kind of DNA sequence dna k mer frequency statistics methods based on MapReduce
EP3353676A2 (en) Method and system of performing a translation
CN109684439B (en) Method and device for indexing prefix in word segmentation process
CN115129806A (en) Data processing method and device, electronic equipment and computer storage medium
CN113407702B (en) Employee cooperation relationship intensity quantization method, system, computer and storage medium
CN108694205B (en) Method and device for matching target field
CN102725754B (en) Method and device for processing index data
US9529835B2 (en) Online compression for limited sequence length radix tree
CN114357996B (en) Sequential text feature extraction method and device, electronic equipment and storage medium
CN114385868B (en) Regular expression generation method, device, medium and equipment
CN113961725A (en) Automatic label labeling method, system, equipment and storage medium
CN114611496A (en) Dictionary generation method and device, storage medium and electronic device
CN108984780B (en) Method and device for managing disk data based on data structure supporting repeated key value tree
CN106933826B (en) Data preprocessing method and device
CN110825927A (en) Data query method and device, electronic equipment and computer readable storage medium
CN117763113B (en) Method, device and storage medium for generating instruction fine tuning data
CN110543622A (en) Text similarity detection method and device, electronic equipment and readable storage medium
CN110557277A (en) method and system for searching nearest common ancestor of two blocks in block chain system
CN114817315B (en) Data processing method and system
CN116821146B (en) Apache Iceberg-based data list updating method and system
CN111967257B (en) Word segmentation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant