CN108846016B

CN108846016B - Chinese word segmentation oriented search algorithm

Info

Publication number: CN108846016B
Application number: CN201810422499.3A
Authority: CN
Inventors: 金城; 陶仕谦; 唐士芳; 吴渊; 张玥杰; 冯瑞; 薛向阳
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2018-05-05
Filing date: 2018-05-05
Publication date: 2021-08-20
Anticipated expiration: 2038-05-05
Also published as: CN108846016A

Abstract

The invention belongs to the technical field of text search engines, and particularly relates to a Chinese word segmentation oriented search algorithm. The algorithm of the invention is mainly divided into two stages: an off-line index building stage and an on-line searching stage. In an off-line index construction stage, firstly, suffix string sets of all original character string sets are extracted, and then an improved suffix tree is generated by the suffix string sets; in the on-line searching stage, firstly, a query result of the keyword is obtained according to the index model based on the suffix tree, then, the matching degree of the keyword and the query result is quantized, and finally, the query result is returned after being sorted from high to low according to the matching program. The index construction time and the occupied space are balanced through the improved index structure based on the suffix tree, and the searching efficiency of the index structure is far higher than the efficiency of violently calculating the matching degree of a result set and sequencing the result set.

Description

Chinese word segmentation oriented search algorithm

Technical Field

The invention belongs to the technical field of text search engines, and particularly relates to a Chinese word segmentation oriented search algorithm.

Background

A search engine is an online information search tool that returns a series of search results to a user that match the user's search keywords. The society of today is an era of information explosion, and in the face of countless information, how to quickly and accurately locate the information desired by the user is one of the most urgent needs, and therefore, the information search technology is rapidly developed and applied.

The most common form of search is text search, and whether the user's target resource is text, image, audio, or even video, the input format can be attributed within the scope of the present invention search as long as it is text. Now, besides the whole-network station searching function provided by google, duhui, yahoo and the like, the searching requirement in a specific field is also increasing. In a specific field (such as only oriented to television programs), due to the limitation of the types of resources, the search conditions can be quite clear, and in addition, the size of the data set is within an acceptable range, so that a plurality of targeted optimizations can be made for the search engine under the premise.

The related technologies of the current Chinese search system mainly include reverse indexes, forward indexes, signed files, suffix trees, and the like. The reverse index has better comprehensive performance and is most commonly used, but in practical application, when the reverse index model is applied to process a large text set, the test on CPU resources, memory space and I/O is very severe.

Disclosure of Invention

The invention aims to provide a Chinese word segmentation-oriented search algorithm which is applied to an intelligent Chinese search engine system, so that search results can be quickly returned according to keywords, and the results are displayed to a user after being sorted from high to low according to the matching degree.

The Chinese word segmentation oriented search algorithm provided by the invention can be mainly divided into two stages: an off-line index building stage and an on-line searching stage. In an off-line index construction stage, firstly, suffix string sets of all original character string sets are extracted, and then an improved suffix tree is generated by the suffix string sets; in the on-line searching stage, firstly, a query result of the keyword is obtained according to the index model based on the suffix tree, then, the matching degree of the keyword and the query result is quantized, and finally, the query result is returned after being sorted from high to low according to the matching program.

Firstly, an off-line index construction stage comprises the following specific steps:

(1) generating a suffix string set from an original data set

T (S) represents the original data set composed of strings S with a delimiter ($) and a terminator (#), wherein the index ID of the ith string is i (1 ≦ i ≦ n). Suppose WBS denotes a suffix string starting from a delimiter and NWBS denotes a suffix string not starting from a delimiter. The specific steps of generating suffix string sets T (WBS) and T (NWBS) with index IDs by T (S) are as follows:

the first step is as follows: traversing all character strings in T (S), extracting all suffix strings s of each character string_iForm a set T^*(s₁),T^*(s₂)…T^*(s_n)[1]. Wherein the suffix string is a substring of the character string S starting from position i to the end of S, i.e. if S is C₁C₂…C_nIs shown, then C_iC_i+1…C_nA suffix string called S (1. ltoreq. i. ltoreq. n);

the second step is that: rejection of T^*(s₁),T^*(s₂)…T^*(s_n) All suffix strings preceded by a delimiter ($) or a terminator (#);

the third step: traverse T^*(s_i) If the first character of the suffix string is the same as the first character of the original string or the first character after the delimiter ($) in the original string, the suffix string is added to t (wbs) after the index ID is added to the end of the suffix string, and vice versa, the suffix string is added to t (nwbs) after the index ID is added to the end of the suffix string.

(2) Establishing improved suffix trees for suffix string sets T (WBS) and T (NWBS)

The improved suffix tree is based on the traditional suffix tree [1], and the identification of each edge is stored in the node. Namely, each node is used as a storage unit, and the structure of the storage unit is shown in fig. 1. The node storage information comprises a node identifier, an end character child node pointer, a separator child node pointer, a general child node pointer set and a matching index ID sequence, wherein the node identifier is an end character, a separator or a general character string.

The specific steps of establishing the improved suffix tree for the arbitrary suffix string set T are as follows:

the first step is as follows: and creating an improved suffix tree only comprising one node, wherein the node identification, all child node pointers and the matching index ID sequence of the node are all null, and marking the node as the root node root of the improved suffix tree.

The second step is that: all elements in the suffix string set T are inserted into the modified suffix tree in sequence. The insertion process of each suffix string is started from the root node to find the insertion position.

Taking the modified suffix tree in fig. 2 as an example, the following three cases are divided when inserting the suffix string:

the following conditions are: if the suffix string needing to be inserted already appears in the current tree, the index number is directly added into the matching index ID sequence of the node. For example, the suffix string to be inserted is "student # 2", and since the "student" node is already present in the current tree, the index number is added directly to the matching index ID sequence of the node, and the result is shown in fig. 3 (a).

Case two: if the prefix of the suffix string to be inserted is the same as that of the existing node, the node is required to be directly added. For example, the suffix string to be inserted is "home $ student # 3", and since the "home" node already exists, the nodes "student" and "#" are directly added, and the result is shown in fig. 3 (b).

Case (c): if the suffix string needing to be inserted is the same as the prefix in the current node, the current node is split first, and then other nodes are inserted. For example, the suffix string to be inserted is "big $ student # 4", since the prefix "big" of the suffix string is the same as the prefix in the current node "university", it is necessary to split the current node first and then insert other nodes, and the result is shown in fig. 3 (c).

The third step: a matching index ID sequence for each node is recursively constructed. As can be seen earlier, the matching index ID sequence of the terminator node has been constructed when the full suffix string insertion is complete. Therefore, only the matching index ID sequences Q (n (s)) of all the non-end symbol nodes n(s) need to be constructed, and the specific method is as shown in formula (1):

Q(N(s))＝Q(N(s#))Q(N(s$))Q(N(s*))# (1)

wherein, N (s #), N (s $) and N (s ×) respectively represent the end token sub-node, the separation token sub-node and all the general sub-nodes of the node N(s).

Secondly, an online searching stage, which comprises the following specific steps:

(1) matching point query

For any node N(s), starting from N(s), inquiring the character string c₁…c_nThe process of matching nodes is shown in equation (2):

wherein, R (N (s)) represents the query result, N(s) is the matching node, and s is the node identification.

Giving a query string c₁…c_nFirst, all child nodes of the root node are searched, and the first character of the node identification is found to be equal to c₁And then execute R (N(s), c)₁…c_n) Finding all matching points, and finally obtaining a search result R (N (S)) (S, Q (N (S))). Wherein Q (N (s)) is a matching index ID sequence of N(s).

(2) Ordering the result set

Defining negative entropy to measure query string c₁…c_nThe degree of matching with the search result character string s is lower as the entropy value is smaller; conversely, the larger the entropy value, the higher the degree of matching.

The calculation algorithm assuming negative entropy is as follows (initial entropy is 0):

(a) get from c₁Position i in s;

(b) traversing s backward from i until a delimiter $ or the end of a terminator # or s is encountered, assuming that m characters have been traversed during the period;

(c) if the end of s is met, judging whether the last character is a terminator #, if so, increasing the negative entropy value by m²And ending the algorithm; otherwise, increasing m by the negative entropy value, and ending the algorithm;

(d) if a separator $isencountered, the negative entropy value is increased by m²And ending the algorithm;

(e) update i to the position one character after the encountered delimiter $ goes back to (b).

And calculating word segmentation negative entropy values of all s in the result set according to the steps, and sequencing the result set according to the values from large to small.

(3) Eliminating duplicate terms in a result set and generating a search result sequence

And sequentially taking out Q (N (s)) of the sorted result set, executing corresponding operation, and then putting the result set into a search result sequence, wherein the initial value of the search result sequence is null. Formula (3) is for Q (N(s)_i) Specific operations performed):

SR(i)＝(D(Q(N(s_i)))-SR(i-1))∩SR(i-1),1≤i≤n# (3)

wherein SR (i) represents the matching index ID sequence Q (N(s) of the ith node after being merged_i) SR (1) and SR (n) are respectively the initial state and the final state of the search result sequence; d (Q (N (s))_i) ()) represents a pair Q (N(s)_i) Perform a deduplication operation; (D-SR) represents Q (N(s) after deduplication_i) Remove index numbers that have already appeared in the search result sequence; (D-SR) # SR denotes that (D-SR) is added to the end of the current search result sequence SR.

After the steps, the finally obtained search result sequence is SR (n).

The index structure based on the suffix tree well balances index construction time and occupied space, the search efficiency of the index structure is far higher than the efficiency of violently calculating matching degree and sequencing of a result set, and compared with fuzzy search realized by other full-text index structures, the index structure of the invention has higher search efficiency while adopting less construction time and occupied memory cost.

Drawings

FIG. 1: the structure diagram of the suffix tree node is improved.

FIG. 2: example graphs of modified suffix trees.

FIG. 3: comparing the figures in different situations when inserting the suffix string.

Detailed Description

To study the search performance of the present invention on data sets of different sizes, we constructed five data sets with data volumes of 10000, 20000, 50000, 100000 and 200000, respectively, and performed multiple sets of comparison experiments with the inverted list-based Lucene engine on each data set.

Each of 25 search strings having lengths of 2 to 4 was randomly generated, and 75 search strings were constituted. For each search string, 100000 searches are performed, and the time consumption of each search is recorded on the premise that the search result is correct.

In order to make Lucene accomplish the same task as the present invention index, a space is added between each character of the initial sequence when the initial index is established, so that each character is considered as a word, and a space is also added between each character of the search string to achieve the same search function of the present invention.

The results of the experiment are shown in table 1:

TABLE 1 inventive index vs. Lucene index search time

As can be seen from the table, the algorithm of the invention has better search efficiency than Lucene on any data set, and the result is more obvious on a small data set, and under the condition that the data set is less than 50000, the search efficiency of the algorithm of the invention can reach 7-10 times of Lucene.

Selecting reference files:

[1]E.Ukkonen,On-Line Construction of Suffix Trees,Algorithmica,14(1995),249-260。

Claims

1. a search algorithm for Chinese word segmentation is characterized by comprising two stages: an off-line index construction stage and an on-line search stage;

the method comprises the following steps of (I) constructing an index offline, and specifically comprising the following steps:

(1) generating a suffix string set from an original data set

T (S) represents the original data set composed of strings S with delimiter ($) and terminator (#), where the index ID of the ith string is i, 1 ≦ i ≦ n, assuming WBS represents a suffix string starting from the delimiter and NWBS represents a suffix string not starting from the delimiter; the specific steps of generating suffix string sets T (WBS) and T (NWBS) with index IDs by T (S) are as follows:

the first step is as follows: traversing all character strings in T (S), extracting all suffix strings s of each character string_iForm a set T^*(s₁),T^*(s₂)…T^*(s_n) Wherein the suffix string is a substring of the character string S starting from position i to the end of S, i.e. if S is C₁C₂…C_nIs shown, then C_iC_i+1…C_nA suffix string called S, i is greater than or equal to 1 and less than or equal to n;

the second step is that: culling set T^*(s₁),T^*(s₂)…T^*(s_n) All suffix strings preceded by a delimiter ($) or a terminator (#);

the third step: traverse T^*(s_i) If the first character of the suffix string is the same as the first character of the original string or the first character after the delimiter ($) in the original string, adding an index ID to the end of the suffix string and then adding the index ID to t (wbs), otherwise, adding the index ID to the end of the suffix string and then adding the index ID to t (nwbs);

The improved suffix tree is characterized in that on the basis of a traditional suffix tree, identifiers on each edge are stored in nodes, namely each node is used as a storage unit, and node storage information comprises a node identifier, an end character child node pointer, a separator child node pointer, a general child node pointer set and a matching index ID sequence, wherein the node identifier is an end character, a separator or a general character string;

the first step is as follows: creating an improved suffix tree only comprising one node, wherein the node identification, all child node pointers and matching index ID sequences of the node are all null, and marking the node as the root node root of the improved suffix tree;

the second step is that: sequentially inserting all elements in the suffix string set T into the improved suffix tree; the insertion process of each suffix string is started from a root node, and an insertion position is searched;

the third step: recursively constructing a matching index ID sequence of each node; as can be seen, the matching index ID sequence of the terminator node has been constructed when the insertion of all suffix strings is complete; only the matching index ID sequence Q (N (s)) of all the non-end symbol nodes N(s) needs to be constructed according to the formula (1):

Q(N(s))＝Q(N(s#))Q(N(s$))Q(N(s*))# (1)

wherein, N (s #), N (s $) and N (s ×) respectively represent an end token sub-node, a separation token sub-node and all general sub-nodes of the node N(s);

(II) an online searching stage, which comprises the following specific steps:

(1) matching point query

For any node N(s), starting from N(s), inquiring the character string c according to the formula (2)₁…c_nThe matching node of (2):

wherein, R (N (s)) represents the query result, N(s) is a matching node, and s is a node identifier;

giving a query string c₁…c_nFirst, all child nodes of the root node are searched, and the first character of the node identification is found to be equal to c₁And then execute R (N(s), c)₁…c_n) Finding all matching points, and finally obtaining a search result R (N (S)) ═(S, Q (N (S)); wherein Q (N (s)) is a matching index ID sequence of N(s);

(2) ordering the result set

Defining negative entropy to measure query string c₁…c_nThe degree of matching with the search result character string s is lower as the entropy value is smaller; on the contrary, the larger the entropy value is, the higher the matching degree is;

calculating word segmentation negative entropy values of all s, and sequencing the result set according to the values from large to small;

Sequentially taking out Q (N (s)) of the sorted result set, executing corresponding operation, and then putting the result set into a search result sequence, wherein the initial value of the search result sequence is null; the corresponding operation is performed on Q (N(s) according to formula (3)_i) ) performs the following operations:

SR(i)＝(D(Q(N(s_i)))-SR(i-1))∩SR(i-1),1≤i≤n# (3)

wherein SR (i) represents the matching index ID sequence Q (N(s) of the ith node after being merged_i) SR (1) and SR (n) are respectively the initial state and the final state of the search result sequence; d (Q (N (s))_i) ()) represents a pair Q (N(s)_i) Perform a deduplication operation; (D-SR) represents Q (N(s) after deduplication_i) Remove index numbers that have already appeared in the search result sequence; (D-SR) # SR denotes that (D-SR) is added to the end of the current search result sequence SR;

the final search result sequence is SR (n).

2. The Chinese word segmentation oriented search algorithm of claim 1, wherein the insertion process of each suffix string is to find an insertion position from a root node, and the following 3 cases are divided into:

the following conditions are: if the suffix string needing to be inserted appears in the current tree, directly adding an index number in the matching index ID sequence of the node;

case two: if the prefix of the suffix string needing to be inserted is the same as that of the existing node, the node is directly added;

case (c): if the suffix string needing to be inserted is the same as the prefix in the current node, the current node is split first, and then other nodes are inserted.

3. The Chinese segmentation-oriented search algorithm of claim 1, wherein the step of calculating the segment negative entropy value of s is as follows: setting the initial entropy value as 0;

(a) get from c₁Position i in s;

(e) update i to the position one character after the encountered delimiter $ and return to step (b).