CN108846016A

CN108846016A - A kind of searching algorithm towards Chinese word segmentation

Info

Publication number: CN108846016A
Application number: CN201810422499.3A
Authority: CN
Inventors: 金城; 陶仕谦; 唐士芳; 吴渊; 张玥杰; 冯瑞; 薛向阳
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2018-05-05
Filing date: 2018-05-05
Publication date: 2018-11-20
Anticipated expiration: 2038-05-05
Also published as: CN108846016B

Abstract

The invention belongs to text search engine technical field, specially a kind of searching algorithm towards Chinese word segmentation.Inventive algorithm is broadly divided into two stages：Offline building index stage and online lookup stage.In the building index stage offline, the suffix set of strings of all original character set of strings is extracted first, and improved suffix tree is then generated by suffix set of strings；The stage is being searched online, the query result of keyword is obtained according to the index model based on suffix tree first, then the matching degree of quantized key word and query result, returns after finally query result sorts from high to low by matcher.The present invention passes through a kind of improved efficiency for balancing index construct time and occupied space based on the index structure of suffix tree, being much higher than to result set violence calculating matching degree and sequence using the search efficiency of index structure of the invention.

Description

A kind of searching algorithm towards Chinese word segmentation

Technical field

The invention belongs to text search engine technical fields, and in particular to a kind of searching algorithm towards Chinese word segmentation.

Background technique

Search engine is a kind of online information research tool, and a series of search results for meeting user's search key are returned Back to user.Today's society is how the epoch of an information explosion are quickly accurately positioned user and think facing to countless information The information wanted is most urgent one of demand, therefore information search technique is also rapidly progressed and applies.

Searching for the most common form is text search, and no matter the target resource of user, which is text, image, audio, even regards Frequently, as long as the format of input is text, in the range of can summing up in the point that the present invention searches for.Now in addition to Google, must answer, Yahoo Outside the whole network station function of search of equal offers, the search need of specific area is also increasing.In specific area (for example be only oriented to TV programme), since the type of resource has limitation, so the condition of search can generally accomplish very clear, other data set Size also within the acceptable range, many targetedly optimizations can be made to search engine under these premises.

The relevant technologies of Chinese search system mainly have inverted index, forward index, documents signed (DS), suffix tree etc. at present. Wherein inverted index comprehensive performance is preferable and the most frequently used, but in practical applications, using the big text set of inverted index model treatment It is all the test of very severe to cpu resource, memory headroom and I/O when conjunction.

Summary of the invention

It is an object of the invention to propose a kind of searching algorithm towards Chinese word segmentation, it is applied to intelligentized Chinese search Automotive engine system enables rapidly to return to search result according to keyword, and result is sorted from high to low by matching degree After show user.

Searching algorithm proposed by the present invention towards Chinese word segmentation can be mainly divided into two stages：Offline building index Stage and online lookup stage.In the building index stage offline, the suffix set of strings of all original character set of strings is extracted first, Then improved suffix tree is generated by suffix set of strings；The stage is being searched online, first according to the index model based on suffix tree The query result of keyword is obtained, then the matching degree of quantized key word and query result, finally by query result by matching Program returns after sorting from high to low.

One, the index stage is constructed offline, the specific steps are：

(1) suffix set of strings is generated by original data set

T (S) indicates original data set composed by the character string S with separator ($) and end mark (#), wherein i-th of word The index ID of symbol string is i (1≤i≤n).Assuming that WBS indicate from separator suffix string, NWBS indicate not from separator Locate the suffix string started.By the suffix set of strings T (WBS) and T (NWBS) of T (S) generation tape index ID, specific step is as follows：

The first step：All character strings in T (S) are traversed, all suffix string s of each character string are extracted_i, constitute set T^* (s₁),T^*(s₂)…T^*(s_n)[1].Wherein suffix string refer to character string S since the i of position to a substring of the end S end mark, Even S C₁C₂…C_nIt indicates, then C_iC_i+1…C_nReferred to as S suffix string (1≤i≤n)；

Second step：Reject T^*(s₁),T^*(s₂)…T^*(s_n) in all suffix headed by separator ($) or end mark (#) String；

Third step：Traverse T^*(s_i) in all suffix strings, if the initial character of suffix string is identical with the initial character of former character string, Or it is identical with the initial character after separator ($) in former character string, then it is added after index ID is added at the suffix string end to T (WBS), conversely, being then added after index ID is added at the suffix string end to T (NWBS).

(2) suffix set of strings T (WBS) and T (NWBS) are established respectively and improves suffix tree

Improving suffix tree is that the mark in each edge is stored in node on the basis of traditional suffix tree [1].That is handle For each node as a storage unit, structure is as shown in Figure 1.Nodal stored information includes node identification, end mark section Point pointer, separator child node pointer, general child node pointer set and match index ID sequence, wherein node identification is to terminate Symbol, separator or general character string.

Establishing improvement suffix tree to any suffix set of strings T, specific step is as follows：

The first step：Creation one only includes the improvement suffix tree of a node, the node identification of the node, all child nodes Pointer and match index ID sequence are sky, this node are denoted as the root node root for improving suffix tree.

Second step：All elements in suffix set of strings T are sequentially inserted into and are improved in suffix tree.The insertion of each suffix string Process is all to find insertion position from root node.

Using the improvement suffix tree in Fig. 2 as example, following three kinds of situations are divided into when sewing string after such insertion：

Situation is 1.：The suffix string for such as needing to be inserted into has already appeared in present tree, then directly in the match index ID of node Call number is added in sequence.Such as the suffix string to be inserted into be " student #2 ", due to " student " node in present tree Through occurring, therefore call number is added directly in the match index ID sequence of node, as a result as shown in Fig. 3 (a).

Situation is 2.：Prefix if you need to the suffix string of insertion is identical as current existing node, then is to need directly addition node ?.Such as the suffix string for needing to be inserted into is " Fudan University $ student #3 ", since " Fudan University " node has existed, so directly adding Node " student " and " # ", as a result as shown in Fig. 3 (b).

Situation is 3.：The suffix string for such as needing to be inserted into is identical as the prefix in present node, then first divides present node, then It is inserted into other nodes.Such as the suffix string for needing to be inserted into be " big student #4 ", due to suffix string prefix " big " with work as prosthomere Prefix in point " university " is identical, so needing first to divide present node, is then inserted into other nodes, as a result such as Fig. 3 (c) institute Show.

Third step：The match index ID sequence of each node of recurrence Construction.By it is preceding it is found that end mark node match index ID sequence whole suffix strings be inserted into complete when construction complete.Therefore, all non-end mark node N (s) need to only be constructed Match index ID sequence Q (N (s)), shown in specific method such as formula (1)：

Q (N (s))=Q (N (s#)) Q (N (s $)) Q (N (s*)) # (1)

Wherein, N (s#), N (s $) and N (s*) respectively indicate the end mark child node of node N (s), separator child node and All general child nodes.

Two, the stage is searched online, the specific steps are：

(1) match point is inquired

To arbitrary node N (s), from N (s), inquiry string c₁…c_nMatched node process such as formula (2) institute Show：

Wherein, R (N (s)) indicates that query result, N (s) are matched node, and s is node identification.

Provide inquiry string c₁…c_n, all child nodes of root node are first looked for, the initial character etc. of node identification is found In c₁Child node N (s), then execute R (N (s), c₁…c_n), all match points are found, search result R (N (s)) is finally obtained =(S, Q (N (s))).Wherein, Q (N (s)) is the match index ID sequence of N (s).

(2) it sorts to result set

Negentropy is defined to measure inquiry string c₁…c_nWith the matching degree of search result character string s, entropy is smaller, It is lower with degree；Conversely, entropy is bigger, matching degree is higher.

Assuming that the computational algorithm of negentropy value is following (initial entropy is 0)：

(a) it obtains from c₁Position i in s；

(b) s is traversed backward since i, the ending until encountering separator $ or full stop # or s, it is assumed that period time M character is gone through；

If what is (c) encountered is the ending of s, judge whether last character is full stop #, if it is, negentropy value Increase m², algorithm terminates；Otherwise, negentropy value increases m, and algorithm terminates；

If that (d) encounter is separator $, negentropy value increases m², algorithm terminates；

(e) i is updated to the position of the latter character of the separator encountered, is returned to (b).

The participle negentropy value that all s are concentrated according to above step calculated result is worth descending to result set progress by it Sequence.

(3) duplicate keys in result set are eliminated and generate search result sequence

The Q (N (s)) for successively taking out result set after sorting, is put into search result sequence after executing corresponding operating, search knot Infructescence column initial value is sky.Formula (3) is to Q (N (s_i)) execute concrete operations：

SR (i)=(D (Q (N (s_i)))-SR(i-1))∩SR(i-1),1≤i≤n# (3)

Wherein, SR (i) indicates the match index ID sequence Q (N (s for having merged i-th of node_i)) after search result sequence Column, SR (1) and SR (n) are respectively the original state and end-state of search result sequence；D(Q(N(s_i))) indicate to Q (N (s_i)) execute deduplication operation；(D-SR) Q (N (s after duplicate removal is indicated_i)) in removal occurred in search result sequence Call number；(D-SR) ∩ SR indicates the end that (D-SR) is added to current search result sequence SR.

After above-mentioned steps, finally obtained search result sequence is SR (n).

The present invention is balanced the index construct time and accounted for based on the index structure of suffix tree by the way that one kind is improved come good With space, it is much higher than the effect that matching degree is calculated to result set violence and is sorted using the search efficiency of index structure of the invention Rate, and it is compared to the fuzzy search of other full-text index structures realization, when index structure of the invention uses less building Between and committed memory cost while can have very high search efficiency.

Detailed description of the invention

Fig. 1：Improve the structure chart of suffix tree node.

Fig. 2：Improve suffix tree exemplary diagram.

Fig. 3：Different situations comparison diagram when being inserted into suffix string.

Specific embodiment

For search performance of the research present invention on different size data set, we construct respectively data volume be 10000, 20000,50000,100000 and 200,000 five data sets, and on each data set with the Lucene engine based on inverted list Carry out multiple groups comparative experiments.

The random length that generates is search string each 25 that 2-4 is not waited, collectively forms 75 kinds of search strings.For every A kind of search string all carries out 100000 search, and under the premise of search result is correct, the time that record is searched for every time disappears Consumption.

In order to which Lucene can complete to index identical task with the present invention, when establishing initial index in initial Space is added in each intercharacter of sequence, and making each character is considered as a word, in each intercharacter of search string Also space is added, to realize the identical function of search of the present invention.

Experimental result is as shown in table 1：

1 present invention index of table and the comparison of Lucene indexed search time

By table as it can be seen that inventive algorithm suffers from search efficiency more better than Lucene on any data set, and tie Fruit is more obvious in small data set, can be with using the search efficiency of inventive algorithm in the case where data set is less than 50000 Reach 7-10 times of Lucene.

With reference to selected works：

[1]E.Ukkonen,On-Line Construction of Suffix Trees,Algorithmica,14 (1995),249-260。

Claims

1. a kind of searching algorithm towards Chinese word segmentation, which is characterized in that be divided into two stages：The offline building index stage and Line searches the stage；

(1) the index stage is constructed offline, the specific steps are：

(1) suffix set of strings is generated by original data set

T (S) indicates original data set composed by the character string S with separator ($) and end mark (#), wherein i-th of character string Index ID be i, 1≤i≤n, it is assumed that WBS indicate from separator suffix string, NWBS expression do not opened from separator The suffix string of beginning；By the suffix set of strings T (WBS) and T (NWBS) of T (S) generation tape index ID, specific step is as follows：

The first step：All character strings in T (S) are traversed, all suffix string s of each character string are extracted_i, constitute set T^*(s₁), T^*(s₂)…T^*(s_n), wherein suffix string refers to character string S since the i of position to a substring of the end S end mark, even S use C₁C₂…C_nIt indicates, then C_iC_i+1…C_nReferred to as S suffix string, 1≤i≤n；

Second step：Reject set T^*(s₁),T^*(s₂)…T^*(s_n) in all suffix headed by separator ($) or end mark (#) String；

Third step：Traverse T^*(s_i) in all suffix strings, if the initial character of suffix string is identical with the initial character of former character string, or It is identical with the initial character after separator ($) in former character string, then it is added after index ID is added at the suffix string end to T (WBS), Conversely, being then added after index ID is added at the suffix string end to T (NWBS)；

So-called improvement suffix tree is the mark in each edge to be stored in node, i.e., on the basis of traditional suffix tree every For a node as a storage unit, nodal stored information includes node identification, end mark child node pointer, separator child node Pointer, general child node pointer set and match index ID sequence, wherein node identification is end mark, separator or general character String；

The first step：Creation one only includes the improvement suffix tree of a node, the node identification of the node, all child node pointers It is sky with match index ID sequence, this node is denoted as the root node root for improving suffix tree；

Second step：All elements in suffix set of strings T are sequentially inserted into and are improved in suffix tree；The insertion process of each suffix string It is all to find insertion position from root node；

Third step：The match index ID sequence of each node of recurrence Construction；By it is preceding it is found that end mark node match index ID sequence It is listed in when the insertion of whole suffix strings is completed construction complete；Only all non-end mark node N (s) need to be constructed by formula (1) Match index ID sequence Q (N (s))：

Q (N (s))=Q (N (s#)) Q (N (s $)) Q (N (s*)) # (1)

Wherein, N (s#), N (s $) and N (s*) respectively indicate the end mark child node of node N (s), separator child node and all General child node；

(2) stage is searched online, the specific steps are：

(1) match point is inquired

To arbitrary node N (s), from N (s), by formula (2) inquiry string c₁…c_nMatched node：

Wherein, R (N (s)) indicates that query result, N (s) are matched node, and s is node identification；

Provide inquiry string c₁…c_n, all child nodes of root node are first looked for, the initial character for finding node identification is equal to c₁ Child node N (s), then execute R (N (s), c₁…c_n), find all match points, finally obtain search result R (N (s))= (S,Q(N(s)))；Wherein, Q (N (s)) is the match index ID sequence of N (s)；

(2) it sorts to result set

Negentropy is defined to measure inquiry string c₁…c_nWith the matching degree of search result character string s, entropy is smaller, matches journey It spends lower；Conversely, entropy is bigger, matching degree is higher；

The participle negentropy value for calculating all s, is ranked up result set by its value is descending；

The Q (N (s)) for successively taking out result set after sorting, is put into search result sequence, search result sequence after executing corresponding operating Column initial value is sky；It is to Q (N (s that the execution corresponding operating, which is by formula (3),_i)) perform the following operations：

SR (i)=(D (Q (N (s_i)))-SR(i-1))∩SR(i-1),1≤i≤n# (3)

Wherein, SR (i) indicates the match index ID sequence Q (N (s for having merged i-th of node_i)) after search result sequence, SR (1) and SR (n) be respectively search result sequence original state and end-state；D(Q(N(s_i))) indicate to Q (N (s_i)) execute Deduplication operation；(D-SR) Q (N (s after duplicate removal is indicated_i)) in the index that had occurred in search result sequence of removal Number；(D-SR) ∩ SR indicates the end that (D-SR) is added to current search result sequence SR；

Finally obtained search result sequence is SR (n).

2. the searching algorithm according to claim 1 towards Chinese word segmentation, which is characterized in that each suffix string is inserted Entering process all is to find insertion position from root node, be divided into following 3 kinds of situations：

Situation is 1.：The suffix string for such as needing to be inserted into has already appeared in present tree, then directly in the match index ID sequence of node Middle addition call number；

Situation is 2.：Prefix if you need to the suffix string of insertion is identical as current existing node, then directly adds node；

Situation is 3.：The suffix string for such as needing to be inserted into is identical as the prefix in present node, then first divides present node, then insert again Enter other nodes.

3. the searching algorithm according to claim 1 towards Chinese word segmentation, which is characterized in that the participle for calculating s is negative The step of entropy, is as follows：If initial entropy is 0；

(a) it obtains from c₁Position i in s；

(b) s is traversed backward since i, the ending until encountering separator $ or full stop # or s, it is assumed that period has traversed m A character；

If what is (c) encountered is the ending of s, judge whether last character is full stop #, if it is, negentropy value increases m², algorithm terminates；Otherwise, negentropy value increases m, and algorithm terminates；

(e) i is updated to the position of the latter character of the separator encountered, returns to step (b).