CN108846016B - Chinese word segmentation oriented search algorithm - Google Patents

Chinese word segmentation oriented search algorithm Download PDF

Info

Publication number
CN108846016B
CN108846016B CN201810422499.3A CN201810422499A CN108846016B CN 108846016 B CN108846016 B CN 108846016B CN 201810422499 A CN201810422499 A CN 201810422499A CN 108846016 B CN108846016 B CN 108846016B
Authority
CN
China
Prior art keywords
node
suffix
string
index
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810422499.3A
Other languages
Chinese (zh)
Other versions
CN108846016A (en
Inventor
金城
陶仕谦
唐士芳
吴渊
张玥杰
冯瑞
薛向阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN201810422499.3A priority Critical patent/CN108846016B/en
Publication of CN108846016A publication Critical patent/CN108846016A/en
Application granted granted Critical
Publication of CN108846016B publication Critical patent/CN108846016B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of text search engines, and particularly relates to a Chinese word segmentation oriented search algorithm. The algorithm of the invention is mainly divided into two stages: an off-line index building stage and an on-line searching stage. In an off-line index construction stage, firstly, suffix string sets of all original character string sets are extracted, and then an improved suffix tree is generated by the suffix string sets; in the on-line searching stage, firstly, a query result of the keyword is obtained according to the index model based on the suffix tree, then, the matching degree of the keyword and the query result is quantized, and finally, the query result is returned after being sorted from high to low according to the matching program. The index construction time and the occupied space are balanced through the improved index structure based on the suffix tree, and the searching efficiency of the index structure is far higher than the efficiency of violently calculating the matching degree of a result set and sequencing the result set.

Description

Chinese word segmentation oriented search algorithm
Technical Field
The invention belongs to the technical field of text search engines, and particularly relates to a Chinese word segmentation oriented search algorithm.
Background
A search engine is an online information search tool that returns a series of search results to a user that match the user's search keywords. The society of today is an era of information explosion, and in the face of countless information, how to quickly and accurately locate the information desired by the user is one of the most urgent needs, and therefore, the information search technology is rapidly developed and applied.
The most common form of search is text search, and whether the user's target resource is text, image, audio, or even video, the input format can be attributed within the scope of the present invention search as long as it is text. Now, besides the whole-network station searching function provided by google, duhui, yahoo and the like, the searching requirement in a specific field is also increasing. In a specific field (such as only oriented to television programs), due to the limitation of the types of resources, the search conditions can be quite clear, and in addition, the size of the data set is within an acceptable range, so that a plurality of targeted optimizations can be made for the search engine under the premise.
The related technologies of the current Chinese search system mainly include reverse indexes, forward indexes, signed files, suffix trees, and the like. The reverse index has better comprehensive performance and is most commonly used, but in practical application, when the reverse index model is applied to process a large text set, the test on CPU resources, memory space and I/O is very severe.
Disclosure of Invention
The invention aims to provide a Chinese word segmentation-oriented search algorithm which is applied to an intelligent Chinese search engine system, so that search results can be quickly returned according to keywords, and the results are displayed to a user after being sorted from high to low according to the matching degree.
The Chinese word segmentation oriented search algorithm provided by the invention can be mainly divided into two stages: an off-line index building stage and an on-line searching stage. In an off-line index construction stage, firstly, suffix string sets of all original character string sets are extracted, and then an improved suffix tree is generated by the suffix string sets; in the on-line searching stage, firstly, a query result of the keyword is obtained according to the index model based on the suffix tree, then, the matching degree of the keyword and the query result is quantized, and finally, the query result is returned after being sorted from high to low according to the matching program.
Firstly, an off-line index construction stage comprises the following specific steps:
(1) generating a suffix string set from an original data set
T (S) represents the original data set composed of strings S with a delimiter ($) and a terminator (#), wherein the index ID of the ith string is i (1 ≦ i ≦ n). Suppose WBS denotes a suffix string starting from a delimiter and NWBS denotes a suffix string not starting from a delimiter. The specific steps of generating suffix string sets T (WBS) and T (NWBS) with index IDs by T (S) are as follows:
the first step is as follows: traversing all character strings in T (S), extracting all suffix strings s of each character stringiForm a set T*(s1),T*(s2)…T*(sn)[1]. Wherein the suffix string is a substring of the character string S starting from position i to the end of S, i.e. if S is C1C2…CnIs shown, then CiCi+1…CnA suffix string called S (1. ltoreq. i. ltoreq. n);
the second step is that: rejection of T*(s1),T*(s2)…T*(sn) All suffix strings preceded by a delimiter ($) or a terminator (#);
the third step: traverse T*(si) If the first character of the suffix string is the same as the first character of the original string or the first character after the delimiter ($) in the original string, the suffix string is added to t (wbs) after the index ID is added to the end of the suffix string, and vice versa, the suffix string is added to t (nwbs) after the index ID is added to the end of the suffix string.
(2) Establishing improved suffix trees for suffix string sets T (WBS) and T (NWBS)
The improved suffix tree is based on the traditional suffix tree [1], and the identification of each edge is stored in the node. Namely, each node is used as a storage unit, and the structure of the storage unit is shown in fig. 1. The node storage information comprises a node identifier, an end character child node pointer, a separator child node pointer, a general child node pointer set and a matching index ID sequence, wherein the node identifier is an end character, a separator or a general character string.
The specific steps of establishing the improved suffix tree for the arbitrary suffix string set T are as follows:
the first step is as follows: and creating an improved suffix tree only comprising one node, wherein the node identification, all child node pointers and the matching index ID sequence of the node are all null, and marking the node as the root node root of the improved suffix tree.
The second step is that: all elements in the suffix string set T are inserted into the modified suffix tree in sequence. The insertion process of each suffix string is started from the root node to find the insertion position.
Taking the modified suffix tree in fig. 2 as an example, the following three cases are divided when inserting the suffix string:
the following conditions are: if the suffix string needing to be inserted already appears in the current tree, the index number is directly added into the matching index ID sequence of the node. For example, the suffix string to be inserted is "student # 2", and since the "student" node is already present in the current tree, the index number is added directly to the matching index ID sequence of the node, and the result is shown in fig. 3 (a).
Case two: if the prefix of the suffix string to be inserted is the same as that of the existing node, the node is required to be directly added. For example, the suffix string to be inserted is "home $ student # 3", and since the "home" node already exists, the nodes "student" and "#" are directly added, and the result is shown in fig. 3 (b).
Case (c): if the suffix string needing to be inserted is the same as the prefix in the current node, the current node is split first, and then other nodes are inserted. For example, the suffix string to be inserted is "big $ student # 4", since the prefix "big" of the suffix string is the same as the prefix in the current node "university", it is necessary to split the current node first and then insert other nodes, and the result is shown in fig. 3 (c).
The third step: a matching index ID sequence for each node is recursively constructed. As can be seen earlier, the matching index ID sequence of the terminator node has been constructed when the full suffix string insertion is complete. Therefore, only the matching index ID sequences Q (n (s)) of all the non-end symbol nodes n(s) need to be constructed, and the specific method is as shown in formula (1):
Q(N(s))=Q(N(s#))Q(N(s$))Q(N(s*))# (1)
wherein, N (s #), N (s $) and N (s ×) respectively represent the end token sub-node, the separation token sub-node and all the general sub-nodes of the node N(s).
Secondly, an online searching stage, which comprises the following specific steps:
(1) matching point query
For any node N(s), starting from N(s), inquiring the character string c1…cnThe process of matching nodes is shown in equation (2):
Figure BDA0001651108910000031
wherein, R (N (s)) represents the query result, N(s) is the matching node, and s is the node identification.
Giving a query string c1…cnFirst, all child nodes of the root node are searched, and the first character of the node identification is found to be equal to c1And then execute R (N(s), c)1…cn) Finding all matching points, and finally obtaining a search result R (N (S)) (S, Q (N (S))). Wherein Q (N (s)) is a matching index ID sequence of N(s).
(2) Ordering the result set
Defining negative entropy to measure query string c1…cnThe degree of matching with the search result character string s is lower as the entropy value is smaller; conversely, the larger the entropy value, the higher the degree of matching.
The calculation algorithm assuming negative entropy is as follows (initial entropy is 0):
(a) get from c1Position i in s;
(b) traversing s backward from i until a delimiter $ or the end of a terminator # or s is encountered, assuming that m characters have been traversed during the period;
(c) if the end of s is met, judging whether the last character is a terminator #, if so, increasing the negative entropy value by m2And ending the algorithm; otherwise, increasing m by the negative entropy value, and ending the algorithm;
(d) if a separator $isencountered, the negative entropy value is increased by m2And ending the algorithm;
(e) update i to the position one character after the encountered delimiter $ goes back to (b).
And calculating word segmentation negative entropy values of all s in the result set according to the steps, and sequencing the result set according to the values from large to small.
(3) Eliminating duplicate terms in a result set and generating a search result sequence
And sequentially taking out Q (N (s)) of the sorted result set, executing corresponding operation, and then putting the result set into a search result sequence, wherein the initial value of the search result sequence is null. Formula (3) is for Q (N(s)i) Specific operations performed):
SR(i)=(D(Q(N(si)))-SR(i-1))∩SR(i-1),1≤i≤n# (3)
wherein SR (i) represents the matching index ID sequence Q (N(s) of the ith node after being mergedi) SR (1) and SR (n) are respectively the initial state and the final state of the search result sequence; d (Q (N (s))i) ()) represents a pair Q (N(s)i) Perform a deduplication operation; (D-SR) represents Q (N(s) after deduplicationi) Remove index numbers that have already appeared in the search result sequence; (D-SR) # SR denotes that (D-SR) is added to the end of the current search result sequence SR.
After the steps, the finally obtained search result sequence is SR (n).
The index structure based on the suffix tree well balances index construction time and occupied space, the search efficiency of the index structure is far higher than the efficiency of violently calculating matching degree and sequencing of a result set, and compared with fuzzy search realized by other full-text index structures, the index structure of the invention has higher search efficiency while adopting less construction time and occupied memory cost.
Drawings
FIG. 1: the structure diagram of the suffix tree node is improved.
FIG. 2: example graphs of modified suffix trees.
FIG. 3: comparing the figures in different situations when inserting the suffix string.
Detailed Description
To study the search performance of the present invention on data sets of different sizes, we constructed five data sets with data volumes of 10000, 20000, 50000, 100000 and 200000, respectively, and performed multiple sets of comparison experiments with the inverted list-based Lucene engine on each data set.
Each of 25 search strings having lengths of 2 to 4 was randomly generated, and 75 search strings were constituted. For each search string, 100000 searches are performed, and the time consumption of each search is recorded on the premise that the search result is correct.
In order to make Lucene accomplish the same task as the present invention index, a space is added between each character of the initial sequence when the initial index is established, so that each character is considered as a word, and a space is also added between each character of the search string to achieve the same search function of the present invention.
The results of the experiment are shown in table 1:
TABLE 1 inventive index vs. Lucene index search time
Figure BDA0001651108910000041
Figure BDA0001651108910000051
As can be seen from the table, the algorithm of the invention has better search efficiency than Lucene on any data set, and the result is more obvious on a small data set, and under the condition that the data set is less than 50000, the search efficiency of the algorithm of the invention can reach 7-10 times of Lucene.
Selecting reference files:
[1]E.Ukkonen,On-Line Construction of Suffix Trees,Algorithmica,14(1995),249-260。

Claims (3)

1. a search algorithm for Chinese word segmentation is characterized by comprising two stages: an off-line index construction stage and an on-line search stage;
the method comprises the following steps of (I) constructing an index offline, and specifically comprising the following steps:
(1) generating a suffix string set from an original data set
T (S) represents the original data set composed of strings S with delimiter ($) and terminator (#), where the index ID of the ith string is i, 1 ≦ i ≦ n, assuming WBS represents a suffix string starting from the delimiter and NWBS represents a suffix string not starting from the delimiter; the specific steps of generating suffix string sets T (WBS) and T (NWBS) with index IDs by T (S) are as follows:
the first step is as follows: traversing all character strings in T (S), extracting all suffix strings s of each character stringiForm a set T*(s1),T*(s2)…T*(sn) Wherein the suffix string is a substring of the character string S starting from position i to the end of S, i.e. if S is C1C2…CnIs shown, then CiCi+1…CnA suffix string called S, i is greater than or equal to 1 and less than or equal to n;
the second step is that: culling set T*(s1),T*(s2)…T*(sn) All suffix strings preceded by a delimiter ($) or a terminator (#);
the third step: traverse T*(si) If the first character of the suffix string is the same as the first character of the original string or the first character after the delimiter ($) in the original string, adding an index ID to the end of the suffix string and then adding the index ID to t (wbs), otherwise, adding the index ID to the end of the suffix string and then adding the index ID to t (nwbs);
(2) establishing improved suffix trees for suffix string sets T (WBS) and T (NWBS)
The improved suffix tree is characterized in that on the basis of a traditional suffix tree, identifiers on each edge are stored in nodes, namely each node is used as a storage unit, and node storage information comprises a node identifier, an end character child node pointer, a separator child node pointer, a general child node pointer set and a matching index ID sequence, wherein the node identifier is an end character, a separator or a general character string;
the specific steps of establishing the improved suffix tree for the arbitrary suffix string set T are as follows:
the first step is as follows: creating an improved suffix tree only comprising one node, wherein the node identification, all child node pointers and matching index ID sequences of the node are all null, and marking the node as the root node root of the improved suffix tree;
the second step is that: sequentially inserting all elements in the suffix string set T into the improved suffix tree; the insertion process of each suffix string is started from a root node, and an insertion position is searched;
the third step: recursively constructing a matching index ID sequence of each node; as can be seen, the matching index ID sequence of the terminator node has been constructed when the insertion of all suffix strings is complete; only the matching index ID sequence Q (N (s)) of all the non-end symbol nodes N(s) needs to be constructed according to the formula (1):
Q(N(s))=Q(N(s#))Q(N(s$))Q(N(s*))# (1)
wherein, N (s #), N (s $) and N (s ×) respectively represent an end token sub-node, a separation token sub-node and all general sub-nodes of the node N(s);
(II) an online searching stage, which comprises the following specific steps:
(1) matching point query
For any node N(s), starting from N(s), inquiring the character string c according to the formula (2)1…cnThe matching node of (2):
Figure FDA0001651108900000021
wherein, R (N (s)) represents the query result, N(s) is a matching node, and s is a node identifier;
giving a query string c1…cnFirst, all child nodes of the root node are searched, and the first character of the node identification is found to be equal to c1And then execute R (N(s), c)1…cn) Finding all matching points, and finally obtaining a search result R (N (S)) ═(S, Q (N (S)); wherein Q (N (s)) is a matching index ID sequence of N(s);
(2) ordering the result set
Defining negative entropy to measure query string c1…cnThe degree of matching with the search result character string s is lower as the entropy value is smaller; on the contrary, the larger the entropy value is, the higher the matching degree is;
calculating word segmentation negative entropy values of all s, and sequencing the result set according to the values from large to small;
(3) eliminating duplicate terms in a result set and generating a search result sequence
Sequentially taking out Q (N (s)) of the sorted result set, executing corresponding operation, and then putting the result set into a search result sequence, wherein the initial value of the search result sequence is null; the corresponding operation is performed on Q (N(s) according to formula (3)i) ) performs the following operations:
SR(i)=(D(Q(N(si)))-SR(i-1))∩SR(i-1),1≤i≤n# (3)
wherein SR (i) represents the matching index ID sequence Q (N(s) of the ith node after being mergedi) SR (1) and SR (n) are respectively the initial state and the final state of the search result sequence; d (Q (N (s))i) ()) represents a pair Q (N(s)i) Perform a deduplication operation; (D-SR) represents Q (N(s) after deduplicationi) Remove index numbers that have already appeared in the search result sequence; (D-SR) # SR denotes that (D-SR) is added to the end of the current search result sequence SR;
the final search result sequence is SR (n).
2. The Chinese word segmentation oriented search algorithm of claim 1, wherein the insertion process of each suffix string is to find an insertion position from a root node, and the following 3 cases are divided into:
the following conditions are: if the suffix string needing to be inserted appears in the current tree, directly adding an index number in the matching index ID sequence of the node;
case two: if the prefix of the suffix string needing to be inserted is the same as that of the existing node, the node is directly added;
case (c): if the suffix string needing to be inserted is the same as the prefix in the current node, the current node is split first, and then other nodes are inserted.
3. The Chinese segmentation-oriented search algorithm of claim 1, wherein the step of calculating the segment negative entropy value of s is as follows: setting the initial entropy value as 0;
(a) get from c1Position i in s;
(b) traversing s backward from i until a delimiter $ or the end of a terminator # or s is encountered, assuming that m characters have been traversed during the period;
(c) if the end of s is met, judging whether the last character is a terminator #, if so, increasing the negative entropy value by m2And ending the algorithm; otherwise, increasing m by the negative entropy value, and ending the algorithm;
(d) if a separator $isencountered, the negative entropy value is increased by m2And ending the algorithm;
(e) update i to the position one character after the encountered delimiter $ and return to step (b).
CN201810422499.3A 2018-05-05 2018-05-05 Chinese word segmentation oriented search algorithm Active CN108846016B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810422499.3A CN108846016B (en) 2018-05-05 2018-05-05 Chinese word segmentation oriented search algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810422499.3A CN108846016B (en) 2018-05-05 2018-05-05 Chinese word segmentation oriented search algorithm

Publications (2)

Publication Number Publication Date
CN108846016A CN108846016A (en) 2018-11-20
CN108846016B true CN108846016B (en) 2021-08-20

Family

ID=64212741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810422499.3A Active CN108846016B (en) 2018-05-05 2018-05-05 Chinese word segmentation oriented search algorithm

Country Status (1)

Country Link
CN (1) CN108846016B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109686413A (en) * 2018-12-24 2019-04-26 杭州费尔斯通科技有限公司 A kind of chemical molecular formula search method based on es inverted index
CN110597855B (en) * 2019-08-14 2022-03-29 中山大学 Data query method, terminal device and computer readable storage medium
CN111241398B (en) * 2020-01-10 2023-07-25 百度在线网络技术(北京)有限公司 Data prefetching method, device, electronic equipment and computer readable storage medium
WO2021236052A1 (en) 2020-05-18 2021-11-25 Google Llc Inference methods for word or wordpiece tokenization
CN112232903B (en) * 2020-09-27 2022-01-11 北京五八信息技术有限公司 Business object display method and device
CN112802553B (en) * 2020-12-29 2024-03-15 北京优迅医疗器械有限公司 Suffix tree algorithm-based genome sequencing sequence and reference genome comparison method
CN112966505B (en) * 2021-01-21 2021-10-15 哈尔滨工业大学 Method, device and storage medium for extracting persistent hot phrases from text corpus
CN113450028A (en) * 2021-08-31 2021-09-28 深圳格隆汇信息科技有限公司 Behavior fund analysis method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631929A (en) * 2013-12-09 2014-03-12 江苏金智教育信息技术有限公司 Intelligent prompt method, module and system for search
CN103838785A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Vertical search engine in patent field
CN103838783A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Suffix tree clustering method suitable for Chinese web page documents
CN107844731A (en) * 2016-09-17 2018-03-27 复旦大学 Long-term sequence δ abnormal point detecting methods based on probabilistic suffix tree

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103838785A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Vertical search engine in patent field
CN103838783A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Suffix tree clustering method suitable for Chinese web page documents
CN103631929A (en) * 2013-12-09 2014-03-12 江苏金智教育信息技术有限公司 Intelligent prompt method, module and system for search
CN107844731A (en) * 2016-09-17 2018-03-27 复旦大学 Long-term sequence δ abnormal point detecting methods based on probabilistic suffix tree

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Domain-specific Chinese word segmentation using suffix tree and mutual information;Zeng D等;《Information Systems Frontiers》;20111231;全文 *
基于后缀树聚类的主题搜索引擎研究;韦美峰等;《情报理论与实践》;20171219;全文 *
改进后缀树的中文检索结果聚类研究;袁津生等;《计算机工程与应用》;20130418;全文 *

Also Published As

Publication number Publication date
CN108846016A (en) 2018-11-20

Similar Documents

Publication Publication Date Title
CN108846016B (en) Chinese word segmentation oriented search algorithm
Chakraborty et al. Analysis and study of incremental k-means clustering algorithm
Eskin et al. Finding composite regulatory patterns in DNA sequences
CN110569328B (en) Entity linking method, electronic device and computer equipment
CN106503223B (en) online house source searching method and device combining position and keyword information
Betzler et al. Parameterized algorithmics for finding connected motifs in biological networks
US9020951B2 (en) Methods for indexing and searching based on language locale
CN110704743A (en) Semantic search method and device based on knowledge graph
WO2016034052A1 (en) Device and method for error correction in data search
CN111868710A (en) Random extraction forest index structure for searching large-scale unstructured data
Clifford et al. Dictionary matching in a stream
CN109408681A (en) A kind of character string matching method, device, equipment and readable storage medium storing program for executing
WO2015010509A1 (en) One-dimensional liner space-based method for implementing trie tree dictionary search
KR102215299B1 (en) Error correction method and device and computer readable medium
CN111666468A (en) Method for searching personalized influence community in social network based on cluster attributes
CN116562297B (en) Chinese sensitive word deformation identification method and system based on HTRIE tree
CN103077216B (en) The method of subgraph match device and subgraph match
CN113742446A (en) Knowledge graph question-answering method and system based on path sorting
US20110113052A1 (en) Query result iteration for multiple queries
EP3955256A1 (en) Non-redundant gene clustering method and system, and electronic device
CN108241709B (en) Data integration method, device and system
CN117763077A (en) Data query method and device
CN109101595B (en) Information query method, device, equipment and computer readable storage medium
CN113468383B (en) Family relation map searching method and device, electronic equipment and storage medium
CN105426490A (en) Tree structure based indexing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant