CN106502980B

CN106502980B - A kind of search method and system based on text morpheme cutting

Info

Publication number: CN106502980B
Application number: CN201610881111.7A
Authority: CN
Inventors: 白凡
Original assignee: Wuhan Douyu Network Technology Co Ltd
Current assignee: Shenzhen Origin Parameter Information Technology Co ltd
Priority date: 2016-10-09
Filing date: 2016-10-09
Publication date: 2019-05-17
Anticipated expiration: 2036-10-09
Also published as: CN106502980A

Abstract

The invention discloses a kind of search methods and system based on text morpheme cutting, it is related to big data searching field, this method includes establishing user search dictionary, judge the retrieval phrase whether occurred comprising user search dictionary in text to be slit, and in the presence of using the retrieval phrase as there are phrases, judge that there is currently the E (w) of phrase whether to be greater than E (avg), and the morpheme in user search dictionary with the presence or absence of this there are phrase is judged when being greater than, and in the absence of by there is currently the morpheme of phrase be stored in dictionary, as there are the corresponding morphemes of phrase, take out that corresponding there are phrases, the remainder of text is subjected to the cutting of fine granularity morpheme；Judgement there are phrase whether more than eight bytes, be no more than when using there is currently the morphemes after the morpheme of phrase and fine granularity cutting as cutting morpheme, be then indexed.The present invention can reduce the frequency of update and maintenance, and improve retrieval quality.

Description

A kind of search method and system based on text morpheme cutting

Technical field

The present invention relates to big data searching fields, and in particular to a kind of search method based on text morpheme cutting and is System.

Background technique

With the rapid development of internet industry, big data retrieval also becomes particularly important, an efficient searching system, It needs to parse text to be retrieved, most important one in resolving equipped with a set of good document parsing scheme Step is that morpheme cutting is carried out to document, i.e. identification obtains morpheme, word and the phrase for constituting document content.

The method for carrying out document parsing at present is main are as follows: identification file structure, it will be any with space and special symbol in text Number alphanumeric sequence terminated is identified as word, and upper case character is converted to small letter, such as " I Love China！Yeah ", Can cutting be " i " " love " " china " "！""yeah".

Since Chinese document will not usually be disconnected by space, such as " I likes China！", it is parsed using existing document Method can be split for " I likes China " "！" " ", still, the word that this segmenting method will lead to after cutting is difficult in database In find the matching of corresponding query word, therefore, for Chinese document, morpheme cutting is needed using at other modes Reason, to guarantee that inquiry and document lexical item can be mutually matched.

Instantly popular some index schemes are as follows: participle index, the participle based on regular expression based on dictionary Index, the participle index based on spcial characters such as spaces and some customized participle indexes.The wherein participle rope based on dictionary Draw be most widely used in current search engine, the scheme that index in classification effect is best, such as Apache Lucene (a open source full-text search project under Apache), Apache Solr (a open source full-text search item under Apache Mesh), an ElasticSearch (search server based on Lucene.It provides the complete of a distributed multi-user ability Literary search engine) etc..

Search file can be cut to minimum unit by existing fine granularity segmenting method, such as by " I likes China！" straight Connect be cut into " I " " love " " in " " state " "！" " ", but so not only huge deposit can be brought to the memory module of searching system Pressure is stored up, and significant phrase such as " China " is caused to be cut, increases retrieval difficulty.

In conclusion current segmenting method not only needs powerful and sufficient dictionary to support, and retrieval quality is lower, but It is that dictionary needs real-time update and maintenance, needs to expend a large amount of manpower, higher cost.

Summary of the invention

In view of the deficiencies in the prior art, the purpose of the present invention is to provide a kind of inspections based on text morpheme cutting Rope method and system, can reduce the frequency of update and maintenance, and improve retrieval quality.

To achieve the above objectives, the technical solution adopted by the present invention is that:

A kind of search method based on text morpheme cutting,

Establish user search dictionary, the dictionary record and store active user all retrieval phrases and each retrieval The sum of the frequency n that phrase occurs, all retrieval phrases is m, and the retrieval frequency P of each retrieval phrase is n/m, each term The desired value of group is E (w), E (w)=P*n；The average expected volume of all retrieval phrases are as follows: E (avg)=[E (w1)+E (w2) +……+E(wn)]/m；

It is described retrieval the following steps are included:

S1, judge in text to be slit whether to include the retrieval phrase having already appeared in user search dictionary, if depositing , will current retrieval phrase as there are phrases, be transferred to step S2；

S2 simultaneously judges that there is currently the E (w) of phrase whether to be greater than E (avg), and judges user search dictionary when being greater than In with the presence or absence of there is currently the morphemes of phrase, and in the absence of by there is currently the morphemes of phrase to be stored in dictionary, as depositing In the corresponding morpheme of phrase, it is transferred to step S3；

S3, it takes out corresponding there are phrase, the remainder of text is subjected to the cutting of fine granularity morpheme；There are words for judgement Group whether more than eight bytes, no more than when using there is currently the morphemes after the morpheme of phrase and fine granularity cutting as cutting Then morpheme is indexed.

Based on the above technical solution, in step S1, user search dictionary is not included in the text to be slit In phrase when, text to be slit is subjected to the cutting of fine granularity morpheme and is indexed.

Based on the above technical solution, in step S3, judgement there are phrase whether more than eight bytes, when being more than When, phrase will be present as text to be slit, be transferred to step S1.

Based on the above technical solution, right when not including in text to be slit with there are when phrase in step S1 Text to be slit carries out the cutting of fine granularity morpheme.

Based on the above technical solution, further comprising the steps of between the step S1 and S2: to remove single cent to be cut Stop words and spcial character in this.

Based on the above technical solution, the stop words include English character, number, mathematical character, punctuation mark, Auxiliary words of mood, adverbial word, preposition and conjunction.

Based on the above technical solution, the spcial character is mathematic sign, unit symbol and tab.

A kind of searching system based on text morpheme cutting, including Database module, input module, judgement compare mould Block, cutting module and retrieval module；

The Database module is for establishing user search dictionary；

The input module is for inputting text to be slit into searching system；

It is described judge comparison module for whether judging in text to be slit comprising there are phrases, and compare there is currently Whether the E (w) of phrase is greater than E (avg), and when being greater than by there is currently the morphemes of phrase to be stored in dictionary；

There are the texts to be slit after phrase to carry out the cutting of fine granularity morpheme for that will remove for the cutting module；

The retrieval module according to the morpheme after cutting for being retrieved.

Based on the above technical solution, it is described judge comparison module be also used to judge there is currently phrase whether be more than Eight bytes, when being no more than using there is currently the morphemes after the morpheme of phrase and fine granularity cutting as cutting morpheme, then It is indexed.

Based on the above technical solution, the cutting module is also used to not comprising there are the texts to be slit of phrase Carry out the cutting of fine granularity morpheme.

Compared with the prior art, the advantages of the present invention are as follows:

(1) a kind of search method based on text morpheme cutting of the invention, it is according to the retrieval habit of user, user is normal Retrieval phrase is stored in retrieval dictionary, and records the desired value of each retrieval phrase, is sentenced according to desired value and average value It is disconnected whether the morpheme of corresponding retrieval phrase to be stored in dictionary, meanwhile, the present invention is herein in connection with the cutting of fine granularity morpheme and judgement The length of retrieval phrase further optimizes method, since the corresponding morpheme in the interested field of each user has centainly Correlation and repeatability, therefore, which can be improved retrieval quality, reduces and updates and the frequency of maintenance.

Detailed description of the invention

Fig. 1 is the flow chart of the search method based on text morpheme cutting in the embodiment of the present invention；

Fig. 2 is the structural block diagram of the searching system based on text morpheme cutting in the embodiment of the present invention.

Specific embodiment

Invention is further described in detail with reference to the accompanying drawings and embodiments.

Shown in Figure 1, the embodiment of the present invention provides a kind of search method based on text morpheme cutting, including following step It is rapid:

Establish user search dictionary, the dictionary record and store active user all retrieval phrases and each retrieval The sum of the frequency n that phrase occurs, all retrieval phrases is m, and the retrieval frequency P of each retrieval phrase is n/m, each term The desired value of group is E (w), E (w)=P*n；The average expected volume of all retrieval phrases are as follows: E (avg)=[E (w1)+E (w2) +……+E(wn)]/m。

Judge whether comprising already present retrieval phrase in user search dictionary in text to be slit, it if it does not exist, will Text to be slit carries out the cutting of fine granularity morpheme and indexes.

If it exists, using current retrieval phrase as there are phrase, the stop words and spcial character in text to be slit are removed, Stop words includes English character, number, mathematical character, punctuation mark, auxiliary words of mood, adverbial word, preposition and conjunction；Spcial character For mathematic sign, unit symbol and tab.Judge there is currently the E (w) of phrase whether be greater than E (avg), and greater than when sentence With the presence or absence of in there are the corresponding morphemes of phrase in disconnected user search dictionary, and in the absence of by there is currently the words of phrase Element deposit dictionary, as there are the corresponding morphemes of phrase.

It takes out corresponding there are phrase, the remainder of text is subjected to the cutting of fine granularity morpheme；There are phrases for judgement Whether more than eight bytes will there is currently phrases to be retrieved again as text to be slit if being more than；If being no more than, with There is currently the morphemes after the morpheme of phrase and fine granularity cutting as cutting morpheme, is then indexed.

Method detailed step of the invention are as follows:

S1, input text to be slit.

S2, judge there is phrase whether comprising the phrase in user search dictionary in text to be slit: if it exists, It is transferred to step S3；Otherwise, it is transferred to step S6.

S3, judge that there is currently the E (w) of phrase whether to be greater than E (avg), if more than step S4 is transferred to；Otherwise, it is transferred to step Rapid S5.

S4, judge in user search dictionary with the presence or absence of corresponding morpheme, and in the absence of will there is currently phrases Morpheme be stored in dictionary, be transferred to step S5.

Stop words and spcial character in S5, removal text to be slit, are transferred to step S6.

S6, take out it is corresponding there are phrase in text to be slit, judgement there are phrase whether more than eight bytes, if It is that phrase will be present as text to be slit, is transferred to step S2；Otherwise, it is transferred to step S7.

S7, text is subjected to the cutting of fine granularity morpheme, obtains retrieval morpheme, signified text includes that there are phrases for removal herein Rear text to be slit and do not include text to be slit there are phrase, including being to exist there are the text morpheme to be slit of phrase The morpheme of phrase and the morpheme of fine granularity cutting；Do not include there are the text morpheme to be slit of phrase be fine granularity cutting morpheme, It is transferred to step S8.

S8, it is indexed with cutting morpheme.

The present invention also provides a kind of searching systems based on text morpheme cutting, including Database module, input mould Block judges comparison module, cutting module and retrieval module.

For Database module for establishing user search dictionary, input module is to be slit for inputting into searching system Text.

Judge that comparison module for whether judging in text to be slit comprising there are phrases, and compares that there is currently phrases E (w) whether be greater than E (avg), and be greater than when by there is currently the morpheme of phrase be stored in dictionary.

Judge comparison module be also used to judge there is currently phrase whether more than eight bytes, be no more than when currently to deposit Then morpheme after the morpheme of phrase and fine granularity cutting is indexed as cutting morpheme.

Cutting module be used for do not include there are the text to be slit of phrase and removal there are the texts to be slit after phrase Carry out the cutting of fine granularity morpheme；It is also used to retrieve to not including there are the progress fine granularity morpheme cutting of the text to be slit of phrase Module according to the morpheme after cutting for being retrieved.

The present invention is not limited to the above-described embodiments, for those skilled in the art, is not departing from Under the premise of the principle of the invention, several improvements and modifications can also be made, these improvements and modifications are also considered as protection of the invention Within the scope of.The content being not described in detail in this specification belongs to the prior art well known to professional and technical personnel in the field.

Claims

1. a kind of search method based on text morpheme cutting, it is characterised in that:

User search dictionary is established, the dictionary records and stores all retrieval phrases and each retrieval phrase of active user The sum of the frequency n of appearance, all retrieval phrases is m, and the retrieval frequency P of each retrieval phrase is n/m, each retrieval phrase Desired value is E (w), E (w)=P*n；The average expected volume of all retrieval phrases are as follows: E (avg)=[E (w1)+E (w2)+...+E (wn)]/m；

It is described retrieval the following steps are included:

S1, judge in text to be slit whether to include the retrieval phrase having already appeared in user search dictionary, and if it exists, will There are phrases for current retrieval phrase conduct, are transferred to step S2；

S2 simultaneously judge there is currently the E (w) of phrase whether be greater than E (avg), and be greater than when judge be in user search dictionary It is no exist there is currently the morphemes of phrase, and in the absence of by there is currently the morpheme of phrase be stored in dictionary, as there are words The corresponding morpheme of group, is transferred to step S3；

S3, it takes out corresponding there are phrase, the remainder of text is subjected to the cutting of fine granularity morpheme；Judging that there are phrases is No more than eight bytes, when being no more than using there is currently the morphemes after the morpheme of phrase and fine granularity cutting as segmenting word Then element is indexed.

2. a kind of search method based on text morpheme cutting as described in claim 1, it is characterised in that: in step S1, institute When stating the phrase not included in user search dictionary in text to be slit, text to be slit is subjected to the cutting of fine granularity morpheme And it indexes.

3. a kind of search method and system based on text morpheme cutting as described in claim 1, it is characterised in that: step S3 In, there are phrase, whether more than eight bytes will be present phrase as text to be slit, be transferred to step when being more than for judgement S1。

4. a kind of search method based on text morpheme cutting as claimed any one in claims 1 to 3, it is characterised in that: In step S1, when not including in text to be slit with there are when phrase, the cutting of fine granularity morpheme is carried out to text to be slit.

5. a kind of search method based on text morpheme cutting as claimed in claim 4, it is characterised in that: the step S1 and It is further comprising the steps of between S2: to remove the stop words and spcial character in text to be slit.

6. a kind of search method based on text morpheme cutting as claimed in claim 5, it is characterised in that: the stop words packet Include English character, number, mathematical character, punctuation mark, auxiliary words of mood, adverbial word, preposition and conjunction.

7. a kind of search method based on text morpheme cutting as claimed in claim 5, it is characterised in that: the spcial character For mathematic sign, unit symbol and tab.

8. a kind of searching system based on text morpheme cutting for realizing any one of claim 1 to 7 search method, special Sign is: including Database module, input module, judging comparison module, cutting module and retrieval module；

The Database module is for establishing user search dictionary；

The input module is for inputting text to be slit into searching system；

It is described to judge that comparison module for whether judging in text to be slit comprising there are phrases, and compares that there is currently phrases E (w) whether be greater than E (avg), and be greater than when continue judge in user search dictionary whether there is corresponding morpheme, if It is not present, then by there is currently the morphemes of phrase to be stored in dictionary；

9. a kind of searching system based on text morpheme cutting as claimed in claim 8, it is characterised in that: the judgement is compared Module be also used to judge there is currently phrase whether more than eight bytes, when being no more than with there is currently the morphemes of phrase and thin Then morpheme after granularity cutting is indexed as cutting morpheme.

10. a kind of searching system based on text morpheme cutting as claimed in claim 8, it is characterised in that: the dividing die Block is also used to not comprising there are the texts to be slit of phrase to carry out the cutting of fine granularity morpheme.