CN106502980B - A kind of search method and system based on text morpheme cutting - Google Patents

A kind of search method and system based on text morpheme cutting Download PDF

Info

Publication number
CN106502980B
CN106502980B CN201610881111.7A CN201610881111A CN106502980B CN 106502980 B CN106502980 B CN 106502980B CN 201610881111 A CN201610881111 A CN 201610881111A CN 106502980 B CN106502980 B CN 106502980B
Authority
CN
China
Prior art keywords
morpheme
phrase
cutting
text
retrieval
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610881111.7A
Other languages
Chinese (zh)
Other versions
CN106502980A (en
Inventor
白凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Origin Parameter Information Technology Co ltd
Original Assignee
Wuhan Douyu Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Douyu Network Technology Co Ltd filed Critical Wuhan Douyu Network Technology Co Ltd
Priority to CN201610881111.7A priority Critical patent/CN106502980B/en
Publication of CN106502980A publication Critical patent/CN106502980A/en
Application granted granted Critical
Publication of CN106502980B publication Critical patent/CN106502980B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/243Natural language query formulation

Abstract

The invention discloses a kind of search methods and system based on text morpheme cutting, it is related to big data searching field, this method includes establishing user search dictionary, judge the retrieval phrase whether occurred comprising user search dictionary in text to be slit, and in the presence of using the retrieval phrase as there are phrases, judge that there is currently the E (w) of phrase whether to be greater than E (avg), and the morpheme in user search dictionary with the presence or absence of this there are phrase is judged when being greater than, and in the absence of by there is currently the morpheme of phrase be stored in dictionary, as there are the corresponding morphemes of phrase, take out that corresponding there are phrases, the remainder of text is subjected to the cutting of fine granularity morpheme;Judgement there are phrase whether more than eight bytes, be no more than when using there is currently the morphemes after the morpheme of phrase and fine granularity cutting as cutting morpheme, be then indexed.The present invention can reduce the frequency of update and maintenance, and improve retrieval quality.

Description

A kind of search method and system based on text morpheme cutting
Technical field
The present invention relates to big data searching fields, and in particular to a kind of search method based on text morpheme cutting and is System.
Background technique
With the rapid development of internet industry, big data retrieval also becomes particularly important, an efficient searching system, It needs to parse text to be retrieved, most important one in resolving equipped with a set of good document parsing scheme Step is that morpheme cutting is carried out to document, i.e. identification obtains morpheme, word and the phrase for constituting document content.
The method for carrying out document parsing at present is main are as follows: identification file structure, it will be any with space and special symbol in text Number alphanumeric sequence terminated is identified as word, and upper case character is converted to small letter, such as " I Love China!Yeah ", Can cutting be " i " " love " " china " "!""yeah".
Since Chinese document will not usually be disconnected by space, such as " I likes China!", it is parsed using existing document Method can be split for " I likes China " "!" " ", still, the word that this segmenting method will lead to after cutting is difficult in database In find the matching of corresponding query word, therefore, for Chinese document, morpheme cutting is needed using at other modes Reason, to guarantee that inquiry and document lexical item can be mutually matched.
Instantly popular some index schemes are as follows: participle index, the participle based on regular expression based on dictionary Index, the participle index based on spcial characters such as spaces and some customized participle indexes.The wherein participle rope based on dictionary Draw be most widely used in current search engine, the scheme that index in classification effect is best, such as Apache Lucene (a open source full-text search project under Apache), Apache Solr (a open source full-text search item under Apache Mesh), an ElasticSearch (search server based on Lucene.It provides the complete of a distributed multi-user ability Literary search engine) etc..
Search file can be cut to minimum unit by existing fine granularity segmenting method, such as by " I likes China!" straight Connect be cut into " I " " love " " in " " state " "!" " ", but so not only huge deposit can be brought to the memory module of searching system Pressure is stored up, and significant phrase such as " China " is caused to be cut, increases retrieval difficulty.
In conclusion current segmenting method not only needs powerful and sufficient dictionary to support, and retrieval quality is lower, but It is that dictionary needs real-time update and maintenance, needs to expend a large amount of manpower, higher cost.
Summary of the invention
In view of the deficiencies in the prior art, the purpose of the present invention is to provide a kind of inspections based on text morpheme cutting Rope method and system, can reduce the frequency of update and maintenance, and improve retrieval quality.
To achieve the above objectives, the technical solution adopted by the present invention is that:
A kind of search method based on text morpheme cutting,
Establish user search dictionary, the dictionary record and store active user all retrieval phrases and each retrieval The sum of the frequency n that phrase occurs, all retrieval phrases is m, and the retrieval frequency P of each retrieval phrase is n/m, each term The desired value of group is E (w), E (w)=P*n;The average expected volume of all retrieval phrases are as follows: E (avg)=[E (w1)+E (w2) +……+E(wn)]/m;
It is described retrieval the following steps are included:
S1, judge in text to be slit whether to include the retrieval phrase having already appeared in user search dictionary, if depositing , will current retrieval phrase as there are phrases, be transferred to step S2;
S2 simultaneously judges that there is currently the E (w) of phrase whether to be greater than E (avg), and judges user search dictionary when being greater than In with the presence or absence of there is currently the morphemes of phrase, and in the absence of by there is currently the morphemes of phrase to be stored in dictionary, as depositing In the corresponding morpheme of phrase, it is transferred to step S3;
S3, it takes out corresponding there are phrase, the remainder of text is subjected to the cutting of fine granularity morpheme;There are words for judgement Group whether more than eight bytes, no more than when using there is currently the morphemes after the morpheme of phrase and fine granularity cutting as cutting Then morpheme is indexed.
Based on the above technical solution, in step S1, user search dictionary is not included in the text to be slit In phrase when, text to be slit is subjected to the cutting of fine granularity morpheme and is indexed.
Based on the above technical solution, in step S3, judgement there are phrase whether more than eight bytes, when being more than When, phrase will be present as text to be slit, be transferred to step S1.
Based on the above technical solution, right when not including in text to be slit with there are when phrase in step S1 Text to be slit carries out the cutting of fine granularity morpheme.
Based on the above technical solution, further comprising the steps of between the step S1 and S2: to remove single cent to be cut Stop words and spcial character in this.
Based on the above technical solution, the stop words include English character, number, mathematical character, punctuation mark, Auxiliary words of mood, adverbial word, preposition and conjunction.
Based on the above technical solution, the spcial character is mathematic sign, unit symbol and tab.
A kind of searching system based on text morpheme cutting, including Database module, input module, judgement compare mould Block, cutting module and retrieval module;
The Database module is for establishing user search dictionary;
The input module is for inputting text to be slit into searching system;
It is described judge comparison module for whether judging in text to be slit comprising there are phrases, and compare there is currently Whether the E (w) of phrase is greater than E (avg), and when being greater than by there is currently the morphemes of phrase to be stored in dictionary;
There are the texts to be slit after phrase to carry out the cutting of fine granularity morpheme for that will remove for the cutting module;
The retrieval module according to the morpheme after cutting for being retrieved.
Based on the above technical solution, it is described judge comparison module be also used to judge there is currently phrase whether be more than Eight bytes, when being no more than using there is currently the morphemes after the morpheme of phrase and fine granularity cutting as cutting morpheme, then It is indexed.
Based on the above technical solution, the cutting module is also used to not comprising there are the texts to be slit of phrase Carry out the cutting of fine granularity morpheme.
Compared with the prior art, the advantages of the present invention are as follows:
(1) a kind of search method based on text morpheme cutting of the invention, it is according to the retrieval habit of user, user is normal Retrieval phrase is stored in retrieval dictionary, and records the desired value of each retrieval phrase, is sentenced according to desired value and average value It is disconnected whether the morpheme of corresponding retrieval phrase to be stored in dictionary, meanwhile, the present invention is herein in connection with the cutting of fine granularity morpheme and judgement The length of retrieval phrase further optimizes method, since the corresponding morpheme in the interested field of each user has centainly Correlation and repeatability, therefore, which can be improved retrieval quality, reduces and updates and the frequency of maintenance.
Detailed description of the invention
Fig. 1 is the flow chart of the search method based on text morpheme cutting in the embodiment of the present invention;
Fig. 2 is the structural block diagram of the searching system based on text morpheme cutting in the embodiment of the present invention.
Specific embodiment
Invention is further described in detail with reference to the accompanying drawings and embodiments.
Shown in Figure 1, the embodiment of the present invention provides a kind of search method based on text morpheme cutting, including following step It is rapid:
Establish user search dictionary, the dictionary record and store active user all retrieval phrases and each retrieval The sum of the frequency n that phrase occurs, all retrieval phrases is m, and the retrieval frequency P of each retrieval phrase is n/m, each term The desired value of group is E (w), E (w)=P*n;The average expected volume of all retrieval phrases are as follows: E (avg)=[E (w1)+E (w2) +……+E(wn)]/m。
Judge whether comprising already present retrieval phrase in user search dictionary in text to be slit, it if it does not exist, will Text to be slit carries out the cutting of fine granularity morpheme and indexes.
If it exists, using current retrieval phrase as there are phrase, the stop words and spcial character in text to be slit are removed, Stop words includes English character, number, mathematical character, punctuation mark, auxiliary words of mood, adverbial word, preposition and conjunction;Spcial character For mathematic sign, unit symbol and tab.Judge there is currently the E (w) of phrase whether be greater than E (avg), and greater than when sentence With the presence or absence of in there are the corresponding morphemes of phrase in disconnected user search dictionary, and in the absence of by there is currently the words of phrase Element deposit dictionary, as there are the corresponding morphemes of phrase.
It takes out corresponding there are phrase, the remainder of text is subjected to the cutting of fine granularity morpheme;There are phrases for judgement Whether more than eight bytes will there is currently phrases to be retrieved again as text to be slit if being more than;If being no more than, with There is currently the morphemes after the morpheme of phrase and fine granularity cutting as cutting morpheme, is then indexed.
Method detailed step of the invention are as follows:
S1, input text to be slit.
S2, judge there is phrase whether comprising the phrase in user search dictionary in text to be slit: if it exists, It is transferred to step S3;Otherwise, it is transferred to step S6.
S3, judge that there is currently the E (w) of phrase whether to be greater than E (avg), if more than step S4 is transferred to;Otherwise, it is transferred to step Rapid S5.
S4, judge in user search dictionary with the presence or absence of corresponding morpheme, and in the absence of will there is currently phrases Morpheme be stored in dictionary, be transferred to step S5.
Stop words and spcial character in S5, removal text to be slit, are transferred to step S6.
S6, take out it is corresponding there are phrase in text to be slit, judgement there are phrase whether more than eight bytes, if It is that phrase will be present as text to be slit, is transferred to step S2;Otherwise, it is transferred to step S7.
S7, text is subjected to the cutting of fine granularity morpheme, obtains retrieval morpheme, signified text includes that there are phrases for removal herein Rear text to be slit and do not include text to be slit there are phrase, including being to exist there are the text morpheme to be slit of phrase The morpheme of phrase and the morpheme of fine granularity cutting;Do not include there are the text morpheme to be slit of phrase be fine granularity cutting morpheme, It is transferred to step S8.
S8, it is indexed with cutting morpheme.
The present invention also provides a kind of searching systems based on text morpheme cutting, including Database module, input mould Block judges comparison module, cutting module and retrieval module.
For Database module for establishing user search dictionary, input module is to be slit for inputting into searching system Text.
Judge that comparison module for whether judging in text to be slit comprising there are phrases, and compares that there is currently phrases E (w) whether be greater than E (avg), and be greater than when by there is currently the morpheme of phrase be stored in dictionary.
Judge comparison module be also used to judge there is currently phrase whether more than eight bytes, be no more than when currently to deposit Then morpheme after the morpheme of phrase and fine granularity cutting is indexed as cutting morpheme.
Cutting module be used for do not include there are the text to be slit of phrase and removal there are the texts to be slit after phrase Carry out the cutting of fine granularity morpheme;It is also used to retrieve to not including there are the progress fine granularity morpheme cutting of the text to be slit of phrase Module according to the morpheme after cutting for being retrieved.
The present invention is not limited to the above-described embodiments, for those skilled in the art, is not departing from Under the premise of the principle of the invention, several improvements and modifications can also be made, these improvements and modifications are also considered as protection of the invention Within the scope of.The content being not described in detail in this specification belongs to the prior art well known to professional and technical personnel in the field.

Claims (10)

1. a kind of search method based on text morpheme cutting, it is characterised in that:
User search dictionary is established, the dictionary records and stores all retrieval phrases and each retrieval phrase of active user The sum of the frequency n of appearance, all retrieval phrases is m, and the retrieval frequency P of each retrieval phrase is n/m, each retrieval phrase Desired value is E (w), E (w)=P*n;The average expected volume of all retrieval phrases are as follows: E (avg)=[E (w1)+E (w2)+...+E (wn)]/m;
It is described retrieval the following steps are included:
S1, judge in text to be slit whether to include the retrieval phrase having already appeared in user search dictionary, and if it exists, will There are phrases for current retrieval phrase conduct, are transferred to step S2;
S2 simultaneously judge there is currently the E (w) of phrase whether be greater than E (avg), and be greater than when judge be in user search dictionary It is no exist there is currently the morphemes of phrase, and in the absence of by there is currently the morpheme of phrase be stored in dictionary, as there are words The corresponding morpheme of group, is transferred to step S3;
S3, it takes out corresponding there are phrase, the remainder of text is subjected to the cutting of fine granularity morpheme;Judging that there are phrases is No more than eight bytes, when being no more than using there is currently the morphemes after the morpheme of phrase and fine granularity cutting as segmenting word Then element is indexed.
2. a kind of search method based on text morpheme cutting as described in claim 1, it is characterised in that: in step S1, institute When stating the phrase not included in user search dictionary in text to be slit, text to be slit is subjected to the cutting of fine granularity morpheme And it indexes.
3. a kind of search method and system based on text morpheme cutting as described in claim 1, it is characterised in that: step S3 In, there are phrase, whether more than eight bytes will be present phrase as text to be slit, be transferred to step when being more than for judgement S1。
4. a kind of search method based on text morpheme cutting as claimed any one in claims 1 to 3, it is characterised in that: In step S1, when not including in text to be slit with there are when phrase, the cutting of fine granularity morpheme is carried out to text to be slit.
5. a kind of search method based on text morpheme cutting as claimed in claim 4, it is characterised in that: the step S1 and It is further comprising the steps of between S2: to remove the stop words and spcial character in text to be slit.
6. a kind of search method based on text morpheme cutting as claimed in claim 5, it is characterised in that: the stop words packet Include English character, number, mathematical character, punctuation mark, auxiliary words of mood, adverbial word, preposition and conjunction.
7. a kind of search method based on text morpheme cutting as claimed in claim 5, it is characterised in that: the spcial character For mathematic sign, unit symbol and tab.
8. a kind of searching system based on text morpheme cutting for realizing any one of claim 1 to 7 search method, special Sign is: including Database module, input module, judging comparison module, cutting module and retrieval module;
The Database module is for establishing user search dictionary;
The input module is for inputting text to be slit into searching system;
It is described to judge that comparison module for whether judging in text to be slit comprising there are phrases, and compares that there is currently phrases E (w) whether be greater than E (avg), and be greater than when continue judge in user search dictionary whether there is corresponding morpheme, if It is not present, then by there is currently the morphemes of phrase to be stored in dictionary;
There are the texts to be slit after phrase to carry out the cutting of fine granularity morpheme for that will remove for the cutting module;
The retrieval module according to the morpheme after cutting for being retrieved.
9. a kind of searching system based on text morpheme cutting as claimed in claim 8, it is characterised in that: the judgement is compared Module be also used to judge there is currently phrase whether more than eight bytes, when being no more than with there is currently the morphemes of phrase and thin Then morpheme after granularity cutting is indexed as cutting morpheme.
10. a kind of searching system based on text morpheme cutting as claimed in claim 8, it is characterised in that: the dividing die Block is also used to not comprising there are the texts to be slit of phrase to carry out the cutting of fine granularity morpheme.
CN201610881111.7A 2016-10-09 2016-10-09 A kind of search method and system based on text morpheme cutting Active CN106502980B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610881111.7A CN106502980B (en) 2016-10-09 2016-10-09 A kind of search method and system based on text morpheme cutting

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610881111.7A CN106502980B (en) 2016-10-09 2016-10-09 A kind of search method and system based on text morpheme cutting

Publications (2)

Publication Number Publication Date
CN106502980A CN106502980A (en) 2017-03-15
CN106502980B true CN106502980B (en) 2019-05-17

Family

ID=58294697

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610881111.7A Active CN106502980B (en) 2016-10-09 2016-10-09 A kind of search method and system based on text morpheme cutting

Country Status (1)

Country Link
CN (1) CN106502980B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108415903B (en) * 2018-03-12 2021-09-07 武汉斗鱼网络科技有限公司 Evaluation method, storage medium, and apparatus for judging validity of search intention recognition
CN110688852B (en) * 2019-09-27 2023-04-07 西安赢瑞电子有限公司 Chinese character word frequency storage method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1629836A (en) * 2003-12-17 2005-06-22 北京大学 Method and apparatus for learning Chinese new words
CN103353894A (en) * 2013-07-19 2013-10-16 武汉睿数信息技术有限公司 Data searching method and system based on semantic analysis
CN103559313A (en) * 2013-11-20 2014-02-05 北京奇虎科技有限公司 Searching method and device
CN103678282A (en) * 2014-01-07 2014-03-26 苏州思必驰信息科技有限公司 Word segmentation method and device
CN104239321A (en) * 2013-06-14 2014-12-24 高德软件有限公司 Data processing method and device for search engine
CN105045875A (en) * 2015-07-17 2015-11-11 北京林业大学 Personalized information retrieval method and apparatus

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1629836A (en) * 2003-12-17 2005-06-22 北京大学 Method and apparatus for learning Chinese new words
CN104239321A (en) * 2013-06-14 2014-12-24 高德软件有限公司 Data processing method and device for search engine
CN103353894A (en) * 2013-07-19 2013-10-16 武汉睿数信息技术有限公司 Data searching method and system based on semantic analysis
CN103559313A (en) * 2013-11-20 2014-02-05 北京奇虎科技有限公司 Searching method and device
CN103678282A (en) * 2014-01-07 2014-03-26 苏州思必驰信息科技有限公司 Word segmentation method and device
CN105045875A (en) * 2015-07-17 2015-11-11 北京林业大学 Personalized information retrieval method and apparatus

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
搜索引擎中文分词技术研究;任丽芸;《中国优秀硕士学位论文全文数据库 信息科技辑》;20120415(第04期);I138-2477

Also Published As

Publication number Publication date
CN106502980A (en) 2017-03-15

Similar Documents

Publication Publication Date Title
US11675977B2 (en) Intelligent system that dynamically improves its knowledge and code-base for natural language understanding
US8972413B2 (en) System and method for matching comment data to text data
US10515125B1 (en) Structured text segment indexing techniques
US8375033B2 (en) Information retrieval through identification of prominent notions
Trabelsi et al. Bridging folksonomies and domain ontologies: Getting out non-taxonomic relations
Saloot et al. An architecture for Malay Tweet normalization
US10606903B2 (en) Multi-dimensional query based extraction of polarity-aware content
CN104281702A (en) Power keyword segmentation based data retrieval method and device
Al-Shammari et al. Towards an error-free Arabic stemming
CN108319583B (en) Method and system for extracting knowledge from Chinese language material library
Ahmed et al. Revised n-gram based automatic spelling correction tool to improve retrieval effectiveness
CN102214189A (en) Data mining-based word usage knowledge acquisition system and method
US20150242493A1 (en) User-guided search query expansion
Wijaya et al. Automatic mood classification of Indonesian tweets using linguistic approach
CN102117285B (en) Search method based on semantic indexing
Jia et al. A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth
CN106502980B (en) A kind of search method and system based on text morpheme cutting
Bölücü et al. Unsupervised joint PoS tagging and stemming for agglutinative languages
Gollapalli et al. Keyphrase extraction using sequential labeling
US8554769B1 (en) Identifying gibberish content in resources
Joseph et al. Citation analysis, centrality, and the ACL Anthology
CN111160007B (en) Search method and device based on BERT language model, computer equipment and storage medium
Roy et al. An unsupervised normalization algorithm for noisy text: a case study for information retrieval and stance detection
Pandey et al. Evaluating effect of stemming and stop-word removal on Hindi text retrieval
CN113157857B (en) Hot topic detection method, device and equipment for news

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20231030

Address after: 518000 International Chamber of Commerce Center 2103, No. 168 Fuhua Third Road, Fu'an Community, Futian Street, Futian District, Shenzhen, Guangdong Province

Patentee after: Shenzhen origin parameter information technology Co.,Ltd.

Address before: 430000 Wuhan Donghu Development Zone, Wuhan, Hubei Province, No. 1 Software Park East Road 4.1 Phase B1 Building 11 Building

Patentee before: WUHAN DOUYU NETWORK TECHNOLOGY Co.,Ltd.