CN106502980B - A kind of search method and system based on text morpheme cutting - Google Patents
A kind of search method and system based on text morpheme cutting Download PDFInfo
- Publication number
- CN106502980B CN106502980B CN201610881111.7A CN201610881111A CN106502980B CN 106502980 B CN106502980 B CN 106502980B CN 201610881111 A CN201610881111 A CN 201610881111A CN 106502980 B CN106502980 B CN 106502980B
- Authority
- CN
- China
- Prior art keywords
- morpheme
- phrase
- cutting
- text
- retrieval
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 26
- 230000036651 mood Effects 0.000 claims description 3
- 238000012423 maintenance Methods 0.000 abstract description 4
- 238000000151 deposition Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000007812 deficiency Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/243—Natural language query formulation
Abstract
The invention discloses a kind of search methods and system based on text morpheme cutting, it is related to big data searching field, this method includes establishing user search dictionary, judge the retrieval phrase whether occurred comprising user search dictionary in text to be slit, and in the presence of using the retrieval phrase as there are phrases, judge that there is currently the E (w) of phrase whether to be greater than E (avg), and the morpheme in user search dictionary with the presence or absence of this there are phrase is judged when being greater than, and in the absence of by there is currently the morpheme of phrase be stored in dictionary, as there are the corresponding morphemes of phrase, take out that corresponding there are phrases, the remainder of text is subjected to the cutting of fine granularity morpheme;Judgement there are phrase whether more than eight bytes, be no more than when using there is currently the morphemes after the morpheme of phrase and fine granularity cutting as cutting morpheme, be then indexed.The present invention can reduce the frequency of update and maintenance, and improve retrieval quality.
Description
Technical field
The present invention relates to big data searching fields, and in particular to a kind of search method based on text morpheme cutting and is
System.
Background technique
With the rapid development of internet industry, big data retrieval also becomes particularly important, an efficient searching system,
It needs to parse text to be retrieved, most important one in resolving equipped with a set of good document parsing scheme
Step is that morpheme cutting is carried out to document, i.e. identification obtains morpheme, word and the phrase for constituting document content.
The method for carrying out document parsing at present is main are as follows: identification file structure, it will be any with space and special symbol in text
Number alphanumeric sequence terminated is identified as word, and upper case character is converted to small letter, such as " I Love China!Yeah ",
Can cutting be " i " " love " " china " "!""yeah".
Since Chinese document will not usually be disconnected by space, such as " I likes China!", it is parsed using existing document
Method can be split for " I likes China " "!" " ", still, the word that this segmenting method will lead to after cutting is difficult in database
In find the matching of corresponding query word, therefore, for Chinese document, morpheme cutting is needed using at other modes
Reason, to guarantee that inquiry and document lexical item can be mutually matched.
Instantly popular some index schemes are as follows: participle index, the participle based on regular expression based on dictionary
Index, the participle index based on spcial characters such as spaces and some customized participle indexes.The wherein participle rope based on dictionary
Draw be most widely used in current search engine, the scheme that index in classification effect is best, such as Apache Lucene
(a open source full-text search project under Apache), Apache Solr (a open source full-text search item under Apache
Mesh), an ElasticSearch (search server based on Lucene.It provides the complete of a distributed multi-user ability
Literary search engine) etc..
Search file can be cut to minimum unit by existing fine granularity segmenting method, such as by " I likes China!" straight
Connect be cut into " I " " love " " in " " state " "!" " ", but so not only huge deposit can be brought to the memory module of searching system
Pressure is stored up, and significant phrase such as " China " is caused to be cut, increases retrieval difficulty.
In conclusion current segmenting method not only needs powerful and sufficient dictionary to support, and retrieval quality is lower, but
It is that dictionary needs real-time update and maintenance, needs to expend a large amount of manpower, higher cost.
Summary of the invention
In view of the deficiencies in the prior art, the purpose of the present invention is to provide a kind of inspections based on text morpheme cutting
Rope method and system, can reduce the frequency of update and maintenance, and improve retrieval quality.
To achieve the above objectives, the technical solution adopted by the present invention is that:
A kind of search method based on text morpheme cutting,
Establish user search dictionary, the dictionary record and store active user all retrieval phrases and each retrieval
The sum of the frequency n that phrase occurs, all retrieval phrases is m, and the retrieval frequency P of each retrieval phrase is n/m, each term
The desired value of group is E (w), E (w)=P*n;The average expected volume of all retrieval phrases are as follows: E (avg)=[E (w1)+E (w2)
+……+E(wn)]/m;
It is described retrieval the following steps are included:
S1, judge in text to be slit whether to include the retrieval phrase having already appeared in user search dictionary, if depositing
, will current retrieval phrase as there are phrases, be transferred to step S2;
S2 simultaneously judges that there is currently the E (w) of phrase whether to be greater than E (avg), and judges user search dictionary when being greater than
In with the presence or absence of there is currently the morphemes of phrase, and in the absence of by there is currently the morphemes of phrase to be stored in dictionary, as depositing
In the corresponding morpheme of phrase, it is transferred to step S3;
S3, it takes out corresponding there are phrase, the remainder of text is subjected to the cutting of fine granularity morpheme;There are words for judgement
Group whether more than eight bytes, no more than when using there is currently the morphemes after the morpheme of phrase and fine granularity cutting as cutting
Then morpheme is indexed.
Based on the above technical solution, in step S1, user search dictionary is not included in the text to be slit
In phrase when, text to be slit is subjected to the cutting of fine granularity morpheme and is indexed.
Based on the above technical solution, in step S3, judgement there are phrase whether more than eight bytes, when being more than
When, phrase will be present as text to be slit, be transferred to step S1.
Based on the above technical solution, right when not including in text to be slit with there are when phrase in step S1
Text to be slit carries out the cutting of fine granularity morpheme.
Based on the above technical solution, further comprising the steps of between the step S1 and S2: to remove single cent to be cut
Stop words and spcial character in this.
Based on the above technical solution, the stop words include English character, number, mathematical character, punctuation mark,
Auxiliary words of mood, adverbial word, preposition and conjunction.
Based on the above technical solution, the spcial character is mathematic sign, unit symbol and tab.
A kind of searching system based on text morpheme cutting, including Database module, input module, judgement compare mould
Block, cutting module and retrieval module;
The Database module is for establishing user search dictionary;
The input module is for inputting text to be slit into searching system;
It is described judge comparison module for whether judging in text to be slit comprising there are phrases, and compare there is currently
Whether the E (w) of phrase is greater than E (avg), and when being greater than by there is currently the morphemes of phrase to be stored in dictionary;
There are the texts to be slit after phrase to carry out the cutting of fine granularity morpheme for that will remove for the cutting module;
The retrieval module according to the morpheme after cutting for being retrieved.
Based on the above technical solution, it is described judge comparison module be also used to judge there is currently phrase whether be more than
Eight bytes, when being no more than using there is currently the morphemes after the morpheme of phrase and fine granularity cutting as cutting morpheme, then
It is indexed.
Based on the above technical solution, the cutting module is also used to not comprising there are the texts to be slit of phrase
Carry out the cutting of fine granularity morpheme.
Compared with the prior art, the advantages of the present invention are as follows:
(1) a kind of search method based on text morpheme cutting of the invention, it is according to the retrieval habit of user, user is normal
Retrieval phrase is stored in retrieval dictionary, and records the desired value of each retrieval phrase, is sentenced according to desired value and average value
It is disconnected whether the morpheme of corresponding retrieval phrase to be stored in dictionary, meanwhile, the present invention is herein in connection with the cutting of fine granularity morpheme and judgement
The length of retrieval phrase further optimizes method, since the corresponding morpheme in the interested field of each user has centainly
Correlation and repeatability, therefore, which can be improved retrieval quality, reduces and updates and the frequency of maintenance.
Detailed description of the invention
Fig. 1 is the flow chart of the search method based on text morpheme cutting in the embodiment of the present invention;
Fig. 2 is the structural block diagram of the searching system based on text morpheme cutting in the embodiment of the present invention.
Specific embodiment
Invention is further described in detail with reference to the accompanying drawings and embodiments.
Shown in Figure 1, the embodiment of the present invention provides a kind of search method based on text morpheme cutting, including following step
It is rapid:
Establish user search dictionary, the dictionary record and store active user all retrieval phrases and each retrieval
The sum of the frequency n that phrase occurs, all retrieval phrases is m, and the retrieval frequency P of each retrieval phrase is n/m, each term
The desired value of group is E (w), E (w)=P*n;The average expected volume of all retrieval phrases are as follows: E (avg)=[E (w1)+E (w2)
+……+E(wn)]/m。
Judge whether comprising already present retrieval phrase in user search dictionary in text to be slit, it if it does not exist, will
Text to be slit carries out the cutting of fine granularity morpheme and indexes.
If it exists, using current retrieval phrase as there are phrase, the stop words and spcial character in text to be slit are removed,
Stop words includes English character, number, mathematical character, punctuation mark, auxiliary words of mood, adverbial word, preposition and conjunction;Spcial character
For mathematic sign, unit symbol and tab.Judge there is currently the E (w) of phrase whether be greater than E (avg), and greater than when sentence
With the presence or absence of in there are the corresponding morphemes of phrase in disconnected user search dictionary, and in the absence of by there is currently the words of phrase
Element deposit dictionary, as there are the corresponding morphemes of phrase.
It takes out corresponding there are phrase, the remainder of text is subjected to the cutting of fine granularity morpheme;There are phrases for judgement
Whether more than eight bytes will there is currently phrases to be retrieved again as text to be slit if being more than;If being no more than, with
There is currently the morphemes after the morpheme of phrase and fine granularity cutting as cutting morpheme, is then indexed.
Method detailed step of the invention are as follows:
S1, input text to be slit.
S2, judge there is phrase whether comprising the phrase in user search dictionary in text to be slit: if it exists,
It is transferred to step S3;Otherwise, it is transferred to step S6.
S3, judge that there is currently the E (w) of phrase whether to be greater than E (avg), if more than step S4 is transferred to;Otherwise, it is transferred to step
Rapid S5.
S4, judge in user search dictionary with the presence or absence of corresponding morpheme, and in the absence of will there is currently phrases
Morpheme be stored in dictionary, be transferred to step S5.
Stop words and spcial character in S5, removal text to be slit, are transferred to step S6.
S6, take out it is corresponding there are phrase in text to be slit, judgement there are phrase whether more than eight bytes, if
It is that phrase will be present as text to be slit, is transferred to step S2;Otherwise, it is transferred to step S7.
S7, text is subjected to the cutting of fine granularity morpheme, obtains retrieval morpheme, signified text includes that there are phrases for removal herein
Rear text to be slit and do not include text to be slit there are phrase, including being to exist there are the text morpheme to be slit of phrase
The morpheme of phrase and the morpheme of fine granularity cutting;Do not include there are the text morpheme to be slit of phrase be fine granularity cutting morpheme,
It is transferred to step S8.
S8, it is indexed with cutting morpheme.
The present invention also provides a kind of searching systems based on text morpheme cutting, including Database module, input mould
Block judges comparison module, cutting module and retrieval module.
For Database module for establishing user search dictionary, input module is to be slit for inputting into searching system
Text.
Judge that comparison module for whether judging in text to be slit comprising there are phrases, and compares that there is currently phrases
E (w) whether be greater than E (avg), and be greater than when by there is currently the morpheme of phrase be stored in dictionary.
Judge comparison module be also used to judge there is currently phrase whether more than eight bytes, be no more than when currently to deposit
Then morpheme after the morpheme of phrase and fine granularity cutting is indexed as cutting morpheme.
Cutting module be used for do not include there are the text to be slit of phrase and removal there are the texts to be slit after phrase
Carry out the cutting of fine granularity morpheme;It is also used to retrieve to not including there are the progress fine granularity morpheme cutting of the text to be slit of phrase
Module according to the morpheme after cutting for being retrieved.
The present invention is not limited to the above-described embodiments, for those skilled in the art, is not departing from
Under the premise of the principle of the invention, several improvements and modifications can also be made, these improvements and modifications are also considered as protection of the invention
Within the scope of.The content being not described in detail in this specification belongs to the prior art well known to professional and technical personnel in the field.
Claims (10)
1. a kind of search method based on text morpheme cutting, it is characterised in that:
User search dictionary is established, the dictionary records and stores all retrieval phrases and each retrieval phrase of active user
The sum of the frequency n of appearance, all retrieval phrases is m, and the retrieval frequency P of each retrieval phrase is n/m, each retrieval phrase
Desired value is E (w), E (w)=P*n;The average expected volume of all retrieval phrases are as follows: E (avg)=[E (w1)+E (w2)+...+E
(wn)]/m;
It is described retrieval the following steps are included:
S1, judge in text to be slit whether to include the retrieval phrase having already appeared in user search dictionary, and if it exists, will
There are phrases for current retrieval phrase conduct, are transferred to step S2;
S2 simultaneously judge there is currently the E (w) of phrase whether be greater than E (avg), and be greater than when judge be in user search dictionary
It is no exist there is currently the morphemes of phrase, and in the absence of by there is currently the morpheme of phrase be stored in dictionary, as there are words
The corresponding morpheme of group, is transferred to step S3;
S3, it takes out corresponding there are phrase, the remainder of text is subjected to the cutting of fine granularity morpheme;Judging that there are phrases is
No more than eight bytes, when being no more than using there is currently the morphemes after the morpheme of phrase and fine granularity cutting as segmenting word
Then element is indexed.
2. a kind of search method based on text morpheme cutting as described in claim 1, it is characterised in that: in step S1, institute
When stating the phrase not included in user search dictionary in text to be slit, text to be slit is subjected to the cutting of fine granularity morpheme
And it indexes.
3. a kind of search method and system based on text morpheme cutting as described in claim 1, it is characterised in that: step S3
In, there are phrase, whether more than eight bytes will be present phrase as text to be slit, be transferred to step when being more than for judgement
S1。
4. a kind of search method based on text morpheme cutting as claimed any one in claims 1 to 3, it is characterised in that:
In step S1, when not including in text to be slit with there are when phrase, the cutting of fine granularity morpheme is carried out to text to be slit.
5. a kind of search method based on text morpheme cutting as claimed in claim 4, it is characterised in that: the step S1 and
It is further comprising the steps of between S2: to remove the stop words and spcial character in text to be slit.
6. a kind of search method based on text morpheme cutting as claimed in claim 5, it is characterised in that: the stop words packet
Include English character, number, mathematical character, punctuation mark, auxiliary words of mood, adverbial word, preposition and conjunction.
7. a kind of search method based on text morpheme cutting as claimed in claim 5, it is characterised in that: the spcial character
For mathematic sign, unit symbol and tab.
8. a kind of searching system based on text morpheme cutting for realizing any one of claim 1 to 7 search method, special
Sign is: including Database module, input module, judging comparison module, cutting module and retrieval module;
The Database module is for establishing user search dictionary;
The input module is for inputting text to be slit into searching system;
It is described to judge that comparison module for whether judging in text to be slit comprising there are phrases, and compares that there is currently phrases
E (w) whether be greater than E (avg), and be greater than when continue judge in user search dictionary whether there is corresponding morpheme, if
It is not present, then by there is currently the morphemes of phrase to be stored in dictionary;
There are the texts to be slit after phrase to carry out the cutting of fine granularity morpheme for that will remove for the cutting module;
The retrieval module according to the morpheme after cutting for being retrieved.
9. a kind of searching system based on text morpheme cutting as claimed in claim 8, it is characterised in that: the judgement is compared
Module be also used to judge there is currently phrase whether more than eight bytes, when being no more than with there is currently the morphemes of phrase and thin
Then morpheme after granularity cutting is indexed as cutting morpheme.
10. a kind of searching system based on text morpheme cutting as claimed in claim 8, it is characterised in that: the dividing die
Block is also used to not comprising there are the texts to be slit of phrase to carry out the cutting of fine granularity morpheme.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610881111.7A CN106502980B (en) | 2016-10-09 | 2016-10-09 | A kind of search method and system based on text morpheme cutting |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610881111.7A CN106502980B (en) | 2016-10-09 | 2016-10-09 | A kind of search method and system based on text morpheme cutting |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106502980A CN106502980A (en) | 2017-03-15 |
CN106502980B true CN106502980B (en) | 2019-05-17 |
Family
ID=58294697
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610881111.7A Active CN106502980B (en) | 2016-10-09 | 2016-10-09 | A kind of search method and system based on text morpheme cutting |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106502980B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108415903B (en) * | 2018-03-12 | 2021-09-07 | 武汉斗鱼网络科技有限公司 | Evaluation method, storage medium, and apparatus for judging validity of search intention recognition |
CN110688852B (en) * | 2019-09-27 | 2023-04-07 | 西安赢瑞电子有限公司 | Chinese character word frequency storage method |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1629836A (en) * | 2003-12-17 | 2005-06-22 | 北京大学 | Method and apparatus for learning Chinese new words |
CN103353894A (en) * | 2013-07-19 | 2013-10-16 | 武汉睿数信息技术有限公司 | Data searching method and system based on semantic analysis |
CN103559313A (en) * | 2013-11-20 | 2014-02-05 | 北京奇虎科技有限公司 | Searching method and device |
CN103678282A (en) * | 2014-01-07 | 2014-03-26 | 苏州思必驰信息科技有限公司 | Word segmentation method and device |
CN104239321A (en) * | 2013-06-14 | 2014-12-24 | 高德软件有限公司 | Data processing method and device for search engine |
CN105045875A (en) * | 2015-07-17 | 2015-11-11 | 北京林业大学 | Personalized information retrieval method and apparatus |
-
2016
- 2016-10-09 CN CN201610881111.7A patent/CN106502980B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1629836A (en) * | 2003-12-17 | 2005-06-22 | 北京大学 | Method and apparatus for learning Chinese new words |
CN104239321A (en) * | 2013-06-14 | 2014-12-24 | 高德软件有限公司 | Data processing method and device for search engine |
CN103353894A (en) * | 2013-07-19 | 2013-10-16 | 武汉睿数信息技术有限公司 | Data searching method and system based on semantic analysis |
CN103559313A (en) * | 2013-11-20 | 2014-02-05 | 北京奇虎科技有限公司 | Searching method and device |
CN103678282A (en) * | 2014-01-07 | 2014-03-26 | 苏州思必驰信息科技有限公司 | Word segmentation method and device |
CN105045875A (en) * | 2015-07-17 | 2015-11-11 | 北京林业大学 | Personalized information retrieval method and apparatus |
Non-Patent Citations (1)
Title |
---|
搜索引擎中文分词技术研究;任丽芸;《中国优秀硕士学位论文全文数据库 信息科技辑》;20120415(第04期);I138-2477 |
Also Published As
Publication number | Publication date |
---|---|
CN106502980A (en) | 2017-03-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11675977B2 (en) | Intelligent system that dynamically improves its knowledge and code-base for natural language understanding | |
US8972413B2 (en) | System and method for matching comment data to text data | |
US10515125B1 (en) | Structured text segment indexing techniques | |
US8375033B2 (en) | Information retrieval through identification of prominent notions | |
Trabelsi et al. | Bridging folksonomies and domain ontologies: Getting out non-taxonomic relations | |
Saloot et al. | An architecture for Malay Tweet normalization | |
US10606903B2 (en) | Multi-dimensional query based extraction of polarity-aware content | |
CN104281702A (en) | Power keyword segmentation based data retrieval method and device | |
Al-Shammari et al. | Towards an error-free Arabic stemming | |
CN108319583B (en) | Method and system for extracting knowledge from Chinese language material library | |
Ahmed et al. | Revised n-gram based automatic spelling correction tool to improve retrieval effectiveness | |
CN102214189A (en) | Data mining-based word usage knowledge acquisition system and method | |
US20150242493A1 (en) | User-guided search query expansion | |
Wijaya et al. | Automatic mood classification of Indonesian tweets using linguistic approach | |
CN102117285B (en) | Search method based on semantic indexing | |
Jia et al. | A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth | |
CN106502980B (en) | A kind of search method and system based on text morpheme cutting | |
Bölücü et al. | Unsupervised joint PoS tagging and stemming for agglutinative languages | |
Gollapalli et al. | Keyphrase extraction using sequential labeling | |
US8554769B1 (en) | Identifying gibberish content in resources | |
Joseph et al. | Citation analysis, centrality, and the ACL Anthology | |
CN111160007B (en) | Search method and device based on BERT language model, computer equipment and storage medium | |
Roy et al. | An unsupervised normalization algorithm for noisy text: a case study for information retrieval and stance detection | |
Pandey et al. | Evaluating effect of stemming and stop-word removal on Hindi text retrieval | |
CN113157857B (en) | Hot topic detection method, device and equipment for news |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20231030 Address after: 518000 International Chamber of Commerce Center 2103, No. 168 Fuhua Third Road, Fu'an Community, Futian Street, Futian District, Shenzhen, Guangdong Province Patentee after: Shenzhen origin parameter information technology Co.,Ltd. Address before: 430000 Wuhan Donghu Development Zone, Wuhan, Hubei Province, No. 1 Software Park East Road 4.1 Phase B1 Building 11 Building Patentee before: WUHAN DOUYU NETWORK TECHNOLOGY Co.,Ltd. |