CN104765838A - Word segmenting method and device - Google Patents

Word segmenting method and device Download PDF

Info

Publication number
CN104765838A
CN104765838A CN201510179584.8A CN201510179584A CN104765838A CN 104765838 A CN104765838 A CN 104765838A CN 201510179584 A CN201510179584 A CN 201510179584A CN 104765838 A CN104765838 A CN 104765838A
Authority
CN
China
Prior art keywords
matching result
phrase
numerical value
matching
dictionary storehouse
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510179584.8A
Other languages
Chinese (zh)
Inventor
李成华
王峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Group Co Ltd
Original Assignee
Hisense Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Group Co Ltd filed Critical Hisense Group Co Ltd
Publication of CN104765838A publication Critical patent/CN104765838A/en
Pending legal-status Critical Current

Links

Landscapes

  • Character Discrimination (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a word segmenting method which is used for increasing the word segmenting accuracy rate. The method comprises the steps that a character string to be processed is obtained; the character string to be processed and a universal dictionary library are matched based on a forward maximum matching method to obtain a first matching result, and the character string to be processed and the universal dictionary library are matched based on a backward maximum matching method to obtain a second matching result; whether the first matching result is consistent with the second matching result or not is judged; the first matching result or the second matching result is output as the word segmenting result when the first matching result is consistent with the second matching result. The invention further discloses a device for implementing the method.

Description

A kind of segmenting method and device
The application is the denomination of invention proposed on October 23rd, 2012 is the divisional application of the Chinese invention patent application 201210407529.6 of " a kind of segmenting method and device ".
Technical field
The present invention relates to participle field, particularly a kind of segmenting method and device.
Background technology
Along with the universal of network and the maturation of electronic technology, televisor is made progressively to trend towards " high Qinghua ", " networking ", " intellectuality ".
Carry out video request program search by internet and become demand main in intelligent television and application.And from internet mass video, the video content that user wants to see will be searched out exactly, just need effectively to extract text message, therefore, how effectively to extract the major issue that text message also just becomes information retrieval field.Chinese word segmentation as information processing and retrieval a major technique and be subject to extensive concern, particularly require more and more higher to participle in the different application of different field, can say that the quality of participle technique has also directly had influence on the result of information processing and retrieval.
Have multiple segmenting method in prior art, wherein based on the segmenting method of character string because comparatively simple and more common.
The existing segmenting method based on character string probably can comprise Forward Maximum Method method and reverse maximum matching method.A kind of segmenting method based on character string is such as had mainly to adopt Forward Maximum Method method or reverse maximum matching method to carry out mechanical Chinese word segmentation process to needing the character string of participle, to unidentified go out individual character achieve the participle identification of place name and street name, its object is to identify place name, street name etc., expanded ground thesaurus.
Present inventor, in the process realizing the embodiment of the present application technical scheme, at least finds to there is following technical matters in prior art:
1, existing Words partition system only adopts a kind of segmenting method (Forward Maximum Method method or reverse maximum matching method) to carry out participle, and participle process is comparatively coarse, causes the word segmentation result that obtains not accurate enough, reduces word segmentation accuracy;
2, existing segmenting method only relates to the participle in place name field, and the character string for other field still cannot effectively identify.
Summary of the invention
The embodiment of the present invention provides a kind of segmenting method and device, for solving the technical matters that in prior art, word segmentation accuracy is not high, achieves the technique effect improving word segmentation accuracy.
An aspect of of the present present invention, provides a kind of segmenting method, comprises the following steps:
Obtain pending character string;
According to Forward Maximum Method method, described pending character string is mated with universaling dictionary storehouse, obtain the first matching result, with according to reverse maximum matching method, described pending character string is mated with universaling dictionary storehouse, obtain the second matching result, wherein, the first phrase that the first numerical value is individual is included in described first matching result, the second phrase that second value is individual is included in described second matching result, the quantity of described first phrase of described first numerical value for comprising in described first matching result determined according to the first matching result, the quantity of described second phrase of described second value for comprising in described second matching result determined according to described second matching result, the individual character that third value is individual is included in described first matching result, the individual character that the 4th numerical value is individual is included in described second matching result, the quantity of individual character of described third value for comprising in described first matching result determined according to described first matching result, the quantity of individual character of described 4th numerical value for comprising in described second matching result determined according to described second matching result,
Judge that whether described first numerical value is equal with described second value;
When described first numerical value is equal with described second value, judge whether described third value is greater than described 4th numerical value, wherein, in described first matching result, include the individual character that third value is individual, in described second matching result, include the individual character that the 4th numerical value is individual;
When described third value equals described 4th numerical value, export a described first numerical value phrase.
Another aspect of the present invention, provides a kind of participle device, comprising:
Acquisition module, for obtaining pending character string;
Matching module, for described pending character string being mated with universaling dictionary storehouse according to Forward Maximum Method method, obtain the first matching result, with according to reverse maximum matching method, described pending character string is mated with universaling dictionary storehouse, obtain the second matching result, wherein, the first phrase that the first numerical value is individual is included in described first matching result, the second phrase that second value is individual is included in described second matching result, the quantity of described first phrase of described first numerical value for comprising in described first matching result determined according to the first matching result, the quantity of described second phrase of described second value for comprising in described second matching result determined according to described second matching result, the individual character that third value is individual is included in described first matching result, the individual character that the 4th numerical value is individual is included in described second matching result, the quantity of individual character of described third value for comprising in described first matching result determined according to described first matching result, the quantity of individual character of described 4th numerical value for comprising in described second matching result determined according to described second matching result,
Whether the first judge module is identical with described second value for judging described first numerical value;
Second judge module, when described first numerical value is identical with described second value, judge whether described third value is greater than described 4th numerical value, wherein, include the individual character that third value is individual in described first matching result, in described second matching result, include the individual character that the 4th numerical value is individual;
Output module, when described third value equals described 4th numerical value, exports a described first numerical value phrase.
Segmenting method in the embodiment of the present invention comprises: obtain pending character string, according to Forward Maximum Method method, described pending character string is mated with universaling dictionary storehouse, obtain the first matching result, with according to reverse maximum matching method, described pending character string is mated with universaling dictionary storehouse, obtain the second matching result, wherein, the first phrase that the first numerical value is individual is included in described first matching result, the second phrase that second value is individual is included in described second matching result, the quantity of described first phrase of described first numerical value for comprising in described first matching result determined according to the first matching result, the quantity of described second phrase of described second value for comprising in described second matching result determined according to described second matching result, the individual character that third value is individual is included in described first matching result, the individual character that the 4th numerical value is individual is included in described second matching result, the quantity of individual character of described third value for comprising in described first matching result determined according to described first matching result, the quantity of individual character of described 4th numerical value for comprising in described second matching result determined according to described second matching result, judge that whether described first numerical value is equal with described second value, when described first numerical value is equal with described second value, judge whether described third value is greater than described 4th numerical value, wherein, in described first matching result, include the individual character that third value is individual, in described second matching result, include the individual character that the 4th numerical value is individual, when described third value equals described 4th numerical value, export a described first numerical value phrase.
In the embodiment of the present invention, Forward Maximum Method method and reverse maximum matching method is adopted to mate same pending character string respectively, after to be matched, if matching result is identical, then can direct Output rusults, so, first be employing two kinds of matching process, comparison matching result afterwards, exports if identical again, obviously improves the accuracy of participle.And in the embodiment of the present invention, if matching result is different, certain ambiguity elimination can also be carried out to matching result, thus can ensure that the result obtained is comparatively accurate as far as possible, ensure that the accuracy of participle from many aspects.
Accompanying drawing explanation
Fig. 1 is the main flow figure of segmenting method in the embodiment of the present invention;
Fig. 2 is the detailed structure view of participle device in the embodiment of the present invention.
Embodiment
Segmenting method in the embodiment of the present invention comprises: obtain pending character string; According to Forward Maximum Method method, described pending character string is mated with universaling dictionary storehouse, obtain the first matching result, and according to reverse maximum matching method, described pending character string is mated with universaling dictionary storehouse, obtain the second matching result; Judge that whether described first matching result is consistent with described second matching result; When consistent, export described first matching result or described second matching result as word segmentation result.
In the embodiment of the present invention, Forward Maximum Method method and reverse maximum matching method is adopted to mate same pending character string respectively, after to be matched, if matching result is identical, then can direct Output rusults, so, first be employing two kinds of matching process, comparison matching result afterwards, exports if identical again, obviously improves the accuracy of participle.And in the embodiment of the present invention, if matching result is different, certain ambiguity elimination can also be carried out to matching result, thus can ensure that the result obtained is comparatively accurate as far as possible, ensure that the accuracy of participle from many aspects.
See Fig. 1, the segmenting method in the embodiment of the present invention can comprise the following steps:
Step 101: obtain pending character string.
In the embodiment of the present invention, first passage can be obtained, after acquisition passage, first dictionary can be loaded.In prior art, the dictionary loaded can be common universaling dictionary storehouse, in the embodiment of the present invention, can build a special dictionary storehouse voluntarily, this special dictionary storehouse can be the special dictionary storehouse in any field, such as, can be the special dictionary storehouse in video display field, or can be the special dictionary storehouse of building field, or can be the special dictionary storehouse of electric field, etc., the special dictionary storehouse being video display field for described special dictionary storehouse in the embodiment of the present invention is described.The information relevant to video display that each actor name, director names, video display title, video display type, film and TV language etc. are different can be included in the special dictionary storehouse in this video display field, by carrying out searching for and mating in the special dictionary storehouse in this video display field, participle device can be made better at the effect in video search field.
In the embodiment of the present invention, a stop words extension dictionary storehouse can also be built voluntarily, multiple vocabulary is included in described stop words extension dictionary storehouse, such as can have auxiliary words of mood, conjunction etc., the vocabulary comprised in described stop words extension dictionary storehouse is all to understanding whole sentence without the vocabulary helped.Such as, have in short: " I goes to have a meal together with you." subject is " I, you ", predicate is " going ", and object is " having a meal ", and wherein " with " be exactly conjunction, be exactly insignificant phrase concerning the whole sentence of understanding, then this " with " word just can be included in described stop words extension dictionary storehouse.
In the embodiment of the present invention, described special dictionary storehouse and the described stop words extension dictionary storehouse of structure can be included in a universaling dictionary storehouse.But the universaling dictionary storehouse described in the embodiment of the present invention is different from universaling dictionary storehouse of the prior art, the universaling dictionary storehouse in the embodiment of the present invention is the universaling dictionary storehouse containing described special dictionary storehouse and described stop words extension dictionary storehouse.Such as, be that the special dictionary storehouse being video display field for described special dictionary storehouse is described in the embodiment of the present invention, then the described universaling dictionary storehouse in the embodiment of the present invention can be contain the special dictionary storehouse in described video display field and the universaling dictionary storehouse in described stop words extension dictionary storehouse.
After loading contains the described universaling dictionary storehouse in described special dictionary storehouse and described stop words extension dictionary storehouse, first can carry out rough lumber according to information such as punctuates to the passage obtained and divide, can be multiple sentence by its cutting.Wherein, each sentence can be described pending character string.
Step 102: described pending character string is mated with universaling dictionary storehouse according to Forward Maximum Method method, obtain the first matching result, with according to reverse maximum matching method, described pending character string is mated with universaling dictionary storehouse, obtain the second matching result.
In the embodiment of the present invention, first can mate described pending character string according to Forward Maximum Method method, obtain described first matching result, described first matching result can correspond to the first individual phrase of the first numerical value.After described pending character string being mated according to Forward Maximum Method method, can continue to mate described pending character string according to reverse maximum matching method, obtain described second matching result, described second matching result can correspond to the second individual phrase of second value.Wherein, described first numerical value is the quantity of described first phrase comprised in described first matching result, described second value is the quantity of described second phrase comprised in described second matching result, namely described first numerical value can be determined according to described first matching result, and described second value can be determined according to described second matching result.Phrase in the embodiment of the present invention can comprise multiword phrase and individual character.Described first numerical value can be obtained according to described first matching result, described second value can be obtained according to described second matching result.
Or in the embodiment of the present invention, first can mate described pending character string according to reverse maximum matching method, obtain described second matching result, described second matching result can correspond to a described second value phrase.After described pending character string being mated according to reverse maximum matching method, can continue to mate described pending character string according to Forward Maximum Method method, obtain described first matching result, described first matching result can correspond to the first individual phrase of the first numerical value.
Or, in the embodiment of the present invention, also can mate described pending character string respectively according to Forward Maximum Method method and reverse maximum matching method simultaneously, obtain described first matching result and described second matching result respectively.That is, in the embodiment of the present invention, employing Forward Maximum Method method and reverse maximum matching method can be any to the sequencing that described pending character string is mated.
Wherein, the process of Forward Maximum Method method (MM) can be as follows:
First set a most major term long, the long length of this most major term needs the length being not more than described pending character string, and preferably, the length that this most major term is grown is less than the length of described pending character string.In general, the length that this most major term is long can rule of thumb set.The described most major term such as set is long is n, then can get n character from left to right to described pending character string, mate with described universaling dictionary storehouse, if there is this entry in described universaling dictionary storehouse, then the match is successful, the cutting from described pending character string of this n character is gone out, continues to get n character from left to right from remaining described pending character string and mate, until by complete for described pending string processing; If wherein an entry coupling is unsuccessful, then from this n character, remove last character, mate with the entry in described universaling dictionary storehouse again, if coupling or unsuccessful, then from this n-1 character, remove last character again, mate with the entry in described universaling dictionary storehouse again, re-treatment like this.Wherein, suppose that the length of described pending character string is m, then n should be and is greater than 1 and the natural number being not more than m.
The ultimate principle of reverse maximum matching method (RMM) is identical with Forward Maximum Method method, direction unlike point word segmentation is contrary with Forward Maximum Method method, scanning can be mated from the end of described pending character string, get the long character of most major term of least significant end as matching field at every turn, if it fails to match, then remove a word of matching field foremost, continue coupling.
Illustrate forward matching method below.
Such as, a pending character string is: " of me has a meal ".
The first step, first setting most major term length is 5.The character be then first syncopated as is " of me eats ", these 5 characters are mated with described universaling dictionary storehouse, discovery cannot be mated, then last character of these 5 characters is removed, become " of me ", these 4 characters are mated with described universaling dictionary storehouse, discovery cannot be mated, then last character of these 4 characters is removed, become " I have one ", these 3 characters are mated with described universaling dictionary storehouse, discovery cannot be mated, then last character of these 3 characters is removed, become " I one ", these 2 characters are mated with described universaling dictionary storehouse, discovery cannot be mated, then last character of these 2 characters is removed, become " I ", this 1 character is mated with described universaling dictionary storehouse, the match is successful.
Second step, carries out cutting by remaining described pending character string, obtains " people has a meal ".These 5 characters are mated with described universaling dictionary storehouse, discovery cannot be mated, then last character of these 5 characters is removed, become " people eats ", these 4 characters are mated with described universaling dictionary storehouse, discovery cannot be mated, then last character of these 4 characters is removed, become " people ", mated with described universaling dictionary storehouse by these 3 characters, discovery cannot be mated, then last character of these 3 characters is removed, become " one ", mated with described universaling dictionary storehouse by these 2 characters, the match is successful.
3rd step, carries out cutting by remaining described pending character string, obtains " people has a meal ".These 3 characters are mated with described universaling dictionary storehouse, discovery cannot be mated, then last character of these 3 characters is removed, become " people eats ", mated with described universaling dictionary storehouse by these 2 characters, discovery cannot be mated, then last character of these 2 characters is removed, become " people ", mated with described universaling dictionary storehouse by this 1 character, the match is successful.
4th step, carries out cutting by remaining described pending character string, obtains " having a meal ".Mated with described universaling dictionary storehouse by these 2 characters, the match is successful.
Then, employing Forward Maximum Method method to the word segmentation result that " of me has a meal " the words obtains after carrying out participle is: I/mono-/people/have a meal, namely obtain four phrases, comprising two individual characters.
Adopt reverse maximum matching method to carry out participle to " of me has a meal " the words again, the word segmentation result obtained is: I/mono-/individual/have a meal.
After described pending character string being mated according to Forward Maximum Method method, described first matching result can be obtained, described first matching result can correspond to the first individual phrase of described first numerical value, such as in the above-described embodiments, described first numerical value is 4, after described pending character string being mated according to reverse maximum matching method, described second matching result can be obtained, described second matching result can correspond to the second individual phrase of described second value, such as in the above-described embodiments, described second value is 4.
Step 103: judge that whether described first matching result is consistent with described second matching result.
In the embodiment of the present invention, after obtaining described first matching result and described second matching result, can judge that whether described first matching result is consistent with described second matching result.Consistent finger herein to be not only phrase quantity consistent, and the phrase content obtained is also completely the same.Such as, for " of me has a meal " the words, described first matching result adopting Forward Maximum Method method to obtain is: I/mono-/people/have a meal, if and adopt reverse maximum matching method, described second matching result then obtained can be: I/mono-/individual/have a meal, described first numerical value is 4, described second value is also 4, although the described second value that described first numerical value that described first matching result is corresponding is corresponding with described second matching result is equal, but the phrase obtained is also incomplete same, therefore still judge determine described first matching result and described second matching result inconsistent.
Such as, judge that whether described first matching result is consistent with described second matching result, can be specifically:
Judge that whether described first numerical value is equal with described second value.
When described first numerical value and described second value unequal time, can show there is ambiguity between described first matching result and described second matching result.
When described first numerical value is equal with described second value, judge that whether the first phrase of described first numerical value second phrase individual with described second value be identical.Wherein, herein identical whether refer to the content of the first phrase of described first numerical value second phrase individual with described second value completely the same.Such as, described first numerical value is 4, described first phrase is respectively: I/mono-/people/have a meal, described second value is 4, described second phrase is respectively: I/mono-/individual/have a meal, although described first numerical value is equal with described second value, the content of described first phrase and described second phrase is not quite identical, and the second phrase that therefore the first phrase of described first numerical value is individual with described second value is incomplete same.And, if, described first numerical value is 4, described first phrase is respectively: I/mono-/people/have a meal, described second value is 4, described second phrase is respectively: I/mono-/people/have a meal, then can determine that the first phrase of described first numerical value second phrase individual with described second value is identical.
When the second phrase that the first phrase that described first numerical value is individual is individual with described second value is identical, show there is no ambiguity between described first matching result and described second matching result, when the second phrase that the first phrase of described first numerical value is individual with described second value is incomplete same, show there is ambiguity between described first matching result and described second matching result.
Preferably, in the embodiment of the present invention, before step 101, first can load the described universaling dictionary storehouse comprising described special dictionary storehouse, wherein, before the described universaling dictionary storehouse of loading, can first classify to described special dictionary storehouse.Like this, after judging that whether described first matching result is consistent with described second matching result, the phrase that described first matching result or described second matching result can be comprised mates with the phrase in sorted described special dictionary storehouse according to classification respectively.Because of judging that described first matching result can determine matching result to be output after whether consistent with described second matching result, such as, if described matching result to be output is described first matching result, the phrase that then described first matching result can be comprised mates with the phrase in sorted described special dictionary storehouse according to classification respectively, if described matching result to be output is described second matching result, the phrase that then described second matching result can be comprised mates with the phrase in sorted described special dictionary storehouse according to classification respectively.
Step 104: when consistent, exports described first matching result or described second matching result as word segmentation result.
If judge to determine that described first matching result is consistent with described second matching result, namely, described first numerical value is equal with described second value, and the content of the first phrase of described first numerical value second phrase individual with described second value is identical, then can export described first matching result or described second matching result using as word segmentation result.
In the embodiment of the present invention, if judge determine described first matching result and described second matching result inconsistent, then can carry out ambiguity elimination to described first matching result and described second matching result, using export through ambiguity eliminate after described first matching result or described second matching result as word segmentation result.
In the embodiment of the present invention, the process that ambiguity is eliminated can be as follows:
First can judge described first numerical value and described second value whether unequal, if judge determine described first numerical value and described second value unequal, then can continue to judge whether described first numerical value is greater than described second value, if judge to determine that described first numerical value is greater than described second value, what then can determine needs output is a described second value phrase, namely according to the phrase that reverse maximum matching method obtains, if and judgement determines that described first numerical value is less than described second value, what then can determine needs output is a described first numerical value phrase, namely according to the phrase that Forward Maximum Method method obtains.
And if judgement determines that described first numerical value is equal with described second value, then other determining step can be continued.Such as, can determine can comprise a third value individual character in a described first numerical value phrase, a 4th numerical value individual character in a described second value phrase, can be comprised, can continue to judge that whether described third value is unequal with described 4th numerical value.If judge determine described third value and described 4th numerical value unequal, then can judge whether described third value is greater than described 4th numerical value, if judge to determine that described third value is greater than described 4th numerical value, what then can determine needs output is a described second value phrase, namely the phrase obtained according to reverse maximum matching method is exported, if and judgement determines that described third value is less than described 4th numerical value, what then can determine needs output is a described first numerical value phrase, namely exports the phrase obtained according to Forward Maximum Method method.Wherein, described third value is the quantity of the individual character comprised in described first matching result, described 4th numerical value is the quantity of the individual character comprised in described second matching result, namely described third value can be determined according to described first matching result, and described 4th numerical value can be determined according to described second matching result.Described third value can be obtained according to described first matching result, described 4th numerical value can be obtained according to described second matching result.
If judge to determine that described first numerical value is equal with described second value, described third value is also equal with described 4th numerical value, then what can determine needs output is a described first numerical value phrase, namely exports the phrase obtained according to Forward Maximum Method method.
Namely, in the embodiment of the present invention, if the described second value that described first numerical value that described first matching result is corresponding is corresponding from described second matching result is different, what then can determine to need to export is the result of phrase negligible amounts, if the described second value that described first numerical value that described first matching result is corresponding is corresponding with described second matching result is identical, and described third value is different from described 4th numerical value, then what can determine to need to export is the result of individual character negligible amounts.This disposal route is adopted, mainly in order to improve the accuracy that ambiguity is eliminated in the embodiment of the present invention.
In the embodiment of the present invention, ambiguity elimination is carried out to described first matching result and described second matching result, using export through ambiguity eliminate after described first matching result or described second matching result as word segmentation result.
Preferably, in the embodiment of the present invention, before step 101, first can load the described universaling dictionary storehouse comprising described special dictionary storehouse, wherein, before the described universaling dictionary storehouse of loading, can first classify to described special dictionary storehouse.Like this, after carrying out ambiguity elimination to described first matching result and described second matching result, the phrase that the word segmentation result after ambiguity can being eliminated comprises mates with the phrase in sorted described special dictionary storehouse according to classification respectively.Because matching result to be output can be determined after carrying out ambiguity elimination, such as, if described matching result to be output is described first matching result after ambiguity is eliminated, the phrase that then described first matching result after eliminating through ambiguity can be comprised mates with the phrase in sorted described special dictionary storehouse according to classification respectively, if described matching result to be output is described second matching result after ambiguity is eliminated, the phrase that then described second matching result after eliminating through ambiguity can be comprised mates with the phrase in sorted described special dictionary storehouse according to classification respectively.
Such as, if divided the special dictionary storehouse in described video display field in order to 5 classifications, be respectively actor name, director names, video display title, video display type and film and TV language, then can respectively each phrase be mated successively with each classification when mating.Concrete elder generation with which classification mates, and mates afterwards with which classification, and order can sets itself, or order can be any.
Such as, if the special dictionary storehouse in described video display field is divided in order to 5 classifications, be respectively actor name, director names, video display title, video display type and film and TV language, the matching order of setting is: actor name-video display title-director names-video display type-film and TV language.And ambiguity eliminate after word segmentation result in a phrase comprising be " hiding ", then first this phrase can be mated with this classification of actor name, discovery does not have entry to match, this phrase is then continued to mate with this classification of video display title, the match is successful, then can word segmentation result after output matching, and can be clear and definite when exporting, this phrase is video display title.
In the embodiment of the present invention, before judging that whether described first matching result is consistent with described second matching result, according to described stop words extension dictionary storehouse, the phrase of the first kind in described first matching result and described second matching result all can also be deleted.Because judging that what cannot determine to need to export is described first matching result or described second matching result before whether described first matching result is consistent with described second matching result, therefore all can delete the phrase of the first kind in described first matching result and described second matching result according to described stop words extension dictionary storehouse.
In the embodiment of the present invention, after judging that whether described first matching result is consistent with described second matching result, according to described stop words extension dictionary storehouse, the phrase of the first kind described in matching result to be output can also be deleted, wherein, described matching result to be output is described first matching result or described second matching result.Because after judging that whether described first matching result is consistent with described second matching result, what can determine needs output is described first matching result or described second matching result, if then determine that described matching result to be output is described first matching result, can delete according to the phrase of described stop words extension dictionary storehouse by the first kind described in described first matching result, without the need to processing described second matching result, if determine that described matching result to be output is described second matching result, can delete according to the phrase of described stop words extension dictionary storehouse by the first kind described in described second matching result, without the need to processing described first matching result, so also step can be saved.
In the embodiment of the present invention, the phrase of the described first kind can refer to the insignificant phrase of implication to understanding described pending character string.Such as, have a word segmentation result for "/I/do not know ", then " " is wherein auxiliary words of mood, obviously nonsensical to the described pending character string of understanding, when it being mated with described stop words extension dictionary storehouse, the match is successful, can be deleted.Concrete, in the embodiment of the present invention, the phrase of the described first kind can be function word phrase, and such as, the phrase of the described first kind can be auxiliary word phrase, conjunction phrase, adverbial idiom, preposition phrase, interjection phrase, onomatopoeia phrase, etc.Preferably, the kind of the phrase comprised in described stop words extension dictionary storehouse can change to some extent according to the difference in field belonging to described pending character string, the phrase comprising which kind in concrete described stop words extension dictionary storehouse can be determined according to real needs, and the present invention does not limit this.
Namely, in the embodiment of the present invention, first phrase of described first numerical value that described first matching result can be obtained mates with described stop words extension dictionary storehouse respectively, if the match is successful phrase, then this phrase is deleted, second phrase of a described second value that also described second matching result can be obtained mates with described stop words extension dictionary storehouse respectively, if there is phrase, the match is successful, then deleted by this phrase.
See Fig. 2, the present invention also provides a kind of participle device, and described device can comprise acquisition module 201, matching module 202, judge module 203 and output module 204.Described device can also comprise disambiguation module 205, load-on module 206, sort module 207 and processing module 208.
Acquisition module 201 may be used for obtaining pending character string.
Matching module 202 may be used for described pending character string being mated with universaling dictionary storehouse according to Forward Maximum Method method, obtain the first matching result, with according to reverse maximum matching method, described pending character string is mated with universaling dictionary storehouse, obtain the second matching result.
The phrase that matching module 202 can also be used for described first matching result or described second matching result to comprise mates with the phrase in sorted described special dictionary storehouse according to classification respectively.
The phrase that matching module 202 can also be used for the first matching result after carrying out ambiguity elimination or described second matching result comprise mates with the phrase in sorted described special dictionary storehouse according to classification respectively.
Judge module 203 may be used for judging that whether described first matching result is consistent with described second matching result.
The first phrase that the first numerical value is individual is included in described first matching result, the second phrase that second value is individual is included in described second matching result, the quantity of described first phrase of described first numerical value for comprising in described first matching result determined according to described first matching result, the quantity of described second phrase of described second value for comprising in described second matching result determined according to described second matching result.Judge module 203 specifically may be used for: judge that whether described first numerical value is equal with described second value; When described first numerical value and described second value unequal time, show there is ambiguity between described first matching result and described second matching result; When described first numerical value is equal with described second value, judge that whether the first phrase of described first numerical value second phrase individual with described second value be identical; When the second phrase that the first phrase that described first numerical value is individual is individual with described second value is identical, show there is no ambiguity between described first matching result and described second matching result, when the second phrase that the first phrase of described first numerical value is individual with described second value is incomplete same, show there is ambiguity between described first matching result and described second matching result.
Output module 204 may be used for when consistent, exports described first matching result or described second matching result as word segmentation result.
Output module 204 can also be used for exporting described first matching result after ambiguity is eliminated or described second matching result as word segmentation result.
Output module 204 specifically may be used for: when described first numerical value is greater than described second value, export a described second value phrase; When described first numerical value is less than described second value, export a described first numerical value phrase.
Output module 204 specifically may be used for: when described third value is greater than described 4th numerical value, exports a described second value phrase; When described third value is less than described 4th numerical value, export a described first numerical value phrase; When described third value equals described 4th numerical value, export a described first numerical value phrase.
Disambiguation module 205 may be used for when inconsistent, carries out ambiguity elimination to described first matching result and described second matching result, using export through ambiguity eliminate after described first matching result or described second matching result as word segmentation result.
Disambiguation module 205 specifically may be used for when stating the first numerical value and described second value is unequal, judges whether described first numerical value is greater than described second value.
The individual character that third value is individual is included in described first matching result, the individual character that the 4th numerical value is individual is included in described second matching result, the quantity of individual character of described third value for comprising in described first matching result determined according to described first matching result, the quantity of individual character of described 4th numerical value for comprising in described second matching result determined according to described second matching result.Disambiguation module 205 specifically may be used for: when stating the first numerical value and being equal with described second value, judges whether described third value is greater than described 4th numerical value.
Load-on module 206 may be used for loading described universaling dictionary storehouse, and described universaling dictionary storehouse comprises special dictionary storehouse.
Load-on module 206 may be used for loading described universaling dictionary storehouse, and described universaling dictionary storehouse comprises stop words extension dictionary storehouse.
Sort module 207 may be used for classifying to described special dictionary storehouse.
Processing module 208 may be used for according to described stop words extension dictionary storehouse, is all deleted by the phrase of the first kind in described first matching result and described second matching result.
Processing module 208 may be used for, according to described stop words extension dictionary storehouse, being deleted by the phrase of the first kind in matching result to be output, and described matching result to be output is described first matching result or described second matching result.
In the embodiment of the present invention, the phrase of the described first kind can be function word phrase, and such as, the phrase of the described first kind can be auxiliary word phrase, conjunction phrase, adverbial idiom, preposition phrase, interjection phrase, onomatopoeia phrase, etc.Preferably, the kind of the phrase comprised in described stop words extension dictionary storehouse can change to some extent according to the difference in field belonging to described pending character string, and the phrase comprising which kind in concrete described stop words extension dictionary storehouse can be determined according to real needs.
Segmenting method in the embodiment of the present invention comprises: obtain pending character string; According to Forward Maximum Method method, described pending character string is mated with universaling dictionary storehouse, obtain the first matching result, and according to reverse maximum matching method, described pending character string is mated with universaling dictionary storehouse, obtain the second matching result; Judge that whether described first matching result is consistent with described second matching result; When consistent, export described first matching result or described second matching result as word segmentation result.
In the embodiment of the present invention, Forward Maximum Method method and reverse maximum matching method is adopted to mate same pending character string respectively, after to be matched, if matching result is identical, then can direct Output rusults, so, first be employing two kinds of matching process, comparison matching result afterwards, exports if identical again, obviously improves the accuracy of participle.And in the embodiment of the present invention, if matching result is different, certain ambiguity elimination can also be carried out to matching result, thus can ensure that the result obtained is comparatively accurate as far as possible, ensure that the accuracy of participle from many aspects.
In the embodiment of the present invention, describe the process that ambiguity is eliminated in detail, those skilled in the art can be easy to realize technical scheme of the present invention according to the content that the embodiment of the present invention describes, open comparatively abundant.And the disambiguation method in the employing embodiment of the present invention, the accuracy of participle can be improved.
The embodiment of the present invention constructs special dictionary storehouse specially, can mate, make the word segmentation result of output more targeted according to described special dictionary storehouse to word segmentation result.Described special dictionary storehouse can be the special dictionary storehouse of every field, thus the participle device in the embodiment of the present invention can be enable to carry out participle to the described pending character string in each field better.Such as, if described special dictionary storehouse is the special dictionary storehouse in described video display field, then described participle device can be enable to be applied to better in video search process.
The embodiment of the present invention also constructs stop words extension dictionary storehouse specially, can first delete insignificant phrase in phrase before output matching result, neither affects the result that participle exports, decreases follow-up operating process, save step.
Those skilled in the art should understand, embodiments of the invention can be provided as method, system or computer program.Therefore, the present invention can adopt the form of complete hardware embodiment, completely software implementation or the embodiment in conjunction with software and hardware aspect.And the present invention can adopt in one or more form wherein including the upper computer program implemented of computer-usable storage medium (including but not limited to magnetic disk memory and optical memory etc.) of computer usable program code.
The present invention describes with reference to according to the process flow diagram of the method for the embodiment of the present invention, equipment (system) and computer program and/or block scheme.Should understand can by the combination of the flow process in each flow process in computer program instructions realization flow figure and/or block scheme and/or square frame and process flow diagram and/or block scheme and/or square frame.These computer program instructions can being provided to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, making the instruction performed by the processor of computing machine or other programmable data processing device produce device for realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
These computer program instructions also can be stored in can in the computer-readable memory that works in a specific way of vectoring computer or other programmable data processing device, the instruction making to be stored in this computer-readable memory produces the manufacture comprising command device, and this command device realizes the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
These computer program instructions also can be loaded in computing machine or other programmable data processing device, make on computing machine or other programmable devices, to perform sequence of operations step to produce computer implemented process, thus the instruction performed on computing machine or other programmable devices is provided for the step realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
Obviously, those skilled in the art can carry out various change and modification to the present invention and not depart from the spirit and scope of the present invention.Like this, if these amendments of the present invention and modification belong within the scope of the claims in the present invention and equivalent technologies thereof, then the present invention is also intended to comprise these change and modification.

Claims (10)

1. a segmenting method, is characterized in that, comprises the following steps:
Obtain pending character string;
According to Forward Maximum Method method and reverse maximum matching method, described pending character string is mated respectively with universaling dictionary storehouse, obtain two matching results;
Judge that whether the phrase numerical value in described two matching results is equal;
If described phrase numerical value is equal, then judge that whether the individual character numerical value in described two matching results is equal;
If described individual character numerical value is equal, then export the matching result of described character string.
2. the method for claim 1, is characterized in that, after whether the described phrase numerical value judged in described two matching results is equal, described method also comprises:
If described phrase numerical value is unequal, then export the little matching result of phrase numerical value.
3. the method for claim 1, is characterized in that, after whether the described individual character numerical value judged in described two matching results is equal, also comprises:
If described individual character numerical value is unequal, then export the little matching result of individual character numerical value.
4. the method for claim 1, is characterized in that, also comprises step: load described universaling dictionary storehouse, described universaling dictionary storehouse comprises special dictionary storehouse before obtaining pending character string.
5. method as claimed in claim 4, is characterized in that, also comprises step: classify to described special dictionary storehouse before the described universaling dictionary storehouse of loading.
6. method as claimed in claim 5, it is characterized in that, judge that whether the phrase numerical value in described two matching results is equal described, also comprise step: if described phrase numerical value is equal, the phrase comprised by described two matching results mates with the phrase in sorted described special dictionary storehouse according to classification respectively.
7. method as claimed in claim 5, it is characterized in that, phrase numerical value in described two matching results of described judgement is unequal, also comprises step: the phrase comprised by described two matching results mates with the phrase in sorted described special dictionary storehouse according to classification respectively.
8. the method for claim 1, is characterized in that, also comprises step: load described universaling dictionary storehouse, described universaling dictionary storehouse comprises stop words extension dictionary storehouse before obtaining pending character string.
9. a participle device, is characterized in that, comprising:
Acquisition module, for obtaining pending character string;
Matching module, for according to Forward Maximum Method method and reverse maximum matching method, mates described pending character string respectively with universaling dictionary storehouse, obtains two matching results;
Judge module, whether equal for judging the phrase numerical value in described two matching results, if described phrase numerical value is equal, then judge that whether the individual character numerical value in described two matching results is equal; If described individual character numerical value is equal, then export the matching result of described character string.
10. device as claimed in claim 9, it is characterized in that, described judge module also comprises, specifically for:
If described phrase numerical value is unequal, then export the little matching result of phrase numerical value.
CN201510179584.8A 2012-10-23 2012-10-23 Word segmenting method and device Pending CN104765838A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210407529.6A CN102915299B (en) 2012-10-23 2012-10-23 Word segmentation method and device

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201210407529.6A Division CN102915299B (en) 2012-10-23 2012-10-23 Word segmentation method and device

Publications (1)

Publication Number Publication Date
CN104765838A true CN104765838A (en) 2015-07-08

Family

ID=47613670

Family Applications (3)

Application Number Title Priority Date Filing Date
CN201510179858.3A Pending CN104765724A (en) 2012-10-23 2012-10-23 Word segmenting method and device
CN201510179584.8A Pending CN104765838A (en) 2012-10-23 2012-10-23 Word segmenting method and device
CN201210407529.6A Active CN102915299B (en) 2012-10-23 2012-10-23 Word segmentation method and device

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN201510179858.3A Pending CN104765724A (en) 2012-10-23 2012-10-23 Word segmenting method and device

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201210407529.6A Active CN102915299B (en) 2012-10-23 2012-10-23 Word segmentation method and device

Country Status (1)

Country Link
CN (3) CN104765724A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550170A (en) * 2015-12-14 2016-05-04 北京锐安科技有限公司 Chinese word segmentation method and apparatus
CN113221552A (en) * 2021-06-02 2021-08-06 浙江百应科技有限公司 Multi-model word segmentation method and device based on deep learning and electronic equipment
CN113342989A (en) * 2021-05-24 2021-09-03 北京航空航天大学 Knowledge graph construction method and device of patent data, storage medium and terminal

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544309B (en) * 2013-11-04 2017-03-15 北京中搜网络技术股份有限公司 A kind of retrieval string method for splitting of Chinese vertical search
CN103593338B (en) * 2013-11-15 2016-05-11 北京锐安科技有限公司 A kind of information processing method and device
CN104077275A (en) * 2014-06-27 2014-10-01 北京奇虎科技有限公司 Method and device for performing word segmentation based on context
CN105630807B (en) * 2014-10-31 2020-02-07 高德软件有限公司 Method and device for analyzing incidence relation between unknown road and known road
CN104461056B (en) * 2014-12-22 2018-06-01 联想(北京)有限公司 A kind of information processing method and electronic equipment
CN105138514B (en) * 2015-08-24 2018-11-09 昆明理工大学 It is a kind of based on dictionary it is positive gradually plus a word maximum matches Chinese word cutting method
CN105243055B (en) * 2015-09-28 2018-07-31 北京橙鑫数据科技有限公司 Based on multilingual segmenting method and device
CN105335488A (en) * 2015-10-16 2016-02-17 中国南方电网有限责任公司电网技术研究中心 Knowledge base construction method
CN106649251B (en) * 2015-10-30 2019-07-09 北京国双科技有限公司 A kind of method and device of Chinese word segmentation
CN106202040A (en) * 2016-06-28 2016-12-07 邓力 A kind of Chinese word cutting method of PDA translation system
CN107622044A (en) * 2016-07-13 2018-01-23 阿里巴巴集团控股有限公司 Segmenting method, device and the equipment of character string
CN107092590A (en) * 2017-03-17 2017-08-25 贵州恒昊软件科技有限公司 A kind of sentence segmenting method and system
CN107680689A (en) * 2017-05-05 2018-02-09 平安科技(深圳)有限公司 Potential disease estimating method, system and the readable storage medium storing program for executing of medical text
CN107220300B (en) * 2017-05-05 2018-07-20 平安科技(深圳)有限公司 Information mining method, electronic device and readable storage medium storing program for executing
CN108009153A (en) * 2017-12-08 2018-05-08 北京明朝万达科技股份有限公司 A kind of searching method and system based on search statement cutting word result
CN110222335A (en) * 2019-05-20 2019-09-10 平安科技(深圳)有限公司 A kind of text segmenting method and device
CN112215010A (en) * 2019-07-10 2021-01-12 北京猎户星空科技有限公司 Semantic recognition method and equipment
CN113302683B (en) * 2019-12-24 2023-08-04 深圳市优必选科技股份有限公司 Multi-tone word prediction method, disambiguation method, device, apparatus, and computer-readable storage medium
CN112287108B (en) * 2020-10-29 2022-08-16 四川长虹电器股份有限公司 Intention recognition optimization method in field of Internet of things

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101042692B (en) * 2006-03-24 2010-09-22 富士通株式会社 translation obtaining method and apparatus based on semantic forecast
CN101122900A (en) * 2007-09-25 2008-02-13 中兴通讯股份有限公司 Words partition system and method
CN102394061B (en) * 2011-11-08 2013-01-02 中国农业大学 Text-to-speech method and system based on semantic retrieval

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
张俊林: "搜索引擎设计实用教程(1)-以百度为例", 《CSDN博客》 *
张旭: "一个基于词典与统计的中文分词算法", 《中国优秀硕士学位论文全文数据库》 *
朱明: "《数据挖掘导论》", 31 January 2012 *
罗杰等: "基于新的关键词提取方法的快速文本分类系统", 《计算机应用研究》 *
麦范金等: "基于双向匹配法和特征选择算法的中文分词技术研究", 《昆明理工大学学报》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550170A (en) * 2015-12-14 2016-05-04 北京锐安科技有限公司 Chinese word segmentation method and apparatus
CN105550170B (en) * 2015-12-14 2018-10-12 北京锐安科技有限公司 A kind of Chinese word cutting method and device
CN113342989A (en) * 2021-05-24 2021-09-03 北京航空航天大学 Knowledge graph construction method and device of patent data, storage medium and terminal
CN113342989B (en) * 2021-05-24 2022-12-20 北京航空航天大学 Knowledge graph construction method and device of patent data, storage medium and terminal
CN113221552A (en) * 2021-06-02 2021-08-06 浙江百应科技有限公司 Multi-model word segmentation method and device based on deep learning and electronic equipment

Also Published As

Publication number Publication date
CN102915299B (en) 2015-04-08
CN104765724A (en) 2015-07-08
CN102915299A (en) 2013-02-06

Similar Documents

Publication Publication Date Title
CN102915299B (en) Word segmentation method and device
US6826576B2 (en) Very-large-scale automatic categorizer for web content
KR100721406B1 (en) Product searching system and method using search logic according to each category
US10445359B2 (en) Method and system for classifying media content
US8126897B2 (en) Unified inverted index for video passage retrieval
AU2018349276A1 (en) Methods and system for semantic search in large databases
CN102339294B (en) Searching method and system for preprocessing keywords
CN107784110B (en) Index establishing method and device
CN111444330A (en) Method, device and equipment for extracting short text keywords and storage medium
CN106777261A (en) Data query method and device based on multi-source heterogeneous data set
CN105468584A (en) Filtering method and system for bad literal information in text
CN102789464A (en) Natural language processing method, device and system based on semanteme recognition
Ye et al. Unknown Chinese word extraction based on variety of overlapping strings
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
US20040122660A1 (en) Creating taxonomies and training data in multiple languages
JP7395377B2 (en) Content search methods, devices, equipment, and storage media
Celikyilmaz et al. Leveraging web query logs to learn user intent via bayesian latent variable model
CN102314464B (en) Lyrics searching method and lyrics searching engine
Watrin et al. An N-gram frequency database reference to handle MWE extraction in NLP applications
CN105574004A (en) Webpage deduplication method and device
JP3090233B2 (en) A method for identifying associations between complex information
CN110705285A (en) Government affair text subject word bank construction method, device, server and readable storage medium
CN112818645A (en) Chemical information extraction method, device, equipment and storage medium
CN107463549B (en) Method and equipment for extracting instance template
RU2266560C1 (en) Method utilized to search for information in poly-topic arrays of unorganized texts

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150708