CN102929870A - Method for establishing word segmentation model, word segmentation method and devices using methods - Google Patents

Method for establishing word segmentation model, word segmentation method and devices using methods Download PDF

Info

Publication number
CN102929870A
CN102929870A CN2011102238434A CN201110223843A CN102929870A CN 102929870 A CN102929870 A CN 102929870A CN 2011102238434 A CN2011102238434 A CN 2011102238434A CN 201110223843 A CN201110223843 A CN 201110223843A CN 102929870 A CN102929870 A CN 102929870A
Authority
CN
China
Prior art keywords
entry
speech
dictionary
word
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011102238434A
Other languages
Chinese (zh)
Other versions
CN102929870B (en
Inventor
何径舟
吴中勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201110223843.4A priority Critical patent/CN102929870B/en
Publication of CN102929870A publication Critical patent/CN102929870A/en
Application granted granted Critical
Publication of CN102929870B publication Critical patent/CN102929870B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a method for establishing a word segmentation model, a word segmentation method, a device for establishing the word segmentation model and a word segmentation device. The method for establishing the word segmentation model comprises the following steps of: A1, labeling each entry of a training corpus and characteristics of each entry; B1, determining the parts of speech of each entry under corresponding characteristics; C1, counting the generation probabilities of each entry under corresponding parts of speech and the probabilities of conversion between the parts of speech by utilizing the labeled training corpus; and D1, obtaining a basic dictionary by utilizing the generation probabilities of each entry under the corresponding parts of speech, obtaining a conversion dictionary by utilizing the probabilities of the conversion between the parts of speech, and adding the basic dictionary and the conversion dictionary into the word segmentation model. When the word segmentation model is used for word segmentation, word segmentation accuracy can be improved, and characteristic labeling work can be finished at the same time of word segmentation.

Description

A kind of method of participle model, method and device thereof of participle set up
[technical field]
The present invention relates to the natural language processing technique field, particularly a kind of method of participle model, method and device thereof of participle set up.
[background technology]
Along with being widely used of internet, increasing text and information exchange are crossed the internet and are propagated, in order to retrieve and excavate valuable content from these texts and information, natural language processing is indispensable technology, and participle then is the element task in the natural language processing.
In the prior art, participle mainly contain rule-based participle and based on the statistics participle.Rule-based participle has cutting of Forward Maximum Method, oppositely maximum coupling, two-way maximum coupling, the most chopped mark cutting, rule-based set etc., the speed that is characterized in is fast, but the effect to ambiguity partition is bad, and in this manner, participle and part-of-speech tagging work can only be finished in a sequential manner, namely carry out first participle, carry out again part-of-speech tagging.Based on the participle of statistics, be that the probability of the co-occurrence in language model with word and word is as the foundation of participle.For example " living standards of the people ", can cutting be " people | life | level " according to dictionary, also can cutting be " people | the people's livelihood | running water | flat ", but can learn the co-occurrence probabilities of co-occurrence probabilities, " life " and " level " of " people " and " life " by language model all far above the co-occurrence probabilities of " people " and " people's livelihood " or " people's livelihood " and " running water ", so finally get " people | life | level " as correct cutting.Because in the participle based on statistics, usually adopt the n-gram language model, the probability that single word is occurred in Large Scale Corpus and the probability of word and word co-occurrence in Large Scale Corpus are as the foundation of participle, thereby cause in the larger situation of dictionary, the calculated amount very large defective that becomes, and under this mode, participle and part-of-speech tagging are still finished in two steps, and in fact, part of speech can play the effect of confirmation to participle, under different parts of speech, different word segmentation result appears possibly.
Therefore, in the segmenting method based on statistics, a kind of improved way being arranged, is the probability that the probability of word and word co-occurrence is reduced to part of speech and part of speech co-occurrence.Because part of speech is compared with word, its dimension reduces greatly, therefore calculated amount is reduced, and owing to considered part of speech, has also finished the work of part-of-speech tagging in participle, but at present in this manner, to the north of major term gonosome system to the example that is categorized as of Chinese word, the kind of part of speech only is kind more than 40, and the relation between the thousands of word is reduced to more than the 40 kind of relation between the part of speech, the significantly loss of quantity of information will inevitably occur, thereby has influence on the precision of participle.
[summary of the invention]
Technical matters to be solved by this invention provides a kind of method of participle model, method and device thereof of participle set up, to solve prior art when the participle, cause the significantly loss of quantity of information when substituting concerning of word and word with the relation of part of speech and part of speech, thus the defective of the reduction precision of word segmentation.
The present invention is that the technical scheme that the technical solution problem adopts provides a kind of method of setting up participle model, and comprising: A1. marks the part of speech of each entry and each entry to corpus; B1. determine the part of speech of each entry under corresponding part of speech; C1. good corpus is added up the transition probability of each entry between the generating probability under the corresponding part of speech and each part of speech to utilize mark, wherein said entry in the generating probability under the corresponding part of speech is: probability that entry occurs with corresponding part of speech in described corpus, and the transition probability between described part of speech is: the probability of a rear adjacent appearance of part of speech under the condition that previous part of speech occurs in described corpus; D1. utilize the generating probability of described each entry under corresponding part of speech to obtain basic dictionary, utilize the transition probability between described each part of speech to obtain shifting dictionary, and add described basic dictionary and described transfer dictionary to participle model.
The preferred embodiment one of according to the present invention, described entry comprises basic entry or dictionary entry, and wherein said basic entry only comprises the entry of dividing with minimum particle size, and described dictionary entry comprises the entry with multiple granularity division.
The preferred embodiment one of according to the present invention, described step B1 comprises the S1 in the following mode, perhaps, the combination of S1 and S2 and the execution priority of S2 are higher than S1:S1. according to the cluster feature of each entry, entry with identical part of speech is carried out cluster, and with the classification of cluster under each entry as the part of speech of each entry under corresponding part of speech; S2. the statistics word frequency of each entry under corresponding part of speech in large-scale corpus, and be described word frequency greater than classification of each entry distribution of setting threshold as entry the part of speech under corresponding part of speech of described word frequency greater than setting threshold.
The preferred embodiment one of according to the present invention, described cluster feature comprises contextual feature, the position feature of entry, the lexical or textual analysis feature of entry, the synonym relationship characteristic of entry or the structured message feature of entry of entry in described large-scale corpus.
The preferred embodiment one of according to the present invention, described method further comprises: D11. is the mark of the entry in described basic dictionary pronunciation in described participle model.
The preferred embodiment one of according to the present invention, described method further comprises: D12. utilizes the good generating probability of corpus statistics word bit under corresponding part of speech of described mark to obtain the word bit dictionary, and add described word bit dictionary to described participle model, the generating probability of wherein said word bit under corresponding part of speech is: word appears at the probability of a position in the entry with corresponding part of speech in described corpus.
The present invention also provides a kind of method of participle, comprising: A2. obtains input text; B2. utilize the participle model of the described method foundation of preamble for the various cutting results of described text generation; C2. utilize described participle model to calculate described various cutting results' score; D2. select the highest cutting result of score as the word segmentation result of described input text and output.
The preferred embodiment one of according to the present invention, described step C2 comprises: C21. searches generating probability and the transition probability of all nodes of described cutting result from described participle model; C22. the generating probability of all nodes of described cutting result and transition probability are multiplied each other and obtain described cutting result's score.
The preferred embodiment one of according to the present invention when described participle model only comprises basic dictionary and shifts dictionary, uses the basic dictionary of described participle model to generate described various cutting result among the described step B2; Among the described step C21 from the generating probability of basic dictionary lookup entry under corresponding part of speech of the described participle model generating probability with all nodes of obtaining described cutting result, from the transition probability of the transition probability between the transfer dictionary lookup part of speech of described participle model with all nodes of obtaining described cutting result.
The preferred embodiment one of according to the present invention when described participle model comprises basic dictionary, word bit dictionary and shifts dictionary, uses the basic dictionary and the described various cutting results of the common generation of word bit dictionary of described participle model among the described step B2; Among the described step C21 from the generating probability of basic dictionary lookup entry under corresponding part of speech of the described participle model generating probability with the entry node that obtains described cutting result, from the generating probability of word bit dictionary lookup word bit under corresponding part of speech of the described participle model generating probability with the word bit node that obtains described cutting result, from the transition probability of the transition probability between the transfer dictionary lookup part of speech of described participle model with all nodes of obtaining described cutting result.
The preferred embodiment one of according to the present invention, when comprising described word bit node among the highest cutting result of described score, further comprise in described step D2: utilize the word bit information of described word bit node to determine the division of the unregistered word among the highest cutting result of described score, wherein said unregistered word is non-existent word in the basic dictionary of described participle model.
The preferred embodiment one of according to the present invention adopts basic entry to set up described participle model, and the dictionary word bar is carried out the method for described participle as described input text, obtains the internal separation of the entry that can cut apart again in the described dictionary entry; If the participle model that the dictionary entry that adopts known internal to divide is set up carries out participle to input text, then when the output word segmentation result, further export the internal separation of the dictionary entry that can cut apart again in the word segmentation result.
The present invention also provides a kind of device of setting up participle model, comprising: the mark unit, for the part of speech that corpus is marked each entry and each entry; The part of speech determining unit is used for determining the part of speech of each entry under corresponding part of speech; Statistic unit, good corpus is added up the transition probability of each entry between the generating probability under the corresponding part of speech and each part of speech to be used for utilizing mark, wherein said entry in the generating probability under the corresponding part of speech is: probability that entry occurs with corresponding part of speech in described corpus, and the transition probability between described part of speech is: the probability of a rear adjacent appearance of part of speech under the condition that previous part of speech occurs in described corpus; The model generation unit is used for utilizing the generating probability of described each entry under corresponding part of speech to obtain basic dictionary, utilizes the transition probability between described each part of speech to obtain shifting dictionary, and adds described basic dictionary and described transfer dictionary to participle model.
The preferred embodiment one of according to the present invention, described entry comprises basic entry or dictionary entry, and wherein said basic entry only comprises the entry of dividing with minimum particle size, and described dictionary entry comprises the entry with many granularity division.
The preferred embodiment one of according to the present invention, described part of speech determining unit comprises the cluster subelement, perhaps, comprises that the combination of described cluster subelement and word frequency statistics subelement and the processing priority of described word frequency statistics subelement are higher than described cluster subelement; Wherein said cluster subelement is used for the cluster feature according to each entry, and the entry with identical part of speech is carried out cluster, and with the classification of cluster under each entry as the part of speech of each entry under corresponding part of speech; Described word frequency statistics subelement, be used in the word frequency of large-scale corpus each entry of statistics under corresponding part of speech, and be described word frequency greater than classification of each entry distribution of setting threshold as entry the part of speech under corresponding part of speech of described word frequency greater than setting threshold.
The preferred embodiment one of according to the present invention, described cluster feature comprises contextual feature, the position feature of entry, the lexical or textual analysis feature of entry, the synonym relationship characteristic of entry or the structured message feature of entry of entry in described large-scale corpus.
The preferred embodiment one of according to the present invention, described device further comprise pronunciation mark unit, and being used at described participle model be that entry in the described basic dictionary marks pronunciation.
The preferred embodiment one of according to the present invention, described device further comprises: the word bit dictionary generates subelement, be used for utilizing the good generating probability of corpus statistics word bit under corresponding part of speech of described mark to obtain the word bit dictionary, and add described word bit dictionary to described participle model, the generating probability of wherein said word bit under corresponding part of speech is: word appears at the probability of a position in the entry with corresponding part of speech in described corpus.
The present invention also provides a kind of device of participle, comprising: receiving element is used for obtaining input text; Cutting is generation unit as a result, is used for utilizing the participle model of the described device foundation of preamble for the various cutting results of described text generation; Computing unit is used for utilizing described participle model to calculate described various cutting results' score; Determining unit is used for selecting the highest cutting result of score as word segmentation result and the output of described input text as a result.
The preferred embodiment one of according to the present invention, described computing unit comprises: search subelement, be used for searching from described participle model generating probability and the transition probability of all nodes of described cutting result; The score computation subunit, being used for the generating probability of all nodes of described cutting result and transition probability multiplied each other obtains described cutting result's score.
The preferred embodiment one of according to the present invention, when described participle model only comprised basic dictionary and shifts dictionary, described cutting as a result generation unit used the basic dictionary of described participle model to generate described various cutting result; The described subelement of searching is from the generating probability of basic dictionary lookup entry under corresponding part of speech of the described participle model generating probability with all nodes of obtaining described cutting result, from the transition probability of the transition probability between the transfer dictionary lookup part of speech of described participle model with all nodes of obtaining described cutting result.
The preferred embodiment one of according to the present invention, when described participle model comprised basic dictionary, word bit dictionary and shifts dictionary, the basic dictionary and the described various cutting results of the common generation of word bit dictionary of described participle model were used in described cutting as a result generation unit; The described subelement of searching is from the generating probability of basic dictionary lookup entry under corresponding part of speech of the described participle model generating probability with the entry node that obtains described cutting result, from the generating probability of word bit dictionary lookup word bit under corresponding part of speech of the described participle model generating probability with the word bit node that obtains described cutting result, from the transition probability of the transition probability between the transfer dictionary lookup part of speech of described participle model with all nodes of obtaining described cutting result.
The preferred embodiment one of according to the present invention, described device further comprises the unregistered word determining unit, be used for when the highest cutting result of described score comprises described word bit node, utilize the word bit information of described word bit node to determine the division of the unregistered word among the highest cutting result of described score, wherein said unregistered word is non-existent word in the basic dictionary of described participle model.
The preferred embodiment one of according to the present invention, described device in advance with the dictionary word bar as input text, adopt the participle model of being set up by basic entry to obtain the internal separation of the entry that can cut apart again in the described dictionary entry; If described cutting as a result generation unit adopts the participle model of the dictionary entry foundation of known internal division to generate the cutting result, further export the internal separation of the dictionary entry that can cut apart again in the word segmentation result when then described as a result determining unit is exported word segmentation result.
As can be seen from the above technical solutions, by obtaining the part of speech under the part of speech, can greatly expand the dimension of part of speech, so that when substituting concerning of word and word with the relation of part of speech and part of speech, keep enough quantity of information, improve the precision of participle, can in participle, finish the work of part-of-speech tagging simultaneously.
[description of drawings]
Fig. 1 is the schematic flow sheet of embodiment of setting up the method for participle model among the present invention;
Fig. 2 is the schematic flow sheet of the embodiment of segmenting method among the present invention;
Fig. 3 is the synoptic diagram of the embodiment one of various cutting results among the present invention;
Fig. 4 is the synoptic diagram of the embodiment two of various cutting results among the present invention;
Fig. 5 is the structural representation block diagram of embodiment of setting up the device of participle model among the present invention;
Fig. 6 is the structural representation block diagram of the embodiment of participle device among the present invention.
[embodiment]
In order to make the purpose, technical solutions and advantages of the present invention clearer, describe the present invention below in conjunction with the drawings and specific embodiments.
Please refer to Fig. 1, Fig. 1 is the schematic flow sheet of embodiment of setting up the method for participle model among the present invention.As shown in Figure 1, described method comprises:
Step S101: the part of speech that corpus is marked each entry and each entry.
Step S102: determine the part of speech of each entry under corresponding part of speech.
Step S103: good corpus is added up the transition probability of each entry between the generating probability under the corresponding part of speech and each part of speech to utilize mark.
Step S104: utilize the generating probability of each entry under corresponding part of speech to obtain basic dictionary, utilize the transition probability between each part of speech to obtain shifting dictionary, and add basic dictionary and transfer dictionary to first participle model.
Step S105: utilize the good generating probability of corpus statistics word bit under corresponding part of speech of mark obtaining the word bit dictionary, and add the word bit dictionary to first participle model.
The below carries out specific description to above-mentioned steps.
In step S101, corpus refers to various texts, for example various web page texts, books and periodicals magazine or novel etc.In the present invention, part of speech also further comprises the proper noun attribute except comprising common verb, noun, pronoun etc.Proper noun refers to the title of specific people, place or mechanism etc., and the proper noun attribute then is the proper noun classification of proper noun ownership.For example just enumerated part proper noun and corresponding attribute thereof in the table 1:
Table 1
Figure BDA0000081314230000071
As can be seen from Table 1, the proper noun attribute can be comprised of many levels, wherein the first level " name ", " mechanism's name ", " place name " etc., the maximum particle size that is the proper noun attribute is divided, also can carry out to it differentiation of small grain size under the first level, can divide into " China ", " U.S. " etc. such as " name " is lower, " place name " is lower can divide into " country ", " area " etc., derive thus the second level, can also obtain by that analogy the more level of small grain size.Corpus is marked the part of speech of each entry and each entry, in fact is exactly to continuous corpus of text, marks out each entry and part of speech thereof that participle obtains." I love Tian An-men, Beijing " such text for example, through can obtain behind the mark " I<pronoun/love<verb/Beijing/<place name. the area 〉/Tian An-men<place name. the place〉" text behind such mark.Because proper noun belongs to noun, so marked the entry of proper noun attribute, the as many as noun part of speech that marked.
The embodiment of step S102 can be following manner 1, and perhaps, mode 1 and 2 combination and the execution priority of mode 2 are higher than mode 1:
1, according to the cluster feature of each entry, the entry with identical part of speech is carried out cluster, and with the classification of cluster under each entry as the part of speech of each entry under corresponding part of speech.
Cluster feature can adopt the contextual feature of entry in large-scale corpus.Large-scale corpus is not limited to the above said corpus that has marked, and also can comprise wider un-annotated data, for example the text in various sources.
For the different entries with same part of speech, because the difference of entry intension, when this word occurred, some words that are associated with its intension can appear in its context.For example: " Beijing " and " Haidian ", although all be " place name. the area ", but the former intension is " administrative city ", latter's intension is " administrative area ", show on the extension of entry, be that the former is often more with word co-occurrences such as " city ", " mayors ", the latter is often more with word co-occurrences such as " district ", " district governments ".The contextual feature of statistics entry in large-scale corpus, then according to the similarity between these contextual features, just can be with the different entries under the identical part of speech, poly-is some classes, thereby forms corresponding part of speech.In the present embodiment, contextual feature refers in the certain context scope several words and the number of times thereof of frequent co-occurrence, and example is as shown in table 2:
Table 2
Figure BDA0000081314230000091
Wherein "<city, 18776〉" be illustrated in the large-scale corpus, in the context of " Beijing " this entry, " city " this word has occurred 18776 times.It should be noted that contextual feature is not limited in " in the certain context scope several words and the number of times thereof of frequent co-occurrence " this embodiment, other any features that can embody context relation are included within the scope of the present invention.
Except adopting the contextual feature of entry in large-scale corpus to carry out the cluster, cluster can also adopt other features, the position feature of entry for example, as: will appear near certain word the entry of same position poly-is a class; Or the lexical or textual analysis feature of entry, as: it is a class that the entry of identical lexical or textual analysis can gather; Or the synonym relationship characteristic of entry, as: having identical synon entry poly-is a class; Or the structured message feature of entry, as: the last character is the noun of " car ", comprises that it is a class that " train ", " electric car ", " bicycle " etc. can gather.Because feature that can cluster can not be exhaustive, so any can as the feature of cluster, all should being included within the scope of the present invention.
2, the statistics word frequency of each entry under corresponding part of speech in Large Scale Corpus, and be word frequency greater than classification of each entry distribution of setting threshold as the part of speech of this entry under corresponding part of speech.
Take table 3 as example:
Table 3
If setting threshold is 10000, entry " I ", " you " in the pronoun then, the number of times that entry in the verb " is said ", " walking " occurs in large-scale corpus has surpassed setting threshold, therefore can distribute separately a classification as the part of speech of this entry under corresponding part of speech for these entries, for example the part of speech of entry " I " is " pronoun .1 ", the part of speech of " you " is " pronoun .2 ", under " pronoun .1 " part of speech, only have an object " I ", under " pronoun .2 " part of speech, only have an object " you ".
In step S 103, the generating probability of entry under corresponding part of speech refers to probability that entry occurs with corresponding part of speech in corpus, and the transition probability between part of speech refers to the probability of a rear adjacent appearance of part of speech under the condition that previous part of speech occurs in corpus.Add up the transition probability of each entry between the generating probability under the corresponding part of speech and each part of speech, a kind of method is based on Markov chain, according to the direct statistical probability of maximal possibility estimation, that is:
The total degree that the number of times that the generating probability=entry of entry under corresponding part of speech occurs in corpus as corresponding part of speech/this part of speech occurs in corpus.
Transition probability between part of speech=two part of speech in corpus before and after the total degree that in corpus, occurs of the part of speech of number of times/wherein appear at front of adjacent co-occurrence.
For example: " swimming " conduct " noun .5 " class has occurred 30 times, " noun .5 " class has occurred 400 times, " noun .5 " class is 200 with the number of times of the adjacent co-occurrence in " verb .1 " class front and back, " swimming " generating probability P under " noun .5 " class (swimming | noun .5)=30/400 then, " noun .5 " class is to the transition probability P=(verb .1| noun .5)=200/400 of " verb .1 " class.
In addition, the statistics entry can also adopt the machine learning instrument based on conditional random field models (CRF) to carry out features training in the generating probability under the corresponding part of speech and the transition probability between part of speech.The visible list of references of concrete grammar: Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. (2004) .Applying conditional random fields to Japanese morphological analysis.In Proc.of EMNLP 2004 (hereinafter referred to as list of references 1).
In step S 104, can obtain basic dictionary by the generating probability of entry under corresponding part of speech, can obtain shifting dictionary by the transition probability between part of speech.Wherein basic dictionary comprise entry, entry at the part of speech under the corresponding part of speech and entry the generating probability under corresponding part of speech, the structure of basic dictionary can be as shown in table 4:
Table 4
Entry Part of speech (part of speech. part of speech ID) The generating probability of entry under corresponding part of speech
Beijing Place name. regional .1 0.0098
…… …… ……
Entry " Beijing " wherein, proper noun attribute in the part of speech " place name. the area " in corpus, mark and can obtain by step S 101, part of speech ID " 1 " acquires by step S102, and the generating probability of entry under corresponding part of speech obtains at step S103.
The transfer dictionary comprises the transition probability between part of speech and part of speech, and the structure that shifts dictionary can be as shown in table 5:
Table 5
Part of speech (part of speech. part of speech ID) Part of speech (part of speech. part of speech ID) Transition probability between part of speech
Place name. regional .1 Mechanism's name. brand .2 0.0017
…… ……… ……
Wherein transition probability obtains at step S103 between part of speech.
Behind step S104, the method for present embodiment further also can be the entry mark pronunciation in the basic dictionary.For the structure of the basic dictionary behind the entry mark pronunciation can be as shown in table 6:
Table 6
Entry Part of speech (part of speech. part of speech ID) Pronunciation The generating probability of entry under corresponding part of speech
Beijing Place name. regional .1 beijing 0.0098
Sticking Verb .2 zhan 0.01
Sticking Adjective .3 nian 0.0095
…… …… …… ……
As can be seen from Table 6, because the corresponding part of speech of entry is the classification under part of speech in the basic dictionary, pronunciation in the time of can embodying entry when therefore marking pronunciation and under different parts of speech, use, thereby, so that this participle model of follow-up use can also be exported the right pronunciation under the application scenarios of word segmentation result in corresponding part of speech when carrying out participle in the output word segmentation result.
In step S105, method of the present invention also can further be utilized the good generating probability of corpus statistics word bit under corresponding part of speech of mark.Word is the individual character that forms entry in the corpus, what word bit referred to is exactly a position that word occurs in entry, this position comprises B (begin), M (middle), E (end), S (single), represent respectively in prefix that a word appears at entry, the word, suffix and this word form separately an entry, for example: sky-B, expression day word occurs at the prefix of entry.
The generating probability of word bit under corresponding part of speech refers in corpus word appears at a position in the entry with corresponding part of speech probability.For example the prefix at the entry of part of speech " noun .1 " occurs, and is designated as B-noun .1, and wherein B represents prefix, and the probability of P (day | B-noun .1) is 0.4, just expression " day " probability of the prefix appearance of the entry of word conduct " noun .1 " class is 0.4.
By the corpus that marks, can add up the position that word occurrence occurs in the entry of each part of speech, the similar method of then introducing among employing and the step S103 can count the generating probability of word bit under corresponding part of speech.For example based on Markov chain, according to the direct statistical probability of maximal possibility estimation:
Word bit under corresponding part of speech generating probability=word appears at the total degree that number of times/this part of speech occurs of certain position in the entry in corpus with corresponding part of speech in corpus.
For example " my god " occurred 50 times as the prefix of the entry of part of speech " noun .1 ", " noun .1 " occurred 500 times, then p (my god | B-noun .1)=50/500.
Obtain to obtain corresponding word bit dictionary after the generating probability of word bit under corresponding part of speech, the word bit dictionary comprises word, word position and corresponding part of speech and the generating probability of word bit under corresponding part of speech in entry, and a schematic structure of word bit dictionary is as shown in table 7:
Table 7
Word Position-part of speech The generating probability of word bit under corresponding part of speech
My god B-noun .1 0.4
Peace M-noun .1 0.1
Door E-noun .1 0.25
…… …… ……
It should be noted that the execution sequence of step S105 in the present embodiment only for schematically describing, in fact step S105 also can carry out in the front of step S103 or step S104.And in other embodiments, method of the present invention also can only comprise step S101 to step S104, and does not comprise step S105.In the present embodiment, if the entry that marks among the step S101 is basic entry, the participle model that then obtains just is based on the participle model of primary word bar, if the entry that marks among the step S101 is the dictionary entry, the participle model that then obtains just is based on the participle model of dictionary word bar.So-called basic entry refers to only comprise the entry of dividing with minimum particle size.For example: " at a high speed ", " highway " such entry, as word, on implication, can not cut apart again, therefore have minimum particle size.And the dictionary entry refers to comprise the entry with any granularity division, wherein not only comprise the entry that minimum particle size is divided, the entry that can not cut apart again like this of " at a high speed ", " highway " for example, also comprise " highway " like this as word, on implication, can be divided into again the entry of " at a high speed ", " highway ".
Please refer to Fig. 2, Fig. 2 is the schematic flow sheet of the embodiment of segmenting method among the present invention.As shown in Figure 2, described method comprises:
S201: obtain input text.
S202: utilize participle model that the previously described method of setting up participle model sets up for the various cutting results of text generation.
S203: utilize participle model to calculate various cutting results' score.
S204: select the highest cutting result of score as word segmentation result and the output of input text.
The below is specifically described above-mentioned steps.
Among the step S201, obtain input text and be and obtain the text for the treatment of participle, this is the prerequisite of subsequent step.
In step S202, so-called various cutting results are that input text is carried out some paths of forming after all possible cutting, and each paths of indication is each cutting result in the embodiment of the invention.When only comprising basic dictionary in the described participle model of preamble and shifting dictionary, use the basic dictionary of participle model to generate various cutting results among the step S202.Please refer to Fig. 3, Fig. 3 is the synoptic diagram of the embodiment one of various cutting results among the present invention.To input text " two this liberal arts schools ", according to the entry in the basic dictionary, set up various cutting results as shown in Figure 3, it comprises path 1 and path 2.
In step S203, generate every kind of cutting result's score among the calculation procedure S202, embodiment may further comprise the steps:
1, from participle model, searches generating probability and the transition probability of all nodes of cutting result.
2, the generating probability of all nodes of cutting result and transition probability are multiplied each other obtain cutting result's score.
Take various cutting results shown in Figure 3 as example, each square frame is a node, and calculating path 1 is as follows with the score in path 2:
P (path 1)=p (number | BOS) * p (two | number) * p (measure word | number) * p (this | measure word) * p (noun .7| measure word) * p (literal arts | noun .7) * P (noun .11| noun .7) * p (school | noun .11) * p (EOS| noun .11)
P (path 2)=p (number | BOS) * p (two | number) * p (noun .24| number) * p (this paper | noun .24) * p (noun .5| noun .24) * p (science | noun .5) * P (noun. morpheme | noun .5) * p (school | noun. morpheme) * p (the EOS| noun. morpheme)
Wherein BOS and EOS represent beginning and finish, p (number | BOS) represent respectively the probability that probability that the word take part of speech as number begins and the word take part of speech as noun .11 finish with p (EOS| noun .11), and p (two | number) and p (measure word | number) are illustrated respectively in the probability that occurs " two " under the condition that part of speech is number and be under the condition of number at the part of speech of previous word, the part of speech of next word is the probability of measure word, and the implication of the probability of other nodes is similar with it.
More than the generating probability of each node can obtain by the generating probability of basic dictionary lookup entry under corresponding part of speech from participle model, the transition probability of each node be by can obtaining from the transition probability between the transfer dictionary lookup part of speech of participle model, so path 1 and path 2 can calculate score separately.
In step S204, select the highest cutting result of score as the word segmentation result of input text.To various cutting results shown in Figure 3, the score that gets proportion by subtraction path 2 of supposing path 1 is high, then to the input " two this liberal arts schools " this text, final word segmentation result is " two (numbers)/basis (measure word)/literal arts (noun .7)/school (noun .11) ", can find out, final word segmentation result has not only comprised each participle, and each participle has also had corresponding part-of-speech tagging.
When comprising basic dictionary, word bit dictionary in the described participle model of preamble and shifting dictionary, use the common various cutting results of generation of basic dictionary and word bit dictionary among the step S202.Please refer to Fig. 4, Fig. 4 is the synoptic diagram of the embodiment two of various cutting results among the present invention.As shown in Figure 4, the cutting result comprises three kinds of situations, and a kind of is only to comprise the entry node in the path, and a kind of is only to comprise the word bit node in the path, and a kind of is not only to comprise in the path also comprising the word bit node by the entry node.When in step S203, calculating every kind of cutting result's score, can obtain the generating probability of entry node from the generating probability of basic dictionary lookup entry under corresponding part of speech of participle model, the generating probability of word bit node can be obtained from the generating probability of word bit dictionary lookup word bit under corresponding part of speech of participle model, the transition probability of all nodes can be obtained from the transition probability between the transfer dictionary lookup part of speech of participle model.When comprising the word bit node among the highest cutting result of the score that calculates, then in step S204, also need to utilize the word bit information of word bit node to determine the division of the unregistered word among the highest cutting result of score.
Unregistered word refers to the entry that can't find in the basic dictionary of participle model.For example suppose that the cutting result that score is the highest among Fig. 4 adds the path that thick lines represent among Fig. 4: " Lee/B-name "-" literary composition/M-name "-" outstanding person/E-name "-" going out/verb .3 ", then " Li Wenjie " these three words of linking up are exactly unregistered word (because the entry that can find in the basic dictionary only has " Li Wen " and " outstanding person "), and " Li Wenjie " is actually this is divided into " Lee/Wen Jie " or " Li Wenjie " then can obtain according to the word bit information of word bit node.The word bit information of analyzing " Li Wenjie " this triliteral word bit node in the highest path of score of given example above can be found out, " Li Wenjie " is the most rational as an independent word segmentation, if but " Li Wenjie " this triliteral word bit nodal information is: " Lee/S-name "-" literary composition/B-name "-" outstanding person/E-name " then can cut out two unregistered words " Lee " and " Wen Jie ".
In the present embodiment, if adopt basic entry to set up participle model, the dictionary word bar is carried out the method for participle of the present invention as input text, can obtain the internal separation of the entry that can cut apart again in the dictionary entry.Basic entry and dictionary entry are the entry only divided with minimum particle size as previously described and any entry of granularity division.Participle model set up in the dictionary entry that adopts again known internal to divide, and when using this participle model that input text is carried out participle, just can when the output word segmentation result, further export the internal separation of the dictionary entry that can cut apart again in the word segmentation result.For example final word segmentation result be " we (pronoun)/(verb)/go up (verb)/highway (noun)/(adverbial word) ", because the dictionary entry " highway " in the basic dictionary of participle model has internal separation, therefore when obtaining this word segmentation result, also can obtain the internal separation of " highway ": " at a high speed (adjective) ", " highway (noun) ", and with this internal separation output.
Please refer to Fig. 5, Fig. 5 is the structural representation block diagram of embodiment of setting up the device of participle model among the present invention.Described device comprises mark unit 301, part of speech determining unit 302, statistic unit 303, model generation unit 304, pronunciation mark unit 305, word bit dictionary generation unit 306.
Wherein mark unit 301, be used for corpus is marked the part of speech of each entry and each entry.Corpus refers to various texts, for example various web page texts, books and periodicals magazine or novel etc.In the present invention, part of speech further comprises the proper noun attribute except comprising common verb, noun, pronoun etc.Proper noun refers to the title of specific people, place or mechanism etc., and the proper noun attribute then is the proper noun classification of proper noun ownership.Corpus is marked the part of speech of each entry and each entry, in fact is exactly to continuous corpus of text, marks out each entry and part of speech thereof that participle obtains." I love Tian An-men, Beijing " such text for example, through can obtain behind the mark " I<pronoun/love<verb/Beijing/<place name. the area 〉/Tian An-men<place name. the place〉" text behind such mark.Because proper noun belongs to noun, so marked the entry of proper noun attribute, the as many as noun part of speech that marked.
Part of speech determining unit 302 is used for determining the part of speech of each entry under corresponding part of speech.The part of speech determining unit comprises following subelement: the processing priority of cluster subelement 3021 and word frequency statistics subelement 3022 and word frequency statistics subelement 3022 is higher than cluster subelement 3021.
Wherein the cluster subelement 3021, are used for the cluster feature according to each entry, and the entry with identical part of speech is carried out cluster, and with the classification of each entry cluster as the part of speech of each entry under corresponding part of speech.
Cluster feature can adopt the contextual feature of entry in large-scale corpus.Large-scale corpus is not limited to the above said corpus that has marked, and also can comprise wider un-annotated data, for example the text in various sources.
For the different entries under the same part of speech, because the difference of entry intension, when this word occurred, some words that are associated with its intension can appear in its context.For example: " Beijing " and " Haidian ", although all be " place name. the area ", but the former intension is " administrative city ", latter's intension is " administrative area ", show on the extension of entry, be that the former is often more with word co-occurrences such as " city ", " mayors ", the latter is often more with word co-occurrences such as " district ", " district governments ".The contextual feature of statistics entry in large-scale corpus, then according to the similarity between these contextual features, just can be with the different entries under the identical part of speech, poly-is some classes, thereby forms corresponding part of speech.In the present embodiment, contextual feature refers in the certain context scope several words and the number of times thereof of frequent co-occurrence, but contextual feature is not limited in " in the certain context scope several words and the number of times thereof of frequent co-occurrence " this embodiment, other any features that can embody context relation are included within the scope of the present invention.
Except adopting the contextual feature of entry in large-scale corpus to carry out the cluster, cluster can also adopt other features, the position feature of entry for example, as: will appear near certain word the entry of same position poly-is a class; Or the lexical or textual analysis feature of entry, as: it is a class that the entry of identical lexical or textual analysis can gather; Or the synonym relationship characteristic of entry, as: having identical synon entry poly-is a class; Or the structured message feature of entry, as: the last character is the noun of " car ", comprises that it is a class that " train ", " electric car ", " bicycle " etc. can gather.Because feature that can cluster can not be exhaustive, so any can as the feature of cluster, all should being included within the scope of the present invention.
Word frequency statistics subelement 3022 is used in the word frequency of Large Scale Corpus each entry of statistics under corresponding part of speech, and be word frequency greater than classification of each entry distribution of setting threshold as the part of speech of this entry under corresponding part of speech.
In other embodiments, part of speech determining unit 302 also can include only cluster subelement 3021 and not comprise word frequency statistics subelement 3022.
Statistic unit 303, good corpus is added up the transition probability of each entry between the generating probability under the corresponding part of speech and each part of speech to be used for utilizing mark, wherein the generating probability of entry under corresponding part of speech is: probability that entry occurs with corresponding part of speech in corpus, the transition probability between part of speech is: the probability of a rear adjacent appearance of part of speech under the condition that previous part of speech occurs in corpus.
The statistics entry is at the generating probability under the corresponding part of speech and the transition probability between part of speech, and a kind of method is based on Markov chain, according to the direct statistical probability of maximal possibility estimation, that is:
The total degree that the number of times that the generating probability=entry of entry under corresponding part of speech occurs in corpus as corresponding part of speech/this part of speech occurs in corpus;
The total degree that number of times/previous part of speech occurs in corpus of part of speech transition probability=two part of speech adjacent co-occurrence in corpus.
For example: " swimming " conduct " noun .5 " class has occurred 30 times, " noun .5 " class has occurred 400 times, " noun .5 " class is 200 with the number of times of the adjacent co-occurrence in " verb .1 " class front and back, " swimming " generating probability P under " noun .5 " class (swimming | noun .5)=30/400 then, " noun .5 " class is to the transition probability P=(verb .1| noun .5)=200/400 of " verb .1 " class.
In addition, the statistics entry can also adopt the machine learning instrument based on conditional random field models (CRF) to carry out features training in the generating probability under the corresponding part of speech and the transition probability between part of speech.The visible list of references 1 of concrete grammar.
Model generation unit 304 is used for utilizing the generating probability of each entry under corresponding part of speech to obtain basic dictionary, utilizes the transition probability between each part of speech to obtain shifting dictionary, and adds basic dictionary and transfer dictionary to participle model.
Can obtain basic dictionary by the generating probability of entry under corresponding part of speech, basic dictionary comprise entry, entry at the part of speech under the corresponding part of speech and entry the generating probability under corresponding part of speech.Can obtain shifting dictionary by the transition probability between part of speech, the transfer dictionary comprises the transition probability between part of speech and part of speech.
Pronunciation mark unit 305, being used at participle model be that entry in the basic dictionary marks pronunciation.
It should be noted that in other embodiments pronunciation mark unit 305 is not prerequisite unit.
Word bit dictionary generation unit 306, be used for utilizing the good generating probability of corpus statistics word bit under corresponding part of speech of mark to obtain the word bit dictionary, and add the word bit dictionary to participle model, wherein the generating probability of word bit under corresponding part of speech is: word appears at the probability of a position in the entry with corresponding part of speech in corpus.。
Word is the individual character that forms entry in the corpus, what word bit referred to is exactly a position that word occurs in entry, this position comprises B (begin), M (middle), E (end), S (single), represent respectively in prefix that a word appears at entry, the word, suffix and this word form separately an entry, for example: sky-B, expression day word occurs at the prefix of entry.。
The generating probability of word bit under corresponding part of speech refers in corpus word appears at a position in the entry with corresponding part of speech probability.For example the prefix at the entry of part of speech " noun .1 " occurs, and is designated as B-noun .1, and wherein B represents prefix, and the probability of P (day | B-noun .1) is 0.4, just expression " day " probability of the prefix appearance of the entry of word conduct " noun .1 " class is 0.4.By the corpus that marks, can add up the position that word occurrence occurs in the entry of each part of speech, then adopt the similar method of introducing in the statistic unit 303, can count the generating probability of word bit under corresponding part of speech.For example based on Markov chain, according to the direct statistical probability of maximal possibility estimation:
Word bit under corresponding part of speech generating probability=word appears at the total degree that number of times/this part of speech occurs of certain position in the entry in corpus with corresponding part of speech in corpus.
For example " my god " occurred 50 times as the prefix of the entry of part of speech " noun .1 ", " noun .1 " occurred 500 times, then p (my god | B-noun .1)=50/500.Obtain can obtaining corresponding word bit dictionary after the generating probability of word bit under corresponding part of speech, the word bit dictionary comprises word, word position and corresponding part of speech and the generating probability of word bit under corresponding part of speech in entry
It should be noted that in other embodiments word bit dictionary generation unit 306 is not prerequisite unit.
In the present embodiment, if the entry of mark unit 301 marks is basic entry, the participle model that then obtains just is based on the participle model of primary word bar, if the entry of mark unit 301 marks is the dictionary entry, the participle model that then obtains just is based on the participle model of dictionary word bar.So-called basic entry refers to only comprise the entry of dividing with minimum particle size.For example: " at a high speed ", " highway " such entry, as word, on implication, can not cut apart again, therefore have minimum particle size.And the dictionary entry, refer to comprise the entry with any granularity division, wherein not only comprise the entry that minimum particle size is divided, the entry that can not cut apart again like this of " at a high speed ", " highway " for example, also comprise " highway " like this as word, on implication, can be divided into again the entry of " at a high speed ", " highway ".
Please refer to Fig. 6, Fig. 6 is the structural representation block diagram of the embodiment of participle device among the present invention.As shown in Figure 6, described device comprises: receiving element 401, cutting be generation unit 402, computing unit 403, as a result determining unit 404, unregistered word determining unit 405 as a result.
Wherein receiving element 401, are used for obtaining input text, obtain input text and are and obtain the text for the treatment of participle.
Cutting is generation unit 402 as a result, is used for utilizing participle model that the previously described device of setting up participle model sets up for the various cutting results of text generation.
So-called various cutting results are that input text is carried out some paths of forming after all possible cutting, and each paths of indication is each cutting result in the embodiment of the invention.When only comprising basic dictionary in the described participle model of preamble and shifting dictionary, cutting as a result generation unit 402 uses the basic dictionary of participle model to generate various cutting results.Please refer to Fig. 3, Fig. 3 is the synoptic diagram of the embodiment one of various cutting results among the present invention.To input text " two this liberal arts schools ", according to the entry in the basic dictionary, set up various cutting results as shown in Figure 3, it comprises path 1 and path 2.
Computing unit 403 is used for utilizing participle model to calculate various cutting results' score.Computing unit 403 comprises searches subelement 4031 and score computation subunit 4032, wherein search subelement 4031, be used for searching from participle model generating probability and the transition probability of all nodes of cutting result, score computation subunit 4032, being used for the generating probability of all nodes of cutting result and transition probability multiplied each other obtains cutting result's score.
Take various cutting results shown in Figure 3 as example, each square frame is a node, and calculating path 1 is as follows with the score in path 2:
P (path 1)=p (number | BOS) * p (two | number) * p (measure word | number) * p (this | measure word) * p (noun .7| measure word) * p (literal arts | noun .7) * P (noun .11| noun .7) * p (school | noun .11) * p (EOS| noun .11)
P (path 2)=p (number | BOS) * p (two | number) * p (noun .24| number) * p (this paper | noun .24) * p (noun .5| noun .24) * p (science | noun .5) * P (noun. morpheme | noun .5) * p (school | noun. morpheme) * p (the EOS| noun. morpheme)
Wherein BOS and EOS represent beginning and finish, p (number | BOS) represent respectively the probability that probability that the word take part of speech as number begins and word take part of speech as noun .11 finish with p (EOS| noun .11), and p (two | number) and p (measure word | number) are illustrated respectively in the probability that occurs " two " under the condition that part of speech is number and be under the condition of number at the part of speech of previous word, the part of speech of next word is the probability of measure word, and the implication of the probability of other nodes is similar with it.。
More than the generating probability of each node can obtain by the generating probability of basic dictionary lookup entry under corresponding part of speech from participle model, the transition probability of each node be by can obtaining from the transition probability between the transfer dictionary lookup part of speech of participle model, so path 1 and path 2 can calculate score separately.
Determining unit 404 as a result, are used for selecting the highest cutting result of score as word segmentation result and the output of input text.To word figure shown in Figure 3, the score that gets proportion by subtraction path 2 of supposing path 1 is high, then to the input " two this liberal arts schools " this text, final word segmentation result is " two (numbers)/basis (measure word)/literal arts (noun .7)/school (noun .11) ", can find out, final word segmentation result has not only comprised each participle, and each participle has also had corresponding part-of-speech tagging.
When comprising basic dictionary, word bit dictionary in the described participle model of preamble and shifting dictionary, basic dictionary and the common various cutting results of generation of word bit dictionary are used in cutting as a result generation unit 402.Please refer to Fig. 4, Fig. 4 is the synoptic diagram of the embodiment two of various cutting results among the present invention.As shown in Figure 4, the cutting result comprises three kinds of situations, and a kind of is only to comprise the entry node in the path, and a kind of is only to comprise the word bit node in the path, and a kind of is not only to comprise in the path also comprising the word bit node by the entry node.Search subelement 4031 and can obtain the generating probability of entry node from the generating probability of basic dictionary lookup entry under corresponding part of speech of participle model, the generating probability of word bit node can be obtained from the generating probability of word bit dictionary lookup word bit under corresponding part of speech of participle model, the transition probability of all nodes can be obtained from the transition probability between the transfer dictionary lookup part of speech of participle model.
Unregistered word determining unit 405 is used for when the highest cutting result of score comprises the word bit node, utilizes the word bit information of word bit node to determine the division of the unregistered word among the highest cutting result of score.
Unregistered word refers to the entry that can't find in the basic dictionary of participle model.For example suppose that the cutting result that score is the highest among Fig. 4 adds the path that thick lines represent among Fig. 4: " Lee/B-name "-" literary composition/M-name "-" outstanding person/E-name "-" going out/verb .3 ", then " Li Wenjie " these three words of linking up are exactly unregistered word (because the entry that can find in the basic dictionary only has " Li Wen " and " outstanding person "), and " Li Wenjie " is actually this is divided into " Lee/Wen Jie " or " Li Wenjie " then can obtain according to the word bit information of word bit node.The word bit information of analyzing " Li Wenjie " this triliteral word bit node in the highest path of score of given example above can be found out, " Li Wenjie " is the most rational as an independent word segmentation, if but " Li Wenjie " this triliteral word bit nodal information is: " Lee/S-name "-" literary composition/B-name "-" outstanding person/E-name " then can cut out two unregistered words " Lee " and " Wen Jie ".
In the present embodiment, if adopt basic entry to set up participle model, as input text, then determining unit 404 can obtain the internal separation of the entry that can cut apart again in the dictionary entry to receiving element 401 as a result with the dictionary word bar.Basic entry and dictionary entry are the entry only divided with minimum particle size as previously described and any entry of granularity division.Participle model set up in the dictionary entry that adopts again known internal to divide, cutting as a result generation unit 402 uses this participle model to generate the cutting result, then determining unit 404 just can when the output word segmentation result, further be exported the internal separation of the dictionary entry that can cut apart again in the word segmentation result as a result.For example final word segmentation result be " we (pronoun)/(verb)/go up (verb)/highway (noun)/(adverbial word) ", because the dictionary entry " highway " in the basic dictionary of participle model has internal separation, therefore when obtaining this word segmentation result, also can obtain the internal separation of " highway ": " at a high speed (adjective) ", " highway (noun) ", and with this internal separation output.
The above only is preferred embodiment of the present invention, and is in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, is equal to replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims (24)

1. a method of setting up participle model is characterized in that, described method comprises:
A1. corpus is marked the part of speech of each entry and each entry;
B1. determine the part of speech of each entry under corresponding part of speech;
C1. good corpus is added up the transition probability of each entry between the generating probability under the corresponding part of speech and each part of speech to utilize mark, wherein said entry in the generating probability under the corresponding part of speech is: probability that entry occurs with corresponding part of speech in described corpus, and the transition probability between described part of speech is: the probability of a rear adjacent appearance of part of speech under the condition that previous part of speech occurs in described corpus;
D1. utilize the generating probability of described each entry under corresponding part of speech to obtain basic dictionary, utilize the transition probability between described each part of speech to obtain shifting dictionary, and add described basic dictionary and described transfer dictionary to participle model.
2. method according to claim 1 is characterized in that, described entry comprises basic entry or dictionary entry, and wherein said basic entry only comprises the entry of dividing with minimum particle size, and described dictionary entry comprises the entry with multiple granularity division.
3. method according to claim 1 is characterized in that, described step B1 comprises the S1 in the following mode, and perhaps, the combination of S1 and S2 and the execution priority of S2 are higher than S1:
S1. according to the cluster feature of each entry, the entry with identical part of speech is carried out cluster, and with the classification of cluster under each entry as the part of speech of each entry under corresponding part of speech;
S2. the statistics word frequency of each entry under corresponding part of speech in large-scale corpus, and be described word frequency greater than classification of each entry distribution of setting threshold as entry the part of speech under corresponding part of speech of described word frequency greater than setting threshold.
4. method according to claim 3, it is characterized in that described cluster feature comprises contextual feature, the position feature of entry, the lexical or textual analysis feature of entry, the synonym relationship characteristic of entry or the structured message feature of entry of entry in described large-scale corpus.
5. method according to claim 1 is characterized in that, described method further comprises:
D11. be the entry mark pronunciation in the described basic dictionary in described participle model.
6. method according to claim 1 is characterized in that, described method further comprises:
D12. utilize the good generating probability of corpus statistics word bit under corresponding part of speech of described mark to obtain the word bit dictionary, and add described word bit dictionary to described participle model, the generating probability of wherein said word bit under corresponding part of speech is: word appears at the probability of a position in the entry with corresponding part of speech in described corpus.
7. the method for a participle is characterized in that, described method comprises:
A2. obtain input text;
B2. utilize the participle model of the described method foundation of arbitrary claim in the claim 1 to 6 for the various cutting results of described text generation;
C2. utilize described participle model to calculate described various cutting results' score;
D2. select the highest cutting result of score as the word segmentation result of described input text and output.
8. method according to claim 7 is characterized in that, described step C2 comprises:
C21. from described participle model, search generating probability and the transition probability of all nodes of described cutting result;
C22. the generating probability of all nodes of described cutting result and transition probability are multiplied each other and obtain described cutting result's score.
9. method according to claim 8 is characterized in that, when described participle model is when utilizing that the described method of arbitrary claim is set up in the claim 1 to 5, uses the basic dictionary of described participle model to generate described various cutting result among the described step B2; Among the described step C21 from the generating probability of basic dictionary lookup entry under corresponding part of speech of the described participle model generating probability with all nodes of obtaining described cutting result, from the transition probability of the transition probability between the transfer dictionary lookup part of speech of described participle model with all nodes of obtaining described cutting result.
10. method according to claim 8 is characterized in that, when described participle model is when utilizing the described method of claim 6 to set up, uses basic dictionary and the common described various cutting results of generation of word bit dictionary of described participle model among the described step B2; Among the described step C21 from the generating probability of basic dictionary lookup entry under corresponding part of speech of the described participle model generating probability with the entry node that obtains described cutting result, from the generating probability of word bit dictionary lookup word bit under corresponding part of speech of the described participle model generating probability with the word bit node that obtains described cutting result, from the transition probability of the transition probability between the transfer dictionary lookup part of speech of described participle model with all nodes of obtaining described cutting result.
11. method according to claim 10, it is characterized in that, when comprising described word bit node among the highest cutting result of described score, further comprise in described step D2: utilize the word bit information of described word bit node to determine the division of the unregistered word among the highest cutting result of described score, wherein said unregistered word is non-existent word in the basic dictionary of described participle model.
12. method according to claim 7 is characterized in that, adopts basic entry to set up described participle model, and the dictionary word bar is carried out the method for described participle as described input text, obtains the internal separation of the entry that can cut apart again in the described dictionary entry;
If the participle model that the dictionary entry that adopts known internal to divide is set up carries out participle to input text, then when the output word segmentation result, further export the internal separation of the dictionary entry that can cut apart again in the word segmentation result.
13. a device of setting up participle model is characterized in that, described device comprises:
The mark unit is for the part of speech that corpus is marked each entry and each entry;
The part of speech determining unit is used for determining the part of speech of each entry under corresponding part of speech;
Statistic unit, good corpus is added up the transition probability of each entry between the generating probability under the corresponding part of speech and each part of speech to be used for utilizing mark, wherein said entry in the generating probability under the corresponding part of speech is: probability that entry occurs with corresponding part of speech in described corpus, and the transition probability between described part of speech is: the probability of a rear adjacent appearance of part of speech under the condition that previous part of speech occurs in described corpus;
The model generation unit is used for utilizing the generating probability of described each entry under corresponding part of speech to obtain basic dictionary, utilizes the transition probability between described each part of speech to obtain shifting dictionary, and adds described basic dictionary and described transfer dictionary to participle model.
14. device according to claim 13 is characterized in that, described entry comprises basic entry or dictionary entry, and wherein said basic entry only comprises the entry of dividing with minimum particle size, and described dictionary entry comprises the entry with many granularity division.
15. device according to claim 13, it is characterized in that, described part of speech determining unit comprises the cluster subelement, perhaps, comprises that the combination of described cluster subelement and word frequency statistics subelement and the processing priority of described word frequency statistics subelement are higher than described cluster subelement;
Wherein said cluster subelement is used for the cluster feature according to each entry, and the entry with identical part of speech is carried out cluster, and with the classification of cluster under each entry as the part of speech of each entry under corresponding part of speech;
Described word frequency statistics subelement, be used in the word frequency of large-scale corpus each entry of statistics under corresponding part of speech, and be described word frequency greater than classification of each entry distribution of setting threshold as entry the part of speech under corresponding part of speech of described word frequency greater than setting threshold.
16. device according to claim 15, it is characterized in that described cluster feature comprises contextual feature, the position feature of entry, the lexical or textual analysis feature of entry, the synonym relationship characteristic of entry or the structured message feature of entry of entry in described large-scale corpus.
17. device according to claim 13 is characterized in that, described device further comprises pronunciation mark unit, and being used at described participle model be that entry in the described basic dictionary marks pronunciation.
18. device according to claim 13 is characterized in that, described device further comprises:
The word bit dictionary generates subelement, be used for utilizing the good generating probability of corpus statistics word bit under corresponding part of speech of described mark to obtain the word bit dictionary, and add described word bit dictionary to described participle model, the generating probability of wherein said word bit under corresponding part of speech is: word appears at the probability of a position in the entry with corresponding part of speech in described corpus.
19. the device of a participle is characterized in that, described device comprises:
Receiving element is used for obtaining input text;
Cutting is generation unit as a result, is used for utilizing the participle model of the described device foundation of the arbitrary claim of claim 13 to 18 for the various cutting results of described text generation;
Computing unit is used for utilizing described participle model to calculate described various cutting results' score;
Determining unit is used for selecting the highest cutting result of score as word segmentation result and the output of described input text as a result.
20. device according to claim 19 is characterized in that, described computing unit comprises:
Search subelement, be used for searching from described participle model generating probability and the transition probability of all nodes of described cutting result;
The score computation subunit, being used for the generating probability of all nodes of described cutting result and transition probability multiplied each other obtains described cutting result's score.
21. device according to claim 20, it is characterized in that, when described participle model is that described cutting as a result generation unit uses the basic dictionary of described participle model to generate described various cutting result when utilizing that the described device of arbitrary claim is set up in the claim 13 to 17; The described subelement of searching is from the generating probability of basic dictionary lookup entry under corresponding part of speech of the described participle model generating probability with all nodes of obtaining described cutting result, from the transition probability of the transition probability between the transfer dictionary lookup part of speech of described participle model with all nodes of obtaining described cutting result.
22. device according to claim 20, it is characterized in that, when described participle model is when utilizing the described device of claim 18 to set up, basic dictionary and the common described various cutting results of generation of word bit dictionary of described participle model are used in described cutting as a result generation unit; The described subelement of searching is from the generating probability of basic dictionary lookup entry under corresponding part of speech of the described participle model generating probability with the entry node that obtains described cutting result, from the generating probability of word bit dictionary lookup word bit under corresponding part of speech of the described participle model generating probability with the word bit node that obtains described cutting result, from the transition probability of the transition probability between the transfer dictionary lookup part of speech of described participle model with all nodes of obtaining described cutting result.
23. device according to claim 22, it is characterized in that, described device further comprises the unregistered word determining unit, be used for when the highest cutting result of described score comprises described word bit node, utilize the word bit information of described word bit node to determine the division of the unregistered word among the highest cutting result of described score, wherein said unregistered word is non-existent word in the basic dictionary of described participle model.
24. device according to claim 19 is characterized in that, described device in advance with the dictionary word bar as input text, adopt the participle model of being set up by basic entry to obtain the internal separation of the entry that can cut apart again in the described dictionary entry;
If described cutting as a result generation unit adopts the participle model of the dictionary entry foundation of known internal division to generate the cutting result, further export the internal separation of the dictionary entry that can cut apart again in the word segmentation result when then described as a result determining unit is exported word segmentation result.
CN201110223843.4A 2011-08-05 2011-08-05 A kind of set up the method for participle model, the method for participle and device thereof Active CN102929870B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110223843.4A CN102929870B (en) 2011-08-05 2011-08-05 A kind of set up the method for participle model, the method for participle and device thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110223843.4A CN102929870B (en) 2011-08-05 2011-08-05 A kind of set up the method for participle model, the method for participle and device thereof

Publications (2)

Publication Number Publication Date
CN102929870A true CN102929870A (en) 2013-02-13
CN102929870B CN102929870B (en) 2016-06-29

Family

ID=47644671

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110223843.4A Active CN102929870B (en) 2011-08-05 2011-08-05 A kind of set up the method for participle model, the method for participle and device thereof

Country Status (1)

Country Link
CN (1) CN102929870B (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104156349A (en) * 2014-03-19 2014-11-19 邓柯 Unlisted word discovering and segmenting system and method based on statistical dictionary model
CN104239355A (en) * 2013-06-21 2014-12-24 高德软件有限公司 Search-engine-oriented data processing method and device
CN105261358A (en) * 2014-07-17 2016-01-20 中国科学院声学研究所 N-gram grammar model constructing method for voice identification and voice identification system
WO2016045567A1 (en) * 2014-09-22 2016-03-31 北京国双科技有限公司 Webpage data analysis method and device
CN106776544A (en) * 2016-11-24 2017-05-31 四川无声信息技术有限公司 Character relation recognition methods and device and segmenting method
CN107291692A (en) * 2017-06-14 2017-10-24 北京百度网讯科技有限公司 Method for customizing, device, equipment and the medium of participle model based on artificial intelligence
CN107526724A (en) * 2017-08-22 2017-12-29 北京百度网讯科技有限公司 For marking the method and device of language material
CN107992509A (en) * 2017-10-12 2018-05-04 如是科技(大连)有限公司 method and device for generating job dictionary information
CN108038103A (en) * 2017-12-18 2018-05-15 北京百分点信息科技有限公司 A kind of method, apparatus segmented to text sequence and electronic equipment
CN108052508A (en) * 2017-12-29 2018-05-18 北京嘉和美康信息技术有限公司 A kind of information extraction method and device
CN108124477A (en) * 2015-02-02 2018-06-05 微软技术授权有限责任公司 Segmenter is improved based on pseudo- data to handle natural language
CN109145282A (en) * 2017-06-16 2019-01-04 贵州小爱机器人科技有限公司 Punctuate model training method, punctuate method, apparatus and computer equipment
CN109408794A (en) * 2017-08-17 2019-03-01 阿里巴巴集团控股有限公司 A kind of frequency dictionary method for building up, segmenting method, server and client side's equipment
CN109683773A (en) * 2017-10-19 2019-04-26 北京国双科技有限公司 Corpus labeling method and device
CN109829162A (en) * 2019-01-30 2019-05-31 新华三大数据技术有限公司 A kind of text segmenting method and device
CN110175273A (en) * 2019-05-22 2019-08-27 腾讯科技(深圳)有限公司 Text handling method, device, computer readable storage medium and computer equipment
CN111062211A (en) * 2019-12-27 2020-04-24 中国联合网络通信集团有限公司 Information extraction method and device, electronic equipment and storage medium
CN111523302A (en) * 2020-07-06 2020-08-11 成都晓多科技有限公司 Syntax analysis method and device, storage medium and electronic equipment
WO2020215456A1 (en) * 2019-04-26 2020-10-29 网宿科技股份有限公司 Text labeling method and device based on teacher forcing

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101075251A (en) * 2007-06-18 2007-11-21 中国电子科技集团公司第五十四研究所 Method for searching file based on data excavation
CN101154226A (en) * 2006-09-27 2008-04-02 腾讯科技(深圳)有限公司 Method for adding unlisted word to word stock of input method and its character input device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101154226A (en) * 2006-09-27 2008-04-02 腾讯科技(深圳)有限公司 Method for adding unlisted word to word stock of input method and its character input device
CN101075251A (en) * 2007-06-18 2007-11-21 中国电子科技集团公司第五十四研究所 Method for searching file based on data excavation

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
周强: "一种切词和词性标注相融合的汉语语料", 《计算语言学研究与应用》 *
张金柱: "基于字位的中文分词方法研究与实现", 《万方数据》 *
段慧明: "大规模汉语标注语料库的制作与使用", 《语言文字应用》 *

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239355A (en) * 2013-06-21 2014-12-24 高德软件有限公司 Search-engine-oriented data processing method and device
CN104156349A (en) * 2014-03-19 2014-11-19 邓柯 Unlisted word discovering and segmenting system and method based on statistical dictionary model
CN104156349B (en) * 2014-03-19 2017-08-15 邓柯 Unlisted word discovery and Words partition system and method based on statistics dictionary model
CN105261358A (en) * 2014-07-17 2016-01-20 中国科学院声学研究所 N-gram grammar model constructing method for voice identification and voice identification system
WO2016045567A1 (en) * 2014-09-22 2016-03-31 北京国双科技有限公司 Webpage data analysis method and device
US10621245B2 (en) 2014-09-22 2020-04-14 Beijing Gridsum Technology Co., Ltd. Webpage data analysis method and device
CN108124477B (en) * 2015-02-02 2021-06-15 微软技术许可有限责任公司 Improving word segmenters to process natural language based on pseudo data
CN108124477A (en) * 2015-02-02 2018-06-05 微软技术授权有限责任公司 Segmenter is improved based on pseudo- data to handle natural language
CN106776544A (en) * 2016-11-24 2017-05-31 四川无声信息技术有限公司 Character relation recognition methods and device and segmenting method
CN107291692A (en) * 2017-06-14 2017-10-24 北京百度网讯科技有限公司 Method for customizing, device, equipment and the medium of participle model based on artificial intelligence
CN107291692B (en) * 2017-06-14 2020-12-18 北京百度网讯科技有限公司 Artificial intelligence-based word segmentation model customization method, device, equipment and medium
CN109145282B (en) * 2017-06-16 2023-11-07 贵州小爱机器人科技有限公司 Sentence-breaking model training method, sentence-breaking device and computer equipment
CN109145282A (en) * 2017-06-16 2019-01-04 贵州小爱机器人科技有限公司 Punctuate model training method, punctuate method, apparatus and computer equipment
CN109408794A (en) * 2017-08-17 2019-03-01 阿里巴巴集团控股有限公司 A kind of frequency dictionary method for building up, segmenting method, server and client side's equipment
CN107526724A (en) * 2017-08-22 2017-12-29 北京百度网讯科技有限公司 For marking the method and device of language material
CN107992509A (en) * 2017-10-12 2018-05-04 如是科技(大连)有限公司 method and device for generating job dictionary information
CN107992509B (en) * 2017-10-12 2022-05-13 如是人力科技集团股份有限公司 Method and device for generating job dictionary information
CN109683773A (en) * 2017-10-19 2019-04-26 北京国双科技有限公司 Corpus labeling method and device
CN108038103A (en) * 2017-12-18 2018-05-15 北京百分点信息科技有限公司 A kind of method, apparatus segmented to text sequence and electronic equipment
CN108038103B (en) * 2017-12-18 2021-08-10 沈阳智能大数据科技有限公司 Method and device for segmenting text sequence and electronic equipment
CN108052508B (en) * 2017-12-29 2021-11-09 北京嘉和海森健康科技有限公司 Information extraction method and device
CN108052508A (en) * 2017-12-29 2018-05-18 北京嘉和美康信息技术有限公司 A kind of information extraction method and device
CN109829162B (en) * 2019-01-30 2022-04-08 新华三大数据技术有限公司 Text word segmentation method and device
CN109829162A (en) * 2019-01-30 2019-05-31 新华三大数据技术有限公司 A kind of text segmenting method and device
WO2020215456A1 (en) * 2019-04-26 2020-10-29 网宿科技股份有限公司 Text labeling method and device based on teacher forcing
CN110175273B (en) * 2019-05-22 2021-09-07 腾讯科技(深圳)有限公司 Text processing method and device, computer readable storage medium and computer equipment
CN110175273A (en) * 2019-05-22 2019-08-27 腾讯科技(深圳)有限公司 Text handling method, device, computer readable storage medium and computer equipment
CN111062211A (en) * 2019-12-27 2020-04-24 中国联合网络通信集团有限公司 Information extraction method and device, electronic equipment and storage medium
CN111523302A (en) * 2020-07-06 2020-08-11 成都晓多科技有限公司 Syntax analysis method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN102929870B (en) 2016-06-29

Similar Documents

Publication Publication Date Title
CN102929870B (en) A kind of set up the method for participle model, the method for participle and device thereof
CN105718586B (en) The method and device of participle
CN101706777B (en) Method and system for extracting resequencing template in machine translation
CN102214166B (en) Machine translation system and machine translation method based on syntactic analysis and hierarchical model
CN104794169A (en) Subject term extraction method and system based on sequence labeling model
CN102693279B (en) Method, device and system for fast calculating comment similarity
CN103885938A (en) Industry spelling mistake checking method based on user feedback
CN103309926A (en) Chinese and English-named entity identification method and system based on conditional random field (CRF)
CN104756100A (en) Intent estimation device and intent estimation method
CN102103594A (en) Character data recognition and processing method and device
US20110040553A1 (en) Natural language processing
Richter et al. Korektor–a system for contextual spell-checking and diacritics completion
CN104484433A (en) Book body matching method based on machine learning
CN105630770A (en) Word segmentation phonetic transcription and ligature writing method and device based on SC grammar
CN102929864B (en) A kind of tone-character conversion method and device
Mladenović et al. Using lexical resources for irony and sarcasm classification
CN103678288A (en) Automatic proper noun translation method
Attia et al. An automatically built named entity lexicon for Arabic
CN109918664B (en) Word segmentation method and device
CN101369285B (en) Spell emendation method for query word in Chinese search engine
Zhao et al. Learning Question Paraphrases for QA from Encarta Logs.
Geyken et al. On-the-fly Generation of Dictionary Articles for the DWDS Website
Gulnazir et al. Investigating Lexical Variation and Change in Malaysian Twitter: A Conceptual Paper.
CN110162791B (en) Text keyword extraction method and system for national defense science and technology field
Sang et al. Extraction of hypernymy information from text∗

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant