CN102929870A

CN102929870A - Method for establishing word segmentation model, word segmentation method and devices using methods

Info

Publication number: CN102929870A
Application number: CN2011102238434A
Authority: CN
Inventors: 何径舟; 吴中勤
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2011-08-05
Filing date: 2011-08-05
Publication date: 2013-02-13
Anticipated expiration: 2031-08-05
Also published as: CN102929870B

Abstract

The invention provides a method for establishing a word segmentation model, a word segmentation method, a device for establishing the word segmentation model and a word segmentation device. The method for establishing the word segmentation model comprises the following steps of: A1, labeling each entry of a training corpus and characteristics of each entry; B1, determining the parts of speech of each entry under corresponding characteristics; C1, counting the generation probabilities of each entry under corresponding parts of speech and the probabilities of conversion between the parts of speech by utilizing the labeled training corpus; and D1, obtaining a basic dictionary by utilizing the generation probabilities of each entry under the corresponding parts of speech, obtaining a conversion dictionary by utilizing the probabilities of the conversion between the parts of speech, and adding the basic dictionary and the conversion dictionary into the word segmentation model. When the word segmentation model is used for word segmentation, word segmentation accuracy can be improved, and characteristic labeling work can be finished at the same time of word segmentation.

Description

A kind of method of participle model, method and device thereof of participle set up

[technical field]

The present invention relates to the natural language processing technique field, particularly a kind of method of participle model, method and device thereof of participle set up.

[background technology]

Along with being widely used of internet, increasing text and information exchange are crossed the internet and are propagated, in order to retrieve and excavate valuable content from these texts and information, natural language processing is indispensable technology, and participle then is the element task in the natural language processing.

In the prior art, participle mainly contain rule-based participle and based on the statistics participle.Rule-based participle has cutting of Forward Maximum Method, oppositely maximum coupling, two-way maximum coupling, the most chopped mark cutting, rule-based set etc., the speed that is characterized in is fast, but the effect to ambiguity partition is bad, and in this manner, participle and part-of-speech tagging work can only be finished in a sequential manner, namely carry out first participle, carry out again part-of-speech tagging.Based on the participle of statistics, be that the probability of the co-occurrence in language model with word and word is as the foundation of participle.For example " living standards of the people ", can cutting be " people | life | level " according to dictionary, also can cutting be " people | the people's livelihood | running water | flat ", but can learn the co-occurrence probabilities of co-occurrence probabilities, " life " and " level " of " people " and " life " by language model all far above the co-occurrence probabilities of " people " and " people's livelihood " or " people's livelihood " and " running water ", so finally get " people | life | level " as correct cutting.Because in the participle based on statistics, usually adopt the n-gram language model, the probability that single word is occurred in Large Scale Corpus and the probability of word and word co-occurrence in Large Scale Corpus are as the foundation of participle, thereby cause in the larger situation of dictionary, the calculated amount very large defective that becomes, and under this mode, participle and part-of-speech tagging are still finished in two steps, and in fact, part of speech can play the effect of confirmation to participle, under different parts of speech, different word segmentation result appears possibly.

Therefore, in the segmenting method based on statistics, a kind of improved way being arranged, is the probability that the probability of word and word co-occurrence is reduced to part of speech and part of speech co-occurrence.Because part of speech is compared with word, its dimension reduces greatly, therefore calculated amount is reduced, and owing to considered part of speech, has also finished the work of part-of-speech tagging in participle, but at present in this manner, to the north of major term gonosome system to the example that is categorized as of Chinese word, the kind of part of speech only is kind more than 40, and the relation between the thousands of word is reduced to more than the 40 kind of relation between the part of speech, the significantly loss of quantity of information will inevitably occur, thereby has influence on the precision of participle.

[summary of the invention]

Technical matters to be solved by this invention provides a kind of method of participle model, method and device thereof of participle set up, to solve prior art when the participle, cause the significantly loss of quantity of information when substituting concerning of word and word with the relation of part of speech and part of speech, thus the defective of the reduction precision of word segmentation.

The present invention is that the technical scheme that the technical solution problem adopts provides a kind of method of setting up participle model, and comprising: A1. marks the part of speech of each entry and each entry to corpus; B1. determine the part of speech of each entry under corresponding part of speech; C1. good corpus is added up the transition probability of each entry between the generating probability under the corresponding part of speech and each part of speech to utilize mark, wherein said entry in the generating probability under the corresponding part of speech is: probability that entry occurs with corresponding part of speech in described corpus, and the transition probability between described part of speech is: the probability of a rear adjacent appearance of part of speech under the condition that previous part of speech occurs in described corpus; D1. utilize the generating probability of described each entry under corresponding part of speech to obtain basic dictionary, utilize the transition probability between described each part of speech to obtain shifting dictionary, and add described basic dictionary and described transfer dictionary to participle model.

The preferred embodiment one of according to the present invention, described entry comprises basic entry or dictionary entry, and wherein said basic entry only comprises the entry of dividing with minimum particle size, and described dictionary entry comprises the entry with multiple granularity division.

The preferred embodiment one of according to the present invention, described step B1 comprises the S1 in the following mode, perhaps, the combination of S1 and S2 and the execution priority of S2 are higher than S1:S1. according to the cluster feature of each entry, entry with identical part of speech is carried out cluster, and with the classification of cluster under each entry as the part of speech of each entry under corresponding part of speech; S2. the statistics word frequency of each entry under corresponding part of speech in large-scale corpus, and be described word frequency greater than classification of each entry distribution of setting threshold as entry the part of speech under corresponding part of speech of described word frequency greater than setting threshold.

The preferred embodiment one of according to the present invention, described cluster feature comprises contextual feature, the position feature of entry, the lexical or textual analysis feature of entry, the synonym relationship characteristic of entry or the structured message feature of entry of entry in described large-scale corpus.

The preferred embodiment one of according to the present invention, described method further comprises: D11. is the mark of the entry in described basic dictionary pronunciation in described participle model.

The preferred embodiment one of according to the present invention, described method further comprises: D12. utilizes the good generating probability of corpus statistics word bit under corresponding part of speech of described mark to obtain the word bit dictionary, and add described word bit dictionary to described participle model, the generating probability of wherein said word bit under corresponding part of speech is: word appears at the probability of a position in the entry with corresponding part of speech in described corpus.

The present invention also provides a kind of method of participle, comprising: A2. obtains input text; B2. utilize the participle model of the described method foundation of preamble for the various cutting results of described text generation; C2. utilize described participle model to calculate described various cutting results' score; D2. select the highest cutting result of score as the word segmentation result of described input text and output.

The preferred embodiment one of according to the present invention, described step C2 comprises: C21. searches generating probability and the transition probability of all nodes of described cutting result from described participle model; C22. the generating probability of all nodes of described cutting result and transition probability are multiplied each other and obtain described cutting result's score.

The preferred embodiment one of according to the present invention when described participle model only comprises basic dictionary and shifts dictionary, uses the basic dictionary of described participle model to generate described various cutting result among the described step B2; Among the described step C21 from the generating probability of basic dictionary lookup entry under corresponding part of speech of the described participle model generating probability with all nodes of obtaining described cutting result, from the transition probability of the transition probability between the transfer dictionary lookup part of speech of described participle model with all nodes of obtaining described cutting result.

The preferred embodiment one of according to the present invention when described participle model comprises basic dictionary, word bit dictionary and shifts dictionary, uses the basic dictionary and the described various cutting results of the common generation of word bit dictionary of described participle model among the described step B2; Among the described step C21 from the generating probability of basic dictionary lookup entry under corresponding part of speech of the described participle model generating probability with the entry node that obtains described cutting result, from the generating probability of word bit dictionary lookup word bit under corresponding part of speech of the described participle model generating probability with the word bit node that obtains described cutting result, from the transition probability of the transition probability between the transfer dictionary lookup part of speech of described participle model with all nodes of obtaining described cutting result.

The preferred embodiment one of according to the present invention, when comprising described word bit node among the highest cutting result of described score, further comprise in described step D2: utilize the word bit information of described word bit node to determine the division of the unregistered word among the highest cutting result of described score, wherein said unregistered word is non-existent word in the basic dictionary of described participle model.

The preferred embodiment one of according to the present invention adopts basic entry to set up described participle model, and the dictionary word bar is carried out the method for described participle as described input text, obtains the internal separation of the entry that can cut apart again in the described dictionary entry; If the participle model that the dictionary entry that adopts known internal to divide is set up carries out participle to input text, then when the output word segmentation result, further export the internal separation of the dictionary entry that can cut apart again in the word segmentation result.

The present invention also provides a kind of device of setting up participle model, comprising: the mark unit, for the part of speech that corpus is marked each entry and each entry; The part of speech determining unit is used for determining the part of speech of each entry under corresponding part of speech; Statistic unit, good corpus is added up the transition probability of each entry between the generating probability under the corresponding part of speech and each part of speech to be used for utilizing mark, wherein said entry in the generating probability under the corresponding part of speech is: probability that entry occurs with corresponding part of speech in described corpus, and the transition probability between described part of speech is: the probability of a rear adjacent appearance of part of speech under the condition that previous part of speech occurs in described corpus; The model generation unit is used for utilizing the generating probability of described each entry under corresponding part of speech to obtain basic dictionary, utilizes the transition probability between described each part of speech to obtain shifting dictionary, and adds described basic dictionary and described transfer dictionary to participle model.

The preferred embodiment one of according to the present invention, described entry comprises basic entry or dictionary entry, and wherein said basic entry only comprises the entry of dividing with minimum particle size, and described dictionary entry comprises the entry with many granularity division.

The preferred embodiment one of according to the present invention, described part of speech determining unit comprises the cluster subelement, perhaps, comprises that the combination of described cluster subelement and word frequency statistics subelement and the processing priority of described word frequency statistics subelement are higher than described cluster subelement; Wherein said cluster subelement is used for the cluster feature according to each entry, and the entry with identical part of speech is carried out cluster, and with the classification of cluster under each entry as the part of speech of each entry under corresponding part of speech; Described word frequency statistics subelement, be used in the word frequency of large-scale corpus each entry of statistics under corresponding part of speech, and be described word frequency greater than classification of each entry distribution of setting threshold as entry the part of speech under corresponding part of speech of described word frequency greater than setting threshold.

The preferred embodiment one of according to the present invention, described device further comprise pronunciation mark unit, and being used at described participle model be that entry in the described basic dictionary marks pronunciation.

The preferred embodiment one of according to the present invention, described device further comprises: the word bit dictionary generates subelement, be used for utilizing the good generating probability of corpus statistics word bit under corresponding part of speech of described mark to obtain the word bit dictionary, and add described word bit dictionary to described participle model, the generating probability of wherein said word bit under corresponding part of speech is: word appears at the probability of a position in the entry with corresponding part of speech in described corpus.

The present invention also provides a kind of device of participle, comprising: receiving element is used for obtaining input text; Cutting is generation unit as a result, is used for utilizing the participle model of the described device foundation of preamble for the various cutting results of described text generation; Computing unit is used for utilizing described participle model to calculate described various cutting results' score; Determining unit is used for selecting the highest cutting result of score as word segmentation result and the output of described input text as a result.

The preferred embodiment one of according to the present invention, described computing unit comprises: search subelement, be used for searching from described participle model generating probability and the transition probability of all nodes of described cutting result; The score computation subunit, being used for the generating probability of all nodes of described cutting result and transition probability multiplied each other obtains described cutting result's score.

The preferred embodiment one of according to the present invention, when described participle model only comprised basic dictionary and shifts dictionary, described cutting as a result generation unit used the basic dictionary of described participle model to generate described various cutting result; The described subelement of searching is from the generating probability of basic dictionary lookup entry under corresponding part of speech of the described participle model generating probability with all nodes of obtaining described cutting result, from the transition probability of the transition probability between the transfer dictionary lookup part of speech of described participle model with all nodes of obtaining described cutting result.

The preferred embodiment one of according to the present invention, when described participle model comprised basic dictionary, word bit dictionary and shifts dictionary, the basic dictionary and the described various cutting results of the common generation of word bit dictionary of described participle model were used in described cutting as a result generation unit; The described subelement of searching is from the generating probability of basic dictionary lookup entry under corresponding part of speech of the described participle model generating probability with the entry node that obtains described cutting result, from the generating probability of word bit dictionary lookup word bit under corresponding part of speech of the described participle model generating probability with the word bit node that obtains described cutting result, from the transition probability of the transition probability between the transfer dictionary lookup part of speech of described participle model with all nodes of obtaining described cutting result.

The preferred embodiment one of according to the present invention, described device further comprises the unregistered word determining unit, be used for when the highest cutting result of described score comprises described word bit node, utilize the word bit information of described word bit node to determine the division of the unregistered word among the highest cutting result of described score, wherein said unregistered word is non-existent word in the basic dictionary of described participle model.

The preferred embodiment one of according to the present invention, described device in advance with the dictionary word bar as input text, adopt the participle model of being set up by basic entry to obtain the internal separation of the entry that can cut apart again in the described dictionary entry; If described cutting as a result generation unit adopts the participle model of the dictionary entry foundation of known internal division to generate the cutting result, further export the internal separation of the dictionary entry that can cut apart again in the word segmentation result when then described as a result determining unit is exported word segmentation result.

As can be seen from the above technical solutions, by obtaining the part of speech under the part of speech, can greatly expand the dimension of part of speech, so that when substituting concerning of word and word with the relation of part of speech and part of speech, keep enough quantity of information, improve the precision of participle, can in participle, finish the work of part-of-speech tagging simultaneously.

[description of drawings]

Fig. 1 is the schematic flow sheet of embodiment of setting up the method for participle model among the present invention;

Fig. 2 is the schematic flow sheet of the embodiment of segmenting method among the present invention;

Fig. 3 is the synoptic diagram of the embodiment one of various cutting results among the present invention;

Fig. 4 is the synoptic diagram of the embodiment two of various cutting results among the present invention;

Fig. 5 is the structural representation block diagram of embodiment of setting up the device of participle model among the present invention;

Fig. 6 is the structural representation block diagram of the embodiment of participle device among the present invention.

[embodiment]

In order to make the purpose, technical solutions and advantages of the present invention clearer, describe the present invention below in conjunction with the drawings and specific embodiments.

Please refer to Fig. 1, Fig. 1 is the schematic flow sheet of embodiment of setting up the method for participle model among the present invention.As shown in Figure 1, described method comprises:

Step S101: the part of speech that corpus is marked each entry and each entry.

Step S102: determine the part of speech of each entry under corresponding part of speech.

Step S103: good corpus is added up the transition probability of each entry between the generating probability under the corresponding part of speech and each part of speech to utilize mark.

Step S104: utilize the generating probability of each entry under corresponding part of speech to obtain basic dictionary, utilize the transition probability between each part of speech to obtain shifting dictionary, and add basic dictionary and transfer dictionary to first participle model.

Step S105: utilize the good generating probability of corpus statistics word bit under corresponding part of speech of mark obtaining the word bit dictionary, and add the word bit dictionary to first participle model.

The below carries out specific description to above-mentioned steps.

In step S101, corpus refers to various texts, for example various web page texts, books and periodicals magazine or novel etc.In the present invention, part of speech also further comprises the proper noun attribute except comprising common verb, noun, pronoun etc.Proper noun refers to the title of specific people, place or mechanism etc., and the proper noun attribute then is the proper noun classification of proper noun ownership.For example just enumerated part proper noun and corresponding attribute thereof in the table 1:

Table 1

As can be seen from Table 1, the proper noun attribute can be comprised of many levels, wherein the first level " name ", " mechanism's name ", " place name " etc., the maximum particle size that is the proper noun attribute is divided, also can carry out to it differentiation of small grain size under the first level, can divide into " China ", " U.S. " etc. such as " name " is lower, " place name " is lower can divide into " country ", " area " etc., derive thus the second level, can also obtain by that analogy the more level of small grain size.Corpus is marked the part of speech of each entry and each entry, in fact is exactly to continuous corpus of text, marks out each entry and part of speech thereof that participle obtains." I love Tian An-men, Beijing " such text for example, through can obtain behind the mark " I＜pronoun/love＜verb/Beijing/＜place name. the area 〉/Tian An-men＜place name. the place〉" text behind such mark.Because proper noun belongs to noun, so marked the entry of proper noun attribute, the as many as noun part of speech that marked.

The embodiment of step S102 can be following manner 1, and perhaps, mode 1 and 2 combination and the execution priority of mode 2 are higher than mode 1:

1, according to the cluster feature of each entry, the entry with identical part of speech is carried out cluster, and with the classification of cluster under each entry as the part of speech of each entry under corresponding part of speech.

Cluster feature can adopt the contextual feature of entry in large-scale corpus.Large-scale corpus is not limited to the above said corpus that has marked, and also can comprise wider un-annotated data, for example the text in various sources.

For the different entries with same part of speech, because the difference of entry intension, when this word occurred, some words that are associated with its intension can appear in its context.For example: " Beijing " and " Haidian ", although all be " place name. the area ", but the former intension is " administrative city ", latter's intension is " administrative area ", show on the extension of entry, be that the former is often more with word co-occurrences such as " city ", " mayors ", the latter is often more with word co-occurrences such as " district ", " district governments ".The contextual feature of statistics entry in large-scale corpus, then according to the similarity between these contextual features, just can be with the different entries under the identical part of speech, poly-is some classes, thereby forms corresponding part of speech.In the present embodiment, contextual feature refers in the certain context scope several words and the number of times thereof of frequent co-occurrence, and example is as shown in table 2:

Table 2

Wherein "＜city, 18776〉" be illustrated in the large-scale corpus, in the context of " Beijing " this entry, " city " this word has occurred 18776 times.It should be noted that contextual feature is not limited in " in the certain context scope several words and the number of times thereof of frequent co-occurrence " this embodiment, other any features that can embody context relation are included within the scope of the present invention.

Except adopting the contextual feature of entry in large-scale corpus to carry out the cluster, cluster can also adopt other features, the position feature of entry for example, as: will appear near certain word the entry of same position poly-is a class; Or the lexical or textual analysis feature of entry, as: it is a class that the entry of identical lexical or textual analysis can gather; Or the synonym relationship characteristic of entry, as: having identical synon entry poly-is a class; Or the structured message feature of entry, as: the last character is the noun of " car ", comprises that it is a class that " train ", " electric car ", " bicycle " etc. can gather.Because feature that can cluster can not be exhaustive, so any can as the feature of cluster, all should being included within the scope of the present invention.

2, the statistics word frequency of each entry under corresponding part of speech in Large Scale Corpus, and be word frequency greater than classification of each entry distribution of setting threshold as the part of speech of this entry under corresponding part of speech.

Take table 3 as example:

Table 3

If setting threshold is 10000, entry " I ", " you " in the pronoun then, the number of times that entry in the verb " is said ", " walking " occurs in large-scale corpus has surpassed setting threshold, therefore can distribute separately a classification as the part of speech of this entry under corresponding part of speech for these entries, for example the part of speech of entry " I " is " pronoun .1 ", the part of speech of " you " is " pronoun .2 ", under " pronoun .1 " part of speech, only have an object " I ", under " pronoun .2 " part of speech, only have an object " you ".

In step S 103, the generating probability of entry under corresponding part of speech refers to probability that entry occurs with corresponding part of speech in corpus, and the transition probability between part of speech refers to the probability of a rear adjacent appearance of part of speech under the condition that previous part of speech occurs in corpus.Add up the transition probability of each entry between the generating probability under the corresponding part of speech and each part of speech, a kind of method is based on Markov chain, according to the direct statistical probability of maximal possibility estimation, that is:

The total degree that the number of times that the generating probability=entry of entry under corresponding part of speech occurs in corpus as corresponding part of speech/this part of speech occurs in corpus.

Transition probability between part of speech=two part of speech in corpus before and after the total degree that in corpus, occurs of the part of speech of number of times/wherein appear at front of adjacent co-occurrence.

For example: " swimming " conduct " noun .5 " class has occurred 30 times, " noun .5 " class has occurred 400 times, " noun .5 " class is 200 with the number of times of the adjacent co-occurrence in " verb .1 " class front and back, " swimming " generating probability P under " noun .5 " class (swimming | noun .5)=30/400 then, " noun .5 " class is to the transition probability P=(verb .1| noun .5)=200/400 of " verb .1 " class.

In addition, the statistics entry can also adopt the machine learning instrument based on conditional random field models (CRF) to carry out features training in the generating probability under the corresponding part of speech and the transition probability between part of speech.The visible list of references of concrete grammar: Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. (2004) .Applying conditional random fields to Japanese morphological analysis.In Proc.of EMNLP 2004 (hereinafter referred to as list of references 1).

In step S 104, can obtain basic dictionary by the generating probability of entry under corresponding part of speech, can obtain shifting dictionary by the transition probability between part of speech.Wherein basic dictionary comprise entry, entry at the part of speech under the corresponding part of speech and entry the generating probability under corresponding part of speech, the structure of basic dictionary can be as shown in table 4:

Table 4

Entry	Part of speech (part of speech. part of speech ID)	The generating probability of entry under corresponding part of speech
			Beijing	Place name. regional .1	0.0098
……	……	……

Entry " Beijing " wherein, proper noun attribute in the part of speech " place name. the area " in corpus, mark and can obtain by step S 101, part of speech ID " 1 " acquires by step S102, and the generating probability of entry under corresponding part of speech obtains at step S103.

The transfer dictionary comprises the transition probability between part of speech and part of speech, and the structure that shifts dictionary can be as shown in table 5:

Table 5

Part of speech (part of speech. part of speech ID)	Part of speech (part of speech. part of speech ID)	Transition probability between part of speech
			Place name. regional .1	Mechanism's name. brand .2	0.0017
……	………	……

Wherein transition probability obtains at step S103 between part of speech.

Behind step S104, the method for present embodiment further also can be the entry mark pronunciation in the basic dictionary.For the structure of the basic dictionary behind the entry mark pronunciation can be as shown in table 6:

Table 6

Entry	Part of speech (part of speech. part of speech ID)	Pronunciation	The generating probability of entry under corresponding part of speech
				Beijing	Place name. regional .1	beijing	0.0098
Sticking	Verb .2	zhan	0.01
				Sticking	Adjective .3	nian	0.0095
……	……	……	……

As can be seen from Table 6, because the corresponding part of speech of entry is the classification under part of speech in the basic dictionary, pronunciation in the time of can embodying entry when therefore marking pronunciation and under different parts of speech, use, thereby, so that this participle model of follow-up use can also be exported the right pronunciation under the application scenarios of word segmentation result in corresponding part of speech when carrying out participle in the output word segmentation result.

In step S105, method of the present invention also can further be utilized the good generating probability of corpus statistics word bit under corresponding part of speech of mark.Word is the individual character that forms entry in the corpus, what word bit referred to is exactly a position that word occurs in entry, this position comprises B (begin), M (middle), E (end), S (single), represent respectively in prefix that a word appears at entry, the word, suffix and this word form separately an entry, for example: sky-B, expression day word occurs at the prefix of entry.

The generating probability of word bit under corresponding part of speech refers in corpus word appears at a position in the entry with corresponding part of speech probability.For example the prefix at the entry of part of speech " noun .1 " occurs, and is designated as B-noun .1, and wherein B represents prefix, and the probability of P (day | B-noun .1) is 0.4, just expression " day " probability of the prefix appearance of the entry of word conduct " noun .1 " class is 0.4.

By the corpus that marks, can add up the position that word occurrence occurs in the entry of each part of speech, the similar method of then introducing among employing and the step S103 can count the generating probability of word bit under corresponding part of speech.For example based on Markov chain, according to the direct statistical probability of maximal possibility estimation:

Word bit under corresponding part of speech generating probability=word appears at the total degree that number of times/this part of speech occurs of certain position in the entry in corpus with corresponding part of speech in corpus.

For example " my god " occurred 50 times as the prefix of the entry of part of speech " noun .1 ", " noun .1 " occurred 500 times, then p (my god | B-noun .1)=50/500.

Obtain to obtain corresponding word bit dictionary after the generating probability of word bit under corresponding part of speech, the word bit dictionary comprises word, word position and corresponding part of speech and the generating probability of word bit under corresponding part of speech in entry, and a schematic structure of word bit dictionary is as shown in table 7:

Table 7

Word	Position-part of speech	The generating probability of word bit under corresponding part of speech
			My god	B-noun .1	0.4
Peace	M-noun .1	0.1
			Door	E-noun .1	0.25
……	……	……

It should be noted that the execution sequence of step S105 in the present embodiment only for schematically describing, in fact step S105 also can carry out in the front of step S103 or step S104.And in other embodiments, method of the present invention also can only comprise step S101 to step S104, and does not comprise step S105.In the present embodiment, if the entry that marks among the step S101 is basic entry, the participle model that then obtains just is based on the participle model of primary word bar, if the entry that marks among the step S101 is the dictionary entry, the participle model that then obtains just is based on the participle model of dictionary word bar.So-called basic entry refers to only comprise the entry of dividing with minimum particle size.For example: " at a high speed ", " highway " such entry, as word, on implication, can not cut apart again, therefore have minimum particle size.And the dictionary entry refers to comprise the entry with any granularity division, wherein not only comprise the entry that minimum particle size is divided, the entry that can not cut apart again like this of " at a high speed ", " highway " for example, also comprise " highway " like this as word, on implication, can be divided into again the entry of " at a high speed ", " highway ".

Please refer to Fig. 2, Fig. 2 is the schematic flow sheet of the embodiment of segmenting method among the present invention.As shown in Figure 2, described method comprises:

S201: obtain input text.

S202: utilize participle model that the previously described method of setting up participle model sets up for the various cutting results of text generation.

S203: utilize participle model to calculate various cutting results' score.

S204: select the highest cutting result of score as word segmentation result and the output of input text.

The below is specifically described above-mentioned steps.

Among the step S201, obtain input text and be and obtain the text for the treatment of participle, this is the prerequisite of subsequent step.

In step S202, so-called various cutting results are that input text is carried out some paths of forming after all possible cutting, and each paths of indication is each cutting result in the embodiment of the invention.When only comprising basic dictionary in the described participle model of preamble and shifting dictionary, use the basic dictionary of participle model to generate various cutting results among the step S202.Please refer to Fig. 3, Fig. 3 is the synoptic diagram of the embodiment one of various cutting results among the present invention.To input text " two this liberal arts schools ", according to the entry in the basic dictionary, set up various cutting results as shown in Figure 3, it comprises path 1 and path 2.

In step S203, generate every kind of cutting result's score among the calculation procedure S202, embodiment may further comprise the steps:

1, from participle model, searches generating probability and the transition probability of all nodes of cutting result.

2, the generating probability of all nodes of cutting result and transition probability are multiplied each other obtain cutting result's score.

Take various cutting results shown in Figure 3 as example, each square frame is a node, and calculating path 1 is as follows with the score in path 2:

Wherein BOS and EOS represent beginning and finish, p (number | BOS) represent respectively the probability that probability that the word take part of speech as number begins and the word take part of speech as noun .11 finish with p (EOS| noun .11), and p (two | number) and p (measure word | number) are illustrated respectively in the probability that occurs " two " under the condition that part of speech is number and be under the condition of number at the part of speech of previous word, the part of speech of next word is the probability of measure word, and the implication of the probability of other nodes is similar with it.

More than the generating probability of each node can obtain by the generating probability of basic dictionary lookup entry under corresponding part of speech from participle model, the transition probability of each node be by can obtaining from the transition probability between the transfer dictionary lookup part of speech of participle model, so path 1 and path 2 can calculate score separately.

In step S204, select the highest cutting result of score as the word segmentation result of input text.To various cutting results shown in Figure 3, the score that gets proportion by subtraction path 2 of supposing path 1 is high, then to the input " two this liberal arts schools " this text, final word segmentation result is " two (numbers)/basis (measure word)/literal arts (noun .7)/school (noun .11) ", can find out, final word segmentation result has not only comprised each participle, and each participle has also had corresponding part-of-speech tagging.

When comprising basic dictionary, word bit dictionary in the described participle model of preamble and shifting dictionary, use the common various cutting results of generation of basic dictionary and word bit dictionary among the step S202.Please refer to Fig. 4, Fig. 4 is the synoptic diagram of the embodiment two of various cutting results among the present invention.As shown in Figure 4, the cutting result comprises three kinds of situations, and a kind of is only to comprise the entry node in the path, and a kind of is only to comprise the word bit node in the path, and a kind of is not only to comprise in the path also comprising the word bit node by the entry node.When in step S203, calculating every kind of cutting result's score, can obtain the generating probability of entry node from the generating probability of basic dictionary lookup entry under corresponding part of speech of participle model, the generating probability of word bit node can be obtained from the generating probability of word bit dictionary lookup word bit under corresponding part of speech of participle model, the transition probability of all nodes can be obtained from the transition probability between the transfer dictionary lookup part of speech of participle model.When comprising the word bit node among the highest cutting result of the score that calculates, then in step S204, also need to utilize the word bit information of word bit node to determine the division of the unregistered word among the highest cutting result of score.

Unregistered word refers to the entry that can't find in the basic dictionary of participle model.For example suppose that the cutting result that score is the highest among Fig. 4 adds the path that thick lines represent among Fig. 4: " Lee/B-name "-" literary composition/M-name "-" outstanding person/E-name "-" going out/verb .3 ", then " Li Wenjie " these three words of linking up are exactly unregistered word (because the entry that can find in the basic dictionary only has " Li Wen " and " outstanding person "), and " Li Wenjie " is actually this is divided into " Lee/Wen Jie " or " Li Wenjie " then can obtain according to the word bit information of word bit node.The word bit information of analyzing " Li Wenjie " this triliteral word bit node in the highest path of score of given example above can be found out, " Li Wenjie " is the most rational as an independent word segmentation, if but " Li Wenjie " this triliteral word bit nodal information is: " Lee/S-name "-" literary composition/B-name "-" outstanding person/E-name " then can cut out two unregistered words " Lee " and " Wen Jie ".

In the present embodiment, if adopt basic entry to set up participle model, the dictionary word bar is carried out the method for participle of the present invention as input text, can obtain the internal separation of the entry that can cut apart again in the dictionary entry.Basic entry and dictionary entry are the entry only divided with minimum particle size as previously described and any entry of granularity division.Participle model set up in the dictionary entry that adopts again known internal to divide, and when using this participle model that input text is carried out participle, just can when the output word segmentation result, further export the internal separation of the dictionary entry that can cut apart again in the word segmentation result.For example final word segmentation result be " we (pronoun)/(verb)/go up (verb)/highway (noun)/(adverbial word) ", because the dictionary entry " highway " in the basic dictionary of participle model has internal separation, therefore when obtaining this word segmentation result, also can obtain the internal separation of " highway ": " at a high speed (adjective) ", " highway (noun) ", and with this internal separation output.

Please refer to Fig. 5, Fig. 5 is the structural representation block diagram of embodiment of setting up the device of participle model among the present invention.Described device comprises mark unit 301, part of speech determining unit 302, statistic unit 303, model generation unit 304, pronunciation mark unit 305, word bit dictionary generation unit 306.

Wherein mark unit 301, be used for corpus is marked the part of speech of each entry and each entry.Corpus refers to various texts, for example various web page texts, books and periodicals magazine or novel etc.In the present invention, part of speech further comprises the proper noun attribute except comprising common verb, noun, pronoun etc.Proper noun refers to the title of specific people, place or mechanism etc., and the proper noun attribute then is the proper noun classification of proper noun ownership.Corpus is marked the part of speech of each entry and each entry, in fact is exactly to continuous corpus of text, marks out each entry and part of speech thereof that participle obtains." I love Tian An-men, Beijing " such text for example, through can obtain behind the mark " I＜pronoun/love＜verb/Beijing/＜place name. the area 〉/Tian An-men＜place name. the place〉" text behind such mark.Because proper noun belongs to noun, so marked the entry of proper noun attribute, the as many as noun part of speech that marked.

Part of speech determining unit 302 is used for determining the part of speech of each entry under corresponding part of speech.The part of speech determining unit comprises following subelement: the processing priority of cluster subelement 3021 and word frequency statistics subelement 3022 and word frequency statistics subelement 3022 is higher than cluster subelement 3021.

Wherein the cluster subelement 3021, are used for the cluster feature according to each entry, and the entry with identical part of speech is carried out cluster, and with the classification of each entry cluster as the part of speech of each entry under corresponding part of speech.

For the different entries under the same part of speech, because the difference of entry intension, when this word occurred, some words that are associated with its intension can appear in its context.For example: " Beijing " and " Haidian ", although all be " place name. the area ", but the former intension is " administrative city ", latter's intension is " administrative area ", show on the extension of entry, be that the former is often more with word co-occurrences such as " city ", " mayors ", the latter is often more with word co-occurrences such as " district ", " district governments ".The contextual feature of statistics entry in large-scale corpus, then according to the similarity between these contextual features, just can be with the different entries under the identical part of speech, poly-is some classes, thereby forms corresponding part of speech.In the present embodiment, contextual feature refers in the certain context scope several words and the number of times thereof of frequent co-occurrence, but contextual feature is not limited in " in the certain context scope several words and the number of times thereof of frequent co-occurrence " this embodiment, other any features that can embody context relation are included within the scope of the present invention.

Word frequency statistics subelement 3022 is used in the word frequency of Large Scale Corpus each entry of statistics under corresponding part of speech, and be word frequency greater than classification of each entry distribution of setting threshold as the part of speech of this entry under corresponding part of speech.

In other embodiments, part of speech determining unit 302 also can include only cluster subelement 3021 and not comprise word frequency statistics subelement 3022.

Statistic unit 303, good corpus is added up the transition probability of each entry between the generating probability under the corresponding part of speech and each part of speech to be used for utilizing mark, wherein the generating probability of entry under corresponding part of speech is: probability that entry occurs with corresponding part of speech in corpus, the transition probability between part of speech is: the probability of a rear adjacent appearance of part of speech under the condition that previous part of speech occurs in corpus.

The statistics entry is at the generating probability under the corresponding part of speech and the transition probability between part of speech, and a kind of method is based on Markov chain, according to the direct statistical probability of maximal possibility estimation, that is:

The total degree that the number of times that the generating probability=entry of entry under corresponding part of speech occurs in corpus as corresponding part of speech/this part of speech occurs in corpus;

The total degree that number of times/previous part of speech occurs in corpus of part of speech transition probability=two part of speech adjacent co-occurrence in corpus.

In addition, the statistics entry can also adopt the machine learning instrument based on conditional random field models (CRF) to carry out features training in the generating probability under the corresponding part of speech and the transition probability between part of speech.The visible list of references 1 of concrete grammar.

Model generation unit 304 is used for utilizing the generating probability of each entry under corresponding part of speech to obtain basic dictionary, utilizes the transition probability between each part of speech to obtain shifting dictionary, and adds basic dictionary and transfer dictionary to participle model.

Can obtain basic dictionary by the generating probability of entry under corresponding part of speech, basic dictionary comprise entry, entry at the part of speech under the corresponding part of speech and entry the generating probability under corresponding part of speech.Can obtain shifting dictionary by the transition probability between part of speech, the transfer dictionary comprises the transition probability between part of speech and part of speech.

Pronunciation mark unit 305, being used at participle model be that entry in the basic dictionary marks pronunciation.

It should be noted that in other embodiments pronunciation mark unit 305 is not prerequisite unit.

Word bit dictionary generation unit 306, be used for utilizing the good generating probability of corpus statistics word bit under corresponding part of speech of mark to obtain the word bit dictionary, and add the word bit dictionary to participle model, wherein the generating probability of word bit under corresponding part of speech is: word appears at the probability of a position in the entry with corresponding part of speech in corpus.。

Word is the individual character that forms entry in the corpus, what word bit referred to is exactly a position that word occurs in entry, this position comprises B (begin), M (middle), E (end), S (single), represent respectively in prefix that a word appears at entry, the word, suffix and this word form separately an entry, for example: sky-B, expression day word occurs at the prefix of entry.。

The generating probability of word bit under corresponding part of speech refers in corpus word appears at a position in the entry with corresponding part of speech probability.For example the prefix at the entry of part of speech " noun .1 " occurs, and is designated as B-noun .1, and wherein B represents prefix, and the probability of P (day | B-noun .1) is 0.4, just expression " day " probability of the prefix appearance of the entry of word conduct " noun .1 " class is 0.4.By the corpus that marks, can add up the position that word occurrence occurs in the entry of each part of speech, then adopt the similar method of introducing in the statistic unit 303, can count the generating probability of word bit under corresponding part of speech.For example based on Markov chain, according to the direct statistical probability of maximal possibility estimation:

For example " my god " occurred 50 times as the prefix of the entry of part of speech " noun .1 ", " noun .1 " occurred 500 times, then p (my god | B-noun .1)=50/500.Obtain can obtaining corresponding word bit dictionary after the generating probability of word bit under corresponding part of speech, the word bit dictionary comprises word, word position and corresponding part of speech and the generating probability of word bit under corresponding part of speech in entry

It should be noted that in other embodiments word bit dictionary generation unit 306 is not prerequisite unit.

In the present embodiment, if the entry of mark unit 301 marks is basic entry, the participle model that then obtains just is based on the participle model of primary word bar, if the entry of mark unit 301 marks is the dictionary entry, the participle model that then obtains just is based on the participle model of dictionary word bar.So-called basic entry refers to only comprise the entry of dividing with minimum particle size.For example: " at a high speed ", " highway " such entry, as word, on implication, can not cut apart again, therefore have minimum particle size.And the dictionary entry, refer to comprise the entry with any granularity division, wherein not only comprise the entry that minimum particle size is divided, the entry that can not cut apart again like this of " at a high speed ", " highway " for example, also comprise " highway " like this as word, on implication, can be divided into again the entry of " at a high speed ", " highway ".

Please refer to Fig. 6, Fig. 6 is the structural representation block diagram of the embodiment of participle device among the present invention.As shown in Figure 6, described device comprises: receiving element 401, cutting be generation unit 402, computing unit 403, as a result determining unit 404, unregistered word determining unit 405 as a result.

Wherein receiving element 401, are used for obtaining input text, obtain input text and are and obtain the text for the treatment of participle.

Cutting is generation unit 402 as a result, is used for utilizing participle model that the previously described device of setting up participle model sets up for the various cutting results of text generation.

So-called various cutting results are that input text is carried out some paths of forming after all possible cutting, and each paths of indication is each cutting result in the embodiment of the invention.When only comprising basic dictionary in the described participle model of preamble and shifting dictionary, cutting as a result generation unit 402 uses the basic dictionary of participle model to generate various cutting results.Please refer to Fig. 3, Fig. 3 is the synoptic diagram of the embodiment one of various cutting results among the present invention.To input text " two this liberal arts schools ", according to the entry in the basic dictionary, set up various cutting results as shown in Figure 3, it comprises path 1 and path 2.

Computing unit 403 is used for utilizing participle model to calculate various cutting results' score.Computing unit 403 comprises searches subelement 4031 and score computation subunit 4032, wherein search subelement 4031, be used for searching from participle model generating probability and the transition probability of all nodes of cutting result, score computation subunit 4032, being used for the generating probability of all nodes of cutting result and transition probability multiplied each other obtains cutting result's score.

Wherein BOS and EOS represent beginning and finish, p (number | BOS) represent respectively the probability that probability that the word take part of speech as number begins and word take part of speech as noun .11 finish with p (EOS| noun .11), and p (two | number) and p (measure word | number) are illustrated respectively in the probability that occurs " two " under the condition that part of speech is number and be under the condition of number at the part of speech of previous word, the part of speech of next word is the probability of measure word, and the implication of the probability of other nodes is similar with it.。

Determining unit 404 as a result, are used for selecting the highest cutting result of score as word segmentation result and the output of input text.To word figure shown in Figure 3, the score that gets proportion by subtraction path 2 of supposing path 1 is high, then to the input " two this liberal arts schools " this text, final word segmentation result is " two (numbers)/basis (measure word)/literal arts (noun .7)/school (noun .11) ", can find out, final word segmentation result has not only comprised each participle, and each participle has also had corresponding part-of-speech tagging.

When comprising basic dictionary, word bit dictionary in the described participle model of preamble and shifting dictionary, basic dictionary and the common various cutting results of generation of word bit dictionary are used in cutting as a result generation unit 402.Please refer to Fig. 4, Fig. 4 is the synoptic diagram of the embodiment two of various cutting results among the present invention.As shown in Figure 4, the cutting result comprises three kinds of situations, and a kind of is only to comprise the entry node in the path, and a kind of is only to comprise the word bit node in the path, and a kind of is not only to comprise in the path also comprising the word bit node by the entry node.Search subelement 4031 and can obtain the generating probability of entry node from the generating probability of basic dictionary lookup entry under corresponding part of speech of participle model, the generating probability of word bit node can be obtained from the generating probability of word bit dictionary lookup word bit under corresponding part of speech of participle model, the transition probability of all nodes can be obtained from the transition probability between the transfer dictionary lookup part of speech of participle model.

Unregistered word determining unit 405 is used for when the highest cutting result of score comprises the word bit node, utilizes the word bit information of word bit node to determine the division of the unregistered word among the highest cutting result of score.

In the present embodiment, if adopt basic entry to set up participle model, as input text, then determining unit 404 can obtain the internal separation of the entry that can cut apart again in the dictionary entry to receiving element 401 as a result with the dictionary word bar.Basic entry and dictionary entry are the entry only divided with minimum particle size as previously described and any entry of granularity division.Participle model set up in the dictionary entry that adopts again known internal to divide, cutting as a result generation unit 402 uses this participle model to generate the cutting result, then determining unit 404 just can when the output word segmentation result, further be exported the internal separation of the dictionary entry that can cut apart again in the word segmentation result as a result.For example final word segmentation result be " we (pronoun)/(verb)/go up (verb)/highway (noun)/(adverbial word) ", because the dictionary entry " highway " in the basic dictionary of participle model has internal separation, therefore when obtaining this word segmentation result, also can obtain the internal separation of " highway ": " at a high speed (adjective) ", " highway (noun) ", and with this internal separation output.

The above only is preferred embodiment of the present invention, and is in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, is equal to replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims

1. a method of setting up participle model is characterized in that, described method comprises:

A1. corpus is marked the part of speech of each entry and each entry;

B1. determine the part of speech of each entry under corresponding part of speech;

C1. good corpus is added up the transition probability of each entry between the generating probability under the corresponding part of speech and each part of speech to utilize mark, wherein said entry in the generating probability under the corresponding part of speech is: probability that entry occurs with corresponding part of speech in described corpus, and the transition probability between described part of speech is: the probability of a rear adjacent appearance of part of speech under the condition that previous part of speech occurs in described corpus;

D1. utilize the generating probability of described each entry under corresponding part of speech to obtain basic dictionary, utilize the transition probability between described each part of speech to obtain shifting dictionary, and add described basic dictionary and described transfer dictionary to participle model.

2. method according to claim 1 is characterized in that, described entry comprises basic entry or dictionary entry, and wherein said basic entry only comprises the entry of dividing with minimum particle size, and described dictionary entry comprises the entry with multiple granularity division.

3. method according to claim 1 is characterized in that, described step B1 comprises the S1 in the following mode, and perhaps, the combination of S1 and S2 and the execution priority of S2 are higher than S1:

S1. according to the cluster feature of each entry, the entry with identical part of speech is carried out cluster, and with the classification of cluster under each entry as the part of speech of each entry under corresponding part of speech;

S2. the statistics word frequency of each entry under corresponding part of speech in large-scale corpus, and be described word frequency greater than classification of each entry distribution of setting threshold as entry the part of speech under corresponding part of speech of described word frequency greater than setting threshold.

4. method according to claim 3, it is characterized in that described cluster feature comprises contextual feature, the position feature of entry, the lexical or textual analysis feature of entry, the synonym relationship characteristic of entry or the structured message feature of entry of entry in described large-scale corpus.

5. method according to claim 1 is characterized in that, described method further comprises:

D11. be the entry mark pronunciation in the described basic dictionary in described participle model.

6. method according to claim 1 is characterized in that, described method further comprises:

D12. utilize the good generating probability of corpus statistics word bit under corresponding part of speech of described mark to obtain the word bit dictionary, and add described word bit dictionary to described participle model, the generating probability of wherein said word bit under corresponding part of speech is: word appears at the probability of a position in the entry with corresponding part of speech in described corpus.

7. the method for a participle is characterized in that, described method comprises:

A2. obtain input text;

B2. utilize the participle model of the described method foundation of arbitrary claim in the claim 1 to 6 for the various cutting results of described text generation;

C2. utilize described participle model to calculate described various cutting results' score;

D2. select the highest cutting result of score as the word segmentation result of described input text and output.

8. method according to claim 7 is characterized in that, described step C2 comprises:

C21. from described participle model, search generating probability and the transition probability of all nodes of described cutting result;

C22. the generating probability of all nodes of described cutting result and transition probability are multiplied each other and obtain described cutting result's score.

9. method according to claim 8 is characterized in that, when described participle model is when utilizing that the described method of arbitrary claim is set up in the claim 1 to 5, uses the basic dictionary of described participle model to generate described various cutting result among the described step B2; Among the described step C21 from the generating probability of basic dictionary lookup entry under corresponding part of speech of the described participle model generating probability with all nodes of obtaining described cutting result, from the transition probability of the transition probability between the transfer dictionary lookup part of speech of described participle model with all nodes of obtaining described cutting result.

10. method according to claim 8 is characterized in that, when described participle model is when utilizing the described method of claim 6 to set up, uses basic dictionary and the common described various cutting results of generation of word bit dictionary of described participle model among the described step B2; Among the described step C21 from the generating probability of basic dictionary lookup entry under corresponding part of speech of the described participle model generating probability with the entry node that obtains described cutting result, from the generating probability of word bit dictionary lookup word bit under corresponding part of speech of the described participle model generating probability with the word bit node that obtains described cutting result, from the transition probability of the transition probability between the transfer dictionary lookup part of speech of described participle model with all nodes of obtaining described cutting result.

11. method according to claim 10, it is characterized in that, when comprising described word bit node among the highest cutting result of described score, further comprise in described step D2: utilize the word bit information of described word bit node to determine the division of the unregistered word among the highest cutting result of described score, wherein said unregistered word is non-existent word in the basic dictionary of described participle model.

12. method according to claim 7 is characterized in that, adopts basic entry to set up described participle model, and the dictionary word bar is carried out the method for described participle as described input text, obtains the internal separation of the entry that can cut apart again in the described dictionary entry;

If the participle model that the dictionary entry that adopts known internal to divide is set up carries out participle to input text, then when the output word segmentation result, further export the internal separation of the dictionary entry that can cut apart again in the word segmentation result.

13. a device of setting up participle model is characterized in that, described device comprises:

The mark unit is for the part of speech that corpus is marked each entry and each entry;

The part of speech determining unit is used for determining the part of speech of each entry under corresponding part of speech;

Statistic unit, good corpus is added up the transition probability of each entry between the generating probability under the corresponding part of speech and each part of speech to be used for utilizing mark, wherein said entry in the generating probability under the corresponding part of speech is: probability that entry occurs with corresponding part of speech in described corpus, and the transition probability between described part of speech is: the probability of a rear adjacent appearance of part of speech under the condition that previous part of speech occurs in described corpus;

The model generation unit is used for utilizing the generating probability of described each entry under corresponding part of speech to obtain basic dictionary, utilizes the transition probability between described each part of speech to obtain shifting dictionary, and adds described basic dictionary and described transfer dictionary to participle model.

14. device according to claim 13 is characterized in that, described entry comprises basic entry or dictionary entry, and wherein said basic entry only comprises the entry of dividing with minimum particle size, and described dictionary entry comprises the entry with many granularity division.

15. device according to claim 13, it is characterized in that, described part of speech determining unit comprises the cluster subelement, perhaps, comprises that the combination of described cluster subelement and word frequency statistics subelement and the processing priority of described word frequency statistics subelement are higher than described cluster subelement;

Wherein said cluster subelement is used for the cluster feature according to each entry, and the entry with identical part of speech is carried out cluster, and with the classification of cluster under each entry as the part of speech of each entry under corresponding part of speech;

Described word frequency statistics subelement, be used in the word frequency of large-scale corpus each entry of statistics under corresponding part of speech, and be described word frequency greater than classification of each entry distribution of setting threshold as entry the part of speech under corresponding part of speech of described word frequency greater than setting threshold.

16. device according to claim 15, it is characterized in that described cluster feature comprises contextual feature, the position feature of entry, the lexical or textual analysis feature of entry, the synonym relationship characteristic of entry or the structured message feature of entry of entry in described large-scale corpus.

17. device according to claim 13 is characterized in that, described device further comprises pronunciation mark unit, and being used at described participle model be that entry in the described basic dictionary marks pronunciation.

18. device according to claim 13 is characterized in that, described device further comprises:

The word bit dictionary generates subelement, be used for utilizing the good generating probability of corpus statistics word bit under corresponding part of speech of described mark to obtain the word bit dictionary, and add described word bit dictionary to described participle model, the generating probability of wherein said word bit under corresponding part of speech is: word appears at the probability of a position in the entry with corresponding part of speech in described corpus.

19. the device of a participle is characterized in that, described device comprises:

Receiving element is used for obtaining input text;

Cutting is generation unit as a result, is used for utilizing the participle model of the described device foundation of the arbitrary claim of claim 13 to 18 for the various cutting results of described text generation;

Computing unit is used for utilizing described participle model to calculate described various cutting results' score;

Determining unit is used for selecting the highest cutting result of score as word segmentation result and the output of described input text as a result.

20. device according to claim 19 is characterized in that, described computing unit comprises:

Search subelement, be used for searching from described participle model generating probability and the transition probability of all nodes of described cutting result;

The score computation subunit, being used for the generating probability of all nodes of described cutting result and transition probability multiplied each other obtains described cutting result's score.

21. device according to claim 20, it is characterized in that, when described participle model is that described cutting as a result generation unit uses the basic dictionary of described participle model to generate described various cutting result when utilizing that the described device of arbitrary claim is set up in the claim 13 to 17; The described subelement of searching is from the generating probability of basic dictionary lookup entry under corresponding part of speech of the described participle model generating probability with all nodes of obtaining described cutting result, from the transition probability of the transition probability between the transfer dictionary lookup part of speech of described participle model with all nodes of obtaining described cutting result.

22. device according to claim 20, it is characterized in that, when described participle model is when utilizing the described device of claim 18 to set up, basic dictionary and the common described various cutting results of generation of word bit dictionary of described participle model are used in described cutting as a result generation unit; The described subelement of searching is from the generating probability of basic dictionary lookup entry under corresponding part of speech of the described participle model generating probability with the entry node that obtains described cutting result, from the generating probability of word bit dictionary lookup word bit under corresponding part of speech of the described participle model generating probability with the word bit node that obtains described cutting result, from the transition probability of the transition probability between the transfer dictionary lookup part of speech of described participle model with all nodes of obtaining described cutting result.

23. device according to claim 22, it is characterized in that, described device further comprises the unregistered word determining unit, be used for when the highest cutting result of described score comprises described word bit node, utilize the word bit information of described word bit node to determine the division of the unregistered word among the highest cutting result of described score, wherein said unregistered word is non-existent word in the basic dictionary of described participle model.

24. device according to claim 19 is characterized in that, described device in advance with the dictionary word bar as input text, adopt the participle model of being set up by basic entry to obtain the internal separation of the entry that can cut apart again in the described dictionary entry;

If described cutting as a result generation unit adopts the participle model of the dictionary entry foundation of known internal division to generate the cutting result, further export the internal separation of the dictionary entry that can cut apart again in the word segmentation result when then described as a result determining unit is exported word segmentation result.