CN103020034A

CN103020034A - Chinese words segmentation method and device

Info

Publication number: CN103020034A
Application number: CN2011102877230A
Authority: CN
Inventors: 秦晓; 万小军; 吴於茜
Original assignee: Peking University; Peking University Founder Group Co Ltd
Current assignee: Peking University; Peking University Founder Group Co Ltd
Priority date: 2011-09-26
Filing date: 2011-09-26
Publication date: 2013-04-03

Abstract

The invention provides a Chinese words segmentation method and device. The method includes the steps of: training a corpus of segmented words to obtain a CRF (Conditional Random Field) model; by using the CRF model, segmenting the words in a corpus of unsegmented words; judging whether the corpus of successfully segmented words satisfies the set condition or not, and if yes, adding the corpus of successfully segmented words to the corpus of segmented words; and executing above steps circularly until the scope of the corpus of segmented words is not enlarge and a final CRF model is obtained. The invention further provides a Chinese words segmentation device which comprises a training module, a words segmenting module, an adding module and a circulating module. The training module is used for training the corpus of segmented words to obtain the CRF model; the words segmenting module is used for segmenting the corpus of unsegmented words by using the CRF model; the adding module is used for judging whether the corpus of successfully segmented words satisfies the set condition or not, and if yes, the adding module adds the corpus of successfully segmented words to the corpus of segmented words; and the circulating module is used for circularly calling the training module, the words segmenting module and the adding module until the scope of the corpus of segmented words is not enlarged to obtain the final DRF model. The Chinese words segmentation method and device improve the words segmentation efficiency and reduce the words segmentation ambiguities.

Description

Chinese word cutting method and device

Technical field

The present invention relates to the Chinese language processing field, in particular to a kind of Chinese word cutting method and device.

Background technology

Correlation technique provides a kind of segmenting method based on dictionary, and the method is called again the segmenting method of machinery.The method needs a dictionary for word segmentation, and principal feature is fairly simple, realize easily, but participle speed is slower, produces ambiguity easily.

Summary of the invention

The present invention aims to provide a kind of Chinese word cutting method and device, and is slower to solve correlation technique participle speed, the problem that produces ambiguity easily.

In an embodiment of the present invention, provide a kind of Chinese word cutting method, having comprised: the language material of participle has been trained obtain the CRF model; Adopt the CRF model that the language material of participle is not carried out participle; Whether the language material of judging the participle success satisfies the condition that arranges, and if so, then joins in the language material of participle; Above-mentioned steps is carried out in circulation, until the scale of the language material of participle no longer enlarges, obtains final CRF model.

In an embodiment of the present invention, provide a kind of Chinese word segmentation device, having comprised: training module is used for the language material of participle trained obtaining the CRF model; Word-dividing mode be used for to adopt the CRF model that the language material of participle is not carried out participle; Add module, be used for judging whether the language material of participle success satisfies the condition of setting, if so, then join in the language material of participle; Loop module is used for recursive call training module, word-dividing mode and adding module, until the scale of the language material of participle no longer enlarges, obtains final CRF model.

The Chinese word cutting method of the above embodiment of the present invention and device be because adopt the CRF technology, so overcome based on the segmenting method participle speed of dictionary slowlyer, the problem that produces ambiguity easily, and then reached raising participle speed reduces the effect of segmentation ambiguity.

Description of drawings

Accompanying drawing described herein is used to provide a further understanding of the present invention, consists of the application's a part, and illustrative examples of the present invention and explanation thereof are used for explaining the present invention, do not consist of improper restriction of the present invention.In the accompanying drawings:

Fig. 1 shows the process flow diagram according to the Chinese word cutting method of the embodiment of the invention;

Fig. 2 is the installation drawing of cross-cutting Chinese word segmentation;

Fig. 3 is the process flow diagram of the training and testing of participle model;

Fig. 4 is the process flow diagram of statement screening;

Fig. 5 shows the synoptic diagram according to the Chinese word segmentation device of the embodiment of the invention

Embodiment

Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.

Fig. 1 shows the process flow diagram according to the Chinese word cutting method of the embodiment of the invention, comprising:

Step S10 trains the language material of participle to obtain the CRF model;

Step S20 adopts the CRF model that the language material of participle is not carried out participle;

Step S30 judges that whether the language material of participle success satisfies the condition that arranges, and if so, then joins in the language material of participle;

Step S40, above-mentioned steps is carried out in circulation, until the scale of the language material of participle no longer enlarges, obtains final CRF model.

Segmenting method participle speed based on dictionary is slower, produce ambiguity easily, and present embodiment has adopted the CRF technology, so overcome based on the segmenting method participle speed of dictionary slower, the problem that produces ambiguity easily, and then reached raising participle speed, reduce the effect of segmentation ambiguity.

Conditional random field models is a kind of typical discriminative model that is proposed in calendar year 2001 by Lafferty.It carries out modeling to target sequence on the basis of observation sequence, emphasis solves the serializing mark (in the present invention, mark is participle) the conditions of problems random field models both had the advantage of discriminative model, have again production model and consider transition probability between contextual tagging, with the characteristics that the serializing form is carried out global parameter optimization and decoding, solved the marking bias problem that other discriminative models (such as the maximum entropy Markov model) are difficult to avoid.

CRF (Conditional random field, condition random field) theory can be used for the natural language processing tasks such as sequence mark, Data Segmentation, chunk parsing.In the Chinese natural language process tasks such as Chinese word segmentation, Chinese personal name recognition, ambiguity resolution, application is arranged, do well.Realize CRF being arranged, FlexCRF, CRF++ based on the Major Systems of CRFs at present.Conditional random field models is a kind of non-directed graph model, and it is under the condition of the given observation sequence that needs mark, calculates the joint probability distribution of whole flag sequence, rather than under given current state condition, defines the distributions of next state.Be given observation sequence O, ask optimal sequence S.

Preferably, step S10 comprises: the language material of participle is expressed as the proper vector form to use effective character feature to incite somebody to action, trains to obtain the CRF model.

Preferably, use effective character feature the language material of participle be expressed as the proper vector form and comprise:

Judge the character numeral whether in the language material of participle, if it is use mark " N " (Number) to represent;

Judge the character letter whether in the language material of participle, if it is use mark " L " (Letter) to represent;

Judge the character punctuation mark (comprising Chinese and western language punctuate) whether in the language material of participle, if it is use mark " P " (Puncture) to represent;

Judge the character time word whether in the language material of participle, if it is use mark " D " (Date) to represent;

If it is no that above all judgements are, then be designated as " C " (Character), expression all common characters except above Four types.

Preferably, adopt the CRF model that the language material of participle is not carried out participle and comprise: use effective character feature not the language material of participle be converted into the proper vector form, adopt the CRF model to carry out participle.

Preferably, use effective character feature not the language material of participle be converted into the proper vector form and comprise:

Judge the character numeral whether in the language material of participle not, if it is use mark " N " expression;

Judge the character letter whether in the language material of participle not, if it is use mark " L " expression;

Judge the character punctuation mark whether in the language material of participle not, if it is use mark " P " expression;

Judge the character time word whether in the language material of participle not, if it is use mark " D " expression;

If it is no that above all judgements are, then be designated as " C ".

Preferably, design suitable feature templates, carry out the training of CRF model, obtain initial statistical model.The preferred embodiment of the present invention is used the feature templates form of 5 windows, and template style is as follows:

#Unigram
	U00：％x[-1，0]
U01：％x[0，0]
	U02：％x[1，0]
U03：％x[-1，0]/％x[0，0]
	U04：％x[0，0]/％x[1，0]
U05：％x[-1，0]/％x[1，0]
	U10：％x[-1，1]
U11：％x[0，1]
	U12：％x[1，1]

U13：％x[-1，1]/％x[0，1]
	U14：％x[0，1]/％x[1，1]
U15：％x[-1，1]/％x[1，1]
	#Bigram
B

This template is only used Unigram (monobasic) model, and wherein, wherein, #Unigram represents linear model relation, U00:%x[-1,0] the expression code name is one group of template of U00, %x[-1,0] the previous word of the current word of representative; U01:%x[0,0] the expression code name is one group of template of U01, %x[0,0] represent current word; U02:%x[1,0] the expression code name is one group of template of U02, %x[1,0] the rear word of the current word of representative; U03:%x[-1,0]/%x[0,0] represent that code name is one group of template of U03, %x[-1,0]/%x[0,0] the previous word of the current word of representative and the group word relation of a rear word; U04:%x[0,0]/and %x[1,0] the expression code name is one group of template of U04, %x[0,0]/%x[1,0] the group word relation of the current word of representative and a rear word; U05:%x[-1,0]/%x[1,0] represent that code name is one group of template of U05, %x[-1,0]/%x[1,0] the previous word of the current word of representative and the group word relation of a rear word; U10:%x[-1,1] the expression code name is one group of template of U10, %x[-1,1] character feature of the previous word of the current word of representative; U11:%x[0,1] the expression code name is one group of template of U11, %x[0,1] character feature of the current word of representative, namely to consider union feature; U12:%x[1,1] the expression code name is one group of template of U12, %x[1,1] character feature of the rear word of the current word of representative; U13:%x[-1,1]/and %x[0,1] the expression code name is one group of template of U13, %x[-1,1]/%x[0,1] relation of the previous word of the current word of representative and the character feature of current word; U14:%x[0,1]/and %x[1,1] the expression code name is one group of template of U14, %x[0,1]/%x[1,1] relation of the character feature of the rear word of the current word of representative and current word; U15:%x[-1,1]/and %x[1,1] the expression code name is one group of template of U15, %x[-1,1]/%x[1,1] relation of the character feature of the previous word of the current word of representative and a rear word of current word; #Bigram represents the binary model relation, and B does not use the relation of binary model as the abbreviation of Bigram in the present embodiment.

5 windows represent at this: only considering the window that comprises 5 words that latter two word by the first two word of current word, current word, current word forms, is the character string of 5 windows such as " Central Committee of the Communist Party of China is total ".If take " General Secretary of the Central Committee of the CPC, President Jiang Zemin " as example, suppose that current word is " note ", then the implication of its template representation is as shown below:

Preferably, judge the output probability of language material of participle success greater than threshold value, then the language material of participle success is joined in the language material of participle.The parameter item of CRF kit when testing for example is set, the marginal probability of the conditional probability of display statement output and the label of each word.Utilize above-mentioned conditional probability, select probable value greater than 0.85 statement, choosing through experiment test of this value, numerical value choose the degree of confidence that the word segmentation result that should satisfy selected statement has certain degree, satisfy again the not participle language material that existence can be screened under this numerical value, because numerical value is higher, existence can be screened statement fewer.

For example, the step of screening process is as follows:

Utilize corpus acquistion statistical model, at the test process of model, adjust correlation parameter, so that show the conditional probability of output and the marginal probability of label in the test result; When this tests, add parameter item a :-v1, simultaneously display condition probability and marginal probability.For example statement " one evening, risen by unexpected raining." through the result after the test be:

#0.906430
	One C S/0.960094
It C S/0.959499
	Late C B/0.987164
Upper C E/0.992140
	, P S/0.999996
Prominent C B/0.999836
	Right C E/0.998923
Lower C S/0.961316
	Play C S/0.961338
C S/0.988134
	Rain C S/0.988271
P?S/1.000000

In the upper table, #0.906430 represents word in the whole sentence and is composed probability for above-mentioned label, and the higher expression confidence level of numerical value is higher." a C S/0.960094 " expression " " word its might label in (B, B2, B3, M, E, S) compose and be the probability of " S " label, the probability that the higher expression of probability is composed as " S " label is higher, quilt composed into the probability of other labels less.

According to the conditional probability value of statement output, selected threshold is converted into the canonical form of statement participle greater than 0.85 statement according to the label of atom word; In upper figure, its conditional probability is 0.906430, and is obviously eligible, so need to add this statement.Be converted into simultaneously the form of standard participle, namely " one evening, unexpected raining risen.》

Preferably, to before the language material of participle is trained, also comprise following at least one step:

The subordinate sentence process, with mark the statement separator of raw material according to Chinese ".", "; ", "? ", "! " be divided into the relatively independent sentence of some meanings.This is mainly in the situation of punctuation mark more complicated, such as the situation about mutually comprising of double quotation marks and fullstop, branch, exclamation, question mark.In addition, the statement sequence for input from left to right scans the character string that is comprised by left and right sides double quotation marks " ", and this part is processed as a unit, does not do the subordinate sentence processing for being included in " " inner fullstop, exclamation, question mark; Doing subordinate sentence at " " outside statement dividing mark processes.For example, " we are about to see the ox year off with the happiness of good harvest, welcome tiger year with high-spirited fighting will for original statement.Our great motherland will be full of vitality, full of hope 1 year in the new year.", it should be divided into two statements, namely " we are about to see the ox year off with the happiness of good harvest, welcome tiger year with high-spirited fighting will." and " our great motherland will be full of vitality, full of hope 1 year in the new year.》。

Atom participle process, this operates mainly in the sentence that contains the non-Chinese characters such as a large amount of numerals, English character, when character reads, the non-Chinese characters such as continuous numeral, English alphabet is done as a whole the processing.For general Chinese character, process unit with single character as one; But then need to carry out special processing for the numeral that comprises in the Chinese, English alphabet, continuous non-Chinese character processed unit as one, for the character string that is connected by ". " equally as a bulk treatment.For example, the atom word segmentation result of " New Year's address in 1998 " is: 1998, year, new, year, say, talk about, but not 1,9,9,8, year, new, year, say, talk about.The advantage of this operation is: this operation more is conducive to the study of CRF model, and by the basic studies of CRF, the feature of some non-Chinese characters is difficult to be learnt preferably merely.This operation is conducive to improve word segmentation result, improves the participle accuracy.

Add the label process, use 6 tag formats herein, be B, B2, B3, M, E, S, they represent respectively the lead-in of certain phrase, second word, the 3rd word, the 3rd middle word, end word and the individual character word that word is follow-up, in " People's Republic of China (PRC) " word, " in " word is prefix " B ", " China " word is prefix " B1 ", " people " word is prefix " B3 ", " people " word is prefix " M ", " being total to " word is prefix " M ", " with " word is prefix " M ", " state " word is prefix " S ".

For example, participle statement " General Secretary of the Central Committee of the CPC, President Jiang Zemin " being converted to the proper vector form is:

In	C	B
			Altogether	C	B2
In	C	B3
			The centre	C	E
Always	C	B
			Book	C	B2

Note	C	E
			、	P	S
State	C	B
			Family	C	E
Main	C	B
			Seat	C	E
The river	C	S
			The pool	C	B
The people	C	E

Among the upper figure, take word " Central Committee of the Communist Party of China " as example, wherein 4 words are common character, are the non-punctuation character of nonnumeric non-letter, so its character feature " C " mark.Simultaneously " in " word is prefix, therefore is labeled as " B ", " be total to " word is prefix the second word, therefore is labeled as " B2 ", second " in " word is prefix the 3rd word, therefore is labeled as " B3 ", " centre " word is suffix, so is labeled as " E ".

Preferably, adopt the CRF model to before the language material of participle does not carry out participle, also comprising following at least one step:

With the language material of participle not according to the statement separator of Chinese ".", "; ", "? ", "! " be divided into the relatively independent sentence of meaning;

Non-Chinese character continuous in the language material with participle is not processed unit as one;

Use B, B2, B3, M, E, S to mark the not language material of participle, B, B2, B3, M, E, S represent respectively the lead-in of phrase, second word, the 3rd word, the 3rd middle word, end word and the individual character word that word is follow-up.

Preferably, this method also comprises: search engine receives the content to be searched of user's input; Adopt the final CRF model that generates to treat search content and carry out participle.

The inventor finds that present Chinese word cutting method is mainly for the participle in the field, and is relatively less for cross-cutting Chinese word segmentation research.In real life, when utilizing search engine to carry out relevant search, although do not emphasize to want the field of search content in the search procedure, but user's demand is widely, dabble the every field in society, so the user in search engine input and input the process of display of search results according to the user is exactly the process of a cross-cutting search in essence.Therefore, cross-cutting participle is significant, and its accuracy rate has directly affected the accuracy of Search Results.

Cross-cutting participle has two hang-ups, i.e. segmentation ambiguity and unregistered word identification.The word of different field differs from one another, and the field of word is professional stronger, and along with the development and progress of society, a large amount of neologisms emerge in an endless stream.Based on the Chinese word cutting method of statistics, although can improve to a certain extent the identification problem of unregistered word, can't learn preferably cross-cutting knowledge information.

The embodiment of the invention can be applied in the cross-cutting participle technique well.In a preferred embodiment of the invention, the language material of participle can be bought in market the corpus in a plurality of fields, the cross-cutting source that refers to the language material content is different, and for example, the information in the language material comes from the other fields such as different field such as news, literature, computing machine, finance and economics.This preferred embodiment can be trained cross-cutting language material, obtains cross-cutting CRF model, thereby can carry out cross-cutting participle.This preferred embodiment has fully utilized the knowledge information of cross-cutting language material, the abundant knowledge feature in the learning areas, and acquistion has the adaptive statistical model in certain field, can effectively improve cross-cutting participle effect.

Installation drawing according to the Chinese word segmentation of the embodiment of the invention has been shown among Fig. 2, has comprised:

Step S202, the mark language material of the Peoples Daily that is provided with Peking University's language is as original corpus;

Step S204 carries out pretreatment operation to marking language material, with statement separator ".", "; ", "? ", "! " for mark statement is divided into the relatively independent sentence of some meanings; The non-Chinese characters such as continuous numeral, English alphabet are done as a whole the processing; Use simultaneously 6 tag formats, language material is added label;

Step S206 is to above-mentioned training through pretreated language material;

Step S208, the statistical model of generation CRF;

Step S210, with 4 kinds of cross-cutting literature, computing machine, medical science, finance not the participle language material as un-annotated data;

Step S212 carries out pretreatment operation for above-mentioned un-annotated data, according to the statement separator of Chinese ".", "; ", "? ", "! " be divided into the relatively independent sentence of meaning; Non-Chinese character continuous in the language material with participle is not processed unit as one; Use B, B2, B3, M, E, S to mark the not language material of participle;

Step S214 obtains pre-service result afterwards;

Step S216 utilizes the above-mentioned CRF model that has obtained, to carrying out participle through pretreated un-annotated data;

Step S218 obtains preliminary word segmentation result;

Step S220 from above-mentioned word segmentation result, chooses the statement that satisfies certain condition, and this part language material added to marks in the language material set;

Step S222, the mark language material set after being expanded;

Step S224 trains the mark language material after the expansion, obtains the CRF model, and utilizes this model that un-annotated data is carried out participle.

Step S226 to above-mentioned word segmentation result, utilizes in the Chinese word segmenting some simple rules to revise;

Fig. 3 shows the process flow diagram according to the training and testing of the participle model of the embodiment of the invention, comprising:

Step S302 reads delegation's character string as input from corpus;

Step S304 carries out subordinate sentence for above-mentioned read statement and processes, with statement separator ".", "; ", "? ", "! " be divided into the relatively independent sentence of meaning;

Step S306 carries out the atom word segmentation processing for above-mentioned statement, and continuous non-Chinese character is processed unit as one;

Step S308 for above-mentioned statement, uses effective character feature, and uses 6 tag formats, is converted to the proper vector form;

Step S310 judges whether to handle language materials all in the training set, then continues aforesaid operations if be untreated;

Step S312 trains the above-mentioned corpus of having handled, and obtains the CRF model;

Step S314 reads delegation's character string as input from testing material;

Step S316 carries out subordinate sentence for above-mentioned read statement and processes, with statement separator ".", "; ", "? ", "! " be divided into the relatively independent sentence of meaning;

Step S318 for above-mentioned statement, adds the subordinate sentence mark at statement subordinate sentence place;

Step S320 carries out the atom word segmentation processing for above-mentioned statement, and continuous non-Chinese character is processed unit as one;

Step S322 for above-mentioned statement, uses effective character feature, and uses 6 tag formats, is converted to the proper vector form;

Step S324 judges whether to handle language materials all in the test set, then continues aforesaid operations if be untreated;

Step S326 utilizes CRF model obtained above, carries out participle to treated for marking language material;

Fig. 4 shows the process flow diagram according to the statement screening of the embodiment of the invention, comprising:

Step S402, the language material of the Peoples Daily of being issued take Peking University's language is as original corpus;

Step 404 is carried out pretreatment operation for above-mentioned mark language material, according to the statement separator of Chinese ".", "; ", "? ", "! " be divided into the relatively independent sentence of meaning; Non-Chinese character continuous in the language material with participle is not processed unit as one; Use B, B2, B3, M, E, S to mark the language material of participle;

Step 406 is trained for language material set obtained above, obtains the CRF model;

Step 408, with 4 kinds of cross-cutting literature, computing machine, medical science, finance not the participle language material as un-annotated data;

Step S410 carries out pretreatment operation for above-mentioned un-annotated data, according to the statement separator of Chinese ".", "; ", "? ", "! " be divided into the relatively independent sentence of meaning; Non-Chinese character continuous in the language material with participle is not processed unit as one; Use B, B2, B3, M, E, S to mark the not language material of participle;

Step S412 obtains the proper vector form into the mark language material;

Step S414 utilizes CRF model obtained above, to carrying out participle through pretreated un-annotated data, by the setting to CRF kit parameter, can show the output probability of each statement;

Step S416 judges whether to exist the statement of output probability and threshold value;

Step S418 adds the statement of output probability greater than certain threshold value in the corpus that has marked to;

Step S420 when not existing statement to satisfy its output probability greater than certain threshold value, obtains final mark language material set;

Fig. 5 shows the synoptic diagram according to the Chinese word segmentation device of the embodiment of the invention, comprising:

Training module 10 is used for the language material of participle trained obtaining the CRF model;

Word-dividing mode 20 be used for to adopt the CRF model that the language material of participle is not carried out participle;

Add module 30, be used for judging whether the language material of participle success satisfies the condition of setting, if so, then join in the language material of participle;

Loop module 40 is used for recursive call training module, word-dividing mode and adding module, until the scale of the language material of participle no longer enlarges, obtains final CRF model.

As can be seen from the above description, the present invention has improved participle speed, has reduced segmentation ambiguity.

Obviously, those skilled in the art should be understood that, above-mentioned each module of the present invention or each step can realize with general calculation element, they can concentrate on the single calculation element, perhaps be distributed on the network that a plurality of calculation elements form, alternatively, they can be realized with the executable program code of calculation element, thereby, they can be stored in the memory storage and be carried out by calculation element, perhaps they are made into respectively each integrated circuit modules, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize.Like this, the present invention is not restricted to any specific hardware and software combination.

The above is the preferred embodiments of the present invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a Chinese word cutting method is characterized in that, comprising:

The language material of participle trained obtain the CRF model;

Adopt described CRF model that the language material of participle is not carried out participle;

Whether the language material of judging the participle success satisfies the condition that arranges, and if so, then joins in the language material of described participle;

Above-mentioned steps is carried out in circulation, until the scale of the language material of described participle no longer enlarges, obtains final CRF model.

2. method according to claim 1 is characterized in that, the language material of participle is trained obtain the CRF model and comprise:

Use effective character feature that the language material of described participle is expressed as the proper vector form, train to obtain described CRF model.

3. method according to claim 2 is characterized in that, adopts described CRF model that the language material of participle is not carried out participle and comprises:

Use described effective character feature that the language material of described not participle is converted into the proper vector form, adopt described CRF model to carry out participle.

4. method according to claim 3, it is characterized in that, using effective character feature that the language material of described participle is expressed as the proper vector form comprises: judge the character numeral whether in the language material of described participle, if it is use mark " N " expression; Judge the character letter whether in the language material of described participle, if it is use mark " L " expression; Judge the character punctuation mark whether in the language material of described participle, if it is use mark " P " expression; Judge the character time word whether in the language material of described participle, if it is use mark " D " expression; If it is no that above all judgements are, then be designated as " C ";

Using described effective character feature that the language material of described not participle is converted into the proper vector form comprises: judge the character numeral whether in the language material of described not participle, if it is use mark " N " expression; Judge the character letter whether in the language material of described not participle, if it is use mark " L " expression; Judge the character punctuation mark whether in the language material of described not participle, if it is use mark " P " expression; Judge the character time word whether in the language material of described not participle, if it is use mark " D " expression; If it is no that above all judgements are, then be designated as " C ".

5. method according to claim 1 is characterized in that, is used for training the form of template of language material of described participle as follows:

#Unigram U00：％x[-1，0] U01：％x[0，0] U02：％x[1，0] U03：％x[-1，0]/％x[0，0] U04：％x[0，0]/％x[1，0] U05：％x[-1，0]/％x[1，0] U10：％x[-1，1]

U11：％x[0，1] U12：％x[1，1] U13：％x[-1，1]/％x[0，1] U14：％x[0，1]/％x[1，1] U15：％x[-1，1]/％x[1，1] #Bigram B

Wherein, #Unigram represents linear model relation, U00:%x[-1,0] the expression code name is one group of template of U00, %x[-1,0] the previous word of the current word of representative; U01:%x[0,0] the expression code name is one group of template of U01, %x[0,0] represent current word; U02:%x[1,0] the expression code name is one group of template of U02, %x[1,0] the rear word of the current word of representative; U03:%x[-1,0]/%x[0,0] represent that code name is one group of template of U03, %x[-1,0]/%x[0,0] the previous word of the current word of representative and the group word relation of a rear word; U04:%x[0,0]/and %x[1,0] the expression code name is one group of template of U04, %x[0,0]/%x[1,0] the group word relation of the current word of representative and a rear word; U05:%x[-1,0]/%x[1,0] represent that code name is one group of template of U05, %x[-1,0]/%x[1,0] the previous word of the current word of representative and the group word relation of a rear word; U10:%x[-1,1] the expression code name is one group of template of U10, %x[-1,1] character feature of the previous word of the current word of representative; U11:%x[0,1] the expression code name is one group of template of U11, %x[0,1] character feature of the current word of representative; U12:%x[1,1] the expression code name is one group of template of U12, %x[1,1] character feature of the rear word of the current word of representative; U13:%x[-1,1]/and %x[0,1] the expression code name is one group of template of U13, %x[-1,1]/%x[0,1] relation of the previous word of the current word of representative and the character feature of current word; U14:%x[0,1]/and %x[1,1] the expression code name is one group of template of U14, %x[0,1]/%x[1,1] relation of the character feature of the rear word of the current word of representative and current word; U15:%x[-1,1]/and %x[1,1] the expression code name is one group of template of U15, %x[-1,1]/%x[1,1] relation of the character feature of the previous word of the current word of representative and a rear word of current word; #Bigram represents the binary model relation, and B is as the abbreviation of Bigram.

6. method according to claim 1 is characterized in that, judges the output probability of language material of described participle success greater than threshold value, and then the language material with described participle success joins in the language material of described participle.

7. method according to claim 1 is characterized in that, to before the language material of participle is trained, also comprises following at least one step:

With the language material of described participle according to the statement separator of Chinese ".", "; ", "? ", "! " be divided into the relatively independent sentence of meaning;

Non-Chinese character continuous in the language material with described participle is processed unit as one;

Use the language material of B, B2, B3, M, the described participle of E, S mark, B, B2, B3, M, E, S represent respectively the lead-in of phrase, second word, the 3rd word, the 3rd middle word, end word and the individual character word that word is follow-up.

8. method according to claim 1 is characterized in that, is adopting described CRF model to before the language material of participle does not carry out participle, also comprises following at least one step:

With the language material of described not participle according to the statement separator of Chinese ".", "; ", "? ", "! " be divided into the relatively independent sentence of meaning;

Non-Chinese character continuous in the language material with described not participle is processed unit as one;

Use the language material of B, B2, B3, M, the described not participle of E, S mark, B, B2, B3, M, E, S represent respectively the lead-in of phrase, second word, the 3rd word, the 3rd middle word, end word and the individual character word that word is follow-up.

9. method according to claim 1 is characterized in that, also comprises:

Search engine receives the content to be searched of user's input;

Adopt the final described CRF model that generates that described content to be searched is carried out participle.

10. a Chinese word segmentation device is characterized in that, comprising:

Training module is used for the language material of participle trained obtaining the CRF model;

Word-dividing mode is used for adopting described CRF model that the language material of participle is not carried out participle;

Add module, be used for judging whether the language material of participle success satisfies the condition that arranges, and if so, then joins in the language material of described participle;

Loop module is used for the described training module of recursive call, described word-dividing mode and described adding module, until the scale of the language material of described participle no longer enlarges, obtains final CRF model.