CN103020034A - Chinese words segmentation method and device - Google Patents

Chinese words segmentation method and device Download PDF

Info

Publication number
CN103020034A
CN103020034A CN2011102877230A CN201110287723A CN103020034A CN 103020034 A CN103020034 A CN 103020034A CN 2011102877230 A CN2011102877230 A CN 2011102877230A CN 201110287723 A CN201110287723 A CN 201110287723A CN 103020034 A CN103020034 A CN 103020034A
Authority
CN
China
Prior art keywords
word
participle
language material
character
group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011102877230A
Other languages
Chinese (zh)
Inventor
秦晓
万小军
吴於茜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Peking University Founder Group Co Ltd
Original Assignee
Peking University
Peking University Founder Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, Peking University Founder Group Co Ltd filed Critical Peking University
Priority to CN2011102877230A priority Critical patent/CN103020034A/en
Publication of CN103020034A publication Critical patent/CN103020034A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a Chinese words segmentation method and device. The method includes the steps of: training a corpus of segmented words to obtain a CRF (Conditional Random Field) model; by using the CRF model, segmenting the words in a corpus of unsegmented words; judging whether the corpus of successfully segmented words satisfies the set condition or not, and if yes, adding the corpus of successfully segmented words to the corpus of segmented words; and executing above steps circularly until the scope of the corpus of segmented words is not enlarge and a final CRF model is obtained. The invention further provides a Chinese words segmentation device which comprises a training module, a words segmenting module, an adding module and a circulating module. The training module is used for training the corpus of segmented words to obtain the CRF model; the words segmenting module is used for segmenting the corpus of unsegmented words by using the CRF model; the adding module is used for judging whether the corpus of successfully segmented words satisfies the set condition or not, and if yes, the adding module adds the corpus of successfully segmented words to the corpus of segmented words; and the circulating module is used for circularly calling the training module, the words segmenting module and the adding module until the scope of the corpus of segmented words is not enlarged to obtain the final DRF model. The Chinese words segmentation method and device improve the words segmentation efficiency and reduce the words segmentation ambiguities.

Description

Chinese word cutting method and device
Technical field
The present invention relates to the Chinese language processing field, in particular to a kind of Chinese word cutting method and device.
Background technology
Correlation technique provides a kind of segmenting method based on dictionary, and the method is called again the segmenting method of machinery.The method needs a dictionary for word segmentation, and principal feature is fairly simple, realize easily, but participle speed is slower, produces ambiguity easily.
Summary of the invention
The present invention aims to provide a kind of Chinese word cutting method and device, and is slower to solve correlation technique participle speed, the problem that produces ambiguity easily.
In an embodiment of the present invention, provide a kind of Chinese word cutting method, having comprised: the language material of participle has been trained obtain the CRF model; Adopt the CRF model that the language material of participle is not carried out participle; Whether the language material of judging the participle success satisfies the condition that arranges, and if so, then joins in the language material of participle; Above-mentioned steps is carried out in circulation, until the scale of the language material of participle no longer enlarges, obtains final CRF model.
In an embodiment of the present invention, provide a kind of Chinese word segmentation device, having comprised: training module is used for the language material of participle trained obtaining the CRF model; Word-dividing mode be used for to adopt the CRF model that the language material of participle is not carried out participle; Add module, be used for judging whether the language material of participle success satisfies the condition of setting, if so, then join in the language material of participle; Loop module is used for recursive call training module, word-dividing mode and adding module, until the scale of the language material of participle no longer enlarges, obtains final CRF model.
The Chinese word cutting method of the above embodiment of the present invention and device be because adopt the CRF technology, so overcome based on the segmenting method participle speed of dictionary slowlyer, the problem that produces ambiguity easily, and then reached raising participle speed reduces the effect of segmentation ambiguity.
Description of drawings
Accompanying drawing described herein is used to provide a further understanding of the present invention, consists of the application's a part, and illustrative examples of the present invention and explanation thereof are used for explaining the present invention, do not consist of improper restriction of the present invention.In the accompanying drawings:
Fig. 1 shows the process flow diagram according to the Chinese word cutting method of the embodiment of the invention;
Fig. 2 is the installation drawing of cross-cutting Chinese word segmentation;
Fig. 3 is the process flow diagram of the training and testing of participle model;
Fig. 4 is the process flow diagram of statement screening;
Fig. 5 shows the synoptic diagram according to the Chinese word segmentation device of the embodiment of the invention
Embodiment
Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.
Fig. 1 shows the process flow diagram according to the Chinese word cutting method of the embodiment of the invention, comprising:
Step S10 trains the language material of participle to obtain the CRF model;
Step S20 adopts the CRF model that the language material of participle is not carried out participle;
Step S30 judges that whether the language material of participle success satisfies the condition that arranges, and if so, then joins in the language material of participle;
Step S40, above-mentioned steps is carried out in circulation, until the scale of the language material of participle no longer enlarges, obtains final CRF model.
Segmenting method participle speed based on dictionary is slower, produce ambiguity easily, and present embodiment has adopted the CRF technology, so overcome based on the segmenting method participle speed of dictionary slower, the problem that produces ambiguity easily, and then reached raising participle speed, reduce the effect of segmentation ambiguity.
Conditional random field models is a kind of typical discriminative model that is proposed in calendar year 2001 by Lafferty.It carries out modeling to target sequence on the basis of observation sequence, emphasis solves the serializing mark (in the present invention, mark is participle) the conditions of problems random field models both had the advantage of discriminative model, have again production model and consider transition probability between contextual tagging, with the characteristics that the serializing form is carried out global parameter optimization and decoding, solved the marking bias problem that other discriminative models (such as the maximum entropy Markov model) are difficult to avoid.
CRF (Conditional random field, condition random field) theory can be used for the natural language processing tasks such as sequence mark, Data Segmentation, chunk parsing.In the Chinese natural language process tasks such as Chinese word segmentation, Chinese personal name recognition, ambiguity resolution, application is arranged, do well.Realize CRF being arranged, FlexCRF, CRF++ based on the Major Systems of CRFs at present.Conditional random field models is a kind of non-directed graph model, and it is under the condition of the given observation sequence that needs mark, calculates the joint probability distribution of whole flag sequence, rather than under given current state condition, defines the distributions of next state.Be given observation sequence O, ask optimal sequence S.
Preferably, step S10 comprises: the language material of participle is expressed as the proper vector form to use effective character feature to incite somebody to action, trains to obtain the CRF model.
Preferably, use effective character feature the language material of participle be expressed as the proper vector form and comprise:
Judge the character numeral whether in the language material of participle, if it is use mark " N " (Number) to represent;
Judge the character letter whether in the language material of participle, if it is use mark " L " (Letter) to represent;
Judge the character punctuation mark (comprising Chinese and western language punctuate) whether in the language material of participle, if it is use mark " P " (Puncture) to represent;
Judge the character time word whether in the language material of participle, if it is use mark " D " (Date) to represent;
If it is no that above all judgements are, then be designated as " C " (Character), expression all common characters except above Four types.
Preferably, adopt the CRF model that the language material of participle is not carried out participle and comprise: use effective character feature not the language material of participle be converted into the proper vector form, adopt the CRF model to carry out participle.
Preferably, use effective character feature not the language material of participle be converted into the proper vector form and comprise:
Judge the character numeral whether in the language material of participle not, if it is use mark " N " expression;
Judge the character letter whether in the language material of participle not, if it is use mark " L " expression;
Judge the character punctuation mark whether in the language material of participle not, if it is use mark " P " expression;
Judge the character time word whether in the language material of participle not, if it is use mark " D " expression;
If it is no that above all judgements are, then be designated as " C ".
Preferably, design suitable feature templates, carry out the training of CRF model, obtain initial statistical model.The preferred embodiment of the present invention is used the feature templates form of 5 windows, and template style is as follows:
#Unigram
U00:%x[-1,0]
U01:%x[0,0]
U02:%x[1,0]
U03:%x[-1,0]/%x[0,0]
U04:%x[0,0]/%x[1,0]
U05:%x[-1,0]/%x[1,0]
U10:%x[-1,1]
U11:%x[0,1]
U12:%x[1,1]
U13:%x[-1,1]/%x[0,1]
U14:%x[0,1]/%x[1,1]
U15:%x[-1,1]/%x[1,1]
#Bigram
B
This template is only used Unigram (monobasic) model, and wherein, wherein, #Unigram represents linear model relation, U00:%x[-1,0] the expression code name is one group of template of U00, %x[-1,0] the previous word of the current word of representative; U01:%x[0,0] the expression code name is one group of template of U01, %x[0,0] represent current word; U02:%x[1,0] the expression code name is one group of template of U02, %x[1,0] the rear word of the current word of representative; U03:%x[-1,0]/%x[0,0] represent that code name is one group of template of U03, %x[-1,0]/%x[0,0] the previous word of the current word of representative and the group word relation of a rear word; U04:%x[0,0]/and %x[1,0] the expression code name is one group of template of U04, %x[0,0]/%x[1,0] the group word relation of the current word of representative and a rear word; U05:%x[-1,0]/%x[1,0] represent that code name is one group of template of U05, %x[-1,0]/%x[1,0] the previous word of the current word of representative and the group word relation of a rear word; U10:%x[-1,1] the expression code name is one group of template of U10, %x[-1,1] character feature of the previous word of the current word of representative; U11:%x[0,1] the expression code name is one group of template of U11, %x[0,1] character feature of the current word of representative, namely to consider union feature; U12:%x[1,1] the expression code name is one group of template of U12, %x[1,1] character feature of the rear word of the current word of representative; U13:%x[-1,1]/and %x[0,1] the expression code name is one group of template of U13, %x[-1,1]/%x[0,1] relation of the previous word of the current word of representative and the character feature of current word; U14:%x[0,1]/and %x[1,1] the expression code name is one group of template of U14, %x[0,1]/%x[1,1] relation of the character feature of the rear word of the current word of representative and current word; U15:%x[-1,1]/and %x[1,1] the expression code name is one group of template of U15, %x[-1,1]/%x[1,1] relation of the character feature of the previous word of the current word of representative and a rear word of current word; #Bigram represents the binary model relation, and B does not use the relation of binary model as the abbreviation of Bigram in the present embodiment.
5 windows represent at this: only considering the window that comprises 5 words that latter two word by the first two word of current word, current word, current word forms, is the character string of 5 windows such as " Central Committee of the Communist Party of China is total ".If take " General Secretary of the Central Committee of the CPC, President Jiang Zemin " as example, suppose that current word is " note ", then the implication of its template representation is as shown below:
Figure BSA00000580902200071
Preferably, judge the output probability of language material of participle success greater than threshold value, then the language material of participle success is joined in the language material of participle.The parameter item of CRF kit when testing for example is set, the marginal probability of the conditional probability of display statement output and the label of each word.Utilize above-mentioned conditional probability, select probable value greater than 0.85 statement, choosing through experiment test of this value, numerical value choose the degree of confidence that the word segmentation result that should satisfy selected statement has certain degree, satisfy again the not participle language material that existence can be screened under this numerical value, because numerical value is higher, existence can be screened statement fewer.
For example, the step of screening process is as follows:
Utilize corpus acquistion statistical model, at the test process of model, adjust correlation parameter, so that show the conditional probability of output and the marginal probability of label in the test result; When this tests, add parameter item a :-v1, simultaneously display condition probability and marginal probability.For example statement " one evening, risen by unexpected raining." through the result after the test be:
#0.906430
One C S/0.960094
It C S/0.959499
Late C B/0.987164
Upper C E/0.992140
, P S/0.999996
Prominent C B/0.999836
Right C E/0.998923
Lower C S/0.961316
Play C S/0.961338
C S/0.988134
Rain C S/0.988271
P?S/1.000000
In the upper table, #0.906430 represents word in the whole sentence and is composed probability for above-mentioned label, and the higher expression confidence level of numerical value is higher." a C S/0.960094 " expression " " word its might label in (B, B2, B3, M, E, S) compose and be the probability of " S " label, the probability that the higher expression of probability is composed as " S " label is higher, quilt composed into the probability of other labels less.
According to the conditional probability value of statement output, selected threshold is converted into the canonical form of statement participle greater than 0.85 statement according to the label of atom word; In upper figure, its conditional probability is 0.906430, and is obviously eligible, so need to add this statement.Be converted into simultaneously the form of standard participle, namely " one evening, unexpected raining risen.》
Preferably, to before the language material of participle is trained, also comprise following at least one step:
The subordinate sentence process, with mark the statement separator of raw material according to Chinese ".", "; ", "? ", "! " be divided into the relatively independent sentence of some meanings.This is mainly in the situation of punctuation mark more complicated, such as the situation about mutually comprising of double quotation marks and fullstop, branch, exclamation, question mark.In addition, the statement sequence for input from left to right scans the character string that is comprised by left and right sides double quotation marks " ", and this part is processed as a unit, does not do the subordinate sentence processing for being included in " " inner fullstop, exclamation, question mark; Doing subordinate sentence at " " outside statement dividing mark processes.For example, " we are about to see the ox year off with the happiness of good harvest, welcome tiger year with high-spirited fighting will for original statement.Our great motherland will be full of vitality, full of hope 1 year in the new year.", it should be divided into two statements, namely " we are about to see the ox year off with the happiness of good harvest, welcome tiger year with high-spirited fighting will." and " our great motherland will be full of vitality, full of hope 1 year in the new year.》。
Atom participle process, this operates mainly in the sentence that contains the non-Chinese characters such as a large amount of numerals, English character, when character reads, the non-Chinese characters such as continuous numeral, English alphabet is done as a whole the processing.For general Chinese character, process unit with single character as one; But then need to carry out special processing for the numeral that comprises in the Chinese, English alphabet, continuous non-Chinese character processed unit as one, for the character string that is connected by ". " equally as a bulk treatment.For example, the atom word segmentation result of " New Year's address in 1998 " is: 1998, year, new, year, say, talk about, but not 1,9,9,8, year, new, year, say, talk about.The advantage of this operation is: this operation more is conducive to the study of CRF model, and by the basic studies of CRF, the feature of some non-Chinese characters is difficult to be learnt preferably merely.This operation is conducive to improve word segmentation result, improves the participle accuracy.
Add the label process, use 6 tag formats herein, be B, B2, B3, M, E, S, they represent respectively the lead-in of certain phrase, second word, the 3rd word, the 3rd middle word, end word and the individual character word that word is follow-up, in " People's Republic of China (PRC) " word, " in " word is prefix " B ", " China " word is prefix " B1 ", " people " word is prefix " B3 ", " people " word is prefix " M ", " being total to " word is prefix " M ", " with " word is prefix " M ", " state " word is prefix " S ".
For example, participle statement " General Secretary of the Central Committee of the CPC, President Jiang Zemin " being converted to the proper vector form is:
In C B
Altogether C B2
In C B3
The centre C E
Always C B
Book C B2
Note C E
P S
State C B
Family C E
Main C B
Seat C E
The river C S
The pool C B
The people C E
Among the upper figure, take word " Central Committee of the Communist Party of China " as example, wherein 4 words are common character, are the non-punctuation character of nonnumeric non-letter, so its character feature " C " mark.Simultaneously " in " word is prefix, therefore is labeled as " B ", " be total to " word is prefix the second word, therefore is labeled as " B2 ", second " in " word is prefix the 3rd word, therefore is labeled as " B3 ", " centre " word is suffix, so is labeled as " E ".
Preferably, adopt the CRF model to before the language material of participle does not carry out participle, also comprising following at least one step:
With the language material of participle not according to the statement separator of Chinese ".", "; ", "? ", "! " be divided into the relatively independent sentence of meaning;
Non-Chinese character continuous in the language material with participle is not processed unit as one;
Use B, B2, B3, M, E, S to mark the not language material of participle, B, B2, B3, M, E, S represent respectively the lead-in of phrase, second word, the 3rd word, the 3rd middle word, end word and the individual character word that word is follow-up.
Preferably, this method also comprises: search engine receives the content to be searched of user's input; Adopt the final CRF model that generates to treat search content and carry out participle.
The inventor finds that present Chinese word cutting method is mainly for the participle in the field, and is relatively less for cross-cutting Chinese word segmentation research.In real life, when utilizing search engine to carry out relevant search, although do not emphasize to want the field of search content in the search procedure, but user's demand is widely, dabble the every field in society, so the user in search engine input and input the process of display of search results according to the user is exactly the process of a cross-cutting search in essence.Therefore, cross-cutting participle is significant, and its accuracy rate has directly affected the accuracy of Search Results.
Cross-cutting participle has two hang-ups, i.e. segmentation ambiguity and unregistered word identification.The word of different field differs from one another, and the field of word is professional stronger, and along with the development and progress of society, a large amount of neologisms emerge in an endless stream.Based on the Chinese word cutting method of statistics, although can improve to a certain extent the identification problem of unregistered word, can't learn preferably cross-cutting knowledge information.
The embodiment of the invention can be applied in the cross-cutting participle technique well.In a preferred embodiment of the invention, the language material of participle can be bought in market the corpus in a plurality of fields, the cross-cutting source that refers to the language material content is different, and for example, the information in the language material comes from the other fields such as different field such as news, literature, computing machine, finance and economics.This preferred embodiment can be trained cross-cutting language material, obtains cross-cutting CRF model, thereby can carry out cross-cutting participle.This preferred embodiment has fully utilized the knowledge information of cross-cutting language material, the abundant knowledge feature in the learning areas, and acquistion has the adaptive statistical model in certain field, can effectively improve cross-cutting participle effect.
Installation drawing according to the Chinese word segmentation of the embodiment of the invention has been shown among Fig. 2, has comprised:
Step S202, the mark language material of the Peoples Daily that is provided with Peking University's language is as original corpus;
Step S204 carries out pretreatment operation to marking language material, with statement separator ".", "; ", "? ", "! " for mark statement is divided into the relatively independent sentence of some meanings; The non-Chinese characters such as continuous numeral, English alphabet are done as a whole the processing; Use simultaneously 6 tag formats, language material is added label;
Step S206 is to above-mentioned training through pretreated language material;
Step S208, the statistical model of generation CRF;
Step S210, with 4 kinds of cross-cutting literature, computing machine, medical science, finance not the participle language material as un-annotated data;
Step S212 carries out pretreatment operation for above-mentioned un-annotated data, according to the statement separator of Chinese ".", "; ", "? ", "! " be divided into the relatively independent sentence of meaning; Non-Chinese character continuous in the language material with participle is not processed unit as one; Use B, B2, B3, M, E, S to mark the not language material of participle;
Step S214 obtains pre-service result afterwards;
Step S216 utilizes the above-mentioned CRF model that has obtained, to carrying out participle through pretreated un-annotated data;
Step S218 obtains preliminary word segmentation result;
Step S220 from above-mentioned word segmentation result, chooses the statement that satisfies certain condition, and this part language material added to marks in the language material set;
Step S222, the mark language material set after being expanded;
Step S224 trains the mark language material after the expansion, obtains the CRF model, and utilizes this model that un-annotated data is carried out participle.
Step S226 to above-mentioned word segmentation result, utilizes in the Chinese word segmenting some simple rules to revise;
Fig. 3 shows the process flow diagram according to the training and testing of the participle model of the embodiment of the invention, comprising:
Step S302 reads delegation's character string as input from corpus;
Step S304 carries out subordinate sentence for above-mentioned read statement and processes, with statement separator ".", "; ", "? ", "! " be divided into the relatively independent sentence of meaning;
Step S306 carries out the atom word segmentation processing for above-mentioned statement, and continuous non-Chinese character is processed unit as one;
Step S308 for above-mentioned statement, uses effective character feature, and uses 6 tag formats, is converted to the proper vector form;
Step S310 judges whether to handle language materials all in the training set, then continues aforesaid operations if be untreated;
Step S312 trains the above-mentioned corpus of having handled, and obtains the CRF model;
Step S314 reads delegation's character string as input from testing material;
Step S316 carries out subordinate sentence for above-mentioned read statement and processes, with statement separator ".", "; ", "? ", "! " be divided into the relatively independent sentence of meaning;
Step S318 for above-mentioned statement, adds the subordinate sentence mark at statement subordinate sentence place;
Step S320 carries out the atom word segmentation processing for above-mentioned statement, and continuous non-Chinese character is processed unit as one;
Step S322 for above-mentioned statement, uses effective character feature, and uses 6 tag formats, is converted to the proper vector form;
Step S324 judges whether to handle language materials all in the test set, then continues aforesaid operations if be untreated;
Step S326 utilizes CRF model obtained above, carries out participle to treated for marking language material;
Fig. 4 shows the process flow diagram according to the statement screening of the embodiment of the invention, comprising:
Step S402, the language material of the Peoples Daily of being issued take Peking University's language is as original corpus;
Step 404 is carried out pretreatment operation for above-mentioned mark language material, according to the statement separator of Chinese ".", "; ", "? ", "! " be divided into the relatively independent sentence of meaning; Non-Chinese character continuous in the language material with participle is not processed unit as one; Use B, B2, B3, M, E, S to mark the language material of participle;
Step 406 is trained for language material set obtained above, obtains the CRF model;
Step 408, with 4 kinds of cross-cutting literature, computing machine, medical science, finance not the participle language material as un-annotated data;
Step S410 carries out pretreatment operation for above-mentioned un-annotated data, according to the statement separator of Chinese ".", "; ", "? ", "! " be divided into the relatively independent sentence of meaning; Non-Chinese character continuous in the language material with participle is not processed unit as one; Use B, B2, B3, M, E, S to mark the not language material of participle;
Step S412 obtains the proper vector form into the mark language material;
Step S414 utilizes CRF model obtained above, to carrying out participle through pretreated un-annotated data, by the setting to CRF kit parameter, can show the output probability of each statement;
Step S416 judges whether to exist the statement of output probability and threshold value;
Step S418 adds the statement of output probability greater than certain threshold value in the corpus that has marked to;
Step S420 when not existing statement to satisfy its output probability greater than certain threshold value, obtains final mark language material set;
Fig. 5 shows the synoptic diagram according to the Chinese word segmentation device of the embodiment of the invention, comprising:
Training module 10 is used for the language material of participle trained obtaining the CRF model;
Word-dividing mode 20 be used for to adopt the CRF model that the language material of participle is not carried out participle;
Add module 30, be used for judging whether the language material of participle success satisfies the condition of setting, if so, then join in the language material of participle;
Loop module 40 is used for recursive call training module, word-dividing mode and adding module, until the scale of the language material of participle no longer enlarges, obtains final CRF model.
As can be seen from the above description, the present invention has improved participle speed, has reduced segmentation ambiguity.
Obviously, those skilled in the art should be understood that, above-mentioned each module of the present invention or each step can realize with general calculation element, they can concentrate on the single calculation element, perhaps be distributed on the network that a plurality of calculation elements form, alternatively, they can be realized with the executable program code of calculation element, thereby, they can be stored in the memory storage and be carried out by calculation element, perhaps they are made into respectively each integrated circuit modules, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize.Like this, the present invention is not restricted to any specific hardware and software combination.
The above is the preferred embodiments of the present invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1. a Chinese word cutting method is characterized in that, comprising:
The language material of participle trained obtain the CRF model;
Adopt described CRF model that the language material of participle is not carried out participle;
Whether the language material of judging the participle success satisfies the condition that arranges, and if so, then joins in the language material of described participle;
Above-mentioned steps is carried out in circulation, until the scale of the language material of described participle no longer enlarges, obtains final CRF model.
2. method according to claim 1 is characterized in that, the language material of participle is trained obtain the CRF model and comprise:
Use effective character feature that the language material of described participle is expressed as the proper vector form, train to obtain described CRF model.
3. method according to claim 2 is characterized in that, adopts described CRF model that the language material of participle is not carried out participle and comprises:
Use described effective character feature that the language material of described not participle is converted into the proper vector form, adopt described CRF model to carry out participle.
4. method according to claim 3, it is characterized in that, using effective character feature that the language material of described participle is expressed as the proper vector form comprises: judge the character numeral whether in the language material of described participle, if it is use mark " N " expression; Judge the character letter whether in the language material of described participle, if it is use mark " L " expression; Judge the character punctuation mark whether in the language material of described participle, if it is use mark " P " expression; Judge the character time word whether in the language material of described participle, if it is use mark " D " expression; If it is no that above all judgements are, then be designated as " C ";
Using described effective character feature that the language material of described not participle is converted into the proper vector form comprises: judge the character numeral whether in the language material of described not participle, if it is use mark " N " expression; Judge the character letter whether in the language material of described not participle, if it is use mark " L " expression; Judge the character punctuation mark whether in the language material of described not participle, if it is use mark " P " expression; Judge the character time word whether in the language material of described not participle, if it is use mark " D " expression; If it is no that above all judgements are, then be designated as " C ".
5. method according to claim 1 is characterized in that, is used for training the form of template of language material of described participle as follows:
#Unigram U00:%x[-1,0] U01:%x[0,0] U02:%x[1,0] U03:%x[-1,0]/%x[0,0] U04:%x[0,0]/%x[1,0] U05:%x[-1,0]/%x[1,0] U10:%x[-1,1]
U11:%x[0,1] U12:%x[1,1] U13:%x[-1,1]/%x[0,1] U14:%x[0,1]/%x[1,1] U15:%x[-1,1]/%x[1,1] #Bigram B
Wherein, #Unigram represents linear model relation, U00:%x[-1,0] the expression code name is one group of template of U00, %x[-1,0] the previous word of the current word of representative; U01:%x[0,0] the expression code name is one group of template of U01, %x[0,0] represent current word; U02:%x[1,0] the expression code name is one group of template of U02, %x[1,0] the rear word of the current word of representative; U03:%x[-1,0]/%x[0,0] represent that code name is one group of template of U03, %x[-1,0]/%x[0,0] the previous word of the current word of representative and the group word relation of a rear word; U04:%x[0,0]/and %x[1,0] the expression code name is one group of template of U04, %x[0,0]/%x[1,0] the group word relation of the current word of representative and a rear word; U05:%x[-1,0]/%x[1,0] represent that code name is one group of template of U05, %x[-1,0]/%x[1,0] the previous word of the current word of representative and the group word relation of a rear word; U10:%x[-1,1] the expression code name is one group of template of U10, %x[-1,1] character feature of the previous word of the current word of representative; U11:%x[0,1] the expression code name is one group of template of U11, %x[0,1] character feature of the current word of representative; U12:%x[1,1] the expression code name is one group of template of U12, %x[1,1] character feature of the rear word of the current word of representative; U13:%x[-1,1]/and %x[0,1] the expression code name is one group of template of U13, %x[-1,1]/%x[0,1] relation of the previous word of the current word of representative and the character feature of current word; U14:%x[0,1]/and %x[1,1] the expression code name is one group of template of U14, %x[0,1]/%x[1,1] relation of the character feature of the rear word of the current word of representative and current word; U15:%x[-1,1]/and %x[1,1] the expression code name is one group of template of U15, %x[-1,1]/%x[1,1] relation of the character feature of the previous word of the current word of representative and a rear word of current word; #Bigram represents the binary model relation, and B is as the abbreviation of Bigram.
6. method according to claim 1 is characterized in that, judges the output probability of language material of described participle success greater than threshold value, and then the language material with described participle success joins in the language material of described participle.
7. method according to claim 1 is characterized in that, to before the language material of participle is trained, also comprises following at least one step:
With the language material of described participle according to the statement separator of Chinese ".", "; ", "? ", "! " be divided into the relatively independent sentence of meaning;
Non-Chinese character continuous in the language material with described participle is processed unit as one;
Use the language material of B, B2, B3, M, the described participle of E, S mark, B, B2, B3, M, E, S represent respectively the lead-in of phrase, second word, the 3rd word, the 3rd middle word, end word and the individual character word that word is follow-up.
8. method according to claim 1 is characterized in that, is adopting described CRF model to before the language material of participle does not carry out participle, also comprises following at least one step:
With the language material of described not participle according to the statement separator of Chinese ".", "; ", "? ", "! " be divided into the relatively independent sentence of meaning;
Non-Chinese character continuous in the language material with described not participle is processed unit as one;
Use the language material of B, B2, B3, M, the described not participle of E, S mark, B, B2, B3, M, E, S represent respectively the lead-in of phrase, second word, the 3rd word, the 3rd middle word, end word and the individual character word that word is follow-up.
9. method according to claim 1 is characterized in that, also comprises:
Search engine receives the content to be searched of user's input;
Adopt the final described CRF model that generates that described content to be searched is carried out participle.
10. a Chinese word segmentation device is characterized in that, comprising:
Training module is used for the language material of participle trained obtaining the CRF model;
Word-dividing mode is used for adopting described CRF model that the language material of participle is not carried out participle;
Add module, be used for judging whether the language material of participle success satisfies the condition that arranges, and if so, then joins in the language material of described participle;
Loop module is used for the described training module of recursive call, described word-dividing mode and described adding module, until the scale of the language material of described participle no longer enlarges, obtains final CRF model.
CN2011102877230A 2011-09-26 2011-09-26 Chinese words segmentation method and device Pending CN103020034A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011102877230A CN103020034A (en) 2011-09-26 2011-09-26 Chinese words segmentation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011102877230A CN103020034A (en) 2011-09-26 2011-09-26 Chinese words segmentation method and device

Publications (1)

Publication Number Publication Date
CN103020034A true CN103020034A (en) 2013-04-03

Family

ID=47968653

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011102877230A Pending CN103020034A (en) 2011-09-26 2011-09-26 Chinese words segmentation method and device

Country Status (1)

Country Link
CN (1) CN103020034A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102867040A (en) * 2012-08-31 2013-01-09 中国科学院计算技术研究所 Chinese search engine mixed speech-oriented query error corrosion method and system
CN104714940A (en) * 2015-02-12 2015-06-17 深圳市前海安测信息技术有限公司 Method and device for identifying unregistered word in intelligent interaction system
CN105609107A (en) * 2015-12-23 2016-05-25 北京奇虎科技有限公司 Text processing method and device based on voice identification
WO2016179987A1 (en) * 2015-05-12 2016-11-17 深圳市华傲数据技术有限公司 Chinese address parsing and annotation method
WO2016179988A1 (en) * 2015-05-12 2016-11-17 深圳市华傲数据技术有限公司 Chinese address parsing and annotation method
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
CN106708804A (en) * 2016-12-27 2017-05-24 努比亚技术有限公司 Method and device for generating word vectors
CN107247706A (en) * 2017-06-16 2017-10-13 中国电子技术标准化研究院 Text punctuate method for establishing model, punctuate method, device and computer equipment
CN107291837A (en) * 2017-05-31 2017-10-24 北京大学 A kind of segmenting method of the network text based on field adaptability
CN107402916A (en) * 2017-07-17 2017-11-28 广州特道信息科技有限公司 The segmenting method and device of Chinese text
CN107480128A (en) * 2017-07-17 2017-12-15 广州特道信息科技有限公司 The segmenting method and device of Chinese text
CN107729314A (en) * 2017-09-29 2018-02-23 东软集团股份有限公司 A kind of Chinese time recognition methods, device and storage medium, program product
CN109145282A (en) * 2017-06-16 2019-01-04 贵州小爱机器人科技有限公司 Punctuate model training method, punctuate method, apparatus and computer equipment
CN109284763A (en) * 2017-07-19 2019-01-29 阿里巴巴集团控股有限公司 A kind of method and server generating participle training data
CN109408801A (en) * 2018-08-28 2019-03-01 昆明理工大学 A kind of Chinese word cutting method based on NB Algorithm
CN109800409A (en) * 2017-11-17 2019-05-24 普天信息技术有限公司 A kind of Chinese word cutting method and system
CN109829162A (en) * 2019-01-30 2019-05-31 新华三大数据技术有限公司 A kind of text segmenting method and device
CN110134766A (en) * 2019-05-09 2019-08-16 北京科技大学 A kind of segmenting method and device towards Chinese medical book document
CN110348021A (en) * 2019-07-17 2019-10-18 湖北亿咖通科技有限公司 Character string identification method, electronic equipment, storage medium based on name physical model
CN110502737A (en) * 2018-05-18 2019-11-26 中国医学科学院北京协和医院 A kind of segmenting method based on medical speciality dictionary and statistic algorithm
CN111178060A (en) * 2019-12-20 2020-05-19 沈阳雅译网络技术有限公司 Korean word segmentation reduction method based on language model
JP2022530447A (en) * 2019-04-22 2022-06-29 平安科技(深▲せん▼)有限公司 Chinese word division method based on deep learning, equipment, storage media and computer equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101046809A (en) * 2006-03-28 2007-10-03 吴风勇 New word identification method based on association rule model
CN101201818A (en) * 2006-12-13 2008-06-18 李萍 Method for calculating language structure, executing participle, machine translation and speech recognition using HMM

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101046809A (en) * 2006-03-28 2007-10-03 吴风勇 New word identification method based on association rule model
CN101201818A (en) * 2006-12-13 2008-06-18 李萍 Method for calculating language structure, executing participle, machine translation and speech recognition using HMM

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
李双龙等: "基于条件随机场的汉语分词系统", 《微计算机信息》 *
杨平等: "支持向量机后验概率方法在多任务脑机接口中的应用", 《中国生物医学工程学报》 *
沈勤中等: "基于字位置概率特征的条件随机场中文分词方法", 《苏州大学学报》 *

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102867040A (en) * 2012-08-31 2013-01-09 中国科学院计算技术研究所 Chinese search engine mixed speech-oriented query error corrosion method and system
CN102867040B (en) * 2012-08-31 2015-03-18 中国科学院计算技术研究所 Chinese search engine mixed speech-oriented query error correction method and system
CN104714940A (en) * 2015-02-12 2015-06-17 深圳市前海安测信息技术有限公司 Method and device for identifying unregistered word in intelligent interaction system
WO2016179987A1 (en) * 2015-05-12 2016-11-17 深圳市华傲数据技术有限公司 Chinese address parsing and annotation method
WO2016179988A1 (en) * 2015-05-12 2016-11-17 深圳市华傲数据技术有限公司 Chinese address parsing and annotation method
CN105609107A (en) * 2015-12-23 2016-05-25 北京奇虎科技有限公司 Text processing method and device based on voice identification
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
CN106708804A (en) * 2016-12-27 2017-05-24 努比亚技术有限公司 Method and device for generating word vectors
CN107291837B (en) * 2017-05-31 2020-04-03 北京大学 Network text word segmentation method based on field adaptability
CN107291837A (en) * 2017-05-31 2017-10-24 北京大学 A kind of segmenting method of the network text based on field adaptability
CN107247706A (en) * 2017-06-16 2017-10-13 中国电子技术标准化研究院 Text punctuate method for establishing model, punctuate method, device and computer equipment
CN109145282B (en) * 2017-06-16 2023-11-07 贵州小爱机器人科技有限公司 Sentence-breaking model training method, sentence-breaking device and computer equipment
CN109145282A (en) * 2017-06-16 2019-01-04 贵州小爱机器人科技有限公司 Punctuate model training method, punctuate method, apparatus and computer equipment
CN107247706B (en) * 2017-06-16 2021-06-25 中国电子技术标准化研究院 Text sentence-breaking model establishing method, sentence-breaking method, device and computer equipment
CN107480128A (en) * 2017-07-17 2017-12-15 广州特道信息科技有限公司 The segmenting method and device of Chinese text
CN107402916A (en) * 2017-07-17 2017-11-28 广州特道信息科技有限公司 The segmenting method and device of Chinese text
CN109284763A (en) * 2017-07-19 2019-01-29 阿里巴巴集团控股有限公司 A kind of method and server generating participle training data
CN107729314A (en) * 2017-09-29 2018-02-23 东软集团股份有限公司 A kind of Chinese time recognition methods, device and storage medium, program product
CN107729314B (en) * 2017-09-29 2021-10-26 东软集团股份有限公司 Chinese time identification method and device, storage medium and program product
CN109800409A (en) * 2017-11-17 2019-05-24 普天信息技术有限公司 A kind of Chinese word cutting method and system
CN110502737A (en) * 2018-05-18 2019-11-26 中国医学科学院北京协和医院 A kind of segmenting method based on medical speciality dictionary and statistic algorithm
CN110502737B (en) * 2018-05-18 2023-02-17 中国医学科学院北京协和医院 Word segmentation method based on medical professional dictionary and statistical algorithm
CN109408801A (en) * 2018-08-28 2019-03-01 昆明理工大学 A kind of Chinese word cutting method based on NB Algorithm
CN109829162A (en) * 2019-01-30 2019-05-31 新华三大数据技术有限公司 A kind of text segmenting method and device
CN109829162B (en) * 2019-01-30 2022-04-08 新华三大数据技术有限公司 Text word segmentation method and device
JP2022530447A (en) * 2019-04-22 2022-06-29 平安科技(深▲せん▼)有限公司 Chinese word division method based on deep learning, equipment, storage media and computer equipment
JP7178513B2 (en) 2019-04-22 2022-11-25 平安科技(深▲せん▼)有限公司 Chinese word segmentation method, device, storage medium and computer equipment based on deep learning
CN110134766B (en) * 2019-05-09 2021-06-25 北京科技大学 Word segmentation method and device for traditional Chinese medical ancient book documents
CN110134766A (en) * 2019-05-09 2019-08-16 北京科技大学 A kind of segmenting method and device towards Chinese medical book document
CN110348021A (en) * 2019-07-17 2019-10-18 湖北亿咖通科技有限公司 Character string identification method, electronic equipment, storage medium based on name physical model
CN111178060A (en) * 2019-12-20 2020-05-19 沈阳雅译网络技术有限公司 Korean word segmentation reduction method based on language model

Similar Documents

Publication Publication Date Title
CN103020034A (en) Chinese words segmentation method and device
CN100595760C (en) Method for gaining oral vocabulary entry, device and input method system thereof
CN105718586B (en) The method and device of participle
CN108846017A (en) The end-to-end classification method of extensive newsletter archive based on Bi-GRU and word vector
CN103678271B (en) A kind of text correction method and subscriber equipment
CN103077164A (en) Text analysis method and text analyzer
CN112417854A (en) Chinese document abstraction type abstract method
CN102968411B (en) Multi-lingual mechanical translation intelligence auxiliary process method and system
CN110276052B (en) Ancient Chinese automatic word segmentation and part-of-speech tagging integrated method and device
CN101556596A (en) Input method system and intelligent word making method
CN112464663A (en) Multi-feature fusion Chinese word segmentation method
CN114036950B (en) Medical text named entity recognition method and system
CN113190602A (en) Event joint extraction method integrating word features and deep learning
CN111967265B (en) Chinese word segmentation and entity recognition combined learning method for automatic generation of data set
CN110222338A (en) A kind of mechanism name entity recognition method
CN113723103A (en) Chinese medical named entity and part-of-speech combined learning method integrating multi-source knowledge
Ye et al. Part-of-speech tagging based on dictionary and statistical machine learning
CN113609840B (en) Chinese law judgment abstract generation method and system
CN112765977B (en) Word segmentation method and device based on cross-language data enhancement
CN114036907A (en) Text data amplification method based on domain features
CN117891948A (en) Small sample news classification method based on internal knowledge extraction and contrast learning
CN113158678A (en) Identification method and device applied to electric power text named entity
CN107608959A (en) A kind of English social media short text place name identification method
CN116484852A (en) Chinese patent entity relationship joint extraction method based on relationship diagram attention network
CN116611428A (en) Non-autoregressive decoding Vietnam text regularization method based on editing alignment algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130403