CN103049501B

CN103049501B - Based on mutual information and the Chinese domain term recognition method of conditional random field models

Info

Publication number: CN103049501B
Application number: CN201210528734.8A
Authority: CN
Inventors: 彭琳; 刘宗田; 杨林楠; 张立敏
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2012-12-11
Filing date: 2012-12-11
Publication date: 2016-08-03
Anticipated expiration: 2032-12-11
Also published as: CN103049501A

Abstract

The invention discloses a kind of based on mutual information with the Chinese domain term recognition method of conditional random field models, its step is as follows: (1) assembling sphere corpus of text, is marked character beyond punctuation mark all of in language material, space, numeral, ascii character and Chinese character；(2) word string is set, calculate word stringAssociation relationship；(3) word string is calculatedLeft and right comentropy；(4) definition word stringEvaluation function, arranges evaluation functionThreshold value, calculates the evaluation function value of each word string, determines word stringFor word, compare this word string successivelyMiddle prev wordEvaluation function value and a rear wordEvaluation function value compares, one by one to meaning of word word stringParticiple；(5) condition random field is utilized to train a field term conditional random field models, with this model to carrying out field term identification.The method, when term identification, can not only overcome the Sparse of legal term, reduces the operand of condition random field algorithm, and can improve Chinese domain term accuracy of identification.

Description

Based on mutual information and the Chinese domain term recognition method of conditional random field models

Technical field

The present invention relates to a kind of based on mutual information with the Chinese domain term recognition method of conditional random field models, belong to areas of information technology.

Background technology

The definition of standard GB/T/T15237.1-2000 " terminology work vocabulary ", term is that the word of general concept in particular professional field is censured, be use in an ambit, the concept that represents in this ambit or the word of relation or phrase.Term can be divided in daily life the field term used in the general term and specific area used.Mostly general term is by the live and work acquiredhelpless feeling of people, it is not required that it is strict accurate in the expression of concept, and its implication is often the fuzzyyest；Field term is the systematicness to a professional conceptual, recapitulative description, and it is equivocal not allow, and the concept that each technical term is expressed must be accurate, it is impossible to different because making the difference of employment.

Field term identification refers to extract out professional field term from the corpus of specific science or technical field.Field term identifies the important content as information extraction automatically, have a wide range of applications in natural language processing field, have great significance for improving the processing accuracy of field text indexing and retrieval, text mining, ontological construction, text classification and cluster, latent semantic analysis etc..Domain term recognition method in existing Chinese text information mainly has:

(1) Chinese domain term recognition method based on statistical method, main thought is to utilize between the internal each constituent of field term the domain features information of higher correlation degree and term to extract field term.Statistics-Based Method general flow is: first with the method in statistics or theory of information, it is established that various statistical information, and according to statistical result, determines seed words more accurately；The most constantly extend, obtain final field term.Term frequencies, average and variance are the more commonly used statistical method, and more scholar uses the method for hypothesis testing, mainly have T inspection, X 2 test, log-likelihood ratio, some mutual information etc..With statistical method identification field term, it is not necessary to syntax, information semantically, being not limited to a certain specialized field, be also independent of any resource, versatility is stronger.

Wherein, mutual information algorithm based on statistics is most widely used.Article is such as had to report, its entitled " Chinese Term Extraction System based on mutual information " (this article author is: it is loyal that a cutting edge of a knife or a sword is permitted cloud Fan's Hou Yan filial piety, it is published in " computer utility research " volume 22 the 5th phase 72-73 published 2005, page 77), this article discloses a kind of automatic extraction system of Chinese terminology, this system is primarily based on mutual information and calculates the internal bond strength of word string, thus obtains term candidate collection；Then concentrate from term candidate and remove basic word, and utilize common words collocation prefix, suffix information to filter further；Finally term candidate is carried out morphological analysis, utilize the part of speech composition rule of term to differentiate, obtain final terminology extraction result.Test result indicate that, utilizing mutual information algorithm is 72.19% to the accuracy rate of terminology extraction, and recall rate is 77.98%, and F measured value is 74.97%.Document is such as had to report, " terminology extraction that C value and mutual information combine " (author is: Liang Ying red Zhang Wenjing Zhang Youcheng, it is published in " computer utility and software " volume 27 the 4th phase the 108-110 page published 2010), this article discloses a kind of terminology extraction method C value and mutual information combined, the method proposes comprehensive C-value parameter and has advantage in terms of long terminology extraction, test result indicate that, the method is 75.7% to the accuracy rate of long terminology extraction, recall rate is 68.4%, F measured value is 71.9%, higher than the additive method under identical language material.But this algorithm performance depends directly on scale and the word frequency of candidate's field term of corpus, the Sparse Problem being also likely to be legal term for some low frequency candidate terms is difficult to solve, so utilizing merely mutual information algorithm that field term is identified, the accuracy rate, recall rate and the F measured value that identify all are difficult to reach more than 80%, are difficult to obtain preferable recognition effect；

(2) the mainly comprising the following steps of Chinese domain term recognition method based on machine learning: use craft or semiautomatic fashion to build corpus, according to certain machine learning algorithm, corpus study is generated model, then recycling model carries out field term extraction experiment to testing material, to verify the effectiveness of this algorithm.The machine Learning Theory having been used for Chinese domain term identification at present mainly includes decision tree, support vector machine, HMM, maximum entropy model, maximum entropy Markov model and condition random field algorithm etc..Term recognition methods based on machine learning, without the domain knowledge of expert and linguistry, realizes feasibility big, can preferably be identified or extract effect in the case of considering multiple term characteristics.

At present, Chinese domain term recognition method conditional random field models based on machine learning is most widely used.Document is such as had to report, " a kind of traditional Chinese medical term automatic term extraction method " (author is: a five generations white space king Pei Yan Zhang Guiping, it is published in " Shenyang Aerospace University's journal " volume 28 the 1st phase the 72-75 page published 2011), this article discloses a kind of terminology extraction method based on condition random field for traditional Chinese medical science field, the method regards the extraction of traditional Chinese medical science field term as a sequence labelling problem, the characteristic quantification being distributed by traditional Chinese medical science field term is as the feature of training, CRF tool kit is utilized to train a field term model, then this model is utilized to carry out terminology extraction.Selecting " classified medical records of famous doctos " to carry out terminology extraction experiment as traditional Chinese medical science field text, rate of accuracy reached to 83.11%, recall rate reaches 81.04%, and F measured value reaches 82.06%.And article " uses the military information automatic term extraction research of CRF technology ", and (author is: Jia Meiying poplar bright scholar Zheng De power Yang Jing, it is published in " computer engineering and application " volume 45 the 32nd phase the 126-129 page published 2009), this article discloses a kind of terminology extraction method based on condition random field for military information field, field term identification is regarded as a sequence labelling problem by the method, the characteristic quantification being distributed by field term is as the feature of training, CRF tool kit is utilized to train a field term feature templates, then this template is utilized to carry out field term extraction.Experiment shows, the method is good to the recognition result of military information field term, and accuracy rate can reach 73.24%, and recall rate reaches 69.57%, and F measured value reaches 71.36%.

When utilizing condition random field algorithm to carry out field term identification, corpus is substantially all as manual and semi-automatic mark, and artificial participation is the highest, and workload is big, causes universal identified amount little, constrains accuracy of identification and the application of this algorithm.Meanwhile, need, first with general participle instrument, language material is carried out participle, the most again the language material after participle is carried out condition random field training and test, finally could realize the identification of term.So the premise utilizing condition random field algorithm to carry out field term identification is, it is assumed that existing general participle instrument can carry out participle exactly to the vocabulary in this field, and thinks that field term is bigger than the word granularity that participle instrument is divided.But, owing to professional field term and popular word exist gap, it is difficulty with the accurate participle to professional field language material with general participle instrument.Therefore, current mutual information and condition random field method identify that during field term identification degree is relatively low automatically, and accuracy of identification is the highest.

Summary of the invention

The problem that prior art exists in view of the above, it is an object of the invention to provide a kind of based on mutual information with the Chinese domain term recognition method of conditional random field models, the method is when term identification, the Sparse of legal term can not only be overcome, reduce the operand of condition random field algorithm, and Chinese domain term accuracy of identification can be improved.

In order to achieve the above object, the present invention uses following technical proposals:

The present invention based on mutual information and the Chinese domain term recognition method of conditional random field models, specifically comprise the following steps that

(1), assembling sphere corpus of text, character beyond punctuation mark all of in language material, space, numeral, ascii character and Chinese character is marked；

(2), word string W is set, calculates the association relationship of word string W；

(3), comentropy about word string W is calculated；

(4), define word string W evaluation function, evaluation function rank (W) threshold value is set, calculates the evaluation function value of each word string, determine that word string W is word, compare prev word x in this word string W successively_nEvaluation function value and a rear word x_n-1Evaluation function value compares, and obtains in each word string W corresponding ratio, its ratio again with evaluation function rank (W) threshold ratio relatively, one by one to meaning of word word string W participle；

(5), with word, part of speech, the training characteristics of random field of the frequency of occurrences of word, condition random field method is utilized to train a field term conditional random field models, with this model to carrying out field term identification.

(2) described in above-mentioned steps (2) arrange word string W, calculate the association relationship of word string W, and its computing formula is as follows:

Assume that a field term is made up of n word, if word string W is a field term, then word string W is by x₁、x₂、x₃……x_nIndividual word forms, and the association relationship computing formula of word string W is as follows:

M I (W) = \frac{N (x_{1}, x_{2}, x_{3}, ..., x_{n})}{N (x_{1}) + N (x_{2}) + N (x_{3}) + ... + N (x_{n}) - N (x_{1}, x_{2}, x_{3}, ..., x_{n})} - - - (1)

Wherein, W represents a word string being made up of n word；

x_iRepresent composition word string W i-th word (i=1,2,3 ..., n)；

N(x₁) represent word x in corpus₁The frequency occurred；

N(x₂) represent word x in corpus₂The frequency occurred；

N(x₃) represent word x in corpus₃The frequency occurred；

N(x_n) represent word x in corpus_nThe frequency occurred；

N(x₁,x₂,x₃,…,x_n) represent word x₁、x₂、x₃、…、x_nThe frequency simultaneously occurred；

MI (W) represents the mutual information in word string W between all words and word.

Calculating left and right comentropy described in above-mentioned steps (3), its computing formula is as follows:

Left comentropy computing formula is:

Right comentropy computing formula is:

Wherein, W is expressed as the word string being made up of n word given；

p(w_iAnd p (Ww W)_i) represent w respectively_iOccur on the left of W and right then time conditional probability；

V_LAnd V_RRepresent the set of words that the W left side and the right are occurred；

x_iThe i-th word of expression composition word string W, wherein, i=1,2,3 ..., n.

Definition word string W evaluation function described in above-mentioned steps (4), and Utilization assessment function carries out participle to language material, refer to utilize step (2) and step (3) calculated mutual information and left and right information entropy, the credibility that word string W in language material is word is evaluated, judge whether this word string is word, wherein, word string W evaluation function computing formula is as follows:

r a n k (W) = \partial \times M I (W) + (1 - \partial) \times \frac{H_{L} (W) + H_{R} (W)}{2} - - - (4)

Wherein, W is expressed as the word string being made up of n word given；

MI (W) represents the association relationship in word string W between character；

H_L(W) the left information entropy of word string W is represented；

H_R(W) the right information entropy of word string W is represented；

For balance factor, in order to regulate comentropy and association relationship weights in word string W evaluation function.

Described in above-mentioned steps (5) with word, part of speech, the training characteristics of random field of the frequency of occurrences of word, condition random field method is utilized to train a field term conditional random field models, utilizing this model to carrying out field term identification, its operating procedure is as follows:

(51), it is labeled in language material with word itself, part of speech, the frequency of occurrences of word；

(52), utilizing the characteristic sequence training to having marked of the CRF++0.53 tool kit, obtain condition random field parameters, this condition random field parameters is the conditional random field models of this field term identification；

(53), with field term identification the conditional random field models field term identification of characteristic sequence that test has been marked.

The Chinese domain term recognition method based on mutual information and conditional random field models of the present invention compared with prior art, has the effect that

(1), the method by organically combining based on statistics and two class term recognition methodss of machine learning, effectively solve Sparse Problem when utilizing merely statistical method to carry out term identification；

(2), the method utilizes mutual information algorithm that language material is carried out participle and mark, it is achieved that the automatic marking of language material；

(3), the method use only 3 word features the most common, as the training of condition random field method, make the method have stronger domain generality, significantly reduce the operand of condition random field, decrease the training time of condition random field.

Accompanying drawing explanation

Fig. 1 be the present invention based on mutual information and the flow chart of the Chinese domain term recognition method of conditional random field models；

Fig. 2 is the flow chart of step in Fig. 1 (4)；

Fig. 3 is the flow chart of step in Fig. 1 (5).

Detailed description of the invention

The present invention is described in further detail with detailed description of the invention below in conjunction with the accompanying drawings.

The present invention will be described as an example with the field term identification of plant bamboo for the present embodiment, but is not limited to the scope of the present invention.

With reference to Fig. 1, the present invention based on mutual information and the Chinese domain term recognition method of conditional random field models, comprise the steps:

(1), assembling sphere corpus of text, character beyond punctuation mark all of in language material, space, numeral, ascii character and Chinese character is marked.

Such as, this example chooses the electronics manuscript of " Chinese Plants will " volume 9 Bambusoideae as field corpus of text.

First, language material is divided into randomly in the ratio of 4:1: corpus and testing material two parts；

Then, retrieve in language material character beyond all punctuation marks, space, numeral, ascii character and Chinese character, above-mentioned character forward and backward respectively with " // " symbol is marked；

Finally, with reference to Chinese part of speech table, to all pronouns, interjection, auxiliary word and function word, and lead-in for " with, have,, by, from, be, then, every, this, be somebody's turn to do, to, institute, make, be, or not, very, should with, " word forward and backward respectively with " // " symbol is marked.

(2), arranging word string W, calculate the association relationship of word string W, its computing formula is as follows:

M I (W) = \frac{N (x_{1}, x_{2}, x_{3}, ..., x_{n})}{N (x_{1}) + N (x_{2}) + N (x_{3}) + ... + N (x_{n}) - N (x_{1}, x_{2}, x_{3}, ..., x_{n})} - - - (1)

Wherein, W represents a word string being made up of n word；

x_iThe i-th word of expression composition word string W, wherein, i=1,2,3 ..., n；

N(x₁) represent word x in corpus₁The frequency occurred；

N(x₂) represent word x in corpus₂The frequency occurred；

N(x₃) represent word x in corpus₃The frequency occurred；

N(x_n) represent word x in corpus_nThe frequency occurred；

Owing to it is considered herein that the length of Chinese domain term is not more than 4 words, and think in the middle of Chinese domain term it is unlikely that punctuation mark, space, numeral, character beyond ascii character and Chinese character, it is also impossible to out interjection more simultaneously, function word, the words such as index pronoun, so the present invention calculates its 2-word respectively to words all in language material text, 3-word, the association relationship of 4-word, when run into marker character " // " stop calculate, the computing formula of its association relationship sees the formula (1) of step (2) in foregoing invention content, (2), (3).

Such as: language material " edge by Diffuse Coptosapelta shape hair //, // ", wherein 2-word includes: " edge ", " edge quilt ", " by flowing ", " Diffuse Coptosapelta ", " Soviet Union's shape " and " shape hair "；3-word includes: " edge quilt ", " edge is flowed ", " by Diffuse Coptosapelta ", " Diffuse Coptosapelta shape " and " Su Zhuanmao "；4-word includes: " edge is flowed ", " edge is by Diffuse Coptosapelta ", " by Diffuse Coptosapelta shape " and " Diffuse Coptosapelta shape hair ", and partly mutual information result of calculation is: MI (limit, edge)=0.82, MI (quilt, stream)=0.37, MI (limit, edge, quilt)=0.23, MI (limit, edge, quilt, stream)=0.08, MI (quilt, stream, Soviet Union, shape)=0.41；

(3), calculating comentropy about word string W, its computing formula is as follows:

Left comentropy computing formula is:

H_{L} (W) = - Σ_{i = 1}^{V_{L}} p (x_{i} W) \times \log p (x_{i} W), Σ_{i = 1}^{V_{L}} p (x_{i} W) = 1 - - - (2)

Right comentropy computing formula is:

H_{R} (W) = - Σ_{i = 1}^{V_{R}} p ({Wx}_{i}) \times \log p ({Wx}_{i}), Σ_{i = 1}^{V_{R}} p ({Wx}_{i}) = 1 - - - (3)

Wherein, W is expressed as the word string being made up of n word given；

Judge whether a word string is word, the bonding tightness between word string internal word to be considered and word, i.e. the size of mutual information between word；Simultaneously, it is also contemplated that the border degrees of freedom between word string, the kind of the adjacent word i.e. occurred on word string border is the most, think that word string left and right comentropy is the biggest, namely the degree of freedom on word string border is the biggest, the formula (2) of step (3), (3) during the computing formula of comentropy sees foregoing invention content around.

Such as: language material " edge by Diffuse Coptosapelta shape hair //, // " in, the most left comentropy result of calculation is: H_L(limit, edge)=0.71, H_L(quilt, stream)=0.91, H_L(limit, edge, quilt)=0.34, H_L(quilt, stream, Soviet Union)=0.42, H_L(limit, edge, quilt, stream)=0.17, H_L(quilt, stream, Soviet Union, shape)=0.19；Right comentropy result of calculation is: H_R(limit, edge)=0.52, H_R(quilt, stream)=0.93, H_R(limit, edge, quilt)=0.56, H_R(quilt, stream, Soviet Union)=0.31, H_R(limit, edge, quilt, stream)=0.14, H_R(quilt, stream, Soviet Union, shape)=0.29；

(4), define word string W evaluation function, evaluation function rank (W) threshold value is set, calculates the evaluation function value of each word string, determine that word string W is word, compare prev word x in this word string W successively_nEvaluation function value and a rear word x_n-1Evaluation function value compares, and obtains ratio corresponding in each word string W, and again with evaluation function rank (W) threshold ratio relatively, one by one to meaning of word word string W participle, its operating procedure is as follows for its ratio:

(41), defining word string W evaluation function, its calculation expression is:

r a n k (W) = \partial \times M I (W) + (1 - \partial) \times \frac{H_{L} (W) + H_{R} (W)}{2} - - - (4)

Wherein, W is expressed as the word string being made up of n word given；

H_L(W) the left information entropy of word string W is represented；

H_R(W) the right information entropy of word string W is represented；

For balance factor, in order to regulate comentropy and association relationship weights in evaluation function.

(42), distinguish Calculation Estimation function value, determine that word string W is word.

Evaluation function formula according to the step (4) in foregoing invention content calculates the evaluation function value of all word strings respectively, whereinTake 0.5, and think that this word string W is word when evaluation function rank (W) is more than threshold value 0.8,

Such as: language material " edge by Diffuse Coptosapelta shape hair //, // ", partly evaluation function result of calculation is: rank (limit, edge)=0.7175, rank (quilt, stream)=0.645, rank (limit, edge, quilt)=0.34, rank (quilt, stream, Soviet Union)=0.4975, rank (limit, edge, quilt, stream)=0.1175, rank (quilt, stream, Soviet Union, shape)=0.325；

(43), prev word x in above-mentioned word string W is compared successively_nEvaluation function value and a rear word x_n- ¹Evaluation function value is compared, and obtains in each word string W corresponding ratio, its ratio again with evaluation function rank (W) threshold ratio relatively, one by one to meaning of word word string W participle.

Such as, first from the beginning of the first character of language material, choose the sub-word string of a length of 4,3,2,1 respectively, be denoted as W (4-word), W (3-word), W (2-word) and W (1-word)；

Then, the evaluation function of word string W (4-word) and W (3-word) is compared, ifThinking that word string W (4-word) is neologisms, d is labeled with symbol " * " before and after word string W (4-word) respectively；Otherwise, it is believed that word string W (4-word) is not neologisms, then it abandons the last character of afterbody, compares the evaluation function of W (3-word) and W (2-word), ifThink that word string W (3-word) is neologisms, be labeled with symbol " * " respectively before and after word string W (3-word)；Otherwise, think that word string W (3-word) is not neologisms, the evaluation function of W (2-word) is judged by its last character abandoning afterbody, if rank (W (2-word)) >=0.8, think that word string W (2-word) is neologisms, be labeled with symbol " * " respectively before and after word string W (2-word)；Otherwise, it is believed that word string W (1-word) is neologisms, is labeled with symbol " * " respectively before and after word string W (1-word)；As long as there being neologisms to be marked, just from the beginning of the first character after neologisms, choose the sub-word string of a length of 4,3,2,1 the most respectively, it is denoted as W (4-word), W (3-word), W (2-word) and W (1-word), re-start the comparison of evaluation function, when running into " // " symbol skips.So the most repeatedly, until so till language material processed, such as: language material " edge by Diffuse Coptosapelta shape hair //; // ", first, start intercepted length from first character and be respectively the sub-word string of 4,3,2,1, it may be assumed that " edge is flowed ", " edge quilt ", " edge " and " limit "；Then, first determine whetherWhether more than or equal to 0.8, according to the result of calculation of step (41) evaluation function, it is known thatLess than 0.8, i.e. word string " edge is flowed " is not neologisms；Then, it is judged thatWhether more than or equal to 0.8, according to the result of calculation of step (41) evaluation function, it is known thatLess than 0.8, therefore word string " edge quilt " is not neologisms；Then judge whether rank (limit, edge) is more than or equal to 0.8, according to the result of calculation of step (41) evaluation function, it is known that rank (limit, edge)=0.82 is more than 0.8, therefore word string " edge " is neologisms；After having and judging neologisms, first character after neologisms starts to choose 4,3,2,1 word string again, W (4-word), W (3-word), W (2-word) and W (1-word) is made as a new round, i.e. " by Diffuse Coptosapelta shape ", " wave current Soviet Union ", " by flowing " and " quilt ", repeat above step to compare, when running into " // " symbol skips, until terminating, so language material " edge by Diffuse Coptosapelta shape hair //; // ", last word segmentation result be " * edge * by * Diffuse Coptosapelta shape * hair //, // "；

(5), with word, part of speech, the training characteristics of random field of the frequency of occurrences of word, utilizing condition random field to train a field term conditional random field models, with this model to carrying out field term identification, its operating procedure is as follows:

(51), being labeled in language material with word itself, part of speech, the frequency of occurrences of word, it is specific as follows:

Successively meaning of word word string W participle being marked characteristic sequence, the characteristic sequence of the mark of this word is respectively as follows: current word itself；The part of speech of current word；The frequency of occurrences of current word, use K-Means clustering method, the frequency of occurrences of above-mentioned current word is divided into 10 grades, each grade is a class, 10 classes are expressed as A, B, C, D, E, F, G, H, I, J, K, are divided into by the characteristic sequence marked: train characteristic sequence two part that the characteristic sequence marked, test have marked；

(52), utilizing the characteristic sequence training to having marked of the CRF++0.53 tool kit, obtain condition random field parameters, condition random field parameters is the conditional random field models of field term identification；

(53) the field term identification of characteristic sequence, with the conditional random field models of field term identification test marked, it is specific as follows:

The conditional random field models of field term identification is obtained after the characteristic sequence that test has marked is input to step (5.2) training, utilize this conditional random field models, calculate eigenvalue, identify field term, output result is the field term identified, such as: language material " edge by Diffuse Coptosapelta shape hair //, // ", finally identifying " edge " and " Diffuse Coptosapelta shape " is field term.

Being more than the preferred forms of the present invention, according to present disclosure, those skilled in the art can be apparent from some identical, replacement schemes, all should belong to the technological innovation scope of the present invention.

Claims

1., based on mutual information and a Chinese domain term recognition method for conditional random field models, specifically comprise the following steps that

(2), word string is set, calculate word stringAssociation relationship, its computing formula is as follows:

Assume that a field term is made up of n word, if word stringIt it is a field term, then word stringBy、、……Individual word forms, word stringAssociation relationship computing formula as follows:

(1)

Wherein,Represent a word string being made up of n word；

Represent composition word stringI-th word (i=1,2,3 ..., n)；

Represent word in corpusThe frequency occurred；

Represent word、、、…、The frequency simultaneously occurred；

Represent word stringIn mutual information between all words and word；

(3), word string is calculatedLeft and right comentropy；

(4), definition word stringEvaluation function, arranges evaluation functionThreshold value, calculates the evaluation function value of each word string, determines word stringFor word, compare this word string successivelyMiddle prev wordEvaluation function value and a rear wordEvaluation function value compares, and obtains each word stringThe ratio of middle correspondence, its ratio again with evaluation functionThreshold ratio relatively, one by one to meaning of word word stringParticiple；

The most according to claim 1 based on mutual information with the Chinese domain term recognition method of conditional random field models, it is characterised in that the calculating left and right comentropy described in above-mentioned steps (3), its computing formula is as follows:

Left comentropy computing formula is:(2)

Right comentropy computing formula is:(3)

Wherein,It is expressed as the word string being made up of n word given；

WithRepresent respectivelyOccur inConditional probability when left side and right side；

WithRepresentThe set of words that the left side and the right are occurred；

Represent composition word stringI-th word, wherein, i=1,2,3 ..., n.

The most according to claim 1 based on mutual information with the Chinese domain term recognition method of conditional random field models, it is characterized in that, definition word string W evaluation function described in above-mentioned steps (4), and Utilization assessment function carries out participle to language material, refer to utilize step (2) and step (3) calculated mutual information and left and right information entropy, to the word string in language materialCredibility for word is evaluated, it is judged that whether this word string is word, and wherein, word string W evaluation function computing formula is as follows:

(4)

Wherein,It is expressed as the word string being made up of n word given；

Represent word stringAssociation relationship between middle character；

Represent word stringLeft information entropy；

Represent word stringRight information entropy；

For balance factor, in order to regulate comentropy with association relationship in word stringWeights in evaluation function.

The most according to claim 1 based on mutual information with the Chinese domain term recognition method of conditional random field models, it is characterized in that, described in above-mentioned steps (5) with word, part of speech, the training characteristics of random field of the frequency of occurrences of word, condition random field method is utilized to train a field term conditional random field models, utilizing this model to carrying out field term identification, its operating procedure is as follows: