CN103049501B - Based on mutual information and the Chinese domain term recognition method of conditional random field models - Google Patents

Based on mutual information and the Chinese domain term recognition method of conditional random field models Download PDF

Info

Publication number
CN103049501B
CN103049501B CN201210528734.8A CN201210528734A CN103049501B CN 103049501 B CN103049501 B CN 103049501B CN 201210528734 A CN201210528734 A CN 201210528734A CN 103049501 B CN103049501 B CN 103049501B
Authority
CN
China
Prior art keywords
word
word string
term
random field
evaluation function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210528734.8A
Other languages
Chinese (zh)
Other versions
CN103049501A (en
Inventor
彭琳
刘宗田
杨林楠
张立敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN201210528734.8A priority Critical patent/CN103049501B/en
Publication of CN103049501A publication Critical patent/CN103049501A/en
Application granted granted Critical
Publication of CN103049501B publication Critical patent/CN103049501B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of based on mutual information with the Chinese domain term recognition method of conditional random field models, its step is as follows: (1) assembling sphere corpus of text, is marked character beyond punctuation mark all of in language material, space, numeral, ascii character and Chinese character;(2) word string is set, calculate word stringAssociation relationship;(3) word string is calculatedLeft and right comentropy;(4) definition word stringEvaluation function, arranges evaluation functionThreshold value, calculates the evaluation function value of each word string, determines word stringFor word, compare this word string successivelyMiddle prev wordEvaluation function value and a rear wordEvaluation function value compares, one by one to meaning of word word stringParticiple;(5) condition random field is utilized to train a field term conditional random field models, with this model to carrying out field term identification.The method, when term identification, can not only overcome the Sparse of legal term, reduces the operand of condition random field algorithm, and can improve Chinese domain term accuracy of identification.

Description

Based on mutual information and the Chinese domain term recognition method of conditional random field models
Technical field
The present invention relates to a kind of based on mutual information with the Chinese domain term recognition method of conditional random field models, belong to areas of information technology.
Background technology
The definition of standard GB/T/T15237.1-2000 " terminology work vocabulary ", term is that the word of general concept in particular professional field is censured, be use in an ambit, the concept that represents in this ambit or the word of relation or phrase.Term can be divided in daily life the field term used in the general term and specific area used.Mostly general term is by the live and work acquiredhelpless feeling of people, it is not required that it is strict accurate in the expression of concept, and its implication is often the fuzzyyest;Field term is the systematicness to a professional conceptual, recapitulative description, and it is equivocal not allow, and the concept that each technical term is expressed must be accurate, it is impossible to different because making the difference of employment.
Field term identification refers to extract out professional field term from the corpus of specific science or technical field.Field term identifies the important content as information extraction automatically, have a wide range of applications in natural language processing field, have great significance for improving the processing accuracy of field text indexing and retrieval, text mining, ontological construction, text classification and cluster, latent semantic analysis etc..Domain term recognition method in existing Chinese text information mainly has:
(1) Chinese domain term recognition method based on statistical method, main thought is to utilize between the internal each constituent of field term the domain features information of higher correlation degree and term to extract field term.Statistics-Based Method general flow is: first with the method in statistics or theory of information, it is established that various statistical information, and according to statistical result, determines seed words more accurately;The most constantly extend, obtain final field term.Term frequencies, average and variance are the more commonly used statistical method, and more scholar uses the method for hypothesis testing, mainly have T inspection, X 2 test, log-likelihood ratio, some mutual information etc..With statistical method identification field term, it is not necessary to syntax, information semantically, being not limited to a certain specialized field, be also independent of any resource, versatility is stronger.
Wherein, mutual information algorithm based on statistics is most widely used.Article is such as had to report, its entitled " Chinese Term Extraction System based on mutual information " (this article author is: it is loyal that a cutting edge of a knife or a sword is permitted cloud Fan's Hou Yan filial piety, it is published in " computer utility research " volume 22 the 5th phase 72-73 published 2005, page 77), this article discloses a kind of automatic extraction system of Chinese terminology, this system is primarily based on mutual information and calculates the internal bond strength of word string, thus obtains term candidate collection;Then concentrate from term candidate and remove basic word, and utilize common words collocation prefix, suffix information to filter further;Finally term candidate is carried out morphological analysis, utilize the part of speech composition rule of term to differentiate, obtain final terminology extraction result.Test result indicate that, utilizing mutual information algorithm is 72.19% to the accuracy rate of terminology extraction, and recall rate is 77.98%, and F measured value is 74.97%.Document is such as had to report, " terminology extraction that C value and mutual information combine " (author is: Liang Ying red Zhang Wenjing Zhang Youcheng, it is published in " computer utility and software " volume 27 the 4th phase the 108-110 page published 2010), this article discloses a kind of terminology extraction method C value and mutual information combined, the method proposes comprehensive C-value parameter and has advantage in terms of long terminology extraction, test result indicate that, the method is 75.7% to the accuracy rate of long terminology extraction, recall rate is 68.4%, F measured value is 71.9%, higher than the additive method under identical language material.But this algorithm performance depends directly on scale and the word frequency of candidate's field term of corpus, the Sparse Problem being also likely to be legal term for some low frequency candidate terms is difficult to solve, so utilizing merely mutual information algorithm that field term is identified, the accuracy rate, recall rate and the F measured value that identify all are difficult to reach more than 80%, are difficult to obtain preferable recognition effect;
(2) the mainly comprising the following steps of Chinese domain term recognition method based on machine learning: use craft or semiautomatic fashion to build corpus, according to certain machine learning algorithm, corpus study is generated model, then recycling model carries out field term extraction experiment to testing material, to verify the effectiveness of this algorithm.The machine Learning Theory having been used for Chinese domain term identification at present mainly includes decision tree, support vector machine, HMM, maximum entropy model, maximum entropy Markov model and condition random field algorithm etc..Term recognition methods based on machine learning, without the domain knowledge of expert and linguistry, realizes feasibility big, can preferably be identified or extract effect in the case of considering multiple term characteristics.
At present, Chinese domain term recognition method conditional random field models based on machine learning is most widely used.Document is such as had to report, " a kind of traditional Chinese medical term automatic term extraction method " (author is: a five generations white space king Pei Yan Zhang Guiping, it is published in " Shenyang Aerospace University's journal " volume 28 the 1st phase the 72-75 page published 2011), this article discloses a kind of terminology extraction method based on condition random field for traditional Chinese medical science field, the method regards the extraction of traditional Chinese medical science field term as a sequence labelling problem, the characteristic quantification being distributed by traditional Chinese medical science field term is as the feature of training, CRF tool kit is utilized to train a field term model, then this model is utilized to carry out terminology extraction.Selecting " classified medical records of famous doctos " to carry out terminology extraction experiment as traditional Chinese medical science field text, rate of accuracy reached to 83.11%, recall rate reaches 81.04%, and F measured value reaches 82.06%.And article " uses the military information automatic term extraction research of CRF technology ", and (author is: Jia Meiying poplar bright scholar Zheng De power Yang Jing, it is published in " computer engineering and application " volume 45 the 32nd phase the 126-129 page published 2009), this article discloses a kind of terminology extraction method based on condition random field for military information field, field term identification is regarded as a sequence labelling problem by the method, the characteristic quantification being distributed by field term is as the feature of training, CRF tool kit is utilized to train a field term feature templates, then this template is utilized to carry out field term extraction.Experiment shows, the method is good to the recognition result of military information field term, and accuracy rate can reach 73.24%, and recall rate reaches 69.57%, and F measured value reaches 71.36%.
When utilizing condition random field algorithm to carry out field term identification, corpus is substantially all as manual and semi-automatic mark, and artificial participation is the highest, and workload is big, causes universal identified amount little, constrains accuracy of identification and the application of this algorithm.Meanwhile, need, first with general participle instrument, language material is carried out participle, the most again the language material after participle is carried out condition random field training and test, finally could realize the identification of term.So the premise utilizing condition random field algorithm to carry out field term identification is, it is assumed that existing general participle instrument can carry out participle exactly to the vocabulary in this field, and thinks that field term is bigger than the word granularity that participle instrument is divided.But, owing to professional field term and popular word exist gap, it is difficulty with the accurate participle to professional field language material with general participle instrument.Therefore, current mutual information and condition random field method identify that during field term identification degree is relatively low automatically, and accuracy of identification is the highest.
Summary of the invention
The problem that prior art exists in view of the above, it is an object of the invention to provide a kind of based on mutual information with the Chinese domain term recognition method of conditional random field models, the method is when term identification, the Sparse of legal term can not only be overcome, reduce the operand of condition random field algorithm, and Chinese domain term accuracy of identification can be improved.
In order to achieve the above object, the present invention uses following technical proposals:
The present invention based on mutual information and the Chinese domain term recognition method of conditional random field models, specifically comprise the following steps that
(1), assembling sphere corpus of text, character beyond punctuation mark all of in language material, space, numeral, ascii character and Chinese character is marked;
(2), word string W is set, calculates the association relationship of word string W;
(3), comentropy about word string W is calculated;
(4), define word string W evaluation function, evaluation function rank (W) threshold value is set, calculates the evaluation function value of each word string, determine that word string W is word, compare prev word x in this word string W successivelynEvaluation function value and a rear word xn-1Evaluation function value compares, and obtains in each word string W corresponding ratio, its ratio again with evaluation function rank (W) threshold ratio relatively, one by one to meaning of word word string W participle;
(5), with word, part of speech, the training characteristics of random field of the frequency of occurrences of word, condition random field method is utilized to train a field term conditional random field models, with this model to carrying out field term identification.
(2) described in above-mentioned steps (2) arrange word string W, calculate the association relationship of word string W, and its computing formula is as follows:
Assume that a field term is made up of n word, if word string W is a field term, then word string W is by x1、x2、x3……xnIndividual word forms, and the association relationship computing formula of word string W is as follows:
M I ( W ) = N ( x 1 , x 2 , x 3 , ... , x n ) N ( x 1 ) + N ( x 2 ) + N ( x 3 ) + ... + N ( x n ) - N ( x 1 , x 2 , x 3 , ... , x n ) - - - ( 1 )
Wherein, W represents a word string being made up of n word;
xiRepresent composition word string W i-th word (i=1,2,3 ..., n);
N(x1) represent word x in corpus1The frequency occurred;
N(x2) represent word x in corpus2The frequency occurred;
N(x3) represent word x in corpus3The frequency occurred;
N(xn) represent word x in corpusnThe frequency occurred;
N(x1,x2,x3,…,xn) represent word x1、x2、x3、…、xnThe frequency simultaneously occurred;
MI (W) represents the mutual information in word string W between all words and word.
Calculating left and right comentropy described in above-mentioned steps (3), its computing formula is as follows:
Left comentropy computing formula is:
Right comentropy computing formula is:
Wherein, W is expressed as the word string being made up of n word given;
p(wiAnd p (Ww W)i) represent w respectivelyiOccur on the left of W and right then time conditional probability;
VLAnd VRRepresent the set of words that the W left side and the right are occurred;
xiThe i-th word of expression composition word string W, wherein, i=1,2,3 ..., n.
Definition word string W evaluation function described in above-mentioned steps (4), and Utilization assessment function carries out participle to language material, refer to utilize step (2) and step (3) calculated mutual information and left and right information entropy, the credibility that word string W in language material is word is evaluated, judge whether this word string is word, wherein, word string W evaluation function computing formula is as follows:
r a n k ( W ) = ∂ × M I ( W ) + ( 1 - ∂ ) × H L ( W ) + H R ( W ) 2 - - - ( 4 )
Wherein, W is expressed as the word string being made up of n word given;
MI (W) represents the association relationship in word string W between character;
HL(W) the left information entropy of word string W is represented;
HR(W) the right information entropy of word string W is represented;
For balance factor, in order to regulate comentropy and association relationship weights in word string W evaluation function.
Described in above-mentioned steps (5) with word, part of speech, the training characteristics of random field of the frequency of occurrences of word, condition random field method is utilized to train a field term conditional random field models, utilizing this model to carrying out field term identification, its operating procedure is as follows:
(51), it is labeled in language material with word itself, part of speech, the frequency of occurrences of word;
(52), utilizing the characteristic sequence training to having marked of the CRF++0.53 tool kit, obtain condition random field parameters, this condition random field parameters is the conditional random field models of this field term identification;
(53), with field term identification the conditional random field models field term identification of characteristic sequence that test has been marked.
The Chinese domain term recognition method based on mutual information and conditional random field models of the present invention compared with prior art, has the effect that
(1), the method by organically combining based on statistics and two class term recognition methodss of machine learning, effectively solve Sparse Problem when utilizing merely statistical method to carry out term identification;
(2), the method utilizes mutual information algorithm that language material is carried out participle and mark, it is achieved that the automatic marking of language material;
(3), the method use only 3 word features the most common, as the training of condition random field method, make the method have stronger domain generality, significantly reduce the operand of condition random field, decrease the training time of condition random field.
Accompanying drawing explanation
Fig. 1 be the present invention based on mutual information and the flow chart of the Chinese domain term recognition method of conditional random field models;
Fig. 2 is the flow chart of step in Fig. 1 (4);
Fig. 3 is the flow chart of step in Fig. 1 (5).
Detailed description of the invention
The present invention is described in further detail with detailed description of the invention below in conjunction with the accompanying drawings.
The present invention will be described as an example with the field term identification of plant bamboo for the present embodiment, but is not limited to the scope of the present invention.
With reference to Fig. 1, the present invention based on mutual information and the Chinese domain term recognition method of conditional random field models, comprise the steps:
(1), assembling sphere corpus of text, character beyond punctuation mark all of in language material, space, numeral, ascii character and Chinese character is marked.
Such as, this example chooses the electronics manuscript of " Chinese Plants will " volume 9 Bambusoideae as field corpus of text.
First, language material is divided into randomly in the ratio of 4:1: corpus and testing material two parts;
Then, retrieve in language material character beyond all punctuation marks, space, numeral, ascii character and Chinese character, above-mentioned character forward and backward respectively with " // " symbol is marked;
Finally, with reference to Chinese part of speech table, to all pronouns, interjection, auxiliary word and function word, and lead-in for " with, have,, by, from, be, then, every, this, be somebody's turn to do, to, institute, make, be, or not, very, should with, " word forward and backward respectively with " // " symbol is marked.
(2), arranging word string W, calculate the association relationship of word string W, its computing formula is as follows:
Assume that a field term is made up of n word, if word string W is a field term, then word string W is by x1、x2、x3……xnIndividual word forms, and the association relationship computing formula of word string W is as follows:
M I ( W ) = N ( x 1 , x 2 , x 3 , ... , x n ) N ( x 1 ) + N ( x 2 ) + N ( x 3 ) + ... + N ( x n ) - N ( x 1 , x 2 , x 3 , ... , x n ) - - - ( 1 )
Wherein, W represents a word string being made up of n word;
xiThe i-th word of expression composition word string W, wherein, i=1,2,3 ..., n;
N(x1) represent word x in corpus1The frequency occurred;
N(x2) represent word x in corpus2The frequency occurred;
N(x3) represent word x in corpus3The frequency occurred;
N(xn) represent word x in corpusnThe frequency occurred;
N(x1,x2,x3,…,xn) represent word x1、x2、x3、…、xnThe frequency simultaneously occurred;
MI (W) represents the mutual information in word string W between all words and word.
Owing to it is considered herein that the length of Chinese domain term is not more than 4 words, and think in the middle of Chinese domain term it is unlikely that punctuation mark, space, numeral, character beyond ascii character and Chinese character, it is also impossible to out interjection more simultaneously, function word, the words such as index pronoun, so the present invention calculates its 2-word respectively to words all in language material text, 3-word, the association relationship of 4-word, when run into marker character " // " stop calculate, the computing formula of its association relationship sees the formula (1) of step (2) in foregoing invention content, (2), (3).
Such as: language material " edge by Diffuse Coptosapelta shape hair //, // ", wherein 2-word includes: " edge ", " edge quilt ", " by flowing ", " Diffuse Coptosapelta ", " Soviet Union's shape " and " shape hair ";3-word includes: " edge quilt ", " edge is flowed ", " by Diffuse Coptosapelta ", " Diffuse Coptosapelta shape " and " Su Zhuanmao ";4-word includes: " edge is flowed ", " edge is by Diffuse Coptosapelta ", " by Diffuse Coptosapelta shape " and " Diffuse Coptosapelta shape hair ", and partly mutual information result of calculation is: MI (limit, edge)=0.82, MI (quilt, stream)=0.37, MI (limit, edge, quilt)=0.23, MI (limit, edge, quilt, stream)=0.08, MI (quilt, stream, Soviet Union, shape)=0.41;
(3), calculating comentropy about word string W, its computing formula is as follows:
Left comentropy computing formula is:
H L ( W ) = - Σ i = 1 V L p ( x i W ) × log p ( x i W ) , Σ i = 1 V L p ( x i W ) = 1 - - - ( 2 )
Right comentropy computing formula is:
H R ( W ) = - Σ i = 1 V R p ( Wx i ) × log p ( Wx i ) , Σ i = 1 V R p ( Wx i ) = 1 - - - ( 3 )
Wherein, W is expressed as the word string being made up of n word given;
p(wiAnd p (Ww W)i) represent w respectivelyiOccur on the left of W and right then time conditional probability;
VLAnd VRRepresent the set of words that the W left side and the right are occurred;
xiThe i-th word of expression composition word string W, wherein, i=1,2,3 ..., n.
Judge whether a word string is word, the bonding tightness between word string internal word to be considered and word, i.e. the size of mutual information between word;Simultaneously, it is also contemplated that the border degrees of freedom between word string, the kind of the adjacent word i.e. occurred on word string border is the most, think that word string left and right comentropy is the biggest, namely the degree of freedom on word string border is the biggest, the formula (2) of step (3), (3) during the computing formula of comentropy sees foregoing invention content around.
Such as: language material " edge by Diffuse Coptosapelta shape hair //, // " in, the most left comentropy result of calculation is: HL(limit, edge)=0.71, HL(quilt, stream)=0.91, HL(limit, edge, quilt)=0.34, HL(quilt, stream, Soviet Union)=0.42, HL(limit, edge, quilt, stream)=0.17, HL(quilt, stream, Soviet Union, shape)=0.19;Right comentropy result of calculation is: HR(limit, edge)=0.52, HR(quilt, stream)=0.93, HR(limit, edge, quilt)=0.56, HR(quilt, stream, Soviet Union)=0.31, HR(limit, edge, quilt, stream)=0.14, HR(quilt, stream, Soviet Union, shape)=0.29;
(4), define word string W evaluation function, evaluation function rank (W) threshold value is set, calculates the evaluation function value of each word string, determine that word string W is word, compare prev word x in this word string W successivelynEvaluation function value and a rear word xn-1Evaluation function value compares, and obtains ratio corresponding in each word string W, and again with evaluation function rank (W) threshold ratio relatively, one by one to meaning of word word string W participle, its operating procedure is as follows for its ratio:
(41), defining word string W evaluation function, its calculation expression is:
r a n k ( W ) = ∂ × M I ( W ) + ( 1 - ∂ ) × H L ( W ) + H R ( W ) 2 - - - ( 4 )
Wherein, W is expressed as the word string being made up of n word given;
MI (W) represents the association relationship in word string W between character;
HL(W) the left information entropy of word string W is represented;
HR(W) the right information entropy of word string W is represented;
For balance factor, in order to regulate comentropy and association relationship weights in evaluation function.
(42), distinguish Calculation Estimation function value, determine that word string W is word.
Evaluation function formula according to the step (4) in foregoing invention content calculates the evaluation function value of all word strings respectively, whereinTake 0.5, and think that this word string W is word when evaluation function rank (W) is more than threshold value 0.8,
Such as: language material " edge by Diffuse Coptosapelta shape hair //, // ", partly evaluation function result of calculation is: rank (limit, edge)=0.7175, rank (quilt, stream)=0.645, rank (limit, edge, quilt)=0.34, rank (quilt, stream, Soviet Union)=0.4975, rank (limit, edge, quilt, stream)=0.1175, rank (quilt, stream, Soviet Union, shape)=0.325;
(43), prev word x in above-mentioned word string W is compared successivelynEvaluation function value and a rear word xn- 1Evaluation function value is compared, and obtains in each word string W corresponding ratio, its ratio again with evaluation function rank (W) threshold ratio relatively, one by one to meaning of word word string W participle.
Such as, first from the beginning of the first character of language material, choose the sub-word string of a length of 4,3,2,1 respectively, be denoted as W (4-word), W (3-word), W (2-word) and W (1-word);
Then, the evaluation function of word string W (4-word) and W (3-word) is compared, ifThinking that word string W (4-word) is neologisms, d is labeled with symbol " * " before and after word string W (4-word) respectively;Otherwise, it is believed that word string W (4-word) is not neologisms, then it abandons the last character of afterbody, compares the evaluation function of W (3-word) and W (2-word), ifThink that word string W (3-word) is neologisms, be labeled with symbol " * " respectively before and after word string W (3-word);Otherwise, think that word string W (3-word) is not neologisms, the evaluation function of W (2-word) is judged by its last character abandoning afterbody, if rank (W (2-word)) >=0.8, think that word string W (2-word) is neologisms, be labeled with symbol " * " respectively before and after word string W (2-word);Otherwise, it is believed that word string W (1-word) is neologisms, is labeled with symbol " * " respectively before and after word string W (1-word);As long as there being neologisms to be marked, just from the beginning of the first character after neologisms, choose the sub-word string of a length of 4,3,2,1 the most respectively, it is denoted as W (4-word), W (3-word), W (2-word) and W (1-word), re-start the comparison of evaluation function, when running into " // " symbol skips.So the most repeatedly, until so till language material processed, such as: language material " edge by Diffuse Coptosapelta shape hair //; // ", first, start intercepted length from first character and be respectively the sub-word string of 4,3,2,1, it may be assumed that " edge is flowed ", " edge quilt ", " edge " and " limit ";Then, first determine whetherWhether more than or equal to 0.8, according to the result of calculation of step (41) evaluation function, it is known thatLess than 0.8, i.e. word string " edge is flowed " is not neologisms;Then, it is judged thatWhether more than or equal to 0.8, according to the result of calculation of step (41) evaluation function, it is known thatLess than 0.8, therefore word string " edge quilt " is not neologisms;Then judge whether rank (limit, edge) is more than or equal to 0.8, according to the result of calculation of step (41) evaluation function, it is known that rank (limit, edge)=0.82 is more than 0.8, therefore word string " edge " is neologisms;After having and judging neologisms, first character after neologisms starts to choose 4,3,2,1 word string again, W (4-word), W (3-word), W (2-word) and W (1-word) is made as a new round, i.e. " by Diffuse Coptosapelta shape ", " wave current Soviet Union ", " by flowing " and " quilt ", repeat above step to compare, when running into " // " symbol skips, until terminating, so language material " edge by Diffuse Coptosapelta shape hair //; // ", last word segmentation result be " * edge * by * Diffuse Coptosapelta shape * hair //, // ";
(5), with word, part of speech, the training characteristics of random field of the frequency of occurrences of word, utilizing condition random field to train a field term conditional random field models, with this model to carrying out field term identification, its operating procedure is as follows:
(51), being labeled in language material with word itself, part of speech, the frequency of occurrences of word, it is specific as follows:
Successively meaning of word word string W participle being marked characteristic sequence, the characteristic sequence of the mark of this word is respectively as follows: current word itself;The part of speech of current word;The frequency of occurrences of current word, use K-Means clustering method, the frequency of occurrences of above-mentioned current word is divided into 10 grades, each grade is a class, 10 classes are expressed as A, B, C, D, E, F, G, H, I, J, K, are divided into by the characteristic sequence marked: train characteristic sequence two part that the characteristic sequence marked, test have marked;
(52), utilizing the characteristic sequence training to having marked of the CRF++0.53 tool kit, obtain condition random field parameters, condition random field parameters is the conditional random field models of field term identification;
(53) the field term identification of characteristic sequence, with the conditional random field models of field term identification test marked, it is specific as follows:
The conditional random field models of field term identification is obtained after the characteristic sequence that test has marked is input to step (5.2) training, utilize this conditional random field models, calculate eigenvalue, identify field term, output result is the field term identified, such as: language material " edge by Diffuse Coptosapelta shape hair //, // ", finally identifying " edge " and " Diffuse Coptosapelta shape " is field term.
Being more than the preferred forms of the present invention, according to present disclosure, those skilled in the art can be apparent from some identical, replacement schemes, all should belong to the technological innovation scope of the present invention.

Claims (4)

1., based on mutual information and a Chinese domain term recognition method for conditional random field models, specifically comprise the following steps that
(1), assembling sphere corpus of text, character beyond punctuation mark all of in language material, space, numeral, ascii character and Chinese character is marked;
(2), word string is set, calculate word stringAssociation relationship, its computing formula is as follows:
Assume that a field term is made up of n word, if word stringIt it is a field term, then word stringBy……Individual word forms, word stringAssociation relationship computing formula as follows:
(1)
Wherein,Represent a word string being made up of n word;
Represent composition word stringI-th word (i=1,2,3 ..., n);
Represent word in corpusThe frequency occurred;
Represent word in corpusThe frequency occurred;
Represent word in corpusThe frequency occurred;
Represent word in corpusThe frequency occurred;
Represent word、…、The frequency simultaneously occurred;
Represent word stringIn mutual information between all words and word;
(3), word string is calculatedLeft and right comentropy;
(4), definition word stringEvaluation function, arranges evaluation functionThreshold value, calculates the evaluation function value of each word string, determines word stringFor word, compare this word string successivelyMiddle prev wordEvaluation function value and a rear wordEvaluation function value compares, and obtains each word stringThe ratio of middle correspondence, its ratio again with evaluation functionThreshold ratio relatively, one by one to meaning of word word stringParticiple;
(5), with word, part of speech, the training characteristics of random field of the frequency of occurrences of word, condition random field method is utilized to train a field term conditional random field models, with this model to carrying out field term identification.
The most according to claim 1 based on mutual information with the Chinese domain term recognition method of conditional random field models, it is characterised in that the calculating left and right comentropy described in above-mentioned steps (3), its computing formula is as follows:
Left comentropy computing formula is:(2)
Right comentropy computing formula is:(3)
Wherein,It is expressed as the word string being made up of n word given;
WithRepresent respectivelyOccur inConditional probability when left side and right side;
WithRepresentThe set of words that the left side and the right are occurred;
Represent composition word stringI-th word, wherein, i=1,2,3 ..., n.
The most according to claim 1 based on mutual information with the Chinese domain term recognition method of conditional random field models, it is characterized in that, definition word string W evaluation function described in above-mentioned steps (4), and Utilization assessment function carries out participle to language material, refer to utilize step (2) and step (3) calculated mutual information and left and right information entropy, to the word string in language materialCredibility for word is evaluated, it is judged that whether this word string is word, and wherein, word string W evaluation function computing formula is as follows:
(4)
Wherein,It is expressed as the word string being made up of n word given;
Represent word stringAssociation relationship between middle character;
Represent word stringLeft information entropy;
Represent word stringRight information entropy;
For balance factor, in order to regulate comentropy with association relationship in word stringWeights in evaluation function.
The most according to claim 1 based on mutual information with the Chinese domain term recognition method of conditional random field models, it is characterized in that, described in above-mentioned steps (5) with word, part of speech, the training characteristics of random field of the frequency of occurrences of word, condition random field method is utilized to train a field term conditional random field models, utilizing this model to carrying out field term identification, its operating procedure is as follows:
(51), it is labeled in language material with word itself, part of speech, the frequency of occurrences of word;
(52), utilizing the characteristic sequence training to having marked of the CRF++0.53 tool kit, obtain condition random field parameters, this condition random field parameters is the conditional random field models of this field term identification;
(53), with field term identification the conditional random field models field term identification of characteristic sequence that test has been marked.
CN201210528734.8A 2012-12-11 2012-12-11 Based on mutual information and the Chinese domain term recognition method of conditional random field models Expired - Fee Related CN103049501B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210528734.8A CN103049501B (en) 2012-12-11 2012-12-11 Based on mutual information and the Chinese domain term recognition method of conditional random field models

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210528734.8A CN103049501B (en) 2012-12-11 2012-12-11 Based on mutual information and the Chinese domain term recognition method of conditional random field models

Publications (2)

Publication Number Publication Date
CN103049501A CN103049501A (en) 2013-04-17
CN103049501B true CN103049501B (en) 2016-08-03

Family

ID=48062142

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210528734.8A Expired - Fee Related CN103049501B (en) 2012-12-11 2012-12-11 Based on mutual information and the Chinese domain term recognition method of conditional random field models

Country Status (1)

Country Link
CN (1) CN103049501B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202043A (en) * 2016-05-20 2016-12-07 北京理工大学 A kind of based on the new word identification immune genetic method becoming word rate fitness function

Families Citing this family (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593427A (en) * 2013-11-07 2014-02-19 清华大学 New word searching method and system
CN103778243B (en) * 2014-02-11 2017-02-08 北京信息科技大学 Domain term extraction method
CN103902673B (en) * 2014-03-19 2017-11-24 新浪网技术(中国)有限公司 Anti-spam filtering rule upgrade method and device
CN104572621B (en) * 2015-01-05 2018-01-26 语联网(武汉)信息技术有限公司 A kind of term decision method based on decision tree
CN104679885B (en) * 2015-03-17 2018-03-30 北京理工大学 A kind of user's search string organization names recognition method based on semantic feature model
CN104933023B (en) * 2015-05-12 2017-09-01 深圳市华傲数据技术有限公司 Chinese address participle mask method
CN108776709B (en) * 2015-10-27 2020-05-19 上海智臻智能网络科技股份有限公司 Computer-readable storage medium and dictionary updating method
CN108897842B (en) * 2015-10-27 2021-04-09 上海智臻智能网络科技股份有限公司 Computer readable storage medium and computer system
CN108875040B (en) * 2015-10-27 2020-08-18 上海智臻智能网络科技股份有限公司 Dictionary updating method and computer-readable storage medium
CN105260362B (en) * 2015-10-30 2019-02-12 小米科技有限责任公司 New words extraction method and apparatus
CN106021230B (en) * 2016-05-19 2018-11-23 无线生活(杭州)信息科技有限公司 A kind of segmenting method and device
CN107423278B (en) * 2016-05-23 2020-07-14 株式会社理光 Evaluation element identification method, device and system
CN106095753B (en) * 2016-06-07 2018-11-06 大连理工大学 A kind of financial field term recognition methods based on comentropy and term confidence level
CN106202056B (en) * 2016-07-26 2019-01-04 北京智能管家科技有限公司 Chinese word segmentation scene library update method and system
CN106445921B (en) * 2016-09-29 2019-05-07 北京理工大学 Utilize the Chinese text terminology extraction method of quadratic mutual information
CN106649661A (en) * 2016-12-13 2017-05-10 税云网络科技服务有限公司 Method and device for establishing knowledge base
CN108268440A (en) * 2017-01-04 2018-07-10 普天信息技术有限公司 A kind of unknown word identification method
CN106991085B (en) * 2017-04-01 2020-08-04 中国工商银行股份有限公司 Entity abbreviation generation method and device
CN107291692B (en) * 2017-06-14 2020-12-18 北京百度网讯科技有限公司 Artificial intelligence-based word segmentation model customization method, device, equipment and medium
CN109145282B (en) * 2017-06-16 2023-11-07 贵州小爱机器人科技有限公司 Sentence-breaking model training method, sentence-breaking device and computer equipment
CN107391486B (en) * 2017-07-20 2020-10-27 南京云问网络技术有限公司 Method for identifying new words in field based on statistical information and sequence labels
CN108509425B (en) * 2018-04-10 2021-08-24 中国人民解放军陆军工程大学 Chinese new word discovery method based on novelty
CN108804413B (en) * 2018-04-28 2022-03-22 百度在线网络技术(北京)有限公司 Text cheating identification method and device
CN108776653A (en) * 2018-05-25 2018-11-09 南京大学 A kind of text segmenting method of the judgement document based on PageRank and comentropy
CN109710947B (en) * 2019-01-22 2021-09-07 福建亿榕信息技术有限公司 Electric power professional word bank generation method and device
CN110175331B (en) * 2019-05-29 2021-05-11 腾讯科技(深圳)有限公司 Method and device for identifying professional terms, electronic equipment and readable storage medium
CN111090742A (en) * 2019-12-19 2020-05-01 东软集团股份有限公司 Question and answer pair evaluation method and device, storage medium and equipment
CN115495507B (en) * 2022-11-17 2023-03-24 江苏鸿程大数据技术与应用研究院有限公司 Engineering material information price matching method, system and storage medium
CN116702786B (en) * 2023-08-04 2023-11-17 山东大学 Chinese professional term extraction method and system integrating rules and statistical features

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101295294A (en) * 2008-06-12 2008-10-29 昆明理工大学 Improved Bayes acceptation disambiguation method based on information gain
CN102314507A (en) * 2011-09-08 2012-01-11 北京航空航天大学 Recognition ambiguity resolution method of Chinese named entity

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100928382B1 (en) * 2006-10-17 2009-11-23 삼성에스디에스 주식회사 Migration apparatus which convert database of mainframe system into database of open system and method for thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101295294A (en) * 2008-06-12 2008-10-29 昆明理工大学 Improved Bayes acceptation disambiguation method based on information gain
CN102314507A (en) * 2011-09-08 2012-01-11 北京航空航天大学 Recognition ambiguity resolution method of Chinese named entity

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
一种基于互信息的串扫描中文文本分词方法;赵秦怡 等;《情报杂志》;20100731;第29卷(第7期);152-172 *
一种面向术语抽取的短语过滤技术;周浪 等;《计算机工程与应用》;20091231(第19期);9-11 *
采用CRF技术的军事情报术语自动抽取研究;贾美英 等;《计算机工程与应用》;20091231(第32期);126-129 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202043A (en) * 2016-05-20 2016-12-07 北京理工大学 A kind of based on the new word identification immune genetic method becoming word rate fitness function
CN106202043B (en) * 2016-05-20 2019-04-12 北京理工大学 A kind of new word identification immune genetic method based at word rate fitness function

Also Published As

Publication number Publication date
CN103049501A (en) 2013-04-17

Similar Documents

Publication Publication Date Title
CN103049501B (en) Based on mutual information and the Chinese domain term recognition method of conditional random field models
CN107193801B (en) Short text feature optimization and emotion analysis method based on deep belief network
CN107451126B (en) Method and system for screening similar meaning words
CN104573046B (en) A kind of comment and analysis method and system based on term vector
CN111241294A (en) Graph convolution network relation extraction method based on dependency analysis and key words
CN103617157A (en) Text similarity calculation method based on semantics
CN107609132A (en) One kind is based on Ontology storehouse Chinese text sentiment analysis method
CN108334495A (en) Short text similarity calculating method and system
CN107608999A (en) A kind of Question Classification method suitable for automatically request-answering system
CN106383817A (en) Paper title generation method capable of utilizing distributed semantic information
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN108563638B (en) Microblog emotion analysis method based on topic identification and integrated learning
CN103455562A (en) Text orientation analysis method and product review orientation discriminator on basis of same
CN103116637A (en) Text sentiment classification method facing Chinese Web comments
CN103020167B (en) A kind of computer Chinese file classification method
CN108763348A (en) A kind of classification improved method of extension short text word feature vector
CN106202584A (en) A kind of microblog emotional based on standard dictionary and semantic rule analyzes method
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN107220293B (en) Emotion-based text classification method
CN108038099B (en) Low-frequency keyword identification method based on word clustering
CN106126502A (en) A kind of emotional semantic classification system and method based on support vector machine
CN110334209A (en) File classification method, device, medium and electronic equipment
CN104317965A (en) Establishment method of emotion dictionary based on linguistic data
CN106445921A (en) Chinese text term extracting method utilizing quadratic mutual information
CN104008187A (en) Semi-structured text matching method based on the minimum edit distance

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160803

Termination date: 20181211