CN103049501A - Chinese domain term recognition method based on mutual information and conditional random field model - Google Patents

Chinese domain term recognition method based on mutual information and conditional random field model Download PDF

Info

Publication number
CN103049501A
CN103049501A CN2012105287348A CN201210528734A CN103049501A CN 103049501 A CN103049501 A CN 103049501A CN 2012105287348 A CN2012105287348 A CN 2012105287348A CN 201210528734 A CN201210528734 A CN 201210528734A CN 103049501 A CN103049501 A CN 103049501A
Authority
CN
China
Prior art keywords
word
word string
evaluation function
random field
field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012105287348A
Other languages
Chinese (zh)
Other versions
CN103049501B (en
Inventor
彭琳
刘宗田
杨林楠
张立敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN201210528734.8A priority Critical patent/CN103049501B/en
Publication of CN103049501A publication Critical patent/CN103049501A/en
Application granted granted Critical
Publication of CN103049501B publication Critical patent/CN103049501B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a Chinese domain term recognition method based on mutual information and a conditional random field model. The Chinese domain term recognition method includes the following steps: (1) gathering domain text corpus and marking all the punctuations, spaces, numbers, ASSCII (American Standard Code for Information Interchange) characters and characters except Chinese characters in the corpus; (2) setting character strings and computing the mutual information values of the character strings, (3) computing the left comentropy and the right comentropy of every character string, (4) defining character string evaluation function, setting evaluation function threshold, computing the evaluation function values of every character string, determining that every character string is a word, comparing in sequence the evaluation function value of the former character with the evaluation function value of the latter character in the character string and segmenting character meaning character strings one by one, (5) utilizing conditional random fields to train a conditional random field model and recognizing domain terms with the conditional random field model. When the Chinese domain term recognition method is used to recognize terms, the data sparsity of legitimate terms is overcome, the amount of calculation of conditional random fields is reduced, and the accuracy of the Chinese domain term recognition is improved.

Description

Chinese field term recognition methods based on mutual information and conditional random field models
Technical field
What the present invention relates to is a kind of Chinese field term recognition methods based on mutual information and conditional random field models, belongs to areas of information technology.
Background technology
The definition of standard GB/T/T15237.1-2000 " terminology work vocabulary ", term are that the word of universal in the particular professional field is censured, and are to use, represent concept in this ambit or word or the phrase of relation in an ambit.Term can be divided into the field term that uses in the general term that uses in the daily life and the specific area.Mostly general term is what the live and work custom by people formed, does not require that it is accurately strict in the expression of concept, and its implication is often fuzzyyer; Field term is that it is equivocal not allow to the systematicness of a professional concept, recapitulative description, and the concept that each technical term is expressed must be accurate, can not be different because of end user's difference.
Field term identification refers to extract out the professional domain term from the corpus of specific science or technical field.Field term is identified the important content as information extraction automatically, have a wide range of applications in natural language processing field, have great significance for the processing accuracy that improves field text indexing and retrieval, text mining, body structure, text classification and cluster, latent semantic analysis etc.Field term recognition methods in the existing Chinese text information mainly contains:
(1) based on the Chinese field term recognition methods of statistical method, main thought is to utilize correlation degree higher between inner each constituent of field term and the domain features information of term to extract field term.Method general flow based on statistics is: at first utilize the method in statistics or the information theory, set up various statistic, and according to statistics, determine more accurately seed word; Then final field term is obtained on this basis constantly expansion.Word frequency, average and variance are the statistical methods of relatively commonly using, and more scholar uses the method for test of hypothesis, mainly contains T check, Chi-square Test, log-likelihood ratio, some mutual information etc.Identify field term with statistical method, do not need syntax, information semantically, be not limited to a certain specialized field, also do not rely on any resource, versatility is stronger.
Wherein, the mutual information algorithm based on statistics is most widely used.The article report is for example arranged, its exercise question is that " based on the Chinese Term Extraction System of mutual information " (this article author is: it is loyal that a cutting edge of a knife or a sword is permitted cloud Fan Hou Yan filial piety, be published in " computer utility research " the 22nd volume the 5th phase 72-73 of publication in 2005,77 pages), this article discloses a kind of Chinese automatic term extraction system, the internal bond strength of word string at first calculates in this system based on mutual information, thereby obtains the term candidate collection; Then concentrate from term candidate and remove primary word, and utilize common words collocation prefix, suffix information further to filter; At last term candidate is carried out lexical analysis, utilize the part of speech composition rule of term to differentiate, obtain final terminology extraction result.Experimental result shows that utilizing mutual information algorithm is 72.19% to the accuracy rate of terminology extraction, and recall rate is that 77.98%, F measured value is 74.97%.Bibliographical information is for example arranged, " terminology extraction that C value and mutual information combine " (author is: the red Zhang Wenjing Zhang Youcheng of Liang Ying, be published in " computer utility and software " the 27th volume the 4th phase 108-110 page or leaf of publication in 2010), this article discloses a kind of terminology extraction method that C value and mutual information are combined, the method proposes comprehensive C-value parameter and has advantage aspect the long terminology extraction, experimental result shows, the method is 75.7% to the accuracy rate of long terminology extraction, recall rate is 68.4%, the F measured value is 71.9%, is higher than the additive method under the identical language material.But this algorithm performance directly depends on the scale of corpus and the word frequency of candidate's field term, also may be that the Sparse Problem of legal term is difficult to solve for some low frequency candidate term, so utilizing merely mutual information algorithm identifies field term, accuracy rate, recall rate and the F measured value of identification all are difficult to reach more than 80%, are difficult to obtain desirable recognition effect;
(2) key step based on the Chinese field term recognition methods of machine learning is: adopt manual or semi-automatic mode makes up corpus, according to certain machine learning algorithm corpus is learnt generation model, and then utilize model that testing material is carried out field term and extract experiment, to verify this algorithm complexity.The machine Learning Theory that has been used at present Chinese field term identification mainly comprises decision tree, support vector machine, Hidden Markov Model (HMM), maximum entropy model, maximum entropy Markov model and condition random field algorithm etc.Need not expert's domain knowledge and linguistry based on the term recognition methods of machine learning, realize that feasibility is large, in the situation of considering multiple term characteristics, can be identified preferably or extract effect.
At present, the Chinese field term recognition methods conditional random field models based on machine learning is most widely used.Bibliographical information is for example arranged, " a kind of traditional chinese medical term automatic term extraction method " (author is: the white space king of five a generations Pei Yan Zhang Guiping, be published in " Shenyang Aerospace University's journal " the 28th volume the 1st phase 72-75 page or leaf of publication in 2011), this article discloses a kind of terminology extraction method based on condition random field for traditional Chinese medical science field, the method extracts traditional Chinese medical science field term regards a sequence labelling problem as, the characteristic quantification that traditional Chinese medical science field term is distributed is as the feature of training, utilize the CRF kit to train a field term model, then utilize this model to carry out terminology extraction.Select " classified medical records of famous doctos " to carry out the terminology extraction experiment as traditional Chinese medical science field text, rate of accuracy reached to 83.11%, recall rate reach 81.04%, F measured value and reach 82.06%.And article " adopts the military information automatic term extraction research of CRF technology ", and (author is: the bright scholar Zheng De power of Jia Meiying poplar Yang Jing, be published in " computer engineering and application " the 45th volume the 32nd phase 126-129 page or leaf of publication in 2009), this article discloses a kind of terminology extraction method based on condition random field for the military information field, the method is regarded field term identification as a sequence labelling problem, the characteristic quantification that field term is distributed is as the feature of training, utilize the CRF kit to train a field term feature templates, then utilize this template to carry out field term and extract.Experiment shows that the method is good to the recognition result of military information field term, and accuracy rate can reach 73.24%, and recall rate reaches 69.57%, F measured value and reaches 71.36%.
Utilize the condition random field algorithm to carry out field term when identification, corpus all be manual and semi-automatic mark basically, and artificially participation is all high, and workload is large, causes that generally the amount of identification is little, has restricted accuracy of identification and the application of this algorithm.Simultaneously, need to utilize first general participle instrument that language material is carried out participle, and then the language material behind the participle is carried out the condition random field training and testing, finally could realize the identification of term.So the prerequisite of utilizing the condition random field algorithm to carry out field term identification is to suppose that existing general participle instrument can carry out exactly participle to the vocabulary in this field, and think that field term is larger than the word granularity that participle instrument divided.But, because there are gap in professional domain term and popular word, be difficult to realization to the accurate participle of professional domain language material with general participle instrument.Therefore, the identification degree is lower automatically in the field term identifying for mutual information and condition random field method at present, and accuracy of identification is not high.
Summary of the invention
The problem that exists of prior art in view of the above, the purpose of this invention is to provide a kind of Chinese field term recognition methods based on mutual information and conditional random field models, when the method is identified at term, can not only overcome the Sparse of legal term, reduce the operand of condition random field algorithm, and can improve Chinese field term accuracy of identification.
In order to achieve the above object, the present invention adopts following technical proposals:
Chinese field term recognition methods based on mutual information and conditional random field models of the present invention, concrete steps are as follows:
(1), the assembling sphere corpus of text, character beyond punctuation marks all in the language material, space, numeral, ascii character and the Chinese character is carried out mark;
(2), word string is set
Figure 2012105287348100002DEST_PATH_IMAGE001
, calculate word string
Figure 957180DEST_PATH_IMAGE001
The mutual information value;
(3), calculate word string
Figure 814278DEST_PATH_IMAGE001
Left and right sides information entropy;
(4), definition word string
Figure 318684DEST_PATH_IMAGE001
Evaluation function arranges evaluation function Threshold value is calculated the evaluation function value of each word string, determines word string
Figure 15561DEST_PATH_IMAGE001
Be word, successively this word string relatively
Figure 410771DEST_PATH_IMAGE001
Middle prev word
Figure 2012105287348100002DEST_PATH_IMAGE003
Evaluation function value and a rear word
Figure 21881DEST_PATH_IMAGE004
The evaluation function value is compared, and obtains each word string
Figure 922972DEST_PATH_IMAGE001
The ratio of middle correspondence, its ratio again with evaluation function Threshold ratio, one by one to meaning of word word string
Figure 123326DEST_PATH_IMAGE001
Participle;
(5), with the training characteristics of the random field of the frequency of occurrences of word, part of speech, word, utilize the condition random field method to train a field term conditional random field models, with this model to carrying out field term identification.
(2) described in the above-mentioned steps (2) arrange word string
Figure 854521DEST_PATH_IMAGE001
, calculate word string
Figure 113464DEST_PATH_IMAGE001
The mutual information value, its computing formula is as follows:
Suppose that a field term is comprised of n word, if word string
Figure 415264DEST_PATH_IMAGE001
It is field term, so a word string
Figure 417855DEST_PATH_IMAGE001
By
Figure 2012105287348100002DEST_PATH_IMAGE005
,
Figure 941240DEST_PATH_IMAGE006
,
Figure DEST_PATH_IMAGE007
Figure 980872DEST_PATH_IMAGE003
Individual word forms, word string
Figure 222497DEST_PATH_IMAGE001
Mutual information value computing formula as follows:
Figure 763200DEST_PATH_IMAGE008
(1)
Wherein,
Figure 203408DEST_PATH_IMAGE001
Represent a word string that is formed by n word;
Expression forms word string
Figure 614274DEST_PATH_IMAGE001
I word (i=1,2,3 ..., n);
Figure 77616DEST_PATH_IMAGE010
Word in the expression corpus
Figure DEST_PATH_IMAGE011
The frequency that occurs;
Figure 484327DEST_PATH_IMAGE012
Word in the expression corpus
Figure 982304DEST_PATH_IMAGE006
The frequency that occurs;
Figure DEST_PATH_IMAGE013
Word in the expression corpus The frequency that occurs;
Word in the expression corpus The frequency that occurs;
Figure DEST_PATH_IMAGE015
The expression word
Figure 814945DEST_PATH_IMAGE011
,
Figure 491914DEST_PATH_IMAGE006
,
Figure 929848DEST_PATH_IMAGE007
...,
Figure 429094DEST_PATH_IMAGE003
The frequency that occurs simultaneously;
The expression word string
Figure 811851DEST_PATH_IMAGE001
In mutual information between all words and the word.
Calculating left and right sides information entropy described in the above-mentioned steps (3), its computing formula is as follows:
Left information entropy computing formula is:
Figure DEST_PATH_IMAGE017
(2)
Right information entropy computing formula is:
Figure 550131DEST_PATH_IMAGE018
(3)
Wherein,
Figure 774439DEST_PATH_IMAGE001
Be expressed as a given word string that is formed by n word;
Figure DEST_PATH_IMAGE019
With Respectively expression Appear at
Figure 258302DEST_PATH_IMAGE001
Left side and right conditional probability then the time;
Figure 405250DEST_PATH_IMAGE022
With
Figure DEST_PATH_IMAGE023
Expression
Figure 495566DEST_PATH_IMAGE001
The set of words that the left side and the right occur;
Figure 677148DEST_PATH_IMAGE009
Expression forms word string
Figure 945450DEST_PATH_IMAGE001
I word, wherein, i=1,2,3 ..., n.
Definition word string W evaluation function described in the above-mentioned steps (4), and utilize evaluation function that language material is carried out participle refers to the mutual information and the left and right sides information entropy that utilize step (2) and step (3) to calculate, to the word string in the language material
Figure 845273DEST_PATH_IMAGE001
Estimate for the confidence level of word, judge whether this word string is word, and wherein, word string W evaluation function computing formula is as follows:
Figure 411383DEST_PATH_IMAGE024
(4)
Wherein,
Figure 509789DEST_PATH_IMAGE001
Be expressed as a given word string that is formed by n word;
Figure 135943DEST_PATH_IMAGE016
The expression word string
Figure 257482DEST_PATH_IMAGE001
Mutual information value between the middle character;
Figure DEST_PATH_IMAGE025
The expression word string
Figure 440333DEST_PATH_IMAGE001
Left information entropy;
Figure 596508DEST_PATH_IMAGE026
The expression word string Right information entropy;
Figure DEST_PATH_IMAGE027
Be balance factor, in order to regulate information entropy and mutual information value in word string
Figure 64715DEST_PATH_IMAGE001
Weights in the evaluation function.
The training characteristics with the random field of the frequency of occurrences of word, part of speech, word described in the above-mentioned steps (5), utilize the condition random field method to train a field term conditional random field models, utilize this model to carrying out field term identification, its operation steps is as follows:
(51), the frequency of occurrences with word itself, part of speech, word marks in language material;
(52), utilize CRF++ 0.53 kit to the training of the characteristic sequence that marked, obtain the condition random field parameter, the conditional random field models that this condition random field parameter is identified for this field term;
(53), with field term identification the field term identification of conditional random field models characteristic sequence that test has been marked.
Chinese field term recognition methods based on mutual information and conditional random field models of the present invention has following effect compared with prior art:
(1), the method will organically combine based on statistics and two class term recognition methodss of machine learning, effectively solve the Sparse Problem when utilizing merely statistical method to carry out term identification;
(2), the method utilizes mutual information algorithm that language material is carried out participle and mark, realized the automatic marking of language material;
(3), the method only adopted 3 the most common word features, as the training of condition random field method, makes the method have stronger domain generality, effectively reduced the operand of condition random field, reduced the training time of condition random field.
Description of drawings
Fig. 1 is the process flow diagram of the Chinese field term recognition methods based on mutual information and conditional random field models of the present invention;
Fig. 2 is the process flow diagram of step among Fig. 1 (4);
Fig. 3 is the process flow diagram of step among Fig. 1 (5).
Embodiment
The present invention is described in further detail below in conjunction with the drawings and specific embodiments.
Present embodiment is with plant---and the present invention will be described as an example in the field term identification of bamboo, but be not used for limiting the scope of the invention.
With reference to Fig. 1, the Chinese field term recognition methods based on mutual information and conditional random field models of the present invention comprises the steps:
(1), the assembling sphere corpus of text, character beyond punctuation marks all in the language material, space, numeral, ascii character and the Chinese character is carried out mark.
For example, this example is chosen the electronics manuscript of " Chinese Plants will " the 9th volume Bambusoideae as the field corpus of text.
At first, the ratio of language material in 4:1 is divided into randomly: corpus and testing material two parts;
Then, retrieve in the language material character beyond all punctuation marks, space, numeral, ascii character and the Chinese character, carry out mark at forward and backward " // " symbol of using respectively of above-mentioned character;
At last, with reference to Chinese part of speech table, to all pronouns, interjection, auxiliary word and function word, and lead-in for " with, have,,, will,, from,, be, then,, every, this, should, to, institute, make, be, or not,, very, should, and, get, " forward and backward " // " symbol of using respectively of word carries out mark.
(2), word string is set
Figure 785678DEST_PATH_IMAGE001
, calculate word string The mutual information value, its computing formula is as follows:
Suppose that a field term is comprised of n word, if word string
Figure 826632DEST_PATH_IMAGE001
It is field term, so a word string
Figure 922764DEST_PATH_IMAGE001
By
Figure 634368DEST_PATH_IMAGE005
,
Figure 575255DEST_PATH_IMAGE006
,
Figure 714112DEST_PATH_IMAGE007
Figure 94278DEST_PATH_IMAGE003
Individual word forms, word string
Figure 422622DEST_PATH_IMAGE001
Mutual information value computing formula as follows:
Figure 407896DEST_PATH_IMAGE028
(1)
Wherein,
Figure 717654DEST_PATH_IMAGE001
Represent a word string that is formed by n word;
Figure 585116DEST_PATH_IMAGE009
Expression forms word string
Figure 451572DEST_PATH_IMAGE001
I word, wherein, i=1,2,3 ..., n;
Figure 291352DEST_PATH_IMAGE010
Word in the expression corpus The frequency that occurs;
Figure 330032DEST_PATH_IMAGE012
Word in the expression corpus The frequency that occurs;
Figure 943733DEST_PATH_IMAGE013
Word in the expression corpus
Figure 408344DEST_PATH_IMAGE007
The frequency that occurs;
Figure 188081DEST_PATH_IMAGE014
Word in the expression corpus The frequency that occurs;
The expression word
Figure 282442DEST_PATH_IMAGE011
, ,
Figure 558495DEST_PATH_IMAGE007
..., The frequency that occurs simultaneously;
The expression word string
Figure 37384DEST_PATH_IMAGE001
In mutual information between all words and the word.
Owing to it is considered herein that the length of Chinese field term is not more than 4 words, and think that in addition character of punctuation mark, space, numeral, ascii character and Chinese character can not appear in Chinese field term centre, also can not go out again simultaneously the words such as interjection, function word, index pronoun, so the present invention calculates respectively the mutual information value of its 2-word, 3-word, 4-word to all words in the language material text, stop to calculate when running into marker character " // ", the computing formula of its mutual information value is referring to formula (1), (2), (3) of step (2) in the foregoing invention content.
For example: language material " edge by tasselled shape hair //, // ", wherein 2-word comprises: " edge ", " edge by ", " by flowing ", " tasselled ", " Soviet Union's shape " and " shape hair "; 3-word comprises: " edge quilt ", " edge is flowed ", " by tasselled ", " tasselled shape " and " Su Zhuanmao "; 4-word comprises: " edge is flowed ", " edge is by tasselled ", " by the tasselled shape " and " tasselled shape hair ", and part mutual information result of calculation is:
Figure DEST_PATH_IMAGE029
,
Figure 39975DEST_PATH_IMAGE030
,
Figure DEST_PATH_IMAGE031
,
Figure 376409DEST_PATH_IMAGE032
,
(3), calculate word string
Figure 540675DEST_PATH_IMAGE001
Left and right sides information entropy, its computing formula is as follows:
Left information entropy computing formula is:
Figure 844617DEST_PATH_IMAGE034
(2)
Right information entropy computing formula is:
Figure DEST_PATH_IMAGE035
(3)
Wherein,
Figure 385320DEST_PATH_IMAGE001
Be expressed as a given word string that is formed by n word;
Figure 763211DEST_PATH_IMAGE019
With
Figure 177006DEST_PATH_IMAGE020
Respectively expression
Figure 640349DEST_PATH_IMAGE021
Appear at
Figure 984742DEST_PATH_IMAGE001
Left side and right conditional probability then the time;
With
Figure 51104DEST_PATH_IMAGE023
Expression
Figure 1743DEST_PATH_IMAGE001
The set of words that the left side and the right occur;
Expression forms word string
Figure 236732DEST_PATH_IMAGE001
I word, wherein, i=1,2,3 ..., n.
Judge that whether a word string is word, not only will consider the bonding tightness between word string internal word and the word, i.e. the size of mutual information between the word; Simultaneously, also to consider the border degrees of freedom between the word string, the kind in abutting connection with word that namely occurs on the word string border is more, think that word string left and right sides information entropy is larger, namely the degree of freedom on word string border is larger, and the computing formula of its left and right sides information entropy is referring to formula (2), (3) of step (3) in the foregoing invention content.
For example: language material " edge by tasselled shape hair //, // " in, the part left information entropy result of calculation be:
Figure 992330DEST_PATH_IMAGE036
,
Figure DEST_PATH_IMAGE037
,
Figure 430264DEST_PATH_IMAGE038
,
Figure DEST_PATH_IMAGE039
,
Figure 178777DEST_PATH_IMAGE040
,
Figure DEST_PATH_IMAGE041
Right information entropy result of calculation is:
Figure 385768DEST_PATH_IMAGE042
, ,
Figure 309337DEST_PATH_IMAGE044
, ,
Figure 234568DEST_PATH_IMAGE046
,
Figure DEST_PATH_IMAGE047
(4), definition word string
Figure 521193DEST_PATH_IMAGE001
Evaluation function arranges evaluation function
Figure 848269DEST_PATH_IMAGE002
Threshold value is calculated the evaluation function value of each word string, determines word string Be word, successively this word string relatively
Figure 92616DEST_PATH_IMAGE001
Middle prev word
Figure 120615DEST_PATH_IMAGE003
Evaluation function value and a rear word
Figure 302198DEST_PATH_IMAGE004
The evaluation function value is compared, and obtains each word string
Figure 819767DEST_PATH_IMAGE001
The ratio of middle correspondence, its ratio again with evaluation function
Figure 454011DEST_PATH_IMAGE002
Threshold ratio, one by one to meaning of word word string
Figure 20121DEST_PATH_IMAGE001
Participle, its operation steps is as follows:
(41), definition word string
Figure 321789DEST_PATH_IMAGE001
Evaluation function, its calculation expression is:
Figure 760992DEST_PATH_IMAGE024
(4)
Wherein,
Figure 882532DEST_PATH_IMAGE001
Be expressed as a given word string that is formed by n word;
Figure 252333DEST_PATH_IMAGE016
The expression word string
Figure 408508DEST_PATH_IMAGE001
Mutual information value between the middle character;
Figure 205563DEST_PATH_IMAGE025
The expression word string
Figure 876716DEST_PATH_IMAGE001
Left information entropy;
Figure 784629DEST_PATH_IMAGE026
The expression word string Right information entropy;
Be balance factor, in order to regulate information entropy and the weights of mutual information value in evaluation function.
Calculate evaluation function numerical value, determine word string (42), respectively
Figure 672448DEST_PATH_IMAGE001
Be word.
Calculate respectively the evaluation function value of all word strings according to the evaluation function formula of the step in the foregoing invention content (4), wherein
Figure 384052DEST_PATH_IMAGE027
Get 0.5, and think and work as evaluation function During greater than threshold value 0.8, this word string
Figure 388097DEST_PATH_IMAGE001
Be word,
For example: language material " edge by tasselled shape hair //, // ", part evaluation function result of calculation is:
Figure 33842DEST_PATH_IMAGE048
, ,
Figure 549137DEST_PATH_IMAGE050
,
Figure DEST_PATH_IMAGE051
,
Figure 102388DEST_PATH_IMAGE052
,
Figure DEST_PATH_IMAGE053
(43), more above-mentioned word string successively
Figure 412147DEST_PATH_IMAGE001
Middle prev word
Figure 482871DEST_PATH_IMAGE003
Evaluation function value and a rear word
Figure 598594DEST_PATH_IMAGE003
The evaluation function value is compared, and obtains each word string
Figure 438374DEST_PATH_IMAGE001
Middle correspondence ratio "? ", its ratio again with evaluation function
Figure 919034DEST_PATH_IMAGE002
Threshold ratio, one by one to meaning of word word string
Figure 477055DEST_PATH_IMAGE001
Participle.
For example, at first from the first character of language material, choose respectively length and be 4,3,2,1 sub-word string, be denoted as
Figure 881622DEST_PATH_IMAGE054
,
Figure DEST_PATH_IMAGE055
,
Figure 575909DEST_PATH_IMAGE056
With
Figure DEST_PATH_IMAGE057
Then, to word string
Figure 555366DEST_PATH_IMAGE054
With
Figure 335103DEST_PATH_IMAGE055
Evaluation function compare, if
Figure 730313DEST_PATH_IMAGE058
, think word string Be neologisms, d is in word string
Figure 180197DEST_PATH_IMAGE054
Front and back mark with symbol " * " respectively; Otherwise, think word string
Figure 447230DEST_PATH_IMAGE054
Be not neologisms, then it abandons the last character of afterbody, and is right
Figure DEST_PATH_IMAGE059
With
Figure 646130DEST_PATH_IMAGE056
Evaluation function compare, if
Figure 377326DEST_PATH_IMAGE060
, think word string
Figure 370690DEST_PATH_IMAGE059
Be neologisms, in word string
Figure 938068DEST_PATH_IMAGE059
Front and back mark with symbol " * " respectively; Otherwise, think word string
Figure 675080DEST_PATH_IMAGE059
Be not neologisms, it abandons the last character pair of afterbody
Figure 198465DEST_PATH_IMAGE056
Evaluation function judge, if
Figure DEST_PATH_IMAGE061
, think word string
Figure 690627DEST_PATH_IMAGE056
Be neologisms, in word string
Figure 932252DEST_PATH_IMAGE056
Front and back mark with symbol " * " respectively; Otherwise, think word string
Figure 207376DEST_PATH_IMAGE057
Be neologisms, in word string
Figure 660966DEST_PATH_IMAGE057
Front and back mark with symbol " * " respectively; As long as there are neologisms to be marked, just from the first character behind the neologisms, choose respectively again length and be 4,3,2,1 sub-word string, be denoted as
Figure 996132DEST_PATH_IMAGE054
, ,
Figure 866185DEST_PATH_IMAGE056
With , re-start the comparison of evaluation function, skip when running into " // " symbol.So repeatedly, so until till language material handles, for example: language material " edge by tasselled shape hair //; // ", at first, begin intercepted length from first character and be respectively 4,3,2,1 sub-word string, that is: " edge is flowed ", " edge by ", " edge " and " limit "; Then, at first judge
Figure 135810DEST_PATH_IMAGE062
Whether more than or equal to 0.8, according to the result of calculation of step (41) evaluation function, as can be known
Figure DEST_PATH_IMAGE063
Less than 0.8, namely word string " edge is flowed " is not neologisms; Then, judge
Figure 899498DEST_PATH_IMAGE064
Whether more than or equal to 0.8, according to the result of calculation of step (41) evaluation function, as can be known
Figure DEST_PATH_IMAGE065
Less than 0.8, so word string " edge by " neither neologisms; Then judge
Figure 782003DEST_PATH_IMAGE066
Whether more than or equal to 0.8, according to the result of calculation of step (41) evaluation function, as can be known
Figure DEST_PATH_IMAGE067
Greater than 0.8, so word string " edge " is neologisms; After judging neologisms, the first character behind neologisms begins to choose 4,3,2,1 word strings again, as the work of a new round
Figure 196804DEST_PATH_IMAGE054
,
Figure 139352DEST_PATH_IMAGE055
,
Figure 577287DEST_PATH_IMAGE056
With , namely " by the tasselled shape ", " wave current Soviet Union ", " by flowing " and " quilt " are repeated above step again and are compared, skip when running into " // " symbol, until finish, so language material " edge by tasselled shape hair //; // ", last word segmentation result be " * edge * by * tasselled shape * hair //, // ";
(5), with the training characteristics of the random field of the frequency of occurrences of word, part of speech, word, utilize condition random field to train a field term conditional random field models, to carrying out field term identification, its operation steps is as follows with this model:
(51), mark in language material with the frequency of occurrences of word itself, part of speech, word, it is specific as follows:
Successively to meaning of word word string
Figure 283523DEST_PATH_IMAGE001
Participle mark characteristic sequence, the characteristic sequence of the mark of this word is respectively: current word itself; The part of speech of current word; The frequency of occurrences of current word, adopt the K-Means clustering method, the frequency of occurrences of above-mentioned current word is divided into 10 grades, each grade is a class, 10 classes are expressed as respectively A, B, C, D, E, F, G, H, I, J, K, and the characteristic sequence that has marked is divided into: characteristic sequence two parts that the characteristic sequence that training has marked, test have marked;
(52), utilize CRF++ 0.53 kit to the training of the characteristic sequence that marked, obtain the condition random field parameter, the condition random field parameter is the conditional random field models of field term identification;
(53), the field term identification of the characteristic sequence that test marked with the conditional random field models of field term identification, it is specific as follows:
The characteristic sequence that test has been marked is input to the rear conditional random field models that obtains field term identification of step (5.2) training, utilize this conditional random field models, calculate eigenwert, identify field term, Output rusults is the field term that identifies, for example: language material " edge by tasselled shape hair //, // ", finally identify " edge " and " tasselled shape " for field term.
More than be preferred forms of the present invention, according to content disclosed by the invention, those skilled in the art can expect some identical, replacement schemes apparently, all should belong to technological innovation scope of the present invention.

Claims (5)

1. Chinese field term recognition methods based on mutual information and conditional random field models, concrete steps are as follows:
(1), the assembling sphere corpus of text, character beyond punctuation marks all in the language material, space, numeral, ascii character and the Chinese character is carried out mark;
(2), word string is set
Figure 649353DEST_PATH_IMAGE001
, calculate word string
Figure 95378DEST_PATH_IMAGE001
The mutual information value;
(3), calculate word string
Figure 288462DEST_PATH_IMAGE001
Left and right sides information entropy;
(4), definition word string Evaluation function arranges evaluation function
Figure 10747DEST_PATH_IMAGE002
Threshold value is calculated the evaluation function value of each word string, determines word string
Figure 73513DEST_PATH_IMAGE001
Be word, successively this word string relatively
Figure 58786DEST_PATH_IMAGE001
Middle prev word
Figure 368545DEST_PATH_IMAGE003
Evaluation function value and a rear word
Figure 501586DEST_PATH_IMAGE004
The evaluation function value is compared, and obtains each word string
Figure 554992DEST_PATH_IMAGE001
The ratio of middle correspondence, its ratio again with evaluation function
Figure 394772DEST_PATH_IMAGE002
Threshold ratio, one by one to meaning of word word string
Figure 688482DEST_PATH_IMAGE001
Participle;
(5), with the training characteristics of the random field of the frequency of occurrences of word, part of speech, word, utilize the condition random field method to train a field term conditional random field models, with this model to carrying out field term identification.
2. the Chinese field term recognition methods based on mutual information and conditional random field models according to claim 1 is characterized in that, described in the above-mentioned steps (2) word string is set
Figure 980923DEST_PATH_IMAGE001
, calculate word string
Figure 838020DEST_PATH_IMAGE001
The mutual information value, its computing formula is as follows:
Suppose that a field term is comprised of n word, if word string
Figure 594624DEST_PATH_IMAGE001
It is field term, so a word string
Figure 246185DEST_PATH_IMAGE001
By
Figure 291501DEST_PATH_IMAGE005
,
Figure 499760DEST_PATH_IMAGE006
,
Figure 48553DEST_PATH_IMAGE007
Figure 136595DEST_PATH_IMAGE003
Individual word forms, word string
Figure 465945DEST_PATH_IMAGE001
Mutual information value computing formula as follows:
Figure 399266DEST_PATH_IMAGE008
(1)
Wherein,
Figure 68145DEST_PATH_IMAGE001
Represent a word string that is formed by n word;
Figure 137207DEST_PATH_IMAGE009
Expression forms word string
Figure 625957DEST_PATH_IMAGE001
I word (i=1,2,3 ..., n);
Figure 628548DEST_PATH_IMAGE010
Word in the expression corpus
Figure 214251DEST_PATH_IMAGE011
The frequency that occurs;
Word in the expression corpus
Figure 620141DEST_PATH_IMAGE006
The frequency that occurs;
Figure 973893DEST_PATH_IMAGE013
Word in the expression corpus The frequency that occurs;
Figure 952531DEST_PATH_IMAGE014
Word in the expression corpus
Figure 478190DEST_PATH_IMAGE003
The frequency that occurs;
Figure 822583DEST_PATH_IMAGE015
The expression word
Figure 320561DEST_PATH_IMAGE011
, ,
Figure 855896DEST_PATH_IMAGE007
...,
Figure 738401DEST_PATH_IMAGE003
The frequency that occurs simultaneously;
Figure 90885DEST_PATH_IMAGE016
The expression word string
Figure 767854DEST_PATH_IMAGE001
In mutual information between all words and the word.
3. the Chinese field term recognition methods based on mutual information and conditional random field models according to claim 1 is characterized in that, the calculating left and right sides information entropy described in the above-mentioned steps (3), and its computing formula is as follows:
Left information entropy computing formula is:
Figure 268105DEST_PATH_IMAGE017
(2)
Right information entropy computing formula is:
Figure 954302DEST_PATH_IMAGE018
(3)
Wherein,
Figure 161292DEST_PATH_IMAGE001
Be expressed as a given word string that is formed by n word;
Figure 274742DEST_PATH_IMAGE019
With
Figure 13022DEST_PATH_IMAGE020
Respectively expression Appear at
Figure 564406DEST_PATH_IMAGE001
Left side and right conditional probability then the time;
Figure 848756DEST_PATH_IMAGE022
With
Figure 58021DEST_PATH_IMAGE023
Expression
Figure 86020DEST_PATH_IMAGE001
The set of words that the left side and the right occur;
Figure 267602DEST_PATH_IMAGE009
Expression forms word string
Figure 722854DEST_PATH_IMAGE001
I word, wherein, i=1,2,3 ..., n.
4. the Chinese field term recognition methods based on mutual information and conditional random field models according to claim 1, it is characterized in that, definition word string W evaluation function described in the above-mentioned steps (4), and utilize evaluation function that language material is carried out participle, refer to the mutual information and the left and right sides information entropy that utilize step (2) and step (3) to calculate, to the word string in the language material
Figure 456235DEST_PATH_IMAGE001
Estimate for the confidence level of word, judge whether this word string is word, and wherein, word string W evaluation function computing formula is as follows:
Figure 22345DEST_PATH_IMAGE024
(4)
Wherein, Be expressed as a given word string that is formed by n word;
Figure 684588DEST_PATH_IMAGE016
The expression word string
Figure 868444DEST_PATH_IMAGE001
Mutual information value between the middle character;
The expression word string
Figure 394421DEST_PATH_IMAGE001
Left information entropy;
Figure 191475DEST_PATH_IMAGE026
The expression word string
Figure 613360DEST_PATH_IMAGE001
Right information entropy;
Be balance factor, in order to regulate information entropy and mutual information value in word string
Figure 531955DEST_PATH_IMAGE001
Weights in the evaluation function.
5. the Chinese field term recognition methods based on mutual information and conditional random field models according to claim 1, it is characterized in that, the training characteristics with the random field of the frequency of occurrences of word, part of speech, word described in the above-mentioned steps (5), utilize the condition random field method to train a field term conditional random field models, utilize this model to carrying out field term identification, its operation steps is as follows:
(51), the frequency of occurrences with word itself, part of speech, word marks in language material;
(52), utilize CRF++ 0.53 kit to the training of the characteristic sequence that marked, obtain the condition random field parameter, the conditional random field models that this condition random field parameter is identified for this field term;
(53), with field term identification the field term identification of conditional random field models characteristic sequence that test has been marked.
CN201210528734.8A 2012-12-11 2012-12-11 Based on mutual information and the Chinese domain term recognition method of conditional random field models Expired - Fee Related CN103049501B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210528734.8A CN103049501B (en) 2012-12-11 2012-12-11 Based on mutual information and the Chinese domain term recognition method of conditional random field models

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210528734.8A CN103049501B (en) 2012-12-11 2012-12-11 Based on mutual information and the Chinese domain term recognition method of conditional random field models

Publications (2)

Publication Number Publication Date
CN103049501A true CN103049501A (en) 2013-04-17
CN103049501B CN103049501B (en) 2016-08-03

Family

ID=48062142

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210528734.8A Expired - Fee Related CN103049501B (en) 2012-12-11 2012-12-11 Based on mutual information and the Chinese domain term recognition method of conditional random field models

Country Status (1)

Country Link
CN (1) CN103049501B (en)

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593427A (en) * 2013-11-07 2014-02-19 清华大学 New word searching method and system
CN103778243A (en) * 2014-02-11 2014-05-07 北京信息科技大学 Domain term extraction method
CN103902673A (en) * 2014-03-19 2014-07-02 新浪网技术(中国)有限公司 Anti-garbage-filtering rule upgrading method and device
CN104572621A (en) * 2015-01-05 2015-04-29 语联网(武汉)信息技术有限公司 Decision tree based term judgment method
CN104679885A (en) * 2015-03-17 2015-06-03 北京理工大学 User search string organization name recognition method based on semantic feature model
CN105183923A (en) * 2015-10-27 2015-12-23 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN105224682A (en) * 2015-10-27 2016-01-06 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN105260362A (en) * 2015-10-30 2016-01-20 小米科技有限责任公司 New word extraction method and device
CN105389349A (en) * 2015-10-27 2016-03-09 上海智臻智能网络科技股份有限公司 Dictionary updating method and apparatus
CN106021230A (en) * 2016-05-19 2016-10-12 无线生活(杭州)信息科技有限公司 Word segmentation method and word segmentation apparatus
CN106095753A (en) * 2016-06-07 2016-11-09 大连理工大学 A kind of financial field based on comentropy and term credibility term recognition methods
WO2016179988A1 (en) * 2015-05-12 2016-11-17 深圳市华傲数据技术有限公司 Chinese address parsing and annotation method
CN106202056A (en) * 2016-07-26 2016-12-07 北京智能管家科技有限公司 Chinese word segmentation scene library update method and system
CN106445921A (en) * 2016-09-29 2017-02-22 北京理工大学 Chinese text term extracting method utilizing quadratic mutual information
CN106649661A (en) * 2016-12-13 2017-05-10 税云网络科技服务有限公司 Method and device for establishing knowledge base
CN106991085A (en) * 2017-04-01 2017-07-28 中国工商银行股份有限公司 The abbreviation generation method and device of a kind of entity
CN107291692A (en) * 2017-06-14 2017-10-24 北京百度网讯科技有限公司 Method for customizing, device, equipment and the medium of participle model based on artificial intelligence
CN107391486A (en) * 2017-07-20 2017-11-24 南京云问网络技术有限公司 A kind of field new word identification method based on statistical information and sequence labelling
CN107423278A (en) * 2016-05-23 2017-12-01 株式会社理光 The recognition methods of essential elements of evaluation, apparatus and system
CN108268440A (en) * 2017-01-04 2018-07-10 普天信息技术有限公司 A kind of unknown word identification method
CN108509425A (en) * 2018-04-10 2018-09-07 中国人民解放军陆军工程大学 A kind of Chinese new word discovery method based on novel degree
CN108776653A (en) * 2018-05-25 2018-11-09 南京大学 A kind of text segmenting method of the judgement document based on PageRank and comentropy
CN108804413A (en) * 2018-04-28 2018-11-13 百度在线网络技术(北京)有限公司 The recognition methods of text cheating and device
CN109145282A (en) * 2017-06-16 2019-01-04 贵州小爱机器人科技有限公司 Punctuate model training method, punctuate method, apparatus and computer equipment
CN109492224A (en) * 2018-11-07 2019-03-19 北京金山数字娱乐科技有限公司 A kind of method and device of vocabulary building
CN109710947A (en) * 2019-01-22 2019-05-03 福建亿榕信息技术有限公司 Power specialty word stock generating method and device
CN110175331A (en) * 2019-05-29 2019-08-27 三角兽(北京)科技有限公司 Recognition methods, device, electronic equipment and the readable storage medium storing program for executing of technical term
CN111090742A (en) * 2019-12-19 2020-05-01 东软集团股份有限公司 Question and answer pair evaluation method and device, storage medium and equipment
CN115495507A (en) * 2022-11-17 2022-12-20 江苏鸿程大数据技术与应用研究院有限公司 Engineering material information price matching method, system and storage medium
CN116702786A (en) * 2023-08-04 2023-09-05 山东大学 Chinese professional term extraction method and system integrating rules and statistical features

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202043B (en) * 2016-05-20 2019-04-12 北京理工大学 A kind of new word identification immune genetic method based at word rate fitness function

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101295294A (en) * 2008-06-12 2008-10-29 昆明理工大学 Improved Bayes acceptation disambiguation method based on information gain
US20100088353A1 (en) * 2006-10-17 2010-04-08 Samsung Sds Co., Ltd. Migration Apparatus Which Convert Database of Mainframe System into Database of Open System and Method for Thereof
CN102314507A (en) * 2011-09-08 2012-01-11 北京航空航天大学 Recognition ambiguity resolution method of Chinese named entity

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100088353A1 (en) * 2006-10-17 2010-04-08 Samsung Sds Co., Ltd. Migration Apparatus Which Convert Database of Mainframe System into Database of Open System and Method for Thereof
CN101295294A (en) * 2008-06-12 2008-10-29 昆明理工大学 Improved Bayes acceptation disambiguation method based on information gain
CN102314507A (en) * 2011-09-08 2012-01-11 北京航空航天大学 Recognition ambiguity resolution method of Chinese named entity

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
周浪 等: "一种面向术语抽取的短语过滤技术", 《计算机工程与应用》, no. 19, 31 December 2009 (2009-12-31), pages 9 - 11 *
贾美英 等: "采用CRF技术的军事情报术语自动抽取研究", 《计算机工程与应用》, no. 32, 31 December 2009 (2009-12-31), pages 126 - 129 *
赵秦怡 等: "一种基于互信息的串扫描中文文本分词方法", 《情报杂志》, vol. 29, no. 7, 31 July 2010 (2010-07-31), pages 152 - 172 *

Cited By (53)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593427A (en) * 2013-11-07 2014-02-19 清华大学 New word searching method and system
CN103778243A (en) * 2014-02-11 2014-05-07 北京信息科技大学 Domain term extraction method
CN103778243B (en) * 2014-02-11 2017-02-08 北京信息科技大学 Domain term extraction method
CN103902673A (en) * 2014-03-19 2014-07-02 新浪网技术(中国)有限公司 Anti-garbage-filtering rule upgrading method and device
CN103902673B (en) * 2014-03-19 2017-11-24 新浪网技术(中国)有限公司 Anti-spam filtering rule upgrade method and device
CN104572621A (en) * 2015-01-05 2015-04-29 语联网(武汉)信息技术有限公司 Decision tree based term judgment method
CN104572621B (en) * 2015-01-05 2018-01-26 语联网(武汉)信息技术有限公司 A kind of term decision method based on decision tree
CN104679885A (en) * 2015-03-17 2015-06-03 北京理工大学 User search string organization name recognition method based on semantic feature model
WO2016179988A1 (en) * 2015-05-12 2016-11-17 深圳市华傲数据技术有限公司 Chinese address parsing and annotation method
CN105389349A (en) * 2015-10-27 2016-03-09 上海智臻智能网络科技股份有限公司 Dictionary updating method and apparatus
CN105389349B (en) * 2015-10-27 2018-07-27 上海智臻智能网络科技股份有限公司 Dictionary update method and device
CN108875040A (en) * 2015-10-27 2018-11-23 上海智臻智能网络科技股份有限公司 Dictionary update method and computer readable storage medium
CN105224682B (en) * 2015-10-27 2018-06-05 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN108897842A (en) * 2015-10-27 2018-11-27 上海智臻智能网络科技股份有限公司 Computer readable storage medium and computer system
CN105224682A (en) * 2015-10-27 2016-01-06 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN105183923A (en) * 2015-10-27 2015-12-23 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN108897842B (en) * 2015-10-27 2021-04-09 上海智臻智能网络科技股份有限公司 Computer readable storage medium and computer system
CN108875040B (en) * 2015-10-27 2020-08-18 上海智臻智能网络科技股份有限公司 Dictionary updating method and computer-readable storage medium
CN105260362A (en) * 2015-10-30 2016-01-20 小米科技有限责任公司 New word extraction method and device
CN106021230B (en) * 2016-05-19 2018-11-23 无线生活(杭州)信息科技有限公司 A kind of segmenting method and device
CN106021230A (en) * 2016-05-19 2016-10-12 无线生活(杭州)信息科技有限公司 Word segmentation method and word segmentation apparatus
CN107423278A (en) * 2016-05-23 2017-12-01 株式会社理光 The recognition methods of essential elements of evaluation, apparatus and system
CN107423278B (en) * 2016-05-23 2020-07-14 株式会社理光 Evaluation element identification method, device and system
CN106095753A (en) * 2016-06-07 2016-11-09 大连理工大学 A kind of financial field based on comentropy and term credibility term recognition methods
CN106095753B (en) * 2016-06-07 2018-11-06 大连理工大学 A kind of financial field term recognition methods based on comentropy and term confidence level
CN106202056A (en) * 2016-07-26 2016-12-07 北京智能管家科技有限公司 Chinese word segmentation scene library update method and system
CN106202056B (en) * 2016-07-26 2019-01-04 北京智能管家科技有限公司 Chinese word segmentation scene library update method and system
CN106445921A (en) * 2016-09-29 2017-02-22 北京理工大学 Chinese text term extracting method utilizing quadratic mutual information
CN106445921B (en) * 2016-09-29 2019-05-07 北京理工大学 Utilize the Chinese text terminology extraction method of quadratic mutual information
CN106649661A (en) * 2016-12-13 2017-05-10 税云网络科技服务有限公司 Method and device for establishing knowledge base
CN108268440A (en) * 2017-01-04 2018-07-10 普天信息技术有限公司 A kind of unknown word identification method
CN106991085B (en) * 2017-04-01 2020-08-04 中国工商银行股份有限公司 Entity abbreviation generation method and device
CN106991085A (en) * 2017-04-01 2017-07-28 中国工商银行股份有限公司 The abbreviation generation method and device of a kind of entity
CN107291692B (en) * 2017-06-14 2020-12-18 北京百度网讯科技有限公司 Artificial intelligence-based word segmentation model customization method, device, equipment and medium
CN107291692A (en) * 2017-06-14 2017-10-24 北京百度网讯科技有限公司 Method for customizing, device, equipment and the medium of participle model based on artificial intelligence
CN109145282A (en) * 2017-06-16 2019-01-04 贵州小爱机器人科技有限公司 Punctuate model training method, punctuate method, apparatus and computer equipment
CN109145282B (en) * 2017-06-16 2023-11-07 贵州小爱机器人科技有限公司 Sentence-breaking model training method, sentence-breaking device and computer equipment
CN107391486A (en) * 2017-07-20 2017-11-24 南京云问网络技术有限公司 A kind of field new word identification method based on statistical information and sequence labelling
CN108509425B (en) * 2018-04-10 2021-08-24 中国人民解放军陆军工程大学 Chinese new word discovery method based on novelty
CN108509425A (en) * 2018-04-10 2018-09-07 中国人民解放军陆军工程大学 A kind of Chinese new word discovery method based on novel degree
CN108804413A (en) * 2018-04-28 2018-11-13 百度在线网络技术(北京)有限公司 The recognition methods of text cheating and device
CN108776653A (en) * 2018-05-25 2018-11-09 南京大学 A kind of text segmenting method of the judgement document based on PageRank and comentropy
CN109492224B (en) * 2018-11-07 2024-05-03 北京金山数字娱乐科技有限公司 Vocabulary construction method and device
CN109492224A (en) * 2018-11-07 2019-03-19 北京金山数字娱乐科技有限公司 A kind of method and device of vocabulary building
CN109710947A (en) * 2019-01-22 2019-05-03 福建亿榕信息技术有限公司 Power specialty word stock generating method and device
CN109710947B (en) * 2019-01-22 2021-09-07 福建亿榕信息技术有限公司 Electric power professional word bank generation method and device
CN110175331A (en) * 2019-05-29 2019-08-27 三角兽(北京)科技有限公司 Recognition methods, device, electronic equipment and the readable storage medium storing program for executing of technical term
CN111090742A (en) * 2019-12-19 2020-05-01 东软集团股份有限公司 Question and answer pair evaluation method and device, storage medium and equipment
CN111090742B (en) * 2019-12-19 2024-05-17 东软集团股份有限公司 Question-answer pair evaluation method, question-answer pair evaluation device, storage medium and equipment
CN115495507B (en) * 2022-11-17 2023-03-24 江苏鸿程大数据技术与应用研究院有限公司 Engineering material information price matching method, system and storage medium
CN115495507A (en) * 2022-11-17 2022-12-20 江苏鸿程大数据技术与应用研究院有限公司 Engineering material information price matching method, system and storage medium
CN116702786A (en) * 2023-08-04 2023-09-05 山东大学 Chinese professional term extraction method and system integrating rules and statistical features
CN116702786B (en) * 2023-08-04 2023-11-17 山东大学 Chinese professional term extraction method and system integrating rules and statistical features

Also Published As

Publication number Publication date
CN103049501B (en) 2016-08-03

Similar Documents

Publication Publication Date Title
CN103049501A (en) Chinese domain term recognition method based on mutual information and conditional random field model
CN107451126B (en) Method and system for screening similar meaning words
CN106372061B (en) Short text similarity calculation method based on semantics
CN106445921B (en) Utilize the Chinese text terminology extraction method of quadratic mutual information
CN111241294A (en) Graph convolution network relation extraction method based on dependency analysis and key words
CN107169086B (en) Text classification method
CN103617157A (en) Text similarity calculation method based on semantics
CN103455562A (en) Text orientation analysis method and product review orientation discriminator on basis of same
CN108959258A (en) It is a kind of that entity link method is integrated based on the specific area for indicating to learn
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN101739430B (en) A kind of training method of the text emotion classifiers based on keyword and sorting technique
CN101882136B (en) Method for analyzing emotion tendentiousness of text
CN109670039A (en) Sentiment analysis method is commented on based on the semi-supervised electric business of tripartite graph and clustering
CN108038099B (en) Low-frequency keyword identification method based on word clustering
CN108376133A (en) The short text sensibility classification method expanded based on emotion word
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN108763348A (en) A kind of classification improved method of extension short text word feature vector
CN106933800A (en) A kind of event sentence abstracting method of financial field
CN113590810B (en) Abstract generation model training method, abstract generation device and electronic equipment
CN101770580A (en) Training method and classification method of cross-field text sentiment classifier
CN104881458A (en) Labeling method and device for web page topics
CN104899188A (en) Problem similarity calculation method based on subjects and focuses of problems
CN103020167A (en) Chinese text classification method for computer
CN109190099B (en) Sentence pattern extraction method and device
CN104008187A (en) Semi-structured text matching method based on the minimum edit distance

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160803

Termination date: 20181211

CF01 Termination of patent right due to non-payment of annual fee