CN103049501A - Chinese domain term recognition method based on mutual information and conditional random field model - Google Patents
Chinese domain term recognition method based on mutual information and conditional random field model Download PDFInfo
- Publication number
- CN103049501A CN103049501A CN2012105287348A CN201210528734A CN103049501A CN 103049501 A CN103049501 A CN 103049501A CN 2012105287348 A CN2012105287348 A CN 2012105287348A CN 201210528734 A CN201210528734 A CN 201210528734A CN 103049501 A CN103049501 A CN 103049501A
- Authority
- CN
- China
- Prior art keywords
- word
- word string
- evaluation function
- random field
- field
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses a Chinese domain term recognition method based on mutual information and a conditional random field model. The Chinese domain term recognition method includes the following steps: (1) gathering domain text corpus and marking all the punctuations, spaces, numbers, ASSCII (American Standard Code for Information Interchange) characters and characters except Chinese characters in the corpus; (2) setting character strings and computing the mutual information values of the character strings, (3) computing the left comentropy and the right comentropy of every character string, (4) defining character string evaluation function, setting evaluation function threshold, computing the evaluation function values of every character string, determining that every character string is a word, comparing in sequence the evaluation function value of the former character with the evaluation function value of the latter character in the character string and segmenting character meaning character strings one by one, (5) utilizing conditional random fields to train a conditional random field model and recognizing domain terms with the conditional random field model. When the Chinese domain term recognition method is used to recognize terms, the data sparsity of legitimate terms is overcome, the amount of calculation of conditional random fields is reduced, and the accuracy of the Chinese domain term recognition is improved.
Description
Technical field
What the present invention relates to is a kind of Chinese field term recognition methods based on mutual information and conditional random field models, belongs to areas of information technology.
Background technology
The definition of standard GB/T/T15237.1-2000 " terminology work vocabulary ", term are that the word of universal in the particular professional field is censured, and are to use, represent concept in this ambit or word or the phrase of relation in an ambit.Term can be divided into the field term that uses in the general term that uses in the daily life and the specific area.Mostly general term is what the live and work custom by people formed, does not require that it is accurately strict in the expression of concept, and its implication is often fuzzyyer; Field term is that it is equivocal not allow to the systematicness of a professional concept, recapitulative description, and the concept that each technical term is expressed must be accurate, can not be different because of end user's difference.
Field term identification refers to extract out the professional domain term from the corpus of specific science or technical field.Field term is identified the important content as information extraction automatically, have a wide range of applications in natural language processing field, have great significance for the processing accuracy that improves field text indexing and retrieval, text mining, body structure, text classification and cluster, latent semantic analysis etc.Field term recognition methods in the existing Chinese text information mainly contains:
(1) based on the Chinese field term recognition methods of statistical method, main thought is to utilize correlation degree higher between inner each constituent of field term and the domain features information of term to extract field term.Method general flow based on statistics is: at first utilize the method in statistics or the information theory, set up various statistic, and according to statistics, determine more accurately seed word; Then final field term is obtained on this basis constantly expansion.Word frequency, average and variance are the statistical methods of relatively commonly using, and more scholar uses the method for test of hypothesis, mainly contains T check, Chi-square Test, log-likelihood ratio, some mutual information etc.Identify field term with statistical method, do not need syntax, information semantically, be not limited to a certain specialized field, also do not rely on any resource, versatility is stronger.
Wherein, the mutual information algorithm based on statistics is most widely used.The article report is for example arranged, its exercise question is that " based on the Chinese Term Extraction System of mutual information " (this article author is: it is loyal that a cutting edge of a knife or a sword is permitted cloud Fan Hou Yan filial piety, be published in " computer utility research " the 22nd volume the 5th phase 72-73 of publication in 2005,77 pages), this article discloses a kind of Chinese automatic term extraction system, the internal bond strength of word string at first calculates in this system based on mutual information, thereby obtains the term candidate collection; Then concentrate from term candidate and remove primary word, and utilize common words collocation prefix, suffix information further to filter; At last term candidate is carried out lexical analysis, utilize the part of speech composition rule of term to differentiate, obtain final terminology extraction result.Experimental result shows that utilizing mutual information algorithm is 72.19% to the accuracy rate of terminology extraction, and recall rate is that 77.98%, F measured value is 74.97%.Bibliographical information is for example arranged, " terminology extraction that C value and mutual information combine " (author is: the red Zhang Wenjing Zhang Youcheng of Liang Ying, be published in " computer utility and software " the 27th volume the 4th phase 108-110 page or leaf of publication in 2010), this article discloses a kind of terminology extraction method that C value and mutual information are combined, the method proposes comprehensive C-value parameter and has advantage aspect the long terminology extraction, experimental result shows, the method is 75.7% to the accuracy rate of long terminology extraction, recall rate is 68.4%, the F measured value is 71.9%, is higher than the additive method under the identical language material.But this algorithm performance directly depends on the scale of corpus and the word frequency of candidate's field term, also may be that the Sparse Problem of legal term is difficult to solve for some low frequency candidate term, so utilizing merely mutual information algorithm identifies field term, accuracy rate, recall rate and the F measured value of identification all are difficult to reach more than 80%, are difficult to obtain desirable recognition effect;
(2) key step based on the Chinese field term recognition methods of machine learning is: adopt manual or semi-automatic mode makes up corpus, according to certain machine learning algorithm corpus is learnt generation model, and then utilize model that testing material is carried out field term and extract experiment, to verify this algorithm complexity.The machine Learning Theory that has been used at present Chinese field term identification mainly comprises decision tree, support vector machine, Hidden Markov Model (HMM), maximum entropy model, maximum entropy Markov model and condition random field algorithm etc.Need not expert's domain knowledge and linguistry based on the term recognition methods of machine learning, realize that feasibility is large, in the situation of considering multiple term characteristics, can be identified preferably or extract effect.
At present, the Chinese field term recognition methods conditional random field models based on machine learning is most widely used.Bibliographical information is for example arranged, " a kind of traditional chinese medical term automatic term extraction method " (author is: the white space king of five a generations Pei Yan Zhang Guiping, be published in " Shenyang Aerospace University's journal " the 28th volume the 1st phase 72-75 page or leaf of publication in 2011), this article discloses a kind of terminology extraction method based on condition random field for traditional Chinese medical science field, the method extracts traditional Chinese medical science field term regards a sequence labelling problem as, the characteristic quantification that traditional Chinese medical science field term is distributed is as the feature of training, utilize the CRF kit to train a field term model, then utilize this model to carry out terminology extraction.Select " classified medical records of famous doctos " to carry out the terminology extraction experiment as traditional Chinese medical science field text, rate of accuracy reached to 83.11%, recall rate reach 81.04%, F measured value and reach 82.06%.And article " adopts the military information automatic term extraction research of CRF technology ", and (author is: the bright scholar Zheng De power of Jia Meiying poplar Yang Jing, be published in " computer engineering and application " the 45th volume the 32nd phase 126-129 page or leaf of publication in 2009), this article discloses a kind of terminology extraction method based on condition random field for the military information field, the method is regarded field term identification as a sequence labelling problem, the characteristic quantification that field term is distributed is as the feature of training, utilize the CRF kit to train a field term feature templates, then utilize this template to carry out field term and extract.Experiment shows that the method is good to the recognition result of military information field term, and accuracy rate can reach 73.24%, and recall rate reaches 69.57%, F measured value and reaches 71.36%.
Utilize the condition random field algorithm to carry out field term when identification, corpus all be manual and semi-automatic mark basically, and artificially participation is all high, and workload is large, causes that generally the amount of identification is little, has restricted accuracy of identification and the application of this algorithm.Simultaneously, need to utilize first general participle instrument that language material is carried out participle, and then the language material behind the participle is carried out the condition random field training and testing, finally could realize the identification of term.So the prerequisite of utilizing the condition random field algorithm to carry out field term identification is to suppose that existing general participle instrument can carry out exactly participle to the vocabulary in this field, and think that field term is larger than the word granularity that participle instrument divided.But, because there are gap in professional domain term and popular word, be difficult to realization to the accurate participle of professional domain language material with general participle instrument.Therefore, the identification degree is lower automatically in the field term identifying for mutual information and condition random field method at present, and accuracy of identification is not high.
Summary of the invention
The problem that exists of prior art in view of the above, the purpose of this invention is to provide a kind of Chinese field term recognition methods based on mutual information and conditional random field models, when the method is identified at term, can not only overcome the Sparse of legal term, reduce the operand of condition random field algorithm, and can improve Chinese field term accuracy of identification.
In order to achieve the above object, the present invention adopts following technical proposals:
Chinese field term recognition methods based on mutual information and conditional random field models of the present invention, concrete steps are as follows:
(1), the assembling sphere corpus of text, character beyond punctuation marks all in the language material, space, numeral, ascii character and the Chinese character is carried out mark;
(4), definition word string
Evaluation function arranges evaluation function
Threshold value is calculated the evaluation function value of each word string, determines word string
Be word, successively this word string relatively
Middle prev word
Evaluation function value and a rear word
The evaluation function value is compared, and obtains each word string
The ratio of middle correspondence, its ratio again with evaluation function
Threshold ratio, one by one to meaning of word word string
Participle;
(5), with the training characteristics of the random field of the frequency of occurrences of word, part of speech, word, utilize the condition random field method to train a field term conditional random field models, with this model to carrying out field term identification.
(2) described in the above-mentioned steps (2) arrange word string
, calculate word string
The mutual information value, its computing formula is as follows:
Suppose that a field term is comprised of n word, if word string
It is field term, so a word string
By
,
,
Individual word forms, word string
Mutual information value computing formula as follows:
Word in the expression corpus
The frequency that occurs;
Calculating left and right sides information entropy described in the above-mentioned steps (3), its computing formula is as follows:
Definition word string W evaluation function described in the above-mentioned steps (4), and utilize evaluation function that language material is carried out participle refers to the mutual information and the left and right sides information entropy that utilize step (2) and step (3) to calculate, to the word string in the language material
Estimate for the confidence level of word, judge whether this word string is word, and wherein, word string W evaluation function computing formula is as follows:
Be balance factor, in order to regulate information entropy and mutual information value in word string
Weights in the evaluation function.
The training characteristics with the random field of the frequency of occurrences of word, part of speech, word described in the above-mentioned steps (5), utilize the condition random field method to train a field term conditional random field models, utilize this model to carrying out field term identification, its operation steps is as follows:
(51), the frequency of occurrences with word itself, part of speech, word marks in language material;
(52), utilize CRF++ 0.53 kit to the training of the characteristic sequence that marked, obtain the condition random field parameter, the conditional random field models that this condition random field parameter is identified for this field term;
(53), with field term identification the field term identification of conditional random field models characteristic sequence that test has been marked.
Chinese field term recognition methods based on mutual information and conditional random field models of the present invention has following effect compared with prior art:
(1), the method will organically combine based on statistics and two class term recognition methodss of machine learning, effectively solve the Sparse Problem when utilizing merely statistical method to carry out term identification;
(2), the method utilizes mutual information algorithm that language material is carried out participle and mark, realized the automatic marking of language material;
(3), the method only adopted 3 the most common word features, as the training of condition random field method, makes the method have stronger domain generality, effectively reduced the operand of condition random field, reduced the training time of condition random field.
Description of drawings
Fig. 1 is the process flow diagram of the Chinese field term recognition methods based on mutual information and conditional random field models of the present invention;
Fig. 2 is the process flow diagram of step among Fig. 1 (4);
Fig. 3 is the process flow diagram of step among Fig. 1 (5).
Embodiment
The present invention is described in further detail below in conjunction with the drawings and specific embodiments.
Present embodiment is with plant---and the present invention will be described as an example in the field term identification of bamboo, but be not used for limiting the scope of the invention.
With reference to Fig. 1, the Chinese field term recognition methods based on mutual information and conditional random field models of the present invention comprises the steps:
(1), the assembling sphere corpus of text, character beyond punctuation marks all in the language material, space, numeral, ascii character and the Chinese character is carried out mark.
For example, this example is chosen the electronics manuscript of " Chinese Plants will " the 9th volume Bambusoideae as the field corpus of text.
At first, the ratio of language material in 4:1 is divided into randomly: corpus and testing material two parts;
Then, retrieve in the language material character beyond all punctuation marks, space, numeral, ascii character and the Chinese character, carry out mark at forward and backward " // " symbol of using respectively of above-mentioned character;
At last, with reference to Chinese part of speech table, to all pronouns, interjection, auxiliary word and function word, and lead-in for " with, have,,, will,, from,, be, then,, every, this, should, to, institute, make, be, or not,, very, should, and, get, " forward and backward " // " symbol of using respectively of word carries out mark.
(2), word string is set
, calculate word string
The mutual information value, its computing formula is as follows:
Suppose that a field term is comprised of n word, if word string
It is field term, so a word string
By
,
,
Individual word forms, word string
Mutual information value computing formula as follows:
Owing to it is considered herein that the length of Chinese field term is not more than 4 words, and think that in addition character of punctuation mark, space, numeral, ascii character and Chinese character can not appear in Chinese field term centre, also can not go out again simultaneously the words such as interjection, function word, index pronoun, so the present invention calculates respectively the mutual information value of its 2-word, 3-word, 4-word to all words in the language material text, stop to calculate when running into marker character " // ", the computing formula of its mutual information value is referring to formula (1), (2), (3) of step (2) in the foregoing invention content.
For example: language material " edge by tasselled shape hair //, // ", wherein 2-word comprises: " edge ", " edge by ", " by flowing ", " tasselled ", " Soviet Union's shape " and " shape hair "; 3-word comprises: " edge quilt ", " edge is flowed ", " by tasselled ", " tasselled shape " and " Su Zhuanmao "; 4-word comprises: " edge is flowed ", " edge is by tasselled ", " by the tasselled shape " and " tasselled shape hair ", and part mutual information result of calculation is:
,
,
,
,
(3), calculate word string
Left and right sides information entropy, its computing formula is as follows:
Left information entropy computing formula is:
Right information entropy computing formula is:
Judge that whether a word string is word, not only will consider the bonding tightness between word string internal word and the word, i.e. the size of mutual information between the word; Simultaneously, also to consider the border degrees of freedom between the word string, the kind in abutting connection with word that namely occurs on the word string border is more, think that word string left and right sides information entropy is larger, namely the degree of freedom on word string border is larger, and the computing formula of its left and right sides information entropy is referring to formula (2), (3) of step (3) in the foregoing invention content.
For example: language material " edge by tasselled shape hair //, // " in, the part left information entropy result of calculation be:
,
,
,
,
,
Right information entropy result of calculation is:
,
,
,
,
,
(4), definition word string
Evaluation function arranges evaluation function
Threshold value is calculated the evaluation function value of each word string, determines word string
Be word, successively this word string relatively
Middle prev word
Evaluation function value and a rear word
The evaluation function value is compared, and obtains each word string
The ratio of middle correspondence, its ratio again with evaluation function
Threshold ratio, one by one to meaning of word word string
Participle, its operation steps is as follows:
Be balance factor, in order to regulate information entropy and the weights of mutual information value in evaluation function.
Calculate respectively the evaluation function value of all word strings according to the evaluation function formula of the step in the foregoing invention content (4), wherein
Get 0.5, and think and work as evaluation function
During greater than threshold value 0.8, this word string
Be word,
For example: language material " edge by tasselled shape hair //, // ", part evaluation function result of calculation is:
,
,
,
,
,
(43), more above-mentioned word string successively
Middle prev word
Evaluation function value and a rear word
The evaluation function value is compared, and obtains each word string
Middle correspondence ratio "? ", its ratio again with evaluation function
Threshold ratio, one by one to meaning of word word string
Participle.
For example, at first from the first character of language material, choose respectively length and be 4,3,2,1 sub-word string, be denoted as
,
,
With
Then, to word string
With
Evaluation function compare, if
, think word string
Be neologisms, d is in word string
Front and back mark with symbol " * " respectively; Otherwise, think word string
Be not neologisms, then it abandons the last character of afterbody, and is right
With
Evaluation function compare, if
, think word string
Be neologisms, in word string
Front and back mark with symbol " * " respectively; Otherwise, think word string
Be not neologisms, it abandons the last character pair of afterbody
Evaluation function judge, if
, think word string
Be neologisms, in word string
Front and back mark with symbol " * " respectively; Otherwise, think word string
Be neologisms, in word string
Front and back mark with symbol " * " respectively; As long as there are neologisms to be marked, just from the first character behind the neologisms, choose respectively again length and be 4,3,2,1 sub-word string, be denoted as
,
,
With
, re-start the comparison of evaluation function, skip when running into " // " symbol.So repeatedly, so until till language material handles, for example: language material " edge by tasselled shape hair //; // ", at first, begin intercepted length from first character and be respectively 4,3,2,1 sub-word string, that is: " edge is flowed ", " edge by ", " edge " and " limit "; Then, at first judge
Whether more than or equal to 0.8, according to the result of calculation of step (41) evaluation function, as can be known
Less than 0.8, namely word string " edge is flowed " is not neologisms; Then, judge
Whether more than or equal to 0.8, according to the result of calculation of step (41) evaluation function, as can be known
Less than 0.8, so word string " edge by " neither neologisms; Then judge
Whether more than or equal to 0.8, according to the result of calculation of step (41) evaluation function, as can be known
Greater than 0.8, so word string " edge " is neologisms; After judging neologisms, the first character behind neologisms begins to choose 4,3,2,1 word strings again, as the work of a new round
,
,
With
, namely " by the tasselled shape ", " wave current Soviet Union ", " by flowing " and " quilt " are repeated above step again and are compared, skip when running into " // " symbol, until finish, so language material " edge by tasselled shape hair //; // ", last word segmentation result be " * edge * by * tasselled shape * hair //, // ";
(5), with the training characteristics of the random field of the frequency of occurrences of word, part of speech, word, utilize condition random field to train a field term conditional random field models, to carrying out field term identification, its operation steps is as follows with this model:
(51), mark in language material with the frequency of occurrences of word itself, part of speech, word, it is specific as follows:
Successively to meaning of word word string
Participle mark characteristic sequence, the characteristic sequence of the mark of this word is respectively: current word itself; The part of speech of current word; The frequency of occurrences of current word, adopt the K-Means clustering method, the frequency of occurrences of above-mentioned current word is divided into 10 grades, each grade is a class, 10 classes are expressed as respectively A, B, C, D, E, F, G, H, I, J, K, and the characteristic sequence that has marked is divided into: characteristic sequence two parts that the characteristic sequence that training has marked, test have marked;
(52), utilize CRF++ 0.53 kit to the training of the characteristic sequence that marked, obtain the condition random field parameter, the condition random field parameter is the conditional random field models of field term identification;
(53), the field term identification of the characteristic sequence that test marked with the conditional random field models of field term identification, it is specific as follows:
The characteristic sequence that test has been marked is input to the rear conditional random field models that obtains field term identification of step (5.2) training, utilize this conditional random field models, calculate eigenwert, identify field term, Output rusults is the field term that identifies, for example: language material " edge by tasselled shape hair //, // ", finally identify " edge " and " tasselled shape " for field term.
More than be preferred forms of the present invention, according to content disclosed by the invention, those skilled in the art can expect some identical, replacement schemes apparently, all should belong to technological innovation scope of the present invention.
Claims (5)
1. Chinese field term recognition methods based on mutual information and conditional random field models, concrete steps are as follows:
(1), the assembling sphere corpus of text, character beyond punctuation marks all in the language material, space, numeral, ascii character and the Chinese character is carried out mark;
(4), definition word string
Evaluation function arranges evaluation function
Threshold value is calculated the evaluation function value of each word string, determines word string
Be word, successively this word string relatively
Middle prev word
Evaluation function value and a rear word
The evaluation function value is compared, and obtains each word string
The ratio of middle correspondence, its ratio again with evaluation function
Threshold ratio, one by one to meaning of word word string
Participle;
(5), with the training characteristics of the random field of the frequency of occurrences of word, part of speech, word, utilize the condition random field method to train a field term conditional random field models, with this model to carrying out field term identification.
2. the Chinese field term recognition methods based on mutual information and conditional random field models according to claim 1 is characterized in that, described in the above-mentioned steps (2) word string is set
, calculate word string
The mutual information value, its computing formula is as follows:
Suppose that a field term is comprised of n word, if word string
It is field term, so a word string
By
,
,
Individual word forms, word string
Mutual information value computing formula as follows:
3. the Chinese field term recognition methods based on mutual information and conditional random field models according to claim 1 is characterized in that, the calculating left and right sides information entropy described in the above-mentioned steps (3), and its computing formula is as follows:
4. the Chinese field term recognition methods based on mutual information and conditional random field models according to claim 1, it is characterized in that, definition word string W evaluation function described in the above-mentioned steps (4), and utilize evaluation function that language material is carried out participle, refer to the mutual information and the left and right sides information entropy that utilize step (2) and step (3) to calculate, to the word string in the language material
Estimate for the confidence level of word, judge whether this word string is word, and wherein, word string W evaluation function computing formula is as follows:
Wherein,
Be expressed as a given word string that is formed by n word;
5. the Chinese field term recognition methods based on mutual information and conditional random field models according to claim 1, it is characterized in that, the training characteristics with the random field of the frequency of occurrences of word, part of speech, word described in the above-mentioned steps (5), utilize the condition random field method to train a field term conditional random field models, utilize this model to carrying out field term identification, its operation steps is as follows:
(51), the frequency of occurrences with word itself, part of speech, word marks in language material;
(52), utilize CRF++ 0.53 kit to the training of the characteristic sequence that marked, obtain the condition random field parameter, the conditional random field models that this condition random field parameter is identified for this field term;
(53), with field term identification the field term identification of conditional random field models characteristic sequence that test has been marked.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210528734.8A CN103049501B (en) | 2012-12-11 | 2012-12-11 | Based on mutual information and the Chinese domain term recognition method of conditional random field models |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210528734.8A CN103049501B (en) | 2012-12-11 | 2012-12-11 | Based on mutual information and the Chinese domain term recognition method of conditional random field models |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103049501A true CN103049501A (en) | 2013-04-17 |
CN103049501B CN103049501B (en) | 2016-08-03 |
Family
ID=48062142
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210528734.8A Expired - Fee Related CN103049501B (en) | 2012-12-11 | 2012-12-11 | Based on mutual information and the Chinese domain term recognition method of conditional random field models |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103049501B (en) |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103593427A (en) * | 2013-11-07 | 2014-02-19 | 清华大学 | New word searching method and system |
CN103778243A (en) * | 2014-02-11 | 2014-05-07 | 北京信息科技大学 | Domain term extraction method |
CN103902673A (en) * | 2014-03-19 | 2014-07-02 | 新浪网技术(中国)有限公司 | Anti-garbage-filtering rule upgrading method and device |
CN104572621A (en) * | 2015-01-05 | 2015-04-29 | 语联网(武汉)信息技术有限公司 | Decision tree based term judgment method |
CN104679885A (en) * | 2015-03-17 | 2015-06-03 | 北京理工大学 | User search string organization name recognition method based on semantic feature model |
CN105183923A (en) * | 2015-10-27 | 2015-12-23 | 上海智臻智能网络科技股份有限公司 | New word discovery method and device |
CN105224682A (en) * | 2015-10-27 | 2016-01-06 | 上海智臻智能网络科技股份有限公司 | New word discovery method and device |
CN105260362A (en) * | 2015-10-30 | 2016-01-20 | 小米科技有限责任公司 | New word extraction method and device |
CN105389349A (en) * | 2015-10-27 | 2016-03-09 | 上海智臻智能网络科技股份有限公司 | Dictionary updating method and apparatus |
CN106021230A (en) * | 2016-05-19 | 2016-10-12 | 无线生活(杭州)信息科技有限公司 | Word segmentation method and word segmentation apparatus |
CN106095753A (en) * | 2016-06-07 | 2016-11-09 | 大连理工大学 | A kind of financial field based on comentropy and term credibility term recognition methods |
WO2016179988A1 (en) * | 2015-05-12 | 2016-11-17 | 深圳市华傲数据技术有限公司 | Chinese address parsing and annotation method |
CN106202056A (en) * | 2016-07-26 | 2016-12-07 | 北京智能管家科技有限公司 | Chinese word segmentation scene library update method and system |
CN106445921A (en) * | 2016-09-29 | 2017-02-22 | 北京理工大学 | Chinese text term extracting method utilizing quadratic mutual information |
CN106649661A (en) * | 2016-12-13 | 2017-05-10 | 税云网络科技服务有限公司 | Method and device for establishing knowledge base |
CN106991085A (en) * | 2017-04-01 | 2017-07-28 | 中国工商银行股份有限公司 | The abbreviation generation method and device of a kind of entity |
CN107291692A (en) * | 2017-06-14 | 2017-10-24 | 北京百度网讯科技有限公司 | Method for customizing, device, equipment and the medium of participle model based on artificial intelligence |
CN107391486A (en) * | 2017-07-20 | 2017-11-24 | 南京云问网络技术有限公司 | A kind of field new word identification method based on statistical information and sequence labelling |
CN107423278A (en) * | 2016-05-23 | 2017-12-01 | 株式会社理光 | The recognition methods of essential elements of evaluation, apparatus and system |
CN108268440A (en) * | 2017-01-04 | 2018-07-10 | 普天信息技术有限公司 | A kind of unknown word identification method |
CN108509425A (en) * | 2018-04-10 | 2018-09-07 | 中国人民解放军陆军工程大学 | Chinese new word discovery method based on novelty |
CN108776653A (en) * | 2018-05-25 | 2018-11-09 | 南京大学 | A kind of text segmenting method of the judgement document based on PageRank and comentropy |
CN108804413A (en) * | 2018-04-28 | 2018-11-13 | 百度在线网络技术(北京)有限公司 | The recognition methods of text cheating and device |
CN109145282A (en) * | 2017-06-16 | 2019-01-04 | 贵州小爱机器人科技有限公司 | Punctuate model training method, punctuate method, apparatus and computer equipment |
CN109492224A (en) * | 2018-11-07 | 2019-03-19 | 北京金山数字娱乐科技有限公司 | A kind of method and device of vocabulary building |
CN109710947A (en) * | 2019-01-22 | 2019-05-03 | 福建亿榕信息技术有限公司 | Power specialty word stock generating method and device |
CN110175331A (en) * | 2019-05-29 | 2019-08-27 | 三角兽(北京)科技有限公司 | Recognition methods, device, electronic equipment and the readable storage medium storing program for executing of technical term |
CN111090742A (en) * | 2019-12-19 | 2020-05-01 | 东软集团股份有限公司 | Question and answer pair evaluation method and device, storage medium and equipment |
CN115495507A (en) * | 2022-11-17 | 2022-12-20 | 江苏鸿程大数据技术与应用研究院有限公司 | Engineering material information price matching method, system and storage medium |
CN116702786A (en) * | 2023-08-04 | 2023-09-05 | 山东大学 | Chinese professional term extraction method and system integrating rules and statistical features |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106202043B (en) * | 2016-05-20 | 2019-04-12 | 北京理工大学 | A kind of new word identification immune genetic method based at word rate fitness function |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101295294A (en) * | 2008-06-12 | 2008-10-29 | 昆明理工大学 | Improved Bayes acceptation disambiguation method based on information gain |
US20100088353A1 (en) * | 2006-10-17 | 2010-04-08 | Samsung Sds Co., Ltd. | Migration Apparatus Which Convert Database of Mainframe System into Database of Open System and Method for Thereof |
CN102314507A (en) * | 2011-09-08 | 2012-01-11 | 北京航空航天大学 | Recognition ambiguity resolution method of Chinese named entity |
-
2012
- 2012-12-11 CN CN201210528734.8A patent/CN103049501B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100088353A1 (en) * | 2006-10-17 | 2010-04-08 | Samsung Sds Co., Ltd. | Migration Apparatus Which Convert Database of Mainframe System into Database of Open System and Method for Thereof |
CN101295294A (en) * | 2008-06-12 | 2008-10-29 | 昆明理工大学 | Improved Bayes acceptation disambiguation method based on information gain |
CN102314507A (en) * | 2011-09-08 | 2012-01-11 | 北京航空航天大学 | Recognition ambiguity resolution method of Chinese named entity |
Non-Patent Citations (3)
Title |
---|
周浪 等: "一种面向术语抽取的短语过滤技术", 《计算机工程与应用》, no. 19, 31 December 2009 (2009-12-31), pages 9 - 11 * |
贾美英 等: "采用CRF技术的军事情报术语自动抽取研究", 《计算机工程与应用》, no. 32, 31 December 2009 (2009-12-31), pages 126 - 129 * |
赵秦怡 等: "一种基于互信息的串扫描中文文本分词方法", 《情报杂志》, vol. 29, no. 7, 31 July 2010 (2010-07-31), pages 152 - 172 * |
Cited By (53)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103593427A (en) * | 2013-11-07 | 2014-02-19 | 清华大学 | New word searching method and system |
CN103778243A (en) * | 2014-02-11 | 2014-05-07 | 北京信息科技大学 | Domain term extraction method |
CN103778243B (en) * | 2014-02-11 | 2017-02-08 | 北京信息科技大学 | Domain term extraction method |
CN103902673A (en) * | 2014-03-19 | 2014-07-02 | 新浪网技术(中国)有限公司 | Anti-garbage-filtering rule upgrading method and device |
CN103902673B (en) * | 2014-03-19 | 2017-11-24 | 新浪网技术(中国)有限公司 | Anti-spam filtering rule upgrade method and device |
CN104572621A (en) * | 2015-01-05 | 2015-04-29 | 语联网(武汉)信息技术有限公司 | Decision tree based term judgment method |
CN104572621B (en) * | 2015-01-05 | 2018-01-26 | 语联网(武汉)信息技术有限公司 | A kind of term decision method based on decision tree |
CN104679885A (en) * | 2015-03-17 | 2015-06-03 | 北京理工大学 | User search string organization name recognition method based on semantic feature model |
WO2016179988A1 (en) * | 2015-05-12 | 2016-11-17 | 深圳市华傲数据技术有限公司 | Chinese address parsing and annotation method |
CN105389349A (en) * | 2015-10-27 | 2016-03-09 | 上海智臻智能网络科技股份有限公司 | Dictionary updating method and apparatus |
CN105389349B (en) * | 2015-10-27 | 2018-07-27 | 上海智臻智能网络科技股份有限公司 | Dictionary update method and device |
CN108875040A (en) * | 2015-10-27 | 2018-11-23 | 上海智臻智能网络科技股份有限公司 | Dictionary update method and computer readable storage medium |
CN105224682B (en) * | 2015-10-27 | 2018-06-05 | 上海智臻智能网络科技股份有限公司 | New word discovery method and device |
CN108897842A (en) * | 2015-10-27 | 2018-11-27 | 上海智臻智能网络科技股份有限公司 | Computer readable storage medium and computer system |
CN105224682A (en) * | 2015-10-27 | 2016-01-06 | 上海智臻智能网络科技股份有限公司 | New word discovery method and device |
CN105183923A (en) * | 2015-10-27 | 2015-12-23 | 上海智臻智能网络科技股份有限公司 | New word discovery method and device |
CN108897842B (en) * | 2015-10-27 | 2021-04-09 | 上海智臻智能网络科技股份有限公司 | Computer readable storage medium and computer system |
CN108875040B (en) * | 2015-10-27 | 2020-08-18 | 上海智臻智能网络科技股份有限公司 | Dictionary updating method and computer-readable storage medium |
CN105260362A (en) * | 2015-10-30 | 2016-01-20 | 小米科技有限责任公司 | New word extraction method and device |
CN106021230A (en) * | 2016-05-19 | 2016-10-12 | 无线生活(杭州)信息科技有限公司 | Word segmentation method and word segmentation apparatus |
CN106021230B (en) * | 2016-05-19 | 2018-11-23 | 无线生活(杭州)信息科技有限公司 | A kind of segmenting method and device |
CN107423278A (en) * | 2016-05-23 | 2017-12-01 | 株式会社理光 | The recognition methods of essential elements of evaluation, apparatus and system |
CN107423278B (en) * | 2016-05-23 | 2020-07-14 | 株式会社理光 | Evaluation element identification method, device and system |
CN106095753A (en) * | 2016-06-07 | 2016-11-09 | 大连理工大学 | A kind of financial field based on comentropy and term credibility term recognition methods |
CN106095753B (en) * | 2016-06-07 | 2018-11-06 | 大连理工大学 | A kind of financial field term recognition methods based on comentropy and term confidence level |
CN106202056A (en) * | 2016-07-26 | 2016-12-07 | 北京智能管家科技有限公司 | Chinese word segmentation scene library update method and system |
CN106202056B (en) * | 2016-07-26 | 2019-01-04 | 北京智能管家科技有限公司 | Chinese word segmentation scene library update method and system |
CN106445921A (en) * | 2016-09-29 | 2017-02-22 | 北京理工大学 | Chinese text term extracting method utilizing quadratic mutual information |
CN106445921B (en) * | 2016-09-29 | 2019-05-07 | 北京理工大学 | Utilize the Chinese text terminology extraction method of quadratic mutual information |
CN106649661A (en) * | 2016-12-13 | 2017-05-10 | 税云网络科技服务有限公司 | Method and device for establishing knowledge base |
CN108268440A (en) * | 2017-01-04 | 2018-07-10 | 普天信息技术有限公司 | A kind of unknown word identification method |
CN106991085B (en) * | 2017-04-01 | 2020-08-04 | 中国工商银行股份有限公司 | Entity abbreviation generation method and device |
CN106991085A (en) * | 2017-04-01 | 2017-07-28 | 中国工商银行股份有限公司 | The abbreviation generation method and device of a kind of entity |
CN107291692B (en) * | 2017-06-14 | 2020-12-18 | 北京百度网讯科技有限公司 | Artificial intelligence-based word segmentation model customization method, device, equipment and medium |
CN107291692A (en) * | 2017-06-14 | 2017-10-24 | 北京百度网讯科技有限公司 | Method for customizing, device, equipment and the medium of participle model based on artificial intelligence |
CN109145282A (en) * | 2017-06-16 | 2019-01-04 | 贵州小爱机器人科技有限公司 | Punctuate model training method, punctuate method, apparatus and computer equipment |
CN109145282B (en) * | 2017-06-16 | 2023-11-07 | 贵州小爱机器人科技有限公司 | Sentence-breaking model training method, sentence-breaking device and computer equipment |
CN107391486A (en) * | 2017-07-20 | 2017-11-24 | 南京云问网络技术有限公司 | A kind of field new word identification method based on statistical information and sequence labelling |
CN108509425B (en) * | 2018-04-10 | 2021-08-24 | 中国人民解放军陆军工程大学 | Chinese new word discovery method based on novelty |
CN108509425A (en) * | 2018-04-10 | 2018-09-07 | 中国人民解放军陆军工程大学 | Chinese new word discovery method based on novelty |
CN108804413A (en) * | 2018-04-28 | 2018-11-13 | 百度在线网络技术(北京)有限公司 | The recognition methods of text cheating and device |
CN108776653A (en) * | 2018-05-25 | 2018-11-09 | 南京大学 | A kind of text segmenting method of the judgement document based on PageRank and comentropy |
CN109492224B (en) * | 2018-11-07 | 2024-05-03 | 北京金山数字娱乐科技有限公司 | Vocabulary construction method and device |
CN109492224A (en) * | 2018-11-07 | 2019-03-19 | 北京金山数字娱乐科技有限公司 | A kind of method and device of vocabulary building |
CN109710947A (en) * | 2019-01-22 | 2019-05-03 | 福建亿榕信息技术有限公司 | Power specialty word stock generating method and device |
CN109710947B (en) * | 2019-01-22 | 2021-09-07 | 福建亿榕信息技术有限公司 | Electric power professional word bank generation method and device |
CN110175331A (en) * | 2019-05-29 | 2019-08-27 | 三角兽(北京)科技有限公司 | Recognition methods, device, electronic equipment and the readable storage medium storing program for executing of technical term |
CN111090742A (en) * | 2019-12-19 | 2020-05-01 | 东软集团股份有限公司 | Question and answer pair evaluation method and device, storage medium and equipment |
CN111090742B (en) * | 2019-12-19 | 2024-05-17 | 东软集团股份有限公司 | Question-answer pair evaluation method, question-answer pair evaluation device, storage medium and equipment |
CN115495507B (en) * | 2022-11-17 | 2023-03-24 | 江苏鸿程大数据技术与应用研究院有限公司 | Engineering material information price matching method, system and storage medium |
CN115495507A (en) * | 2022-11-17 | 2022-12-20 | 江苏鸿程大数据技术与应用研究院有限公司 | Engineering material information price matching method, system and storage medium |
CN116702786A (en) * | 2023-08-04 | 2023-09-05 | 山东大学 | Chinese professional term extraction method and system integrating rules and statistical features |
CN116702786B (en) * | 2023-08-04 | 2023-11-17 | 山东大学 | Chinese professional term extraction method and system integrating rules and statistical features |
Also Published As
Publication number | Publication date |
---|---|
CN103049501B (en) | 2016-08-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103049501A (en) | Chinese domain term recognition method based on mutual information and conditional random field model | |
CN107451126B (en) | Method and system for screening similar meaning words | |
CN105786991B (en) | In conjunction with the Chinese emotion new word identification method and system of user feeling expression way | |
CN103970729B (en) | A kind of multi-threaded extracting method based on semantic category | |
CN106372061B (en) | Short text similarity calculation method based on semantics | |
CN111241294A (en) | Graph convolution network relation extraction method based on dependency analysis and key words | |
CN106445921B (en) | Utilize the Chinese text terminology extraction method of quadratic mutual information | |
CN107169086B (en) | Text classification method | |
CN103617157A (en) | Text similarity calculation method based on semantics | |
CN103455562A (en) | Text orientation analysis method and product review orientation discriminator on basis of same | |
CN103020167B (en) | A kind of computer Chinese file classification method | |
CN108959258A (en) | It is a kind of that entity link method is integrated based on the specific area for indicating to learn | |
CN110362678A (en) | A kind of method and apparatus automatically extracting Chinese text keyword | |
CN103995876A (en) | Text classification method based on chi square statistics and SMO algorithm | |
CN101739430B (en) | A kind of training method of the text emotion classifiers based on keyword and sorting technique | |
CN108038099B (en) | Low-frequency keyword identification method based on word clustering | |
CN108376133A (en) | The short text sensibility classification method expanded based on emotion word | |
CN103970730A (en) | Method for extracting multiple subject terms from single Chinese text | |
CN108763348A (en) | A kind of classification improved method of extension short text word feature vector | |
CN106933800A (en) | A kind of event sentence abstracting method of financial field | |
CN104899188A (en) | Problem similarity calculation method based on subjects and focuses of problems | |
CN113590810B (en) | Abstract generation model training method, abstract generation device and electronic equipment | |
CN101770580A (en) | Training method and classification method of cross-field text sentiment classifier | |
CN104881458A (en) | Labeling method and device for web page topics | |
CN104317965A (en) | Establishment method of emotion dictionary based on linguistic data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20160803 Termination date: 20181211 |
|
CF01 | Termination of patent right due to non-payment of annual fee |