CN104965818B - A kind of entry name entity recognition method and system based on self-learning-ruler - Google Patents

A kind of entry name entity recognition method and system based on self-learning-ruler Download PDF

Info

Publication number
CN104965818B
CN104965818B CN201510271752.6A CN201510271752A CN104965818B CN 104965818 B CN104965818 B CN 104965818B CN 201510271752 A CN201510271752 A CN 201510271752A CN 104965818 B CN104965818 B CN 104965818B
Authority
CN
China
Prior art keywords
speech
words
text
feature
entry name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510271752.6A
Other languages
Chinese (zh)
Other versions
CN104965818A (en
Inventor
柳厅文
时金桥
张洋
闫旸
郭莉
张浩亮
亚静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201510271752.6A priority Critical patent/CN104965818B/en
Publication of CN104965818A publication Critical patent/CN104965818A/en
Application granted granted Critical
Publication of CN104965818B publication Critical patent/CN104965818B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of entry name entity recognition method and system based on self-learning-ruler, rule is used as using part of speech blacklist and keyword white list, and participation of the construction of part of speech blacklist and keyword white list completely without people, can learn to come out automatically from training set.The present invention can be as the supplement of traditional recognition method, and then accuracy rate and recall rate can be improved on the basis of original.

Description

A kind of entry name entity recognition method and system based on self-learning-ruler
Technical field
The present invention relates to the fields such as text-processing, natural language processing, and in particular to a kind of item based on self-learning-ruler Mesh name entity recognition method and system.
Background technology
Name Entity recognition is the basic problem of natural language processing.In natural language processing, name entity mainly wraps Physical name is included, such as country name, institution term, place name, name, abbreviation, and some numerical expressions, such as currency values, hundred Fraction, temporal expression etc..
Because the identification of English name entity need to only consider the feature of word in itself without regard to participle problem, therefore realization is difficult Spend relatively low.According to MUC and ACE evaluation result, accuracy rate, recall rate, the F1 values of English name Entity recognition are current 90% or so can be reached mostly.Chinese name Entity recognition is started late.Phase early 1990s, some domestic Person has carried out some researchs to Chinese name entity (such as place name, name, institution term) identification.For example, Sun Maosong etc. exists Domestic contrast early proceeds by Chinese personal name recognition, and they mainly calculate surname using the method for statistics and name word is general Rate;Zhang little Heng etc. is identified and analyzed to Chinese organization names, and experiment has mainly been carried out to colleges and universities' name using artificial rule grinds Study carefully;The Zhang at Intel Chinese research center etc. demonstrates an extraction Chinese name entity of their exploitations on ACL2000 And the information extraction system of these inter-entity correlations, the system utilize study (the Memory Based based on memory Learning, MBL) algorithm acquisition rule, to extract name entity and the relation between them.Although at present name, Name, mechanism name identification on have a preferable effect, but pair with particular kind of name Entity recognition, current research is still In blank stage.
The algorithm of classical name Entity recognition has the statistics sides such as hidden markov, condition random field, maximum-entropy model Method.For traditional statistical method, it can not ensure that all name entities are all retrieved detection.
In order to realize the information extraction of scientific and technological category information, develop higher accuracy and the name entity recognition techniques of recall rate It is very necessary.
The content of the invention
The invention provides a kind of entry name entity recognition method and system based on self-learning-ruler, with part of speech blacklist Rule, and participation of the construction of part of speech blacklist and keyword white list completely without people are used as with keyword white list, can Learn to come out with automatic from training set.The present invention can be as the supplement of traditional recognition method, and then can be in original base Accuracy rate and recall rate are improved on plinth.
To achieve these goals, the present invention uses following technical scheme:
A kind of entry name entity recognition method based on self-learning-ruler, comprises the following steps:
1) multiple entry names are taken to produce part of speech blacklist and Feature Words white list as training set;
2) cutting is carried out to text to be identified based on context information;
3) text to be identified after step 2) cutting is blocked based on part of speech blacklist;
4) in the text to be identified after step 3) processing, feature based word white list confirms entry name, obtains final Recognition result.
Further, the part of speech blacklist be from calculate remove in part of speech as defined in institute Chinese part of speech label sets it is all What the part of speech that entry name includes obtained.
Further, the Feature Words white list is that what is obtained makes all items when carrying out part-of-speech tagging to entry name set Mesh name all includes the minimal characteristic set of words of the Feature Words in feature set of words.
Further, a Feature Words are included in science and technology item name, then claim this feature word covering entry name, if feature All Feature Words can cover all entry names in set of words, then claim this feature set of words to entry name all standing.
Further, above-mentioned minimal characteristic set of words obtains by the following method:
Entry name in training set is segmented, obtains all part of speech set, is tried to achieve in part of speech to entry name set A Minimum covering set, the Minimum covering set is defined as minimal characteristic set of words.
Further, in step 2), in the form of regular expression, the context information of detection project name, will treat Identify that the sentence that regular expression is hit in text carries out cutting.
A kind of entry name entity recognition system based on self-learning-ruler, including:
Language material training module, part of speech blacklist and Feature Words white list are obtained for being trained to entry name;
Text input unit, for inputting text to be identified;
Text dividing unit, cutting is carried out to text to be identified for based on context prompt message;
Text blocks unit, for being cut according to part of speech blacklist to the text to be identified after text dividing unit cutting It is disconnected;
Text confirmation unit, confirm for blocking the entry name that unit obtains to text according to Feature Words white list, Obtain final recognition result.
Further, the text dividing unit is based on regular expression detection context information, to the sentence of hit Son carries out cutting.
Beneficial effects of the present invention are as follows:
The present invention using part of speech blacklist and keyword white list as regular, and part of speech blacklist and keyword white list Participation of the construction completely without people, can learn to come out automatically from training set.
The present invention can be as the supplement of traditional recognition method, and then accuracy rate can be improved on the basis of original and is called together The rate of returning.Using the present invention method we obtain 94.78% accuracy rate on 1500 groups of testing materials, 89.19% recalls Rate and 91.9% F1 values.
Brief description of the drawings
Fig. 1 is the overall flow figure of the entry name entity recognition method of the invention based on self-learning-ruler.
Fig. 2 is the distribution frequency schematic diagram for showing Feature Words.
Fig. 3 is ⊙ operation charts of the present invention.
Fig. 4 is increased with k values, Feature Words covering entry name change in gain tendency chart.
Fig. 5 is the frame diagram of the entry name entity recognition system of the invention based on self-learning-ruler.
Embodiment
The present invention is explained in further detail below in conjunction with the accompanying drawings.
The overall flow of entry name entity recognition method of the invention based on self-learning-ruler is as shown in figure 1, existing to wherein Committed step describe in detail it is as follows:
1st, the cutting based on context information
By the prompt message of context, we can obtain the prompt message of inherence, and we are believed using common prompting Breath, in the form of regular expression, carrys out the external prompt message of detection project name, the context bar as project name detection Part.For the content of regular expression hit, the sentence of hit is carried out cutting by us, by ... it is right exemplified by obtaining ... prize In..
" XXX " project obtains first-class National Scientific and Technological Progress Award
National Prize for Natural Sciences second prize is awarded in " YYY " project
" ZZZ " project has reached advanced international standard
By taking one section of word as an example, " on January 18th, 2013, the Central Committee of the Communist Party of China, State Council observe the grand opening of in Beijing Great Hall of the People The national science technology reward party and state leaders such as conference, Hu Jintao, Xi Jinping, Wen Jiabao, Li Keqiang, Liu Yunshan attends greatly Meeting is simultaneously given out an award for the prize-winning representative of 2012 years.2012 annual national technical awards are awarded a prize project 330, wherein scientific and technological progress Prize 212, including special award 3, the first prize 22, second prize 187.What Chinese Anti-Cancer Association was recommended《Tumor Angiongesis Mechanism and its application in antiangiogenesis therapy》Project wins first-class National Scientific and Technological Progress Award, and project leader's Bian is improved military Professor, which appears on the stage, receives the prize-giving of central authorities leader.”.In this section words, " Chinese Anti-Cancer Association is recommended《Tumor Angiongesis mechanism And its application in antiangiogenesis therapy》Project wins first-class National Scientific and Technological Progress Award " regular expression rule are matched first Then " .., which is won ... encourages ", then, in the word, deleting and " winning first-class National Scientific and Technological Progress Award.", so, just weed out A part of irrelevant information.
2nd, the text dividing based on part of speech blacklist
In science and technology item name, the part of speech for having part never occurs in entry name.Calculating institute's Chinese part of speech mark As defined in note collection in 96 kinds of parts of speech, there are 35 kinds of parts of speech from not appearing in entry name training corpus.Part of speech blacklist is used as Previous step language material processing cutting, so as to get cutting result be close as far as possible with real result.
Such as the output to previous step " on January 18th, 2013, the Central Committee of the Communist Party of China, State Council are grand in Beijing Great Hall of the People National science technology reward conference, Hu Jintao, Xi Jinping, Wen Jiabao, Li Keqiang, Liu Yunshan etc. party and country's leader has been held again People attends conference and given out an award for the prize-winning representative of 2012 years.2012 annual national technical awards are awarded a prize project 330, wherein The progress prize in science and technology 212, including special award 3, the first prize 22, second prize 187.What Chinese Anti-Cancer Association was recommended《Tumour Angiogenesis mechanism and its application in antiangiogenesis therapy》Project, project leader professor Bian Xiuwu, which appears on the stage, to be received The prize-giving of central authorities leader.", according to the result of part of speech blacklist cutting, we obtain following substring " Central Committee of the Communist Party of China, State Council observes the grand opening of in Beijing Great Hall of the People ", " national science technology reward conference ", " wait party and state leaders to attend Conference simultaneously prize-winning represents for 2012 years ", " prize-giving ", " 2012 annual national technical awards award a prize project 330 ", " science and technology is entered Step prize 212, including special award 3, the first prize 22 ", " second prize 187 ", " Chinese Anti-Cancer Association is recommended《Tumour blood Pipe generting machanism and its application in antiangiogenesis therapy》Project.", relative to previous step, the target character that we obtain String " Tumor Angiongesis mechanism and its application in antiangiogenesis therapy " and the science and technology item title of reality are more nearly.
3rd, the confirmation of feature based word white list
In project name, there are substantial amounts of Feature Words to occur, such as " application, research, engineering, exploitation ".It is moreover, substantial amounts of Feature Words occur in pairs.If a character string to be matched contains more than two Feature Words, here it is considered that the word Symbol string is identified as science and technology item name.In project name, include the word (see Fig. 2) of substantial amounts of long-tail feature, and these are grown Tail word, it is possible to cause the misrecognition of entry name.
The maintenance of Feature Words white list
The training set of white list is to be based on one group of entry name P={ p1,p2,...,pn, to the entry name that set sizes are n Carry out part-of-speech tagging, it is assumed that obtain m word, problem changes into the feature set of words for asking minimum so that all entry names all wrap Contain the Feature Words in set.We are WLAN problems (selecting Words with the LeAst this problem definition Number)。
" Feature Words covering entry name " is defined below:One Feature Words is included in science and technology item name, then claiming should Feature Words cover entry name, if Feature Words all in feature set of words can cover all entry names, claim the set To entry name all standing.
WLAN the name of the games are a linear programming problems, and constraints is that its function is to find a set so that All standing is realized to entry name, optimization aim is to find the entry name set for meeting constraints so that the scale of this set It is as small as possible.The problem equivalent is minimal set-covering problem.And minimal set-covering problem is NP-hard problems.
It can not ensure the effect for gathering smaller extraction in view of the size of one-sided limitation candidate feature set of words w set It is better, so the size that define optimal w here is k so that on the premise of w set is less than k, set covering is tried one's best Big entry name, and as far as possible more Feature Words occur simultaneously, to ensure the success of matching, ensure the probability of k value error hidings in addition It is as small as possible.
Here k-WLAN problems are defined:
Smallest subset S is found from candidate feature set of words w, it is desirable to meets following condition:1) S is complete by all subsets Portion all covers;2) number of S elements is less than k;3)Value it is as big as possible, that is to say, that selection covering as far as possible More words.
The present invention takes Dynamic Programming and greedy algorithm to solve k-WLAN problems.
Here define ⊙ operations and project noun 0-1 matrixes M:Matrix M abscissas represent entry name training set complete or collected works U Comprising word set, ordinate represent entry name set, if MijCorresponding value is 1, then it represents that the word gathers this Covering, is otherwise 0.An if word wiThere is the component of k non-zero in corresponding matrix M, then ⊙ Operation Definitions are by wiIt is corresponding non- Entry name is deleted from matrix corresponding to null component.
Fig. 3 is expressed as M ⊙ w2An example.Wherein, w represents the feature set of words of Minimum covering set to be generated, and p is represented The common factor of each set element represents whether this feature occur in the entry name in each entry name, p and each w set Word;1 represents occur, and 0 represents do not occur.
The formal definitions of Dynamic Programming in Fig. 3
H (M, k)=max { A (wi)+H(M⊙wi,k-1)|i≤i≤m}
The formula is recursive form, A (wi) represent wiThe number of the entry name covered.
Because the distribution of item characteristic word meets long-tail distribution, with the increase of k values, what the Feature Words newly increased were covered Entry name is gradually reduced, when the increase of k values to a certain extent, income is close to boundary cost.Here we take item characteristic The cut-off condition of greedy algorithm is that H (M, k+1)-H (M, k) difference is less than some threshold value.
Good effect
Take multiple project names as training set to produce part of speech blacklist and Feature Words white list herein, training set comes Source and 1119 state natural sciences funds of 8 colleges and universities, national science and technology progress during test set uses 2005 to 2014 The entry name language material of prize is as test.The generation of part of speech blacklist is that basis is specified below:Calculating institute's Chinese part of speech label sets Specified in 96 kinds of parts of speech, all entry names include 61 kinds of part of speech, in other words, in blacklist share 35 kinds of parts of speech.Entering After row part-of-speech tagging and participle, 3397 words are obtained, by statistics, the number for obtaining Minimum covering set is more than or equal to 72. Fig. 4 represented with the growth of k values, H (M, k+1)-H (M, k) variation tendency, and according to this figure, we are by the threshold value of marginal cost It is set as that 30, k values are set to 102.
We define single entry name identification recall rate and accuracy rate it is as follows:
Assuming that tiExpect that wherein corresponding project name is t in set i-th group of language material of T for testi,Represent to propose by algorithm Entry name,Both public word strings are represented, if the recall rate of single entry name isSingle item The accuracy rate of mesh name isThe recall rate of all items is Accuracy rate isFor the testing material not comprising entry name, it should be t to test correct situationiWithAll For sky, now Re (t are set herei)=Pr (ti)=1, ifIt is not sky, then Re (ti)=1, Pr (ti)=0.
Under this experiment condition, test is identified to 1119 groups of testing materials in we, obtain 90.97% it is accurate Rate, 77.9% accuracy rate, 83.93% F1 values, on 766 groups of testing materials for not containing entry name, we obtain 98.43% accuracy rate, 100% recall rate, (evaluation index in information retrieval, it combines accuracy rate to 99.21% F1 values With the weight relationship before recall rate).
In summary, using method of the invention, we obtain 94.78% accuracy rate on 1500 groups of testing materials, 89.19% recall rate and 91.9% F1 values.

Claims (6)

1. a kind of entry name entity recognition method based on self-learning-ruler, comprises the following steps:
1) take multiple entry names to produce part of speech blacklist and Feature Words white list as training set, the part of speech blacklist be from Calculate in part of speech as defined in institute's Chinese part of speech label sets and remove what the part of speech that all entry names include obtained, the Feature Words are white List is that what is obtained makes all items name all include the feature in feature set of words when carrying out part-of-speech tagging to entry name set The minimal characteristic set of words of word;
2) cutting is carried out to text to be identified based on context information;
3) text to be identified after step 2) cutting is blocked based on part of speech blacklist;
4) in the text to be identified after step 3) processing, feature based word white list confirms entry name, obtains final identification As a result.
2. the entry name entity recognition method based on self-learning-ruler as claimed in claim 1 a, it is characterised in that feature Word is included in science and technology item name, then claims this feature word covering entry name, if Feature Words all in feature set of words can be with All entry names are covered, then claim this feature set of words to entry name all standing.
3. the entry name entity recognition method based on self-learning-ruler as claimed in claim 2, it is characterised in that the minimum Feature set of words obtains by the following method:
Entry name in training set is segmented, obtains all part of speech set, is tried to achieve in part of speech to the one of entry name set Individual Minimum covering set, the Minimum covering set are defined as minimal characteristic set of words.
4. the entry name entity recognition method based on self-learning-ruler as claimed in claim 1, it is characterised in that step 2) In, in the form of regular expression, the context information of detection project name, regular expression in text to be identified is hit Sentence carry out cutting.
5. a kind of entry name entity recognition system based on self-learning-ruler, including:
Language material training module, for entry name be trained obtain part of speech blacklist and Feature Words white list, the part of speech it is black List is to remove the part of speech that all entry names include from part of speech as defined in calculating institute Chinese part of speech label sets to obtain, described Feature Words white list is that what is obtained makes all items name all include feature set of words when carrying out part-of-speech tagging to entry name set In Feature Words minimal characteristic set of words;
Text input unit, for inputting text to be identified;
Text dividing unit, cutting is carried out to text to be identified for based on context prompt message;
Text blocks unit, for being blocked according to part of speech blacklist to the text to be identified after text dividing unit cutting;
Text confirmation unit, confirm for blocking the entry name that unit obtains to text according to Feature Words white list, obtain Final recognition result.
6. the entry name entity recognition system based on self-learning-ruler as claimed in claim 5, it is characterised in that the text Cutting unit is based on regular expression detection context information, and cutting is carried out to the sentence of hit.
CN201510271752.6A 2015-05-25 2015-05-25 A kind of entry name entity recognition method and system based on self-learning-ruler Active CN104965818B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510271752.6A CN104965818B (en) 2015-05-25 2015-05-25 A kind of entry name entity recognition method and system based on self-learning-ruler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510271752.6A CN104965818B (en) 2015-05-25 2015-05-25 A kind of entry name entity recognition method and system based on self-learning-ruler

Publications (2)

Publication Number Publication Date
CN104965818A CN104965818A (en) 2015-10-07
CN104965818B true CN104965818B (en) 2018-01-05

Family

ID=54219853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510271752.6A Active CN104965818B (en) 2015-05-25 2015-05-25 A kind of entry name entity recognition method and system based on self-learning-ruler

Country Status (1)

Country Link
CN (1) CN104965818B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569997B (en) * 2016-10-19 2019-12-10 中国科学院信息工程研究所 Science and technology compound phrase identification method based on hidden Markov model
CN108038106B (en) * 2017-12-22 2021-07-02 北京工业大学 Fine-grained domain term self-learning method based on context semantics
CN109543764B (en) * 2018-11-28 2023-06-16 安徽省公共气象服务中心 Early warning information validity detection method and detection system based on intelligent semantic perception

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101118538A (en) * 2007-09-17 2008-02-06 中国科学院计算技术研究所 Method and system for recognizing feature lexical item in Chinese naming entity

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102479191B (en) * 2010-11-22 2014-03-26 阿里巴巴集团控股有限公司 Method and device for providing multi-granularity word segmentation result

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101118538A (en) * 2007-09-17 2008-02-06 中国科学院计算技术研究所 Method and system for recognizing feature lexical item in Chinese naming entity

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
信息抽取中关键技术的研究;张素香;《中国博士学位论文全文数据库信息科技辑》;20071115(第05期);第36页倒数第3段,第37页第2-4段,第38页第1段,图3-1 *
基于提及关系的微博用户知识发现初探;吴恺 等;《图书与情报》;20150420(第02期);第125页右栏第1段 *
结合类内集中度和最小集合覆盖的特征选择;张文鹏 等;《计算机工程与应用》;20110720;第47卷(第28期);第124页左栏第1段 *

Also Published As

Publication number Publication date
CN104965818A (en) 2015-10-07

Similar Documents

Publication Publication Date Title
Clark et al. Simple and effective multi-paragraph reading comprehension
CN107766324B (en) Text consistency analysis method based on deep neural network
CN108804677B (en) Deep learning problem classification method and system combining multi-level attention mechanism
CN106980609A (en) A kind of name entity recognition method of the condition random field of word-based vector representation
CN109977234A (en) A kind of knowledge mapping complementing method based on subject key words filtering
CN102945232B (en) Training-corpus quality evaluation and selection method orienting to statistical-machine translation
CN104778256B (en) A kind of the quick of field question answering system consulting can increment clustering method
CN107895000B (en) Cross-domain semantic information retrieval method based on convolutional neural network
CN111242033B (en) Video feature learning method based on discriminant analysis of video and text pairs
CN107526720A (en) Meaning generation method, meaning generating means and program
CN109977199A (en) A kind of reading understanding method based on attention pond mechanism
CN111222318B (en) Trigger word recognition method based on double-channel bidirectional LSTM-CRF network
CN108763191A (en) A kind of text snippet generation method and system
CN104965818B (en) A kind of entry name entity recognition method and system based on self-learning-ruler
CN107369098A (en) The treating method and apparatus of data in social networks
CN110008309A (en) A kind of short phrase picking method and device
CN110059220A (en) A kind of film recommended method based on deep learning Yu Bayesian probability matrix decomposition
CN112232087A (en) Transformer-based specific aspect emotion analysis method of multi-granularity attention model
CN107943919A (en) A kind of enquiry expanding method of session-oriented formula entity search
CN109165040A (en) A method of the code copy suspicion detection based on Random Forest model
CN115114926A (en) Chinese agricultural named entity identification method
CN101763403A (en) Query translation method facing multi-lingual information retrieval system
CN108920451A (en) Text emotion analysis method based on dynamic threshold and multi-categorizer
CN113255346B (en) Address element identification method based on graph embedding and CRF knowledge integration
Wu et al. Chain of thought prompting elicits knowledge augmentation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant