CN104965818B

CN104965818B - A kind of entry name entity recognition method and system based on self-learning-ruler

Info

Publication number: CN104965818B
Application number: CN201510271752.6A
Authority: CN
Inventors: 柳厅文; 时金桥; 张洋; 闫旸; 郭莉; 张浩亮; 亚静
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2015-05-25
Filing date: 2015-05-25
Publication date: 2018-01-05
Anticipated expiration: 2035-05-25
Also published as: CN104965818A

Abstract

The invention discloses a kind of entry name entity recognition method and system based on self-learning-ruler, rule is used as using part of speech blacklist and keyword white list, and participation of the construction of part of speech blacklist and keyword white list completely without people, can learn to come out automatically from training set.The present invention can be as the supplement of traditional recognition method, and then accuracy rate and recall rate can be improved on the basis of original.

Description

A kind of entry name entity recognition method and system based on self-learning-ruler

Technical field

The present invention relates to the fields such as text-processing, natural language processing, and in particular to a kind of item based on self-learning-ruler Mesh name entity recognition method and system.

Background technology

Name Entity recognition is the basic problem of natural language processing.In natural language processing, name entity mainly wraps Physical name is included, such as country name, institution term, place name, name, abbreviation, and some numerical expressions, such as currency values, hundred Fraction, temporal expression etc..

Because the identification of English name entity need to only consider the feature of word in itself without regard to participle problem, therefore realization is difficult Spend relatively low.According to MUC and ACE evaluation result, accuracy rate, recall rate, the F1 values of English name Entity recognition are current 90% or so can be reached mostly.Chinese name Entity recognition is started late.Phase early 1990s, some domestic Person has carried out some researchs to Chinese name entity (such as place name, name, institution term) identification.For example, Sun Maosong etc. exists Domestic contrast early proceeds by Chinese personal name recognition, and they mainly calculate surname using the method for statistics and name word is general Rate；Zhang little Heng etc. is identified and analyzed to Chinese organization names, and experiment has mainly been carried out to colleges and universities' name using artificial rule grinds Study carefully；The Zhang at Intel Chinese research center etc. demonstrates an extraction Chinese name entity of their exploitations on ACL2000 And the information extraction system of these inter-entity correlations, the system utilize study (the Memory Based based on memory Learning, MBL) algorithm acquisition rule, to extract name entity and the relation between them.Although at present name, Name, mechanism name identification on have a preferable effect, but pair with particular kind of name Entity recognition, current research is still In blank stage.

The algorithm of classical name Entity recognition has the statistics sides such as hidden markov, condition random field, maximum-entropy model Method.For traditional statistical method, it can not ensure that all name entities are all retrieved detection.

In order to realize the information extraction of scientific and technological category information, develop higher accuracy and the name entity recognition techniques of recall rate It is very necessary.

The content of the invention

The invention provides a kind of entry name entity recognition method and system based on self-learning-ruler, with part of speech blacklist Rule, and participation of the construction of part of speech blacklist and keyword white list completely without people are used as with keyword white list, can Learn to come out with automatic from training set.The present invention can be as the supplement of traditional recognition method, and then can be in original base Accuracy rate and recall rate are improved on plinth.

To achieve these goals, the present invention uses following technical scheme：

A kind of entry name entity recognition method based on self-learning-ruler, comprises the following steps：

1) multiple entry names are taken to produce part of speech blacklist and Feature Words white list as training set；

2) cutting is carried out to text to be identified based on context information；

3) text to be identified after step 2) cutting is blocked based on part of speech blacklist；

4) in the text to be identified after step 3) processing, feature based word white list confirms entry name, obtains final Recognition result.

Further, the part of speech blacklist be from calculate remove in part of speech as defined in institute Chinese part of speech label sets it is all What the part of speech that entry name includes obtained.

Further, the Feature Words white list is that what is obtained makes all items when carrying out part-of-speech tagging to entry name set Mesh name all includes the minimal characteristic set of words of the Feature Words in feature set of words.

Further, a Feature Words are included in science and technology item name, then claim this feature word covering entry name, if feature All Feature Words can cover all entry names in set of words, then claim this feature set of words to entry name all standing.

Further, above-mentioned minimal characteristic set of words obtains by the following method：

Entry name in training set is segmented, obtains all part of speech set, is tried to achieve in part of speech to entry name set A Minimum covering set, the Minimum covering set is defined as minimal characteristic set of words.

Further, in step 2), in the form of regular expression, the context information of detection project name, will treat Identify that the sentence that regular expression is hit in text carries out cutting.

A kind of entry name entity recognition system based on self-learning-ruler, including：

Language material training module, part of speech blacklist and Feature Words white list are obtained for being trained to entry name；

Text input unit, for inputting text to be identified；

Text dividing unit, cutting is carried out to text to be identified for based on context prompt message；

Text blocks unit, for being cut according to part of speech blacklist to the text to be identified after text dividing unit cutting It is disconnected；

Text confirmation unit, confirm for blocking the entry name that unit obtains to text according to Feature Words white list, Obtain final recognition result.

Further, the text dividing unit is based on regular expression detection context information, to the sentence of hit Son carries out cutting.

Beneficial effects of the present invention are as follows：

The present invention using part of speech blacklist and keyword white list as regular, and part of speech blacklist and keyword white list Participation of the construction completely without people, can learn to come out automatically from training set.

The present invention can be as the supplement of traditional recognition method, and then accuracy rate can be improved on the basis of original and is called together The rate of returning.Using the present invention method we obtain 94.78% accuracy rate on 1500 groups of testing materials, 89.19% recalls Rate and 91.9% F1 values.

Brief description of the drawings

Fig. 1 is the overall flow figure of the entry name entity recognition method of the invention based on self-learning-ruler.

Fig. 2 is the distribution frequency schematic diagram for showing Feature Words.

Fig. 3 is ⊙ operation charts of the present invention.

Fig. 4 is increased with k values, Feature Words covering entry name change in gain tendency chart.

Fig. 5 is the frame diagram of the entry name entity recognition system of the invention based on self-learning-ruler.

Embodiment

The present invention is explained in further detail below in conjunction with the accompanying drawings.

The overall flow of entry name entity recognition method of the invention based on self-learning-ruler is as shown in figure 1, existing to wherein Committed step describe in detail it is as follows：

1st, the cutting based on context information

By the prompt message of context, we can obtain the prompt message of inherence, and we are believed using common prompting Breath, in the form of regular expression, carrys out the external prompt message of detection project name, the context bar as project name detection Part.For the content of regular expression hit, the sentence of hit is carried out cutting by us, by ... it is right exemplified by obtaining ... prize In..

" XXX " project obtains first-class National Scientific and Technological Progress Award

National Prize for Natural Sciences second prize is awarded in " YYY " project

" ZZZ " project has reached advanced international standard

By taking one section of word as an example, " on January 18th, 2013, the Central Committee of the Communist Party of China, State Council observe the grand opening of in Beijing Great Hall of the People The national science technology reward party and state leaders such as conference, Hu Jintao, Xi Jinping, Wen Jiabao, Li Keqiang, Liu Yunshan attends greatly Meeting is simultaneously given out an award for the prize-winning representative of 2012 years.2012 annual national technical awards are awarded a prize project 330, wherein scientific and technological progress Prize 212, including special award 3, the first prize 22, second prize 187.What Chinese Anti-Cancer Association was recommended《Tumor Angiongesis Mechanism and its application in antiangiogenesis therapy》Project wins first-class National Scientific and Technological Progress Award, and project leader's Bian is improved military Professor, which appears on the stage, receives the prize-giving of central authorities leader.”.In this section words, " Chinese Anti-Cancer Association is recommended《Tumor Angiongesis mechanism And its application in antiangiogenesis therapy》Project wins first-class National Scientific and Technological Progress Award " regular expression rule are matched first Then " .., which is won ... encourages ", then, in the word, deleting and " winning first-class National Scientific and Technological Progress Award.", so, just weed out A part of irrelevant information.

2nd, the text dividing based on part of speech blacklist

In science and technology item name, the part of speech for having part never occurs in entry name.Calculating institute's Chinese part of speech mark As defined in note collection in 96 kinds of parts of speech, there are 35 kinds of parts of speech from not appearing in entry name training corpus.Part of speech blacklist is used as Previous step language material processing cutting, so as to get cutting result be close as far as possible with real result.

Such as the output to previous step " on January 18th, 2013, the Central Committee of the Communist Party of China, State Council are grand in Beijing Great Hall of the People National science technology reward conference, Hu Jintao, Xi Jinping, Wen Jiabao, Li Keqiang, Liu Yunshan etc. party and country's leader has been held again People attends conference and given out an award for the prize-winning representative of 2012 years.2012 annual national technical awards are awarded a prize project 330, wherein The progress prize in science and technology 212, including special award 3, the first prize 22, second prize 187.What Chinese Anti-Cancer Association was recommended《Tumour Angiogenesis mechanism and its application in antiangiogenesis therapy》Project, project leader professor Bian Xiuwu, which appears on the stage, to be received The prize-giving of central authorities leader.", according to the result of part of speech blacklist cutting, we obtain following substring " Central Committee of the Communist Party of China, State Council observes the grand opening of in Beijing Great Hall of the People ", " national science technology reward conference ", " wait party and state leaders to attend Conference simultaneously prize-winning represents for 2012 years ", " prize-giving ", " 2012 annual national technical awards award a prize project 330 ", " science and technology is entered Step prize 212, including special award 3, the first prize 22 ", " second prize 187 ", " Chinese Anti-Cancer Association is recommended《Tumour blood Pipe generting machanism and its application in antiangiogenesis therapy》Project.", relative to previous step, the target character that we obtain String " Tumor Angiongesis mechanism and its application in antiangiogenesis therapy " and the science and technology item title of reality are more nearly.

3rd, the confirmation of feature based word white list

In project name, there are substantial amounts of Feature Words to occur, such as " application, research, engineering, exploitation ".It is moreover, substantial amounts of Feature Words occur in pairs.If a character string to be matched contains more than two Feature Words, here it is considered that the word Symbol string is identified as science and technology item name.In project name, include the word (see Fig. 2) of substantial amounts of long-tail feature, and these are grown Tail word, it is possible to cause the misrecognition of entry name.

The maintenance of Feature Words white list

The training set of white list is to be based on one group of entry name P={ p₁,p₂,...,p_n, to the entry name that set sizes are n Carry out part-of-speech tagging, it is assumed that obtain m word, problem changes into the feature set of words for asking minimum so that all entry names all wrap Contain the Feature Words in set.We are WLAN problems (selecting Words with the LeAst this problem definition Number)。

" Feature Words covering entry name " is defined below：One Feature Words is included in science and technology item name, then claiming should Feature Words cover entry name, if Feature Words all in feature set of words can cover all entry names, claim the set To entry name all standing.

WLAN the name of the games are a linear programming problems, and constraints is that its function is to find a set so that All standing is realized to entry name, optimization aim is to find the entry name set for meeting constraints so that the scale of this set It is as small as possible.The problem equivalent is minimal set-covering problem.And minimal set-covering problem is NP-hard problems.

It can not ensure the effect for gathering smaller extraction in view of the size of one-sided limitation candidate feature set of words w set It is better, so the size that define optimal w here is k so that on the premise of w set is less than k, set covering is tried one's best Big entry name, and as far as possible more Feature Words occur simultaneously, to ensure the success of matching, ensure the probability of k value error hidings in addition It is as small as possible.

Here k-WLAN problems are defined：

Smallest subset S is found from candidate feature set of words w, it is desirable to meets following condition：1) S is complete by all subsets Portion all covers；2) number of S elements is less than k；3)Value it is as big as possible, that is to say, that selection covering as far as possible More words.

The present invention takes Dynamic Programming and greedy algorithm to solve k-WLAN problems.

Here define ⊙ operations and project noun 0-1 matrixes M：Matrix M abscissas represent entry name training set complete or collected works U Comprising word set, ordinate represent entry name set, if M_ijCorresponding value is 1, then it represents that the word gathers this Covering, is otherwise 0.An if word w_iThere is the component of k non-zero in corresponding matrix M, then ⊙ Operation Definitions are by w_iIt is corresponding non- Entry name is deleted from matrix corresponding to null component.

Fig. 3 is expressed as M ⊙ w₂An example.Wherein, w represents the feature set of words of Minimum covering set to be generated, and p is represented The common factor of each set element represents whether this feature occur in the entry name in each entry name, p and each w set Word；1 represents occur, and 0 represents do not occur.

The formal definitions of Dynamic Programming in Fig. 3

H (M, k)=max { A (w_i)+H(M⊙w_i,k-1)|i≤i≤m}

The formula is recursive form, A (w_i) represent w_iThe number of the entry name covered.

Because the distribution of item characteristic word meets long-tail distribution, with the increase of k values, what the Feature Words newly increased were covered Entry name is gradually reduced, when the increase of k values to a certain extent, income is close to boundary cost.Here we take item characteristic The cut-off condition of greedy algorithm is that H (M, k+1)-H (M, k) difference is less than some threshold value.

Good effect

Take multiple project names as training set to produce part of speech blacklist and Feature Words white list herein, training set comes Source and 1119 state natural sciences funds of 8 colleges and universities, national science and technology progress during test set uses 2005 to 2014 The entry name language material of prize is as test.The generation of part of speech blacklist is that basis is specified below：Calculating institute's Chinese part of speech label sets Specified in 96 kinds of parts of speech, all entry names include 61 kinds of part of speech, in other words, in blacklist share 35 kinds of parts of speech.Entering After row part-of-speech tagging and participle, 3397 words are obtained, by statistics, the number for obtaining Minimum covering set is more than or equal to 72. Fig. 4 represented with the growth of k values, H (M, k+1)-H (M, k) variation tendency, and according to this figure, we are by the threshold value of marginal cost It is set as that 30, k values are set to 102.

We define single entry name identification recall rate and accuracy rate it is as follows：

Assuming that t_iExpect that wherein corresponding project name is t in set i-th group of language material of T for test_i,Represent to propose by algorithm Entry name,Both public word strings are represented, if the recall rate of single entry name isSingle item The accuracy rate of mesh name isThe recall rate of all items is Accuracy rate isFor the testing material not comprising entry name, it should be t to test correct situation_iWithAll For sky, now Re (t are set here_i)=Pr (t_i)=1, ifIt is not sky, then Re (t_i)=1, Pr (t_i)=0.

Under this experiment condition, test is identified to 1119 groups of testing materials in we, obtain 90.97% it is accurate Rate, 77.9% accuracy rate, 83.93% F1 values, on 766 groups of testing materials for not containing entry name, we obtain 98.43% accuracy rate, 100% recall rate, (evaluation index in information retrieval, it combines accuracy rate to 99.21% F1 values With the weight relationship before recall rate).

In summary, using method of the invention, we obtain 94.78% accuracy rate on 1500 groups of testing materials, 89.19% recall rate and 91.9% F1 values.

Claims

1. a kind of entry name entity recognition method based on self-learning-ruler, comprises the following steps：

1) take multiple entry names to produce part of speech blacklist and Feature Words white list as training set, the part of speech blacklist be from Calculate in part of speech as defined in institute's Chinese part of speech label sets and remove what the part of speech that all entry names include obtained, the Feature Words are white List is that what is obtained makes all items name all include the feature in feature set of words when carrying out part-of-speech tagging to entry name set The minimal characteristic set of words of word；

4) in the text to be identified after step 3) processing, feature based word white list confirms entry name, obtains final identification As a result.

2. the entry name entity recognition method based on self-learning-ruler as claimed in claim 1 a, it is characterised in that feature Word is included in science and technology item name, then claims this feature word covering entry name, if Feature Words all in feature set of words can be with All entry names are covered, then claim this feature set of words to entry name all standing.

3. the entry name entity recognition method based on self-learning-ruler as claimed in claim 2, it is characterised in that the minimum Feature set of words obtains by the following method：

Entry name in training set is segmented, obtains all part of speech set, is tried to achieve in part of speech to the one of entry name set Individual Minimum covering set, the Minimum covering set are defined as minimal characteristic set of words.

4. the entry name entity recognition method based on self-learning-ruler as claimed in claim 1, it is characterised in that step 2) In, in the form of regular expression, the context information of detection project name, regular expression in text to be identified is hit Sentence carry out cutting.

5. a kind of entry name entity recognition system based on self-learning-ruler, including：

Language material training module, for entry name be trained obtain part of speech blacklist and Feature Words white list, the part of speech it is black List is to remove the part of speech that all entry names include from part of speech as defined in calculating institute Chinese part of speech label sets to obtain, described Feature Words white list is that what is obtained makes all items name all include feature set of words when carrying out part-of-speech tagging to entry name set In Feature Words minimal characteristic set of words；

Text input unit, for inputting text to be identified；

Text blocks unit, for being blocked according to part of speech blacklist to the text to be identified after text dividing unit cutting；

6. the entry name entity recognition system based on self-learning-ruler as claimed in claim 5, it is characterised in that the text Cutting unit is based on regular expression detection context information, and cutting is carried out to the sentence of hit.