CN106484676A

CN106484676A - Biological Text protein reference resolution method based on syntax tree and domain features

Info

Publication number: CN106484676A
Application number: CN201610872780.8A
Authority: CN
Inventors: 李辰; 饶志强; 张向荣
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2016-09-30
Filing date: 2016-09-30
Publication date: 2017-03-08
Anticipated expiration: 2036-09-30
Also published as: CN106484676B

Abstract

The present invention relates to the Biological Text protein reference resolution method based on syntax tree and domain features, for solving the problems, such as that in existing rule-based method, F value is low, its step includes：Pretreatment is carried out to urtext；Search relationship pronoun and apart from the nearest noun phrase of this relative pronoun from syntax tree, as the leading language of this relative pronoun；Search personal pronoun from syntax tree, and search the leading language of this personal pronoun from the syntax tree of the phrase structure arranged side by side, clause syntax tree or previous sentence of syntax tree；Obtain definite noun phrases and candidate's leading language collection using syntax tree, and the leading language of conduct picking out optimum is concentrated based on the biological field feature such as property such as DANFU number, entity type, quantity from the leading language of candidate；Nonprotein reference resolution filters.Present invention achieves the protein reference resolution in Biological Text, it is obtained in that higher F value.

Description

Biological Text protein reference resolution method based on syntax tree and domain features

【Technical field】

The invention belongs to Text Mining Technology field is and in particular to Biological Text albumen based on syntax tree and domain features Matter reference resolution method.

【Background technology】

With developing rapidly of computer and Internet technology, substantial amounts of information and document are presented in digitized. Document in biomedical sector magnanimity is existed and is just being increased with exponential form, and biomedical correlational study scholar is using biography Manual read's pattern of system is difficult to efficiently obtain valuable information, the therefore text envelope of automatization in face of huge document Breath extraction becomes a significant job.The task extracted as biomedical sector text message, protein refers to Clear up the word being to find out and indicate same protein entity in Biological Text and phrase, anaphor is have the effect of referring to right As, for example common pronoun, leading language is object being pointed at, having actual content, such as protein and biological entities, egg White matter reference resolution, as an ancillary technique, serves important supporting role, energy to many tasks of biological text mining Effectively improve the performance of bio information extraction system.

Biological natural language processing task BioNLP Shared Task 2011 provides the biomedical sector of standard Protein reference resolution language material Coreference, this language material derives from the summary in MEDLINE, mainly solves biomedical literary composition The reference resolution of the proteinacious entities in this, the proteinacious entities in language material mark out in advance.

General field reference resolution is more ripe compared to the research method of biomedical sector reference resolution, but due to spy Determine the uniqueness of field language material, these methods are grafted directly in biomedical sector and can not obtain good effect, exploitation Method for Biological Text protein reference resolution is necessary.At present in the protein reference resolution of Biological Text, main The method being divided into the method, rule-based method and fusion based on supervised learning.Method based on supervised learning pass through from Protein reference resolution training data is concentrated and is extracted feature to learn to obtain a model, then using this model to new data In protein refer to relation and carry out clearing up process, the paper that such as Youngjun Kim et al. delivers for 2011 in BioNLP task In " The Taming of Reconcile as a Biomedical Coreference Resolver ", disclose a kind of base Supervised learning method in feature and sorter model is used for protein reference resolution, and the method is extracted lexical feature, syntax The series of features such as feature, then carry out the process of protein reference resolution using a grader.Rule-based method is led to Cross the protein reference resolution that the series of rules of manual formulation processes in Biological Text, such as Makoto Miwa et al. 2012 In magazine《Bioinformatics》Volume 28 paper that 13 phases delivered " Boosting automatic event extraction from the literature using domain adaptation and coreference In resolution ", disclose a kind of protein reference resolution method of rule-based coupling, the method uses an anaphor Couple candidate detection extracts noun phrase, pronoun etc. as anaphor candidate, is then carried using leading language couple candidate detection Take noun phrase as leading language candidate, finally refer to relation connecting detection according to coupling, structure completely using one Join, order that strict head word coupling, the head word coupling loosened, quantity Matching etc. are regular, select the override of each anaphor Row language.The method merging is passed through simultaneously using supervised learning and rule, to different types of protein reference resolution using different Method processing, the paper that such as Jennifer D'Souza and Vincent Ng delivered in ACM-BCB meeting in 2012 “Anaphora Resolution in Biomedical Literature:In A Hybrid Approach ", disclose one kind Based on the protein reference resolution method of grader and rule match, the method processes the egg of pronoun using the grader learning White matter reference resolution, processes the protein reference resolution of noun phrase using regular method.

The performance of protein reference resolution method is generally evaluated with recall rate, accuracy rate and F value, and recall rate refers to be extracted Go out correct protein and refer to the ratio that all proteins in relation data refer to relation, accuracy rate is to extract correct egg White matter refers to relation and all ratios extracting results, and F value is the harmonic-mean of recall rate and accuracy rate, as final comprehensive Close index.In method for Biological Text protein reference resolution, needed using substantial amounts of people based on the method for supervised learning Work mark language material training could obtain ideal result, but acquisition manually marks language material in a large number and needs to pay very big people Power material resources and long time, current data with existing can not meet such requirement, leads in a limited number of training datas The model that lower supervised learning method obtains is not good, and the recall rate of model is very low, thus leading to F value very low；Rule-based side Method has generally only used for reference some rules of general field reference resolution, and these rules are only in terms of the structure of phrase or word and attribute To consider, not consider that the unique syntactic property of Biological Text and domain features refer to pass it is impossible to effectively extract correct protein System, the recall rate of method low thus leading to F value low；The method merging still has in supervised learning method because training data has Limit the model leading to not ideal enough, the low problem of pronoun protein types reference resolution recall rate, though the overall recall rate of method There is improvement but still not so that F value is not ideal enough.The result of current protein reference resolution also needs to continue to improve, thus Could be more effectively biological other tasks of text mining, such as biological event extracts, and carries out early stage pretreatment work.

【Content of the invention】

It is an object of the invention to overcoming the defect that above-mentioned prior art exists it is proposed that a kind of be based on syntax tree and field The Biological Text protein reference resolution method of feature, can extract correct protein and refer to relation, thus improve comprehensively referring to Mark F value.

For achieving the above object, the technical scheme that the present invention takes is：

Comprise the steps：

(1) subordinate sentence, participle, part-of-speech tagging, lemmatization and syntactic analysis are carried out to urtext, obtain each sentence Syntax tree T_i, i=1,2 ..., N, the syntax tree of all sentences constitutes syntax tree collectionWherein i represents the sequence of sentence Number, N represents the number of all sentences；

(2) from syntax tree T_iMiddle search relationship pronoun node and apart from this relative pronoun node nearest noun phrase knot Point, obtains the leading language Ar of relative pronoun anaphor Mr and relative pronoun anaphor Mr；

(3) from syntax tree T_iMiddle lookup personal pronoun node, obtains personal pronoun anaphor Mp；And from this personal pronoun knot This personal pronoun anaphor is searched in the syntax tree of phrase structure arranged side by side, clause syntax tree or previous sentence of point place syntax tree The leading language Ap of Mp；

(4) from syntax tree T_iMiddle lookup comprises the definite noun phrases node of particular organisms entity type key word, obtains To definite noun phrases anaphor Md；Collect the subset of T from syntax treeMiddle lookup is all to comprise biological entities or egg The noun phrase node of white matter entity, obtains candidate leading language collection X, based on biological field characteristic properties from candidate leading language collection X In obtain the leading language Ad of definite noun phrases anaphor Md；Wherein T_jCollect the syntax tree of j-th sentence in T, k for syntax tree Size for sentence windows；

(5) filter out leading language from all reference resolution results that step (2) to step (4) obtains and do not comprise protein The reference resolution of entity, completes the Biological Text protein reference resolution based on syntax tree and domain features.

Further, described in step (2) from syntax tree T_iMiddle search relationship pronoun node and apart from this relative pronoun The nearest noun phrase node of node, realizing step is：

201st, from syntax tree T_iMiddle lookup is labeled as " WDT " or the node of " WP ", obtains relative pronoun node Nr and relation Pronoun anaphor Mr, from syntax tree T_iMiddle lookup is labeled as all nodes of " NP ", obtains candidate leading language collection Z；Wherein WDT generation The qualifier that table is started with wh, WP represents with the pronoun of wh beginning, and NP represents noun phrase；

202nd, from syntax tree T_iAll candidates leading language place node of middle lookup candidate leading language collection Z, obtains candidate first Row language nodal set Nz；

203rd, extract the sentence of each candidate leading language node and relative pronoun node Nr in candidate leading language nodal set Nz Fa Shu path；

204th, pick out syntax tree path the shortest from all syntax tree paths that step 203 obtains, and the shortest with this The noun phrase node that syntax tree path is located is as nearest noun phrase node.

Further, in step (3), the acquisition of leading language Ap is specially：

301st, with personal pronoun anaphor Mp place node as starting point, in syntax tree T_iIn bottom-up traversal, search bag Node Nc containing phrase structure arranged side by side, judges that node Nc whether there is, if so, in syntax tree T_iMiddle extraction is with node Nc as root knot Syntax subtree STc of point, and search short apart from the farthest noun of personal pronoun anaphor Mp place node in syntax subtree STc Language node, obtains the leading language Ap of personal pronoun anaphor Mp, otherwise, execution step 302；

302nd, with personal pronoun anaphor Mp place node for starting point in syntax tree T_iIn bottom-up traversal, find out son Sentence node Ns；Extract syntax subtree STs with clause node Ns as root node, and search apart from people in syntax subtree STs Claim pronoun anaphor Mp place node farthest noun phrase node, judge that this noun phrase node whether there is, if so, obtain The leading language Ap of personal pronoun anaphor Mp, otherwise, execution step 303；

303rd, select syntax tree T from syntax tree collection T_i-1, in syntax tree T_i-1In with last leaf node as starting point Bottom-up traversal, finds out clause node Nt；Extract syntax subtree STt with clause node Nt as root node, and in this sentence Search, in method tree STt, all noun phrase nodes matching with personal pronoun anaphor Mp DANFU number, obtain candidate leading Language collection Y；From candidate leading language collection Y, the farthest leading language of candidate of chosen distance personal pronoun anaphor Mp, obtains personal pronoun The leading language Ap of anaphor Mp.

Further, in step (3) phrase structure arranged side by side refer to coordinate noun phrase, verb phrase arranged side by side or side by side son Sentence structure.

Further, in step (4), the acquisition of leading language Ad is specially：

Whether the head word the 401st, judging definite noun phrases anaphor Md is " proteins " or " genes ", if so, from Select the leading language of all candidates comprising proteinacious entities in candidate leading language collection X, obtain new candidate leading language collection Xs, and from In this new candidate leading language collection Xs, according to head word coupling, comprise the order that proteinacious entities quantity is more than 1, select distance limit The nearest leading language of candidate of qualitative noun phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, no Then, execution step 402；

402nd, judge whether definite noun phrases anaphor Md is plural form, if so, press from candidate leading language collection X According to head word coupling, comprise biological entities quantity be more than 1, comprise proteinacious entities quantity be more than 1 order, select apart from limited The leading language of the nearest candidate of noun phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, otherwise, holds Row step 403；

Whether the head word the 403rd, judging definite noun phrases anaphor Md is " protein " or " gene ", if so, from time Select in leading language collection X and select the leading language of all candidates comprising proteinacious entities, obtain new candidate leading language collection Xs, and from this In new candidate leading language collection Xs, according to head word coupling, comprise the order that proteinacious entities quantity is equal to 1, select distance and limit The property nearest leading language of candidate of noun phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, otherwise, Execution step 404；

404th, from candidate leading language collection X according to head word coupling, comprise that biological entities quantity is equal to 1, to comprise protein real The order that body quantity is equal to 1, selects apart from the nearest leading language of candidate of definite noun phrases anaphor Md, obtains limited name The leading language Ad of word phrase anaphor Md.

Further, the particular organisms entity type key word in step (4), including " protein ", " gene ", " factor ", " element ", " receptor ", " complex " and " construct ".

Further, the biological entities of step (4), its recognition methods includes：Started by numeral, and comprise letter；By little Write beginning of letter, and comprise capitalization or numeral or special symbol；Started by capitalization, and comprise digital or special symbol Number；Or started by capitalization, comprise lower case, and comprise capitalization or special symbol.

The present invention compared with prior art, has advantages below：

The present invention carries out after pretreatment to urtext, extracts relative pronoun, personal pronoun and limited from syntax tree Noun phrase, determines the leading language of relative pronoun and personal pronoun anaphor based on syntax tree, is determined based on biological field feature The leading language of definite noun phrases anaphor, finally filters the reference resolution result of nonprotein entity.The present invention is based on sentence Method tree is extracted to the leading language of relative pronoun anaphor and personal pronoun anaphor, based on domain features to limited noun The leading language of phrase anaphor is extracted, and can excavate only unavailable syntax in terms of the structure of phrase or word, attribute Structural information, thus effectively extract more correct protein to refer to relation.Due to taking full advantage of biomedical text Syntactic property and domain features, can obtain more correct while ensureing accuracy rate compared to the method being currently based on rule Protein reference resolution result, improve recall rate, obtained more preferable integrated performance index F value, the simulation experiment result Indicate this point.The present invention can be used for pointing to relative pronoun, personal pronoun and the limit of proteinacious entities in biomedical text The reference resolution such as qualitative noun phrase.

【Brief description】

Fig. 1 be the present invention realize FB(flow block)；

Fig. 2 be the present invention from candidate leading language collection Z, pick out the nearest candidate of distance relation pronoun Mr place node first Row language node realize FB(flow block).

【Specific embodiment】

Below in conjunction with accompanying drawing, the present invention is described in further detail：

With reference to Fig. 1：The present invention comprises the steps：

Step 1, urtext pretreatment.

1a) using GENIA sentence partition tools, subordinate sentence is carried out to urtext；

1b) using Stanford University's CoreNLP instrument, participle, part-of-speech tagging and lemmatization are carried out to text；

1c) using Enju parser, syntactic analysis is carried out to each sentence, and result is changed into PTB form, obtain Syntax tree T to each sentence_i, i=1,2 ..., N, the syntax tree of all sentences constitutes syntax tree collectionWherein i Represent the sequence number of sentence, N represents the number of all sentences.

Step 2, the leading language Ar based on syntax tree search relationship pronoun anaphor.

2a) from syntax tree T_iMiddle lookup is labeled as " WDT " or the node of " WP ", obtains relative pronoun anaphor Mr, subordinate clause Method tree T_iMiddle lookup is labeled as all nodes of " NP ", obtains candidate leading language collection Z；Wherein WDT and WP is general mark, WDT represents with the qualifier of wh beginning, and WP represents with the pronoun of wh beginning, and NP represents noun phrase；；

2b) from candidate leading language collection Z, pick out the nearest candidate of distance relation pronoun anaphor Mr place node leading Language node, obtains the leading language Ar of relative pronoun anaphor Mr, it is as shown in Figure 2 that it implements step.

2011st, from syntax tree T_iThe node that middle search relationship pronoun anaphor Mr is located, obtains relative pronoun node Nr；

2012nd, from syntax tree T_iAll candidates leading language place node of middle lookup candidate leading language collection Z, obtains candidate first Row language nodal set Nz；

2013rd, extract the sentence of each candidate leading language node and relative pronoun node Nr in candidate leading language nodal set Nz Fa Shu path；

2014th, pick out syntax tree path the shortest from all syntax tree paths that step 2013 obtains, and with this Candidate's leading language node that short syntax tree path is located, as nearest candidate's leading language node, obtains relative pronoun anaphor Mr Leading language Ar.

Step 3, searches the leading language Ap of personal pronoun anaphor based on syntax tree.

3a) from syntax tree T_iMiddle lookup by " they ", " them ", " themselves ", " their ", " its " or has finger The personal pronoun node that " it " of generation effect is constituted, obtains personal pronoun anaphor Mp；

3b) with personal pronoun anaphor Mp place node as starting point, in syntax tree T_iIn bottom-up traversal, lookup comprises The node Nc of coordinate noun phrase, verb phrase arranged side by side or coordinate clause structure；

3c) judge step 3b) lookup result in whether there is node Nc, if so, in syntax tree T_iMiddle extraction is with node Nc is syntax subtree STc of root node, searches all nodes being labeled as " NP " in syntax subtree STc, and from these nodes In pick out apart from the farthest node of personal pronoun anaphor Mp place node, obtain the leading language of personal pronoun anaphor Mp Ap, otherwise, execution step 3d)；

3d) with personal pronoun anaphor Mp place node for starting point in syntax tree T_iIn bottom-up traversal, search bid It is designated as the clause node Ns of " S ", and extract syntax subtree STs with clause node Ns as root node；In syntax subtree STs Lookup is labeled as all nodes of " NP ", and selects farthest apart from personal pronoun anaphor Mp place node from these nodes Noun phrase node, judges that this noun phrase node whether there is, and if so, obtains the leading language Ap of personal pronoun anaphor Mp, Otherwise, execution step 3e)；

3e) from syntax tree collection T, select syntax tree T_i-1, in syntax tree T_i-1In with last leaf node for starting point from Bottom traverses up, and finds out the clause node Nt being labeled as " S ", and extracts the syntax subtree with clause node Nt as root node STt；Search all nodes being labeled as " NP " in syntax subtree STt, and filter out from these nodes and personal pronoun photograph Answer the unmatched noun phrase node of language Mp DANFU number, obtain candidate leading language collection Y by remaining all noun phrase nodes；From In candidate leading language collection Y, the farthest leading language of candidate of chosen distance personal pronoun anaphor Mp, obtains personal pronoun anaphor Mp Leading language Ap；

Step 4, searches the leading language Ad of definite noun phrases anaphor based on biological field feature.

4a) from syntax tree T_iMiddle all noun phrase nodes containing " DT " labelling child node for the lookup, and from these nouns Pick out in phrase node containing particular organisms entity type key word " protein ", " gene ", " factor ", The noun phrase node of " element ", " receptor ", " complex " or " construct ", obtains definite noun phrases Anaphor Md；

The subset of T 4b) is collected from syntax treeAs { T_i-2,T_i-1,T_iIn search and be labeled as all nodes of " NP ", And from these nodes, filter out the noun phrase node not comprising biological entities and proteinacious entities, by remaining noun phrase Node obtains candidate leading language collection X, wherein T_jCollect the syntax tree of j-th sentence in T for syntax tree, k is the size of sentence windows；

Described biological entities, its recognition methods includes：Started by numeral, and comprise letter；Started by lower case, and Comprise capitalization or numeral or special symbol；Started by capitalization, and comprise numeral or special symbol；Opened by capitalization Head, comprises lower case, and comprises capitalization or special symbol；

Whether head word 4c) judging definite noun phrases anaphor Md is " proteins " or " genes ", if so, from Select the leading language of all candidates comprising proteinacious entities in candidate leading language collection X, obtain new candidate leading language collection Xs, and from In this new candidate leading language collection Xs, according to head word coupling, comprise the order that proteinacious entities quantity is more than 1, select distance limit The nearest leading language of candidate of qualitative noun phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, no Then, execution step 4d)；

4d) judge whether definite noun phrases anaphor Md is plural form, if so, press from candidate leading language collection X According to head word coupling, comprise biological entities quantity be more than 1, comprise proteinacious entities quantity be more than 1 order, select apart from limited The leading language of the nearest candidate of noun phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, otherwise, holds Row step 4e)；

Whether head word 4e) judging definite noun phrases anaphor Md is " protein " or " gene ", if so, from time Select in leading language collection X and select the leading language of all candidates comprising proteinacious entities, obtain new candidate leading language collection Xs, and from this In new candidate leading language collection Xs, according to head word coupling, comprise the order that proteinacious entities quantity is equal to 1, select distance and limit The property nearest leading language of candidate of noun phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, otherwise, Execution step 4f)；

4f) from candidate leading language collection X according to head word coupling, comprise biological entities quantity be equal to 1, comprise proteinacious entities The order that quantity is equal to 1, selects apart from the nearest leading language of candidate of definite noun phrases anaphor Md, obtains limited noun The leading language Ad of phrase anaphor Md；

Step 5, filters out, from all reference resolution results, the reference resolution that leading language does not comprise proteinacious entities；Complete Biological Text protein reference resolution based on syntax tree and domain features.Wherein reference resolution result is expressed as, to relation, wrapping Include：Relative pronoun anaphor Mr and its leading language Ar, personal pronoun anaphor Mp and its leading language Ap and limited noun are short Language anaphor Md and its leading language Ad.

Below by way of emulation experiment, the technique effect of the present invention is described further：

1st, simulated conditions:

Emulation experiment shares task BioNLP 2011 Coreference data using biomedical natural language processing Collection, data set has marked out proteinacious entities in advance.

Emulation experiment is Intel Core (TM) i7-4720HQ, dominant frequency 2.60GHz in CPU, inside saves as the WINDOWS of 8G Emulated with JAVA programming language in 7 systems.

2nd, emulation content and interpretation of result：

With existing rule-based method, Biological Text albumen is carried out on Coreference data set using the present invention The emulation of matter reference resolution, experimental result is as follows：

Method	Recall rate (%)	Accuracy rate (%)	F value (%)
				Rule-based method	50.4	62.7	55.9
The present invention	60.2	63.8	62.0

To sum up, the present invention is carried to the leading language of relative pronoun anaphor and personal pronoun anaphor based on syntax tree Take, based on domain features, the leading language of definite noun phrases anaphor is extracted, only can excavate from phrase or word Structure, attribute aspect unavailable syntactic structure information, thus extracting more protein to refer to relation.Due to fully sharp With syntactic property and the domain features of biomedical text, accuracy rate can ensured compared to the method being currently based on rule While obtain more correct protein reference resolution results, improve recall rate, obtained more preferable integrated performance index F Value, has certain advantage compared with the existing methods.

The present invention is low for solving F value present in existing rule-based Biological Text protein reference resolution method Technical problem, carries out pretreatment to urtext；Search relationship pronoun and apart from the nearest name of this relative pronoun from syntax tree Word phrase, as the leading language of this relative pronoun；Personal pronoun, and the phrase knot arranged side by side from syntax tree is searched from syntax tree The leading language of this personal pronoun is searched in the syntax tree of structure, clause syntax tree or previous sentence；Obtained limited using syntax tree Noun phrase and candidate's leading language collection, and based on the biological field feature such as property such as DANFU number, entity type, quantity from candidate first The leading language of conduct picking out optimum concentrated in row language；Nonprotein reference resolution filters.Due to being dealt with relationship generation based on syntax tree The protein reference resolution of word and personal pronoun, is referred to based on the protein that biological field characterization rules process definite noun phrases In generation, clears up, and employs different processing methods to the reference resolution of different anaphor types, takes full advantage of biomedical sector The specific syntactic property of text and domain features, test result indicate that, the present invention can effectively extract protein and refer to relation, Achieve the protein reference resolution in Biological Text, there is higher aggregative indicator F value.

Claims

1. a kind of Biological Text protein reference resolution method based on syntax tree and domain features it is characterised in that：Including such as Lower step：

(1) subordinate sentence, participle, part-of-speech tagging, lemmatization and syntactic analysis are carried out to urtext, obtain the syntax of each sentence Tree T_i, i=1,2 ..., N, the syntax tree of all sentences constitutes syntax tree collectionWherein i represents the sequence number of sentence, N Represent the number of all sentences；

(2) from syntax tree T_iMiddle search relationship pronoun node and apart from the nearest noun phrase node of this relative pronoun node, obtains The leading language Ar of relative pronoun anaphor Mr and relative pronoun anaphor Mr；

(3) from syntax tree T_iMiddle lookup personal pronoun node, obtains personal pronoun anaphor Mp；And from this personal pronoun node institute Search this personal pronoun anaphor Mp's in the syntax tree of the phrase structure arranged side by side, clause syntax tree or previous sentence of syntax tree Leading language Ap；

(4) from syntax tree T_iMiddle lookup comprises the definite noun phrases node of particular organisms entity type key word, is limited Property noun phrase anaphor Md；Collect the subset of T from syntax treeMiddle lookup is all to comprise biological entities or protein reality The noun phrase node of body, obtains candidate leading language collection X, is obtained from candidate leading language collection X based on biological field characteristic properties The leading language Ad of definite noun phrases anaphor Md；Wherein T_jCollect the syntax tree of j-th sentence in T for syntax tree, k is sentence The size of window；

(5) filter out leading language from all reference resolution results that step (2) to step (4) obtains and do not comprise proteinacious entities Reference resolution, complete the Biological Text protein reference resolution based on syntax tree and domain features.

2. a kind of Biological Text protein reference resolution side based on syntax tree and domain features according to claim 1 Method it is characterised in that：Described in step (2) from syntax tree T_iMiddle search relationship pronoun node and apart from this relative pronoun node Nearest noun phrase node, realizing step is：

201st, from syntax tree T_iMiddle lookup is labeled as " WDT " or the node of " WP ", obtains relative pronoun node Nr and relative pronoun shines Answer language Mr, from syntax tree T_iMiddle lookup is labeled as all nodes of " NP ", obtains candidate leading language collection Z；Wherein WDT represents with wh The qualifier of beginning, WP represents with the pronoun of wh beginning, and NP represents noun phrase；

202nd, from syntax tree T_iAll candidates leading language place node of middle lookup candidate leading language collection Z, obtains candidate's leading language knot Point set Nz；

203rd, extract the syntax tree of each candidate leading language node and relative pronoun node Nr in candidate leading language nodal set Nz Path；

204th, pick out syntax tree path the shortest from all syntax tree paths that step 203 obtains, and with this short sentence method The noun phrase node that tree path is located is as nearest noun phrase node.

3. a kind of Biological Text protein reference resolution side based on syntax tree and domain features according to claim 1 Method it is characterised in that：In step (3), the acquisition of leading language Ap is specially：

301st, with personal pronoun anaphor Mp place node as starting point, in syntax tree T_iIn bottom-up traversal, search and comprise side by side The node Nc of phrase structure, judges that node Nc whether there is, if so, in syntax tree T_iThe middle sentence extracting with node Nc as root node Method tree STc, and search in syntax subtree STc apart from the farthest noun phrase knot of personal pronoun anaphor Mp place node Point, obtains the leading language Ap of personal pronoun anaphor Mp, otherwise, execution step 302；

302nd, with personal pronoun anaphor Mp place node for starting point in syntax tree T_iIn bottom-up traversal, find out clause knot Point Ns；Extract syntax subtree STs with clause node Ns as root node, and search apart from person generation in syntax subtree STs The farthest noun phrase node of word anaphor Mp place node, judges that this noun phrase node whether there is, if so, obtains person The leading language Ap of pronoun anaphor Mp, otherwise, execution step 303；

303rd, select syntax tree T from syntax tree collection T_i-1, in syntax tree T_i-1In with last leaf node for starting point the bottom of from Traverse up, find out clause node Nt；Extract syntax subtree STt with clause node Nt as root node, and in this syntax Search, in tree STt, all noun phrase nodes matching with personal pronoun anaphor Mp DANFU number, obtain candidate's leading language collection Y；From candidate leading language collection Y, the farthest leading language of candidate of chosen distance personal pronoun anaphor Mp, obtains personal pronoun and correlates The leading language Ap of language Mp.

4. a kind of Biological Text protein reference resolution side based on syntax tree and domain features according to claim 1 Method it is characterised in that：In step (3), phrase structure arranged side by side refers to coordinate noun phrase, verb phrase arranged side by side or coordinate clause Structure.

5. a kind of Biological Text protein reference resolution side based on syntax tree and domain features according to claim 1 Method it is characterised in that：In step (4), the acquisition of leading language Ad is specially：

Whether the head word the 401st, judging definite noun phrases anaphor Md is " proteins " or " genes ", if so, from candidate Select the leading language of all candidates comprising proteinacious entities in leading language collection X, obtain new candidate leading language collection Xs, and new from this Candidate leading language collection Xs in, according to head word coupling, comprise proteinacious entities quantity be more than 1 order, select apart from limited The leading language of the nearest candidate of noun phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, otherwise, holds Row step 402；

402nd, judge whether definite noun phrases anaphor Md is plural form, if so, according to head from candidate leading language collection X Word coupling, comprise biological entities quantity be more than 1, comprise proteinacious entities quantity be more than 1 order, select apart from limited noun The leading language of the nearest candidate of phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, otherwise, executes step Rapid 403；

Whether the head word the 403rd, judging definite noun phrases anaphor Md is " protein " or " gene ", if so, from candidate first Select the leading language of all candidates comprising proteinacious entities in row language collection X, obtain new candidate leading language collection Xs, and new from this In candidate leading language collection Xs, according to head word coupling, comprise the order that proteinacious entities quantity is equal to 1, select apart from limited name The nearest leading language of candidate of word phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, otherwise, execution Step 404；

404th, from candidate leading language collection X according to head word coupling, comprise biological entities quantity be equal to 1, comprise proteinacious entities number The order that amount is equal to 1, selects apart from the nearest leading language of candidate of definite noun phrases anaphor Md, obtains limited noun short The leading language Ad of language anaphor Md.

6. a kind of Biological Text protein reference resolution side based on syntax tree and domain features according to claim 1 Method it is characterised in that：Particular organisms entity type key word in step (4), including " protein ", " gene ", " factor ", " element ", " receptor ", " complex " and " construct ".

7. a kind of Biological Text protein reference resolution side based on syntax tree and domain features according to claim 1 Method it is characterised in that：The biological entities of step (4), its recognition methods includes：Started by numeral, and comprise letter；By small letter Female beginning, and comprise capitalization or numeral or special symbol；Started by capitalization, and comprise numeral or special symbol；Or Person is started by capitalization, comprises lower case, and comprises capitalization or special symbol.