CN106484676A - Biological Text protein reference resolution method based on syntax tree and domain features - Google Patents

Biological Text protein reference resolution method based on syntax tree and domain features Download PDF

Info

Publication number
CN106484676A
CN106484676A CN201610872780.8A CN201610872780A CN106484676A CN 106484676 A CN106484676 A CN 106484676A CN 201610872780 A CN201610872780 A CN 201610872780A CN 106484676 A CN106484676 A CN 106484676A
Authority
CN
China
Prior art keywords
syntax tree
node
anaphor
leading language
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610872780.8A
Other languages
Chinese (zh)
Other versions
CN106484676B (en
Inventor
李辰
饶志强
张向荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201610872780.8A priority Critical patent/CN106484676B/en
Publication of CN106484676A publication Critical patent/CN106484676A/en
Application granted granted Critical
Publication of CN106484676B publication Critical patent/CN106484676B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to the Biological Text protein reference resolution method based on syntax tree and domain features, for solving the problems, such as that in existing rule-based method, F value is low, its step includes:Pretreatment is carried out to urtext;Search relationship pronoun and apart from the nearest noun phrase of this relative pronoun from syntax tree, as the leading language of this relative pronoun;Search personal pronoun from syntax tree, and search the leading language of this personal pronoun from the syntax tree of the phrase structure arranged side by side, clause syntax tree or previous sentence of syntax tree;Obtain definite noun phrases and candidate's leading language collection using syntax tree, and the leading language of conduct picking out optimum is concentrated based on the biological field feature such as property such as DANFU number, entity type, quantity from the leading language of candidate;Nonprotein reference resolution filters.Present invention achieves the protein reference resolution in Biological Text, it is obtained in that higher F value.

Description

Biological Text protein reference resolution method based on syntax tree and domain features
【Technical field】
The invention belongs to Text Mining Technology field is and in particular to Biological Text albumen based on syntax tree and domain features Matter reference resolution method.
【Background technology】
With developing rapidly of computer and Internet technology, substantial amounts of information and document are presented in digitized. Document in biomedical sector magnanimity is existed and is just being increased with exponential form, and biomedical correlational study scholar is using biography Manual read's pattern of system is difficult to efficiently obtain valuable information, the therefore text envelope of automatization in face of huge document Breath extraction becomes a significant job.The task extracted as biomedical sector text message, protein refers to Clear up the word being to find out and indicate same protein entity in Biological Text and phrase, anaphor is have the effect of referring to right As, for example common pronoun, leading language is object being pointed at, having actual content, such as protein and biological entities, egg White matter reference resolution, as an ancillary technique, serves important supporting role, energy to many tasks of biological text mining Effectively improve the performance of bio information extraction system.
Biological natural language processing task BioNLP Shared Task 2011 provides the biomedical sector of standard Protein reference resolution language material Coreference, this language material derives from the summary in MEDLINE, mainly solves biomedical literary composition The reference resolution of the proteinacious entities in this, the proteinacious entities in language material mark out in advance.
General field reference resolution is more ripe compared to the research method of biomedical sector reference resolution, but due to spy Determine the uniqueness of field language material, these methods are grafted directly in biomedical sector and can not obtain good effect, exploitation Method for Biological Text protein reference resolution is necessary.At present in the protein reference resolution of Biological Text, main The method being divided into the method, rule-based method and fusion based on supervised learning.Method based on supervised learning pass through from Protein reference resolution training data is concentrated and is extracted feature to learn to obtain a model, then using this model to new data In protein refer to relation and carry out clearing up process, the paper that such as Youngjun Kim et al. delivers for 2011 in BioNLP task In " The Taming of Reconcile as a Biomedical Coreference Resolver ", disclose a kind of base Supervised learning method in feature and sorter model is used for protein reference resolution, and the method is extracted lexical feature, syntax The series of features such as feature, then carry out the process of protein reference resolution using a grader.Rule-based method is led to Cross the protein reference resolution that the series of rules of manual formulation processes in Biological Text, such as Makoto Miwa et al. 2012 In magazine《Bioinformatics》Volume 28 paper that 13 phases delivered " Boosting automatic event extraction from the literature using domain adaptation and coreference In resolution ", disclose a kind of protein reference resolution method of rule-based coupling, the method uses an anaphor Couple candidate detection extracts noun phrase, pronoun etc. as anaphor candidate, is then carried using leading language couple candidate detection Take noun phrase as leading language candidate, finally refer to relation connecting detection according to coupling, structure completely using one Join, order that strict head word coupling, the head word coupling loosened, quantity Matching etc. are regular, select the override of each anaphor Row language.The method merging is passed through simultaneously using supervised learning and rule, to different types of protein reference resolution using different Method processing, the paper that such as Jennifer D'Souza and Vincent Ng delivered in ACM-BCB meeting in 2012 “Anaphora Resolution in Biomedical Literature:In A Hybrid Approach ", disclose one kind Based on the protein reference resolution method of grader and rule match, the method processes the egg of pronoun using the grader learning White matter reference resolution, processes the protein reference resolution of noun phrase using regular method.
The performance of protein reference resolution method is generally evaluated with recall rate, accuracy rate and F value, and recall rate refers to be extracted Go out correct protein and refer to the ratio that all proteins in relation data refer to relation, accuracy rate is to extract correct egg White matter refers to relation and all ratios extracting results, and F value is the harmonic-mean of recall rate and accuracy rate, as final comprehensive Close index.In method for Biological Text protein reference resolution, needed using substantial amounts of people based on the method for supervised learning Work mark language material training could obtain ideal result, but acquisition manually marks language material in a large number and needs to pay very big people Power material resources and long time, current data with existing can not meet such requirement, leads in a limited number of training datas The model that lower supervised learning method obtains is not good, and the recall rate of model is very low, thus leading to F value very low;Rule-based side Method has generally only used for reference some rules of general field reference resolution, and these rules are only in terms of the structure of phrase or word and attribute To consider, not consider that the unique syntactic property of Biological Text and domain features refer to pass it is impossible to effectively extract correct protein System, the recall rate of method low thus leading to F value low;The method merging still has in supervised learning method because training data has Limit the model leading to not ideal enough, the low problem of pronoun protein types reference resolution recall rate, though the overall recall rate of method There is improvement but still not so that F value is not ideal enough.The result of current protein reference resolution also needs to continue to improve, thus Could be more effectively biological other tasks of text mining, such as biological event extracts, and carries out early stage pretreatment work.
【Content of the invention】
It is an object of the invention to overcoming the defect that above-mentioned prior art exists it is proposed that a kind of be based on syntax tree and field The Biological Text protein reference resolution method of feature, can extract correct protein and refer to relation, thus improve comprehensively referring to Mark F value.
For achieving the above object, the technical scheme that the present invention takes is:
Comprise the steps:
(1) subordinate sentence, participle, part-of-speech tagging, lemmatization and syntactic analysis are carried out to urtext, obtain each sentence Syntax tree Ti, i=1,2 ..., N, the syntax tree of all sentences constitutes syntax tree collectionWherein i represents the sequence of sentence Number, N represents the number of all sentences;
(2) from syntax tree TiMiddle search relationship pronoun node and apart from this relative pronoun node nearest noun phrase knot Point, obtains the leading language Ar of relative pronoun anaphor Mr and relative pronoun anaphor Mr;
(3) from syntax tree TiMiddle lookup personal pronoun node, obtains personal pronoun anaphor Mp;And from this personal pronoun knot This personal pronoun anaphor is searched in the syntax tree of phrase structure arranged side by side, clause syntax tree or previous sentence of point place syntax tree The leading language Ap of Mp;
(4) from syntax tree TiMiddle lookup comprises the definite noun phrases node of particular organisms entity type key word, obtains To definite noun phrases anaphor Md;Collect the subset of T from syntax treeMiddle lookup is all to comprise biological entities or egg The noun phrase node of white matter entity, obtains candidate leading language collection X, based on biological field characteristic properties from candidate leading language collection X In obtain the leading language Ad of definite noun phrases anaphor Md;Wherein TjCollect the syntax tree of j-th sentence in T, k for syntax tree Size for sentence windows;
(5) filter out leading language from all reference resolution results that step (2) to step (4) obtains and do not comprise protein The reference resolution of entity, completes the Biological Text protein reference resolution based on syntax tree and domain features.
Further, described in step (2) from syntax tree TiMiddle search relationship pronoun node and apart from this relative pronoun The nearest noun phrase node of node, realizing step is:
201st, from syntax tree TiMiddle lookup is labeled as " WDT " or the node of " WP ", obtains relative pronoun node Nr and relation Pronoun anaphor Mr, from syntax tree TiMiddle lookup is labeled as all nodes of " NP ", obtains candidate leading language collection Z;Wherein WDT generation The qualifier that table is started with wh, WP represents with the pronoun of wh beginning, and NP represents noun phrase;
202nd, from syntax tree TiAll candidates leading language place node of middle lookup candidate leading language collection Z, obtains candidate first Row language nodal set Nz;
203rd, extract the sentence of each candidate leading language node and relative pronoun node Nr in candidate leading language nodal set Nz Fa Shu path;
204th, pick out syntax tree path the shortest from all syntax tree paths that step 203 obtains, and the shortest with this The noun phrase node that syntax tree path is located is as nearest noun phrase node.
Further, in step (3), the acquisition of leading language Ap is specially:
301st, with personal pronoun anaphor Mp place node as starting point, in syntax tree TiIn bottom-up traversal, search bag Node Nc containing phrase structure arranged side by side, judges that node Nc whether there is, if so, in syntax tree TiMiddle extraction is with node Nc as root knot Syntax subtree STc of point, and search short apart from the farthest noun of personal pronoun anaphor Mp place node in syntax subtree STc Language node, obtains the leading language Ap of personal pronoun anaphor Mp, otherwise, execution step 302;
302nd, with personal pronoun anaphor Mp place node for starting point in syntax tree TiIn bottom-up traversal, find out son Sentence node Ns;Extract syntax subtree STs with clause node Ns as root node, and search apart from people in syntax subtree STs Claim pronoun anaphor Mp place node farthest noun phrase node, judge that this noun phrase node whether there is, if so, obtain The leading language Ap of personal pronoun anaphor Mp, otherwise, execution step 303;
303rd, select syntax tree T from syntax tree collection Ti-1, in syntax tree Ti-1In with last leaf node as starting point Bottom-up traversal, finds out clause node Nt;Extract syntax subtree STt with clause node Nt as root node, and in this sentence Search, in method tree STt, all noun phrase nodes matching with personal pronoun anaphor Mp DANFU number, obtain candidate leading Language collection Y;From candidate leading language collection Y, the farthest leading language of candidate of chosen distance personal pronoun anaphor Mp, obtains personal pronoun The leading language Ap of anaphor Mp.
Further, in step (3) phrase structure arranged side by side refer to coordinate noun phrase, verb phrase arranged side by side or side by side son Sentence structure.
Further, in step (4), the acquisition of leading language Ad is specially:
Whether the head word the 401st, judging definite noun phrases anaphor Md is " proteins " or " genes ", if so, from Select the leading language of all candidates comprising proteinacious entities in candidate leading language collection X, obtain new candidate leading language collection Xs, and from In this new candidate leading language collection Xs, according to head word coupling, comprise the order that proteinacious entities quantity is more than 1, select distance limit The nearest leading language of candidate of qualitative noun phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, no Then, execution step 402;
402nd, judge whether definite noun phrases anaphor Md is plural form, if so, press from candidate leading language collection X According to head word coupling, comprise biological entities quantity be more than 1, comprise proteinacious entities quantity be more than 1 order, select apart from limited The leading language of the nearest candidate of noun phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, otherwise, holds Row step 403;
Whether the head word the 403rd, judging definite noun phrases anaphor Md is " protein " or " gene ", if so, from time Select in leading language collection X and select the leading language of all candidates comprising proteinacious entities, obtain new candidate leading language collection Xs, and from this In new candidate leading language collection Xs, according to head word coupling, comprise the order that proteinacious entities quantity is equal to 1, select distance and limit The property nearest leading language of candidate of noun phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, otherwise, Execution step 404;
404th, from candidate leading language collection X according to head word coupling, comprise that biological entities quantity is equal to 1, to comprise protein real The order that body quantity is equal to 1, selects apart from the nearest leading language of candidate of definite noun phrases anaphor Md, obtains limited name The leading language Ad of word phrase anaphor Md.
Further, the particular organisms entity type key word in step (4), including " protein ", " gene ", " factor ", " element ", " receptor ", " complex " and " construct ".
Further, the biological entities of step (4), its recognition methods includes:Started by numeral, and comprise letter;By little Write beginning of letter, and comprise capitalization or numeral or special symbol;Started by capitalization, and comprise digital or special symbol Number;Or started by capitalization, comprise lower case, and comprise capitalization or special symbol.
The present invention compared with prior art, has advantages below:
The present invention carries out after pretreatment to urtext, extracts relative pronoun, personal pronoun and limited from syntax tree Noun phrase, determines the leading language of relative pronoun and personal pronoun anaphor based on syntax tree, is determined based on biological field feature The leading language of definite noun phrases anaphor, finally filters the reference resolution result of nonprotein entity.The present invention is based on sentence Method tree is extracted to the leading language of relative pronoun anaphor and personal pronoun anaphor, based on domain features to limited noun The leading language of phrase anaphor is extracted, and can excavate only unavailable syntax in terms of the structure of phrase or word, attribute Structural information, thus effectively extract more correct protein to refer to relation.Due to taking full advantage of biomedical text Syntactic property and domain features, can obtain more correct while ensureing accuracy rate compared to the method being currently based on rule Protein reference resolution result, improve recall rate, obtained more preferable integrated performance index F value, the simulation experiment result Indicate this point.The present invention can be used for pointing to relative pronoun, personal pronoun and the limit of proteinacious entities in biomedical text The reference resolution such as qualitative noun phrase.
【Brief description】
Fig. 1 be the present invention realize FB(flow block);
Fig. 2 be the present invention from candidate leading language collection Z, pick out the nearest candidate of distance relation pronoun Mr place node first Row language node realize FB(flow block).
【Specific embodiment】
Below in conjunction with accompanying drawing, the present invention is described in further detail:
With reference to Fig. 1:The present invention comprises the steps:
Step 1, urtext pretreatment.
1a) using GENIA sentence partition tools, subordinate sentence is carried out to urtext;
1b) using Stanford University's CoreNLP instrument, participle, part-of-speech tagging and lemmatization are carried out to text;
1c) using Enju parser, syntactic analysis is carried out to each sentence, and result is changed into PTB form, obtain Syntax tree T to each sentencei, i=1,2 ..., N, the syntax tree of all sentences constitutes syntax tree collectionWherein i Represent the sequence number of sentence, N represents the number of all sentences.
Step 2, the leading language Ar based on syntax tree search relationship pronoun anaphor.
2a) from syntax tree TiMiddle lookup is labeled as " WDT " or the node of " WP ", obtains relative pronoun anaphor Mr, subordinate clause Method tree TiMiddle lookup is labeled as all nodes of " NP ", obtains candidate leading language collection Z;Wherein WDT and WP is general mark, WDT represents with the qualifier of wh beginning, and WP represents with the pronoun of wh beginning, and NP represents noun phrase;;
2b) from candidate leading language collection Z, pick out the nearest candidate of distance relation pronoun anaphor Mr place node leading Language node, obtains the leading language Ar of relative pronoun anaphor Mr, it is as shown in Figure 2 that it implements step.
2011st, from syntax tree TiThe node that middle search relationship pronoun anaphor Mr is located, obtains relative pronoun node Nr;
2012nd, from syntax tree TiAll candidates leading language place node of middle lookup candidate leading language collection Z, obtains candidate first Row language nodal set Nz;
2013rd, extract the sentence of each candidate leading language node and relative pronoun node Nr in candidate leading language nodal set Nz Fa Shu path;
2014th, pick out syntax tree path the shortest from all syntax tree paths that step 2013 obtains, and with this Candidate's leading language node that short syntax tree path is located, as nearest candidate's leading language node, obtains relative pronoun anaphor Mr Leading language Ar.
Step 3, searches the leading language Ap of personal pronoun anaphor based on syntax tree.
3a) from syntax tree TiMiddle lookup by " they ", " them ", " themselves ", " their ", " its " or has finger The personal pronoun node that " it " of generation effect is constituted, obtains personal pronoun anaphor Mp;
3b) with personal pronoun anaphor Mp place node as starting point, in syntax tree TiIn bottom-up traversal, lookup comprises The node Nc of coordinate noun phrase, verb phrase arranged side by side or coordinate clause structure;
3c) judge step 3b) lookup result in whether there is node Nc, if so, in syntax tree TiMiddle extraction is with node Nc is syntax subtree STc of root node, searches all nodes being labeled as " NP " in syntax subtree STc, and from these nodes In pick out apart from the farthest node of personal pronoun anaphor Mp place node, obtain the leading language of personal pronoun anaphor Mp Ap, otherwise, execution step 3d);
3d) with personal pronoun anaphor Mp place node for starting point in syntax tree TiIn bottom-up traversal, search bid It is designated as the clause node Ns of " S ", and extract syntax subtree STs with clause node Ns as root node;In syntax subtree STs Lookup is labeled as all nodes of " NP ", and selects farthest apart from personal pronoun anaphor Mp place node from these nodes Noun phrase node, judges that this noun phrase node whether there is, and if so, obtains the leading language Ap of personal pronoun anaphor Mp, Otherwise, execution step 3e);
3e) from syntax tree collection T, select syntax tree Ti-1, in syntax tree Ti-1In with last leaf node for starting point from Bottom traverses up, and finds out the clause node Nt being labeled as " S ", and extracts the syntax subtree with clause node Nt as root node STt;Search all nodes being labeled as " NP " in syntax subtree STt, and filter out from these nodes and personal pronoun photograph Answer the unmatched noun phrase node of language Mp DANFU number, obtain candidate leading language collection Y by remaining all noun phrase nodes;From In candidate leading language collection Y, the farthest leading language of candidate of chosen distance personal pronoun anaphor Mp, obtains personal pronoun anaphor Mp Leading language Ap;
Step 4, searches the leading language Ad of definite noun phrases anaphor based on biological field feature.
4a) from syntax tree TiMiddle all noun phrase nodes containing " DT " labelling child node for the lookup, and from these nouns Pick out in phrase node containing particular organisms entity type key word " protein ", " gene ", " factor ", The noun phrase node of " element ", " receptor ", " complex " or " construct ", obtains definite noun phrases Anaphor Md;
The subset of T 4b) is collected from syntax treeAs { Ti-2,Ti-1,TiIn search and be labeled as all nodes of " NP ", And from these nodes, filter out the noun phrase node not comprising biological entities and proteinacious entities, by remaining noun phrase Node obtains candidate leading language collection X, wherein TjCollect the syntax tree of j-th sentence in T for syntax tree, k is the size of sentence windows;
Described biological entities, its recognition methods includes:Started by numeral, and comprise letter;Started by lower case, and Comprise capitalization or numeral or special symbol;Started by capitalization, and comprise numeral or special symbol;Opened by capitalization Head, comprises lower case, and comprises capitalization or special symbol;
Whether head word 4c) judging definite noun phrases anaphor Md is " proteins " or " genes ", if so, from Select the leading language of all candidates comprising proteinacious entities in candidate leading language collection X, obtain new candidate leading language collection Xs, and from In this new candidate leading language collection Xs, according to head word coupling, comprise the order that proteinacious entities quantity is more than 1, select distance limit The nearest leading language of candidate of qualitative noun phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, no Then, execution step 4d);
4d) judge whether definite noun phrases anaphor Md is plural form, if so, press from candidate leading language collection X According to head word coupling, comprise biological entities quantity be more than 1, comprise proteinacious entities quantity be more than 1 order, select apart from limited The leading language of the nearest candidate of noun phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, otherwise, holds Row step 4e);
Whether head word 4e) judging definite noun phrases anaphor Md is " protein " or " gene ", if so, from time Select in leading language collection X and select the leading language of all candidates comprising proteinacious entities, obtain new candidate leading language collection Xs, and from this In new candidate leading language collection Xs, according to head word coupling, comprise the order that proteinacious entities quantity is equal to 1, select distance and limit The property nearest leading language of candidate of noun phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, otherwise, Execution step 4f);
4f) from candidate leading language collection X according to head word coupling, comprise biological entities quantity be equal to 1, comprise proteinacious entities The order that quantity is equal to 1, selects apart from the nearest leading language of candidate of definite noun phrases anaphor Md, obtains limited noun The leading language Ad of phrase anaphor Md;
Step 5, filters out, from all reference resolution results, the reference resolution that leading language does not comprise proteinacious entities;Complete Biological Text protein reference resolution based on syntax tree and domain features.Wherein reference resolution result is expressed as, to relation, wrapping Include:Relative pronoun anaphor Mr and its leading language Ar, personal pronoun anaphor Mp and its leading language Ap and limited noun are short Language anaphor Md and its leading language Ad.
Below by way of emulation experiment, the technique effect of the present invention is described further:
1st, simulated conditions:
Emulation experiment shares task BioNLP 2011 Coreference data using biomedical natural language processing Collection, data set has marked out proteinacious entities in advance.
Emulation experiment is Intel Core (TM) i7-4720HQ, dominant frequency 2.60GHz in CPU, inside saves as the WINDOWS of 8G Emulated with JAVA programming language in 7 systems.
2nd, emulation content and interpretation of result:
With existing rule-based method, Biological Text albumen is carried out on Coreference data set using the present invention The emulation of matter reference resolution, experimental result is as follows:
Method Recall rate (%) Accuracy rate (%) F value (%)
Rule-based method 50.4 62.7 55.9
The present invention 60.2 63.8 62.0
To sum up, the present invention is carried to the leading language of relative pronoun anaphor and personal pronoun anaphor based on syntax tree Take, based on domain features, the leading language of definite noun phrases anaphor is extracted, only can excavate from phrase or word Structure, attribute aspect unavailable syntactic structure information, thus extracting more protein to refer to relation.Due to fully sharp With syntactic property and the domain features of biomedical text, accuracy rate can ensured compared to the method being currently based on rule While obtain more correct protein reference resolution results, improve recall rate, obtained more preferable integrated performance index F Value, has certain advantage compared with the existing methods.
The present invention is low for solving F value present in existing rule-based Biological Text protein reference resolution method Technical problem, carries out pretreatment to urtext;Search relationship pronoun and apart from the nearest name of this relative pronoun from syntax tree Word phrase, as the leading language of this relative pronoun;Personal pronoun, and the phrase knot arranged side by side from syntax tree is searched from syntax tree The leading language of this personal pronoun is searched in the syntax tree of structure, clause syntax tree or previous sentence;Obtained limited using syntax tree Noun phrase and candidate's leading language collection, and based on the biological field feature such as property such as DANFU number, entity type, quantity from candidate first The leading language of conduct picking out optimum concentrated in row language;Nonprotein reference resolution filters.Due to being dealt with relationship generation based on syntax tree The protein reference resolution of word and personal pronoun, is referred to based on the protein that biological field characterization rules process definite noun phrases In generation, clears up, and employs different processing methods to the reference resolution of different anaphor types, takes full advantage of biomedical sector The specific syntactic property of text and domain features, test result indicate that, the present invention can effectively extract protein and refer to relation, Achieve the protein reference resolution in Biological Text, there is higher aggregative indicator F value.

Claims (7)

1. a kind of Biological Text protein reference resolution method based on syntax tree and domain features it is characterised in that:Including such as Lower step:
(1) subordinate sentence, participle, part-of-speech tagging, lemmatization and syntactic analysis are carried out to urtext, obtain the syntax of each sentence Tree Ti, i=1,2 ..., N, the syntax tree of all sentences constitutes syntax tree collectionWherein i represents the sequence number of sentence, N Represent the number of all sentences;
(2) from syntax tree TiMiddle search relationship pronoun node and apart from the nearest noun phrase node of this relative pronoun node, obtains The leading language Ar of relative pronoun anaphor Mr and relative pronoun anaphor Mr;
(3) from syntax tree TiMiddle lookup personal pronoun node, obtains personal pronoun anaphor Mp;And from this personal pronoun node institute Search this personal pronoun anaphor Mp's in the syntax tree of the phrase structure arranged side by side, clause syntax tree or previous sentence of syntax tree Leading language Ap;
(4) from syntax tree TiMiddle lookup comprises the definite noun phrases node of particular organisms entity type key word, is limited Property noun phrase anaphor Md;Collect the subset of T from syntax treeMiddle lookup is all to comprise biological entities or protein reality The noun phrase node of body, obtains candidate leading language collection X, is obtained from candidate leading language collection X based on biological field characteristic properties The leading language Ad of definite noun phrases anaphor Md;Wherein TjCollect the syntax tree of j-th sentence in T for syntax tree, k is sentence The size of window;
(5) filter out leading language from all reference resolution results that step (2) to step (4) obtains and do not comprise proteinacious entities Reference resolution, complete the Biological Text protein reference resolution based on syntax tree and domain features.
2. a kind of Biological Text protein reference resolution side based on syntax tree and domain features according to claim 1 Method it is characterised in that:Described in step (2) from syntax tree TiMiddle search relationship pronoun node and apart from this relative pronoun node Nearest noun phrase node, realizing step is:
201st, from syntax tree TiMiddle lookup is labeled as " WDT " or the node of " WP ", obtains relative pronoun node Nr and relative pronoun shines Answer language Mr, from syntax tree TiMiddle lookup is labeled as all nodes of " NP ", obtains candidate leading language collection Z;Wherein WDT represents with wh The qualifier of beginning, WP represents with the pronoun of wh beginning, and NP represents noun phrase;
202nd, from syntax tree TiAll candidates leading language place node of middle lookup candidate leading language collection Z, obtains candidate's leading language knot Point set Nz;
203rd, extract the syntax tree of each candidate leading language node and relative pronoun node Nr in candidate leading language nodal set Nz Path;
204th, pick out syntax tree path the shortest from all syntax tree paths that step 203 obtains, and with this short sentence method The noun phrase node that tree path is located is as nearest noun phrase node.
3. a kind of Biological Text protein reference resolution side based on syntax tree and domain features according to claim 1 Method it is characterised in that:In step (3), the acquisition of leading language Ap is specially:
301st, with personal pronoun anaphor Mp place node as starting point, in syntax tree TiIn bottom-up traversal, search and comprise side by side The node Nc of phrase structure, judges that node Nc whether there is, if so, in syntax tree TiThe middle sentence extracting with node Nc as root node Method tree STc, and search in syntax subtree STc apart from the farthest noun phrase knot of personal pronoun anaphor Mp place node Point, obtains the leading language Ap of personal pronoun anaphor Mp, otherwise, execution step 302;
302nd, with personal pronoun anaphor Mp place node for starting point in syntax tree TiIn bottom-up traversal, find out clause knot Point Ns;Extract syntax subtree STs with clause node Ns as root node, and search apart from person generation in syntax subtree STs The farthest noun phrase node of word anaphor Mp place node, judges that this noun phrase node whether there is, if so, obtains person The leading language Ap of pronoun anaphor Mp, otherwise, execution step 303;
303rd, select syntax tree T from syntax tree collection Ti-1, in syntax tree Ti-1In with last leaf node for starting point the bottom of from Traverse up, find out clause node Nt;Extract syntax subtree STt with clause node Nt as root node, and in this syntax Search, in tree STt, all noun phrase nodes matching with personal pronoun anaphor Mp DANFU number, obtain candidate's leading language collection Y;From candidate leading language collection Y, the farthest leading language of candidate of chosen distance personal pronoun anaphor Mp, obtains personal pronoun and correlates The leading language Ap of language Mp.
4. a kind of Biological Text protein reference resolution side based on syntax tree and domain features according to claim 1 Method it is characterised in that:In step (3), phrase structure arranged side by side refers to coordinate noun phrase, verb phrase arranged side by side or coordinate clause Structure.
5. a kind of Biological Text protein reference resolution side based on syntax tree and domain features according to claim 1 Method it is characterised in that:In step (4), the acquisition of leading language Ad is specially:
Whether the head word the 401st, judging definite noun phrases anaphor Md is " proteins " or " genes ", if so, from candidate Select the leading language of all candidates comprising proteinacious entities in leading language collection X, obtain new candidate leading language collection Xs, and new from this Candidate leading language collection Xs in, according to head word coupling, comprise proteinacious entities quantity be more than 1 order, select apart from limited The leading language of the nearest candidate of noun phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, otherwise, holds Row step 402;
402nd, judge whether definite noun phrases anaphor Md is plural form, if so, according to head from candidate leading language collection X Word coupling, comprise biological entities quantity be more than 1, comprise proteinacious entities quantity be more than 1 order, select apart from limited noun The leading language of the nearest candidate of phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, otherwise, executes step Rapid 403;
Whether the head word the 403rd, judging definite noun phrases anaphor Md is " protein " or " gene ", if so, from candidate first Select the leading language of all candidates comprising proteinacious entities in row language collection X, obtain new candidate leading language collection Xs, and new from this In candidate leading language collection Xs, according to head word coupling, comprise the order that proteinacious entities quantity is equal to 1, select apart from limited name The nearest leading language of candidate of word phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, otherwise, execution Step 404;
404th, from candidate leading language collection X according to head word coupling, comprise biological entities quantity be equal to 1, comprise proteinacious entities number The order that amount is equal to 1, selects apart from the nearest leading language of candidate of definite noun phrases anaphor Md, obtains limited noun short The leading language Ad of language anaphor Md.
6. a kind of Biological Text protein reference resolution side based on syntax tree and domain features according to claim 1 Method it is characterised in that:Particular organisms entity type key word in step (4), including " protein ", " gene ", " factor ", " element ", " receptor ", " complex " and " construct ".
7. a kind of Biological Text protein reference resolution side based on syntax tree and domain features according to claim 1 Method it is characterised in that:The biological entities of step (4), its recognition methods includes:Started by numeral, and comprise letter;By small letter Female beginning, and comprise capitalization or numeral or special symbol;Started by capitalization, and comprise numeral or special symbol;Or Person is started by capitalization, comprises lower case, and comprises capitalization or special symbol.
CN201610872780.8A 2016-09-30 2016-09-30 Biological Text protein reference resolution method based on syntax tree and domain features Active CN106484676B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610872780.8A CN106484676B (en) 2016-09-30 2016-09-30 Biological Text protein reference resolution method based on syntax tree and domain features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610872780.8A CN106484676B (en) 2016-09-30 2016-09-30 Biological Text protein reference resolution method based on syntax tree and domain features

Publications (2)

Publication Number Publication Date
CN106484676A true CN106484676A (en) 2017-03-08
CN106484676B CN106484676B (en) 2019-04-12

Family

ID=58269087

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610872780.8A Active CN106484676B (en) 2016-09-30 2016-09-30 Biological Text protein reference resolution method based on syntax tree and domain features

Country Status (1)

Country Link
CN (1) CN106484676B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220300A (en) * 2017-05-05 2017-09-29 平安科技(深圳)有限公司 Information mining method, electronic installation and readable storage medium storing program for executing
CN108446268A (en) * 2018-02-11 2018-08-24 青海师范大学 Tibetan language personal pronoun reference resolution system
CN109885841A (en) * 2019-03-20 2019-06-14 苏州大学 Reference resolution method based on node representation
CN110674630A (en) * 2019-09-24 2020-01-10 北京明略软件系统有限公司 Reference resolution method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007207113A (en) * 2006-02-03 2007-08-16 Hitachi Software Eng Co Ltd Genealogical tree display system
CN102339362A (en) * 2011-11-08 2012-02-01 苏州大学 Method for extracting protein interaction relationship
CN105138864A (en) * 2015-09-24 2015-12-09 大连理工大学 Protein interaction relationship data base construction method based on biomedical science literature

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007207113A (en) * 2006-02-03 2007-08-16 Hitachi Software Eng Co Ltd Genealogical tree display system
CN102339362A (en) * 2011-11-08 2012-02-01 苏州大学 Method for extracting protein interaction relationship
CN105138864A (en) * 2015-09-24 2015-12-09 大连理工大学 Protein interaction relationship data base construction method based on biomedical science literature

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
REBHOLZ-SCHUHMANN 等: "Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources", 《JOURNAL OF BIOMEDICAL SEMANTICS》 *
TIKK, DOMONKOS 等: "A Comprehensive Benchmark of Kernel Methods to Extract Protein-Protein Interactions from Literature", 《PLOS COMPUTATIONAL BIOLOGY》 *
刘念 等: "基于树核的蛋白质相互作用关系提取的研究", 《华中科技大学学报(自然科学版)》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220300A (en) * 2017-05-05 2017-09-29 平安科技(深圳)有限公司 Information mining method, electronic installation and readable storage medium storing program for executing
CN108446268A (en) * 2018-02-11 2018-08-24 青海师范大学 Tibetan language personal pronoun reference resolution system
CN109885841A (en) * 2019-03-20 2019-06-14 苏州大学 Reference resolution method based on node representation
CN109885841B (en) * 2019-03-20 2023-07-11 苏州大学 Reference digestion method based on node representation method
CN110674630A (en) * 2019-09-24 2020-01-10 北京明略软件系统有限公司 Reference resolution method and device, electronic equipment and storage medium
CN110674630B (en) * 2019-09-24 2023-03-21 北京明略软件系统有限公司 Reference resolution method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN106484676B (en) 2019-04-12

Similar Documents

Publication Publication Date Title
CN103488724B (en) A kind of reading domain knowledge map construction method towards books
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN104636466B (en) Entity attribute extraction method and system for open webpage
CN111708874A (en) Man-machine interaction question-answering method and system based on intelligent complex intention recognition
WO2018153215A1 (en) Method for automatically generating sentence sample with similar semantics
CN112542223A (en) Semi-supervised learning method for constructing medical knowledge graph from Chinese electronic medical record
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN104298662B (en) A kind of machine translation method and translation system based on nomenclature of organic compound entity
CN109271626A (en) Text semantic analysis method
CN106202543A (en) Ontology Matching method and system based on machine learning
CN108121829A (en) The domain knowledge collection of illustrative plates automated construction method of software-oriented defect
CN113282689B (en) Retrieval method and device based on domain knowledge graph
CN105138864B (en) Protein interactive relation data base construction method based on Biomedical literature
CN110609983B (en) Structured decomposition method for policy file
CN113743097B (en) Emotion triplet extraction method based on span sharing and grammar dependency relationship enhancement
CN107004000A (en) A kind of language material generating means and method
CN107247739B (en) A kind of financial bulletin text knowledge extracting method based on factor graph
CN107145514B (en) Chinese sentence pattern classification method based on decision tree and SVM mixed model
CN106484676B (en) Biological Text protein reference resolution method based on syntax tree and domain features
CN107193798A (en) A kind of examination question understanding method in rule-based examination question class automatically request-answering system
CN112035675A (en) Medical text labeling method, device, equipment and storage medium
CN109918640A (en) A kind of Chinese text proofreading method of knowledge based map
CN113157860B (en) Electric power equipment maintenance knowledge graph construction method based on small-scale data
Wang et al. Neural related work summarization with a joint context-driven attention mechanism
CN107818082A (en) With reference to the semantic role recognition methods of phrase structure tree

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant