CN106484676A - Biological Text protein reference resolution method based on syntax tree and domain features - Google Patents
Biological Text protein reference resolution method based on syntax tree and domain features Download PDFInfo
- Publication number
- CN106484676A CN106484676A CN201610872780.8A CN201610872780A CN106484676A CN 106484676 A CN106484676 A CN 106484676A CN 201610872780 A CN201610872780 A CN 201610872780A CN 106484676 A CN106484676 A CN 106484676A
- Authority
- CN
- China
- Prior art keywords
- syntax tree
- node
- anaphor
- leading language
- candidate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to the Biological Text protein reference resolution method based on syntax tree and domain features, for solving the problems, such as that in existing rule-based method, F value is low, its step includes:Pretreatment is carried out to urtext;Search relationship pronoun and apart from the nearest noun phrase of this relative pronoun from syntax tree, as the leading language of this relative pronoun;Search personal pronoun from syntax tree, and search the leading language of this personal pronoun from the syntax tree of the phrase structure arranged side by side, clause syntax tree or previous sentence of syntax tree;Obtain definite noun phrases and candidate's leading language collection using syntax tree, and the leading language of conduct picking out optimum is concentrated based on the biological field feature such as property such as DANFU number, entity type, quantity from the leading language of candidate;Nonprotein reference resolution filters.Present invention achieves the protein reference resolution in Biological Text, it is obtained in that higher F value.
Description
【Technical field】
The invention belongs to Text Mining Technology field is and in particular to Biological Text albumen based on syntax tree and domain features
Matter reference resolution method.
【Background technology】
With developing rapidly of computer and Internet technology, substantial amounts of information and document are presented in digitized.
Document in biomedical sector magnanimity is existed and is just being increased with exponential form, and biomedical correlational study scholar is using biography
Manual read's pattern of system is difficult to efficiently obtain valuable information, the therefore text envelope of automatization in face of huge document
Breath extraction becomes a significant job.The task extracted as biomedical sector text message, protein refers to
Clear up the word being to find out and indicate same protein entity in Biological Text and phrase, anaphor is have the effect of referring to right
As, for example common pronoun, leading language is object being pointed at, having actual content, such as protein and biological entities, egg
White matter reference resolution, as an ancillary technique, serves important supporting role, energy to many tasks of biological text mining
Effectively improve the performance of bio information extraction system.
Biological natural language processing task BioNLP Shared Task 2011 provides the biomedical sector of standard
Protein reference resolution language material Coreference, this language material derives from the summary in MEDLINE, mainly solves biomedical literary composition
The reference resolution of the proteinacious entities in this, the proteinacious entities in language material mark out in advance.
General field reference resolution is more ripe compared to the research method of biomedical sector reference resolution, but due to spy
Determine the uniqueness of field language material, these methods are grafted directly in biomedical sector and can not obtain good effect, exploitation
Method for Biological Text protein reference resolution is necessary.At present in the protein reference resolution of Biological Text, main
The method being divided into the method, rule-based method and fusion based on supervised learning.Method based on supervised learning pass through from
Protein reference resolution training data is concentrated and is extracted feature to learn to obtain a model, then using this model to new data
In protein refer to relation and carry out clearing up process, the paper that such as Youngjun Kim et al. delivers for 2011 in BioNLP task
In " The Taming of Reconcile as a Biomedical Coreference Resolver ", disclose a kind of base
Supervised learning method in feature and sorter model is used for protein reference resolution, and the method is extracted lexical feature, syntax
The series of features such as feature, then carry out the process of protein reference resolution using a grader.Rule-based method is led to
Cross the protein reference resolution that the series of rules of manual formulation processes in Biological Text, such as Makoto Miwa et al. 2012
In magazine《Bioinformatics》Volume 28 paper that 13 phases delivered " Boosting automatic event
extraction from the literature using domain adaptation and coreference
In resolution ", disclose a kind of protein reference resolution method of rule-based coupling, the method uses an anaphor
Couple candidate detection extracts noun phrase, pronoun etc. as anaphor candidate, is then carried using leading language couple candidate detection
Take noun phrase as leading language candidate, finally refer to relation connecting detection according to coupling, structure completely using one
Join, order that strict head word coupling, the head word coupling loosened, quantity Matching etc. are regular, select the override of each anaphor
Row language.The method merging is passed through simultaneously using supervised learning and rule, to different types of protein reference resolution using different
Method processing, the paper that such as Jennifer D'Souza and Vincent Ng delivered in ACM-BCB meeting in 2012
“Anaphora Resolution in Biomedical Literature:In A Hybrid Approach ", disclose one kind
Based on the protein reference resolution method of grader and rule match, the method processes the egg of pronoun using the grader learning
White matter reference resolution, processes the protein reference resolution of noun phrase using regular method.
The performance of protein reference resolution method is generally evaluated with recall rate, accuracy rate and F value, and recall rate refers to be extracted
Go out correct protein and refer to the ratio that all proteins in relation data refer to relation, accuracy rate is to extract correct egg
White matter refers to relation and all ratios extracting results, and F value is the harmonic-mean of recall rate and accuracy rate, as final comprehensive
Close index.In method for Biological Text protein reference resolution, needed using substantial amounts of people based on the method for supervised learning
Work mark language material training could obtain ideal result, but acquisition manually marks language material in a large number and needs to pay very big people
Power material resources and long time, current data with existing can not meet such requirement, leads in a limited number of training datas
The model that lower supervised learning method obtains is not good, and the recall rate of model is very low, thus leading to F value very low;Rule-based side
Method has generally only used for reference some rules of general field reference resolution, and these rules are only in terms of the structure of phrase or word and attribute
To consider, not consider that the unique syntactic property of Biological Text and domain features refer to pass it is impossible to effectively extract correct protein
System, the recall rate of method low thus leading to F value low;The method merging still has in supervised learning method because training data has
Limit the model leading to not ideal enough, the low problem of pronoun protein types reference resolution recall rate, though the overall recall rate of method
There is improvement but still not so that F value is not ideal enough.The result of current protein reference resolution also needs to continue to improve, thus
Could be more effectively biological other tasks of text mining, such as biological event extracts, and carries out early stage pretreatment work.
【Content of the invention】
It is an object of the invention to overcoming the defect that above-mentioned prior art exists it is proposed that a kind of be based on syntax tree and field
The Biological Text protein reference resolution method of feature, can extract correct protein and refer to relation, thus improve comprehensively referring to
Mark F value.
For achieving the above object, the technical scheme that the present invention takes is:
Comprise the steps:
(1) subordinate sentence, participle, part-of-speech tagging, lemmatization and syntactic analysis are carried out to urtext, obtain each sentence
Syntax tree Ti, i=1,2 ..., N, the syntax tree of all sentences constitutes syntax tree collectionWherein i represents the sequence of sentence
Number, N represents the number of all sentences;
(2) from syntax tree TiMiddle search relationship pronoun node and apart from this relative pronoun node nearest noun phrase knot
Point, obtains the leading language Ar of relative pronoun anaphor Mr and relative pronoun anaphor Mr;
(3) from syntax tree TiMiddle lookup personal pronoun node, obtains personal pronoun anaphor Mp;And from this personal pronoun knot
This personal pronoun anaphor is searched in the syntax tree of phrase structure arranged side by side, clause syntax tree or previous sentence of point place syntax tree
The leading language Ap of Mp;
(4) from syntax tree TiMiddle lookup comprises the definite noun phrases node of particular organisms entity type key word, obtains
To definite noun phrases anaphor Md;Collect the subset of T from syntax treeMiddle lookup is all to comprise biological entities or egg
The noun phrase node of white matter entity, obtains candidate leading language collection X, based on biological field characteristic properties from candidate leading language collection X
In obtain the leading language Ad of definite noun phrases anaphor Md;Wherein TjCollect the syntax tree of j-th sentence in T, k for syntax tree
Size for sentence windows;
(5) filter out leading language from all reference resolution results that step (2) to step (4) obtains and do not comprise protein
The reference resolution of entity, completes the Biological Text protein reference resolution based on syntax tree and domain features.
Further, described in step (2) from syntax tree TiMiddle search relationship pronoun node and apart from this relative pronoun
The nearest noun phrase node of node, realizing step is:
201st, from syntax tree TiMiddle lookup is labeled as " WDT " or the node of " WP ", obtains relative pronoun node Nr and relation
Pronoun anaphor Mr, from syntax tree TiMiddle lookup is labeled as all nodes of " NP ", obtains candidate leading language collection Z;Wherein WDT generation
The qualifier that table is started with wh, WP represents with the pronoun of wh beginning, and NP represents noun phrase;
202nd, from syntax tree TiAll candidates leading language place node of middle lookup candidate leading language collection Z, obtains candidate first
Row language nodal set Nz;
203rd, extract the sentence of each candidate leading language node and relative pronoun node Nr in candidate leading language nodal set Nz
Fa Shu path;
204th, pick out syntax tree path the shortest from all syntax tree paths that step 203 obtains, and the shortest with this
The noun phrase node that syntax tree path is located is as nearest noun phrase node.
Further, in step (3), the acquisition of leading language Ap is specially:
301st, with personal pronoun anaphor Mp place node as starting point, in syntax tree TiIn bottom-up traversal, search bag
Node Nc containing phrase structure arranged side by side, judges that node Nc whether there is, if so, in syntax tree TiMiddle extraction is with node Nc as root knot
Syntax subtree STc of point, and search short apart from the farthest noun of personal pronoun anaphor Mp place node in syntax subtree STc
Language node, obtains the leading language Ap of personal pronoun anaphor Mp, otherwise, execution step 302;
302nd, with personal pronoun anaphor Mp place node for starting point in syntax tree TiIn bottom-up traversal, find out son
Sentence node Ns;Extract syntax subtree STs with clause node Ns as root node, and search apart from people in syntax subtree STs
Claim pronoun anaphor Mp place node farthest noun phrase node, judge that this noun phrase node whether there is, if so, obtain
The leading language Ap of personal pronoun anaphor Mp, otherwise, execution step 303;
303rd, select syntax tree T from syntax tree collection Ti-1, in syntax tree Ti-1In with last leaf node as starting point
Bottom-up traversal, finds out clause node Nt;Extract syntax subtree STt with clause node Nt as root node, and in this sentence
Search, in method tree STt, all noun phrase nodes matching with personal pronoun anaphor Mp DANFU number, obtain candidate leading
Language collection Y;From candidate leading language collection Y, the farthest leading language of candidate of chosen distance personal pronoun anaphor Mp, obtains personal pronoun
The leading language Ap of anaphor Mp.
Further, in step (3) phrase structure arranged side by side refer to coordinate noun phrase, verb phrase arranged side by side or side by side son
Sentence structure.
Further, in step (4), the acquisition of leading language Ad is specially:
Whether the head word the 401st, judging definite noun phrases anaphor Md is " proteins " or " genes ", if so, from
Select the leading language of all candidates comprising proteinacious entities in candidate leading language collection X, obtain new candidate leading language collection Xs, and from
In this new candidate leading language collection Xs, according to head word coupling, comprise the order that proteinacious entities quantity is more than 1, select distance limit
The nearest leading language of candidate of qualitative noun phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, no
Then, execution step 402;
402nd, judge whether definite noun phrases anaphor Md is plural form, if so, press from candidate leading language collection X
According to head word coupling, comprise biological entities quantity be more than 1, comprise proteinacious entities quantity be more than 1 order, select apart from limited
The leading language of the nearest candidate of noun phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, otherwise, holds
Row step 403;
Whether the head word the 403rd, judging definite noun phrases anaphor Md is " protein " or " gene ", if so, from time
Select in leading language collection X and select the leading language of all candidates comprising proteinacious entities, obtain new candidate leading language collection Xs, and from this
In new candidate leading language collection Xs, according to head word coupling, comprise the order that proteinacious entities quantity is equal to 1, select distance and limit
The property nearest leading language of candidate of noun phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, otherwise,
Execution step 404;
404th, from candidate leading language collection X according to head word coupling, comprise that biological entities quantity is equal to 1, to comprise protein real
The order that body quantity is equal to 1, selects apart from the nearest leading language of candidate of definite noun phrases anaphor Md, obtains limited name
The leading language Ad of word phrase anaphor Md.
Further, the particular organisms entity type key word in step (4), including " protein ", " gene ",
" factor ", " element ", " receptor ", " complex " and " construct ".
Further, the biological entities of step (4), its recognition methods includes:Started by numeral, and comprise letter;By little
Write beginning of letter, and comprise capitalization or numeral or special symbol;Started by capitalization, and comprise digital or special symbol
Number;Or started by capitalization, comprise lower case, and comprise capitalization or special symbol.
The present invention compared with prior art, has advantages below:
The present invention carries out after pretreatment to urtext, extracts relative pronoun, personal pronoun and limited from syntax tree
Noun phrase, determines the leading language of relative pronoun and personal pronoun anaphor based on syntax tree, is determined based on biological field feature
The leading language of definite noun phrases anaphor, finally filters the reference resolution result of nonprotein entity.The present invention is based on sentence
Method tree is extracted to the leading language of relative pronoun anaphor and personal pronoun anaphor, based on domain features to limited noun
The leading language of phrase anaphor is extracted, and can excavate only unavailable syntax in terms of the structure of phrase or word, attribute
Structural information, thus effectively extract more correct protein to refer to relation.Due to taking full advantage of biomedical text
Syntactic property and domain features, can obtain more correct while ensureing accuracy rate compared to the method being currently based on rule
Protein reference resolution result, improve recall rate, obtained more preferable integrated performance index F value, the simulation experiment result
Indicate this point.The present invention can be used for pointing to relative pronoun, personal pronoun and the limit of proteinacious entities in biomedical text
The reference resolution such as qualitative noun phrase.
【Brief description】
Fig. 1 be the present invention realize FB(flow block);
Fig. 2 be the present invention from candidate leading language collection Z, pick out the nearest candidate of distance relation pronoun Mr place node first
Row language node realize FB(flow block).
【Specific embodiment】
Below in conjunction with accompanying drawing, the present invention is described in further detail:
With reference to Fig. 1:The present invention comprises the steps:
Step 1, urtext pretreatment.
1a) using GENIA sentence partition tools, subordinate sentence is carried out to urtext;
1b) using Stanford University's CoreNLP instrument, participle, part-of-speech tagging and lemmatization are carried out to text;
1c) using Enju parser, syntactic analysis is carried out to each sentence, and result is changed into PTB form, obtain
Syntax tree T to each sentencei, i=1,2 ..., N, the syntax tree of all sentences constitutes syntax tree collectionWherein i
Represent the sequence number of sentence, N represents the number of all sentences.
Step 2, the leading language Ar based on syntax tree search relationship pronoun anaphor.
2a) from syntax tree TiMiddle lookup is labeled as " WDT " or the node of " WP ", obtains relative pronoun anaphor Mr, subordinate clause
Method tree TiMiddle lookup is labeled as all nodes of " NP ", obtains candidate leading language collection Z;Wherein WDT and WP is general mark,
WDT represents with the qualifier of wh beginning, and WP represents with the pronoun of wh beginning, and NP represents noun phrase;;
2b) from candidate leading language collection Z, pick out the nearest candidate of distance relation pronoun anaphor Mr place node leading
Language node, obtains the leading language Ar of relative pronoun anaphor Mr, it is as shown in Figure 2 that it implements step.
2011st, from syntax tree TiThe node that middle search relationship pronoun anaphor Mr is located, obtains relative pronoun node Nr;
2012nd, from syntax tree TiAll candidates leading language place node of middle lookup candidate leading language collection Z, obtains candidate first
Row language nodal set Nz;
2013rd, extract the sentence of each candidate leading language node and relative pronoun node Nr in candidate leading language nodal set Nz
Fa Shu path;
2014th, pick out syntax tree path the shortest from all syntax tree paths that step 2013 obtains, and with this
Candidate's leading language node that short syntax tree path is located, as nearest candidate's leading language node, obtains relative pronoun anaphor Mr
Leading language Ar.
Step 3, searches the leading language Ap of personal pronoun anaphor based on syntax tree.
3a) from syntax tree TiMiddle lookup by " they ", " them ", " themselves ", " their ", " its " or has finger
The personal pronoun node that " it " of generation effect is constituted, obtains personal pronoun anaphor Mp;
3b) with personal pronoun anaphor Mp place node as starting point, in syntax tree TiIn bottom-up traversal, lookup comprises
The node Nc of coordinate noun phrase, verb phrase arranged side by side or coordinate clause structure;
3c) judge step 3b) lookup result in whether there is node Nc, if so, in syntax tree TiMiddle extraction is with node
Nc is syntax subtree STc of root node, searches all nodes being labeled as " NP " in syntax subtree STc, and from these nodes
In pick out apart from the farthest node of personal pronoun anaphor Mp place node, obtain the leading language of personal pronoun anaphor Mp
Ap, otherwise, execution step 3d);
3d) with personal pronoun anaphor Mp place node for starting point in syntax tree TiIn bottom-up traversal, search bid
It is designated as the clause node Ns of " S ", and extract syntax subtree STs with clause node Ns as root node;In syntax subtree STs
Lookup is labeled as all nodes of " NP ", and selects farthest apart from personal pronoun anaphor Mp place node from these nodes
Noun phrase node, judges that this noun phrase node whether there is, and if so, obtains the leading language Ap of personal pronoun anaphor Mp,
Otherwise, execution step 3e);
3e) from syntax tree collection T, select syntax tree Ti-1, in syntax tree Ti-1In with last leaf node for starting point from
Bottom traverses up, and finds out the clause node Nt being labeled as " S ", and extracts the syntax subtree with clause node Nt as root node
STt;Search all nodes being labeled as " NP " in syntax subtree STt, and filter out from these nodes and personal pronoun photograph
Answer the unmatched noun phrase node of language Mp DANFU number, obtain candidate leading language collection Y by remaining all noun phrase nodes;From
In candidate leading language collection Y, the farthest leading language of candidate of chosen distance personal pronoun anaphor Mp, obtains personal pronoun anaphor Mp
Leading language Ap;
Step 4, searches the leading language Ad of definite noun phrases anaphor based on biological field feature.
4a) from syntax tree TiMiddle all noun phrase nodes containing " DT " labelling child node for the lookup, and from these nouns
Pick out in phrase node containing particular organisms entity type key word " protein ", " gene ", " factor ",
The noun phrase node of " element ", " receptor ", " complex " or " construct ", obtains definite noun phrases
Anaphor Md;
The subset of T 4b) is collected from syntax treeAs { Ti-2,Ti-1,TiIn search and be labeled as all nodes of " NP ",
And from these nodes, filter out the noun phrase node not comprising biological entities and proteinacious entities, by remaining noun phrase
Node obtains candidate leading language collection X, wherein TjCollect the syntax tree of j-th sentence in T for syntax tree, k is the size of sentence windows;
Described biological entities, its recognition methods includes:Started by numeral, and comprise letter;Started by lower case, and
Comprise capitalization or numeral or special symbol;Started by capitalization, and comprise numeral or special symbol;Opened by capitalization
Head, comprises lower case, and comprises capitalization or special symbol;
Whether head word 4c) judging definite noun phrases anaphor Md is " proteins " or " genes ", if so, from
Select the leading language of all candidates comprising proteinacious entities in candidate leading language collection X, obtain new candidate leading language collection Xs, and from
In this new candidate leading language collection Xs, according to head word coupling, comprise the order that proteinacious entities quantity is more than 1, select distance limit
The nearest leading language of candidate of qualitative noun phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, no
Then, execution step 4d);
4d) judge whether definite noun phrases anaphor Md is plural form, if so, press from candidate leading language collection X
According to head word coupling, comprise biological entities quantity be more than 1, comprise proteinacious entities quantity be more than 1 order, select apart from limited
The leading language of the nearest candidate of noun phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, otherwise, holds
Row step 4e);
Whether head word 4e) judging definite noun phrases anaphor Md is " protein " or " gene ", if so, from time
Select in leading language collection X and select the leading language of all candidates comprising proteinacious entities, obtain new candidate leading language collection Xs, and from this
In new candidate leading language collection Xs, according to head word coupling, comprise the order that proteinacious entities quantity is equal to 1, select distance and limit
The property nearest leading language of candidate of noun phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, otherwise,
Execution step 4f);
4f) from candidate leading language collection X according to head word coupling, comprise biological entities quantity be equal to 1, comprise proteinacious entities
The order that quantity is equal to 1, selects apart from the nearest leading language of candidate of definite noun phrases anaphor Md, obtains limited noun
The leading language Ad of phrase anaphor Md;
Step 5, filters out, from all reference resolution results, the reference resolution that leading language does not comprise proteinacious entities;Complete
Biological Text protein reference resolution based on syntax tree and domain features.Wherein reference resolution result is expressed as, to relation, wrapping
Include:Relative pronoun anaphor Mr and its leading language Ar, personal pronoun anaphor Mp and its leading language Ap and limited noun are short
Language anaphor Md and its leading language Ad.
Below by way of emulation experiment, the technique effect of the present invention is described further:
1st, simulated conditions:
Emulation experiment shares task BioNLP 2011 Coreference data using biomedical natural language processing
Collection, data set has marked out proteinacious entities in advance.
Emulation experiment is Intel Core (TM) i7-4720HQ, dominant frequency 2.60GHz in CPU, inside saves as the WINDOWS of 8G
Emulated with JAVA programming language in 7 systems.
2nd, emulation content and interpretation of result:
With existing rule-based method, Biological Text albumen is carried out on Coreference data set using the present invention
The emulation of matter reference resolution, experimental result is as follows:
Method | Recall rate (%) | Accuracy rate (%) | F value (%) |
Rule-based method | 50.4 | 62.7 | 55.9 |
The present invention | 60.2 | 63.8 | 62.0 |
To sum up, the present invention is carried to the leading language of relative pronoun anaphor and personal pronoun anaphor based on syntax tree
Take, based on domain features, the leading language of definite noun phrases anaphor is extracted, only can excavate from phrase or word
Structure, attribute aspect unavailable syntactic structure information, thus extracting more protein to refer to relation.Due to fully sharp
With syntactic property and the domain features of biomedical text, accuracy rate can ensured compared to the method being currently based on rule
While obtain more correct protein reference resolution results, improve recall rate, obtained more preferable integrated performance index F
Value, has certain advantage compared with the existing methods.
The present invention is low for solving F value present in existing rule-based Biological Text protein reference resolution method
Technical problem, carries out pretreatment to urtext;Search relationship pronoun and apart from the nearest name of this relative pronoun from syntax tree
Word phrase, as the leading language of this relative pronoun;Personal pronoun, and the phrase knot arranged side by side from syntax tree is searched from syntax tree
The leading language of this personal pronoun is searched in the syntax tree of structure, clause syntax tree or previous sentence;Obtained limited using syntax tree
Noun phrase and candidate's leading language collection, and based on the biological field feature such as property such as DANFU number, entity type, quantity from candidate first
The leading language of conduct picking out optimum concentrated in row language;Nonprotein reference resolution filters.Due to being dealt with relationship generation based on syntax tree
The protein reference resolution of word and personal pronoun, is referred to based on the protein that biological field characterization rules process definite noun phrases
In generation, clears up, and employs different processing methods to the reference resolution of different anaphor types, takes full advantage of biomedical sector
The specific syntactic property of text and domain features, test result indicate that, the present invention can effectively extract protein and refer to relation,
Achieve the protein reference resolution in Biological Text, there is higher aggregative indicator F value.
Claims (7)
1. a kind of Biological Text protein reference resolution method based on syntax tree and domain features it is characterised in that:Including such as
Lower step:
(1) subordinate sentence, participle, part-of-speech tagging, lemmatization and syntactic analysis are carried out to urtext, obtain the syntax of each sentence
Tree Ti, i=1,2 ..., N, the syntax tree of all sentences constitutes syntax tree collectionWherein i represents the sequence number of sentence, N
Represent the number of all sentences;
(2) from syntax tree TiMiddle search relationship pronoun node and apart from the nearest noun phrase node of this relative pronoun node, obtains
The leading language Ar of relative pronoun anaphor Mr and relative pronoun anaphor Mr;
(3) from syntax tree TiMiddle lookup personal pronoun node, obtains personal pronoun anaphor Mp;And from this personal pronoun node institute
Search this personal pronoun anaphor Mp's in the syntax tree of the phrase structure arranged side by side, clause syntax tree or previous sentence of syntax tree
Leading language Ap;
(4) from syntax tree TiMiddle lookup comprises the definite noun phrases node of particular organisms entity type key word, is limited
Property noun phrase anaphor Md;Collect the subset of T from syntax treeMiddle lookup is all to comprise biological entities or protein reality
The noun phrase node of body, obtains candidate leading language collection X, is obtained from candidate leading language collection X based on biological field characteristic properties
The leading language Ad of definite noun phrases anaphor Md;Wherein TjCollect the syntax tree of j-th sentence in T for syntax tree, k is sentence
The size of window;
(5) filter out leading language from all reference resolution results that step (2) to step (4) obtains and do not comprise proteinacious entities
Reference resolution, complete the Biological Text protein reference resolution based on syntax tree and domain features.
2. a kind of Biological Text protein reference resolution side based on syntax tree and domain features according to claim 1
Method it is characterised in that:Described in step (2) from syntax tree TiMiddle search relationship pronoun node and apart from this relative pronoun node
Nearest noun phrase node, realizing step is:
201st, from syntax tree TiMiddle lookup is labeled as " WDT " or the node of " WP ", obtains relative pronoun node Nr and relative pronoun shines
Answer language Mr, from syntax tree TiMiddle lookup is labeled as all nodes of " NP ", obtains candidate leading language collection Z;Wherein WDT represents with wh
The qualifier of beginning, WP represents with the pronoun of wh beginning, and NP represents noun phrase;
202nd, from syntax tree TiAll candidates leading language place node of middle lookup candidate leading language collection Z, obtains candidate's leading language knot
Point set Nz;
203rd, extract the syntax tree of each candidate leading language node and relative pronoun node Nr in candidate leading language nodal set Nz
Path;
204th, pick out syntax tree path the shortest from all syntax tree paths that step 203 obtains, and with this short sentence method
The noun phrase node that tree path is located is as nearest noun phrase node.
3. a kind of Biological Text protein reference resolution side based on syntax tree and domain features according to claim 1
Method it is characterised in that:In step (3), the acquisition of leading language Ap is specially:
301st, with personal pronoun anaphor Mp place node as starting point, in syntax tree TiIn bottom-up traversal, search and comprise side by side
The node Nc of phrase structure, judges that node Nc whether there is, if so, in syntax tree TiThe middle sentence extracting with node Nc as root node
Method tree STc, and search in syntax subtree STc apart from the farthest noun phrase knot of personal pronoun anaphor Mp place node
Point, obtains the leading language Ap of personal pronoun anaphor Mp, otherwise, execution step 302;
302nd, with personal pronoun anaphor Mp place node for starting point in syntax tree TiIn bottom-up traversal, find out clause knot
Point Ns;Extract syntax subtree STs with clause node Ns as root node, and search apart from person generation in syntax subtree STs
The farthest noun phrase node of word anaphor Mp place node, judges that this noun phrase node whether there is, if so, obtains person
The leading language Ap of pronoun anaphor Mp, otherwise, execution step 303;
303rd, select syntax tree T from syntax tree collection Ti-1, in syntax tree Ti-1In with last leaf node for starting point the bottom of from
Traverse up, find out clause node Nt;Extract syntax subtree STt with clause node Nt as root node, and in this syntax
Search, in tree STt, all noun phrase nodes matching with personal pronoun anaphor Mp DANFU number, obtain candidate's leading language collection
Y;From candidate leading language collection Y, the farthest leading language of candidate of chosen distance personal pronoun anaphor Mp, obtains personal pronoun and correlates
The leading language Ap of language Mp.
4. a kind of Biological Text protein reference resolution side based on syntax tree and domain features according to claim 1
Method it is characterised in that:In step (3), phrase structure arranged side by side refers to coordinate noun phrase, verb phrase arranged side by side or coordinate clause
Structure.
5. a kind of Biological Text protein reference resolution side based on syntax tree and domain features according to claim 1
Method it is characterised in that:In step (4), the acquisition of leading language Ad is specially:
Whether the head word the 401st, judging definite noun phrases anaphor Md is " proteins " or " genes ", if so, from candidate
Select the leading language of all candidates comprising proteinacious entities in leading language collection X, obtain new candidate leading language collection Xs, and new from this
Candidate leading language collection Xs in, according to head word coupling, comprise proteinacious entities quantity be more than 1 order, select apart from limited
The leading language of the nearest candidate of noun phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, otherwise, holds
Row step 402;
402nd, judge whether definite noun phrases anaphor Md is plural form, if so, according to head from candidate leading language collection X
Word coupling, comprise biological entities quantity be more than 1, comprise proteinacious entities quantity be more than 1 order, select apart from limited noun
The leading language of the nearest candidate of phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, otherwise, executes step
Rapid 403;
Whether the head word the 403rd, judging definite noun phrases anaphor Md is " protein " or " gene ", if so, from candidate first
Select the leading language of all candidates comprising proteinacious entities in row language collection X, obtain new candidate leading language collection Xs, and new from this
In candidate leading language collection Xs, according to head word coupling, comprise the order that proteinacious entities quantity is equal to 1, select apart from limited name
The nearest leading language of candidate of word phrase anaphor Md, obtains the leading language Ad of definite noun phrases anaphor Md, otherwise, execution
Step 404;
404th, from candidate leading language collection X according to head word coupling, comprise biological entities quantity be equal to 1, comprise proteinacious entities number
The order that amount is equal to 1, selects apart from the nearest leading language of candidate of definite noun phrases anaphor Md, obtains limited noun short
The leading language Ad of language anaphor Md.
6. a kind of Biological Text protein reference resolution side based on syntax tree and domain features according to claim 1
Method it is characterised in that:Particular organisms entity type key word in step (4), including " protein ", " gene ",
" factor ", " element ", " receptor ", " complex " and " construct ".
7. a kind of Biological Text protein reference resolution side based on syntax tree and domain features according to claim 1
Method it is characterised in that:The biological entities of step (4), its recognition methods includes:Started by numeral, and comprise letter;By small letter
Female beginning, and comprise capitalization or numeral or special symbol;Started by capitalization, and comprise numeral or special symbol;Or
Person is started by capitalization, comprises lower case, and comprises capitalization or special symbol.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610872780.8A CN106484676B (en) | 2016-09-30 | 2016-09-30 | Biological Text protein reference resolution method based on syntax tree and domain features |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610872780.8A CN106484676B (en) | 2016-09-30 | 2016-09-30 | Biological Text protein reference resolution method based on syntax tree and domain features |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106484676A true CN106484676A (en) | 2017-03-08 |
CN106484676B CN106484676B (en) | 2019-04-12 |
Family
ID=58269087
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610872780.8A Active CN106484676B (en) | 2016-09-30 | 2016-09-30 | Biological Text protein reference resolution method based on syntax tree and domain features |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106484676B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107220300A (en) * | 2017-05-05 | 2017-09-29 | 平安科技(深圳)有限公司 | Information mining method, electronic installation and readable storage medium storing program for executing |
CN108446268A (en) * | 2018-02-11 | 2018-08-24 | 青海师范大学 | Tibetan language personal pronoun reference resolution system |
CN109885841A (en) * | 2019-03-20 | 2019-06-14 | 苏州大学 | Reference resolution method based on node representation |
CN110674630A (en) * | 2019-09-24 | 2020-01-10 | 北京明略软件系统有限公司 | Reference resolution method and device, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007207113A (en) * | 2006-02-03 | 2007-08-16 | Hitachi Software Eng Co Ltd | Genealogical tree display system |
CN102339362A (en) * | 2011-11-08 | 2012-02-01 | 苏州大学 | Method for extracting protein interaction relationship |
CN105138864A (en) * | 2015-09-24 | 2015-12-09 | 大连理工大学 | Protein interaction relationship data base construction method based on biomedical science literature |
-
2016
- 2016-09-30 CN CN201610872780.8A patent/CN106484676B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007207113A (en) * | 2006-02-03 | 2007-08-16 | Hitachi Software Eng Co Ltd | Genealogical tree display system |
CN102339362A (en) * | 2011-11-08 | 2012-02-01 | 苏州大学 | Method for extracting protein interaction relationship |
CN105138864A (en) * | 2015-09-24 | 2015-12-09 | 大连理工大学 | Protein interaction relationship data base construction method based on biomedical science literature |
Non-Patent Citations (3)
Title |
---|
REBHOLZ-SCHUHMANN 等: "Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources", 《JOURNAL OF BIOMEDICAL SEMANTICS》 * |
TIKK, DOMONKOS 等: "A Comprehensive Benchmark of Kernel Methods to Extract Protein-Protein Interactions from Literature", 《PLOS COMPUTATIONAL BIOLOGY》 * |
刘念 等: "基于树核的蛋白质相互作用关系提取的研究", 《华中科技大学学报(自然科学版)》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107220300A (en) * | 2017-05-05 | 2017-09-29 | 平安科技(深圳)有限公司 | Information mining method, electronic installation and readable storage medium storing program for executing |
CN108446268A (en) * | 2018-02-11 | 2018-08-24 | 青海师范大学 | Tibetan language personal pronoun reference resolution system |
CN109885841A (en) * | 2019-03-20 | 2019-06-14 | 苏州大学 | Reference resolution method based on node representation |
CN109885841B (en) * | 2019-03-20 | 2023-07-11 | 苏州大学 | Reference digestion method based on node representation method |
CN110674630A (en) * | 2019-09-24 | 2020-01-10 | 北京明略软件系统有限公司 | Reference resolution method and device, electronic equipment and storage medium |
CN110674630B (en) * | 2019-09-24 | 2023-03-21 | 北京明略软件系统有限公司 | Reference resolution method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN106484676B (en) | 2019-04-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103488724B (en) | A kind of reading domain knowledge map construction method towards books | |
CN106649260B (en) | Product characteristic structure tree construction method based on comment text mining | |
CN104636466B (en) | Entity attribute extraction method and system for open webpage | |
CN111708874A (en) | Man-machine interaction question-answering method and system based on intelligent complex intention recognition | |
WO2018153215A1 (en) | Method for automatically generating sentence sample with similar semantics | |
CN112542223A (en) | Semi-supervised learning method for constructing medical knowledge graph from Chinese electronic medical record | |
CN106599032B (en) | Text event extraction method combining sparse coding and structure sensing machine | |
CN104298662B (en) | A kind of machine translation method and translation system based on nomenclature of organic compound entity | |
CN109271626A (en) | Text semantic analysis method | |
CN106202543A (en) | Ontology Matching method and system based on machine learning | |
CN108121829A (en) | The domain knowledge collection of illustrative plates automated construction method of software-oriented defect | |
CN113282689B (en) | Retrieval method and device based on domain knowledge graph | |
CN105138864B (en) | Protein interactive relation data base construction method based on Biomedical literature | |
CN110609983B (en) | Structured decomposition method for policy file | |
CN113743097B (en) | Emotion triplet extraction method based on span sharing and grammar dependency relationship enhancement | |
CN107004000A (en) | A kind of language material generating means and method | |
CN107247739B (en) | A kind of financial bulletin text knowledge extracting method based on factor graph | |
CN107145514B (en) | Chinese sentence pattern classification method based on decision tree and SVM mixed model | |
CN106484676B (en) | Biological Text protein reference resolution method based on syntax tree and domain features | |
CN107193798A (en) | A kind of examination question understanding method in rule-based examination question class automatically request-answering system | |
CN112035675A (en) | Medical text labeling method, device, equipment and storage medium | |
CN109918640A (en) | A kind of Chinese text proofreading method of knowledge based map | |
CN113157860B (en) | Electric power equipment maintenance knowledge graph construction method based on small-scale data | |
Wang et al. | Neural related work summarization with a joint context-driven attention mechanism | |
CN107818082A (en) | With reference to the semantic role recognition methods of phrase structure tree |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |