CN109446326A

CN109446326A - Biomedical event based on replicanism combines abstracting method

Info

Publication number: CN109446326A
Application number: CN201811291947.7A
Authority: CN
Inventors: 李丽双; 叶沛言; 王子维; 周安桥
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2018-11-01
Filing date: 2018-11-01
Publication date: 2019-03-08
Anticipated expiration: 2038-11-01
Also published as: CN109446326B

Abstract

The biomedical event extraction method based on replicanism that the present invention provides a kind of, belongs to natural language processing technique field.Biomedical event extraction method and step based on replicanism is as follows: tectonic model input vector；Construction uses the Encoder module of two-way LSTM model；Construct the Decoder module that trigger word and element are identified based on Attention mechanism and while replicanism.Using the present invention it is possible to prevente effectively from improving the performance of biomedical event extraction using error propagation caused by shared parameter merely in disadvantage and conjunctive model mutually indepedent between grading method bring concatenated error and subtask.

Description

Biomedical event based on replicanism combines abstracting method

Technical field

The invention belongs to natural language processing technique fields, are related to a kind of event joint extraction based on biomedical text Method.It specifically refers to through replicanism, the method extracted using joint, while extracting the trigger word in biological event and element It is candidate to constitute biological event.Biological event candidate is carried out by the event structure feature that support vector machines (SVM) learns again Classification, removes invalid combination pair, to obtain complete biomedical event.

Background technique

There are two main classes for the method that realization biological event extracts at present.One kind is method (also referred to as assembly line stage by stage Method, Pipelined method), i.e., biological event extraction is divided into two key steps: trigger word identification and element identification, Complete biomedical event is constituted by event post-processing again.Another kind of is the method that joint extracts, that is, uses certain technology Identify that trigger word and element constitute biological event simultaneously.Method stage by stage is the biological event abstracting method of current mainstream.

Biological event extraction is carried out using method stage by stage and is mainly the following mode: rule-based method, base In the method for statistical machine learning and method based on neural network and term vector.

Trigger word identification, which refers to, extracts trigger word present in Biological Text (word or phrase that characterization biological event occurs). For trigger word identification mission, rule-based method refers to the feature and contextual information using trigger word itself, people Building site heuristically establishes relevant regulations.If current word and predefined rule match, determine that it is trigger word.If not Find with the matched rule of current word, then determine current word be non-toggle word.Since rule-based method is with strong points, It is higher that biological event extracts result accurate rate, but recall rate is lower.Such as Cohen (Cohen, K.Bretonnel, et al. " High-precision biological event extraction with a concept recognizer.", Association for Computational Linguistics, 2009) etc. pass through rule-based method and extract biological thing Part obtains highest accuracy rate on the data set of 09 Shared Task of BioNLP ', but F value is relatively low.

Triggering word recognition method based on statistical machine learning usually regard trigger word identification as classification problem, using statistics Machine learning model is classified.Common statistical machine learning model has support vector machines (SVM), on-line Algorithm (PA), item Part random field (CRF) etc..In addition, needing engineer to add feature toward contact to improve the classifying quality of model.Such as(Jari,et al."Extracting complex biological events with rich Graph-based feature sets. ", Association for Computational Linguistics, 2009) make It uses SVM as classifier, has extracted on morphological feature, sentence characteristics, part of speech stem feature and the interdependent chain of trigger word Information etc. achieves best result in 09 Shared Task of BioNLP '.

Triggering word recognition method based on neural network and term vector can reduce the cost of engineer's complex characteristic, together When very good solution the problem of lacking semantic information between word and word.Such method mainly passes through complicated non-in neural network Linear structure learns some abstract features automatically, captures the semantic information between word.Such as Wang (Wang, Jian, et al. " Biomedical event trigger detection by dependency-based word embedding.",BMC Medical genomics, 2016) etc. morphology and semantic feature between word are learnt by neural network automatically, then will generated Feature vector be sent in neural network and classify.Common deep neural network model includes RNN, CNN, LSTM, GRU Deng.Experimental result shows that effect of the method based on neural network and term vector in trigger word identification mission is better than base mostly In rule and based on the method for statistical machine learning.

Element identification refers to that extraction trigger word and the relationship between the relationship or trigger word and trigger word between biological entities are (nested Event).Similar trigger word identifies, element know method for distinguishing be usually also classified into it is rule-based, based on statistical machine learning and be based on The method of neural network.Rule-based element recognition methods is usually required according to the interdependent information and syntax in interdependent analytic tree The corresponding rule of information design is candidate, then the identifying feature by way of pattern match.Due to element structure complexity and Diversity, generally required when laying down a regulation more fully using in contextual information and corpus information and more experts know Know.Although rule-based method can be improved the extraction performance of system, but poor to the recognition effect of unknown corpus.

Element recognition methods based on statistical machine learning is similar to trigger word and identifies, by element identification as a pass It is classification task, is classified using traditional machine learning model to element.Relative to rule-based method, based on statistics The method of machine learning is more stable, and it is also more preferable to extract performance.But it since element identifies the complexity of itself, generally requires to utilize Information on analysis diagram, extracts correlated characteristic as much as possible.Have centainly so as to cause performance is extracted to the design of feature Dependence reduces the generalization ability of system.

Element recognition methods based on neural network and term vector can pass through neural network to avoid complicated characteristic Design The relationship between relationship or trigger word between automatic study trigger word and entity.As Wang is right using convolutional neural networks (CNN) Biomedical text carries out trigger word identification and element identification.(Wang,Anran,et al."A multiple distributed representation method based on neural network for biomedical event extraction.",BMC medical informatics and decision making,2017)。

It is more well arranged that method stage by stage makes biological event extract task, but there is also some problems: (1) mistake It propagates.The mistake of trigger word cognitive phase can extract task through entire biological event always.Since element cognitive phase needs The trigger word predicted using trigger word cognitive phase, therefore, if trigger word recognition effect is bad, the mistake in trigger word stage Element cognitive phase can be traveled to always, and then generates concatenated error.(2) relationship between two subtasks is ignored.Triggering Word identification and element identify that the two tasks are not completely independent.The result of element identification helps to identify trigger word, and The identification of trigger word also will affect the recognition effect of element.(3) redundant information is generated.Since element identification is needed the touching of prediction It sends out word and has marked biological entities and be respectively combined the input for identifying network as element, therefore many not related touchings can be generated Send out word-trigger word to and trigger word-entity pair, so as to cause identification error rate improve.

For disadvantage mentioned above existing for biological event abstracting method stage by stage, many scholars begin one's study with joint extraction Method extracts trigger word and element simultaneously.2010, Poon and Vanderwende used the method that joint extracts for the first time.He Using Markov logical network extract biological event, BioNLP'09Shared Task test set obtain 50% F value, The best UTurku system of its event extraction ratio of precision achievement in 2009 it is also high (Poon, Hoifung, Lucy Vanderwende, et al."Joint inference for knowledge extraction from biomedical literature.", Association for Computational Linguistics,2010).2011, Rediel etc. used conjunctive model structure A biological event extraction system UMASS is built, which achieves in the task evaluation and test of BioNLP ' 11Shared Task The achievement of second place, and the event extraction system of first place be also based on UMASS conjunctive model variant (Riedel, Sebastian,Andrew McCallum,et al."Robust biomedical event extraction with dual decomposition and minimal domain adaptation.",Association for Computational Linguistics,2011).It can be seen that joint extraction model from the above event extraction effect and be better than sublevel to a certain extent The model that section extracts.

The method that the previous conjunctive model for event extraction mostly uses greatly shared parameter, to eliminate in method stage by stage Two mutually independent problems in subtask.But this method there are still some drawbacks: (1) being carried out using the method that joint extracts When biological event extracts, the feature of extraction depends on NLP pretreating tool, may generate error.(2) using shared parameter Method carries out joint extraction, i.e., is respectively extracted by using simple model to two subtasks, then by first son Input of the extraction result of task as second subtask.This mode can reduce the independence between two subtasks, But there are problems that error propagation.

The problem of for current grading method and joint abstracting method, the present invention proposes a kind of using duplicator The biomedical event of system combines abstracting method.Specifically, refer to by element, trigger word 1, the recognition sequence of trigger word 2, lead to Cross the input for walking the element of back prediction or triggering term vector as current time, the triggering that obtains them and will predict The inner link of word, and then realize while extracting trigger word and element, avoid grading method bring concatenated error and son from appointing Mutually indepedent disadvantage between business.In addition, this method predicts every a pair of of trigger word and element as a whole, it can To avoid in conjunctive model merely use shared parameter bring error propagation.

Summary of the invention

The present invention provides a kind of, and the biomedical event based on replicanism combines abstracting method.This method can be effective It avoids using merely in disadvantage and conjunctive model mutually indepedent between grading method bring concatenated error and subtask Shared parameter and bring error propagation, trigger word and element are identified as a whole, the element by predicting back or The input that the corresponding term vector of trigger word is walked as current time obtains the inner link of they and the trigger word that will be predicted, Realize that joint extracts.

Technical solution of the present invention:

Biomedical event based on replicanism combines abstracting method, and steps are as follows:

(1) input vector is constructed

Biological event joint is carried out mainly for biomedical text to extract；

Firstly, it is necessary to pre-process to biomedical text, the input of frame is constituted；Pretreatment the following steps are included:

(1) corpus and Large Scale Biology medical ground corpus that will acquire are sent into word2vec jointly, training Obtain the term vector of each word；

(2) by searching for vocabulary, the term vector of each word in corpus is obtained, the input of model is constituted；

(2) frame used is summarized

Encoder-Decoder model based on Attention mechanism；Encoder module be responsible for the sentence of input into Row coding, the coding vector and attention force vector of Decoder resume module sentence, generates trigger word-element pair of prediction；

(3) Encoder module

Since the two-way propagation mechanism in BiLSTM model can obtain the corresponding contextual information of the word in sentence, thus More comprehensive and accurate semantic expressiveness is obtained, therefore the Encoder module in this model obtains word using BiLSTM model Language and the corresponding encoded information of sentence；

Specific formula is as follows:

The input of Encoder module:

X represents a sentence of input model, x_tT-th of word in sentence is represented, n represents the length of sentence；

X=(x₁,x₂,…,x_n)(1)

Encoder module t step output beThe output walked by positive LSTM tIt is walked with reversed LSTM t defeated OutIt is spliced；

Represent forward direction LSTM t step output；W_O、W_C、W_i、W_fRespectively represent corresponding weight；b_o、b_C、b_i、b_f Represent corresponding biasing；It is the hidden layer state of positive LSTM t step,For the parameter of random initializtion, σ is activation Function；

It is specific to derive:

Derivation withDerivation it is identical, will input X=(x₁,x₂,…,x_n) reversely, i.e. X₁=(x_n,x_n-1,…,x₁) Input as Encoder module；Using the derivation of equation of (2)-(7), can be obtained

(4) Decoder module

Element refers to the relationship between trigger word or the relationship between trigger word and entity, however due to joint extracts can be with Trigger word and element are extracted simultaneously, therefore does not need individually to carry out trigger word identification here, therefore, the present invention does not distinguish prediction Element is the relationship between trigger word and trigger word or the relationship between trigger word and entity, be uniformly defaulted as be trigger word it Between relationship.

The input of Decoder module is s, c_t、v_t；S is the sentence coding vector that Encoder module obtains, c_tFor t step Attention force vector, v_tIt is the term vector of element or trigger word that t-1 step is predicted,WithIt is Encoder module n-th respectively The hidden layer state of the positive output of step and the hidden layer state reversely exported,It is the hidden layer shape of Decoder module t step State, whereinIt is the output of Encoder module t step,It is the output of Decoder module t step；f_a、f_h、f_kFor Activation primitive, W_h、W_k、W_a、U_a、U_hFor corresponding weight, b_a、b_h、b_kFor corresponding biasing；

S is indicated are as follows:

c_tThe derivation of equation it is as follows:

The output of each step of Decoder module:

u_t=[c_t；v_t](14)

ForIndicate that t step identifies as t%3=1 (t=1,4,7 ...) is element；As t%3=2 (t =2,5,8 ...) when, indicate to identify is first trigger word.As t%3=0 (t=3,6,9 ...), identification is indicated Out be second trigger word；I.e. every 3 step identifies a trigger word and element pair, indicates are as follows:

<element, trigger word 1, trigger word 2>.

(1) identifying feature:

Due to that cannot predict in advance in a sentence altogether comprising how many a elements, when identifying feature, needs to be arranged one A end mark, once the end mark is identified during identifying feature, then the end of identification of current sentence.W_qr,U_qNAFor Weight, b_qr,b_qNAFor biasing, f_qr、f_qNAFor activation primitive；

It is rightCarry out the transformation of (16-19):

Q=[q_r；q_NA](18)

q_a=softmax (q) (19)

q_aThe corresponding classification of the dimension of middle maximum probability is this trigger word-element centering feature category.Wherein q Dimension be the corresponding feature category number of corpus add 1；q_NAIt is off the mark of relationship differentiation, once identify q_NA, then no longer Continue to identify trigger word-element pair in current sentence.

(2) trigger word 1 is identified

Candidate trigger word is selected from n word of sentence as trigger word 1；Due to the term vector phase of trigger word 1 and input It closes, it is therefore desirable to the semantic information of input be addedf_p, f_pNAFor activation primitive.W_p,U_pNAFor weight, b_p,b_pNAFor biasing.

It is rightWithIt is handled as follows:

P=[p_e；p_NA](23)

p_a=softmax (p) (24)

p_aDimension be n+1, wherein before n dimension respectively represent n word in sentence, p_NAFor the stopping for identifying trigger word Symbol.P is obtained after p is normalized_a, select p_aThe corresponding word of the dimension of middle maximum probability is as trigger word 1.

(3) trigger word 2 is identified

When choosing the word in sentence as candidate trigger word 2, need to delete identified trigger word 1.Since element is directed to Be different trigger word, for this purpose, the trigger word position i that array record previous step identifies can be set.It completes (2) in after formula (20)-(24) step, result p is normalized to obtain p_a, make p_aIn the numerical value of i-th of position be set to 0.Then p is selected_aThe corresponding word of the dimension of middle maximum probability is as trigger word 2.

Beneficial effects of the present invention: method of the invention is it is possible to prevente effectively from grading method bring concatenated error and son The bring error propagation using shared parameter merely in mutually indepedent disadvantage and conjunctive model, will trigger between task Word and element identify as a whole, are walked by the corresponding term vector of element or trigger word for predicting back as current time Input, obtain them and the inner link of trigger word that will predict, realize that joint extracts.

Detailed description of the invention

Fig. 1 is the Seq2Seq model using Attention mechanism.

Fig. 2 is two-way LSTM model.

Fig. 3 is Decoder frame.

Specific embodiment

Below in conjunction with attached drawing and technical solution, a specific embodiment of the invention is further illustrated.

Model of the invention first encodes biomedical text, converts text to the vector comprising semantic information Sequence.Then trigger word is carried out to sentence by conjunctive model and element identifies, finally by trigger word-element of prediction to conduct SVM layers of input.The SVM layers of structure feature by the biological event learnt is removed to trigger word-element to classifying Invalid combination pair is finally constituted biological event output.Biological event extraction model is divided into embeding layer, and Seq2Seq layers, SVM layers, Output layer.Model structure is as shown in table 1.

Table 1: biological event extraction model

1. embeding layer

User inputs after biomedical text, and system first carries out subordinate sentence, participle to biomedical text.Again by searching for Vocabulary obtains corresponding term vector.

2.Seq2Seq layers

Training is using a sentence in text as the sequence inputting of Seq2Seq model every time.Each time step of model Using a word in sentence as input, while calculating according to the above-mentioned formula listed output and the word of Encoder module Pay attention to force vector, the two be used as to the input of Decoder module simultaneously, corresponding transformation finally is done into the output of Decoder module, Element, trigger word 1, trigger word 2 are respectively obtained by the sequence of prediction, constitutes trigger word-element pair of prediction, i.e., < element, triggering Word 1, trigger word 2 >.

3.SVM layers

System learns the legacy structure of every kind of event type, including the corresponding triggering of every kind of event type from original language material Word, the element number of permission, type etc..Further according to the feature learnt to trigger word-element of Seq2Seq layers of prediction to progress Classification, removes invalid combination pair.

4. output layer

System output are as follows: all biomedical event informations included in the biomedical text of user's input, including Trigger word corresponding to event type and event and element.If such as user inputs sentence: This cellular interaction was tumor-specific,although isolated granules could enhance Then system should identify wherein event result to fibroblast proliferation. are as follows:

Event E1(Type:Cell_proliferation,Trigger:proliferation,Theme: fibroblast)；

Event E2(Type:Positive_regulation,Trigger:enhance,Theme:E1)。

Claims

1. a kind of biomedical event based on replicanism combines abstracting method, which is characterized in that steps are as follows:

(1) input vector is constructed

Biological event joint is carried out mainly for biomedical text to extract；

(1) corpus and Large Scale Biology medical ground corpus that will acquire are sent into word2vec jointly, and training obtains The term vector of each word；

(2) frame used is summarized

Encoder-Decoder model based on Attention mechanism；Encoder module is responsible for compiling the sentence of input Code, the coding vector and attention force vector of Decoder resume module sentence, generates trigger word-element pair of prediction；

(3) Encoder module

The corresponding contextual information of the word in sentence is obtained using the two-way propagation mechanism in BiLSTM model, to obtain more Comprehensive and accurate semantic expressiveness, the Encoder module in this model obtains word using BiLSTM model and sentence is corresponding Encoded information；

Specific formula is as follows:

The input of Encoder module:

X=(x₁,x₂,…,x_n)(1)

Encoder module t step output beThe output walked by positive LSTM tWith the output of reversed LSTM t step It is spliced；

Represent forward direction LSTM t step output；W_O、W_C、W_i、W_fRespectively represent corresponding weight；

b_o、b_C、b_i、b_fRepresent corresponding biasing；It is the hidden layer state of positive LSTM t step,For random initializtion Parameter, σ is activation primitive；

It is specific to derive:

Derivation withDerivation it is identical, will input X=(x₁,x₂,…,x_n) reversely, i.e. X₁=(x_n,x_n-1,…,x₁) conduct The input of Encoder module；Using the derivation of equation of (2)-(7) to get arriving

(4) Decoder module

The element that this method does not distinguish prediction is the relationship between trigger word and trigger word or the pass between trigger word and entity System, is uniformly defaulted as being the relationship between trigger word；

The input of Decoder module is s, c_t、v_t；S is the sentence coding vector that Encoder module obtains, c_tFor the note of t step Meaning force vector, v_tIt is the term vector of element or trigger word that t-1 step is predicted,WithBe respectively the n-th step of Encoder module just To the hidden layer state and the reversed hidden layer state that exports of output,It is the hidden layer state of Decoder module t step, In It is the output of Encoder module t step,It is the output of Decoder module t step；f_a、f_h、f_kFor activation Function, W_h、W_k、W_a、U_a、U_hFor corresponding weight, b_a、b_h、b_kFor corresponding biasing；

S is indicated are as follows:

c_tThe derivation of equation it is as follows:

The output of each step of Decoder module:

Ut=[ct；v_t](14)

ForIndicate that t step identifies as t%3=1 (t=1,4,7 ...) is element；When t%3=2 (t=2, 5,8 ...) when, indicate to identify is first trigger word；As t%3=0 (t=3,6,9 ...), indicate to identify Be second trigger word；I.e. every 3 step identifies a trigger word and element pair, indicates are as follows:

<element, trigger word 1, trigger word 2>；

(1) identifying feature:

One end mark is set, identifies the end mark during identifying feature, then the end of identification of current sentence； W_qr,U_qNAFor weight, b_qr,b_qNAFor biasing, f_qr、f_qNAFor activation primitive；

It is rightCarry out the transformation of (16-19):

Q=[q_r；q_NA](18)

q_a=softmax (q) (19)

q_aThe corresponding classification of the dimension of middle maximum probability is this trigger word-element centering feature category；Wherein, the dimension of q Degree is that the corresponding feature category number of corpus adds 1；q_NAIt is off the mark of relationship differentiation, once identify q_NA, then do not continue to Identify trigger word-element pair in current sentence；

(2) trigger word 1 is identified

Candidate trigger word is selected from n word of sentence as trigger word 1；Since trigger word 1 is related to the term vector of input, because The semantic information of input need to be added in thisf_p, f_pNAFor activation primitive；W_p,U_pNAFor weight, b_p,b_pNAFor biasing；

It is rightWithIt is handled as follows:

P=[p_e；p_NA](23)

p_a=softmax (p) (24)

p_aDimension be n+1, wherein before n dimension respectively represent n word in sentence, p_NAFor the stop element for identifying trigger word； P is obtained after p is normalized_a, select p_aThe corresponding word of the dimension of middle maximum probability is as trigger word 1；

(3) trigger word 2 is identified

When choosing the word in sentence as candidate trigger word 2, need to delete identified trigger word 1；Since element is directed to Different trigger word, for this purpose, one array of setting records the trigger word position i that previous step identifies；It completes in step (2) After formula (20)-(24) step, result p is normalized to obtain p_a, make p_aIn the numerical value of i-th of position be set to 0；So After select p_aThe corresponding word of the dimension of middle maximum probability is as trigger word 2.