CN111859935B

CN111859935B - Method for constructing cancer-related biomedical event database based on literature

Info

Publication number: CN111859935B
Application number: CN202010629395.7A
Authority: CN
Inventors: 李丽双; 连瑞源; 黄梦佐; 王泽昊; 袁光辉
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2020-07-03
Filing date: 2020-07-03
Publication date: 2022-09-20
Anticipated expiration: 2040-07-03
Also published as: CN111859935A

Abstract

The invention belongs to the technical field of natural language processing, and provides a method for constructing a cancer-related biomedical event database based on literature, which comprises the following three parts: 1. performing combined extraction on the biomedical entities and the entity relations based on the entity relation graph; 2. biomedical event extraction based on a layered distillation network; 3. a database of cancer-related biomedical events is constructed. The method fully considers the characteristics of the entity and the context in the biomedical text on the basis of the traditional method, solves the problems of multi-type entity identification and incomplete entity identification in the biomedical event extraction, obtains deeper syntactic information based on a layered distillation network to extract the biomedical event, improves the precision of complex event extraction, can help researchers in the biomedical field to automatically analyze the text, can provide the function of searching known biomedical named entities and biomedical events, and helps the researchers to research and analyze relevant documents of the biomedical science.

Description

Method for constructing cancer-related biomedical event database based on literature

Technical Field

The invention belongs to the technical field of natural language processing, and relates to a method for carrying out high-quality biomedical named Entity identification, Entity relationship extraction and biomedical event extraction on biomedical related documents, in particular to biomedical entities and Entity relationship combined extraction based on an Entity-relationship Graph (ERG) and biomedical event extraction based on a Hierarchical Distillation Network (HDN).

Background

Stored in the biomedical event database is a complete biomedical event consisting of a trigger, typically a verb or verb, and an element, typically a trigger of a biomedical entity or another event. The construction of the Biomedical Event database relates to three steps of Word Representation (Word Representation), Biomedical Named Entity identification and Entity Relationship Extraction (Biomedical Named Entity Recognition and Entity Relationship Extraction) and Biomedical Event Extraction (Biomedical Event Extraction).

The word representation means that the words are represented by vectors with specific dimensions, semantic information of the words can be reflected in a space vector, and abundant semantic feature information can be obtained from a large amount of unlabeled linguistic data. In order to obtain more valuable information from biomedical texts, Lee and Yoon et al (BioBERT: a pre-trained biological language representation model for biological text reduction. CoRR abs/1901.08746(2019)) initialize the BioBERT model by using a pre-trained BERT model on a general corpus (English Wikipedia and BooksCorpus) and finely tune the model by using biomedical field corpora PubMed and PMC, so as to obtain a word representation model suitable for the biomedical field, wherein the entity recognition result task F value of the model on NCBI disease is 89.71%.

The existing biomedical named entity recognition and entity relation extraction model mainly comprises a staged extraction model and a combined extraction model. The staged method ignores the correlation between entity identification and a relation extraction task, and has the problem of error propagation; the classic approach of joint Extraction is to construct an End-to-End model, such as the End-to-End joint model proposed by Miwa and Bansal (End-to-End Extraction using LSTMs on Sequences and Tree Structures [ C ]. Association for Computational relationships, 2016), for entity identification and entity relationship Extraction, and incorporate dependency path information into the model, with entity identification and relationship Extraction F values of 83.4% and 55.6%, respectively, at ACE 05. However, the model generally establishes the connection between two subtasks by means of parameter sharing, and the degree of the connection is shallow and generates a large amount of redundant information, so how to fully model the dependency relationship between entity identification and entity relationship extraction, and how to construct a deeply-connected entity and entity relationship extraction model also needs to be researched.

Biomedical event extraction refers to a process of automatically detecting a description of fine-grained interaction between biomolecules, such as genes, proteins, etc., and biological organ tissues from the literature, with the purpose of extracting structured information about predefined event types from unstructured text. The current biomedical event extraction models are two types, and are respectively based on a grading strategy and a joint strategy, the grading strategy respectively identifies trigger words and elements of an event, and then complete event representation is obtained through post-processing. For example, Miwa et al (Boosting automatic event extraction from the organizational using domain adaptation and reference resolution [ J ]. Bioinformatics,2012) improved by using rich feature sets, fusing the resolution of the expression and the domain adaptation strategy, and developed eventMine system reached 57.98% in the F-value of the BioNLP' 11 corpus. The staging strategy is easy to understand and easy to implement, but also has obvious defects such as cascading errors. The Joint strategy jointly executes subtasks such as trigger word and element recognition, so that error propagation among subtasks can be avoided to a certain extent, and cascade errors can be reduced, for example, a Markov logic network is adopted based on a shallow machine learning method Poon and the like (Joint information for knowledge extraction from biological logic [ C ]. Association for Computational logic, 2010) in the early stage, and a predicate logic Joint statement is artificially summarized to extract trigger words and elements, so that an F-value of 50.0% is obtained on a BioNLP' 09 test set. Although markov logic networks can avoid cascading errors, a large number of features are not well utilized. Lipolybix et al (Extracting binary with a dual decomposition integration word templates [ J ]. IEEE/ACM transformations on computational biology and bioinformatics,2015) added word vector features, including more syntactic and semantic information, with F-values of 53.19% on the BioNLP' 13 test set. However, regardless of the above-mentioned Biomedical Event Extraction model Based on the "staged" strategy or the "combined" strategy, complicated feature engineering is mostly required, and depending on additional expert knowledge, in recent years, due to the advantages exhibited by deep learning in capturing text implicit information, some Biomedical Event Extraction models Based on deep learning are proposed, for example Yu et al (LSTM-Based End-to-End frame for Biomedical Event Extraction [ J ]. IEEE/ACM transformations on computational biology and biology, 2019) feature raw language material Based on the combined strategy using word embedding, and then input it to Bi-LSTM for trigger recognition, and simultaneously perform "trigger-entity" detection using Tree-LSTM. The F-values of the model on the BioNLP ' 13, BioNLP ' 11, BioNLP ' 09 test set reached 57.39%, 58.23% and 59.68%, respectively.

From the above analysis, although the biomedical event extraction research has achieved abundant results, the problems of complex event extraction and the like are still to be systematically and deeply researched. Although there are many biomedical field ontologies and databases, such as Gene Ontology (Gene Ontology), Pathway database system, etc., and there are related research efforts to establish binary interaction relationship networks, such as protein interaction relationship networks, how to construct more fine-grained databases for cancer is still a research topic of great concern for many researchers.

Disclosure of the invention

The invention provides a biomedical entity relation combined extraction model based on an entity relation graph and a biomedical event extraction system based on a layered distillation network, solves the problems that the prior research cannot fully capture remote context semantic information and cannot fully express syntactic characteristics, improves the accuracy of event extraction, and accordingly constructs a biomedical event database with fine granularity related to cancers.

The invention mainly comprises three parts: 1. jointly extracting biomedical entities and entity relations based on an Entity Relation Graph (ERG); 2. biomedical event extraction based on a Hierarchical Distillation Network (HDN); 3. constructing a cancer-related biomedical event database; the biomedicine named entity identification and the entity relation extraction are necessary prerequisites for constructing the biomedicine event extraction and the biomedicine event interaction network, and are finally expressed in the form of a cancer related biomedicine event database.

The technical scheme adopted by the invention is as follows:

the method for constructing the cancer-related biomedical event database based on literature comprises the following steps:

(I) obtaining and preprocessing model training test corpus

(1) 261-; (2) segmenting and sentence-dividing the biomedical documents, and adding [ CLS ] and [ SEP ] characters as the beginning and ending marks of the sentences respectively; (3) utilizing special symbols to cut words of the text obtained in the step (2); (4) training with a pre-training biomedical word representation (BioBERT) model to obtain a word vector; in the process of training word vectors, firstly, obtaining a code of a public BioBERT, then finishing fine adjustment by using a corpus pubMed and a PMC in the biomedical field, storing model parameters, and finally processing MLEE linguistic data by using a fine-adjusted model to obtain word vectors;

(II) biomedical entities and entity relationship joint extraction based on Entity Relationship Graph (ERG)

In order to realize deep combination of biomedical entity identification and entity relationship extraction tasks, the invention provides a biomedical entity and entity relationship combination extraction model based on an Entity Relationship Graph (ERG). In the model, firstly, an entity recognition module recognizes entities in an input text for a specific entity type, then an entity relation directed graph of the specific relation type is constructed, and all related entity pairs in the graph are output; the model is divided into three modules:

(1) input module

Encoding an input source sentence using the above-described pre-trained biomedical word representation (BioBERT) model, the input being S { [ CLS ]],w ₁ ,w ₂ ,...,w _n ,[SEP]In which [ CLS ]]And [ SEP ]]Extracting hidden layer characteristics H for the beginning mark and the end mark of the sentence ^E And sentence vector X: h ^E ,X＝Encoder(S)；

Wherein,

for the hidden layer output of the ith word,

for vector representation of the entire sentence, [ CLS ] is used in the model implementation]Instead of, a vector of _E The number of hidden layer units of the Encoder;

(2) entity identification module

In biomedical texts, an entity may be composed of multiple words and may be of multiple types, but most traditional models, such as copy mechanisms, can only recognize the last word in the entity, so that the recognized entity lacks integrity and can only belong to a certain type. This module models the entity type as a mapping of the entity's beginning word to ending word, rather than classifying after the entity is identified, i.e., learning a marker f specific to the entity class _etype (w _start )→w _end The problems of multi-type entity recognition and entity recognition incompleteness are solved;

firstly, generating a hidden layer characteristic H according to the previous step ^E Marking out which words in the sentence are most likely to be the beginning of an entity, and then constructing a marker for each entity type to mark out which words in the sentence are most likely to be the end of a specific type of entity; the specific operation is as follows:

wherein,

et _k as entity type etype _k The embedding of (a) into (b),

representing the likelihood that the ith word is identified as the entity start word,

representing that the ith word is recognized as an type _k The possibility of the end word of the type entity is classified into two categories, and the category is defined as 1 when the value of the category exceeds a certain threshold, otherwise, the category is 0; then, the entity in the sentence is identified according to the minimum span principle

Since entity identification is performed for a specific entity type, the entity representation already includes its type information, and k represents that the type of this entity is type _k ；

(3) Entity Relationship Graph (ERG)

Taking the entity output in the last step as a node of the entity relationship graph, initializing the entity relationship graph, assuming that the relationship types among the entities are R types in total, and using R +1 asymmetric adjacency matrixes

To represent the figure, the elements in the adjacency matrix

Representing head entity e _i And tail entity e _j Has a relation r between _s Otherwise, it means nothing, A ⁽⁰⁾ Is a zero matrix; then to entity relationship type r _s Training a scorer for each entity pair

And (3) scoring:

wherein,

f is a feature fusion function for trainable parameters, when the score is calculated

If a threshold value lambda is exceeded, the adjacency matrix A is formed ^(s) Middle corresponding element

Setting the value to be 1, otherwise setting the value to be 0, and obtaining the entity relationship type r after all the entity pairs are predicted _s A corresponding entity relationship diagram; the model trains R scorers simultaneously, generates R entity relationship graphs corresponding to R different relationship types respectively,

a value of 1 means that there is a relationship triplet<e _i ,e _j ,r _s >Integrating the R entity relationship graphs and outputting all the related entity pairs

Outputting two entities corresponding to all nonzero elements in A<e _i ,e _j >；

When the model is finished, extracting all entity pairs < entity 1, entity 2> with relationship existing in the text, wherein the entity type is regarded as a mapping relation from an entity starting word to an entity finishing word in the entity identification module, and when an entity relationship graph is constructed, the tropism of the entity relationship is considered, and a grader of a specific relationship is constructed by utilizing entity representation fused with entity type information, so that the deep combination of the entity and the entity relationship is realized, as shown in figure 1;

(III) layered distillation network (HDN) based biomedical event extraction

The invention provides a biomedical event extraction model based on a Hierarchical Distillation Network (HDN) on the basis of the existing biomedical event extraction model based on deep learning, the model acquires different levels of syntactic information through a bidirectional GRU and a multilayer Gating Graph Network (GGN), emphasizes the difference between sentence representations of different levels through a distillation module, and introduces a residual connection to integrate the distilled sentence representations of all levels, so that the obtained sentence representations can express richer syntactic characteristics and capture more distant contextual information. And finally, respectively considering the trigger word recognition task and the element recognition task as a sequence labeling task and a multi-classification task, and performing post-processing to generate a final biomedical event. The specific flow of the model is as follows:

(1) input module

The model first sets the input sentence W to W ₁ ,w ₂ ,...w _L Embedding to obtain sentence vector X ═ X ₁ ,x ₂ ,...,x _L The sentence vector includes the following parts:

the word vector: encoding the input original sentence by using the pre-training biomedical word representation (BioBERT) model introduced above to obtain the context semantics of the word;

entity type vector: the entity type information is contained in the entity representation obtained by the biomedical entity and entity relationship combined extraction model introduced above, so that the entity type information is used as the entity type vector of the module;

(iii) relative position vector: the relative position vector is used only for the element recognition task; the relative position calculation formula of the candidate trigger words and the candidate elements is as follows:

wherein q and m are respectively the starting subscript and the length of the candidate trigger word or element, and L is the length of the sentence; then, obtaining a relative position vector by searching a randomly initialized position embedded table;

(2) bidirectional GRU module

The sentence vector X obtained by the input module is { X ═ X ₁ ,x ₂ ,...,x _L As input to this module, the bidirectional GRU consists of two recursive computation units: forward and backward GRU units that are continuously updated according to the order of sentences, the forward and backward GRUs being obtained separatelyTo the forward hidden vector

And backward hidden vector

Concatenating them to get sentence sequence vector H ═ (H) ₁ ,h ₂ ,...,h _L ) Wherein

The specific calculation process of the module is as follows:

z _t ＝σ(W _z ·[h _t-1 ,x _t ])r _t ＝σ(W _r ·[h _t-1 ,x _t ]) (3.2)

wherein z is _t ，r _t Update and reset gates, W, of GRU units, respectively _(·) To trainable the parameters, σ (-) and tanh (-) are activation functions,

is the hidden vector after passing through the reset gate;

(3) gated Graph Network (GGN) module

Defining graph G ═ (V, E) as a dependency syntax tree for sentences, where V ═ V (V, E) ₁ ,v ₂ ,...,v _L )，E＝(e ₁ ,e ₂ ,...,e _M ) Respectively, node set and edge set of the graph G, and the module obtains the node feature vector H in the last step ^(l) And the adjacency matrix A of the graph is used as input, and a new node characteristic vector H is output ^(l+1) In the transmission process, the GGN adopts a door mechanism, and the calculation process is as follows:

U＝σ(W _U ·[Y,H ^(l) ])R＝σ(W _R ·[Y,H ^(l) ]) (3.5)

wherein,

is used for carrying out symmetrical normalization on the characteristic vector, Y is the normalized vector, W _(·) For trainable parameters, σ (-) and relu (-) are activation functions, U and R are update and reset gates, respectively, of the GGN,

is the node feature vector after the GGN reset gate. In the GGN, the edge type between nodes is a non-directional edge. Along with the superposition of GGNs, a node can exchange information with other nodes which are farther away, so that the characteristic vector of the node contains more syntactic information, for example, 1 layer of GGNs can induce information interaction between adjacent nodes in a graph, and according to the superposition of another layer of GGNs on the previous layer of GGNs, the node can carry out information interaction with nodes which are equal to or less than 2 in distance, so that the model uses a first-order syntactic vector to represent the output of a first-layer GGN, uses a second-order syntactic vector to represent the output of a second-layer GGN, and so on;

(4) distillation module

Due to the superposition of the GGN modules, the normal vectors of different orders are inevitably overlapped in a characteristic space to cause redundancy, so that the distillation module is designed to reconstruct the syntax vectors of each order obtained in the previous process. The module mainly emphasizes the heterogeneity among normal vectors of different orders of sentences to obtain a first-order syntactic vector

And second order syntax vectors

For example, this module includes three core operations:

alignment of

By this operation, obtain

And

aligned vector

The alignment process is as follows:

wherein, W _E For trainable parameters, e _i,j Is composed of

And

the vector after the fusion is carried out,

and

respectively representing the weights of a first-order vector and a second-order vector when aligning;

② comparison

By mixing S ^l And

comparing to obtain the distillation ratio of the syntax vector of the first order

The calculation formula is as follows:

(iii) distillation

The syntax vector after the distillation of the l order can be obtained by the distillation operation

Wherein:

and finally, integrating the distilled syntactic vectors of all orders through residual connection to obtain the final sentence expression R (R) which can express richer syntactic characteristics and capture more distant context information ₁ ,r ₂ ,...,r _L )；

(5) Trigger word recognition and element recognition

Triggering word recognition

In the model, the trigger word recognition is regarded as a sequence labeling task, and the labeling mode of BILOU (Begin, Inside, Last, Outside, Unit) is adopted, and in the ith time step, the ith word r is labeled _i Using a Softmax classifier to predict the trigger:

p(y _i |x)＝softmax(W _te ·r _i +b _te ) (3.13)

wherein, W _te And b _te Is a trainable parameter;

② element identification

In this model, plain recognition can be considered as a multi-classification task, classifying each trigger-entity pair or trigger-trigger pair using a Softmax classifier:

p(y|x)＝softmax(W _ae ·[t,a]+b _ae ) (3.14)

where t is the trigger vector, a is another entity or trigger vector, W _ae And b _ae Is a trainable parameter;

(6) post-treatment

And the post-processing process combines the trigger words and the elements obtained in the last step into a complete event, namely event detection. In the model, an SVM classifier is used for automatically learning the legal event structure of each event type by extracting features, then candidate events are formed, and finally the event types of the candidate events are determined.

The biomedical event extraction model based on the Hierarchical Distillation Network (HDN) can capture richer syntactic characteristics and capture more distant context information, then takes trigger word recognition and element recognition tasks as sequence labeling and multi-classification tasks respectively, and carries out post-processing to generate a final biomedical event. This model achieved 62.74% F on MLEE, 3.13% higher than the current best model, as shown in FIG. 2;

(IV) cancer-related biomedical event database construction

The biomedical entity and biomedical event information required for constructing the cancer-related biomedical event database are obtained by the method and stored in the database. The constructed cancer-related biomedical event database comprises a biomedical entity table, a biomedical event table and the like, and is specifically shown in the following table:

table 1 database table

The invention constructs a biomedical event extraction platform based on entity and entity relation combined extraction and event extraction in the biomedical field, and provides query service for researchers. Biomedical event extraction is the analysis of biomedical documents, and presents potential fine-grained complex event relationships between different system levels including molecular level, cell, tissue, and the like. At present, documents in the aspect of cancer genetics are in a mass growth situation, and for the number of documents with mass data, how to automatically obtain valuable related information by using an information extraction technology has positive significance for researching the formation of cancers, discovering medicines for treating the cancers, early diagnosis of the cancers, improving the survival rate of cancer patients, reducing the medical cost of the cancers and the like.

The invention has the beneficial effects that: the method fully considers the characteristics of the entity and the context in the biomedical text on the basis of the traditional method, solves the problems of multi-type entity identification and incomplete entity identification in the biomedical event extraction, obtains deeper syntactic information based on a layered distillation network (HDN) to extract the biomedical event, improves the precision of complex event extraction, can help researchers in the biomedical field to automatically analyze the text, can provide the function of searching known biomedical named entities and biomedical events, and helps the researchers to research and analyze relevant biomedical documents.

Drawings

FIG. 1 is a schematic diagram of biomedical entity and entity relationship joint extraction based on an Entity Relationship Graph (ERG).

Fig. 2 is a diagram of a layered distillation network (HDN) based biomedical event extraction model.

FIG. 3 is a database E-R diagram.

Detailed Description

The system of the invention can automatically perform word representation, biomedical entity and entity relation extraction and biomedical event extraction on a given text, thereby being greatly convenient for researchers to analyze the biomedical events described in the literature from a large amount of literature. The system adopts a B/S (Browser/Server, Browser/Server mode, system construction by using Django framework, and mainly realized by using technologies such as HTML (hypertext markup language), CSS (cascading style sheets) and the like) structural design and is divided into a view layer, a logic layer and a data layer. As shown in table 2:

table 2 database system architecture

1. User input text to be parsed

The text input supports two modes of keyboard input and local file uploading, and the view layer receives a text to be retrieved input by a user, submits the text to the logic layer and stores the text in the data layer. Suppose that the text to be analyzed by the user is "We have used retrovisual-media gene delivery to effect summary automatic expression of TIMP-3in human neural and mammalian cells in order to user's future expression of the background of TIMPs to inhibit the same in vivo", the user can select 1 and directly input the text through the page text box; 2. and storing the texts into formats such as txt, doc and the like, and uploading the texts in a file form. The first approach is suitable for short text processing and the second approach is suitable for long text processing.

2. The system analyzes the text to be analyzed

The realization of the function requires the coordination work of a logic layer and a database layer of the system, and the method specifically comprises the following steps:

(1) and after preprocessing such as segmenting and word segmentation of the text to be analyzed by the logic layer, taking the text as the input of a fine-tuned pre-trained biomedical word representation (BioBERT) model to obtain a word vector and a sentence vector of the text.

(2) Taking the output result obtained in the step (1) as the input of the entity and entity relationship combined extraction model, firstly obtaining the biomedical entities specific to the entity types through the entity identification module as shown in the following table:

TABLE 3 biomedical entity extraction results

E1	E2	E3	E4	E5
					TIMP-3	TIMPs	neuroblastoma	Melanoma tumor cells	murine

Because the module regards the entity start word to the entity end word as the mapping relation specific to the entity type, the obtained entity representation contains the type information of the entity, and then an entity relation graph specific to the entity relation type is constructed and integrated, and all related entity pairs are output: < TIMP-3, TIMPs >; < TIMPs, TIMP-3 >; < neuroplastoma, melanoma tomor cells >; < melanoma tomor cells, neuroplastoma >.

(3) Then, biomedical event extraction is performed, the word vector and the entity type vector (i.e., the entity representation in step (2)) obtained in the above process are input into a biomedical event extraction model based on a Hierarchical Distillation Network (HDN) for trigger word recognition, and element detection is performed in consideration of each trigger word-entity and trigger word-trigger word pair, and then, a biomedical event is obtained through a post-processing process as shown in table 4:

TABLE 4 biomedical event extraction results

T1	Blood_vessel_development:angiogenesis
		T2	Gene_expression:expression Theme:TIMP-3
T3	Positive_regulation:effectTheme:expression
		T4	Negative_regulation:inhibit Theme:angiogenesisCause:TIMPs

The trigger word in T1 is angiogenesis, and there is no corresponding element, i.e. it is stated that the trigger word itself constitutes a simple event.

(4) And (4) delivering the output results of the steps (1) to (3) to a data layer for storage, and simultaneously feeding back the visualization results to the user by a view layer to construct a cancer-related biological event database.

3. User retrieval of biomedical events

When the system has completed the extraction of the biomedical events from the input text, the entire event can be presented in the form of an interactive event graph. For example, after the user inputs the text, in the biomedical event extraction interface, a graph G containing m + n nodes is seen, and the graph G is the complete biomedical event in which m entities and n trigger words participate. Each node marks the identity (trigger word/entity) according to different colors, the edge types between the nodes represent element roles/entity relations and are distinguished according to the colors of the edges, and if the trigger word of an event A in the graph is an element of an event B, the event B is a complex event. In addition, the user may search for an entity in the interface to view biomedical events related to the entity.

Claims

1. A method for constructing a cancer-related biomedical event database based on literature is characterized by comprising the following steps:

(I) obtaining and preprocessing model training test corpus

(1) 261-; (2) segmenting and sentence-dividing the biomedical documents, and adding [ CLS ] and [ SEP ] characters as the beginning and ending marks of the sentences respectively; (3) utilizing special symbols to cut words of the text obtained in the step (2); (4) training by using a pre-training biomedical word representation model to obtain a word vector; in the process of training word vectors, firstly, obtaining a code of a public BioBERT, then finishing fine adjustment by using a corpus pubMed and a PMC in the biomedical field, storing model parameters, and finally processing MLEE linguistic data by using a fine-adjusted model to obtain word vectors;

(II) biomedical entities and entity relation joint extraction based on entity relation graph

In order to realize deep combination of biomedical entity identification and entity relationship extraction tasks, a biomedical entity and entity relationship combination extraction model based on an entity relationship graph is provided; in the model, firstly, an entity recognition module recognizes entities in an input text for a specific entity type, then an entity relation directed graph of the specific relation type is constructed, and all related entity pairs in the graph are output; the model is divided into three modules:

(1) input module

Encoding the input source sentence by using the pre-training biomedical word representation model, wherein the input is S { [ CLS { [],w ₁ ,w ₂ ,...,w _n ,[SEP]In which [ CLS]And [ SEP ]]Respectively extracting hidden layer characteristics H for the beginning mark and the end mark of the sentence ^E And sentence vector X: h ^E ,X＝Encoder(S)；

Wherein,

for the hidden layer output of the ith word,

(2) entity identification module

This module models the entity type as a mapping of the entity's beginning word to ending word, rather than classifying after the entity is identified, i.e., learning a marker f specific to the entity class _etype (w _start )→w _end The problems of multi-type entity recognition and entity recognition incompleteness are solved;

firstly, generating hidden layer characteristics H according to the previous step ^E Marking out which words in the sentence are most likely to be the beginning of an entity, then constructing a marker for each entity type, and marking out which words in the sentence are most likely to be the end of a specific type of entity; the specific operation is as follows:

wherein,

et _k as entity type etype _k The embedding of (a) into (b),

represents the ithThe likelihood of a word being identified as an entity start word,

representing that the ith word is recognized as an etype _k The possibility of the end word of the type entity is classified into two categories, and the category is defined as 1 when the value of the category exceeds a certain threshold, otherwise, the category is 0; then, the entity in the sentence is identified according to the minimum span principle

Since entity identification is performed for a specific entity type, the entity representation already includes the type information thereof, and k represents that the type of the entity is type _k ；

(3) Entity relationship graph

Taking the entity output in the last step as a node of the entity relationship graph, initializing the entity relationship graph, assuming that the relationship types among the entities are R types in total, and using R +1 asymmetric adjacent matrixes A ^(s) ,

s 0,1, R denotes this figure, with the elements in the adjacency matrix

And (3) scoring:

wherein,

Outputting two entities corresponding to all non-zero elements in A<e _i ,e _j >；

When the model is finished, all entity pairs < entity 1, entity 2> with relationship existing in the text are extracted, because the entity type is regarded as a mapping relation from an entity starting word to an entity finishing word in the entity identification module, and when an entity relationship graph is constructed, the tropism of the entity relationship is considered, and a marker of a specific relationship is constructed by utilizing entity representation fused with entity type information, so that the deep combination of the entity and the entity relationship is realized;

(III) biomedical event extraction based on layered distillation network

On the basis of the existing biomedical event extraction model based on deep learning, a biomedical event extraction model based on a layered distillation network is provided, the model acquires syntactic information of different layers through a bidirectional GRU and a multilayer gating graph network, the difference between sentence representations of different layers is emphasized through a distillation module, then a residual connection is introduced to integrate the sentence representations after distillation of all layers, and the obtained sentence representations can express richer syntactic characteristics and capture context information of longer distance; finally, respectively considering the trigger word recognition task and the element recognition task as a sequence labeling task and a multi-classification task, and performing post-processing to generate a final biomedical event; the specific flow of the model is as follows:

(1) input module

word vector: coding the input original sentence by using the pre-training biomedical word representation model introduced above to obtain the context semantics of the word;

wherein q and m are respectively the starting subscript and the length of the candidate trigger word or element, and L is the length of the sentence; then, a position embedding table initialized randomly is searched to obtain a relative position vector;

(2) bidirectional GRU module

The sentence vector X obtained by the input module is { X ═ X ₁ ,x ₂ ,...,x _L As input to this module, the bidirectional GRU consists of two recursive computation units: forward GRU unit and backward GRU unit, which are continuously updated with the order of sentences, and forward GRU and backward GRU respectively obtain forward hidden vector

And backward hidden vector

The specific calculation process of the module is as follows:

z _t ＝σ(W _z ·[h _t-1 ,x _t ])r _t ＝σ(W _r ·[h _t-1 ,x _t ]) (3.2)

is the hidden vector after passing through the reset gate;

(3) gate control graph network module

Defining graph G ═ (V, E) as a dependency syntax tree for sentences, where V ═ V (V, E) ₁ ,v ₂ ,...,v _L )，E＝(e ₁ ,e ₂ ,...,e _M ) Respectively, node set and edge set of the graph G, and the module obtains the node feature vector H in the last step ^(l) And the adjacency matrix A of the graph is used as input, and a new node characteristic vector H is output ^(l+1) In the process of transmitting, GGN adoptsThe door mechanism comprises the following calculation processes:

U＝σ(W _U ·[Y,H ^(l) ])R＝σ(W _R ·[Y,H ^(l) ]) (3.5)

wherein,

is the node eigenvector after the GGN reset gate; in the GGN, the edge type between nodes is a non-directional edge; along with the superposition of GGNs, the nodes exchange information with other nodes which are farther away, so that the characteristic vectors of the nodes contain more syntactic information, the information interaction between adjacent nodes in a 1-layer GGN induced graph is realized, and the information interaction is carried out between the nodes and the nodes with the distance equal to or less than 2 according to the superposition of another layer of GGNs on the basis of the previous GGNs, so that the output of the first layer of GGNs is represented by a first-order syntactic vector, the output of the second layer of GGNs is represented by a second-order syntactic vector, and the like;

(4) distillation module

Due to the superposition of the GGN modules, the normal vectors of different orders inevitably overlap in a feature space to cause redundancy problem, so thatDesigning a distillation module to reconstruct syntax vectors of each order obtained in the previous process; the module mainly emphasizes heterogeneity among normal vectors of different orders, namely the first-order syntactic vector

And second order syntax vectors

This module includes three core operations:

alignment of

By this operation, obtain

And

aligned vector

The alignment process is as follows:

wherein, W _E For trainable parameters, e _i,j Is composed of

And

the vector after the fusion is carried out,

and

② comparison

By mixing S ^l And

The calculation formula is as follows:

(iii) distillation

Wherein:

finally, the distilled syntactic vectors of all orders are integrated through residual connection to obtain the final sentence expression R (R) which can express richer syntactic characteristics and capture more distant context information ₁ ,r ₂ ,...,r _L )；

(5) Trigger word recognition and element recognition

Triggering word recognition

In the model, the trigger word recognition is regarded as a sequence labeling task, a BILOU labeling mode is adopted, and in the ith time step, the ith word r is labeled _i Using a Softmax classifier to predict the trigger:

p(y _i |x)＝softmax(W _te ·r _i +b _te ) (3.13)

wherein, W _te And b _te Is a trainable parameter;

② element identification

In this model, the recognition of primitives is considered as a multi-classification task, and each trigger-entity pair or trigger-trigger pair is classified using the Softmax classifier:

p(y|x)＝softmax(W _ae ·[t,a]+b _ae ) (3.14)

(6) post-treatment

The post-processing process combines the trigger words and the elements obtained in the last step into a complete event, namely event detection; in the model, an SVM classifier is used for automatically learning the legal event structure of each event type by extracting features, then candidate events are formed, and finally the event types of the candidate events are determined;

(IV) cancer-related biomedical event database construction

Obtaining biomedical entities and biomedical event information required by constructing a cancer-related biomedical event database by the method, and storing the information in the database; the constructed cancer-related biomedical events database comprises a biomedical entity table and a biomedical events table.