CN111859935B - Method for constructing cancer-related biomedical event database based on literature - Google Patents
Method for constructing cancer-related biomedical event database based on literature Download PDFInfo
- Publication number
- CN111859935B CN111859935B CN202010629395.7A CN202010629395A CN111859935B CN 111859935 B CN111859935 B CN 111859935B CN 202010629395 A CN202010629395 A CN 202010629395A CN 111859935 B CN111859935 B CN 111859935B
- Authority
- CN
- China
- Prior art keywords
- entity
- biomedical
- vector
- word
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 206010028980 Neoplasm Diseases 0.000 title claims abstract description 23
- 201000011510 cancer Diseases 0.000 title claims abstract description 18
- 238000000605 extraction Methods 0.000 claims abstract description 70
- 238000004821 distillation Methods 0.000 claims abstract description 28
- 230000006870 function Effects 0.000 claims abstract description 9
- 239000013598 vector Substances 0.000 claims description 88
- 230000008569 process Effects 0.000 claims description 16
- 102100029018 Gametogenetin Human genes 0.000 claims description 15
- 238000012549 training Methods 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 8
- 230000014509 gene expression Effects 0.000 claims description 8
- 239000011159 matrix material Substances 0.000 claims description 8
- 230000003993 interaction Effects 0.000 claims description 7
- 238000002372 labelling Methods 0.000 claims description 7
- 238000012805 post-processing Methods 0.000 claims description 7
- 230000002457 bidirectional effect Effects 0.000 claims description 6
- 238000010586 diagram Methods 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 5
- 239000003550 marker Substances 0.000 claims description 5
- 238000012360 testing method Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 4
- 238000013135 deep learning Methods 0.000 claims description 4
- 238000001514 detection method Methods 0.000 claims description 4
- 230000004927 fusion Effects 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000002156 mixing Methods 0.000 claims description 2
- 238000010606 normalization Methods 0.000 claims description 2
- 230000010415 tropism Effects 0.000 claims description 2
- 238000011160 research Methods 0.000 abstract description 6
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 102000005406 Tissue Inhibitor of Metalloproteinase-3 Human genes 0.000 description 4
- 108010031429 Tissue Inhibitor of Metalloproteinase-3 Proteins 0.000 description 4
- 108090000623 proteins and genes Proteins 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 210000004027 cell Anatomy 0.000 description 3
- 201000001441 melanoma Diseases 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 230000033115 angiogenesis Effects 0.000 description 2
- 239000003086 colorant Substances 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000000844 transformation Methods 0.000 description 2
- 240000005589 Calophyllum inophyllum Species 0.000 description 1
- 241000288105 Grus Species 0.000 description 1
- 241001529936 Murinae Species 0.000 description 1
- 206010029260 Neuroblastoma Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000015624 blood vessel development Effects 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 230000003828 downregulation Effects 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000013399 early diagnosis Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000001476 gene delivery Methods 0.000 description 1
- 238000001727 in vivo Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 210000004962 mammalian cell Anatomy 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 210000003061 neural cell Anatomy 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 230000037361 pathway Effects 0.000 description 1
- 230000006916 protein interaction Effects 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000004083 survival effect Effects 0.000 description 1
- 210000004881 tumor cell Anatomy 0.000 description 1
- 230000003827 upregulation Effects 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention belongs to the technical field of natural language processing, and provides a method for constructing a cancer-related biomedical event database based on literature, which comprises the following three parts: 1. performing combined extraction on the biomedical entities and the entity relations based on the entity relation graph; 2. biomedical event extraction based on a layered distillation network; 3. a database of cancer-related biomedical events is constructed. The method fully considers the characteristics of the entity and the context in the biomedical text on the basis of the traditional method, solves the problems of multi-type entity identification and incomplete entity identification in the biomedical event extraction, obtains deeper syntactic information based on a layered distillation network to extract the biomedical event, improves the precision of complex event extraction, can help researchers in the biomedical field to automatically analyze the text, can provide the function of searching known biomedical named entities and biomedical events, and helps the researchers to research and analyze relevant documents of the biomedical science.
Description
Technical Field
The invention belongs to the technical field of natural language processing, and relates to a method for carrying out high-quality biomedical named Entity identification, Entity relationship extraction and biomedical event extraction on biomedical related documents, in particular to biomedical entities and Entity relationship combined extraction based on an Entity-relationship Graph (ERG) and biomedical event extraction based on a Hierarchical Distillation Network (HDN).
Background
Stored in the biomedical event database is a complete biomedical event consisting of a trigger, typically a verb or verb, and an element, typically a trigger of a biomedical entity or another event. The construction of the Biomedical Event database relates to three steps of Word Representation (Word Representation), Biomedical Named Entity identification and Entity Relationship Extraction (Biomedical Named Entity Recognition and Entity Relationship Extraction) and Biomedical Event Extraction (Biomedical Event Extraction).
The word representation means that the words are represented by vectors with specific dimensions, semantic information of the words can be reflected in a space vector, and abundant semantic feature information can be obtained from a large amount of unlabeled linguistic data. In order to obtain more valuable information from biomedical texts, Lee and Yoon et al (BioBERT: a pre-trained biological language representation model for biological text reduction. CoRR abs/1901.08746(2019)) initialize the BioBERT model by using a pre-trained BERT model on a general corpus (English Wikipedia and BooksCorpus) and finely tune the model by using biomedical field corpora PubMed and PMC, so as to obtain a word representation model suitable for the biomedical field, wherein the entity recognition result task F value of the model on NCBI disease is 89.71%.
The existing biomedical named entity recognition and entity relation extraction model mainly comprises a staged extraction model and a combined extraction model. The staged method ignores the correlation between entity identification and a relation extraction task, and has the problem of error propagation; the classic approach of joint Extraction is to construct an End-to-End model, such as the End-to-End joint model proposed by Miwa and Bansal (End-to-End Extraction using LSTMs on Sequences and Tree Structures [ C ]. Association for Computational relationships, 2016), for entity identification and entity relationship Extraction, and incorporate dependency path information into the model, with entity identification and relationship Extraction F values of 83.4% and 55.6%, respectively, at ACE 05. However, the model generally establishes the connection between two subtasks by means of parameter sharing, and the degree of the connection is shallow and generates a large amount of redundant information, so how to fully model the dependency relationship between entity identification and entity relationship extraction, and how to construct a deeply-connected entity and entity relationship extraction model also needs to be researched.
Biomedical event extraction refers to a process of automatically detecting a description of fine-grained interaction between biomolecules, such as genes, proteins, etc., and biological organ tissues from the literature, with the purpose of extracting structured information about predefined event types from unstructured text. The current biomedical event extraction models are two types, and are respectively based on a grading strategy and a joint strategy, the grading strategy respectively identifies trigger words and elements of an event, and then complete event representation is obtained through post-processing. For example, Miwa et al (Boosting automatic event extraction from the organizational using domain adaptation and reference resolution [ J ]. Bioinformatics,2012) improved by using rich feature sets, fusing the resolution of the expression and the domain adaptation strategy, and developed eventMine system reached 57.98% in the F-value of the BioNLP' 11 corpus. The staging strategy is easy to understand and easy to implement, but also has obvious defects such as cascading errors. The Joint strategy jointly executes subtasks such as trigger word and element recognition, so that error propagation among subtasks can be avoided to a certain extent, and cascade errors can be reduced, for example, a Markov logic network is adopted based on a shallow machine learning method Poon and the like (Joint information for knowledge extraction from biological logic [ C ]. Association for Computational logic, 2010) in the early stage, and a predicate logic Joint statement is artificially summarized to extract trigger words and elements, so that an F-value of 50.0% is obtained on a BioNLP' 09 test set. Although markov logic networks can avoid cascading errors, a large number of features are not well utilized. Lipolybix et al (Extracting binary with a dual decomposition integration word templates [ J ]. IEEE/ACM transformations on computational biology and bioinformatics,2015) added word vector features, including more syntactic and semantic information, with F-values of 53.19% on the BioNLP' 13 test set. However, regardless of the above-mentioned Biomedical Event Extraction model Based on the "staged" strategy or the "combined" strategy, complicated feature engineering is mostly required, and depending on additional expert knowledge, in recent years, due to the advantages exhibited by deep learning in capturing text implicit information, some Biomedical Event Extraction models Based on deep learning are proposed, for example Yu et al (LSTM-Based End-to-End frame for Biomedical Event Extraction [ J ]. IEEE/ACM transformations on computational biology and biology, 2019) feature raw language material Based on the combined strategy using word embedding, and then input it to Bi-LSTM for trigger recognition, and simultaneously perform "trigger-entity" detection using Tree-LSTM. The F-values of the model on the BioNLP ' 13, BioNLP ' 11, BioNLP ' 09 test set reached 57.39%, 58.23% and 59.68%, respectively.
From the above analysis, although the biomedical event extraction research has achieved abundant results, the problems of complex event extraction and the like are still to be systematically and deeply researched. Although there are many biomedical field ontologies and databases, such as Gene Ontology (Gene Ontology), Pathway database system, etc., and there are related research efforts to establish binary interaction relationship networks, such as protein interaction relationship networks, how to construct more fine-grained databases for cancer is still a research topic of great concern for many researchers.
Disclosure of the invention
The invention provides a biomedical entity relation combined extraction model based on an entity relation graph and a biomedical event extraction system based on a layered distillation network, solves the problems that the prior research cannot fully capture remote context semantic information and cannot fully express syntactic characteristics, improves the accuracy of event extraction, and accordingly constructs a biomedical event database with fine granularity related to cancers.
The invention mainly comprises three parts: 1. jointly extracting biomedical entities and entity relations based on an Entity Relation Graph (ERG); 2. biomedical event extraction based on a Hierarchical Distillation Network (HDN); 3. constructing a cancer-related biomedical event database; the biomedicine named entity identification and the entity relation extraction are necessary prerequisites for constructing the biomedicine event extraction and the biomedicine event interaction network, and are finally expressed in the form of a cancer related biomedicine event database.
The technical scheme adopted by the invention is as follows:
the method for constructing the cancer-related biomedical event database based on literature comprises the following steps:
(I) obtaining and preprocessing model training test corpus
(1) 261-; (2) segmenting and sentence-dividing the biomedical documents, and adding [ CLS ] and [ SEP ] characters as the beginning and ending marks of the sentences respectively; (3) utilizing special symbols to cut words of the text obtained in the step (2); (4) training with a pre-training biomedical word representation (BioBERT) model to obtain a word vector; in the process of training word vectors, firstly, obtaining a code of a public BioBERT, then finishing fine adjustment by using a corpus pubMed and a PMC in the biomedical field, storing model parameters, and finally processing MLEE linguistic data by using a fine-adjusted model to obtain word vectors;
(II) biomedical entities and entity relationship joint extraction based on Entity Relationship Graph (ERG)
In order to realize deep combination of biomedical entity identification and entity relationship extraction tasks, the invention provides a biomedical entity and entity relationship combination extraction model based on an Entity Relationship Graph (ERG). In the model, firstly, an entity recognition module recognizes entities in an input text for a specific entity type, then an entity relation directed graph of the specific relation type is constructed, and all related entity pairs in the graph are output; the model is divided into three modules:
(1) input module
Encoding an input source sentence using the above-described pre-trained biomedical word representation (BioBERT) model, the input being S { [ CLS ]],w 1 ,w 2 ,...,w n ,[SEP]In which [ CLS ]]And [ SEP ]]Extracting hidden layer characteristics H for the beginning mark and the end mark of the sentence E And sentence vector X: h E ,X=Encoder(S);
Wherein,for the hidden layer output of the ith word,for vector representation of the entire sentence, [ CLS ] is used in the model implementation]Instead of, a vector of E The number of hidden layer units of the Encoder;
(2) entity identification module
In biomedical texts, an entity may be composed of multiple words and may be of multiple types, but most traditional models, such as copy mechanisms, can only recognize the last word in the entity, so that the recognized entity lacks integrity and can only belong to a certain type. This module models the entity type as a mapping of the entity's beginning word to ending word, rather than classifying after the entity is identified, i.e., learning a marker f specific to the entity class etype (w start )→w end The problems of multi-type entity recognition and entity recognition incompleteness are solved;
firstly, generating a hidden layer characteristic H according to the previous step E Marking out which words in the sentence are most likely to be the beginning of an entity, and then constructing a marker for each entity type to mark out which words in the sentence are most likely to be the end of a specific type of entity; the specific operation is as follows:
wherein,et k as entity type etype k The embedding of (a) into (b),representing the likelihood that the ith word is identified as the entity start word,representing that the ith word is recognized as an type k The possibility of the end word of the type entity is classified into two categories, and the category is defined as 1 when the value of the category exceeds a certain threshold, otherwise, the category is 0; then, the entity in the sentence is identified according to the minimum span principleSince entity identification is performed for a specific entity type, the entity representation already includes its type information, and k represents that the type of this entity is type k ;
(3) Entity Relationship Graph (ERG)
Taking the entity output in the last step as a node of the entity relationship graph, initializing the entity relationship graph, assuming that the relationship types among the entities are R types in total, and using R +1 asymmetric adjacency matrixesTo represent the figure, the elements in the adjacency matrixRepresenting head entity e i And tail entity e j Has a relation r between s Otherwise, it means nothing, A (0) Is a zero matrix; then to entity relationship type r s Training a scorer for each entity pairAnd (3) scoring:
wherein,f is a feature fusion function for trainable parameters, when the score is calculatedIf a threshold value lambda is exceeded, the adjacency matrix A is formed (s) Middle corresponding elementSetting the value to be 1, otherwise setting the value to be 0, and obtaining the entity relationship type r after all the entity pairs are predicted s A corresponding entity relationship diagram; the model trains R scorers simultaneously, generates R entity relationship graphs corresponding to R different relationship types respectively,a value of 1 means that there is a relationship triplet<e i ,e j ,r s >Integrating the R entity relationship graphs and outputting all the related entity pairsOutputting two entities corresponding to all nonzero elements in A<e i ,e j >;
When the model is finished, extracting all entity pairs < entity 1, entity 2> with relationship existing in the text, wherein the entity type is regarded as a mapping relation from an entity starting word to an entity finishing word in the entity identification module, and when an entity relationship graph is constructed, the tropism of the entity relationship is considered, and a grader of a specific relationship is constructed by utilizing entity representation fused with entity type information, so that the deep combination of the entity and the entity relationship is realized, as shown in figure 1;
(III) layered distillation network (HDN) based biomedical event extraction
The invention provides a biomedical event extraction model based on a Hierarchical Distillation Network (HDN) on the basis of the existing biomedical event extraction model based on deep learning, the model acquires different levels of syntactic information through a bidirectional GRU and a multilayer Gating Graph Network (GGN), emphasizes the difference between sentence representations of different levels through a distillation module, and introduces a residual connection to integrate the distilled sentence representations of all levels, so that the obtained sentence representations can express richer syntactic characteristics and capture more distant contextual information. And finally, respectively considering the trigger word recognition task and the element recognition task as a sequence labeling task and a multi-classification task, and performing post-processing to generate a final biomedical event. The specific flow of the model is as follows:
(1) input module
The model first sets the input sentence W to W 1 ,w 2 ,...w L Embedding to obtain sentence vector X ═ X 1 ,x 2 ,...,x L The sentence vector includes the following parts:
the word vector: encoding the input original sentence by using the pre-training biomedical word representation (BioBERT) model introduced above to obtain the context semantics of the word;
entity type vector: the entity type information is contained in the entity representation obtained by the biomedical entity and entity relationship combined extraction model introduced above, so that the entity type information is used as the entity type vector of the module;
(iii) relative position vector: the relative position vector is used only for the element recognition task; the relative position calculation formula of the candidate trigger words and the candidate elements is as follows:
wherein q and m are respectively the starting subscript and the length of the candidate trigger word or element, and L is the length of the sentence; then, obtaining a relative position vector by searching a randomly initialized position embedded table;
(2) bidirectional GRU module
The sentence vector X obtained by the input module is { X ═ X 1 ,x 2 ,...,x L As input to this module, the bidirectional GRU consists of two recursive computation units: forward and backward GRU units that are continuously updated according to the order of sentences, the forward and backward GRUs being obtained separatelyTo the forward hidden vectorAnd backward hidden vectorConcatenating them to get sentence sequence vector H ═ (H) 1 ,h 2 ,...,h L ) WhereinThe specific calculation process of the module is as follows:
z t =σ(W z ·[h t-1 ,x t ])r t =σ(W r ·[h t-1 ,x t ]) (3.2)
wherein z is t ,r t Update and reset gates, W, of GRU units, respectively (·) To trainable the parameters, σ (-) and tanh (-) are activation functions,is the hidden vector after passing through the reset gate;
(3) gated Graph Network (GGN) module
Defining graph G ═ (V, E) as a dependency syntax tree for sentences, where V ═ V (V, E) 1 ,v 2 ,...,v L ),E=(e 1 ,e 2 ,...,e M ) Respectively, node set and edge set of the graph G, and the module obtains the node feature vector H in the last step (l) And the adjacency matrix A of the graph is used as input, and a new node characteristic vector H is output (l+1) In the transmission process, the GGN adopts a door mechanism, and the calculation process is as follows:
U=σ(W U ·[Y,H (l) ])R=σ(W R ·[Y,H (l) ]) (3.5)
wherein,is used for carrying out symmetrical normalization on the characteristic vector, Y is the normalized vector, W (·) For trainable parameters, σ (-) and relu (-) are activation functions, U and R are update and reset gates, respectively, of the GGN,is the node feature vector after the GGN reset gate. In the GGN, the edge type between nodes is a non-directional edge. Along with the superposition of GGNs, a node can exchange information with other nodes which are farther away, so that the characteristic vector of the node contains more syntactic information, for example, 1 layer of GGNs can induce information interaction between adjacent nodes in a graph, and according to the superposition of another layer of GGNs on the previous layer of GGNs, the node can carry out information interaction with nodes which are equal to or less than 2 in distance, so that the model uses a first-order syntactic vector to represent the output of a first-layer GGN, uses a second-order syntactic vector to represent the output of a second-layer GGN, and so on;
(4) distillation module
Due to the superposition of the GGN modules, the normal vectors of different orders are inevitably overlapped in a characteristic space to cause redundancy, so that the distillation module is designed to reconstruct the syntax vectors of each order obtained in the previous process. The module mainly emphasizes the heterogeneity among normal vectors of different orders of sentences to obtain a first-order syntactic vectorAnd second order syntax vectorsFor example, this module includes three core operations:
alignment of
wherein, W E For trainable parameters, e i,j Is composed ofAndthe vector after the fusion is carried out,andrespectively representing the weights of a first-order vector and a second-order vector when aligning;
② comparison
By mixing S l Andcomparing to obtain the distillation ratio of the syntax vector of the first orderThe calculation formula is as follows:
(iii) distillation
The syntax vector after the distillation of the l order can be obtained by the distillation operationWherein:
and finally, integrating the distilled syntactic vectors of all orders through residual connection to obtain the final sentence expression R (R) which can express richer syntactic characteristics and capture more distant context information 1 ,r 2 ,...,r L );
(5) Trigger word recognition and element recognition
Triggering word recognition
In the model, the trigger word recognition is regarded as a sequence labeling task, and the labeling mode of BILOU (Begin, Inside, Last, Outside, Unit) is adopted, and in the ith time step, the ith word r is labeled i Using a Softmax classifier to predict the trigger:
p(y i |x)=softmax(W te ·r i +b te ) (3.13)
wherein, W te And b te Is a trainable parameter;
② element identification
In this model, plain recognition can be considered as a multi-classification task, classifying each trigger-entity pair or trigger-trigger pair using a Softmax classifier:
p(y|x)=softmax(W ae ·[t,a]+b ae ) (3.14)
where t is the trigger vector, a is another entity or trigger vector, W ae And b ae Is a trainable parameter;
(6) post-treatment
And the post-processing process combines the trigger words and the elements obtained in the last step into a complete event, namely event detection. In the model, an SVM classifier is used for automatically learning the legal event structure of each event type by extracting features, then candidate events are formed, and finally the event types of the candidate events are determined.
The biomedical event extraction model based on the Hierarchical Distillation Network (HDN) can capture richer syntactic characteristics and capture more distant context information, then takes trigger word recognition and element recognition tasks as sequence labeling and multi-classification tasks respectively, and carries out post-processing to generate a final biomedical event. This model achieved 62.74% F on MLEE, 3.13% higher than the current best model, as shown in FIG. 2;
(IV) cancer-related biomedical event database construction
The biomedical entity and biomedical event information required for constructing the cancer-related biomedical event database are obtained by the method and stored in the database. The constructed cancer-related biomedical event database comprises a biomedical entity table, a biomedical event table and the like, and is specifically shown in the following table:
table 1 database table
The invention constructs a biomedical event extraction platform based on entity and entity relation combined extraction and event extraction in the biomedical field, and provides query service for researchers. Biomedical event extraction is the analysis of biomedical documents, and presents potential fine-grained complex event relationships between different system levels including molecular level, cell, tissue, and the like. At present, documents in the aspect of cancer genetics are in a mass growth situation, and for the number of documents with mass data, how to automatically obtain valuable related information by using an information extraction technology has positive significance for researching the formation of cancers, discovering medicines for treating the cancers, early diagnosis of the cancers, improving the survival rate of cancer patients, reducing the medical cost of the cancers and the like.
The invention has the beneficial effects that: the method fully considers the characteristics of the entity and the context in the biomedical text on the basis of the traditional method, solves the problems of multi-type entity identification and incomplete entity identification in the biomedical event extraction, obtains deeper syntactic information based on a layered distillation network (HDN) to extract the biomedical event, improves the precision of complex event extraction, can help researchers in the biomedical field to automatically analyze the text, can provide the function of searching known biomedical named entities and biomedical events, and helps the researchers to research and analyze relevant biomedical documents.
Drawings
FIG. 1 is a schematic diagram of biomedical entity and entity relationship joint extraction based on an Entity Relationship Graph (ERG).
Fig. 2 is a diagram of a layered distillation network (HDN) based biomedical event extraction model.
FIG. 3 is a database E-R diagram.
Detailed Description
The system of the invention can automatically perform word representation, biomedical entity and entity relation extraction and biomedical event extraction on a given text, thereby being greatly convenient for researchers to analyze the biomedical events described in the literature from a large amount of literature. The system adopts a B/S (Browser/Server, Browser/Server mode, system construction by using Django framework, and mainly realized by using technologies such as HTML (hypertext markup language), CSS (cascading style sheets) and the like) structural design and is divided into a view layer, a logic layer and a data layer. As shown in table 2:
table 2 database system architecture
1. User input text to be parsed
The text input supports two modes of keyboard input and local file uploading, and the view layer receives a text to be retrieved input by a user, submits the text to the logic layer and stores the text in the data layer. Suppose that the text to be analyzed by the user is "We have used retrovisual-media gene delivery to effect summary automatic expression of TIMP-3in human neural and mammalian cells in order to user's future expression of the background of TIMPs to inhibit the same in vivo", the user can select 1 and directly input the text through the page text box; 2. and storing the texts into formats such as txt, doc and the like, and uploading the texts in a file form. The first approach is suitable for short text processing and the second approach is suitable for long text processing.
2. The system analyzes the text to be analyzed
The realization of the function requires the coordination work of a logic layer and a database layer of the system, and the method specifically comprises the following steps:
(1) and after preprocessing such as segmenting and word segmentation of the text to be analyzed by the logic layer, taking the text as the input of a fine-tuned pre-trained biomedical word representation (BioBERT) model to obtain a word vector and a sentence vector of the text.
(2) Taking the output result obtained in the step (1) as the input of the entity and entity relationship combined extraction model, firstly obtaining the biomedical entities specific to the entity types through the entity identification module as shown in the following table:
TABLE 3 biomedical entity extraction results
E1 | E2 | E3 | E4 | E5 |
TIMP-3 | TIMPs | neuroblastoma | Melanoma tumor cells | murine |
Because the module regards the entity start word to the entity end word as the mapping relation specific to the entity type, the obtained entity representation contains the type information of the entity, and then an entity relation graph specific to the entity relation type is constructed and integrated, and all related entity pairs are output: < TIMP-3, TIMPs >; < TIMPs, TIMP-3 >; < neuroplastoma, melanoma tomor cells >; < melanoma tomor cells, neuroplastoma >.
(3) Then, biomedical event extraction is performed, the word vector and the entity type vector (i.e., the entity representation in step (2)) obtained in the above process are input into a biomedical event extraction model based on a Hierarchical Distillation Network (HDN) for trigger word recognition, and element detection is performed in consideration of each trigger word-entity and trigger word-trigger word pair, and then, a biomedical event is obtained through a post-processing process as shown in table 4:
TABLE 4 biomedical event extraction results
T1 | Blood_vessel_development:angiogenesis |
T2 | Gene_expression:expression Theme:TIMP-3 |
T3 | Positive_regulation:effectTheme:expression |
T4 | Negative_regulation:inhibit Theme:angiogenesisCause:TIMPs |
The trigger word in T1 is angiogenesis, and there is no corresponding element, i.e. it is stated that the trigger word itself constitutes a simple event.
(4) And (4) delivering the output results of the steps (1) to (3) to a data layer for storage, and simultaneously feeding back the visualization results to the user by a view layer to construct a cancer-related biological event database.
3. User retrieval of biomedical events
When the system has completed the extraction of the biomedical events from the input text, the entire event can be presented in the form of an interactive event graph. For example, after the user inputs the text, in the biomedical event extraction interface, a graph G containing m + n nodes is seen, and the graph G is the complete biomedical event in which m entities and n trigger words participate. Each node marks the identity (trigger word/entity) according to different colors, the edge types between the nodes represent element roles/entity relations and are distinguished according to the colors of the edges, and if the trigger word of an event A in the graph is an element of an event B, the event B is a complex event. In addition, the user may search for an entity in the interface to view biomedical events related to the entity.
Claims (1)
1. A method for constructing a cancer-related biomedical event database based on literature is characterized by comprising the following steps:
(I) obtaining and preprocessing model training test corpus
(1) 261-; (2) segmenting and sentence-dividing the biomedical documents, and adding [ CLS ] and [ SEP ] characters as the beginning and ending marks of the sentences respectively; (3) utilizing special symbols to cut words of the text obtained in the step (2); (4) training by using a pre-training biomedical word representation model to obtain a word vector; in the process of training word vectors, firstly, obtaining a code of a public BioBERT, then finishing fine adjustment by using a corpus pubMed and a PMC in the biomedical field, storing model parameters, and finally processing MLEE linguistic data by using a fine-adjusted model to obtain word vectors;
(II) biomedical entities and entity relation joint extraction based on entity relation graph
In order to realize deep combination of biomedical entity identification and entity relationship extraction tasks, a biomedical entity and entity relationship combination extraction model based on an entity relationship graph is provided; in the model, firstly, an entity recognition module recognizes entities in an input text for a specific entity type, then an entity relation directed graph of the specific relation type is constructed, and all related entity pairs in the graph are output; the model is divided into three modules:
(1) input module
Encoding the input source sentence by using the pre-training biomedical word representation model, wherein the input is S { [ CLS { [],w 1 ,w 2 ,...,w n ,[SEP]In which [ CLS]And [ SEP ]]Respectively extracting hidden layer characteristics H for the beginning mark and the end mark of the sentence E And sentence vector X: h E ,X=Encoder(S);
Wherein, for the hidden layer output of the ith word,for vector representation of the entire sentence, [ CLS ] is used in the model implementation]Instead of, a vector of E The number of hidden layer units of the Encoder;
(2) entity identification module
This module models the entity type as a mapping of the entity's beginning word to ending word, rather than classifying after the entity is identified, i.e., learning a marker f specific to the entity class etype (w start )→w end The problems of multi-type entity recognition and entity recognition incompleteness are solved;
firstly, generating hidden layer characteristics H according to the previous step E Marking out which words in the sentence are most likely to be the beginning of an entity, then constructing a marker for each entity type, and marking out which words in the sentence are most likely to be the end of a specific type of entity; the specific operation is as follows:
wherein,et k as entity type etype k The embedding of (a) into (b),represents the ithThe likelihood of a word being identified as an entity start word,representing that the ith word is recognized as an etype k The possibility of the end word of the type entity is classified into two categories, and the category is defined as 1 when the value of the category exceeds a certain threshold, otherwise, the category is 0; then, the entity in the sentence is identified according to the minimum span principleSince entity identification is performed for a specific entity type, the entity representation already includes the type information thereof, and k represents that the type of the entity is type k ;
(3) Entity relationship graph
Taking the entity output in the last step as a node of the entity relationship graph, initializing the entity relationship graph, assuming that the relationship types among the entities are R types in total, and using R +1 asymmetric adjacent matrixes A (s) ,s 0,1, R denotes this figure, with the elements in the adjacency matrix Representing head entity e i And tail entity e j Has a relation r between s Otherwise, it means nothing, A (0) Is a zero matrix; then to entity relationship type r s Training a scorer for each entity pairAnd (3) scoring:
wherein,f is a feature fusion function for trainable parameters, when the score is calculatedIf a threshold value lambda is exceeded, the adjacency matrix A is formed (s) Middle corresponding elementSetting the value to be 1, otherwise setting the value to be 0, and obtaining the entity relationship type r after all the entity pairs are predicted s A corresponding entity relationship diagram; the model trains R scorers simultaneously, generates R entity relationship graphs corresponding to R different relationship types respectively,a value of 1 means that there is a relationship triplet<e i ,e j ,r s >Integrating the R entity relationship graphs and outputting all the related entity pairsOutputting two entities corresponding to all non-zero elements in A<e i ,e j >;
When the model is finished, all entity pairs < entity 1, entity 2> with relationship existing in the text are extracted, because the entity type is regarded as a mapping relation from an entity starting word to an entity finishing word in the entity identification module, and when an entity relationship graph is constructed, the tropism of the entity relationship is considered, and a marker of a specific relationship is constructed by utilizing entity representation fused with entity type information, so that the deep combination of the entity and the entity relationship is realized;
(III) biomedical event extraction based on layered distillation network
On the basis of the existing biomedical event extraction model based on deep learning, a biomedical event extraction model based on a layered distillation network is provided, the model acquires syntactic information of different layers through a bidirectional GRU and a multilayer gating graph network, the difference between sentence representations of different layers is emphasized through a distillation module, then a residual connection is introduced to integrate the sentence representations after distillation of all layers, and the obtained sentence representations can express richer syntactic characteristics and capture context information of longer distance; finally, respectively considering the trigger word recognition task and the element recognition task as a sequence labeling task and a multi-classification task, and performing post-processing to generate a final biomedical event; the specific flow of the model is as follows:
(1) input module
The model first sets the input sentence W to W 1 ,w 2 ,...w L Embedding to obtain sentence vector X ═ X 1 ,x 2 ,...,x L The sentence vector includes the following parts:
word vector: coding the input original sentence by using the pre-training biomedical word representation model introduced above to obtain the context semantics of the word;
entity type vector: the entity type information is contained in the entity representation obtained by the biomedical entity and entity relationship combined extraction model introduced above, so that the entity type information is used as the entity type vector of the module;
(iii) relative position vector: the relative position vector is used only for the element recognition task; the relative position calculation formula of the candidate trigger words and the candidate elements is as follows:
wherein q and m are respectively the starting subscript and the length of the candidate trigger word or element, and L is the length of the sentence; then, a position embedding table initialized randomly is searched to obtain a relative position vector;
(2) bidirectional GRU module
The sentence vector X obtained by the input module is { X ═ X 1 ,x 2 ,...,x L As input to this module, the bidirectional GRU consists of two recursive computation units: forward GRU unit and backward GRU unit, which are continuously updated with the order of sentences, and forward GRU and backward GRU respectively obtain forward hidden vectorAnd backward hidden vectorConcatenating them to get sentence sequence vector H ═ (H) 1 ,h 2 ,...,h L ) WhereinThe specific calculation process of the module is as follows:
z t =σ(W z ·[h t-1 ,x t ])r t =σ(W r ·[h t-1 ,x t ]) (3.2)
wherein z is t ,r t Update and reset gates, W, of GRU units, respectively (·) To trainable the parameters, σ (-) and tanh (-) are activation functions,is the hidden vector after passing through the reset gate;
(3) gate control graph network module
Defining graph G ═ (V, E) as a dependency syntax tree for sentences, where V ═ V (V, E) 1 ,v 2 ,...,v L ),E=(e 1 ,e 2 ,...,e M ) Respectively, node set and edge set of the graph G, and the module obtains the node feature vector H in the last step (l) And the adjacency matrix A of the graph is used as input, and a new node characteristic vector H is output (l+1) In the process of transmitting, GGN adoptsThe door mechanism comprises the following calculation processes:
U=σ(W U ·[Y,H (l) ])R=σ(W R ·[Y,H (l) ]) (3.5)
wherein,is used for carrying out symmetrical normalization on the characteristic vector, Y is the normalized vector, W (·) For trainable parameters, σ (-) and relu (-) are activation functions, U and R are update and reset gates, respectively, of the GGN,is the node eigenvector after the GGN reset gate; in the GGN, the edge type between nodes is a non-directional edge; along with the superposition of GGNs, the nodes exchange information with other nodes which are farther away, so that the characteristic vectors of the nodes contain more syntactic information, the information interaction between adjacent nodes in a 1-layer GGN induced graph is realized, and the information interaction is carried out between the nodes and the nodes with the distance equal to or less than 2 according to the superposition of another layer of GGNs on the basis of the previous GGNs, so that the output of the first layer of GGNs is represented by a first-order syntactic vector, the output of the second layer of GGNs is represented by a second-order syntactic vector, and the like;
(4) distillation module
Due to the superposition of the GGN modules, the normal vectors of different orders inevitably overlap in a feature space to cause redundancy problem, so thatDesigning a distillation module to reconstruct syntax vectors of each order obtained in the previous process; the module mainly emphasizes heterogeneity among normal vectors of different orders, namely the first-order syntactic vectorAnd second order syntax vectorsThis module includes three core operations:
alignment of
wherein, W E For trainable parameters, e i,j Is composed ofAndthe vector after the fusion is carried out,andrespectively representing the weights of a first-order vector and a second-order vector when aligning;
② comparison
By mixing S l Andcomparing to obtain the distillation ratio of the syntax vector of the first orderThe calculation formula is as follows:
(iii) distillation
The syntax vector after the distillation of the l order can be obtained by the distillation operationWherein:
finally, the distilled syntactic vectors of all orders are integrated through residual connection to obtain the final sentence expression R (R) which can express richer syntactic characteristics and capture more distant context information 1 ,r 2 ,...,r L );
(5) Trigger word recognition and element recognition
Triggering word recognition
In the model, the trigger word recognition is regarded as a sequence labeling task, a BILOU labeling mode is adopted, and in the ith time step, the ith word r is labeled i Using a Softmax classifier to predict the trigger:
p(y i |x)=softmax(W te ·r i +b te ) (3.13)
wherein, W te And b te Is a trainable parameter;
② element identification
In this model, the recognition of primitives is considered as a multi-classification task, and each trigger-entity pair or trigger-trigger pair is classified using the Softmax classifier:
p(y|x)=softmax(W ae ·[t,a]+b ae ) (3.14)
where t is the trigger vector, a is another entity or trigger vector, W ae And b ae Is a trainable parameter;
(6) post-treatment
The post-processing process combines the trigger words and the elements obtained in the last step into a complete event, namely event detection; in the model, an SVM classifier is used for automatically learning the legal event structure of each event type by extracting features, then candidate events are formed, and finally the event types of the candidate events are determined;
(IV) cancer-related biomedical event database construction
Obtaining biomedical entities and biomedical event information required by constructing a cancer-related biomedical event database by the method, and storing the information in the database; the constructed cancer-related biomedical events database comprises a biomedical entity table and a biomedical events table.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010629395.7A CN111859935B (en) | 2020-07-03 | 2020-07-03 | Method for constructing cancer-related biomedical event database based on literature |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010629395.7A CN111859935B (en) | 2020-07-03 | 2020-07-03 | Method for constructing cancer-related biomedical event database based on literature |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111859935A CN111859935A (en) | 2020-10-30 |
CN111859935B true CN111859935B (en) | 2022-09-20 |
Family
ID=73151955
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010629395.7A Active CN111859935B (en) | 2020-07-03 | 2020-07-03 | Method for constructing cancer-related biomedical event database based on literature |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111859935B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112632230B (en) * | 2020-12-30 | 2021-10-15 | 中国科学院空天信息创新研究院 | Event joint extraction method and device based on multi-level graph network |
CN113051929A (en) * | 2021-03-23 | 2021-06-29 | 电子科技大学 | Entity relationship extraction method based on fine-grained semantic information enhancement |
CN113239142B (en) * | 2021-04-26 | 2022-09-23 | 昆明理工大学 | Trigger-word-free event detection method fused with syntactic information |
CN113255321B (en) * | 2021-06-10 | 2021-10-29 | 之江实验室 | Financial field chapter-level event extraction method based on article entity word dependency relationship |
CN113420551A (en) * | 2021-07-13 | 2021-09-21 | 华中师范大学 | Biomedical entity relation extraction method for modeling entity similarity |
CN113468333B (en) * | 2021-09-02 | 2021-11-19 | 华东交通大学 | Event detection method and system fusing hierarchical category information |
CN114912456B (en) * | 2022-07-19 | 2022-09-23 | 北京惠每云科技有限公司 | Medical entity relationship identification method and device and storage medium |
CN116049345B (en) * | 2023-03-31 | 2023-10-10 | 江西财经大学 | Document-level event joint extraction method and system based on bidirectional event complete graph |
CN116501830B (en) * | 2023-06-29 | 2023-09-05 | 中南大学 | Method and related equipment for jointly extracting overlapping relation of biomedical texts |
CN117131198B (en) * | 2023-10-27 | 2024-01-16 | 中南大学 | Knowledge enhancement entity relationship joint extraction method and device for medical teaching library |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107818141B (en) * | 2017-10-10 | 2020-07-14 | 大连理工大学 | Biomedical event extraction method integrated with structured element recognition |
CN108628970B (en) * | 2018-04-17 | 2021-06-18 | 大连理工大学 | Biomedical event combined extraction method based on new marker mode |
CN109446338B (en) * | 2018-09-20 | 2020-07-21 | 大连交通大学 | Neural network-based drug disease relation classification method |
CN109446326B (en) * | 2018-11-01 | 2021-04-20 | 大连理工大学 | Biomedical event combined extraction method based on replication mechanism |
CN110083838B (en) * | 2019-04-29 | 2021-01-19 | 西安交通大学 | Biomedical semantic relation extraction method based on multilayer neural network and external knowledge base |
-
2020
- 2020-07-03 CN CN202010629395.7A patent/CN111859935B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN111859935A (en) | 2020-10-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111859935B (en) | Method for constructing cancer-related biomedical event database based on literature | |
CN109446338B (en) | Neural network-based drug disease relation classification method | |
US11580415B2 (en) | Hierarchical multi-task term embedding learning for synonym prediction | |
CN109635291B (en) | Recommendation method for fusing scoring information and article content based on collaborative training | |
De Mulder et al. | A survey on the application of recurrent neural networks to statistical language modeling | |
CN110110324B (en) | Biomedical entity linking method based on knowledge representation | |
Karpathy | Connecting images and natural language | |
Tang et al. | Deep sequential fusion LSTM network for image description | |
Fu et al. | Learning towards conversational AI: A survey | |
Xu et al. | Deep image captioning: A review of methods, trends and future challenges | |
CN112069827B (en) | Data-to-text generation method based on fine-grained subject modeling | |
CN114265936A (en) | Method for realizing text mining of science and technology project | |
CN114239612A (en) | Multi-modal neural machine translation method, computer equipment and storage medium | |
Zhu et al. | Statistical learning for semantic parsing: A survey | |
CN116384371A (en) | Combined entity and relation extraction method based on BERT and dependency syntax | |
Andrews et al. | Robust entity clustering via phylogenetic inference | |
Sharma et al. | Innovations in neural data-to-text generation: A Survey | |
Krug | Techniques for the Automatic Extraction of Character Networks in German Historic Novels | |
Wang et al. | Paraphrase recognition via combination of neural classifier and keywords | |
Singh et al. | Neural approaches towards text summarization | |
Sharma et al. | Neural Methods for Data-to-text Generation | |
Zhang et al. | Social Media Named Entity Recognition Based On Graph Attention Network | |
Sahay et al. | Unsupervised learning of explainable parse trees for improved generalisation | |
Libovický | Multimodality in machine translation | |
Favre | Contextual language understanding Thoughts on Machine Learning in Natural Language Processing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |