CN111859935B - Method for constructing cancer-related biomedical event database based on literature - Google Patents

Method for constructing cancer-related biomedical event database based on literature Download PDF

Info

Publication number
CN111859935B
CN111859935B CN202010629395.7A CN202010629395A CN111859935B CN 111859935 B CN111859935 B CN 111859935B CN 202010629395 A CN202010629395 A CN 202010629395A CN 111859935 B CN111859935 B CN 111859935B
Authority
CN
China
Prior art keywords
entity
biomedical
vector
word
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010629395.7A
Other languages
Chinese (zh)
Other versions
CN111859935A (en
Inventor
李丽双
连瑞源
黄梦佐
王泽昊
袁光辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN202010629395.7A priority Critical patent/CN111859935B/en
Publication of CN111859935A publication Critical patent/CN111859935A/en
Application granted granted Critical
Publication of CN111859935B publication Critical patent/CN111859935B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of natural language processing, and provides a method for constructing a cancer-related biomedical event database based on literature, which comprises the following three parts: 1. performing combined extraction on the biomedical entities and the entity relations based on the entity relation graph; 2. biomedical event extraction based on a layered distillation network; 3. a database of cancer-related biomedical events is constructed. The method fully considers the characteristics of the entity and the context in the biomedical text on the basis of the traditional method, solves the problems of multi-type entity identification and incomplete entity identification in the biomedical event extraction, obtains deeper syntactic information based on a layered distillation network to extract the biomedical event, improves the precision of complex event extraction, can help researchers in the biomedical field to automatically analyze the text, can provide the function of searching known biomedical named entities and biomedical events, and helps the researchers to research and analyze relevant documents of the biomedical science.

Description

Method for constructing cancer-related biomedical event database based on literature
Technical Field
The invention belongs to the technical field of natural language processing, and relates to a method for carrying out high-quality biomedical named Entity identification, Entity relationship extraction and biomedical event extraction on biomedical related documents, in particular to biomedical entities and Entity relationship combined extraction based on an Entity-relationship Graph (ERG) and biomedical event extraction based on a Hierarchical Distillation Network (HDN).
Background
Stored in the biomedical event database is a complete biomedical event consisting of a trigger, typically a verb or verb, and an element, typically a trigger of a biomedical entity or another event. The construction of the Biomedical Event database relates to three steps of Word Representation (Word Representation), Biomedical Named Entity identification and Entity Relationship Extraction (Biomedical Named Entity Recognition and Entity Relationship Extraction) and Biomedical Event Extraction (Biomedical Event Extraction).
The word representation means that the words are represented by vectors with specific dimensions, semantic information of the words can be reflected in a space vector, and abundant semantic feature information can be obtained from a large amount of unlabeled linguistic data. In order to obtain more valuable information from biomedical texts, Lee and Yoon et al (BioBERT: a pre-trained biological language representation model for biological text reduction. CoRR abs/1901.08746(2019)) initialize the BioBERT model by using a pre-trained BERT model on a general corpus (English Wikipedia and BooksCorpus) and finely tune the model by using biomedical field corpora PubMed and PMC, so as to obtain a word representation model suitable for the biomedical field, wherein the entity recognition result task F value of the model on NCBI disease is 89.71%.
The existing biomedical named entity recognition and entity relation extraction model mainly comprises a staged extraction model and a combined extraction model. The staged method ignores the correlation between entity identification and a relation extraction task, and has the problem of error propagation; the classic approach of joint Extraction is to construct an End-to-End model, such as the End-to-End joint model proposed by Miwa and Bansal (End-to-End Extraction using LSTMs on Sequences and Tree Structures [ C ]. Association for Computational relationships, 2016), for entity identification and entity relationship Extraction, and incorporate dependency path information into the model, with entity identification and relationship Extraction F values of 83.4% and 55.6%, respectively, at ACE 05. However, the model generally establishes the connection between two subtasks by means of parameter sharing, and the degree of the connection is shallow and generates a large amount of redundant information, so how to fully model the dependency relationship between entity identification and entity relationship extraction, and how to construct a deeply-connected entity and entity relationship extraction model also needs to be researched.
Biomedical event extraction refers to a process of automatically detecting a description of fine-grained interaction between biomolecules, such as genes, proteins, etc., and biological organ tissues from the literature, with the purpose of extracting structured information about predefined event types from unstructured text. The current biomedical event extraction models are two types, and are respectively based on a grading strategy and a joint strategy, the grading strategy respectively identifies trigger words and elements of an event, and then complete event representation is obtained through post-processing. For example, Miwa et al (Boosting automatic event extraction from the organizational using domain adaptation and reference resolution [ J ]. Bioinformatics,2012) improved by using rich feature sets, fusing the resolution of the expression and the domain adaptation strategy, and developed eventMine system reached 57.98% in the F-value of the BioNLP' 11 corpus. The staging strategy is easy to understand and easy to implement, but also has obvious defects such as cascading errors. The Joint strategy jointly executes subtasks such as trigger word and element recognition, so that error propagation among subtasks can be avoided to a certain extent, and cascade errors can be reduced, for example, a Markov logic network is adopted based on a shallow machine learning method Poon and the like (Joint information for knowledge extraction from biological logic [ C ]. Association for Computational logic, 2010) in the early stage, and a predicate logic Joint statement is artificially summarized to extract trigger words and elements, so that an F-value of 50.0% is obtained on a BioNLP' 09 test set. Although markov logic networks can avoid cascading errors, a large number of features are not well utilized. Lipolybix et al (Extracting binary with a dual decomposition integration word templates [ J ]. IEEE/ACM transformations on computational biology and bioinformatics,2015) added word vector features, including more syntactic and semantic information, with F-values of 53.19% on the BioNLP' 13 test set. However, regardless of the above-mentioned Biomedical Event Extraction model Based on the "staged" strategy or the "combined" strategy, complicated feature engineering is mostly required, and depending on additional expert knowledge, in recent years, due to the advantages exhibited by deep learning in capturing text implicit information, some Biomedical Event Extraction models Based on deep learning are proposed, for example Yu et al (LSTM-Based End-to-End frame for Biomedical Event Extraction [ J ]. IEEE/ACM transformations on computational biology and biology, 2019) feature raw language material Based on the combined strategy using word embedding, and then input it to Bi-LSTM for trigger recognition, and simultaneously perform "trigger-entity" detection using Tree-LSTM. The F-values of the model on the BioNLP ' 13, BioNLP ' 11, BioNLP ' 09 test set reached 57.39%, 58.23% and 59.68%, respectively.
From the above analysis, although the biomedical event extraction research has achieved abundant results, the problems of complex event extraction and the like are still to be systematically and deeply researched. Although there are many biomedical field ontologies and databases, such as Gene Ontology (Gene Ontology), Pathway database system, etc., and there are related research efforts to establish binary interaction relationship networks, such as protein interaction relationship networks, how to construct more fine-grained databases for cancer is still a research topic of great concern for many researchers.
Disclosure of the invention
The invention provides a biomedical entity relation combined extraction model based on an entity relation graph and a biomedical event extraction system based on a layered distillation network, solves the problems that the prior research cannot fully capture remote context semantic information and cannot fully express syntactic characteristics, improves the accuracy of event extraction, and accordingly constructs a biomedical event database with fine granularity related to cancers.
The invention mainly comprises three parts: 1. jointly extracting biomedical entities and entity relations based on an Entity Relation Graph (ERG); 2. biomedical event extraction based on a Hierarchical Distillation Network (HDN); 3. constructing a cancer-related biomedical event database; the biomedicine named entity identification and the entity relation extraction are necessary prerequisites for constructing the biomedicine event extraction and the biomedicine event interaction network, and are finally expressed in the form of a cancer related biomedicine event database.
The technical scheme adopted by the invention is as follows:
the method for constructing the cancer-related biomedical event database based on literature comprises the following steps:
(I) obtaining and preprocessing model training test corpus
(1) 261-; (2) segmenting and sentence-dividing the biomedical documents, and adding [ CLS ] and [ SEP ] characters as the beginning and ending marks of the sentences respectively; (3) utilizing special symbols to cut words of the text obtained in the step (2); (4) training with a pre-training biomedical word representation (BioBERT) model to obtain a word vector; in the process of training word vectors, firstly, obtaining a code of a public BioBERT, then finishing fine adjustment by using a corpus pubMed and a PMC in the biomedical field, storing model parameters, and finally processing MLEE linguistic data by using a fine-adjusted model to obtain word vectors;
(II) biomedical entities and entity relationship joint extraction based on Entity Relationship Graph (ERG)
In order to realize deep combination of biomedical entity identification and entity relationship extraction tasks, the invention provides a biomedical entity and entity relationship combination extraction model based on an Entity Relationship Graph (ERG). In the model, firstly, an entity recognition module recognizes entities in an input text for a specific entity type, then an entity relation directed graph of the specific relation type is constructed, and all related entity pairs in the graph are output; the model is divided into three modules:
(1) input module
Encoding an input source sentence using the above-described pre-trained biomedical word representation (BioBERT) model, the input being S { [ CLS ]],w 1 ,w 2 ,...,w n ,[SEP]In which [ CLS ]]And [ SEP ]]Extracting hidden layer characteristics H for the beginning mark and the end mark of the sentence E And sentence vector X: h E ,X=Encoder(S);
Wherein,
Figure BDA0002567872710000031
for the hidden layer output of the ith word,
Figure BDA0002567872710000032
for vector representation of the entire sentence, [ CLS ] is used in the model implementation]Instead of, a vector of E The number of hidden layer units of the Encoder;
(2) entity identification module
In biomedical texts, an entity may be composed of multiple words and may be of multiple types, but most traditional models, such as copy mechanisms, can only recognize the last word in the entity, so that the recognized entity lacks integrity and can only belong to a certain type. This module models the entity type as a mapping of the entity's beginning word to ending word, rather than classifying after the entity is identified, i.e., learning a marker f specific to the entity class etype (w start )→w end The problems of multi-type entity recognition and entity recognition incompleteness are solved;
firstly, generating a hidden layer characteristic H according to the previous step E Marking out which words in the sentence are most likely to be the beginning of an entity, and then constructing a marker for each entity type to mark out which words in the sentence are most likely to be the end of a specific type of entity; the specific operation is as follows:
Figure BDA0002567872710000033
Figure BDA0002567872710000034
wherein,
Figure BDA0002567872710000035
et k as entity type etype k The embedding of (a) into (b),
Figure BDA0002567872710000036
representing the likelihood that the ith word is identified as the entity start word,
Figure BDA0002567872710000041
representing that the ith word is recognized as an type k The possibility of the end word of the type entity is classified into two categories, and the category is defined as 1 when the value of the category exceeds a certain threshold, otherwise, the category is 0; then, the entity in the sentence is identified according to the minimum span principle
Figure BDA0002567872710000042
Since entity identification is performed for a specific entity type, the entity representation already includes its type information, and k represents that the type of this entity is type k
(3) Entity Relationship Graph (ERG)
Taking the entity output in the last step as a node of the entity relationship graph, initializing the entity relationship graph, assuming that the relationship types among the entities are R types in total, and using R +1 asymmetric adjacency matrixes
Figure BDA0002567872710000043
To represent the figure, the elements in the adjacency matrix
Figure BDA0002567872710000044
Representing head entity e i And tail entity e j Has a relation r between s Otherwise, it means nothing, A (0) Is a zero matrix; then to entity relationship type r s Training a scorer for each entity pair
Figure BDA0002567872710000045
And (3) scoring:
Figure BDA0002567872710000046
wherein,
Figure BDA0002567872710000047
f is a feature fusion function for trainable parameters, when the score is calculated
Figure BDA0002567872710000048
If a threshold value lambda is exceeded, the adjacency matrix A is formed (s) Middle corresponding element
Figure BDA0002567872710000049
Setting the value to be 1, otherwise setting the value to be 0, and obtaining the entity relationship type r after all the entity pairs are predicted s A corresponding entity relationship diagram; the model trains R scorers simultaneously, generates R entity relationship graphs corresponding to R different relationship types respectively,
Figure BDA00025678727100000410
a value of 1 means that there is a relationship triplet<e i ,e j ,r s >Integrating the R entity relationship graphs and outputting all the related entity pairs
Figure BDA00025678727100000411
Outputting two entities corresponding to all nonzero elements in A<e i ,e j >;
When the model is finished, extracting all entity pairs < entity 1, entity 2> with relationship existing in the text, wherein the entity type is regarded as a mapping relation from an entity starting word to an entity finishing word in the entity identification module, and when an entity relationship graph is constructed, the tropism of the entity relationship is considered, and a grader of a specific relationship is constructed by utilizing entity representation fused with entity type information, so that the deep combination of the entity and the entity relationship is realized, as shown in figure 1;
(III) layered distillation network (HDN) based biomedical event extraction
The invention provides a biomedical event extraction model based on a Hierarchical Distillation Network (HDN) on the basis of the existing biomedical event extraction model based on deep learning, the model acquires different levels of syntactic information through a bidirectional GRU and a multilayer Gating Graph Network (GGN), emphasizes the difference between sentence representations of different levels through a distillation module, and introduces a residual connection to integrate the distilled sentence representations of all levels, so that the obtained sentence representations can express richer syntactic characteristics and capture more distant contextual information. And finally, respectively considering the trigger word recognition task and the element recognition task as a sequence labeling task and a multi-classification task, and performing post-processing to generate a final biomedical event. The specific flow of the model is as follows:
(1) input module
The model first sets the input sentence W to W 1 ,w 2 ,...w L Embedding to obtain sentence vector X ═ X 1 ,x 2 ,...,x L The sentence vector includes the following parts:
the word vector: encoding the input original sentence by using the pre-training biomedical word representation (BioBERT) model introduced above to obtain the context semantics of the word;
entity type vector: the entity type information is contained in the entity representation obtained by the biomedical entity and entity relationship combined extraction model introduced above, so that the entity type information is used as the entity type vector of the module;
(iii) relative position vector: the relative position vector is used only for the element recognition task; the relative position calculation formula of the candidate trigger words and the candidate elements is as follows:
Figure BDA0002567872710000051
wherein q and m are respectively the starting subscript and the length of the candidate trigger word or element, and L is the length of the sentence; then, obtaining a relative position vector by searching a randomly initialized position embedded table;
(2) bidirectional GRU module
The sentence vector X obtained by the input module is { X ═ X 1 ,x 2 ,...,x L As input to this module, the bidirectional GRU consists of two recursive computation units: forward and backward GRU units that are continuously updated according to the order of sentences, the forward and backward GRUs being obtained separatelyTo the forward hidden vector
Figure BDA0002567872710000052
And backward hidden vector
Figure BDA0002567872710000056
Concatenating them to get sentence sequence vector H ═ (H) 1 ,h 2 ,...,h L ) Wherein
Figure BDA0002567872710000053
The specific calculation process of the module is as follows:
z t =σ(W z ·[h t-1 ,x t ])r t =σ(W r ·[h t-1 ,x t ]) (3.2)
Figure BDA0002567872710000054
wherein z is t ,r t Update and reset gates, W, of GRU units, respectively (·) To trainable the parameters, σ (-) and tanh (-) are activation functions,
Figure BDA0002567872710000055
is the hidden vector after passing through the reset gate;
(3) gated Graph Network (GGN) module
Defining graph G ═ (V, E) as a dependency syntax tree for sentences, where V ═ V (V, E) 1 ,v 2 ,...,v L ),E=(e 1 ,e 2 ,...,e M ) Respectively, node set and edge set of the graph G, and the module obtains the node feature vector H in the last step (l) And the adjacency matrix A of the graph is used as input, and a new node characteristic vector H is output (l+1) In the transmission process, the GGN adopts a door mechanism, and the calculation process is as follows:
Figure BDA0002567872710000061
U=σ(W U ·[Y,H (l) ])R=σ(W R ·[Y,H (l) ]) (3.5)
Figure BDA0002567872710000062
Figure BDA0002567872710000063
wherein,
Figure BDA0002567872710000064
is used for carrying out symmetrical normalization on the characteristic vector, Y is the normalized vector, W (·) For trainable parameters, σ (-) and relu (-) are activation functions, U and R are update and reset gates, respectively, of the GGN,
Figure BDA0002567872710000065
is the node feature vector after the GGN reset gate. In the GGN, the edge type between nodes is a non-directional edge. Along with the superposition of GGNs, a node can exchange information with other nodes which are farther away, so that the characteristic vector of the node contains more syntactic information, for example, 1 layer of GGNs can induce information interaction between adjacent nodes in a graph, and according to the superposition of another layer of GGNs on the previous layer of GGNs, the node can carry out information interaction with nodes which are equal to or less than 2 in distance, so that the model uses a first-order syntactic vector to represent the output of a first-layer GGN, uses a second-order syntactic vector to represent the output of a second-layer GGN, and so on;
(4) distillation module
Due to the superposition of the GGN modules, the normal vectors of different orders are inevitably overlapped in a characteristic space to cause redundancy, so that the distillation module is designed to reconstruct the syntax vectors of each order obtained in the previous process. The module mainly emphasizes the heterogeneity among normal vectors of different orders of sentences to obtain a first-order syntactic vector
Figure BDA0002567872710000066
And second order syntax vectors
Figure BDA0002567872710000067
For example, this module includes three core operations:
alignment of
By this operation, obtain
Figure BDA0002567872710000068
And
Figure BDA0002567872710000069
aligned vector
Figure BDA00025678727100000610
Figure BDA00025678727100000611
The alignment process is as follows:
Figure BDA00025678727100000612
Figure BDA00025678727100000613
Figure BDA00025678727100000614
wherein, W E For trainable parameters, e i,j Is composed of
Figure BDA00025678727100000615
And
Figure BDA00025678727100000616
the vector after the fusion is carried out,
Figure BDA00025678727100000617
and
Figure BDA00025678727100000618
respectively representing the weights of a first-order vector and a second-order vector when aligning;
② comparison
By mixing S l And
Figure BDA00025678727100000619
comparing to obtain the distillation ratio of the syntax vector of the first order
Figure BDA00025678727100000620
The calculation formula is as follows:
Figure BDA0002567872710000071
(iii) distillation
The syntax vector after the distillation of the l order can be obtained by the distillation operation
Figure BDA0002567872710000072
Wherein:
Figure BDA0002567872710000073
and finally, integrating the distilled syntactic vectors of all orders through residual connection to obtain the final sentence expression R (R) which can express richer syntactic characteristics and capture more distant context information 1 ,r 2 ,...,r L );
(5) Trigger word recognition and element recognition
Triggering word recognition
In the model, the trigger word recognition is regarded as a sequence labeling task, and the labeling mode of BILOU (Begin, Inside, Last, Outside, Unit) is adopted, and in the ith time step, the ith word r is labeled i Using a Softmax classifier to predict the trigger:
p(y i |x)=softmax(W te ·r i +b te ) (3.13)
wherein, W te And b te Is a trainable parameter;
② element identification
In this model, plain recognition can be considered as a multi-classification task, classifying each trigger-entity pair or trigger-trigger pair using a Softmax classifier:
p(y|x)=softmax(W ae ·[t,a]+b ae ) (3.14)
where t is the trigger vector, a is another entity or trigger vector, W ae And b ae Is a trainable parameter;
(6) post-treatment
And the post-processing process combines the trigger words and the elements obtained in the last step into a complete event, namely event detection. In the model, an SVM classifier is used for automatically learning the legal event structure of each event type by extracting features, then candidate events are formed, and finally the event types of the candidate events are determined.
The biomedical event extraction model based on the Hierarchical Distillation Network (HDN) can capture richer syntactic characteristics and capture more distant context information, then takes trigger word recognition and element recognition tasks as sequence labeling and multi-classification tasks respectively, and carries out post-processing to generate a final biomedical event. This model achieved 62.74% F on MLEE, 3.13% higher than the current best model, as shown in FIG. 2;
(IV) cancer-related biomedical event database construction
The biomedical entity and biomedical event information required for constructing the cancer-related biomedical event database are obtained by the method and stored in the database. The constructed cancer-related biomedical event database comprises a biomedical entity table, a biomedical event table and the like, and is specifically shown in the following table:
table 1 database table
Figure BDA0002567872710000081
The invention constructs a biomedical event extraction platform based on entity and entity relation combined extraction and event extraction in the biomedical field, and provides query service for researchers. Biomedical event extraction is the analysis of biomedical documents, and presents potential fine-grained complex event relationships between different system levels including molecular level, cell, tissue, and the like. At present, documents in the aspect of cancer genetics are in a mass growth situation, and for the number of documents with mass data, how to automatically obtain valuable related information by using an information extraction technology has positive significance for researching the formation of cancers, discovering medicines for treating the cancers, early diagnosis of the cancers, improving the survival rate of cancer patients, reducing the medical cost of the cancers and the like.
The invention has the beneficial effects that: the method fully considers the characteristics of the entity and the context in the biomedical text on the basis of the traditional method, solves the problems of multi-type entity identification and incomplete entity identification in the biomedical event extraction, obtains deeper syntactic information based on a layered distillation network (HDN) to extract the biomedical event, improves the precision of complex event extraction, can help researchers in the biomedical field to automatically analyze the text, can provide the function of searching known biomedical named entities and biomedical events, and helps the researchers to research and analyze relevant biomedical documents.
Drawings
FIG. 1 is a schematic diagram of biomedical entity and entity relationship joint extraction based on an Entity Relationship Graph (ERG).
Fig. 2 is a diagram of a layered distillation network (HDN) based biomedical event extraction model.
FIG. 3 is a database E-R diagram.
Detailed Description
The system of the invention can automatically perform word representation, biomedical entity and entity relation extraction and biomedical event extraction on a given text, thereby being greatly convenient for researchers to analyze the biomedical events described in the literature from a large amount of literature. The system adopts a B/S (Browser/Server, Browser/Server mode, system construction by using Django framework, and mainly realized by using technologies such as HTML (hypertext markup language), CSS (cascading style sheets) and the like) structural design and is divided into a view layer, a logic layer and a data layer. As shown in table 2:
table 2 database system architecture
Figure BDA0002567872710000091
1. User input text to be parsed
The text input supports two modes of keyboard input and local file uploading, and the view layer receives a text to be retrieved input by a user, submits the text to the logic layer and stores the text in the data layer. Suppose that the text to be analyzed by the user is "We have used retrovisual-media gene delivery to effect summary automatic expression of TIMP-3in human neural and mammalian cells in order to user's future expression of the background of TIMPs to inhibit the same in vivo", the user can select 1 and directly input the text through the page text box; 2. and storing the texts into formats such as txt, doc and the like, and uploading the texts in a file form. The first approach is suitable for short text processing and the second approach is suitable for long text processing.
2. The system analyzes the text to be analyzed
The realization of the function requires the coordination work of a logic layer and a database layer of the system, and the method specifically comprises the following steps:
(1) and after preprocessing such as segmenting and word segmentation of the text to be analyzed by the logic layer, taking the text as the input of a fine-tuned pre-trained biomedical word representation (BioBERT) model to obtain a word vector and a sentence vector of the text.
(2) Taking the output result obtained in the step (1) as the input of the entity and entity relationship combined extraction model, firstly obtaining the biomedical entities specific to the entity types through the entity identification module as shown in the following table:
TABLE 3 biomedical entity extraction results
E1 E2 E3 E4 E5
TIMP-3 TIMPs neuroblastoma Melanoma tumor cells murine
Because the module regards the entity start word to the entity end word as the mapping relation specific to the entity type, the obtained entity representation contains the type information of the entity, and then an entity relation graph specific to the entity relation type is constructed and integrated, and all related entity pairs are output: < TIMP-3, TIMPs >; < TIMPs, TIMP-3 >; < neuroplastoma, melanoma tomor cells >; < melanoma tomor cells, neuroplastoma >.
(3) Then, biomedical event extraction is performed, the word vector and the entity type vector (i.e., the entity representation in step (2)) obtained in the above process are input into a biomedical event extraction model based on a Hierarchical Distillation Network (HDN) for trigger word recognition, and element detection is performed in consideration of each trigger word-entity and trigger word-trigger word pair, and then, a biomedical event is obtained through a post-processing process as shown in table 4:
TABLE 4 biomedical event extraction results
T1 Blood_vessel_development:angiogenesis
T2 Gene_expression:expression Theme:TIMP-3
T3 Positive_regulation:effectTheme:expression
T4 Negative_regulation:inhibit Theme:angiogenesisCause:TIMPs
The trigger word in T1 is angiogenesis, and there is no corresponding element, i.e. it is stated that the trigger word itself constitutes a simple event.
(4) And (4) delivering the output results of the steps (1) to (3) to a data layer for storage, and simultaneously feeding back the visualization results to the user by a view layer to construct a cancer-related biological event database.
3. User retrieval of biomedical events
When the system has completed the extraction of the biomedical events from the input text, the entire event can be presented in the form of an interactive event graph. For example, after the user inputs the text, in the biomedical event extraction interface, a graph G containing m + n nodes is seen, and the graph G is the complete biomedical event in which m entities and n trigger words participate. Each node marks the identity (trigger word/entity) according to different colors, the edge types between the nodes represent element roles/entity relations and are distinguished according to the colors of the edges, and if the trigger word of an event A in the graph is an element of an event B, the event B is a complex event. In addition, the user may search for an entity in the interface to view biomedical events related to the entity.

Claims (1)

1. A method for constructing a cancer-related biomedical event database based on literature is characterized by comprising the following steps:
(I) obtaining and preprocessing model training test corpus
(1) 261-; (2) segmenting and sentence-dividing the biomedical documents, and adding [ CLS ] and [ SEP ] characters as the beginning and ending marks of the sentences respectively; (3) utilizing special symbols to cut words of the text obtained in the step (2); (4) training by using a pre-training biomedical word representation model to obtain a word vector; in the process of training word vectors, firstly, obtaining a code of a public BioBERT, then finishing fine adjustment by using a corpus pubMed and a PMC in the biomedical field, storing model parameters, and finally processing MLEE linguistic data by using a fine-adjusted model to obtain word vectors;
(II) biomedical entities and entity relation joint extraction based on entity relation graph
In order to realize deep combination of biomedical entity identification and entity relationship extraction tasks, a biomedical entity and entity relationship combination extraction model based on an entity relationship graph is provided; in the model, firstly, an entity recognition module recognizes entities in an input text for a specific entity type, then an entity relation directed graph of the specific relation type is constructed, and all related entity pairs in the graph are output; the model is divided into three modules:
(1) input module
Encoding the input source sentence by using the pre-training biomedical word representation model, wherein the input is S { [ CLS { [],w 1 ,w 2 ,...,w n ,[SEP]In which [ CLS]And [ SEP ]]Respectively extracting hidden layer characteristics H for the beginning mark and the end mark of the sentence E And sentence vector X: h E ,X=Encoder(S);
Wherein,
Figure FDA0002567872700000011
Figure FDA0002567872700000012
for the hidden layer output of the ith word,
Figure FDA0002567872700000013
for vector representation of the entire sentence, [ CLS ] is used in the model implementation]Instead of, a vector of E The number of hidden layer units of the Encoder;
(2) entity identification module
This module models the entity type as a mapping of the entity's beginning word to ending word, rather than classifying after the entity is identified, i.e., learning a marker f specific to the entity class etype (w start )→w end The problems of multi-type entity recognition and entity recognition incompleteness are solved;
firstly, generating hidden layer characteristics H according to the previous step E Marking out which words in the sentence are most likely to be the beginning of an entity, then constructing a marker for each entity type, and marking out which words in the sentence are most likely to be the end of a specific type of entity; the specific operation is as follows:
Figure FDA0002567872700000014
Figure FDA0002567872700000015
wherein,
Figure FDA0002567872700000016
et k as entity type etype k The embedding of (a) into (b),
Figure FDA0002567872700000017
represents the ithThe likelihood of a word being identified as an entity start word,
Figure FDA0002567872700000018
representing that the ith word is recognized as an etype k The possibility of the end word of the type entity is classified into two categories, and the category is defined as 1 when the value of the category exceeds a certain threshold, otherwise, the category is 0; then, the entity in the sentence is identified according to the minimum span principle
Figure FDA0002567872700000021
Since entity identification is performed for a specific entity type, the entity representation already includes the type information thereof, and k represents that the type of the entity is type k
(3) Entity relationship graph
Taking the entity output in the last step as a node of the entity relationship graph, initializing the entity relationship graph, assuming that the relationship types among the entities are R types in total, and using R +1 asymmetric adjacent matrixes A (s) ,
Figure FDA0002567872700000022
s 0,1, R denotes this figure, with the elements in the adjacency matrix
Figure FDA0002567872700000023
Figure FDA0002567872700000024
Representing head entity e i And tail entity e j Has a relation r between s Otherwise, it means nothing, A (0) Is a zero matrix; then to entity relationship type r s Training a scorer for each entity pair
Figure FDA0002567872700000025
And (3) scoring:
Figure FDA0002567872700000026
wherein,
Figure FDA0002567872700000027
f is a feature fusion function for trainable parameters, when the score is calculated
Figure FDA0002567872700000028
If a threshold value lambda is exceeded, the adjacency matrix A is formed (s) Middle corresponding element
Figure FDA0002567872700000029
Setting the value to be 1, otherwise setting the value to be 0, and obtaining the entity relationship type r after all the entity pairs are predicted s A corresponding entity relationship diagram; the model trains R scorers simultaneously, generates R entity relationship graphs corresponding to R different relationship types respectively,
Figure FDA00025678727000000210
a value of 1 means that there is a relationship triplet<e i ,e j ,r s >Integrating the R entity relationship graphs and outputting all the related entity pairs
Figure FDA00025678727000000211
Outputting two entities corresponding to all non-zero elements in A<e i ,e j >;
When the model is finished, all entity pairs < entity 1, entity 2> with relationship existing in the text are extracted, because the entity type is regarded as a mapping relation from an entity starting word to an entity finishing word in the entity identification module, and when an entity relationship graph is constructed, the tropism of the entity relationship is considered, and a marker of a specific relationship is constructed by utilizing entity representation fused with entity type information, so that the deep combination of the entity and the entity relationship is realized;
(III) biomedical event extraction based on layered distillation network
On the basis of the existing biomedical event extraction model based on deep learning, a biomedical event extraction model based on a layered distillation network is provided, the model acquires syntactic information of different layers through a bidirectional GRU and a multilayer gating graph network, the difference between sentence representations of different layers is emphasized through a distillation module, then a residual connection is introduced to integrate the sentence representations after distillation of all layers, and the obtained sentence representations can express richer syntactic characteristics and capture context information of longer distance; finally, respectively considering the trigger word recognition task and the element recognition task as a sequence labeling task and a multi-classification task, and performing post-processing to generate a final biomedical event; the specific flow of the model is as follows:
(1) input module
The model first sets the input sentence W to W 1 ,w 2 ,...w L Embedding to obtain sentence vector X ═ X 1 ,x 2 ,...,x L The sentence vector includes the following parts:
word vector: coding the input original sentence by using the pre-training biomedical word representation model introduced above to obtain the context semantics of the word;
entity type vector: the entity type information is contained in the entity representation obtained by the biomedical entity and entity relationship combined extraction model introduced above, so that the entity type information is used as the entity type vector of the module;
(iii) relative position vector: the relative position vector is used only for the element recognition task; the relative position calculation formula of the candidate trigger words and the candidate elements is as follows:
Figure FDA0002567872700000031
wherein q and m are respectively the starting subscript and the length of the candidate trigger word or element, and L is the length of the sentence; then, a position embedding table initialized randomly is searched to obtain a relative position vector;
(2) bidirectional GRU module
The sentence vector X obtained by the input module is { X ═ X 1 ,x 2 ,...,x L As input to this module, the bidirectional GRU consists of two recursive computation units: forward GRU unit and backward GRU unit, which are continuously updated with the order of sentences, and forward GRU and backward GRU respectively obtain forward hidden vector
Figure FDA0002567872700000032
And backward hidden vector
Figure FDA0002567872700000033
Concatenating them to get sentence sequence vector H ═ (H) 1 ,h 2 ,...,h L ) Wherein
Figure FDA0002567872700000034
The specific calculation process of the module is as follows:
z t =σ(W z ·[h t-1 ,x t ])r t =σ(W r ·[h t-1 ,x t ]) (3.2)
Figure FDA0002567872700000035
wherein z is t ,r t Update and reset gates, W, of GRU units, respectively (·) To trainable the parameters, σ (-) and tanh (-) are activation functions,
Figure FDA0002567872700000036
is the hidden vector after passing through the reset gate;
(3) gate control graph network module
Defining graph G ═ (V, E) as a dependency syntax tree for sentences, where V ═ V (V, E) 1 ,v 2 ,...,v L ),E=(e 1 ,e 2 ,...,e M ) Respectively, node set and edge set of the graph G, and the module obtains the node feature vector H in the last step (l) And the adjacency matrix A of the graph is used as input, and a new node characteristic vector H is output (l+1) In the process of transmitting, GGN adoptsThe door mechanism comprises the following calculation processes:
Figure FDA0002567872700000037
U=σ(W U ·[Y,H (l) ])R=σ(W R ·[Y,H (l) ]) (3.5)
Figure FDA0002567872700000038
Figure FDA0002567872700000039
wherein,
Figure FDA0002567872700000041
is used for carrying out symmetrical normalization on the characteristic vector, Y is the normalized vector, W (·) For trainable parameters, σ (-) and relu (-) are activation functions, U and R are update and reset gates, respectively, of the GGN,
Figure FDA0002567872700000042
is the node eigenvector after the GGN reset gate; in the GGN, the edge type between nodes is a non-directional edge; along with the superposition of GGNs, the nodes exchange information with other nodes which are farther away, so that the characteristic vectors of the nodes contain more syntactic information, the information interaction between adjacent nodes in a 1-layer GGN induced graph is realized, and the information interaction is carried out between the nodes and the nodes with the distance equal to or less than 2 according to the superposition of another layer of GGNs on the basis of the previous GGNs, so that the output of the first layer of GGNs is represented by a first-order syntactic vector, the output of the second layer of GGNs is represented by a second-order syntactic vector, and the like;
(4) distillation module
Due to the superposition of the GGN modules, the normal vectors of different orders inevitably overlap in a feature space to cause redundancy problem, so thatDesigning a distillation module to reconstruct syntax vectors of each order obtained in the previous process; the module mainly emphasizes heterogeneity among normal vectors of different orders, namely the first-order syntactic vector
Figure FDA0002567872700000043
And second order syntax vectors
Figure FDA0002567872700000044
This module includes three core operations:
alignment of
By this operation, obtain
Figure FDA0002567872700000045
And
Figure FDA0002567872700000046
aligned vector
Figure FDA0002567872700000047
Figure FDA0002567872700000048
The alignment process is as follows:
Figure FDA0002567872700000049
Figure FDA00025678727000000410
Figure FDA00025678727000000411
wherein, W E For trainable parameters, e i,j Is composed of
Figure FDA00025678727000000412
And
Figure FDA00025678727000000413
the vector after the fusion is carried out,
Figure FDA00025678727000000414
and
Figure FDA00025678727000000415
respectively representing the weights of a first-order vector and a second-order vector when aligning;
② comparison
By mixing S l And
Figure FDA00025678727000000416
comparing to obtain the distillation ratio of the syntax vector of the first order
Figure FDA00025678727000000417
The calculation formula is as follows:
Figure FDA00025678727000000418
(iii) distillation
The syntax vector after the distillation of the l order can be obtained by the distillation operation
Figure FDA00025678727000000419
Wherein:
Figure FDA00025678727000000420
finally, the distilled syntactic vectors of all orders are integrated through residual connection to obtain the final sentence expression R (R) which can express richer syntactic characteristics and capture more distant context information 1 ,r 2 ,...,r L );
(5) Trigger word recognition and element recognition
Triggering word recognition
In the model, the trigger word recognition is regarded as a sequence labeling task, a BILOU labeling mode is adopted, and in the ith time step, the ith word r is labeled i Using a Softmax classifier to predict the trigger:
p(y i |x)=softmax(W te ·r i +b te ) (3.13)
wherein, W te And b te Is a trainable parameter;
② element identification
In this model, the recognition of primitives is considered as a multi-classification task, and each trigger-entity pair or trigger-trigger pair is classified using the Softmax classifier:
p(y|x)=softmax(W ae ·[t,a]+b ae ) (3.14)
where t is the trigger vector, a is another entity or trigger vector, W ae And b ae Is a trainable parameter;
(6) post-treatment
The post-processing process combines the trigger words and the elements obtained in the last step into a complete event, namely event detection; in the model, an SVM classifier is used for automatically learning the legal event structure of each event type by extracting features, then candidate events are formed, and finally the event types of the candidate events are determined;
(IV) cancer-related biomedical event database construction
Obtaining biomedical entities and biomedical event information required by constructing a cancer-related biomedical event database by the method, and storing the information in the database; the constructed cancer-related biomedical events database comprises a biomedical entity table and a biomedical events table.
CN202010629395.7A 2020-07-03 2020-07-03 Method for constructing cancer-related biomedical event database based on literature Active CN111859935B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010629395.7A CN111859935B (en) 2020-07-03 2020-07-03 Method for constructing cancer-related biomedical event database based on literature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010629395.7A CN111859935B (en) 2020-07-03 2020-07-03 Method for constructing cancer-related biomedical event database based on literature

Publications (2)

Publication Number Publication Date
CN111859935A CN111859935A (en) 2020-10-30
CN111859935B true CN111859935B (en) 2022-09-20

Family

ID=73151955

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010629395.7A Active CN111859935B (en) 2020-07-03 2020-07-03 Method for constructing cancer-related biomedical event database based on literature

Country Status (1)

Country Link
CN (1) CN111859935B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632230B (en) * 2020-12-30 2021-10-15 中国科学院空天信息创新研究院 Event joint extraction method and device based on multi-level graph network
CN113051929A (en) * 2021-03-23 2021-06-29 电子科技大学 Entity relationship extraction method based on fine-grained semantic information enhancement
CN113239142B (en) * 2021-04-26 2022-09-23 昆明理工大学 Trigger-word-free event detection method fused with syntactic information
CN113255321B (en) * 2021-06-10 2021-10-29 之江实验室 Financial field chapter-level event extraction method based on article entity word dependency relationship
CN113420551A (en) * 2021-07-13 2021-09-21 华中师范大学 Biomedical entity relation extraction method for modeling entity similarity
CN113468333B (en) * 2021-09-02 2021-11-19 华东交通大学 Event detection method and system fusing hierarchical category information
CN114912456B (en) * 2022-07-19 2022-09-23 北京惠每云科技有限公司 Medical entity relationship identification method and device and storage medium
CN116049345B (en) * 2023-03-31 2023-10-10 江西财经大学 Document-level event joint extraction method and system based on bidirectional event complete graph
CN116501830B (en) * 2023-06-29 2023-09-05 中南大学 Method and related equipment for jointly extracting overlapping relation of biomedical texts
CN117131198B (en) * 2023-10-27 2024-01-16 中南大学 Knowledge enhancement entity relationship joint extraction method and device for medical teaching library

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107818141B (en) * 2017-10-10 2020-07-14 大连理工大学 Biomedical event extraction method integrated with structured element recognition
CN108628970B (en) * 2018-04-17 2021-06-18 大连理工大学 Biomedical event combined extraction method based on new marker mode
CN109446338B (en) * 2018-09-20 2020-07-21 大连交通大学 Neural network-based drug disease relation classification method
CN109446326B (en) * 2018-11-01 2021-04-20 大连理工大学 Biomedical event combined extraction method based on replication mechanism
CN110083838B (en) * 2019-04-29 2021-01-19 西安交通大学 Biomedical semantic relation extraction method based on multilayer neural network and external knowledge base

Also Published As

Publication number Publication date
CN111859935A (en) 2020-10-30

Similar Documents

Publication Publication Date Title
CN111859935B (en) Method for constructing cancer-related biomedical event database based on literature
CN109446338B (en) Neural network-based drug disease relation classification method
US11580415B2 (en) Hierarchical multi-task term embedding learning for synonym prediction
CN109635291B (en) Recommendation method for fusing scoring information and article content based on collaborative training
De Mulder et al. A survey on the application of recurrent neural networks to statistical language modeling
CN110110324B (en) Biomedical entity linking method based on knowledge representation
Karpathy Connecting images and natural language
Tang et al. Deep sequential fusion LSTM network for image description
Fu et al. Learning towards conversational AI: A survey
Xu et al. Deep image captioning: A review of methods, trends and future challenges
CN112069827B (en) Data-to-text generation method based on fine-grained subject modeling
CN114265936A (en) Method for realizing text mining of science and technology project
CN114239612A (en) Multi-modal neural machine translation method, computer equipment and storage medium
Zhu et al. Statistical learning for semantic parsing: A survey
CN116384371A (en) Combined entity and relation extraction method based on BERT and dependency syntax
Andrews et al. Robust entity clustering via phylogenetic inference
Sharma et al. Innovations in neural data-to-text generation: A Survey
Krug Techniques for the Automatic Extraction of Character Networks in German Historic Novels
Wang et al. Paraphrase recognition via combination of neural classifier and keywords
Singh et al. Neural approaches towards text summarization
Sharma et al. Neural Methods for Data-to-text Generation
Zhang et al. Social Media Named Entity Recognition Based On Graph Attention Network
Sahay et al. Unsupervised learning of explainable parse trees for improved generalisation
Libovický Multimodality in machine translation
Favre Contextual language understanding Thoughts on Machine Learning in Natural Language Processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant