CN113420551A

CN113420551A - Biomedical entity relation extraction method for modeling entity similarity

Info

Publication number: CN113420551A
Application number: CN202110788351.3A
Authority: CN
Inventors: 赵卫中
Original assignee: Central China Normal University
Current assignee: Central China Normal University
Priority date: 2021-07-13
Filing date: 2021-07-13
Publication date: 2021-09-21

Abstract

The invention discloses a biomedical entity relation extraction method for modeling entity similarity, which comprises the following steps: s1, obtaining an initial representation of the text by the input module; s2, assuming that the given biomedical text which has been subjected to entity labeling is a sequence composed of sentences, wherein each sentence is expressed as a sequence composed of vector representations of each word (S)₁,...,S_i,...,S_L) (ii) a S3, considering the abundant semantic information contained in each word, the vector representation of each word is composed of word embedding, position embedding and entity type embedding; s4, establishing similarity information between the biomedical entities in each document through a relation heteromorphic model on the basis of S3The invention has the advantages that the method can learn richer entity representation: an end-to-end neural network is adopted by the relational heterogeneous graph module, meaningful features can be automatically learned from large-scale biomedical texts, and time-consuming, labor-consuming and extremely complex feature engineering in the traditional method is avoided.

Description

Biomedical entity relation extraction method for modeling entity similarity

Technical Field

The invention relates to the technical field of biomedicine, in particular to a biomedical entity relation extraction method for modeling entity similarity.

Background

The biomedical text comprises a large number of and a large variety of information entities, the entities contain complex relationships, and four types of relationships are divided according to the types of the entities: Protein-Protein Interactions (PPIS), Genotype-Phenotype Associations (GPA), Drug-Drug Interactions (DDI), and Chemical-Induced diseases (CID). There have been many studies on the extraction of relationships between these four types of biomedical entities, where the conventional rule-based method uses a rule template (usually in the form of a regular expression) generated by a domain expert to extract the matching relationships from the biomedical text; identifying relationships between biomedical entities by means of co-occurrence probability based on a statistical learning method; the NLP-based method decomposes the text into grammatical structures that allow easy extraction of relationships between entities by parsing the sentences.

However, research shows that the existing biomedical entity relationship extraction methods do not model similarity information between entities in biomedical texts, but the similarity information plays a key role in relationship extraction. Taking chemical and disease-inducing relationships (CIDs) as an example, if three CID relationship entity pairs have been successfully predicted: the method is characterized in that the method refers to the above information and greatly helps to predict whether the entity pair < isoniazid, dark spot > is CID relationship or not, because the entity pair < ethambutol, bilateral optic neuropathy > and < isoniazid, bilateral optic neuropathy > can be used for obtaining the information that the chemical entity 'ethambutol' and 'isoniazid' have certain similarity through CID relationship, and the entity pair < ethambutol, dark spot > is found to have CID relationship, so that the entity pair < isoniazid, dark spot > is also judged to have CID relationship. Therefore, it is necessary to fully model the similarity information between entities in the biomedical text, and obtain a better entity representation for efficient relationship extraction.

Disclosure of Invention

The invention aims to provide a biomedical entity relation extraction method for modeling entity similarity, which aims to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: a biomedical entity relation extraction method for modeling entity similarity comprises the following steps:

s1, obtaining an initial representation of the text by the input module;

s2, assuming that the given biomedical text which has been subjected to entity labeling is a sequence composed of sentences, wherein each sentence is expressed as a sequence composed of vector representations of each word (S)₁,...,S_i,...,S_L)；

S3, considering the abundant semantic information contained in each word, the vector representation of each word is composed of word embedding, position embedding and entity type embedding;

s4, modeling similarity information among the biomedical entities in each document through a relational heteromorphic model module on the basis of S3, and learning richer entity representation;

and S5, classifying the candidate relation nodes and storing the entity relation in a structured form.

As a preferred embodiment of the present invention: the word embedding adopts a pre-training model BioBERT as a word embedding model;

the position embedding adopts sine and cosine functions with different frequencies to model different position information in sentences;

the entity type in the entity type embedding comprises an O type, namely the entity type is not an entity, a vector is randomly initialized to represent information contained in the entity type, the entity type embedding vector is used as a parameter of a model, and the optimization is carried out through a training process, namely the Fine-tuning is obtained.

As a preferred embodiment of the present invention: the relationship heterogeneous graph module specifically comprises heterogeneous graph construction and a double-layer attention mechanism.

As a preferred embodiment of the present invention: the heteromorphic image construction specifically comprises the following steps:

a1: assuming that the heterogeneous graph is represented as HG ═ (HV, HE), where HV represents the set of nodes and HE is the set of edges;

a2: given a processed entityThe labeled biomedical text D, the node set of the heterogeneous graph constructed based on the text D is composed of a plurality of subsets: node set E of various biomedical entities₁,E₂,...,E_NWhere N denotes the number of categories of biomedical entities in a given text, and a candidate relationship node R, consisting of pairs of biomedical entities, formalized as HV ═ (E)₁,E₂,...,E_N)UR；

A3: initializing and expressing the biomedical entity nodes as vector expression obtained by an input module; in addition, for the candidate relation nodes, firstly, the vector representation of the corresponding biomedical entity is spliced, then, after a full connection layer is used for activation, the activated vector is finally used as the initialization representation of the candidate relation nodes in the heteromorphic graph.

As a preferred embodiment of the present invention: the construction of the edges between the nodes of the candidate relationship in the heteromorphic graph comprises the following two steps:

b1: each candidate relationship node is formed by pairing a chemical entity and a disease entity, so that an edge is constructed for each candidate relationship node and the corresponding biomedical entity;

b2: in order to take the similarity between the entities into consideration, similarity calculation is carried out between the entities, and if the similarity between two entities is large enough, an edge is constructed between the two entity nodes.

As a preferred embodiment of the present invention: the double-layer attention mechanism specifically comprises the following steps:

c1: giving a certain candidate relation node, and firstly collecting 1-hop and 2-hop neighbor nodes of the node;

c2: then all neighbor nodes are divided into groups according to the node types: and updating the vector representation of the given candidate relation node by using two levels of attention mechanisms of various biomedical entity neighbor nodes and candidate relation neighbor nodes.

As a preferred embodiment of the present invention: the double-layer attention mechanism consists of two parts: the node-level attention mechanism aggregates information of neighbors of the same type and the type-level attention mechanism aggregates information of neighbor nodes of different types.

As a preferred embodiment of the present invention: the node-level attention mechanism is used for fully modeling the importance of different neighbor nodes having the same type, and specifically comprises the following steps:

d1: given a certain candidate relation node, all the v-class biomedical entity neighbor node sets are assumed to be represented as

Wherein any class v biomedical entity neighbors

For all the V-class biomedical entity neighbors, selective information aggregation is carried out through a node level attention mechanism to obtain a neighbor vector representation representing the V-class biomedical entity type

D2: neighbor vector representations for other types of entities and candidate relationship types through D1 node-level attention

As a preferred embodiment of the present invention: the type level attention specifically comprises the following steps:

e1: the type-level attention is based on the node-level attention, the type-level attention learns the weights of different types of neighbors of a given candidate relationship node, and through a process similar to the node-level attention, the type-level attention is formally expressed as:

wherein

Representing multiple types of neighbors;

e2: vector representations to be obtained taking into account different neighbor nodes and different types of neighbors in the same type

E3: updating the original candidate relation node by using a full-connection network, and carrying out relation reasoning:

where σ denotes a Sigmoid activation function, the output value of which is between 0 and 1, so that

And finally, storing the extracted entity relationship in a structured form.

Compared with the prior art, the invention has the beneficial effects that: the invention aggregates the information of the neighbors of the same type through a node level attention mechanism, and fully models the importance of different neighbor nodes with the same type; the type level attention mechanism aggregates information of different types of neighbor nodes, on the basis of node level attention, the type level attention mechanism learns the weights of different types of neighbors of given candidate relationship nodes, considers the importance of different types, can fully model similarity information between entities in a biomedical text through a relationship heterogeneous module based on a double-layer attention mechanism to obtain more optimal candidate relationship node representation so as to carry out efficient relationship extraction, adopts an end-to-end neural network through a relationship heterogeneous graph module, can automatically learn meaningful characteristics from a large-scale biomedical text, avoids time-consuming and labor-consuming characteristic engineering with extreme complexity in the traditional method, and can fully model the similarity information between the entities in the biomedical text through modeling the similarity information between the entities in the biomedical text, and obtaining a better candidate relation node representation so as to carry out efficient relation extraction.

Drawings

Fig. 1 is a general roadmap for the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a technical solution: a biomedical entity relation extraction method for modeling entity similarity comprises the following steps:

s1, obtaining an initial representation of the text by the input module;

The word embedding adopts a pre-training model BioBERT as a word embedding model;

position embedding adopts sine and cosine functions with different frequencies to model different position information in sentences;

the entity type in the entity type embedding comprises an O type, namely the entity type is not an entity, a vector is randomly initialized to represent information contained in the entity type, the entity type embedding vector is used as a parameter of a model, and the optimization is carried out through a training process, namely the Fine-tuning is carried out.

Through the input module, each sentence S in a given text can be represented as a matrix X₁Where the jth row in the matrix represents the vector representation of the jth word. And inputting the initial representation of the biomedical text obtained by the input module into a next entity relationship heteromorphic module to learn richer entity representations.

The heteromorphic image construction specifically comprises the following steps:

a2: given a biomedical text D subjected to entity labeling, a node set of a heterogeneous graph constructed on the basis of the text D consists of a plurality of subsets: node set E of various biomedical entities₁,E₂,...,E_NWhere N denotes the number of categories of biomedical entities in a given text, and a candidate relationship node R, consisting of pairs of biomedical entities, formalized as HV ═ (E)₁,E₂,...,E_N)UR；

The construction of the edges between the nodes of the candidate relationship in the heteromorphic graph is carried out in the following two steps:

The double-layer attention mechanism specifically comprises the following steps:

The double-layer attention mechanism consists of two parts: the node-level attention mechanism aggregates information of neighbors of the same type and the type-level attention mechanism aggregates information of neighbor nodes of different types, in the constructed heterogeneous graph, relationship reasoning is mainly based on vector representation of candidate relationship nodes, so that information related to the candidate relationship nodes needs to be considered as much as possible to improve the performance of relationship extraction, as each candidate relationship node is provided with edges connected with corresponding entity nodes, different neighbor nodes in the same type and different types of neighbors have different importance for representation learning of the candidate relationship nodes, and therefore for each candidate relationship node, the influence of the neighbor nodes on the learning of the candidate relationship node vector representation is considered in two levels.

The node-level attention mechanism is used for fully modeling the importance of different neighbor nodes with the same type, and specifically comprises the following steps:

Wherein any class v biomedical entity neighbors

The type level attention specifically comprises the following steps:

wherein

Representing multiple types of neighbors;

And finally, storing the extracted entity relationship in a structured form.

In particular, in use, an initial representation of the text is obtained by the input module, assuming that a given biomedical text that has been entity-tagged is a sequence of sentences, wherein each sentence is represented as a sequence of vector representations of each word (S)₁,...,S_i,...,S_L) Considering the rich semantic information contained in each word, the vector representation of each word is composed of word embedding, position embedding and entity type embedding, the word embedding adopts a pre-training model BioBERT as a word embedding model, the position embedding adopts sine and cosine functions with different frequencies to model different position information in sentences, the entity type in the entity type embedding comprises O type, namely not entity, one vector is randomly initialized to represent the information contained in the entity type, the entity type embedding vector is used as a parameter of the model, the optimization is realized through a training process, namely Fine-tuning, the similarity information between biomedical entities in each document is modeled through a relation heterogeneous graph module, richer entity representation is learned, the relation heterogeneous graph module specifically comprises heterogeneous graph construction and a double-layer attention mechanism, and the assumption that the representation of heterogeneous graph is HG (HV, HE), where HV represents a set of nodes, HE is a set of edges, and then given a biomedical text D that has been subjected to entity labeling, the set of nodes of the heterogeneous graph constructed based on the text D is composed of a plurality of subsets: node set E of various biomedical entities₁,E₂,...,E_NWhere N denotes the number of categories of biomedical entities in a given text, and a candidate relationship node R, consisting of pairs of biomedical entities, formalized as HV ═ (E)₁,E₂,...,E_N) UR, initialization of biomedical entity nodesThe expression is vector expression obtained by an input module; in addition, for candidate relation nodes, firstly, the vector representation of the corresponding biomedical entities is spliced, then, after a full connection layer is used for activation, finally, the activated vector is used as the initialization representation of the candidate relation nodes in the heteromorphic graph, the candidate relation nodes are classified, the entity relation is stored in a structured form, as each candidate relation node is formed by pairing a chemical entity and a disease entity, an edge is constructed for each candidate relation node and the corresponding biomedical entity, similarity calculation is carried out between the entities in order to consider the similarity between the entities, if the similarity between the two entities is large enough, an edge is constructed between the two entity nodes, a double-layer attention machine is controlled to give a certain candidate relation node, 1-hop neighbor node and 2-hop neighbor nodes are collected firstly, then all neighbor nodes are divided into groups according to the node types: the vector representation of the given candidate relation node is updated by utilizing two levels of attention mechanisms of various biomedical entity neighbor nodes and candidate relation neighbor nodes, and the double-level attention mechanism consists of two parts: the node-level attention mechanism aggregates information of neighbors of the same type and aggregates information of neighbor nodes of different types, the node-level attention mechanism is used for fully modeling importance of different neighbor nodes of the same type, the node-level attention mechanism gives a certain candidate relation node, and all the neighbor node sets of the V-type biomedical entities are assumed to be represented as

Wherein any class v biomedical entity neighbors

Neighbor vector representations for other types of entities and candidate relationship types through D1 node-level attention

Type-level attention based on node-level attention, type-level attention learns the weights of different types of neighbors of a given candidate relationship node, and through a process similar to node-level attention, type-level attention is formally expressed as:

vector representations to be obtained taking into account different neighbor nodes and different types of neighbors in the same type

Updating the original candidate relation node by using a full-connection network, and carrying out relation reasoning:

And finally, storing the extracted entity relationship in a structured form.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A biomedical entity relation extraction method for modeling entity similarity is characterized by comprising the following steps:

s1, obtaining an initial representation of the text by the input module;

2. The method of claim 1, wherein the method comprises the following steps: the word embedding adopts a pre-training model BioBERT as a word embedding model;

3. The method of claim 1, wherein the method comprises the following steps: the relationship heterogeneous graph module specifically comprises heterogeneous graph construction and a double-layer attention mechanism.

4. The method of claim 3, wherein the method comprises the following steps: the heteromorphic image construction specifically comprises the following steps:

5. The method of claim 4, wherein the method comprises the following steps: the construction of the edges between the nodes of the candidate relationship in the heteromorphic graph comprises the following two steps:

6. The method of claim 3, wherein the method comprises the following steps: the double-layer attention mechanism specifically comprises the following steps:

7. The method of claim 3, wherein the method comprises the following steps: the double-layer attention mechanism consists of two parts: the node-level attention mechanism aggregates information of neighbors of the same type and the type-level attention mechanism aggregates information of neighbor nodes of different types.

8. The method of claim 7, wherein the method comprises the following steps: the node-level attention mechanism is used for fully modeling the importance of different neighbor nodes with the same type, and specifically comprises the following steps:

Wherein any class v biomedical entity neighbors

9. The method of claim 8, wherein the method comprises the following steps: the type level attention specifically comprises the following steps:

wherein

Representing multiple types of neighbors;

And finally, storing the extracted entity relationship in a structured form.