CN114722216A

CN114722216A - Entity alignment method based on Chinese electronic medical record knowledge graph

Info

Publication number: CN114722216A
Application number: CN202210413638.2A
Authority: CN
Inventors: 李丽双; 董姜媛
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2022-04-15
Filing date: 2022-04-15
Publication date: 2022-07-08

Abstract

The invention provides an entity alignment method based on a knowledge graph of a Chinese electronic medical record. The method comprises the following steps: constructing a training set and a test set, and constructing the training set and the test set for entity alignment on the heterogeneous medical knowledge map; reasoning is carried out on the medical knowledge graph by using rules, the missing relation is supplemented, and the structural heterogeneity among the medical knowledge graphs is relieved; a dual-view neural network model based on a gating mechanism is constructed, aiming at aligning and fusing heterogeneous medical knowledge maps, the accuracy of Hits @5 is as high as 85.4% on the basis of considering both the accuracy and the labor cost, the existing medical resources are effectively integrated, and the development of intelligent medical treatment is promoted.

Description

Entity alignment method based on Chinese electronic medical record knowledge graph

Technical Field

The invention belongs to the field of natural language processing, and relates to an entity alignment method in construction of a Knowledge Graph (KG) of Chinese Electronic Medical Record (EMR), in particular to a Medical Knowledge Graph alignment method based on a neural network and a gating mechanism.

Background

The Chinese electronic medical record is one of the products of the information medical health service, which contains a great deal of medical facts, and has important significance in automatically acquiring and integrating effective medical information from a great deal of electronic medical records by utilizing a natural language processing technology along with the accumulation of a great deal of domestic electronic medical records. The construction of the knowledge graph related to the electronic medical record is one of the most effective methods for displaying and utilizing the medical information in the electronic medical record. The knowledge graph is a knowledge representation form, and can structure, normalize and clearly show information elements in a graph form. The knowledge graph based on the electronic medical record is a vertical domain knowledge graph, and is beneficial to establishing a scientific intelligent medical knowledge base and a knowledge network. However, due to different knowledge sources and different construction purposes, overlapping knowledge and complementation phenomena occur between different knowledge maps, and particularly in medical knowledge maps, the phenomena are ubiquitous. The large-scale medical knowledge map established by the knowledge fusion technology is beneficial to improving the data quality so as to promote the development of intelligent medical treatment. The most critical technique in knowledge fusion is entity alignment, whose goal is to discriminate whether two entities in different knowledge-maps point to the same thing in the real world.

In the general field, early entity alignment used a similarity measure method to determine whether two entities are aligned by calculating their string similarity and structural similarity. However, such methods typically align entities according to manually designed rules, making them very complex and difficult to implement. Later, as knowledge graph representation learning developed, more and more researchers began to use knowledge graph representation learning instead of symbolic formal senses to solve the problem. The best known knowledge graph representation learning model is a set of translation-based models, the most classical of which is TransE. Although the translation-based method does not depend on manually constructed rules, the translation-based model can only obtain the first-order neighborhood information of the entity, so the learned entity embedding can only represent the local features of the entity, and the structural information of the knowledge graph cannot be fully utilized.

With the development of graph neural networks in recent years, some studies use graph neural networks to model the structure of knowledge graphs, and enhance entity embedding with neighbor entities, i.e., learn central entity representations by recursively aggregating representations of neighbor entities using graph convolution. Thus, the more similar the entity neighborhood structure, the closer the representation learned for the central entity based on the model of the graph neural network. However, due to different knowledge sources and different construction purposes, structural heterogeneity exists among knowledge maps, and the development of the model is seriously influenced. Wang et al (Cross-linking knowledge graph view graph volume networks. in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.349-357 (2018)) first attempted to use graph convolution networks for entity alignment. Ye et al (A vectored relative network for multi-relative network alignment. in: IJCAI, pp.4135-4141 (2019)) introduces the translation characteristics of the TransE model in the convolution process, and adds different neighbors of the entity into corresponding relationship vectors for merging and representing. However, none of them consider the impact of knowledge-graph heterogeneity on entity alignment.

At present, entity alignment methods in the medical field are very deficient, and a systematic and efficient solution is not available. Since the manifestation of entities in the medical field is complex and diverse, such as the disease entity "chronic viral hepatitis b" may exist in the form of the entity "hepatitis b" or "HBV infection" in the knowledge map, the similarity measurement-based method in the general field is not suitable for the medical field. In addition, because the structural difference of the medical knowledge graph is obvious, the translation-based model only fuses the first-order neighborhood of the entity and cannot fully utilize the structural information of the knowledge graph, and meanwhile, the capability of relieving the heterogeneity of the knowledge graph is limited, so the method is not suitable for entity alignment in the medical field. Many researchers try to solve the entity alignment problem in the general field by using the graph neural network, but due to the complicated knowledge in the medical field, the graph structures generated by different electronic medical records have significant differences, and the structural heterogeneity is more obvious on the medical knowledge graph. The current research on solving the alignment of entities in the medical field using graph neural networks is nearly blank.

Disclosure of Invention

The invention provides a double-view-angle graph neural network entity alignment model, which is used for solving the problem of medical knowledge map fusion based on Chinese electronic medical records and is beneficial to improving the data quality of the existing medical knowledge map so as to promote the development of intelligent medical treatment.

The invention mainly comprises three parts: 1. constructing a training set and a test set, and constructing the training set and the test set for entity alignment on the heterogeneous medical knowledge map; 2. reasoning is carried out on the medical knowledge graph by using rules, the missing relation is supplemented, and the structural heterogeneity among the medical knowledge graphs is relieved; 3. a dual-view neural network model based on a gating mechanism is constructed, and aims to align and fuse heterogeneous medical knowledge maps.

The technical scheme adopted by the invention is as follows:

an entity alignment method based on knowledge graph of Chinese electronic medical record respectively obtains entity sets of two heterogeneous medical knowledge graphs, and carries out seed pair labeling on entities, for example: the entities AIDS and AIDS from knowledge-maps megg _1 and megg _2, respectively, are a set of seed pairs. Taking the entity seed pair which is labeled successfully as a training set and a test set; also comprises the following steps:

step 1, complementing a knowledge graph by using rules;

the invention completes the medical knowledge graph by filling up the missing relationship to relieve the difference in the structure of the heterogeneous medical knowledge graph. According to preset rules in each heterogeneous medical knowledge graph, combining all the rules to obtain a rule set K; applying a rule set K to each heterogeneous medical knowledge graph, giving a rule K belonging to K, searching all prerequisite triples meeting the rule, deducing a conclusion heterogeneous triple according to the rule, and adding the conclusion heterogeneous triple to the original knowledge graph if the conclusion heterogeneous triple does not exist in the original knowledge graph to fulfill the aim of knowledge graph completion; obtaining two supplemented heterogeneous medical knowledge maps G ═ E, R, T and G ' ═ E ', R ', T ', wherein E, E ' respectively represents a first entity set and a second entity set, R, R ' respectively represents a first relation set and a second relation set, T, T ' respectively represents a first triple set and a second triple set, E belongs to E, R belongs to R, and T belongs to T and respectively represents any entity, relation and triple in G; the invention completes the medical knowledge graph by filling up the missing relationship to relieve the difference in the structure of the heterogeneous medical knowledge graph.

Step 2, constructing a dual-view neural network model based on a gating mechanism;

the final purpose of entity alignment is to find the pair of entities in all E and E' that are pointing the same. At present, most of entity alignment models based on the neural network cannot well solve the problems caused by medical knowledge map isomerism, the invention provides a novel dual-view neural network model DvGNet based on a gating mechanism on the basis of the existing research to solve the entity alignment problem, and the model relieves the isomerism of the medical knowledge map from two visual angles. Wherein the entity representation of each view is obtained by aggregating multi-hop neighborhood information through a gating mechanism.

The model takes as input two medical knowledge maps G and G' and as output an embedded representation of each entity, and determines whether it should be aligned by measuring the distance between the embedded representations of the entities. The model comprehensively considers the structural heterogeneity of the knowledge graph from an entity interaction visual angle and a relation interaction visual angle. The model details are as follows:

step 2.1, constructing an entity interactive visual angle network;

ideally, if the central entities to be aligned have the same neighborhood structure, the graph neural network-based approach can learn very similar entity representations for the central entities. However, due to different purposes and sources of construction, medical knowledge maps often have pattern heterogeneity and imperfection, which results in different neighborhood of the central entity, which is an important reason for structural heterogeneity, and is referred to as neighborhood heterogeneity of the entity in the present invention.

The entity interaction perspective iteratively learns an accurate self-attention score for each neighbor of the entity through a self-attention mechanism, and important neighbors are given higher weights in the training process to relieve the heterogeneity of the entity neighborhood. According to any entity e in the supplemented heterogeneous medical knowledge graph_iCalculating the embedded representation of the entity and its neighbor entity to obtain the self-attention score, and aggregating the neighbor feature vectors of the entity by using the self-attention score to obtain the entity e_iA representation of the ith layer in an entity interaction perspective;

step 2.2, constructing a relationship interaction visual angle network;

most current research is aimed at addressing entity neighborhood heterogeneity, however, due to the independence of relationships between medical knowledge-maps, this leads to relational heterogeneity between knowledge-maps. For example, the relationships present in the knowledge-graph G may not be present in the knowledge-graph G', which is also an important cause of medical knowledge-graph structural heterogeneity. The range of entity interaction perspectives is limited to a single knowledge graph, and the relational heterogeneity between medical knowledge graphs is ignored, so the invention proposes to use relational interaction to solve the problem.

The relationship feature matrixes of the supplemented heterogeneous medical knowledge maps G and G' interact to obtain a relationship similarity matrix, and then the maximum pooling operation is carried out to obtain a relationship matching vector; finally, aggregating information from neighbors by using cross-map matching scores obtained from the relational matching vectors; obtaining any entity e in the complemented heterogeneous medical knowledge graph_iA representation of the ith layer in a relational interactive view;

step 2.3, gate-controlled polymerization;

current research considers that the output layer representation of a multi-layer network aggregates multi-hop neighborhood information of entities, so researchers consider the output layer representation of the network as the final embedding of the entities. However, as the number of network layers increases, the number of neighborhoods aggregated by the central entity increases exponentially, and thus the representation of the output layer causes much noise to the central entity.

In order to obtain a more accurate representation of the entity, the present invention proposes a new gating mechanism to aggregate the embedding of the hidden and output layers, applying it to both views. The gating mechanism can remove redundant noise of each layer while capturing the embedding of the multi-hop neighborhood information enhancement entity of the entity, thereby solving the problem of noise caused by multilayer convolution.

Respectively performing gate control aggregation on the entity interaction view network constructed in the step 2.1 and the relation interaction view network constructed in the step 2.2 to respectively obtain the output h of the entity interaction view network_i,1Output h of the sum-relation interaction view network_i,2；

Re-aggregating the outputs of two view-angle gating aggregates by a gating mechanism to obtain an entity e_iIs finally expressed as h_i；

Step 3, calculating an embedding distance;

based on the final representation of the entity obtained in the step 2.3, calculating the embedding distance between the two heterogeneous medical knowledge map entities by using d (·), wherein the smaller the distance, the more similar the entities are represented; wherein d (-) is the L2 paradigm; and training the supervision model by using the seeds in the training set, so that the distance between aligned entities is gradually reduced, and the distance between non-aligned entities is gradually increased.

Further, step 2.1 above constructs self-attention scores in the entity interaction perspective network

The calculation formula of (a) is as follows:

wherein

Is a self-attention coefficient, representing an entity e_jTo e_iThe degree of importance of;

representing the neighbors of the entity including the entity, "| |" represents vector splicing, and LeakyReLU (·) is an activation function; w₁,W₂And p is a trainable parameter.

In a graph neural network, the representation of a node is learned by recursively aggregating the feature vectors of its neighbors. Different aggregation strategies result in different graph neural network variants. Graph attention networks are a very popular variant of graph neural networks.

Further, the step 2.1 of building the entity interaction perspective network utilizes the self-attention score represented by the formula (1)

Aggregate entity e_iTo obtain entity e_iThe expression of the ith layer in the entity interaction view angle is calculated according to the following formula:

wherein the content of the first and second substances,

is a weight parameter of the l-th layer in the view network, and σ () is an activation function, selected as ReLU ().

Further, in the step 2.2 of constructing the relationship interaction view network, a calculation formula of the relationship matching vector is as follows:

M＝f_M(f_S(R,R′)) (4)

wherein f is_S(. The) represents a relational similarity calculation function defined as f_S(R,R′)＝R^TR ', R and R' respectively represent the relationship characteristic matrixes of the two medical knowledge maps to be aligned; f. of_M(. h) represents a maximum pooling operation function; calculating a cross-map matching score from the relationship matching vector M

The formula is as follows:

wherein M [. cndot ] represents a relational matching degree indexing operation; t represents a triple set of the knowledge-graph;

using cross-graph matching scores

Computing entity e_iThe expression of the ith layer in the relational interactive view is calculated by the formula:

wherein

Further, in the step 2.2 of constructing the relationship interaction view network, the initialized representation of the relationship r is obtained by the embedded representation of the head and tail entities of all triples with r as the relationship, and the calculation formula is as follows:

wherein T is_rRepresenting a set of triples in relation to r, e_hAnd e_tRespectively representing the corresponding head-tail entity embedding.

Further, in the above step 2.3, the gate-controlled aggregation formula of the epsilon layer network is

Where ρ is_ξ(α,β)＝g_ξ·α+(1-g_ξ)·β，g_ξFor trainable parameters used to control each layer output of the network; τ is a view category, namely a relationship interaction view or an entity interaction view;

respectively carrying out gate control aggregation on the entity interaction visual angle network constructed in the step 2.1 and the relation interaction visual angle network constructed in the step 2.2 by using a formula (9) to respectively obtain output h of the entity interaction visual angle network_i,1Output h of the sum-relation interaction view network_i,2；

Entity e_iThe final representation of (2) is obtained by re-aggregating the outputs of two view angle gating aggregates through a gating mechanism, and the calculation formula is as follows:

h_i＝g₁·h_i,1+(1-g₁)·h_i,2 (10)

wherein g is₁Are trainable parameters for controlling the aggregation of two views.

The invention has the beneficial effects that firstly, the built training set is utilized to train the proposed dual-view neural network model based on the gating mechanism, the trained model is used for excavating the unknown alignment entity pair in the medical knowledge graph, the fusion of heterogeneous medical knowledge graphs can be realized, the accuracy of Hits @5 is up to 85.4 percent on the basis of considering both the accuracy and the labor cost, the existing medical resources are effectively integrated, and the development of intelligent medical treatment is promoted.

Drawings

FIG. 1 medical knowledge-map alignment framework diagram.

Fig. 2 is a dual view neural network model based on a gating mechanism.

Detailed Description

The technical scheme of the invention is described in detail with reference to the accompanying drawings.

For example, fig. 1 is a medical knowledge graph alignment framework diagram, and table 1 is detailed data statistics of heterogeneous medical knowledge graphs.

1. Corpus pre-processing

The invention manually marks the heterogeneous medical knowledge maps MeKG _1 and MeKG _2 constructed by the electronic medical records, and takes the manually marked seed pairs as a training set and a test set, wherein the seed pair number of the training set and the seed pair number of the test set are 227 and 136 respectively.

2. Complementing knowledge-maps using rules

First, the expert deduces rules from each medical knowledge graph, and then transforms the rules among the medical knowledge graphs according to the assumption of knowledge invariance to form a set of unified rules. Thereafter, rule matching is performed in each medical knowledge-graph. And finally, in order to ensure the quality of the triples obtained by rule inference, the experts are requested to screen the inferred triples.

(1) Inference and transfer of rules

The expert designs the rules existing in each medical knowledge graph, and combines all the rules to obtain a rule set K. For example,

(2) regularly fall to ground

Applying the rule set K obtained in the step (1) to each medical knowledge graph, giving a rule K belonging to K, searching all premise triples meeting the rule, deducing a conclusion triplet according to the rule, and adding the conclusion triplet to the original knowledge graph if the conclusion triplet does not exist in the original knowledge graph, so that the purpose of knowledge graph completion is achieved. For example, according to the above rules:

①

suppose entity e in fig. 1 is "imbalance syndrome", entity g is "hemodialysis", and entity k is "coma". Then the relationship r5 is "disease-indicating symptom" and the relationship r6 is "treatment-causing disease", since "hemodialysis" treatment may cause the complication "imbalance syndrome", and by the relationship of e to k, it can be concluded that the relationship r7 of g to k is "treatment-causing symptom".

②

Assume that entity e in FIG. 1 is "nephrotic syndrome", entity g is "prednisone", and entity k is "proteinuria". Then the relation r5 is "disease-indicating symptom", the relation r6 is "treatment-improving disease", since "prednisone" can treat "nephrotic syndrome", and the relation r7 of g and k can be inferred to be "symptomatic treatment" by the relation of e and k.

(2) Expert identification

And (3) screening the triples deduced in the step (2) by the expert to ensure the quality of the triples, wherein the number of the triples deduced by the expert screening the two medical knowledge maps is 11668 and 11820 respectively, and the detailed data of the completed knowledge maps are shown in table 1.

Table 1: heterogeneous knowledge map data statistical table

3. Model training

(1) Obtaining entity interaction perspective embedding

For any entity e in G_iThe view angle utilizes the attention score calculated from equation (1)

Iteratively aggregating neighbor information for a central entity e_iThe entity interaction view angle of (1) is embedded and expressed, and the output aggregation of each layer of the network is controlled by a gating mechanism, the number of the network layers is 2 in the embodiment, and an entity e_iThe formula for the embedded representation at the view of class τ is as follows:

wherein g is₀Is that trainable parameters are used to control the aggregation of the network hidden layer and output layer representations, in case the number of network layers is set to 2,

is a hidden layer embedding of the network,

is the output layer embedding of the network. Finally, entity e is obtained_iEmbedding h in a physical interaction perspective_i,1。

(2) Obtaining relational interactive perspective embedding

Similar to (1), e for any entity in G_iThe perspective utilizes a cross-map matching score calculated from equation (5)

Iteratively aggregating neighbor information for a central entity e_iThe relationship of (2) is embedded and expressed in a visual angle, a gating mechanism is used for controlling output aggregation of each layer of the network, and an entity e is finally obtained according to a formula (11)_iEmbedding h at relational interaction views_i,2。

(3) Gated polymerization

Aggregating the embedding h of the entity interaction visual angle obtained from the steps (1) and (2) by using the formula (10)_i,1Embedding of views with relational interaction h_i,2To obtain an entity e_iIs finally embedded to represent h_i。

Similarly, all entities in G' are also processed in the same manner as in steps (1), (2), and (3).

(4) Calculating an embedding distance

And (4) calculating the embedding distance between the entities by using d (-) according to the entity embedding representation obtained in the step (3), wherein the smaller the distance is, the more similar the entities are represented. Wherein d (-) is defined as the L2 paradigm. And training the supervision model by using the seeds in the training set, so that the distance between aligned entities is gradually reduced, and the distance between non-aligned entities is gradually increased.

4. Model quality assessment

In this example, Hits @ k and MRR were used as evaluation indexes of the model. Where Hits @1 represents the percentage of correctly aligned entities among the first k candidate entities, k is set to 1, 5, 10, and 50, respectively, and MRR represents the average of the inverse of the rank of the correctly aligned entities. Higher values for Hits @ k and MRR indicate better model performance. The results of the model on the test set are shown in table 2.

Table 2: table of model DvGNet results

	Hits@1	Hits@5	Hits@10	Hits@50	MRR
						DvGNet	75.9％	85.4％	86.9％	95.6％	80.6％

5. Mining unknown aligned entity pairs using DvGNet

Because only a small number of seed pairs are manually marked as a training set and a test set, a large number of entity pairs to be aligned exist in the knowledge graph, and the invention utilizes DvGNet to mine the remaining entity pairs to be aligned of medical knowledge graphs MeKG _1 and MeKG _ 2. As can be seen from the results in Table 2, the accuracy of Hits @5 was as high as 85.4%. On the basis of considering both accuracy and labor cost, the DvGNet recommends the first 5 candidate entities with the highest similarity for each entity to be aligned, and the expert labels the candidate entities to obtain a final alignment result. As shown in table 3, the aligned entity of the entity "acute lymphoblastic leukemia" is "ALL", the aligned entity of the entity "alzheimer's disease" is "senile dementia", and the aligned entity of the entity "Bartter syndrome" is "barter syndrome". The final number of pairs of entities to be aligned identified by the model is 546.

Table 3: entity alignment candidate example table

Entity to be aligned	Top1	Top2	Top3	Top4	Top5
						Acute lymphocytic leukemia	ALL	Leukemia (leukemia)	Acute leukemia	Leukemia of central nervous system	CNSL
Alzheimer's disease	Senile dementia	Memory disorder	Dementia and dementia	ADS	Cognitive dysfunction
						Bartter syndrome	Hypokalemic alkalosis	Bart syndrome	Hypokalemia	Gitleman integration	Hyperplasia of glomerular side organ

Claims

1. An entity alignment method based on a knowledge graph of a Chinese electronic medical record is characterized in that entity sets of two heterogeneous medical knowledge graphs are respectively obtained, seed pairs of entities are labeled, and the successfully labeled entity seed pairs are used as a training set and a test set; also comprises the following steps:

step 1, complementing a knowledge graph by using rules;

according to preset rules in each heterogeneous medical knowledge graph, combining all the rules to obtain a rule set K; applying a rule set K to each heterogeneous medical knowledge graph, giving a rule K belonging to K, searching all prerequisite triples meeting the rule, deducing a conclusion heterogeneous triple according to the rule, and adding the conclusion heterogeneous triple to the original knowledge graph if the conclusion heterogeneous triple does not exist in the original knowledge graph to fulfill the aim of knowledge graph completion; obtaining two supplemented heterogeneous medical knowledge maps G ═ E, R, T and G ' ═ E ', R ', T ', wherein E, E ' respectively represents a first entity set and a second entity set, R, R ' respectively represents a first relation set and a second relation set, T, T ' respectively represents a first triple set and a second triple set, E belongs to E, R belongs to R, and T belongs to T and respectively represents any entity, relation and triple in G;

step 2.1, constructing an entity interactive visual angle network;

according to any entity e in the supplemented heterogeneous medical knowledge graph_iCalculating the embedded representation of the entity and its neighbor entity to obtain the self-attention score, and aggregating the neighbor feature vectors of the entity by using the self-attention score to obtain the entity e_iA representation of the ith layer in an entity interaction perspective;

step 2.2, constructing a relationship interaction visual angle network;

the relationship feature matrixes of the supplemented heterogeneous medical knowledge maps G and G' interact to obtain a relationship similarity matrix, and then the maximum pooling operation is carried out to obtain a relationship matching vector; finally, aggregating information from the neighbors by using the cross-map matching score obtained from the relation matching vector; obtaining any entity e in the complemented heterogeneous medical knowledge graph_iA representation of the ith layer in a relational interactive view;

step 2.3, gated polymerization;

Step 3, calculating an embedding distance;

2. According to claimThe entity alignment method based on the Chinese electronic medical record knowledge graph is characterized in that the step 2.1 is used for constructing the self-attention score in the entity interaction visual angle network

The calculation formula of (a) is as follows:

wherein

3. The method of claim 2, wherein the step 2.1 of constructing the entity interaction view network utilizes the self-attention score of formula (1)

wherein the content of the first and second substances,

is the weighting parameter of the l-th layer in the view network, and σ () is the activation function, chosen as ReLU ().

4. The entity alignment method based on the knowledge graph of the Chinese electronic medical record as claimed in claim 1, wherein in the step 2.2 of constructing the relationship interaction view network, the calculation formula of the relationship matching vector is as follows:

M＝f_M(f_S(R,R′)) (4)

The formula is as follows:

using cross-graph matching scores

wherein

5. The entity alignment method based on the knowledge-graph of Chinese electronic medical records of claim 1, wherein in the step 2.2 of constructing the relationship interaction view network, the initialized representation of the relationship r is obtained by the embedded representation of the head and tail entities of all the triples with the relationship r as a relation, and the calculation formula is as follows:

6. The entity alignment method based on the knowledge-graph of Chinese electronic medical record of claim 1, wherein in the step 2.3 of gate-controlled aggregation, the gate-controlled aggregation formula of an epsilon-layer network is as follows:

respectively carrying out gated aggregation on the entity interaction view network constructed in the step 2.1 and the relation interaction view network constructed in the step 2.2 by using a formula (9),respectively obtaining the output h of the entity interactive visual angle network_i,1Output h of sum relation interactive view network_i,2；

h_i＝g₁·h_i,1+(1-g₁)·h_i,2 (10)