CN110866124A

CN110866124A - Medical knowledge graph fusion method and device based on multiple data sources

Info

Publication number: CN110866124A
Application number: CN201911077284.3A
Authority: CN
Inventors: 周永杰; 高飞; 王则远; 刘静
Original assignee: Beijing Promise Cognitive Medical Technology Co Ltd
Current assignee: Beijing Promise Cognitive Medical Technology Co Ltd
Priority date: 2019-11-06
Filing date: 2019-11-06
Publication date: 2020-03-06
Anticipated expiration: 2039-11-06
Also published as: CN110866124B

Abstract

The embodiment of the invention provides a medical knowledge graph fusion method and device based on multiple data sources. The method comprises the following steps: performing knowledge representation learning based on the first medical knowledge graph and the second medical knowledge graph respectively to obtain each first initial vector and each second initial vector; mapping each first initial vector and each second initial vector into a reference vector space based on a pre-obtained reference vector set to obtain each first mapping vector and each second mapping vector; and fusing knowledge in the first medical knowledge map and the second medical knowledge map according to the first mapping vectors and the second mapping vectors to obtain the fused medical knowledge map. The medical knowledge graph fusion method and device based on multiple data sources, provided by the embodiment of the invention, are based on a knowledge representation learning method, and can obtain a fused medical knowledge graph with stronger internal logicality and expression capability by utilizing the internal logicality of the knowledge graph.

Description

Medical knowledge graph fusion method and device based on multiple data sources

Technical Field

The invention relates to the technical field of computers, in particular to a medical knowledge graph fusion method and device based on multiple data sources.

Background

With the development of the mobile internet, the generated data thereof also grows explosively, and more knowledge metadata is scattered on the internet. However, due to the fact that contents on the internet are heterogeneous in multiple sources and are loose in organization structure, the knowledge interconnection in a big data environment is greatly challenged. Therefore, a knowledge interconnection method which not only accords with the development and change of network information resources but also adapts to the cognitive requirements of users needs to be explored from a new perspective according to a knowledge organization principle under a big data environment, and the integrity and relevance of human cognition are disclosed in a deeper level.

The knowledge-graph is generated in such a large context. The concept of knowledge graph was formally proposed in 2012 by google, aiming at a more intelligent search engine. After the knowledge graph is proposed, the knowledge graph gradually plays an important role in applications such as intelligent question answering, intelligence analysis, fraud prevention and the like, and is widely applied to the fields of intelligent search, intelligent question answering, personalized recommendation and the like at present. The knowledge graph is essentially a knowledge base of a semantic network, and can be simply understood as a multi-relation graph from the practical application point of view. A relationship graph is composed of nodes and edges, and a multi-relationship graph generally comprises multiple types of nodes and multiple types of edges. The knowledge graph is a new knowledge representation mode based on a graph data structure and composed of various nodes (entities) and various labeled edges (relationships among the entities), so that various entities or concepts existing in the real world are described, the relationships or associations among the entities are embodied, and discrete data are integrated together, thereby providing more valuable decision support.

In the medical field, with the development of medical health informatization and medical information systems, a large amount of medical data is accumulated. How to abstract information from these data, manage and apply it is a key issue to advance medical intelligence. The construction technology of the medical knowledge map mainly relates to medical knowledge representation, medical knowledge extraction, medical knowledge reasoning and the like. The medical field is one of the most widely applied vertical fields of the knowledge graph, but due to the characteristics of strong medical data specialization, complex structure and the like, the construction of the medical knowledge graph has the problems of low efficiency, poor expansibility and the like. A single data source is not enough to cover complete medical field knowledge, but when knowledge acquisition sources are wide, requirements of the data source and design concepts are different, so that the medical knowledge map has the problems of difference in knowledge quality, redundant knowledge, unclear association relation and boundaries between entity concepts and the like. Therefore, how to fuse medical knowledge maps from different data sources becomes a problem to be solved in the field.

Disclosure of Invention

The embodiment of the invention provides a medical knowledge graph fusion method and device based on multiple data sources, which are used for solving or at least partially solving the defect that the fusion of medical knowledge graphs from different data sources is difficult to realize in the prior art.

In a first aspect, an embodiment of the present invention provides a medical knowledge graph fusion method based on multiple data sources, including:

performing knowledge representation learning based on the first medical knowledge graph and the second medical knowledge graph respectively to obtain each first initial vector and each second initial vector;

mapping each first initial vector and each second initial vector to a reference vector space based on a pre-obtained reference vector set to obtain each first mapping vector and each second mapping vector;

fusing knowledge in the first medical knowledge graph and the second medical knowledge graph according to the first mapping vectors and the second mapping vectors to obtain a fused medical knowledge graph;

wherein the first initial vector represents a result as a vector of an entity or relationship in the first medical knowledge-graph in a first vector space; the second initial vector is a vector representation result of the entity or the relation in the second medical knowledge-graph in a second vector space; the reference vector is a vector representation result of an entity in the reference knowledge graph in the reference vector space; the first medical knowledge-graph, the second medical knowledge-graph and the reference knowledge-graph are constructed based on different medical data sources.

Preferably, the mapping each first initial vector and each second initial vector into a reference vector space based on a pre-obtained reference vector set, and the specific step of obtaining each first mapping vector and each second mapping vector includes:

for any first instance in the first medical knowledge graph and any second instance in the second medical knowledge graph, if judging that the tail entity in any first instance and the tail entity in any second instance are known, and a relationship exists in the reference knowledge graph, taking the tail entity in any first instance and the tail entity in any second instance as a pre-fusion entity pair;

and performing bidirectional supervision training on each first initial vector and each second initial vector according to the reference vector corresponding to each pre-fusion entity pair in the reference vector set to obtain each first mapping vector and each second mapping vector.

Preferably, the specific step of fusing knowledge in the first medical knowledge-graph and the second medical knowledge-graph according to the first mapping vectors and the second mapping vectors includes:

for each pre-fusion entity pair, acquiring a first distance between reference vectors corresponding to two tail entities in the pre-fusion entity pair, and a second distance between a first mapping vector corresponding to a head entity in a first instance corresponding to the pre-fusion entity pair and a second mapping vector corresponding to a head entity in a second instance corresponding to the pre-fusion entity pair;

and if the difference between the first distance and the second distance is smaller than a preset difference threshold value, determining that the relationship between two tail entities in the pre-fusion entity pair exists between the head entity in the first instance corresponding to the pre-fusion entity pair and the head entity in the second instance corresponding to the pre-fusion entity pair in the fused medical knowledge graph.

Preferably, the specific step of fusing knowledge in the first medical knowledge-graph and the second medical knowledge-graph according to the first mapping vectors and the second mapping vectors further includes:

obtaining a second mapping vector closest to the first mapping vector corresponding to any entity in the first medical knowledge map;

and if the distance between the first mapping vector corresponding to any entity and the nearest second mapping vector is judged to be smaller than a preset distance threshold, fusing the entity in the second medical knowledge map corresponding to the entity and the nearest second mapping vector into the same entity in the fused medical knowledge map.

Preferably, after determining that the head entity in the first instance corresponding to the pre-fusion entity pair and the head entity in the second instance corresponding to the pre-fusion entity pair have the relationship between the two tail entities in the pre-fusion entity pair in the fused medical knowledge-graph, the method further includes:

determining a head entity in a first instance corresponding to the pre-fusion entity pair and a head entity in a second instance corresponding to the pre-fusion entity pair as a new pre-fusion entity pair;

and taking the first mapping vector and the second mapping vector respectively corresponding to the two head entities in the new pre-fusion entity pair as new reference vectors, and adding the new reference vectors into the reference vector set to obtain a new reference vector set.

Preferably, the learning of knowledge representation based on the first medical knowledge graph and the second medical knowledge graph respectively further includes, before obtaining each first initial vector and each second initial vector:

according to a method of combining a two-way long-and-short-term memory network with a conditional random field and a first medical data source, obtaining each example in the first medical knowledge map, and constructing the first medical knowledge map according to each example in the first medical knowledge map.

and acquiring each example in the second medical knowledge graph according to a method of combining a two-way long-and-short-term memory network with a conditional random field and a second medical data source, and constructing the second medical knowledge graph according to each example in the second medical knowledge graph.

In a second aspect, an embodiment of the present invention provides a medical knowledge-graph fusion apparatus based on multiple data sources, including:

the vector representation module is used for performing knowledge representation learning based on the first medical knowledge graph and the second medical knowledge graph respectively to obtain each first initial vector and each second initial vector;

the vector mapping module is used for mapping each first initial vector and each second initial vector into a reference vector space based on a pre-obtained reference vector set to obtain each first mapping vector and each second mapping vector;

the knowledge fusion module is used for fusing knowledge in the first medical knowledge graph and the second medical knowledge graph according to the first mapping vectors and the second mapping vectors to obtain a fused medical knowledge graph;

In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the program is executed, the steps of the medical knowledge-graph fusion method based on multiple data sources as provided in any one of the various possible implementations of the first aspect are implemented.

In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the multi-data-source-based medical knowledge-map fusion method as provided in any one of the various possible implementations of the first aspect.

The embodiment of the invention provides a medical knowledge graph fusion method and device based on multiple data sources, which is based on a knowledge representation learning method and is used for vectorizing a first medical knowledge graph and a second medical knowledge graph to be fused, mapping vectorized results into a reference vector space according to a reference vector set, performing knowledge fusion according to a first mapping vector and a second mapping vector in the reference vector space, obtaining a fused medical knowledge graph, and obtaining the fused medical knowledge graph with stronger internal logicality and expression capacity by utilizing the internal logicality of the knowledge graph.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart diagram of a medical knowledge-graph fusion method based on multiple data sources according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a medical knowledge-map fusion device based on multiple data sources according to an embodiment of the present invention;

fig. 3 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to overcome the above problems in the prior art, embodiments of the present invention provide a medical knowledge graph fusion method and apparatus based on multiple data sources, and the inventive concept is to fuse medical knowledge graphs from different data sources based on a knowledge representation method, and based on knowledge representation, based on a reference knowledge graph, not only can implement entity disambiguation and entity linkage in general knowledge graph fusion, but also can discover relationships that do not exist in a single medical knowledge graph, thereby extending knowledge.

Fig. 1 is a schematic flow chart of a medical knowledge-graph fusion method based on multiple data sources according to an embodiment of the present invention. As shown in fig. 1, the method includes: and S101, performing knowledge representation learning based on the first medical knowledge graph and the second medical knowledge graph respectively to obtain each first initial vector and each second initial vector.

Wherein the first initial vector is a vector representation result of an entity or a relation in the first medical knowledge map in a first vector space; and a second initial vector, which is a vector representation result of the entity or relationship in the second medical knowledge-graph in the second vector space.

It should be noted that the input of the knowledge-graph is unstructured data, such as professional medical guidelines or a large number of drug specifications, all in natural language. The invention firstly extracts medical knowledge from natural language and disassembles the medical knowledge into raw materials of a knowledge graph, namely nodes and relations. The nodes represent clinical medical entities (entities for short) such as diseases, symptoms, medicines and the like, and the connecting lines represent the relationship among the entities, so that the interconnection and intercommunication of knowledge and data are realized through the form.

A knowledge graph may include multiple sets of "entity-relationship-entity" RDF (Resource description framework) triples. Each "entity-relationship-entity" triple may be referred to as an instance.

In order to facilitate the fusion of the knowledge graph, the entity and the relationship in the knowledge graph can be vectorized by a knowledge representation learning method to obtain a vector representation result of the entity and the relationship in a vector space, namely, a vector corresponding to the entity and the relationship is obtained.

Common knowledge representation learning models include: a distance model, a single-layer neural network model, an energy model, a bilinear model, a tensor neural network model, a matrix decomposition model, a translation model and the like.

Translation models such as TransE, PTransE, TransH and the like can be adopted to vectorize each instance in the first medical knowledge graph and vectorize each instance in the second medical knowledge graph respectively to obtain each first initial vector representing the entity or the relation in the first medical knowledge graph in the first vector space and each second initial vector representing the entity or the relation in the second medical knowledge graph in the second vector space.

Preferably, a classical translation model TransE can be employed. The TransE is distributed vector representation based on the entities and the relations, and the entities and the relations are mapped to a low-dimensional vector space by utilizing the phenomenon that word vector translation is not changed.

The relation (relationship) in the triple instance "entity-relation-entity" (head-relation-tail) is regarded as a vector between a head entity (head) and a tail entity (tail) by the transE, and h, r and t (vectors corresponding to the head, the relation and the tail) are continuously adjusted to make (h + r) equal to t as much as possible, namely h + r equals to t.

It should be noted that, in the triple example, either one of the two entities may be used as a head entity, and the other may be used as a tail entity.

The distance function of TransE can be set as: d (h + r, t), which is used to measure the distance between (h + r) and t, the smaller the function value, the more reasonable the triplet. TransE defines the following objective function using the maximum interval method:

where γ is the interval value, S is a triplet in the knowledge base, S 'is an artificially constructed negative-sampled triplet, and h or t of S' is replaced by a random value, but not both. The smaller the distance function (h + r, t) the better for S, and the larger (h + r, t) the better for S'.

Step S102, mapping each first initial vector and each second initial vector to a reference vector space based on a pre-obtained reference vector set, and obtaining each first mapping vector and each second mapping vector.

The reference vector is a vector representation result of an entity in the reference knowledge graph in a reference vector space; the first medical knowledge-graph, the second medical knowledge-graph and the reference knowledge-graph are constructed based on different medical data sources.

It should be noted that, the basic idea of the method for entity fusion by using knowledge representation is to generate limitation on training of vector representations of entities in knowledge maps KG1 and KG2 to be fused by referring to a vector set, and train pairs of unfused entities into vectors close to each other. For example, KG1 and KG2 present the equivalent entity "antibiotic" and the same relationship "drug type", then during the training of KG1 and KG2, "amoxicillin" and "amoxicillin" should be represented as closely spaced vectors.

Before step S102, the entities and relationships in the reference knowledge graph may be vectorized by a knowledge representation learning method to obtain reference vectors in the reference vector space, where the reference vectors represent the entities or relationships in the reference knowledge graph.

The reference vectors obtained based on the reference knowledge-graph may be combined into an initial set of reference vectors.

In step S101, a single knowledge graph is trained individually, that is, a first medical knowledge graph is trained according to a translation model to obtain first initial vectors, a second medical knowledge graph is trained according to a translation model to obtain second initial vectors, and in order to perform knowledge fusion, instances in two knowledge graphs to be fused (that is, the first medical knowledge graph and the second medical knowledge graph) need to be mapped into the same low-dimensional vector space.

The first initial vectors and the second initial vectors may be subjected to bidirectional supervised training according to a reference vector set, and the first initial vectors and the second initial vectors are mapped into a reference vector space, so as to obtain first mapping vectors corresponding to each first initial vector and second mapping vectors corresponding to each second initial vector.

And S103, fusing knowledge in the first medical knowledge graph and the second medical knowledge graph according to the first mapping vectors and the second mapping vectors to obtain the fused medical knowledge graph.

Specifically, whether the meanings of the entities in the first medical knowledge graph corresponding to the first mapping vector and the meanings of the entities in the second medical knowledge graph corresponding to the second mapping vector are the same or whether an inclusion relationship exists may be determined based on the distance between the first mapping vector and the second mapping vector, and the entities in the first medical knowledge graph and the entities in the second medical knowledge graph having the same meanings are determined as the pair of entities to be fused.

And according to the determined entity pair to be fused, fusing the entity in the first medical knowledge graph and the entity in the second medical knowledge graph to obtain the fused medical knowledge graph.

The general knowledge graph fusion method judges whether the meanings of the entities in different knowledge graphs are the same or whether the inclusion relations exist based on the self semantics of the entities, finds concepts or attributes such as equivalent examples, equivalents or inclusion relations and the like, and performs entity disambiguation and entity linkage on the basis without considering the internal logicality of the knowledge graphs. In the method for learning based on knowledge representation, when judging whether the meanings of the entities in different knowledge maps are the same or whether inclusion relations exist by using the translation invariant phenomenon of word vectors, the embodiment of the invention is based on the semantics of the entities and the relations between the entities and other entities, so that the logic property in the knowledge maps can be utilized, and the logic property and the expression capacity in the obtained fused medical knowledge maps can be enhanced.

The embodiment of the invention discloses a knowledge representation learning-based method, which comprises the steps of vectorizing a first medical knowledge graph and a second medical knowledge graph to be fused, mapping vectorized results into a reference vector space according to a reference vector set, carrying out knowledge fusion according to a first mapping vector and a second mapping vector in the reference vector space to obtain the fused medical knowledge graph, and obtaining the fused medical knowledge graph with stronger internal logic and expression capacity by utilizing the internal logic of the knowledge graph.

Based on the content of the foregoing embodiments, based on a reference vector set obtained in advance, mapping each first initial vector and each second initial vector into a reference vector space, where the specific step of obtaining each first mapping vector and each second mapping vector includes: and for any first instance in the first medical knowledge graph and any second instance in the second medical knowledge graph, if judging that the tail entity in any first instance and the tail entity in any second instance are known, and a relation exists in the reference knowledge graph, taking the tail entity in any first instance and the tail entity in any second instance as a pre-fusion entity pair.

Specifically, when performing bidirectional supervised training, a pair of pre-fusion entities is determined first.

The instance in the first medical knowledge-map is a first instance and the instance in the second medical knowledge-map is a second instance. For any first instance and any second instance, if the tail entities in the two instances have a relationship in the reference knowledge-graph, the tail entities in the two instances can be taken as a pre-fused entity pair.

It is understood that for any first instance and any second instance, if the head entities in the two instances have a relationship in the reference knowledge-graph, the head entities in the two instances can also be used as a pre-fused entity pair.

Presence of a relationship means that in the reference knowledge-graph, there is an instance that includes both entities in the pre-fused entity pair.

Specifically, since the first medical knowledge-map, the second medical knowledge-map, and the reference knowledge-map are constructed based on different medical data sources, the dimensions of the first vector space, the second vector space, and the reference vector space are not necessarily the same even if the same method of knowledge representation learning is employed.

The purpose of the two-way supervised training is to map the first initial vectors and the second initial vectors into the same vector space.

The optimization method in the two-way supervision training comprises the following steps:

the reference vectors corresponding to two entities in the pre-fusion entity pair are kept unchanged, and are respectively used as mapping results of a first initial vector and a second initial vector corresponding to the two entities in the pre-fusion entity pair (namely as a first mapping vector and a second mapping vector corresponding to the two entities respectively), and the mapping relation between the first initial vector and the reference vector and between the second initial vector and the reference vector is determined;

and continuously optimizing the optimization results of the first initial vectors and the second initial vectors corresponding to other entities and relations in the reference vector space according to the mapping relations, wherein the optimization results meet various constraint relations including the constraint relations among the three vectors corresponding to the triples, and thus, the first mapping vector corresponding to each first initial vector and the second mapping vector corresponding to each second initial vector are obtained.

Through bidirectional supervised training, the first mapping vector and the second mapping vector corresponding to the equivalent entities in the first medical knowledge graph and the second medical knowledge graph can have smaller distance values in the reference vector space.

According to the embodiment of the invention, through carrying out bidirectional supervision training, each first initial vector and each second initial vector are mapped into the reference vector space, and each first mapping vector and each second mapping vector are obtained, so that the first mapping vector and the second mapping vector corresponding to the equivalent entity have smaller distance values in the reference vector space, and therefore, knowledge fusion can be carried out based on each first mapping vector and each second mapping vector, and the fused medical knowledge map with stronger internal logicality and expression capability can be obtained.

Based on the content of the foregoing embodiments, the specific step of fusing the knowledge in the first medical knowledge graph and the second medical knowledge graph according to the first mapping vectors and the second mapping vectors includes: for each pre-fusion entity pair, a first distance between reference vectors corresponding to two tail entities in the pre-fusion entity pair, a second distance between a first mapping vector corresponding to a head entity in a first instance corresponding to the pre-fusion entity pair and a second mapping vector corresponding to a head entity in a second instance corresponding to the pre-fusion entity pair are obtained.

Specifically, in the reference vector space, a distance between a first mapping vector and a second mapping vector corresponding to two tail entities in each pre-fusion entity pair is obtained as a first distance. The first distance is a modulus of a reference vector corresponding to a relationship between the two tail entities.

And acquiring a distance between a first mapping vector corresponding to the head entity in the example of the first medical knowledge graph where the tail entity of the first medical knowledge graph is located in the two tail entities and a second mapping vector corresponding to the head entity in the example of the second medical knowledge graph where the tail entity of the second medical knowledge graph is located in the two tail entities as a second distance.

And if the difference between the first distance and the second distance is smaller than a preset difference threshold value, determining that the relationship between two tail entities in the pre-fusion entity pair exists between the head entity in the first example corresponding to the pre-fusion entity pair and the head entity in the second example corresponding to the pre-fusion entity pair in the fused medical knowledge graph.

Specifically, the difference between the first distance and the second distance obtained from the same pair of pre-fused entities is calculated.

It should be noted that the difference between the first distance and the second distance is obtained by subtracting the smaller value of the first distance from the larger value of the second distance.

And judging whether the obtained difference between the first distance and the second distance is smaller than a preset difference threshold value. If so, the head entity of the same pre-fusion entity pair in the example of the first medical knowledge graph where the tail entity of the first medical knowledge graph is located and the head entity of the same pre-fusion entity pair in the example of the second medical knowledge graph where the tail entity of the second medical knowledge graph is located have the same relation with the two tail entities of the pre-fusion entity pair.

It should be noted that, if two entities in the pre-fusion entity pair are head entities, the first distance is a distance between reference vectors corresponding to the two entities; the second distance is the distance between a first mapping vector corresponding to a tail entity in a first medical knowledge graph corresponding to the two head entities and a second mapping vector corresponding to a tail entity in a second medical knowledge graph corresponding to the two head entities; if the difference between the first distance and the second distance is smaller than a preset difference threshold, it is indicated that the tail entity in the first medical knowledge graph corresponding to the two head entities has a relationship with the tail entity in the second medical knowledge graph corresponding to the two head entities, and the relationship is the same as the relationship between the two head entities in the pre-fusion entity pair.

For example, a triplet in the first medical knowledge graph is "K1-J1-K2", a triplet in the second medical knowledge graph is "L1-J2-L2", a triplet in the reference knowledge graph is "K1-X-L1", a first distance between a first mapping vector corresponding to the entity K1 and a first mapping vector corresponding to L1 is equal to a first distance between a first mapping vector corresponding to the entity K2 and a first mapping vector corresponding to L2, it is indicated that a relationship X also exists between the entities K2 and L2, and the triplet "K2-X-L2" is taken as one triplet in the fused medical knowledge graph.

According to the embodiment of the invention, the first distance and the second distance are obtained, whether the head entity in the first example corresponding to the pre-fusion entity has a relationship with the head entity in the second example corresponding to the pre-fusion entity is judged according to whether the difference between the first distance and the second distance is smaller than the preset difference threshold value, the relationship which does not exist independently in the first medical knowledge map and the second medical knowledge map can be found, the old knowledge is updated or the new knowledge is supplemented for the medical knowledge map, the knowledge is expanded, the fused medical knowledge map with richer knowledge is obtained, and the fused medical knowledge map is more comprehensive and has more practical application value.

Based on the content of the foregoing embodiments, the specific step of fusing the knowledge in the first medical knowledge graph and the second medical knowledge graph according to the first mapping vectors and the second mapping vectors further includes: a second mapping vector closest in distance to the first mapping vector corresponding to any entity in the first medical knowledge-graph is obtained.

Specifically, for any first mapping vector, the distance between the first mapping vector and each second mapping vector is obtained.

And if the distance between the first mapping vector corresponding to any entity and the nearest second mapping vector is judged to be smaller than a preset distance threshold, fusing any entity and the entity in the second medical knowledge graph corresponding to the nearest second mapping vector into the same entity in the fused medical knowledge graph.

Specifically, after the distance between the first mapping vector and each second mapping vector is obtained, it may be determined whether the minimum value is smaller than a preset distance threshold.

If so, the entity corresponding to the first mapping vector and the entity corresponding to the second mapping vector corresponding to the minimum value have higher similarity, and are considered to have the same meaning and to be equivalent entities. And fusing the equivalent entities into the same entity in the fused medical knowledge graph.

It should be noted that knowledge of different disease species knowledge maps is crossed, and map fusion can enlarge the scale of the maps and expand adaptability. The data sources are different, so that different terms may exist to represent the same meaning, and the map fusion can be performed to eliminate ambiguity. In addition, knowledge graph fusion can also reason out relationships which do not exist in previous individual graphs, so that the knowledge is expanded.

According to the embodiment of the invention, whether the minimum distance between the first mapping vector and the nearest second mapping vector is smaller than the preset distance threshold value or not is judged, whether the entity corresponding to the first mapping vector and the entity corresponding to the second mapping vector corresponding to the minimum value are equivalent entities or not is judged, the equivalent entities can be found more accurately and the like, a better fused medical knowledge map is obtained, and the fused medical knowledge map is more comprehensive and has more practical application value.

Based on the content of the foregoing embodiments, after determining that the relationship between the head entity in the first instance corresponding to the pre-fusion entity pair and the head entity in the second instance corresponding to the pre-fusion entity pair in the fused medical knowledge graph has two tail entities in the pre-fusion entity pair, the method further includes: and determining the head entity in the first instance corresponding to the pre-fusion entity pair and the head entity in the second instance corresponding to the pre-fusion entity pair as a new pre-fusion entity pair.

Specifically, an iterative fusion method may be adopted, and an equivalent entity or two entities with a relationship found in each iteration is used as a new pre-fusion entity pair.

And adding a first mapping vector and a second mapping vector respectively corresponding to two head entities in the new pre-fusion entity pair into the reference vector set to obtain a new reference vector set.

Specifically, a first mapping vector and a second mapping vector respectively corresponding to two entities in each new pre-fusion entity pair are used as new reference vectors, and are added to a reference vector set to obtain a new reference vector set.

After obtaining the new reference vector set, the step S102 is executed again according to the new reference vector set.

The appropriate number of iterations may be set according to the actual situation until no more equivalent entities and two entities with a relationship are found.

It should be noted that after each iteration, or after all iterations are complete, F-Measure may be used to evaluate the fusion result.

Wherein F represents an evaluation value of the fusion result; p is the fusion accuracy, which represents the proportion of correctly predicted samples in all predicted samples (i.e., the found equivalent entities and the two entities with relationships); r is recall and represents the proportion of all positive samples that are predicted to be positive samples.

In general, accuracy and recall rate are mutually restricted, so that the F-Measure is used for comprehensively evaluating the model, and the higher the F-Measure is, the better the performance is.

The embodiment of the invention fully utilizes the internal characteristics of the knowledge graph by iterative fusion, can improve the fusion efficiency, can obtain a better fused medical knowledge graph, and has more comprehensive and practical application value.

Based on the content of the foregoing embodiments, performing knowledge representation learning based on the first medical knowledge graph and the second medical knowledge graph respectively, before obtaining each first initial vector and each second initial vector, further includes: according to the method of combining the two-way long-and-short-term memory network with the conditional random field and the first medical data source, obtaining each example in the first medical knowledge map, and constructing the first medical knowledge map according to each example in the first medical knowledge map.

Specifically, in order to facilitate knowledge representation learning and knowledge fusion of the knowledge graph, each instance in the first medical knowledge graph can be acquired based on the first medical data source according to a method of combining a two-way long-and-short-term memory network with a conditional random field, and the first medical knowledge graph is constructed according to each instance in the first medical knowledge graph.

Entities and relations can be jointly extracted from the first medical knowledge graph through a two-way long-time memory network in combination with a joint learning model of a conditional random field (BilSTM-CRF). The task of the joint extraction of the entities and the relations is to automatically detect the entities from the unstructured text through the model and simultaneously extract the semantic relations among the entities, wherein the relations refer to predefined entity relation types. The extracted entities and relations form a knowledge triple, and the triple can be directly modeled. The entity tag defined in the embodiment of the present invention includes: numerical values, medical history, symptoms, diagnosis, drug name, drug class, scale, regimen, and others. The relationship types include: COPD stratification treatments are grouped into and, including, to medication regimen, to drug, patient condition, inference condition, recommended basal medication, recommended escalation medication, and regimen including medication category.

Firstly, labeling data in a first medical data source, and dividing the labeling into three parts according to a defined entity and an entity relation type. Entity words adopt BIOES labeling specification: b represents a first character, I represents a middle character, E represents a tail character, S represents a single character, and O represents a character which does not belong to an entity; entity relationship: coding and marking are carried out according to entity relation types defined in advance; role information of the entity: divided into entities 1 and 2.

After labeling, one sentence is written as a vector consisting of characters in sentence units: x ═ x₁,x₂,…,x_n). The first layer of the BilSTM-CRF model is an embedded representation (look-up) layer, and each character x in an input sentence is processed by utilizing a pre-trained or randomly initialized embedding matrix_iAnd (i is more than or equal to 1 and less than or equal to n) is mapped into a low-dimensional dense word vector and represents semantic information of the word in different dimensions. Meanwhile, to alleviate overfitting, random deactivation (dropout) is set before the next layer is input.

The second layer of the BilSTM-CRF model is a bidirectional LSTM layer. Firstly, the vector of each generated word is used as the input of each step of the bidirectional LSTM, and then the sequence of the forward LSTM is output

Sequence of outputs at various positions with inverted LSTM

And splicing according to positions to obtain a complete hidden state sequence. The output of each time step of the bi-directional LSTM layer is a multi-label probability value, denoted as the matrix P ═ P (P)₁,p₂,...,p_n)∈R_n×kThe probability of each character being divided into each label can be classified by softmax, but only the information of the current position is considered, and the information of the context is not considered, so that the CRF layer is accessed next.

The third layer of the BilSTM-CRF model is a CRF layer. The parameter of the CRF layer is a matrix A of (k +2) × (k +2), the element A in the matrix A_ijThe transition score from the ith label to the jth label is shown, and the labels which are labeled before can be utilized when labeling is carried out at one position. The 2 is added because a start state is added to the beginning of the sentence and an end state is added to the end of the sentence. If a tag sequence y with a length equal to the sentence length is recorded (y)₁,y₂,…,y_n) Then the model scores y for a label of sentence x equal to y

The score of the entire sequence is equal to the sum of the scores of the positions, with the score for each position being determined partly by the LSTM output and partly by the transition matrix a of the CRF. Further, the normalized probability can be obtained by using Softmax:

the model is trained by using a maximized log-likelihood function, and the log-likelihood of a training sample is given by the following formula:

logP(y^x|x)＝score(x,y^x)-log(∑exp(score(x,y′)))

the model uses a dynamically planned Viterbi algorithm in the prediction process to solve the optimal path:

through the steps, the unit group in the form of 'entity-relation-entity' can be extracted from the first medical data source to serve as an example, and the first medical knowledge graph is formed according to the extracted examples.

According to the embodiment of the invention, the first medical knowledge graph is constructed according to a method of combining a two-way long-time memory network with a conditional random field, and a better first medical knowledge graph can be obtained, so that a better fused medical knowledge graph can be fused according to the first medical knowledge graph and the second medical knowledge graph.

Based on the content of the foregoing embodiments, performing knowledge representation learning based on the first medical knowledge graph and the second medical knowledge graph respectively, before obtaining each first initial vector and each second initial vector, further includes: and acquiring each example in the second medical knowledge graph according to the method of combining the two-way long-time memory network with the conditional random field and the second medical data source, and constructing the second medical knowledge graph according to each example in the second medical knowledge graph.

It should be noted that, in order to facilitate knowledge representation learning and knowledge fusion of the knowledge graph, each instance in the second medical knowledge graph may be acquired based on the second medical data source according to a method of combining the two-way long-and-short-term memory network and the conditional random field, and the second medical knowledge graph may be constructed according to each instance in the first medical knowledge graph.

The step of constructing the second medical knowledge graph based on the method of combining the two-way long-short term memory network with the conditional random field is similar to the step of constructing the first medical knowledge graph based on the first medical data source, which can be referred to the above embodiment of constructing the first medical knowledge graph, and is not described herein again.

According to the embodiment of the invention, the second medical knowledge graph is constructed according to a method of combining a two-way long-time memory network with a conditional random field, and a better second medical knowledge graph can be obtained, so that a better fused medical knowledge graph can be fused according to the first medical knowledge graph and the second medical knowledge graph.

Fig. 2 is a schematic structural diagram of a medical knowledge-map fusion device based on multiple data sources according to an embodiment of the present invention. Based on the content of the above embodiments, as shown in fig. 2, the apparatus includes a vector representation module 201, a vector mapping module 202, and a knowledge fusion module 203, where:

a vector representation module 201, configured to perform knowledge representation learning based on the first medical knowledge graph and the second medical knowledge graph, respectively, to obtain first initial vectors and second initial vectors;

a vector mapping module 202, configured to map, based on a reference vector set obtained in advance, each first initial vector and each second initial vector into a reference vector space, so as to obtain each first mapping vector and each second mapping vector;

the knowledge fusion module 203 is configured to fuse knowledge in the first medical knowledge graph and the second medical knowledge graph according to the first mapping vectors and the second mapping vectors, and acquire a fused medical knowledge graph;

wherein the first initial vector is a vector representation result of an entity or a relation in the first medical knowledge map in a first vector space; a second initial vector representing a result of a vector representation in a second vector space of an entity or relationship in a second medical knowledge-graph; a reference vector, which is a vector representation result of an entity in the reference knowledge graph in a reference vector space; the first medical knowledge-graph, the second medical knowledge-graph and the reference knowledge-graph are constructed based on different medical data sources.

Specifically, the vector representation module 201 separately vectorizes each instance in the first medical knowledge graph and each instance in the second medical knowledge graph based on the knowledge representation learning method, to obtain each first initial vector representing an entity or a relationship in the first medical knowledge graph in the first vector space and each second initial vector representing an entity or a relationship in the second medical knowledge graph in the second vector space.

The vector mapping module 202 may perform bidirectional supervised training on each first initial vector and each second initial vector according to the reference vector set, map each first initial vector and each second initial vector into the reference vector space, and obtain a first mapping vector corresponding to each first initial vector and a second mapping vector corresponding to each second initial vector.

The knowledge fusion module 203 may determine, based on a distance between the first mapping vector and the second mapping vector, whether meanings of an entity in the first medical knowledge graph corresponding to the first mapping vector and an entity in the second medical knowledge graph corresponding to the second mapping vector are the same or whether an inclusion relationship exists, and determine the entity in the first medical knowledge graph and the entity in the second medical knowledge graph having the same meanings as an entity pair to be fused; and according to the determined entity pair to be fused, fusing the entity in the first medical knowledge graph and the entity in the second medical knowledge graph to obtain the fused medical knowledge graph.

The medical knowledge graph fusion device based on multiple data sources provided by the embodiment of the invention is used for executing the medical knowledge graph fusion method based on multiple data sources provided by each embodiment of the invention, and the specific method and process for realizing the corresponding functions of each module included in the medical knowledge graph fusion device based on multiple data sources are described in the embodiment of the medical knowledge graph fusion method based on multiple data sources, and are not described again here.

The medical knowledge-map fusion device based on multiple data sources is used for the medical knowledge-map fusion method based on multiple data sources of the previous embodiments. Therefore, the descriptions and definitions in the medical knowledge-graph fusion method based on multiple data sources in the foregoing embodiments can be used for understanding the execution modules in the embodiments of the present invention.

Fig. 3 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention. Based on the content of the above embodiment, as shown in fig. 3, the electronic device may include: a processor (processor)301, a memory (memory)302, and a bus 303; wherein, the processor 301 and the memory 302 complete the communication with each other through the bus 303; the processor 301 is configured to invoke computer program instructions stored in the memory 302 and executable on the processor 301 to perform the multi-data source based medical knowledge-graph fusion method provided by the above-described method embodiments, including, for example: performing knowledge representation learning based on the first medical knowledge graph and the second medical knowledge graph respectively to obtain each first initial vector and each second initial vector; mapping each first initial vector and each second initial vector into a reference vector space based on a pre-obtained reference vector set to obtain each first mapping vector and each second mapping vector; fusing knowledge in the first medical knowledge map and the second medical knowledge map according to the first mapping vectors and the second mapping vectors to obtain a fused medical knowledge map; wherein the first initial vector is a vector representation result of an entity or a relation in the first medical knowledge map in a first vector space; a second initial vector representing a result of a vector representation in a second vector space of an entity or relationship in a second medical knowledge-graph; a reference vector, which is a vector representation result of an entity in the reference knowledge graph in a reference vector space; the first medical knowledge-graph, the second medical knowledge-graph and the reference knowledge-graph are constructed based on different medical data sources.

Another embodiment of the present invention discloses a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, which when executed by a computer, enable the computer to perform a method for medical knowledge-graph fusion based on multiple data sources, as provided by the above-mentioned method embodiments, for example, the method comprising: performing knowledge representation learning based on the first medical knowledge graph and the second medical knowledge graph respectively to obtain each first initial vector and each second initial vector; mapping each first initial vector and each second initial vector into a reference vector space based on a pre-obtained reference vector set to obtain each first mapping vector and each second mapping vector; fusing knowledge in the first medical knowledge map and the second medical knowledge map according to the first mapping vectors and the second mapping vectors to obtain a fused medical knowledge map; wherein the first initial vector is a vector representation result of an entity or a relation in the first medical knowledge map in a first vector space; a second initial vector representing a result of a vector representation in a second vector space of an entity or relationship in a second medical knowledge-graph; a reference vector, which is a vector representation result of an entity in the reference knowledge graph in a reference vector space; the first medical knowledge-graph, the second medical knowledge-graph and the reference knowledge-graph are constructed based on different medical data sources.

Furthermore, the logic instructions in the memory 302 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or make a contribution to the prior art, or may be implemented in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Another embodiment of the present invention provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method for multi-data source based medical knowledge-map fusion provided by the above method embodiments, for example, the method comprising: performing knowledge representation learning based on the first medical knowledge graph and the second medical knowledge graph respectively to obtain each first initial vector and each second initial vector; mapping each first initial vector and each second initial vector into a reference vector space based on a pre-obtained reference vector set to obtain each first mapping vector and each second mapping vector; fusing knowledge in the first medical knowledge map and the second medical knowledge map according to the first mapping vectors and the second mapping vectors to obtain a fused medical knowledge map; wherein the first initial vector is a vector representation result of an entity or a relation in the first medical knowledge map in a first vector space; a second initial vector representing a result of a vector representation in a second vector space of an entity or relationship in a second medical knowledge-graph; a reference vector, which is a vector representation result of an entity in the reference knowledge graph in a reference vector space; the first medical knowledge-graph, the second medical knowledge-graph and the reference knowledge-graph are constructed based on different medical data sources.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. It is understood that the above-described technical solutions may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method of the above-described embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A medical knowledge graph fusion method based on multiple data sources is characterized by comprising the following steps:

2. The method for medical knowledge-graph fusion based on multiple data sources of claim 1, wherein said mapping each of said first initial vectors and each of said second initial vectors into a reference vector space based on a set of pre-obtained reference vectors, and wherein the specific steps of obtaining each of said first mapping vectors and each of said second mapping vectors comprise:

3. The method of claim 2, wherein the step of fusing knowledge in the first and second medical knowledge-maps according to the first and second mapping vectors comprises:

4. The multiple data source-based medical knowledge-graph fusion method of claim 3, wherein the specific step of fusing knowledge in the first medical knowledge-graph and the second medical knowledge-graph according to the first mapping vectors and the second mapping vectors further comprises:

5. The method of claim 3, wherein the determining the relationship between the head entity in the first instance corresponding to the pre-fused entity pair and the head entity in the second instance corresponding to the pre-fused entity pair in the fused medical knowledge-graph, and the two tail entities in the pre-fused entity pair further comprises:

6. The method for multiple data source-based medical knowledge-graph fusion as claimed in any one of claims 1 to 5, wherein the learning of knowledge representation based on the first medical knowledge-graph and the second medical knowledge-graph respectively, before obtaining each first initial vector and each second initial vector further comprises:

7. The method for multiple data source-based medical knowledge-graph fusion as claimed in any one of claims 1 to 5, wherein the learning of knowledge representation based on the first medical knowledge-graph and the second medical knowledge-graph respectively, before obtaining each first initial vector and each second initial vector further comprises:

8. A medical knowledge-graph fusion apparatus based on multiple data sources, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the multiple data source based medical knowledge-graph fusion method of any one of claims 1 to 7.

10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the steps of the multiple data source based medical knowledge-graph fusion method according to any one of claims 1 to 7.