CN117010494B

CN117010494B - Medical data generation method and system based on causal expression learning

Info

Publication number: CN117010494B
Application number: CN202311257598.8A
Authority: CN
Inventors: 池胜强; 李劲松; 朱伟伟; 王丰; 田雨
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2024-01-05
Anticipated expiration: 2043-09-27
Also published as: CN117010494A

Abstract

The invention discloses a medical data generation method and a system based on causal expression learning, wherein the method comprises the following steps: acquiring a medical knowledge graph; reconstructing patient medical data into a patient data map according to the medical knowledge map; obtaining an adjacent matrix and a node initial embedded representation set according to a patient data map, and obtaining patient characterization through a graph encoder; decoupling the patient data map into a patient causal profile and a patient confusion profile using an attention mechanism from the patient characterization; generating a patient reconstruction causal feature map and a patient synthesis causal feature map according to the patient causal feature map and the patient confusion feature map, constructing and generating a model based on the anti-learning training medical data, so that the causal feature map of the patient and the patient synthesis causal feature map can not be distinguished, and simultaneously, the patient causal feature map and the patient reconstruction causal feature map are similar as much as possible; and inputting the target patient data map into a trained medical data generation model to obtain generated medical data.

Description

Medical data generation method and system based on causal expression learning

Technical Field

The invention belongs to the technical field of medical health information, and particularly relates to a medical data generation method and system based on causal representation learning.

Background

The medical data generation means that new medical data with similar characteristics is generated from the existing medical data by utilizing technologies such as machine learning, artificial intelligence and the like, so that the scale and the diversity of a data set are increased, and the utilization value of the medical data is improved. The medical data generation technology can be widely applied to the fields of medical research, medical care management, medical decision support and the like, and provides more data support for medical science and medical services.

Medical data generation models can be divided into two main categories: rule-based models and artificial intelligence-based models.

Rule-based models typically rely on artificially designed rules that are typically manually built by experts based on domain knowledge to generate data. Although this method can produce high quality data, it relies heavily on domain experts, has limited rule expressive power, is difficult to handle complex relationships, and limits the practical application of rule-based models.

The model based on artificial intelligence mainly utilizes algorithms such as deep learning and the like to generate new data by using existing medical data. Such methods have the advantage of automatically learning patterns and rules in existing data to generate medical data. Among them, a medical data generation method based on generation of an countermeasure network is an algorithm widely used in recent years. The main principle of generating an countermeasure network is to build and train two networks, one of which is responsible for generating medical data and the other of which tries to distinguish between real data and generated data, the two networks competing with each other to improve the quality of the generated data. Currently, the methods for generating the countermeasure network with good medical data generation effect include medGAN, medBGAN, WGAN, EMR-WGAN and the like. The medGAN adds a self-encoder to process the generation of discrete features based on the generation of the countermeasure network model, thereby optimizing the generation of discrete features in the medical data. The medGGAN replaces the loss function of the arbiter in the medGAN with the boundary-sorting loss function to improve the quality of the generated discrete data. WGAN is optimized based on medGAN, incorporating Wasserstein divergence in the training goals, and applying a Lipschitz constraint inside the discriminator to ensure that the discriminator can describe the distance between distributions more accurately. The EMR-WGAN introduces layer normalization in the discriminator to enhance the learning performance on the basis of the WGAN, and combines a gradient penalty strategy to enhance Lipschitz constraint, so that the influence of parameter clipping on GAN fitting data distribution is reduced.

Although, artificial intelligence based models are continually improving their ability to generate high quality data. However, due to complexity and quality problems of the medical data itself, deviations exist between the generated data and the real data, including data distribution, attribute distribution, correlation, and the like, which affect the reliability of the data. Furthermore, medical data involves relatively complex domain knowledge and terminology, requiring understanding and analysis of a number of different domains. The existing data generation model does not fully consider the complexity, lacks the utilization of knowledge in the medical field, and partially generates medical data which does not conform to medical common sense.

Disclosure of Invention

In view of the above, the invention provides a medical data generation method and system based on causal expression learning.

In a first aspect, the present invention provides a method of medical data generation based on causal representation learning, the method comprising:

acquiring a medical knowledge graph;

extracting the medical data of the patient, and reconstructing the medical data of the patient into a patient data map according to the medical knowledge map;

obtaining an adjacent matrix and node initial embedded representation set according to the patient data map, and inputting the adjacent matrix and node initial embedded representation set to a graph encoder to obtain a patient representation; decoupling the patient data map into a patient causal profile and a patient confusion profile using an attention mechanism from the patient characterization;

constructing and training a medical data generation model; comprising the following steps: generating a patient reconstruction causal feature map and a patient synthesis causal feature map according to the patient causal feature map and the patient confusion feature map; configuring the embedded representation identifier and the causal profile identifier, alternately training the embedded representation identifier and the causal profile identifier based on the countermeasure learning such that the causal profile identifier cannot distinguish between the causal profile of the patient and the causal profile of the patient, while making the causal profile of the patient and the causal profile of the patient as similar as possible;

and decoupling the target patient data map into a target patient causal feature map and a target patient confusion feature map, and inputting the target patient causal feature map and the target patient confusion feature map into a trained medical data generation model to obtain generated medical data.

In a second aspect, embodiments of the present invention also provide a medical data generation system based on causal representation learning, the system comprising:

the medical knowledge graph acquisition module is used for acquiring a medical knowledge graph;

the patient data map construction module is used for extracting patient medical data and reconstructing the patient medical data into a patient data map according to the medical knowledge map;

the patient data map decoupling module is used for obtaining an adjacent matrix and a node initial embedded representation set according to the patient data map, inputting the adjacent matrix and the node initial embedded representation set to a map encoder, and obtaining a patient representation; decoupling the patient data map into a patient causal profile and a patient confusion profile using an attention mechanism from the patient characterization;

the medical data generation model training module is used for constructing and training a medical data generation model; comprising the following steps: generating a patient reconstruction causal feature map and a patient synthesis causal feature map according to the patient causal feature map and the patient confusion feature map; configuring the embedded representation identifier and the causal profile identifier, alternately training the embedded representation identifier and the causal profile identifier based on the countermeasure learning such that the causal profile identifier cannot distinguish between the causal profile of the patient and the causal profile of the patient, while making the causal profile of the patient and the causal profile of the patient as similar as possible;

and the medical data generation module is used for decoupling the target patient data map into a target patient causal feature map and a target patient confusion feature map, and inputting the target patient causal feature map and the target patient confusion feature map into a trained medical data generation model to obtain generated medical data.

In a third aspect, embodiments of the present invention further provide an electronic device, including a memory and a processor, the memory being coupled to the processor; the memory is used for storing program data, and the processor is used for executing the program data to realize the medical data generation method based on causal expression learning.

In a fourth aspect, embodiments of the present invention also provide a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the causal representation learning based medical data generating method described above.

Compared with the prior art, the invention has the beneficial effects that: the invention provides a medical data generation method based on causal representation learning, which is characterized in that a patient data map is decoupled into a patient causal feature map and a patient confusion feature map, a medical data generation model is constructed, and a trained medical data generation model is obtained based on anti-learning alternate training, so that a reconstructed causal feature map output by the medical data generation model is similar to a target patient causal feature map as much as possible, complex domain knowledge and terminology of medical data are fully considered, deviation between generated medical data and real medical data is reduced, and quality of generated medical data is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a schematic diagram of a medical data generation method based on causal representation learning provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of causal representation learning provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of an countermeasure generation network provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of a medical data generation system based on causal representation learning provided by an embodiment of the present invention;

fig. 5 is a schematic diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The features of the following examples and embodiments may be combined with each other without any conflict.

The invention discloses a medical data generation method based on causal expression learning, firstly, constructing a medical knowledge graph by utilizing knowledge in the medical field; next, the medical data of the patient is converted into a patient data map using the medical knowledge map. Causal and confounding feature maps are extracted from the patient data map using a causal representation learning method based on a graph neural network. The causal feature map is then used as input for the generation of medical data, and medical knowledge is introduced into the generator, normalizing the medical data generation. Meanwhile, the confusion characteristic map is used as prior distribution generated by medical data, so that the problem of mode collapse caused by using a fixed prior is avoided. In addition, the causal feature map is used as a real data input to generate an countermeasure network, so that the influence of low-quality medical data on the reliability of generated data is reduced. And finally, optimizing the generated countermeasure network structure, establishing a double-layer countermeasure structure, and increasing data reconstruction loss to ensure the authenticity of generated medical data. The method provided by the invention is directly oriented to the generation of the medical data of the graph structure, and the medical knowledge is taken as the internal data generation constraint, so that the probability that the generated data accords with the clinical common sense can be improved.

As shown in fig. 1, the method comprises the steps of:

s1, constructing a medical knowledge graph.

Specifically, in this example, medical knowledge is obtained from SNOMED-CT, HPO, medical literature, and entities in the medical knowledge are identified by named entity identification techniques, wherein the entities in the medical knowledge include diseases, drugs, symptoms, and the like. Relationships between entities, such as relationships between diseases and symptoms, relationships between drugs and diseases, etc., are extracted from medical knowledge by relationship extraction techniques. Entities and relationships are represented by computer-understandable triples < entity-relationship-entity >. Medical knowledge from different sources is fused, repetition and contradiction are eliminated, and a medical knowledge graph is obtained. Further, the constructed medical knowledge graph is stored in a database for subsequent construction of patient data graphs.

S2, extracting the medical data of the patient, and reconstructing the medical data of the patient into a patient data map according to the medical knowledge map.

Specifically, patient medical data including patient basic information, disease diagnosis, symptom signs, laboratory tests, surgical medications, follow-up, and the like is extracted from the hospital information system. Wherein, the patient basic information includes patient age, patient sex, etc.

Features in patient medical data are normalized using a set of medical standard terms, corresponding to a unified data encoding, medical terms, and numerical units.

Based on the medical knowledge graph, extracting medical concepts and relationships implicit by features in the medical data of the patient, and constructing a patient data graph.

For example, the lower limit of the normal range for adult female hemoglobin concentration is 110g/L, and a patient hemoglobin concentration of 93g/L indicates a lower value than normal, and thus the medical concept of "low hemoglobin" is included in the patient data map. If the patient uses a steroid, the patient data profile contains a medical concept of "steroid"; if the patient does not use a steroid, the patient data profile does not contain a steroid.

It should be noted that, extracting medical concepts and relationships underlying features in patient medical data includes: and extracting medical concepts corresponding to the characteristics in the medical data of the patient and adjacent connected medical concepts and the relation between the two medical concepts from the medical knowledge graph to obtain a patient data graph. The process of reconstructing the patient data into the patient data map by using the medical knowledge map can enrich semantic information of the high-dimensional sparse medical data and input medical knowledge for the subsequent medical data generation process.

S3, obtaining an adjacent matrix and node initial embedded representation set according to the patient data map, and inputting the adjacent matrix and node initial embedded representation set to a graph encoder to obtain patient characterization; the patient data map is decoupled into a patient causal profile and a patient confusion profile using an attention mechanism from the patient characterization.

The step S3 is to decouple the patient data map into the patient causal profile and the patient confusion profile based on the causal representation learning. Wherein, the patient causal characteristic map is a key characteristic information map reflecting basic information of patients, clinical manifestations of diseases, occurrence and development of diseases, treatment schemes and efficacy evaluation. The patient confusion profile is a noise information profile that is not relevant to patient disease diagnosis and treatment, possibly due to medical data quality issues. Eliminating confounding information in the real medical data can reduce the influence of low-quality medical data on the reliability of the generated data.

Specifically, as shown in fig. 2, the step S3 specifically includes the following substeps:

s301, acquiring an adjacency matrix A and a node initial embedding representation set X 'according to a patient data map G= { A, X' }; the adjacency matrix a and the node initial embedded representation set X 'are input to a graph neural network-based encoder f (∙), a node embedded representation set x=f (a, X') in the patient data map is obtained, and the adjacency matrix and the node embedded representation set in the patient data map are used as patient characterizations.

Specifically, the node-initially embedded representation of the patient data map is represented by a single thermally coded vector of medical concepts, and the adjacency matrix records the relationships between the nodes in the patient data map. The node initial embedding representation set X' and adjacency matrix a are inputs to the graph encoder f (∙). The graph encoder is implemented based on a graph convolutional neural network. In the graph encoder, nodes can integrate information of neighbor nodes by transmitting information through connecting edges in the graph. The graph encoder may be composed of multiple graph convolutional layers. The design of the atlas requires consideration of multiple relationship types in the patient data atlas, which can be handled in two different ways: selecting a multiple relationship graph neural network, etc. may directly process a graph convolution layer of a heterogeneous graph containing multiple relationships, or treat the heterogeneous graph as a isomorphic graph. In order to learn information between multi-hop nodes while avoiding the problem of overcomplete of the graph neural network caused by increasing the number of graph convolutions, consider adding a Skip-connection (Skip-connection) in the graph encoder by which nodes of the patient data map are initially embedded into the representation set X' directly into the output of the current graph convolutions layer. The output of the graph encoder is a set of node embedded representations x=f (a, X') in the patient data map, with the set of node embedded representations and the adjacency matrix in the patient data map being the patient representation.

S302, acquiring a first multi-layer perceptron, and calculating node-level attention scores and side-level attention scores in the causal feature map by using the first multi-layer perceptron according to patient characterization based on an attention mechanism; acquiring a second multi-layer perceptron, and calculating node-level attention scores and side-level attention scores in the confusion characteristic diagram by using the second multi-layer perceptron according to the patient characterization based on an attention mechanism;

in particular, patient characterization based on graph neural network output, using a attentional mechanism, using two multi-layer perceptron (MLP _node (∙) and MLP _edge (∙)) to evaluate the attention scores at the node level and the edge level. For the ith node v _i J-th node v _j Sum edge (v) _i ,v _j ) The expression is as follows:

wherein h is _i Is the ith node v _i Is embedded in the representation vector, h _j Is the j-th node v _j Is a set of the embedded representation vectors of (a),is a softmax function, ||is a vector orderLinking operation (Tech)>Representing the ith node v in the causal feature graph _i Is a node level attention score of +.>Representing an ith node v in a confusion feature map _i Is a node level attention score of +.>Representing edges (v) in the causal profile _i ,v _j ) Is a side level attention score of +.>Representing edges (v) in a confusion feature map _i ,v _j ) Is a side level attention score of +.>，。

S303, according to the node level attention scores and the edge level attention scores in the causal feature map, splicing to obtain a patient causal feature map node soft maskSoft mask for patient causal feature map edge>The method comprises the steps of carrying out a first treatment on the surface of the Splicing according to node level attention scores and side level attention scores in the confusion characteristic map to obtain a patient confusion characteristic map node soft mask +.>Patient confusion feature map edge soft mask。

Further, patient causal profile node soft maskPatient confusion feature map node soft mask +.>Where U represents the number of nodes in the patient data map.

Patient causal feature map edge soft mask:

patient confusion feature map edge soft mask:

s304, decoupling the patient data map into a patient causal feature map and a patient confusion feature map according to the patient causal feature map node soft mask, the patient causal feature map edge soft mask, the patient confusion feature map node soft mask, and the patient confusion feature map edge soft mask.

Specifically, the adjacency matrix in the patient data map is multiplied by the soft mask of the patient causal feature map edge and marked as a first matrix; the node embedding representation set in the patient data map is multiplied by the patient causal feature map node soft mask and recorded as a second matrix; the patient causal profile is represented by a first matrix and a second matrix, expressed as:。

multiplying the adjacent matrix in the patient data map with the soft mask of the patient confusion characteristic map edge to be marked as a third matrix; the node embedding representation set in the patient data map is multiplied by the node soft mask of the patient confusion characteristic map and recorded as a fourth matrix; the patient confusion feature map is represented by a third matrix and a fourth matrix, and the expression is:。

s4, constructing and training a medical data generation model; comprising the following steps: generating a patient reconstruction causal feature map and a patient synthesis causal feature map according to the patient causal feature map and the patient confusion feature map; the embedded representation identifier and the causal profile identifier are configured to alternately train the embedded representation identifier and the causal profile identifier based on the countermeasure learning such that the causal profile identifier cannot distinguish between the causal profile of the patient and the causal profile of the patient, while the causal profile of the patient and the causal profile of the patient are reconstructed as similarly as possible.

Specifically, as shown in fig. 3, the medical data generation model specifically includes:

an encoder for a patient causal profile G ^c And (4) performing graph coding to obtain a first node embedded representation set Z in the patient causal feature graph.

A generator for acquiring a patient confusion profile G ⁿ Synthetic embedded representation vector set for intermediate nodesAnd superimpose the patient causal profile G ^c The first node embedded representation set Z in the patient confusion feature map is obtained, and the second node embedded representation set +.>。

Further, patient confusion feature map G ⁿ The embedding of the middle node is represented as a mean value, 1 is a variance, and sampling is carried out to obtain a synthetic embedding representation vector set of the nodes in the patient confusion characteristic diagram. By overlaying the patient causal feature map G ^c The first node of the model is embedded into a representation set Z to obtain a patient confusion characteristic diagram G ⁿ The second node of (a) is embedded with a representation set +.>The expression is:。

a decoder for embedding according to the first nodeRepresenting the set Z to obtain a reconstructed patient causal feature mapThe method comprises the steps of carrying out a first treatment on the surface of the For embedding the representation set according to the second node +.>Obtaining a synthetic patient causal profile +.>。

An embedded representation discriminator for embedding a first node embedded representation set Z with a second node embedded representation setMore closely.

Causal profile discriminator, making a patient causal profile G ^c And reconstructing a causal profile of a patientAs similar as possible and not distinguishing patient causal profile G ^c And synthetic patient causal profile->。

Specifically, the process of training the medical data generation model includes:

initializing parameters of a medical data generation model;

using synthetic patient causal profiles by fixing encoder and decoderTraining embeddings represent the discriminator and generator.

Inverse optimization of decoder by fixed generator, using synthetic patient causal profileReconstructing a causal profile of a patient>Patient causal profile G ^c Training a causal profile discriminator, decoder and encoder.

Based on the contrast learning, training is performed alternately until a preset training round number or a loss function converges, and parameter updating is performed based on a back propagation algorithm, so that a trained medical data generation model is obtained.

Wherein, the total loss function L when the medical data generation model is trained is as follows: reconstruction loss L of patient causal profile _rec Loss of identification L of patient causal profile _d And embedding a challenge loss L representing a discriminator _z And (3) summing.

Reconstruction loss L of the patient causal profile _rec Embedding reconstruction errors for a set of representations for a nodeAnd reconstruction error of the adjacency matrix +.>And (3) summing.

Reconstruction errors for node embedded representation setsThe expression of (2) is as follows:

reconstruction errors of adjacency matrixThe expression of (2) is as follows:

where N is the number of nodes in the causal profile,，/>。

loss of identification L of patient causal profile _d The expression of (2) is as follows:

in the method, in the process of the invention,. Loss of identification L of patient causal profile _d To improve the quality of the decoder's generation of the synthetic causal profile and the capability of the causal profile discriminator to distinguish between real data and synthetic data.

Embedding counter loss L representing a discriminator _z The expression of (2) is as follows:

in the formula, the loss function L _z Generating a set of synthetic embedded representations of nodes in a causal profile for a lifting generatorAnd the ability of the embedded representation discriminator to distinguish the embedded representation.

And S4, decoupling the target patient data map into a target patient causal feature map and a target patient confusion feature map, and inputting the target patient causal feature map and the target patient confusion feature map into a trained medical data generation model to obtain generated medical data.

Specifically, a generator and decoder are used to generate new data samples. First, the generator uses a causal profile G ^c Embedded representation set Z of intermediate nodes and confusion feature map G ⁿ As input, the sampling results in a set of synthetic embedded representation vectors of nodes in the confusion feature map. And obtaining the synthetic embedded representation set of the nodes in the causal feature graph by superposing the embedded representation set of the nodes in the causal feature graph. The set of synthetic embedded representations is then input to a decoder to obtain a synthetic causal feature map. Finally, converting the synthetic causal feature map into a patient numberAnd according to the map, namely the generated medical data. If multiple samples need to be generated, the above steps may be repeated.

It should be noted that the generated medical data can be applied to the fields of medical research, medical care management, medical decision support and the like, and provides more data support for medical science and medical services.

The medical data generation method based on causal representation learning further comprises the following steps:

and S5, performing quality evaluation on the generated medical data.

Specifically, first, the present invention performs multidimensional quality assessment on generated medical data using generated data quality assessment indicators. Wherein, generating the data quality evaluation index comprises feature distribution difference, feature correlation, generating data modeling prediction capability, attribute inference risk, member inference risk and the like. And then, scoring the importance of the different generated data quality assessment indexes according to the characteristics of the downstream application scene. And finally, carrying out weighted summation according to the scores of the different generated data quality evaluation indexes and the importance scores thereof to obtain the final generated data quality scores.

And if the final generated data quality score is lower than a preset score threshold, performing iterative optimization on the medical data generation model again.

The embodiment of the invention also provides a medical data generation system based on causal expression learning, as shown in fig. 4, the system comprises:

As shown in fig. 5, an embodiment of the present application provides an electronic device, which includes a memory 101 for storing one or more programs; a processor 102. The method of any of the first aspects described above is implemented when one or more programs are executed by the processor 102.

And a communication interface 103, where the memory 101, the processor 102 and the communication interface 103 are electrically connected directly or indirectly to each other to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The memory 101 may be used to store software programs and modules that are stored within the memory 101 for execution by the processor 102 to perform various functional applications and data processing. The communication interface 103 may be used for communication of signaling or data with other node devices.

The Memory 101 may be, but is not limited to, a random access Memory 101 (Random Access Memory, RAM), a Read Only Memory 101 (ROM), a programmable Read Only Memory 101 (Programmable Read-Only Memory, PROM), an erasable Read Only Memory 101 (Erasable Programmable Read-Only Memory, EPROM), an electrically erasable Read Only Memory 101 (Electric Erasable Programmable Read-Only Memory, EEPROM), etc.

The processor 102 may be an integrated circuit chip with signal processing capabilities. The processor 102 may be a general purpose processor 102, including a central processor 102 (Central Processing Unit, CPU), a network processor 102 (Network Processor, NP), etc.; but may also be a digital signal processor 102 (Digital Signal Processing, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a Field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.

In the embodiments provided in the present application, it should be understood that the disclosed method and system may be implemented in other manners. The above-described method and system embodiments are merely illustrative, for example, flow charts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, the functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

In another aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by the processor 102, implements a method as in any of the first aspects described above. The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory 101 (ROM), a random access Memory 101 (RAM, random Access Memory), a magnetic disk or an optical disk, or other various media capable of storing program codes.

The above embodiments are merely for illustrating the design concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, the scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications according to the principles and design ideas of the present invention are within the scope of the present invention.

Claims

1. A method of generating medical data based on causal representation learning, the method comprising:

acquiring a medical knowledge graph;

obtaining an adjacent matrix and node initial embedded representation set according to the patient data map, and inputting the adjacent matrix and node initial embedded representation set to a graph encoder to obtain a patient representation; decoupling the patient data map into a patient causal profile and a patient confusion profile using an attention mechanism from the patient characterization; comprising the following steps:

acquiring a first multi-layer perceptron, and calculating node-level attention scores and side-level attention scores in the causal feature map by using the first multi-layer perceptron according to the patient characterization based on an attention mechanism;

acquiring a second multi-layer perceptron, and calculating node-level attention scores and side-level attention scores in the confusion characteristic diagram by using the second multi-layer perceptron according to the patient characterization based on an attention mechanism;

according to the node level attention scores and the side level attention scores in the causal feature map, splicing to obtain a patient causal feature map node soft mask and a patient causal feature map side soft mask;

splicing the node level attention scores and the side level attention scores in the confusion feature map to obtain a patient confusion feature map node soft mask and a patient confusion feature map side soft mask;

decoupling the patient data map into a patient causal feature map and a patient confusion feature map according to the patient causal feature map node soft mask, the patient causal feature map edge soft mask, the patient confusion feature map node soft mask, and the patient confusion feature map edge soft mask;

constructing and training a medical data generation model; comprising the following steps: generating a patient reconstruction causal feature map and a patient synthesis causal feature map according to the patient causal feature map and the patient confusion feature map; configuring an encoder for graph encoding the patient causal feature graph to obtain a first node embedded representation set in the patient causal feature graph; the configuration generator is used for acquiring a synthesized embedded representation vector set of the nodes in the patient confusion characteristic diagram, and superposing a first node embedded representation set in the patient causal characteristic diagram to obtain a second node embedded representation set in the patient confusion characteristic diagram; configuring the embedded representation identifier and the causal profile identifier, alternately training the embedded representation identifier and the causal profile identifier based on the countermeasure learning such that the causal profile identifier cannot distinguish between the causal profile of the patient and the causal profile of the patient, while making the causal profile of the patient and the causal profile of the patient as similar as possible; the embedded representation discriminator is used for enabling the first node embedded representation set to be more similar to the second node embedded representation set, and the embedded representation discriminator has the following expression of countering loss:

wherein L is _z Representing the loss of antagonism of the embedded representation discriminator, Z represents the first node embedded representation set,representing a second node embedded representation set;

2. The causal representation learning based medical data generation method of claim 1, wherein deriving the set of adjacency matrices and node initial embedded representations from the patient data map, inputting the set of adjacency matrices and node initial embedded representations to the graph encoder, deriving the patient representation comprises:

acquiring an adjacency matrix and a node initial embedded representation set according to a patient data map;

the graph encoder is composed of multiple graph convolutional layers, and further comprises a jump connection, wherein the node initial embedded representation set is directly added to the output of each graph convolutional layer through the jump connection;

the set of adjacency matrices and node initial embedded representations is input to a graph encoder, resulting in a set of adjacency matrices and node embedded representations in the patient data atlas, i.e. a patient representation.

3. The causal representation learning-based medical data generation method of claim 1, wherein decoupling the patient data map into the patient causal profile and the patient confusion profile using an attention mechanism from the patient characterization further comprises:

for the ith node v _i J-th node v _j Sum edge (v) _i ,v _j ) Ith node v in causal feature map _i Node level attention score of (1) and ith node v in confusion feature map _i The sum of node level attention scores of (2) is 1; edge (v) in causal profile _i ,v _j ) Is a rule for the edge level attention scoring of (v) _i ,v _j ) The sum of the side-level attention scores of (2) is 1.

4. The causal representation learning-based medical data generation method of claim 1, wherein decoupling the patient data graph into the patient causal profile and the patient confusion profile based on the patient causal profile node soft mask, the patient causal profile edge soft mask, the patient confusion profile node soft mask, the patient confusion profile edge soft mask comprises:

multiplying the adjacency matrix in the patient data map with the soft mask of the patient causal feature map edge to be marked as a first matrix; the node embedding representation set in the patient data map is multiplied by the patient causal feature map node soft mask and recorded as a second matrix; the patient causal feature map is represented by a first matrix and a second matrix;

multiplying the adjacent matrix in the patient data map with the soft mask of the patient confusion characteristic map edge to be marked as a third matrix; the node embedding representation set in the patient data map is multiplied by the node soft mask of the patient confusion characteristic map and recorded as a fourth matrix; the patient confusion profile is represented by a third matrix and a fourth matrix.

5. The causal representation learning-based medical data generation method of claim 1, wherein the medical data generation model comprises:

the encoder is used for carrying out graph encoding on the patient causal characteristic graph to obtain a first node embedded representation set in the patient causal characteristic graph;

the generator is used for acquiring a synthesized embedded representation vector set of the nodes in the patient confusion characteristic diagram, and superposing a first node embedded representation set in the patient causal characteristic diagram to obtain a second node embedded representation set in the patient confusion characteristic diagram; the decoder is used for embedding the representation set according to the first node to obtain a reconstructed patient causal feature map; the method comprises the steps of embedding a representation set according to a second node to obtain a synthetic patient causal feature map;

an embedded representation discriminator, through the fixed encoder and decoder, for causing the first node embedded representation set to be more similar to the second node embedded representation set;

a causal profile discriminator, which performs a reverse optimization of the decoder by a fixed generator such that the causal profile discriminator cannot distinguish between a patient causal profile and a synthetic patient causal profile; simultaneously, the causal characteristic map of the patient is made to be similar to the causal characteristic map of the reconstructed patient as much as possible;

the embedded representation discriminator and causal profile discriminator are trained alternately based on countermeasure learning and parameter updates are performed based on a back propagation algorithm.

6. The causal representation learning-based medical data generation method of claim 1, wherein training the medical data generation model comprises:

setting a total loss function during training of a medical data generation model;

the total loss function is the sum of the reconstructed loss of the patient causal profile, the identified loss of the patient causal profile, and the embedded representation identifier's counterloss;

the reconstruction loss of the patient causal profile is the sum of the reconstruction error of the node embedding representation set and the reconstruction error of the adjacency matrix.

7. A medical data generation system based on causal representation learning, the system comprising:

the patient data map decoupling module is used for obtaining an adjacent matrix and a node initial embedded representation set according to the patient data map, inputting the adjacent matrix and the node initial embedded representation set to a map encoder, and obtaining a patient representation; decoupling the patient data map into a patient causal profile and a patient confusion profile using an attention mechanism from the patient characterization; comprising the following steps:

the medical data generation model training module is used for constructing and training a medical data generation model; comprising the following steps: generating a patient reconstruction causal feature map and a patient synthesis causal feature map according to the patient causal feature map and the patient confusion feature map; configuring an encoder for graph encoding the patient causal feature graph to obtain a first node embedded representation set in the patient causal feature graph; the configuration generator is used for acquiring a synthesized embedded representation vector set of the nodes in the patient confusion characteristic diagram, and superposing a first node embedded representation set in the patient causal characteristic diagram to obtain a second node embedded representation set in the patient confusion characteristic diagram; configuring the embedded representation identifier and the causal profile identifier, alternately training the embedded representation identifier and the causal profile identifier based on the countermeasure learning such that the causal profile identifier cannot distinguish between the causal profile of the patient and the causal profile of the patient, while making the causal profile of the patient and the causal profile of the patient as similar as possible; the embedded representation discriminator is used for enabling the first node embedded representation set to be more similar to the second node embedded representation set, and the embedded representation discriminator has the following expression of countering loss:

8. An electronic device comprising a memory and a processor, wherein the memory is coupled to the processor; wherein the memory is for storing program data and the processor is for executing the program data to implement the causal representation learning based medical data generating method of any of the preceding claims 1-6.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the causal representation learning based medical data generating method according to any of claims 1-6.