CN117034142B - Unbalanced medical data missing value filling method and system - Google Patents
Unbalanced medical data missing value filling method and system Download PDFInfo
- Publication number
- CN117034142B CN117034142B CN202311283938.4A CN202311283938A CN117034142B CN 117034142 B CN117034142 B CN 117034142B CN 202311283938 A CN202311283938 A CN 202311283938A CN 117034142 B CN117034142 B CN 117034142B
- Authority
- CN
- China
- Prior art keywords
- data
- patient
- filling
- generator
- patient data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 75
- 230000008569 process Effects 0.000 claims abstract description 37
- 238000010276 construction Methods 0.000 claims abstract description 7
- 230000006870 function Effects 0.000 claims description 45
- 239000011159 matrix material Substances 0.000 claims description 27
- 238000009826 distribution Methods 0.000 claims description 22
- 238000003745 diagnosis Methods 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 7
- 201000010099 disease Diseases 0.000 claims description 7
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 7
- 238000010606 normalization Methods 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 4
- 230000000694 effects Effects 0.000 abstract description 7
- 238000003860 storage Methods 0.000 description 10
- 230000004913 activation Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000036541 health Effects 0.000 description 4
- 229940079593 drug Drugs 0.000 description 3
- 239000003814 drug Substances 0.000 description 3
- 238000004590 computer program Methods 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 230000007812 deficiency Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000036772 blood pressure Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 238000005429 filling process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/60—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computational Mathematics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Optimization (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Mathematical Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Algebra (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Primary Health Care (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Public Health (AREA)
- Bioinformatics & Computational Biology (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Biology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The invention discloses a filling method and a filling system for unbalanced medical data missing values, wherein the loss of a bulldozer distance construction generator and a discriminator is used, so that the problem that a gradient can disappear in the generator in the training process can be solved; adding the patient label into the generator as a supervision signal, and increasing the diversity of patient data generated by the generator; an auxiliary classifier is added to predict patient data filled by the filling unit, and a prediction result is fed back to the generator to improve the generation effect of the generator; the missing part of the patient data is filled by using the random number, the filled patient data is used as the input of a generator, and the relation between the missing value and other data is learned through the generator, so that the problem that enough complete samples need to be collected in the training process is avoided; the generator loss is composed of three parts, and the generator is enabled to consider the filling effect from different angles by constructing different losses, so that the accuracy of the filling result is improved.
Description
Technical Field
The invention belongs to the technical field of medical information, and particularly relates to an unbalanced medical data missing value filling method and system.
Background
The electronic health record (EHR, electronic Health Records) stores information related to patient visits, including basic information, diagnostic information, examination information, medication information, and the like of the patient. This information provides the basis for medical data mining. However, due to the factors of failure of the collecting device, unstable transmission and the like, a great amount of missing data exists in the electronic health record. These missing data not only increase the complexity and difficulty of statistical analysis, but also lead to inaccurate analysis results. Therefore, the missing value filling problem in the electronic health record is solved, and the method has important significance for improving the quality of data mining.
The generation countermeasure network (GAN, generative Adversarial Networks) is a neural network for capturing training data distribution, and creates new data through the learned data distribution, and is currently used in the fields of picture generation, text generation and the like. In recent years, experts and scholars apply the GAN method to the field of filling data missing values, but in real life, because electronic medical record data of patients in hospitals are often unbalanced, the number of patients with different types of diseases is quite different, and if the GAN method is directly applied to the unbalanced filling of medical data missing values, problems exist. On the one hand, the padding effect lacks diversity, and on unbalanced samples, the generator spoofs the arbiter by focusing only on the type-padding quality of the large number of samples and disregarding those of the small number of data, resulting in the last padded data belonging only to certain types of ill data. On the other hand, GAN methods train on unbalanced data, and the generator is more prone to vanishing gradient problems. The Wasserstein GAN article states that minimizing the loss of the generator under the optimal discriminant is equivalent to minimizing the JS dispersion (JSD, jensen-Shannon Divergence) between the true and generated distributions, which is a fixed constant log2 when the true and generated distributions do not overlap or overlap is negligible, and the gradient of the generator disappears, making network training difficult.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a filling method and a filling system for unbalanced medical data missing values based on generation of an countermeasure network, and the filling quality of the medical data missing values is improved.
The invention aims at realizing the following technical scheme:
in a first aspect, the present invention provides a method of filling unbalanced medical data loss values, the method comprising:
acquiring patient data by using an informatization system of a hospital;
filling the missing values in the patient data by using a data filling model;
the data filling model comprises a data processing unit, a generator, a filling unit, a discriminator and an auxiliary classifier; the generator and the discriminator form a generating countermeasure network;
the data processing unit records the position of the missing value in the original data of the patient by using a mask matrix, pre-fills the missing value in the original data of the patient by using 0, fills the missing value in the original data of the patient by using a random number, and inputs the missing value into the generator;
the generator is used for learning the distribution of the input patient data, generating new patient data and inputting a filling unit, and the input of the generator comprises the patient data and a patient label;
the filling unit is used for filling the missing value in the original patient data by utilizing the new patient data generated by the generator;
the input of the discriminator comprises patient data filled by the filling unit and patient data filled with the missing value in the original patient data by 0, and the probability of each patient data being an observed value is output;
the auxiliary classifier is used for predicting the patient data filled by the filling unit and feeding back a prediction result to the generator;
the training process comprises pre-training the auxiliary classifier and formally training a data filling model, wherein in the pre-training process, the auxiliary classifier is trained by using undelayed patient data, and the network parameters of the auxiliary classifier are determined, and in the formally training process, the network parameters of the auxiliary classifier do not participate in updating; training the discriminator and then training the generator in the formal training process, wherein the discriminator and the generator are used for continuously performing countermeasure training until the data filling model converges;
patient data and patient labels which need to be filled with missing values are input into a trained data filling model, and the filled patient data is output after passing through a data processing unit, a generator and a filling unit.
Further, the acquired patient data is input into a data filling model after data preprocessing, specifically: and performing single-heat encoding operation on the discrete data, and performing maximum and minimum normalization operation on the continuous data.
Further, the patient raw data is recorded asWherein->Raw data representing the ith patient, n being the number of patients, k being the number of features; mask matrix is marked as->Wherein->The method is used for marking the observed value and the missing value in the original data of the ith patient, wherein the observed value is 1, and the missing value is 0; pre-filling the missing values in the original data of the patient with 0, the filled data matrix is marked +.>Wherein->Representing patient data after the i-th patient raw data is prefilled with 0; creating a random matrix mark as +.>Wherein->Is a random number vector which is randomly generated and accords with standard normal distribution and is used for fillingFilling the missing value in the original data of the ith patient; filling the missing values in the original data of the patient with random numbers in a random matrix, and marking the filled data matrix as +.>Wherein->Representing patient data obtained after filling the missing values in the ith patient raw data with random numbers, a->,/>Representing the hadamard product.
Further, the loss function of the generator is composed of three parts, wherein the first part is to calculate the difference between the observed value generated by the generator and the actual observed value, and the mean square error is used as the loss function; the second part is to generate the generator loss against the network, using the wasperstein distance as a loss function; the third partial loss is the difference between the prediction label of the patient data filled by the filling unit and the real label of the patient by the auxiliary classifier, and a cross entropy function is used as a loss function.
Further, the generator's loss function;
First partial loss function;
Second partial loss function;
Third partial loss function;
Wherein the method comprises the steps ofOutput value of generator representing ith patient data as input, +.>G () represents patient data obtained after passing through the generator, y i Representing the actual label of the ith patient, D () represents the result of patient data after passing through the arbiter, t i Representing patient data filled with the ith patient raw data via the filling unit,/patient data filled with the ith patient raw data via the filling unit>Predictive tag for the i patient representing the auxiliary classifier,>and->Is a hyper-parameter, representing the vector inner product.
Further, in the shim cell, patient data generated by the generator is utilizedFilling up the missing value in the original data X of the patient, and marking the filled up data matrix as +.>Wherein t is i Representing patient data filled with the ith patient raw data via the filling unit,/patient data filled with the ith patient raw data via the filling unit>Wherein->Represents the output value of the generator when the ith patient data is input.
Further, the loss function L of the discriminator D The calculation formula is as follows:
;
where D () represents the result of patient data after passing through the arbiter,representing patient data after prefilling the missing values in the ith patient raw data with 0, t i Representing the patient data filled in by the filling unit with the ith patient raw data, representing the vector inner product.
Further, the loss function L of the discriminator D The calculation formula is as follows:
;
where D () represents the result of patient data after passing through the arbiter,representing patient data after prefilling the missing values in the ith patient raw data with 0, t i Representing the patient data filled in by the filling unit with the ith patient raw data, representing the vector inner product.
Further, in the process of formally training the data filling model, firstly inputting patient data containing a missing value, calculating loss by the discriminator, and updating the network parameters of the discriminator by gradient back propagation; then the generator calculates the loss, and the gradient back propagation updates the generator network parameters; the arbiter and generator continue the countermeasure training until the data population model converges.
In a second aspect, the present invention provides an unbalanced medical data missing value filling system, the system comprising a data acquisition module, a data filling model construction module and a data filling module; the data acquisition module is used for acquiring patient data by using an informatization system of a hospital;
the data filling model construction module is used for constructing and training a data filling model; the data filling model comprises a data processing unit, a generator, a filling unit, a discriminator and an auxiliary classifier, wherein the generator and the discriminator form a generating countermeasure network;
the data processing unit records the position of the missing value in the original data of the patient by using a mask matrix, pre-fills the missing value in the original data of the patient by using 0, fills the missing value in the original data of the patient by using a random number, and inputs the missing value into the generator;
the generator is used for learning the distribution of the input patient data, generating new patient data and inputting a filling unit, and the input of the generator comprises the patient data and a patient label;
the filling unit is used for filling the missing value in the original patient data by utilizing the new patient data generated by the generator;
the input of the discriminator comprises patient data filled by the filling unit and patient data filled with the missing value in the original patient data by 0, and the probability of each patient data being an observed value is output;
the auxiliary classifier is used for predicting the patient data filled by the filling unit and feeding back a prediction result to the generator;
the training process comprises pre-training the auxiliary classifier and formally training a data filling model, wherein in the pre-training process, the auxiliary classifier is trained by using undelayed patient data, and the network parameters of the auxiliary classifier are determined, and in the formally training process, the network parameters of the auxiliary classifier do not participate in updating; training the discriminator and then training the generator in the formal training process, wherein the discriminator and the generator are used for continuously performing countermeasure training until the data filling model converges;
the data filling module is used for inputting patient data and patient labels which need to be filled with missing values into a trained data filling model, and outputting the filled patient data after passing through the data processing unit, the generator and the filling unit.
In a third aspect, the present invention provides an unbalanced medical data missing value filling device, comprising a memory and one or more processors, wherein executable codes are stored in the memory, and the processor implements the unbalanced medical data missing value filling method according to the first aspect when executing the executable codes.
In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon a program which, when executed by a processor, implements the unbalanced medical data loss value filling method according to the first aspect.
The beneficial effects of the invention are as follows:
1. the invention uses bulldozer distance (Wasserstein distance) to replace JS divergence to construct the loss of a generator and a discriminator, the Wasserstein distance has superior smooth characteristic relative to JS divergence, even if two distributions are not overlapped, the Wasserstein distance can still reflect the distance between the two distributions, and the problem that the generator possibly has vanishing gradient in the training process can be solved.
2. The patient label is added into the generator as the supervision signal, so that the generator can be helped to identify different patient data in the unbalanced medical electronic medical record, and the diversity of the patient data generated by the generator is increased.
3. The invention adds the auxiliary classifier, carries out classified prediction on the patient data filled by the filling unit, feeds the prediction result back to the generator, and improves the generation effect of the generator.
4. The invention fills the missing part of the patient data by using the random number, takes the filled patient data as the input of the generator, and avoids the problem that enough complete samples need to be collected in the training process by learning the relation between the missing value and other data by the generator.
5. The generator loss provided by the invention consists of three parts, namely, the loss between the patient observed value generated by the generator and the actual observed value of the patient, the loss between the prediction and the true value of the patient missing value generated by the discriminator, and the loss between the prediction label and the true label of the patient data filled by the filling unit by the auxiliary classifier, and the generator considers the filling effect from different angles by constructing different losses, so that the accuracy of the filling result is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for unbalanced medical data missing value population provided in an exemplary embodiment;
FIG. 2 is a table format of raw patient data provided by an exemplary embodiment;
FIG. 3 is a schematic diagram of a data population model architecture provided in an exemplary embodiment;
FIG. 4 is a schematic diagram of a data processing unit according to an exemplary embodiment;
FIG. 5 is a block diagram of an unbalanced medical data deficiency value filling system provided by an exemplary embodiment;
FIG. 6 is a block diagram of an unbalanced medical data loss value filling apparatus according to an exemplary embodiment.
Detailed Description
For a better understanding of the technical solutions of the present application, embodiments of the present application are described in detail below with reference to the accompanying drawings.
It should be understood that the described embodiments are merely some, but not all, of the embodiments of the present application. All other embodiments, based on the embodiments herein, which would be apparent to one of ordinary skill in the art without making any inventive effort, are intended to be within the scope of the present application.
The terminology used in the embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
The invention provides a filling method of unbalanced medical data missing values, as shown in fig. 1, which comprises two parts, namely a data acquisition part and a data filling part, wherein the data acquisition part utilizes an informatization system of a hospital to extract patient structured data, the data filling part utilizes a data filling model to fill missing values in patient data, and the specific implementation flow of each part is described in detail below.
1. Data acquisition
Patient structured data including patient basic information, diagnosis results, examination information, medication information, etc. are first extracted using a hospital informationized system.
And then carrying out data preprocessing on the extracted patient data, specifically:
performing single-heat encoding operation on the extracted discrete data, wherein the discrete data comprises the characteristics of diagnosis, medication and the like of a patient;
performing maximum and minimum normalization operation on the extracted continuous data, wherein the continuous data comprises the characteristics of weight, age, blood pressure and the like of a patient, and the maximum and minimum normalization formula is as followsWherein x is ij A value representing the jth feature of the ith patient,/->Representing the minimum value in the j-th feature, < ->Representing the maximum in the j-th feature.
Finally, the patient diagnosis result is used as a patient label and recorded asWherein n represents the number of patients, ">The label of the i-th patient, u is the disease type of the diagnosis result in the extracted patient data, and is 10 in this embodiment. Patient raw data is denoted by X, denoted +.>Wherein->Raw data representing the ith patient, k is the number of extracted patient features. As shown in FIG. 2, the patient raw data X is organized into a tabular form, wherein each row represents the data of a patient, X ij A value representing the jth feature of the ith patient, N representing the patient's missing value for that feature.
2. Data population
The missing values in the patient data are filled using the data filling model. As shown in fig. 3, the data population model includes a data processing unit, a generator G, a shim unit, a arbiter D, and an auxiliary classifier C. Wherein the generator G and the arbiter D constitute a generation countermeasure network.
Specifically, the data processing unit is used for filling the missing value of the original patient data by using the random number, and since the original patient data acquired by the data acquisition part contains the missing value, normal numerical operation cannot be performed, and the original patient data containing the missing value needs to be processed by the data processing unit. The generator functions to learn the distribution of the input patient data and generate new patient data. The shim unit is used for filling the missing value of the original data of the patient by using the new patient data generated by the generator. The role of the discriminator is to discriminate each piece of input data and determine whether or not the input data is an observed value (an undelayed value). The auxiliary classifier is used for predicting the filled patient data and feeding back the prediction result to the generator. The respective parts of the data population model are described below.
2.1 data processing Unit
Since the patient raw data acquired by the data acquisition part contains the missing value, normal numerical operation cannot be performed, and in order to enable the data to perform normal numerical operation, the data processing unit is required to process the patient raw data, and the processing process comprises two parts.
The first part is to record the original of the patientThe location of the missing values in the initial data is pre-filled with missing values in the patient's initial data using 0. First using a mask matrixRecording the position of the missing value in the original data of the patient, mask vector +.>For marking which positions in the original data of the ith patient are observations and which positions are missing values, and using 1 to represent the observations and 0 to represent the missing values. For example [1, 0,1 ]]The third feature of the patient is indicated as a deficiency value, and the other features have observations. Then fill in the missing values in the original data of the patient with 0 and use the data matrix +.>Representing patient data after filling, wherein +.>Patient data after pre-filling the missing values in the ith patient raw data with 0 is shown.
The second part is to fill in missing values in the patient raw data using random numbers. First creating a random matrixWherein->Is a random number vector which is randomly generated and accords with standard normal distribution and is used for filling the missing value in the original data of the ith patient,/for>Representing the kth feature used to populate the ith patient raw data. Then use +.>Representing patient data obtained after filling the missing values in the ith patient raw data with random numbers, a->By->Calculated, wherein->Representing the hadamard product. And uses a data matrixRepresenting patient data filled with missing values in all patient raw data using random numbers.
Fig. 4 is an example of a processing procedure of the data processing unit, in which the patient original data X acquired by the data acquisition section contains a missing value; data matrixPre-populating patient data after the missing values in the patient raw data X with 0; the data in the random matrix Z are random numbers conforming to standard normal distribution; the mask matrix M is used to mark the positions of the patient raw data X observations and missing values.
2.2 generator
The generator is configured to learn the distribution of the input patient data and to generate new patient data, the input of which includes patient data and patient labels, wherein the patient labels are used as supervisory signals to allow the generator to learn the labels for each patient. The generator consists of a three-layer fully connected network, each layer of nodes is k, k and k, wherein k represents the characteristic dimension of the original data X of the patient. The first two layers are hidden layers, the last layer is an output layer, the activation function of the first two layers is ReLU, the activation function of the last layer is Tanh, and a root mean square back propagation function (RMSprop function) is used as an optimization function of the generator. Using a data matrix
Representing the output of the generator, wherein->Output value of generator representing ith patient data as input, +.>Representing patient data after filling missing values in the ith patient raw data with random numbers, y i The label representing the i-th patient, G () represents the patient data obtained after passing through the generator G.
The loss of the whole generator consists of three parts:
first partial loss L 1 The difference between the patient observation generated by the generator and the patient actual observation is calculated, using the mean square error as the portion of the loss function, when the difference between the patient observation generated by the generator and the patient actual observation is smaller, the closer the patient observation generated by the generator is to the patient actual observation.
Loss of second part L 2 Is the generator penalty of a conventional generation countermeasure network, where Wasserstein distance is used instead of cross entropy as the penalty function. Wasserstein distance can be determined byApproximation, wherein D (T) represents the result of the patient data T after filling by the filling unit via the arbiter D,>represents Hadamard product, M represents mask matrix marking the position of original data missing value of patient, E [ []Representing mathematical expectations. When the patient missing value generated by the generator is closer to the true value, the patient missing value generated by the generator is easier to be judged as the observed value by the discriminator, and L is 2 The smaller the value and vice versa.
Third partial loss L 3 The difference between the predictive label and the patient real label of the patient data filled by the filling unit by the auxiliary classifier is calculated, the cross entropy function is used as a loss function, and the smaller the difference between the predictive label and the patient real label of the patient data filled by the filling unit by the auxiliary classifier is, the filled data is representedThe better the effect.
Loss function L of the whole generator G The following is shown:
;
wherein the method comprises the steps ofRepresenting patient data after pre-filling the missing values in the ith patient raw data with 0,/v>Output value of generator representing ith patient data as input, +.>Represents Hadamard product, represents vector inner product, M is mask matrix, +.>For the mask vector, it is recorded which positions in the original data of the ith patient are observations (non-missing values) and which positions are missing values, +.>Is to calculate the mean square error between the observed value generated by the generator of the ith patient and the actual observed value,/>Represents patient data after filling missing values in the ith patient raw data with random numbers, G () represents patient data obtained after passing through generator G, y i A true label representing the ith patient, +.>A predictive label representing the ith patient by the auxiliary classifier; t represents patient data filled by the filling unit, T i Representing patient data filled by the filling unit of the ith patient data, and D () represents a result obtained by the patient data passing through the discriminator; />And->Is a super parameter, 0.3,0.2 in this embodiment, respectively.
2.3 shim cells
The shim cells are patient data generated by a generatorFilling up the missing value in the original data X of the patient, and outputting the filled up patient data using +.>Representing patient data filled by a filling unit, wherein t i Representing patient data filled with the ith patient raw data by the filling unit, and t i By->Calculated, wherein->Output value of generator representing ith patient data as input, +.>Representing patient data after prefilling the missing values in the ith patient raw data with 0, m i Mask vector representing the ith patient, +.>Representing the hadamard product.
2.4 discriminant
The purpose of the arbiter is to determine whether each data input is an observed value, the input of which is the patient data T after filling the filling unit and the patient data after prefilling the missing value in the original data of the patient with 0The outputs are T and +.>Probability of each data being an observed value. The discriminator consists of three layers of fully-connected networks, each layer of nodes is k, k and k, wherein k represents the characteristic dimension of original data X of a patient, the first two layers are hidden layers, an activation function is ReLU, the last layer is an output layer, no activation function is generated, and an RMSprop function is used as an optimization function of the discriminator. The loss of the discriminator is to calculate the distribution difference between the patient original data and the patient data filled by the filling unit, and the Wasserstein distance is used for replacing JS divergence to measure the distribution difference between the patient original data and the patient data filled by the filling unit. Loss function L of discriminator D The calculation formula is as follows:
;
where D () represents the result of patient data after passing through the arbiter,representing patient data after prefilling the missing values in the ith patient raw data with 0, m i Mask vector, t, representing the ith patient i Representing patient data filled in by the filling unit with the ith patient raw data. During the training process, when L D The smaller the Wasserstein distance representing the true distribution from the generated distribution, the better the data population trains.
2.5 auxiliary classifier
The auxiliary classifier predicts the patient data filled by the filling unit. In the pre-training process of the data filling model, the auxiliary classifier is trained by using the patient data which is not missing, network parameters of the auxiliary classifier are determined, and in the formal training process of the data filling model, the network parameters of the auxiliary classifier do not participate in updating. The whole auxiliary classifier consists of three layers of fully-connected networks, wherein the first two layers are hidden layers, and the last layer is an output layer. The number of nodes in the first two layers is set by human, in this embodiment, 128,64, the number of nodes in the last layer is u, and u is extractedThe disease type of the diagnosis result in the patient data is 10 in this example. The activation function of the first two layers is ReLU, and the activation function of the last layer is Softmax. The loss of the auxiliary classifier is calculated by calculating the difference between the prediction label of the auxiliary classifier and the real label of the patient, and when the difference is smaller, the prediction effect of the auxiliary classifier is better, wherein a cross entropy function is used as the loss function of the auxiliary classifier, and the loss function L of the auxiliary classifier C The calculation formula is as follows:
;
wherein the method comprises the steps ofThe label of the ith patient is obtained from the diagnosis result in the patient data through a single-heat coding operation, u is the disease type of the diagnosis result in the acquired patient data,/->The predictive label of the auxiliary classifier for the ith patient is a vector with a length of u, each value in the vector represents the probability that the auxiliary classifier predicts that the patient suffers from the corresponding disease,,t i representing the patient data after the i-th patient raw data is padded by the padding unit, C () represents the result obtained after the patient data is passed through the auxiliary classifier, and C represents the vector inner product.
The training of the whole data filling model is divided into the following two stages:
the first stage is a pre-training process, which aims to train the auxiliary classifier and determine the network parameters of the auxiliary classifier. When training the auxiliary classifier, firstly initializing auxiliary classifier network parameters, then using undelayed patient data as training data, calculating a loss function of the auxiliary classifier, and updating the auxiliary classifier network parameters by gradient back propagation until the auxiliary classifier converges. After the auxiliary classifier network parameters are determined, the auxiliary classifier network parameters do not participate in updating in the formal training process of the second-stage data filling model.
The second stage is a training data filling model process, and in the data filling model training process, the training strategy is to train the discriminant first and then train the generator. When the data filling model is trained, firstly, inputting patient data containing a missing value, calculating loss by a discriminator, and updating network parameters of the discriminator by gradient back propagation; then the generator calculates the loss, the gradient back propagation updates the generator network parameters, and the discriminator and the generator continuously conduct countermeasure training until the data filling model converges; at the initial stage of training, the discriminator can easily distinguish which data are observed values and which data are filling values, and as the training goes deep, the generator learns the distribution of patient data, the generated data are very close to the observed values of the patient, the discriminator cannot judge which data are observed values and which data are filling values, and the generator and the discriminator achieve Nash equilibrium, so that the trained data filling model converges.
After the data filling model is trained, the model can be used for filling the patient data missing value. In the filling process, firstly, data preprocessing is carried out on patient data to be filled, single-heat encoding operation is carried out on discrete data in the patient data to be filled, and maximum and minimum normalization operation is carried out on continuous data. And then selecting a patient diagnosis result as a patient label, taking the patient data containing the missing value and the patient label as input of a data filling model, and outputting the filled patient data through a data processing unit, a generator and a filling unit.
On the other hand, the invention also provides an unbalanced medical data missing value filling system, as shown in fig. 5, which comprises a data acquisition module, a data filling model construction module and a data filling module; the data acquisition module is used for acquiring patient data by using an informatization system of a hospital;
the data filling model construction module is used for constructing and training a data filling model; the data filling model comprises a data processing unit, a generator, a filling unit, a discriminator and an auxiliary classifier, wherein the generator and the discriminator form a generating countermeasure network;
the data processing unit records the position of the missing value in the original data of the patient by using a mask matrix, pre-fills the missing value in the original data of the patient by using 0, fills the missing value in the original data of the patient by using a random number, and inputs the missing value into the generator;
the generator is used for learning the distribution of the input patient data, generating new patient data and inputting a filling unit, and the input of the generator comprises the patient data and a patient label;
the filling unit is used for filling the missing value in the original patient data by utilizing the new patient data generated by the generator;
the input of the discriminator comprises patient data filled by the filling unit and patient data filled with the missing value in the original patient data by 0, and the probability of each patient data being an observed value is output;
the auxiliary classifier is used for predicting the patient data filled by the filling unit and feeding back a prediction result to the generator;
the training process comprises pre-training the auxiliary classifier and formally training a data filling model, wherein in the pre-training process, the auxiliary classifier is trained by using undelayed patient data, and the network parameters of the auxiliary classifier are determined, and in the formally training process, the network parameters of the auxiliary classifier do not participate in updating; training the discriminator and then training the generator in the formal training process, wherein the discriminator and the generator are used for continuously performing countermeasure training until the data filling model converges;
the data filling module is used for inputting patient data and patient labels which need to be filled with missing values into a trained data filling model, and outputting the filled patient data after passing through the data processing unit, the generator and the filling unit.
Corresponding to the embodiment of the unbalanced medical data missing value filling method, the invention further provides an embodiment of the unbalanced medical data missing value filling device.
Referring to fig. 6, an unbalanced medical data missing value filling apparatus provided by an embodiment of the present invention includes a memory and one or more processors, where the memory stores executable codes, and the processors are configured to implement an unbalanced medical data missing value filling method in the above embodiment when executing the executable codes.
The embodiment of the unbalanced medical data missing value filling device provided by the invention can be applied to any device with data processing capability, and the device with data processing capability can be a device or a device such as a computer. The device embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory through a processor of any device with data processing capability. In terms of hardware, as shown in fig. 6, a hardware structure diagram of an apparatus with any data processing capability where an unbalanced medical data missing value filling apparatus provided by the present invention is shown, except for a processor, a memory, a network interface, and a nonvolatile memory shown in fig. 6, where in an embodiment, the apparatus with any data processing capability where an apparatus is located generally includes other hardware according to an actual function of the apparatus with any data processing capability, which is not described herein.
The implementation process of the functions and roles of each unit in the above-mentioned device is specifically detailed in the implementation process of the corresponding steps in the above-mentioned method, and will not be described herein again.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The above described embodiments of the apparatus are only illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The embodiment of the present invention also provides a computer-readable storage medium having a program stored thereon, which when executed by a processor, implements an unbalanced medical data loss value filling method in the above embodiment.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may be any external storage device that has data processing capability, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, which are provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.
The above-described embodiments are intended to illustrate the present invention, not to limit it, and any modifications and variations made thereto are within the spirit of the invention and the scope of the appended claims.
Claims (10)
1. A method of filling an unbalanced medical data loss value, comprising:
acquiring patient data by using an informatization system of a hospital;
filling the missing values in the patient data by using a data filling model;
the data filling model comprises a data processing unit, a generator, a filling unit, a discriminator and an auxiliary classifier; the generator and the discriminator form a generating countermeasure network;
the data processing unit records the position of the missing value in the original data of the patient by using a mask matrix, pre-fills the missing value in the original data of the patient by using 0, fills the missing value in the original data of the patient by using a random number, and inputs the missing value into the generator;
the generator is used for learning the distribution of the input patient data, generating new patient data and inputting a filling unit, and the input of the generator comprises the patient data and a patient label;
the filling unit is used for filling the missing value in the original patient data by utilizing the new patient data generated by the generator;
the input of the discriminator comprises patient data filled by the filling unit and patient data filled with the missing value in the original patient data by 0, and the probability of each patient data being an observed value is output;
the auxiliary classifier is used for predicting the patient data filled by the filling unit and feeding back a prediction result to the generator;
the training process comprises pre-training the auxiliary classifier and formally training a data filling model, wherein in the pre-training process, the auxiliary classifier is trained by using undelayed patient data, and the network parameters of the auxiliary classifier are determined, and in the formally training process, the network parameters of the auxiliary classifier do not participate in updating; training the discriminator and then training the generator in the formal training process, wherein the discriminator and the generator are used for continuously performing countermeasure training until the data filling model converges;
patient data and patient labels which need to be filled with missing values are input into a trained data filling model, and the filled patient data is output after passing through a data processing unit, a generator and a filling unit.
2. The method for filling unbalanced medical data loss values according to claim 1, wherein the data filling model is input after the acquired patient data is subjected to data preprocessing, specifically: and performing single-heat encoding operation on the discrete data, and performing maximum and minimum normalization operation on the continuous data.
3. The method of claim 1, wherein the patient raw data is recorded asWherein->Raw data representing the ith patient, n being the number of patients, k being the number of features; mask matrix is marked as->Wherein->The method is used for marking the observed value and the missing value in the original data of the ith patient, wherein the observed value is 1, and the missing value is 0; pre-filling the missing values in the original data of the patient with 0, the filled data matrix is marked +.>Wherein->Representing patient data after the i-th patient raw data is prefilled with 0; creating a random matrix mark as +.>Wherein->The random number vector which is randomly generated and accords with standard normal distribution is used for filling the missing value in the original data of the ith patient; filling the missing values in the original data of the patient by using random numbers in a random matrix, and marking the filled data matrix as
Wherein->Representing patient data obtained after filling the missing values in the ith patient raw data with random numbers, a->,/>Representing the hadamard product.
4. A method of filling unbalanced medical data loss values according to claim 3, wherein the generator's loss function is composed of three parts, the first part being the calculation of the difference between the observed value generated by the generator and the actual observed value, using the mean square error as the loss function; the second part is to generate the generator loss against the network, using the wasperstein distance as a loss function; the third partial loss is the difference between the prediction label of the patient data filled by the filling unit and the real label of the patient by the auxiliary classifier, and a cross entropy function is used as a loss function.
5. The method of claim 4, wherein the generator has a loss function;
First partial loss function;
Second partial loss function;
Third partial loss functionNumber of digits;
Wherein the method comprises the steps ofOutput value of generator representing ith patient data as input, +.>G () represents patient data obtained after passing through the generator, y i Representing the actual label of the ith patient, D () represents the result of patient data after passing through the arbiter, t i Representing patient data filled with the ith patient raw data via the filling unit,/patient data filled with the ith patient raw data via the filling unit>Predictive tag for the i patient representing the auxiliary classifier,>and->Is a hyper-parameter, representing the vector inner product.
6. A method of filling an unbalanced medical data loss value according to claim 3, wherein the shim cells are filled with patient data generated by a generatorFilling up the missing value in the original data X of the patient, and marking the filled up data matrix as +.>Wherein t is i Representing patient data filled with the ith patient raw data via the filling unit,/patient data filled with the ith patient raw data via the filling unit>Wherein->Represents the output value of the generator when the ith patient data is input.
7. A method of filling an unbalanced medical data loss value according to claim 3, wherein the loss function L of the arbiter D The calculation formula is as follows:
;
where D () represents the result of patient data after passing through the arbiter,representing patient data after prefilling the missing values in the ith patient raw data with 0, t i Representing the patient data filled in by the filling unit with the ith patient raw data, representing the vector inner product.
8. The method for filling unbalanced medical data loss values according to claim 1, wherein the auxiliary classifier has a loss function L C The calculation formula is as follows:
;
wherein the method comprises the steps ofThe label of the ith patient is obtained from the diagnosis result in the patient data through a single-heat coding operation, u is the disease type of the diagnosis result in the patient data,/->The predictive label of the auxiliary classifier for the ith patient is a vector with length u, and the vector isAnd each value of (2) represents the probability that the auxiliary classifier predicts that the patient suffers from the corresponding disease, n is the number of patients,,t i representing the patient data after the i-th patient raw data is padded by the padding unit, C () represents the result obtained after the patient data is passed through the auxiliary classifier, and C represents the vector inner product.
9. The method for filling an unbalanced medical data loss value according to any one of claims 1 to 8, wherein in the process of formally training a data filling model, patient data containing the loss value is firstly input, loss is calculated by the discriminator, and network parameters of the discriminator are updated by gradient back propagation; then the generator calculates the loss, and the gradient back propagation updates the generator network parameters; the arbiter and generator continue the countermeasure training until the data population model converges.
10. The unbalanced medical data missing value filling system is characterized by comprising a data acquisition module, a data filling model construction module and a data filling module; the data acquisition module is used for acquiring patient data by using an informatization system of a hospital;
the data filling model construction module is used for constructing and training a data filling model; the data filling model comprises a data processing unit, a generator, a filling unit, a discriminator and an auxiliary classifier, wherein the generator and the discriminator form a generating countermeasure network;
the data processing unit records the position of the missing value in the original data of the patient by using a mask matrix, pre-fills the missing value in the original data of the patient by using 0, fills the missing value in the original data of the patient by using a random number, and inputs the missing value into the generator;
the generator is used for learning the distribution of the input patient data, generating new patient data and inputting a filling unit, and the input of the generator comprises the patient data and a patient label;
the filling unit is used for filling the missing value in the original patient data by utilizing the new patient data generated by the generator;
the input of the discriminator comprises patient data filled by the filling unit and patient data filled with the missing value in the original patient data by 0, and the probability of each patient data being an observed value is output;
the auxiliary classifier is used for predicting the patient data filled by the filling unit and feeding back a prediction result to the generator;
the training process comprises pre-training the auxiliary classifier and formally training a data filling model, wherein in the pre-training process, the auxiliary classifier is trained by using undelayed patient data, and the network parameters of the auxiliary classifier are determined, and in the formally training process, the network parameters of the auxiliary classifier do not participate in updating; training the discriminator and then training the generator in the formal training process, wherein the discriminator and the generator are used for continuously performing countermeasure training until the data filling model converges;
the data filling module is used for inputting patient data and patient labels which need to be filled with missing values into a trained data filling model, and outputting the filled patient data after passing through the data processing unit, the generator and the filling unit.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311283938.4A CN117034142B (en) | 2023-10-07 | 2023-10-07 | Unbalanced medical data missing value filling method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311283938.4A CN117034142B (en) | 2023-10-07 | 2023-10-07 | Unbalanced medical data missing value filling method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117034142A CN117034142A (en) | 2023-11-10 |
CN117034142B true CN117034142B (en) | 2024-02-09 |
Family
ID=88630271
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311283938.4A Active CN117034142B (en) | 2023-10-07 | 2023-10-07 | Unbalanced medical data missing value filling method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117034142B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117524318B (en) * | 2024-01-05 | 2024-03-22 | 深圳新合睿恩生物医疗科技有限公司 | New antigen heterogeneous data integration method and device, equipment and storage medium |
CN118262931B (en) * | 2024-05-30 | 2024-09-24 | 中国人民解放军总医院 | Medical data characteristic completion method and system in emergency rescue scene |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109165664A (en) * | 2018-07-04 | 2019-01-08 | 华南理工大学 | A kind of attribute missing data collection completion and prediction technique based on generation confrontation network |
CN111833359A (en) * | 2020-07-13 | 2020-10-27 | 中国海洋大学 | Brain tumor segmentation data enhancement method based on generation of confrontation network |
EP3792830A1 (en) * | 2019-09-10 | 2021-03-17 | Robert Bosch GmbH | Training a class-conditional generative adverserial network |
CN113591954A (en) * | 2021-07-20 | 2021-11-02 | 哈尔滨工程大学 | Filling method of missing time sequence data in industrial system |
CN116364290A (en) * | 2023-06-02 | 2023-06-30 | 之江实验室 | Hemodialysis characterization identification and complications risk prediction system based on multi-view alignment |
-
2023
- 2023-10-07 CN CN202311283938.4A patent/CN117034142B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109165664A (en) * | 2018-07-04 | 2019-01-08 | 华南理工大学 | A kind of attribute missing data collection completion and prediction technique based on generation confrontation network |
EP3792830A1 (en) * | 2019-09-10 | 2021-03-17 | Robert Bosch GmbH | Training a class-conditional generative adverserial network |
CN111833359A (en) * | 2020-07-13 | 2020-10-27 | 中国海洋大学 | Brain tumor segmentation data enhancement method based on generation of confrontation network |
CN113591954A (en) * | 2021-07-20 | 2021-11-02 | 哈尔滨工程大学 | Filling method of missing time sequence data in industrial system |
CN116364290A (en) * | 2023-06-02 | 2023-06-30 | 之江实验室 | Hemodialysis characterization identification and complications risk prediction system based on multi-view alignment |
Non-Patent Citations (3)
Title |
---|
Miao X等. Generative semi-supervised learning for multivariate time series imputation.Proceedings of the AAAI Conference on Artificial Intelligence.2021,第35卷(第10期),第8983-8991页. * |
基于对抗自编码网络的水利数据补全方法;季琳雅;吕鑫;陶飞飞;曾涛;;计算机工程(04);全文 * |
面向机器学习模型安全的测试与修复;张笑宇等;电子学报;第50卷(第12期);第2884-2918页 * |
Also Published As
Publication number | Publication date |
---|---|
CN117034142A (en) | 2023-11-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Heo et al. | Uncertainty-aware attention for reliable interpretation and prediction | |
CN109659033B (en) | Chronic disease state of an illness change event prediction device based on recurrent neural network | |
CN117034142B (en) | Unbalanced medical data missing value filling method and system | |
WO2021120936A1 (en) | Chronic disease prediction system based on multi-task learning model | |
CN110957015B (en) | Missing value filling method for electronic medical record data | |
CN109036553A (en) | A kind of disease forecasting method based on automatic extraction Medical Technologist's knowledge | |
Liu et al. | Handling missing values in healthcare data: A systematic review of deep learning-based imputation techniques | |
CN113421652A (en) | Method for analyzing medical data, method for training model and analyzer | |
CN111612278A (en) | Life state prediction method and device, electronic equipment and storage medium | |
CN116759041B (en) | Medical time sequence data generation method and device considering diagnosis and treatment event relationship | |
CN118098482B (en) | Intelligent medical management system and method based on 5G technology | |
Yan et al. | Kidney stone detection using an optimized Deep Believe network by fractional coronavirus herd immunity optimizer | |
Yuan et al. | Efficient symptom inquiring and diagnosis via adaptive alignment of reinforcement learning and classification | |
WO2023184598A1 (en) | Artificial intelligence-based heart simulator data correction system and method | |
CN118039124B (en) | Object analysis method, device, computer equipment and storage medium | |
CN115035346A (en) | Classification method for Alzheimer disease based on cooperative learning method enhancement | |
Du et al. | The effects of deep network topology on mortality prediction | |
CN113066531B (en) | Risk prediction method, risk prediction device, computer equipment and storage medium | |
Aljawarneh et al. | Pneumonia detection using enhanced convolutional neural network model on chest x-ray images | |
Xing et al. | An Enhanced Vision Transformer Model in Digital Twins Powered Internet of Medical Things for Pneumonia Diagnosis | |
CN113658688A (en) | Clinical decision support method based on word segmentation-free deep learning | |
CN117789983A (en) | Artificial intelligence-based thrombotic microvascular disease multisource data processing method | |
CN116469534A (en) | Hospital number calling management system and method thereof | |
Li et al. | MVIRA: A model based on Missing Value Imputation and Reliability Assessment for mortality risk prediction | |
CN113436743A (en) | Multi-outcome efficacy prediction method and device based on expression learning and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |