CN117034142B - Unbalanced medical data missing value filling method and system - Google Patents

Unbalanced medical data missing value filling method and system Download PDF

Info

Publication number
CN117034142B
CN117034142B CN202311283938.4A CN202311283938A CN117034142B CN 117034142 B CN117034142 B CN 117034142B CN 202311283938 A CN202311283938 A CN 202311283938A CN 117034142 B CN117034142 B CN 117034142B
Authority
CN
China
Prior art keywords
data
patient
filling
generator
patient data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311283938.4A
Other languages
Chinese (zh)
Other versions
CN117034142A (en
Inventor
李劲松
朱伟伟
池胜强
田雨
周天舒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202311283938.4A priority Critical patent/CN117034142B/en
Publication of CN117034142A publication Critical patent/CN117034142A/en
Application granted granted Critical
Publication of CN117034142B publication Critical patent/CN117034142B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Epidemiology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Public Health (AREA)
  • Probability & Statistics with Applications (AREA)
  • Primary Health Care (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

本发明公开了一种不平衡医疗数据缺失值填充方法及系统,本发明使用推土机距离构建生成器和判别器的损失,能够解决在训练过程中生成器可能出现消失梯度的问题;将患者标签作为监督信号加入到生成器中,增加生成器生成患者数据的多样性;增加了辅助分类器,对填补单元填补后的患者数据进行预测,并将预测结果反馈给生成器,提高生成器的生成效果;利用随机数填充患者数据的缺失部分,将填充后的患者数据作为生成器的输入,通过生成器学习缺失值与其他数据间的关系,避免了在训练过程中需要收集足够多完整样本的问题;生成器损失由三部分组成,通过构建不同的损失,让生成器从不同角度考虑填充的效果,从而提高填充结果的准确性。

The invention discloses a method and system for filling missing values in unbalanced medical data. The invention uses bulldozer distance to construct the loss of the generator and the discriminator, which can solve the problem of possible vanishing gradients in the generator during the training process; the patient label is used as The supervision signal is added to the generator to increase the diversity of patient data generated by the generator; an auxiliary classifier is added to predict the patient data after the padding unit is filled, and the prediction results are fed back to the generator to improve the generation effect of the generator. ; Use random numbers to fill in the missing parts of the patient data, use the filled patient data as the input of the generator, and learn the relationship between the missing values and other data through the generator, avoiding the problem of collecting enough complete samples during the training process ;The generator loss consists of three parts. By constructing different losses, the generator can consider the effect of filling from different angles, thereby improving the accuracy of the filling results.

Description

一种不平衡医疗数据缺失值填充方法及系统A method and system for filling missing values in unbalanced medical data

技术领域Technical field

本发明属于医疗信息技术领域,尤其涉及一种不平衡医疗数据缺失值填充方法及系统。The invention belongs to the field of medical information technology, and in particular relates to a method and system for filling missing values in unbalanced medical data.

背景技术Background technique

电子健康记录(EHR, Electronic Health Records)保存着患者就诊相关的信息,包含患者的基本信息、诊断信息、检查信息、用药信息等。这些信息为医疗数据挖掘提供基础。但由于收集设备故障、传输不稳定等因素,会导致电子健康记录存在大量的缺失数据。这些缺失数据不仅会增大统计分析的复杂性和难度,还会导致分析结果不准确。因此,解决电子健康记录中的缺失值填充问题,对提高数据挖掘的质量具有重要意义。Electronic Health Records (EHR, Electronic Health Records) store information related to patient treatment, including basic patient information, diagnosis information, examination information, medication information, etc. This information provides the basis for medical data mining. However, due to factors such as collection equipment failure and unstable transmission, there will be a large amount of missing data in electronic health records. These missing data will not only increase the complexity and difficulty of statistical analysis, but also lead to inaccurate analysis results. Therefore, solving the missing value filling problem in electronic health records is of great significance to improving the quality of data mining.

生成对抗网络(GAN, Generative Adversarial Networks)是一种捕获训练数据分布的神经网络,通过学习到的数据分布创造新的数据,目前常用于图片生成、文字生成等领域。近些年,也有专家和学者将GAN方法运用于数据缺失值填充领域,但在现实生活中,由于医院患者的电子病历数据常常是不平衡的,不同类型疾病患者数量相差较大,如果直接将GAN方法作用于不平衡的医疗数据缺失值填充时会存在一些问题。一方面,填充效果缺乏多样性,在不平衡的样本上,生成器通过只关注样本数量多的类型填充质量并忽视那些数据数量少的类型填充质量,以此来欺骗判别器,从而导致最后填充的数据只属于某类疾病的数据。另一方面,GAN方法在不平衡数据上训练,生成器更容易发生消失梯度问题。《Wasserstein GAN》文章指出,在最优判别器下,最小化生成器的损失等价于最小化真实分布和生成分布之间的JS散度(JSD,Jensen-Shannon Divergence),当真实分布和生成分布不重叠时或者重叠部分可以忽略时,JS散度就是固定常数log2,此时生成器出现梯度消失,难以进行网络训练。Generative Adversarial Networks (GAN) is a neural network that captures the distribution of training data and creates new data through the learned data distribution. It is currently commonly used in image generation, text generation and other fields. In recent years, some experts and scholars have also applied the GAN method to the field of filling missing data. However, in real life, because the electronic medical record data of hospital patients are often unbalanced, the number of patients with different types of diseases varies greatly. If you directly use There are some problems when the GAN method is used to fill missing values in unbalanced medical data. On the one hand, the filling effect lacks diversity. On unbalanced samples, the generator deceives the discriminator by only focusing on the filling quality of types with a large number of samples and ignoring the filling quality of those types with a small number of data, resulting in final filling. The data only belongs to a certain type of disease. On the other hand, GAN methods are trained on imbalanced data, and the generator is more prone to the vanishing gradient problem. The "Wasserstein GAN" article points out that under the optimal discriminator, minimizing the loss of the generator is equivalent to minimizing the JS divergence (JSD, Jensen-Shannon Divergence) between the real distribution and the generated distribution. When the real distribution and the generated distribution When the distributions do not overlap or the overlapping part can be ignored, the JS divergence is a fixed constant log2. At this time, the gradient of the generator disappears, making it difficult to train the network.

发明内容Contents of the invention

本发明的目的在于针对现有技术的不足,提供一种基于生成对抗网络的不平衡医疗数据缺失值填充方法及系统,提高医疗数据缺失值的填充质量。The purpose of the present invention is to provide a method and system for filling missing values of unbalanced medical data based on a generative adversarial network to improve the quality of filling missing values of medical data in view of the shortcomings of the existing technology.

本发明的目的是通过以下技术方案实现的:The purpose of the present invention is achieved through the following technical solutions:

第一方面,本发明提供一种不平衡医疗数据缺失值填充方法,该方法包括:In a first aspect, the present invention provides a method for filling missing values in unbalanced medical data, which method includes:

利用医院的信息化系统获取患者数据;Use the hospital's information system to obtain patient data;

利用数据填充模型对患者数据中的缺失值进行填充;Use data filling models to fill in missing values in patient data;

所述数据填充模型包括数据处理单元、生成器、填补单元、判别器和辅助分类器;所述生成器和判别器构成生成对抗网络;The data filling model includes a data processing unit, a generator, a filling unit, a discriminator and an auxiliary classifier; the generator and the discriminator constitute a generative adversarial network;

所述数据处理单元中,使用掩码矩阵记录患者原始数据中缺失值的位置,使用0预填充患者原始数据中的缺失值,使用随机数填充患者原始数据中的缺失值,并输入生成器;In the data processing unit, a mask matrix is used to record the locations of missing values in the patient's original data, 0 is used to pre-fill the missing values in the patient's original data, random numbers are used to fill in the missing values in the patient's original data, and the data is input into the generator;

所述生成器用于学习输入的患者数据的分布,生成新的患者数据,并输入填补单元,所述生成器的输入包括患者数据和患者标签;The generator is used to learn the distribution of input patient data, generate new patient data, and input the filling unit, and the input of the generator includes patient data and patient labels;

所述填补单元用于利用生成器生成的新的患者数据对患者原始数据中的缺失值进行填补;The filling unit is used to fill in missing values in the original patient data using the new patient data generated by the generator;

所述判别器用于对输入的每个患者数据进行辨别,判断是否为观察值,所述判别器的输入包括填补单元填补后的患者数据以及使用0预填充患者原始数据中的缺失值后的患者数据,输出为每个患者数据为观察值的概率;The discriminator is used to identify each input patient data and determine whether it is an observation value. The input of the discriminator includes the patient data after filling in the filling unit and the patient after using 0 to pre-fill the missing values in the patient's original data. Data, the output is the probability that each patient's data is an observation;

所述辅助分类器用于对填补单元填补后的患者数据进行预测,将预测结果反馈给生成器;The auxiliary classifier is used to predict the patient data after filling in the filling unit, and feed the prediction results back to the generator;

训练过程包括预训练辅助分类器和正式训练数据填充模型,预训练过程中使用未缺失的患者数据对辅助分类器进行训练,确定辅助分类器网络参数,正式训练过程中辅助分类器网络参数不参与更新;正式训练过程中先训练判别器再训练生成器,判别器和生成器不断的对抗训练,直至数据填充模型收敛;The training process includes pre-training the auxiliary classifier and filling the model with formal training data. During the pre-training process, the auxiliary classifier is trained using non-missing patient data and the network parameters of the auxiliary classifier are determined. The network parameters of the auxiliary classifier are not involved in the formal training process. Update: During the formal training process, the discriminator is trained first and then the generator is trained. The discriminator and generator are continuously trained against each other until the data filling model converges;

将需要填充缺失值的患者数据及患者标签输入训练好的数据填充模型,经过数据处理单元、生成器和填补单元后,输出填充后的患者数据。Input the patient data and patient labels that need to fill in missing values into the trained data filling model. After passing through the data processing unit, generator and filling unit, the filled patient data is output.

进一步地,对获取的患者数据进行数据预处理后再输入数据填充模型,具体为:对离散型数据进行独热编码操作,对连续型数据进行最大最小值归一化操作。Further, the acquired patient data is preprocessed before being input into the data filling model, specifically: performing one-hot encoding operations on discrete data, and performing maximum and minimum value normalization operations on continuous data.

进一步地,患者原始数据记为,其中/>表示第i个患者的原始数据,n为患者数量,k为特征数量;掩码矩阵记为/>,其中/>用于标记第i个患者原始数据中的观察值和缺失值,观察值取1,缺失值取0;使用0预填充患者原始数据中的缺失值,填充后的数据矩阵记为/>,其中/>表示使用0预填充第i个患者原始数据中的缺失值后的患者数据;创建随机矩阵记为/>,其中/>是随机生成的符合标准正态分布的随机数向量,用于填充第i个患者原始数据中的缺失值;使用随机矩阵中的随机数填充患者原始数据中的缺失值,填充后的数据矩阵记为/>,其中/>表示使用随机数填充第i个患者原始数据中的缺失值后得到的患者数据,/>,/>表示哈达玛积。Further, the patient’s original data is recorded as , of which/> Represents the original data of the i-th patient, n is the number of patients, k is the number of features; the mask matrix is recorded as/> , of which/> Used to mark the observed values and missing values in the i-th patient's original data. The observed values are 1 and the missing values are 0; 0 is used to pre-fill the missing values in the patient's original data. The filled data matrix is marked as/> , of which/> Represents the patient data after pre-filling the missing values in the original data of the i-th patient with 0; creating a random matrix is recorded as/> , of which/> is a randomly generated random number vector that conforms to the standard normal distribution, used to fill in the missing values in the i-th patient's original data; use the random numbers in the random matrix to fill in the missing values in the patient's original data, and the filled data matrix records for/> , of which/> Represents the patient data obtained after using random numbers to fill the missing values in the original data of the i-th patient,/> ,/> Represents Hadama product.

进一步地,所述生成器的损失函数由三部分组成,第一部分是计算生成器生成的观察值与实际观察值之间的差距,使用均方误差作为损失函数;第二部分是生成对抗网络的生成器损失,使用Wasserstein距离作为损失函数;第三部分损失是计算辅助分类器对填补单元填补后的患者数据的预测标签与患者真实标签之间的差距,使用交叉熵函数作为损失函数。Further, the loss function of the generator consists of three parts. The first part is to calculate the difference between the observation value generated by the generator and the actual observation value, using the mean square error as the loss function; the second part is to generate the adversarial network. The generator loss uses Wasserstein distance as the loss function; the third part of the loss is to calculate the gap between the predicted label of the patient data filled by the auxiliary classifier on the padded unit and the patient's true label, using the cross-entropy function as the loss function.

进一步地,述生成器的损失函数Furthermore, the loss function of the generator is ;

第一部分损失函数The first part of the loss function ;

第二部分损失函数The second part of the loss function ;

第三部分损失函数The third part of the loss function ;

其中表示第i个患者数据作为输入时生成器的输出值,/>,G()表示经过生成器后得到的患者数据,yi表示第i个患者的真实标签,D()表示患者数据经过判别器后得到的结果,ti表示第i个患者原始数据经填补单元填补后的患者数据,/>表示辅助分类器对第i个患者的预测标签,/>和/>为超参数,·表示向量内积。in Represents the output value of the generator when the i-th patient data is used as input, /> , G() represents the patient data obtained after passing through the generator, yi represents the real label of the i-th patient, D() represents the result of the patient data passing through the discriminator, t i represents the filled-in original data of the i-th patient Patient data after cell filling,/> Represents the predicted label of the i-th patient by the auxiliary classifier, /> and/> is a hyperparameter, · represents the vector inner product.

进一步地,所述填补单元中,利用生成器生成的患者数据填补患者原始数据X中的缺失值,填补后的数据矩阵记为/>,其中ti表示第i个患者原始数据经填补单元填补后的患者数据,/>,其中/>表示第i个患者数据作为输入时生成器的输出值。Further, in the filling unit, the patient data generated by the generator is used Fill in the missing values in the patient's original data X, and the filled data matrix is marked as/> , where t i represents the patient data after the original data of the i-th patient has been filled by the padding unit,/> , of which/> Represents the output value of the generator when the i-th patient data is used as input.

进一步地,所述判别器的损失函数LD计算公式如下:Further, the calculation formula of the loss function LD of the discriminator is as follows:

;

其中D()表示患者数据经过判别器后得到的结果,表示使用0预填充第i个患者原始数据中的缺失值后的患者数据,ti表示第i个患者原始数据经填补单元填补后的患者数据,·表示向量内积。where D() represents the result obtained after the patient data passes through the discriminator, represents the patient data after pre-filling the missing values in the i-th patient's original data with 0, t i represents the patient data after filling the i-th patient's original data with the padding unit, and · represents the vector inner product.

进一步地,所述判别器的损失函数LD计算公式如下:Further, the calculation formula of the loss function LD of the discriminator is as follows:

;

其中D()表示患者数据经过判别器后得到的结果,表示使用0预填充第i个患者原始数据中的缺失值后的患者数据,ti表示第i个患者原始数据经填补单元填补后的患者数据,·表示向量内积。where D() represents the result obtained after the patient data passes through the discriminator, represents the patient data after pre-filling the missing values in the i-th patient's original data with 0, t i represents the patient data after filling the i-th patient's original data with the padding unit, and · represents the vector inner product.

进一步地,正式训练数据填充模型的过程中,首先输入包含缺失值的患者数据,判别器计算损失,梯度反向传播更新判别器网络参数;然后生成器计算损失,梯度反向传播更新生成器网络参数;判别器和生成器不断的对抗训练,直至数据填充模型收敛。Furthermore, in the process of formally training the data to fill the model, first input the patient data containing missing values, the discriminator calculates the loss, and the gradient back propagation updates the discriminator network parameters; then the generator calculates the loss, and the gradient back propagation updates the generator network Parameters; the discriminator and generator are continuously trained against each other until the data-filled model converges.

第二方面,本发明提供一种不平衡医疗数据缺失值填充系统,该系统包括数据获取模块、数据填充模型构建模块和数据填充模块;所述数据获取模块用于利用医院的信息化系统获取患者数据;In a second aspect, the present invention provides a system for filling missing values in unbalanced medical data. The system includes a data acquisition module, a data filling model building module and a data filling module; the data acquisition module is used to obtain patients using the hospital's information system. data;

所述数据填充模型构建模块用于构建及训练数据填充模型;所述数据填充模型包括数据处理单元、生成器、填补单元、判别器和辅助分类器,生成器和判别器构成生成对抗网络;The data filling model building module is used to build and train a data filling model; the data filling model includes a data processing unit, a generator, a filling unit, a discriminator and an auxiliary classifier, and the generator and the discriminator constitute a generative adversarial network;

所述数据处理单元中,使用掩码矩阵记录患者原始数据中缺失值的位置,使用0预填充患者原始数据中的缺失值,使用随机数填充患者原始数据中的缺失值,并输入生成器;In the data processing unit, a mask matrix is used to record the locations of missing values in the patient's original data, 0 is used to pre-fill the missing values in the patient's original data, random numbers are used to fill in the missing values in the patient's original data, and the data is input into the generator;

所述生成器用于学习输入的患者数据的分布,生成新的患者数据,并输入填补单元,所述生成器的输入包括患者数据和患者标签;The generator is used to learn the distribution of input patient data, generate new patient data, and input the filling unit, and the input of the generator includes patient data and patient labels;

所述填补单元用于利用生成器生成的新的患者数据对患者原始数据中的缺失值进行填补;The filling unit is used to fill in missing values in the original patient data using the new patient data generated by the generator;

所述判别器用于对输入的每个患者数据进行辨别,判断是否为观察值,所述判别器的输入包括填补单元填补后的患者数据以及使用0预填充患者原始数据中的缺失值后的患者数据,输出为每个患者数据为观察值的概率;The discriminator is used to identify each input patient data and determine whether it is an observation value. The input of the discriminator includes the patient data after filling in the filling unit and the patient after using 0 to pre-fill the missing values in the patient's original data. Data, the output is the probability that each patient's data is an observation;

所述辅助分类器用于对填补单元填补后的患者数据进行预测,将预测结果反馈给生成器;The auxiliary classifier is used to predict the patient data after filling in the filling unit, and feed the prediction results back to the generator;

训练过程包括预训练辅助分类器和正式训练数据填充模型,预训练过程中使用未缺失的患者数据对辅助分类器进行训练,确定辅助分类器网络参数,正式训练过程中辅助分类器网络参数不参与更新;正式训练过程中先训练判别器再训练生成器,判别器和生成器不断的对抗训练,直至数据填充模型收敛;The training process includes pre-training the auxiliary classifier and filling the model with formal training data. During the pre-training process, the auxiliary classifier is trained using non-missing patient data and the network parameters of the auxiliary classifier are determined. The network parameters of the auxiliary classifier are not involved in the formal training process. Update: During the formal training process, the discriminator is trained first and then the generator is trained. The discriminator and generator are continuously trained against each other until the data filling model converges;

所述数据填充模块用于将需要填充缺失值的患者数据及患者标签输入训练好的数据填充模型,经过数据处理单元、生成器和填补单元后,输出填充后的患者数据。The data filling module is used to input the patient data and patient labels that need to fill missing values into the trained data filling model, and output the filled patient data after passing through the data processing unit, generator and filling unit.

第三方面,本发明提供一种不平衡医疗数据缺失值填充设备,包括存储器和一个或多个处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,实现如第一方面所述的不平衡医疗数据缺失值填充方法。In a third aspect, the present invention provides a device for filling missing values in unbalanced medical data, which includes a memory and one or more processors. The memory stores executable code. When the processor executes the executable code, Implement the missing value filling method for imbalanced medical data as described in the first aspect.

第四方面,本发明提供一种计算机可读存储介质,其上存储有程序,所述程序被处理器执行时,实现如第一方面所述的不平衡医疗数据缺失值填充法。In a fourth aspect, the present invention provides a computer-readable storage medium on which a program is stored. When the program is executed by a processor, the method for filling missing values in unbalanced medical data as described in the first aspect is implemented.

本发明的有益效果是:The beneficial effects of the present invention are:

1. 本发明使用推土机距离(Wasserstein距离)代替JS散度构建生成器和判别器的损失,Wasserstein距离相对JS散度具有优越的平滑特性,即便两个分布没有重叠,Wasserstein距离仍然能够反映它们的远近,能够解决在训练过程中生成器可能会出现消失梯度问题。1. This invention uses bulldozer distance (Wasserstein distance) instead of JS divergence to construct the loss of the generator and discriminator. Wasserstein distance has superior smoothing properties compared to JS divergence. Even if the two distributions do not overlap, Wasserstein distance can still reflect their Far and near, it can solve the vanishing gradient problem that may occur in the generator during the training process.

2. 本发明将患者标签作为监督信号加入到生成器中,帮助生成器在不平衡的医疗电子病历中能够识别不同的患者数据,增加生成器生成患者数据的多样性。2. The present invention adds patient labels to the generator as supervision signals, helping the generator to identify different patient data in unbalanced medical electronic medical records, and increasing the diversity of patient data generated by the generator.

3. 本发明增加了辅助分类器,对填补单元填补后的患者数据进行分类预测,并将预测结果反馈给生成器,提高生成器的生成效果。3. The present invention adds an auxiliary classifier to classify and predict the patient data after filling in the filling units, and feeds the prediction results back to the generator to improve the generation effect of the generator.

4. 本发明利用随机数填充患者数据的缺失部分,将填充后的患者数据作为生成器的输入,通过生成器学习缺失值与其他数据间的关系,避免了在训练过程中需要收集足够多完整样本的问题。4. The present invention uses random numbers to fill in the missing parts of patient data, uses the filled patient data as the input of the generator, and learns the relationship between the missing values and other data through the generator, avoiding the need to collect enough complete data during the training process. Sample question.

5. 本发明提出的生成器损失由三部分组成,分别为生成器生成的患者观察值与患者实际观察值之间的损失、判别器对生成器生成的患者缺失值的预测与真实值之间的损失、辅助分类器对填补单元填补后的患者数据的预测标签与患者真实标签之间的损失,通过构建不同的损失,让生成器从不同角度考虑填充的效果,从而提高填充结果的准确性。5. The generator loss proposed by this invention consists of three parts, which are the loss between the patient observation values generated by the generator and the patient's actual observation values, and the loss between the discriminator's prediction of the patient's missing values generated by the generator and the true value. The loss, the loss between the predicted label of the patient data filled by the auxiliary classifier and the real label of the patient after filling in the filling unit, by constructing different losses, allows the generator to consider the effect of filling from different angles, thereby improving the accuracy of the filling results .

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some of the drawings of the present invention. Embodiments, for those of ordinary skill in the art, other drawings can also be obtained based on these drawings without exerting creative efforts.

图1为一示例性实施例提供的不平衡医疗数据缺失值填充方法流程图;Figure 1 is a flow chart of a method for filling missing values in unbalanced medical data provided by an exemplary embodiment;

图2为一示例性实施例提供的患者原始数据表格形式;Figure 2 is a table form of patient original data provided by an exemplary embodiment;

图3为一示例性实施例提供的数据填充模型架构示意图;Figure 3 is a schematic diagram of the data filling model architecture provided by an exemplary embodiment;

图4为一示例性实施例提供的数据处理单元处理过程示意图;Figure 4 is a schematic diagram of the processing process of the data processing unit provided by an exemplary embodiment;

图5为一示例性实施例提供的不平衡医疗数据缺失值填充系统结构图;Figure 5 is a structural diagram of a system for filling missing values in unbalanced medical data provided by an exemplary embodiment;

图6为一示例性实施例提供的不平衡医疗数据缺失值填充设备结构图。Figure 6 is a structural diagram of a device for filling missing values in unbalanced medical data provided by an exemplary embodiment.

具体实施方式Detailed ways

为了更好的理解本申请的技术方案,下面结合附图对本申请实施例进行详细描述。In order to better understand the technical solution of the present application, the embodiments of the present application will be described in detail below with reference to the accompanying drawings.

应当明确,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。It should be clear that the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of this application.

在本申请实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。The terminology used in the embodiments of the present application is only for the purpose of describing specific embodiments and is not intended to limit the present application. As used in the embodiments and the appended claims, the singular forms "a," "the" and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise.

本发明提供一种不平衡医疗数据缺失值填充方法,如图1所示,该方法包括数据获取和数据填充两个部分,数据获取部分利用医院的信息化系统提取患者结构化数据,数据填充部分利用数据填充模型对患者数据中的缺失值进行填充,下面详细阐述每个部分的具体实现流程。The present invention provides a method for filling missing values in unbalanced medical data, as shown in Figure 1. The method includes two parts: data acquisition and data filling. The data acquisition part uses the hospital's information system to extract patient structured data, and the data filling part Use the data filling model to fill in the missing values in the patient data. The specific implementation process of each part is explained in detail below.

一、数据获取1. Data acquisition

首先利用医院的信息化系统提取患者结构化数据,这些数据包括患者基本信息、诊断结果、检查信息、用药信息等。First, the hospital's information system is used to extract patient structured data, which includes basic patient information, diagnosis results, examination information, medication information, etc.

然后对提取的患者数据进行数据预处理,具体为:Then perform data preprocessing on the extracted patient data, specifically:

对提取的离散型数据进行独热编码操作,离散型数据包括患者的诊断、用药等特征;Perform a one-hot encoding operation on the extracted discrete data. The discrete data includes patient characteristics such as diagnosis and medication;

对提取的连续型数据进行最大最小值归一化操作,连续型数据包括患者的体重、年龄、血压等特征,最大最小值归一化公式为,其中xij表示第i个患者第j个特征的值,/>表示第j个特征中最小值,/>表示第j个特征中最大值。Perform maximum and minimum normalization operations on the extracted continuous data. The continuous data includes the patient's weight, age, blood pressure and other characteristics. The maximum and minimum normalization formula is: , where x ij represents the value of the j-th feature of the i-th patient, /> Represents the minimum value in the j-th feature,/> Represents the maximum value in the jth feature.

最后将患者诊断结果作为患者标签,记作,其中n表示患者数量,/>为第i个患者的标签,u为提取的患者数据中诊断结果的疾病种类,在本实施例中为10。患者原始数据使用X表示,记作/>,其中/>表示第i个患者的原始数据,k为提取的患者特征数量。如图2所示,为患者原始数据X整理成的表格形式,其中每一行代表了一个患者的数据,xij表示第i个患者第j个特征的值,N表示患者该特征为缺失值。Finally, the patient diagnosis result is used as the patient label, recorded as , where n represents the number of patients,/> is the label of the i-th patient, u is the disease type of the diagnosis result in the extracted patient data, which is 10 in this embodiment. The patient's original data is represented by , of which/> represents the original data of the i-th patient, and k is the number of extracted patient features. As shown in Figure 2, the patient 's original data

二、数据填充2. Data filling

利用数据填充模型对患者数据中的缺失值进行填充。如图3所示,数据填充模型包括数据处理单元、生成器G、填补单元、判别器D以及辅助分类器C。其中生成器G和判别器D构成生成对抗网络。Use the data imputation model to fill in missing values in patient data. As shown in Figure 3, the data filling model includes a data processing unit, a generator G, a filling unit, a discriminator D, and an auxiliary classifier C. The generator G and the discriminator D constitute a generative adversarial network.

具体地,数据处理单元的作用是使用随机数对患者原始数据的缺失值进行填充,由于数据获取部分获取到的患者原始数据包含缺失值,无法进行正常的数值运算,需要利用数据处理单元对含有缺失值的患者原始数据进行处理。生成器的作用是学习输入的患者数据的分布,生成新的患者数据。填补单元的作用是利用生成器生成的新的患者数据对患者原始数据的缺失值进行填补。判别器的作用是对输入的每个数据进行辨别,判断输入的数据是否为观察值(未缺失的值)。辅助分类器的作用是对填补后的患者数据进行预测,并将预测结果反馈给生成器。接下来分别对数据填充模型各个部分进行介绍。Specifically, the role of the data processing unit is to use random numbers to fill in the missing values of the patient's original data. Since the patient's original data obtained by the data acquisition part contains missing values, normal numerical operations cannot be performed. The data processing unit needs to be used to fill in the missing values containing Patient raw data with missing values were processed. The role of the generator is to learn the distribution of input patient data and generate new patient data. The function of the imputation unit is to use the new patient data generated by the generator to fill in the missing values of the original patient data. The function of the discriminator is to identify each input data and determine whether the input data is an observation value (no missing value). The role of the auxiliary classifier is to predict the filled patient data and feed the prediction results back to the generator. Next, each part of the data filling model is introduced separately.

2.1数据处理单元2.1 Data processing unit

由于数据获取部分获取到的患者原始数据包含缺失值,无法进行正常的数值运算,为了让这些数据能够进行正常的数值运算,需要数据处理单元对这些患者原始数据进行处理,处理过程包括两个部分。Since the original patient data obtained by the data acquisition part contains missing values, normal numerical operations cannot be performed. In order to enable normal numerical operations on these data, a data processing unit is required to process these original patient data. The processing process includes two parts. .

第一部分是记录患者原始数据中缺失值的位置以及使用0预填充患者原始数据中的缺失值。首先使用掩码矩阵记录患者原始数据中缺失值的位置,掩码矩阵M中的掩码向量/>用于标记第i个患者原始数据中哪些位置为观察值,哪些位置为缺失值,并使用1表示观察值,0表示缺失值。例如[1,1,0,1]表示患者第三个特征为缺失值,其他特征都有观察值。然后使用0填充患者原始数据中的缺失值,并使用数据矩阵/>表示填充后的患者数据,其中/>表示用0预填充第i个患者原始数据中的缺失值后的患者数据。The first part is to record the location of missing values in the patient's raw data and prefill the missing values in the patient's raw data with 0. First use the mask matrix Record the location of missing values in the patient's original data, the mask vector in the mask matrix M/> Used to mark which positions in the original data of the i-th patient are observed values and which positions are missing values, and use 1 to represent observed values and 0 to represent missing values. For example, [1,1,0,1] indicates that the third feature of the patient is a missing value, and other features have observed values. Then fill the missing values in the patient's original data with 0 and use the data matrix /> Represents the populated patient data, where/> Represents the patient data after pre-filling the missing values in the original data of the i-th patient with 0.

第二部分是使用随机数填充患者原始数据中的缺失值。首先创建随机矩阵,其中/>是随机生成的符合标准正态分布的随机数向量,用于填充第i个患者原始数据中的缺失值,/>表示用于填充第i个患者原始数据中的第k个特征。然后使用/>表示使用随机数填充第i个患者原始数据中的缺失值后得到的患者数据,/>由/>计算得到,其中/>表示哈达玛积。并使用数据矩阵表示使用随机数填充了所有患者原始数据中的缺失值后的患者数据。The second part is to use random numbers to fill in the missing values in the patient's raw data. First create a random matrix , of which/> is a randomly generated random number vector that conforms to the standard normal distribution and is used to fill in the missing values in the original data of the i-th patient,/> Represents the k-th feature used to populate the original data of the i-th patient. Then use/> Represents the patient data obtained after using random numbers to fill the missing values in the original data of the i-th patient,/> by/> Calculated, where/> Represents Hadama product. and use data matrix Represents the patient data after filling all missing values in the patient's original data with random numbers.

图4为数据处理单元处理过程的一个示例,其中通过数据获取部分获取的患者原始数据X包含缺失值;数据矩阵为使用0预填充患者原始数据X中的缺失值后的患者数据;随机矩阵Z中的数据为符合标准正态分布的随机数;掩码矩阵M用于标记患者原始数据X观察值和缺失值的位置。Figure 4 is an example of the processing process of the data processing unit, in which the patient's original data X obtained through the data acquisition part contains missing values; data matrix It is the patient data after pre-filling the missing values in the patient's original data s position.

2.2生成器2.2 Generator

生成器用于学习输入的患者数据的分布并生成新的患者数据,它的输入包括患者数据和患者标签,其中患者标签是作为监督信号,让生成器能够了解每个患者的标签。生成器由三层全连接网络组成,每层节点数为k、k、k,其中k表示患者原始数据X的特征维度。前两层为隐藏层,最后一层为输出层,前两层的激活函数为ReLU,最后一层的激活函数为Tanh,并使用均方根反向传播函数(RMSprop函数)作为生成器的优化函数。使用数据矩阵The generator is used to learn the distribution of input patient data and generate new patient data. Its inputs include patient data and patient labels, where the patient labels are used as supervision signals to allow the generator to learn the labels of each patient. The generator consists of three layers of fully connected networks, with the number of nodes in each layer being k, k, k, where k represents the feature dimension of the patient's original data X. The first two layers are hidden layers, the last layer is the output layer, the activation function of the first two layers is ReLU, the activation function of the last layer is Tanh, and the root mean square backpropagation function (RMSprop function) is used as the optimization of the generator function. Working with data matrices

表示生成器的输出,其中/>表示第i个患者数据作为输入时生成器的输出值,/>表示使用随机数填充第i个患者原始数据中的缺失值后的患者数据,yi表示第i个患者的标签,G()表示经过生成器G后得到的患者数据。 represents the output of the generator, where /> Represents the output value of the generator when the i-th patient data is used as input, /> Represents the patient data after using random numbers to fill the missing values in the original data of the i-th patient, y i represents the label of the i-th patient, and G() represents the patient data obtained after passing through the generator G.

整个生成器的损失由三部分组成:The loss of the entire generator consists of three parts:

第一部分损失L1是计算生成器生成的患者观察值与患者实际观察值之间的差距,这里使用均方误差作为这部分损失函数,当生成器生成的患者观察值与患者实际观察值之间的差距越小,表示生成器生成的患者观察值越接近患者实际观察值。The first part of the loss L 1 is to calculate the difference between the patient observation value generated by the generator and the patient's actual observation value. Here, the mean square error is used as this part of the loss function. When the difference between the patient observation value generated by the generator and the patient's actual observation value The smaller the gap, the closer the patient observation values generated by the generator are to the actual patient observation values.

第二部分损失L2是传统生成对抗网络的生成器损失,这里使用Wasserstein 距离代替交叉熵作为损失函数。Wasserstein距离可以通过近似计算得到,其中D(T)表示填补单元填补后的患者数据T经过判别器D得到的结果,/>表示哈达玛积,M表示标记患者原始数据缺失值位置的掩码矩阵,E[]表示数学期望。当生成器生成的患者缺失值越接近真实值时,生成器生成的患者缺失值越容易被判别器判别为观察值,此时L2值也就越小,反之亦然。The second part of the loss L 2 is the generator loss of the traditional generative adversarial network, where Wasserstein distance is used instead of cross entropy as the loss function. The Wasserstein distance can be passed Approximately calculated, where D(T) represents the result of the patient data T filled by the padding unit passing through the discriminator D,/> represents the Hadamard product, M represents the mask matrix marking the position of missing values in the patient's original data, and E[] represents the mathematical expectation. When the patient missing values generated by the generator are closer to the real values, the patient missing values generated by the generator are more likely to be identified as observed values by the discriminator, and the L 2 value will be smaller at this time, and vice versa.

第三部分损失L3是计算辅助分类器对填补单元填补后的患者数据的预测标签与患者真实标签之间的差距,这里使用交叉熵函数作为损失函数,当辅助分类器对填补单元填补后的患者数据的预测标签与患者真实标签之间的差距越小,表示填充的数据效果越好。The third part of the loss L3 is to calculate the gap between the predicted label of the patient data after the auxiliary classifier has filled the padding unit and the patient's true label. Here, the cross entropy function is used as the loss function. When the auxiliary classifier fills the padding unit The smaller the gap between the predicted label of the patient data and the patient's true label, the better the filled data is.

整个生成器的损失函数LG如下所示:The loss function L G of the entire generator is as follows:

;

其中表示用0预填充第i个患者原始数据中的缺失值后的患者数据,/>表示第i个患者数据作为输入时生成器的输出值,/>表示哈达玛积,·表示向量内积,M为掩码矩阵,/>为掩码向量,记录了第i个患者原始数据中哪些位置为观察值(未缺失值),哪些位置为缺失值,/>是计算第i个患者的生成器生成的观察值与实际观察值之间的均方误差,/>表示使用随机数填充第i个患者原始数据中的缺失值后的患者数据,G()表示经过生成器G后得到的患者数据,yi表示第i个患者的真实标签,/>表示辅助分类器对第i个患者的预测标签;T表示填补单元填补后的患者数据,ti表示第i个患者数据经填补单元填补后的患者数据,D()表示患者数据经过判别器后得到的结果;/>和/>是超参数,在本实施例中分别为0.3,0.2。in Represents the patient data after pre-filling the missing values in the original data of the i-th patient with 0, /> Represents the output value of the generator when the i-th patient data is used as input, /> represents the Hadamard product, · represents the vector inner product, M is the mask matrix, /> is a mask vector, which records which positions in the original data of the i-th patient are observed values (no missing values) and which positions are missing values, /> is to calculate the mean square error between the observation value generated by the generator of the i-th patient and the actual observation value,/> Represents the patient data after using random numbers to fill the missing values in the original data of the i-th patient, G() represents the patient data obtained after passing through the generator G, y i represents the true label of the i-th patient, /> represents the predicted label of the i-th patient by the auxiliary classifier; T represents the patient data after the padding unit is filled, t i represents the patient data after the i-th patient data is filled by the padding unit, and D() represents the patient data after passing through the discriminator. The result obtained;/> and/> are hyperparameters, which are 0.3 and 0.2 respectively in this embodiment.

2.3填补单元2.3 Filling units

填补单元是利用生成器生成的患者数据填补患者原始数据X中的缺失值,并输出填补后的患者数据,使用/>表示填补单元填补后的患者数据,其中ti表示第i个患者原始数据经填补单元填补后的患者数据,并且ti由/>计算得到,其中/>表示第i个患者数据作为输入时生成器的输出值,/>表示用0预填充第i个患者原始数据中的缺失值后的患者数据,mi表示第i个患者的掩码向量,/>表示哈达玛积。The padding unit is the patient data generated using the generator Fill in the missing values in the original patient data X and output the filled patient data, use /> Represents the patient data after the padding unit is filled, where t i represents the patient data after the i-th patient’s original data is filled by the padding unit, and t i is given by/> Calculated, where/> Represents the output value of the generator when the i-th patient data is used as input, /> represents the patient data after pre-filling the missing values in the original data of the i-th patient with 0, m i represents the mask vector of the i-th patient, /> Represents Hadama product.

2.4判别器2.4 Discriminator

判别器的目的是判断输入的每个数据是否为观察值,其输入分别是填补单元填补后的患者数据T以及使用0预填充患者原始数据中的缺失值后的患者数据,其输出分别是对T和/>中每个数据为观察值的概率。判别器由三层全连接网络组成,每层节点数为k、k、k,其中k表示患者原始数据X的特征维度,前两层为隐藏层,激活函数为ReLU,最后一层为输出层,无激活函数,并使用RMSprop函数作为判别器的优化函数。判别器的损失是计算患者原始数据和填补单元填补后的患者数据之间的分布差异,使用Wasserstein距离代替JS散度来衡量患者原始数据和填补单元填补后的患者数据之间的分布差异。判别器的损失函数LD计算公式如下:The purpose of the discriminator is to determine whether each input data is an observation value. Its inputs are the patient data T after filling in the padding unit and the patient data after using 0 to pre-fill the missing values in the patient's original data. , whose outputs are for T and/> Each data in is the probability of an observation. The discriminator consists of three layers of fully connected networks. The number of nodes in each layer is k, k, k, where k represents the feature dimension of the patient's original data X. The first two layers are hidden layers, the activation function is ReLU, and the last layer is the output layer. , no activation function, and use the RMSprop function as the optimization function of the discriminator. The loss of the discriminator is to calculate the distribution difference between the original patient data and the patient data after padding unit padding, using Wasserstein distance instead of JS divergence to measure the distribution difference between the patient original data and the patient data after padding unit padding. The loss function LD of the discriminator is calculated as follows:

;

其中D()表示患者数据经过判别器后得到的结果,表示用0预填充第i个患者原始数据中的缺失值后的患者数据,mi表示第i个患者的掩码向量,ti表示第i个患者原始数据经填补单元填补后的患者数据。在训练过程中,当LD越小时,表示真实分布与生成分布的Wasserstein距离越小,数据填充训练得越好。where D() represents the result obtained after the patient data passes through the discriminator, represents the patient data after pre-filling the missing values in the i-th patient's original data with 0, m i represents the mask vector of the i-th patient, and t i represents the patient data after the i-th patient's original data has been filled by the padding unit. During the training process, when LD is smaller, the Wasserstein distance indicating the true distribution and the generated distribution is smaller, and the data filling training is better.

2.5辅助分类器2.5 Auxiliary classifier

辅助分类器是对填补单元填补后的患者数据进行预测。在数据填充模型预训练过程中,首先使用未缺失的患者数据对辅助分类器进行训练,确定好辅助分类器的网络参数,在数据填充模型正式训练过程中,辅助分类器网络参数不参与更新。整个辅助分类器由三层全连接网络组成,前两层为隐藏层,最后一层为输出层。前两层的节点数人为设定,本实施例中分别设为128,64,最后一层的节点数为u,u为提取的患者数据中诊断结果的疾病种类,在本实施例中为10。前两层的激活函数为ReLU,最后一层的激活函数为Softmax。辅助分类器的损失是计算辅助分类器的预测标签与患者真实标签的差距,当差距越小,表示辅助分类器的预测效果越好,这里使用交叉熵函数作为辅助分类器的损失函数,辅助分类器的损失函数LC计算公式如下:The auxiliary classifier predicts the patient data after filling the filled units. During the pre-training process of the data filling model, the auxiliary classifier is first trained using non-missing patient data, and the network parameters of the auxiliary classifier are determined. During the formal training process of the data filling model, the network parameters of the auxiliary classifier do not participate in the update. The entire auxiliary classifier consists of three layers of fully connected networks, the first two layers are hidden layers, and the last layer is the output layer. The number of nodes in the first two layers is artificially set. In this embodiment, it is set to 128 and 64 respectively. The number of nodes in the last layer is u. u is the disease type of the diagnosis result in the extracted patient data. In this embodiment, it is 10. . The activation function of the first two layers is ReLU, and the activation function of the last layer is Softmax. The loss of the auxiliary classifier is calculated by calculating the difference between the predicted label of the auxiliary classifier and the patient's true label. When the gap is smaller, it means that the prediction effect of the auxiliary classifier is better. Here, the cross entropy function is used as the loss function of the auxiliary classifier to assist classification. The calculation formula of the loss function L C of the device is as follows:

;

其中为第i个患者的标签,由患者数据中诊断结果通过独热编码操作得到,u为获取的患者数据中诊断结果的疾病种类,/>表示辅助分类器对第i个患者的预测标签,是长度为u的向量,向量里的每个值表示辅助分类器预测患者患对应疾病的概率,,ti表示第i个患者原始数据经填补单元填补后的患者数据,C()表示患者数据经过辅助分类器后得到的结果,·表示向量内积。in is the label of the i-th patient, which is obtained from the diagnosis result in the patient data through a one-hot encoding operation, u is the disease type of the diagnosis result in the obtained patient data,/> Represents the predicted label of the i-th patient by the auxiliary classifier, which is a vector of length u. Each value in the vector represents the probability that the auxiliary classifier predicts the patient to suffer from the corresponding disease. , t i represents the patient data after the i-th patient's original data has been filled by the padding unit, C() represents the result obtained after the patient data passes through the auxiliary classifier, · represents the vector inner product.

整个数据填充模型的训练分为以下两个阶段:The training of the entire data filling model is divided into the following two stages:

第一阶段是预训练过程,预训练目的是训练辅助分类器,确定辅助分类器网络参数。在训练辅助分类器时,首先初始化辅助分类器网络参数,然后使用未缺失的患者数据作为训练数据,计算辅助分类器的损失函数,梯度反向传播更新辅助分类器网络参数直至辅助分类器收敛。确定好辅助分类器网络参数后,在第二阶段数据填充模型的正式训练过程中,辅助分类器网络参数不参与更新。The first stage is the pre-training process. The purpose of pre-training is to train the auxiliary classifier and determine the network parameters of the auxiliary classifier. When training the auxiliary classifier, first initialize the auxiliary classifier network parameters, then use non-missing patient data as training data to calculate the loss function of the auxiliary classifier, and gradient backpropagation updates the auxiliary classifier network parameters until the auxiliary classifier converges. After the auxiliary classifier network parameters are determined, the auxiliary classifier network parameters will not be updated during the formal training process of the second-stage data filling model.

第二阶段是训练数据填充模型过程,在数据填充模型训练过程中,训练策略是先训练判别器再训练生成器。在训练数据填充模型时,首先输入包含缺失值的患者数据,判别器计算损失,梯度反向传播更新判别器网络参数;然后生成器计算损失,梯度反向传播更新生成器网络参数,判别器和生成器不断的对抗训练,直至数据填充模型收敛;在训练初期,判别器很容易辨别哪些数据为观察值,哪些数据为填充值,随着训练深入,生成器学习到患者数据的分布,生成的数据很接近患者的观察值,判别器无法判断哪些数据为观察值,哪些数据为填充值,生成器和判别器达到了纳什均衡,此时训练的数据填充模型收敛。The second stage is the process of training the data filling model. During the data filling model training process, the training strategy is to first train the discriminator and then the generator. When training data to fill the model, first input patient data containing missing values, the discriminator calculates the loss, and gradient backpropagation updates the discriminator network parameters; then the generator calculates the loss, gradient backpropagation updates the generator network parameters, the discriminator and The generator continues adversarial training until the data filling model converges; in the early stage of training, the discriminator can easily distinguish which data are observation values and which data are filling values. As the training deepens, the generator learns the distribution of patient data and generates The data is very close to the patient's observation values, and the discriminator cannot determine which data are observation values and which data are filling values. The generator and discriminator reach Nash equilibrium, and the trained data filling model converges at this time.

待数据填充模型训练完毕后,就可以使用该模型对患者数据缺失值进行填充。在实施填充过程中,首先对需要填充的患者数据进行数据预处理,对需要填充的患者数据中离散型数据进行独热编码操作,对连续型数据进行最大最小值归一化操作。然后选取患者诊断结果作为患者标签,将含有缺失值的患者数据和患者标签作为数据填充模型的输入,经过数据处理单元、生成器以及填补单元,输出填充后的患者数据。After the data filling model is trained, the model can be used to fill in missing values in patient data. During the filling process, first perform data preprocessing on the patient data that needs to be filled, perform a one-hot encoding operation on the discrete data in the patient data that needs to be filled, and perform a maximum and minimum value normalization operation on the continuous data. Then select the patient diagnosis result as the patient label, use the patient data and patient labels containing missing values as the input of the data filling model, and output the filled patient data through the data processing unit, generator and filling unit.

另一方面,本发明还提供了一种不平衡医疗数据缺失值填充系统,如图5所示,该系统包括数据获取模块、数据填充模型构建模块和数据填充模块;所述数据获取模块用于利用医院的信息化系统获取患者数据;On the other hand, the present invention also provides an unbalanced medical data missing value filling system, as shown in Figure 5. The system includes a data acquisition module, a data filling model building module and a data filling module; the data acquisition module is used to Use the hospital's information system to obtain patient data;

所述数据填充模型构建模块用于构建及训练数据填充模型;所述数据填充模型包括数据处理单元、生成器、填补单元、判别器和辅助分类器,生成器和判别器构成生成对抗网络;The data filling model building module is used to build and train a data filling model; the data filling model includes a data processing unit, a generator, a filling unit, a discriminator and an auxiliary classifier, and the generator and the discriminator constitute a generative adversarial network;

所述数据处理单元中,使用掩码矩阵记录患者原始数据中缺失值的位置,使用0预填充患者原始数据中的缺失值,使用随机数填充患者原始数据中的缺失值,并输入生成器;In the data processing unit, a mask matrix is used to record the locations of missing values in the patient's original data, 0 is used to pre-fill the missing values in the patient's original data, random numbers are used to fill in the missing values in the patient's original data, and the data is input into the generator;

所述生成器用于学习输入的患者数据的分布,生成新的患者数据,并输入填补单元,所述生成器的输入包括患者数据和患者标签;The generator is used to learn the distribution of input patient data, generate new patient data, and input the filling unit, and the input of the generator includes patient data and patient labels;

所述填补单元用于利用生成器生成的新的患者数据对患者原始数据中的缺失值进行填补;The filling unit is used to fill in missing values in the original patient data using the new patient data generated by the generator;

所述判别器用于对输入的每个患者数据进行辨别,判断是否为观察值,所述判别器的输入包括填补单元填补后的患者数据以及使用0预填充患者原始数据中的缺失值后的患者数据,输出为每个患者数据为观察值的概率;The discriminator is used to identify each input patient data and determine whether it is an observation value. The input of the discriminator includes the patient data after filling in the filling unit and the patient after using 0 to pre-fill the missing values in the patient's original data. Data, the output is the probability that each patient's data is an observation;

所述辅助分类器用于对填补单元填补后的患者数据进行预测,将预测结果反馈给生成器;The auxiliary classifier is used to predict the patient data after filling in the filling unit, and feed the prediction results back to the generator;

训练过程包括预训练辅助分类器和正式训练数据填充模型,预训练过程中使用未缺失的患者数据对辅助分类器进行训练,确定辅助分类器网络参数,正式训练过程中辅助分类器网络参数不参与更新;正式训练过程中先训练判别器再训练生成器,判别器和生成器不断的对抗训练,直至数据填充模型收敛;The training process includes pre-training the auxiliary classifier and filling the model with formal training data. During the pre-training process, the auxiliary classifier is trained using non-missing patient data and the network parameters of the auxiliary classifier are determined. The network parameters of the auxiliary classifier are not involved in the formal training process. Update: During the formal training process, the discriminator is trained first and then the generator is trained. The discriminator and generator are continuously trained against each other until the data filling model converges;

所述数据填充模块用于将需要填充缺失值的患者数据及患者标签输入训练好的数据填充模型,经过数据处理单元、生成器和填补单元后,输出填充后的患者数据。The data filling module is used to input the patient data and patient labels that need to fill missing values into the trained data filling model, and output the filled patient data after passing through the data processing unit, generator and filling unit.

与前述一种不平衡医疗数据缺失值填充方法的实施例相对应,本发明还提供了一种不平衡医疗数据缺失值填充设备的实施例。Corresponding to the foregoing embodiment of a method for filling missing values in unbalanced medical data, the present invention also provides an embodiment of a device for filling missing values in unbalanced medical data.

参见图6,本发明实施例提供的一种不平衡医疗数据缺失值填充设备,包括存储器和一个或多个处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码时,用于实现上述实施例中的一种不平衡医疗数据缺失值填充方法。Referring to Figure 6, an embodiment of the present invention provides a device for filling missing values in unbalanced medical data, including a memory and one or more processors. The memory stores executable code, and the processor executes the executable code. The code is used to implement a missing value filling method for unbalanced medical data in the above embodiment.

本发明提供的一种不平衡医疗数据缺失值填充设备的实施例可以应用在任意具备数据处理能力的设备上,该任意具备数据处理能力的设备可以为诸如计算机等设备或装置。设备实施例可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。以软件实现为例,作为一个逻辑意义上的设备,是通过其所在任意具备数据处理能力的设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言,如图6所示,为本发明提供的一种不平衡医疗数据缺失值填充设备所在任意具备数据处理能力的设备的一种硬件结构图,除了图6所示的处理器、内存、网络接口、以及非易失性存储器之外,实施例中设备所在的任意具备数据处理能力的设备通常根据该任意具备数据处理能力的设备的实际功能,还可以包括其他硬件,对此不再赘述。The embodiment of a device for filling missing values in unbalanced medical data provided by the present invention can be applied to any device with data processing capabilities, and any device with data processing capabilities can be a device or device such as a computer. The device embodiment may be implemented by software, or may be implemented by hardware or a combination of software and hardware. Taking software implementation as an example, as a logical device, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory and running them through the processor of any device with data processing capabilities. From the hardware level, as shown in Figure 6, it is a hardware structure diagram of any device with data processing capabilities where the unbalanced medical data missing value filling device provided by the present invention is located. In addition to the processor shown in Figure 6 , memory, network interface, and non-volatile memory, any device with data processing capabilities where the device is located in the embodiment may also include other hardware according to the actual functions of any device with data processing capabilities. In this regard No longer.

上述设备中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。The specific implementation process of the functions and roles of each unit in the above equipment can be found in the implementation process of the corresponding steps in the above method, and will not be described again here.

对于设备实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的部分说明即可。以上所描述的设备实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本发明方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。As for the equipment embodiment, since it basically corresponds to the method embodiment, please refer to the partial description of the method embodiment for relevant details. The device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the present invention. Persons of ordinary skill in the art can understand and implement the method without any creative effort.

本发明实施例还提供一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时,实现上述实施例中的一种不平衡医疗数据缺失值填充方法。An embodiment of the present invention also provides a computer-readable storage medium on which a program is stored. When the program is executed by a processor, the method for filling missing values of unbalanced medical data in the above embodiment is implemented.

所述计算机可读存储介质可以是前述任一实施例所述的任意具备数据处理能力的设备的内部存储单元,例如硬盘或内存。所述计算机可读存储介质也可以是任意具备数据处理能力的设备的外部存储设备,例如所述设备上配备的插接式硬盘、智能存储卡(Smart Media Card,SMC)、SD卡、闪存卡(Flash Card)等。进一步的,所述计算机可读存储介质还可以既包括任意具备数据处理能力的设备的内部存储单元也包括外部存储设备。所述计算机可读存储介质用于存储所述计算机程序以及所述任意具备数据处理能力的设备所需的其他程序和数据,还可以用于暂时地存储已经输出或者将要输出的数据。The computer-readable storage medium may be an internal storage unit of any device with data processing capabilities as described in any of the foregoing embodiments, such as a hard disk or a memory. The computer-readable storage medium can also be an external storage device of any device with data processing capabilities, such as a plug-in hard disk, a smart memory card (SMC), an SD card, or a flash memory card equipped on the device. (Flash Card) etc. Furthermore, the computer-readable storage medium may also include both an internal storage unit and an external storage device of any device with data processing capabilities. The computer-readable storage medium is used to store the computer program and other programs and data required by any device with data processing capabilities, and can also be used to temporarily store data that has been output or is to be output.

上述实施例用来解释说明本发明,而不是对本发明进行限制,在本发明的精神和权利要求的保护范围内,对本发明作出的任何修改和改变,都落入本发明的保护范围。The above embodiments are used to illustrate the present invention, rather than to limit the present invention. Within the spirit of the present invention and the protection scope of the claims, any modifications and changes made to the present invention fall within the protection scope of the present invention.

Claims (10)

1. A method of filling an unbalanced medical data loss value, comprising:
acquiring patient data by using an informatization system of a hospital;
filling the missing values in the patient data by using a data filling model;
the data filling model comprises a data processing unit, a generator, a filling unit, a discriminator and an auxiliary classifier; the generator and the discriminator form a generating countermeasure network;
the data processing unit records the position of the missing value in the original data of the patient by using a mask matrix, pre-fills the missing value in the original data of the patient by using 0, fills the missing value in the original data of the patient by using a random number, and inputs the missing value into the generator;
the generator is used for learning the distribution of the input patient data, generating new patient data and inputting a filling unit, and the input of the generator comprises the patient data and a patient label;
the filling unit is used for filling the missing value in the original patient data by utilizing the new patient data generated by the generator;
the input of the discriminator comprises patient data filled by the filling unit and patient data filled with the missing value in the original patient data by 0, and the probability of each patient data being an observed value is output;
the auxiliary classifier is used for predicting the patient data filled by the filling unit and feeding back a prediction result to the generator;
the training process comprises pre-training the auxiliary classifier and formally training a data filling model, wherein in the pre-training process, the auxiliary classifier is trained by using undelayed patient data, and the network parameters of the auxiliary classifier are determined, and in the formally training process, the network parameters of the auxiliary classifier do not participate in updating; training the discriminator and then training the generator in the formal training process, wherein the discriminator and the generator are used for continuously performing countermeasure training until the data filling model converges;
patient data and patient labels which need to be filled with missing values are input into a trained data filling model, and the filled patient data is output after passing through a data processing unit, a generator and a filling unit.
2. The method for filling unbalanced medical data loss values according to claim 1, wherein the data filling model is input after the acquired patient data is subjected to data preprocessing, specifically: and performing single-heat encoding operation on the discrete data, and performing maximum and minimum normalization operation on the continuous data.
3. The method of claim 1, wherein the patient raw data is recorded asWherein->Raw data representing the ith patient, n being the number of patients, k being the number of features; mask matrix is marked as->Wherein->The method is used for marking the observed value and the missing value in the original data of the ith patient, wherein the observed value is 1, and the missing value is 0; pre-filling the missing values in the original data of the patient with 0, the filled data matrix is marked +.>Wherein->Representing patient data after the i-th patient raw data is prefilled with 0; creating a random matrix mark as +.>Wherein->The random number vector which is randomly generated and accords with standard normal distribution is used for filling the missing value in the original data of the ith patient; filling the missing values in the original data of the patient by using random numbers in a random matrix, and marking the filled data matrix as
Wherein->Representing patient data obtained after filling the missing values in the ith patient raw data with random numbers, a->,/>Representing the hadamard product.
4. A method of filling unbalanced medical data loss values according to claim 3, wherein the generator's loss function is composed of three parts, the first part being the calculation of the difference between the observed value generated by the generator and the actual observed value, using the mean square error as the loss function; the second part is to generate the generator loss against the network, using the wasperstein distance as a loss function; the third partial loss is the difference between the prediction label of the patient data filled by the filling unit and the real label of the patient by the auxiliary classifier, and a cross entropy function is used as a loss function.
5. The method of claim 4, wherein the generator has a loss function
First partial loss function
Second partial loss function
Third partial loss functionNumber of digits
Wherein the method comprises the steps ofOutput value of generator representing ith patient data as input, +.>G () represents patient data obtained after passing through the generator, y i Representing the actual label of the ith patient, D () represents the result of patient data after passing through the arbiter, t i Representing patient data filled with the ith patient raw data via the filling unit,/patient data filled with the ith patient raw data via the filling unit>Predictive tag for the i patient representing the auxiliary classifier,>and->Is a hyper-parameter, representing the vector inner product.
6. A method of filling an unbalanced medical data loss value according to claim 3, wherein the shim cells are filled with patient data generated by a generatorFilling up the missing value in the original data X of the patient, and marking the filled up data matrix as +.>Wherein t is i Representing patient data filled with the ith patient raw data via the filling unit,/patient data filled with the ith patient raw data via the filling unit>Wherein->Represents the output value of the generator when the ith patient data is input.
7. A method of filling an unbalanced medical data loss value according to claim 3, wherein the loss function L of the arbiter D The calculation formula is as follows:
where D () represents the result of patient data after passing through the arbiter,representing patient data after prefilling the missing values in the ith patient raw data with 0, t i Representing the patient data filled in by the filling unit with the ith patient raw data, representing the vector inner product.
8. The method for filling unbalanced medical data loss values according to claim 1, wherein the auxiliary classifier has a loss function L C The calculation formula is as follows:
wherein the method comprises the steps ofThe label of the ith patient is obtained from the diagnosis result in the patient data through a single-heat coding operation, u is the disease type of the diagnosis result in the patient data,/->The predictive label of the auxiliary classifier for the ith patient is a vector with length u, and the vector isAnd each value of (2) represents the probability that the auxiliary classifier predicts that the patient suffers from the corresponding disease, n is the number of patients,,t i representing the patient data after the i-th patient raw data is padded by the padding unit, C () represents the result obtained after the patient data is passed through the auxiliary classifier, and C represents the vector inner product.
9. The method for filling an unbalanced medical data loss value according to any one of claims 1 to 8, wherein in the process of formally training a data filling model, patient data containing the loss value is firstly input, loss is calculated by the discriminator, and network parameters of the discriminator are updated by gradient back propagation; then the generator calculates the loss, and the gradient back propagation updates the generator network parameters; the arbiter and generator continue the countermeasure training until the data population model converges.
10. The unbalanced medical data missing value filling system is characterized by comprising a data acquisition module, a data filling model construction module and a data filling module; the data acquisition module is used for acquiring patient data by using an informatization system of a hospital;
the data filling model construction module is used for constructing and training a data filling model; the data filling model comprises a data processing unit, a generator, a filling unit, a discriminator and an auxiliary classifier, wherein the generator and the discriminator form a generating countermeasure network;
the data processing unit records the position of the missing value in the original data of the patient by using a mask matrix, pre-fills the missing value in the original data of the patient by using 0, fills the missing value in the original data of the patient by using a random number, and inputs the missing value into the generator;
the generator is used for learning the distribution of the input patient data, generating new patient data and inputting a filling unit, and the input of the generator comprises the patient data and a patient label;
the filling unit is used for filling the missing value in the original patient data by utilizing the new patient data generated by the generator;
the input of the discriminator comprises patient data filled by the filling unit and patient data filled with the missing value in the original patient data by 0, and the probability of each patient data being an observed value is output;
the auxiliary classifier is used for predicting the patient data filled by the filling unit and feeding back a prediction result to the generator;
the training process comprises pre-training the auxiliary classifier and formally training a data filling model, wherein in the pre-training process, the auxiliary classifier is trained by using undelayed patient data, and the network parameters of the auxiliary classifier are determined, and in the formally training process, the network parameters of the auxiliary classifier do not participate in updating; training the discriminator and then training the generator in the formal training process, wherein the discriminator and the generator are used for continuously performing countermeasure training until the data filling model converges;
the data filling module is used for inputting patient data and patient labels which need to be filled with missing values into a trained data filling model, and outputting the filled patient data after passing through the data processing unit, the generator and the filling unit.
CN202311283938.4A 2023-10-07 2023-10-07 Unbalanced medical data missing value filling method and system Active CN117034142B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311283938.4A CN117034142B (en) 2023-10-07 2023-10-07 Unbalanced medical data missing value filling method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311283938.4A CN117034142B (en) 2023-10-07 2023-10-07 Unbalanced medical data missing value filling method and system

Publications (2)

Publication Number Publication Date
CN117034142A CN117034142A (en) 2023-11-10
CN117034142B true CN117034142B (en) 2024-02-09

Family

ID=88630271

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311283938.4A Active CN117034142B (en) 2023-10-07 2023-10-07 Unbalanced medical data missing value filling method and system

Country Status (1)

Country Link
CN (1) CN117034142B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117524318B (en) * 2024-01-05 2024-03-22 深圳新合睿恩生物医疗科技有限公司 New antigen heterogeneous data integration method and device, equipment and storage medium
CN118262931B (en) * 2024-05-30 2024-09-24 中国人民解放军总医院 A method and system for completing medical data features in emergency rescue scenarios
CN118690277B (en) * 2024-08-23 2024-10-25 中国科学院自动化研究所 Intelligent health assessment method based on sub-matrix integration and missing tolerance technology

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165664A (en) * 2018-07-04 2019-01-08 华南理工大学 A kind of attribute missing data collection completion and prediction technique based on generation confrontation network
CN111833359A (en) * 2020-07-13 2020-10-27 中国海洋大学 Brain tumor segmentation data enhancement method based on generative adversarial network
EP3792830A1 (en) * 2019-09-10 2021-03-17 Robert Bosch GmbH Training a class-conditional generative adverserial network
CN113591954A (en) * 2021-07-20 2021-11-02 哈尔滨工程大学 Filling method of missing time sequence data in industrial system
CN116364290A (en) * 2023-06-02 2023-06-30 之江实验室 Hemodialysis characterization identification and complications risk prediction system based on multi-view alignment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165664A (en) * 2018-07-04 2019-01-08 华南理工大学 A kind of attribute missing data collection completion and prediction technique based on generation confrontation network
EP3792830A1 (en) * 2019-09-10 2021-03-17 Robert Bosch GmbH Training a class-conditional generative adverserial network
CN111833359A (en) * 2020-07-13 2020-10-27 中国海洋大学 Brain tumor segmentation data enhancement method based on generative adversarial network
CN113591954A (en) * 2021-07-20 2021-11-02 哈尔滨工程大学 Filling method of missing time sequence data in industrial system
CN116364290A (en) * 2023-06-02 2023-06-30 之江实验室 Hemodialysis characterization identification and complications risk prediction system based on multi-view alignment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Miao X等. Generative semi-supervised learning for multivariate time series imputation.Proceedings of the AAAI Conference on Artificial Intelligence.2021,第35卷(第10期),第8983-8991页. *
基于对抗自编码网络的水利数据补全方法;季琳雅;吕鑫;陶飞飞;曾涛;;计算机工程(04);全文 *
面向机器学习模型安全的测试与修复;张笑宇等;电子学报;第50卷(第12期);第2884-2918页 *

Also Published As

Publication number Publication date
CN117034142A (en) 2023-11-10

Similar Documents

Publication Publication Date Title
CN109920501B (en) Electronic medical record classification method and system based on convolutional neural network and active learning
CN117034142B (en) Unbalanced medical data missing value filling method and system
CN109036553B (en) Disease prediction method based on automatic extraction of medical expert knowledge
CN113421652B (en) Method for analyzing medical data, method for training model and analyzer
CN107273685A (en) A kind of data analysing method of multi-modal big data for clinical disease
CN116364299A (en) A method and system for clustering disease diagnosis and treatment paths based on heterogeneous information network
WO2020224433A1 (en) Target object attribute prediction method based on machine learning and related device
CN114864099B (en) A method and system for automatic generation of clinical data based on causal relationship mining
CN111612278A (en) Life state prediction method, device, electronic device and storage medium
CN110767279A (en) Method and system for missing data completion in electronic health record based on LSTM
CN112489740A (en) Medical record detection method, training method of related model, related equipment and device
CN117153393A (en) A cardiovascular disease risk prediction method based on multi-modal fusion
CN113707278B (en) A brain CT medical report generation method based on spatial coding
CN116598014A (en) Medical missing data complement method based on graph attention mechanism and language big model
CN115579141A (en) Interpretable disease risk prediction model construction method and disease risk prediction device
CN116759076A (en) Unsupervised disease diagnosis method and system based on medical image
CN116881336A (en) Efficient multi-mode contrast depth hash retrieval method for medical big data
Wang et al. A dense RNN for sequential four-chamber view left ventricle wall segmentation and cardiac state estimation
CN107491656B (en) Pregnancy outcome influence factor evaluation method based on relative risk decision tree model
Sudharson et al. Enhancing the efficiency of lung disease prediction using CatBoost and expectation maximization algorithms
CN113345564B (en) A method and device for early prediction of hospitalization length of patients based on graph neural network
CN109119155A (en) ICU mortality prediction assessment system based on deep learning
CN118608849A (en) A method for constructing a CT image classification model based on a bidirectional combination of GNN and CNN
CN117992913A (en) Multimode data classification method based on bimodal attention fusion network
CN117038096A (en) Chronic disease prediction method based on low-resource medical data and knowledge mining

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant