WO2024008043A1 - 一种基于因果关系挖掘的临床数据自动化生成方法及系统 - Google Patents

一种基于因果关系挖掘的临床数据自动化生成方法及系统 Download PDF

Info

Publication number
WO2024008043A1
WO2024008043A1 PCT/CN2023/105558 CN2023105558W WO2024008043A1 WO 2024008043 A1 WO2024008043 A1 WO 2024008043A1 CN 2023105558 W CN2023105558 W CN 2023105558W WO 2024008043 A1 WO2024008043 A1 WO 2024008043A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
virtual
cause
causal
result
Prior art date
Application number
PCT/CN2023/105558
Other languages
English (en)
French (fr)
Inventor
李劲松
田雨
周天舒
路子豪
Original Assignee
浙江大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浙江大学 filed Critical 浙江大学
Publication of WO2024008043A1 publication Critical patent/WO2024008043A1/zh

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records

Definitions

  • the present invention relates to the field of medical information technology, and in particular to an automated clinical data generation method and system based on causal relationship mining.
  • This program uses the MIMIC_III data set as an example to verify, and uses the International Classification of Diseases (ICD) according to certain characteristics of the disease.
  • ICD International Classification of Diseases
  • the diseases are classified into categories, and the disease data and surgical data in the MIMIC_III data set are simply collected by truncation. After removing a large number of small classification items, one-hot encoding is used for them.
  • the discrete The data is converted into continuous data, and then the convolutional layer in the original generative adversarial network is changed to a fully connected layer, and then the distribution of the original clinical data is modeled, and virtual clinical data similar to real medical data is obtained through the generator.
  • medWGAN Compared with medGAN, the most important improvement of medWGAN and medBGAN is to replace the original generative adversarial network in medGAN with the better model in the generative adversarial network, speeding up the training speed and appropriately improving the problems that occur during the training process of the generative adversarial network. Mode crash issue.
  • the purpose of the present invention is to propose an automatic generation method and system of clinical data based on causal relationship mining in view of the shortcomings of the existing technology.
  • Data set construction Construct a table with patients as rows and patient clinical information as columns to obtain the data set that needs to be virtually generated;
  • Cause data generation Divide the data nodes in the cause-and-effect diagram obtained in step (3) into two types: starting cause column and subsequent result column; for the starting cause column, calculate the group distance according to the customized number of groups and the range of the starting cause column data, and then draw the frequency distribution histogram, obtain the frequency distribution line chart, approximately obtain the overall density curve, calculate the distribution function of the probability density function, and obtain an increasing function with a value range of (0,1) And take the inverse function to generate random numbers uniformly in the range of [0,1], use the inverse function to find the corresponding random numbers, and obtain the virtual generation result of the starting cause column data;
  • Result data generation For each result data in the subsequent result column, random noise is first uniformly sampled from the normal distribution, and the random noise and the real cause data corresponding to the result data are input into the generator to construct Virtual result data that is causally connected with the real cause data. Then the virtual result data, real cause data and real result data are input into the discriminator for training. The discriminator judges the real situation of the virtual result data and compares the generator and The discriminator reaches a stable state after a certain round of training, and then random noise and virtual cause data are input into the generator to obtain virtual result data.
  • the patient's clinical information is to select the patient's condition, examination, symptoms and drug-related clinical information from different departments of the hospital based on the patient's admission code.
  • the text information preprocessing process specifically includes: eliminating useless characters, checking the integrity of the information in the table, checking for missing information and incorrect information, and deleting or changing the data. Then use regular expressions to segment long sentences, and then judge the polarity of the segmented sentences, and then use a unified medical expression to transform various languages and characters into the same form through the consistency of their inherent meanings to ensure that the form A unified expression form is used to sequentially encode the different text expressions in each column and convert them into a numerical sequence.
  • step (2) the preprocessed text information and numerical information are combined to obtain real medical clinical table data composed of numbers;
  • step (3) the specific process of obtaining the causal connection between data columns is: for any two data nodes in the completely undirected graph, if the requirements of conditional independence are not met for all other nodes, then it is determined that these There is a causal connection between two data nodes, and assuming that all random nodes obey multivariate Gaussian distribution as a whole, determine whether the data nodes are conditionally independent based on the formula of partial correlation coefficient.
  • any two data nodes a and b will be regarded as H, and its s-order partial correlation coefficient ⁇ a, b
  • H ⁇ s is the s-1 order partial correlation coefficient of data nodes a and b
  • H ⁇ s is the s-1 order partial correlation coefficient of data nodes a and s
  • H ⁇ s is the s-1 order partial correlation coefficient of data nodes b and s, which is transformed into a normal distribution Z(a, b
  • H is the s-order partial correlation coefficient of data nodes a and b.
  • ⁇ -1 ( ⁇ ) is the cumulative distribution function of the normal distribution N (0, 1); if the latter is large, it means that the partial correlation coefficient of data nodes a and b for the remaining data node set H is 0 , that is, data nodes a and b are conditionally independent, and then determine the connection lines between data nodes, that is, the dependence between data columns.
  • step (4) the calculation process of the distribution function of the probability density function is as follows: According to the number of peaks on the overall density curve and the principle of minimizing the error sum of squares, the probability density function is expressed as a combination of t normal distributions p(g ), that is, expressed as:
  • ⁇ i is the mean of the i-th normal distribution
  • ⁇ i is the standard deviation of the i-th normal distribution
  • the generator’s loss includes two parts, one is the true and false loss caused by the discriminator itself, and the other is the causal loss caused by causality; among them, the true and false loss caused by the discriminator itself
  • the true and false loss loss a formula is as follows:
  • N is the number of patients in a batch during the training process, The probability that the virtual result data generated for the i-th is confirmed as the result data in the subsequent result column corresponding to the input real cause data under the discriminant effect of the discriminator;
  • causal loss For causal loss, first of all, it is clear that the causal loss is to ensure that the causality between the virtual samples generated by the generator and the original real samples is similar. The smaller the causal loss, the more the causality between the virtual samples meets the needs; data List The causality between them is expressed as the correlation between values. For the virtual result column and the real result column, the correlation coefficient is calculated with each column of the corresponding cause column, and then the difference between the correlation coefficients is back-propagated. The causality of the virtual result column is fed back.
  • the formula of causal loss loss b is as follows:
  • c j is the Pearson correlation coefficient between the real result column and the corresponding jth cause column, The Pearson correlation coefficient between the virtual result column generated by the generator and the corresponding jth cause column;
  • the generator uses the sum of the above two loss functions as its loss function, and uses the network optimization method of Wasserstein GAN-gradient penalty based on gradient penalty and bulldozer distance to generate causally connected virtual cause data and real result data. Similar virtual result data; during the training phase, real cause data is input, and the causal connection between the real cause data and real result data is learned through the loss function of the generator. After the network is stable, virtual cause data is input to obtain the corresponding virtual result data.
  • step (5) the generated virtual result data is connected and input into the joint discriminator together with the real result data.
  • the joint discriminator judges the causal association of the virtual result data.
  • the training ratio of the generator is optimized to optimize the causal connection ability of the generator, and the training objective function Value (D, G) is set to:
  • G refers to the generator
  • D refers to the discriminator
  • q represents the real result data
  • z represents the random variable
  • E q ⁇ P (q) represents the expected result for q that satisfies the distribution of p (q)
  • It represents the expected result for z that satisfies the distribution of p z (z).
  • the error back propagation algorithm is used to reverse the similarity gap between the virtual result data and the real result data according to the order of the generated virtual result data. propagated to each generator, thereby improving the causal association between virtual outcome data and virtual cause data.
  • the present invention also provides an automated clinical data generation system based on causality mining, which system includes a data set building module, a natural language processing module, a causal discovery module, a cause data generation module, and a result data generation module;
  • the data set building module is used to construct a table with patients as rows and patient clinical information as columns to obtain a data set that needs to be virtually generated;
  • the natural language processing module is used to preprocess the text information and numerical information in the data set obtained by the data set construction module, convert the text information into a unified expression form, and obtain a digital sequence after sequential encoding, and use a unified method for numerical information.
  • Numerical expression form
  • the causal discovery module is used to use the data columns of the data set processed by the natural language processing module as data nodes, Connecting lines are drawn between all data nodes to complete a completely undirected graph. Then the dependence direction of the edges in the completely undirected graph is determined based on the principle of d separation, and the completely undirected graph is expanded to a completely partially directed acyclic graph, and then obtain the causal connection between the data columns, and obtain the causal graph;
  • the cause data generation module is used to divide the data nodes in the cause-and-effect diagram obtained in the cause-and-effect discovery module into two types: starting cause columns and subsequent result columns; for the starting cause columns, calculate the group distance according to the customized number of groups and the range of the starting cause column data, and then draw the frequency distribution histogram, obtain the frequency distribution line chart, approximately obtain the overall density curve, calculate the distribution function of the probability density function, and obtain an increasing function with a value range of (0,1) And take the inverse function to generate random numbers uniformly in the range of [0,1], use the inverse function to find the corresponding random numbers, and obtain the virtual generation result of the starting cause column data;
  • the result data generation module is used to first obtain random noise by uniform sampling from the normal distribution for each result data in the subsequent result column, and input the random noise and the real cause data corresponding to the result data into the generator, Construct virtual result data that is causally linked to the real cause data, and then input the virtual result data, real cause data and real result data into the discriminator for training.
  • the discriminator will judge the real situation of the virtual result data and evaluate the generator The discriminator reaches a stable state after a certain round of training, and then the random noise and virtual cause data are input into the generator to obtain virtual result data.
  • the present invention achieves partial interpretability when generating medical clinical data. This is a method of achieving causal correlation through a generative adversarial network: based on the causal nature of the medical data itself, it depends on the causes in the medical data.
  • the data uses a generative adversarial network in the form of random variables to generate virtual result data, and multiple generators are used to generate medical causal clinical data.
  • the present invention adopts the method of first generating separately and then uniformly optimizing: the patient's medical information is disassembled with the help of the causal chain, and the cause data is simulated through the organic combination of normal distribution to thereby Obtain virtual data; for the result data, the generator of the adversarial network is causally associated with the cause data to obtain the virtual data. In this way, the virtual data is generated separately, and then the aforementioned generator is optimized with the help of the discriminant ability of the joint discriminator. Obtain realistic virtual clinical data.
  • the present invention transforms medical causality that is difficult to reflect into a simple form: when generating subsequent result data, for the causal connection between the subsequent result data and the initial cause data, the Pearson correlation coefficient is used The causality in the overall causal graph is converted into numerical correlations between fewer nodes, and the difference in correlation coefficients between real data and virtual data is passed to the generator network through backpropagation as the loss of the generator. This ensures the causal relationship between the virtual subsequent result data generated by the generator and the real cause data.
  • the present invention connects the causal connections between different patient information through multiple generators, greatly increasing the internal connections of the generated virtual patient clinical data. It can not only generate virtual clinical data, but also reduce the number of generated virtual clinical data. The possibility of internal contradictions in bed data makes the data more similar to real data and more suitable for use in real scenarios.
  • Figure 1 is a schematic flow chart of the automatic generation method of clinical data based on causal relationship mining according to the present invention.
  • Figure 2 is a schematic diagram of causal generative adversarial network training according to the present invention.
  • Figure 3 is an example of the present invention transforming a completely undirected graph into a completely directed acyclic graph.
  • Figure 4 is a structural diagram of the automatic clinical data generation system based on causal relationship mining according to the present invention.
  • the present invention provides an automatic generation method of clinical data based on causal relationship mining.
  • This invention starts from the collection source of medical data and performs classified statistics on the patient's information left in the hospital according to the patient's admission and discharge process.
  • the records therein can be roughly divided into disease course records, examination results, doctor's orders, operation records, and nursing records.
  • the data types include images, text and even video data, but text type data has the largest amount of stored data and the widest application range, and these data are often recorded in patient tables in hospitals.
  • This invention ignores the redundant information generated in the patient's admission process, organizes the patient's admission process into four items: condition, examination, disease, and medicine, integrates them into a unified data set, and then centralizes the data according to the user's personalized needs.
  • the data is screened for the first time, and then the dependency relationship and direction of dependence between each column of data is clarified through the algorithm, and a complete partial directed acyclic graph of the selected data is drawn to discover the causal relationship between the data, and then the The parts of the cause-and-effect diagram that the user is interested in are selected for virtual generation of cause-and-effect relationships, thereby solving the problem of loose connections between the generated virtual data.
  • (1) Data set construction According to specific needs, select patients admitted to a specific period of time or a specific area, and select the patient's condition, examination, symptoms, and medications from different departments of the hospital based on the patient's admission code. 4 aspects of information, and then build a table with patients as rows and different patient information as columns to complete the data set that needs to be virtually generated this time. Afterwards, the data in the data set can be filtered according to user needs.
  • Step (1) Based on the filtered data set in step (1), preprocess the text information and numerical information in the data set obtained in step (1).
  • the hospital uses both text and numerical forms to record the patient's condition and symptoms. and drugs, etc.
  • For text eliminate useless characters, check the integrity of the information in the table, check for missing information, incorrect information, etc., and delete or change the data according to the specific situation, and then use regular expressions
  • the formula is used to divide long sentences, and then judge the polarity of the divided sentences, and then use a unified medical expression method to transform various languages and characters into the same form through the consistency of their inherent meanings, ensuring that a unified form is used in the table.
  • H ⁇ s is the s-1 order partial correlation coefficient of data nodes a and b
  • H ⁇ s is the s-1 order partial correlation coefficient of data nodes a and s
  • H ⁇ s is the s-1 order partial correlation coefficient of data nodes b and s, which is transformed into a normal distribution Z(a, b
  • H is the s-order partial correlation coefficient of data nodes a and b.
  • ⁇ -1 ( ⁇ ) is the cumulative distribution function of the normal distribution N (0, 1); if the latter is large, it means that the partial correlation coefficient of data nodes a and b for the remaining data node set H is 0 , that is, data nodes a and b are conditionally independent, and then determine the connection lines between data nodes, that is, the dependence between data columns. Then, based on the principle of d separation to determine the dependence direction of the edges in the undirected graph, the undirected graph can be expanded into a fully partially directed acyclic graph, and the causal connection between data columns can be obtained.
  • Cause data generation For the cause and effect diagram found in step (3), disassemble the cause and effect diagram according to the user's needs, take out the interesting parts, and divide the data nodes in the cause and effect diagram according to whether they have parent nodes or not. This feature is divided into two types: initial cause column and subsequent result column. For the starting cause column, the minimum and maximum normalization method is not used. Instead, the range of the data in the column is calculated, and then the group distance is calculated based on the customized number of groups, and then the frequency distribution histogram is drawn to obtain the frequency.
  • Distribution line chart approximately obtain the overall density curve, and then express the probability density function as a combination p(g) of t normal distributions based on the number of peaks on the overall density curve and the principle of minimizing the error sum of squares, that is, it is expressed as:
  • ⁇ i is the mean of the i-th normal distribution
  • ⁇ i is the standard deviation of the i-th normal distribution
  • the distribution function of the probability density function is calculated
  • step (3) the cause and effect diagram of the selected data has been obtained. According to the causal correlation, all result data are affected by the corresponding cause data, and in step (4), the cause and effect diagram of the selected data has been obtained.
  • the cause data is virtually generated, and then the result data in the disassembled cause-and-effect diagram is listed.
  • Each result data is generated by a generator. Multiple generators need to be constructed to virtually generate the result data. For each result data, random noise is first uniformly sampled from the normal distribution, and the cause data corresponding to the random noise and the result data is transported to the same hidden layer through the embedding layer, and then input into the generator.
  • the learning ability of the machine is used to virtually construct result data that is causally linked to the cause data. Then the virtual result data, real cause data, and real result data are input into the discriminator, and the discriminator judges the true situation of the result data.
  • the loss of the generator includes two parts. One is the true and false loss caused by the discriminator itself, and the other is the causal loss caused by causality. The smaller the true and false loss caused by the discriminator itself, the more similar the virtual data is to the real data.
  • the true and false loss formula is as follows:
  • N is the number of patients in a batch during the training process, The probability that the virtual result data generated for the i-th is confirmed as the result data in the subsequent result column corresponding to the input real cause data under the discriminant effect of the discriminator;
  • causal loss For causal loss, first of all, it is clear that the causal loss is to ensure that the causality between the virtual samples generated by the generator and the original real samples is similar. The smaller the causal loss, the more the causality between the virtual samples meets the needs; data The causality between columns is expressed as the correlation between values. For the virtual subsequent result column and the real subsequent result column, the correlation coefficient is calculated for each column of the starting cause column, and the correlation between the correlation coefficients is back propagated. The difference then feeds back the causality of the virtual subsequent result column.
  • the formula of causal loss is as follows:
  • M is the number of starting cause columns input to the generator
  • c j is the real subsequent result column and the corresponding jth starting Pearson's correlation coefficient for the cause column, The Pearson correlation coefficient between the virtual subsequent result column generated by the generator and the corresponding jth starting cause column;
  • the generator uses the sum of the above two loss functions as its loss function, and uses the network optimization method of Wasserstein GAN-gradient penalty based on gradient penalty and bulldozer distance to generate causally connected virtual cause data and real result data. Similar virtual result data; in the training phase, real cause data is input, and the causal connection between the real cause data and the real result data is learned through the loss function of the generator. After the network is stable, random noise and virtual cause data are input into In the generator, the corresponding virtual result data is obtained;
  • step (6) Joint training: After training all the generators in step (5), use random variables to generate virtual data in sequence, connect the generated virtual data, and input it into the joint discriminator together with the real data. The discriminator judges the causal association of virtual result data;
  • the causal connection ability of the generator is optimized, and the training objective function Value (D, G) is set to:
  • G refers to the generator
  • D refers to the discriminator
  • q represents the real result data
  • z represents the random variable
  • E q ⁇ P (q) represents the expected result for q that satisfies the distribution of p (q)
  • It represents the expected result for z that satisfies the distribution of p z (z).
  • the error back propagation algorithm is used to reverse the similarity gap between the virtual result data and the real result data according to the order of the generated virtual result data. propagated to each generator, thereby improving the causal association between virtual outcome data and virtual cause data.
  • Embodiments of the present invention provide an automated clinical data generation method based on causal relationship mining, which is used to virtually generate diabetes-related clinical data; the details are as follows:
  • (1) Data set construction First, according to personal needs, the IDs of patients admitted from 2000 to 2020 are found in the hospital system, and then the patients’ chief complaints, height, weight, BMI and other basic physiological indicators are selected from different departments. , conventional laboratory test indicators such as albumin, globulin, urea, and uric acid, as well as the patient's disease status and medication status, etc., construct a table with patients as rows and different patient information as columns to form this virtually generated data set, as shown in Table 1 shown in the figure, and then focus on selecting data indicators related to diabetes.
  • conventional laboratory test indicators such as albumin, globulin, urea, and uric acid, as well as the patient's disease status and medication status, etc.
  • Step (2) Preprocess the text information and numerical information in the data set obtained in step (1). For example, you can divide the part of the medical history of "no nausea and vomiting, no chest tightness and shortness of breath” into 2 by "," A short sentence, and then the polarity is judged based on "none” to obtain the patient's specific information, which is then analyzed in a unified manner with the information of other patients, and converted into a sequence form with the help of one-hot encoding.
  • Cause data generation disassemble the cause-and-effect diagram of urine sugar, diabetes, glimepiride tablets and other related drugs, and divide the data nodes in the cause-and-effect diagram into starting cause columns based on the feature of whether there is a parent node or not. and subsequent result columns of both types. Realize the virtual generation of urine sugar, age and other data based on the starting cause column.
  • Result data generation As shown in (a) to (e) in Figure 2, a and b are the real initial cause data, while c, d, and e are the real subsequent result data, a' and b' respectively represents the virtual starting cause data obtained through step (4), while c', d', and e' represent the virtual result data generated through the generative adversarial network respectively; for the causal diagram in step (3)
  • Each result data needs to use the learning ability of the generator to virtually construct result data that is causally linked to the cause data.
  • the virtual result data and the real cause data are input into the discriminator, and the discriminator analyzes the relationship between the data.
  • the causal connection situation and the real situation of the virtual result data are judged, and based on the trained generator, virtual result data that is causally connected with the cause data and similar to the real result data is generated.
  • the generated virtual clinical data will then need to be tested.
  • the distribution of each column of virtual clinical data is displayed, and the proportion information of various categories is compared with the real clinical data to obtain the similarity of a single column, and then the logical
  • the editorial regression classifier is used to judge the overall similarity of real clinical data and virtual clinical data.
  • some information in the virtual clinical data is masked, and then other information is used to predict the masked information, so as to determine the advantages and disadvantages of causal learning, and then check the performance of real clinical data and virtual clinical data in a specific situation.
  • the causal generative adversarial network model effectively protected the privacy of patients, and then the virtual data was officially put into use.
  • the present invention also provides embodiments of the automatic generation system of clinical data based on causality mining.
  • the present invention provides an automated clinical data generation system based on causality mining, including a data set construction module, a natural language processing module, a causal discovery module, a cause data generation module, and a result data generation module;
  • the data set building module is used to construct a table with patients as rows and patient clinical information as columns to obtain a data set that needs to be virtually generated;
  • the natural language processing module is used to preprocess the text information and numerical information in the data set obtained by the data set construction module, convert the text information into a unified expression form, and obtain a digital sequence after sequential encoding, and use a unified method for numerical information.
  • Numerical expression form
  • the causal discovery module is used to use the data columns of the data set processed by the natural language processing module as data nodes. Connection lines are drawn between all data nodes to complete a completely undirected graph, and then according to the principle of d separation Determine the dependency direction of edges in a completely undirected graph, extend the completely undirected graph into a completely partially directed acyclic graph, and then obtain the causal connection between data columns to obtain a causal graph;
  • the cause data generation module is used to divide the data nodes in the cause-and-effect diagram obtained in the cause-and-effect discovery module into two types: starting cause columns and subsequent result columns; for the starting cause columns, calculate the group distance according to the customized number of groups and the range of the starting cause column data, and then draw the frequency distribution histogram, obtain the frequency distribution line chart, approximately obtain the overall density curve, calculate the distribution function of the probability density function, and obtain an increasing function with a value range of (0,1) And take the inverse function to generate random numbers uniformly in the range of [0,1], use the inverse function to find the corresponding random numbers, and obtain the virtual generation result of the starting cause column data;
  • the result data generation module is used to first obtain random noise by uniform sampling from the normal distribution for each result data in the subsequent result column, and input the random noise and the real cause data corresponding to the result data into the generator, Construct virtual result data that is causally linked to the real cause data, and then input the virtual result data, real cause data and real result data into the discriminator for training.
  • the discriminator will judge the real situation of the virtual result data and evaluate the generator The discriminator reaches a stable state after a certain round of training, and then the random noise and virtual cause data are input into the generator to obtain virtual result data.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Primary Health Care (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明公开了一种基于因果关系挖掘的临床数据自动化生成方法及系统,从医学数据的收集来源出发,按照患者的入院出院流程,将患者留存在医院的信息进行分门别类的统计,忽略患者入院流程中产生的多余信息,将患者的入院流程整理为病情、检查、病症、药物4项,将其整合为统一数据集,之后再根据使用者的个性化需求,对数据集中的数据进行第一次筛选,之后通过算法明确各列数据间的依赖关系和依赖方向,绘出挑选的数据的完全部分有向无环图,从而发现这些数据之间的因果关联,之后再从因果图中挑选出使用者感兴趣的部分进行因果关联虚拟生成,进而解决生成的虚拟数据之间联系不紧密的问题。

Description

一种基于因果关系挖掘的临床数据自动化生成方法及系统 技术领域
本发明涉及医疗信息技术领域,尤其涉及一种基于因果关系挖掘的临床数据自动化生成方法及系统。
背景技术
通过对个体患者临床数据的整合、分析与挖掘,可以为个体患者建立一个良好的健康模型,给患者提供一个精细、准确的疾病预防方案和治疗方案。个体患者的临床数据一旦被整合收集,由机构统一研究,就可以辅助开发医疗软件,开发新药等等,这会对医疗行业有很大的辅助作用,但是这些临床数据往往包含着患者敏感的个人信息,这些隐私信息一旦泄露就有可能给患者的生活带来一定的负面影响,所以医疗部门在使用这些数据时会有担忧。
为了让医学临床数据充分发挥作用,就要解决医学临床数据中隐私性带来的问题,一方面可以通过各种匿名化手段来隐去个人可辨识信息,但是攻击者可以通过手中的其他信息表进行撞库,从而获取到发布的信息表中的个人可辨识信息,这种方式并不稳定,会受到数据攻击者持有数据的威胁,并不能有效保护患者的隐私数据。另一方面也可以生成整体的虚拟的医学临床数据来隔离开单独的真实的隐私数据,只要保证虚拟数据和真实数据在整体分布上相似,就不用担心真实数据中的隐私泄露问题了,但是患者产生的临床数据种类多、样式多,难以完整地对所有数据进行虚拟生成,并且生成的虚拟数据与真实数据的相似度无法完美保证,虚拟数据对真实临床数据之间的关联情况的学习并不充分,还没有达到实际应用的目标。
现有技术产生的虚拟患者的合理性不足,虚拟患者的数据有前后矛盾的可能。现有技术在生成多种临床数据时大多是将这些临床数据通过数据拼接的方式连接到一起,然后通过生成对抗网络互相竞争自我学习的方式来隐性获取数据之间的关联,这种方式学习不到数据关联的真实情况,并且学习到的关联会存有一定误差,导致生成的虚拟数据有可能出现前后冲突的情况。在医学领域,采用生成虚拟数据的方式来保护患者隐私的方法有很多,但基本的思路是不变的,都是一份数据、一份数据的生成,先掌握分布,然后根据真实数据的分布情况来生成虚拟数据。但是由于神经网络的黑箱特性,无法对生成的虚拟数据进行适当的解释,而医学领域相比于其他的工业和机械领域更加强调结果来源的可解释性,这就导致该模型的适用性较低,不具有普遍使用的价值。
已经有许多研究针对临床数据的隐私性提出了自己的解决方案,其中与本发明最相近似 的技术方案则是2018年Baowaly等人提出的medWGAN和medBGAN,这两种方案都是在2017年Choi等人提出的medGAN的基础上进行优化得到的,接下来将对该方案进行详细描述。
medGAN对医学数据中较为重要的两种数据——疾病数据和手术数据进行虚拟生成,该方案以MIMIC_III数据集作为实例验证,借助国际疾病分类(international Classification of diseases,ICD)依据疾病的某些特征将疾病分门别类,通过舍尾的方式对MIMIC_III数据集中的疾病数据和手术数据进行简单收集,在去掉大量细小分类项之后,对其采用one-hot编码方式,借助自动编码器的学习能力,将离散数据转变为连续数据,然后将最初的生成对抗网络中的卷积层更改为全连接层,进而对原始临床数据的分布进行建模,通过生成器来获取与真实医学数据相似的虚拟临床数据。
相比于medGAN,medWGAN和medBGAN最重要的改进则是将生成对抗网络中较优秀的模型替代掉medGAN中最初的生成对抗网络,加快了训练速度、适当地改善了生成对抗网络训练过程中出现的模式崩溃问题。
参考文献
【1】Baowaly M K,Lin C C,Liu C L,et al.Synthesizing electronic health records using improved generative adversarial networks[J].Journal of the American Medical Informatics Association,2019,26(3):228-241.
【2】Choi E,Biswal S,Malin B,et al.Generating multi-label discrete patient records using generative adversarial networks[C]//Machine learning for healthcare conference.PMLR,2017:286-305。
发明内容
本发明目的在于针对现有技术的不足,提出一种基于因果关系挖掘的临床数据自动化生成方法及系统。
本发明的目的是通过以下技术方案来实现的:一种基于因果关系挖掘的临床数据自动化生成方法,具体步骤如下:
(1)数据集构建:构建以患者为行,患者临床信息为列的表格,得到需要进行虚拟生成的数据集;
(2)自然语言处理:对步骤(1)中得到的数据集中的文字信息和数值信息进行预处理,将文字信息转变为统一的表述形式,进行顺序编码后得到数字序列,对数值信息采用统一的数值表达形式;
(3)因果发现:将经过自然语言处理后的数据集的数据列作为数据节点,所有的数据节 点之间均绘上连接线,完成一张完全无向图,再根据d分隔的原理确定完全无向图中边的依赖方向,将完全无向图扩展为完全部分有向无环图,进而获得数据列之间的因果联系,得到因果图;
(4)原因数据生成:将步骤(3)中得到的因果图中的数据节点划分为起始原因列和后续结果列两种类型;对于起始原因列,根据自定义的组数计算组距和起始原因列数据的极差,进而绘制出频率分布直方图,得到频率分布折线图,近似得到总体密度曲线,计算概率密度函数的分布函数,得到值域为(0,1)的递增函数并取反函数,在[0,1]范围内均匀产生随机数,利用反函数找到对应的随机数,得到起始原因列数据的虚拟生成结果;
(5)结果数据生成:对于后续结果列中每一个结果数据,首先从正态分布中均匀采样获取随机噪声,并将该随机噪声与该结果数据对应的真实原因数据输入到生成器中,构建与真实原因数据具备因果联系的虚拟结果数据,之后将虚拟结果数据、真实原因数据以及真实的结果数据输入到判别器中训练,由判别器对虚拟结果数据的真实情况进行判断,对生成器和判别器经过一定轮次的训练达到稳定状态,再将随机噪声与虚拟原因数据输入到生成器中,从而得到虚拟结果数据。
进一步地,步骤(1)中,所述患者临床信息,是根据患者的入院编码在医院不同的科室中挑选出患者的病情、检查、病症和药物相关的临床信息。
进一步地,步骤(2)中,对于文字信息预处理过程具体为:消除无用的字符,对表格内的信息完整性进行查验,查看信息缺失、信息明确有误情况,并进行数据删除或更改,之后借助正则表达式对长句进行分割,再对分割之后的语句进行极性判断,进而借助统一的医学表达方式将各种不同的语言文字通过其内在含义的一致性转变为同一形式,保证表格内采用统一的表述形式,给每一列不同的文字表述进行顺序编码,将其转变为数字序列。
进一步地,步骤(2)中,预处理后的文字信息和数值信息结合得到由数字构成的真实医学临床表格数据;该表格数据用数学符号(x,Y)表示,其中x=[x1,x2,x3,...xn]为患者的入院编码集合,n为患者数量,xn为第n个患者编码,Y=[y1,y2,y3,...yn]T=[f1,f2,f3,...fm]∈Rn×m为患者的特征矩阵,m为选定的患者信息指标的数量,fm为患者第m个信息指标的数据,yn为第n位患者的临床医学数据。
进一步地,步骤(3)中,获得数据列之间的因果联系具体过程为:对于完全无向图中的任意两个数据节点,如果对其他所有节点均不满足条件独立的要求,则判定这两个数据节点之间有因果联系,再假设全部的随机节点在整体上服从多元高斯分布,依据偏相关系数的公式确定数据节点之间是否条件独立。
进一步地,对于一张含有r个数据节点的完全无向图,其中任意两个数据节点a,b,将 其余数据节点的集合视为H,其s阶偏相关系数ρa,b|H为:
其中ρa,b|H\s为数据节点a和b的s-1阶偏相关系数,ρa,b|H\s为数据节点a和s的s-1阶偏相关系数,ρa,b|H\s为数据节点b和s的s-1阶偏相关系数,将其通过Fisher Z变换转变为正态分布Z(a,b|H),其表示为:
其中ρa,b|H为数据节点a和b的s阶偏相关系数,在给定显著性水平α的前提下,判断的大小关系,其中Φ-1(·)为正态分布N(0,1)的累积分布函数;若后者大,则说明数据节点a和b对于剩余数据节点集合H的偏相关系数为0,即数据节点a与b条件独立,进而确定数据节点之间的连接线条,即数据列之间的依赖关系。
进一步地,步骤(4)中,概率密度函数的分布函数计算过程如下:根据总体密度曲线上峰的个数和误差平方和最小原则将概率密度函数表示为t个正态分布的结合p(g),即将其表示为:
其中g为步骤(4)中起始原因列的数据,ξi为第i个正态分布的均值,σi为第i个正态分布的标准差;根据g与p(g)的位置关系求取该分布函数的反函数。
进一步地,步骤(5)中,生成器的损失包括两部分,一是由判别器本身带来的真假损失,二是由因果性带来的因果损失;其中判别器本身带来的真假损失越小,则表示虚拟数据与真实数据越相似,该真假损失lossa公式如下:
其中N为训练过程中一批次的患者数量,为第i个生成的虚拟结果数据在判别器的判别效果下确认为与输入的真实原因数据相对应的后续结果列中结果数据的概率;
对于因果损失,首先明确因果损失是为了保证生成器生成的虚拟样本与原始的真实样本之间的因果性是相似的,因果损失越小,则表明虚拟样本之间的因果性越满足需求;数据列 之间的因果性表示为数值间的相关性,对于虚拟的结果列和真实的结果列,将其和对应的原因列的每一列计算相关系数,通过反向传播相关系数之间的差进而对虚拟结果列的因果性进行反馈,因果损失lossb的公式如下:
其中M为输入生成器的起始原因列的数量,cj为真实的结果列与对应的第j个原因列的皮尔逊相关系数,为生成器生成的虚拟的结果列与对应的第j个原因列的皮尔逊相关系数;
生成器采用上述两损失函数之和作为其损失函数,并使用基于梯度惩罚与推土机距离的生成对抗网络Wasserstein GAN-gradient penalty的网络优化方法,从而生成与虚拟原因数据具备因果联系且与真实结果数据相似的虚拟结果数据;在训练阶段,输入真实的原因数据,通过生成器的损失函数学习真实原因数据与真实结果数据之间的因果联系,网络稳定之后,再输入虚拟的原因数据,从而得到对应的虚拟的结果数据。
进一步地,步骤(5)中,将生成的虚拟结果数据连接起来,和真实结果数据一同输入到联合判别器中,由联合判别器对虚拟结果数据的因果关联进行判断,根据生成器和联合判别器的训练比例,对生成器的因果联系能力进行优化,将训练的目标函数Value(D,G)设置为:
其中G指代生成器,D指代判别器,q表示真实结果数据,z表示随机变量,Eq~P(q)表示对满足p(q)这种分布的q取期望得到的结果,表示对满足pz(z)这种分布的z取期望得到的结果,使用误差反向传播算法,根据生成的虚拟结果数据的前后顺序,将虚拟结果数据与真实结果数据的相似度差距反向传播到每一个生成器中,从而提高虚拟结果数据与虚拟原因数据之间的因果关联。
本发明还提供了一种基于因果关系挖掘的临床数据自动化生成系统,该系统包括数据集构建模块、自然语言处理模块、因果发现模块、原因数据生成模块、结果数据生成模块;
所述数据集构建模块用于构建以患者为行,患者临床信息为列的表格,得到需要进行虚拟生成的数据集;
所述自然语言处理模块用于对数据集构建模块得到的数据集中的文字信息和数值信息进行预处理,将文字信息转变为统一的表述形式,进行顺序编码后得到数字序列,对数值信息采用统一的数值表达形式;
所述因果发现模块用于将经过自然语言处理模块处理后的数据集的数据列作为数据节点, 所有的数据节点之间均绘上连接线,完成一张完全无向图,再根据d分隔的原理确定完全无向图中边的依赖方向,将完全无向图扩展为完全部分有向无环图,进而获得数据列之间的因果联系,得到因果图;
所述原因数据生成模块用于将因果发现模块中得到的因果图中的数据节点划分为起始原因列和后续结果列两种类型;对于起始原因列,根据自定义的组数计算组距和起始原因列数据的极差,进而绘制出频率分布直方图,得到频率分布折线图,近似得到总体密度曲线,计算概率密度函数的分布函数,得到值域为(0,1)的递增函数并取反函数,在[0,1]范围内均匀产生随机数,利用反函数找到对应的随机数,得到起始原因列数据的虚拟生成结果;
所述结果数据生成模块用于对后续结果列中每一个结果数据,首先从正态分布中均匀采样获取随机噪声,并将该随机噪声与该结果数据对应的真实原因数据输入到生成器中,构建与真实原因数据具备因果联系的虚拟结果数据,之后将虚拟结果数据、真实原因数据以及真实的结果数据输入到判别器中训练,由判别器对虚拟结果数据的真实情况进行判断,对生成器和判别器经过一定轮次的训练达到稳定状态,再将随机噪声与虚拟原因数据输入到生成器中,从而得到虚拟结果数据。
本发明的有益效果:
1.本发明在进行医学临床数据的生成时,实现了部分的可解释性,这是通过生成对抗网络实现了因果关联的方法:依据医学数据本身存在的因果性质,依赖于医学数据中的原因数据通过随机变量的方式使用生成对抗网络进而生成虚拟的结果数据,使用多个生成器从而实现医学因果临床数据的生成。
2.本发明在进行医学临床数据的生成时,采用了先分别生成,再统一优化的方法:借助因果链条将患者的医学信息拆开,对于原因数据,通过正态分布的有机组合进行模拟从而获得虚拟数据;对于结果数据,通过生成对抗网络的生成器与原因数据进行因果关联从而获得虚拟数据,这样分别进行虚拟生成,之后再借助联合判别器的判别能力将前述的生成器进行优化,从而得到逼真的虚拟临床数据。
3.本发明在进行医学临床数据的生成时,将难体现的医学因果性转变为简易形式:在生成后续结果数据时,对于后续结果数据与起始原因数据的因果联系,借助皮尔逊相关系数将整体因果图中的因果性转变为较少节点之间的数值相关性,将真实数据与虚拟数据之间相关系数的差作为生成器的损失通过反向传播的方式传递到生成器网络中,从而保证生成器生成的虚拟的后续结果数据与真实原因数据的因果关联。
4.本发明将病人不同信息间的因果联系性通过多个生成器进行连接,大大增加了生成的虚拟患者临床数据内部的联系性,不但可以生成虚拟的临床数据,而且减少了生成的虚拟临 床数据内部矛盾的可能,使得数据与真实数据更加相似,更能够投入真实场景下的使用。
附图说明
图1为本发明基于因果关系挖掘的临床数据自动化生成方法的流程示意图。
图2为本发明因果生成对抗网络训练示意图。
图3为本发明将完全无向图转变为完全有向无环图的样例。
图4为本发明基于因果关系挖掘的临床数据自动化生成系统结构图。
具体实施方式
以下结合附图对本发明具体实施方式作进一步详细说明。
如图1所示,本发明提供的一种基于因果关系挖掘的临床数据自动化生成方法。
本发明从医学数据的收集来源出发,按照患者的入院出院流程,将患者留存在医院的信息进行分门别类的统计,其中的记录大致可以分为病程记录、检查检验结果、医嘱、手术记录、护理记录几类,其中的数据类型包括图像、文字乃至影像数据,但其中存储数据量最为丰富、应用范围最广的还是文字类型的数据,而这些数据在医院内往往都是记录在患者的表格中,本发明忽略患者入院流程中产生的多余信息,将患者的入院流程整理为病情、检查、病症、药物4项,将其整合为统一数据集,之后再根据使用者的个性化需求,对数据集中的数据进行第一次筛选,之后通过算法明确各列数据间的依赖关系和依赖方向,绘出挑选的数据的完全部分有向无环图,从而发现这些数据之间的因果关联,之后再从因果图中挑选出使用者感兴趣的部分进行因果关联虚拟生成,进而解决生成的虚拟数据之间联系不紧密的问题。
本发明方法具体步骤如下:
(1)数据集构建:根据具体需求,选择具体某一段时间入院、或是具体某一地区的患者,根据患者的入院编码在医院不同的科室中挑选出患者的病情、检查、病症、药物这4方面的信息,之后构建以患者为行,患者不同的信息为列的表格,完成本次需要进行虚拟生成的数据集,之后可以根据使用者需求对数据集内的数据进行筛选。
(2)自然语言处理:基于步骤(1)中筛选后的数据集,对步骤(1)中得到数据集中文字信息和数值信息进行预处理,医院应用文字和数值两种形式记录患者病情、病症和药物等情况,对文字而言,消除无用的字符,对表格内的信息完整性进行查验,查看信息缺失、信息明确有误等情况,并按照具体情形进行数据删除或更改,之后借助正则表达式对长句进行分割,再对分割之后的语句进行极性判断,进而借助统一的医学表达方式将各种不同的语言文字通过其内在含义的一致性转变为同一形式,保证表格内采用统一的表述形式,给每一列不同的文字表述进行顺序编码,将其转变为数字序列;对数值而言,则对每一列都采用统一的数值表达形式。两者结合得到由数字构成的真实医学临床表格数据。该表格数据用数学符 号(x,Y)表示,其中x=[x1,x2,x3,...xn]为患者的入院编码集合,n为患者数量,xn为第n个患者编码,Y=[y1,y2,y3,...yn]T=[f1,f2,f3,...fm]∈Rn×m为患者的特征矩阵,m为选定的患者信息指标的数量,fm为患者第m个信息指标的数据,yn为第n位患者的临床医学数据。
(3)因果发现:将经过自然语言处理后的数据列作为数据节点,所有的数据节点之间都绘上连接线,完成一张完全无向图,之后判断数据节点之间的因果联系,对于图中的任意两个数据节点,如果对其他所有节点都不满足条件独立的要求,则断定这两个数据节点之间有因果联系,再假设全部的随机节点在整体上服从多元高斯分布,就可以将变量条件独立这一要求转变为变量之间的偏相关系数为0这一公式。对于一张含有r个数据节点的完全无向图,其中任意两个数据节点a,b,将其余数据节点的集合视为H,其s阶偏相关系数ρa,b|H为:
其中ρa,b|H\s为数据节点a和b的s-1阶偏相关系数,ρa,b|H\s为数据节点a和s的s-1阶偏相关系数,ρa,b|H\s为数据节点b和s的s-1阶偏相关系数,将其通过Fisher Z变换转变为正态分布Z(a,b|H),其表示为
其中ρa,b|H为数据节点a和b的s阶偏相关系数,在给定显著性水平α的前提下,判断的大小关系,其中Φ-1(·)为正态分布N(0,1)的累积分布函数;若后者大,则说明数据节点a和b对于剩余数据节点集合H的偏相关系数为0,即数据节点a与b条件独立,进而确定数据节点之间的连接线条,即数据列之间的依赖关系。之后再根据d分隔的原理来确定无向图中边的依赖方向,就可以将无向图扩展为完全部分有向无环图,进而获得数据列之间的因果联系。
(4)原因数据生成:对于步骤(3)中找到的因果图,根据使用者需求对因果图进行拆解,拿出其中感兴趣的部分,将该因果图中的数据节点根据有无父节点这一特征划分为起始原因列和后续结果列两种类型。对于起始原因列,不采用最小最大归一化的方式来处理,而是计算该列数据的极差,再根据自定义的组数计算组距,进而绘制出频率分布直方图,从而得到频率分布折线图,近似得到总体密度曲线,再根据总体密度曲线上峰的个数和误差平方和最小原则将概率密度函数表示为t个正态分布的结合p(g),即将其表示为:
其中g为步骤(4)中起始原因列的数据,ξi为第i个正态分布的均值,σi为第i个正态分布的标准差,之后计算该概率密度函数的分布函数,得到值域为(0,1)的递增函数,之后再根据g与p(g)的位置关系求取该分布函数的反函数,之后再在[0,1]范围内均匀产生随机数,利用求得的反函数找到对应的随机数,这些随机数就是起始原因列数据的虚拟生成结果,进而为后续结果列的数据生成打下基础。
(5)结果数据生成:在步骤(3)中已经得到了所挑选数据的因果图,根据因果关联性,所有的结果数据都是由相应的原因数据影响的,而在步骤(4)中已经对原因数据进行了虚拟生成,接下来列举拆解下来的因果图中的结果数据,每一个结果数据都是由一个生成器来生成的,要构建多个生成器针对结果数据进行虚拟生成。对于每一个结果数据,首先从正态分布中均匀采样获取随机噪声,并将该随机噪声与该结果数据对应的原因数据通过embedding层输送到同一隐藏层中,进而输入到生成器中,借助生成器的学习能力来虚拟构建与原因数据具备因果联系的结果数据,之后将虚拟结果数据、真实原因数据以及真实的结果数据输入到判别器中,由判别器对结果数据的真实情况进行判断。生成器的损失包括两部分,一是由判别器本身带来的真假损失,二是由因果性带来的因果损失。其中判别器本身带来的真假损失越小,则表示虚拟数据与真实数据越相似,该真假损失公式如下:
其中N为训练过程中一批次的患者数量,为第i个生成的虚拟结果数据在判别器的判别效果下确认为与输入的真实原因数据相对应的后续结果列中结果数据的概率;
对于因果损失,首先明确因果损失是为了保证生成器生成的虚拟样本与原始的真实样本之间的因果性是相似的,因果损失越小,则表明虚拟样本之间的因果性越满足需求;数据列之间的因果性表示为数值间的相关性,对于虚拟的后续结果列和真实的后续结果列,将其对起始原因列的每一列计算相关系数,通过反向传播相关系数之间的差进而对虚拟后续结果列的因果性进行反馈,因果损失的公式如下:
其中M为输入生成器的起始原因列的数量,cj为真实的后续结果列与对应的第j个起始 原因列的皮尔逊相关系数,为生成器生成的虚拟的后续结果列与对应的第j个起始原因列的皮尔逊相关系数;
生成器采用上述两损失函数之和作为其损失函数,并使用基于梯度惩罚与推土机距离的生成对抗网络Wasserstein GAN-gradient penalty的网络优化方法,从而生成与虚拟原因数据具备因果联系且与真实结果数据相似的虚拟结果数据;在训练阶段,输入真实的原因数据,通过生成器的损失函数学习真实原因数据与真实结果数据之间的因果联系,网络稳定之后,再将随机噪声与虚拟原因数据输入到生成器中,从而得到对应的虚拟结果数据;
(6)联合训练:在训练好步骤(5)中的所有生成器之后,借助随机变量依次进行虚拟数据生成,将生成的虚拟数据连接起来,和真实数据一同输入到联合判别器中,由联合判别器对虚拟结果数据的因果关联进行判断;
根据生成器和联合判别器的训练比例,对生成器的因果联系能力进行优化,将训练的目标函数Value(D,G)设置为:
其中G指代生成器,D指代判别器,q表示真实结果数据,z表示随机变量,Eq~P(q)表示对满足p(q)这种分布的q取期望得到的结果,表示对满足pz(z)这种分布的z取期望得到的结果,使用误差反向传播算法,根据生成的虚拟结果数据的前后顺序,将虚拟结果数据与真实结果数据的相似度差距反向传播到每一个生成器中,从而提高虚拟结果数据与虚拟原因数据之间的因果关联。
为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图对本发明的具体实施方式做详细的说明。
在下面的描述中阐述了很多具体细节以便于充分理解本发明,但是本发明还可以采用其它不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本发明内涵的情况下做类似推广,因此本发明不受下面公开的具体实施例的限制。
本发明实施例提供一种基于因果关系挖掘的临床数据自动化生成方法,用于虚拟生成糖尿病相关的临床数据;具体如下:
(1)数据集构建:首先根据个人需求,将2000年-2020年入院的病人的ID在医院系统内找到,之后在不同的科室中挑选出病人的主诉,身高、体重、BMI等基本生理指标,白蛋白、球蛋白、尿素、尿酸等常规化验指标以及病人疾病情况和服药情况等,构建以患者为行,患者不同的信息为列的表格,形成本次虚拟生成的数据集,如表1所示,再从中着重挑选与糖尿病相关的数据指标。
表1
(2)自然语言处理:对步骤(1)中得到数据集中文字信息和数值信息进行预处理,例如可以对于“无恶心呕吐,无胸闷气短”这一部分病史,通过“,”将其分割为2个短句,之后根据“无”进行极性判断,进而得到患者的具体信息,再与其他患者的信息统一分析,借助one-hot编码转变为序列形式。
(3)因果发现:如图3所示,将上述挑选与糖尿病相关的数据列作为数据节点,比如年龄、血糖、格列美脲片等,所有的数据节点之间都绘上连接线,完成一张完全无向图;之后再根据d分隔的原理来确定无向图中边的依赖方向,将无向图扩展为完全部分有向无环图,即为因果图。
(4)原因数据生成:拆解出尿糖、糖尿病以及格列美脲片等相关药物这一部分因果图,将该因果图中的数据节点根据有无父节点这一特征划分为起始原因列和后续结果列这两种类型。根据起始原因列实现尿糖、年龄等数据的虚拟生成。
(5)结果数据生成:如图2中的(a)~(e)所示,其中a和b为真实的起始原因数据,而c、d、e为真实的后续结果数据,a’和b’分别表示通过步骤(4)得到的虚拟的起始原因数据,而c’、d’、e’则分别表示通过生成对抗网络生成的虚拟的结果数据;对于步骤(3)中因果图的每一个结果数据,都需要借助生成器的学习能力来虚拟构建与原因数据具备因果联系的结果数据,之后将虚拟结果数据与真实原因数据一同输入到判别器中,由判别器对数据之间的因果联系情况和虚拟结果数据的真实情况进行判断,基于训练好的生成器,生成与原因数据具备因果联系且与真实结果数据相似的虚拟结果数据。
(6)联合训练:对所有生成器进行联合训练,将生成器和联合判别器训练比例设置为3:1,对生成器的因果联系能力进行优化,使用误差反向传播算法,将虚拟数据与真实数据的相似度差距反向传播到每一个生成器中,对每一个生成器的参数进行进一步优化,减少经过多个生成器而出现的流程误差,提高虚拟结果数据的真实性和整体因果关联。
之后还需对生成的虚拟临床数据进行测试。首先显示出虚拟临床数据每一列数据的分布情况,将其各种类占比信息与真实临床数据进行对比,从而得到单列相似度,之后再采用逻 辑回归分类器对真实临床数据和虚拟临床数据进行整体相似度判断。再对虚拟临床数据中的某些信息进行遮掩,然后通过其他的信息对遮掩的信息进行预测,进而判别出因果性学习的优劣,之后再查验真实临床数据和虚拟临床数据在某一特定情况下的患者人数,发现该因果生成对抗网络模型有效地保护了患者的隐私,之后就将虚拟数据正式投入使用。
与基于因果关系挖掘的临床数据自动化生成方法的实施例相对应,本发明还提供了基于因果关系挖掘的临床数据自动化生成系统的实施例。
如图4所示,本发明提供的一种基于因果关系挖掘的临床数据自动化生成系统,包括数据集构建模块、自然语言处理模块、因果发现模块、原因数据生成模块、结果数据生成模块;
所述数据集构建模块用于构建以患者为行,患者临床信息为列的表格,得到需要进行虚拟生成的数据集;
所述自然语言处理模块用于对数据集构建模块得到的数据集中的文字信息和数值信息进行预处理,将文字信息转变为统一的表述形式,进行顺序编码后得到数字序列,对数值信息采用统一的数值表达形式;
所述因果发现模块用于将自然语言处理模块处理后的数据集的数据列作为数据节点,所有的数据节点之间均绘上连接线,完成一张完全无向图,再根据d分隔的原理确定完全无向图中边的依赖方向,将完全无向图扩展为完全部分有向无环图,进而获得数据列之间的因果联系,得到因果图;
所述原因数据生成模块用于将因果发现模块中得到的因果图中的数据节点划分为起始原因列和后续结果列两种类型;对于起始原因列,根据自定义的组数计算组距和起始原因列数据的极差,进而绘制出频率分布直方图,得到频率分布折线图,近似得到总体密度曲线,计算概率密度函数的分布函数,得到值域为(0,1)的递增函数并取反函数,在[0,1]范围内均匀产生随机数,利用反函数找到对应的随机数,得到起始原因列数据的虚拟生成结果;
所述结果数据生成模块用于对后续结果列中每一个结果数据,首先从正态分布中均匀采样获取随机噪声,并将该随机噪声与该结果数据对应的真实原因数据输入到生成器中,构建与真实原因数据具备因果联系的虚拟结果数据,之后将虚拟结果数据、真实原因数据以及真实的结果数据输入到判别器中训练,由判别器对虚拟结果数据的真实情况进行判断,对生成器和判别器经过一定轮次的训练达到稳定状态,再将随机噪声与虚拟原因数据输入到生成器中,从而得到虚拟结果数据。
上述系统中各个模块的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。
对于系统实施例而言,由于其基本对应于方法实施例,所以相关之处参见方法实施例的 部分说明即可。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
上述实施例用来解释说明本发明,而不是对本发明进行限制,在本发明的精神和权利要求的保护范围内,对本发明作出的任何修改和改变,都落入本发明的保护范围。

Claims (10)

  1. 一种基于因果关系挖掘的临床数据自动化生成方法,其特征在于,具体步骤如下:
    (1)数据集构建:构建以患者为行,患者临床信息为列的表格,得到需要进行虚拟生成的数据集;
    (2)自然语言处理:对步骤(1)中得到的数据集中的文字信息和数值信息进行预处理,将文字信息转变为统一的表述形式,进行顺序编码后得到数字序列,对数值信息采用统一的数值表达形式;
    (3)因果发现:将经过自然语言处理后的数据集的数据列作为数据节点,所有的数据节点之间均绘上连接线,完成一张完全无向图,再根据d分隔的原理确定完全无向图中边的依赖方向,将完全无向图扩展为完全部分有向无环图,进而获得数据列之间的因果联系,得到因果图;
    (4)原因数据生成:将步骤(3)中得到的因果图中的数据节点划分为起始原因列和后续结果列两种类型;对于起始原因列,根据自定义的组数计算组距和起始原因列数据的极差,进而绘制出频率分布直方图,得到频率分布折线图,近似得到总体密度曲线,计算概率密度函数的分布函数,得到值域为(0,1)的递增函数并取反函数,在[0,1]范围内均匀产生随机数,利用反函数找到对应的随机数,得到起始原因列数据的虚拟生成结果;
    (5)结果数据生成:对于后续结果列中每一个结果数据,首先从正态分布中均匀采样获取随机噪声,并将该随机噪声与该结果数据对应的真实原因数据输入到生成器中,构建与真实原因数据具备因果联系的虚拟结果数据,之后将虚拟结果数据、真实原因数据以及真实的结果数据输入到判别器中训练,由判别器对虚拟结果数据的真实情况进行判断,对生成器和判别器经过一定轮次的训练达到稳定状态,再将随机噪声与虚拟原因数据输入到生成器中,从而得到虚拟结果数据。
  2. 根据权利要求1所述的一种基于因果关系挖掘的临床数据自动化生成方法,其特征在于,步骤(1)中,所述患者临床信息,是根据患者的入院编码在医院不同的科室中挑选出患者的病情、检查、病症和药物相关的临床信息。
  3. 根据权利要求1所述的一种基于因果关系挖掘的临床数据自动化生成方法,其特征在于,步骤(2)中,对于文字信息预处理过程具体为:消除无用的字符,对表格内的信息完整性进行查验,查看信息缺失、信息明确有误情况,并进行数据删除或更改,之后借助正则表达式对长句进行分割,再对分割之后的语句进行极性判断,进而借助统一的医学表达方式将各种不同的语言文字通过其内在含义的一致性转变为同一形式,保证表格内采用统一的表述 形式,给每一列不同的文字表述进行顺序编码,将其转变为数字序列。
  4. 根据权利要求1所述的一种基于因果关系挖掘的临床数据自动化生成方法,其特征在于,步骤(2)中,预处理后的文字信息和数值信息结合得到由数字构成的真实医学临床表格数据;该表格数据用数学符号(x,Y)表示,其中x=[x1,x2,x3,...xn]为患者的入院编码集合,n为患者数量,xn为第n个患者编码,Y=[y1,y2,y3,...yn]T=[f1,f2,f3,...fm]∈Rn×m为患者的特征矩阵,m为选定的患者信息指标的数量,fm为患者第m个信息指标的数据,yn为第n位患者的临床医学数据。
  5. 根据权利要求1所述的一种基于因果关系挖掘的临床数据自动化生成方法,其特征在于,步骤(3)中,获得数据列之间的因果联系具体过程为:对于完全无向图中的任意两个数据节点,如果对其他所有节点均不满足条件独立的要求,则判定这两个数据节点之间有因果联系,再假设全部的随机节点在整体上服从多元高斯分布,依据偏相关系数的公式确定数据节点之间是否条件独立。
  6. 根据权利要求5所述的一种基于因果关系挖掘的临床数据自动化生成方法,其特征在于,对于一张含有r个数据节点的完全无向图,其中任意两个数据节点a,b,将其余数据节点的集合视为H,其s阶偏相关系数ρa,b|H为:
    其中ρa,b|H\s为数据节点a和b的s-1阶偏相关系数,ρa,s|H\s为数据节点a和s的s-1阶偏相关系数,ρb,s|H\s为数据节点b和s的s-1阶偏相关系数,将其通过Fisher Z变换转变为正态分布Z(a,b|H),其表示为:
    其中ρa,b|H为数据节点a和b的s阶偏相关系数,在给定显著性水平α的前提下,判断的大小关系,其中Φ-1(·)为正态分布N(0,1)的累积分布函数;若后者大,则说明数据节点a和b对于剩余数据节点集合H的偏相关系数为0,即数据节点a与b条件独立,进而确定数据节点之间的连接线条,即数据列之间的依赖关系。
  7. 根据权利要求1所述的一种基于因果关系挖掘的临床数据自动化生成方法,其特征在于,步骤(4)中,概率密度函数的分布函数计算过程如下:根据总体密度曲线上峰的个数和误差平方和最小原则将概率密度函数表示为t个正态分布的结合p(g),即将其表示为:
    其中g为步骤(4)中起始原因列的数据,ξi为第i个正态分布的均值,σi为第i个正态分布的标准差;根据g与p(g)的位置关系求取该分布函数的反函数。
  8. 根据权利要求1所述的一种基于因果关系挖掘的临床数据自动化生成方法,其特征在于,步骤(5)中,生成器的损失包括两部分,一是由判别器本身带来的真假损失,二是由因果性带来的因果损失;其中判别器本身带来的真假损失越小,则表示虚拟数据与真实数据越相似,该真假损失lossa公式如下:
    其中N为训练过程中一批次的患者数量,为第i个生成的虚拟结果数据在判别器的判别效果下确认为与输入的真实原因数据相对应的后续结果列中结果数据的概率;
    对于因果损失,首先明确因果损失是为了保证生成器生成的虚拟样本与原始的真实样本之间的因果性是相似的,因果损失越小,则表明虚拟样本之间的因果性越满足需求;数据列之间的因果性表示为数值间的相关性,对于虚拟的结果列和真实的结果列,将其和对应的原因列的每一列计算相关系数,通过反向传播相关系数之间的差进而对虚拟结果列的因果性进行反馈,因果损失lossb的公式如下:
    其中M为输入生成器的起始原因列的数量,cj为真实的结果列与对应的第j个原因列的皮尔逊相关系数,为生成器生成的虚拟的结果列与对应的第j个原因列的皮尔逊相关系数;
    生成器采用上述两损失函数之和作为其损失函数,并使用基于梯度惩罚与推土机距离的生成对抗网络Wasserstein GAN-gradient penalty的网络优化方法,从而生成与虚拟原因数据具备因果联系且与真实结果数据相似的虚拟结果数据;在训练阶段,输入真实的原因数据,通过生成器的损失函数学习真实原因数据与真实结果数据之间的因果联系,网络稳定之后,再输入虚拟的原因数据,从而得到对应的虚拟的结果数据。
  9. 根据权利要求1所述的一种基于因果关系挖掘的临床数据自动化生成方法,其特征在于,步骤(5)中,将生成的虚拟结果数据连接起来,和真实结果数据一同输入到联合判别器 中,由联合判别器对虚拟结果数据的因果关联进行判断,根据生成器和联合判别器的训练比例,对生成器的因果联系能力进行优化,将训练的目标函数Value(D,G)设置为:
    其中G指代生成器,D指代判别器,q表示真实结果数据,z表示随机变量,Eq~P(q)表示对满足p(q)这种分布的q取期望得到的结果,表示对满足pz(z)这种分布的z取期望得到的结果,使用误差反向传播算法,根据生成的虚拟结果数据的前后顺序,将虚拟结果数据与真实结果数据的相似度差距反向传播到每一个生成器中,从而提高虚拟结果数据与虚拟原因数据之间的因果关联。
  10. 一种基于因果关系挖掘的临床数据自动化生成系统,其特征在于,该系统包括数据集构建模块、自然语言处理模块、因果发现模块、原因数据生成模块、结果数据生成模块;
    所述数据集构建模块用于构建以患者为行,患者临床信息为列的表格,得到需要进行虚拟生成的数据集;
    所述自然语言处理模块用于对数据集构建模块得到的数据集中的文字信息和数值信息进行预处理,将文字信息转变为统一的表述形式,进行顺序编码后得到数字序列,对数值信息采用统一的数值表达形式;
    所述因果发现模块用于将经过自然语言处理模块处理后的数据集的数据列作为数据节点,所有的数据节点之间均绘上连接线,完成一张完全无向图,再根据d分隔的原理确定完全无向图中边的依赖方向,将完全无向图扩展为完全部分有向无环图,进而获得数据列之间的因果联系,得到因果图;
    所述原因数据生成模块用于将因果发现模块中得到的因果图中的数据节点划分为起始原因列和后续结果列两种类型;对于起始原因列,根据自定义的组数计算组距和起始原因列数据的极差,进而绘制出频率分布直方图,得到频率分布折线图,近似得到总体密度曲线,计算概率密度函数的分布函数,得到值域为(0,1)的递增函数并取反函数,在[0,1]范围内均匀产生随机数,利用反函数找到对应的随机数,得到起始原因列数据的虚拟生成结果;
    所述结果数据生成模块用于对后续结果列中每一个结果数据,首先从正态分布中均匀采样获取随机噪声,并将该随机噪声与该结果数据对应的真实原因数据输入到生成器中,构建与真实原因数据具备因果联系的虚拟结果数据,之后将虚拟结果数据、真实原因数据以及真实的结果数据输入到判别器中训练,由判别器对虚拟结果数据的真实情况进行判断,对生成器和判别器经过一定轮次的训练达到稳定状态,再将随机噪声与虚拟原因数据输入到生成器中,从而得到虚拟结果数据。
PCT/CN2023/105558 2022-07-05 2023-07-03 一种基于因果关系挖掘的临床数据自动化生成方法及系统 WO2024008043A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210782447.3A CN114864099B (zh) 2022-07-05 2022-07-05 一种基于因果关系挖掘的临床数据自动化生成方法及系统
CN202210782447.3 2022-07-05

Publications (1)

Publication Number Publication Date
WO2024008043A1 true WO2024008043A1 (zh) 2024-01-11

Family

ID=82625517

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/105558 WO2024008043A1 (zh) 2022-07-05 2023-07-03 一种基于因果关系挖掘的临床数据自动化生成方法及系统

Country Status (2)

Country Link
CN (1) CN114864099B (zh)
WO (1) WO2024008043A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114864099B (zh) * 2022-07-05 2022-11-01 浙江大学 一种基于因果关系挖掘的临床数据自动化生成方法及系统
CN116469543B (zh) * 2023-04-21 2023-10-27 脉景(杭州)健康管理有限公司 一种主症兼症识别方法、系统及设备
CN117077641B (zh) * 2023-10-16 2024-01-19 北京亚信数据有限公司 医疗数据合成方法及装置
CN117809854A (zh) * 2023-12-29 2024-04-02 重庆邮电大学 一种基于医学因果知识嵌入的危险因素因果关系提取方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110634566A (zh) * 2019-09-24 2019-12-31 成都成信高科信息技术有限公司 一种中医临床诊断数据处理系统及方法、信息数据处理终端
AU2020102667A4 (en) * 2020-10-11 2021-01-14 George, Tony DR Adversarial training for large scale healthcare data using machine learning system
US20210064760A1 (en) * 2019-09-03 2021-03-04 Microsoft Technology Licensing, Llc Protecting machine learning models from privacy attacks
CN113808734A (zh) * 2021-09-08 2021-12-17 宁波工程学院 一种基于深度学习的因果医疗诊断方法
CN114220549A (zh) * 2021-12-16 2022-03-22 无锡中盾科技有限公司 一种基于可解释机器学习的有效生理学特征选择和医学因果推理方法
CN114864099A (zh) * 2022-07-05 2022-08-05 浙江大学 一种基于因果关系挖掘的临床数据自动化生成方法及系统

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6354192B2 (ja) * 2014-02-14 2018-07-11 オムロン株式会社 因果ネットワーク生成システム
CN112151130B (zh) * 2019-01-15 2022-11-04 合肥工业大学 一种基于文献检索的决策支持系统和构建方法
KR20210147651A (ko) * 2020-05-29 2021-12-07 의료법인 이원의료재단 Gan을 이용한 의료 데이터 생성 방법 및 그 시스템
CN112835709B (zh) * 2020-12-17 2023-09-22 华南理工大学 基于生成对抗网络的云负载时序数据生成方法、系统和介质
CN113378991A (zh) * 2021-07-07 2021-09-10 上海联影医疗科技股份有限公司 医疗数据生成方法、装置、电子设备及存储介质
CN113990520B (zh) * 2021-11-05 2024-06-07 天津工业大学 一种基于可控生成对抗网络的中医药处方生成方法
CN114664452B (zh) * 2022-05-20 2022-09-23 之江实验室 一种基于因果校验数据生成的全科多疾病预测系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210064760A1 (en) * 2019-09-03 2021-03-04 Microsoft Technology Licensing, Llc Protecting machine learning models from privacy attacks
CN110634566A (zh) * 2019-09-24 2019-12-31 成都成信高科信息技术有限公司 一种中医临床诊断数据处理系统及方法、信息数据处理终端
AU2020102667A4 (en) * 2020-10-11 2021-01-14 George, Tony DR Adversarial training for large scale healthcare data using machine learning system
CN113808734A (zh) * 2021-09-08 2021-12-17 宁波工程学院 一种基于深度学习的因果医疗诊断方法
CN114220549A (zh) * 2021-12-16 2022-03-22 无锡中盾科技有限公司 一种基于可解释机器学习的有效生理学特征选择和医学因果推理方法
CN114864099A (zh) * 2022-07-05 2022-08-05 浙江大学 一种基于因果关系挖掘的临床数据自动化生成方法及系统

Also Published As

Publication number Publication date
CN114864099A (zh) 2022-08-05
CN114864099B (zh) 2022-11-01

Similar Documents

Publication Publication Date Title
WO2024008043A1 (zh) 一种基于因果关系挖掘的临床数据自动化生成方法及系统
CN109935336B (zh) 一种儿童呼吸科疾病的智能辅助诊断系统
Spiegelhalter et al. Statistical and knowledge‐based approaches to clinical decision‐support systems, with an application in gastroenterology
CN106845147B (zh) 医学经验总结模型的建立方法、装置
CN109102886B (zh) 多推理模式融合的老年病推理诊断系统
CN111048167B (zh) 一种层级式病例结构化方法及系统
CN111316281A (zh) 基于机器学习的自然语言情境中数值数据的语义分类
CN111260448A (zh) 基于人工智能的药品推荐方法及相关设备
WO2023029506A1 (zh) 病情分析方法、装置、电子设备及存储介质
WO2023071530A1 (zh) 一种小样本弱标注条件下的医疗事件识别方法及系统
CN109119160B (zh) 多重推理方式的专家分诊系统及其方法
CN109360658A (zh) 一种基于词向量模型的疾病模式挖掘方法及装置
CN117034142B (zh) 一种不平衡医疗数据缺失值填充方法及系统
CN114783603A (zh) 基于多源图神经网络融合的患病风险预测方法及系统
Pokharel et al. Representing EHRs with temporal tree and sequential pattern mining for similarity computing
Hansen et al. Assigning diagnosis codes using medication history
Mani et al. Building Bayesian network models in medicine: The MENTOR experience
WO2024131025A1 (zh) 数据处理方法、装置、电子设备及存储介质
Sun et al. A general fine-tuned transfer learning model for predicting clinical task acrossing diverse EHRs datasets
CN116110594A (zh) 基于关联文献的医学知识图谱的知识评价方法及系统
CN113057588A (zh) 一种病症预警方法、装置、设备及介质
Reid Diabetes diagnosis and readmission risks predictive modelling: USA
AU2021100217A4 (en) Traditional Chinese Medicine Data Processing Method and System Combining Attribute-based Constrained Concept Lattice
Gunaratnam A web-based perinatal decision support system framework using a knowledge-based-approach to estimate clinical outcomes: neonatal mortality and preterm birth in twins pregnancies
Sapna et al. Integration of Fuzzy Clustering Technique with Big Data for Disease Diagnosis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23834810

Country of ref document: EP

Kind code of ref document: A1