CN115188484A - Multi-party mixed data tracing method and system based on potential group tool variables - Google Patents

Multi-party mixed data tracing method and system based on potential group tool variables Download PDF

Info

Publication number
CN115188484A
CN115188484A CN202210836782.7A CN202210836782A CN115188484A CN 115188484 A CN115188484 A CN 115188484A CN 202210836782 A CN202210836782 A CN 202210836782A CN 115188484 A CN115188484 A CN 115188484A
Authority
CN
China
Prior art keywords
cluster
data
treatment
medical record
variables
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210836782.7A
Other languages
Chinese (zh)
Inventor
况琨
吴安鹏
吴飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Higher Research Institute Of Shanghai Zhejiang University
Shanghai AI Innovation Center
Original Assignee
Higher Research Institute Of Shanghai Zhejiang University
Shanghai AI Innovation Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Higher Research Institute Of Shanghai Zhejiang University, Shanghai AI Innovation Center filed Critical Higher Research Institute Of Shanghai Zhejiang University
Priority to CN202210836782.7A priority Critical patent/CN115188484A/en
Publication of CN115188484A publication Critical patent/CN115188484A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Public Health (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Mathematical Optimization (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Primary Health Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Epidemiology (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Algebra (AREA)
  • Probability & Statistics with Applications (AREA)
  • Operations Research (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Pathology (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a multi-party mixed data tracing method and system based on potential group tool variables. The method maps the state of illness information to a characterization space through characterization learning; then, based on the given cluster number, identifying a heterogeneous treatment scheme allocation mechanism by using an expectation maximization algorithm, namely that intervention variables and confusion variables have different causal relationships on different data sources; and finally, based on a heterogeneous treatment scheme distribution mechanism, dividing medical record data into a plurality of different sample subgroups, selecting an optimal group tool variable for the data based on the correlation index as different source indication variables of multi-party data, so as to trace the source to obtain the difference of diagnosis and treatment means of different medical institutions, and realize the grouping of the medical record samples. The method can further carry out combined learning by combining a potential group tool variable embedding tool variable regression method with multi-party knowledge based on the source indication variable, and provides auxiliary accurate treatment scheme recommendation for each patient.

Description

Multi-party mixed data tracing method and system based on potential group tool variables
Technical Field
The invention relates to the field of causal inference, in particular to a multi-party mixed data tracing method and system based on potential group tool variables in medical record data.
Background
Causal inference is a powerful explanatory tool, and plays an important role in decision making processes in a plurality of different fields, such as accurate medical treatment, policy decision, accurate recommendation, teaching strategy improvement and the like. One gold standard in the causal field is a random control experiment in order to identify therapeutic/intervention effects or potential outcome functions, but because of the enormous costs and ethical concerns involved. We can often only make causal analyses based on observed data. However, due to the lack of uniform data collection specifications, multiple sources of data from multiple different causal relationships are often mixed together, presenting additional challenges to causal analysis.
Considering the mixed data deviation and the unobserved confusion deviation which are ubiquitous in the causal data, the tool variable is the most classical and most credible research method for removing the unobserved confusion influence in the observed data set. However, the tool variable method often depends on tool variables selected by human expert knowledge, and sometimes, due to the deficiency of human prior knowledge, the tool variables may not strictly meet the efficacy that the tool variables should have in the algorithm, so that the final estimated result is not reliable. One solution is to integrate tool variables from a large set of available tool variable candidates by means of checking or screening, but the same problem is that available tool variable candidates are not so easy to find.
For the case history data set in the medical data, it records the patient's signs (weight, height, age, sex, nature of work, relevant examination results), oral information (description of the patient on his own pathology and past medical history), recommended treatment given by the doctor and treatment results obtained by return visits. The medical record data is collected and gathered by different medical institutions, and different medical institutions generate different treatment scheme distribution mechanisms for the same disease due to the differences of treatment concepts, skilled technologies and medical instruments and equipment, namely the treatment scheme distribution mechanisms may have heterogeneity among different medical institutions. The heterogeneous treatment plan allocation mechanisms represent different causal relationships existing between intervention variables and confusion variables of different data sources, and each heterogeneous treatment plan allocation mechanism corresponds to a diagnosis and treatment means. However, due to the lack of medical record data collection regulations in the medical field, medical institutions from which different medical record data are derived and the corresponding medical treatment means are often missing in records, and a plurality of causal relationships are mixed in one data set. Therefore, how to trace the source of the medical record data samples in the medical record data set and find the diagnosis and treatment means categories corresponding to more medical record data or the corresponding medical institutions is a technical problem to be solved urgently at present.
Disclosure of Invention
The invention aims to overcome the defects that medical record data collected historically come from different medical institutions and have differences in diagnosis and treatment means due to the lack of medical record data collection specifications in the medical field, and extra deviation is brought to the estimation of real potential results. The invention provides a multi-party mixed data tracing method and system based on potential group tool variables, which can divide data into a plurality of potential different covariate-intervention relation sample groups based on a heterogeneous treatment scheme distribution mechanism, recover a subgroup indicator from mixed overall data by adopting a characterization learning and expectation maximization algorithm, and embed the subgroup indicator into a downstream prediction or recommendation task as a tool variable to perform more accurate prediction of a potential result function.
The technical scheme adopted by the invention is as follows:
in a first aspect, the present invention provides a multi-party hybrid data tracing method based on potential cluster tool variables, which includes the following steps:
s1, acquiring medical record data sets which are used for tracing and identifying and are sourced from a plurality of medical institutions, wherein each piece of medical record data comprises condition information, a treatment scheme given by the medical institutions and a treatment result after treatment according to the treatment scheme, and the condition information comprises patient characteristics and oral information;
s2, selecting a cluster quantity value to be selected in a cluster over-parameter value range, taking a treatment scheme given by a medical institution as an intervention variable, taking corresponding state information as a confusion variable, and mapping the state information observed in the medical record data set to a representation space through representation learning;
s3, fixing the expectation and covariance matrixes in the characterization space obtained in the S2, and identifying heterogeneous treatment scheme distribution mechanisms corresponding to the cluster quantity value to be selected by utilizing an expectation maximization algorithm, wherein the heterogeneous treatment scheme distribution mechanisms represent different cause-effect relationships between intervention variables and confusion variables of different data sources, and each heterogeneous treatment scheme distribution mechanism corresponds to a diagnosis and treatment means;
s4, traversing all cluster quantity candidate values in the cluster over-parameter value range, and respectively executing S2 and S3 on each cluster quantity candidate value to obtain a heterogeneous treatment scheme distribution mechanism corresponding to each cluster quantity candidate value; the method comprises the steps of dividing samples in medical record data sets into sample subgroups with the same number as the number of the clusters to be selected for each cluster number value to be selected, then selecting an optimal cluster number from all the cluster number values to be selected and potential group tool variables corresponding to the samples under the optimal cluster number based on correlation independent indexes, clustering and dividing all medical record data in multi-party mixed medical record data to form a plurality of groups by taking the potential group tool variables as different source indication variables in the multi-party mixed medical record data, and enabling the medical record data in each group to have the same diagnosis and treatment means, namely belong to the same heterogeneous treatment scheme distribution mechanism, so that the difference of diagnosis and treatment means of different medical institutions in the medical record data sets is obtained by tracing the source.
Preferably, in S1, in each medical record data, the patient signs include weight, height, age, sex, working property and related examination result, the oral information is derived from the description of the patient about the self-disease state and past medical history, and the treatment result is derived from the return visit result.
As a preferable aspect of the first aspect, the S2 specifically includes the following substeps:
s201, aiming at the number value K of each cluster to be selected in the cluster over-parameter value range, all observed patient characteristics and oral information are used as confusion variables through a characterization learning algorithm, mapping is carried out on each dimension independent characterization space through a mapping function, and non-independent data and multivariate complex interaction items are jointly learned as a noise item:
Figure BDA0003748767000000031
wherein X is the condition information in the medical record data as confounding variable, T is the treatment regimen in the medical record data as intervention variable, E TZ Error terms representing unobserved patient signs and oral information or arising from measurement errors;
Figure BDA0003748767000000032
represents the heterogeneous treatment plan assignment mechanism corresponding to the potential K potential cluster tool variables Z e {1
Figure BDA0003748767000000033
The input of the method is a confusion variable X, and K is a value to be selected of the number of the currently selected cluster; z is an instantiation of a potential cluster tool variable Z; r is the characterization space obtained by final learning, R j Represents the jth component of the data characterization space R, j ∈ { 1., m R },m R To characterize the dimensions in total, α zj Is a linear fit coefficient, beta, of the corresponding characterization z Noise term co-learned from non-independent data and multivariate complex interaction term, 1 [Z=z] Is a conditional function, i.e. 1 when the true treatment protocol assignment mechanism Z = Z corresponding between sample data X and T, and 0 otherwise;
s202, calculating expectation and covariance of the characterization space based on the characterization space R finally obtained by learning in S201:
Figure BDA0003748767000000041
wherein r is i Is the characterization vector of the ith sample, σ (R, R) is the covariance matrix, and n is the total sample number of the medical record dataset;
s203, defining the calculation formula of the likelihood function and the log likelihood function of the complete data (wherein z is from the potential tool variable modeling) as follows:
Figure BDA0003748767000000042
Figure BDA0003748767000000043
wherein: t is the instantiation of the intervention variable T, R is the instantiation of the token space R, T i ,r i ,z i T, r and z corresponding to the ith sample,
Figure BDA0003748767000000044
is the joint probability distribution of t, r, z given the distribution parameter theta, pi k Is t i ,r i From group z i A probability of = k,
Figure BDA0003748767000000045
is at a given distribution parameter mu kk T lower i ,r i Combined probability distribution of (u) kk Respectively, the mean and the variance, respectively,
Figure BDA0003748767000000046
is a stripA function of an element, i.e. z i K is 1, otherwise 0,k e { 1.
As a preferable aspect of the first aspect, the S3 specifically includes the following substeps:
s301, initializing heterogeneous data distribution by using random number
Figure BDA0003748767000000047
K is the number value to be selected of the selected cluster in the S2;
s302, using the characterization space information obtained in S202
Figure BDA0003748767000000048
Reinitializing a heterogeneous data distribution theta to theta (0) ={π (0)(0)(0) }:
Figure BDA0003748767000000049
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA00037487670000000410
respectively the mean of T, the variance of T and the covariance matrices of T and R initialized at random,
Figure BDA00037487670000000411
is that
Figure BDA00037487670000000412
Transposing;
s303, begin to perform the desired step in the S-th iteration, i.e., estimate θ from the given observed data { T, R } and the current heterogeneous data distribution (s) The log-likelihood function for computing the complete data is expected to be:
Figure BDA0003748767000000051
wherein it is desired to
Figure BDA0003748767000000052
Is the ith sample on the kth group with respect to θ (s) Conditional probability distribution of (2):
Figure BDA0003748767000000053
wherein the content of the first and second substances,
Figure BDA0003748767000000054
is the sample t, r is derived from the probabilities of the group z = i, and the sum of the conditional probabilities of the K groups is 1,
Figure BDA0003748767000000055
at a given distribution parameter
Figure BDA0003748767000000056
Joint probability distribution of lower T and R;
s304, continuously executing the maximization step in the S iteration, namely estimating theta according to the given observed data { T, R } and the current heterogeneous data distribution (s) Maximizing the log-likelihood function expectation Q (theta ) of the complete data (s) ) And updating the heterogeneous data distribution estimate to θ (s+1)
θ (s+1) =argmax θ Q(θ,θ (s) )
Wherein theta is (s+1) Solving the following parameters:
Figure BDA0003748767000000057
Figure BDA0003748767000000058
Figure BDA0003748767000000059
wherein
Figure BDA0003748767000000061
Indicating the stitching of T and R in the direction of the feature dimension,
Figure BDA0003748767000000062
is a matrix, M 2 =MM T
S305, in the expectation maximization algorithm, continuously and iteratively executing an expectation step S304 and a maximization step S305 to finally obtain a distribution convergence solution corresponding to the current K value
Figure BDA0003748767000000063
By theta * Different causal relationships and corresponding distributions thereof exist between intervention variables and confusion variables characterizing different data sources.
As a preferable aspect of the first aspect, the step S4 specifically includes the following substeps:
s401, traversing all cluster quantity candidate values in the cluster hyperparameter K value range, respectively executing S2 and S3 on each optional K value, and obtaining a distribution convergence solution theta of complete data * ={π *** }; reconstructing potential cluster tool variables corresponding to each medical record data sample in the medical record data set based on the distribution convergence solution corresponding to each K value:
Figure BDA0003748767000000064
where subscript i denotes the parameter corresponding to the ith sample, i =1,2, …, n;
s402, aiming at values of all cluster hyperparameter K, using a correlation independent index MMD as a screening index, and selecting the cluster hyperparameter K which enables the MMD to be minimum as the optimal cluster number;
Figure BDA0003748767000000065
K * =argmin K MMD K (Z,R),K={1,2,…,10}
wherein the content of the first and second substances,
Figure BDA0003748767000000066
represents the mean, K, of the characterizations R corresponding to all samples in the kth subgroup of samples * The number of clusters is the optimal;
s403, selecting the best cluster number K * Potential cluster tool variable z corresponding to each sample i Tool variable Z for optimal potential cluster * Simultaneously with z i The medical record data in each group has the same source indication variable, and represents that the medical record data in the same group adopts the same diagnosis and treatment means, namely belongs to the same heterogeneous treatment scheme distribution mechanism, so that the source tracing of the diagnosis and treatment means category in each medical record data is realized.
Preferably, the characterization learning algorithm employs a variational auto-encoder, a principal component analysis, a correlation minimization characterization learning, or a prior knowledge-based characterization.
In a second aspect, the present invention provides a multi-party hybrid data sourcing system based on potential cluster tool variables, comprising:
the data set acquisition module is used for acquiring medical record data sets which are subsequently traced and identified and come from a plurality of medical institutions, wherein each piece of medical record data comprises the information of the state of illness, a treatment scheme given by the medical institutions and a treatment result after treatment according to the treatment scheme, and the information of the state of illness comprises the body of a patient and oral information;
the characterization module is used for selecting a cluster quantity value to be selected in the cluster over-parameter value range, taking a treatment scheme given by a medical institution as an intervention variable, taking corresponding state information as a confusion variable, and mapping the state information observed in the medical record data set to a characterization space through characterization learning;
the expectation maximization algorithm module is used for fixing expectation and covariance matrixes in a characterization space obtained in the characterization module, and identifying heterogeneous treatment scheme distribution mechanisms corresponding to the cluster quantity to-be-selected values by utilizing an expectation maximization algorithm, wherein the heterogeneous treatment scheme distribution mechanisms represent different cause-and-effect relationships among intervention variables and confusion variables of different data sources, and each heterogeneous treatment scheme distribution mechanism corresponds to a diagnosis and treatment means;
the grouping traceability module is used for traversing all cluster quantity candidate values within the cluster over-parameter value range, and respectively executing the representation module and the expectation maximization algorithm module on each cluster quantity candidate value to obtain a heterogeneous treatment scheme distribution mechanism corresponding to each cluster quantity candidate value; the method comprises the steps of dividing samples in medical record data sets into sample subgroups with the same number as the number of the clusters to be selected for each cluster number value to be selected, then selecting an optimal cluster number from all the cluster number values to be selected and potential group tool variables corresponding to the samples under the optimal cluster number based on correlation independent indexes, clustering and dividing all medical record data in multi-party mixed medical record data to form a plurality of groups by taking the potential group tool variables as different source indication variables in the multi-party mixed medical record data, and enabling the medical record data in each group to have the same diagnosis and treatment means, namely belong to the same heterogeneous treatment scheme distribution mechanism, so that the difference of diagnosis and treatment means of different medical institutions in the medical record data sets is obtained by tracing the source.
In a third aspect, the present invention provides an accurate treatment recommendation system, which includes:
a joint learning module, configured to obtain potential cluster tool variables corresponding to each sample in the optimal cluster number obtained by any one of the multi-party mixed data tracing identification methods in the first aspect, embed the obtained potential cluster tool variables in a tool variable regression method, and perform joint learning by combining with multi-party knowledge to obtain a counterfactual prediction function;
and the treatment scheme recommending module is used for inputting the state information of the target case as a confusion variable into the counterfactual predicting function to obtain a treatment result predicted value which can be achieved by the target case under each diagnosis and treatment means and is used as a reference for selecting the diagnosis and treatment means.
As a preferred aspect of the third aspect, condition information corresponding to medical record data, a treatment plan given by a medical institution, an optimal potential group tool variable, and a treatment result after treatment according to the treatment plan need to be input for each learning sample, so that a counterfactual prediction function capable of predicting treatment results under different treatment means based on the condition information is learned.
As a preferable aspect of the third aspect, the tool variable regression method includes a two-stage least square regression method, a two-stage least square regression method based on a polynomial, a two-stage least square regression method based on a kernel method, a least square regression method based on a deep learning algorithm, and a two-stage least square regression method based on a moment of opposition condition.
The traditional tool variable method usually depends on tool variables selected by human expert knowledge, and sometimes, due to the deficiency of human priori knowledge, the tool variables may not strictly meet the efficacy of the tool variables in the algorithm, so that the finally estimated result is not credible. Compared with the prior art, the method for carrying out potential group tool variable recovery and deducing the potential result function based on the heterogeneous treatment/intervention data is provided for carrying out the estimation of the potential result function which is not influenced by the confusion effect under the condition of having unobserved confusion variables when carrying out the tracing identification of the diagnosis and treatment means aiming at the huge data set shared by multiple medical institutions in the medical scene, and the method does not depend on the human expert knowledge to carry out the tool variable designation and the tool variable candidate set generation. By the method, the difference of diagnosis and treatment means of different medical institutions can be identified, so that the medical record data are clustered and grouped according to the diagnosis and treatment means, and the source tracing of the medical record data is realized. Meanwhile, the invention can further combine multi-party knowledge to carry out combined learning based on the difference of diagnosis and treatment means of different medical institutions obtained by identification, so as to conveniently realize the recommendation of the optimal treatment scheme and assist in providing accurate treatment for each patient.
Drawings
FIG. 1 is a flow diagram of a multi-party hybrid data tracing method based on potential cluster tool variables.
FIG. 2 is a block diagram of a multi-party hybrid data traceability system based on potential cluster tool variables.
Fig. 3 is a block diagram of a precision treatment protocol recommendation system.
Figure 4 is a graphical representation of heterogeneous treatment/intervention data in an example embodiment.
FIG. 5 is a graph illustrating the multi-party hybrid data traceability recognition accuracy and the corresponding visualization thereof in the embodiment.
Detailed Description
The invention will be further elucidated and described with reference to the drawings and the detailed description.
As shown in fig. 1, in a preferred embodiment of the present invention, a multi-party mixed data tracing method based on potential group tool variables is provided, in which, for a huge data set shared by multiple medical institutions in a medical scene, source tracing identification is performed on each piece of medical record data, differences of diagnosis and treatment means of different medical institutions are obtained through identification, and joint learning is performed in combination with multi-party knowledge, so as to assist in providing accurate treatment for each patient. The multi-party mixed data tracing method specifically comprises the following steps:
s1, medical record data sets which are used for tracing and identifying and are sourced from a plurality of medical institutions are obtained, wherein each piece of medical record data comprises condition information, a treatment scheme given by the medical institutions and a treatment result after treatment according to the treatment scheme, and the condition information comprises patient characteristics and oral information.
In the embodiment, in each medical record data, the patient physical signs include weight, height, age, sex, working property and related examination results, the oral information is derived from the description of the patient about the self disease state and past medical history, and the treatment results are derived from the return visit results.
S2, selecting a cluster quantity value to be selected in the cluster over-parameter value range, taking a treatment scheme given by a medical institution as an intervention variable (also called as a treatment/intervention variable), taking corresponding state information as a confusion variable, and mapping the state information observed in the medical record data set to a characterization space through characterization learning.
In this embodiment, the step S2 specifically includes the following sub-steps:
s201, aiming at each cluster quantity value K to be selected in the cluster over-parameter value range, all observed patient characteristics and oral information are used as confusion variables through a characteristic learning algorithm, mapping is carried out on characteristic spaces with independent dimensions through a mapping function, and non-independent data and multivariate complex interaction items are jointly learned to be a noise item:
Figure BDA0003748767000000091
wherein X is the condition information in the medical record data as confounding variable, T is the treatment regimen in the medical record data as intervention variable, E TZ Error terms representing unobserved patient signs and oral information or arising from measurement errors;
Figure BDA0003748767000000092
represents a heterogeneous treatment plan assignment scheme corresponding to the potential K potential cluster tool variables Z E {1
Figure BDA0003748767000000093
The input of the method is a confusion variable X, and K is a value to be selected of the number of the currently selected cluster; z is an instantiation of a potential cluster tool variable Z; h is zj (X) is a polynomial characterization learning function of variable j ξ zj,1 Linear fitting coefficients for the corresponding polynomial; x is the number of j The j-th dimension in X is marked by a power, and the specific maximum power needs to be optimized according to a fitting result; r is the characterization space obtained by final learning, R j Represents the jth component of the data characterization space R, j ∈ { 1., m R },m R To characterize the dimensions as a whole, α zj Is the linear fit coefficient of the corresponding characterization; beta is a z Noise term co-learned from non-independent data and multivariate complex interaction term, 1 [Z=z] Is a conditional function, i.e. 1 when the true treatment plan assignment mechanism Z = Z corresponds between sample data X and T, and 0 otherwise.
In this embodiment, the above-mentioned characterization learning algorithm may be performed in advance by using a characterization learning algorithm such as a variational automatic encoder, principal component analysis, correlation minimization characterization learning, or a prior knowledge-based characterization.
S202, calculating the expectation and covariance of the characterization space based on the characterization space R obtained by final learning in S201:
Figure BDA0003748767000000101
wherein r is i Is the characterization vector of the ith sample, σ (R, R) is the covariance matrix, and n is the total number of samples in the medical record dataset.
S203, defining the calculation formula of the likelihood function and the log likelihood function of the complete data (wherein z is from the potential tool variable modeling) as follows:
Figure BDA0003748767000000102
Figure BDA0003748767000000103
wherein: t is the instantiation of the intervention variable T, R is the instantiation of the token space R, T i ,r i ,z i T, r and z corresponding to the ith sample,
Figure BDA0003748767000000104
is the joint probability distribution of t, r, z given the distribution parameter theta, pi k Is t i ,r i From group z i A probability of = k,
Figure BDA0003748767000000105
is at a given distribution parameter mu kk T lower i ,r i Combined probability distribution of (u) kk Respectively a mean value and a variance, which are,
Figure BDA0003748767000000111
is a conditional function, i.e. z i K is 1, otherwise 0,k e { 1.
It should be noted that, in the present invention, medical record data with a source medical institution label is complete data, and medical record data without a source label is incomplete data. The medical record data set belongs to incomplete data, and the complete data of the medical record data set needs to be obtained through subsequent modeling. In S203, only the likelihood function and log-likelihood function calculation formula of the complete data are defined, but at this time, because the medical record data set belongs to incomplete data, the likelihood function and log-likelihood function of the complete data cannot be directly obtained, and need to be obtained by subsequent solution.
And S3, fixing the expectation and covariance matrixes in the characterization space obtained in the S2, and identifying heterogeneous treatment scheme distribution mechanisms corresponding to the cluster quantity to-be-selected values by utilizing an expectation maximization algorithm, wherein the heterogeneous treatment scheme distribution mechanisms represent different cause-effect relationships existing between intervention variables and confusion variables of different data sources, and each heterogeneous treatment scheme distribution mechanism corresponds to a diagnosis and treatment means.
It should be noted that the treatment methods in medical institutions are closely related to the treatment concepts, skilled technologies, medical instruments and equipment, and the like of the medical institutions, and the treatment schemes given for the same disease conditions are different in the treatment methods of the medical institutions. For example, for a condition, medical institution a typically determines a treatment plan by performing a blood test and then based on the results of the blood test, medical institution B typically determines a treatment plan by performing a conscious experience, and medical institution C typically performs a test using ultrasound or radiation equipment and then determines a treatment plan based on the results of the test. The three treatment methods correspond to three heterogeneous treatment scheme distribution mechanisms, namely, when a medical institution faces a treatment scheme (confusion variable), the treatment scheme (intervention variable) is given according to the treatment method of the medical institution, and different causal relationships exist among the intervention variables and the confusion variables of different medical institutions.
In this embodiment, the step S3 specifically includes the following sub-steps:
s301, initializing heterogeneous data distribution by using random number
Figure BDA0003748767000000112
And K is the number value to be selected of the selected cluster in the S2.
S302, using the characterization space information obtained in S202
Figure BDA0003748767000000113
Reinitializing a heterogeneous data distribution θ to θ (0) ={π (0)(0)(0) }:
Figure BDA0003748767000000114
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003748767000000115
respectively the mean of T, the variance of T and the covariance matrices of T and R initialized at random,
Figure BDA0003748767000000121
is that
Figure BDA0003748767000000122
The transposing of (1).
S303, begin to perform the desired step in the S-th iteration, i.e., estimate θ from the given observed data { T, R } and the current heterogeneous data distribution (s) The log-likelihood function for computing the complete data is expected to be:
Figure BDA0003748767000000123
in which it is desired to
Figure BDA0003748767000000124
Is the ith sample on the kth group with respect to θ (s) Conditional probability distribution of (2):
Figure BDA0003748767000000125
wherein the content of the first and second substances,
Figure BDA0003748767000000126
is the probability that the sample t, r originates from group z = i (in this case i =1,2, …, K), and the sum of the conditional probabilities for K groups is 1,
Figure BDA0003748767000000127
at a given distribution parameter
Figure BDA0003748767000000128
Lower joint probability distribution of T and R.
S304, continuously executing the maximization step in the S iteration, namely estimating theta according to the given observed data { T, R } and the current heterogeneous data distribution (s) Maximizing the log-likelihood function expectation Q (theta ) of the complete data (s) ) And updating the heterogeneous data distribution estimate to θ (s+1)
θ (s+1) =argmax θ Q(θ,θ (s) )
Wherein theta is (s+1) Solving the following parameters:
Figure BDA0003748767000000131
Figure BDA0003748767000000132
Figure BDA0003748767000000133
wherein
Figure BDA0003748767000000134
Indicating the stitching of T and R in the direction of the feature dimension,
Figure BDA0003748767000000135
is a matrix, M 2 =MM T
S305, in the expectation maximization algorithm, continuously and iteratively executing an expectation step S304 and a maximization step S305 to finally obtain a distribution convergence solution corresponding to the current K value
Figure BDA0003748767000000136
By theta * Different causal relationships and corresponding distributions thereof exist between intervention variables and confusion variables characterizing different data sources.
S4, traversing all cluster quantity candidate values in the cluster over-parameter value range, and respectively executing S2 and S3 on each cluster quantity candidate value to obtain a heterogeneous treatment scheme distribution mechanism corresponding to each cluster quantity candidate value; according to the value to be selected of the number of each cluster, samples in the medical record data set are divided into sample subgroups the number of which is the same as that of the value to be selected of the number of the clusters, then an optimal number of the clusters and potential group tool variables corresponding to the samples under the optimal number of the clusters are selected from all the values to be selected of the number of the clusters based on the correlation independent indexes, the potential group tool variables are used as different source indication variables in the multi-party mixed medical record data, all the medical record data in the multi-party mixed medical record data are clustered and divided into a plurality of groups, the medical record data in each group have the same diagnosis and treatment means, namely belong to the same heterogeneous treatment scheme distribution mechanism, and therefore the difference of the diagnosis and treatment means of different medical institutions in the medical record data set is obtained from the source.
In this embodiment, the step S4 specifically includes the following sub-steps:
s401, traversing all cluster quantity candidate values in the cluster hyperparameter K value range, respectively executing S2 and S3 on each optional K value, and obtaining a distribution convergence solution theta of complete data * ={π *** }; based onAnd reconstructing a potential cluster tool variable corresponding to each medical record data sample in the medical record data set according to the distribution convergence solution corresponding to each K value:
Figure BDA0003748767000000141
where subscript i denotes the parameter corresponding to the ith sample, i =1,2, …, n;
s402, aiming at values of all cluster hyperparameters K, using a correlation independent index MMD as a screening index, and selecting the cluster hyperparameter K which enables the MMD to be minimum as the optimal cluster quantity;
Figure BDA0003748767000000142
K * =argmin K MMD K (Z,R),K={1,2,…,10}
wherein the content of the first and second substances,
Figure BDA0003748767000000143
represents the mean, K, of the characterizations R of all samples in the kth subgroup of samples * The number of clusters is the optimal;
s403, selecting the best cluster number K * Potential cluster tool variable z corresponding to each sample i For best potential cluster tool variable Z * Simultaneously with z i The medical record data in each group has the same source indication variable, and represents that the medical record data in the same group adopts the same diagnosis and treatment means, namely belongs to the same heterogeneous treatment scheme distribution mechanism, so that the source tracing of the diagnosis and treatment means category in each medical record data is realized.
It should be noted that the tracing method in S1 to S4 can identify a heterogeneous treatment scheme allocation mechanism (diagnosis and treatment means corresponding to a medical institution) in a medical record data set without a source tag by reconstructing a potential group tool variable, that is, identify different causal relationships between intervention variables and confusion variables in the medical record data, so as to classify the medical record data according to diagnosis and treatment means categories. The medical record data of the same type can be regarded as a sample subgroup, wherein the samples can be regarded as having the same diagnosis and treatment means, namely the distribution mechanisms of the heterogeneous treatment schemes are consistent, so that the medical record data can determine the diagnosis and treatment means type to which the medical record data belongs, and the tracing is realized.
In addition, if the respective heterogeneous treatment plan allocation mechanisms of different medical institutions in the medical record data set are different, the classification categories may also be directly corresponding to the medical institutions, that is, the medical record data samples in each sample subgroup all originate from the same medical institution. If part of the medical record data samples in one sample subgroup are provided with medical institution source labels and are complete data, and the other part of the medical record data samples are actual medical institution source labels and are incomplete data, the medical institution source labels of the incomplete data can be supplemented by the medical institution source labels of the complete data in one sample subgroup, so that institution source tracing of the medical record data is realized.
Similarly, based on the same inventive concept, as shown in fig. 2, another preferred embodiment of the present invention further provides a multi-party hybrid data traceability system based on potential group tool variables, which corresponds to the multi-party hybrid data traceability method based on potential group tool variables provided in the foregoing embodiment, and includes:
the data set acquisition module is used for acquiring medical record data sets which are subsequently traced and identified and come from a plurality of medical institutions, wherein each piece of medical record data comprises the information of the state of illness, a treatment scheme given by the medical institutions and a treatment result after treatment according to the treatment scheme, and the information of the state of illness comprises the body of a patient and oral information;
the characterization module is used for selecting a cluster quantity value to be selected in the cluster over-parameter value range, taking a treatment scheme given by a medical institution as an intervention variable, taking corresponding state information as a confusion variable, and mapping the state information observed in the medical record data set to a characterization space through characterization learning;
the expectation maximization algorithm module is used for fixing expectation and covariance matrixes in a characterization space obtained in the characterization module, and identifying heterogeneous treatment scheme distribution mechanisms corresponding to the cluster quantity to-be-selected values by utilizing an expectation maximization algorithm, wherein the heterogeneous treatment scheme distribution mechanisms represent different cause-and-effect relationships among intervention variables and confusion variables of different data sources, and each heterogeneous treatment scheme distribution mechanism corresponds to a diagnosis and treatment means;
the grouping and tracing module is used for traversing all cluster quantity to-be-selected values in the cluster over-parameter value range, and executing the representation module and the expectation maximization algorithm module on each cluster quantity to-be-selected value respectively to obtain a heterogeneous treatment scheme distribution mechanism corresponding to each cluster quantity to-be-selected value; according to the value to be selected of the number of each cluster, samples in the medical record data set are divided into sample subgroups the number of which is the same as that of the value to be selected of the number of the clusters, then an optimal number of the clusters and potential group tool variables corresponding to the samples under the optimal number of the clusters are selected from all the values to be selected of the number of the clusters based on the correlation independent indexes, the potential group tool variables are used as different source indication variables in the multi-party mixed medical record data, all the medical record data in the multi-party mixed medical record data are clustered and divided into a plurality of groups, the medical record data in each group have the same diagnosis and treatment means, namely belong to the same heterogeneous treatment scheme distribution mechanism, and therefore the difference of the diagnosis and treatment means of different medical institutions in the medical record data set is obtained from the source.
Since the principle of solving the problem of the multi-party mixed data tracing method based on the potential cluster tool variables is similar to that of the multi-party mixed data tracing system based on the potential cluster tool variables in the embodiment of the present invention, the detailed implementation forms of the modules of the system in this embodiment may also be referred to the detailed implementation forms of the method portions shown in S1 to S4, and repeated details are not repeated.
In addition, in another embodiment of the present invention, based on the multi-party mixed data tracing and identifying method shown in S1 to S4, the accurate treatment plan recommendation can be further implemented through the step S5, which specifically includes:
firstly, acquiring potential cluster tool variables corresponding to each sample under the optimal cluster number obtained by the multi-party mixed data tracing identification method in the embodiments S1 to S4, embedding the acquired potential cluster tool variables into a tool variable regression method, and performing joint learning by combining multi-party knowledge to obtain a counterfactual prediction function;
then, the state information of the target case is used as a confusion variable to be input into the counterfactual prediction function, and a predicted treatment result value which can be achieved by the target case under each diagnosis and treatment means is obtained to be used as a reference when the diagnosis and treatment means is selected.
In addition, in another embodiment of the present invention, as shown in fig. 3, based on the same inventive concept as the above-mentioned precise treatment plan recommendation, there is also provided a precise treatment plan recommendation system, which includes:
the joint learning module is configured to obtain potential cluster tool variables corresponding to each sample in the optimal cluster number obtained by the multi-party mixed data traceability identification method in the foregoing embodiment, embed the obtained potential cluster tool variables into a tool variable regression method, and perform joint learning by combining multi-party knowledge to obtain a counterfactual prediction function;
and the treatment scheme recommending module is used for inputting the state information of the target case as a confusion variable into the counterfactual predicting function to obtain a treatment result predicted value which can be achieved by the target case under each diagnosis and treatment means and is used as a reference for selecting the diagnosis and treatment means.
In the above accurate treatment plan recommendation method and system, when joint learning is performed by combining multi-party knowledge, each learning sample needs to input condition information corresponding to medical record data, a treatment plan given by a medical institution, an optimal potential group tool variable, and a treatment result after treatment according to the treatment plan, so that a counter-fact prediction function capable of predicting treatment results under different diagnosis and treatment means based on the condition information is obtained by learning.
In the above-described precise treatment plan recommendation method and system, the tool variable regression method used includes a two-stage least square regression method, a polynomial-based two-stage least square regression method, a kernel-based two-stage least square regression method, a deep learning algorithm-based least square regression method, a moment-of-opposition-condition-based two-stage least square regression method, and the like.
It should be noted that, in the above accurate treatment plan recommendation method and system of the present invention, the apparatus only gives the predicted treatment result value of the target case under each diagnosis and treatment means, but specifically selecting which diagnosis and treatment means can be selected by the patient or the doctor. The accurate treatment scheme recommendation method and system can be applied to the field of auxiliary medical treatment and can also be applied to the non-medical fields of scientific research and the like.
In addition, in the above-described embodiments, the modules are executed as program modules executed in sequence, and thus, the processes of data processing are essentially executed. Moreover, as will be clearly understood by those skilled in the art, for convenience and simplicity of description, the specific working process of the system described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again. In the embodiments provided in the present application, the division of the steps or modules in the method and system is only one logical function division, and there may be another division manner in actual implementation, for example, multiple modules or steps may be combined or may be integrated together, and one module or step may also be split.
In the following, the present invention will show the application effect of the multi-party mixed data tracing method based on potential cluster tool variables and the precise treatment recommendation method in the above embodiments on specific data sets by a specific example, so as to facilitate understanding of the essence of the present invention.
Examples
This example discloses an Infant Health and Development Program (IHDP) data set and a PM2.5 concentration effect on cardiovascular mortality (PM-CMR) data set. As shown in fig. 4, taking the PM2.5 concentration in the PM-CMR dataset as an example, it is shown that there is one potential cluster tool variable in the heterogeneous treatment/intervention data, while the potential cluster tool variable of IHDP is similar.
The Infant Health and Development Program (IHDP) dataset contained 747 twins samples, each sample containing 6 pre-treatment/intervention continuous variables and 19 discrete variables associated with the infant and its maternal diathesis, which were aimed at studying the effect of an early stage special home visit teacher on the future mental development of the infant. The present example divides the data set into a training set, a validation set, and a test set based on a 63%/27%/10% ratio. And similar to the previous work, the present example makes assumptions about the potential outcome function and then generates a semi-synthetic dataset for the present example corresponding to fig. 4.
The PM2.5 concentration impact on cardiovascular disease mortality (PM-CMR) dataset contains 2132 city data, each city containing 6 treatment/pre-intervention continuous variables associated with cardiovascular disease, which was aimed at studying the impact of PM2.5 concentration on cardiovascular disease mortality. The present example divides the data set into a training set, a validation set, and a test set based on a 63%/27%/10% ratio. And similar to the previous work, the present example makes assumptions about its potential outcome function and then generates a semi-synthetic dataset corresponding to fig. 4 for the present example.
To objectively evaluate the performance of the algorithm, 10 data upsets and model retraining were randomly performed on both data instances, and the LatGIV was run EM The mean and its standard deviation (mean (std)) of the local latent result function fit MSE errors over 10 experiments were calculated by embedding 9 different downstream tool variable regression methods.
For two data sets, the accuracy rate of multi-party mixed data source tracing identification and the corresponding visualization are shown in fig. 5, wherein LatGIV EM The tracing method, latGIV, provided in the foregoing embodiments of the present invention KM The method directly uses K-Means clustering to obtain the class cluster as a potential tool variable method, and the recovery accuracy of the method adopted by the invention can reach about 80 percent, while the accuracy rate of the method directly adopting K-Means algorithm clustering can only obtain the accuracy rate of less than 60 percent, and the accuracy rate of even 30 percent can not be ensured for 5 data sources.
Furthermore, on the basis of the multi-party mixed data tracing method provided by the embodiment of the invention, the recommendation accuracy of the accurate treatment scheme recommendation method is further tested. According to the method, a counterfactual prediction function is obtained by combining multi-party knowledge through the optimal group tool variables obtained based on tracing, the counterfactual prediction function is used for predicting the treatment result prediction value which can be achieved by different cases under each diagnosis and treatment means, the optimal treatment scheme is selected according to the prediction value, and accurate treatment scheme recommendation is provided for each patient. Specific results are shown in table 1:
TABLE 1 LatGIV EM MSE error (mean (std))
Figure BDA0003748767000000181
Where, in turn, along the table vertical, noneIV means that no tool variables are used, while the potential tool variables for UAS are from the following reference [1 ]]The WAS potential tool variables are from the following reference [2 ]]ModoiV potential tool variables are from the following reference [3]Potential tool variables for AutoIV are from the following references [4 ]],LatGIV KM The potential tool variables are derived from the K-Means algorithm, latGIV EM The potential tool variables for this are derived from the precise treatment protocol recommendation method proposed in the previous embodiment of the present invention, trueIV refers to previously known cluster tool variables.
Furthermore, in turn across the table, poly2SLS is the most classical two-stage tool variable regression method for predicting potential outcomes (i.e., precision treatment recommendation), kernelIV is from reference [5], deepIV is from reference [6], and DeepGMM is from reference [7].
The above references are specifically as follows:
[1].Neil M Davies,Stephanie von Hinke Kessler Scholder,Helmut Farbmacher, Stephen Burgess,Frank Windmeijer,and George Davey Smith.2015.The many weak instruments problem and Mendelian randomization.Statistiics in medicine 34,3(2015),454–468.
[2].Stephen Burgess,Frank Dudbridge,and Simon G Thompson.2016.Combining information on multiple instrumental variables in Mendelian randomization: comparison of allele score and summarized data methods.Statistics in medicine 35,11(2016),1880–1906.
[3].Jason S Hartford,Victor Veitch,Dhanya Sridhar,and Kevin Leyton-Brown.2021. Valid causal inference with(some)invalid instruments.In International Conference on Machine Learning.PMLR,4096–4106.
[4].Junkun Yuan,Anpeng Wu,Kun Kuang,Bo Li,Runze Wu,Fei Wu,and Lanfen Lin.2022.Auto IV:Counterfactual Prediction via Automatic Instrumental Variable Decomposition.ACM Transactions on Knowledge Discovery from Data (TKDD)16,4(2022),1–20.
[5].Rahul Singh,Maneesh Sahani,and Arthur Gretton.2019.Kernel instrumental variable regression.In NeurIPS 2019.4593–4605.
[6].Jason Hartford,Greg Lewis,Kevin Leyton-Brown,and Matt Taddy.2017. DeepIV:A flexible approach for counterfactual prediction.In ICML 2017.
[7].Andrew Bennett,Nathan Kallus,and Tobias Schnabel.2019.Deep generalized method of moments for instrumental variable analysis.In NeurIPS 2019.
therefore, compared with the estimation method in the prior art, the method has better recommendation accuracy.
The above-mentioned embodiments are merely two preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims (10)

1. A multi-party mixed data tracing method based on potential cluster tool variables is characterized by comprising the following steps:
s1, acquiring medical record data sets which are used for tracing and identifying and are sourced from a plurality of medical institutions, wherein each piece of medical record data comprises condition information, a treatment scheme given by the medical institutions and a treatment result after treatment according to the treatment scheme, and the condition information comprises patient characteristics and oral information;
s2, selecting a cluster quantity value to be selected in a cluster over-parameter value range, taking a treatment scheme given by a medical institution as an intervention variable, taking corresponding state information as a confusion variable, and mapping the state information observed in the medical record data set to a representation space through representation learning;
s3, fixing the expectation and covariance matrixes in the characterization space obtained in the S2, and identifying heterogeneous treatment scheme distribution mechanisms corresponding to the cluster quantity value to be selected by utilizing an expectation maximization algorithm, wherein the heterogeneous treatment scheme distribution mechanisms represent different cause-effect relationships between intervention variables and confusion variables of different data sources, and each heterogeneous treatment scheme distribution mechanism corresponds to a diagnosis and treatment means;
s4, traversing all cluster quantity candidate values in the cluster over-parameter value range, and respectively executing S2 and S3 on each cluster quantity candidate value to obtain a heterogeneous treatment scheme distribution mechanism corresponding to each cluster quantity candidate value; the method comprises the steps of dividing samples in medical record data sets into sample subgroups with the same number as the number of the clusters to be selected for each cluster number value to be selected, then selecting an optimal cluster number from all the cluster number values to be selected and potential group tool variables corresponding to the samples under the optimal cluster number based on correlation independent indexes, clustering and dividing all medical record data in multi-party mixed medical record data to form a plurality of groups by taking the potential group tool variables as different source indication variables in the multi-party mixed medical record data, and enabling the medical record data in each group to have the same diagnosis and treatment means, namely belong to the same heterogeneous treatment scheme distribution mechanism, so that the difference of diagnosis and treatment means of different medical institutions in the medical record data sets is obtained by tracing the source.
2. The multi-party mixed data tracing method based on potential group tool variables of claim 1, wherein in each medical record data in S1, the patient signs include weight, height, age, sex, working property and related examination results, the oral information is derived from the description of the patient about the self disease state and past medical history, and the treatment results are derived from the return visit results.
3. The multi-party hybrid data tracing method based on potential cluster tool variables according to claim 1, wherein said S2 comprises the following sub-steps:
s201, aiming at the number value K of each cluster to be selected in the cluster over-parameter value range, all observed patient characteristics and oral information are used as confusion variables through a characterization learning algorithm, mapping is carried out on each dimension independent characterization space through a mapping function, and non-independent data and multivariate complex interaction items are jointly learned as a noise item:
Figure FDA0003748766990000021
wherein X is the condition information in the medical record data as confounding variable, T is the treatment regimen in the medical record data as intervention variable, E TZ Error terms representing unobserved patient signs and oral information or arising from measurement errors;
Figure FDA0003748766990000022
represents a heterogeneous treatment plan assignment scheme corresponding to the potential K potential cluster tool variables Z E {1
Figure FDA0003748766990000023
The input of the method is a confusion variable X, and K is a value to be selected of the number of the currently selected cluster; z is an instantiation of a potential cluster tool variable Z; r is the characterization space obtained by final learning, R j Represents the jth component of the data characterization space R, j ∈ { 1.,m R },m R To characterize the dimensions in total, α zj Is a linear fit coefficient, beta, of the corresponding characterization z Noise term co-learned from non-independent data and multivariate complex interaction terms, 1 [Z=z] Is a conditional function, i.e. 1 when the true treatment protocol assignment mechanism Z = Z corresponding between sample data X and T, or 0 otherwise;
s202, calculating the expectation and covariance of the characterization space based on the characterization space R obtained by final learning in S201:
Figure FDA0003748766990000024
wherein r is i Is the characterization vector of the ith sample, σ (R, R) is the covariance matrix, and n is the total sample number of the medical record dataset;
s203, defining the likelihood function and the log likelihood function of the complete data as follows:
Figure FDA0003748766990000025
Figure FDA0003748766990000026
wherein: t is the instantiation of the intervention variable T, R is the instantiation of the token space R, T i ,r i ,z i T, r and z corresponding to the ith sample,
Figure FDA0003748766990000031
is the joint probability distribution of t, r, z given the distribution parameter theta, pi k Is t i ,r i From group z i A probability of = k,
Figure FDA0003748766990000032
is at a given distribution parameter mu kk T lower i ,r i Of the joint probability distribution, mu kk Respectively, the mean and the variance, respectively,
Figure FDA0003748766990000033
is a conditional function, i.e. z i K is 1, otherwise 0,k e { 1.
4. The multi-party hybrid data tracing method based on potential cluster tool variables according to claim 1, wherein said S3 specifically comprises the following sub-steps:
s301, initializing heterogeneous data distribution by using random number
Figure FDA0003748766990000034
K is the number value to be selected of the selected cluster in the S2;
s302, using the characterization space information obtained in S202
Figure FDA0003748766990000035
Reinitializing a heterogeneous data distribution theta to theta (0) ={π (0)(0)(0) }:
Figure FDA0003748766990000036
Wherein the content of the first and second substances,
Figure FDA0003748766990000037
respectively the mean of T, the variance of T and the covariance matrices of T and R initialized at random,
Figure FDA0003748766990000038
is that
Figure FDA0003748766990000039
Transposing;
s303, begin to perform the desired step in the S-th iteration, i.e.Estimating theta from given observation data { T, R } and current heterogeneous data distribution (s) The log-likelihood function for computing the complete data is expected to be:
Figure FDA00037487669900000310
in which it is desired to
Figure FDA00037487669900000311
Is the ith sample on the kth group with respect to θ (s) Conditional probability distribution of (2):
Figure FDA00037487669900000312
wherein the content of the first and second substances,
Figure FDA00037487669900000313
is the sample t, r is derived from the probabilities of the group z = i, and the sum of the conditional probabilities of the K groups is 1,
Figure FDA00037487669900000314
at a given distribution parameter
Figure FDA00037487669900000315
A joint probability distribution of lower T and R;
s304, continuously executing the maximization step in the S iteration, namely estimating theta according to the given observed data { T, R } and the current heterogeneous data distribution (s) Maximizing the log-likelihood function expectation Q (theta ) of the complete data (s) ) And updating the heterogeneous data distribution estimate to θ (s+1)
θ (s+1) =argmax θ Q(θ,θ (s) )
Wherein theta is (s+1) Solving parameters in (1) to obtain:
Figure FDA0003748766990000041
Figure FDA0003748766990000042
Figure FDA0003748766990000043
wherein
Figure FDA0003748766990000044
Indicating the stitching of T and R in the direction of the feature dimension,
Figure FDA0003748766990000045
is a matrix, M 2 =MM T
S305, in the expectation maximization algorithm, continuously and iteratively executing an expectation step S304 and a maximization step S305 to finally obtain a distribution convergence solution corresponding to the current K value
Figure FDA0003748766990000046
By theta * Different causal relationships and corresponding distributions thereof exist between intervention variables and confusion variables characterizing different data sources.
5. The multi-party hybrid data sourcing method based on latent cluster tool variables according to claim 1, wherein said step S4 comprises the following sub-steps:
s401, traversing all cluster quantity candidate values in the cluster hyperparameter K value range, respectively executing S2 and S3 on each optional K value, and obtaining a distribution convergence solution theta of complete data * ={π *** }; reconstructing potential cluster tool variables corresponding to each medical record data sample in the medical record data set based on the distribution convergence solution corresponding to each K value:
Figure FDA0003748766990000047
where subscript i denotes the parameter corresponding to the ith sample, i =1,2, …, n;
s402, aiming at values of all cluster hyperparameters K, using a correlation independent index MMD (maximum mean difference) as a screening index, and selecting the cluster hyperparameter K which enables the MMD to be minimum as the optimal cluster quantity;
Figure FDA0003748766990000051
K * =argmin K MMD K (Z,R),K={1,2,…,10}
wherein the content of the first and second substances,
Figure FDA0003748766990000052
represents the mean, K, of the characterizations R corresponding to all samples in the kth subgroup of samples * The number of clusters is the best;
s403, selecting the best cluster number K * Potential cluster tool variable z corresponding to each sample i Tool variable Z for optimal potential cluster * Simultaneously with z i The medical record data in each group has the same source indication variable, and represents that the medical record data in the same group adopts the same diagnosis and treatment means, namely belongs to the same heterogeneous treatment scheme distribution mechanism, so that the source tracing of the diagnosis and treatment means category in each medical record data is realized.
6. The multi-party hybrid data traceability method based on potential cohort tool variables, wherein the characterization learning algorithm employs a variational auto-encoder, principal component analysis, relevance minimization characterization learning, or a priori knowledge based characterization.
7. A multi-party hybrid data traceability system based on potential cluster tool variables, comprising:
the data set acquisition module is used for acquiring medical record data sets which are subsequently traced and identified and come from a plurality of medical institutions, wherein each piece of medical record data comprises the information of the state of illness, a treatment scheme given by the medical institutions and a treatment result after treatment according to the treatment scheme, and the information of the state of illness comprises the body of a patient and oral information;
the characterization module is used for selecting a cluster quantity value to be selected in the cluster over-parameter value range, taking a treatment scheme given by a medical institution as an intervention variable, taking corresponding state information as a confusion variable, and mapping the state information observed in the medical record data set to a characterization space through characterization learning;
the expectation maximization algorithm module is used for fixing expectation and covariance matrixes in a characterization space obtained in the characterization module, and identifying heterogeneous treatment scheme distribution mechanisms corresponding to the cluster quantity to-be-selected values by utilizing an expectation maximization algorithm, wherein the heterogeneous treatment scheme distribution mechanisms represent different cause-and-effect relationships among intervention variables and confusion variables of different data sources, and each heterogeneous treatment scheme distribution mechanism corresponds to a diagnosis and treatment means;
the grouping traceability module is used for traversing all cluster quantity candidate values within the cluster over-parameter value range, and respectively executing the representation module and the expectation maximization algorithm module on each cluster quantity candidate value to obtain a heterogeneous treatment scheme distribution mechanism corresponding to each cluster quantity candidate value; according to the value to be selected of the number of each cluster, samples in the medical record data set are divided into sample subgroups the number of which is the same as that of the value to be selected of the number of the clusters, then an optimal number of the clusters and potential group tool variables corresponding to the samples under the optimal number of the clusters are selected from all the values to be selected of the number of the clusters based on the correlation independent indexes, the potential group tool variables are used as different source indication variables in the multi-party mixed medical record data, all the medical record data in the multi-party mixed medical record data are clustered and divided into a plurality of groups, the medical record data in each group have the same diagnosis and treatment means, namely belong to the same heterogeneous treatment scheme distribution mechanism, and therefore the difference of the diagnosis and treatment means of different medical institutions in the medical record data set is obtained from the source.
8. An accurate treatment recommendation system, comprising:
the joint learning module is used for acquiring potential cluster tool variables corresponding to each sample under the optimal cluster number obtained by the multi-party mixed data tracing identification method according to any one of claims 1 to 7, embedding the acquired potential cluster tool variables into a tool variable regression method, and performing joint learning by combining multi-party knowledge to obtain a counter fact prediction function;
and the treatment scheme recommending module is used for inputting the state information of the target case as a confusion variable into the counterfactual predicting function to obtain a treatment result predicted value which can be achieved by the target case under each diagnosis and treatment means and is used as a reference for selecting the diagnosis and treatment means.
9. The system of claim 8, wherein when performing the joint learning with the multi-party knowledge, each learning sample requires inputting the condition information corresponding to the medical record data, the treatment plan given by the medical institution, the optimal potential group tool variables, and the treatment result after the treatment according to the treatment plan, so as to learn and obtain a counterfactual prediction function capable of predicting the treatment result under different diagnosis and treatment means based on the condition information.
10. The precision treatment protocol recommendation system of claim 8 wherein the tool variable regression method comprises a two-stage least squares regression method, a polynomial-based two-stage least squares regression method, a kernel-based two-stage least squares regression method, a deep learning algorithm-based least squares regression method, a moment-of-opposition condition-based two-stage least squares regression method.
CN202210836782.7A 2022-07-15 2022-07-15 Multi-party mixed data tracing method and system based on potential group tool variables Pending CN115188484A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210836782.7A CN115188484A (en) 2022-07-15 2022-07-15 Multi-party mixed data tracing method and system based on potential group tool variables

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210836782.7A CN115188484A (en) 2022-07-15 2022-07-15 Multi-party mixed data tracing method and system based on potential group tool variables

Publications (1)

Publication Number Publication Date
CN115188484A true CN115188484A (en) 2022-10-14

Family

ID=83519421

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210836782.7A Pending CN115188484A (en) 2022-07-15 2022-07-15 Multi-party mixed data tracing method and system based on potential group tool variables

Country Status (1)

Country Link
CN (1) CN115188484A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117290659A (en) * 2023-11-24 2023-12-26 华信咨询设计研究院有限公司 Data tracing method based on regression analysis

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117290659A (en) * 2023-11-24 2023-12-26 华信咨询设计研究院有限公司 Data tracing method based on regression analysis
CN117290659B (en) * 2023-11-24 2024-04-02 华信咨询设计研究院有限公司 Data tracing method based on regression analysis

Similar Documents

Publication Publication Date Title
Krause et al. A workflow for visual diagnostics of binary classifiers using instance-level explanations
Che et al. Interpretable deep models for ICU outcome prediction
WO2021226132A2 (en) Systems and methods for managing autoimmune conditions, disorders and diseases
Dong et al. RCoNet: Deformable mutual information maximization and high-order uncertainty-aware learning for robust COVID-19 detection
Arbet et al. Lessons and tips for designing a machine learning study using EHR data
US20230395196A1 (en) Method and system for quantifying cellular activity from high throughput sequencing data
Enad et al. A review on artificial intelligence and quantum machine learning for heart disease diagnosis: Current techniques, challenges and issues, recent developments, and future directions
Huang et al. A review of fusion methods for omics and imaging data
CN115188484A (en) Multi-party mixed data tracing method and system based on potential group tool variables
Babu et al. Implementation of partitional clustering on ILPD dataset to predict liver disorders
Al-Ars et al. A web/mobile decision support system to improve medical diagnosis using a combination of K-mean and fuzzy logic
McDermott et al. Clinical artificial intelligence: Design principles and fallacies
Huang et al. Multitask Learning for Joint Diagnosis of Multiple Mental Disorders in Resting-State fMRI
Ruvinskaya et al. Models based on conformal predictors for diagnostic systems in medicine
Wu et al. An iterative self-learning framework for medical domain generalization
Neves et al. Shedding light on ai in radiology: A systematic review and taxonomy of eye gaze-driven interpretability in deep learning
Mathew et al. A web based decision support system driven for the neurological disorders
Adebayo Towards Effective Tools for Debugging Machine Learning Models
Kortum et al. Improving the decision support in diagnostic systems using classifier probability calibration
EP4390960A1 (en) Systems and methods for providing an updated machine learning algorithm
Puga et al. A cost-based multi-layer network approach for the discovery of patient phenotypes
Madni et al. Breast Cancer Diagnosis Comparative Machine Learning Analysis Algorithms
Saeidi et al. Artificial intelligence and clinical decision making: approaches and challenges
Freitas da Cruz Standardizing clinical predictive modeling: standardizing development, validation, and interpretation of clinical prediction models
Choi Extensions of Regression Trees for Subgroup Identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination