CN115188484A

CN115188484A - Multi-party mixed data tracing method and system based on potential group tool variables

Info

Publication number: CN115188484A
Application number: CN202210836782.7A
Authority: CN
Inventors: 况琨; 吴安鹏; 吴飞
Original assignee: Higher Research Institute Of Shanghai Zhejiang University; Shanghai AI Innovation Center
Current assignee: Higher Research Institute Of Shanghai Zhejiang University; Shanghai AI Innovation Center
Priority date: 2022-07-15
Filing date: 2022-07-15
Publication date: 2022-10-14

Abstract

The invention discloses a multi-party mixed data tracing method and system based on potential group tool variables. The method maps the state of illness information to a characterization space through characterization learning; then, based on the given cluster number, identifying a heterogeneous treatment scheme allocation mechanism by using an expectation maximization algorithm, namely that intervention variables and confusion variables have different causal relationships on different data sources; and finally, based on a heterogeneous treatment scheme distribution mechanism, dividing medical record data into a plurality of different sample subgroups, selecting an optimal group tool variable for the data based on the correlation index as different source indication variables of multi-party data, so as to trace the source to obtain the difference of diagnosis and treatment means of different medical institutions, and realize the grouping of the medical record samples. The method can further carry out combined learning by combining a potential group tool variable embedding tool variable regression method with multi-party knowledge based on the source indication variable, and provides auxiliary accurate treatment scheme recommendation for each patient.

Description

Multi-party mixed data tracing method and system based on potential group tool variables

Technical Field

The invention relates to the field of causal inference, in particular to a multi-party mixed data tracing method and system based on potential group tool variables in medical record data.

Background

Causal inference is a powerful explanatory tool, and plays an important role in decision making processes in a plurality of different fields, such as accurate medical treatment, policy decision, accurate recommendation, teaching strategy improvement and the like. One gold standard in the causal field is a random control experiment in order to identify therapeutic/intervention effects or potential outcome functions, but because of the enormous costs and ethical concerns involved. We can often only make causal analyses based on observed data. However, due to the lack of uniform data collection specifications, multiple sources of data from multiple different causal relationships are often mixed together, presenting additional challenges to causal analysis.

Considering the mixed data deviation and the unobserved confusion deviation which are ubiquitous in the causal data, the tool variable is the most classical and most credible research method for removing the unobserved confusion influence in the observed data set. However, the tool variable method often depends on tool variables selected by human expert knowledge, and sometimes, due to the deficiency of human prior knowledge, the tool variables may not strictly meet the efficacy that the tool variables should have in the algorithm, so that the final estimated result is not reliable. One solution is to integrate tool variables from a large set of available tool variable candidates by means of checking or screening, but the same problem is that available tool variable candidates are not so easy to find.

For the case history data set in the medical data, it records the patient's signs (weight, height, age, sex, nature of work, relevant examination results), oral information (description of the patient on his own pathology and past medical history), recommended treatment given by the doctor and treatment results obtained by return visits. The medical record data is collected and gathered by different medical institutions, and different medical institutions generate different treatment scheme distribution mechanisms for the same disease due to the differences of treatment concepts, skilled technologies and medical instruments and equipment, namely the treatment scheme distribution mechanisms may have heterogeneity among different medical institutions. The heterogeneous treatment plan allocation mechanisms represent different causal relationships existing between intervention variables and confusion variables of different data sources, and each heterogeneous treatment plan allocation mechanism corresponds to a diagnosis and treatment means. However, due to the lack of medical record data collection regulations in the medical field, medical institutions from which different medical record data are derived and the corresponding medical treatment means are often missing in records, and a plurality of causal relationships are mixed in one data set. Therefore, how to trace the source of the medical record data samples in the medical record data set and find the diagnosis and treatment means categories corresponding to more medical record data or the corresponding medical institutions is a technical problem to be solved urgently at present.

Disclosure of Invention

The invention aims to overcome the defects that medical record data collected historically come from different medical institutions and have differences in diagnosis and treatment means due to the lack of medical record data collection specifications in the medical field, and extra deviation is brought to the estimation of real potential results. The invention provides a multi-party mixed data tracing method and system based on potential group tool variables, which can divide data into a plurality of potential different covariate-intervention relation sample groups based on a heterogeneous treatment scheme distribution mechanism, recover a subgroup indicator from mixed overall data by adopting a characterization learning and expectation maximization algorithm, and embed the subgroup indicator into a downstream prediction or recommendation task as a tool variable to perform more accurate prediction of a potential result function.

The technical scheme adopted by the invention is as follows:

in a first aspect, the present invention provides a multi-party hybrid data tracing method based on potential cluster tool variables, which includes the following steps:

s1, acquiring medical record data sets which are used for tracing and identifying and are sourced from a plurality of medical institutions, wherein each piece of medical record data comprises condition information, a treatment scheme given by the medical institutions and a treatment result after treatment according to the treatment scheme, and the condition information comprises patient characteristics and oral information;

s2, selecting a cluster quantity value to be selected in a cluster over-parameter value range, taking a treatment scheme given by a medical institution as an intervention variable, taking corresponding state information as a confusion variable, and mapping the state information observed in the medical record data set to a representation space through representation learning;

s3, fixing the expectation and covariance matrixes in the characterization space obtained in the S2, and identifying heterogeneous treatment scheme distribution mechanisms corresponding to the cluster quantity value to be selected by utilizing an expectation maximization algorithm, wherein the heterogeneous treatment scheme distribution mechanisms represent different cause-effect relationships between intervention variables and confusion variables of different data sources, and each heterogeneous treatment scheme distribution mechanism corresponds to a diagnosis and treatment means;

s4, traversing all cluster quantity candidate values in the cluster over-parameter value range, and respectively executing S2 and S3 on each cluster quantity candidate value to obtain a heterogeneous treatment scheme distribution mechanism corresponding to each cluster quantity candidate value; the method comprises the steps of dividing samples in medical record data sets into sample subgroups with the same number as the number of the clusters to be selected for each cluster number value to be selected, then selecting an optimal cluster number from all the cluster number values to be selected and potential group tool variables corresponding to the samples under the optimal cluster number based on correlation independent indexes, clustering and dividing all medical record data in multi-party mixed medical record data to form a plurality of groups by taking the potential group tool variables as different source indication variables in the multi-party mixed medical record data, and enabling the medical record data in each group to have the same diagnosis and treatment means, namely belong to the same heterogeneous treatment scheme distribution mechanism, so that the difference of diagnosis and treatment means of different medical institutions in the medical record data sets is obtained by tracing the source.

Preferably, in S1, in each medical record data, the patient signs include weight, height, age, sex, working property and related examination result, the oral information is derived from the description of the patient about the self-disease state and past medical history, and the treatment result is derived from the return visit result.

As a preferable aspect of the first aspect, the S2 specifically includes the following substeps:

s201, aiming at the number value K of each cluster to be selected in the cluster over-parameter value range, all observed patient characteristics and oral information are used as confusion variables through a characterization learning algorithm, mapping is carried out on each dimension independent characterization space through a mapping function, and non-independent data and multivariate complex interaction items are jointly learned as a noise item:

wherein X is the condition information in the medical record data as confounding variable, T is the treatment regimen in the medical record data as intervention variable, E _TZ Error terms representing unobserved patient signs and oral information or arising from measurement errors;

represents the heterogeneous treatment plan assignment mechanism corresponding to the potential K potential cluster tool variables Z e {1

The input of the method is a confusion variable X, and K is a value to be selected of the number of the currently selected cluster; z is an instantiation of a potential cluster tool variable Z; r is the characterization space obtained by final learning, R _j Represents the jth component of the data characterization space R, j ∈ { 1., m _R }，m _R To characterize the dimensions in total, α _zj Is a linear fit coefficient, beta, of the corresponding characterization _z Noise term co-learned from non-independent data and multivariate complex interaction term, 1 _[Z＝z] Is a conditional function, i.e. 1 when the true treatment protocol assignment mechanism Z = Z corresponding between sample data X and T, and 0 otherwise;

s202, calculating expectation and covariance of the characterization space based on the characterization space R finally obtained by learning in S201:

wherein r is _i Is the characterization vector of the ith sample, σ (R, R) is the covariance matrix, and n is the total sample number of the medical record dataset;

s203, defining the calculation formula of the likelihood function and the log likelihood function of the complete data (wherein z is from the potential tool variable modeling) as follows:

wherein: t is the instantiation of the intervention variable T, R is the instantiation of the token space R, T _i ,r _i ,z _i T, r and z corresponding to the ith sample,

is the joint probability distribution of t, r, z given the distribution parameter theta, pi _k Is t _i ,r _i From group z _i A probability of = k,

is at a given distribution parameter mu _k ,Σ _k T lower _i ,r _i Combined probability distribution of (u) _k ,Σ _k Respectively, the mean and the variance, respectively,

is a stripA function of an element, i.e. z _i K is 1, otherwise 0,k e { 1.

As a preferable aspect of the first aspect, the S3 specifically includes the following substeps:

s301, initializing heterogeneous data distribution by using random number

K is the number value to be selected of the selected cluster in the S2;

s302, using the characterization space information obtained in S202

Reinitializing a heterogeneous data distribution theta to theta ⁽⁰⁾ ＝{π ⁽⁰⁾ ,μ ⁽⁰⁾ ,Σ ⁽⁰⁾ }：

Wherein, the first and the second end of the pipe are connected with each other,

respectively the mean of T, the variance of T and the covariance matrices of T and R initialized at random,

is that

Transposing;

s303, begin to perform the desired step in the S-th iteration, i.e., estimate θ from the given observed data { T, R } and the current heterogeneous data distribution ^(s) The log-likelihood function for computing the complete data is expected to be:

wherein it is desired to

Is the ith sample on the kth group with respect to θ ^(s) Conditional probability distribution of (2):

wherein the content of the first and second substances,

is the sample t, r is derived from the probabilities of the group z = i, and the sum of the conditional probabilities of the K groups is 1,

at a given distribution parameter

Joint probability distribution of lower T and R;

s304, continuously executing the maximization step in the S iteration, namely estimating theta according to the given observed data { T, R } and the current heterogeneous data distribution ^(s) Maximizing the log-likelihood function expectation Q (theta ) of the complete data ^(s) ) And updating the heterogeneous data distribution estimate to θ ^(s+1) ：

θ ^(s+1) ＝argmax _θ Q(θ,θ ^(s) )

Wherein theta is ^(s+1) Solving the following parameters:

wherein

Indicating the stitching of T and R in the direction of the feature dimension,

is a matrix, M ² ＝MM ^T ；

S305, in the expectation maximization algorithm, continuously and iteratively executing an expectation step S304 and a maximization step S305 to finally obtain a distribution convergence solution corresponding to the current K value

By theta ^* Different causal relationships and corresponding distributions thereof exist between intervention variables and confusion variables characterizing different data sources.

As a preferable aspect of the first aspect, the step S4 specifically includes the following substeps:

s401, traversing all cluster quantity candidate values in the cluster hyperparameter K value range, respectively executing S2 and S3 on each optional K value, and obtaining a distribution convergence solution theta of complete data ^* ＝{π ^* ,μ ^* ,Σ ^* }; reconstructing potential cluster tool variables corresponding to each medical record data sample in the medical record data set based on the distribution convergence solution corresponding to each K value:

where subscript i denotes the parameter corresponding to the ith sample, i =1,2, …, n;

s402, aiming at values of all cluster hyperparameter K, using a correlation independent index MMD as a screening index, and selecting the cluster hyperparameter K which enables the MMD to be minimum as the optimal cluster number;

K ^* ＝argmin _K MMD _K (Z,R),K＝{1,2,…,10}

wherein the content of the first and second substances,

represents the mean, K, of the characterizations R corresponding to all samples in the kth subgroup of samples ^* The number of clusters is the optimal;

s403, selecting the best cluster number K ^* Potential cluster tool variable z corresponding to each sample _i Tool variable Z for optimal potential cluster ^* Simultaneously with z _i The medical record data in each group has the same source indication variable, and represents that the medical record data in the same group adopts the same diagnosis and treatment means, namely belongs to the same heterogeneous treatment scheme distribution mechanism, so that the source tracing of the diagnosis and treatment means category in each medical record data is realized.

Preferably, the characterization learning algorithm employs a variational auto-encoder, a principal component analysis, a correlation minimization characterization learning, or a prior knowledge-based characterization.

In a second aspect, the present invention provides a multi-party hybrid data sourcing system based on potential cluster tool variables, comprising:

the data set acquisition module is used for acquiring medical record data sets which are subsequently traced and identified and come from a plurality of medical institutions, wherein each piece of medical record data comprises the information of the state of illness, a treatment scheme given by the medical institutions and a treatment result after treatment according to the treatment scheme, and the information of the state of illness comprises the body of a patient and oral information;

the characterization module is used for selecting a cluster quantity value to be selected in the cluster over-parameter value range, taking a treatment scheme given by a medical institution as an intervention variable, taking corresponding state information as a confusion variable, and mapping the state information observed in the medical record data set to a characterization space through characterization learning;

the expectation maximization algorithm module is used for fixing expectation and covariance matrixes in a characterization space obtained in the characterization module, and identifying heterogeneous treatment scheme distribution mechanisms corresponding to the cluster quantity to-be-selected values by utilizing an expectation maximization algorithm, wherein the heterogeneous treatment scheme distribution mechanisms represent different cause-and-effect relationships among intervention variables and confusion variables of different data sources, and each heterogeneous treatment scheme distribution mechanism corresponds to a diagnosis and treatment means;

the grouping traceability module is used for traversing all cluster quantity candidate values within the cluster over-parameter value range, and respectively executing the representation module and the expectation maximization algorithm module on each cluster quantity candidate value to obtain a heterogeneous treatment scheme distribution mechanism corresponding to each cluster quantity candidate value; the method comprises the steps of dividing samples in medical record data sets into sample subgroups with the same number as the number of the clusters to be selected for each cluster number value to be selected, then selecting an optimal cluster number from all the cluster number values to be selected and potential group tool variables corresponding to the samples under the optimal cluster number based on correlation independent indexes, clustering and dividing all medical record data in multi-party mixed medical record data to form a plurality of groups by taking the potential group tool variables as different source indication variables in the multi-party mixed medical record data, and enabling the medical record data in each group to have the same diagnosis and treatment means, namely belong to the same heterogeneous treatment scheme distribution mechanism, so that the difference of diagnosis and treatment means of different medical institutions in the medical record data sets is obtained by tracing the source.

In a third aspect, the present invention provides an accurate treatment recommendation system, which includes:

a joint learning module, configured to obtain potential cluster tool variables corresponding to each sample in the optimal cluster number obtained by any one of the multi-party mixed data tracing identification methods in the first aspect, embed the obtained potential cluster tool variables in a tool variable regression method, and perform joint learning by combining with multi-party knowledge to obtain a counterfactual prediction function;

and the treatment scheme recommending module is used for inputting the state information of the target case as a confusion variable into the counterfactual predicting function to obtain a treatment result predicted value which can be achieved by the target case under each diagnosis and treatment means and is used as a reference for selecting the diagnosis and treatment means.

As a preferred aspect of the third aspect, condition information corresponding to medical record data, a treatment plan given by a medical institution, an optimal potential group tool variable, and a treatment result after treatment according to the treatment plan need to be input for each learning sample, so that a counterfactual prediction function capable of predicting treatment results under different treatment means based on the condition information is learned.

As a preferable aspect of the third aspect, the tool variable regression method includes a two-stage least square regression method, a two-stage least square regression method based on a polynomial, a two-stage least square regression method based on a kernel method, a least square regression method based on a deep learning algorithm, and a two-stage least square regression method based on a moment of opposition condition.

The traditional tool variable method usually depends on tool variables selected by human expert knowledge, and sometimes, due to the deficiency of human priori knowledge, the tool variables may not strictly meet the efficacy of the tool variables in the algorithm, so that the finally estimated result is not credible. Compared with the prior art, the method for carrying out potential group tool variable recovery and deducing the potential result function based on the heterogeneous treatment/intervention data is provided for carrying out the estimation of the potential result function which is not influenced by the confusion effect under the condition of having unobserved confusion variables when carrying out the tracing identification of the diagnosis and treatment means aiming at the huge data set shared by multiple medical institutions in the medical scene, and the method does not depend on the human expert knowledge to carry out the tool variable designation and the tool variable candidate set generation. By the method, the difference of diagnosis and treatment means of different medical institutions can be identified, so that the medical record data are clustered and grouped according to the diagnosis and treatment means, and the source tracing of the medical record data is realized. Meanwhile, the invention can further combine multi-party knowledge to carry out combined learning based on the difference of diagnosis and treatment means of different medical institutions obtained by identification, so as to conveniently realize the recommendation of the optimal treatment scheme and assist in providing accurate treatment for each patient.

Drawings

FIG. 1 is a flow diagram of a multi-party hybrid data tracing method based on potential cluster tool variables.

FIG. 2 is a block diagram of a multi-party hybrid data traceability system based on potential cluster tool variables.

Fig. 3 is a block diagram of a precision treatment protocol recommendation system.

Figure 4 is a graphical representation of heterogeneous treatment/intervention data in an example embodiment.

FIG. 5 is a graph illustrating the multi-party hybrid data traceability recognition accuracy and the corresponding visualization thereof in the embodiment.

Detailed Description

The invention will be further elucidated and described with reference to the drawings and the detailed description.

As shown in fig. 1, in a preferred embodiment of the present invention, a multi-party mixed data tracing method based on potential group tool variables is provided, in which, for a huge data set shared by multiple medical institutions in a medical scene, source tracing identification is performed on each piece of medical record data, differences of diagnosis and treatment means of different medical institutions are obtained through identification, and joint learning is performed in combination with multi-party knowledge, so as to assist in providing accurate treatment for each patient. The multi-party mixed data tracing method specifically comprises the following steps:

s1, medical record data sets which are used for tracing and identifying and are sourced from a plurality of medical institutions are obtained, wherein each piece of medical record data comprises condition information, a treatment scheme given by the medical institutions and a treatment result after treatment according to the treatment scheme, and the condition information comprises patient characteristics and oral information.

In the embodiment, in each medical record data, the patient physical signs include weight, height, age, sex, working property and related examination results, the oral information is derived from the description of the patient about the self disease state and past medical history, and the treatment results are derived from the return visit results.

S2, selecting a cluster quantity value to be selected in the cluster over-parameter value range, taking a treatment scheme given by a medical institution as an intervention variable (also called as a treatment/intervention variable), taking corresponding state information as a confusion variable, and mapping the state information observed in the medical record data set to a characterization space through characterization learning.

In this embodiment, the step S2 specifically includes the following sub-steps:

s201, aiming at each cluster quantity value K to be selected in the cluster over-parameter value range, all observed patient characteristics and oral information are used as confusion variables through a characteristic learning algorithm, mapping is carried out on characteristic spaces with independent dimensions through a mapping function, and non-independent data and multivariate complex interaction items are jointly learned to be a noise item:

represents a heterogeneous treatment plan assignment scheme corresponding to the potential K potential cluster tool variables Z E {1

The input of the method is a confusion variable X, and K is a value to be selected of the number of the currently selected cluster; z is an instantiation of a potential cluster tool variable Z; h is _zj (X) is a polynomial characterization learning function of variable j ξ _zj,1 Linear fitting coefficients for the corresponding polynomial; x is the number of _j The j-th dimension in X is marked by a power, and the specific maximum power needs to be optimized according to a fitting result; r is the characterization space obtained by final learning, R _j Represents the jth component of the data characterization space R, j ∈ { 1., m _R }，m _R To characterize the dimensions as a whole, α _zj Is the linear fit coefficient of the corresponding characterization; beta is a _z Noise term co-learned from non-independent data and multivariate complex interaction term, 1 _[Z＝z] Is a conditional function, i.e. 1 when the true treatment plan assignment mechanism Z = Z corresponds between sample data X and T, and 0 otherwise.

In this embodiment, the above-mentioned characterization learning algorithm may be performed in advance by using a characterization learning algorithm such as a variational automatic encoder, principal component analysis, correlation minimization characterization learning, or a prior knowledge-based characterization.

S202, calculating the expectation and covariance of the characterization space based on the characterization space R obtained by final learning in S201:

wherein r is _i Is the characterization vector of the ith sample, σ (R, R) is the covariance matrix, and n is the total number of samples in the medical record dataset.

is at a given distribution parameter mu _k ,Σ _k T lower _i ,r _i Combined probability distribution of (u) _k ,Σ _k Respectively a mean value and a variance, which are,

is a conditional function, i.e. z _i K is 1, otherwise 0,k e { 1.

It should be noted that, in the present invention, medical record data with a source medical institution label is complete data, and medical record data without a source label is incomplete data. The medical record data set belongs to incomplete data, and the complete data of the medical record data set needs to be obtained through subsequent modeling. In S203, only the likelihood function and log-likelihood function calculation formula of the complete data are defined, but at this time, because the medical record data set belongs to incomplete data, the likelihood function and log-likelihood function of the complete data cannot be directly obtained, and need to be obtained by subsequent solution.

And S3, fixing the expectation and covariance matrixes in the characterization space obtained in the S2, and identifying heterogeneous treatment scheme distribution mechanisms corresponding to the cluster quantity to-be-selected values by utilizing an expectation maximization algorithm, wherein the heterogeneous treatment scheme distribution mechanisms represent different cause-effect relationships existing between intervention variables and confusion variables of different data sources, and each heterogeneous treatment scheme distribution mechanism corresponds to a diagnosis and treatment means.

It should be noted that the treatment methods in medical institutions are closely related to the treatment concepts, skilled technologies, medical instruments and equipment, and the like of the medical institutions, and the treatment schemes given for the same disease conditions are different in the treatment methods of the medical institutions. For example, for a condition, medical institution a typically determines a treatment plan by performing a blood test and then based on the results of the blood test, medical institution B typically determines a treatment plan by performing a conscious experience, and medical institution C typically performs a test using ultrasound or radiation equipment and then determines a treatment plan based on the results of the test. The three treatment methods correspond to three heterogeneous treatment scheme distribution mechanisms, namely, when a medical institution faces a treatment scheme (confusion variable), the treatment scheme (intervention variable) is given according to the treatment method of the medical institution, and different causal relationships exist among the intervention variables and the confusion variables of different medical institutions.

In this embodiment, the step S3 specifically includes the following sub-steps:

s301, initializing heterogeneous data distribution by using random number

And K is the number value to be selected of the selected cluster in the S2.

S302, using the characterization space information obtained in S202

Reinitializing a heterogeneous data distribution θ to θ ⁽⁰⁾ ＝{π ⁽⁰⁾ ,μ ⁽⁰⁾ ,Σ ⁽⁰⁾ }：

is that

The transposing of (1).

in which it is desired to

wherein the content of the first and second substances,

is the probability that the sample t, r originates from group z = i (in this case i =1,2, …, K), and the sum of the conditional probabilities for K groups is 1,

at a given distribution parameter

Lower joint probability distribution of T and R.

θ ^(s+1) ＝argmax _θ Q(θ,θ ^(s) )

Wherein theta is ^(s+1) Solving the following parameters:

wherein

Indicating the stitching of T and R in the direction of the feature dimension,

is a matrix, M ² ＝MM ^T 。

S4, traversing all cluster quantity candidate values in the cluster over-parameter value range, and respectively executing S2 and S3 on each cluster quantity candidate value to obtain a heterogeneous treatment scheme distribution mechanism corresponding to each cluster quantity candidate value; according to the value to be selected of the number of each cluster, samples in the medical record data set are divided into sample subgroups the number of which is the same as that of the value to be selected of the number of the clusters, then an optimal number of the clusters and potential group tool variables corresponding to the samples under the optimal number of the clusters are selected from all the values to be selected of the number of the clusters based on the correlation independent indexes, the potential group tool variables are used as different source indication variables in the multi-party mixed medical record data, all the medical record data in the multi-party mixed medical record data are clustered and divided into a plurality of groups, the medical record data in each group have the same diagnosis and treatment means, namely belong to the same heterogeneous treatment scheme distribution mechanism, and therefore the difference of the diagnosis and treatment means of different medical institutions in the medical record data set is obtained from the source.

In this embodiment, the step S4 specifically includes the following sub-steps:

s401, traversing all cluster quantity candidate values in the cluster hyperparameter K value range, respectively executing S2 and S3 on each optional K value, and obtaining a distribution convergence solution theta of complete data ^* ＝{π ^* ,μ ^* ,Σ ^* }; based onAnd reconstructing a potential cluster tool variable corresponding to each medical record data sample in the medical record data set according to the distribution convergence solution corresponding to each K value:

s402, aiming at values of all cluster hyperparameters K, using a correlation independent index MMD as a screening index, and selecting the cluster hyperparameter K which enables the MMD to be minimum as the optimal cluster quantity;

K ^* ＝argmin _K MMD _K (Z,R),K＝{1,2,…,10}

wherein the content of the first and second substances,

represents the mean, K, of the characterizations R of all samples in the kth subgroup of samples ^* The number of clusters is the optimal;

s403, selecting the best cluster number K ^* Potential cluster tool variable z corresponding to each sample _i For best potential cluster tool variable Z ^* Simultaneously with z _i The medical record data in each group has the same source indication variable, and represents that the medical record data in the same group adopts the same diagnosis and treatment means, namely belongs to the same heterogeneous treatment scheme distribution mechanism, so that the source tracing of the diagnosis and treatment means category in each medical record data is realized.

It should be noted that the tracing method in S1 to S4 can identify a heterogeneous treatment scheme allocation mechanism (diagnosis and treatment means corresponding to a medical institution) in a medical record data set without a source tag by reconstructing a potential group tool variable, that is, identify different causal relationships between intervention variables and confusion variables in the medical record data, so as to classify the medical record data according to diagnosis and treatment means categories. The medical record data of the same type can be regarded as a sample subgroup, wherein the samples can be regarded as having the same diagnosis and treatment means, namely the distribution mechanisms of the heterogeneous treatment schemes are consistent, so that the medical record data can determine the diagnosis and treatment means type to which the medical record data belongs, and the tracing is realized.

In addition, if the respective heterogeneous treatment plan allocation mechanisms of different medical institutions in the medical record data set are different, the classification categories may also be directly corresponding to the medical institutions, that is, the medical record data samples in each sample subgroup all originate from the same medical institution. If part of the medical record data samples in one sample subgroup are provided with medical institution source labels and are complete data, and the other part of the medical record data samples are actual medical institution source labels and are incomplete data, the medical institution source labels of the incomplete data can be supplemented by the medical institution source labels of the complete data in one sample subgroup, so that institution source tracing of the medical record data is realized.

Similarly, based on the same inventive concept, as shown in fig. 2, another preferred embodiment of the present invention further provides a multi-party hybrid data traceability system based on potential group tool variables, which corresponds to the multi-party hybrid data traceability method based on potential group tool variables provided in the foregoing embodiment, and includes:

the grouping and tracing module is used for traversing all cluster quantity to-be-selected values in the cluster over-parameter value range, and executing the representation module and the expectation maximization algorithm module on each cluster quantity to-be-selected value respectively to obtain a heterogeneous treatment scheme distribution mechanism corresponding to each cluster quantity to-be-selected value; according to the value to be selected of the number of each cluster, samples in the medical record data set are divided into sample subgroups the number of which is the same as that of the value to be selected of the number of the clusters, then an optimal number of the clusters and potential group tool variables corresponding to the samples under the optimal number of the clusters are selected from all the values to be selected of the number of the clusters based on the correlation independent indexes, the potential group tool variables are used as different source indication variables in the multi-party mixed medical record data, all the medical record data in the multi-party mixed medical record data are clustered and divided into a plurality of groups, the medical record data in each group have the same diagnosis and treatment means, namely belong to the same heterogeneous treatment scheme distribution mechanism, and therefore the difference of the diagnosis and treatment means of different medical institutions in the medical record data set is obtained from the source.

Since the principle of solving the problem of the multi-party mixed data tracing method based on the potential cluster tool variables is similar to that of the multi-party mixed data tracing system based on the potential cluster tool variables in the embodiment of the present invention, the detailed implementation forms of the modules of the system in this embodiment may also be referred to the detailed implementation forms of the method portions shown in S1 to S4, and repeated details are not repeated.

In addition, in another embodiment of the present invention, based on the multi-party mixed data tracing and identifying method shown in S1 to S4, the accurate treatment plan recommendation can be further implemented through the step S5, which specifically includes:

firstly, acquiring potential cluster tool variables corresponding to each sample under the optimal cluster number obtained by the multi-party mixed data tracing identification method in the embodiments S1 to S4, embedding the acquired potential cluster tool variables into a tool variable regression method, and performing joint learning by combining multi-party knowledge to obtain a counterfactual prediction function;

then, the state information of the target case is used as a confusion variable to be input into the counterfactual prediction function, and a predicted treatment result value which can be achieved by the target case under each diagnosis and treatment means is obtained to be used as a reference when the diagnosis and treatment means is selected.

In addition, in another embodiment of the present invention, as shown in fig. 3, based on the same inventive concept as the above-mentioned precise treatment plan recommendation, there is also provided a precise treatment plan recommendation system, which includes:

the joint learning module is configured to obtain potential cluster tool variables corresponding to each sample in the optimal cluster number obtained by the multi-party mixed data traceability identification method in the foregoing embodiment, embed the obtained potential cluster tool variables into a tool variable regression method, and perform joint learning by combining multi-party knowledge to obtain a counterfactual prediction function;

In the above accurate treatment plan recommendation method and system, when joint learning is performed by combining multi-party knowledge, each learning sample needs to input condition information corresponding to medical record data, a treatment plan given by a medical institution, an optimal potential group tool variable, and a treatment result after treatment according to the treatment plan, so that a counter-fact prediction function capable of predicting treatment results under different diagnosis and treatment means based on the condition information is obtained by learning.

In the above-described precise treatment plan recommendation method and system, the tool variable regression method used includes a two-stage least square regression method, a polynomial-based two-stage least square regression method, a kernel-based two-stage least square regression method, a deep learning algorithm-based least square regression method, a moment-of-opposition-condition-based two-stage least square regression method, and the like.

It should be noted that, in the above accurate treatment plan recommendation method and system of the present invention, the apparatus only gives the predicted treatment result value of the target case under each diagnosis and treatment means, but specifically selecting which diagnosis and treatment means can be selected by the patient or the doctor. The accurate treatment scheme recommendation method and system can be applied to the field of auxiliary medical treatment and can also be applied to the non-medical fields of scientific research and the like.

In addition, in the above-described embodiments, the modules are executed as program modules executed in sequence, and thus, the processes of data processing are essentially executed. Moreover, as will be clearly understood by those skilled in the art, for convenience and simplicity of description, the specific working process of the system described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again. In the embodiments provided in the present application, the division of the steps or modules in the method and system is only one logical function division, and there may be another division manner in actual implementation, for example, multiple modules or steps may be combined or may be integrated together, and one module or step may also be split.

In the following, the present invention will show the application effect of the multi-party mixed data tracing method based on potential cluster tool variables and the precise treatment recommendation method in the above embodiments on specific data sets by a specific example, so as to facilitate understanding of the essence of the present invention.

Examples

This example discloses an Infant Health and Development Program (IHDP) data set and a PM2.5 concentration effect on cardiovascular mortality (PM-CMR) data set. As shown in fig. 4, taking the PM2.5 concentration in the PM-CMR dataset as an example, it is shown that there is one potential cluster tool variable in the heterogeneous treatment/intervention data, while the potential cluster tool variable of IHDP is similar.

The Infant Health and Development Program (IHDP) dataset contained 747 twins samples, each sample containing 6 pre-treatment/intervention continuous variables and 19 discrete variables associated with the infant and its maternal diathesis, which were aimed at studying the effect of an early stage special home visit teacher on the future mental development of the infant. The present example divides the data set into a training set, a validation set, and a test set based on a 63%/27%/10% ratio. And similar to the previous work, the present example makes assumptions about the potential outcome function and then generates a semi-synthetic dataset for the present example corresponding to fig. 4.

The PM2.5 concentration impact on cardiovascular disease mortality (PM-CMR) dataset contains 2132 city data, each city containing 6 treatment/pre-intervention continuous variables associated with cardiovascular disease, which was aimed at studying the impact of PM2.5 concentration on cardiovascular disease mortality. The present example divides the data set into a training set, a validation set, and a test set based on a 63%/27%/10% ratio. And similar to the previous work, the present example makes assumptions about its potential outcome function and then generates a semi-synthetic dataset corresponding to fig. 4 for the present example.

To objectively evaluate the performance of the algorithm, 10 data upsets and model retraining were randomly performed on both data instances, and the LatGIV was run _EM The mean and its standard deviation (mean (std)) of the local latent result function fit MSE errors over 10 experiments were calculated by embedding 9 different downstream tool variable regression methods.

For two data sets, the accuracy rate of multi-party mixed data source tracing identification and the corresponding visualization are shown in fig. 5, wherein LatGIV _EM The tracing method, latGIV, provided in the foregoing embodiments of the present invention _KM The method directly uses K-Means clustering to obtain the class cluster as a potential tool variable method, and the recovery accuracy of the method adopted by the invention can reach about 80 percent, while the accuracy rate of the method directly adopting K-Means algorithm clustering can only obtain the accuracy rate of less than 60 percent, and the accuracy rate of even 30 percent can not be ensured for 5 data sources.

Furthermore, on the basis of the multi-party mixed data tracing method provided by the embodiment of the invention, the recommendation accuracy of the accurate treatment scheme recommendation method is further tested. According to the method, a counterfactual prediction function is obtained by combining multi-party knowledge through the optimal group tool variables obtained based on tracing, the counterfactual prediction function is used for predicting the treatment result prediction value which can be achieved by different cases under each diagnosis and treatment means, the optimal treatment scheme is selected according to the prediction value, and accurate treatment scheme recommendation is provided for each patient. Specific results are shown in table 1:

TABLE 1 LatGIV _EM MSE error (mean (std))

Where, in turn, along the table vertical, noneIV means that no tool variables are used, while the potential tool variables for UAS are from the following reference [1 ]]The WAS potential tool variables are from the following reference [2 ]]ModoiV potential tool variables are from the following reference [3]Potential tool variables for AutoIV are from the following references [4 ]]，LatGIV _KM The potential tool variables are derived from the K-Means algorithm, latGIV _EM The potential tool variables for this are derived from the precise treatment protocol recommendation method proposed in the previous embodiment of the present invention, trueIV refers to previously known cluster tool variables.

Furthermore, in turn across the table, poly2SLS is the most classical two-stage tool variable regression method for predicting potential outcomes (i.e., precision treatment recommendation), kernelIV is from reference [5], deepIV is from reference [6], and DeepGMM is from reference [7].

The above references are specifically as follows:

[1].Neil M Davies,Stephanie von Hinke Kessler Scholder,Helmut Farbmacher, Stephen Burgess,Frank Windmeijer,and George Davey Smith.2015.The many weak instruments problem and Mendelian randomization.Statistiics in medicine 34,3(2015),454–468.

[2].Stephen Burgess,Frank Dudbridge,and Simon G Thompson.2016.Combining information on multiple instrumental variables in Mendelian randomization: comparison of allele score and summarized data methods.Statistics in medicine 35,11(2016),1880–1906.

[3].Jason S Hartford,Victor Veitch,Dhanya Sridhar,and Kevin Leyton-Brown.2021. Valid causal inference with(some)invalid instruments.In International Conference on Machine Learning.PMLR,4096–4106.

[4].Junkun Yuan,Anpeng Wu,Kun Kuang,Bo Li,Runze Wu,Fei Wu,and Lanfen Lin.2022.Auto IV:Counterfactual Prediction via Automatic Instrumental Variable Decomposition.ACM Transactions on Knowledge Discovery from Data (TKDD)16,4(2022),1–20.

[5].Rahul Singh,Maneesh Sahani,and Arthur Gretton.2019.Kernel instrumental variable regression.In NeurIPS 2019.4593–4605.

[6].Jason Hartford,Greg Lewis,Kevin Leyton-Brown,and Matt Taddy.2017. DeepIV:A flexible approach for counterfactual prediction.In ICML 2017.

[7].Andrew Bennett,Nathan Kallus,and Tobias Schnabel.2019.Deep generalized method of moments for instrumental variable analysis.In NeurIPS 2019.

therefore, compared with the estimation method in the prior art, the method has better recommendation accuracy.

The above-mentioned embodiments are merely two preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims

1. A multi-party mixed data tracing method based on potential cluster tool variables is characterized by comprising the following steps:

2. The multi-party mixed data tracing method based on potential group tool variables of claim 1, wherein in each medical record data in S1, the patient signs include weight, height, age, sex, working property and related examination results, the oral information is derived from the description of the patient about the self disease state and past medical history, and the treatment results are derived from the return visit results.

3. The multi-party hybrid data tracing method based on potential cluster tool variables according to claim 1, wherein said S2 comprises the following sub-steps:

The input of the method is a confusion variable X, and K is a value to be selected of the number of the currently selected cluster; z is an instantiation of a potential cluster tool variable Z; r is the characterization space obtained by final learning, R _j Represents the jth component of the data characterization space R, j ∈ { 1.,m _R }，m _R To characterize the dimensions in total, α _zj Is a linear fit coefficient, beta, of the corresponding characterization _z Noise term co-learned from non-independent data and multivariate complex interaction terms, 1 _[Z＝z] Is a conditional function, i.e. 1 when the true treatment protocol assignment mechanism Z = Z corresponding between sample data X and T, or 0 otherwise;

s203, defining the likelihood function and the log likelihood function of the complete data as follows:

is at a given distribution parameter mu _k ,Σ _k T lower _i ,r _i Of the joint probability distribution, mu _k ,Σ _k Respectively, the mean and the variance, respectively,

is a conditional function, i.e. z _i K is 1, otherwise 0,k e { 1.

4. The multi-party hybrid data tracing method based on potential cluster tool variables according to claim 1, wherein said S3 specifically comprises the following sub-steps:

s301, initializing heterogeneous data distribution by using random number

K is the number value to be selected of the selected cluster in the S2;

s302, using the characterization space information obtained in S202

Wherein the content of the first and second substances,

is that

Transposing;

s303, begin to perform the desired step in the S-th iteration, i.e.Estimating theta from given observation data { T, R } and current heterogeneous data distribution ^(s) The log-likelihood function for computing the complete data is expected to be:

in which it is desired to

wherein the content of the first and second substances,

at a given distribution parameter

A joint probability distribution of lower T and R;

θ ^(s+1) ＝argmax _θ Q(θ,θ ^(s) )

Wherein theta is ^(s+1) Solving parameters in (1) to obtain:

wherein

Indicating the stitching of T and R in the direction of the feature dimension,

is a matrix, M ² ＝MM ^T ；

5. The multi-party hybrid data sourcing method based on latent cluster tool variables according to claim 1, wherein said step S4 comprises the following sub-steps:

s402, aiming at values of all cluster hyperparameters K, using a correlation independent index MMD (maximum mean difference) as a screening index, and selecting the cluster hyperparameter K which enables the MMD to be minimum as the optimal cluster quantity;

K ^* ＝argmin _K MMD _K (Z,R),K＝{1,2,…,10}

wherein the content of the first and second substances,

represents the mean, K, of the characterizations R corresponding to all samples in the kth subgroup of samples ^* The number of clusters is the best;

6. The multi-party hybrid data traceability method based on potential cohort tool variables, wherein the characterization learning algorithm employs a variational auto-encoder, principal component analysis, relevance minimization characterization learning, or a priori knowledge based characterization.

7. A multi-party hybrid data traceability system based on potential cluster tool variables, comprising:

the grouping traceability module is used for traversing all cluster quantity candidate values within the cluster over-parameter value range, and respectively executing the representation module and the expectation maximization algorithm module on each cluster quantity candidate value to obtain a heterogeneous treatment scheme distribution mechanism corresponding to each cluster quantity candidate value; according to the value to be selected of the number of each cluster, samples in the medical record data set are divided into sample subgroups the number of which is the same as that of the value to be selected of the number of the clusters, then an optimal number of the clusters and potential group tool variables corresponding to the samples under the optimal number of the clusters are selected from all the values to be selected of the number of the clusters based on the correlation independent indexes, the potential group tool variables are used as different source indication variables in the multi-party mixed medical record data, all the medical record data in the multi-party mixed medical record data are clustered and divided into a plurality of groups, the medical record data in each group have the same diagnosis and treatment means, namely belong to the same heterogeneous treatment scheme distribution mechanism, and therefore the difference of the diagnosis and treatment means of different medical institutions in the medical record data set is obtained from the source.

8. An accurate treatment recommendation system, comprising:

the joint learning module is used for acquiring potential cluster tool variables corresponding to each sample under the optimal cluster number obtained by the multi-party mixed data tracing identification method according to any one of claims 1 to 7, embedding the acquired potential cluster tool variables into a tool variable regression method, and performing joint learning by combining multi-party knowledge to obtain a counter fact prediction function;

9. The system of claim 8, wherein when performing the joint learning with the multi-party knowledge, each learning sample requires inputting the condition information corresponding to the medical record data, the treatment plan given by the medical institution, the optimal potential group tool variables, and the treatment result after the treatment according to the treatment plan, so as to learn and obtain a counterfactual prediction function capable of predicting the treatment result under different diagnosis and treatment means based on the condition information.

10. The precision treatment protocol recommendation system of claim 8 wherein the tool variable regression method comprises a two-stage least squares regression method, a polynomial-based two-stage least squares regression method, a kernel-based two-stage least squares regression method, a deep learning algorithm-based least squares regression method, a moment-of-opposition condition-based two-stage least squares regression method.