CN117936106A

CN117936106A - Causal discovery and disease development track prediction system and method based on time sequence data

Info

Publication number: CN117936106A
Application number: CN202311763900.7A
Authority: CN
Inventors: 丁鼐; 张文卓; 孙周健
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-12-21
Filing date: 2023-12-21
Publication date: 2024-04-26

Abstract

The invention discloses a causal discovery and disease development track prediction system and method based on time sequence data. The system comprises a data preprocessing module for preprocessing patient time series disease data, a causal deriving module for predicting the disease feature vector quantity and the disease feature prediction relation of the patient, and a track prediction module for predicting the disease feature state track of the patient; the method comprises the following steps: optimizing a causal discovery and disease development track prediction system, and predicting after completion to obtain a disease prediction track and a disease characteristic prediction relation matrix of a patient, thereby realizing causal discovery and disease development track prediction of the disease. The invention can excavate causal relation based on longitudinal data of chronic diseases of patients, provide causal interpretable disease development prediction tracks, and solve the problems that the track prediction by deep learning cannot be interpreted and father and son characteristics with causal relation are mutually influenced, thereby assisting doctors in making clinical decisions.

Description

Causal discovery and disease development track prediction system and method based on time sequence data

Technical Field

The invention relates to a causal discovery and disease development track prediction system, relates to the field of data processing, and in particular relates to a causal discovery and disease development track prediction system and method based on time sequence data.

Background

If the disease development after the illness, such as disease species after a period of time and death after a few years, can be predicted, the method is helpful for doctors to take treatment measures in advance. Under this assumption, recent studies have focused on using machine learning methods to obtain more accurate predictive models.

However, predicting disease progression using machine learning, which is a black box model, often does not give the physician sufficient guidance, and fitting mathematically does not tell the physician what factors can lead to what changes in the disease. The development of causal relationships among various factors of a human body when experiencing diseases has more guiding significance for doctor decision-making.

In terms of trajectory prediction, researchers typically use recurrent neural network RNNs and convectors to learn characterizations from serialized patient data and predict their disease progression trajectories. Both methods are designed to model discrete trajectories with fixed time intervals, but in practice we may need to generate continuous patient trajectories. Recent studies have generally employed two methods to model continuous trajectories.

The first approach models the trajectory prediction problem as solving the power system. By means of the neural ODE solver, a neural network parameterized power system can be optimized, thus enabling continuous patient trajectory prediction. The second approach models continuous trajectories by modifying the architecture of the recurrent neural network RNN or transducer. While these models have good performance, they are generally not applicable to treatment effect analysis because they do not capture causal information in the data.

With respect to causal discovery, there are generally two methods by which causal relationships can be discovered from sequence data. The first approach is to model the data generation process using an ODE-based linear dynamic system and employ a sparse penalty (e.g., ridge penalty) to eliminate unnecessary feature interactions. Previous studies have demonstrated that this approach can reconstruct causal structures correctly given observations. A method called physical information network PIN is widely used to find control equations in a physical process. The second approach exploits the gland causality assumption that each sampling variable is only affected by early observations. The causal graph is then summarized by analyzing the gland causal information. Causal discovery methods clearly have the ability to aid in prognostic analysis, but are not currently applied to disease trajectory prediction involving linear and nonlinear relationships.

The existing model DAG-GNN (directed acyclic graph neural network) adds overall nonlinearity, and the situation of combination of linearity and nonlinearity is not considered; the data-driven model discovers the causal relation of the existing parameters, and hidden influencing factors are not considered; the model using the inverse time attention mechanism cannot discover the correlation relationship and cannot judge the direct correlation.

Disclosure of Invention

In order to solve the problems in the background art, the invention provides a causal discovery and disease development track prediction system and method based on time sequence data. The invention overcomes the defect that the performance and the interpretability of the existing disease development track prediction method cannot be considered, digs the causal relation in the patient disease time sequence data in a mode that medical staff can understand, and realizes causal discovery and track prediction based on electronic medical records, thereby leading the intervention of sub-features in the causal relation to be possible without influence of ancestor features and assisting in completing clinical decision support tasks.

The technical scheme adopted by the invention is as follows:

1. A causal discovery and disease progression trail prediction system based on time series data, comprising:

and the data preprocessing module is used for preprocessing time series disease data of the patient.

And the causal deriving module is used for predicting and obtaining the disease feature vector quantity and the disease feature-to-feature prediction relation matrix of the patient according to the preprocessed time series disease data of the patient.

And the track prediction module is used for obtaining the disease characteristic state prediction track of the patient according to the preprocessed time series disease data of the patient and the disease characteristic quantity.

The time series disease data of the patient comprise a plurality of disease characteristic data of the patient corresponding to different time points recorded in a text form.

2. A method of predicting a causal discovery and disease progression trajectory prediction system, comprising:

1) Inputting real disease data and preset simulated disease data of electronic medical records of each patient into a causal discovery and disease development track prediction system, and continuously optimizing the causal discovery and disease development track prediction system to obtain an optimized causal discovery and disease development track prediction system; the form of the preset simulated disease data is consistent with that of the real disease data of the electronic medical record.

2) And inputting real disease data of the electronic medical record of the patient to be predicted into an optimized causal discovery and disease development track prediction system for processing, and outputting a disease prediction track and a disease characteristic inter-prediction relation matrix of the patient to be predicted after processing, so as to realize causal discovery and disease development track prediction.

In the step 1), the causal discovery and disease development track prediction system is continuously optimized, specifically, the causal discovery and disease development track prediction system is continuously optimized by using an optimization module, a causal graph recognition module is used for constructing a causal mask matrix according to each optimized causal discovery and disease development track prediction system obtained in the optimization process, and finally the causal mask matrix is input into a causal derivation module to serve as a processing matrix of the causal derivation module. The retained model is more likely to identify the correct causal relationship.

The causal graph identification module identifies the most reliable causal relationship by the performance and stability of the plurality of systems trained and returns a causal relationship matrix, i.e., a causal mask matrix.

The optimization module specifically uses an augmented Lagrangian method to perform optimization, and calculates final loss according to a loss functionRetention loss/>The causal graph identification module acquires a neural connection matrix of the causal discovery and disease development track prediction system for each remained optimized causal discovery and disease development track prediction system, specifically as follows:

Wherein m _ij represents that the ith disease feature is the cause of the jth disease feature; certainty that the ith disease feature is the cause of the jth disease feature, when/> Then represent uncertainty, when/>And then represents the determination; k represents the total number of disease features;

When the null ratio Y _ij of the ith and jth disease signature is greater than the preset acceptance ratio ρ, the link between the ith and jth disease signature is considered null, m _ij =0 and When the ineffective ratio Y _ij of the ith and jth disease signature is less than the preset unacceptable ratio 1- ρ, the connection between the ith and jth disease signature is considered effective, then m _ij =1 and/>Wherein Y _ij represents the ratio of the number of models considered invalid for connection i.fwdarw.j in the N remaining optimized systems to the total number of systems, i.e., the invalid ratio, and Y _ij＝e_ij/N,e_ij represents the number of converging systems for which the ith disease feature and the jth disease feature are valid.

Until the causal relation between causal discovery and each disease characteristic in the disease development track prediction system is determined, a causal mask matrix m _k is constructed and obtained

At the beginning, the matrix m andEach causal relationship is uncertain.

In the step 2), the real disease data of the electronic medical record of the patient to be predicted is input into an optimized causal discovery and disease development track prediction system for processing, and the method specifically comprises the following steps:

2.1 Inputting the real disease data of the electronic medical record of the patient to be predicted into a data preprocessing module for processing, and outputting preprocessed real disease data after processing.

2.2 The preprocessed real disease data and the causal mask matrix m _k are input into a causal deriving module for processing, and the disease feature vector quantity and the disease feature prediction relation matrix of the patient are output after the processing.

2.3 The preprocessing real disease data and the disease characteristic derivative quantity are input into a track prediction module for processing, and the disease characteristic state prediction track of the patient is output after the processing.

In the step 2.1), the data preprocessing module integrates the real disease data of the electronic medical record of the patient to be predicted into text data according to a time sequence, and then performs format unified processing, missing data processing and outlier processing in sequence to obtain output preprocessed real disease data.

The format unification process specifically unifies units of data; the missing data processing specifically comprises screening out disease characteristics with the missing rate higher than 30%; the outlier processing is specifically normalized mapping processing of discrete indexes.

In the step 2.2), the causal deriving module comprises a figure bell Sigmoid function, a cyclic neural network, a feedforward neural network and sparse punishment, wherein the figure bell Sigmoid function is firstly used for mapping discrete variables in preprocessing real disease data into continuous variables, then the mapped preprocessed real disease data are input into the cyclic neural network for processing, the processed output is multiplied by a causal mask matrix m _k and then input into the feedforward neural network for processing, the disease feature vector quantity and the adjacency matrix of a patient are output after processing, and the adjacency matrix is constructed into a directed acyclic graph DAG (Directed Acyclic Graph) and then sparse punishment is carried out to obtain a disease feature prediction relation matrix of the patient.

The causal deriving module maps the continuous variable and the discrete variable to the same new variable space, predicts the derivative of the characteristic, introduces connectivity between feedforward neural network evaluation characteristics, and uses a logic probability logic value to replace a true value for the discrete variable; removing false causal connection by constructing a prediction relation between adjacent matrix description characteristics and using sparse punishment; at the same time, the directed acyclic graph DAG characteristics of the causal graph are considered to ensure causal interpretability of the model.

In the step 2.3), the track prediction module includes a long-short-time memory network LSTM (Long Short Term Memory), a re-parameterization method and a numerical value ODE (Ordinary Differential Equation) solver, the preprocessed real disease data is input into the long-short-time memory network LSTM to be processed and then the disease characteristic statistics of the patient are output, then the disease characteristic statistics are randomly sampled by using the re-parameterization method to obtain the initial disease state of the patient, and the initial disease state and the disease characteristic derivative quantity of the patient are input into the numerical value ODE solver to be processed and then the disease characteristic state prediction track of the patient is output. Randomness is introduced by a re-parameterization technique to more accurately simulate the actual situation.

The numerical ODE solver uses a Variational Automatic Encoder (VAE) (Variational Auto-Encoder) to estimate posterior probability distribution of the initial disease state of the patient, and estimates and obtains the change rate of the disease features of the patient according to the number of disease features of the patient, the initial disease state and the posterior probability distribution of the patient, so as to predict and obtain the disease feature state prediction track of the patient.

The invention combines the automatic variational encoder and the LSTM network to perform characteristic estimation and statistic calculation, and solves the disease progress track through a normal differential equation; optimizing model parameters by adopting an augmented Lagrangian method, and ensuring the interpretation of the causal relationship; through reliable causal relation between multi-model training and sparse punishment screening, accurate prediction of disease development tracks and identification of causal graphs are realized. The invention can excavate causal relation based on longitudinal data of chronic diseases of patients, provide causal interpretable disease development prediction tracks, and solve the problems that the track prediction by deep learning cannot be interpreted and father and son characteristics with causal relation are mutually influenced, thereby assisting doctors in making clinical decisions.

The beneficial effects of the invention are as follows:

The invention additionally has a causal discovery function on the premise of achieving the disease track prediction performance similar to the current leading model; the invention more accurately estimates the disease progress track of the patient by introducing advanced technologies such as an ODE solution method, a variation automatic encoder and the like; the causal derivation method is adopted, the continuous variable and the discrete variable are mapped to the same new variable space, and the relevance between the connectivity evaluation characteristics of the neural network is introduced, so that the interpretability of the causal relationship is improved; by the method for increasing the Lagrangian, the optimization problem can be effectively solved, and the efficiency and the convergence rate of model training are improved; by training a plurality of models and comprehensively considering the performance and stability of the models, the causal graph recognition algorithm has higher robustness and reliability, and can accurately recognize causal relations among the features; for discrete variables, the invention uses Gumbel Sigmoid function to map, so that the processing of the discrete variables by the model is more flexible and accurate.

Drawings

FIG. 1 is a schematic diagram of a causal discovery and disease progression trajectory prediction system according to the present invention;

FIG. 2 is a predictive flow diagram of the system of the present invention;

FIG. 3 is an optimization flow chart of the system of the present invention;

FIG. 4 is a diagram of an example of a causal discovery function in the present invention.

Detailed Description

The invention will be described in further detail with reference to the accompanying drawings and specific examples.

The invention utilizes the advanced technologies such as neural ordinary differential equation, automatic variational encoder and the like to estimate the disease progress track of the patient more accurately, adopts a causal derivative method to map continuous variables and discrete variables to the same new variable space, and introduces the relevance between the connectivity evaluation characteristics of the neural network, thereby improving the interpretability of causal relations.

As shown in fig. 1, the causal discovery and disease development trajectory prediction system based on time series data of the present invention includes a data preprocessing module for preprocessing time series disease data of a patient; the causal deriving module is used for predicting and obtaining the disease feature quantity of the patient and a disease feature-to-feature prediction relation matrix according to the preprocessed time series disease data of the patient; and the track prediction module is used for obtaining the disease characteristic state prediction track of the patient according to the preprocessed time series disease data of the patient and the disease characteristic quantity.

The time series disease data of the patient includes a plurality of pieces of disease characteristic data of the patient corresponding to different time points recorded in text form.

As shown in fig. 2 and 3, the method for predicting the causal discovery and disease progression track prediction system of the present invention comprises:

In the step 1), a causal discovery and disease development track prediction system is continuously optimized, specifically, an optimization module is used for continuously optimizing the causal discovery and disease development track prediction system, a causal graph identification module is used for constructing a causal mask matrix according to each optimized causal discovery and disease development track prediction system obtained in the optimization process, and finally the causal mask matrix is input into a causal derivation module to be used as a processing matrix of the causal derivation module. The retained model is more likely to identify the correct causal relationship.

Wherein m _ij represents that the ith disease feature is the cause of the jth disease feature; certainty that the ith disease feature is the cause of the jth disease feature, when/> Then represent uncertainty, when/>And then represents the determination; k represents the total number of disease features.

When the null ratio Y _ij of the ith and jth disease signature is greater than the preset acceptance ratio ρ, the link between the ith and jth disease signature is considered null, m _ij =0 andWhen the ineffective ratio Y _ij of the ith and jth disease signature is less than the preset unacceptable ratio 1- ρ, the connection between the ith and jth disease signature is considered effective, then m _ij =1 and/>Wherein Y _ij represents the ratio of the number of models considered invalid for connection i.fwdarw.j in the N remaining optimized systems to the total number of systems, i.e., the invalid ratio, and Y _ij＝e_ij/N,e_ij represents the number of converging systems for which the ith disease feature and the jth disease feature are valid.

At the beginning, the matrix m andEach causal relationship is uncertain.

In the step 2), inputting real disease data of the electronic medical record of the patient to be predicted into an optimized causal discovery and disease development track prediction system for processing, wherein the method comprises the following steps of:

In the step 2.1), the data preprocessing module integrates the real disease data of the electronic medical record of the patient to be predicted into text data according to a time sequence, and then performs format unified processing, missing data processing and outlier processing in sequence to obtain output preprocessed real disease data. The format unification process specifically unifies units of data; the missing data processing specifically comprises screening out disease characteristics with the missing rate higher than 30%; the outlier processing is specifically normalized mapping processing of discrete indexes.

In step 2.2), the causal deriving module comprises a Gumbel Sigmoid function, a cyclic neural network, a feedforward neural network and sparse punishment, wherein discrete variables in preprocessing real disease data are firstly mapped into continuous variables by using the Gumbel Sigmoid function, the mapped preprocessing real disease data are input into the cyclic neural network for processing, the processed output is multiplied by a causal mask matrix m _k and then input into the feedforward neural network for processing, the disease feature quantity and the adjacency matrix of a patient are output after processing, and the adjacency matrix is constructed into a directed acyclic graph DAG and then sparse punishment is carried out to obtain a disease feature prediction relation matrix of the patient.

In step 2.3), the track prediction module comprises a long-short time memory network LSTM, a re-parameterization method and a numerical value ODE solver, the preprocessed real disease data are input into the long-short time memory network LSTM for processing and then the disease characteristic statistics of the patient are output, then the disease characteristic statistics are randomly sampled by the re-parameterization method to obtain the initial disease state of the patient, and the initial disease state and the disease characteristic derivative quantity of the patient are input into the numerical value ODE solver for processing and then the disease characteristic state prediction track of the patient is output. Randomness is introduced by a re-parameterization technique to more accurately simulate the actual situation.

The numerical ODE solver uses a Variational Automatic Encoder (VAE) to estimate posterior probability distribution of an initial disease state of a patient, and obtains a change rate of disease features of the patient according to the number of disease features of the patient, the initial disease state and the posterior probability distribution estimate, so as to predict and obtain a disease feature state prediction track of the patient.

Specific embodiments of the invention are as follows:

As shown in FIG. 1, the present invention estimates statistics using a long-term memory LSTM network based on preprocessed patient time series data, includes the mean μ _i and standard deviation σ _i of the time series data of the ith patient during training, and then randomly samples and estimates the values of the ith patient characteristics at an initial time point t ₀ using a re-parameterization method Obtaining an initial state of onset of the disease; in the causal derivation process,/>The value representing the kth characteristic at time t can be a continuous variable or a discrete variable, and the discrete variable needs to be mapped into a logic value and is/>Mapping as/>Then obtaining output after the cyclic neural networkUsing k+1 independent components/>To predict the number of disease signatures/>, per patientData leakage is avoided; /(I)Is a causal mask matrix, all elements are 1; neural network connectivity is introduced to evaluate the predictive effect of the features, and an adjacency matrix/>, based on the connectivity vector, is constructedK+1 independent Components/>Predicted/>And initial value/>Predicting the development tracks of K features.

The data preprocessing module is used for preprocessing text data, and specifically comprises the following steps: and extracting time sequence data of a plurality of characteristics of the same patient required in the electronic medical record, integrating the time sequence data into text data, unifying units and formats, screening indexes with the loss rate higher than 30%, and carrying out normalized mapping treatment on discrete indexes. The final data is represented as a binary tupleIs a data set consisting of a sequence of sequences. s _i represents the time point/>, of the ith patientI _i represents a label data sequence to be predicted, and is specifically as follows:

Wherein v _ij represents disease data of the j-th visit of the i-th patient, which can be regarded as an abbreviation of v _i(t_ij), v _i (t) is a vector of K elements representing the characteristics of the patient at the t time point, and v _i represents the progress track of the i-th patient; m _ij represents the absence of the corresponding feature in v ^ij (1 represents absence), and m _ij∈0,1^K;t_ij represents the time of the jth visit of the ith patient; representing the actual number of visits by the ith patient,/> Indicating the predicted number of visits for the ith patient. Wherein the value of the kth characteristic of the jth visit of the ith patient/>Either continuous or discrete variables. In this study, the system assumed that the discrete variables were binary so that generalizations could be made to handle classification variables naturally.

In the trajectory prediction module, the disease progression trajectory of the patient is estimated by solving a normal differential equation. First, a long-term memory LSTM network parameterized by a parameter phi is used to estimate statistics.

Estimating patient characteristics at an initial time point t ₀ An initial time point, i.e. the moment when the patient is not diagnosed but starts symptomatic; then obtaining/>, by random sampling through a re-parameterization methodThis step can introduce randomness to better simulate the actual situation. Estimating/>, using standard variational auto-encodersThe posterior probability distribution of the patient is estimated according to the disease characteristic guide quantity, the initial disease state and the posterior probability distribution of the patient, so that the disease characteristic state prediction track of the patient is obtained in a prediction mode.

Since the ODE-based model can only model the dynamics of continuous variables, for discrete variables,Actually representing the logic value rather than the true value. First, will/>Mapping to/>This is the same mapping for continuous variables, discretizing the logit for discrete variables. This step is to unify the continuous variable and discrete variable processing modes, and specifically comprises the following steps:

To obtain disease signature status at any time point t, estimates are made using a numerical ODE solver according to f _θ and v _i0 The method comprises the following steps:

Where f _θ denotes a feed-forward neural network.

In the causal derivation module, design f _θ ensures that the model predicts feature trajectories in a causally interpretable manner. The model uses K+1 independent componentsTo predict each/>Each/>Is a feed-forward neural network in which there is no bias term and the output is real. Neural network connectivity is introduced to evaluate the predictive effects of features. Based on connectivity vector, deduce/> Wherein C ^k is/>Is described herein). /(I)Similar to a contiguous matrix, in which non-zero elements/>Meaning that the jth feature may predict the kth feature to some extent. Then use sparse punishment pair/>Penalties are made to remove false causal connections.

Causal relationships between features in many diseases can be modeled as a directed acyclic graph, DAG. Assume that the causal graph is a DAG and is based on thatAn additional score-based penalty is applied.

Where Tr represents the trace of the matrix and exp (a) represents the index of the non-negative square adjacency matrix a. (1-I) °A DAG is formed, if and only if h (θ) =0. (1-I) represents a prediction of the derivative of a self-loop, i.e. a feature, is allowed.

If A represents a cyclic graph, there must be someResulting in Tr (exp (A)) -K >0,/>Representing the weighted path count from element i to j after k steps, which is a non-negative number. If the DAG constraint is equal to zero, then it is true if and only if A represents a DAG. Once the constraint is established, it is possible that each/>Only by itself and its causative features, thus making the model causally interpretable.

In the optimization module, model parameters are optimized through an augmented Lagrangian method, so that the model can accurately predict the disease development track while ensuring causal interpretability. By minimizingTo optimize the parameters. The optimization objective is at/>The observed data is then perfectly reconstructed, while at/>Future trajectories are accurately predicted. A random gradient descent method is employed to approximately solve each sub-problem. Since the numerical ODE solver is typically not differentiable, the attendant sensitivity method is employed to make the ODE solver differentiable and the back propagation method viable.

In the causal graph identification module, the most reliable causal relationships are identified by a plurality of models trained and based on their performance and stability. Systems remain challenging in finding causal relationships between features. The first challenge is that, since the model uses a numerical optimizer,The value of (2) cannot be basically penalized to exact zero, but can be only a very small number. In addition, the system is also vulnerable because neural ordinary differential equations and matrix index operations in the system are very sensitive to parameter initialization and input noise. When the initial parameters are not good, the system tends to converge to a bad point. Even if the system converges to a satisfactory point and good predictive performance is obtained, causal edges with wrong causal direction are easily identified. The module uses the matrix mε {0,1} ^(K+1)×(K+1) to describe the causal relationship between features, while using the matrix/> To determine whether a causal relationship is determined.

The present embodiment employs one real medical dataset (ADNI dataset, alzheimer's Disease Neuroimaging Initiative) and four simulated datasets to evaluate the performance of the present system. True dataset: ADNI dataset included comprehensive collection of multimodal data from large subject cohorts, including healthy controls, mild cognitive impairment individuals, and alzheimer's disease patients. The pre-processed ADNI dataset retained 88 features (23 continuous features and 65 discrete features) with available visit records for these features equal to or greater than 3. Simulation dataset: the Hao dataset records the progression of four features (amyloid (a _β), phosphorylated tau (τ _p), neurodegeneration (n) and cognitive decline score (C)) in patients with advanced mild cognitive impairment. The Zheng dataset records the progression tracks of four features, a _β, tau, n and C, in alzheimer's disease patients, as shown in figure 4. The MM-25 and MM-50 datasets are two Michael-Portal rate Michaelis-Menten (MM) kinetic datasets generated to evaluate the model in a high-dimensional scenario. MM-25 contains 20 properties and MM-50 contains 45 properties.

The Zheng dataset is a confounding-free dataset, and the other three datasets contain unobservable confounding factors. All data sets are normalized before training. The specific dataset cases are shown in table 1:

	Hao	Zheng	MM-25	MM-50	ADNI
						number of patients	1024	1024	1024	1024	275
Average follow-up times	15	15	15	15	3.7
						Feature quantity	4	4	20	45	88
Average access interval	1.00	2.00	0.25	0.25	1.65
						Whether all are continuous variables	Is that	Is that	Is that	Is that	Whether or not

In order to better highlight the characteristics of the method of the invention, a comparison experiment is carried out in the embodiment; the disease trajectory prediction system (Causal Trajectory Prediction, hereinafter CTP) provided in this example was compared with two common models and three recently proposed models. The five comparison models are respectively: 1) Linear ordinary differential equation (Linear Ordinary Differential Equation, LODE): in contrast to the present CTP, LODE baseline uses the same structure, which uses a linear function to model derivatives of features; removing false connection nodes by using the ridge loss; 2) The neural ordinary differential equation (Neural Ordinary Differential Equation, NODE): NODE also uses the same structure, but it does not use the ridge and DAG losses to optimize parameters; 3) Neural map model (Neural Graphical Modelling in Continuous-time, NGM) at continuous time: NGM uses the same structure, but it only adds packet ridge loss in the first layer of the neural network to extract causal relationships; 4) The neural controlled differential equation (Continuous-time Modeling of Counterfactual Outcomes Using Neural Controlled Differential Equations,TE-CDE)：TE-CDE over continuous time uses the controlled differential equation to evaluate patient trajectory at any point in time and uses an countermeasure training method to adjust unmeasured confounding factors; 5) Inverse differential equation (Counterfactual Ordinary Differential Equation, CF-ODE): CF-ODE employs a Bayesian framework, using NODE equipped with uncertainty estimates, to continuously predict the impact of treatment over time.

Table 2 shows the results of the comparative experiments of the disease track prediction system and the comparative model in the task of disease track prediction, and shows the prediction performance of the system in the real data set ADNI. The predicted performance on continuous features is evaluated with mean square error (Mean Squared Error, MSE) and on discrete features with macroscopic average area under the receiver operating characteristic curve (Area Under the Curve, AUC). For fairness comparison, the model did not apply causal identification methods to disease trajectory prediction other than causal discovery experiments. A comparison experiment shows that the method of the invention obtains the performance equivalent to other baseline models in the ADNI data set; experimental results on four simulated data sets indicate that the system of the present invention also achieves better or comparable performance compared to the comparative model.

TABLE 2

Table 3 shows the results of the comparative experiments of the disease track prediction system and the comparative model of the present invention in the causal relationship finding of the task. The causal relationship of the real dataset cannot be evaluated, considering only four simulated datasets. We used the harmonic mean of accuracy, accuracy and recall (F1 Score, F1) and AUC to evaluate causal discovery performance. Comparative experiments showed that the NODE model could not extract causal relationships from features because its AUC in all four data sets was less than 0.57. The performance of TE-ODE and CF-ODE is not ideal because they are not designed to extract causal relationships between features. LODE and NGM achieve better performance benefits by taking advantage of the ridge penalty. NGM is superior to LODE, which may be due to the use of neural networks.

Furthermore, the original CTP model may better identify causal relationships between features than all baselines, and may further improve its performance by using causal identification algorithms. For example, the inventive system only achieves a causal discovery accuracy of 0.56 in the Hao dataset. However, if CTP (designated CTP ^*) to which the causal identification algorithm is applied is selected, its causal discovery performance may be significantly improved. Therefore, the invention can effectively complete the causal identification and disease track prediction based on time series data.

TABLE 3 Table 3

Corresponding to the foregoing causal discovery based on time series data and the example of the disease development trajectory prediction method, the embodiments of the present invention further provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor implements the functions of each module in the disease development model and the method thereof based on causal trajectory prediction. The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices of the foregoing embodiments. The computer readable storage medium may also be any device having data processing capabilities, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), an SD card, a flash memory card (FLASH CARD), or the like, provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing a computer program and other programs and data required by any device having data processing capabilities, and can also be used for temporarily storing data that has been output or is to be output.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the invention. The above embodiments are merely for illustrating the design concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, the scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications according to the principles and design ideas of the present invention are within the scope of the present invention.

Claims

1. A causal discovery and disease progression trajectory prediction system based on time series data, comprising:

a data preprocessing module for preprocessing time-series disease data of a patient;

The causal deriving module is used for predicting and obtaining the disease feature quantity of the patient and a disease feature-to-feature prediction relation matrix according to the preprocessed time series disease data of the patient;

2. The causal discovery and disease progression track prediction system based on time series data according to claim 1, wherein: the time series disease data of the patient comprise a plurality of disease characteristic data of the patient corresponding to different time points recorded in a text form.

3. The method of predicting a causal discovery and disease progression trajectory in any one of claims 1-2, comprising:

1) Inputting real disease data and preset simulated disease data of electronic medical records of each patient into a causal discovery and disease development track prediction system, and continuously optimizing the causal discovery and disease development track prediction system to obtain an optimized causal discovery and disease development track prediction system;

4. The method of claim 3, wherein the causal discovery and disease progression trail prediction system predicts: in the step 1), the causal discovery and disease development track prediction system is continuously optimized, specifically, the causal discovery and disease development track prediction system is continuously optimized by using an optimization module, a causal graph recognition module is used for constructing a causal mask matrix according to each optimized causal discovery and disease development track prediction system obtained in the optimization process, and finally the causal mask matrix is input into a causal derivation module to serve as a processing matrix of the causal derivation module.

5. The method of claim 4, wherein the causal discovery and disease progression trail prediction system comprises: the optimization module specifically uses an augmented Lagrangian method to perform optimization, and calculates final loss according to a loss functionRetention loss/>The causal graph identification module acquires a neural connection matrix of the causal discovery and disease development track prediction system for each remained optimized causal discovery and disease development track prediction system, specifically as follows:

6. The method of claim 4, wherein the causal discovery and disease progression trail prediction system comprises: in the step 2), the real disease data of the electronic medical record of the patient to be predicted is input into an optimized causal discovery and disease development track prediction system for processing, and the method specifically comprises the following steps:

2.1 Inputting the electronic medical record real disease data of the patient to be predicted into a data preprocessing module for processing, and outputting preprocessed real disease data after processing;

2.2 Inputting the preprocessed real disease data and the causal mask matrix m _k into a causal deriving module for processing, and outputting the disease feature vector quantity and the disease feature prediction relation matrix of the patient after processing;

7. The method of claim 6, wherein the causal discovery and disease progression trail prediction system predicts: in the step 2.1), the data preprocessing module integrates the real disease data of the electronic medical record of the patient to be predicted into text data according to a time sequence, and then performs format unified processing, missing data processing and outlier processing in sequence to obtain output preprocessed real disease data.

8. The method of claim 6, wherein the causal discovery and disease progression trail prediction system predicts: in the step 2.2), the causal deriving module comprises a figure bell Sigmoid function, a cyclic neural network, a feedforward neural network and sparse punishment, wherein the figure bell Sigmoid function is firstly used for mapping discrete variables in preprocessing real disease data into continuous variables, then the mapped preprocessed real disease data are input into the cyclic neural network for processing, the processed output is multiplied by a causal mask matrix m _k and then input into the feedforward neural network for processing, the disease feature vector quantity and the adjacency matrix of a patient are output after processing, and the adjacency matrix is constructed into a directed acyclic graph DAG and then sparse punishment is carried out to obtain a disease feature prediction relation matrix of the patient.

9. The method of claim 6, wherein the causal discovery and disease progression trail prediction system predicts: in the step 2.3), the track prediction module comprises a long-short time memory network LSTM, a re-parameterization method and a numerical value ODE solver, the preprocessed real disease data are input into the long-short time memory network LSTM to be processed and then the disease characteristic statistics of the patient are output, then the disease characteristic statistics are randomly sampled by the re-parameterization method to obtain the initial disease state of the patient, and the initial disease state and the disease characteristic derivative quantity of the patient are input into the numerical value ODE solver to be processed and then the disease characteristic state prediction track of the patient is output.

10. The method of claim 9, wherein the causal discovery and disease progression trail prediction system predicts: the numerical ODE solver uses a Variational Automatic Encoder (VAE) to estimate posterior probability distribution of an initial disease state of a patient, and obtains the change rate of the disease features of the patient according to the disease feature vector quantity, the initial disease state and the posterior probability distribution estimation of the patient, so as to predict and obtain a disease feature state prediction track of the patient.