CN115691652A - Causal relationship prediction method based on multi-sample data - Google Patents

Causal relationship prediction method based on multi-sample data Download PDF

Info

Publication number
CN115691652A
CN115691652A CN202211440950.7A CN202211440950A CN115691652A CN 115691652 A CN115691652 A CN 115691652A CN 202211440950 A CN202211440950 A CN 202211440950A CN 115691652 A CN115691652 A CN 115691652A
Authority
CN
China
Prior art keywords
data
variable
causal
network
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211440950.7A
Other languages
Chinese (zh)
Inventor
刘小平
张月蕾
常啸
陈洛南
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Institute of Advanced Studies of UCAS
Original Assignee
Hangzhou Institute of Advanced Studies of UCAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Institute of Advanced Studies of UCAS filed Critical Hangzhou Institute of Advanced Studies of UCAS
Priority to CN202211440950.7A priority Critical patent/CN115691652A/en
Publication of CN115691652A publication Critical patent/CN115691652A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a causal relationship prediction method based on multi-sample data, which comprises the steps of utilizing expression data of specific representations as initial input data, calculating correlation coefficients among the initial input data, utilizing the correlation coefficients among the initial input data to construct an initial network, carrying out H0 hypothesis and H1 hypothesis on each pair of sides on the basis of the initial correlation network, carrying out different regressions on partial expression data on the basis of different hypotheses to obtain a linear regression equation of each pair of sides, substituting different regression equations under the hypothesis of the two sides with the same residual expression data to obtain errors of the two regressions, and judging whether the causal relationship exists or not by comparing the fitting errors between the two types of regression equations so as to construct the whole causal network. The invention completely provides a new causal concept and establishment method, which is called CVP method for short, and the method is a data-driven model-free algorithm used for processing data irrelevant to time.

Description

Causal relationship prediction method based on multi-sample data
Technical Field
The invention relates to the technical field of computational biology and bioinformatics, in particular to a causal relationship prediction method based on multi-sample data.
Background
Causal reasoning from observed data is a core problem in various research fields of natural sciences and engineering, such as biology, earth science, economics, medicine, neuroscience, machine learning, and the like. Glangel causal relationship (GC) reasoning, which is a representative method, is a method for inferring potential causal relationships based on time series data, and since 1969, GC-based methods have been widely used in many fields. However, most biological data is based on time-independent data, such as phenotypes or phases, rather than time series, and thus this approach is not suitable.
Causality between molecules can generally be represented by a causality graph, nodes represent different molecules, and directed edges characterize direct causality between molecules. Existing mainstream algorithms for predicting causal relationships include the well-known Granger causal relationship (GC), the convergent cross-mapping algorithm (CCM), the Transfer Entropy (TE) and bayesian theory. Specifically, the GC is based on time lag information and the current state interpreted with past information, for example, based on an autoregressive model. As a nonlinear version of the GC method, TE considers temporal asymmetry of information to determine causal relationships, and is widely used in biological systems such as neuroscience and physiology. Both GC and TE are causal inferred based on the original state space. CCM, on the other hand, is measured in the delayed embedding space, in addition to GC theory, reflecting the non-linear relationship between two variables in the current state. All of the above algorithms measure causal relationships where time-dependent data or time series data is needed. In contrast, bayesian networks and Structural Causal Models (SCMs) are able to process time-independent data based on statistical independence and intervention, and to identify direct causal relationships between molecules. However, these methods all rely on the structure of directed acyclic graphs to infer causal relationships, which limits the application to biomolecular networks that typically have feedback loops or circular interactions.
Therefore, the causal relationship in the real biological/molecular system is deduced with high precision based on the time-independent data rather than the data depending on the time data, so as to better provide the relationship between the data, provide powerful data for the actual decision, and still be a technical problem to be improved and overcome.
Disclosure of Invention
The invention overcomes the defects of the prior art and provides a causal relationship prediction method based on multi-sample data.
The technical scheme of the invention is as follows:
a causal relationship prediction method based on multi-sample data does not consider time dependency to infer cause and effect, and specifically comprises the following steps:
101 Initial network construction step: calculating a correlation coefficient between initial input data using expression data of a specific representation as the initial input data and constructing an initial network using the correlation coefficient between the initial input data;
102 ) judging a causal relationship: performing H0 hypothesis and H1 hypothesis on each pair of sides on the basis of the initial correlation network; the edge of H0 represents that the causal relationship does not exist, and the edge of H1 represents that the causal relationship exists; performing different regressions by using partial expression data on the basis of different assumptions to obtain a linear regression equation of each pair of edges;
103 Step of constructing a causal network: for different regression equations with two assumed sides, substituting the same remaining expression data to obtain errors of two regressions; by comparing the fitting error between the two, whether the causal relationship exists can be judged, and the whole causal network is constructed.
Further, in step 101), the relevance between the two variables is evaluated by using the partial correlation coefficient to eliminate the indirect influence of the genes, so as to obtain the direct relevance between the genes and construct a relevance network.
Further, the calculation formula of Pearson correlation coefficient is as follows:
Figure BDA0003948238720000031
wherein X and Y are two variables;
the first order partial correlation coefficient is calculated as follows:
Figure BDA0003948238720000032
in the formula r 12 Representing the correlation coefficient between variable 1 and variable 2, r 13 Representing the correlation coefficient between variable 1 and variable 3, r 23 Representing the correlation coefficient between variable 2 and variable 3.
Further, in the step 102), regression fitting is carried out on the effect variable by using a covariate which does not contain a dependent variable in the H0 hypothesis; performing regression fitting on the effect variable by using the dependent variable and the covariate in the H1 hypothesis; the specific regression formulas are respectively as follows:
H0:
Figure BDA0003948238720000033
H1:Y=f(X,Z 1 ,Z 2 ,…,Z n-2 )+ε
wherein Y represents the fruit variable; x represents a dependent variable; z is other variables not including X, Y variables; ε and
Figure BDA0003948238720000034
is noise.
Further, dividing expression data of the specific representation into a training set and a test set, wherein the training set is used as regression, and the test set is used as judgment error; the division mode comprises K-fold cross validation and leave one method.
Further, in different regression equations, training set data is substituted, fitting of the equations is performed, and a square total error is obtained in the following specific manner:
firstly, taking the distance between the output of the fitting equation and the true value as an error, and summing a plurality of errors obtained by a plurality of times of cross tests to obtain a total error, wherein the formula is as follows:
Figure BDA0003948238720000041
wherein e i Representing errors obtained in the ith cross-over testThe difference, m, represents the total number of cross-validations.
Further, the error in step 103) may be calculated by a distance metric means; the specific way of comparison is realized by comparing the causal strength; the causal strength calculation formula is as follows:
Figure BDA0003948238720000042
where e represents the mean sum of squared residuals of the cross-validated test set obtained from the regression of the dependent variables and other random variables to the dependent variables in the H1 hypothesis,
Figure BDA0003948238720000043
the mean sum of squared residuals for the cross-validated test set, which contains the regression of all random variables to the effect variable except the dependent variable, is represented in the H0 hypothesis.
Further, the expression data of the specific expression includes gene expression data, biological chain data, and disease transmission model data.
Further, the construction device comprises an initial network construction module, a data segmentation module, a regression fitting module, a fitting error module, a cause and effect judgment module and a network judgment module;
the initial network construction module is used for calculating correlation coefficients among input data by using expression data of a specific representation as original input data and constructing an initial network by using the correlation coefficients among the data;
the data segmentation module is used for segmenting original input data into a training set and a test set; the training set is used for training the regression fitting module, and the testing set is used for calculating the error of the fitting error module;
the regression fitting module is used for carrying out H0 hypothesis and H1 hypothesis on each pair of edges on the basis of the initial correlation network; the edge of H0 represents that the causal relationship does not exist, and the edge of H1 represents that the causal relationship exists; performing different regressions by using partial expression data on the basis of different assumptions to obtain a linear regression equation of each pair of edges;
the fitting error module is used for inputting a regression equation obtained by the regression fitting module by using the test set and calculating the obtained fitting error;
the cause and effect judgment module is used for judging cause and effect strength by using multiple errors obtained by the fitting error module after the segmentation of the multiple data sets is obtained by the data segmentation module so as to obtain a real cause and effect relationship;
and the network judgment module is used for traversing different node pairs corresponding to the edges in the initial network by utilizing the module mode to finally obtain the real causal relationship network.
Compared with the prior art, the invention has the advantages that: the invention completely provides a new causal concept and establishment method, which is called CVP method for short, and the method is a data-driven model-free algorithm used for processing data irrelevant to time. The CVP method is quantified by cross-validated predictability and statistical independence of observed variables on time-independent data, similar to but different from GC on time-dependent data. The CVP method does not need redundant prior information, and is suitable for real data of various backgrounds, including gene regulation networks and other causal networks with feedback loops. The CVP method can effectively identify causal relationships in complex biological systems, and can better reveal regulatory mechanisms and explain biological functions. In particular, derivation of specific disease causal/regulatory networks may reveal the underlying mechanisms of molecular effects, providing accurate data and decision guidance for further precision treatment.
Drawings
FIG. 1 is a schematic diagram of the overall framework of the present invention;
Detailed Description
Reference will now be made in detail to the embodiments of the present invention, wherein like or similar reference numerals refer to like or similar elements or elements of similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative and are only for explaining the present invention and are not to be construed as limiting the present invention.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Reference numerals in the various embodiments are provided for steps of the description only and are not necessarily associated in a substantially sequential manner. Different steps in each embodiment can be combined in different sequences, so that the purpose of the invention is achieved.
The invention is further described with reference to the following figures and detailed description.
Example (b):
as shown in fig. 1, a causal relationship prediction method based on multi-sample data specifically includes the following steps:
101 Initial network construction step: using the expression data of the specific representation as initial input data, calculating correlation coefficients between the initial input data and constructing an initial network using the correlation coefficients between the initial input data. The expression data for a particular representation includes gene expression data, biological chain data, disease transmission model data, and the like. For example, the expression data under normal conditions is used as reference data for disease research, such data mainly includes gene expression data, mainly RNA-Seq data, and other data meeting the standard, such as biological chain data and disease transmission model data.
Specifically, the correlation between two variables is evaluated by using a Pearson correlation coefficient or a partial correlation coefficient, and the indirect influence of genes is removed by the partial correlation coefficient, so that the direct correlation between the genes is obtained, and a correlation network is constructed.
The calculation formula of the specific Pearson correlation coefficient is as follows:
Figure BDA0003948238720000061
wherein X and Y are two variables.
Namely, the correlation strength between two variables is measured by using the pearson correlation coefficient, when a PCC/PTCC is used for constructing the correlation network, the network nodes represent genes, and the edges represent the correlation strength between the genes. Clearly, the magnitude of the correlation is affected by many factors, such as sample size, sequencing depth, etc. To reduce the impact of technical factors on different data sets, adaptive thresholds are used instead of fixed thresholds for definition. If the feature dimension is below 1000, the relevant network chooses to retain all edges, otherwise, highly relevant edges with an average degree of 100 would be retained.
The first order partial correlation coefficient is calculated as follows:
Figure BDA0003948238720000071
in the formula r 12 Representing the correlation coefficient between variable 1 and variable 2, r 13 Representing the correlation coefficient between variable 1 and variable 3, r 23 Representing the correlation coefficient between variable 2 and variable 3.
Namely, when evaluating the direct correlation between two variables by using the partial correlation coefficient in statistics, the partial correlation coefficient is a correlation network obtained after removing a group of random variables, a corresponding hypothesis test p value can be obtained by calculating the PTCC between two genes, and the edge with the p value of the PTCC being less than 0.1 is screened out, so that the partial correlation network is obtained.
102 Step of judging causal relationship: on the basis of the initial correlation network, H0 hypothesis and H1 hypothesis are performed for each pair of edges. The edge of H0 represents that causality does not exist, and the edge of H1 represents that causality exists. And performing different regressions on the partial expression data on the basis of different assumptions to obtain a linear regression equation of each pair of edges. I.e., a sub-network in which each gene isolated from the correlation network has a first order neighborhood, neighboring genes are quantified by this scheme for Causal Strength (CS) of that gene.
Performing regression fitting on the effect variable by using other variables which do not contain dependent variables in the H0 hypothesis; in the H1 hypothesis, a dependent variable and other variables are used for carrying out regression fitting on an effect variable; the specific regression formulas are respectively as follows:
H0:
Figure BDA0003948238720000072
H1:Y=f(X,Z 1 ,Z 2 ,…,Z n-2 )+ε
wherein Y represents the fruit variable; x represents a dependent variable; z is other covariates which do not contain X and Y variables; ε and
Figure BDA0003948238720000081
is noise.
103 Step of constructing a causal network: for different regression equations assumed on both sides, the error of both regressions is obtained by substituting the same remaining expression data. By comparing the fitting error between the two, whether the causal relationship exists can be judged, and the whole causal network is constructed. The overall process can be briefly described as based on the significance test of causal intensity (CS), and redundant causal relationships are removed to obtain the optimal causal subnetwork of each gene. And integrating subnetworks of all genes to obtain a final causal network.
Specifically, the expression data of a specific representation is divided into a training set and a test set, wherein the training set is used as regression, and the test set is used as judgment error; the division mode comprises K-fold cross validation and one-out-of-one method.
And (3) substituting training set data in different regression equations, fitting the equations to obtain the square total error, wherein the specific mode is as follows:
firstly, the distance between the output of a fitting equation and a true value is taken as an error, and a plurality of errors obtained by a plurality of times of cross tests are added to obtain a total error, wherein the formula is as follows:
Figure BDA0003948238720000082
wherein e i Represents the error obtained in the i-th cross-test, and m represents the total number of cross-tests.
And the error can be calculated by distance measurement means; the specific way of comparison is realized by comparing the causal strength; the causal strength calculation formula is as follows:
Figure BDA0003948238720000083
where e represents the mean sum of squared residuals of the cross-validated test set obtained from the regression of the dependent variables and other random variables to the dependent variables in the H1 hypothesis,
Figure BDA0003948238720000084
the mean sum of squared residuals for the cross-validated test set, which contains the regression of all random variables to the effect variable except the dependent variable, is represented in the H0 hypothesis.
Causal intensity CS X→Y Can be defined as the difference between the ability to interpret the variables X without and X with respect to Y, i.e., the difference between
Figure BDA0003948238720000091
Can also be used to verify e and
Figure BDA0003948238720000092
the difference between them, i.e. the significance test. However, statistical tests are too time consuming to infer a large scale network. Thus using causal intensity ω X→Y The difference of the error items of the test samples is effectively calculated, and the method is more suitable for large-scale networks.
The specific process is as follows:
1) Target variable pairs are selected according to the initial network, and two different assumptions are made, H0 and H1 assumptions.
H0 in this step assumes: where one variable (dependent variable) has a causal effect on another variable (effect variable).
H1 assumes in this step: where a co-variable (dependent variable) has no causal effect on another variable (effect variable).
2) The data set is divided by a cross validation mode, the data set is divided into a training set and a testing set, and the data of the dependent variable in the data set is selectively covered according to different assumptions.
The selective masking in this step is represented as masking the information of the dependent variable in the H0 hypothesis. In the H1 hypothesis, information of the dependent variable is added.
The data dividing mode of the cross validation in the step comprises K-fold cross validation and leaves one method.
3) In different assumptions, different regression equations are established for the dependent variables using the data of the training set.
The regression equation established in this step is as follows:
the regression equation in the H0 hypothesis is as follows:
Figure BDA0003948238720000093
the regression equation in the H1 hypothesis is as follows:
Y=f(X,Z 1 ,Z 2 ,…,Z n-2 )+ε
and (4) substituting training set data in different regression equations, and fitting the equations to obtain the square total error. The total error is obtained by taking the distance between the output of the fitting equation and the true value as an error, and adding a plurality of errors obtained by a plurality of times of cross tests to obtain the total error. The specific formula is as follows:
Figure BDA0003948238720000101
and comparing the errors obtained under the two assumptions to determine whether the causal relationship between the pair of causal variables exists. The way the error is compared is defined by causal strength. The causal strength is positive, and the causal relationship is considered to exist; conversely, the causal strength is negative and the causal relationship is considered to be absent. The concrete formula of causal strength is as follows:
Figure BDA0003948238720000102
repeating the above steps for the variable pairs with edges in the initial network to obtain all the causal relationships in the initial network, i.e. to obtain the final causal relationship network (target network).
While, for purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it is to be understood and appreciated that the methodologies are not limited by the order of acts, as some acts may, in accordance with one or more embodiments, occur in different orders and/or concurrently with other acts from that shown and described herein or not shown and described herein, as would be understood by one skilled in the art.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software as a computer program product, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disc), as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks (disks) usually reproduce data magnetically, while discs (discs) reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (9)

1. A causal relationship prediction method based on multi-sample data is characterized in that: the method specifically comprises the following steps:
101 Initial network construction step: calculating a correlation coefficient between initial input data using expression data of a specific representation as the initial input data and constructing an initial network using the correlation coefficient between the initial input data;
102 ) judging a causal relationship: performing H0 hypothesis and H1 hypothesis on each pair of sides on the basis of the initial correlation network; the edge of H0 represents that the causal relationship does not exist, and the edge of H1 represents that the causal relationship exists; performing different regressions by using part of expression data on the basis of different assumptions to obtain a linear regression equation of each pair of edges;
103 Step of constructing a causal network: for different regression equations with two assumed sides, substituting the same remaining expression data to obtain errors of two regressions; by comparing the fitting error between the two, whether the causal relationship exists can be judged, and the whole causal network is constructed.
2. A method of causal relationship prediction based on multi-sample data as claimed in claim 1, wherein: in the step 101), estimating the correlation between two variables by using a Pearson correlation coefficient or a partial correlation coefficient, and constructing a correlation network; wherein the partial correlation coefficient is used for eliminating indirect influence of genes so as to obtain direct correlation between the genes.
3. A method of multi-sample data based causal prediction as claimed in claim 2, wherein: the Pearson correlation coefficient is calculated as follows:
Figure FDA0003948238710000011
wherein X and Y are two variables;
the first order partial correlation coefficient is calculated as follows:
Figure FDA0003948238710000012
in the formula r 12 Represents the correlation coefficient between variable 1 and variable 2, r 13 Represents the correlation coefficient between variable 1 and variable 3, r 23 Representing the correlation coefficient between variable 2 and variable 3.
4. A method of causal relationship prediction based on multi-sample data as claimed in claim 1, wherein: carrying out regression fitting on the effect variable by using the covariate which does not contain the dependent variable in the H0 hypothesis in the step 102); in the H1 hypothesis, a dependent variable and other variables are used for carrying out regression fitting on an effect variable; the specific regression formulas are respectively as follows:
H0:
Figure FDA0003948238710000021
H1:
Figure FDA0003948238710000022
wherein Y represents the fruit variable; x represents a dependent variable(ii) a Z is other covariates which do not contain X and Y variables; ε and
Figure FDA0003948238710000023
is noise.
5. The method of claim 1, wherein the method comprises: dividing expression data of a specific representation into a training set and a test set, wherein the training set is used as regression, and the test set is used as judgment; the division mode comprises K-fold cross validation and one-out-of-one method.
6. A method of causal relationship prediction based on multi-sample data as claimed in claim 5, wherein: and (3) introducing training set data into different regression equations, and fitting the equations to obtain total errors in a specific mode as follows:
firstly, the distance between the output of a fitting equation and a true value is taken as an error, and a plurality of errors obtained by a plurality of times of cross tests are added to obtain a total error, wherein the formula is as follows:
Figure FDA0003948238710000024
wherein e i Error obtained in the ith cross-test is indicated, and m represents the total number of cross-validations.
7. The method of claim 1, wherein the method comprises: the error in step 103) may be calculated by distance metric means; the specific way of comparison is realized by comparing the causal strength; the causal strength calculation formula is as follows:
Figure FDA0003948238710000031
wherein e represents the result variable of the H1 hypothesis including the dependent variable and other covariatesRegression, resulting average sum of squared residuals for the cross-validated test set,
Figure FDA0003948238710000032
the mean sum of squared residuals for the cross-validated test set, which contains the regression of all random variables to the effect variable except the dependent variable, is represented in the H0 hypothesis.
8. A method for causal relationship prediction based on multi-sample data according to any of claims 1-7, wherein: expression data for a particular representation includes gene expression data, biological chain data, disease transmission model data.
9. A method for causal relationship prediction based on multi-sample data according to any of claims 1-7, wherein: the construction device comprises an initial network construction module, a data segmentation module, a regression fitting module, a fitting error module, a cause and effect judgment module and a network judgment module;
the initial network construction module is used for calculating correlation coefficients among input data by using expression data of a specific representation as original input data and constructing an initial network by using the correlation coefficients among the data;
the data segmentation module is used for segmenting original input data into a training set and a test set; the training set is used for training the regression fitting module, and the testing set is used for calculating the error of the fitting error module;
the regression fitting module is used for carrying out H0 hypothesis and H1 hypothesis on each pair of edges on the basis of the initial correlation network; the edge of H0 represents that the causal relationship does not exist, and the edge of H1 represents that the causal relationship exists; performing different regressions by using partial expression data on the basis of different assumptions to obtain a linear regression equation of each pair of edges;
the fitting error module is used for inputting a regression equation obtained by the regression fitting module by using the test set and calculating the obtained fitting error;
the cause and effect judgment module is used for judging cause and effect strength by using multiple errors obtained by the fitting error module after the multiple data set segmentations are obtained by the data segmentation module, so as to obtain a real cause and effect relationship;
and the network judgment module traverses different node pairs corresponding to the edges in the initial network by using the module mode to finally obtain a real causal relationship network.
CN202211440950.7A 2022-11-17 2022-11-17 Causal relationship prediction method based on multi-sample data Pending CN115691652A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211440950.7A CN115691652A (en) 2022-11-17 2022-11-17 Causal relationship prediction method based on multi-sample data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211440950.7A CN115691652A (en) 2022-11-17 2022-11-17 Causal relationship prediction method based on multi-sample data

Publications (1)

Publication Number Publication Date
CN115691652A true CN115691652A (en) 2023-02-03

Family

ID=85054033

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211440950.7A Pending CN115691652A (en) 2022-11-17 2022-11-17 Causal relationship prediction method based on multi-sample data

Country Status (1)

Country Link
CN (1) CN115691652A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116434969A (en) * 2023-06-14 2023-07-14 之江实验室 Multi-center chronic disease prediction device based on causal structure invariance

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116434969A (en) * 2023-06-14 2023-07-14 之江实验室 Multi-center chronic disease prediction device based on causal structure invariance
CN116434969B (en) * 2023-06-14 2023-09-12 之江实验室 Multi-center chronic disease prediction device based on causal structure invariance

Similar Documents

Publication Publication Date Title
CN114422381B (en) Communication network traffic prediction method, system, storage medium and computer equipment
Liu et al. Multiobjective criteria for neural network structure selection and identification of nonlinear systems using genetic algorithms
EP4198848A1 (en) Method and system for multi-step prediction of future wind speed based on automatic reservoir neural network
Sadat Hosseini et al. Short-term load forecasting of power systems by gene expression programming
US20220245499A1 (en) Quantum circuit simulation
CN115691652A (en) Causal relationship prediction method based on multi-sample data
CN113988464A (en) Network link attribute relation prediction method and equipment based on graph neural network
CN112187554A (en) Operation and maintenance system fault positioning method and system based on Monte Carlo tree search
Rivero et al. DoME: A deterministic technique for equation development and Symbolic Regression
EP3739473B1 (en) Optimization device and method of controlling optimization device
CN114124260B (en) Spectrum prediction method, device, medium and equipment based on composite 2D-LSTM network
US11989656B2 (en) Search space exploration for deep learning
CN114694379A (en) Traffic flow prediction method and system based on self-adaptive dynamic graph convolution
Lei et al. A novel time-delay neural grey model and its applications
Oh et al. Genetically optimized hybrid fuzzy set-based polynomial neural networks
CN115545210A (en) Method and related apparatus for quantum computing
WO2019194128A1 (en) Model learning device, model learning method, and program
Alqawasmi et al. Estimation of ARMA model order using artificial neural networks
US20230401363A1 (en) GaN Distributed RF Power Amplifier Automation Design with Deep Reinforcement Learning
Abbas et al. Volterra system identification using adaptive genetic algorithms
CN113852970B (en) Multi-dimensional spectrum prediction method, system, device and medium based on graph neural network
WO2020054046A1 (en) Optimization device, control method of optimization device, and control program of optimization device
Ng et al. Comparative studies in problems of missing extreme daily streamflow records
CN115859048A (en) Noise processing method and device for partial discharge signal
CN111583991B (en) Method, system, equipment and medium for gene regulation and control network reconstruction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination