CN115691652A

CN115691652A - Causal relationship prediction method based on multi-sample data

Info

Publication number: CN115691652A
Application number: CN202211440950.7A
Authority: CN
Inventors: 刘小平; 张月蕾; 常啸; 陈洛南
Original assignee: Hangzhou Institute of Advanced Studies of UCAS
Current assignee: Hangzhou Institute of Advanced Studies of UCAS
Priority date: 2022-11-17
Filing date: 2022-11-17
Publication date: 2023-02-03

Abstract

The invention discloses a causal relationship prediction method based on multi-sample data, which comprises the steps of utilizing expression data of specific representations as initial input data, calculating correlation coefficients among the initial input data, utilizing the correlation coefficients among the initial input data to construct an initial network, carrying out H0 hypothesis and H1 hypothesis on each pair of sides on the basis of the initial correlation network, carrying out different regressions on partial expression data on the basis of different hypotheses to obtain a linear regression equation of each pair of sides, substituting different regression equations under the hypothesis of the two sides with the same residual expression data to obtain errors of the two regressions, and judging whether the causal relationship exists or not by comparing the fitting errors between the two types of regression equations so as to construct the whole causal network. The invention completely provides a new causal concept and establishment method, which is called CVP method for short, and the method is a data-driven model-free algorithm used for processing data irrelevant to time.

Description

Causal relationship prediction method based on multi-sample data

Technical Field

The invention relates to the technical field of computational biology and bioinformatics, in particular to a causal relationship prediction method based on multi-sample data.

Background

Causal reasoning from observed data is a core problem in various research fields of natural sciences and engineering, such as biology, earth science, economics, medicine, neuroscience, machine learning, and the like. Glangel causal relationship (GC) reasoning, which is a representative method, is a method for inferring potential causal relationships based on time series data, and since 1969, GC-based methods have been widely used in many fields. However, most biological data is based on time-independent data, such as phenotypes or phases, rather than time series, and thus this approach is not suitable.

Causality between molecules can generally be represented by a causality graph, nodes represent different molecules, and directed edges characterize direct causality between molecules. Existing mainstream algorithms for predicting causal relationships include the well-known Granger causal relationship (GC), the convergent cross-mapping algorithm (CCM), the Transfer Entropy (TE) and bayesian theory. Specifically, the GC is based on time lag information and the current state interpreted with past information, for example, based on an autoregressive model. As a nonlinear version of the GC method, TE considers temporal asymmetry of information to determine causal relationships, and is widely used in biological systems such as neuroscience and physiology. Both GC and TE are causal inferred based on the original state space. CCM, on the other hand, is measured in the delayed embedding space, in addition to GC theory, reflecting the non-linear relationship between two variables in the current state. All of the above algorithms measure causal relationships where time-dependent data or time series data is needed. In contrast, bayesian networks and Structural Causal Models (SCMs) are able to process time-independent data based on statistical independence and intervention, and to identify direct causal relationships between molecules. However, these methods all rely on the structure of directed acyclic graphs to infer causal relationships, which limits the application to biomolecular networks that typically have feedback loops or circular interactions.

Therefore, the causal relationship in the real biological/molecular system is deduced with high precision based on the time-independent data rather than the data depending on the time data, so as to better provide the relationship between the data, provide powerful data for the actual decision, and still be a technical problem to be improved and overcome.

Disclosure of Invention

The invention overcomes the defects of the prior art and provides a causal relationship prediction method based on multi-sample data.

The technical scheme of the invention is as follows:

a causal relationship prediction method based on multi-sample data does not consider time dependency to infer cause and effect, and specifically comprises the following steps:

101 Initial network construction step: calculating a correlation coefficient between initial input data using expression data of a specific representation as the initial input data and constructing an initial network using the correlation coefficient between the initial input data;

102 ) judging a causal relationship: performing H0 hypothesis and H1 hypothesis on each pair of sides on the basis of the initial correlation network; the edge of H0 represents that the causal relationship does not exist, and the edge of H1 represents that the causal relationship exists; performing different regressions by using partial expression data on the basis of different assumptions to obtain a linear regression equation of each pair of edges;

103 Step of constructing a causal network: for different regression equations with two assumed sides, substituting the same remaining expression data to obtain errors of two regressions; by comparing the fitting error between the two, whether the causal relationship exists can be judged, and the whole causal network is constructed.

Further, in step 101), the relevance between the two variables is evaluated by using the partial correlation coefficient to eliminate the indirect influence of the genes, so as to obtain the direct relevance between the genes and construct a relevance network.

Further, the calculation formula of Pearson correlation coefficient is as follows:

wherein X and Y are two variables;

the first order partial correlation coefficient is calculated as follows:

in the formula r ₁₂ Representing the correlation coefficient between variable 1 and variable 2, r ₁₃ Representing the correlation coefficient between variable 1 and variable 3, r ₂₃ Representing the correlation coefficient between variable 2 and variable 3.

Further, in the step 102), regression fitting is carried out on the effect variable by using a covariate which does not contain a dependent variable in the H0 hypothesis; performing regression fitting on the effect variable by using the dependent variable and the covariate in the H1 hypothesis; the specific regression formulas are respectively as follows:

H0:

H1:Y＝f(X,Z ₁ ,Z ₂ ,…,Z _n-2 )+ε

wherein Y represents the fruit variable; x represents a dependent variable; z is other variables not including X, Y variables; ε and

is noise.

Further, dividing expression data of the specific representation into a training set and a test set, wherein the training set is used as regression, and the test set is used as judgment error; the division mode comprises K-fold cross validation and leave one method.

Further, in different regression equations, training set data is substituted, fitting of the equations is performed, and a square total error is obtained in the following specific manner:

firstly, taking the distance between the output of the fitting equation and the true value as an error, and summing a plurality of errors obtained by a plurality of times of cross tests to obtain a total error, wherein the formula is as follows:

wherein e _i Representing errors obtained in the ith cross-over testThe difference, m, represents the total number of cross-validations.

Further, the error in step 103) may be calculated by a distance metric means; the specific way of comparison is realized by comparing the causal strength; the causal strength calculation formula is as follows:

where e represents the mean sum of squared residuals of the cross-validated test set obtained from the regression of the dependent variables and other random variables to the dependent variables in the H1 hypothesis,

the mean sum of squared residuals for the cross-validated test set, which contains the regression of all random variables to the effect variable except the dependent variable, is represented in the H0 hypothesis.

Further, the expression data of the specific expression includes gene expression data, biological chain data, and disease transmission model data.

Further, the construction device comprises an initial network construction module, a data segmentation module, a regression fitting module, a fitting error module, a cause and effect judgment module and a network judgment module;

the initial network construction module is used for calculating correlation coefficients among input data by using expression data of a specific representation as original input data and constructing an initial network by using the correlation coefficients among the data;

the data segmentation module is used for segmenting original input data into a training set and a test set; the training set is used for training the regression fitting module, and the testing set is used for calculating the error of the fitting error module;

the regression fitting module is used for carrying out H0 hypothesis and H1 hypothesis on each pair of edges on the basis of the initial correlation network; the edge of H0 represents that the causal relationship does not exist, and the edge of H1 represents that the causal relationship exists; performing different regressions by using partial expression data on the basis of different assumptions to obtain a linear regression equation of each pair of edges;

the fitting error module is used for inputting a regression equation obtained by the regression fitting module by using the test set and calculating the obtained fitting error;

the cause and effect judgment module is used for judging cause and effect strength by using multiple errors obtained by the fitting error module after the segmentation of the multiple data sets is obtained by the data segmentation module so as to obtain a real cause and effect relationship;

and the network judgment module is used for traversing different node pairs corresponding to the edges in the initial network by utilizing the module mode to finally obtain the real causal relationship network.

Compared with the prior art, the invention has the advantages that: the invention completely provides a new causal concept and establishment method, which is called CVP method for short, and the method is a data-driven model-free algorithm used for processing data irrelevant to time. The CVP method is quantified by cross-validated predictability and statistical independence of observed variables on time-independent data, similar to but different from GC on time-dependent data. The CVP method does not need redundant prior information, and is suitable for real data of various backgrounds, including gene regulation networks and other causal networks with feedback loops. The CVP method can effectively identify causal relationships in complex biological systems, and can better reveal regulatory mechanisms and explain biological functions. In particular, derivation of specific disease causal/regulatory networks may reveal the underlying mechanisms of molecular effects, providing accurate data and decision guidance for further precision treatment.

Drawings

FIG. 1 is a schematic diagram of the overall framework of the present invention;

Detailed Description

Reference will now be made in detail to the embodiments of the present invention, wherein like or similar reference numerals refer to like or similar elements or elements of similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative and are only for explaining the present invention and are not to be construed as limiting the present invention.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Reference numerals in the various embodiments are provided for steps of the description only and are not necessarily associated in a substantially sequential manner. Different steps in each embodiment can be combined in different sequences, so that the purpose of the invention is achieved.

The invention is further described with reference to the following figures and detailed description.

Example (b):

as shown in fig. 1, a causal relationship prediction method based on multi-sample data specifically includes the following steps:

101 Initial network construction step: using the expression data of the specific representation as initial input data, calculating correlation coefficients between the initial input data and constructing an initial network using the correlation coefficients between the initial input data. The expression data for a particular representation includes gene expression data, biological chain data, disease transmission model data, and the like. For example, the expression data under normal conditions is used as reference data for disease research, such data mainly includes gene expression data, mainly RNA-Seq data, and other data meeting the standard, such as biological chain data and disease transmission model data.

Specifically, the correlation between two variables is evaluated by using a Pearson correlation coefficient or a partial correlation coefficient, and the indirect influence of genes is removed by the partial correlation coefficient, so that the direct correlation between the genes is obtained, and a correlation network is constructed.

The calculation formula of the specific Pearson correlation coefficient is as follows:

wherein X and Y are two variables.

Namely, the correlation strength between two variables is measured by using the pearson correlation coefficient, when a PCC/PTCC is used for constructing the correlation network, the network nodes represent genes, and the edges represent the correlation strength between the genes. Clearly, the magnitude of the correlation is affected by many factors, such as sample size, sequencing depth, etc. To reduce the impact of technical factors on different data sets, adaptive thresholds are used instead of fixed thresholds for definition. If the feature dimension is below 1000, the relevant network chooses to retain all edges, otherwise, highly relevant edges with an average degree of 100 would be retained.

The first order partial correlation coefficient is calculated as follows:

Namely, when evaluating the direct correlation between two variables by using the partial correlation coefficient in statistics, the partial correlation coefficient is a correlation network obtained after removing a group of random variables, a corresponding hypothesis test p value can be obtained by calculating the PTCC between two genes, and the edge with the p value of the PTCC being less than 0.1 is screened out, so that the partial correlation network is obtained.

102 Step of judging causal relationship: on the basis of the initial correlation network, H0 hypothesis and H1 hypothesis are performed for each pair of edges. The edge of H0 represents that causality does not exist, and the edge of H1 represents that causality exists. And performing different regressions on the partial expression data on the basis of different assumptions to obtain a linear regression equation of each pair of edges. I.e., a sub-network in which each gene isolated from the correlation network has a first order neighborhood, neighboring genes are quantified by this scheme for Causal Strength (CS) of that gene.

Performing regression fitting on the effect variable by using other variables which do not contain dependent variables in the H0 hypothesis; in the H1 hypothesis, a dependent variable and other variables are used for carrying out regression fitting on an effect variable; the specific regression formulas are respectively as follows:

H0:

H1:Y＝f(X,Z ₁ ,Z ₂ ,…,Z _n-2 )+ε

wherein Y represents the fruit variable; x represents a dependent variable; z is other covariates which do not contain X and Y variables; ε and

is noise.

103 Step of constructing a causal network: for different regression equations assumed on both sides, the error of both regressions is obtained by substituting the same remaining expression data. By comparing the fitting error between the two, whether the causal relationship exists can be judged, and the whole causal network is constructed. The overall process can be briefly described as based on the significance test of causal intensity (CS), and redundant causal relationships are removed to obtain the optimal causal subnetwork of each gene. And integrating subnetworks of all genes to obtain a final causal network.

Specifically, the expression data of a specific representation is divided into a training set and a test set, wherein the training set is used as regression, and the test set is used as judgment error; the division mode comprises K-fold cross validation and one-out-of-one method.

And (3) substituting training set data in different regression equations, fitting the equations to obtain the square total error, wherein the specific mode is as follows:

firstly, the distance between the output of a fitting equation and a true value is taken as an error, and a plurality of errors obtained by a plurality of times of cross tests are added to obtain a total error, wherein the formula is as follows:

wherein e _i Represents the error obtained in the i-th cross-test, and m represents the total number of cross-tests.

And the error can be calculated by distance measurement means; the specific way of comparison is realized by comparing the causal strength; the causal strength calculation formula is as follows:

Causal intensity CS _X→Y Can be defined as the difference between the ability to interpret the variables X without and X with respect to Y, i.e., the difference between

Can also be used to verify e and

the difference between them, i.e. the significance test. However, statistical tests are too time consuming to infer a large scale network. Thus using causal intensity ω _X→Y The difference of the error items of the test samples is effectively calculated, and the method is more suitable for large-scale networks.

The specific process is as follows:

1) Target variable pairs are selected according to the initial network, and two different assumptions are made, H0 and H1 assumptions.

H0 in this step assumes: where one variable (dependent variable) has a causal effect on another variable (effect variable).

H1 assumes in this step: where a co-variable (dependent variable) has no causal effect on another variable (effect variable).

2) The data set is divided by a cross validation mode, the data set is divided into a training set and a testing set, and the data of the dependent variable in the data set is selectively covered according to different assumptions.

The selective masking in this step is represented as masking the information of the dependent variable in the H0 hypothesis. In the H1 hypothesis, information of the dependent variable is added.

The data dividing mode of the cross validation in the step comprises K-fold cross validation and leaves one method.

3) In different assumptions, different regression equations are established for the dependent variables using the data of the training set.

The regression equation established in this step is as follows:

the regression equation in the H0 hypothesis is as follows:

the regression equation in the H1 hypothesis is as follows:

Y＝f(X,Z ₁ ,Z ₂ ,…,Z _n-2 )+ε

and (4) substituting training set data in different regression equations, and fitting the equations to obtain the square total error. The total error is obtained by taking the distance between the output of the fitting equation and the true value as an error, and adding a plurality of errors obtained by a plurality of times of cross tests to obtain the total error. The specific formula is as follows:

and comparing the errors obtained under the two assumptions to determine whether the causal relationship between the pair of causal variables exists. The way the error is compared is defined by causal strength. The causal strength is positive, and the causal relationship is considered to exist; conversely, the causal strength is negative and the causal relationship is considered to be absent. The concrete formula of causal strength is as follows:

repeating the above steps for the variable pairs with edges in the initial network to obtain all the causal relationships in the initial network, i.e. to obtain the final causal relationship network (target network).

While, for purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it is to be understood and appreciated that the methodologies are not limited by the order of acts, as some acts may, in accordance with one or more embodiments, occur in different orders and/or concurrently with other acts from that shown and described herein or not shown and described herein, as would be understood by one skilled in the art.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software as a computer program product, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disc), as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks (disks) usually reproduce data magnetically, while discs (discs) reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A causal relationship prediction method based on multi-sample data is characterized in that: the method specifically comprises the following steps:

102 ) judging a causal relationship: performing H0 hypothesis and H1 hypothesis on each pair of sides on the basis of the initial correlation network; the edge of H0 represents that the causal relationship does not exist, and the edge of H1 represents that the causal relationship exists; performing different regressions by using part of expression data on the basis of different assumptions to obtain a linear regression equation of each pair of edges;

2. A method of causal relationship prediction based on multi-sample data as claimed in claim 1, wherein: in the step 101), estimating the correlation between two variables by using a Pearson correlation coefficient or a partial correlation coefficient, and constructing a correlation network; wherein the partial correlation coefficient is used for eliminating indirect influence of genes so as to obtain direct correlation between the genes.

3. A method of multi-sample data based causal prediction as claimed in claim 2, wherein: the Pearson correlation coefficient is calculated as follows:

wherein X and Y are two variables;

the first order partial correlation coefficient is calculated as follows:

in the formula r ₁₂ Represents the correlation coefficient between variable 1 and variable 2, r ₁₃ Represents the correlation coefficient between variable 1 and variable 3, r ₂₃ Representing the correlation coefficient between variable 2 and variable 3.

4. A method of causal relationship prediction based on multi-sample data as claimed in claim 1, wherein: carrying out regression fitting on the effect variable by using the covariate which does not contain the dependent variable in the H0 hypothesis in the step 102); in the H1 hypothesis, a dependent variable and other variables are used for carrying out regression fitting on an effect variable; the specific regression formulas are respectively as follows:

H0:

H1:

wherein Y represents the fruit variable; x represents a dependent variable(ii) a Z is other covariates which do not contain X and Y variables; ε and

is noise.

5. The method of claim 1, wherein the method comprises: dividing expression data of a specific representation into a training set and a test set, wherein the training set is used as regression, and the test set is used as judgment; the division mode comprises K-fold cross validation and one-out-of-one method.

6. A method of causal relationship prediction based on multi-sample data as claimed in claim 5, wherein: and (3) introducing training set data into different regression equations, and fitting the equations to obtain total errors in a specific mode as follows:

wherein e _i Error obtained in the ith cross-test is indicated, and m represents the total number of cross-validations.

7. The method of claim 1, wherein the method comprises: the error in step 103) may be calculated by distance metric means; the specific way of comparison is realized by comparing the causal strength; the causal strength calculation formula is as follows:

wherein e represents the result variable of the H1 hypothesis including the dependent variable and other covariatesRegression, resulting average sum of squared residuals for the cross-validated test set,

8. A method for causal relationship prediction based on multi-sample data according to any of claims 1-7, wherein: expression data for a particular representation includes gene expression data, biological chain data, disease transmission model data.

9. A method for causal relationship prediction based on multi-sample data according to any of claims 1-7, wherein: the construction device comprises an initial network construction module, a data segmentation module, a regression fitting module, a fitting error module, a cause and effect judgment module and a network judgment module;

the cause and effect judgment module is used for judging cause and effect strength by using multiple errors obtained by the fitting error module after the multiple data set segmentations are obtained by the data segmentation module, so as to obtain a real cause and effect relationship;

and the network judgment module traverses different node pairs corresponding to the edges in the initial network by using the module mode to finally obtain a real causal relationship network.