CN113808665B

CN113808665B - Causal correlation analysis method for fine localization of genome-wide pathogenic SNP

Info

Publication number: CN113808665B
Application number: CN202111149486.1A
Authority: CN
Inventors: 薛付忠; 孙晓茹; 李洪凯; 杨帆
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2024-03-08
Anticipated expiration: 2041-09-29
Also published as: CN113808665A

Abstract

The invention provides a causal correlation analysis method for fine positioning of a whole genome pathogenic SNP, which is used for fine positioning of a pathogenic SNP of a human complex disease and reducing false positive rate of a GWAS result. Under the guidance of a causal inference framework, a causal GWAS analysis strategy (CDSFM algorithm) for fine positioning of pathogenic loci of a whole genome is constructed, and under the constraint of a specific causal graph model, the strategy is independently adjusted by gradual conditions, so that the false positive rate is effectively reduced, the true positive rate is improved, the hit rate of capturing pathogenic SNP is improved to more than 90%, and the detection efficiency is higher.

Description

Causal correlation analysis method for fine localization of genome-wide pathogenic SNP

Technical Field

The invention relates to the technical field of biological genetic engineering, in particular to a causal correlation analysis method for fine positioning of a whole genome pathogenic SNP.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

From the whole genome association analysis (Genome Wide Association Study, GWAS) method, it has been proposed to date that there are many tens of thousands of SNPs statistically associated with more than 4000 common diseases throughout the whole genome 14. However, few SNPs are functionally validated in the laboratory, and it is difficult to elucidate their genetic mechanisms. Such a high false positive rate, while presenting difficulties for subsequent verification, can also lead to suspicion of GWAS results by non-legacy scientists. How to further pinpoint the true pathogenic SNP and reduce the false positive rate is a widely studied problem of researchers at present.

The inventors found that many researchers have proposed various algorithms for fine positioning from multiple angles, which can be roughly classified into the following four categories:

(1) Heuristic fine positioning method, the common analysis thinking is that after the marginal related SNP is screened by a generalized linear regression model and a generalized linear mixed model, the SNP with R2 more than a certain threshold value is screened according to the structures of top SNP and surrounding LD. However, such methods are not effective in reducing false positive rates when there is a high correlation between SNPs in a region.

(2) The conditional regression method represented by the conditional regression analysis method is used for judging whether the condition P value of the rest SNP is still significant after the top SNP is given, and most of the time, the top SNP in the region is not necessarily a true pathogenic site, so that the judging result of the model is affected when the two SNP are highly correlated.

(3) And (3) performing variable screening by judging whether the regression coefficient is 0 according to a punishment regression model represented by LASSO. The sparse model constructed under the strategy only keeps a few SNP on the highly relevant region of the SNP, and the true pathogenic SNP is easy to delete by mistake.

(4) The Bayesian model represented by the Bayesian variable selection model is used for fine positioning by calculating posterior probability of SNP as pathogenic sites, and the method often needs to preset the number of pathogenic sites, and incorrect setting will influence analysis results.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a Causal association analysis method for fine positioning of a whole genome pathogenic SNP, and constructs a Causal GWAS analysis strategy and method (Causal diagnostic-based Stepwise Fine-Mapping, CDSFM algorithm) for fine positioning of the whole genome pathogenic SNP under the guidance of a Causal inference framework, thereby reducing the false positive rate of the GWAS result and improving the hit rate of capturing the pathogenic SNP.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the first aspect of the invention provides a causal correlation analysis method for fine localization of genome-wide pathogenic SNP.

A causal correlation analysis method for fine localization of genome-wide pathogenic SNPs, comprising the following steps:

acquiring genome data to be analyzed;

carrying out whole genome association analysis on genome data by using a single factor regression model, screening significant SNP with a P value lower than a preset threshold value, defining the significant SNP as a first candidate gene set, and sequencing the SNP in the first candidate gene set from small to large according to the P value;

fixing SNP with minimum P value in first candidate gene set ₀₁ The remaining SNPs constitute a first subset of candidate genes, SNPs ₀₁ Carrying out binary regression analysis on the results with the SNPs in the first alternative base factor set in turn, checking the conditional independence of the two SNPs and the results, and judging whether the checked SNPs are rejected from the first alternative gene set according to rejection conditions;

if deleted SNP is SNP ₀₁ Or all other SNPs in the first candidate base factor set have been tested, and the second SNP in the first candidate base factor set after fixed ordering ₀₂ Repeating the above process until no more SNPs are removed from the first candidate gene subset, and recording the remaining SNPs remaining in the first candidate gene set after completion of the process as the second candidate gene subset;

if the number of SNP in the second alternative base factor set is less than or equal to 2, ending calculation, wherein all SNP in the second alternative base factor set are screened pathogenic sites; otherwise, continuing to calculate until the number of SNP in the m-th alternative base factor set is less than or equal to m+1, and stopping the calculation.

The eliminating condition of the SNP in the process is that if LD between two SNPs is equal to 1, the two SNPs are reserved in a first alternative base factor set, or two condition P values are analyzed, and if one P value is a missing value, the two SNPs are reserved in the first alternative base factor set;

if both P values are not missing, comparing the P values with a defined preset threshold, and if both P values are larger than or smaller than the preset threshold, keeping both SNPs in the first alternative base factor set; and if one of the two P values is larger than a preset threshold value and the other P value is smaller than the preset threshold value, removing the SNP with the P value larger than the preset threshold value from the first candidate gene subset.

Further, the m-ary regression analysis is adopted to carry out the process of obtaining the m-th candidate gene subset by calculating the condition independence of the SNP.

Pair of m-ary regression modelsAlternative Gene set S _m-1 (m=1, …, n) for analysis, first given S _m-1 The first m-1 SNP in (a), the rest SNP is added into a regression model in turn to carry out m-ary regression analysis with a final Y;

the corresponding rejection conditions in this process are: if the obtained m condition P values are all greater or less than 0.05, m SNPs are all retained in S _m-1 In (a) and (b); otherwise, deleting SNP with P value larger than or equal to 0.05; if a collinearity problem occurs with a given SNP when a new SNP is added, both SNPs are reserved; finally remain at S _m-1 All SNPs in (1) are denoted S _m 。

The single-or multi-factor regression model is a linear regression or logistic regression model.

The statistical principles underlying the condition independent examination of possible causal relationships include:

considering LD structure and condition P value, the true pathogenic site will not be independent of the outcome condition due to falsely associated sites;

given a true pathogenic site, the falsely associated site is independent of the outcome condition;

when strong LD is present at both pathogenic sites, it is possible that the conditions are independent of outcome at the same time;

when there is no true pathogenic SNP, SNPs with larger LD with pathogenic SNP are more easily preserved.

In a second aspect, the invention provides a causal correlation analysis system for fine localization of whole genome pathogenic SNPs.

A causal correlation analysis system for the fine localization of whole genome pathogenic SNPs, comprising:

a data acquisition module configured to: acquiring genome data to be analyzed;

a cause and effect GWAS module configured to:

carrying out whole genome association analysis on genome data by using a single factor regression model, screening significant SNP with a P value lower than a preset threshold value, defining the selected SNP as a first candidate gene set, and sequencing the SNP in the first candidate gene set from small to large according to the P value;

fixing the first alternative baseSNP with minimal P-value due to concentration ₀₁ The remaining SNPs constitute a first subset of candidate genes, SNPs ₀₁ Performing binary regression analysis on the results with the SNPs in the first alternative base factor set in turn, calculating the conditional independence of the two SNPs and the results, and judging whether to reject the SNPs from the first alternative gene set according to rejection standards;

A third aspect of the invention provides a computer readable storage medium having stored thereon a program which when executed by a processor performs the steps of the causal correlation analysis method for fine localization of whole genome pathogenic SNPs according to the first aspect of the invention.

In a fourth aspect, the present invention provides an electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, the processor implementing the steps in the causal correlation analysis method for fine localization of whole genome pathogenic SNPs according to the first aspect of the invention when the program is executed.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the method, the system, the medium or the electronic equipment, a causal GWAS analysis strategy (namely a CDSFM algorithm) for finely positioning pathogenic sites facing the whole genome is constructed under the guidance of a causal inference framework, the strategy is independently adjusted through gradual conditions under the condition that the constraint of a specific causal graph model is removed, the false positive rate (FalseDiscovery Rate and FDR) is effectively reduced, the true positive rate (True Discovery Rate and TDR) is improved, the hit rate of capturing pathogenic SNP is improved to more than 90%, and the detection efficiency is higher.

2. The method, the system, the medium or the electronic equipment breaks through the limitation of statistical association, has overall performance obviously superior to the existing fine positioning methods such as a generalized linear regression model, a LASSO regression model, a GCTA model, a Bayesian variable selection regression model and the like, and provides a new strategy and a new method for fine positioning of genetic susceptibility sites of the whole genome.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

Fig. 1 is a statistical schematic diagram of a CDSFM algorithm according to an embodiment of the present invention.

Fig. 2 is an exemplary diagram of a CDSFM algorithm provided in embodiment 1 of the present invention.

FIG. 3 is a schematic diagram of the framework of the causal GWAS method for fine localization of whole genome-oriented susceptibility sites provided in example 2 of the present invention.

FIG. 4 is a schematic diagram showing the results of the CDSFM algorithm compared with other fine positioning algorithms when the end result provided in example 2 is quantitative trait.

FIG. 5 is a schematic diagram showing the results of the CDSFM algorithm and other fine positioning algorithms compared with each other when the end result provided in example 2 is quality traits.

Detailed Description

The invention will be further described with reference to the drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

Embodiments of the invention and features of the embodiments may be combined with each other without conflict.

The invention constructs a fine positioning algorithm CDSFM, and the algorithm considers the LD structure and the condition P value at the same time, and the statistical principle based on the algorithm is that the true pathogenic site cannot be independent of the ending condition due to the false associated site; given a true pathogenic site, the falsely associated site is independent of the outcome condition; when strong LD is present at both pathogenic sites, it is possible that the conditions are independent of outcome at the same time; SNPs with larger LD than pathogenic SNPs are more easily retained when there are no true pathogenic SNPs in the model.

The four principles described above are in one-to-one correspondence with the causal graph scenario in fig. 1, and also summarize all causal relationships possible when the condition is independently checked. The judgment standard of the algorithm is that a regression model is used for analysis, the condition P value of a certain SNP is more than 0.05, and the SNP is removed from the alternative set; if the condition P value of the SNPs in the model is greater or less than 0.05, all SNPs in the model are retained.

Example 1:

taking a small dataset as an example, it is assumed that the actual causal relationship of SNPs to Y in the data is shown in fig. 2. Assuming 7 SNP in the original data set, for screening the pathogenic sites of the outcome Y, firstly, judging whether each SNP is independent of the Y margin by using a unitary regression model, and eliminating the SNP ₇ The method comprises the steps of carrying out a first treatment on the surface of the Given SNP in binary regression model ₁ post-SNP ₄ Independent of Y conditions, SNP is deleted ₄ The method comprises the steps of carrying out a first treatment on the surface of the Similarly, SNP can be eliminated by using ternary regression model ₅ And SNP ₆ The method comprises the steps of carrying out a first treatment on the surface of the At this time, the number of the remaining SNPs in the candidate set is less than 4, a quaternary regression model cannot be constructed, and the operation is terminated. Then { SNP } ₁ ，SNP ₂ ，SNP ₃ And the disease site is screened.

Example 2:

as shown in fig. 3, embodiment 2 of the present invention provides a causal GWAS method for fine localization of pathogenic SNPs for whole genome, comprising the steps of:

(1) First, it is determined whether each SNP of the genome is independent of the outcome Y. In the model, a single factor regression model (such as a linear regression model or a logistic regression model) is used for carrying out genome-wide association analysis on a sample, and the P value is screened to be lower than a certain threshold value (such as P) based on the analysis result<5×10 ^–8 ) And defining the selected SNP as an alternative gene set S ₀ The method comprises the following steps:

collection of genes S ₀ The SNPs in (1) are ordered from small to large according to the P value.

(2) Fixing S ₀ SNP with minimum P value in ₀₁ Remaining SNP forms S ₀ Subset SNP _0j (j＝2，…，J)。

SNP ₀₁ With SNP _0j (j=j, …, 2) simultaneously performing regression analysis on the outcome Y (e.g. using a binary regression model), the conditional independence of the two SNPs from the outcome was calculated.

Considering the collinearity problem, if LD between two SNPs in the model is equal to 1, both SNPs remain in the gene set S ₀ In (c), two condition P values are analyzed, and if one P value is a deletion value, both SNPs are kept in S ₀ In (a) and (b); if neither P value is missing, the P value is compared to a defined significance threshold (e.g., a defined significance level of 0.05), and if both P values are greater than or less than 0.05, both SNPs remain in the gene set S ₀ In (a) and (b); if one of the two P values is greater than 0.05 and one is less than 0.05, then SNP with P value greater than 0.05 is removed from S ₀ Removing the materials, and performing no further analysis; if deleted SNP is SNP ₀₁ Or S ₀ All the rest SNP has been checked, S after fixed ordering ₀ The second SNP in (a) ₀₂ Repeating the above process until no SNP is removed from S ₀ . Record that this process is still maintained at S ₀ The remaining SNPs in (a) are candidate gene sets S ₁ 。

(3) If the candidate gene set S ₁ The number of SNP is less than or equal to 2, and the calculation is finished, and the obtained gene set S ₁ All SNPs in the gene are selected pathogenic sites; otherwise continue to use ternary regression equation pair S ₁ Performing the above analysis to obtain candidate gene set S ₂ . Repeating the iterative process until S _m The number of SNP is less than or equal to m+1, and the operation is stopped, at this time S _m Namely, a real pathogenic gene set.

(4) It should be noted that when using the m-ary regression model for the candidate gene set S _m-1 (m=1, & ltDEG & gt, n) in the analysis, S is given first _m-1 The first m-1 SNP in (a) is added into a regression model in sequence, and regression analysis is carried out on the rest SNP and the ending Y. If the obtained m condition P values are all greater or less than 0.05, m SNPs are all retained in S _m-1 In (a) and (b); otherwise, deleting SNPs with P value greater than or equal to 0.05. If a co-linearity problem occurs with a given SNP when a new SNP is added, both SNPs are retained. Finally remain at S _m-1 All SNPs in (1) are denoted S _m 。

The method described in this embodiment breaks through the limitation of statistical association, and the overall performance is significantly better than the existing fine localization methods (fig. 4 and 5) such as generalized linear regression model, LASSO regression model, GCTA model and bayesian variable selection regression model, and provides a new strategy and a new method for fine localization of genetic pathogenic sites of whole genome.

Example 3:

the embodiment 3 of the invention provides a causal correlation analysis system for fine localization of genome-wide pathogenic SNP, comprising:

a data acquisition module configured to: acquiring genome data to be analyzed;

a cause and effect GWAS module configured to:

genome-wide association analysis of genomic data using a one-way regression model, screening for significant SNPs with P-values below a preset threshold, defining the selected SNPs as a first candidate gene set (i.e., S ₀ ) SNP in the first candidate Gene set is treated as followsSorting according to the P value from small to large;

fixing SNP with minimum P value in first candidate gene set ₀₁ The remaining SNPs constitute a first subset of candidate genes, SNPs ₀₁ Performing binary regression analysis on the results with the SNPs in the first alternative base factor set in turn, calculating the conditional independence of the two SNPs and the results, and removing the SNPs with the P value larger than a preset threshold value from the first alternative gene set;

if deleted SNP is SNP ₀₁ Or all other SNPs in the first candidate base factor set have been tested, and the second SNP in the first candidate base factor set after fixed ordering ₀₂ Repeating the above process until no more SNPs are removed from the first candidate gene subset, and recording the remaining SNPs remaining in the first candidate gene set after completion of the process as the second candidate gene subset (i.e., S ₁ )；

If the number of SNP in the second candidate base factor set is less than or equal to 2 or the second candidate gene subset is equal to the first candidate gene subset, ending calculation, wherein all SNP in the second candidate base factor set are screened pathogenic sites; otherwise, continuing to calculate until the number of SNP in the m-th alternative base factor set is less than or equal to m+1, and stopping the calculation.

The working method of the system is the same as the causal correlation analysis method for fine localization of the genome-wide pathogenic SNP provided in example 1, and will not be described here.

Example 4:

embodiment 4 of the present invention provides a computer-readable storage medium having stored thereon a program which, when executed by a processor, performs the steps of the causal correlation analysis method for fine localization of whole genome pathogenic SNPs as described in embodiment 1 of the present invention.

Example 5:

embodiment 5 of the present invention provides an electronic device, including a memory, a processor, and a program stored on the memory and executable on the processor, wherein the processor implements the steps in the causal correlation analysis method for fine localization of whole genome pathogenic SNPs according to embodiment 1 of the present invention when executing the program.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random access Memory (Random AccessMemory, RAM), or the like.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A causal correlation analysis method for fine localization of whole genome pathogenic SNP is characterized in that:

the method comprises the following steps:

acquiring genome data to be analyzed;

fixing SNP with minimum P value in first candidate gene set ₀₁ The remaining SNPs constitute a first subset of candidate genes, SNPs ₀₁ Performing binary regression analysis on the results with the SNPs in the first alternative base factor set in turn, calculating the conditional independence of the two SNPs and the results, and judging whether to reject the SNPs from the first alternative gene set according to rejection standards;

if deleted SNP is SNP ₀₁ Or all other SNPs in the first candidate base factor set have been tested, and the second SNP in the first candidate base factor set after fixed ordering ₀₂ Repeating the above process until no more SNP is removed from the first candidate gene subset, and keeping the first candidate gene subset after completion of the processThe remaining SNPs in (a) are a second subset of candidate genes;

2. The causal association analysis method for fine localization of genome-wide pathogenic SNPs according to claim 1, characterized in that:

LD between two SNPs is equal to 1, both SNPs are reserved in the first alternative base factor set, two condition P values are analyzed, and if one P value is a missing value, both SNPs are reserved in the first alternative base factor set;

3. The causal association analysis method for fine localization of genome-wide pathogenic SNPs according to claim 1, characterized in that:

and (3) carrying out the process of obtaining the m-th candidate gene subset by adopting m-ary regression analysis to carry out the conditional independence calculation of the SNP.

4. The causal association analysis method for fine localization of genome-wide pathogenic SNPs according to claim 1, characterized in that:

when using multiple regression models for candidate gene sets S _m-1， m=1, …, n; when analysis is performed, give S _m-1 Sequentially adding the rest SNP to the regression model and performing regression analysis on the rest SNP and the ending Y;

the corresponding rejection conditions in the process are as follows: if the obtained m condition P values are all greater or less than 0.05, m SNPs are all retained in S _m-1 In (a) and (b); otherwise, deleting SNP with P value larger than or equal to 0.05; if a collinearity problem occurs with a given SNP when a new SNP is added, both SNPs are reserved; finally remain at S _m-1 All SNPs in (1) are denoted S _m 。

5. The causal association analysis method for fine localization of genome-wide pathogenic SNPs according to claim 1, characterized in that:

6. The causal association analysis method for fine localization of genome-wide pathogenic SNPs according to claim 1, characterized in that:

or,

given a truly pathogenic site, the falsely associated site is independent of the outcome condition.

7. The causal association analysis method for fine localization of genome-wide pathogenic SNPs according to claim 1, characterized in that:

the statistical principle on which the condition independent examination of possible causal relationships is based also includes:

or,

8. A causal correlation analysis system for fine localization of whole genome pathogenic SNPs, characterized in that:

comprising the following steps:

a data acquisition module configured to: acquiring genome data to be analyzed;

a cause and effect GWAS module configured to:

9. A computer readable storage medium having stored thereon a program, which when executed by a processor performs the steps in the causal correlation analysis method for fine localization of whole genome pathogenic SNPs according to any one of claims 1-7.

10. An electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, wherein the processor performs the steps in the causal correlation analysis method for fine localization of whole genome pathogenic SNPs according to any one of claims 1-7 when the program is executed by the processor.