CN113808665B - Causal correlation analysis method for fine localization of genome-wide pathogenic SNP - Google Patents
Causal correlation analysis method for fine localization of genome-wide pathogenic SNP Download PDFInfo
- Publication number
- CN113808665B CN113808665B CN202111149486.1A CN202111149486A CN113808665B CN 113808665 B CN113808665 B CN 113808665B CN 202111149486 A CN202111149486 A CN 202111149486A CN 113808665 B CN113808665 B CN 113808665B
- Authority
- CN
- China
- Prior art keywords
- snp
- snps
- pathogenic
- genome
- causal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000001717 pathogenic effect Effects 0.000 title claims abstract description 62
- 238000000034 method Methods 0.000 title claims abstract description 57
- 230000001364 causal effect Effects 0.000 title claims abstract description 42
- 238000010219 correlation analysis Methods 0.000 title claims abstract description 18
- 230000004807 localization Effects 0.000 title claims description 26
- 238000004458 analytical method Methods 0.000 claims abstract description 12
- 108090000623 proteins and genes Proteins 0.000 claims description 58
- 230000008569 process Effects 0.000 claims description 17
- 238000004364 calculation method Methods 0.000 claims description 12
- 238000000611 regression analysis Methods 0.000 claims description 12
- 238000012098 association analyses Methods 0.000 claims description 7
- 238000012097 association analysis method Methods 0.000 claims description 7
- 238000012216 screening Methods 0.000 claims description 7
- 238000012417 linear regression Methods 0.000 claims description 6
- 230000000717 retained effect Effects 0.000 claims description 6
- 238000012163 sequencing technique Methods 0.000 claims description 4
- 230000000694 effects Effects 0.000 claims description 3
- 238000007477 logistic regression Methods 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 abstract description 14
- 201000010099 disease Diseases 0.000 abstract description 3
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 abstract description 3
- 238000001514 detection method Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 11
- 238000004590 computer program Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 208000034826 Genetic Predisposition to Disease Diseases 0.000 description 1
- 238000012300 Sequence Analysis Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000000875 corresponding effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 238000010353 genetic engineering Methods 0.000 description 1
- 230000008303 genetic mechanism Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Data Mining & Analysis (AREA)
- Genetics & Genomics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Molecular Biology (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Bioethics (AREA)
- Epidemiology (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a causal correlation analysis method for fine positioning of a whole genome pathogenic SNP, which is used for fine positioning of a pathogenic SNP of a human complex disease and reducing false positive rate of a GWAS result. Under the guidance of a causal inference framework, a causal GWAS analysis strategy (CDSFM algorithm) for fine positioning of pathogenic loci of a whole genome is constructed, and under the constraint of a specific causal graph model, the strategy is independently adjusted by gradual conditions, so that the false positive rate is effectively reduced, the true positive rate is improved, the hit rate of capturing pathogenic SNP is improved to more than 90%, and the detection efficiency is higher.
Description
Technical Field
The invention relates to the technical field of biological genetic engineering, in particular to a causal correlation analysis method for fine positioning of a whole genome pathogenic SNP.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
From the whole genome association analysis (Genome Wide Association Study, GWAS) method, it has been proposed to date that there are many tens of thousands of SNPs statistically associated with more than 4000 common diseases throughout the whole genome 14. However, few SNPs are functionally validated in the laboratory, and it is difficult to elucidate their genetic mechanisms. Such a high false positive rate, while presenting difficulties for subsequent verification, can also lead to suspicion of GWAS results by non-legacy scientists. How to further pinpoint the true pathogenic SNP and reduce the false positive rate is a widely studied problem of researchers at present.
The inventors found that many researchers have proposed various algorithms for fine positioning from multiple angles, which can be roughly classified into the following four categories:
(1) Heuristic fine positioning method, the common analysis thinking is that after the marginal related SNP is screened by a generalized linear regression model and a generalized linear mixed model, the SNP with R2 more than a certain threshold value is screened according to the structures of top SNP and surrounding LD. However, such methods are not effective in reducing false positive rates when there is a high correlation between SNPs in a region.
(2) The conditional regression method represented by the conditional regression analysis method is used for judging whether the condition P value of the rest SNP is still significant after the top SNP is given, and most of the time, the top SNP in the region is not necessarily a true pathogenic site, so that the judging result of the model is affected when the two SNP are highly correlated.
(3) And (3) performing variable screening by judging whether the regression coefficient is 0 according to a punishment regression model represented by LASSO. The sparse model constructed under the strategy only keeps a few SNP on the highly relevant region of the SNP, and the true pathogenic SNP is easy to delete by mistake.
(4) The Bayesian model represented by the Bayesian variable selection model is used for fine positioning by calculating posterior probability of SNP as pathogenic sites, and the method often needs to preset the number of pathogenic sites, and incorrect setting will influence analysis results.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a Causal association analysis method for fine positioning of a whole genome pathogenic SNP, and constructs a Causal GWAS analysis strategy and method (Causal diagnostic-based Stepwise Fine-Mapping, CDSFM algorithm) for fine positioning of the whole genome pathogenic SNP under the guidance of a Causal inference framework, thereby reducing the false positive rate of the GWAS result and improving the hit rate of capturing the pathogenic SNP.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the first aspect of the invention provides a causal correlation analysis method for fine localization of genome-wide pathogenic SNP.
A causal correlation analysis method for fine localization of genome-wide pathogenic SNPs, comprising the following steps:
acquiring genome data to be analyzed;
carrying out whole genome association analysis on genome data by using a single factor regression model, screening significant SNP with a P value lower than a preset threshold value, defining the significant SNP as a first candidate gene set, and sequencing the SNP in the first candidate gene set from small to large according to the P value;
fixing SNP with minimum P value in first candidate gene set 01 The remaining SNPs constitute a first subset of candidate genes, SNPs 01 Carrying out binary regression analysis on the results with the SNPs in the first alternative base factor set in turn, checking the conditional independence of the two SNPs and the results, and judging whether the checked SNPs are rejected from the first alternative gene set according to rejection conditions;
if deleted SNP is SNP 01 Or all other SNPs in the first candidate base factor set have been tested, and the second SNP in the first candidate base factor set after fixed ordering 02 Repeating the above process until no more SNPs are removed from the first candidate gene subset, and recording the remaining SNPs remaining in the first candidate gene set after completion of the process as the second candidate gene subset;
if the number of SNP in the second alternative base factor set is less than or equal to 2, ending calculation, wherein all SNP in the second alternative base factor set are screened pathogenic sites; otherwise, continuing to calculate until the number of SNP in the m-th alternative base factor set is less than or equal to m+1, and stopping the calculation.
The eliminating condition of the SNP in the process is that if LD between two SNPs is equal to 1, the two SNPs are reserved in a first alternative base factor set, or two condition P values are analyzed, and if one P value is a missing value, the two SNPs are reserved in the first alternative base factor set;
if both P values are not missing, comparing the P values with a defined preset threshold, and if both P values are larger than or smaller than the preset threshold, keeping both SNPs in the first alternative base factor set; and if one of the two P values is larger than a preset threshold value and the other P value is smaller than the preset threshold value, removing the SNP with the P value larger than the preset threshold value from the first candidate gene subset.
Further, the m-ary regression analysis is adopted to carry out the process of obtaining the m-th candidate gene subset by calculating the condition independence of the SNP.
Pair of m-ary regression modelsAlternative Gene set S m-1 (m=1, …, n) for analysis, first given S m-1 The first m-1 SNP in (a), the rest SNP is added into a regression model in turn to carry out m-ary regression analysis with a final Y;
the corresponding rejection conditions in this process are: if the obtained m condition P values are all greater or less than 0.05, m SNPs are all retained in S m-1 In (a) and (b); otherwise, deleting SNP with P value larger than or equal to 0.05; if a collinearity problem occurs with a given SNP when a new SNP is added, both SNPs are reserved; finally remain at S m-1 All SNPs in (1) are denoted S m 。
The single-or multi-factor regression model is a linear regression or logistic regression model.
The statistical principles underlying the condition independent examination of possible causal relationships include:
considering LD structure and condition P value, the true pathogenic site will not be independent of the outcome condition due to falsely associated sites;
given a true pathogenic site, the falsely associated site is independent of the outcome condition;
when strong LD is present at both pathogenic sites, it is possible that the conditions are independent of outcome at the same time;
when there is no true pathogenic SNP, SNPs with larger LD with pathogenic SNP are more easily preserved.
In a second aspect, the invention provides a causal correlation analysis system for fine localization of whole genome pathogenic SNPs.
A causal correlation analysis system for the fine localization of whole genome pathogenic SNPs, comprising:
a data acquisition module configured to: acquiring genome data to be analyzed;
a cause and effect GWAS module configured to:
carrying out whole genome association analysis on genome data by using a single factor regression model, screening significant SNP with a P value lower than a preset threshold value, defining the selected SNP as a first candidate gene set, and sequencing the SNP in the first candidate gene set from small to large according to the P value;
fixing the first alternative baseSNP with minimal P-value due to concentration 01 The remaining SNPs constitute a first subset of candidate genes, SNPs 01 Performing binary regression analysis on the results with the SNPs in the first alternative base factor set in turn, calculating the conditional independence of the two SNPs and the results, and judging whether to reject the SNPs from the first alternative gene set according to rejection standards;
if deleted SNP is SNP 01 Or all other SNPs in the first candidate base factor set have been tested, and the second SNP in the first candidate base factor set after fixed ordering 02 Repeating the above process until no more SNPs are removed from the first candidate gene subset, and recording the remaining SNPs remaining in the first candidate gene set after completion of the process as the second candidate gene subset;
if the number of SNP in the second alternative base factor set is less than or equal to 2, ending calculation, wherein all SNP in the second alternative base factor set are screened pathogenic sites; otherwise, continuing to calculate until the number of SNP in the m-th alternative base factor set is less than or equal to m+1, and stopping the calculation.
A third aspect of the invention provides a computer readable storage medium having stored thereon a program which when executed by a processor performs the steps of the causal correlation analysis method for fine localization of whole genome pathogenic SNPs according to the first aspect of the invention.
In a fourth aspect, the present invention provides an electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, the processor implementing the steps in the causal correlation analysis method for fine localization of whole genome pathogenic SNPs according to the first aspect of the invention when the program is executed.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the method, the system, the medium or the electronic equipment, a causal GWAS analysis strategy (namely a CDSFM algorithm) for finely positioning pathogenic sites facing the whole genome is constructed under the guidance of a causal inference framework, the strategy is independently adjusted through gradual conditions under the condition that the constraint of a specific causal graph model is removed, the false positive rate (FalseDiscovery Rate and FDR) is effectively reduced, the true positive rate (True Discovery Rate and TDR) is improved, the hit rate of capturing pathogenic SNP is improved to more than 90%, and the detection efficiency is higher.
2. The method, the system, the medium or the electronic equipment breaks through the limitation of statistical association, has overall performance obviously superior to the existing fine positioning methods such as a generalized linear regression model, a LASSO regression model, a GCTA model, a Bayesian variable selection regression model and the like, and provides a new strategy and a new method for fine positioning of genetic susceptibility sites of the whole genome.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.
Fig. 1 is a statistical schematic diagram of a CDSFM algorithm according to an embodiment of the present invention.
Fig. 2 is an exemplary diagram of a CDSFM algorithm provided in embodiment 1 of the present invention.
FIG. 3 is a schematic diagram of the framework of the causal GWAS method for fine localization of whole genome-oriented susceptibility sites provided in example 2 of the present invention.
FIG. 4 is a schematic diagram showing the results of the CDSFM algorithm compared with other fine positioning algorithms when the end result provided in example 2 is quantitative trait.
FIG. 5 is a schematic diagram showing the results of the CDSFM algorithm and other fine positioning algorithms compared with each other when the end result provided in example 2 is quality traits.
Detailed Description
The invention will be further described with reference to the drawings and examples.
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
Embodiments of the invention and features of the embodiments may be combined with each other without conflict.
The invention constructs a fine positioning algorithm CDSFM, and the algorithm considers the LD structure and the condition P value at the same time, and the statistical principle based on the algorithm is that the true pathogenic site cannot be independent of the ending condition due to the false associated site; given a true pathogenic site, the falsely associated site is independent of the outcome condition; when strong LD is present at both pathogenic sites, it is possible that the conditions are independent of outcome at the same time; SNPs with larger LD than pathogenic SNPs are more easily retained when there are no true pathogenic SNPs in the model.
The four principles described above are in one-to-one correspondence with the causal graph scenario in fig. 1, and also summarize all causal relationships possible when the condition is independently checked. The judgment standard of the algorithm is that a regression model is used for analysis, the condition P value of a certain SNP is more than 0.05, and the SNP is removed from the alternative set; if the condition P value of the SNPs in the model is greater or less than 0.05, all SNPs in the model are retained.
Example 1:
taking a small dataset as an example, it is assumed that the actual causal relationship of SNPs to Y in the data is shown in fig. 2. Assuming 7 SNP in the original data set, for screening the pathogenic sites of the outcome Y, firstly, judging whether each SNP is independent of the Y margin by using a unitary regression model, and eliminating the SNP 7 The method comprises the steps of carrying out a first treatment on the surface of the Given SNP in binary regression model 1 post-SNP 4 Independent of Y conditions, SNP is deleted 4 The method comprises the steps of carrying out a first treatment on the surface of the Similarly, SNP can be eliminated by using ternary regression model 5 And SNP 6 The method comprises the steps of carrying out a first treatment on the surface of the At this time, the number of the remaining SNPs in the candidate set is less than 4, a quaternary regression model cannot be constructed, and the operation is terminated. Then { SNP } 1 ,SNP 2 ,SNP 3 And the disease site is screened.
Example 2:
as shown in fig. 3, embodiment 2 of the present invention provides a causal GWAS method for fine localization of pathogenic SNPs for whole genome, comprising the steps of:
(1) First, it is determined whether each SNP of the genome is independent of the outcome Y. In the model, a single factor regression model (such as a linear regression model or a logistic regression model) is used for carrying out genome-wide association analysis on a sample, and the P value is screened to be lower than a certain threshold value (such as P) based on the analysis result<5×10 –8 ) And defining the selected SNP as an alternative gene set S 0 The method comprises the following steps:
collection of genes S 0 The SNPs in (1) are ordered from small to large according to the P value.
(2) Fixing S 0 SNP with minimum P value in 01 Remaining SNP forms S 0 Subset SNP 0j (j=2,…,J)。
SNP 01 With SNP 0j (j=j, …, 2) simultaneously performing regression analysis on the outcome Y (e.g. using a binary regression model), the conditional independence of the two SNPs from the outcome was calculated.
Considering the collinearity problem, if LD between two SNPs in the model is equal to 1, both SNPs remain in the gene set S 0 In (c), two condition P values are analyzed, and if one P value is a deletion value, both SNPs are kept in S 0 In (a) and (b); if neither P value is missing, the P value is compared to a defined significance threshold (e.g., a defined significance level of 0.05), and if both P values are greater than or less than 0.05, both SNPs remain in the gene set S 0 In (a) and (b); if one of the two P values is greater than 0.05 and one is less than 0.05, then SNP with P value greater than 0.05 is removed from S 0 Removing the materials, and performing no further analysis; if deleted SNP is SNP 01 Or S 0 All the rest SNP has been checked, S after fixed ordering 0 The second SNP in (a) 02 Repeating the above process until no SNP is removed from S 0 . Record that this process is still maintained at S 0 The remaining SNPs in (a) are candidate gene sets S 1 。
(3) If the candidate gene set S 1 The number of SNP is less than or equal to 2, and the calculation is finished, and the obtained gene set S 1 All SNPs in the gene are selected pathogenic sites; otherwise continue to use ternary regression equation pair S 1 Performing the above analysis to obtain candidate gene set S 2 . Repeating the iterative process until S m The number of SNP is less than or equal to m+1, and the operation is stopped, at this time S m Namely, a real pathogenic gene set.
(4) It should be noted that when using the m-ary regression model for the candidate gene set S m-1 (m=1, & ltDEG & gt, n) in the analysis, S is given first m-1 The first m-1 SNP in (a) is added into a regression model in sequence, and regression analysis is carried out on the rest SNP and the ending Y. If the obtained m condition P values are all greater or less than 0.05, m SNPs are all retained in S m-1 In (a) and (b); otherwise, deleting SNPs with P value greater than or equal to 0.05. If a co-linearity problem occurs with a given SNP when a new SNP is added, both SNPs are retained. Finally remain at S m-1 All SNPs in (1) are denoted S m 。
The method described in this embodiment breaks through the limitation of statistical association, and the overall performance is significantly better than the existing fine localization methods (fig. 4 and 5) such as generalized linear regression model, LASSO regression model, GCTA model and bayesian variable selection regression model, and provides a new strategy and a new method for fine localization of genetic pathogenic sites of whole genome.
Example 3:
the embodiment 3 of the invention provides a causal correlation analysis system for fine localization of genome-wide pathogenic SNP, comprising:
a data acquisition module configured to: acquiring genome data to be analyzed;
a cause and effect GWAS module configured to:
genome-wide association analysis of genomic data using a one-way regression model, screening for significant SNPs with P-values below a preset threshold, defining the selected SNPs as a first candidate gene set (i.e., S 0 ) SNP in the first candidate Gene set is treated as followsSorting according to the P value from small to large;
fixing SNP with minimum P value in first candidate gene set 01 The remaining SNPs constitute a first subset of candidate genes, SNPs 01 Performing binary regression analysis on the results with the SNPs in the first alternative base factor set in turn, calculating the conditional independence of the two SNPs and the results, and removing the SNPs with the P value larger than a preset threshold value from the first alternative gene set;
if deleted SNP is SNP 01 Or all other SNPs in the first candidate base factor set have been tested, and the second SNP in the first candidate base factor set after fixed ordering 02 Repeating the above process until no more SNPs are removed from the first candidate gene subset, and recording the remaining SNPs remaining in the first candidate gene set after completion of the process as the second candidate gene subset (i.e., S 1 );
If the number of SNP in the second candidate base factor set is less than or equal to 2 or the second candidate gene subset is equal to the first candidate gene subset, ending calculation, wherein all SNP in the second candidate base factor set are screened pathogenic sites; otherwise, continuing to calculate until the number of SNP in the m-th alternative base factor set is less than or equal to m+1, and stopping the calculation.
The working method of the system is the same as the causal correlation analysis method for fine localization of the genome-wide pathogenic SNP provided in example 1, and will not be described here.
Example 4:
embodiment 4 of the present invention provides a computer-readable storage medium having stored thereon a program which, when executed by a processor, performs the steps of the causal correlation analysis method for fine localization of whole genome pathogenic SNPs as described in embodiment 1 of the present invention.
Example 5:
embodiment 5 of the present invention provides an electronic device, including a memory, a processor, and a program stored on the memory and executable on the processor, wherein the processor implements the steps in the causal correlation analysis method for fine localization of whole genome pathogenic SNPs according to embodiment 1 of the present invention when executing the program.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random access Memory (Random AccessMemory, RAM), or the like.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A causal correlation analysis method for fine localization of whole genome pathogenic SNP is characterized in that:
the method comprises the following steps:
acquiring genome data to be analyzed;
carrying out whole genome association analysis on genome data by using a single factor regression model, screening significant SNP with a P value lower than a preset threshold value, defining the selected SNP as a first candidate gene set, and sequencing the SNP in the first candidate gene set from small to large according to the P value;
fixing SNP with minimum P value in first candidate gene set 01 The remaining SNPs constitute a first subset of candidate genes, SNPs 01 Performing binary regression analysis on the results with the SNPs in the first alternative base factor set in turn, calculating the conditional independence of the two SNPs and the results, and judging whether to reject the SNPs from the first alternative gene set according to rejection standards;
if deleted SNP is SNP 01 Or all other SNPs in the first candidate base factor set have been tested, and the second SNP in the first candidate base factor set after fixed ordering 02 Repeating the above process until no more SNP is removed from the first candidate gene subset, and keeping the first candidate gene subset after completion of the processThe remaining SNPs in (a) are a second subset of candidate genes;
if the number of SNP in the second alternative base factor set is less than or equal to 2, ending calculation, wherein all SNP in the second alternative base factor set are screened pathogenic sites; otherwise, continuing to calculate until the number of SNP in the m-th alternative base factor set is less than or equal to m+1, and stopping the calculation.
2. The causal association analysis method for fine localization of genome-wide pathogenic SNPs according to claim 1, characterized in that:
LD between two SNPs is equal to 1, both SNPs are reserved in the first alternative base factor set, two condition P values are analyzed, and if one P value is a missing value, both SNPs are reserved in the first alternative base factor set;
if both P values are not missing, comparing the P values with a defined preset threshold, and if both P values are larger than or smaller than the preset threshold, keeping both SNPs in the first alternative base factor set; and if one of the two P values is larger than a preset threshold value and the other P value is smaller than the preset threshold value, removing the SNP with the P value larger than the preset threshold value from the first candidate gene subset.
3. The causal association analysis method for fine localization of genome-wide pathogenic SNPs according to claim 1, characterized in that:
and (3) carrying out the process of obtaining the m-th candidate gene subset by adopting m-ary regression analysis to carry out the conditional independence calculation of the SNP.
4. The causal association analysis method for fine localization of genome-wide pathogenic SNPs according to claim 1, characterized in that:
when using multiple regression models for candidate gene sets S m-1, m=1, …, n; when analysis is performed, give S m-1 Sequentially adding the rest SNP to the regression model and performing regression analysis on the rest SNP and the ending Y;
the corresponding rejection conditions in the process are as follows: if the obtained m condition P values are all greater or less than 0.05, m SNPs are all retained in S m-1 In (a) and (b); otherwise, deleting SNP with P value larger than or equal to 0.05; if a collinearity problem occurs with a given SNP when a new SNP is added, both SNPs are reserved; finally remain at S m-1 All SNPs in (1) are denoted S m 。
5. The causal association analysis method for fine localization of genome-wide pathogenic SNPs according to claim 1, characterized in that:
the single-or multi-factor regression model is a linear regression or logistic regression model.
6. The causal association analysis method for fine localization of genome-wide pathogenic SNPs according to claim 1, characterized in that:
the statistical principles underlying the condition independent examination of possible causal relationships include:
considering LD structure and condition P value, the true pathogenic site will not be independent of the outcome condition due to falsely associated sites;
or,
given a truly pathogenic site, the falsely associated site is independent of the outcome condition.
7. The causal association analysis method for fine localization of genome-wide pathogenic SNPs according to claim 1, characterized in that:
the statistical principle on which the condition independent examination of possible causal relationships is based also includes:
when strong LD is present at both pathogenic sites, it is possible that the conditions are independent of outcome at the same time;
or,
when there is no true pathogenic SNP, SNPs with larger LD with pathogenic SNP are more easily preserved.
8. A causal correlation analysis system for fine localization of whole genome pathogenic SNPs, characterized in that:
comprising the following steps:
a data acquisition module configured to: acquiring genome data to be analyzed;
a cause and effect GWAS module configured to:
carrying out whole genome association analysis on genome data by using a single factor regression model, screening significant SNP with a P value lower than a preset threshold value, defining the selected SNP as a first candidate gene set, and sequencing the SNP in the first candidate gene set from small to large according to the P value;
fixing SNP with minimum P value in first candidate gene set 01 The remaining SNPs constitute a first subset of candidate genes, SNPs 01 Performing binary regression analysis on the results with the SNPs in the first alternative base factor set in turn, calculating the conditional independence of the two SNPs and the results, and judging whether to reject the SNPs from the first alternative gene set according to rejection standards;
if deleted SNP is SNP 01 Or all other SNPs in the first candidate base factor set have been tested, and the second SNP in the first candidate base factor set after fixed ordering 02 Repeating the above process until no more SNPs are removed from the first candidate gene subset, and recording the remaining SNPs remaining in the first candidate gene set after completion of the process as the second candidate gene subset;
if the number of SNP in the second alternative base factor set is less than or equal to 2, ending calculation, wherein all SNP in the second alternative base factor set are screened pathogenic sites; otherwise, continuing to calculate until the number of SNP in the m-th alternative base factor set is less than or equal to m+1, and stopping the calculation.
9. A computer readable storage medium having stored thereon a program, which when executed by a processor performs the steps in the causal correlation analysis method for fine localization of whole genome pathogenic SNPs according to any one of claims 1-7.
10. An electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, wherein the processor performs the steps in the causal correlation analysis method for fine localization of whole genome pathogenic SNPs according to any one of claims 1-7 when the program is executed by the processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111149486.1A CN113808665B (en) | 2021-09-29 | 2021-09-29 | Causal correlation analysis method for fine localization of genome-wide pathogenic SNP |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111149486.1A CN113808665B (en) | 2021-09-29 | 2021-09-29 | Causal correlation analysis method for fine localization of genome-wide pathogenic SNP |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113808665A CN113808665A (en) | 2021-12-17 |
CN113808665B true CN113808665B (en) | 2024-03-08 |
Family
ID=78896991
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111149486.1A Active CN113808665B (en) | 2021-09-29 | 2021-09-29 | Causal correlation analysis method for fine localization of genome-wide pathogenic SNP |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113808665B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117352162B (en) * | 2023-10-24 | 2024-10-01 | 重庆邮电大学 | Disease factor data processing method based on double-rule causal feature selection |
CN118335200B (en) * | 2024-06-12 | 2024-09-03 | 山东大学 | Lung adenocarcinoma subtype classification system, medium and equipment based on causal feature selection |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010224815A (en) * | 2009-03-23 | 2010-10-07 | Japan Found Cancer Res | Retrieval algorithm for epistasis effect based on comprehensive genome wide snp information |
CN104838384A (en) * | 2012-11-26 | 2015-08-12 | 皇家飞利浦有限公司 | Diagnostic genetic analysis using variant-disease association with patient-specific relevance assessment |
CN108004340A (en) * | 2016-10-27 | 2018-05-08 | 河南农业大学 | One cultivate peanut full-length genome SNP exploitation method |
CN112185464A (en) * | 2020-09-29 | 2021-01-05 | 山东大学 | Multi-character full transcriptome association analysis method and system based on Mendelian randomization and application thereof |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6432974B2 (en) * | 2014-06-20 | 2018-12-05 | 国立大学法人東北大学 | TagSNP Selection Method, Selection Computer System, and Selection Software |
-
2021
- 2021-09-29 CN CN202111149486.1A patent/CN113808665B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010224815A (en) * | 2009-03-23 | 2010-10-07 | Japan Found Cancer Res | Retrieval algorithm for epistasis effect based on comprehensive genome wide snp information |
CN104838384A (en) * | 2012-11-26 | 2015-08-12 | 皇家飞利浦有限公司 | Diagnostic genetic analysis using variant-disease association with patient-specific relevance assessment |
CN108004340A (en) * | 2016-10-27 | 2018-05-08 | 河南农业大学 | One cultivate peanut full-length genome SNP exploitation method |
CN112185464A (en) * | 2020-09-29 | 2021-01-05 | 山东大学 | Multi-character full transcriptome association analysis method and system based on Mendelian randomization and application thereof |
Non-Patent Citations (2)
Title |
---|
Genetically Predicted Insomnia in Relation to 14 Cardiovascular Conditions and 17 Cardiometabolic Risk Factors: A Mendelian Randomization Study;Xinhui Liu等;《JOURNAL OF THE AMERICAN HEART ASSOCIATION》;全文 * |
Identification and Estimation of Causal Effects Using a Negative-Control Exposure in Time-Series Studies With Applications to Environmental Epidemiology;Yu, Yuanyuan等;《AMERICAN JOURNAL OF EPIDEMIOLOGY》;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113808665A (en) | 2021-12-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Deshwar et al. | PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors | |
JP6888123B2 (en) | Deep learning-based technology for pre-training deep convolutional neural networks | |
US20240247306A1 (en) | Detecting Cross-Contamination in Sequencing Data Using Regression Techniques | |
CN113808665B (en) | Causal correlation analysis method for fine localization of genome-wide pathogenic SNP | |
Oldham et al. | Network methods for describing sample relationships in genomic datasets: application to Huntington’s disease | |
JP2021093169A (en) | Deep learning-based technique for training deep convolutional neural network | |
JP2020525886A (en) | Variant classifier based on deep neural network | |
US20220130488A1 (en) | Methods for detecting copy-number variations in next-generation sequencing | |
CN109887546B (en) | Single-gene or multi-gene copy number detection system and method based on next-generation sequencing | |
JP2006519440A (en) | Statistical identification of increased risk of disease | |
CN110770840A (en) | Method and system for the decomposition and quantification of a mixture of DNA from multiple contributors of known or unknown genotypes | |
CN113228191A (en) | System and method for identifying chromosomal abnormalities in embryos | |
US20220277811A1 (en) | Detecting False Positive Variant Calls In Next-Generation Sequencing | |
CN110770839A (en) | Method for the accurate computational decomposition of DNA mixtures from contributors of unknown genotype | |
EP3729441A1 (en) | Microsatellite instabilty detection | |
CN112735594B (en) | Method for screening mutation sites related to disease phenotype and application thereof | |
Gaynor et al. | nQuack: An R package for predicting ploidal level from sequence data using site-based heterozygosity | |
Guha et al. | Bayesian hidden Markov modeling of array CGH data | |
US20190108311A1 (en) | Site-specific noise model for targeted sequencing | |
Sell | Addressing challenges of ancient DNA sequence data obtained with next generation methods | |
US20040219567A1 (en) | Methods for global pattern discovery of genetic association in mapping genetic traits | |
CN108509767B (en) | Method and device for processing genetic mutation | |
US20200013484A1 (en) | Machine learning variant source assignment | |
CN116646010B (en) | Human virus detection method and device, equipment and storage medium | |
CN116994650A (en) | Method, device and storage medium for classifying diffuse large B cell lymphoma genotypes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |