CN101894216B - Method of discovering SNP group related to complex disease from SNP information - Google Patents
Method of discovering SNP group related to complex disease from SNP information Download PDFInfo
- Publication number
- CN101894216B CN101894216B CN2010102309492A CN201010230949A CN101894216B CN 101894216 B CN101894216 B CN 101894216B CN 2010102309492 A CN2010102309492 A CN 2010102309492A CN 201010230949 A CN201010230949 A CN 201010230949A CN 101894216 B CN101894216 B CN 101894216B
- Authority
- CN
- China
- Prior art keywords
- mrow
- msub
- snp
- single nucleotide
- msubsup
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 201000010099 disease Diseases 0.000 title claims abstract description 67
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 title claims abstract description 67
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000007781 pre-processing Methods 0.000 claims abstract description 3
- 239000002773 nucleotide Substances 0.000 claims description 71
- 125000003729 nucleotide group Chemical group 0.000 claims description 69
- 230000001717 pathogenic effect Effects 0.000 claims description 35
- 238000005259 measurement Methods 0.000 claims description 13
- 108090000623 proteins and genes Proteins 0.000 claims description 9
- 238000005070 sampling Methods 0.000 claims description 9
- 108700028369 Alleles Proteins 0.000 claims description 3
- 210000000349 chromosome Anatomy 0.000 claims description 3
- 230000003321 amplification Effects 0.000 claims description 2
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 2
- 102000054765 polymorphisms of proteins Human genes 0.000 claims 1
- 238000011161 development Methods 0.000 abstract description 3
- 238000011160 research Methods 0.000 abstract description 3
- 238000013399 early diagnosis Methods 0.000 abstract description 2
- 230000003950 pathogenic mechanism Effects 0.000 abstract 1
- 238000002474 experimental method Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 208000010507 Adenocarcinoma of Lung Diseases 0.000 description 4
- 201000005249 lung adenocarcinoma Diseases 0.000 description 4
- 238000004088 simulation Methods 0.000 description 4
- 230000001364 causal effect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000008506 pathogenesis Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 208000026350 Inborn Genetic disease Diseases 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 229960000074 biopharmaceutical Drugs 0.000 description 1
- 230000002301 combined effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 231100000676 disease causative agent Toxicity 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 208000016361 genetic disease Diseases 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Landscapes
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a method of discovering SNP group related to complex disease from SNP information, solving the that problem that the multiple nosogenesis of susceptibilitytodiseases and SNP combination related to each reason can not be found in the SNP data of the complex disease in the prior art. The method comprises the steps of: preprocessing an SNP data set of the complex disease; searching an SNP group of candidate suspected nosogenesis in the preprocessed data set according to the measure of the relevance of the SNP group; calculating the stability measure of the relevance of the SNP group of all candidate suspected nosogenesis; adding the SNP group with maximum relevance stability measure as the suspected nosogenesis into a set of the SNP groups related to the complex disease; and outputting the SNP group in the set of the SNP groups related to the complex disease, and evaluating the degree of the SNP groups being the suspected nosogenesis by using the degree of the relevance stability close to 1. The invention can simultaneously find a plurality of nosogenesis SNP groups hidden in the data, and can be used for the pathogenic mechanism research and the early diagnosis of the complex disease and the development of the biomedicine.
Description
Technical Field
The invention belongs to the technical field of data processing, in particular to a method for discovering an SNP group related to a complex disease from single nucleotide polymorphism SNP data by using a maximum stability criterion, which can be used for researching the pathogenesis of the complex disease, early diagnosing and developing biological medicines.
Background
The complex disease is generated by the combined action of various genetic factors and environmental factors, and the generation and development of the complex disease are influenced by a plurality of genes of a complex network structure. Complex diseases differ from mendelian genetic diseases in that in most cases there are often not enough major genes to cause disease, where the effect of a single gene on the disease may be negligible or even nonexistent, but the combined effects of these single, possibly insignificant, genomes may be the causative agents of the complex disease. These characteristics bring great difficulty in finding the causative gene of the complex disease, and it is difficult to find the causative gene or related markers for the pathogenesis research, early diagnosis and biopharmaceutical development of the complex disease. How to find out the multiple causes of the disease in the genome-wide range and which genes are combined to become one cause of the disease are the main problems existing at present.
To overcome these problems, researchers have attempted to develop multiple disease markers. These methods mainly include hypothesis-based methods, feature-based methods of selection, and causal methods:
(1) a hypothesis testing based approach. This is the most important method for finding single-gene disease at present, and the search method is usually an exhaustive method. Single SNPs that are pathogenic can be found from the whole genome data, or double SNP combinations that are pathogenic can be found in the medium-scale data.
(2) A method based on feature selection. Large-scale data can be processed, but correlation between features is generally not considered; the limited feature correlation can be examined, but only medium and small data can be processed; can be combined with the classifier in various ways, and aims at the best popularization performance of the classifier, but has large calculation amount and is only suitable for small-scale data.
(3) A method based on causal analysis. Most of the data appear in the form of causal networks, and currently, the data are only in a theoretical research stage and cannot process large-scale data.
None of the three methods described above was based on the discovery of objective causative SNPs in the data. Computationally, either only a single SNP that is pathogenic can be found, and cannot be used for the discovery of multiple SNP sets associated with complex diseases, or its ultra-large computational power cannot be used for the whole genome data.
Disclosure of Invention
The present invention is directed to overcome the disadvantages of the conventional methods, and to provide a method for discovering a complex disease related SNP set from SNP data, in view of discovering an SNP that is a cause of a disease that is present in the data, so as to discover a plurality of possible causes of a disease and SNP combinations related to each of the causes from the complex disease SNP data.
The technical scheme for realizing the invention comprises the following steps:
(1) setting C as a collection of single nucleotide polymorphic SNP groups related to the complex disease, setting an initial value as null, setting M as the number of the single nucleotide polymorphic SNP groups which are preset to discover to be related to the complex disease, and setting a default value as 6; setting L as the upper limit value of the number of SNPs contained in the single nucleotide polymorphism SNP group, and setting the default value as 5; according to the principle of treating the influence of the variation of any gene in homologous chromosome alleles on diseases equivalently, the single nucleotide polymorphic SNP data is preprocessed as follows:
wherein N is the number of samples in the single nucleotide polymorphism SNP data, xi∈{0,1,2,3}dD is the number of single nucleotide polymorphism SNP in the data, y ″iE {1, 2} is a sample xiClass (1) represents a disease group, 2 represents a control group, and y is [ y ]1,y2,...,yN]Ω represents the preprocessed data;
(2) single nucleotide polymorphic SNP set FrAssociation AS (F) with class label yrOmega) is defined as FrMutual information MI (F) between and yr;y):
Wherein, FrIs a set of r single nucleotide polymorphic SNPs, p (F)rY) is FrAnd the joint probability of y, p (F)r) Is the joint probability of r single nucleotide polymorphic SNPs, and p (y) is the classProbability of y being marked;
(3) polymorphism SNP group F based on Single nucleotiderRelevance measure of (AS) (F)rAnd/omega), searching a single nucleotide polymorphic SNP group of candidate suspected pathogenic causes in omega according to the following steps:
(3a) calculating the relevance measurement of each single nucleotide polymorphic SNP in omega, and adding the single nucleotide polymorphic SNP corresponding to the former K big relevance measurement values into a set D consisting of candidate suspected pathogenic cause single nucleotide polymorphic SNP groups;
(3b) removing an unlabeled single nucleotide polymorphic SNP group F from DrTurning to the step (3c), if an unlabeled SNP set F cannot be taken out of the set DrIf so, ending the step (3);
(3c) if FrThe number of contained single nucleotide polymorphism SNPs is equal to L, and the mark is FrTurning to the step (3b), otherwise, turning to the step (3 d);
(3d) calculating FrMeasuring the relevance of a new SNP group formed by each single nucleotide polymorphism SNP in the omega residual single nucleotide polymorphism SNPs, adding the single nucleotide polymorphism SNP group corresponding to the previous K big relevance measurement values into a set D formed by the candidate suspected pathogenic cause single nucleotide polymorphism SNP groups, and marking FrTurning to the step (3 b);
(4) calculating the stability measure ST of the relevance of the set of single nucleotide polymorphic SNPs of all the searched candidate suspected pathogenic causes (F)r):
Wherein, FrIs a set consisting of r single nucleotide polymorphic SNPs; mu.sδ(r) and σδ(r) are eachThe mean and mean square error of the volatility,set of r single nucleotide polymorphic SNPs obtained for the ith sampleI 1, 2, 1f,mfThe default value is 100 for the number of put-back samples of SNPs in Ω;
δ(Fr) AS a relevance measure (F)rOmega) of the measured signal, and/or the measured signal,
wherein, muAS(Fr) And σAS(Fr) Sampling m for putting back samples in omega respectivelysThe relevance measure AS (F) obtainedr/Zi) Mean and mean square error of (d):
wherein Z isiData obtained for the ith sample with put back, AS (F)r/Zi) As data ZiMiddle FrI 1, 2, …, ms,msThe default value is 1000;
(5) according to the maximum stability criterion, selecting the single nucleotide polymorphic SNP group with the maximum stability measure of relevance from the set D of the candidate single nucleotide polymorphic SNP groups with suspected pathogenic causes as one single nucleotide polymorphic SNP group with the suspected pathogenic causes, adding the single nucleotide polymorphic SNP group with the maximum stability measure of relevance to the set C of the single nucleotide polymorphic SNP groups related to the complex diseases, removing the single nucleotide polymorphic SNP contained in the set C from omega, and turning to the step (3) if the number of the single nucleotide polymorphic SNP groups in the set C is less than M, and turning to the step (6) if not;
(6) the single nucleotide polymorphic SNP set in C was exported, and the suspected degree of this SNP set as the causative SNP set was evaluated by the degree in which the stability of the association of each SNP set was close to 1.
The invention has the following advantages:
(1) the invention uses mutual information as the relevance measure of the SNP group and the disease, and not only describes the linear statistical relationship of the SNP group and the disease, but also describes the nonlinear statistical relationship of the SNP group and the disease.
(2) The invention provides an SNP group for discovering suspected pathogenic causes by using the stability of the relevance of the SNP group; the evaluation method provides a method for judging whether the association between the SNP group and the disease is stable or not from the statistical angle by using a back sampling technology, and provides possibility for finding the objective SNP group related to the disease.
(3) In the process of finding the SNP group related to the complex disease, no artificial parameter is introduced, and no existing machine learning, pattern recognition and data mining method based on hypothesis is used, so that the influence of the artificial hypothesis is avoided to the maximum extent;
(4) the invention can find a plurality of possible pathogenic causes of the susceptible disease and SNP combinations related to each cause from complex disease SNP data.
Drawings
FIG. 1 is a flow chart of the present invention for discovering multiple pathogenic SNP groups.
Detailed Description
Referring to fig. 1, the specific implementation steps of the present invention are as follows:
step 1, preprocessing and initializing SNP data.
(1.1) processing SNP data into data containing only 0, 1, 2, 3 according to the principle that the influence of variation of any one gene in homologous chromosome alleles on diseases can be treated equivalently, wherein 0 represents deletion data;
(1.2) orderRepresenting the preprocessed data, wherein N is the number of samples in the SNP data, xi∈{0,1,2,3}dD is the number of SNPs in the data, yiE {1, 2} is a sample xiClass (1) represents a disease group, 2 represents a control group, and y is [ y ]1,y2,...,yN];
(1.3) let C be a set of SNP groups associated with a complex disease, initialized to null; setting M as the number of SNP groups which are expected to be found to be related to the complex disease, wherein the default value is 6; let L be the upper limit of the number of SNPs included in the single nucleotide polymorphism SNP set, and the default value be 5.
And 2, defining relevance measurement.
(2.1) relevance measure AS (F)rOmega) is defined as SNP group FrMutual information MI (F) with classmark yr(ii) a y) represented by formula (1):
wherein, FrIs a set of r SNPs, p (F)rY) is FrAnd the joint probability of y, p (F)r) Is the joint probability of r SNPs, and p (y) is the probability of the classmark y;
and 3, searching the SNP group of the candidate suspected pathogenic reason in omega.
(3.1) calculating the relevance measurement of each SNP by the formula (1), and adding the SNP corresponding to the previous K big relevance measurement values into a set D consisting of candidate suspected pathogenic cause SNP groups;
(3.2) taking out from D a SNP group F which is not labeledrTurning to step (3.3), if an SNP group F which is not labeled cannot be extracted from DrIf yes, ending the step 3;
(3.3) if FrThe number of SNPs contained is equal to L, and the marker is FrTurning to the step (3.2), otherwise, turning to the step (3.4);
(3.4) calculation of F from formula (1)rMeasuring the relevance of a new SNP group formed by each SNP in omega residual SNPs, adding the SNP group corresponding to the previous K big relevance measurement values into a set D formed by candidate suspected pathogenic cause SNP groups, and marking FrAnd (3.2) turning to the step.
Step 4, calculating the searched SNP group F of all candidate suspected pathogenic causesrA stability measure of the correlation of (a).
(4.1) adding FrStability measure of relevance of ST (F)r) Defined by formula (2):
wherein, FrIs a set consisting of r single nucleotide polymorphic SNPs; mu.sδ(r) and σδ(r) are eachThe mean and mean square error of the volatility,set of r single nucleotide polymorphic SNPs obtained for the ith sampleI 1, 2, 1f,mfThe default value is 100 for the number of put-back samples of SNPs in Ω;
δ(Fr) AS a relevance measure (F)rA/omega) is defined as formula (3),
wherein, muAS(Fr) And σAS(Fr) Sampling m for putting back samples in omega respectivelysThe relevance measure AS (F) obtainedr/Zi) Mean and mean square error of (d):
wherein Z isiData obtained for the ith sample with put back, AS (F)r/Zi) As data ZiMiddle FrI 1, 2, …, ms,msThe default value is 1000;
(4.2) the relevance measure AS (F) is calculated AS followsrOmega) volatility delta (F)r):
(4.2.1) m for samples in ΩsSampling with secondary release to obtain data Zi,i=1,2,...,ms;
(4.2.2) for all ZiZ is calculated from the following formulaiMiddle FrRelevance measure of (AS) (F)r/Zi):
(4.2.3) calculating a relevance measure AS (F) from equation (4)r/Zi) Mean value of (a)AS(Fr) Sum mean square error σAS(Fr);
(4.2.4) calculating the volatility δ (F) of the relevance measure from equation (3)r);
(4.3) calculating μ as followsδ(r) and σδ(r);
(4.3.1) performing m for SNP in ΩfSampling with secondary amplification to obtain SNP group containing r single nucleotide polymorphism SNPsi=1,2,...,mf;
(4.3.2.1) perform m on the samples in ΩsSampling with secondary release to obtain data Zj,j=1,2,...,ms;
(4.3.2.3) calculating a relevance measure from the following equationMean value ofSum mean square error
(4.3.3) calculation from the following equationi=1,2,...,mfThe volatility of the relevance measureMean value of (a)δ(r) and mean square error σδ(r):
(4.4) for each SNP set F in the set D consisting of candidate SNP sets of suspected causes obtained in step 3rThe stability of the correlation was calculated as follows:
(4.4.1) calculation of SNP group F from step (4.2)rFluctuation δ (F) ofr);
(4.4.2) calculating μ from step (4.3)δ(r) and σδ(r);
(4.4.3) calculating the lower limit of integration of equation (2)And substituting the expression into the formula (2) to calculate the SNP group FrStability of (2).
And 5, selecting the SNP group suspected to cause the disease.
(5.1) according to the maximum stability criterion, selecting the SNP group with the maximum relevance stability measurement value from the set D of the single nucleotide polymorphism SNP groups of the candidate suspected pathogenic causes, and taking the SNP group as one SNP group of the suspected pathogenic causes and marking the SNP group as S;
(5.2) adding S into the set C of the SNP groups related to the complex disease, and turning to the step 6 if the number of the SNP groups related to the complex disease in the set C is equal to M;
(5.3) remove the SNP contained in S from the data omega, go to step 3 to find the next SNP group related to the disease.
And 6, outputting the SNP set related to the complex disease, and evaluating the suspected degree of the SNP set as the disease causing SNP set by using the degree that the stability of the relevance of each SNP set is close to 1.
The invention will be described in more detail with respect to the effect of the process of the invention by the following experimental examples. These experimental examples are for illustrative purposes and are not intended to limit the scope of the present invention.
Experiment 1: and (3) simulating the discovery of the SNP group related to the complex disease in the data.
The simulation data is obtained by adding 7 known SNP groups related to complex diseases by biologists on the basis of real SNP data of the population in New York, and the 7 SNP groups are different from the association model of the diseases. There are two groups of data: the first set contained 2000 samples, 100 SNPs, denoted by SNP 100; the second set contained 2000 samples, 2000 SNPs, denoted SNP 2000. The detailed information of the data is shown in table 1. The experimental results obtained on the two sets of data described in table 1 are shown in table 2.
TABLE 1 Experimental data
Data set name | Number of SNPs | Number of samples | Number of samples in disease group | Number of samples in control group |
SNP100 | 100 | 2000 | 1127 | 873 |
SNP2000 | 2000 | 2000 | 1181 | 819 |
In table 2, q represents the position of the association of SNPs in the discovered SNP set in the order of the association of all individual SNPs from large to small; the SNPs found represent those related to the disease found in the data by the method of the present invention; the pathogenic SNP group is a known SNP group related to complex diseases, which is added to data by biologists in advance; the relevance represents the relevance measurement value of the SNP group discovered by the method of the invention and the disease; stability represents a stability measure of the association of the SNP set found by the method of the present invention; the P value is a universal measurement value for evaluating the quality of the SNP group in the field of finding the SNP group related to the complex disease from the SNP data; SNPs in the table are indicated by their numbers in the data.
TABLE 2 results of experiments on SNP100 and SNP2000 data for the discovery of a pathogenic SNP set by the method of the invention
Experiment 2: discovery of SNP group related to lung adenocarcinoma.
The data for real lung adenocarcinoma contained 191 disease samples, 99 control samples, 238304 SNPs, with 5.55% of the data lost. The results of experiments conducted on this data for the discovery of a pathogenic SNP set are shown in Table 3, in which SNPs are indicated by their numbers in the data.
TABLE 3 results of experiments on the discovery of a pathogenic SNP set for lung adenocarcinoma data by the method of the present invention
q | Discovered SNP set | Relevance | Stability of | P value |
187716,1 | 130199,177958 | 0.223783 | 0.998951 | 1.3701e-005 |
3130,70815,2 | 102091,180050,234964 | 0.568258 | 0.986097 | 6.7758e-005 |
62201,3,14707 | 48316,144695,181381 | 0.586346 | 0.980482 | 7.4825e-006 |
2712,4 | 66357,206952 | 0.204549 | 0.997601 | 1.897e-005 |
5,2525,197037 | 7938,116763,236441 | 0.653206 | 0.984182 | 0.010945 |
114680,20,6 | 41440,76592,236930 | 0.492324 | 0.972419 | 0.0013376 |
From tables 2, 3, the following conclusions can be drawn:
(1) for SNP100 data, the method of the invention discovers 6 of 7 real pathogenic SNP groups from simulation data; for SNP2000 data, the method of the invention discovers 5 of 7 real pathogenic SNP groups from simulation data; for real lung adenocarcinoma data, 6 SNP groups suspected to be causative were also found. It can be seen that the method of the present invention can find SNP groups related to diseases in SNP data.
(2) From simulation data experiments, it can also be seen that the number of the found real pathogenic SNP groups is not obviously reduced due to the increase of the number of SNPs in the data, and the method shows strong robustness to the number of SNPs in the data; meanwhile, SNP groups with different association models with diseases are found, and the robustness of the method for the association models is shown.
(3) From the aspect of stability, the stability of the association of the pathogenic SNP group discovered by the invention is very high, and is close to 1, compared with the P value of the common assessment method, the stability can discover more implicit suspected pathogenic SNP groups, and the superiority of the assessment method is shown.
(4) In view of the q-values of the SNPs in the set of discovered pathogenic SNPs: some SNP groups with poor single relevance but strong combination effect, such as 83, 85, 100 combination and 1818, 1747, 1998 combination, can also be successfully found by the method of the invention, and further shows that the invention has stronger capability of finding single weak-relevance and strong-relevance pathogenic SNP groups.
Claims (3)
1. A method for discovering SNP group related to complex disease from single nucleotide polymorphism SNP data comprises the following steps:
(1) setting C as a collection of single nucleotide polymorphic SNP groups related to the complex disease, setting an initial value as null, setting M as the number of the single nucleotide polymorphic SNP groups which are preset to discover to be related to the complex disease, and setting a default value as 6; setting L as the upper limit value of the number of SNPs contained in the single nucleotide polymorphism SNP group, and setting the default value as 5; preprocessing single nucleotide polymorphism SNP data into single nucleotide polymorphism SNP data according to the principle of treating the influence of variation of any one gene in homologous chromosome alleles on diseases equivalently:Wherein N is the number of samples in the single nucleotide polymorphism SNP data, xi∈{0,1,2,3}dD is the number of single nucleotide polymorphic SNPs in the data, yiE {1, 2} is a sample xiClass (1) represents a disease group, 2 represents a control group, and y is [ y ]1,y2,...,yN]Ω represents the preprocessed data;
(2) single nucleotide polymorphic SNP set FrAssociation AS (F) with class label yrOmega) is defined as FrMutual information MI (F) between and yr;y):
Wherein, FrIs a set of r single nucleotide polymorphic SNPs, p (F)rY) is FrAnd the joint probability of y, p (F)r) Is the joint probability of r single nucleotide polymorphic SNPs, and p (y) is the probability of the label y;
(3) polymorphism SNP group F based on Single nucleotiderRelevance measure of (AS) (F)rAnd/omega), searching a single nucleotide polymorphic SNP group of candidate suspected pathogenic causes in omega according to the following steps:
(3a) calculating the relevance measurement of each single nucleotide polymorphic SNP in omega, and adding the first K single nucleotide polymorphic SNPs corresponding to the large relevance measurement values into a set D consisting of candidate suspected pathogenic cause single nucleotide polymorphic SNP groups;
(3b) removing an unlabeled single nucleotide polymorphic SNP group F from DrTurning to the step (3c), if an unlabeled SNP set F cannot be taken out of the set DrIf so, ending the step (3);
(3c) if FrThe number of contained single nucleotide polymorphism SNPs is equal to L, and the mark is FrTurning to the step (3b), otherwise, turning to the step (3 d);
(3d) calculating FrMeasuring the relevance of a new SNP group formed by each one of the omega residual single nucleotide polymorphic SNPs, adding the first K single nucleotide polymorphic SNP groups corresponding to large relevance measurement values into a set D formed by the candidate suspected pathogenic cause single nucleotide polymorphic SNP groups, and marking FrTurning to the step (3 b);
(4) calculating the stability measure ST of the relevance of the set of single nucleotide polymorphic SNPs of all the searched candidate suspected pathogenic causes (F)r):
Its value is [0, 1 ]]Wherein, FrIs a set consisting of r single nucleotide polymorphic SNPs; mu.sδ(r) and σδ(r) are eachThe mean and mean square error of the volatility,set of r single nucleotide polymorphic SNPs obtained for the ith sampleI 1, 2, 1f,mfThe default value is 100 for the number of samples with playback for the features in Ω;
δ(Fr) Is to turn offMeasure of relevance AS (F)rOmega) of the measured signal, and/or the measured signal,
wherein, muAS(Fr) and σAS(Fr) Sampling m for putting back samples in omega respectivelysThe relevance measure AS (F) obtainedr/Zi) mean and mean square error:
wherein Z isiData obtained for the ith sample with put back, AS (F)r/Zi) As data ZiMiddle FrI 1, 2, …, ms,msThe default value is 1000;
(5) selecting the single nucleotide polymorphic SNP group with the highest correlation stability as one single nucleotide polymorphic SNP group of the suspected pathogenic reason from the collection D of the single nucleotide polymorphic SNP groups of the candidate suspected pathogenic reasons according to the maximum stability criterion, adding the single nucleotide polymorphic SNP group into the collection C of the single nucleotide polymorphic SNP groups related to the complex disease, removing the single nucleotide polymorphic SNP contained in the collection C from omega, and turning to the step (3) if the number of the single nucleotide polymorphic SNP groups in the collection C is less than M, and turning to the step (6) if not;
(6) all the single nucleotide polymorphic SNP sets in C were exported, and the degree of the suspicion of this SNP set as a causative SNP set was evaluated with the degree in which the stability of the association of each SNP set was close to 1.
2. The method of claim 1, wherein the lower limit in the stability measure formula for the correlation given in step (4)The method comprises the following steps:
(4a) go m to features in ΩfSampling with secondary amplification to obtain a sample containing r single nucleotide polymorphisms SNP
(4d) Perform m on samples in ΩsSampling with secondary release to obtain data Zi,i=1,2,...,ms;
(4e) Calculating data ZiMiddle FrRelevance measure of (AS) (F)r/Zi),i=1,2,....,ms;
(4f) Computing a relevance measure AS (F)r/Zi) Mean value of (a)AS(Fr) Sum mean square error σAS(Fr);
(4g) Is measured by muAS(Fr) And σAS(Fr) Calculating the volatility delta (F) of the relevance measurer);
3. The method of claim 2, wherein the association of step (4c)Mean value of the volatility ofδ(r) and mean square error σδ(r) calculated as follows:
wherein,is a set containing r single nucleotide polymorphic SNPs, mfTo perform the number of return samples for SNPs in omega,as a measure of relevanceThe fluctuation of the pressure of the air conditioner is reduced,
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010102309492A CN101894216B (en) | 2010-07-16 | 2010-07-16 | Method of discovering SNP group related to complex disease from SNP information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010102309492A CN101894216B (en) | 2010-07-16 | 2010-07-16 | Method of discovering SNP group related to complex disease from SNP information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101894216A CN101894216A (en) | 2010-11-24 |
CN101894216B true CN101894216B (en) | 2012-09-05 |
Family
ID=43103406
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2010102309492A Expired - Fee Related CN101894216B (en) | 2010-07-16 | 2010-07-16 | Method of discovering SNP group related to complex disease from SNP information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101894216B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102629305B (en) * | 2012-03-06 | 2015-02-25 | 上海大学 | Feature selection method facing to SNP (Single Nucleotide Polymorphism) data |
CN103366100A (en) * | 2013-06-25 | 2013-10-23 | 西安电子科技大学 | Method for filtering SNP (Single Nucleotide Polymorphism) unrelated to complex diseases from whole-genome |
CN104462868B (en) * | 2014-12-11 | 2017-04-05 | 西安电子科技大学 | A kind of full-length genome SNP site analysis method of combination random forest and Relief F |
CN105354444B (en) * | 2015-11-24 | 2018-06-19 | 华南理工大学 | Method based on the susceptible SNP combinations of susceptible SNP screenings complex disease |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1480532A (en) * | 2002-09-04 | 2004-03-10 | 华中农业大学 | Gene of cortexin-3 receptor of pig melanin and method for detecting polymorphism of mononucleotide |
CN101346724A (en) * | 2005-11-26 | 2009-01-14 | 吉恩安全网络有限责任公司 | System and method for cleaning noisy genetic data and using genetic, phentoypic and clinical data to make predictions |
CN101570788A (en) * | 2009-06-09 | 2009-11-04 | 华东师范大学 | Method for recognizing genotype through single nucleotide polymorphism chip |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040102905A1 (en) * | 2001-03-26 | 2004-05-27 | Epigenomics Ag | Method for epigenetic feature selection |
-
2010
- 2010-07-16 CN CN2010102309492A patent/CN101894216B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1480532A (en) * | 2002-09-04 | 2004-03-10 | 华中农业大学 | Gene of cortexin-3 receptor of pig melanin and method for detecting polymorphism of mononucleotide |
CN101346724A (en) * | 2005-11-26 | 2009-01-14 | 吉恩安全网络有限责任公司 | System and method for cleaning noisy genetic data and using genetic, phentoypic and clinical data to make predictions |
CN101570788A (en) * | 2009-06-09 | 2009-11-04 | 华东师范大学 | Method for recognizing genotype through single nucleotide polymorphism chip |
Also Published As
Publication number | Publication date |
---|---|
CN101894216A (en) | 2010-11-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Minnoye et al. | Chromatin accessibility profiling methods | |
CN106446600B (en) | A kind of design method of the sgRNA based on CRISPR/Cas9 | |
US20190130999A1 (en) | Latent Representations of Phylogeny to Predict Organism Phenotype | |
US20210332354A1 (en) | Systems and methods for identifying differential accessibility of gene regulatory elements at single cell resolution | |
Marshall et al. | How mitonuclear discordance and geographic variation have confounded species boundaries in a widely studied snake | |
JP2005531853A (en) | System and method for SNP genotype clustering | |
CN101894216B (en) | Method of discovering SNP group related to complex disease from SNP information | |
US20190139628A1 (en) | Machine learning techniques for analysis of structural variants | |
CN106202999A (en) | Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement | |
Champigny et al. | Learning from methylomes: epigenomic correlates of Populus balsamifera traits based on deep learning models of natural DNA methylation | |
CN113823356B (en) | Methylation site identification method and device | |
An et al. | KCRR: a nonlinear machine learning with a modified genomic similarity matrix improved the genomic prediction efficiency | |
CN108920889B (en) | Chemical health hazard screening method | |
CN110400605A (en) | A kind of the ligand bioactivity prediction technique and its application of GPCR drug targets | |
CN109997193B (en) | Method for quantitative analysis of subgroups in specific group | |
Fu et al. | A statistical model for mapping morphological shape | |
CN107977550A (en) | A kind of quick analysis Disease-causing gene algorithm based on compression | |
CN106570350B (en) | Mononucleotide polymorphic site parting algorithm | |
CN115394348A (en) | IncRNA subcellular localization prediction method, equipment and medium based on graph convolution network | |
CN108733974B (en) | Mitochondrial sequence splicing and copy number determination method based on high-throughput sequencing | |
US20220076784A1 (en) | Systems and methods for identifying feature linkages in multi-genomic feature data from single-cell partitions | |
US20210324465A1 (en) | Systems and methods for analyzing and aggregating open chromatin signatures at single cell resolution | |
CN114005489B (en) | Analysis method and device for detecting point mutation based on third-generation sequencing data | |
Zhang et al. | Reading the underlying information from massive metagenomic sequencing data | |
Sulins et al. | Automatic termination of parallel optimization runs of stochastic global optimization methods in consensus or stagnation cases |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20120905 Termination date: 20180716 |