CN101894216B

CN101894216B - Method of discovering SNP group related to complex disease from SNP information

Info

Publication number: CN101894216B
Application number: CN2010102309492A
Authority: CN
Inventors: 张军英; 耿耀君; 于国强; 赵晓雪; 尚军亮; 王跃
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2010-07-16
Filing date: 2010-07-16
Publication date: 2012-09-05
Anticipated expiration: 2030-07-16
Also published as: CN101894216A

Abstract

The invention discloses a method of discovering SNP group related to complex disease from SNP information, solving the that problem that the multiple nosogenesis of susceptibilitytodiseases and SNP combination related to each reason can not be found in the SNP data of the complex disease in the prior art. The method comprises the steps of: preprocessing an SNP data set of the complex disease; searching an SNP group of candidate suspected nosogenesis in the preprocessed data set according to the measure of the relevance of the SNP group; calculating the stability measure of the relevance of the SNP group of all candidate suspected nosogenesis; adding the SNP group with maximum relevance stability measure as the suspected nosogenesis into a set of the SNP groups related to the complex disease; and outputting the SNP group in the set of the SNP groups related to the complex disease, and evaluating the degree of the SNP groups being the suspected nosogenesis by using the degree of the relevance stability close to 1. The invention can simultaneously find a plurality of nosogenesis SNP groups hidden in the data, and can be used for the pathogenic mechanism research and the early diagnosis of the complex disease and the development of the biomedicine.

Description

Method for finding SNP group related to complex disease from SNP data

Technical Field

The invention belongs to the technical field of data processing, in particular to a method for discovering an SNP group related to a complex disease from single nucleotide polymorphism SNP data by using a maximum stability criterion, which can be used for researching the pathogenesis of the complex disease, early diagnosing and developing biological medicines.

Background

The complex disease is generated by the combined action of various genetic factors and environmental factors, and the generation and development of the complex disease are influenced by a plurality of genes of a complex network structure. Complex diseases differ from mendelian genetic diseases in that in most cases there are often not enough major genes to cause disease, where the effect of a single gene on the disease may be negligible or even nonexistent, but the combined effects of these single, possibly insignificant, genomes may be the causative agents of the complex disease. These characteristics bring great difficulty in finding the causative gene of the complex disease, and it is difficult to find the causative gene or related markers for the pathogenesis research, early diagnosis and biopharmaceutical development of the complex disease. How to find out the multiple causes of the disease in the genome-wide range and which genes are combined to become one cause of the disease are the main problems existing at present.

To overcome these problems, researchers have attempted to develop multiple disease markers. These methods mainly include hypothesis-based methods, feature-based methods of selection, and causal methods:

(1) a hypothesis testing based approach. This is the most important method for finding single-gene disease at present, and the search method is usually an exhaustive method. Single SNPs that are pathogenic can be found from the whole genome data, or double SNP combinations that are pathogenic can be found in the medium-scale data.

(2) A method based on feature selection. Large-scale data can be processed, but correlation between features is generally not considered; the limited feature correlation can be examined, but only medium and small data can be processed; can be combined with the classifier in various ways, and aims at the best popularization performance of the classifier, but has large calculation amount and is only suitable for small-scale data.

(3) A method based on causal analysis. Most of the data appear in the form of causal networks, and currently, the data are only in a theoretical research stage and cannot process large-scale data.

None of the three methods described above was based on the discovery of objective causative SNPs in the data. Computationally, either only a single SNP that is pathogenic can be found, and cannot be used for the discovery of multiple SNP sets associated with complex diseases, or its ultra-large computational power cannot be used for the whole genome data.

Disclosure of Invention

The present invention is directed to overcome the disadvantages of the conventional methods, and to provide a method for discovering a complex disease related SNP set from SNP data, in view of discovering an SNP that is a cause of a disease that is present in the data, so as to discover a plurality of possible causes of a disease and SNP combinations related to each of the causes from the complex disease SNP data.

The technical scheme for realizing the invention comprises the following steps:

(1) setting C as a collection of single nucleotide polymorphic SNP groups related to the complex disease, setting an initial value as null, setting M as the number of the single nucleotide polymorphic SNP groups which are preset to discover to be related to the complex disease, and setting a default value as 6; setting L as the upper limit value of the number of SNPs contained in the single nucleotide polymorphism SNP group, and setting the default value as 5; according to the principle of treating the influence of the variation of any gene in homologous chromosome alleles on diseases equivalently, the single nucleotide polymorphic SNP data is preprocessed as follows:

wherein N is the number of samples in the single nucleotide polymorphism SNP data, x_i∈{0，1，2，3}^dD is the number of single nucleotide polymorphism SNP in the data, y ″_iE {1, 2} is a sample x_iClass (1) represents a disease group, 2 represents a control group, and y is [ y ]₁，y₂，...，y_N]Ω represents the preprocessed data;

(2) single nucleotide polymorphic SNP set F_rAssociation AS (F) with class label y_rOmega) is defined as F_rMutual information MI (F) between and y_r；y)：

Wherein, F_rIs a set of r single nucleotide polymorphic SNPs, p (F)_rY) is F_rAnd the joint probability of y, p (F)_r) Is the joint probability of r single nucleotide polymorphic SNPs, and p (y) is the classProbability of y being marked;

(3) polymorphism SNP group F based on Single nucleotide_rRelevance measure of (AS) (F)_rAnd/omega), searching a single nucleotide polymorphic SNP group of candidate suspected pathogenic causes in omega according to the following steps:

(3a) calculating the relevance measurement of each single nucleotide polymorphic SNP in omega, and adding the single nucleotide polymorphic SNP corresponding to the former K big relevance measurement values into a set D consisting of candidate suspected pathogenic cause single nucleotide polymorphic SNP groups;

(3b) removing an unlabeled single nucleotide polymorphic SNP group F from D_rTurning to the step (3c), if an unlabeled SNP set F cannot be taken out of the set D_rIf so, ending the step (3);

(3c) if F_rThe number of contained single nucleotide polymorphism SNPs is equal to L, and the mark is F_rTurning to the step (3b), otherwise, turning to the step (3 d);

(3d) calculating F_rMeasuring the relevance of a new SNP group formed by each single nucleotide polymorphism SNP in the omega residual single nucleotide polymorphism SNPs, adding the single nucleotide polymorphism SNP group corresponding to the previous K big relevance measurement values into a set D formed by the candidate suspected pathogenic cause single nucleotide polymorphism SNP groups, and marking F_rTurning to the step (3 b);

(4) calculating the stability measure ST of the relevance of the set of single nucleotide polymorphic SNPs of all the searched candidate suspected pathogenic causes (F)_r)：

<math> <mrow> <mi>ST</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msqrt> <mn>2</mn> <mi>π</mi> </msqrt> </mfrac> <munderover> <mo>&Integral;</mo> <mfrac> <mrow> <mi>δ</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>μ</mi> <mi>δ</mi> </msub> <mrow> <mo>(</mo> <mi>r</mi> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mi>σ</mi> <mi>δ</mi> </msub> <mrow> <mo>(</mo> <mi>r</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mrow> <mo>+</mo> <mo>∞</mo> </mrow> </munderover> <mi>exp</mi> <mrow> <mo>(</mo> <mo>-</mo> <mfrac> <msup> <mi>x</mi> <mn>2</mn> </msup> <mn>2</mn> </mfrac> <mo>)</mo> </mrow> <mi>dx</mi> </mrow> </math>

Wherein, F_rIs a set consisting of r single nucleotide polymorphic SNPs; mu.s_δ(r) and σ_δ(r) are each

The mean and mean square error of the volatility,

set of r single nucleotide polymorphic SNPs obtained for the ith sample

I 1, 2, 1_f，m_fThe default value is 100 for the number of put-back samples of SNPs in Ω;

δ(F_r) AS a relevance measure (F)_rOmega) of the measured signal, and/or the measured signal,

wherein, mu_AS(F_r) And σ_AS(F_r) Sampling m for putting back samples in omega respectively_sThe relevance measure AS (F) obtained_r/Z_i) Mean and mean square error of (d):

wherein Z is_iData obtained for the ith sample with put back, AS (F)_r/Z_i) As data Z_iMiddle F_rI 1, 2, …, m_s，m_sThe default value is 1000;

(5) according to the maximum stability criterion, selecting the single nucleotide polymorphic SNP group with the maximum stability measure of relevance from the set D of the candidate single nucleotide polymorphic SNP groups with suspected pathogenic causes as one single nucleotide polymorphic SNP group with the suspected pathogenic causes, adding the single nucleotide polymorphic SNP group with the maximum stability measure of relevance to the set C of the single nucleotide polymorphic SNP groups related to the complex diseases, removing the single nucleotide polymorphic SNP contained in the set C from omega, and turning to the step (3) if the number of the single nucleotide polymorphic SNP groups in the set C is less than M, and turning to the step (6) if not;

(6) the single nucleotide polymorphic SNP set in C was exported, and the suspected degree of this SNP set as the causative SNP set was evaluated by the degree in which the stability of the association of each SNP set was close to 1.

The invention has the following advantages:

(1) the invention uses mutual information as the relevance measure of the SNP group and the disease, and not only describes the linear statistical relationship of the SNP group and the disease, but also describes the nonlinear statistical relationship of the SNP group and the disease.

(2) The invention provides an SNP group for discovering suspected pathogenic causes by using the stability of the relevance of the SNP group; the evaluation method provides a method for judging whether the association between the SNP group and the disease is stable or not from the statistical angle by using a back sampling technology, and provides possibility for finding the objective SNP group related to the disease.

(3) In the process of finding the SNP group related to the complex disease, no artificial parameter is introduced, and no existing machine learning, pattern recognition and data mining method based on hypothesis is used, so that the influence of the artificial hypothesis is avoided to the maximum extent;

(4) the invention can find a plurality of possible pathogenic causes of the susceptible disease and SNP combinations related to each cause from complex disease SNP data.

Drawings

FIG. 1 is a flow chart of the present invention for discovering multiple pathogenic SNP groups.

Detailed Description

Referring to fig. 1, the specific implementation steps of the present invention are as follows:

step 1, preprocessing and initializing SNP data.

(1.1) processing SNP data into data containing only 0, 1, 2, 3 according to the principle that the influence of variation of any one gene in homologous chromosome alleles on diseases can be treated equivalently, wherein 0 represents deletion data;

(1.2) order

Representing the preprocessed data, wherein N is the number of samples in the SNP data, x_i∈{0，1，2，3}^dD is the number of SNPs in the data, y_iE {1, 2} is a sample x_iClass (1) represents a disease group, 2 represents a control group, and y is [ y ]₁，y₂，...，y_N]；

(1.3) let C be a set of SNP groups associated with a complex disease, initialized to null; setting M as the number of SNP groups which are expected to be found to be related to the complex disease, wherein the default value is 6; let L be the upper limit of the number of SNPs included in the single nucleotide polymorphism SNP set, and the default value be 5.

And 2, defining relevance measurement.

(2.1) relevance measure AS (F)_rOmega) is defined as SNP group F_rMutual information MI (F) with classmark y_r(ii) a y) represented by formula (1):

wherein, F_rIs a set of r SNPs, p (F)_rY) is F_rAnd the joint probability of y, p (F)_r) Is the joint probability of r SNPs, and p (y) is the probability of the classmark y;

and 3, searching the SNP group of the candidate suspected pathogenic reason in omega.

(3.1) calculating the relevance measurement of each SNP by the formula (1), and adding the SNP corresponding to the previous K big relevance measurement values into a set D consisting of candidate suspected pathogenic cause SNP groups;

(3.2) taking out from D a SNP group F which is not labeled_rTurning to step (3.3), if an SNP group F which is not labeled cannot be extracted from D_rIf yes, ending the step 3;

(3.3) if F_rThe number of SNPs contained is equal to L, and the marker is F_rTurning to the step (3.2), otherwise, turning to the step (3.4);

(3.4) calculation of F from formula (1)_rMeasuring the relevance of a new SNP group formed by each SNP in omega residual SNPs, adding the SNP group corresponding to the previous K big relevance measurement values into a set D formed by candidate suspected pathogenic cause SNP groups, and marking F_rAnd (3.2) turning to the step.

Step 4, calculating the searched SNP group F of all candidate suspected pathogenic causes_rA stability measure of the correlation of (a).

(4.1) adding F_rStability measure of relevance of ST (F)_r) Defined by formula (2):

<math> <mrow> <mi>ST</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msqrt> <mn>2</mn> <mi>π</mi> </msqrt> </mfrac> <munderover> <mo>&Integral;</mo> <mfrac> <mrow> <mi>δ</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>μ</mi> <mi>δ</mi> </msub> <mrow> <mo>(</mo> <mi>r</mi> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mi>σ</mi> <mi>δ</mi> </msub> <mrow> <mo>(</mo> <mi>r</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mrow> <mo>+</mo> <mo>∞</mo> </mrow> </munderover> <mi>exp</mi> <mrow> <mo>(</mo> <mo>-</mo> <mfrac> <msup> <mi>x</mi> <mn>2</mn> </msup> <mn>2</mn> </mfrac> <mo>)</mo> </mrow> <mi>dx</mi> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> </math>

The mean and mean square error of the volatility,

set of r single nucleotide polymorphic SNPs obtained for the ith sample

δ(F_r) AS a relevance measure (F)_rA/omega) is defined as formula (3),

(4.2) the relevance measure AS (F) is calculated AS follows_rOmega) volatility delta (F)_r)：

(4.2.1) m for samples in Ω_sSampling with secondary release to obtain data Z_i，i＝1，2，...，m_s；

(4.2.2) for all Z_iZ is calculated from the following formula_iMiddle F_rRelevance measure of (AS) (F)_r/Z_i)：

(4.2.3) calculating a relevance measure AS (F) from equation (4)_r/Z_i) Mean value of (a)_AS(F_r) Sum mean square error σ_AS(F_r)；

(4.2.4) calculating the volatility δ (F) of the relevance measure from equation (3)_r)；

(4.3) calculating μ as follows_δ(r) and σ_δ(r)；

(4.3.1) performing m for SNP in Ω_fSampling with secondary amplification to obtain SNP group containing r single nucleotide polymorphism SNPs

i＝1，2，...，m_f；

(4.3.2) calculation

The volatility of the relevance measure

i＝1，2，...，m_f：

(4.3.2.1) perform m on the samples in Ω_sSampling with secondary release to obtain data Z_j，j＝1，2，...，m_s；

(4.3.2.2) for all Z_jZ is calculated from the following formula_jIn

Measure of relevance of

(4.3.2.3) calculating a relevance measure from the following equation

Mean value of

Sum mean square error

(4.3.2.4) calculating the volatility of the relevance measure from the following equation

(4.3.3) calculation from the following equation

i＝1，2，...，m_fThe volatility of the relevance measure

Mean value of (a)_δ(r) and mean square error σ_δ(r)：

(4.4) for each SNP set F in the set D consisting of candidate SNP sets of suspected causes obtained in step 3_rThe stability of the correlation was calculated as follows:

(4.4.1) calculation of SNP group F from step (4.2)_rFluctuation δ (F) of_r)；

(4.4.2) calculating μ from step (4.3)_δ(r) and σ_δ(r)；

(4.4.3) calculating the lower limit of integration of equation (2)

And substituting the expression into the formula (2) to calculate the SNP group F_rStability of (2).

And 5, selecting the SNP group suspected to cause the disease.

(5.1) according to the maximum stability criterion, selecting the SNP group with the maximum relevance stability measurement value from the set D of the single nucleotide polymorphism SNP groups of the candidate suspected pathogenic causes, and taking the SNP group as one SNP group of the suspected pathogenic causes and marking the SNP group as S;

(5.2) adding S into the set C of the SNP groups related to the complex disease, and turning to the step 6 if the number of the SNP groups related to the complex disease in the set C is equal to M;

(5.3) remove the SNP contained in S from the data omega, go to step 3 to find the next SNP group related to the disease.

And 6, outputting the SNP set related to the complex disease, and evaluating the suspected degree of the SNP set as the disease causing SNP set by using the degree that the stability of the relevance of each SNP set is close to 1.

The invention will be described in more detail with respect to the effect of the process of the invention by the following experimental examples. These experimental examples are for illustrative purposes and are not intended to limit the scope of the present invention.

Experiment 1: and (3) simulating the discovery of the SNP group related to the complex disease in the data.

The simulation data is obtained by adding 7 known SNP groups related to complex diseases by biologists on the basis of real SNP data of the population in New York, and the 7 SNP groups are different from the association model of the diseases. There are two groups of data: the first set contained 2000 samples, 100 SNPs, denoted by SNP 100; the second set contained 2000 samples, 2000 SNPs, denoted SNP 2000. The detailed information of the data is shown in table 1. The experimental results obtained on the two sets of data described in table 1 are shown in table 2.

TABLE 1 Experimental data

Data set name	Number of SNPs	Number of samples	Number of samples in disease group	Number of samples in control group
					SNP100	100	2000	1127	873
SNP2000	2000	2000	1181	819

In table 2, q represents the position of the association of SNPs in the discovered SNP set in the order of the association of all individual SNPs from large to small; the SNPs found represent those related to the disease found in the data by the method of the present invention; the pathogenic SNP group is a known SNP group related to complex diseases, which is added to data by biologists in advance; the relevance represents the relevance measurement value of the SNP group discovered by the method of the invention and the disease; stability represents a stability measure of the association of the SNP set found by the method of the present invention; the P value is a universal measurement value for evaluating the quality of the SNP group in the field of finding the SNP group related to the complex disease from the SNP data; SNPs in the table are indicated by their numbers in the data.

TABLE 2 results of experiments on SNP100 and SNP2000 data for the discovery of a pathogenic SNP set by the method of the invention

Experiment 2: discovery of SNP group related to lung adenocarcinoma.

The data for real lung adenocarcinoma contained 191 disease samples, 99 control samples, 238304 SNPs, with 5.55% of the data lost. The results of experiments conducted on this data for the discovery of a pathogenic SNP set are shown in Table 3, in which SNPs are indicated by their numbers in the data.

TABLE 3 results of experiments on the discovery of a pathogenic SNP set for lung adenocarcinoma data by the method of the present invention

q	Discovered SNP set	Relevance	Stability of	P value
					187716，1	130199，177958	0.223783	0.998951	1.3701e-005
3130，70815，2	102091，180050，234964	0.568258	0.986097	6.7758e-005
					62201，3，14707	48316，144695，181381	0.586346	0.980482	7.4825e-006
2712，4	66357，206952	0.204549	0.997601	1.897e-005
					5，2525，197037	7938，116763，236441	0.653206	0.984182	0.010945
114680，20，6	41440，76592，236930	0.492324	0.972419	0.0013376

From tables 2, 3, the following conclusions can be drawn:

(1) for SNP100 data, the method of the invention discovers 6 of 7 real pathogenic SNP groups from simulation data; for SNP2000 data, the method of the invention discovers 5 of 7 real pathogenic SNP groups from simulation data; for real lung adenocarcinoma data, 6 SNP groups suspected to be causative were also found. It can be seen that the method of the present invention can find SNP groups related to diseases in SNP data.

(2) From simulation data experiments, it can also be seen that the number of the found real pathogenic SNP groups is not obviously reduced due to the increase of the number of SNPs in the data, and the method shows strong robustness to the number of SNPs in the data; meanwhile, SNP groups with different association models with diseases are found, and the robustness of the method for the association models is shown.

(3) From the aspect of stability, the stability of the association of the pathogenic SNP group discovered by the invention is very high, and is close to 1, compared with the P value of the common assessment method, the stability can discover more implicit suspected pathogenic SNP groups, and the superiority of the assessment method is shown.

(4) In view of the q-values of the SNPs in the set of discovered pathogenic SNPs: some SNP groups with poor single relevance but strong combination effect, such as 83, 85, 100 combination and 1818, 1747, 1998 combination, can also be successfully found by the method of the invention, and further shows that the invention has stronger capability of finding single weak-relevance and strong-relevance pathogenic SNP groups.

Claims

1. A method for discovering SNP group related to complex disease from single nucleotide polymorphism SNP data comprises the following steps:

(1) setting C as a collection of single nucleotide polymorphic SNP groups related to the complex disease, setting an initial value as null, setting M as the number of the single nucleotide polymorphic SNP groups which are preset to discover to be related to the complex disease, and setting a default value as 6; setting L as the upper limit value of the number of SNPs contained in the single nucleotide polymorphism SNP group, and setting the default value as 5; preprocessing single nucleotide polymorphism SNP data into single nucleotide polymorphism SNP data according to the principle of treating the influence of variation of any one gene in homologous chromosome alleles on diseases equivalently：

Wherein N is the number of samples in the single nucleotide polymorphism SNP data, x_i∈{0，1，2，3}^dD is the number of single nucleotide polymorphic SNPs in the data, y_iE {1, 2} is a sample x_iClass (1) represents a disease group, 2 represents a control group, and y is [ y ]₁，y₂，...，y_N]Ω represents the preprocessed data;

Wherein, F_rIs a set of r single nucleotide polymorphic SNPs, p (F)_rY) is F_rAnd the joint probability of y, p (F)_r) Is the joint probability of r single nucleotide polymorphic SNPs, and p (y) is the probability of the label y;

(3a) calculating the relevance measurement of each single nucleotide polymorphic SNP in omega, and adding the first K single nucleotide polymorphic SNPs corresponding to the large relevance measurement values into a set D consisting of candidate suspected pathogenic cause single nucleotide polymorphic SNP groups;

(3d) calculating F_rMeasuring the relevance of a new SNP group formed by each one of the omega residual single nucleotide polymorphic SNPs, adding the first K single nucleotide polymorphic SNP groups corresponding to large relevance measurement values into a set D formed by the candidate suspected pathogenic cause single nucleotide polymorphic SNP groups, and marking F_rTurning to the step (3 b);

<math> <mrow> <mi>ST</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msqrt> <mn>2</mn> <mi>π</mi> </msqrt> </mfrac> <mfrac> <munderover> <mo>&Integral;</mo> <mrow> <mi>δ</mi> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>r</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>μ</mi> <mi>δ</mi> </msub> <mrow> <mo>(</mo> <mi>r</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mo>+</mo> <mo>∞</mo> </mrow> </munderover> <mrow> <msub> <mi>σ</mi> <mi>δ</mi> </msub> <mrow> <mo>(</mo> <mi>r</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mi>exp</mi> <mrow> <mo>(</mo> <mo>-</mo> <mfrac> <msup> <mi>x</mi> <mn>2</mn> </msup> <mn>2</mn> </mfrac> <mo>)</mo> </mrow> <mi>dx</mi> </mrow> </math>

Its value is [0, 1 ]]Wherein, F_rIs a set consisting of r single nucleotide polymorphic SNPs; mu.s_δ(r) and σ_δ(r) are each

The mean and mean square error of the volatility,

set of r single nucleotide polymorphic SNPs obtained for the ith sampleI 1, 2, 1_f，m_fThe default value is 100 for the number of samples with playback for the features in Ω;

δ(F_r) Is to turn offMeasure of relevance AS (F)_rOmega) of the measured signal, and/or the measured signal,

wherein, mu_AS(Fr) and σ_AS(F_r) Sampling m for putting back samples in omega respectively_sThe relevance measure AS (F) obtained_r/Zi) mean and mean square error:

(5) selecting the single nucleotide polymorphic SNP group with the highest correlation stability as one single nucleotide polymorphic SNP group of the suspected pathogenic reason from the collection D of the single nucleotide polymorphic SNP groups of the candidate suspected pathogenic reasons according to the maximum stability criterion, adding the single nucleotide polymorphic SNP group into the collection C of the single nucleotide polymorphic SNP groups related to the complex disease, removing the single nucleotide polymorphic SNP contained in the collection C from omega, and turning to the step (3) if the number of the single nucleotide polymorphic SNP groups in the collection C is less than M, and turning to the step (6) if not;

(6) all the single nucleotide polymorphic SNP sets in C were exported, and the degree of the suspicion of this SNP set as a causative SNP set was evaluated with the degree in which the stability of the association of each SNP set was close to 1.

2. The method of claim 1, wherein the lower limit in the stability measure formula for the correlation given in step (4)

The method comprises the following steps:

(4a) go m to features in Ω_fSampling with secondary amplification to obtain a sample containing r single nucleotide polymorphisms SNP

F_{r}^{i}, i = 1,2, . . ., m_{f};

(4b) Computing

F_{r}^{i}, = 1,2, . . ., m_{f}

Correlation of (2)

i＝1，2，...，m_f，

(4c) Calculating relevance

Mean value of the volatility of_δ(r) and mean square error σ_δ(r)；

(4d) Perform m on samples in Ω_sSampling with secondary release to obtain data Z_i，i＝1，2，...，m_s；

(4e) Calculating data Z_iMiddle F_rRelevance measure of (AS) (F)_r/Z_i)，i＝1，2，....，m_s；

(4f) Computing a relevance measure AS (F)_r/Z_i) Mean value of (a)_AS(F_r) Sum mean square error σ_AS(F_r)；

(4g) Is measured by mu_AS(F_r) And σ_AS(F_r) Calculating the volatility delta (F) of the relevance measure_r)；

(4h) Lower bound in stability measure formula to determine relevance

3. The method of claim 2, wherein the association of step (4c)

Mean value of the volatility of_δ(r) and mean square error σ_δ(r) calculated as follows:

wherein,

is a set containing r single nucleotide polymorphic SNPs, m_fTo perform the number of return samples for SNPs in omega,

as a measure of relevance

The fluctuation of the pressure of the air conditioner is reduced,

wherein Z is_jFor the data obtained with the put back sample j,

as data Z_jIn

J is 1, 2, …, m_s。