CN103366100A

CN103366100A - Method for filtering SNP (Single Nucleotide Polymorphism) unrelated to complex diseases from whole-genome

Info

Publication number: CN103366100A
Application number: CN2013102796270A
Authority: CN
Inventors: 张军英; 刘丹; 赵晓雪; 谭芳慧
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2013-06-25
Filing date: 2013-06-25
Publication date: 2013-10-23

Abstract

A method for filtering SNPs irrelevant to complex diseases from the whole genome, which is used for the research of pathogenic mechanism, early diagnosis and biopharmaceutical development of complex diseases. (1) Preprocessing and initialization of single nucleotide polymorphism SNP data. According to the principle that any gene variation in homologous chromosomal alleles can be treated equally to the disease, the SNP data is processed into data containing only 0, 1, 2, and 3. (2) Define the correlation measure. The association I(Y;X) between a SNP subset X and a disease Y is defined as the mutual information MI(Y;X) between X and Y. (3) Use the FGSA method to search for the candidate suspected pathogenic SNP group in the SNP collection. (4) According to the frequency-association priority criterion, in the set of SNP groups of candidate suspected pathogenic causes, select the SNP group whose frequency of occurrence exceeds the threshold. (5) Outputting the SNPs whose frequency is greater than the threshold at the top. This method can retain the SNPs corresponding to the pathogenic causes covered by other pathogenic causes, and lay the foundation for the subsequent discovery of pathogenic causes.

Description

A method for filtering SNPs irrelevant to complex diseases from the whole genome

技术领域technical field

本发明属于数据处理技术领域，具体说，提出了一种从全基因组单核苷酸多态性(Single Nucleotide Polymorphism，SNP)数据中过滤与复杂疾病无关SNP的方法，可用于对复杂疾病致病机理研究、早期诊断和生物药物研制。The invention belongs to the technical field of data processing. Specifically, a method for filtering SNPs irrelevant to complex diseases from the whole genome single nucleotide polymorphism (Single Nucleotide Polymorphism, SNP) data is proposed, which can be used for pathogenicity of complex diseases Mechanism research, early diagnosis and biopharmaceutical development.

背景技术Background technique

复杂疾病是由多种遗传因素、环境因素共同作用产生的，其产生和发展受到复杂网络结构的多个基因的影响。复杂疾病不同于孟德尔式遗传疾病，在大部分情况下往往没有足以致病的主基因，其中的单基因对致病的作用可能会微不足道甚至不存在，但这些单个可能起微不足道作用的基因组合起来，其联合效应却可能是复杂疾病的致病原因。这些特点给发现复杂疾病的致病基因带来了很大困难，难以找到致病基因或相关标记用于对复杂疾病的致病机理研究、早期诊断和生物药物研制。如何在全基因组范围内找出致病的多个原因和哪些基因联合会成为致病的一个原因是目前存在的主要问题。Complex diseases are caused by multiple genetic factors and environmental factors, and their occurrence and development are affected by multiple genes in a complex network structure. Complex diseases are different from Mendelian genetic diseases. In most cases, there are often no major genes that are sufficient to cause disease, and the effect of a single gene on disease may be negligible or even non-existent, but these single gene combinations that may play a negligible role Together, their combined effects may be the cause of complex diseases. These characteristics have brought great difficulties to the discovery of pathogenic genes of complex diseases, and it is difficult to find pathogenic genes or related markers for the study of pathogenic mechanisms, early diagnosis and development of biological drugs for complex diseases. How to find out multiple causes of disease and which genes combine to become a cause of disease on a genome-wide scale are the main problems at present.

解决方法可分为两类：直接法和两步法。直接法直接在原有SNP集上搜索，只能处理中小规模数据，所处理的数据规模视算法不同而不同(如MDR，BEAM等的处理规模就相差较大)。两步法是先通过过滤，从原有SNP集合中过滤掉那些与疾病无关的SNP，然后在剩余的SNP集合中进行搜索。本发明涉及两步法中的第一步：SNP过滤。Solutions can be divided into two categories: direct methods and two-step methods. The direct method searches directly on the original SNP set, and can only process small and medium-scale data, and the scale of the processed data varies depending on the algorithm (for example, the processing scale of MDR, BEAM, etc. varies greatly). The two-step method is to first filter out those SNPs that are not related to the disease from the original SNP set, and then search in the remaining SNP set. The present invention involves the first step in a two-step process: SNP filtering.

多数两步法的第一步都穷举进行，即找SNP的两两组合得分高的，留作第二步处理(如BOOST，AntEpiSeeker等)。该方法处理大规模数据的能力极其有限，因此引入蚁群算法，找比可能阶次更高阶次的SNP组合(AntEpiSeeker)，和引入随机森林通过找分类能力强的SNP子集，获得需要进一步考察的SNP集合。这些过滤方法均存在如下不足：The first step of most two-step methods is carried out exhaustively, that is, to find pairs of SNPs with high scores, and leave them for the second step (such as BOOST, AntEpiSeeker, etc.). The ability of this method to deal with large-scale data is extremely limited, so an ant colony algorithm is introduced to find a higher-order SNP combination (AntEpiSeeker) than possible, and a random forest is introduced to find a subset of SNPs with strong classification ability to obtain further The set of SNPs investigated. These filtering methods all have following deficiencies:

1.能够处理的SNP阶次非常有限，如穷举的计算量巨大，只穷举两两SNP获得其得分，从而只能保留2阶SNP交互作用(即两个SNP的交互作用)，而丢失更高阶SNP交互作用。1. The order of SNPs that can be processed is very limited. For example, the amount of calculation for exhaustive enumeration is huge. Only two or two SNPs are exhausted to obtain their scores, so that only the second-order SNP interaction (that is, the interaction of two SNPs) can be retained, and the Higher order SNP interactions.

2.能够处理的SNP规模非常有限，这是由于过滤过程需复杂计算，如随机森林法所处理的数据规模仅在100个SNP左右，相对来说，AntEpiSeeker可以处理更多的SNP(如5,000SNPs)。2. The scale of SNPs that can be processed is very limited. This is because the filtering process requires complex calculations. For example, the data scale processed by the random forest method is only about 100 SNPs. Relatively speaking, AntEpiSeeker can handle more SNPs (such as 5,000 SNPs ).

3.不能处理一个致病原因被其他致病原因覆盖情况下的多致病原因情况。3. It cannot handle the multi-cause situation where one cause of disease is covered by other causes of disease.

发明内容Contents of the invention

本发明的目的在于克服采用两步法在全基因组范围内找出致病的多个原因和哪些基因联合会成为致病的一个原因方面所存在的不足，发明一种从全基因组中过滤与复杂疾病无关SNP的方法，这种方法从全基因组SNP数据中，过滤掉那些单个或者联合起来都与疾病表型无关的SNP，从而保留那些单个或者联合起来可能与疾病相关的所有高阶SNP致病因素，为从中进一步检测和识别致病的SNP原因奠定基础。The purpose of the present invention is to overcome the deficiencies in the use of a two-step method to find out multiple causes of disease and which genes combine to become a cause of disease in the whole genome, and to invent a method to filter and complex Disease-independent SNP method, this method filters out those SNPs that are not related to the disease phenotype individually or in combination from the genome-wide SNP data, thereby retaining all high-order SNPs that may be associated with the disease individually or in combination Factors, which lay the foundation for further detection and identification of the cause of the disease-causing SNP.

实现本发明的技术方案，包括如下步骤：Realize the technical scheme of the present invention, comprise the steps:

(1)对全基因组SNP数据进行预处理和初始化(1) Preprocessing and initialization of genome-wide SNP data

根据同源染色体等位基因中任一基因的变异对疾病的影响等同对待原则，将SNP数据预处理为：

其中x_i∈{0，1，2，3}^d为SNP i对应位点的取值：对应位点上的两个等位基因当为纯合子AA时取1，纯合子aa时取2，杂合子Aa或aA时取3，当该数据缺失时取0；y_i∈{1，2}为样本x_i的类标，1表示疾病组，2表示对照组，N为SNP数据中样本的个数，d为数据中SNP的个数，并记所涉及的SNP的集合为Ω；According to the principle that the variation of any gene in the homologous chromosome alleles has the same impact on the disease, the SNP data is preprocessed as follows:

Where x _i ∈ {0, 1, 2, 3} ^d is the value of the corresponding site of SNP i: when the two alleles at the corresponding site are homozygous AA, it takes 1, and when it is homozygous aa, it takes 2, Take 3 for heterozygote Aa or aA, and take 0 when the data is missing; y _i ∈ {1, 2} is the class label of sample _xi , 1 means the disease group, 2 means the control group, N is the number of samples in the SNP data number, d is the number of SNPs in the data, and record the set of involved SNPs as Ω;

(2)定义关联性测度(2) Define the correlation measure

每个SNP子集都可能通过交互作用成为一个致病因素，一个有1个SNP的因素称为l阶因素，将SNP因素X与疾病Y之间的关联性I(Y；X)定义为X与Y之间的互信息MI(Y；X)：Each SNP subset may become a pathogenic factor through interaction, a factor with 1 SNP is called an l-order factor, and the association I(Y;X) between SNP factor X and disease Y is defined as X Mutual information MI(Y;X) with Y:

I(Y；X)＝H(Y)-H(Y|X) (1)I(Y;X)＝H(Y)-H(Y|X) (1)

其中 $H (Y) = - \underset{y &Element; {0,1}}{Σ} p (y) \log p (y)$ 和 $H (Y | X) = - \underset{y &Element; {0,1}}{Σ} \underset{x {&Element; {1,2,3,}}^{| X |}}{Σ} p (y | x) \log p (y | x)$ 分别为熵和条件熵；in $h (Y) = - \underset{the y &Element; {0,1}}{Σ} p (the y) \log p (the y)$ and $h (Y | x) = - \underset{the y &Element; {0,1}}{Σ} \underset{x {&Element; {1,2,3,}}^{| x |}}{Σ} p (the y | x) \log p (the y | x)$ are entropy and conditional entropy, respectively;

(3)运用基于因素的遗传搜索方法(Factor based Genetic Search Algorithm，FGSA)搜索SNP的集合Ω中侯选疑似致病原因的SNP组；(3) Use the Factor based Genetic Search Algorithm (FGSA) to search for the SNP group of candidate suspected pathogenic causes in the SNP set Ω;

(4)依频度-关联优先准则，将SNP排序；(4) sort the SNPs according to the frequency-association priority criterion;

(5)输出排在最前面的频度大于门限的SNP。(5) Outputting the SNPs whose frequency is greater than the threshold at the top.

本发明与现有技术相比较所具有的显著效果：The remarkable effect that the present invention has compared with prior art:

本发明公开了一种从全基因组单核苷酸多态性SNP中过滤与复杂疾病无关SNP的方法——基于因素的遗传搜索方法FGSA(factor based genetic search algorithm)，其以因素为基础保证了对单个与疾病弱关联、联合起来与疾病强关联的SNP因素的搜索，其层层剥离准则保证在多个致病因素存在时不至于由于某强关联因素的存在而掩盖了其它可能也强关联的因素的搜索，而采用的频度-关联优先准则则保证了这样搜索出的解的相对稳定性，具有如下显著效果：The invention discloses a method for filtering SNPs irrelevant to complex diseases from the single nucleotide polymorphism SNPs in the whole genome - a factor based genetic search algorithm FGSA (factor based genetic search algorithm), which guarantees The search for SNP factors that are weakly associated with a disease and strongly associated with a disease in combination, its layer-by-layer peeling criterion ensures that when multiple pathogenic factors exist, other possible strong associations will not be covered up by the existence of a strong association factor The search of the factors, and the frequency-association priority criterion adopted ensures the relative stability of the solution obtained in this way, and has the following significant effects:

(1)本方法能够实现从全基因组中对与疾病无关SNP的有效过滤，即：过滤掉那些单个或者联合起来都与疾病表型无关的SNP，并使过滤后的剩余SNP的数量尽可能小。(1) This method can effectively filter SNPs that are not related to diseases from the whole genome, that is, filter out those SNPs that are not related to the disease phenotype individually or in combination, and make the number of remaining SNPs after filtering as small as possible .

(2)本方法能够保留那些被其他致病原因所覆盖的致病原因对应SNP，从而为后续这些致病原因的发现奠定基础。(2) This method can retain the SNPs corresponding to the pathogenic causes covered by other pathogenic causes, thereby laying the foundation for the subsequent discovery of these pathogenic causes.

(3)本方法可处理全基因组SNP规模，如10,000以上的SNP数据，比如能够处理AMD(由于视网膜损坏而导致视中心的视觉丧失)数据，其中含有103611个SNP，96个病例样本和50个对照样本。用其他多种方法所找到的与AMD相关的SNP均在本方法过滤后的SNP集合中(详见实验对比效果说明部分)。(3) This method can handle genome-wide SNP scale, such as SNP data of more than 10,000, such as AMD (loss of vision due to retinal damage) data, which contains 103611 SNPs, 96 case samples and 50 control sample. The AMD-related SNPs found by other methods are all in the SNP set filtered by this method (see the description of the experimental comparison effect for details).

附图说明Description of drawings

图1是本发明FGSA算法的流程图；Fig. 1 is the flowchart of FGSA algorithm of the present invention;

图2是图1中的遗传算法的流程图。FIG. 2 is a flowchart of the genetic algorithm in FIG. 1 .

具体实施方式Detailed ways

参照图1和图2，本发明的方法称为FGSA方法，其具体实现步骤如下：With reference to Fig. 1 and Fig. 2, method of the present invention is called FGSA method, and its specific implementation steps are as follows:

步骤1，对SNP数据进行预处理和初始化。Step 1, preprocessing and initializing the SNP data.

(1.1)根据同源染色体等位基因中任一基因的变异对疾病的影响可以等同对待的原则，将SNP数据处理成

其中x_i∈{0，1，2，3}^d为SNP i对应位点的取值：对应位点上的两个等位基因当为纯合子AA时取1，纯合子aa时取2，杂合子Aa或aA时取3，当对应位点上的等位基因数据缺失时取0；y_i∈{1，2}为样本x_i的类标，1表示疾病组，2表示对照组，N为SNP数据中样本的个数，d为数据中SNP的个数，仅含0，1，2，3的数据，其中0表示缺失数据，所涉及的SNP的集合记为Ω。(1.1) According to the principle that the variation of any gene in the homologous chromosome alleles can have the same impact on the disease, the SNP data is processed into

Where x _i ∈ {0, 1, 2, 3} ^d is the value of the corresponding site of SNP i: when the two alleles at the corresponding site are homozygous AA, it takes 1, and when it is homozygous aa, it takes 2, Take 3 for heterozygote Aa or aA, take 0 when the allele data on the corresponding locus is missing; y _i ∈ {1, 2} is the class label of sample _xi , 1 means the disease group, 2 means the control group, N is the number of samples in the SNP data, and d is the number of SNPs in the data, including only the data of 0, 1, 2, and 3, where 0 means missing data, and the set of SNPs involved is recorded as Ω.

步骤2，定义关联性测度。Step 2, define the correlation measure.

(2.1)每个SNP子集都可能成为一个致病因素，一个有l个SNP的因素称为l阶因素。将一个因素X与疾病Y之间的关联性I(Y；X)定义为X与Y之间的互信息MI(Y；X)，表示为式(1)：(2.1) Each SNP subset may become a pathogenic factor, and a factor with l SNP is called l-order factor. The correlation I(Y; X) between a factor X and a disease Y is defined as the mutual information MI(Y; X) between X and Y, expressed as formula (1):

I(Y；X)＝H(Y)-H(Y|X) (1)I(Y;X)＝H(Y)-H(Y|X) (1)

其中 $H (Y) = - \underset{y &Element; {0,1}}{Σ} p (y) \log p (y)$ 和 $H (Y | X) = - \underset{y &Element; {0,1}}{Σ} \underset{x {&Element; {1,2,3,}}^{| X |}}{Σ} p (y | x) \log p (y | x)$ 分别为熵和条件熵。in $h (Y) = - \underset{the y &Element; {0,1}}{Σ} p (the y) \log p (the y)$ and $h (Y | x) = - \underset{the y &Element; {0,1}}{Σ} \underset{x {&Element; {1,2,3,}}^{| x |}}{Σ} p (the y | x) \log p (the y | x)$ are entropy and conditional entropy, respectively.

步骤3，运用FGSA方法搜索SNP的集合Ω中侯选疑似致病原因的SNP组。Step 3, use the FGSA method to search for the SNP group that is suspected to be the cause of the disease in the SNP set Ω.

(3.1)设置遗传算法参数及FGSA相关参数(3.1) Set genetic algorithm parameters and FGSA related parameters

设置要找的SNP组中SNP个数l。设置遗传算法参数，包括种群规模N_l，交叉概率P_c，变异概率P_m，迭代次数Iter_l；设置FGSA相关参数，包括重复搜索次数Num_l，每次要找的l阶交互作用的个数M_l；Set the number l of SNPs in the SNP group to be found. Set genetic algorithm parameters, including population size N _l , crossover probability P _c , mutation probability P _m , iteration number Iter _l ; set FGSA related parameters, including repeated search times Num _l , and the number of l-order interactions to be found each time M _l ;

(3.2)初始化要找的l阶交互作用的个数k＝1；设疑似致病原因相关SNP集合S₁＝Φ；设要考察的SNP集合为Ω^*＝Ω；(3.2) Initialize the number of the first-order interactions to be found k=1; set the SNP set S ₁ =Φ related to the suspected cause of disease; set the SNP set to be investigated as Ω ^* =Ω;

(3.3)随机初始化种群：从Ω^*中的SNP中随机生成l个不同的有效SNP编号，构成一个l阶因素作为一个个体。总计生成N_l个个体构成种群；(3.3) Randomly initialize the population: Randomly generate l different effective SNP numbers from the SNPs in Ω ^* , constituting an l-order factor as an individual. A total of N _l individuals are generated to form a population;

(3.4)适应度计算：对种群中的每个个体，根据式(1)计算互信息作为该个体的适应度；(3.4) Calculation of fitness: For each individual in the population, calculate the mutual information according to formula (1) as the fitness of the individual;

(3.5)选择操作：依据种群中各个个体的适应度数值，采用轮盘赌方式和精英策略进行选择操作，选出N_l个个体；(3.5) Selection operation: According to the fitness value of each individual in the population, use roulette and elite strategy to select _N1 individuals;

(3.6)交叉操作：从N_l个个体中任取两个个体，随机选择交叉点，对这两个个体，以交叉概率p_c将交叉点后面的部分对调，形成两个新个体；(3.6) Crossover operation: Randomly select two individuals from N _l individuals, randomly select the intersection point, and for these two individuals, exchange the part behind the intersection point with the crossover probability p _c to form two new individuals;

(3.7)变异操作：对每个个体，随机生成一个有效SNP编号，依变异概率p_m替换掉这个个体中的任一SNP编号；(3.7) Mutation operation: For each individual, randomly generate an effective SNP number, and replace any SNP number in this individual according to the mutation probability p _m ;

(3.8)产生下一代种群：由(3.7)操作后得到的所有个体作为下一代种群；(3.8) Generate the next generation population: all individuals obtained after the operation of (3.7) are used as the next generation population;

(3.9)若迭代次数小于Iter_l，则跳转到(3.4)；(3.9) If the number of iterations is less than Iter _l , then jump to (3.4);

(3.10)取种群中具有最大适应度的个体记为s_k加入到疑似与疾病相关的SNP集合S_k中，并从SNP集合Ω^*中去掉这个个体，即

(3.10) Take the individual with the maximum fitness in the population and mark it as s _k and add it to the SNP set S _k suspected to be related to the disease, and remove this individual from the SNP set Ω ^* , that is

(3.11)重复(3.3)～(3.10)M_l次，每次向S_k中加入一个个体，并从数据中去掉这个个体，经M_l次重复，得S_k；(3.11) Repeat (3.3)～(3.10) M _l times, each time add an individual to S _k , and remove this individual from the data, after M _l repetitions, get S _k ;

(3.12)重设Ω^*＝Ω，重复(3.2)～(3.11)总计Num_l次，从而得到SNP集合 $S_{1}, S_{2}, . . ., S_{{Num}_{l}};$ (3.12) Reset Ω ^* = Ω, repeat (3.2) ~ (3.11) a total of Num _l times, so as to obtain the SNP set $S_{1}, S_{2}, . . ., S_{{Num}_{l}};$

(3.13)输出包含各个l阶交互作用的SNP集合

(3.13) Output the SNP set containing each l-order interaction

步骤4，计算v中各SNP的频度。Step 4, calculate the frequency of each SNP in v.

(4.1)频度的计算：将步骤3中找到的疑似致病原因的SNP出现的次数作为该SNP的频度；(4.1) Calculation of frequency: the frequency of occurrence of the SNP of the suspected cause of disease found in step 3 as the frequency of the SNP;

(4.2)依频度-关联优先准则将SNP排序，即：按频度大的优先、同频度时单SNP与疾病关联即互信息大的优先的原则，将SNP排序。(4.2) The SNPs were sorted according to the frequency-association priority criterion, that is, the SNPs were sorted according to the principle of the highest frequency priority, and the single SNP with the same frequency associated with the disease, that is, the highest mutual information priority.

步骤5，输出排在最前面的频度大于门限的SNP。Step 5, outputting the top SNPs whose frequency is greater than the threshold.

其中，步骤3中的(3.3)～(3.8)中遗传算法的实施是以从Ω^*中的SNP中随机生成l个不同的有效SNP编号构成一个l阶因素作为一个个体，并通过它们的交叉变异获得更优个体的遗传进化搜索方法，体现了本方法的以因素为基础的遗传进化SNP过滤的特点，从而保证了对单个与疾病弱关联、联合起来与疾病强关联的SNP因素的搜索；步骤3中的(3.9)则通过将种群中具有最大适应度的个体加入到疑似与疾病相关的SNP集合中，并从数据中去掉这个个体，实现对致病因素的层层剥离，从而保证在多个致病因素存在时不至于由于某强关联因素的存在而掩盖了其它可能也强关联的因素的搜索；而步骤4中的频度-关联优先准则则保证了这样搜索出的解的相对稳定性。Among them, the implementation of the genetic algorithm in (3.3)～(3.8) in step 3 is to randomly generate l different effective SNP numbers from the SNPs in Ω ^* to form an l-order factor as an individual, and through their crossover The genetic evolution search method of mutation to obtain better individuals embodies the characteristics of this method's factor-based genetic evolution SNP filtering, thus ensuring the search for single SNP factors that are weakly associated with the disease and combined to be strongly associated with the disease; (3.9) in step 3 adds the individual with the greatest fitness in the population to the SNP set suspected to be related to the disease, and removes this individual from the data to realize the layer-by-layer peeling off of the pathogenic factors, so as to ensure that When multiple pathogenic factors exist, the search for other factors that may also be strongly correlated will not be covered up by the existence of a strong correlation factor; and the frequency-correlation priority criterion in step 4 ensures that the solution obtained in this way is relatively accurate. stability.

本发明将通过下述的实验例子对本方法的效果进行更详细的描述。这些实验例子用于举例的目的，而不试图限制本发明的范围。The present invention will describe the effect of this method in more detail through the following experimental examples. These experimental examples are for illustrative purposes and are not intended to limit the scope of the invention.

在以下的实验中，本方法的参数取为：N_l＝10，P_c＝0.9，P_m＝0.25，Iter_l＝5000，Num_l＝20，M_l＝8。In the following experiments, the parameters of this method are set as: N _l =10, P _c =0.9, P _m =0.25, Iter _l =5000, Num _l =20, M _l =8.

实验1：仿真数据SNP的过滤。Experiment 1: Filtering of simulated data SNPs.

仿真数据是在纽约人口真实SNP数据的基础上，由生物学家加入7个已知的与复杂疾病相关的SNP组得到的，且这7个SNP组与疾病的关联模型是不同的。数据共有两组：第一组包含2000个样本，100个SNP，用SNP100表示；第二组包含2000个样本，2000个SNP，用SNP2000表示。数据信息如表1。The simulation data is based on the real SNP data of the New York population, and biologists add 7 SNP groups known to be associated with complex diseases, and the association models of these 7 SNP groups and diseases are different. There are two groups of data: the first group contains 2000 samples, 100 SNPs, represented by SNP100; the second group contains 2000 samples, 2000 SNPs, represented by SNP2000. The data information is shown in Table 1.

表1实验数据Table 1 Experimental data

对这两组数据进行实验，其中FGSA的参数(Iter₂，Iter₃)分别对两组数据取值，分别为(Iter₂，Iter₃)＝(600，1100)，(1200，1700)，(M₂，M₃)对2组数据相同，均为(M₂，M₃)＝(8，5)。所进行的两组实验只是频度门限不同：对于2组数据，门限分别取为Th＝3，2和和取为Th＝1。Experiments are carried out on these two groups of data, wherein the parameters (Iter ₂ , Iter ₃ ) of FGSA take values for the two groups of data respectively, which are (Iter ₂ , Iter ₃ )=(600, 1100), (1200, 1700), ( M ₂ , M ₃ ) are the same for the two sets of data, both are (M ₂ , M ₃ )=(8,5). The two sets of experiments carried out are only different in the frequency threshold: for the two sets of data, the threshold is taken as Th=3, and 2 and 2 are taken as Th=1.

FGSA算法及几种典型特征选择方法(最小冗余最大相关方法--mRMR，最大熵方法--ME等)的实验结果示于表2和表3中，其中“-”表示计算量过大而没有找到结果，压缩率是过滤后获得的SNP数与数据中的SNP总数之比，因素率是过滤后的SNP中包含的真实致病因素数目与数据中的真实致病因素数目之比。The experimental results of the FGSA algorithm and several typical feature selection methods (Minimum Redundancy Maximum Relevance Method--mRMR, Maximum Entropy Method--ME, etc.) No results found, the compression rate is the ratio of the number of SNPs obtained after filtering to the total number of SNPs in the data, and the factor rate is the ratio of the number of real pathogenic factors contained in the filtered SNPs to the number of real pathogenic factors in the data.

表2中，a为真阳性，因素率定义为过滤后的SNP中包含致病因素数占过滤前Ω中包含致病因素总数的百分比；压缩率为过滤后的SNP数占过滤前Ω中的SNP总数的百分比。In Table 2, a is a true positive, and the factor rate is defined as the percentage of the number of pathogenic factors contained in the filtered SNP to the total number of pathogenic factors contained in Ω before filtering; the compression ratio is the percentage of the number of pathogenic factors contained in Ω after filtering Percentage of total number of SNPs.

表2FGSA-频度算法的性能及与其它算法的比较Table 2 Performance of FGSA-frequency algorithm and comparison with other algorithms

表3FGSA-频度算法的性能(取频度门限为Th＝1的结果)Table 3FGSA-Frequency Algorithm Performance (Frequency Threshold is the result of Th=1)

由表2和表3可以看出：It can be seen from Table 2 and Table 3 that:

(a)FGSA方法在压缩率基本相当的情况下，其因素率都大于其他方法，表明了算法的有效性；(a) When the compression rate of the FGSA method is basically the same, its factor rate is higher than that of other methods, which shows the effectiveness of the algorithm;

(b)mRMR的因素率仅为3/7，即所选出的SNP集合在7个致病因素中仅完整包含了3个，而FGSA方法的因素率为5/7～6/7，即至少完整包含了5个或6个，显然表明FGSA方法的有效性；(b) The factor rate of mRMR is only 3/7, that is, the selected SNP set contains only 3 of the 7 pathogenic factors, while the factor rate of the FGSA method is 5/7~6/7, that is At least 5 or 6 are fully included, which clearly shows the effectiveness of the FGSA method;

(c)压缩率随SNP规模N增长，且N越大压缩率越高，在2000个SNP情况下达到了97％，表明FGSA方法对全基因组SNP情况更有效；(c) The compression rate increases with the SNP size N, and the larger the N, the higher the compression rate, reaching 97% in the case of 2000 SNPs, indicating that the FGSA method is more effective for the genome-wide SNPs;

(d)当因素中的SNP过多时，会出现维数灾难，这是长度为5的致病因素用这些方法始终没能完整选上的重要原因之一。(d) When there are too many SNPs in the factor, the curse of dimensionality will appear, which is one of the important reasons why the pathogenic factors with a length of 5 cannot be completely selected by these methods.

实验2：真实AMD数据的SNP过滤。Experiment 2: SNP filtering of real AMD data.

AMD是影响老年人的医学条件，他会由于视网膜损坏而导致视中心的视觉丧失。AMD数据(见表4)含有103611个SNP，96个病例样本和50个对照样本，其中有0.811％的数据丢失。样本数值为0，1，2，3，其中0表示该数据丢失。AMD数据常用于SNP关联分析，已有不少方法用在AMD数据上并获得了一些相关的关联基因。表5给出了所找到的用其它方法(如BOOST，AntEpiSeeker，epiMODE，BEAM，HapForest，Single-Marker，DASSO-MB等方法等)也找出的SNP。AMD is a medical condition affecting older adults who can experience loss of vision in the optic center due to damage to the retina. The AMD data (see Table 4) contains 103611 SNPs, 96 case samples and 50 control samples, with 0.811% missing data. The sample values are 0, 1, 2, 3, where 0 means that the data is missing. AMD data is often used in SNP association analysis, and many methods have been used on AMD data and some related genes have been obtained. Table 5 shows the SNPs found by other methods (such as BOOST, AntEpiSeeker, epiMODE, BEAM, HapForest, Single-Marker, DASSO-MB, etc.).

表4AMD数据Table 4 AMD data

SNP数Number of SNPs 病例样本数Case sample number 对照样本数Number of Control Samples 数据缺失比例Missing data ratio AMD数据AMD data 103611103611 9696 5050 0.822％0.822%

表5FGSA方法找出的用其它方法也找出的SNP列表Table 5 The list of SNPs found by the FGSA method and also found by other methods

从表5可以看到：It can be seen from Table 5:

(a)FGSA所找到的SNP与其它方法找到的SNP有很高的重叠性，表明了FGSA方法的有效性。(a) The SNPs found by FGSA have a high overlap with those found by other methods, indicating the effectiveness of the FGSA method.

(b)FGSA还给出了其他方法没有找出的但是频度也很高的SNP，包括编号为19405，6693，56674，80178，76784，92627，46516，88957，42568，51958，41808，47428的SNP，其频率分别为35，26，26，25，24，24，22，21，20，15，11，9，并不排除它们与疾病关联的可能性，特别地，不排除真实的致病因素就在表5中SNP及上述SNP构成的集合上，或是其中某些SNP构成的子集上。(b) FGSA also gives SNPs that are not found by other methods but have a high frequency, including numbers 19405, 6693, 56674, 80178, 76784, 92627, 46516, 88957, 42568, 51958, 41808, 47428 The SNPs, whose frequencies are 35, 26, 26, 25, 24, 24, 22, 21, 20, 15, 11, 9, do not rule out the possibility that they are associated with disease and, in particular, do not rule out true pathogenicity The factors are on the set of SNPs in Table 5 and the above-mentioned SNPs, or a subset of some of the SNPs.

Claims

1. A method for filtering SNPs unrelated to complex diseases from the whole genome, comprising the steps of:

Step 1, preprocessing and initialization of genome-wide SNP data

According to the principle that the variation of any gene in the homologous chromosome alleles has the same impact on the disease, the SNP data is preprocessed into:

Where x _i ∈ {0, 1, 2, 3} ^d is the value of the corresponding site of SNP i: when the two alleles at the corresponding site are homozygous AA, it takes 1, and when it is homozygous aa, it takes 2, Take 3 for heterozygote Aa or aA, take 0 when the allele data on the corresponding locus is missing; y _i ∈ {1, 2} is the class label of sample _xi , 1 means the disease group, 2 means the control group, N is the number of samples in the SNP data, d is the number of SNPs in the data, and record the set of SNPs involved as Ω;

Step 2, define the relevance measure

The association I(Y; X) between a SNP factor X and a disease Y is defined as the mutual information MI(Y; X) between X and Y, expressed as formula (1):

I(Y;X)＝H(Y)-H(Y|X) (1)

In the formula

h (Y) = - \underset{the y &Element; {0,1}}{Σ} p (the y) \log p (the y)

is the entropy,

h (Y | x) = - \underset{the y &Element; {0,1}}{Σ} \underset{x {&Element; {1,2,3,}}^{| x |}}{Σ} p (the y | x) \log p (the y | x)

is the conditional entropy;

Step 3, use the factor-based genetic search method "FGSA" to search for the SNP group in the SNP set Ω that is suspected to be the cause of the disease, so as to filter the SNPs that are not related to complex diseases;

(3.1) Set genetic algorithm parameters and FGSA related parameters

Set the number l of SNPs in the SNP group to be found. Set genetic algorithm parameters, including population size Nl, crossover probability P _c , mutation probability p _m , iteration number Iter _l ; set FGSA related parameters, including repeated search times Num _l , and the number of l-order interactions to be found each time M _l ;

(3.2) Initialize the number of the first-order interactions to be found k=1; set the SNP set S ₁ =Φ related to the suspected cause of disease; set the SNP set to be investigated as Ω ^* =Ω;

(3.3) Randomly initialize the population: Randomly generate l different effective SNP numbers from the SNP in Ω ^* , form an l-order factor as an individual, and generate a total of N _l individuals to form the population;

(3.4) Calculation of fitness: For each individual in the population, calculate the mutual information according to formula (1) as the fitness of the individual;

(3.5) Selection operation: According to the fitness value of each individual in the population, use roulette and elite strategy to select _N1 individuals;

(3.6) Crossover operation: Randomly select two individuals from _N1 individuals, randomly select the intersection point, and exchange the part behind the intersection point for these two individuals with the crossover probability p _c to form two new individuals;

(3.7) Mutation operation: For each individual, randomly generate an effective SNP number, and replace any SNP number in this individual according to the mutation probability p _m ;

(3.8) Generate the population of the next generation: all individuals obtained after the operation of step (3.7) are used as the population of the next generation;

(3.9) If the number of iterations is less than Iter _l , then jump to step (3.4);

(3.11) Repeat steps (3.3)～(3.10) M _l times, add an individual to S _k each time, and remove this individual from the data, and obtain S _k through M _l repetitions;

(3.12) Reset Ω ^* = Ω, repeat (3.2) ~ (3.11) Num _l times, get the SNP set

S_{1}, S_{2}, . . ., S_{{Num}_{l}};

(3.13) Output the SNP set containing each l-order interaction

Step 4, calculate the frequency of each SNP in v

(4.1) Calculation of the frequency of each SNP: the number of occurrences of the SNP of the suspected cause of disease found in step (3) is used as the frequency of the SNP;

(4.2) SNPs are sorted according to the frequency-association priority criterion, that is, the SNPs are sorted according to the principle of high frequency priority, and when the frequency is the same, the single SNP is associated with the disease, that is, the priority principle of mutual information is large;

Step 5, outputting the top SNPs whose frequency is greater than the threshold.