CN103366100A - Method for filtering SNP (Single Nucleotide Polymorphism) unrelated to complex diseases from whole-genome - Google Patents

Method for filtering SNP (Single Nucleotide Polymorphism) unrelated to complex diseases from whole-genome Download PDF

Info

Publication number
CN103366100A
CN103366100A CN2013102796270A CN201310279627A CN103366100A CN 103366100 A CN103366100 A CN 103366100A CN 2013102796270 A CN2013102796270 A CN 2013102796270A CN 201310279627 A CN201310279627 A CN 201310279627A CN 103366100 A CN103366100 A CN 103366100A
Authority
CN
China
Prior art keywords
snp
snps
disease
data
individual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013102796270A
Other languages
Chinese (zh)
Inventor
张军英
刘丹
赵晓雪
谭芳慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN2013102796270A priority Critical patent/CN103366100A/en
Publication of CN103366100A publication Critical patent/CN103366100A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

一种从全基因组中过滤与复杂疾病无关SNP的方法,用于对复杂疾病致病机理研究、早期诊断和生物药物研制。(1)对单核苷酸多态SNP数据预处理和初始化。根据同源染色体等位基因中任一个基因的变异对疾病的影响可以等同对待的原则,将SNP数据处理成仅含0,1,2,3的数据。(2)定义关联性测度。将SNP子集X与疾病Y之间的关联性I(Y;X)定义为X与Y之间的互信息MI(Y;X)。(3)运用FGSA方法搜索SNP集合中侯选疑似致病原因SNP组。(4)依频度-关联优先准则,在候选疑似致病原因的SNP组的集合中,选择频度出现次数超过阈值的SNP组。(5)输出排在最前面的频度大于门限的SNP。本方法能够保留那些被其它致病原因所覆盖的致病原因对应SNP,为后续致病原因的发现奠定基础。

Figure 201310279627

A method for filtering SNPs irrelevant to complex diseases from the whole genome, which is used for the research of pathogenic mechanism, early diagnosis and biopharmaceutical development of complex diseases. (1) Preprocessing and initialization of single nucleotide polymorphism SNP data. According to the principle that any gene variation in homologous chromosomal alleles can be treated equally to the disease, the SNP data is processed into data containing only 0, 1, 2, and 3. (2) Define the correlation measure. The association I(Y;X) between a SNP subset X and a disease Y is defined as the mutual information MI(Y;X) between X and Y. (3) Use the FGSA method to search for the candidate suspected pathogenic SNP group in the SNP collection. (4) According to the frequency-association priority criterion, in the set of SNP groups of candidate suspected pathogenic causes, select the SNP group whose frequency of occurrence exceeds the threshold. (5) Outputting the SNPs whose frequency is greater than the threshold at the top. This method can retain the SNPs corresponding to the pathogenic causes covered by other pathogenic causes, and lay the foundation for the subsequent discovery of pathogenic causes.

Figure 201310279627

Description

从全基因组中过滤与复杂疾病无关SNP的方法A method for filtering SNPs irrelevant to complex diseases from the whole genome

技术领域technical field

本发明属于数据处理技术领域,具体说,提出了一种从全基因组单核苷酸多态性(Single Nucleotide Polymorphism,SNP)数据中过滤与复杂疾病无关SNP的方法,可用于对复杂疾病致病机理研究、早期诊断和生物药物研制。The invention belongs to the technical field of data processing. Specifically, a method for filtering SNPs irrelevant to complex diseases from the whole genome single nucleotide polymorphism (Single Nucleotide Polymorphism, SNP) data is proposed, which can be used for pathogenicity of complex diseases Mechanism research, early diagnosis and biopharmaceutical development.

背景技术Background technique

复杂疾病是由多种遗传因素、环境因素共同作用产生的,其产生和发展受到复杂网络结构的多个基因的影响。复杂疾病不同于孟德尔式遗传疾病,在大部分情况下往往没有足以致病的主基因,其中的单基因对致病的作用可能会微不足道甚至不存在,但这些单个可能起微不足道作用的基因组合起来,其联合效应却可能是复杂疾病的致病原因。这些特点给发现复杂疾病的致病基因带来了很大困难,难以找到致病基因或相关标记用于对复杂疾病的致病机理研究、早期诊断和生物药物研制。如何在全基因组范围内找出致病的多个原因和哪些基因联合会成为致病的一个原因是目前存在的主要问题。Complex diseases are caused by multiple genetic factors and environmental factors, and their occurrence and development are affected by multiple genes in a complex network structure. Complex diseases are different from Mendelian genetic diseases. In most cases, there are often no major genes that are sufficient to cause disease, and the effect of a single gene on disease may be negligible or even non-existent, but these single gene combinations that may play a negligible role Together, their combined effects may be the cause of complex diseases. These characteristics have brought great difficulties to the discovery of pathogenic genes of complex diseases, and it is difficult to find pathogenic genes or related markers for the study of pathogenic mechanisms, early diagnosis and development of biological drugs for complex diseases. How to find out multiple causes of disease and which genes combine to become a cause of disease on a genome-wide scale are the main problems at present.

解决方法可分为两类:直接法和两步法。直接法直接在原有SNP集上搜索,只能处理中小规模数据,所处理的数据规模视算法不同而不同(如MDR,BEAM等的处理规模就相差较大)。两步法是先通过过滤,从原有SNP集合中过滤掉那些与疾病无关的SNP,然后在剩余的SNP集合中进行搜索。本发明涉及两步法中的第一步:SNP过滤。Solutions can be divided into two categories: direct methods and two-step methods. The direct method searches directly on the original SNP set, and can only process small and medium-scale data, and the scale of the processed data varies depending on the algorithm (for example, the processing scale of MDR, BEAM, etc. varies greatly). The two-step method is to first filter out those SNPs that are not related to the disease from the original SNP set, and then search in the remaining SNP set. The present invention involves the first step in a two-step process: SNP filtering.

多数两步法的第一步都穷举进行,即找SNP的两两组合得分高的,留作第二步处理(如BOOST,AntEpiSeeker等)。该方法处理大规模数据的能力极其有限,因此引入蚁群算法,找比可能阶次更高阶次的SNP组合(AntEpiSeeker),和引入随机森林通过找分类能力强的SNP子集,获得需要进一步考察的SNP集合。这些过滤方法均存在如下不足:The first step of most two-step methods is carried out exhaustively, that is, to find pairs of SNPs with high scores, and leave them for the second step (such as BOOST, AntEpiSeeker, etc.). The ability of this method to deal with large-scale data is extremely limited, so an ant colony algorithm is introduced to find a higher-order SNP combination (AntEpiSeeker) than possible, and a random forest is introduced to find a subset of SNPs with strong classification ability to obtain further The set of SNPs investigated. These filtering methods all have following deficiencies:

1.能够处理的SNP阶次非常有限,如穷举的计算量巨大,只穷举两两SNP获得其得分,从而只能保留2阶SNP交互作用(即两个SNP的交互作用),而丢失更高阶SNP交互作用。1. The order of SNPs that can be processed is very limited. For example, the amount of calculation for exhaustive enumeration is huge. Only two or two SNPs are exhausted to obtain their scores, so that only the second-order SNP interaction (that is, the interaction of two SNPs) can be retained, and the Higher order SNP interactions.

2.能够处理的SNP规模非常有限,这是由于过滤过程需复杂计算,如随机森林法所处理的数据规模仅在100个SNP左右,相对来说,AntEpiSeeker可以处理更多的SNP(如5,000SNPs)。2. The scale of SNPs that can be processed is very limited. This is because the filtering process requires complex calculations. For example, the data scale processed by the random forest method is only about 100 SNPs. Relatively speaking, AntEpiSeeker can handle more SNPs (such as 5,000 SNPs ).

3.不能处理一个致病原因被其他致病原因覆盖情况下的多致病原因情况。3. It cannot handle the multi-cause situation where one cause of disease is covered by other causes of disease.

发明内容Contents of the invention

本发明的目的在于克服采用两步法在全基因组范围内找出致病的多个原因和哪些基因联合会成为致病的一个原因方面所存在的不足,发明一种从全基因组中过滤与复杂疾病无关SNP的方法,这种方法从全基因组SNP数据中,过滤掉那些单个或者联合起来都与疾病表型无关的SNP,从而保留那些单个或者联合起来可能与疾病相关的所有高阶SNP致病因素,为从中进一步检测和识别致病的SNP原因奠定基础。The purpose of the present invention is to overcome the deficiencies in the use of a two-step method to find out multiple causes of disease and which genes combine to become a cause of disease in the whole genome, and to invent a method to filter and complex Disease-independent SNP method, this method filters out those SNPs that are not related to the disease phenotype individually or in combination from the genome-wide SNP data, thereby retaining all high-order SNPs that may be associated with the disease individually or in combination Factors, which lay the foundation for further detection and identification of the cause of the disease-causing SNP.

实现本发明的技术方案,包括如下步骤:Realize the technical scheme of the present invention, comprise the steps:

(1)对全基因组SNP数据进行预处理和初始化(1) Preprocessing and initialization of genome-wide SNP data

根据同源染色体等位基因中任一基因的变异对疾病的影响等同对待原则,将SNP数据预处理为:

Figure BSA00000921376200021
其中xi∈{0,1,2,3}d为SNP i对应位点的取值:对应位点上的两个等位基因当为纯合子AA时取1,纯合子aa时取2,杂合子Aa或aA时取3,当该数据缺失时取0;yi∈{1,2}为样本xi的类标,1表示疾病组,2表示对照组,N为SNP数据中样本的个数,d为数据中SNP的个数,并记所涉及的SNP的集合为Ω;According to the principle that the variation of any gene in the homologous chromosome alleles has the same impact on the disease, the SNP data is preprocessed as follows:
Figure BSA00000921376200021
Where x i ∈ {0, 1, 2, 3} d is the value of the corresponding site of SNP i: when the two alleles at the corresponding site are homozygous AA, it takes 1, and when it is homozygous aa, it takes 2, Take 3 for heterozygote Aa or aA, and take 0 when the data is missing; y i ∈ {1, 2} is the class label of sample xi , 1 means the disease group, 2 means the control group, N is the number of samples in the SNP data number, d is the number of SNPs in the data, and record the set of involved SNPs as Ω;

(2)定义关联性测度(2) Define the correlation measure

每个SNP子集都可能通过交互作用成为一个致病因素,一个有1个SNP的因素称为l阶因素,将SNP因素X与疾病Y之间的关联性I(Y;X)定义为X与Y之间的互信息MI(Y;X):Each SNP subset may become a pathogenic factor through interaction, a factor with 1 SNP is called an l-order factor, and the association I(Y;X) between SNP factor X and disease Y is defined as X Mutual information MI(Y;X) with Y:

I(Y;X)=H(Y)-H(Y|X)    (1)I(Y;X)=H(Y)-H(Y|X) (1)

其中 H ( Y ) = - Σ y ∈ { 0,1 } p ( y ) log p ( y ) H ( Y | X ) = - Σ y ∈ { 0,1 } Σ x ∈ { 1,2,3 , } | X | p ( y | x ) log p ( y | x ) 分别为熵和条件熵;in h ( Y ) = - Σ the y ∈ { 0,1 } p ( the y ) log p ( the y ) and h ( Y | x ) = - Σ the y ∈ { 0,1 } Σ x ∈ { 1,2,3 , } | x | p ( the y | x ) log p ( the y | x ) are entropy and conditional entropy, respectively;

(3)运用基于因素的遗传搜索方法(Factor based Genetic Search Algorithm,FGSA)搜索SNP的集合Ω中侯选疑似致病原因的SNP组;(3) Use the Factor based Genetic Search Algorithm (FGSA) to search for the SNP group of candidate suspected pathogenic causes in the SNP set Ω;

(4)依频度-关联优先准则,将SNP排序;(4) sort the SNPs according to the frequency-association priority criterion;

(5)输出排在最前面的频度大于门限的SNP。(5) Outputting the SNPs whose frequency is greater than the threshold at the top.

本发明与现有技术相比较所具有的显著效果:The remarkable effect that the present invention has compared with prior art:

本发明公开了一种从全基因组单核苷酸多态性SNP中过滤与复杂疾病无关SNP的方法——基于因素的遗传搜索方法FGSA(factor based genetic search algorithm),其以因素为基础保证了对单个与疾病弱关联、联合起来与疾病强关联的SNP因素的搜索,其层层剥离准则保证在多个致病因素存在时不至于由于某强关联因素的存在而掩盖了其它可能也强关联的因素的搜索,而采用的频度-关联优先准则则保证了这样搜索出的解的相对稳定性,具有如下显著效果:The invention discloses a method for filtering SNPs irrelevant to complex diseases from the single nucleotide polymorphism SNPs in the whole genome - a factor based genetic search algorithm FGSA (factor based genetic search algorithm), which guarantees The search for SNP factors that are weakly associated with a disease and strongly associated with a disease in combination, its layer-by-layer peeling criterion ensures that when multiple pathogenic factors exist, other possible strong associations will not be covered up by the existence of a strong association factor The search of the factors, and the frequency-association priority criterion adopted ensures the relative stability of the solution obtained in this way, and has the following significant effects:

(1)本方法能够实现从全基因组中对与疾病无关SNP的有效过滤,即:过滤掉那些单个或者联合起来都与疾病表型无关的SNP,并使过滤后的剩余SNP的数量尽可能小。(1) This method can effectively filter SNPs that are not related to diseases from the whole genome, that is, filter out those SNPs that are not related to the disease phenotype individually or in combination, and make the number of remaining SNPs after filtering as small as possible .

(2)本方法能够保留那些被其他致病原因所覆盖的致病原因对应SNP,从而为后续这些致病原因的发现奠定基础。(2) This method can retain the SNPs corresponding to the pathogenic causes covered by other pathogenic causes, thereby laying the foundation for the subsequent discovery of these pathogenic causes.

(3)本方法可处理全基因组SNP规模,如10,000以上的SNP数据,比如能够处理AMD(由于视网膜损坏而导致视中心的视觉丧失)数据,其中含有103611个SNP,96个病例样本和50个对照样本。用其他多种方法所找到的与AMD相关的SNP均在本方法过滤后的SNP集合中(详见实验对比效果说明部分)。(3) This method can handle genome-wide SNP scale, such as SNP data of more than 10,000, such as AMD (loss of vision due to retinal damage) data, which contains 103611 SNPs, 96 case samples and 50 control sample. The AMD-related SNPs found by other methods are all in the SNP set filtered by this method (see the description of the experimental comparison effect for details).

附图说明Description of drawings

图1是本发明FGSA算法的流程图;Fig. 1 is the flowchart of FGSA algorithm of the present invention;

图2是图1中的遗传算法的流程图。FIG. 2 is a flowchart of the genetic algorithm in FIG. 1 .

具体实施方式Detailed ways

参照图1和图2,本发明的方法称为FGSA方法,其具体实现步骤如下:With reference to Fig. 1 and Fig. 2, method of the present invention is called FGSA method, and its specific implementation steps are as follows:

步骤1,对SNP数据进行预处理和初始化。Step 1, preprocessing and initializing the SNP data.

(1.1)根据同源染色体等位基因中任一基因的变异对疾病的影响可以等同对待的原则,将SNP数据处理成

Figure BSA00000921376200041
其中xi∈{0,1,2,3}d为SNP i对应位点的取值:对应位点上的两个等位基因当为纯合子AA时取1,纯合子aa时取2,杂合子Aa或aA时取3,当对应位点上的等位基因数据缺失时取0;yi∈{1,2}为样本xi的类标,1表示疾病组,2表示对照组,N为SNP数据中样本的个数,d为数据中SNP的个数,仅含0,1,2,3的数据,其中0表示缺失数据,所涉及的SNP的集合记为Ω。(1.1) According to the principle that the variation of any gene in the homologous chromosome alleles can have the same impact on the disease, the SNP data is processed into
Figure BSA00000921376200041
Where x i ∈ {0, 1, 2, 3} d is the value of the corresponding site of SNP i: when the two alleles at the corresponding site are homozygous AA, it takes 1, and when it is homozygous aa, it takes 2, Take 3 for heterozygote Aa or aA, take 0 when the allele data on the corresponding locus is missing; y i ∈ {1, 2} is the class label of sample xi , 1 means the disease group, 2 means the control group, N is the number of samples in the SNP data, and d is the number of SNPs in the data, including only the data of 0, 1, 2, and 3, where 0 means missing data, and the set of SNPs involved is recorded as Ω.

步骤2,定义关联性测度。Step 2, define the correlation measure.

(2.1)每个SNP子集都可能成为一个致病因素,一个有l个SNP的因素称为l阶因素。将一个因素X与疾病Y之间的关联性I(Y;X)定义为X与Y之间的互信息MI(Y;X),表示为式(1):(2.1) Each SNP subset may become a pathogenic factor, and a factor with l SNP is called l-order factor. The correlation I(Y; X) between a factor X and a disease Y is defined as the mutual information MI(Y; X) between X and Y, expressed as formula (1):

I(Y;X)=H(Y)-H(Y|X)    (1)I(Y;X)=H(Y)-H(Y|X) (1)

其中 H ( Y ) = - Σ y ∈ { 0,1 } p ( y ) log p ( y ) H ( Y | X ) = - Σ y ∈ { 0,1 } Σ x ∈ { 1,2,3 , } | X | p ( y | x ) log p ( y | x ) 分别为熵和条件熵。in h ( Y ) = - Σ the y ∈ { 0,1 } p ( the y ) log p ( the y ) and h ( Y | x ) = - Σ the y ∈ { 0,1 } Σ x ∈ { 1,2,3 , } | x | p ( the y | x ) log p ( the y | x ) are entropy and conditional entropy, respectively.

步骤3,运用FGSA方法搜索SNP的集合Ω中侯选疑似致病原因的SNP组。Step 3, use the FGSA method to search for the SNP group that is suspected to be the cause of the disease in the SNP set Ω.

(3.1)设置遗传算法参数及FGSA相关参数(3.1) Set genetic algorithm parameters and FGSA related parameters

设置要找的SNP组中SNP个数l。设置遗传算法参数,包括种群规模Nl,交叉概率Pc,变异概率Pm,迭代次数Iterl;设置FGSA相关参数,包括重复搜索次数Numl,每次要找的l阶交互作用的个数MlSet the number l of SNPs in the SNP group to be found. Set genetic algorithm parameters, including population size N l , crossover probability P c , mutation probability P m , iteration number Iter l ; set FGSA related parameters, including repeated search times Num l , and the number of l-order interactions to be found each time M l ;

(3.2)初始化要找的l阶交互作用的个数k=1;设疑似致病原因相关SNP集合S1=Φ;设要考察的SNP集合为Ω*=Ω;(3.2) Initialize the number of the first-order interactions to be found k=1; set the SNP set S 1 =Φ related to the suspected cause of disease; set the SNP set to be investigated as Ω * =Ω;

(3.3)随机初始化种群:从Ω*中的SNP中随机生成l个不同的有效SNP编号,构成一个l阶因素作为一个个体。总计生成Nl个个体构成种群;(3.3) Randomly initialize the population: Randomly generate l different effective SNP numbers from the SNPs in Ω * , constituting an l-order factor as an individual. A total of N l individuals are generated to form a population;

(3.4)适应度计算:对种群中的每个个体,根据式(1)计算互信息作为该个体的适应度;(3.4) Calculation of fitness: For each individual in the population, calculate the mutual information according to formula (1) as the fitness of the individual;

(3.5)选择操作:依据种群中各个个体的适应度数值,采用轮盘赌方式和精英策略进行选择操作,选出Nl个个体;(3.5) Selection operation: According to the fitness value of each individual in the population, use roulette and elite strategy to select N1 individuals;

(3.6)交叉操作:从Nl个个体中任取两个个体,随机选择交叉点,对这两个个体,以交叉概率pc将交叉点后面的部分对调,形成两个新个体;(3.6) Crossover operation: Randomly select two individuals from N l individuals, randomly select the intersection point, and for these two individuals, exchange the part behind the intersection point with the crossover probability p c to form two new individuals;

(3.7)变异操作:对每个个体,随机生成一个有效SNP编号,依变异概率pm替换掉这个个体中的任一SNP编号;(3.7) Mutation operation: For each individual, randomly generate an effective SNP number, and replace any SNP number in this individual according to the mutation probability p m ;

(3.8)产生下一代种群:由(3.7)操作后得到的所有个体作为下一代种群;(3.8) Generate the next generation population: all individuals obtained after the operation of (3.7) are used as the next generation population;

(3.9)若迭代次数小于Iterl,则跳转到(3.4);(3.9) If the number of iterations is less than Iter l , then jump to (3.4);

(3.10)取种群中具有最大适应度的个体记为sk加入到疑似与疾病相关的SNP集合Sk中,并从SNP集合Ω*中去掉这个个体,即

Figure BSA00000921376200051
(3.10) Take the individual with the maximum fitness in the population and mark it as s k and add it to the SNP set S k suspected to be related to the disease, and remove this individual from the SNP set Ω * , that is
Figure BSA00000921376200051

(3.11)重复(3.3)~(3.10)Ml次,每次向Sk中加入一个个体,并从数据中去掉这个个体,经Ml次重复,得Sk(3.11) Repeat (3.3)~(3.10) M l times, each time add an individual to S k , and remove this individual from the data, after M l repetitions, get S k ;

(3.12)重设Ω*=Ω,重复(3.2)~(3.11)总计Numl次,从而得到SNP集合 S 1 , S 2 , . . . , S Num l ; (3.12) Reset Ω * = Ω, repeat (3.2) ~ (3.11) a total of Num l times, so as to obtain the SNP set S 1 , S 2 , . . . , S Num l ;

(3.13)输出包含各个l阶交互作用的SNP集合

Figure BSA00000921376200053
(3.13) Output the SNP set containing each l-order interaction
Figure BSA00000921376200053

步骤4,计算v中各SNP的频度。Step 4, calculate the frequency of each SNP in v.

(4.1)频度的计算:将步骤3中找到的疑似致病原因的SNP出现的次数作为该SNP的频度;(4.1) Calculation of frequency: the frequency of occurrence of the SNP of the suspected cause of disease found in step 3 as the frequency of the SNP;

(4.2)依频度-关联优先准则将SNP排序,即:按频度大的优先、同频度时单SNP与疾病关联即互信息大的优先的原则,将SNP排序。(4.2) The SNPs were sorted according to the frequency-association priority criterion, that is, the SNPs were sorted according to the principle of the highest frequency priority, and the single SNP with the same frequency associated with the disease, that is, the highest mutual information priority.

步骤5,输出排在最前面的频度大于门限的SNP。Step 5, outputting the top SNPs whose frequency is greater than the threshold.

其中,步骤3中的(3.3)~(3.8)中遗传算法的实施是以从Ω*中的SNP中随机生成l个不同的有效SNP编号构成一个l阶因素作为一个个体,并通过它们的交叉变异获得更优个体的遗传进化搜索方法,体现了本方法的以因素为基础的遗传进化SNP过滤的特点,从而保证了对单个与疾病弱关联、联合起来与疾病强关联的SNP因素的搜索;步骤3中的(3.9)则通过将种群中具有最大适应度的个体加入到疑似与疾病相关的SNP集合中,并从数据中去掉这个个体,实现对致病因素的层层剥离,从而保证在多个致病因素存在时不至于由于某强关联因素的存在而掩盖了其它可能也强关联的因素的搜索;而步骤4中的频度-关联优先准则则保证了这样搜索出的解的相对稳定性。Among them, the implementation of the genetic algorithm in (3.3)~(3.8) in step 3 is to randomly generate l different effective SNP numbers from the SNPs in Ω * to form an l-order factor as an individual, and through their crossover The genetic evolution search method of mutation to obtain better individuals embodies the characteristics of this method's factor-based genetic evolution SNP filtering, thus ensuring the search for single SNP factors that are weakly associated with the disease and combined to be strongly associated with the disease; (3.9) in step 3 adds the individual with the greatest fitness in the population to the SNP set suspected to be related to the disease, and removes this individual from the data to realize the layer-by-layer peeling off of the pathogenic factors, so as to ensure that When multiple pathogenic factors exist, the search for other factors that may also be strongly correlated will not be covered up by the existence of a strong correlation factor; and the frequency-correlation priority criterion in step 4 ensures that the solution obtained in this way is relatively accurate. stability.

本发明将通过下述的实验例子对本方法的效果进行更详细的描述。这些实验例子用于举例的目的,而不试图限制本发明的范围。The present invention will describe the effect of this method in more detail through the following experimental examples. These experimental examples are for illustrative purposes and are not intended to limit the scope of the invention.

在以下的实验中,本方法的参数取为:Nl=10,Pc=0.9,Pm=0.25,Iterl=5000,Numl=20,Ml=8。In the following experiments, the parameters of this method are set as: N l =10, P c =0.9, P m =0.25, Iter l =5000, Num l =20, M l =8.

实验1:仿真数据SNP的过滤。Experiment 1: Filtering of simulated data SNPs.

仿真数据是在纽约人口真实SNP数据的基础上,由生物学家加入7个已知的与复杂疾病相关的SNP组得到的,且这7个SNP组与疾病的关联模型是不同的。数据共有两组:第一组包含2000个样本,100个SNP,用SNP100表示;第二组包含2000个样本,2000个SNP,用SNP2000表示。数据信息如表1。The simulation data is based on the real SNP data of the New York population, and biologists add 7 SNP groups known to be associated with complex diseases, and the association models of these 7 SNP groups and diseases are different. There are two groups of data: the first group contains 2000 samples, 100 SNPs, represented by SNP100; the second group contains 2000 samples, 2000 SNPs, represented by SNP2000. The data information is shown in Table 1.

表1实验数据Table 1 Experimental data

Figure BSA00000921376200061
Figure BSA00000921376200061

对这两组数据进行实验,其中FGSA的参数(Iter2,Iter3)分别对两组数据取值,分别为(Iter2,Iter3)=(600,1100),(1200,1700),(M2,M3)对2组数据相同,均为(M2,M3)=(8,5)。所进行的两组实验只是频度门限不同:对于2组数据,门限分别取为Th=3,2和和取为Th=1。Experiments are carried out on these two groups of data, wherein the parameters (Iter 2 , Iter 3 ) of FGSA take values for the two groups of data respectively, which are (Iter 2 , Iter 3 )=(600, 1100), (1200, 1700), ( M 2 , M 3 ) are the same for the two sets of data, both are (M 2 , M 3 )=(8,5). The two sets of experiments carried out are only different in the frequency threshold: for the two sets of data, the threshold is taken as Th=3, and 2 and 2 are taken as Th=1.

FGSA算法及几种典型特征选择方法(最小冗余最大相关方法--mRMR,最大熵方法--ME等)的实验结果示于表2和表3中,其中“-”表示计算量过大而没有找到结果,压缩率是过滤后获得的SNP数与数据中的SNP总数之比,因素率是过滤后的SNP中包含的真实致病因素数目与数据中的真实致病因素数目之比。The experimental results of the FGSA algorithm and several typical feature selection methods (Minimum Redundancy Maximum Relevance Method--mRMR, Maximum Entropy Method--ME, etc.) No results found, the compression rate is the ratio of the number of SNPs obtained after filtering to the total number of SNPs in the data, and the factor rate is the ratio of the number of real pathogenic factors contained in the filtered SNPs to the number of real pathogenic factors in the data.

表2中,a为真阳性,因素率定义为过滤后的SNP中包含致病因素数占过滤前Ω中包含致病因素总数的百分比;压缩率为过滤后的SNP数占过滤前Ω中的SNP总数的百分比。In Table 2, a is a true positive, and the factor rate is defined as the percentage of the number of pathogenic factors contained in the filtered SNP to the total number of pathogenic factors contained in Ω before filtering; the compression ratio is the percentage of the number of pathogenic factors contained in Ω after filtering Percentage of total number of SNPs.

表2FGSA-频度算法的性能及与其它算法的比较Table 2 Performance of FGSA-frequency algorithm and comparison with other algorithms

表3FGSA-频度算法的性能(取频度门限为Th=1的结果)Table 3FGSA-Frequency Algorithm Performance (Frequency Threshold is the result of Th=1)

Figure BSA00000921376200072
Figure BSA00000921376200072

由表2和表3可以看出:It can be seen from Table 2 and Table 3 that:

(a)FGSA方法在压缩率基本相当的情况下,其因素率都大于其他方法,表明了算法的有效性;(a) When the compression rate of the FGSA method is basically the same, its factor rate is higher than that of other methods, which shows the effectiveness of the algorithm;

(b)mRMR的因素率仅为3/7,即所选出的SNP集合在7个致病因素中仅完整包含了3个,而FGSA方法的因素率为5/7~6/7,即至少完整包含了5个或6个,显然表明FGSA方法的有效性;(b) The factor rate of mRMR is only 3/7, that is, the selected SNP set contains only 3 of the 7 pathogenic factors, while the factor rate of the FGSA method is 5/7~6/7, that is At least 5 or 6 are fully included, which clearly shows the effectiveness of the FGSA method;

(c)压缩率随SNP规模N增长,且N越大压缩率越高,在2000个SNP情况下达到了97%,表明FGSA方法对全基因组SNP情况更有效;(c) The compression rate increases with the SNP size N, and the larger the N, the higher the compression rate, reaching 97% in the case of 2000 SNPs, indicating that the FGSA method is more effective for the genome-wide SNPs;

(d)当因素中的SNP过多时,会出现维数灾难,这是长度为5的致病因素用这些方法始终没能完整选上的重要原因之一。(d) When there are too many SNPs in the factor, the curse of dimensionality will appear, which is one of the important reasons why the pathogenic factors with a length of 5 cannot be completely selected by these methods.

实验2:真实AMD数据的SNP过滤。Experiment 2: SNP filtering of real AMD data.

AMD是影响老年人的医学条件,他会由于视网膜损坏而导致视中心的视觉丧失。AMD数据(见表4)含有103611个SNP,96个病例样本和50个对照样本,其中有0.811%的数据丢失。样本数值为0,1,2,3,其中0表示该数据丢失。AMD数据常用于SNP关联分析,已有不少方法用在AMD数据上并获得了一些相关的关联基因。表5给出了所找到的用其它方法(如BOOST,AntEpiSeeker,epiMODE,BEAM,HapForest,Single-Marker,DASSO-MB等方法等)也找出的SNP。AMD is a medical condition affecting older adults who can experience loss of vision in the optic center due to damage to the retina. The AMD data (see Table 4) contains 103611 SNPs, 96 case samples and 50 control samples, with 0.811% missing data. The sample values are 0, 1, 2, 3, where 0 means that the data is missing. AMD data is often used in SNP association analysis, and many methods have been used on AMD data and some related genes have been obtained. Table 5 shows the SNPs found by other methods (such as BOOST, AntEpiSeeker, epiMODE, BEAM, HapForest, Single-Marker, DASSO-MB, etc.).

表4AMD数据Table 4 AMD data

SNP数Number of SNPs 病例样本数Case sample number 对照样本数Number of Control Samples 数据缺失比例Missing data ratio AMD数据AMD data 103611103611 9696 5050 0.822%0.822%

表5FGSA方法找出的用其它方法也找出的SNP列表Table 5 The list of SNPs found by the FGSA method and also found by other methods

Figure BSA00000921376200091
Figure BSA00000921376200091

从表5可以看到:It can be seen from Table 5:

(a)FGSA所找到的SNP与其它方法找到的SNP有很高的重叠性,表明了FGSA方法的有效性。(a) The SNPs found by FGSA have a high overlap with those found by other methods, indicating the effectiveness of the FGSA method.

(b)FGSA还给出了其他方法没有找出的但是频度也很高的SNP,包括编号为19405,6693,56674,80178,76784,92627,46516,88957,42568,51958,41808,47428的SNP,其频率分别为35,26,26,25,24,24,22,21,20,15,11,9,并不排除它们与疾病关联的可能性,特别地,不排除真实的致病因素就在表5中SNP及上述SNP构成的集合上,或是其中某些SNP构成的子集上。(b) FGSA also gives SNPs that are not found by other methods but have a high frequency, including numbers 19405, 6693, 56674, 80178, 76784, 92627, 46516, 88957, 42568, 51958, 41808, 47428 The SNPs, whose frequencies are 35, 26, 26, 25, 24, 24, 22, 21, 20, 15, 11, 9, do not rule out the possibility that they are associated with disease and, in particular, do not rule out true pathogenicity The factors are on the set of SNPs in Table 5 and the above-mentioned SNPs, or a subset of some of the SNPs.

Claims (1)

1.从全基因组中过滤与复杂疾病无关SNP的方法,包括如下步骤:1. A method for filtering SNPs unrelated to complex diseases from the whole genome, comprising the steps of: 步骤1,对全基因组SNP数据进行预处理和初始化Step 1, preprocessing and initialization of genome-wide SNP data 根据同源染色体等位基因中任一基因的变异对疾病的影响等同对待的原则,将SNP数据预处理成:
Figure FSA00000921376100011
其中xi∈{0,1,2,3}d为SNP i对应位点的取值:对应位点上的两个等位基因当为纯合子AA时取1,纯合子aa时取2,杂合子Aa或aA时取3,当对应位点上的等位基因数据缺失时取0;yi∈{1,2}为样本xi的类标,1表示疾病组,2表示对照组,N为SNP数据中样本的个数,d为数据中SNP的个数,并记所涉及的SNP的集合为Ω;
According to the principle that the variation of any gene in the homologous chromosome alleles has the same impact on the disease, the SNP data is preprocessed into:
Figure FSA00000921376100011
Where x i ∈ {0, 1, 2, 3} d is the value of the corresponding site of SNP i: when the two alleles at the corresponding site are homozygous AA, it takes 1, and when it is homozygous aa, it takes 2, Take 3 for heterozygote Aa or aA, take 0 when the allele data on the corresponding locus is missing; y i ∈ {1, 2} is the class label of sample xi , 1 means the disease group, 2 means the control group, N is the number of samples in the SNP data, d is the number of SNPs in the data, and record the set of SNPs involved as Ω;
步骤2,定义关联性测度Step 2, define the relevance measure 将一个SNP因素X与疾病Y之间的关联性I(Y;X)定义为X与Y之间的互信息MI(Y;X),表示为式(1):The association I(Y; X) between a SNP factor X and a disease Y is defined as the mutual information MI(Y; X) between X and Y, expressed as formula (1): I(Y;X)=H(Y)-H(Y|X)    (1)I(Y;X)=H(Y)-H(Y|X) (1) 式中 H ( Y ) = - Σ y ∈ { 0,1 } p ( y ) log p ( y ) 为熵, H ( Y | X ) = - Σ y ∈ { 0,1 } Σ x ∈ { 1,2,3 , } | X | p ( y | x ) log p ( y | x ) 为条件熵;In the formula h ( Y ) = - Σ the y ∈ { 0,1 } p ( the y ) log p ( the y ) is the entropy, h ( Y | x ) = - Σ the y ∈ { 0,1 } Σ x ∈ { 1,2,3 , } | x | p ( the y | x ) log p ( the y | x ) is the conditional entropy; 步骤3,运用基于因素的遗传搜索方法“FGSA”搜索SNP的集合Ω中侯选疑似致病原因的SNP组,从而过滤与复杂疾病无关的SNP;Step 3, use the factor-based genetic search method "FGSA" to search for the SNP group in the SNP set Ω that is suspected to be the cause of the disease, so as to filter the SNPs that are not related to complex diseases; (3.1)设置遗传算法参数及FGSA相关参数(3.1) Set genetic algorithm parameters and FGSA related parameters 设置要找的SNP组中SNP个数l。设置遗传算法参数,包括种群规模Nl,交叉概率Pc,变异概率pm,迭代次数Iterl;设置FGSA相关参数,包括重复搜索次数Numl,每次要找的l阶交互作用的个数MlSet the number l of SNPs in the SNP group to be found. Set genetic algorithm parameters, including population size Nl, crossover probability P c , mutation probability p m , iteration number Iter l ; set FGSA related parameters, including repeated search times Num l , and the number of l-order interactions to be found each time M l ; (3.2)初始化要找的l阶交互作用的个数k=1;设疑似致病原因相关SNP集合S1=Φ;设要考察的SNP集合为Ω*=Ω;(3.2) Initialize the number of the first-order interactions to be found k=1; set the SNP set S 1 =Φ related to the suspected cause of disease; set the SNP set to be investigated as Ω * =Ω; (3.3)随机初始化种群:从Ω*中的SNP中随机生成l个不同的有效SNP编号,构成一个l阶因素作为一个个体,总计生成Nl个个体构成种群;(3.3) Randomly initialize the population: Randomly generate l different effective SNP numbers from the SNP in Ω * , form an l-order factor as an individual, and generate a total of N l individuals to form the population; (3.4)适应度计算:对种群中的每个个体,根据式(1)计算互信息作为该个体的适应度;(3.4) Calculation of fitness: For each individual in the population, calculate the mutual information according to formula (1) as the fitness of the individual; (3.5)选择操作:依据种群中各个个体的适应度数值,采用轮盘赌方式和精英策略进行选择操作,选出Nl个个体;(3.5) Selection operation: According to the fitness value of each individual in the population, use roulette and elite strategy to select N1 individuals; (3.6)交叉操作:从Nl个个体中任取两个个体,随机选择交叉点,对这两个个体以交叉概率pc将交叉点后面的部分对调,形成两个新个体;(3.6) Crossover operation: Randomly select two individuals from N1 individuals, randomly select the intersection point, and exchange the part behind the intersection point for these two individuals with the crossover probability p c to form two new individuals; (3.7)变异操作:对每个个体,随机生成一个有效SNP编号,依变异概率pm替换掉这个个体中的任一SNP编号;(3.7) Mutation operation: For each individual, randomly generate an effective SNP number, and replace any SNP number in this individual according to the mutation probability p m ; (3.8)产生下一代种群:由步骤(3.7)操作后得到的所有个体作为下一代种群;(3.8) Generate the population of the next generation: all individuals obtained after the operation of step (3.7) are used as the population of the next generation; (3.9)若迭代次数小于Iterl,则跳转到步骤(3.4);(3.9) If the number of iterations is less than Iter l , then jump to step (3.4); (3.10)取种群中具有最大适应度的个体记为sk加入到疑似与疾病相关的SNP集合Sk中,并从SNP集合Ω*中去掉这个个体,即
Figure FSA00000921376100021
(3.10) Take the individual with the maximum fitness in the population and mark it as s k and add it to the SNP set S k suspected to be related to the disease, and remove this individual from the SNP set Ω * , that is
Figure FSA00000921376100021
(3.11)重复步骤(3.3)~(3.10)Ml次,每次向Sk中加入一个个体,并从数据中去掉这个个体,经Ml次重复得Sk(3.11) Repeat steps (3.3)~(3.10) M l times, add an individual to S k each time, and remove this individual from the data, and obtain S k through M l repetitions; (3.12)重设Ω*=Ω,重复(3.2)~(3.11)Numl次,得到SNP集合 S 1 , S 2 , . . . , S Num l ; (3.12) Reset Ω * = Ω, repeat (3.2) ~ (3.11) Num l times, get the SNP set S 1 , S 2 , . . . , S Num l ; (3.13)输出包含各个l阶交互作用的SNP集合
Figure FSA00000921376100023
(3.13) Output the SNP set containing each l-order interaction
Figure FSA00000921376100023
步骤4,计算v中各SNP的频度Step 4, calculate the frequency of each SNP in v (4.1)各SNP的频度计算:按步骤(3)找到的疑似致病原因的SNP出现次数作为该SNP的频度;(4.1) Calculation of the frequency of each SNP: the number of occurrences of the SNP of the suspected cause of disease found in step (3) is used as the frequency of the SNP; (4.2)依频度-关联优先准则将SNP排序,即:按频度大的优先,同频度时单SNP与疾病关联即互信息大的优先的原则,将SNP排序;(4.2) SNPs are sorted according to the frequency-association priority criterion, that is, the SNPs are sorted according to the principle of high frequency priority, and when the frequency is the same, the single SNP is associated with the disease, that is, the priority principle of mutual information is large; 步骤5,输出排在最前面的频度大于门限的SNP。Step 5, outputting the top SNPs whose frequency is greater than the threshold.
CN2013102796270A 2013-06-25 2013-06-25 Method for filtering SNP (Single Nucleotide Polymorphism) unrelated to complex diseases from whole-genome Pending CN103366100A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013102796270A CN103366100A (en) 2013-06-25 2013-06-25 Method for filtering SNP (Single Nucleotide Polymorphism) unrelated to complex diseases from whole-genome

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013102796270A CN103366100A (en) 2013-06-25 2013-06-25 Method for filtering SNP (Single Nucleotide Polymorphism) unrelated to complex diseases from whole-genome

Publications (1)

Publication Number Publication Date
CN103366100A true CN103366100A (en) 2013-10-23

Family

ID=49367427

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013102796270A Pending CN103366100A (en) 2013-06-25 2013-06-25 Method for filtering SNP (Single Nucleotide Polymorphism) unrelated to complex diseases from whole-genome

Country Status (1)

Country Link
CN (1) CN103366100A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462868A (en) * 2014-12-11 2015-03-25 西安电子科技大学 Genome-wide SNP (single nucleotide polymorphism) site analysis method based on combination of random forest and Relief-F
CN108256293A (en) * 2018-02-09 2018-07-06 哈尔滨工业大学深圳研究生院 A kind of statistical method and system of the disease association assortment of genes
CN110135057A (en) * 2019-05-14 2019-08-16 北京工业大学 Soft-sensing method for dioxin emission concentration in solid waste incineration process based on multi-layer feature selection
CN110428897A (en) * 2019-06-19 2019-11-08 西安电子科技大学 Medical diagnosis on disease information processing method based on SNP pathogenic factor Yu disease association relationship
CN112270957A (en) * 2020-10-19 2021-01-26 西安邮电大学 High-order SNP pathogenic combination data detection method, system and computer equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894216A (en) * 2010-07-16 2010-11-24 西安电子科技大学 A method for discovering groups of SNPs associated with complex diseases from SNP data
CN102629305A (en) * 2012-03-06 2012-08-08 上海大学 Feature selection method facing to SNP (Single Nucleotide Polymorphism) data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894216A (en) * 2010-07-16 2010-11-24 西安电子科技大学 A method for discovering groups of SNPs associated with complex diseases from SNP data
CN102629305A (en) * 2012-03-06 2012-08-08 上海大学 Feature selection method facing to SNP (Single Nucleotide Polymorphism) data

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JUNYING ZHANG等: "A Genetic Algorithm to Filter SNPs for SNP Association Study", 《WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY (WI-IAT), 2012 IEEE/WIC/ACM INTERNATIONAL CONFERENCES ON》, vol. 1, 7 December 2012 (2012-12-07), pages 684 - 687, XP032391329, DOI: doi:10.1109/WI-IAT.2012.146 *
蒋胜利: "高维数据的特征选择与特征提取研究"", 《中国博士学位论文全文数据库 信息科技辑》, vol. 2011, no. 12, 15 December 2011 (2011-12-15), pages 138 - 49 *
蒋胜利等: "基于多重遗传算法的单核苷酸多态性特征选择", 《四川大学学报(工程科学版)》, vol. 42, no. 2, 20 March 2010 (2010-03-20), pages 132 - 138 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462868A (en) * 2014-12-11 2015-03-25 西安电子科技大学 Genome-wide SNP (single nucleotide polymorphism) site analysis method based on combination of random forest and Relief-F
CN104462868B (en) * 2014-12-11 2017-04-05 西安电子科技大学 A kind of full-length genome SNP site analysis method of combination random forest and Relief F
CN108256293A (en) * 2018-02-09 2018-07-06 哈尔滨工业大学深圳研究生院 A kind of statistical method and system of the disease association assortment of genes
CN110135057A (en) * 2019-05-14 2019-08-16 北京工业大学 Soft-sensing method for dioxin emission concentration in solid waste incineration process based on multi-layer feature selection
CN110135057B (en) * 2019-05-14 2021-03-02 北京工业大学 Soft measurement method of dioxin emission concentration in solid waste incineration process based on multi-layer feature selection
US11976817B2 (en) 2019-05-14 2024-05-07 Beijing University Of Technology Method for detecting a dioxin emission concentration of a municipal solid waste incineration process based on multi-level feature selection
CN110428897A (en) * 2019-06-19 2019-11-08 西安电子科技大学 Medical diagnosis on disease information processing method based on SNP pathogenic factor Yu disease association relationship
CN110428897B (en) * 2019-06-19 2022-03-18 西安电子科技大学 A disease diagnosis information processing method based on the relationship between SNP pathogenic factors and diseases
CN112270957A (en) * 2020-10-19 2021-01-26 西安邮电大学 High-order SNP pathogenic combination data detection method, system and computer equipment
CN112270957B (en) * 2020-10-19 2023-11-07 西安邮电大学 High-order SNP pathogenic combination data detection method, system and computer equipment

Similar Documents

Publication Publication Date Title
Marand et al. A cis-regulatory atlas in maize at single-cell resolution
Varshney et al. Designing future crops: genomics-assisted breeding comes of age
Kelly et al. Analysis of the giant genomes of F ritillaria (L iliaceae) indicates that a lack of DNA removal characterizes extreme expansions in genome size
Smith et al. Analysis of phylogenomic datasets reveals conflict, concordance, and gene duplications with examples from animals and plants
Arruda et al. Genomic selection for predicting Fusarium head blight resistance in a wheat breeding program
Xu et al. Predicting hybrid performance in rice using genomic best linear unbiased prediction
Gernandt et al. Multi‐locus phylogenetics, lineage sorting, and reticulation in Pinus subsection Australes
Song et al. Conserved noncoding sequences provide insights into regulatory sequence and loss of gene expression in maize
CN104462868B (en) A kind of full-length genome SNP site analysis method of combination random forest and Relief F
Lavarenne et al. The spring of systems biology-driven breeding
CA2932507C (en) Improved molecular breeding methods
Rowley et al. A draft genome and high-density genetic map of European hazelnut (Corylus avellana L.)
CN110993113B (en) LncRNA-disease relation prediction method and system based on MF-SDAE
Aono et al. Machine learning approaches reveal genomic regions associated with sugarcane brown rust resistance
Ding et al. Population-genomic analyses reveal bottlenecks and asymmetric introgression from Persian into iron walnut during domestication
Senerchia et al. Evolutionary dynamics of retrotransposons assessed by high-throughput sequencing in wild relatives of wheat
CN103366100A (en) Method for filtering SNP (Single Nucleotide Polymorphism) unrelated to complex diseases from whole-genome
Timilsena et al. Phylogenomic resolution of order-and family-level monocot relationships using 602 single-copy nuclear genes and 1375 BUSCO genes
Grover et al. Dual domestication, diversity, and differential introgression in Old World cotton diploids
Wang et al. Genetic diversity and population structure in the endangered tree Hopea hainanensis (Dipterocarpaceae) on Hainan Island, China
Zhang et al. The lack of negative association between TE load and subgenome dominance in synthesized Brassica allotetraploids
Yu et al. Genomic analyses reveal dead‐end hybridization between two deeply divergent kiwifruit species rather than homoploid hybrid speciation
Cornet et al. Holocentric repeat landscapes: From micro‐evolutionary patterns to macro‐evolutionary associations with karyotype evolution
Morales‐Briones et al. Phylogenomic analyses in Phrymaceae reveal extensive gene tree discordance in relationships among major clades
Qin et al. Phylogenomics and divergence pattern of Polygonatum (Asparagaceae: Polygonateae) in the north temperate region

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20131023