CN103366100A - Method for filtering SNP (Single Nucleotide Polymorphism) unrelated to complex diseases from whole-genome - Google Patents
Method for filtering SNP (Single Nucleotide Polymorphism) unrelated to complex diseases from whole-genome Download PDFInfo
- Publication number
- CN103366100A CN103366100A CN2013102796270A CN201310279627A CN103366100A CN 103366100 A CN103366100 A CN 103366100A CN 2013102796270 A CN2013102796270 A CN 2013102796270A CN 201310279627 A CN201310279627 A CN 201310279627A CN 103366100 A CN103366100 A CN 103366100A
- Authority
- CN
- China
- Prior art keywords
- snp
- snps
- disease
- data
- individual
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 201000010099 disease Diseases 0.000 title claims abstract description 57
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 title claims abstract description 57
- 238000000034 method Methods 0.000 title claims abstract description 49
- 238000001914 filtration Methods 0.000 title claims abstract description 15
- 239000002773 nucleotide Substances 0.000 title abstract description 5
- 125000003729 nucleotide group Chemical group 0.000 title abstract description 5
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 13
- 108700028369 Alleles Proteins 0.000 claims abstract description 9
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 238000004422 calculation algorithm Methods 0.000 claims description 14
- 230000002068 genetic effect Effects 0.000 claims description 13
- 230000003993 interaction Effects 0.000 claims description 10
- 230000035772 mutation Effects 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 108010014173 Factor X Proteins 0.000 claims description 3
- 210000000349 chromosome Anatomy 0.000 claims description 3
- 230000001717 pathogenic effect Effects 0.000 abstract description 23
- 238000011161 development Methods 0.000 abstract description 4
- 238000013399 early diagnosis Methods 0.000 abstract description 3
- 229960000074 biopharmaceutical Drugs 0.000 abstract description 2
- 230000003950 pathogenic mechanism Effects 0.000 abstract description 2
- 238000011160 research Methods 0.000 abstract description 2
- 230000002759 chromosomal effect Effects 0.000 abstract 1
- 230000000875 corresponding effect Effects 0.000 description 6
- 230000006835 compression Effects 0.000 description 5
- 238000007906 compression Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 5
- 239000000523 sample Substances 0.000 description 4
- 238000010845 search algorithm Methods 0.000 description 3
- 201000004569 Blindness Diseases 0.000 description 2
- 230000007812 deficiency Effects 0.000 description 2
- 208000018769 loss of vision Diseases 0.000 description 2
- 231100000864 loss of vision Toxicity 0.000 description 2
- 230000007918 pathogenicity Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000007637 random forest analysis Methods 0.000 description 2
- 230000004393 visual impairment Effects 0.000 description 2
- 208000026350 Inborn Genetic disease Diseases 0.000 description 1
- 206010057430 Retinal injury Diseases 0.000 description 1
- 238000012098 association analyses Methods 0.000 description 1
- 230000002301 combined effect Effects 0.000 description 1
- 239000013068 control sample Substances 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 208000016361 genetic disease Diseases 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 210000001525 retina Anatomy 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Images
Landscapes
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
一种从全基因组中过滤与复杂疾病无关SNP的方法,用于对复杂疾病致病机理研究、早期诊断和生物药物研制。(1)对单核苷酸多态SNP数据预处理和初始化。根据同源染色体等位基因中任一个基因的变异对疾病的影响可以等同对待的原则,将SNP数据处理成仅含0,1,2,3的数据。(2)定义关联性测度。将SNP子集X与疾病Y之间的关联性I(Y;X)定义为X与Y之间的互信息MI(Y;X)。(3)运用FGSA方法搜索SNP集合中侯选疑似致病原因SNP组。(4)依频度-关联优先准则,在候选疑似致病原因的SNP组的集合中,选择频度出现次数超过阈值的SNP组。(5)输出排在最前面的频度大于门限的SNP。本方法能够保留那些被其它致病原因所覆盖的致病原因对应SNP,为后续致病原因的发现奠定基础。
A method for filtering SNPs irrelevant to complex diseases from the whole genome, which is used for the research of pathogenic mechanism, early diagnosis and biopharmaceutical development of complex diseases. (1) Preprocessing and initialization of single nucleotide polymorphism SNP data. According to the principle that any gene variation in homologous chromosomal alleles can be treated equally to the disease, the SNP data is processed into data containing only 0, 1, 2, and 3. (2) Define the correlation measure. The association I(Y;X) between a SNP subset X and a disease Y is defined as the mutual information MI(Y;X) between X and Y. (3) Use the FGSA method to search for the candidate suspected pathogenic SNP group in the SNP collection. (4) According to the frequency-association priority criterion, in the set of SNP groups of candidate suspected pathogenic causes, select the SNP group whose frequency of occurrence exceeds the threshold. (5) Outputting the SNPs whose frequency is greater than the threshold at the top. This method can retain the SNPs corresponding to the pathogenic causes covered by other pathogenic causes, and lay the foundation for the subsequent discovery of pathogenic causes.
Description
技术领域technical field
本发明属于数据处理技术领域,具体说,提出了一种从全基因组单核苷酸多态性(Single Nucleotide Polymorphism,SNP)数据中过滤与复杂疾病无关SNP的方法,可用于对复杂疾病致病机理研究、早期诊断和生物药物研制。The invention belongs to the technical field of data processing. Specifically, a method for filtering SNPs irrelevant to complex diseases from the whole genome single nucleotide polymorphism (Single Nucleotide Polymorphism, SNP) data is proposed, which can be used for pathogenicity of complex diseases Mechanism research, early diagnosis and biopharmaceutical development.
背景技术Background technique
复杂疾病是由多种遗传因素、环境因素共同作用产生的,其产生和发展受到复杂网络结构的多个基因的影响。复杂疾病不同于孟德尔式遗传疾病,在大部分情况下往往没有足以致病的主基因,其中的单基因对致病的作用可能会微不足道甚至不存在,但这些单个可能起微不足道作用的基因组合起来,其联合效应却可能是复杂疾病的致病原因。这些特点给发现复杂疾病的致病基因带来了很大困难,难以找到致病基因或相关标记用于对复杂疾病的致病机理研究、早期诊断和生物药物研制。如何在全基因组范围内找出致病的多个原因和哪些基因联合会成为致病的一个原因是目前存在的主要问题。Complex diseases are caused by multiple genetic factors and environmental factors, and their occurrence and development are affected by multiple genes in a complex network structure. Complex diseases are different from Mendelian genetic diseases. In most cases, there are often no major genes that are sufficient to cause disease, and the effect of a single gene on disease may be negligible or even non-existent, but these single gene combinations that may play a negligible role Together, their combined effects may be the cause of complex diseases. These characteristics have brought great difficulties to the discovery of pathogenic genes of complex diseases, and it is difficult to find pathogenic genes or related markers for the study of pathogenic mechanisms, early diagnosis and development of biological drugs for complex diseases. How to find out multiple causes of disease and which genes combine to become a cause of disease on a genome-wide scale are the main problems at present.
解决方法可分为两类:直接法和两步法。直接法直接在原有SNP集上搜索,只能处理中小规模数据,所处理的数据规模视算法不同而不同(如MDR,BEAM等的处理规模就相差较大)。两步法是先通过过滤,从原有SNP集合中过滤掉那些与疾病无关的SNP,然后在剩余的SNP集合中进行搜索。本发明涉及两步法中的第一步:SNP过滤。Solutions can be divided into two categories: direct methods and two-step methods. The direct method searches directly on the original SNP set, and can only process small and medium-scale data, and the scale of the processed data varies depending on the algorithm (for example, the processing scale of MDR, BEAM, etc. varies greatly). The two-step method is to first filter out those SNPs that are not related to the disease from the original SNP set, and then search in the remaining SNP set. The present invention involves the first step in a two-step process: SNP filtering.
多数两步法的第一步都穷举进行,即找SNP的两两组合得分高的,留作第二步处理(如BOOST,AntEpiSeeker等)。该方法处理大规模数据的能力极其有限,因此引入蚁群算法,找比可能阶次更高阶次的SNP组合(AntEpiSeeker),和引入随机森林通过找分类能力强的SNP子集,获得需要进一步考察的SNP集合。这些过滤方法均存在如下不足:The first step of most two-step methods is carried out exhaustively, that is, to find pairs of SNPs with high scores, and leave them for the second step (such as BOOST, AntEpiSeeker, etc.). The ability of this method to deal with large-scale data is extremely limited, so an ant colony algorithm is introduced to find a higher-order SNP combination (AntEpiSeeker) than possible, and a random forest is introduced to find a subset of SNPs with strong classification ability to obtain further The set of SNPs investigated. These filtering methods all have following deficiencies:
1.能够处理的SNP阶次非常有限,如穷举的计算量巨大,只穷举两两SNP获得其得分,从而只能保留2阶SNP交互作用(即两个SNP的交互作用),而丢失更高阶SNP交互作用。1. The order of SNPs that can be processed is very limited. For example, the amount of calculation for exhaustive enumeration is huge. Only two or two SNPs are exhausted to obtain their scores, so that only the second-order SNP interaction (that is, the interaction of two SNPs) can be retained, and the Higher order SNP interactions.
2.能够处理的SNP规模非常有限,这是由于过滤过程需复杂计算,如随机森林法所处理的数据规模仅在100个SNP左右,相对来说,AntEpiSeeker可以处理更多的SNP(如5,000SNPs)。2. The scale of SNPs that can be processed is very limited. This is because the filtering process requires complex calculations. For example, the data scale processed by the random forest method is only about 100 SNPs. Relatively speaking, AntEpiSeeker can handle more SNPs (such as 5,000 SNPs ).
3.不能处理一个致病原因被其他致病原因覆盖情况下的多致病原因情况。3. It cannot handle the multi-cause situation where one cause of disease is covered by other causes of disease.
发明内容Contents of the invention
本发明的目的在于克服采用两步法在全基因组范围内找出致病的多个原因和哪些基因联合会成为致病的一个原因方面所存在的不足,发明一种从全基因组中过滤与复杂疾病无关SNP的方法,这种方法从全基因组SNP数据中,过滤掉那些单个或者联合起来都与疾病表型无关的SNP,从而保留那些单个或者联合起来可能与疾病相关的所有高阶SNP致病因素,为从中进一步检测和识别致病的SNP原因奠定基础。The purpose of the present invention is to overcome the deficiencies in the use of a two-step method to find out multiple causes of disease and which genes combine to become a cause of disease in the whole genome, and to invent a method to filter and complex Disease-independent SNP method, this method filters out those SNPs that are not related to the disease phenotype individually or in combination from the genome-wide SNP data, thereby retaining all high-order SNPs that may be associated with the disease individually or in combination Factors, which lay the foundation for further detection and identification of the cause of the disease-causing SNP.
实现本发明的技术方案,包括如下步骤:Realize the technical scheme of the present invention, comprise the steps:
(1)对全基因组SNP数据进行预处理和初始化(1) Preprocessing and initialization of genome-wide SNP data
根据同源染色体等位基因中任一基因的变异对疾病的影响等同对待原则,将SNP数据预处理为:其中xi∈{0,1,2,3}d为SNP i对应位点的取值:对应位点上的两个等位基因当为纯合子AA时取1,纯合子aa时取2,杂合子Aa或aA时取3,当该数据缺失时取0;yi∈{1,2}为样本xi的类标,1表示疾病组,2表示对照组,N为SNP数据中样本的个数,d为数据中SNP的个数,并记所涉及的SNP的集合为Ω;According to the principle that the variation of any gene in the homologous chromosome alleles has the same impact on the disease, the SNP data is preprocessed as follows: Where x i ∈ {0, 1, 2, 3} d is the value of the corresponding site of SNP i: when the two alleles at the corresponding site are homozygous AA, it takes 1, and when it is homozygous aa, it takes 2, Take 3 for heterozygote Aa or aA, and take 0 when the data is missing; y i ∈ {1, 2} is the class label of sample xi , 1 means the disease group, 2 means the control group, N is the number of samples in the SNP data number, d is the number of SNPs in the data, and record the set of involved SNPs as Ω;
(2)定义关联性测度(2) Define the correlation measure
每个SNP子集都可能通过交互作用成为一个致病因素,一个有1个SNP的因素称为l阶因素,将SNP因素X与疾病Y之间的关联性I(Y;X)定义为X与Y之间的互信息MI(Y;X):Each SNP subset may become a pathogenic factor through interaction, a factor with 1 SNP is called an l-order factor, and the association I(Y;X) between SNP factor X and disease Y is defined as X Mutual information MI(Y;X) with Y:
I(Y;X)=H(Y)-H(Y|X) (1)I(Y;X)=H(Y)-H(Y|X) (1)
其中
(3)运用基于因素的遗传搜索方法(Factor based Genetic Search Algorithm,FGSA)搜索SNP的集合Ω中侯选疑似致病原因的SNP组;(3) Use the Factor based Genetic Search Algorithm (FGSA) to search for the SNP group of candidate suspected pathogenic causes in the SNP set Ω;
(4)依频度-关联优先准则,将SNP排序;(4) sort the SNPs according to the frequency-association priority criterion;
(5)输出排在最前面的频度大于门限的SNP。(5) Outputting the SNPs whose frequency is greater than the threshold at the top.
本发明与现有技术相比较所具有的显著效果:The remarkable effect that the present invention has compared with prior art:
本发明公开了一种从全基因组单核苷酸多态性SNP中过滤与复杂疾病无关SNP的方法——基于因素的遗传搜索方法FGSA(factor based genetic search algorithm),其以因素为基础保证了对单个与疾病弱关联、联合起来与疾病强关联的SNP因素的搜索,其层层剥离准则保证在多个致病因素存在时不至于由于某强关联因素的存在而掩盖了其它可能也强关联的因素的搜索,而采用的频度-关联优先准则则保证了这样搜索出的解的相对稳定性,具有如下显著效果:The invention discloses a method for filtering SNPs irrelevant to complex diseases from the single nucleotide polymorphism SNPs in the whole genome - a factor based genetic search algorithm FGSA (factor based genetic search algorithm), which guarantees The search for SNP factors that are weakly associated with a disease and strongly associated with a disease in combination, its layer-by-layer peeling criterion ensures that when multiple pathogenic factors exist, other possible strong associations will not be covered up by the existence of a strong association factor The search of the factors, and the frequency-association priority criterion adopted ensures the relative stability of the solution obtained in this way, and has the following significant effects:
(1)本方法能够实现从全基因组中对与疾病无关SNP的有效过滤,即:过滤掉那些单个或者联合起来都与疾病表型无关的SNP,并使过滤后的剩余SNP的数量尽可能小。(1) This method can effectively filter SNPs that are not related to diseases from the whole genome, that is, filter out those SNPs that are not related to the disease phenotype individually or in combination, and make the number of remaining SNPs after filtering as small as possible .
(2)本方法能够保留那些被其他致病原因所覆盖的致病原因对应SNP,从而为后续这些致病原因的发现奠定基础。(2) This method can retain the SNPs corresponding to the pathogenic causes covered by other pathogenic causes, thereby laying the foundation for the subsequent discovery of these pathogenic causes.
(3)本方法可处理全基因组SNP规模,如10,000以上的SNP数据,比如能够处理AMD(由于视网膜损坏而导致视中心的视觉丧失)数据,其中含有103611个SNP,96个病例样本和50个对照样本。用其他多种方法所找到的与AMD相关的SNP均在本方法过滤后的SNP集合中(详见实验对比效果说明部分)。(3) This method can handle genome-wide SNP scale, such as SNP data of more than 10,000, such as AMD (loss of vision due to retinal damage) data, which contains 103611 SNPs, 96 case samples and 50 control sample. The AMD-related SNPs found by other methods are all in the SNP set filtered by this method (see the description of the experimental comparison effect for details).
附图说明Description of drawings
图1是本发明FGSA算法的流程图;Fig. 1 is the flowchart of FGSA algorithm of the present invention;
图2是图1中的遗传算法的流程图。FIG. 2 is a flowchart of the genetic algorithm in FIG. 1 .
具体实施方式Detailed ways
参照图1和图2,本发明的方法称为FGSA方法,其具体实现步骤如下:With reference to Fig. 1 and Fig. 2, method of the present invention is called FGSA method, and its specific implementation steps are as follows:
步骤1,对SNP数据进行预处理和初始化。
(1.1)根据同源染色体等位基因中任一基因的变异对疾病的影响可以等同对待的原则,将SNP数据处理成其中xi∈{0,1,2,3}d为SNP i对应位点的取值:对应位点上的两个等位基因当为纯合子AA时取1,纯合子aa时取2,杂合子Aa或aA时取3,当对应位点上的等位基因数据缺失时取0;yi∈{1,2}为样本xi的类标,1表示疾病组,2表示对照组,N为SNP数据中样本的个数,d为数据中SNP的个数,仅含0,1,2,3的数据,其中0表示缺失数据,所涉及的SNP的集合记为Ω。(1.1) According to the principle that the variation of any gene in the homologous chromosome alleles can have the same impact on the disease, the SNP data is processed into Where x i ∈ {0, 1, 2, 3} d is the value of the corresponding site of SNP i: when the two alleles at the corresponding site are homozygous AA, it takes 1, and when it is homozygous aa, it takes 2, Take 3 for heterozygote Aa or aA, take 0 when the allele data on the corresponding locus is missing; y i ∈ {1, 2} is the class label of sample xi , 1 means the disease group, 2 means the control group, N is the number of samples in the SNP data, and d is the number of SNPs in the data, including only the data of 0, 1, 2, and 3, where 0 means missing data, and the set of SNPs involved is recorded as Ω.
步骤2,定义关联性测度。Step 2, define the correlation measure.
(2.1)每个SNP子集都可能成为一个致病因素,一个有l个SNP的因素称为l阶因素。将一个因素X与疾病Y之间的关联性I(Y;X)定义为X与Y之间的互信息MI(Y;X),表示为式(1):(2.1) Each SNP subset may become a pathogenic factor, and a factor with l SNP is called l-order factor. The correlation I(Y; X) between a factor X and a disease Y is defined as the mutual information MI(Y; X) between X and Y, expressed as formula (1):
I(Y;X)=H(Y)-H(Y|X) (1)I(Y;X)=H(Y)-H(Y|X) (1)
其中
步骤3,运用FGSA方法搜索SNP的集合Ω中侯选疑似致病原因的SNP组。Step 3, use the FGSA method to search for the SNP group that is suspected to be the cause of the disease in the SNP set Ω.
(3.1)设置遗传算法参数及FGSA相关参数(3.1) Set genetic algorithm parameters and FGSA related parameters
设置要找的SNP组中SNP个数l。设置遗传算法参数,包括种群规模Nl,交叉概率Pc,变异概率Pm,迭代次数Iterl;设置FGSA相关参数,包括重复搜索次数Numl,每次要找的l阶交互作用的个数Ml;Set the number l of SNPs in the SNP group to be found. Set genetic algorithm parameters, including population size N l , crossover probability P c , mutation probability P m , iteration number Iter l ; set FGSA related parameters, including repeated search times Num l , and the number of l-order interactions to be found each time M l ;
(3.2)初始化要找的l阶交互作用的个数k=1;设疑似致病原因相关SNP集合S1=Φ;设要考察的SNP集合为Ω*=Ω;(3.2) Initialize the number of the first-order interactions to be found k=1; set the SNP set S 1 =Φ related to the suspected cause of disease; set the SNP set to be investigated as Ω * =Ω;
(3.3)随机初始化种群:从Ω*中的SNP中随机生成l个不同的有效SNP编号,构成一个l阶因素作为一个个体。总计生成Nl个个体构成种群;(3.3) Randomly initialize the population: Randomly generate l different effective SNP numbers from the SNPs in Ω * , constituting an l-order factor as an individual. A total of N l individuals are generated to form a population;
(3.4)适应度计算:对种群中的每个个体,根据式(1)计算互信息作为该个体的适应度;(3.4) Calculation of fitness: For each individual in the population, calculate the mutual information according to formula (1) as the fitness of the individual;
(3.5)选择操作:依据种群中各个个体的适应度数值,采用轮盘赌方式和精英策略进行选择操作,选出Nl个个体;(3.5) Selection operation: According to the fitness value of each individual in the population, use roulette and elite strategy to select N1 individuals;
(3.6)交叉操作:从Nl个个体中任取两个个体,随机选择交叉点,对这两个个体,以交叉概率pc将交叉点后面的部分对调,形成两个新个体;(3.6) Crossover operation: Randomly select two individuals from N l individuals, randomly select the intersection point, and for these two individuals, exchange the part behind the intersection point with the crossover probability p c to form two new individuals;
(3.7)变异操作:对每个个体,随机生成一个有效SNP编号,依变异概率pm替换掉这个个体中的任一SNP编号;(3.7) Mutation operation: For each individual, randomly generate an effective SNP number, and replace any SNP number in this individual according to the mutation probability p m ;
(3.8)产生下一代种群:由(3.7)操作后得到的所有个体作为下一代种群;(3.8) Generate the next generation population: all individuals obtained after the operation of (3.7) are used as the next generation population;
(3.9)若迭代次数小于Iterl,则跳转到(3.4);(3.9) If the number of iterations is less than Iter l , then jump to (3.4);
(3.10)取种群中具有最大适应度的个体记为sk加入到疑似与疾病相关的SNP集合Sk中,并从SNP集合Ω*中去掉这个个体,即 (3.10) Take the individual with the maximum fitness in the population and mark it as s k and add it to the SNP set S k suspected to be related to the disease, and remove this individual from the SNP set Ω * , that is
(3.11)重复(3.3)~(3.10)Ml次,每次向Sk中加入一个个体,并从数据中去掉这个个体,经Ml次重复,得Sk;(3.11) Repeat (3.3)~(3.10) M l times, each time add an individual to S k , and remove this individual from the data, after M l repetitions, get S k ;
(3.12)重设Ω*=Ω,重复(3.2)~(3.11)总计Numl次,从而得到SNP集合
(3.13)输出包含各个l阶交互作用的SNP集合 (3.13) Output the SNP set containing each l-order interaction
步骤4,计算v中各SNP的频度。Step 4, calculate the frequency of each SNP in v.
(4.1)频度的计算:将步骤3中找到的疑似致病原因的SNP出现的次数作为该SNP的频度;(4.1) Calculation of frequency: the frequency of occurrence of the SNP of the suspected cause of disease found in step 3 as the frequency of the SNP;
(4.2)依频度-关联优先准则将SNP排序,即:按频度大的优先、同频度时单SNP与疾病关联即互信息大的优先的原则,将SNP排序。(4.2) The SNPs were sorted according to the frequency-association priority criterion, that is, the SNPs were sorted according to the principle of the highest frequency priority, and the single SNP with the same frequency associated with the disease, that is, the highest mutual information priority.
步骤5,输出排在最前面的频度大于门限的SNP。Step 5, outputting the top SNPs whose frequency is greater than the threshold.
其中,步骤3中的(3.3)~(3.8)中遗传算法的实施是以从Ω*中的SNP中随机生成l个不同的有效SNP编号构成一个l阶因素作为一个个体,并通过它们的交叉变异获得更优个体的遗传进化搜索方法,体现了本方法的以因素为基础的遗传进化SNP过滤的特点,从而保证了对单个与疾病弱关联、联合起来与疾病强关联的SNP因素的搜索;步骤3中的(3.9)则通过将种群中具有最大适应度的个体加入到疑似与疾病相关的SNP集合中,并从数据中去掉这个个体,实现对致病因素的层层剥离,从而保证在多个致病因素存在时不至于由于某强关联因素的存在而掩盖了其它可能也强关联的因素的搜索;而步骤4中的频度-关联优先准则则保证了这样搜索出的解的相对稳定性。Among them, the implementation of the genetic algorithm in (3.3)~(3.8) in step 3 is to randomly generate l different effective SNP numbers from the SNPs in Ω * to form an l-order factor as an individual, and through their crossover The genetic evolution search method of mutation to obtain better individuals embodies the characteristics of this method's factor-based genetic evolution SNP filtering, thus ensuring the search for single SNP factors that are weakly associated with the disease and combined to be strongly associated with the disease; (3.9) in step 3 adds the individual with the greatest fitness in the population to the SNP set suspected to be related to the disease, and removes this individual from the data to realize the layer-by-layer peeling off of the pathogenic factors, so as to ensure that When multiple pathogenic factors exist, the search for other factors that may also be strongly correlated will not be covered up by the existence of a strong correlation factor; and the frequency-correlation priority criterion in step 4 ensures that the solution obtained in this way is relatively accurate. stability.
本发明将通过下述的实验例子对本方法的效果进行更详细的描述。这些实验例子用于举例的目的,而不试图限制本发明的范围。The present invention will describe the effect of this method in more detail through the following experimental examples. These experimental examples are for illustrative purposes and are not intended to limit the scope of the invention.
在以下的实验中,本方法的参数取为:Nl=10,Pc=0.9,Pm=0.25,Iterl=5000,Numl=20,Ml=8。In the following experiments, the parameters of this method are set as: N l =10, P c =0.9, P m =0.25, Iter l =5000, Num l =20, M l =8.
实验1:仿真数据SNP的过滤。Experiment 1: Filtering of simulated data SNPs.
仿真数据是在纽约人口真实SNP数据的基础上,由生物学家加入7个已知的与复杂疾病相关的SNP组得到的,且这7个SNP组与疾病的关联模型是不同的。数据共有两组:第一组包含2000个样本,100个SNP,用SNP100表示;第二组包含2000个样本,2000个SNP,用SNP2000表示。数据信息如表1。The simulation data is based on the real SNP data of the New York population, and biologists add 7 SNP groups known to be associated with complex diseases, and the association models of these 7 SNP groups and diseases are different. There are two groups of data: the first group contains 2000 samples, 100 SNPs, represented by SNP100; the second group contains 2000 samples, 2000 SNPs, represented by SNP2000. The data information is shown in Table 1.
表1实验数据Table 1 Experimental data
对这两组数据进行实验,其中FGSA的参数(Iter2,Iter3)分别对两组数据取值,分别为(Iter2,Iter3)=(600,1100),(1200,1700),(M2,M3)对2组数据相同,均为(M2,M3)=(8,5)。所进行的两组实验只是频度门限不同:对于2组数据,门限分别取为Th=3,2和和取为Th=1。Experiments are carried out on these two groups of data, wherein the parameters (Iter 2 , Iter 3 ) of FGSA take values for the two groups of data respectively, which are (Iter 2 , Iter 3 )=(600, 1100), (1200, 1700), ( M 2 , M 3 ) are the same for the two sets of data, both are (M 2 , M 3 )=(8,5). The two sets of experiments carried out are only different in the frequency threshold: for the two sets of data, the threshold is taken as Th=3, and 2 and 2 are taken as Th=1.
FGSA算法及几种典型特征选择方法(最小冗余最大相关方法--mRMR,最大熵方法--ME等)的实验结果示于表2和表3中,其中“-”表示计算量过大而没有找到结果,压缩率是过滤后获得的SNP数与数据中的SNP总数之比,因素率是过滤后的SNP中包含的真实致病因素数目与数据中的真实致病因素数目之比。The experimental results of the FGSA algorithm and several typical feature selection methods (Minimum Redundancy Maximum Relevance Method--mRMR, Maximum Entropy Method--ME, etc.) No results found, the compression rate is the ratio of the number of SNPs obtained after filtering to the total number of SNPs in the data, and the factor rate is the ratio of the number of real pathogenic factors contained in the filtered SNPs to the number of real pathogenic factors in the data.
表2中,a为真阳性,因素率定义为过滤后的SNP中包含致病因素数占过滤前Ω中包含致病因素总数的百分比;压缩率为过滤后的SNP数占过滤前Ω中的SNP总数的百分比。In Table 2, a is a true positive, and the factor rate is defined as the percentage of the number of pathogenic factors contained in the filtered SNP to the total number of pathogenic factors contained in Ω before filtering; the compression ratio is the percentage of the number of pathogenic factors contained in Ω after filtering Percentage of total number of SNPs.
表2FGSA-频度算法的性能及与其它算法的比较Table 2 Performance of FGSA-frequency algorithm and comparison with other algorithms
表3FGSA-频度算法的性能(取频度门限为Th=1的结果)Table 3FGSA-Frequency Algorithm Performance (Frequency Threshold is the result of Th=1)
由表2和表3可以看出:It can be seen from Table 2 and Table 3 that:
(a)FGSA方法在压缩率基本相当的情况下,其因素率都大于其他方法,表明了算法的有效性;(a) When the compression rate of the FGSA method is basically the same, its factor rate is higher than that of other methods, which shows the effectiveness of the algorithm;
(b)mRMR的因素率仅为3/7,即所选出的SNP集合在7个致病因素中仅完整包含了3个,而FGSA方法的因素率为5/7~6/7,即至少完整包含了5个或6个,显然表明FGSA方法的有效性;(b) The factor rate of mRMR is only 3/7, that is, the selected SNP set contains only 3 of the 7 pathogenic factors, while the factor rate of the FGSA method is 5/7~6/7, that is At least 5 or 6 are fully included, which clearly shows the effectiveness of the FGSA method;
(c)压缩率随SNP规模N增长,且N越大压缩率越高,在2000个SNP情况下达到了97%,表明FGSA方法对全基因组SNP情况更有效;(c) The compression rate increases with the SNP size N, and the larger the N, the higher the compression rate, reaching 97% in the case of 2000 SNPs, indicating that the FGSA method is more effective for the genome-wide SNPs;
(d)当因素中的SNP过多时,会出现维数灾难,这是长度为5的致病因素用这些方法始终没能完整选上的重要原因之一。(d) When there are too many SNPs in the factor, the curse of dimensionality will appear, which is one of the important reasons why the pathogenic factors with a length of 5 cannot be completely selected by these methods.
实验2:真实AMD数据的SNP过滤。Experiment 2: SNP filtering of real AMD data.
AMD是影响老年人的医学条件,他会由于视网膜损坏而导致视中心的视觉丧失。AMD数据(见表4)含有103611个SNP,96个病例样本和50个对照样本,其中有0.811%的数据丢失。样本数值为0,1,2,3,其中0表示该数据丢失。AMD数据常用于SNP关联分析,已有不少方法用在AMD数据上并获得了一些相关的关联基因。表5给出了所找到的用其它方法(如BOOST,AntEpiSeeker,epiMODE,BEAM,HapForest,Single-Marker,DASSO-MB等方法等)也找出的SNP。AMD is a medical condition affecting older adults who can experience loss of vision in the optic center due to damage to the retina. The AMD data (see Table 4) contains 103611 SNPs, 96 case samples and 50 control samples, with 0.811% missing data. The sample values are 0, 1, 2, 3, where 0 means that the data is missing. AMD data is often used in SNP association analysis, and many methods have been used on AMD data and some related genes have been obtained. Table 5 shows the SNPs found by other methods (such as BOOST, AntEpiSeeker, epiMODE, BEAM, HapForest, Single-Marker, DASSO-MB, etc.).
表4AMD数据Table 4 AMD data
表5FGSA方法找出的用其它方法也找出的SNP列表Table 5 The list of SNPs found by the FGSA method and also found by other methods
从表5可以看到:It can be seen from Table 5:
(a)FGSA所找到的SNP与其它方法找到的SNP有很高的重叠性,表明了FGSA方法的有效性。(a) The SNPs found by FGSA have a high overlap with those found by other methods, indicating the effectiveness of the FGSA method.
(b)FGSA还给出了其他方法没有找出的但是频度也很高的SNP,包括编号为19405,6693,56674,80178,76784,92627,46516,88957,42568,51958,41808,47428的SNP,其频率分别为35,26,26,25,24,24,22,21,20,15,11,9,并不排除它们与疾病关联的可能性,特别地,不排除真实的致病因素就在表5中SNP及上述SNP构成的集合上,或是其中某些SNP构成的子集上。(b) FGSA also gives SNPs that are not found by other methods but have a high frequency, including numbers 19405, 6693, 56674, 80178, 76784, 92627, 46516, 88957, 42568, 51958, 41808, 47428 The SNPs, whose frequencies are 35, 26, 26, 25, 24, 24, 22, 21, 20, 15, 11, 9, do not rule out the possibility that they are associated with disease and, in particular, do not rule out true pathogenicity The factors are on the set of SNPs in Table 5 and the above-mentioned SNPs, or a subset of some of the SNPs.
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013102796270A CN103366100A (en) | 2013-06-25 | 2013-06-25 | Method for filtering SNP (Single Nucleotide Polymorphism) unrelated to complex diseases from whole-genome |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013102796270A CN103366100A (en) | 2013-06-25 | 2013-06-25 | Method for filtering SNP (Single Nucleotide Polymorphism) unrelated to complex diseases from whole-genome |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103366100A true CN103366100A (en) | 2013-10-23 |
Family
ID=49367427
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2013102796270A Pending CN103366100A (en) | 2013-06-25 | 2013-06-25 | Method for filtering SNP (Single Nucleotide Polymorphism) unrelated to complex diseases from whole-genome |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103366100A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462868A (en) * | 2014-12-11 | 2015-03-25 | 西安电子科技大学 | Genome-wide SNP (single nucleotide polymorphism) site analysis method based on combination of random forest and Relief-F |
CN108256293A (en) * | 2018-02-09 | 2018-07-06 | 哈尔滨工业大学深圳研究生院 | A kind of statistical method and system of the disease association assortment of genes |
CN110135057A (en) * | 2019-05-14 | 2019-08-16 | 北京工业大学 | Soft-sensing method for dioxin emission concentration in solid waste incineration process based on multi-layer feature selection |
CN110428897A (en) * | 2019-06-19 | 2019-11-08 | 西安电子科技大学 | Medical diagnosis on disease information processing method based on SNP pathogenic factor Yu disease association relationship |
CN112270957A (en) * | 2020-10-19 | 2021-01-26 | 西安邮电大学 | High-order SNP pathogenic combination data detection method, system and computer equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101894216A (en) * | 2010-07-16 | 2010-11-24 | 西安电子科技大学 | A method for discovering groups of SNPs associated with complex diseases from SNP data |
CN102629305A (en) * | 2012-03-06 | 2012-08-08 | 上海大学 | Feature selection method facing to SNP (Single Nucleotide Polymorphism) data |
-
2013
- 2013-06-25 CN CN2013102796270A patent/CN103366100A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101894216A (en) * | 2010-07-16 | 2010-11-24 | 西安电子科技大学 | A method for discovering groups of SNPs associated with complex diseases from SNP data |
CN102629305A (en) * | 2012-03-06 | 2012-08-08 | 上海大学 | Feature selection method facing to SNP (Single Nucleotide Polymorphism) data |
Non-Patent Citations (3)
Title |
---|
JUNYING ZHANG等: "A Genetic Algorithm to Filter SNPs for SNP Association Study", 《WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY (WI-IAT), 2012 IEEE/WIC/ACM INTERNATIONAL CONFERENCES ON》, vol. 1, 7 December 2012 (2012-12-07), pages 684 - 687, XP032391329, DOI: doi:10.1109/WI-IAT.2012.146 * |
蒋胜利: "高维数据的特征选择与特征提取研究"", 《中国博士学位论文全文数据库 信息科技辑》, vol. 2011, no. 12, 15 December 2011 (2011-12-15), pages 138 - 49 * |
蒋胜利等: "基于多重遗传算法的单核苷酸多态性特征选择", 《四川大学学报(工程科学版)》, vol. 42, no. 2, 20 March 2010 (2010-03-20), pages 132 - 138 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462868A (en) * | 2014-12-11 | 2015-03-25 | 西安电子科技大学 | Genome-wide SNP (single nucleotide polymorphism) site analysis method based on combination of random forest and Relief-F |
CN104462868B (en) * | 2014-12-11 | 2017-04-05 | 西安电子科技大学 | A kind of full-length genome SNP site analysis method of combination random forest and Relief F |
CN108256293A (en) * | 2018-02-09 | 2018-07-06 | 哈尔滨工业大学深圳研究生院 | A kind of statistical method and system of the disease association assortment of genes |
CN110135057A (en) * | 2019-05-14 | 2019-08-16 | 北京工业大学 | Soft-sensing method for dioxin emission concentration in solid waste incineration process based on multi-layer feature selection |
CN110135057B (en) * | 2019-05-14 | 2021-03-02 | 北京工业大学 | Soft measurement method of dioxin emission concentration in solid waste incineration process based on multi-layer feature selection |
US11976817B2 (en) | 2019-05-14 | 2024-05-07 | Beijing University Of Technology | Method for detecting a dioxin emission concentration of a municipal solid waste incineration process based on multi-level feature selection |
CN110428897A (en) * | 2019-06-19 | 2019-11-08 | 西安电子科技大学 | Medical diagnosis on disease information processing method based on SNP pathogenic factor Yu disease association relationship |
CN110428897B (en) * | 2019-06-19 | 2022-03-18 | 西安电子科技大学 | A disease diagnosis information processing method based on the relationship between SNP pathogenic factors and diseases |
CN112270957A (en) * | 2020-10-19 | 2021-01-26 | 西安邮电大学 | High-order SNP pathogenic combination data detection method, system and computer equipment |
CN112270957B (en) * | 2020-10-19 | 2023-11-07 | 西安邮电大学 | High-order SNP pathogenic combination data detection method, system and computer equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Marand et al. | A cis-regulatory atlas in maize at single-cell resolution | |
Varshney et al. | Designing future crops: genomics-assisted breeding comes of age | |
Kelly et al. | Analysis of the giant genomes of F ritillaria (L iliaceae) indicates that a lack of DNA removal characterizes extreme expansions in genome size | |
Smith et al. | Analysis of phylogenomic datasets reveals conflict, concordance, and gene duplications with examples from animals and plants | |
Arruda et al. | Genomic selection for predicting Fusarium head blight resistance in a wheat breeding program | |
Xu et al. | Predicting hybrid performance in rice using genomic best linear unbiased prediction | |
Gernandt et al. | Multi‐locus phylogenetics, lineage sorting, and reticulation in Pinus subsection Australes | |
Song et al. | Conserved noncoding sequences provide insights into regulatory sequence and loss of gene expression in maize | |
CN104462868B (en) | A kind of full-length genome SNP site analysis method of combination random forest and Relief F | |
Lavarenne et al. | The spring of systems biology-driven breeding | |
CA2932507C (en) | Improved molecular breeding methods | |
Rowley et al. | A draft genome and high-density genetic map of European hazelnut (Corylus avellana L.) | |
CN110993113B (en) | LncRNA-disease relation prediction method and system based on MF-SDAE | |
Aono et al. | Machine learning approaches reveal genomic regions associated with sugarcane brown rust resistance | |
Ding et al. | Population-genomic analyses reveal bottlenecks and asymmetric introgression from Persian into iron walnut during domestication | |
Senerchia et al. | Evolutionary dynamics of retrotransposons assessed by high-throughput sequencing in wild relatives of wheat | |
CN103366100A (en) | Method for filtering SNP (Single Nucleotide Polymorphism) unrelated to complex diseases from whole-genome | |
Timilsena et al. | Phylogenomic resolution of order-and family-level monocot relationships using 602 single-copy nuclear genes and 1375 BUSCO genes | |
Grover et al. | Dual domestication, diversity, and differential introgression in Old World cotton diploids | |
Wang et al. | Genetic diversity and population structure in the endangered tree Hopea hainanensis (Dipterocarpaceae) on Hainan Island, China | |
Zhang et al. | The lack of negative association between TE load and subgenome dominance in synthesized Brassica allotetraploids | |
Yu et al. | Genomic analyses reveal dead‐end hybridization between two deeply divergent kiwifruit species rather than homoploid hybrid speciation | |
Cornet et al. | Holocentric repeat landscapes: From micro‐evolutionary patterns to macro‐evolutionary associations with karyotype evolution | |
Morales‐Briones et al. | Phylogenomic analyses in Phrymaceae reveal extensive gene tree discordance in relationships among major clades | |
Qin et al. | Phylogenomics and divergence pattern of Polygonatum (Asparagaceae: Polygonateae) in the north temperate region |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20131023 |