CN109448794A - A kind of epistasis site method for digging based on heredity taboo and Bayesian network - Google Patents

A kind of epistasis site method for digging based on heredity taboo and Bayesian network Download PDF

Info

Publication number
CN109448794A
CN109448794A CN201811287261.0A CN201811287261A CN109448794A CN 109448794 A CN109448794 A CN 109448794A CN 201811287261 A CN201811287261 A CN 201811287261A CN 109448794 A CN109448794 A CN 109448794A
Authority
CN
China
Prior art keywords
network
snp
site
individual
taboo
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811287261.0A
Other languages
Chinese (zh)
Other versions
CN109448794B (en
Inventor
刘建晓
果杨
钟芷漫
杨晨
胡江峰
蒋雅玲
梁子珍
高辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong Agricultural University
Original Assignee
Huazhong Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong Agricultural University filed Critical Huazhong Agricultural University
Priority to CN201811287261.0A priority Critical patent/CN109448794B/en
Publication of CN109448794A publication Critical patent/CN109448794A/en
Application granted granted Critical
Publication of CN109448794B publication Critical patent/CN109448794B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Genetics & Genomics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Physiology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of epistasis site method for digging based on heredity taboo and Bayesian network, comprising: 1, genotype data is converted to the Boolean type data of binary representation;2, conditional mutual information between any SNP site pair and phenotype is rapidly calculated using logical AND operation, takes out top-N node pair, building includes the initial network figure of SNP site;3, based on initial network individual, by increase at random while, delete while, reverse side generate new individual, until network individual amount reaches Population Size scale;4, by the marking mechanism of three kinds of operations and Bayesian network of genetic algorithm, develop to bayesian network structure, find the optimal solution of network structure, fast and accurately get the epistatic gene site for influencing phenotypic character.The present invention can help biological study person to obtain the epistatic gene site of influence particular phenotype proterties, and then auxiliary gene function is excavated, and offer reference for the complicated genetics of quantitative characters basis parsing of different plant species.

Description

A kind of epistasis site method for digging based on heredity taboo and Bayesian network
Technical field
The present invention relates to technical field of biological information more particularly to it is a kind of based on heredity taboo and Bayesian network it is upper Property site method for digging.
Background technique
With the continuous improvement and improvement of people's living standard and medical environment, those diseases only determined by environmental factor Sick (such as infectious disease, malnutrition etc.) has been substantially achieved control, and complex disease and Mendelian inheritance disease become current shadow Ring the principal disease of human health.Mendelian inheritance disease is a kind of single-gene disorder, and it is fixed that genetic process follows Mendelian inheritance Rule, researcher has determined correlated inheritance gene using the method for positional cloning at present, illustrates its mode of inheritance substantially.Complicated disease Disease accounts for about 80% or more of human diseases, causes great injury to human health.Asthma, cancer, diabetes, high blood Pressure, senile dementia, rheumatoid arthritis, schizophrenia, heart disease, cardiovascular disease, obesity, tumour etc. are common chronic Disease is referred to as complex disease.The cause of disease of complex disease is extremely complex, is related to environment, gene and mutual between them The many factors such as effect.Therefore, it is badly in need of illustrating the pathogenesis and genetic mechanism of complex disease, diagnosis to complex disease and controls It treats and scientific basis is provided, provided safeguard for human health, it may have important research significance.
From the point of view of biogenetics, determine that the inherent cause of biological complex character mainly includes three aspects: gene The interaction between interaction and gene and environment between main effect, gene and gene.Pass through biology many experiments The study found that the main reason for control biological complex character is the interaction between gene and gene.Between gene and gene Interaction, also known as epistasis (Epistasis), it is mainly shown as the interaction between SNP.Meanwhile with height The rapid development of flux technique produces the biological data of magnanimity at present.Utilize genome-wide association study (Genome-wide Association Study, GWAS) method filtered out from the data in genome range with the significantly associated SNPs of disease, To illustrate the hot issue that the genetic mechanism of complex disease is current biological informatics research.GWAS method mainly stresses In the detection of major gene resistance, in early-stage study although having found much sites relevant to phenotype using this method, also can only Explain the hereditary variation of only a few.One of them most important reason is exactly that these study the phase having ignored between gene and gene Interaction, i.e. epistasis.As it can be seen that carrying out epistasis site to excavate being the current main means for explaining complex disease genetic mechanism. However, epistasis detection method still has dyscalculia, algorithm complexity height, inefficiency and false positive rate height etc. at present Problem causes accurately and efficiently detect SNP site associated with disease and combinations thereof.Therefore, in full-length genome model Enclosing interior more effective, the more accurate epistasis detection algorithm of proposition has highly important research significance, also causes a disease to complex disease Discovery, diagnosis, the treatment and prevention of mechanism have very important effect.
Summary of the invention
The technical problem to be solved in the present invention is that for the defects in the prior art, provide it is a kind of based on heredity taboo and The epistasis site method for digging of Bayesian network.
The technical solution adopted by the present invention to solve the technical problems is:
The present invention provides a kind of epistasis site method for digging based on heredity taboo and Bayesian network, including following step It is rapid:
Step 1, to SNP genotype data, genotype data is expressed as to the data of 0,1,2 forms, 0 indicates that homozygote is normal See genotype, 1 indicates heterozygote, and 2 indicate the rare genotype of homozygote;Cdna sample to be excavated is obtained, is single with sample number Position is divided into 0,1,2 three group, and genotype data is converted to the Boolean type data 0,1 of binary form expression;
Step 2 is based on information entropy theory, conditional mutual information between any SNP site pair and phenotypic character is calculated, according to calculating Mutual information size to node to being ranked up, take out top-N node pair, building includes the initial network figure of SNP site pair;
Step 3, under the premise of not generating ring, to initial network individual by random increase while, delete while, reverse side behaviour Make to generate next network individual, new network individual is then regenerated on the basis of next network is individual;More than repeating The operation for generating new network individual, until network individual amount reaches initial population scale size;
Three kinds of operations of step 4, the genetic algorithm optimized by TABU search, including selection, intersection and variation, Yi Jibei The marking mechanism of this network of leaf develops to the initial network population that step 3 obtains, initial network population be include SNP The Bayesian network of point, finds the optimal solution of network structure, to get the epistatic gene site for influencing phenotypic character.
Further, this method of the invention further includes the method judged the network of building:
Step 5, the standard using fitness function as judge network individual superiority and inferiority, using the method for BIC marking to net The superiority and inferiority of network is judged.
Further, in step 2 and step 5 of the invention, genotype data is converted to the cloth of binary form expression That type data, directly operate binary data using logic and operation, and then be rapidly performed by condition mutual trust between node The BIC of breath and Bayesian network, which gives a mark, to be calculated.
Further, initial network figure of the building comprising SNP site pair in step 2 of the invention method particularly includes:
Step 2.1 sets epistatic gene site number nlocus to be excavated, to nlocus site in all sites into Row permutation and combination is based on information entropy theory, and nlocus site and the table of various combination are rapidly calculated using logical AND operation Conditional mutual information between type character;
Step 2.2, according to the conditional mutual information size of calculating to different nodes to being ranked up, take out top-N node Right, wherein the size of N is determined according to experimental result;For being not included in top-N node centering SNP site, select its The node pair once occurred inserts it into top-N node centering;
Step 2.3 regards all gene SNP sites as nodes, the top-N node obtained according to step 2.2 It is right, different nodes are inserted into network corresponding side, construct initial network figure.
Further, develop in step 4 of the invention method particularly includes:
Step 4.1, selection operation;It is given a mark, will be given a mark highest to network using the methods of marking of Bayesian network Optimal Bayesian network individual is placed on the initial position of population, is entered using roulette selection method choice network individual next Generation;
Step 4.2, taboo crossover operation;Developed using multiple row cross method to two networks, and carries out generation ring Judgement;Precocious phenomenon is generated in order to avoid normal crossing operates, the memory function having using TABU search carries out crossover operation The filial generation network of generation is compared with the individual in taboo list afterwards;If being not belonging to introduce taboo list, by this filial generation network Individual enters the next generation, and is stored in taboo list;If the individual already belongs to taboo list, this filial generation is abandoned Individual re-starts taboo crossover operation, until the filial generation of generation is not belonging to taboo list;
Step 4.3, Taboo mutation operation;To network individual with certain mutation probability carry out increase while, delete while, reverse Side operation, selection increase network scoring by most variations, to obtain optimization network structure;The note having using TABU search Function is recalled, in the inferior solution deposit taboo list for improving current adaptive value that variation is generated.
Further, the circular in step 2.1 of the invention are as follows:
When excavating k epistasis SNP site for influencing phenotype Class, and I (Class | SNP1,...SNPk) indicate on k Conditional mutual information between position property SNP site and phenotype Class, the formula calculated are as follows:
I(Class|SNP1,...SNPk)=H (Class)+H (SNP1,...SNPk)-H(Class,SNP1,...SNPk)
Calculate the formula of the comentropy H (Class) of Class are as follows:
Calculate the comentropy H (SNP of k SNP site1,…,SNPk) formula are as follows:
Further, calculating BIC of the invention scores method particularly includes:
Sample data is indicated with D, and G indicates bayesian network structure, can obtain according to Bayesian formula:
P (G | D)=P (D | G) P (G)/P (D)
Wherein, P (G) indicates the priori knowledge of network structure;
Use θGThe parameter for indicating network structure can obtain above formula expansion by edge integral:
P (D | G)=∫ P (D | G, θG)P(θG|G)dθG
And then obtain the BIC methods of marking of Bayesian network:
Wherein m indicates the total quantity of sample, and n indicates the number of variable, riIndicate the value number of i-th of variable, qiIt indicates The number of combinations of father's variable of i-th of variable, mijkIndicate that i-th of variable takes k-th of value, his father's variable takes j-th of combined sample This number.
The beneficial effect comprise that: the epistasis site of the invention based on heredity taboo and Bayesian network is dug Genotype data is converted to the Boolean type data of binary representation by pick method first, is rapidly calculated using logical AND operation Conditional mutual information between any SNP site pair and phenotype.According to the mutual information size of calculating, to SNP site to being ranked up On the basis of, top-N node pair is taken out, building includes the initial network figure of SNP site.Then, under the premise of not generating ring, base In initial network individual by increase at random while, delete while, reverse the operation of three, side to generate new individual, it is big until reaching population On a small scale.The memory thought of tabu search algorithm is used in the evolutional operation of genetic algorithm, three kinds of behaviour of genetic algorithm are utilized Make the BIC scoring method of (selection intersects and makes a variation) with Bayesian network, develops to bayesian network structure, find most Excellent network structure, fast and accurately gets the epistatic gene site for influencing phenotypic character, and auxiliary gene function is excavated.
Detailed description of the invention
Present invention will be further explained below with reference to the attached drawings and examples, in attached drawing:
Fig. 1 is the flow diagram that the present invention is embodied;
Fig. 2 genotype data indicates figure;
Fig. 3 binary Boolean type tables of data diagram;
Fig. 4 bayesian network structure matrix coder indicates figure;
The treatment process that Fig. 5 avoids ring structure from generating;
Fig. 6 crossover operation does not generate new individual schematic diagram;
Fig. 7 Bayesian network mutation operation;
2 site epistasis Detection accuracy of Fig. 8 distinct methods compares;
2 site epistasis detection efficiency of Fig. 9 distinct methods compares;
3 site epistasis Detection accuracy of Figure 10 distinct methods compares.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.
1, genotype data is indicated with the data of 0,1,2 form, such as the data that SNP genotype is AT are expressed as follows: AA It is indicated with 0, TT is indicated with 2, and AT/TA is indicated with 1.Fig. 2 show 11 SNP (SNPA~SNPK) corresponding 4 samples gene Type data, last column Class indicate phenotypic character, and wherein Class=1 indicates case (illness), and Class=0 is indicated Control (control).In order to improve conditional mutual information and the computational efficiency of Bayesian network marking between subsequent node, by gene Type data are converted to the Boolean type data 0/1 of binary form expression.By SNP in Fig. 2A~SNPDGenotype data be changed into The Boolean type data format of binary form is as shown in Figure 3.
In Fig. 3, preceding 4 are classified as the binary representation that genotype is 0, and centre 4 is classified as the binary representation that genotype is 1, most 4 it is classified as the binary representation that genotype is 2 afterwards.When specific SNP is when the genotype of certain sample is 0, then by its preceding corresponding position of 4 column The binary system set is indicated with 1, as shown in corresponding data in Fig. 2 and Fig. 3 box.
2, based on binary Boolean type form indicate genotype data, using information entropy theory calculate SNP site pair with Conditional mutual information between phenotypic character, to node to being ranked up and taking out top-N node pair, building is initial comprising SNP site Network.
(1) when excavating k epistasis SNP site for influencing phenotype Class, using formula (1) calculate k SNP site and Conditional mutual information between Class.Using the comentropy H (Class) of Class in formula (2) calculating formula (1), using formula (3) calculating formula (1) the comentropy H (SNP of k SNP site in1,…,SNPk)。
I(Class|SNP1,...SNPk)=H (Class)+H (SNP1,...SNPk)-H(Class,SNP1,...SNPk) (1)
It, can quick condition between calculate node by logical AND operation based on the genotype data that binary form indicates Mutual information.For example, calculating the M of Fig. 3 using formula (4)bitConditional mutual information I (Class | SNPB,SNPC)。
I(Class|SNPB,SNPC)=H (Class)+H (SNPB,SNPC)-H(Class,SNPB,SNPC) (4)
Such as using H (SNP in formula (5) calculating formula (4)B,SNPC)。
It, can be directly using binary logic and operation, by string of binary characters when calculating formula (5) Character 1 carries out counting solution.Such as the M of Fig. 3bitShown in the binary value of middle underscore part, using p in formula (6) calculating formula (5) (1,1),
(2) right from high to low according to the conditional mutual information size between the k epistasis SNP site and Class of above-mentioned calculating Different nodes takes out top-N node pair, wherein the size of N can be determined according to experimental result to being ranked up.For It is not included in top-N node centering SNP node, the node pair for selecting its first time to occur in sorted node pair, by it It is inserted into top-N node centering.
(3) regard all gene SNP sites as nodes, to the above-mentioned top-N node being calculated to progress Circulation, node is inserted into network corresponding side, and so on, construct initial network figure.
3, when being scanned in the space of bayesian network structure, each of genetic-Tabu search that the present invention uses The corresponding bayesian network structure of individual.Bayesian network individual is indicated with adjacency matrix, if SNP site in network Number is n, and each individual is expressed as the adjacency matrix C of n × n.Using 0/1 encoding scheme, if node i is the father of node j Node, Cij=1, otherwise, Cij=0.As shown in Figure 4.
Bayesian network individual in genetic algorithm initial population is made of the side between SNP node and node.It is not generating Under the premise of ring, to initial network individual by increase at random while, delete while, reverse side operation to generate next network individual. Then new network individual is regenerated on the basis of next network is individual, and so on until network individual amount reaches just Beginning population scale size.Genetic algorithm starts iteration using initial population as starting point.
4, on the basis of initial network population, pass through genetic manipulation (selection, intersect and variation) and Bayesian network Marking mechanism develops to the Bayesian network for including SNP site, finds the optimal solution of network structure, and then quick and precisely Get influence phenotypic character epistatic gene site.Meanwhile in order to enhance the diversity of population and obtain global optimum Tabu search strategy is applied in the intersection and variation evolutional operation of genetic algorithm, also accelerates the convergence of algorithm by solution.
(1) selection operation
Excellent individual is selected from current group using selection operation, makes defect individual as the next-generation breeding of parent Offspring, the principle of selection operation are that the selected probability of the stronger individual of fitness is bigger.Utilize the scoring side of Bayesian network Method gives a mark to network, and to giving a mark, higher network is selected, and the highest optimal Bayesian network individual that will give a mark is placed on The initial position of population enters the next generation using roulette selection method choice preferably network individual.If i-th network is suitable Answering angle value is fi, then i selected probability PiAs shown in Eq. (7), wherein N indicates the size of population.
(2) crossover operation is avoided
By preferable individual in the available a new generation of crossover operation, new individual inherits the characteristic of their elder generation individual. In order to speed up the convergence rate, using multiple row cross method.When intersecting so that network changes, carry out generating ring judgement.It is false If two network individual Individual in population1With Individual2, Individual is selected at random1Two column f1,f2With Individual2Two column s1,s2.By Individual1F1Column and Individual2S1Column swap, will Individual1F2Column and Individual2S2Column swap.That is:And then obtain: Individual1 [...s1...s2...], Individual2[...f1...f2...].It carries out judging whether have in network when crossover operation Ring generates, and works as Individual1And Individual2When all there is no ring structure, then it is assumed that Individual1And Individual2 For new offspring individual.If swap operation generate ring structure, skip the swap operation in this site, continue to judge it is next, directly To two column, all exchange is finished.As shown in Figure 5.Implementation procedure is as follows in Fig. 5:
<1>Individual1Secondary series (I1.f2) and Individual2First row (I2.s1) swap line by line, As shown in " exchange first row " part in figure, intersect successfully, to obtain two individuals for having intersected first row.
<2>Individual1Third column (I1.f3) and Individual2Third column (I2.s3) exchange line by line, such as scheme In shown in " exchange secondary series " part.In the first row, if Individual1Tertial the first row respective value 0 with Individual2Tertial the first row respective value 1 exchanges, and will lead to Individual1Ring structure is generated, such as red circle portion in figure Shown in point.Algorithm skips the first row, swaps since the second row.Finally, exchange finishes to obtain final two individuals.
In different population filial generations, common crossover operation may generate identical offspring, lead to the dye in group Colour solid has local similarity, so that search be made to stagnate, is easy to produce " precocity " phenomenon.In Fig. 6, in first time iteration In, Individual1I1.f1 and Individual2I2.s2 and Individual1I1.f2 and Individual2's I2.s3 carries out crossover operation, obtains three new filial generation network individuals shown in second of iteration.In second of iteration, Individual1I1.f2 and Individual3I3.p2 and Individual1I1.f3 and Individual3's I3.p3 carries out crossover operation.Two offspring individuals I1, the I3 and parent obtained in third time iteration are repeated, current crossover operation New offspring individual is not generated.
Normal crossing operation is easy to produce precocious phenomenon, in order to solve this problem, the memory function having using TABU search Can, the filial generation network of generation and the individual in taboo list are compared one by one after carrying out crossover operation.If being not belonging to avoid This filial generation network individual is entered the next generation, and is stored in taboo list by list.If the individual already belongs to prohibit Avoid list, then abandon this offspring individual, re-start taboo crossover operation, is until the filial generation of generation is not belonging to taboo list Only.
(3) Taboo mutation operates
Mutation operation randomly chooses an individual in group first, general with certain variation for the network individual chosen Rate PmCarry out increase while, delete while, reverse side operation, to increase the diversity of population.In Fig. 7, mutation operation in matrix Side in corresponding network between deletion of node A and node C.In the case where guaranteeing not generate ring, selection increases network scoring Most variation, to obtain preferably network structure.Common mutation operation has stronger randomness and is easily destroyed suitable Response preferably network individual, has that the ability of climbing the mountain is poor, is easily trapped into the problems such as local optimum.The present invention has using TABU search The solution is stored in taboo list, then scans on current basal when variation generates inferior solution by some memory functions.It should Method can search for avoid detour, jump out local optimum, improve the ability of climbing the mountain of mutation operation, and then help quickly to find Preferably network individual.
5, fitness function is evaluated.
Fitness function is the standard for judging network individual quality, it determines which outstanding individual is retained, which A little poor individuals are eliminated.Genetic-Tabu search is to be given a mark to Bayesian network to calculate fitness in the present invention, into And the foundation as evolutionary search, the superiority and inferiority of network is judged using common BIC scoring method.
Bayes's scoring mainly in the case where given priori knowledge and sample data, selects posterior probability maximum Bayesian network structure.Sample data is indicated with D, and G indicates bayesian network structure, can obtain formula (7) according to Bayesian formula.Its In, P (G) indicates the priori knowledge of network structure.
P (G | D)=P (D | G) P (G)/P (D) (7)
Use θGThe parameter for indicating network structure can obtain formula (8) to formula (7) expansion by edge integral
P (D | G)=∫ P (D | G, θG)P(θG|G)dθG (8)
And then the BIC methods of marking of Bayesian network is obtained, as shown in formula (9).
Wherein m indicates the total quantity of sample, and n indicates the number of variable, riIndicate the value number of i-th of variable, qiIt indicates The number of combinations of father's variable of i-th of variable, mijkIndicate that i-th of variable takes k-th of value, his father's variable takes j-th of combined sample This number.Similarly, the Boolean type data indicated based on binary form are operated using logical AND and carry out BIC marking and calculate, can be with It realizes and quickly calculates.
6, algorithm terminates
When reaching maximum number of iterations or optimum individual and all being remained unchanged by k for score value, algorithm in the present invention Terminate.Otherwise, replace previous generation individual with a new generation's individual after selection, intersection, variation, and return to continuation iteration and hold Row.
7, illustrate the high efficiency of the epistasis site method for digging based on heredity taboo and Bayesian network by testing, It is respectively compared the accuracy rate and efficiency of two nodes and the detection of three node epistasis.
Here is to carry out the reality of epistasis site excavation on GAMETES Software Create data set using method of the invention Example is applied, the high efficiency in the present invention will be described in detail method excavation epistasis site is carried out by relevant experiment.GAMETES software is one Money industry is commonly used to generate the software of Epistasis analogue data[4], which can rapidly and accurately generate Epistasis analogue data generates specific two site even multidigit point Epistasis model by changing different parameters. The parameter that can be set includes: the number of SNP site, heritability (heritability, h2), minimum gene frequency (MAF) and illness rate (prevalence) etc..The 1st behavior site title in the file of analogue data is generated, last 1 is classified as Class label, 1 indicates illness, and 0 indicates control.Genotype data indicates that 0 indicates homozygote Common genes type, 1 table with 0,1,2 Show heterozygote, 2 indicate the rare genotype of homozygote.
The Bayesian network epistasis site method for digging of heredity taboo optimization in the present invention is denoted as Epi-GTBN, is tested The epistasis detection method compared includes following several: BEAM, AntEpiSeeker, SNPRuler, MDR, BOOST and Bayes Online learning methods hill-climbing.By the way that different heritability h is arranged2(0.025,0.05,0.1,0.2,0.3,0.4) With minimum gene frequency MAF (0.1,0.2,0.3,0.4), using the different data set of GAMETES Software Create, each Data set under parameter setting includes 100 files.The accuracy rate that epistasis site is excavated is calculated using formula (10).Wherein NumedgeExpression can detect the data set number in target epistasis site.
Test 1.2 site epistasis Detection accuracies and efficiency comparative
In this experiment, the different heritability h of setting are compared2With epistasis position in the case of minimum gene frequency MAF The accuracy rate that point excavates.Fig. 8 and Fig. 9 gives the accuracy rate of 2 site epistasis of distinct methods excavation and efficiency compares.
By Fig. 8 as it can be seen that in different h2Under MAF value condition, 2 sites of BEAM and hill-climbing method Epistasis Detection accuracy will be far below other 5 kinds of methods.The inspection of Epi-GTBN, MDR, BOOST and AntEpiSeeker method It is maximum to survey accuracy rate, the detection accuracy of substantially 100%, SNPRuler method will be slightly lower than other 4 kinds of methods.
It, be 6 kinds far more than other by Fig. 9 as it can be seen that 2 site epistasis detection times of AntEpiSeeker method are most Method.The detection time of tri- kinds of methods of BEAM, BOOST and SNPRuler is minimum, MDR, hill-climbing and Epi-GTBN tri- Time used in kind method is placed in the middle.Wherein, the time used in Epi-GTBN will be less than MDR and hill-climbing.In the side Epi-GTBN In method, genotype data is converted to the Boolean type data of binary form expression, can use logical AND operation quickly meter Between operator node conditional mutual information and to network carry out BIC marking.When constructing initial network and carrying out marking calculating to network, this The a large amount of calculating time can be saved.
In short, two site epistasis Detection accuracies of MDR, BOOST and AntEpiSeeker method and the side Epi-GTBN Method is almost the same, and substantially 100%.But detection time used in two methods of MDR and AntEpiSeeker is significantly more than Epi-GTBN method.In addition, the parameter setting of AntEpiSeeker method is more complicated, result and the close phase of parameter setting It closes.BOOST method is simply possible to use in the detection of 2 site epistasis, it is impossible to be used in the epistasis in multiple sites detects.As it can be seen that of the invention Middle Epi-GTBN method has preferable Detection accuracy under the premise of not influencing epistasis detection efficiency.
Test the comparison of 2.3 site epistasis Detection accuracies
In this experiment, distinct methods are compared, different heritability h is being set2With minimum gene frequency MAF situation Under 3 site epistasis excavate accuracy rate, as shown in Figure 10.
By experimental result as it can be seen that 2 site epistasis testing results in 3 site epistasis detection accuracies and Fig. 8 in Figure 10 It is almost the same.The epistasis Detection accuracy of BEAM and hill-climbing method is minimum.The detection of MDR and Epi-GTBN is quasi- True rate highest, substantially 100%, it is higher than SNPRuler method.
It should be understood that for those of ordinary skills, it can be modified or changed according to the above description, And all these modifications and variations should all belong to the protection domain of appended claims of the present invention.

Claims (7)

1. a kind of epistasis site method for digging based on heredity taboo and Bayesian network, which is characterized in that including following step It is rapid:
Step 1, to SNP genotype data, genotype data is expressed as to the data of 0,1,2 forms, 0 indicates the common base of homozygote Heterozygote is indicated because of type, 1, and 2 indicate the rare genotype of homozygote;Cdna sample to be excavated is obtained, is divided as unit of sample number It is 0,1,2 three group, genotype data is converted to the Boolean type data 0,1 of binary form expression;
Step 2 is based on information entropy theory, conditional mutual information between any SNP site pair and phenotypic character is calculated, according to the mutual of calculating Information size, to being ranked up, takes out top-N node pair to node, and building includes the initial network figure of SNP site pair;
Step 3, under the premise of not generating ring, to initial network individual by random increase while, delete while, reverse side operation life At next network individual, new network individual is then regenerated on the basis of next network is individual;Repeat above generate The operation of new network individual, until network individual amount reaches initial population scale size;
Three kinds of operations of step 4, the genetic algorithm optimized by TABU search, including selection, intersection and variation and Bayes The marking mechanism of network develops to the initial network population that step 3 obtains, initial network population be include SNP site Bayesian network, finds the optimal solution of network structure, to get the epistatic gene site for influencing phenotypic character.
2. the epistasis site method for digging according to claim 1 based on heredity taboo and Bayesian network, feature It is, this method further includes the method judged the network of building:
Step 5, the standard using fitness function as judge network individual superiority and inferiority, using the method for BIC marking to network Superiority and inferiority is judged.
3. the epistasis site method for digging according to claim 2 based on heredity taboo and Bayesian network, feature It is, in step 2 and step 5, genotype data is converted to the Boolean type data of binary form expression, directly utilizes logic Binary data is operated with operation, and then the BIC of conditional mutual information and Bayesian network is beaten between being rapidly performed by node Divide and calculates.
4. the epistasis site method for digging according to claim 1 based on heredity taboo and Bayesian network, feature It is, initial network figure of the building comprising SNP site pair in step 2 method particularly includes:
Step 2.1 sets epistatic gene site number nlocus to be excavated, arranges nlocus site in all sites Column combination, be based on information entropy theory, using logical AND operation rapidly calculate various combination nlocus site with it is Phenetic Conditional mutual information between shape;
Step 2.2, according to the conditional mutual information size of calculating to different nodes to being ranked up, take out top-N node pair, The size of middle N is determined according to experimental result;For being not included in top-N node centering SNP site, its first time is selected to go out Existing node pair inserts it into top-N node centering;
Step 2.3 regards all gene SNP sites as nodes, the top-N node pair obtained according to step 2.2, will Different nodes are inserted into network corresponding side, construct initial network figure.
5. the epistasis site method for digging according to claim 1 based on heredity taboo and Bayesian network, feature It is, develops in step 4 method particularly includes:
Step 4.1, selection operation;It is given a mark, will be given a mark highest optimal to network using the methods of marking of Bayesian network Bayesian network individual is placed on the initial position of population, enters the next generation using roulette selection method choice network individual;
Step 4.2, taboo crossover operation;Developed using multiple row cross method to two networks, and carries out generating ring judgement; Precocious phenomenon is generated in order to avoid normal crossing operates, the memory function having using TABU search carries out handle after crossover operation The filial generation network of generation is compared with the individual in taboo list;If being not belonging to introduce taboo list, by this filial generation network individual The next generation is entered, and is stored in taboo list;If the individual already belongs to taboo list, this filial generation is abandoned Body re-starts taboo crossover operation, until the filial generation of generation is not belonging to taboo list;
Step 4.3, Taboo mutation operation;To network individual with certain mutation probability carry out increase while, delete while, reverse side behaviour Make, selection increases network scoring by most variations, to obtain optimization network structure;The memory function having using TABU search Can, in the inferior solution deposit taboo list for improving current adaptive value that variation is generated.
6. the epistasis site method for digging according to claim 4 based on heredity taboo and Bayesian network, feature It is, the circular in step 2.1 are as follows:
When excavating k epistasis SNP site for influencing phenotype Class, and I (Class | SNP1,...SNPk) indicate k epistasis Conditional mutual information between SNP site and phenotype Class, the formula calculated are as follows:
I(Class|SNP1,...SNPk)=H (Class)+H (SNP1,...SNPk)-H(Class,SNP1,...SNPk)
Calculate the formula of the comentropy H (Class) of Class are as follows:
Calculate the comentropy H (SNP of k SNP site1,…,SNPk) formula are as follows:
7. the epistasis site method for digging according to claim 2 based on heredity taboo and Bayesian network, feature It is, calculates BIC scoring method particularly includes:
Sample data is indicated with D, and G indicates bayesian network structure, can obtain according to Bayesian formula:
P (G | D)=P (D | G) P (G)/P (D)
Wherein, P (G) indicates the priori knowledge of network structure;
Use θGThe parameter for indicating network structure can obtain above formula expansion by edge integral:
P (D | G)=∫ P (D | G, θG)P(θG|G)dθG
And then obtain the BIC methods of marking of Bayesian network:
Wherein m indicates the total quantity of sample, and n indicates the number of variable, riIndicate the value number of i-th of variable, qiIndicate i-th The number of combinations of father's variable of a variable, mijkIndicate that i-th of variable takes k-th of value, his father's variable takes j-th of combined sample Number.
CN201811287261.0A 2018-10-31 2018-10-31 Genetic taboo and Bayesian network-based epistatic site mining method Active CN109448794B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811287261.0A CN109448794B (en) 2018-10-31 2018-10-31 Genetic taboo and Bayesian network-based epistatic site mining method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811287261.0A CN109448794B (en) 2018-10-31 2018-10-31 Genetic taboo and Bayesian network-based epistatic site mining method

Publications (2)

Publication Number Publication Date
CN109448794A true CN109448794A (en) 2019-03-08
CN109448794B CN109448794B (en) 2021-04-30

Family

ID=65549784

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811287261.0A Active CN109448794B (en) 2018-10-31 2018-10-31 Genetic taboo and Bayesian network-based epistatic site mining method

Country Status (1)

Country Link
CN (1) CN109448794B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110570909A (en) * 2019-09-11 2019-12-13 华中农业大学 Method for mining epistatic sites of artificial bee colony optimized Bayesian network
CN111180012A (en) * 2019-12-27 2020-05-19 哈尔滨工业大学 Gene identification method based on empirical Bayes and Mendelian randomized fusion
CN111833967A (en) * 2020-07-10 2020-10-27 华中农业大学 K-tree-based epistatic site mining method for optimizing Bayesian network
CN112447263A (en) * 2020-11-22 2021-03-05 西安邮电大学 Multitask high-order SNP upper detection method, system, storage medium and equipment
TWI741760B (en) * 2020-08-27 2021-10-01 財團法人工業技術研究院 Learning based resource allocation method, learning based resource allocation system and user interface
CN114207620A (en) * 2019-07-29 2022-03-18 国立研究开发法人理化学研究所 Data interpretation device, method and program, data integration device, method and program, and digital city construction system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060154290A1 (en) * 2003-03-07 2006-07-13 Illumigen Biosciences, Inc. Method and apparatus for pattern identification in diploid DNA sequence data
US20120036096A1 (en) * 2010-08-05 2012-02-09 King Fahd University Of Petroleum And Minerals Method of generating an integrated fuzzy-based guidance law for aerodynamic missiles
CN103632067A (en) * 2013-11-07 2014-03-12 浙江大学 Seed quantitative trait locus positioning method based on mixed linear model
US20140283152A1 (en) * 2013-03-14 2014-09-18 University Of Florida Research Foundation, Inc. Method for artificial selection
CN107590364A (en) * 2017-08-29 2018-01-16 集美大学 A kind of quick bayes method of new estimation genomic breeding value

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060154290A1 (en) * 2003-03-07 2006-07-13 Illumigen Biosciences, Inc. Method and apparatus for pattern identification in diploid DNA sequence data
US20120036096A1 (en) * 2010-08-05 2012-02-09 King Fahd University Of Petroleum And Minerals Method of generating an integrated fuzzy-based guidance law for aerodynamic missiles
US20140283152A1 (en) * 2013-03-14 2014-09-18 University Of Florida Research Foundation, Inc. Method for artificial selection
CN103632067A (en) * 2013-11-07 2014-03-12 浙江大学 Seed quantitative trait locus positioning method based on mixed linear model
CN107590364A (en) * 2017-08-29 2018-01-16 集美大学 A kind of quick bayes method of new estimation genomic breeding value

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CARLA CHIA-MING CHEN ET AL.: "Methods for Identifying SNP Interactions: A Review on Variations of Logic Regression,Random Forest and Bayesian Logistic Regression", 《IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114207620A (en) * 2019-07-29 2022-03-18 国立研究开发法人理化学研究所 Data interpretation device, method and program, data integration device, method and program, and digital city construction system
US11669541B2 (en) 2019-07-29 2023-06-06 Riken Data interpretation apparatus, method, and program, data integration apparatus, method, and program, and digital city establishing system
CN114207620B (en) * 2019-07-29 2023-08-15 国立研究开发法人理化学研究所 Data interpretation device, method, storage medium, data integration device, method, storage medium, and digital city construction system
CN110570909A (en) * 2019-09-11 2019-12-13 华中农业大学 Method for mining epistatic sites of artificial bee colony optimized Bayesian network
CN110570909B (en) * 2019-09-11 2023-03-03 华中农业大学 Method for mining epistatic sites of artificial bee colony optimized Bayesian network
CN111180012A (en) * 2019-12-27 2020-05-19 哈尔滨工业大学 Gene identification method based on empirical Bayes and Mendelian randomized fusion
CN111833967A (en) * 2020-07-10 2020-10-27 华中农业大学 K-tree-based epistatic site mining method for optimizing Bayesian network
CN111833967B (en) * 2020-07-10 2022-05-20 华中农业大学 K-tree-based epistatic site mining method for optimizing Bayesian network
TWI741760B (en) * 2020-08-27 2021-10-01 財團法人工業技術研究院 Learning based resource allocation method, learning based resource allocation system and user interface
CN112447263A (en) * 2020-11-22 2021-03-05 西安邮电大学 Multitask high-order SNP upper detection method, system, storage medium and equipment
CN112447263B (en) * 2020-11-22 2023-12-26 西安邮电大学 Multi-task high-order SNP upper detection method, system, storage medium and equipment

Also Published As

Publication number Publication date
CN109448794B (en) 2021-04-30

Similar Documents

Publication Publication Date Title
CN109448794A (en) A kind of epistasis site method for digging based on heredity taboo and Bayesian network
Ravinet et al. Interpreting the genomic landscape of speciation: a road map for finding barriers to gene flow
Zhu et al. A novel adaptive hybrid crossover operator for multiobjective evolutionary algorithm
Morrison et al. Molecular homology and multiple-sequence alignment: an analysis of concepts and practice
JPH04503876A (en) Genetic synthesis of neural networks
CN110853756B (en) Esophagus cancer risk prediction method based on SOM neural network and SVM
CN106599936A (en) Characteristic selection method based on binary ant colony algorithm and system thereof
Martín-Hernanz et al. Maximize resolution or minimize error? Using genotyping-by-sequencing to investigate the recent diversification of Helianthemum (Cistaceae)
CN110111840A (en) A kind of somatic mutation detection method
WO2023197718A1 (en) Circular rna ires prediction method
CN109801681B (en) SNP (Single nucleotide polymorphism) selection method based on improved fuzzy clustering algorithm
CN108509764B (en) Ancient organism pedigree evolution analysis method based on genetic attribute reduction
Motsinger et al. Comparison of neural network optimization approaches for studies of human genetics
Wang et al. Interpretation of manhattan plots and other outputs of genome-wide association studies
CN110879778A (en) Novel dynamic feedback and improved patch evaluation software automatic restoration method
Alapati Discrete optimization of truss structure using genetic algorithm
CN110400597A (en) A kind of genetype for predicting method based on deep learning
CN114742173A (en) Transformer fault diagnosis method and system based on neural network
CN111833964A (en) Method for mining superior locus of Bayesian network optimized by integer linear programming
CN114282130A (en) Fraud website identification method based on selection of mutant moth flame optimization algorithm
Ni et al. New insights into trait introgression with the look-ahead intercrossing strategy
CN110533186A (en) Appraisal procedure, device, equipment and the readable storage medium storing program for executing of crowdsourcing pricing structure
Zhang et al. Aligning multiple protein sequence by an improved genetic algorithm
Davenport et al. Using bioinformatics to analyse germplasm collections
CN114628031B (en) Multi-mode optimization method for detecting dynamic network biomarkers of cancer individual patients

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant