CN109448794A - A kind of epistasis site method for digging based on heredity taboo and Bayesian network - Google Patents
A kind of epistasis site method for digging based on heredity taboo and Bayesian network Download PDFInfo
- Publication number
- CN109448794A CN109448794A CN201811287261.0A CN201811287261A CN109448794A CN 109448794 A CN109448794 A CN 109448794A CN 201811287261 A CN201811287261 A CN 201811287261A CN 109448794 A CN109448794 A CN 109448794A
- Authority
- CN
- China
- Prior art keywords
- network
- snp
- site
- individual
- taboo
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/126—Evolutionary algorithms, e.g. genetic algorithms or genetic programming
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Genetics & Genomics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Physiology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of epistasis site method for digging based on heredity taboo and Bayesian network, comprising: 1, genotype data is converted to the Boolean type data of binary representation;2, conditional mutual information between any SNP site pair and phenotype is rapidly calculated using logical AND operation, takes out top-N node pair, building includes the initial network figure of SNP site;3, based on initial network individual, by increase at random while, delete while, reverse side generate new individual, until network individual amount reaches Population Size scale;4, by the marking mechanism of three kinds of operations and Bayesian network of genetic algorithm, develop to bayesian network structure, find the optimal solution of network structure, fast and accurately get the epistatic gene site for influencing phenotypic character.The present invention can help biological study person to obtain the epistatic gene site of influence particular phenotype proterties, and then auxiliary gene function is excavated, and offer reference for the complicated genetics of quantitative characters basis parsing of different plant species.
Description
Technical field
The present invention relates to technical field of biological information more particularly to it is a kind of based on heredity taboo and Bayesian network it is upper
Property site method for digging.
Background technique
With the continuous improvement and improvement of people's living standard and medical environment, those diseases only determined by environmental factor
Sick (such as infectious disease, malnutrition etc.) has been substantially achieved control, and complex disease and Mendelian inheritance disease become current shadow
Ring the principal disease of human health.Mendelian inheritance disease is a kind of single-gene disorder, and it is fixed that genetic process follows Mendelian inheritance
Rule, researcher has determined correlated inheritance gene using the method for positional cloning at present, illustrates its mode of inheritance substantially.Complicated disease
Disease accounts for about 80% or more of human diseases, causes great injury to human health.Asthma, cancer, diabetes, high blood
Pressure, senile dementia, rheumatoid arthritis, schizophrenia, heart disease, cardiovascular disease, obesity, tumour etc. are common chronic
Disease is referred to as complex disease.The cause of disease of complex disease is extremely complex, is related to environment, gene and mutual between them
The many factors such as effect.Therefore, it is badly in need of illustrating the pathogenesis and genetic mechanism of complex disease, diagnosis to complex disease and controls
It treats and scientific basis is provided, provided safeguard for human health, it may have important research significance.
From the point of view of biogenetics, determine that the inherent cause of biological complex character mainly includes three aspects: gene
The interaction between interaction and gene and environment between main effect, gene and gene.Pass through biology many experiments
The study found that the main reason for control biological complex character is the interaction between gene and gene.Between gene and gene
Interaction, also known as epistasis (Epistasis), it is mainly shown as the interaction between SNP.Meanwhile with height
The rapid development of flux technique produces the biological data of magnanimity at present.Utilize genome-wide association study (Genome-wide
Association Study, GWAS) method filtered out from the data in genome range with the significantly associated SNPs of disease,
To illustrate the hot issue that the genetic mechanism of complex disease is current biological informatics research.GWAS method mainly stresses
In the detection of major gene resistance, in early-stage study although having found much sites relevant to phenotype using this method, also can only
Explain the hereditary variation of only a few.One of them most important reason is exactly that these study the phase having ignored between gene and gene
Interaction, i.e. epistasis.As it can be seen that carrying out epistasis site to excavate being the current main means for explaining complex disease genetic mechanism.
However, epistasis detection method still has dyscalculia, algorithm complexity height, inefficiency and false positive rate height etc. at present
Problem causes accurately and efficiently detect SNP site associated with disease and combinations thereof.Therefore, in full-length genome model
Enclosing interior more effective, the more accurate epistasis detection algorithm of proposition has highly important research significance, also causes a disease to complex disease
Discovery, diagnosis, the treatment and prevention of mechanism have very important effect.
Summary of the invention
The technical problem to be solved in the present invention is that for the defects in the prior art, provide it is a kind of based on heredity taboo and
The epistasis site method for digging of Bayesian network.
The technical solution adopted by the present invention to solve the technical problems is:
The present invention provides a kind of epistasis site method for digging based on heredity taboo and Bayesian network, including following step
It is rapid:
Step 1, to SNP genotype data, genotype data is expressed as to the data of 0,1,2 forms, 0 indicates that homozygote is normal
See genotype, 1 indicates heterozygote, and 2 indicate the rare genotype of homozygote;Cdna sample to be excavated is obtained, is single with sample number
Position is divided into 0,1,2 three group, and genotype data is converted to the Boolean type data 0,1 of binary form expression;
Step 2 is based on information entropy theory, conditional mutual information between any SNP site pair and phenotypic character is calculated, according to calculating
Mutual information size to node to being ranked up, take out top-N node pair, building includes the initial network figure of SNP site pair;
Step 3, under the premise of not generating ring, to initial network individual by random increase while, delete while, reverse side behaviour
Make to generate next network individual, new network individual is then regenerated on the basis of next network is individual;More than repeating
The operation for generating new network individual, until network individual amount reaches initial population scale size;
Three kinds of operations of step 4, the genetic algorithm optimized by TABU search, including selection, intersection and variation, Yi Jibei
The marking mechanism of this network of leaf develops to the initial network population that step 3 obtains, initial network population be include SNP
The Bayesian network of point, finds the optimal solution of network structure, to get the epistatic gene site for influencing phenotypic character.
Further, this method of the invention further includes the method judged the network of building:
Step 5, the standard using fitness function as judge network individual superiority and inferiority, using the method for BIC marking to net
The superiority and inferiority of network is judged.
Further, in step 2 and step 5 of the invention, genotype data is converted to the cloth of binary form expression
That type data, directly operate binary data using logic and operation, and then be rapidly performed by condition mutual trust between node
The BIC of breath and Bayesian network, which gives a mark, to be calculated.
Further, initial network figure of the building comprising SNP site pair in step 2 of the invention method particularly includes:
Step 2.1 sets epistatic gene site number nlocus to be excavated, to nlocus site in all sites into
Row permutation and combination is based on information entropy theory, and nlocus site and the table of various combination are rapidly calculated using logical AND operation
Conditional mutual information between type character;
Step 2.2, according to the conditional mutual information size of calculating to different nodes to being ranked up, take out top-N node
Right, wherein the size of N is determined according to experimental result;For being not included in top-N node centering SNP site, select its
The node pair once occurred inserts it into top-N node centering;
Step 2.3 regards all gene SNP sites as nodes, the top-N node obtained according to step 2.2
It is right, different nodes are inserted into network corresponding side, construct initial network figure.
Further, develop in step 4 of the invention method particularly includes:
Step 4.1, selection operation;It is given a mark, will be given a mark highest to network using the methods of marking of Bayesian network
Optimal Bayesian network individual is placed on the initial position of population, is entered using roulette selection method choice network individual next
Generation;
Step 4.2, taboo crossover operation;Developed using multiple row cross method to two networks, and carries out generation ring
Judgement;Precocious phenomenon is generated in order to avoid normal crossing operates, the memory function having using TABU search carries out crossover operation
The filial generation network of generation is compared with the individual in taboo list afterwards;If being not belonging to introduce taboo list, by this filial generation network
Individual enters the next generation, and is stored in taboo list;If the individual already belongs to taboo list, this filial generation is abandoned
Individual re-starts taboo crossover operation, until the filial generation of generation is not belonging to taboo list;
Step 4.3, Taboo mutation operation;To network individual with certain mutation probability carry out increase while, delete while, reverse
Side operation, selection increase network scoring by most variations, to obtain optimization network structure;The note having using TABU search
Function is recalled, in the inferior solution deposit taboo list for improving current adaptive value that variation is generated.
Further, the circular in step 2.1 of the invention are as follows:
When excavating k epistasis SNP site for influencing phenotype Class, and I (Class | SNP1,...SNPk) indicate on k
Conditional mutual information between position property SNP site and phenotype Class, the formula calculated are as follows:
I(Class|SNP1,...SNPk)=H (Class)+H (SNP1,...SNPk)-H(Class,SNP1,...SNPk)
Calculate the formula of the comentropy H (Class) of Class are as follows:
Calculate the comentropy H (SNP of k SNP site1,…,SNPk) formula are as follows:
Further, calculating BIC of the invention scores method particularly includes:
Sample data is indicated with D, and G indicates bayesian network structure, can obtain according to Bayesian formula:
P (G | D)=P (D | G) P (G)/P (D)
Wherein, P (G) indicates the priori knowledge of network structure;
Use θGThe parameter for indicating network structure can obtain above formula expansion by edge integral:
P (D | G)=∫ P (D | G, θG)P(θG|G)dθG
And then obtain the BIC methods of marking of Bayesian network:
Wherein m indicates the total quantity of sample, and n indicates the number of variable, riIndicate the value number of i-th of variable, qiIt indicates
The number of combinations of father's variable of i-th of variable, mijkIndicate that i-th of variable takes k-th of value, his father's variable takes j-th of combined sample
This number.
The beneficial effect comprise that: the epistasis site of the invention based on heredity taboo and Bayesian network is dug
Genotype data is converted to the Boolean type data of binary representation by pick method first, is rapidly calculated using logical AND operation
Conditional mutual information between any SNP site pair and phenotype.According to the mutual information size of calculating, to SNP site to being ranked up
On the basis of, top-N node pair is taken out, building includes the initial network figure of SNP site.Then, under the premise of not generating ring, base
In initial network individual by increase at random while, delete while, reverse the operation of three, side to generate new individual, it is big until reaching population
On a small scale.The memory thought of tabu search algorithm is used in the evolutional operation of genetic algorithm, three kinds of behaviour of genetic algorithm are utilized
Make the BIC scoring method of (selection intersects and makes a variation) with Bayesian network, develops to bayesian network structure, find most
Excellent network structure, fast and accurately gets the epistatic gene site for influencing phenotypic character, and auxiliary gene function is excavated.
Detailed description of the invention
Present invention will be further explained below with reference to the attached drawings and examples, in attached drawing:
Fig. 1 is the flow diagram that the present invention is embodied;
Fig. 2 genotype data indicates figure;
Fig. 3 binary Boolean type tables of data diagram;
Fig. 4 bayesian network structure matrix coder indicates figure;
The treatment process that Fig. 5 avoids ring structure from generating;
Fig. 6 crossover operation does not generate new individual schematic diagram;
Fig. 7 Bayesian network mutation operation;
2 site epistasis Detection accuracy of Fig. 8 distinct methods compares;
2 site epistasis detection efficiency of Fig. 9 distinct methods compares;
3 site epistasis Detection accuracy of Figure 10 distinct methods compares.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not
For limiting the present invention.
1, genotype data is indicated with the data of 0,1,2 form, such as the data that SNP genotype is AT are expressed as follows: AA
It is indicated with 0, TT is indicated with 2, and AT/TA is indicated with 1.Fig. 2 show 11 SNP (SNPA~SNPK) corresponding 4 samples gene
Type data, last column Class indicate phenotypic character, and wherein Class=1 indicates case (illness), and Class=0 is indicated
Control (control).In order to improve conditional mutual information and the computational efficiency of Bayesian network marking between subsequent node, by gene
Type data are converted to the Boolean type data 0/1 of binary form expression.By SNP in Fig. 2A~SNPDGenotype data be changed into
The Boolean type data format of binary form is as shown in Figure 3.
In Fig. 3, preceding 4 are classified as the binary representation that genotype is 0, and centre 4 is classified as the binary representation that genotype is 1, most
4 it is classified as the binary representation that genotype is 2 afterwards.When specific SNP is when the genotype of certain sample is 0, then by its preceding corresponding position of 4 column
The binary system set is indicated with 1, as shown in corresponding data in Fig. 2 and Fig. 3 box.
2, based on binary Boolean type form indicate genotype data, using information entropy theory calculate SNP site pair with
Conditional mutual information between phenotypic character, to node to being ranked up and taking out top-N node pair, building is initial comprising SNP site
Network.
(1) when excavating k epistasis SNP site for influencing phenotype Class, using formula (1) calculate k SNP site and
Conditional mutual information between Class.Using the comentropy H (Class) of Class in formula (2) calculating formula (1), using formula (3) calculating formula
(1) the comentropy H (SNP of k SNP site in1,…,SNPk)。
I(Class|SNP1,...SNPk)=H (Class)+H (SNP1,...SNPk)-H(Class,SNP1,...SNPk) (1)
It, can quick condition between calculate node by logical AND operation based on the genotype data that binary form indicates
Mutual information.For example, calculating the M of Fig. 3 using formula (4)bitConditional mutual information I (Class | SNPB,SNPC)。
I(Class|SNPB,SNPC)=H (Class)+H (SNPB,SNPC)-H(Class,SNPB,SNPC) (4)
Such as using H (SNP in formula (5) calculating formula (4)B,SNPC)。
It, can be directly using binary logic and operation, by string of binary characters when calculating formula (5)
Character 1 carries out counting solution.Such as the M of Fig. 3bitShown in the binary value of middle underscore part, using p in formula (6) calculating formula (5)
(1,1),
(2) right from high to low according to the conditional mutual information size between the k epistasis SNP site and Class of above-mentioned calculating
Different nodes takes out top-N node pair, wherein the size of N can be determined according to experimental result to being ranked up.For
It is not included in top-N node centering SNP node, the node pair for selecting its first time to occur in sorted node pair, by it
It is inserted into top-N node centering.
(3) regard all gene SNP sites as nodes, to the above-mentioned top-N node being calculated to progress
Circulation, node is inserted into network corresponding side, and so on, construct initial network figure.
3, when being scanned in the space of bayesian network structure, each of genetic-Tabu search that the present invention uses
The corresponding bayesian network structure of individual.Bayesian network individual is indicated with adjacency matrix, if SNP site in network
Number is n, and each individual is expressed as the adjacency matrix C of n × n.Using 0/1 encoding scheme, if node i is the father of node j
Node, Cij=1, otherwise, Cij=0.As shown in Figure 4.
Bayesian network individual in genetic algorithm initial population is made of the side between SNP node and node.It is not generating
Under the premise of ring, to initial network individual by increase at random while, delete while, reverse side operation to generate next network individual.
Then new network individual is regenerated on the basis of next network is individual, and so on until network individual amount reaches just
Beginning population scale size.Genetic algorithm starts iteration using initial population as starting point.
4, on the basis of initial network population, pass through genetic manipulation (selection, intersect and variation) and Bayesian network
Marking mechanism develops to the Bayesian network for including SNP site, finds the optimal solution of network structure, and then quick and precisely
Get influence phenotypic character epistatic gene site.Meanwhile in order to enhance the diversity of population and obtain global optimum
Tabu search strategy is applied in the intersection and variation evolutional operation of genetic algorithm, also accelerates the convergence of algorithm by solution.
(1) selection operation
Excellent individual is selected from current group using selection operation, makes defect individual as the next-generation breeding of parent
Offspring, the principle of selection operation are that the selected probability of the stronger individual of fitness is bigger.Utilize the scoring side of Bayesian network
Method gives a mark to network, and to giving a mark, higher network is selected, and the highest optimal Bayesian network individual that will give a mark is placed on
The initial position of population enters the next generation using roulette selection method choice preferably network individual.If i-th network is suitable
Answering angle value is fi, then i selected probability PiAs shown in Eq. (7), wherein N indicates the size of population.
(2) crossover operation is avoided
By preferable individual in the available a new generation of crossover operation, new individual inherits the characteristic of their elder generation individual.
In order to speed up the convergence rate, using multiple row cross method.When intersecting so that network changes, carry out generating ring judgement.It is false
If two network individual Individual in population1With Individual2, Individual is selected at random1Two column f1,f2With
Individual2Two column s1,s2.By Individual1F1Column and Individual2S1Column swap, will
Individual1F2Column and Individual2S2Column swap.That is:And then obtain: Individual1
[...s1...s2...], Individual2[...f1...f2...].It carries out judging whether have in network when crossover operation
Ring generates, and works as Individual1And Individual2When all there is no ring structure, then it is assumed that Individual1And Individual2
For new offspring individual.If swap operation generate ring structure, skip the swap operation in this site, continue to judge it is next, directly
To two column, all exchange is finished.As shown in Figure 5.Implementation procedure is as follows in Fig. 5:
<1>Individual1Secondary series (I1.f2) and Individual2First row (I2.s1) swap line by line,
As shown in " exchange first row " part in figure, intersect successfully, to obtain two individuals for having intersected first row.
<2>Individual1Third column (I1.f3) and Individual2Third column (I2.s3) exchange line by line, such as scheme
In shown in " exchange secondary series " part.In the first row, if Individual1Tertial the first row respective value 0 with
Individual2Tertial the first row respective value 1 exchanges, and will lead to Individual1Ring structure is generated, such as red circle portion in figure
Shown in point.Algorithm skips the first row, swaps since the second row.Finally, exchange finishes to obtain final two individuals.
In different population filial generations, common crossover operation may generate identical offspring, lead to the dye in group
Colour solid has local similarity, so that search be made to stagnate, is easy to produce " precocity " phenomenon.In Fig. 6, in first time iteration
In, Individual1I1.f1 and Individual2I2.s2 and Individual1I1.f2 and Individual2's
I2.s3 carries out crossover operation, obtains three new filial generation network individuals shown in second of iteration.In second of iteration,
Individual1I1.f2 and Individual3I3.p2 and Individual1I1.f3 and Individual3's
I3.p3 carries out crossover operation.Two offspring individuals I1, the I3 and parent obtained in third time iteration are repeated, current crossover operation
New offspring individual is not generated.
Normal crossing operation is easy to produce precocious phenomenon, in order to solve this problem, the memory function having using TABU search
Can, the filial generation network of generation and the individual in taboo list are compared one by one after carrying out crossover operation.If being not belonging to avoid
This filial generation network individual is entered the next generation, and is stored in taboo list by list.If the individual already belongs to prohibit
Avoid list, then abandon this offspring individual, re-start taboo crossover operation, is until the filial generation of generation is not belonging to taboo list
Only.
(3) Taboo mutation operates
Mutation operation randomly chooses an individual in group first, general with certain variation for the network individual chosen
Rate PmCarry out increase while, delete while, reverse side operation, to increase the diversity of population.In Fig. 7, mutation operation in matrix
Side in corresponding network between deletion of node A and node C.In the case where guaranteeing not generate ring, selection increases network scoring
Most variation, to obtain preferably network structure.Common mutation operation has stronger randomness and is easily destroyed suitable
Response preferably network individual, has that the ability of climbing the mountain is poor, is easily trapped into the problems such as local optimum.The present invention has using TABU search
The solution is stored in taboo list, then scans on current basal when variation generates inferior solution by some memory functions.It should
Method can search for avoid detour, jump out local optimum, improve the ability of climbing the mountain of mutation operation, and then help quickly to find
Preferably network individual.
5, fitness function is evaluated.
Fitness function is the standard for judging network individual quality, it determines which outstanding individual is retained, which
A little poor individuals are eliminated.Genetic-Tabu search is to be given a mark to Bayesian network to calculate fitness in the present invention, into
And the foundation as evolutionary search, the superiority and inferiority of network is judged using common BIC scoring method.
Bayes's scoring mainly in the case where given priori knowledge and sample data, selects posterior probability maximum
Bayesian network structure.Sample data is indicated with D, and G indicates bayesian network structure, can obtain formula (7) according to Bayesian formula.Its
In, P (G) indicates the priori knowledge of network structure.
P (G | D)=P (D | G) P (G)/P (D) (7)
Use θGThe parameter for indicating network structure can obtain formula (8) to formula (7) expansion by edge integral
P (D | G)=∫ P (D | G, θG)P(θG|G)dθG (8)
And then the BIC methods of marking of Bayesian network is obtained, as shown in formula (9).
Wherein m indicates the total quantity of sample, and n indicates the number of variable, riIndicate the value number of i-th of variable, qiIt indicates
The number of combinations of father's variable of i-th of variable, mijkIndicate that i-th of variable takes k-th of value, his father's variable takes j-th of combined sample
This number.Similarly, the Boolean type data indicated based on binary form are operated using logical AND and carry out BIC marking and calculate, can be with
It realizes and quickly calculates.
6, algorithm terminates
When reaching maximum number of iterations or optimum individual and all being remained unchanged by k for score value, algorithm in the present invention
Terminate.Otherwise, replace previous generation individual with a new generation's individual after selection, intersection, variation, and return to continuation iteration and hold
Row.
7, illustrate the high efficiency of the epistasis site method for digging based on heredity taboo and Bayesian network by testing,
It is respectively compared the accuracy rate and efficiency of two nodes and the detection of three node epistasis.
Here is to carry out the reality of epistasis site excavation on GAMETES Software Create data set using method of the invention
Example is applied, the high efficiency in the present invention will be described in detail method excavation epistasis site is carried out by relevant experiment.GAMETES software is one
Money industry is commonly used to generate the software of Epistasis analogue data[4], which can rapidly and accurately generate
Epistasis analogue data generates specific two site even multidigit point Epistasis model by changing different parameters.
The parameter that can be set includes: the number of SNP site, heritability (heritability, h2), minimum gene frequency
(MAF) and illness rate (prevalence) etc..The 1st behavior site title in the file of analogue data is generated, last 1 is classified as
Class label, 1 indicates illness, and 0 indicates control.Genotype data indicates that 0 indicates homozygote Common genes type, 1 table with 0,1,2
Show heterozygote, 2 indicate the rare genotype of homozygote.
The Bayesian network epistasis site method for digging of heredity taboo optimization in the present invention is denoted as Epi-GTBN, is tested
The epistasis detection method compared includes following several: BEAM, AntEpiSeeker, SNPRuler, MDR, BOOST and Bayes
Online learning methods hill-climbing.By the way that different heritability h is arranged2(0.025,0.05,0.1,0.2,0.3,0.4)
With minimum gene frequency MAF (0.1,0.2,0.3,0.4), using the different data set of GAMETES Software Create, each
Data set under parameter setting includes 100 files.The accuracy rate that epistasis site is excavated is calculated using formula (10).Wherein
NumedgeExpression can detect the data set number in target epistasis site.
Test 1.2 site epistasis Detection accuracies and efficiency comparative
In this experiment, the different heritability h of setting are compared2With epistasis position in the case of minimum gene frequency MAF
The accuracy rate that point excavates.Fig. 8 and Fig. 9 gives the accuracy rate of 2 site epistasis of distinct methods excavation and efficiency compares.
By Fig. 8 as it can be seen that in different h2Under MAF value condition, 2 sites of BEAM and hill-climbing method
Epistasis Detection accuracy will be far below other 5 kinds of methods.The inspection of Epi-GTBN, MDR, BOOST and AntEpiSeeker method
It is maximum to survey accuracy rate, the detection accuracy of substantially 100%, SNPRuler method will be slightly lower than other 4 kinds of methods.
It, be 6 kinds far more than other by Fig. 9 as it can be seen that 2 site epistasis detection times of AntEpiSeeker method are most
Method.The detection time of tri- kinds of methods of BEAM, BOOST and SNPRuler is minimum, MDR, hill-climbing and Epi-GTBN tri-
Time used in kind method is placed in the middle.Wherein, the time used in Epi-GTBN will be less than MDR and hill-climbing.In the side Epi-GTBN
In method, genotype data is converted to the Boolean type data of binary form expression, can use logical AND operation quickly meter
Between operator node conditional mutual information and to network carry out BIC marking.When constructing initial network and carrying out marking calculating to network, this
The a large amount of calculating time can be saved.
In short, two site epistasis Detection accuracies of MDR, BOOST and AntEpiSeeker method and the side Epi-GTBN
Method is almost the same, and substantially 100%.But detection time used in two methods of MDR and AntEpiSeeker is significantly more than
Epi-GTBN method.In addition, the parameter setting of AntEpiSeeker method is more complicated, result and the close phase of parameter setting
It closes.BOOST method is simply possible to use in the detection of 2 site epistasis, it is impossible to be used in the epistasis in multiple sites detects.As it can be seen that of the invention
Middle Epi-GTBN method has preferable Detection accuracy under the premise of not influencing epistasis detection efficiency.
Test the comparison of 2.3 site epistasis Detection accuracies
In this experiment, distinct methods are compared, different heritability h is being set2With minimum gene frequency MAF situation
Under 3 site epistasis excavate accuracy rate, as shown in Figure 10.
By experimental result as it can be seen that 2 site epistasis testing results in 3 site epistasis detection accuracies and Fig. 8 in Figure 10
It is almost the same.The epistasis Detection accuracy of BEAM and hill-climbing method is minimum.The detection of MDR and Epi-GTBN is quasi-
True rate highest, substantially 100%, it is higher than SNPRuler method.
It should be understood that for those of ordinary skills, it can be modified or changed according to the above description,
And all these modifications and variations should all belong to the protection domain of appended claims of the present invention.
Claims (7)
1. a kind of epistasis site method for digging based on heredity taboo and Bayesian network, which is characterized in that including following step
It is rapid:
Step 1, to SNP genotype data, genotype data is expressed as to the data of 0,1,2 forms, 0 indicates the common base of homozygote
Heterozygote is indicated because of type, 1, and 2 indicate the rare genotype of homozygote;Cdna sample to be excavated is obtained, is divided as unit of sample number
It is 0,1,2 three group, genotype data is converted to the Boolean type data 0,1 of binary form expression;
Step 2 is based on information entropy theory, conditional mutual information between any SNP site pair and phenotypic character is calculated, according to the mutual of calculating
Information size, to being ranked up, takes out top-N node pair to node, and building includes the initial network figure of SNP site pair;
Step 3, under the premise of not generating ring, to initial network individual by random increase while, delete while, reverse side operation life
At next network individual, new network individual is then regenerated on the basis of next network is individual;Repeat above generate
The operation of new network individual, until network individual amount reaches initial population scale size;
Three kinds of operations of step 4, the genetic algorithm optimized by TABU search, including selection, intersection and variation and Bayes
The marking mechanism of network develops to the initial network population that step 3 obtains, initial network population be include SNP site
Bayesian network, finds the optimal solution of network structure, to get the epistatic gene site for influencing phenotypic character.
2. the epistasis site method for digging according to claim 1 based on heredity taboo and Bayesian network, feature
It is, this method further includes the method judged the network of building:
Step 5, the standard using fitness function as judge network individual superiority and inferiority, using the method for BIC marking to network
Superiority and inferiority is judged.
3. the epistasis site method for digging according to claim 2 based on heredity taboo and Bayesian network, feature
It is, in step 2 and step 5, genotype data is converted to the Boolean type data of binary form expression, directly utilizes logic
Binary data is operated with operation, and then the BIC of conditional mutual information and Bayesian network is beaten between being rapidly performed by node
Divide and calculates.
4. the epistasis site method for digging according to claim 1 based on heredity taboo and Bayesian network, feature
It is, initial network figure of the building comprising SNP site pair in step 2 method particularly includes:
Step 2.1 sets epistatic gene site number nlocus to be excavated, arranges nlocus site in all sites
Column combination, be based on information entropy theory, using logical AND operation rapidly calculate various combination nlocus site with it is Phenetic
Conditional mutual information between shape;
Step 2.2, according to the conditional mutual information size of calculating to different nodes to being ranked up, take out top-N node pair,
The size of middle N is determined according to experimental result;For being not included in top-N node centering SNP site, its first time is selected to go out
Existing node pair inserts it into top-N node centering;
Step 2.3 regards all gene SNP sites as nodes, the top-N node pair obtained according to step 2.2, will
Different nodes are inserted into network corresponding side, construct initial network figure.
5. the epistasis site method for digging according to claim 1 based on heredity taboo and Bayesian network, feature
It is, develops in step 4 method particularly includes:
Step 4.1, selection operation;It is given a mark, will be given a mark highest optimal to network using the methods of marking of Bayesian network
Bayesian network individual is placed on the initial position of population, enters the next generation using roulette selection method choice network individual;
Step 4.2, taboo crossover operation;Developed using multiple row cross method to two networks, and carries out generating ring judgement;
Precocious phenomenon is generated in order to avoid normal crossing operates, the memory function having using TABU search carries out handle after crossover operation
The filial generation network of generation is compared with the individual in taboo list;If being not belonging to introduce taboo list, by this filial generation network individual
The next generation is entered, and is stored in taboo list;If the individual already belongs to taboo list, this filial generation is abandoned
Body re-starts taboo crossover operation, until the filial generation of generation is not belonging to taboo list;
Step 4.3, Taboo mutation operation;To network individual with certain mutation probability carry out increase while, delete while, reverse side behaviour
Make, selection increases network scoring by most variations, to obtain optimization network structure;The memory function having using TABU search
Can, in the inferior solution deposit taboo list for improving current adaptive value that variation is generated.
6. the epistasis site method for digging according to claim 4 based on heredity taboo and Bayesian network, feature
It is, the circular in step 2.1 are as follows:
When excavating k epistasis SNP site for influencing phenotype Class, and I (Class | SNP1,...SNPk) indicate k epistasis
Conditional mutual information between SNP site and phenotype Class, the formula calculated are as follows:
I(Class|SNP1,...SNPk)=H (Class)+H (SNP1,...SNPk)-H(Class,SNP1,...SNPk)
Calculate the formula of the comentropy H (Class) of Class are as follows:
Calculate the comentropy H (SNP of k SNP site1,…,SNPk) formula are as follows:
7. the epistasis site method for digging according to claim 2 based on heredity taboo and Bayesian network, feature
It is, calculates BIC scoring method particularly includes:
Sample data is indicated with D, and G indicates bayesian network structure, can obtain according to Bayesian formula:
P (G | D)=P (D | G) P (G)/P (D)
Wherein, P (G) indicates the priori knowledge of network structure;
Use θGThe parameter for indicating network structure can obtain above formula expansion by edge integral:
P (D | G)=∫ P (D | G, θG)P(θG|G)dθG
And then obtain the BIC methods of marking of Bayesian network:
Wherein m indicates the total quantity of sample, and n indicates the number of variable, riIndicate the value number of i-th of variable, qiIndicate i-th
The number of combinations of father's variable of a variable, mijkIndicate that i-th of variable takes k-th of value, his father's variable takes j-th of combined sample
Number.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811287261.0A CN109448794B (en) | 2018-10-31 | 2018-10-31 | Genetic taboo and Bayesian network-based epistatic site mining method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811287261.0A CN109448794B (en) | 2018-10-31 | 2018-10-31 | Genetic taboo and Bayesian network-based epistatic site mining method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109448794A true CN109448794A (en) | 2019-03-08 |
CN109448794B CN109448794B (en) | 2021-04-30 |
Family
ID=65549784
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811287261.0A Active CN109448794B (en) | 2018-10-31 | 2018-10-31 | Genetic taboo and Bayesian network-based epistatic site mining method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109448794B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110570909A (en) * | 2019-09-11 | 2019-12-13 | 华中农业大学 | Method for mining epistatic sites of artificial bee colony optimized Bayesian network |
CN111180012A (en) * | 2019-12-27 | 2020-05-19 | 哈尔滨工业大学 | Gene identification method based on empirical Bayes and Mendelian randomized fusion |
CN111833967A (en) * | 2020-07-10 | 2020-10-27 | 华中农业大学 | K-tree-based epistatic site mining method for optimizing Bayesian network |
CN112447263A (en) * | 2020-11-22 | 2021-03-05 | 西安邮电大学 | Multitask high-order SNP upper detection method, system, storage medium and equipment |
TWI741760B (en) * | 2020-08-27 | 2021-10-01 | 財團法人工業技術研究院 | Learning based resource allocation method, learning based resource allocation system and user interface |
CN114207620A (en) * | 2019-07-29 | 2022-03-18 | 国立研究开发法人理化学研究所 | Data interpretation device, method and program, data integration device, method and program, and digital city construction system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060154290A1 (en) * | 2003-03-07 | 2006-07-13 | Illumigen Biosciences, Inc. | Method and apparatus for pattern identification in diploid DNA sequence data |
US20120036096A1 (en) * | 2010-08-05 | 2012-02-09 | King Fahd University Of Petroleum And Minerals | Method of generating an integrated fuzzy-based guidance law for aerodynamic missiles |
CN103632067A (en) * | 2013-11-07 | 2014-03-12 | 浙江大学 | Seed quantitative trait locus positioning method based on mixed linear model |
US20140283152A1 (en) * | 2013-03-14 | 2014-09-18 | University Of Florida Research Foundation, Inc. | Method for artificial selection |
CN107590364A (en) * | 2017-08-29 | 2018-01-16 | 集美大学 | A kind of quick bayes method of new estimation genomic breeding value |
-
2018
- 2018-10-31 CN CN201811287261.0A patent/CN109448794B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060154290A1 (en) * | 2003-03-07 | 2006-07-13 | Illumigen Biosciences, Inc. | Method and apparatus for pattern identification in diploid DNA sequence data |
US20120036096A1 (en) * | 2010-08-05 | 2012-02-09 | King Fahd University Of Petroleum And Minerals | Method of generating an integrated fuzzy-based guidance law for aerodynamic missiles |
US20140283152A1 (en) * | 2013-03-14 | 2014-09-18 | University Of Florida Research Foundation, Inc. | Method for artificial selection |
CN103632067A (en) * | 2013-11-07 | 2014-03-12 | 浙江大学 | Seed quantitative trait locus positioning method based on mixed linear model |
CN107590364A (en) * | 2017-08-29 | 2018-01-16 | 集美大学 | A kind of quick bayes method of new estimation genomic breeding value |
Non-Patent Citations (1)
Title |
---|
CARLA CHIA-MING CHEN ET AL.: "Methods for Identifying SNP Interactions: A Review on Variations of Logic Regression,Random Forest and Bayesian Logistic Regression", 《IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114207620A (en) * | 2019-07-29 | 2022-03-18 | 国立研究开发法人理化学研究所 | Data interpretation device, method and program, data integration device, method and program, and digital city construction system |
US11669541B2 (en) | 2019-07-29 | 2023-06-06 | Riken | Data interpretation apparatus, method, and program, data integration apparatus, method, and program, and digital city establishing system |
CN114207620B (en) * | 2019-07-29 | 2023-08-15 | 国立研究开发法人理化学研究所 | Data interpretation device, method, storage medium, data integration device, method, storage medium, and digital city construction system |
CN110570909A (en) * | 2019-09-11 | 2019-12-13 | 华中农业大学 | Method for mining epistatic sites of artificial bee colony optimized Bayesian network |
CN110570909B (en) * | 2019-09-11 | 2023-03-03 | 华中农业大学 | Method for mining epistatic sites of artificial bee colony optimized Bayesian network |
CN111180012A (en) * | 2019-12-27 | 2020-05-19 | 哈尔滨工业大学 | Gene identification method based on empirical Bayes and Mendelian randomized fusion |
CN111833967A (en) * | 2020-07-10 | 2020-10-27 | 华中农业大学 | K-tree-based epistatic site mining method for optimizing Bayesian network |
CN111833967B (en) * | 2020-07-10 | 2022-05-20 | 华中农业大学 | K-tree-based epistatic site mining method for optimizing Bayesian network |
TWI741760B (en) * | 2020-08-27 | 2021-10-01 | 財團法人工業技術研究院 | Learning based resource allocation method, learning based resource allocation system and user interface |
CN112447263A (en) * | 2020-11-22 | 2021-03-05 | 西安邮电大学 | Multitask high-order SNP upper detection method, system, storage medium and equipment |
CN112447263B (en) * | 2020-11-22 | 2023-12-26 | 西安邮电大学 | Multi-task high-order SNP upper detection method, system, storage medium and equipment |
Also Published As
Publication number | Publication date |
---|---|
CN109448794B (en) | 2021-04-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109448794A (en) | A kind of epistasis site method for digging based on heredity taboo and Bayesian network | |
Ravinet et al. | Interpreting the genomic landscape of speciation: a road map for finding barriers to gene flow | |
Zhu et al. | A novel adaptive hybrid crossover operator for multiobjective evolutionary algorithm | |
Morrison et al. | Molecular homology and multiple-sequence alignment: an analysis of concepts and practice | |
JPH04503876A (en) | Genetic synthesis of neural networks | |
CN110853756B (en) | Esophagus cancer risk prediction method based on SOM neural network and SVM | |
CN106599936A (en) | Characteristic selection method based on binary ant colony algorithm and system thereof | |
Martín-Hernanz et al. | Maximize resolution or minimize error? Using genotyping-by-sequencing to investigate the recent diversification of Helianthemum (Cistaceae) | |
CN110111840A (en) | A kind of somatic mutation detection method | |
WO2023197718A1 (en) | Circular rna ires prediction method | |
CN109801681B (en) | SNP (Single nucleotide polymorphism) selection method based on improved fuzzy clustering algorithm | |
CN108509764B (en) | Ancient organism pedigree evolution analysis method based on genetic attribute reduction | |
Motsinger et al. | Comparison of neural network optimization approaches for studies of human genetics | |
Wang et al. | Interpretation of manhattan plots and other outputs of genome-wide association studies | |
CN110879778A (en) | Novel dynamic feedback and improved patch evaluation software automatic restoration method | |
Alapati | Discrete optimization of truss structure using genetic algorithm | |
CN110400597A (en) | A kind of genetype for predicting method based on deep learning | |
CN114742173A (en) | Transformer fault diagnosis method and system based on neural network | |
CN111833964A (en) | Method for mining superior locus of Bayesian network optimized by integer linear programming | |
CN114282130A (en) | Fraud website identification method based on selection of mutant moth flame optimization algorithm | |
Ni et al. | New insights into trait introgression with the look-ahead intercrossing strategy | |
CN110533186A (en) | Appraisal procedure, device, equipment and the readable storage medium storing program for executing of crowdsourcing pricing structure | |
Zhang et al. | Aligning multiple protein sequence by an improved genetic algorithm | |
Davenport et al. | Using bioinformatics to analyse germplasm collections | |
CN114628031B (en) | Multi-mode optimization method for detecting dynamic network biomarkers of cancer individual patients |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |