CN109448794B

CN109448794B - Genetic taboo and Bayesian network-based epistatic site mining method

Info

Publication number: CN109448794B
Application number: CN201811287261.0A
Authority: CN
Inventors: 刘建晓; 果杨; 钟芷漫; 杨晨; 胡江峰; 蒋雅玲; 梁子珍; 高辉
Original assignee: Huazhong Agricultural University
Current assignee: Huazhong Agricultural University
Priority date: 2018-10-31
Filing date: 2018-10-31
Publication date: 2021-04-30
Anticipated expiration: 2038-10-31
Also published as: CN109448794A

Abstract

The invention discloses an epistatic site mining method based on genetic taboos and Bayesian networks, which comprises the following steps: 1. converting the genotype data into binary expressed boolean data; 2. rapidly calculating condition mutual information between any SNP site pair and a phenotype by using logic and operation, taking out a top-N node pair, and constructing an initial network diagram containing the SNP sites; 3. generating new individuals by randomly adding, deleting and reversing edges based on the initial network individuals until the number of the network individuals reaches the size of the population; 4. through three operations of a genetic algorithm and a scoring mechanism of the Bayesian network, the Bayesian network structure is evolved, an optimal solution of the network structure is found, and the epistatic genetic locus affecting phenotypic traits is quickly and accurately obtained. The invention can help biological researchers to obtain the episomal gene locus influencing specific phenotypic traits, thereby assisting in gene function mining and providing reference for genetic basis analysis of complex quantitative traits of different species.

Description

Genetic taboo and Bayesian network-based epistatic site mining method

Technical Field

The invention relates to the technical field of biological information, in particular to an epistatic site mining method based on genetic taboo and Bayesian network.

Background

With the increasing and improving of living standard and medical environment of people, diseases (such as infectious diseases, malnutrition, etc.) determined only by environmental factors are basically controlled, and complex diseases and Mendelian genetic diseases become major diseases affecting human health at present. Mendelian hereditary disease is a single-gene disease, the genetic process of the disease follows Mendelian genetic law, and researchers use a positioning cloning method to determine related hereditary genes so as to basically clarify the hereditary mode of the disease. The complex diseases account for more than about 80% of human diseases, and cause great harm to human health. Asthma, cancer, diabetes, hypertension, senile dementia, rheumatoid arthritis, schizophrenia, heart disease, cardiovascular diseases, obesity, tumor and other common chronic diseases, which are collectively called complex diseases. The etiology of complex diseases is very complex, involving multiple factors such as the environment, genes, and their interactions. Therefore, the pathogenic cause and the genetic mechanism of the complex disease need to be elucidated urgently, scientific basis is provided for the diagnosis and treatment of the complex disease, the guarantee is provided for human health, and the important research significance is also provided.

From the point of view of biological genetics, the genetic factors determining the complex traits of organisms mainly include three aspects: major effects of genes, gene-gene interactions, and gene-environment interactions. Through a large number of biological experimental researches, the main reason for controlling the biological complex traits is the interaction between genes. Gene-to-gene interactions, also known as Epistasis (Epistasis), are primarily manifested as interactions between SNPs. Meanwhile, with the rapid development of high-throughput technology, a huge amount of biological data is generated at present. SNPs which are obviously related to diseases are screened from data in a Genome range by using a Genome-wide Association Study (GWAS) method, so that the explanation of the genetic mechanism of complex diseases is a hot problem of the current bioinformatics research. The GWAS method focuses mainly on the detection of major genes, and although many phenotype-associated sites are found by the method in the prior art, only a few genetic variations can be explained by the previous research. One of the most important reasons is that these studies neglect gene-to-gene interactions, i.e., epistasis. It can be seen that the excavation of epistatic sites is the main means for explaining the genetic mechanism of complex diseases at present. However, the current epistasis detection method still has the problems of difficult calculation, high algorithm complexity, low efficiency, high false positive rate and the like, so that the SNP sites and the combination thereof associated with the diseases cannot be accurately and efficiently detected. Therefore, the method has very important research significance for providing a more effective and accurate epistatic detection algorithm in the whole genome range, and also has very important effects on discovery, diagnosis, treatment and prevention of the pathogenesis of the complex diseases.

Disclosure of Invention

The invention aims to solve the technical problem of providing an epistatic site mining method based on genetic taboos and Bayesian networks aiming at the defects in the prior art.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the invention provides an epistatic site mining method based on genetic taboos and Bayesian networks, which comprises the following steps:

step 1, expressing the SNP genotype data into data in the forms of 0, 1 and 2, wherein 0 represents the common genotype of homozygote, 1 represents heterozygote, and 2 represents the rare genotype of homozygote; acquiring a gene sample to be mined, dividing the gene sample into three groups of 0, 1 and 2 by taking the number of the sample as a unit, and converting genotype data into

Boolean data

0 and 1 represented in a binary form;

step 2, calculating condition mutual information between any SNP locus pair and phenotypic characters based on an information entropy theory, sequencing the node pairs according to the calculated mutual information, taking out top-N node pairs, and constructing an initial network graph containing the SNP locus pairs;

step 3, on the premise of not generating a ring, generating a next network individual by randomly adding edges, deleting edges and reversing edges for the initial network individual, and then generating a new network individual on the basis of the next network individual; repeating the operation of generating the new network individuals until the number of the network individuals reaches the size of the initial population scale;

and 4, evolving the initial network population obtained in the step 3 through three operations of a genetic algorithm for tabu search optimization, including selection, crossing and mutation, and a scoring mechanism of a Bayesian network, wherein the initial network population is the Bayesian network comprising SNP sites, finding the optimal solution of the network structure, and thus obtaining the superior gene sites influencing phenotypic traits.

Further, the method of the present invention further includes a method of determining the constructed network:

and 5, adopting a fitness function as a standard for judging the quality of the network individual, and adopting a BIC scoring method to judge the quality of the network.

Furthermore, in step 2 and step 5 of the invention, genotype data is converted into Boolean data represented in binary form, and the binary data is directly operated by using logical AND operation, so that the condition mutual information between nodes and BIC scoring calculation of the Bayesian network can be rapidly carried out.

Further, the specific method for constructing the initial network map containing the SNP site pairs in step 2 of the present invention is as follows:

step 2.1, setting the number nlocus of the epistatic gene sites to be mined, arranging and combining nlocus sites in all the sites, and quickly calculating the condition mutual information between the nlocus sites and phenotypic characters of different combinations by using logic and operation on the basis of an information entropy theory;

step 2.2, sorting different node pairs according to the calculated condition mutual information size, and taking out top-N node pairs, wherein the size of N is determined according to the experimental result; for SNP loci which are not contained in the top-N node pairs, selecting the node pairs which appear for the first time, and inserting the node pairs into the top-N node pairs;

and 2.3, taking all gene SNP loci as nodes in the network, and inserting corresponding edges of different node pairs into the network graph according to the top-N node pairs obtained in the step 2.2 to construct an initial network graph.

Further, the specific method for performing evolution in step 4 of the present invention is:

step 4.1, selecting operation; scoring the network by using a scoring method of the Bayesian network, placing the optimal Bayesian network individual with the highest score at the initial position of the population, and selecting the network individual to enter the next generation by using a roulette selection method;

step 4.2, tabu cross operation; evolving the two networks by adopting a multi-column crossing method, and judging a generation ring; in order to avoid the premature phenomenon generated by common crossover operation, the generated offspring network is compared with individuals in a tabu table after crossover operation is carried out by utilizing the memory function of tabu search; if the network individual does not belong to the tabu list, the network individual of the next generation enters the next generation and is stored in the tabu list; if the individual belongs to the tabu table, discarding the child individual, and performing tabu cross operation again until the generated child does not belong to the tabu table;

step 4.3, mutation operation is forbidden; adding edges, deleting edges and reversing edges of the network individuals according to a certain variation probability, and selecting the variation which enables the network score to be increased most, so as to obtain an optimized network structure; the memory function of the tabu search is utilized to store the inferior solution which is generated by the variation and can improve the current adaptive value into a tabu table.

Further, the specific calculation method in step 2.1 of the present invention is:

i (Class | SNP) when mining k epistatic SNP sites affecting phenotypic Class₁,...SNP_k) Expressing the condition mutual information between the k epistatic SNP sites and the phenotype Class, and the calculation formula is as follows:

I(Class|SNP₁,...SNP_k)＝H(Class)+H(SNP₁,...SNP_k)-H(Class,SNP₁,...SNP_k)

the formula for calculating the information entropy H (Class) of Class is as follows:

calculating the information entropy H (SNP) of k SNP loci₁,…,SNP_k) The formula of (1) is:

further, the specific method for calculating the BIC score of the invention comprises the following steps:

and D represents sample data, G represents a Bayesian network structure, and the sample data can be obtained according to a Bayesian formula:

P(G|D)＝P(D|G)P(G)/P(D)

wherein p (g) represents a priori knowledge of the network structure;

by theta_GParameters representing the network structure are obtained by edge integration expanding the above equation:

P(D|G)＝∫P(D|G,θ_G)P(θ_G|G)dθ_G

and further obtaining a BIC scoring method of the Bayesian network:

where m denotes the total number of samples, n denotes the number of variables, r_iRepresents the number of values of the ith variable, q_iNumber of combinations of parent variables, m, representing the ith variable_ijkIt means that the ith variable takes the kth value, and the parent variable takes the number of samples of the jth combination.

The invention has the following beneficial effects: the invention relates to an epistatic site mining method based on genetic taboo and Bayesian network, which comprises the steps of firstly converting genotype data into Boolean data represented by binary system, and rapidly calculating the condition mutual information between any SNP site pair and phenotype by using logic and operation. And according to the calculated mutual information size, taking out a top-N node pair on the basis of sequencing the SNP site pairs, and constructing an initial network diagram containing the SNP sites. Then, on the premise of not generating a ring, generating new individuals by three operations of randomly adding edges, deleting edges and reversing edges based on the initial network individuals until the size of the population is reached. The memory idea of the tabu search algorithm is used for the evolution operation of the genetic algorithm, the Bayesian network structure is evolved by utilizing three operations (selection, intersection and variation) of the genetic algorithm and a BIC scoring method of the Bayesian network, the optimal network structure is found, the superior gene locus influencing the phenotypic character is rapidly and accurately obtained, and the gene function excavation is assisted.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a schematic flow chart of an embodiment of the present invention;

FIG. 2 is a representation of genotype data;

FIG. 3 is a representation of binary Boolean data;

FIG. 4 is a representation of a Bayesian network structure matrix code;

FIG. 5 is a process for avoiding the creation of loop structures;

FIG. 6 the crossover operation does not produce a new individual schematic;

FIG. 7 Bayesian network mutation operations;

FIG. 8 is a comparison of the accuracy of the detection of the position of 2 loci by different methods;

FIG. 9 comparison of efficiency of epitopic detection at different method 2 sites;

FIG. 10 shows the comparison of the accuracy of the positional detection at the 3-position in the different methods.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

1. Genotype data is expressed in the form of 0, 1, 2, e.g., data for SNP genotype AT is as follows: AA is represented by 0, TT is represented by 2, and AT/TA is represented by 1. FIG. 2 shows 11 SNPs (SNPs)_A～SNP_K) Genotype data for the corresponding 4 samples, the last column of Class indicates phenotypic traits, where Class ═ 1 indicates case (diseased), CAnd lass ═ 0 denotes control (control). In order to improve the calculation efficiency of conditional mutual information between subsequent nodes and Bayesian network scoring, the genotype data is converted into Boolean data 0/1 represented in binary form. SNP in FIG. 2_A～SNP_DThe conversion of genotype data into binary form of boolean data format is shown in fig. 3.

In FIG. 3, the first 4 columns are binary representations with genotype 0, the middle 4 columns are binary representations with genotype 1, and the last 4 columns are binary representations with genotype 2. When the genotype of a specific SNP in a certain sample is 0, the binary value of the corresponding position in the first 4 columns is represented by 1, as shown by the corresponding data in the boxes of FIGS. 2 and 3.

2. Based on genotype data represented in a binary Boolean form, calculating condition mutual information between SNP locus pairs and phenotypic characters by using an information entropy theory, sequencing the node pairs, taking out top-N node pairs, and constructing an initial network diagram containing the SNP loci.

(1) When k epistatic SNP sites affecting a phenotype Class are mined, the formula (1) is adopted to calculate the condition mutual information between the k SNP sites and the Class. Calculating the information entropy H (Class) of Class in the formula (1) by adopting the formula (2), and calculating the information entropy H (SNP) of k SNP sites in the formula (1) by adopting the formula (3)₁,…,SNP_k)。

I(Class|SNP₁,...SNP_k)＝H(Class)+H(SNP₁,...SNP_k)-H(Class,SNP₁,...SNP_k) (1)

Based on the genotype data expressed in a binary form, the conditional mutual information between the nodes can be rapidly calculated through logic and operation. For example, M in FIG. 3 is calculated using equation (4)_bitMedium Condition mutual information I (Class | SNP)_B,SNP_C)。

I(Class|SNP_B,SNP_C)＝H(Class)+H(SNP_B,SNP_C)-H(Class,SNP_B,SNP_C) (4)

If formula (5) is used to calculate H (SNP) in formula (4)_B,SNP_C)。

When equation (5) is calculated, binary logical and operations can be directly utilized to solve by counting the number of characters 1 in a binary string. M in FIG. 3_bitThe binary value of the underlined part in equation (5) is used to calculate p (1,1) in equation (5) using equation (6),

(2) and sorting different node pairs from high to low according to the calculated conditional mutual information size between the k upper SNP sites and the Class, and taking out a top-N node pair, wherein the size of N can be determined according to an experiment result. For SNP nodes not included in the top-N node pair, the node pair whose first occurrence appears is selected in the ordered node pair and inserted into the top-N node pair.

(3) And taking all gene SNP loci as nodes in the network, circulating the top-N node pairs obtained by calculation, inserting corresponding edges of the node pairs into the network graph, and constructing the initial network graph by analogy.

3. When searching in the space of the Bayesian network structure, each individual in the genetic tabu algorithm adopted by the invention corresponds to one Bayesian network structure. And (3) expressing the Bayesian network individuals by using an adjacency matrix, wherein the number of SNP sites in the network is n, and each individual is expressed as an n multiplied by n adjacency matrix C. With the encoding scheme of 0/1, if node i is the parent of node j, C_ijNot 1, otherwise, C _ij0. As shown in fig. 4.

The Bayesian network individuals in the initial population of the genetic algorithm are composed of SNP nodes and edges between the nodes. And on the premise of not generating a ring, generating the next network individual by randomly adding edges, deleting edges and reversing edges for the initial network individual. And then new network individuals are generated on the basis of the next network individuals, and the like until the number of the network individuals reaches the size of the initial population size. The genetic algorithm starts iterating with the initial population as a starting point.

4. On the basis of an initial network population, the Bayesian network comprising the SNP sites is evolved through genetic operations (selection, crossing and variation) and a scoring mechanism of the Bayesian network, an optimal solution of a network structure is found, and then the epistatic genetic sites influencing phenotypic traits are rapidly and accurately obtained. Meanwhile, in order to enhance the diversity of the population and obtain the global optimal solution, a tabu search strategy is applied to the crossover and mutation evolution operation of the genetic algorithm, and the convergence of the algorithm is accelerated.

(1) Selection operation

Selecting excellent individuals from the current population by using selection operation, and enabling the excellent individuals to serve as father generations to be next generation breeding offspring, wherein the selection operation principle is that the probability that the individuals with stronger fitness are selected is higher. Scoring the network by using a scoring method of the Bayesian network, selecting a network with higher score, placing the optimal Bayesian network individual with the highest score at the initial position of the population, and selecting the optimal network individual to enter the next generation by using a roulette selection method. Let the fitness value of the ith network be f_iThen i is chosen with probability P_iAs shown in Eq. (7), where N represents the size of the population.

(2) Contraindicated cross operation

Better individuals in the new generation can be obtained through the cross operation, and the new individuals inherit the characteristics of the parents of the new individuals. In order to accelerate the convergence rate, a multi-column crossing method is adopted. And when the network is changed due to the crossing, judging the generation ring. What is supposed to beTwo network individuals in the group, Individual₁And indigo, and₂randomly selecting Indvidual₁Two columns f₁,f₂And Indvidual₂Two columns s₁,s₂. Mixing Indaividual₁F of (a)₁Column and Indvidual₂S of₁Exchanging columns to exchange Indvidual₁F of (a)₂Column and Indvidual₂S of₂The columns are swapped. Namely:

further obtaining: induvidual₁[...s₁...s₂...]，Individual₂[...f₁...f₂...]. When performing the crossover operation, it is determined whether there is a ring generation in the network, when Indvidual₁And Indvidual₂When none of the ring structures exist, Indvidual is considered to be₁And Indvidual₂Is a new filial generation individual. If the switching operation generates a ring structure, the switching operation of the position is skipped over, and the next switching operation is continuously judged until all the two rows are switched. As shown in fig. 5. The process in fig. 5 is performed as follows:

<1>Individual₁second column (I1.f2) and Inividual₂The first column (i2.s1) is exchanged line by line, as shown in the part of "exchange first column" in the figure, the crossing is successful, and thus two individuals which have crossed the first column are obtained.

<2>Individual₁The third column (I1.f3) and Indvidual₂The third column (i2.s3) is switched row by row as shown in the "switch second column" part of the figure. In the first row, if Indvidual₁The first row of the third column corresponds to the value 0 and Indvidual₂The first row of the third column corresponds to a value of 1 swap, which will result in Indvidual₁Creating a ring structure as shown in the red circle portion of the figure. The algorithm skips the first row and starts the swap from the second row. And finally, obtaining two final individuals after the exchange is finished.

Common crossover operations may produce identical offspring among the offspring of different populations, resulting in local similarities of chromosomes in the population, thereby stalling the search,the phenomenon of early ripening is easy to occur. As in FIG. 6, in the first iteration, Indvidual₁I1.f1 and Indvidual of₂I2.s2 and Indvidual of₁I1.f2 and Indvidual of₂S3, to get three new descendant network individuals shown in the second iteration. In the second iteration, Indvidual₁I1.f2 and Indvidual of₃I3.p2 and Indvidual₁I1.f3 and Indvidual of₃I3.p3 of (1). Two child individuals I1, I3 in the third iteration are obtained to be repeated with the parent, and the crossing operation does not generate a new child individual.

The common crossover operation is easy to generate premature phenomenon, in order to solve the problem, the memory function of tabu search is utilized, and the generated offspring network is compared with individuals in a tabu table one by one after the crossover operation is carried out. If the network individual does not belong to the tabu list, the network individual of the next generation is entered into the next generation and stored in the tabu list. If the individual already belongs to the tabu list, the child generation individual is discarded, and tabu crossover operation is performed again until the generated child does not belong to the tabu list.

(3) Contraindicated for mutation operation

Firstly, randomly selecting an individual in a group by mutation operation, and selecting the selected network individual with a certain mutation probability P_mAnd performing edge adding, edge deleting and edge reversing operations, thereby increasing the diversity of the population. As in fig. 7, the variant operation in the matrix corresponds to the deletion of the edge between node a and node C in the network. And under the condition of ensuring that no ring is generated, selecting the variation which increases the network score the most so as to obtain a better network structure. The common mutation operation has stronger randomness and is easy to damage network individuals with better adaptability, and the problems of poor climbing capability, easy falling into local optimum and the like exist. The invention utilizes the memory function of tabu search, when the variation generates inferior solution, the solution is stored in a tabu table, and then the search is carried out on the current basis. The method can avoid roundabout search, jump out of local optimum, improve the mountain climbing capability of the variation operation, and further help to quickly find better network individuals.

5. And evaluating a fitness function.

The fitness function is a standard for judging the quality of network individuals, and determines which excellent individuals are reserved and which poor individuals are eliminated. The genetic tabu algorithm calculates the fitness by scoring the Bayesian network, and then is used as a basis for evolutionary search, and the quality of the network is judged by adopting a common BIC scoring method.

The Bayesian scoring is mainly to select a Bayesian network structure with the maximum posterior probability under the condition of given prior knowledge and sample data. The sample data is represented by D, the Bayesian network structure is represented by G, and the equation (7) can be obtained according to the Bayesian formula. Where p (g) represents a priori knowledge of the network structure.

P(G|D)＝P(D|G)P(G)/P(D) (7)

By theta_GThe parameters representing the network structure are expanded from equation (7) by edge integration to obtain equation (8).

P(D|G)＝∫P(D|G,θ_G)P(θ_G|G)dθ_G (8)

And further obtaining a BIC scoring method of the Bayesian network, as shown in the formula (9).

Where m denotes the total number of samples, n denotes the number of variables, r_iRepresents the number of values of the ith variable, q_iNumber of combinations of parent variables, m, representing the ith variable_ijkIt means that the ith variable takes the kth value, and the parent variable takes the number of samples of the jth combination. Similarly, based on Boolean data expressed in binary form, the BIC scoring calculation is performed by using the logical AND operation, so that the rapid calculation can be realized.

6. Algorithm termination

When the maximum iteration times are reached or the score value of the optimal individual is kept unchanged after k generations, the algorithm is ended. Otherwise, the selected, crossed and mutated new generation individuals are used for replacing the previous generation individuals, and the iteration execution is returned to continue.

7. Experiments show that the high efficiency of the superior locus mining method based on genetic taboo and Bayesian network is high, and the accuracy and efficiency of the superior detection of the two nodes and the three nodes are respectively compared.

The following is an embodiment of mining the superior sites on the GAMESS software generated data set by applying the method of the present invention, and the high efficiency of mining the superior sites by the method of the present invention is explained in detail by related experiments. GAME ES software is a software commonly used in the industry for generating Epistasis simulation data^[4]The software can quickly and accurately generate Epistasis simulation data, and a specific two-site or even multi-site Epistasis model is generated by changing different parameters. Parameters that may be set include: number of SNP sites, and inheritance Rate (h)²) Minimum Allele Frequency (MAF), and prevalence (prediction). The 1 st line site name in the file generating the simulation data is listed as Class label with 1 representing disease and 0 representing control. Genotype data are indicated as 0, 1, 2,0 for the homozygote common genotype, 1 for the heterozygote, and 2 for the homozygote rare genotype.

The genetic taboo-optimized Bayesian network epistasis site mining method is marked as Epi-GTBN, and the epistasis detection method for experimental comparison comprises the following steps: BEAM, AntEpiSeeker, SNPRuler, MDR, BOOST and Bayesian network learning method hill-climbings. By setting different inheritance rates h²(0.025,0.05,0.1,0.2,0.3,0.4) and minimum allele frequency MAF (0.1,0.2,0.3,0.4), different data sets were generated using game software, the data set at each parameter setting consisting of 100 files. And (4) calculating the accuracy of the superior locus mining by adopting the formula (10). Wherein Num_edgeIndicating the number of data sets in which the target episomal site can be detected.

Experiment 1.2 comparison of accuracy and efficiency of site superior detection

In this experiment, the ratioSet different inheritance rates h²And the accuracy of the epistatic site mining with minimum allele frequency MAF. FIGS. 8 and 9 show the accuracy and efficiency comparison of the positional mining at the 2-site for different methods.

As can be seen from FIG. 8, at different h²And under the condition of MAF value, the 2-site epitopic detection accuracy of the BEAM and hill-clinmbig methods is far lower than that of the other 5 methods. The detection accuracy of the Epi-GTBN, MDR, BOOST and AntEpiSeker methods is the largest and basically 100%, and the detection accuracy of the SNPRuler method is slightly lower than that of the other 4 methods.

As can be seen from fig. 9, the 2-site epitopic detection time of the AntEpiSeeker method is the most, which is much longer than that of the other 6 methods. The detection time of the three methods of BEAM, BOOST and SNPRuler is the least, and the time used by the three methods of MDR, hill-climbingg and Epi-GTBN is intermediate. Wherein Epi-GTBN takes less time than MDR and hill-climbings. In the Epi-GTBN method, genotype data is converted into Boolean data represented in a binary form, and conditional mutual information among nodes can be calculated by utilizing logic and operation fast, and BIC scoring can be carried out on a network. This may save a significant amount of computing time when building the initial network and scoring the network.

In conclusion, the accuracy of the two-site locality detection of the MDR, BOOST and AntEpiSeeker methods is basically consistent with that of the Epi-GTBN method, and is basically 100%. However, the detection times for the MDR and AntEpiSeeker methods are significantly greater than for the Epi-GTBN method. In addition, the parameter setting of the AntEpiSeeker method is complicated, and the result is closely related to the parameter setting. The BOOST method can be only used for detecting the epistasis of 2 sites, but cannot be used for detecting the epistasis of a plurality of sites. Therefore, the Epi-GTBN method has better detection accuracy rate on the premise of not influencing the superior detection efficiency.

Experiment 2.3 comparison of accuracy of site superior detection

In this experiment, different methods were compared in setting different inheritance rates h²And the accuracy of 3 site epistatic mining with minimum allele frequency MAF, as shown in fig. 10.

As can be seen from the experimental results, the detection accuracy of the position at the 3-locus in FIG. 10 is substantially the same as that of the position at the 2-locus in FIG. 8. The BEAM and hill-cliping methods have the lowest accuracy of the epistatic detection. The detection accuracy of the MDR and the Epi-GTBN is the highest, is basically 100 percent and is higher than that of the SNPRuler method.

It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims

1. An epistatic site mining method based on genetic taboos and Bayesian networks is characterized by comprising the following steps:

step 1, expressing the SNP genotype data into data in the forms of 0, 1 and 2, wherein 0 represents the common genotype of homozygote, 1 represents heterozygote, and 2 represents the rare genotype of homozygote; acquiring a gene sample to be mined, dividing the gene sample into three groups of 0, 1 and 2 by taking the number of the sample as a unit, and converting genotype data into Boolean data 0 and 1 represented in a binary form;

2. The genetic taboo and bayesian network based epistatic site mining method according to claim 1, characterized by further comprising a method of judging the constructed network:

3. The genetic taboo and bayesian network-based epistatic site mining method according to claim 2, wherein in steps 2 and 5, genotype data is converted into boolean data represented in binary form, and the binary data is directly operated by logic and operation, thereby rapidly performing inter-node condition mutual information and BIC scoring calculation of bayesian networks.

4. The genetic taboo and Bayesian network-based epistatic site mining method according to claim 1, wherein the specific method for constructing the initial network map comprising SNP site pairs in step 2 is as follows:

5. The genetic taboo and bayesian network-based epistatic site mining method according to claim 1, wherein the specific method for evolution in step 4 is:

6. The genetic taboo and bayesian network-based epistatic site mining method according to claim 4, wherein the specific calculation method in step 2.1 is:

I(Class|SNP₁,...SNP_k)＝H(Class)+H(SNP₁,...SNP_k)-H(Class,SNP₁,...SNP_k)

7. the genetic taboo and bayesian network-based epistatic site mining method according to claim 2, wherein the specific method for calculating BIC score is as follows:

P(G|D)＝P(D|G)P(G)/P(D)

wherein p (g) represents a priori knowledge of the network structure;

P(D|G)＝∫P(D|G,θ_G)P(θ_G|G)dθ_G

and further obtaining a BIC scoring method of the Bayesian network: