CN109448794B - Genetic taboo and Bayesian network-based epistatic site mining method - Google Patents

Genetic taboo and Bayesian network-based epistatic site mining method Download PDF

Info

Publication number
CN109448794B
CN109448794B CN201811287261.0A CN201811287261A CN109448794B CN 109448794 B CN109448794 B CN 109448794B CN 201811287261 A CN201811287261 A CN 201811287261A CN 109448794 B CN109448794 B CN 109448794B
Authority
CN
China
Prior art keywords
network
snp
genetic
epistatic
bayesian
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811287261.0A
Other languages
Chinese (zh)
Other versions
CN109448794A (en
Inventor
刘建晓
果杨
钟芷漫
杨晨
胡江峰
蒋雅玲
梁子珍
高辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong Agricultural University
Original Assignee
Huazhong Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong Agricultural University filed Critical Huazhong Agricultural University
Priority to CN201811287261.0A priority Critical patent/CN109448794B/en
Publication of CN109448794A publication Critical patent/CN109448794A/en
Application granted granted Critical
Publication of CN109448794B publication Critical patent/CN109448794B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Genetics & Genomics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Physiology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an epistatic site mining method based on genetic taboos and Bayesian networks, which comprises the following steps: 1. converting the genotype data into binary expressed boolean data; 2. rapidly calculating condition mutual information between any SNP site pair and a phenotype by using logic and operation, taking out a top-N node pair, and constructing an initial network diagram containing the SNP sites; 3. generating new individuals by randomly adding, deleting and reversing edges based on the initial network individuals until the number of the network individuals reaches the size of the population; 4. through three operations of a genetic algorithm and a scoring mechanism of the Bayesian network, the Bayesian network structure is evolved, an optimal solution of the network structure is found, and the epistatic genetic locus affecting phenotypic traits is quickly and accurately obtained. The invention can help biological researchers to obtain the episomal gene locus influencing specific phenotypic traits, thereby assisting in gene function mining and providing reference for genetic basis analysis of complex quantitative traits of different species.

Description

Genetic taboo and Bayesian network-based epistatic site mining method
Technical Field
The invention relates to the technical field of biological information, in particular to an epistatic site mining method based on genetic taboo and Bayesian network.
Background
With the increasing and improving of living standard and medical environment of people, diseases (such as infectious diseases, malnutrition, etc.) determined only by environmental factors are basically controlled, and complex diseases and Mendelian genetic diseases become major diseases affecting human health at present. Mendelian hereditary disease is a single-gene disease, the genetic process of the disease follows Mendelian genetic law, and researchers use a positioning cloning method to determine related hereditary genes so as to basically clarify the hereditary mode of the disease. The complex diseases account for more than about 80% of human diseases, and cause great harm to human health. Asthma, cancer, diabetes, hypertension, senile dementia, rheumatoid arthritis, schizophrenia, heart disease, cardiovascular diseases, obesity, tumor and other common chronic diseases, which are collectively called complex diseases. The etiology of complex diseases is very complex, involving multiple factors such as the environment, genes, and their interactions. Therefore, the pathogenic cause and the genetic mechanism of the complex disease need to be elucidated urgently, scientific basis is provided for the diagnosis and treatment of the complex disease, the guarantee is provided for human health, and the important research significance is also provided.
From the point of view of biological genetics, the genetic factors determining the complex traits of organisms mainly include three aspects: major effects of genes, gene-gene interactions, and gene-environment interactions. Through a large number of biological experimental researches, the main reason for controlling the biological complex traits is the interaction between genes. Gene-to-gene interactions, also known as Epistasis (Epistasis), are primarily manifested as interactions between SNPs. Meanwhile, with the rapid development of high-throughput technology, a huge amount of biological data is generated at present. SNPs which are obviously related to diseases are screened from data in a Genome range by using a Genome-wide Association Study (GWAS) method, so that the explanation of the genetic mechanism of complex diseases is a hot problem of the current bioinformatics research. The GWAS method focuses mainly on the detection of major genes, and although many phenotype-associated sites are found by the method in the prior art, only a few genetic variations can be explained by the previous research. One of the most important reasons is that these studies neglect gene-to-gene interactions, i.e., epistasis. It can be seen that the excavation of epistatic sites is the main means for explaining the genetic mechanism of complex diseases at present. However, the current epistasis detection method still has the problems of difficult calculation, high algorithm complexity, low efficiency, high false positive rate and the like, so that the SNP sites and the combination thereof associated with the diseases cannot be accurately and efficiently detected. Therefore, the method has very important research significance for providing a more effective and accurate epistatic detection algorithm in the whole genome range, and also has very important effects on discovery, diagnosis, treatment and prevention of the pathogenesis of the complex diseases.
Disclosure of Invention
The invention aims to solve the technical problem of providing an epistatic site mining method based on genetic taboos and Bayesian networks aiming at the defects in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows:
the invention provides an epistatic site mining method based on genetic taboos and Bayesian networks, which comprises the following steps:
step 1, expressing the SNP genotype data into data in the forms of 0, 1 and 2, wherein 0 represents the common genotype of homozygote, 1 represents heterozygote, and 2 represents the rare genotype of homozygote; acquiring a gene sample to be mined, dividing the gene sample into three groups of 0, 1 and 2 by taking the number of the sample as a unit, and converting genotype data into Boolean data 0 and 1 represented in a binary form;
step 2, calculating condition mutual information between any SNP locus pair and phenotypic characters based on an information entropy theory, sequencing the node pairs according to the calculated mutual information, taking out top-N node pairs, and constructing an initial network graph containing the SNP locus pairs;
step 3, on the premise of not generating a ring, generating a next network individual by randomly adding edges, deleting edges and reversing edges for the initial network individual, and then generating a new network individual on the basis of the next network individual; repeating the operation of generating the new network individuals until the number of the network individuals reaches the size of the initial population scale;
and 4, evolving the initial network population obtained in the step 3 through three operations of a genetic algorithm for tabu search optimization, including selection, crossing and mutation, and a scoring mechanism of a Bayesian network, wherein the initial network population is the Bayesian network comprising SNP sites, finding the optimal solution of the network structure, and thus obtaining the superior gene sites influencing phenotypic traits.
Further, the method of the present invention further includes a method of determining the constructed network:
and 5, adopting a fitness function as a standard for judging the quality of the network individual, and adopting a BIC scoring method to judge the quality of the network.
Furthermore, in step 2 and step 5 of the invention, genotype data is converted into Boolean data represented in binary form, and the binary data is directly operated by using logical AND operation, so that the condition mutual information between nodes and BIC scoring calculation of the Bayesian network can be rapidly carried out.
Further, the specific method for constructing the initial network map containing the SNP site pairs in step 2 of the present invention is as follows:
step 2.1, setting the number nlocus of the epistatic gene sites to be mined, arranging and combining nlocus sites in all the sites, and quickly calculating the condition mutual information between the nlocus sites and phenotypic characters of different combinations by using logic and operation on the basis of an information entropy theory;
step 2.2, sorting different node pairs according to the calculated condition mutual information size, and taking out top-N node pairs, wherein the size of N is determined according to the experimental result; for SNP loci which are not contained in the top-N node pairs, selecting the node pairs which appear for the first time, and inserting the node pairs into the top-N node pairs;
and 2.3, taking all gene SNP loci as nodes in the network, and inserting corresponding edges of different node pairs into the network graph according to the top-N node pairs obtained in the step 2.2 to construct an initial network graph.
Further, the specific method for performing evolution in step 4 of the present invention is:
step 4.1, selecting operation; scoring the network by using a scoring method of the Bayesian network, placing the optimal Bayesian network individual with the highest score at the initial position of the population, and selecting the network individual to enter the next generation by using a roulette selection method;
step 4.2, tabu cross operation; evolving the two networks by adopting a multi-column crossing method, and judging a generation ring; in order to avoid the premature phenomenon generated by common crossover operation, the generated offspring network is compared with individuals in a tabu table after crossover operation is carried out by utilizing the memory function of tabu search; if the network individual does not belong to the tabu list, the network individual of the next generation enters the next generation and is stored in the tabu list; if the individual belongs to the tabu table, discarding the child individual, and performing tabu cross operation again until the generated child does not belong to the tabu table;
step 4.3, mutation operation is forbidden; adding edges, deleting edges and reversing edges of the network individuals according to a certain variation probability, and selecting the variation which enables the network score to be increased most, so as to obtain an optimized network structure; the memory function of the tabu search is utilized to store the inferior solution which is generated by the variation and can improve the current adaptive value into a tabu table.
Further, the specific calculation method in step 2.1 of the present invention is:
i (Class | SNP) when mining k epistatic SNP sites affecting phenotypic Class1,...SNPk) Expressing the condition mutual information between the k epistatic SNP sites and the phenotype Class, and the calculation formula is as follows:
I(Class|SNP1,...SNPk)=H(Class)+H(SNP1,...SNPk)-H(Class,SNP1,...SNPk)
the formula for calculating the information entropy H (Class) of Class is as follows:
Figure BDA0001849343330000041
calculating the information entropy H (SNP) of k SNP loci1,…,SNPk) The formula of (1) is:
Figure BDA0001849343330000042
further, the specific method for calculating the BIC score of the invention comprises the following steps:
and D represents sample data, G represents a Bayesian network structure, and the sample data can be obtained according to a Bayesian formula:
P(G|D)=P(D|G)P(G)/P(D)
wherein p (g) represents a priori knowledge of the network structure;
by thetaGParameters representing the network structure are obtained by edge integration expanding the above equation:
P(D|G)=∫P(D|G,θG)P(θG|G)dθG
and further obtaining a BIC scoring method of the Bayesian network:
Figure BDA0001849343330000043
where m denotes the total number of samples, n denotes the number of variables, riRepresents the number of values of the ith variable, qiNumber of combinations of parent variables, m, representing the ith variableijkIt means that the ith variable takes the kth value, and the parent variable takes the number of samples of the jth combination.
The invention has the following beneficial effects: the invention relates to an epistatic site mining method based on genetic taboo and Bayesian network, which comprises the steps of firstly converting genotype data into Boolean data represented by binary system, and rapidly calculating the condition mutual information between any SNP site pair and phenotype by using logic and operation. And according to the calculated mutual information size, taking out a top-N node pair on the basis of sequencing the SNP site pairs, and constructing an initial network diagram containing the SNP sites. Then, on the premise of not generating a ring, generating new individuals by three operations of randomly adding edges, deleting edges and reversing edges based on the initial network individuals until the size of the population is reached. The memory idea of the tabu search algorithm is used for the evolution operation of the genetic algorithm, the Bayesian network structure is evolved by utilizing three operations (selection, intersection and variation) of the genetic algorithm and a BIC scoring method of the Bayesian network, the optimal network structure is found, the superior gene locus influencing the phenotypic character is rapidly and accurately obtained, and the gene function excavation is assisted.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a schematic flow chart of an embodiment of the present invention;
FIG. 2 is a representation of genotype data;
FIG. 3 is a representation of binary Boolean data;
FIG. 4 is a representation of a Bayesian network structure matrix code;
FIG. 5 is a process for avoiding the creation of loop structures;
FIG. 6 the crossover operation does not produce a new individual schematic;
FIG. 7 Bayesian network mutation operations;
FIG. 8 is a comparison of the accuracy of the detection of the position of 2 loci by different methods;
FIG. 9 comparison of efficiency of epitopic detection at different method 2 sites;
FIG. 10 shows the comparison of the accuracy of the positional detection at the 3-position in the different methods.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
1. Genotype data is expressed in the form of 0, 1, 2, e.g., data for SNP genotype AT is as follows: AA is represented by 0, TT is represented by 2, and AT/TA is represented by 1. FIG. 2 shows 11 SNPs (SNPs)A~SNPK) Genotype data for the corresponding 4 samples, the last column of Class indicates phenotypic traits, where Class ═ 1 indicates case (diseased), CAnd lass ═ 0 denotes control (control). In order to improve the calculation efficiency of conditional mutual information between subsequent nodes and Bayesian network scoring, the genotype data is converted into Boolean data 0/1 represented in binary form. SNP in FIG. 2A~SNPDThe conversion of genotype data into binary form of boolean data format is shown in fig. 3.
In FIG. 3, the first 4 columns are binary representations with genotype 0, the middle 4 columns are binary representations with genotype 1, and the last 4 columns are binary representations with genotype 2. When the genotype of a specific SNP in a certain sample is 0, the binary value of the corresponding position in the first 4 columns is represented by 1, as shown by the corresponding data in the boxes of FIGS. 2 and 3.
2. Based on genotype data represented in a binary Boolean form, calculating condition mutual information between SNP locus pairs and phenotypic characters by using an information entropy theory, sequencing the node pairs, taking out top-N node pairs, and constructing an initial network diagram containing the SNP loci.
(1) When k epistatic SNP sites affecting a phenotype Class are mined, the formula (1) is adopted to calculate the condition mutual information between the k SNP sites and the Class. Calculating the information entropy H (Class) of Class in the formula (1) by adopting the formula (2), and calculating the information entropy H (SNP) of k SNP sites in the formula (1) by adopting the formula (3)1,…,SNPk)。
I(Class|SNP1,...SNPk)=H(Class)+H(SNP1,...SNPk)-H(Class,SNP1,...SNPk) (1)
Figure BDA0001849343330000061
Figure BDA0001849343330000062
Based on the genotype data expressed in a binary form, the conditional mutual information between the nodes can be rapidly calculated through logic and operation. For example, M in FIG. 3 is calculated using equation (4)bitMedium Condition mutual information I (Class | SNP)B,SNPC)。
I(Class|SNPB,SNPC)=H(Class)+H(SNPB,SNPC)-H(Class,SNPB,SNPC) (4)
If formula (5) is used to calculate H (SNP) in formula (4)B,SNPC)。
Figure BDA0001849343330000071
When equation (5) is calculated, binary logical and operations can be directly utilized to solve by counting the number of characters 1 in a binary string. M in FIG. 3bitThe binary value of the underlined part in equation (5) is used to calculate p (1,1) in equation (5) using equation (6),
Figure BDA0001849343330000072
(2) and sorting different node pairs from high to low according to the calculated conditional mutual information size between the k upper SNP sites and the Class, and taking out a top-N node pair, wherein the size of N can be determined according to an experiment result. For SNP nodes not included in the top-N node pair, the node pair whose first occurrence appears is selected in the ordered node pair and inserted into the top-N node pair.
(3) And taking all gene SNP loci as nodes in the network, circulating the top-N node pairs obtained by calculation, inserting corresponding edges of the node pairs into the network graph, and constructing the initial network graph by analogy.
3. When searching in the space of the Bayesian network structure, each individual in the genetic tabu algorithm adopted by the invention corresponds to one Bayesian network structure. And (3) expressing the Bayesian network individuals by using an adjacency matrix, wherein the number of SNP sites in the network is n, and each individual is expressed as an n multiplied by n adjacency matrix C. With the encoding scheme of 0/1, if node i is the parent of node j, CijNot 1, otherwise, C ij0. As shown in fig. 4.
The Bayesian network individuals in the initial population of the genetic algorithm are composed of SNP nodes and edges between the nodes. And on the premise of not generating a ring, generating the next network individual by randomly adding edges, deleting edges and reversing edges for the initial network individual. And then new network individuals are generated on the basis of the next network individuals, and the like until the number of the network individuals reaches the size of the initial population size. The genetic algorithm starts iterating with the initial population as a starting point.
4. On the basis of an initial network population, the Bayesian network comprising the SNP sites is evolved through genetic operations (selection, crossing and variation) and a scoring mechanism of the Bayesian network, an optimal solution of a network structure is found, and then the epistatic genetic sites influencing phenotypic traits are rapidly and accurately obtained. Meanwhile, in order to enhance the diversity of the population and obtain the global optimal solution, a tabu search strategy is applied to the crossover and mutation evolution operation of the genetic algorithm, and the convergence of the algorithm is accelerated.
(1) Selection operation
Selecting excellent individuals from the current population by using selection operation, and enabling the excellent individuals to serve as father generations to be next generation breeding offspring, wherein the selection operation principle is that the probability that the individuals with stronger fitness are selected is higher. Scoring the network by using a scoring method of the Bayesian network, selecting a network with higher score, placing the optimal Bayesian network individual with the highest score at the initial position of the population, and selecting the optimal network individual to enter the next generation by using a roulette selection method. Let the fitness value of the ith network be fiThen i is chosen with probability PiAs shown in Eq. (7), where N represents the size of the population.
Figure BDA0001849343330000081
(2) Contraindicated cross operation
Better individuals in the new generation can be obtained through the cross operation, and the new individuals inherit the characteristics of the parents of the new individuals. In order to accelerate the convergence rate, a multi-column crossing method is adopted. And when the network is changed due to the crossing, judging the generation ring. What is supposed to beTwo network individuals in the group, Individual1And indigo, and2randomly selecting Indvidual1Two columns f1,f2And Indvidual2Two columns s1,s2. Mixing Indaividual1F of (a)1Column and Indvidual2S of1Exchanging columns to exchange Indvidual1F of (a)2Column and Indvidual2S of2The columns are swapped. Namely:
Figure BDA0001849343330000082
further obtaining: induvidual1[...s1...s2...],Individual2[...f1...f2...]. When performing the crossover operation, it is determined whether there is a ring generation in the network, when Indvidual1And Indvidual2When none of the ring structures exist, Indvidual is considered to be1And Indvidual2Is a new filial generation individual. If the switching operation generates a ring structure, the switching operation of the position is skipped over, and the next switching operation is continuously judged until all the two rows are switched. As shown in fig. 5. The process in fig. 5 is performed as follows:
<1>Individual1second column (I1.f2) and Inividual2The first column (i2.s1) is exchanged line by line, as shown in the part of "exchange first column" in the figure, the crossing is successful, and thus two individuals which have crossed the first column are obtained.
<2>Individual1The third column (I1.f3) and Indvidual2The third column (i2.s3) is switched row by row as shown in the "switch second column" part of the figure. In the first row, if Indvidual1The first row of the third column corresponds to the value 0 and Indvidual2The first row of the third column corresponds to a value of 1 swap, which will result in Indvidual1Creating a ring structure as shown in the red circle portion of the figure. The algorithm skips the first row and starts the swap from the second row. And finally, obtaining two final individuals after the exchange is finished.
Common crossover operations may produce identical offspring among the offspring of different populations, resulting in local similarities of chromosomes in the population, thereby stalling the search,the phenomenon of early ripening is easy to occur. As in FIG. 6, in the first iteration, Indvidual1I1.f1 and Indvidual of2I2.s2 and Indvidual of1I1.f2 and Indvidual of2S3, to get three new descendant network individuals shown in the second iteration. In the second iteration, Indvidual1I1.f2 and Indvidual of3I3.p2 and Indvidual1I1.f3 and Indvidual of3I3.p3 of (1). Two child individuals I1, I3 in the third iteration are obtained to be repeated with the parent, and the crossing operation does not generate a new child individual.
The common crossover operation is easy to generate premature phenomenon, in order to solve the problem, the memory function of tabu search is utilized, and the generated offspring network is compared with individuals in a tabu table one by one after the crossover operation is carried out. If the network individual does not belong to the tabu list, the network individual of the next generation is entered into the next generation and stored in the tabu list. If the individual already belongs to the tabu list, the child generation individual is discarded, and tabu crossover operation is performed again until the generated child does not belong to the tabu list.
(3) Contraindicated for mutation operation
Firstly, randomly selecting an individual in a group by mutation operation, and selecting the selected network individual with a certain mutation probability PmAnd performing edge adding, edge deleting and edge reversing operations, thereby increasing the diversity of the population. As in fig. 7, the variant operation in the matrix corresponds to the deletion of the edge between node a and node C in the network. And under the condition of ensuring that no ring is generated, selecting the variation which increases the network score the most so as to obtain a better network structure. The common mutation operation has stronger randomness and is easy to damage network individuals with better adaptability, and the problems of poor climbing capability, easy falling into local optimum and the like exist. The invention utilizes the memory function of tabu search, when the variation generates inferior solution, the solution is stored in a tabu table, and then the search is carried out on the current basis. The method can avoid roundabout search, jump out of local optimum, improve the mountain climbing capability of the variation operation, and further help to quickly find better network individuals.
5. And evaluating a fitness function.
The fitness function is a standard for judging the quality of network individuals, and determines which excellent individuals are reserved and which poor individuals are eliminated. The genetic tabu algorithm calculates the fitness by scoring the Bayesian network, and then is used as a basis for evolutionary search, and the quality of the network is judged by adopting a common BIC scoring method.
The Bayesian scoring is mainly to select a Bayesian network structure with the maximum posterior probability under the condition of given prior knowledge and sample data. The sample data is represented by D, the Bayesian network structure is represented by G, and the equation (7) can be obtained according to the Bayesian formula. Where p (g) represents a priori knowledge of the network structure.
P(G|D)=P(D|G)P(G)/P(D) (7)
By thetaGThe parameters representing the network structure are expanded from equation (7) by edge integration to obtain equation (8).
P(D|G)=∫P(D|G,θG)P(θG|G)dθG (8)
And further obtaining a BIC scoring method of the Bayesian network, as shown in the formula (9).
Figure BDA0001849343330000101
Where m denotes the total number of samples, n denotes the number of variables, riRepresents the number of values of the ith variable, qiNumber of combinations of parent variables, m, representing the ith variableijkIt means that the ith variable takes the kth value, and the parent variable takes the number of samples of the jth combination. Similarly, based on Boolean data expressed in binary form, the BIC scoring calculation is performed by using the logical AND operation, so that the rapid calculation can be realized.
6. Algorithm termination
When the maximum iteration times are reached or the score value of the optimal individual is kept unchanged after k generations, the algorithm is ended. Otherwise, the selected, crossed and mutated new generation individuals are used for replacing the previous generation individuals, and the iteration execution is returned to continue.
7. Experiments show that the high efficiency of the superior locus mining method based on genetic taboo and Bayesian network is high, and the accuracy and efficiency of the superior detection of the two nodes and the three nodes are respectively compared.
The following is an embodiment of mining the superior sites on the GAMESS software generated data set by applying the method of the present invention, and the high efficiency of mining the superior sites by the method of the present invention is explained in detail by related experiments. GAME ES software is a software commonly used in the industry for generating Epistasis simulation data[4]The software can quickly and accurately generate Epistasis simulation data, and a specific two-site or even multi-site Epistasis model is generated by changing different parameters. Parameters that may be set include: number of SNP sites, and inheritance Rate (h)2) Minimum Allele Frequency (MAF), and prevalence (prediction). The 1 st line site name in the file generating the simulation data is listed as Class label with 1 representing disease and 0 representing control. Genotype data are indicated as 0, 1, 2,0 for the homozygote common genotype, 1 for the heterozygote, and 2 for the homozygote rare genotype.
The genetic taboo-optimized Bayesian network epistasis site mining method is marked as Epi-GTBN, and the epistasis detection method for experimental comparison comprises the following steps: BEAM, AntEpiSeeker, SNPRuler, MDR, BOOST and Bayesian network learning method hill-climbings. By setting different inheritance rates h2(0.025,0.05,0.1,0.2,0.3,0.4) and minimum allele frequency MAF (0.1,0.2,0.3,0.4), different data sets were generated using game software, the data set at each parameter setting consisting of 100 files. And (4) calculating the accuracy of the superior locus mining by adopting the formula (10). Wherein NumedgeIndicating the number of data sets in which the target episomal site can be detected.
Figure BDA0001849343330000111
Experiment 1.2 comparison of accuracy and efficiency of site superior detection
In this experiment, the ratioSet different inheritance rates h2And the accuracy of the epistatic site mining with minimum allele frequency MAF. FIGS. 8 and 9 show the accuracy and efficiency comparison of the positional mining at the 2-site for different methods.
As can be seen from FIG. 8, at different h2And under the condition of MAF value, the 2-site epitopic detection accuracy of the BEAM and hill-clinmbig methods is far lower than that of the other 5 methods. The detection accuracy of the Epi-GTBN, MDR, BOOST and AntEpiSeker methods is the largest and basically 100%, and the detection accuracy of the SNPRuler method is slightly lower than that of the other 4 methods.
As can be seen from fig. 9, the 2-site epitopic detection time of the AntEpiSeeker method is the most, which is much longer than that of the other 6 methods. The detection time of the three methods of BEAM, BOOST and SNPRuler is the least, and the time used by the three methods of MDR, hill-climbingg and Epi-GTBN is intermediate. Wherein Epi-GTBN takes less time than MDR and hill-climbings. In the Epi-GTBN method, genotype data is converted into Boolean data represented in a binary form, and conditional mutual information among nodes can be calculated by utilizing logic and operation fast, and BIC scoring can be carried out on a network. This may save a significant amount of computing time when building the initial network and scoring the network.
In conclusion, the accuracy of the two-site locality detection of the MDR, BOOST and AntEpiSeeker methods is basically consistent with that of the Epi-GTBN method, and is basically 100%. However, the detection times for the MDR and AntEpiSeeker methods are significantly greater than for the Epi-GTBN method. In addition, the parameter setting of the AntEpiSeeker method is complicated, and the result is closely related to the parameter setting. The BOOST method can be only used for detecting the epistasis of 2 sites, but cannot be used for detecting the epistasis of a plurality of sites. Therefore, the Epi-GTBN method has better detection accuracy rate on the premise of not influencing the superior detection efficiency.
Experiment 2.3 comparison of accuracy of site superior detection
In this experiment, different methods were compared in setting different inheritance rates h2And the accuracy of 3 site epistatic mining with minimum allele frequency MAF, as shown in fig. 10.
As can be seen from the experimental results, the detection accuracy of the position at the 3-locus in FIG. 10 is substantially the same as that of the position at the 2-locus in FIG. 8. The BEAM and hill-cliping methods have the lowest accuracy of the epistatic detection. The detection accuracy of the MDR and the Epi-GTBN is the highest, is basically 100 percent and is higher than that of the SNPRuler method.
It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims (7)

1. An epistatic site mining method based on genetic taboos and Bayesian networks is characterized by comprising the following steps:
step 1, expressing the SNP genotype data into data in the forms of 0, 1 and 2, wherein 0 represents the common genotype of homozygote, 1 represents heterozygote, and 2 represents the rare genotype of homozygote; acquiring a gene sample to be mined, dividing the gene sample into three groups of 0, 1 and 2 by taking the number of the sample as a unit, and converting genotype data into Boolean data 0 and 1 represented in a binary form;
step 2, calculating condition mutual information between any SNP locus pair and phenotypic characters based on an information entropy theory, sequencing the node pairs according to the calculated mutual information, taking out top-N node pairs, and constructing an initial network graph containing the SNP locus pairs;
step 3, on the premise of not generating a ring, generating a next network individual by randomly adding edges, deleting edges and reversing edges for the initial network individual, and then generating a new network individual on the basis of the next network individual; repeating the operation of generating the new network individuals until the number of the network individuals reaches the size of the initial population scale;
and 4, evolving the initial network population obtained in the step 3 through three operations of a genetic algorithm for tabu search optimization, including selection, crossing and mutation, and a scoring mechanism of a Bayesian network, wherein the initial network population is the Bayesian network comprising SNP sites, finding the optimal solution of the network structure, and thus obtaining the superior gene sites influencing phenotypic traits.
2. The genetic taboo and bayesian network based epistatic site mining method according to claim 1, characterized by further comprising a method of judging the constructed network:
and 5, adopting a fitness function as a standard for judging the quality of the network individual, and adopting a BIC scoring method to judge the quality of the network.
3. The genetic taboo and bayesian network-based epistatic site mining method according to claim 2, wherein in steps 2 and 5, genotype data is converted into boolean data represented in binary form, and the binary data is directly operated by logic and operation, thereby rapidly performing inter-node condition mutual information and BIC scoring calculation of bayesian networks.
4. The genetic taboo and Bayesian network-based epistatic site mining method according to claim 1, wherein the specific method for constructing the initial network map comprising SNP site pairs in step 2 is as follows:
step 2.1, setting the number nlocus of the epistatic gene sites to be mined, arranging and combining nlocus sites in all the sites, and quickly calculating the condition mutual information between the nlocus sites and phenotypic characters of different combinations by using logic and operation on the basis of an information entropy theory;
step 2.2, sorting different node pairs according to the calculated condition mutual information size, and taking out top-N node pairs, wherein the size of N is determined according to the experimental result; for SNP loci which are not contained in the top-N node pairs, selecting the node pairs which appear for the first time, and inserting the node pairs into the top-N node pairs;
and 2.3, taking all gene SNP loci as nodes in the network, and inserting corresponding edges of different node pairs into the network graph according to the top-N node pairs obtained in the step 2.2 to construct an initial network graph.
5. The genetic taboo and bayesian network-based epistatic site mining method according to claim 1, wherein the specific method for evolution in step 4 is:
step 4.1, selecting operation; scoring the network by using a scoring method of the Bayesian network, placing the optimal Bayesian network individual with the highest score at the initial position of the population, and selecting the network individual to enter the next generation by using a roulette selection method;
step 4.2, tabu cross operation; evolving the two networks by adopting a multi-column crossing method, and judging a generation ring; in order to avoid the premature phenomenon generated by common crossover operation, the generated offspring network is compared with individuals in a tabu table after crossover operation is carried out by utilizing the memory function of tabu search; if the network individual does not belong to the tabu list, the network individual of the next generation enters the next generation and is stored in the tabu list; if the individual belongs to the tabu table, discarding the child individual, and performing tabu cross operation again until the generated child does not belong to the tabu table;
step 4.3, mutation operation is forbidden; adding edges, deleting edges and reversing edges of the network individuals according to a certain variation probability, and selecting the variation which enables the network score to be increased most, so as to obtain an optimized network structure; the memory function of the tabu search is utilized to store the inferior solution which is generated by the variation and can improve the current adaptive value into a tabu table.
6. The genetic taboo and bayesian network-based epistatic site mining method according to claim 4, wherein the specific calculation method in step 2.1 is:
i (Class | SNP) when mining k epistatic SNP sites affecting phenotypic Class1,...SNPk) Expressing the condition mutual information between the k epistatic SNP sites and the phenotype Class, and the calculation formula is as follows:
I(Class|SNP1,...SNPk)=H(Class)+H(SNP1,...SNPk)-H(Class,SNP1,...SNPk)
the formula for calculating the information entropy H (Class) of Class is as follows:
Figure FDA0001849343320000031
calculating the information entropy H (SNP) of k SNP loci1,…,SNPk) The formula of (1) is:
Figure FDA0001849343320000032
7. the genetic taboo and bayesian network-based epistatic site mining method according to claim 2, wherein the specific method for calculating BIC score is as follows:
and D represents sample data, G represents a Bayesian network structure, and the sample data can be obtained according to a Bayesian formula:
P(G|D)=P(D|G)P(G)/P(D)
wherein p (g) represents a priori knowledge of the network structure;
by thetaGParameters representing the network structure are obtained by edge integration expanding the above equation:
P(D|G)=∫P(D|G,θG)P(θG|G)dθG
and further obtaining a BIC scoring method of the Bayesian network:
Figure FDA0001849343320000033
where m denotes the total number of samples, n denotes the number of variables, riRepresents the number of values of the ith variable, qiNumber of combinations of parent variables, m, representing the ith variableijkIt means that the ith variable takes the kth value, and the parent variable takes the number of samples of the jth combination.
CN201811287261.0A 2018-10-31 2018-10-31 Genetic taboo and Bayesian network-based epistatic site mining method Active CN109448794B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811287261.0A CN109448794B (en) 2018-10-31 2018-10-31 Genetic taboo and Bayesian network-based epistatic site mining method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811287261.0A CN109448794B (en) 2018-10-31 2018-10-31 Genetic taboo and Bayesian network-based epistatic site mining method

Publications (2)

Publication Number Publication Date
CN109448794A CN109448794A (en) 2019-03-08
CN109448794B true CN109448794B (en) 2021-04-30

Family

ID=65549784

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811287261.0A Active CN109448794B (en) 2018-10-31 2018-10-31 Genetic taboo and Bayesian network-based epistatic site mining method

Country Status (1)

Country Link
CN (1) CN109448794B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114207620B (en) * 2019-07-29 2023-08-15 国立研究开发法人理化学研究所 Data interpretation device, method, storage medium, data integration device, method, storage medium, and digital city construction system
CN110570909B (en) * 2019-09-11 2023-03-03 华中农业大学 Method for mining epistatic sites of artificial bee colony optimized Bayesian network
CN111180012A (en) * 2019-12-27 2020-05-19 哈尔滨工业大学 Gene identification method based on empirical Bayes and Mendelian randomized fusion
CN111833967B (en) * 2020-07-10 2022-05-20 华中农业大学 K-tree-based epistatic site mining method for optimizing Bayesian network
TWI741760B (en) * 2020-08-27 2021-10-01 財團法人工業技術研究院 Learning based resource allocation method, learning based resource allocation system and user interface
CN112447263B (en) * 2020-11-22 2023-12-26 西安邮电大学 Multi-task high-order SNP upper detection method, system, storage medium and equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103632067A (en) * 2013-11-07 2014-03-12 浙江大学 Seed quantitative trait locus positioning method based on mixed linear model
CN107590364A (en) * 2017-08-29 2018-01-16 集美大学 A kind of quick bayes method of new estimation genomic breeding value

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7041455B2 (en) * 2003-03-07 2006-05-09 Illumigen Biosciences, Inc. Method and apparatus for pattern identification in diploid DNA sequence data
US8195345B2 (en) * 2010-08-05 2012-06-05 King Fahd University Of Petroleum & Minerals Method of generating an integrated fuzzy-based guidance law for aerodynamic missiles
US20140283152A1 (en) * 2013-03-14 2014-09-18 University Of Florida Research Foundation, Inc. Method for artificial selection

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103632067A (en) * 2013-11-07 2014-03-12 浙江大学 Seed quantitative trait locus positioning method based on mixed linear model
CN107590364A (en) * 2017-08-29 2018-01-16 集美大学 A kind of quick bayes method of new estimation genomic breeding value

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Methods for Identifying SNP Interactions: A Review on Variations of Logic Regression,Random Forest and Bayesian Logistic Regression;Carla Chia-Ming Chen et al.;《IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS》;20110310;第8卷(第6期);全文 *

Also Published As

Publication number Publication date
CN109448794A (en) 2019-03-08

Similar Documents

Publication Publication Date Title
CN109448794B (en) Genetic taboo and Bayesian network-based epistatic site mining method
CA2964902C (en) Ancestral human genomes
CN111328419B (en) Method and system based on neural network implementation
Jing et al. MACOED: a multi-objective ant colony optimization algorithm for SNP epistasis detection in genome-wide association studies
Mourad et al. A hierarchical Bayesian network approach for linkage disequilibrium modeling and data-dimensionality reduction prior to genome-wide association studies
CN110570909B (en) Method for mining epistatic sites of artificial bee colony optimized Bayesian network
CN103164631B (en) A kind of intelligent coordinate expression gene analyser
Neigenfind et al. Haplotype inference from unphased SNP data in heterozygous polyploids based on SAT
CN108509764B (en) Ancient organism pedigree evolution analysis method based on genetic attribute reduction
CN115691661A (en) Gene coding breeding prediction method and device based on graph clustering
Zrimec et al. Supervised generative design of regulatory DNA for gene expression control
CN111833964A (en) Method for mining superior locus of Bayesian network optimized by integer linear programming
CN114219605A (en) Wind control method, device and storage medium
CN112270952B (en) Method for identifying cancer drive pathway
Piserchia Applications of Genetic Algorithms in Bioinformatics
US20220246235A1 (en) System and method for gene editing cassette design
Pena et al. Learning and validating Bayesian network models of gene networks
CN110444251B (en) Monomer style generating method based on branch delimitation
Liu et al. An Artificial Fish Swarm Algorithm for Identifying Associations between Multiple Variants and Multiple Phenotypes
Klasen Development and application of statistical algorithms for the detection of additive and interacting loci underlying quantitative traits
CN116189758A (en) Constrained multi-objective optimization method for detecting drug targets of individual cancer patients
Guan et al. Ant Colony optimization with self-evolving parameter for detecting epistatic interactions
WO2000079479A1 (en) Method and device for network inference
Clark et al. An evolutionary algorithm to find associations in dense genetic maps
Penso Dolfin Genome reconstruction and combinatoric analyses of rearrangement evolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant