CN109390032B

CN109390032B - Method for exploring disease-related SNP (single nucleotide polymorphism) combination in data of whole genome association analysis based on evolutionary algorithm

Info

Publication number: CN109390032B
Application number: CN201811299072.5A
Authority: CN
Inventors: 孙立岩; 刘桂霞
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2018-11-02
Filing date: 2018-11-02
Publication date: 2020-07-31
Anticipated expiration: 2038-11-02
Also published as: CN109390032A

Abstract

The invention discloses a method for exploring SNP (single nucleotide polymorphism) combinations related to diseases in data of whole genome association analysis based on an evolutionary algorithm, which comprises the following steps of: initializing a group and an individual record table, and calculating evaluation indexes of individuals in the group; wherein the evaluation index includes: ce. gini, k2, g, cec, ginic, k2c, gc; step two, ranking and fusing the evaluation indexes; step three, judging whether the evolution of the population reaches a termination condition, and if so, outputting an evolution result; generating a random number between 0 and 1, judging whether the random number is greater than the exploration probability, and determining to generate a new individual in an exploration or utilization mode according to the judgment result; and fifthly, adjusting the new individual, calculating the evaluation index of the adjusted new individual, adding the evaluation index into the individual record, and judging whether the eight evaluation indexes of the new individual are all larger than the maximum value of the corresponding evaluation indexes maintained in the current group.

Description

Method for exploring disease-related SNP (single nucleotide polymorphism) combination in data of whole genome association analysis based on evolutionary algorithm

Technical Field

The invention relates to the technical field of evolutionary algorithms, in particular to a method for exploring SNP (single nucleotide polymorphism) combinations related to diseases in data of whole genome association analysis based on the evolutionary algorithms.

Background

With the rapid development of high-throughput genotyping technology, more and more diseased group-control group (case-control) data based on single-nucleotide polymorphism (single-nucleotide polymorphism SNP) of the whole Genome emerge, the scale of the data often comprises thousands of samples and hundreds of thousands of SNPs, researchers expect to analyze the data by various methods of statistics, computers and biology to find out SNP associated with diseases, so as to further explore disease potential models, bring more guidance to explanation, prevention and treatment of diseases, and the research direction is called as whole Genome association analysis (Genome-wide association study GWAS). Due to the existence of epistasis (epistasis), some SNPs can show the correlation with diseases only when being combined with other SNPs for analysis, the correlation between the SNP combination and the diseases has to be considered for comprehensive analysis of GWAS data, and if only two-order epistasis is considered, namely the relation between every two SNPs and the diseases is considered, the search space is hundreds of millions, so that the method for quickly and accurately finding out the SNP combination related to the diseases in the GWAS is of great significance. Currently, methods for exploring related SNPs or SNP combinations on GWAS data are mainly divided into five major categories: exhaustive search, random search, filtered search, model-based search, and evolutionary algorithm-based search. Exhaustive searches, such as the MDR algorithm, which requires the calculation of the association of each SNP combination with the disease, are computationally burdensome as the degree of the superordinate considered increases, and such algorithms often fail to complete the analysis of GWAS data in a significant time. Random searches, such as BEAM, measure the association of each SNP with the disease through a random sampling process, which, although relatively efficient, tends to be very inaccurate. Filtering searches, such as BOOST, fastchi, are often divided into two stages, the first stage, which uses a simple and fast score to measure each SNP or each SNP combination, and only the checked SNPs or SNP combinations can enter the second stage, which aims to reduce the search space by using some fast indicators, and in the second stage, each SNP combination needs to be evaluated in a very small search space, but since the score used in the first stage often cannot achieve satisfactory precision, many important SNPs cannot enter the second stage for precise analysis, and therefore, the precision of such algorithms is still unsatisfactory. Model-based search, such as ts-RF and AdaBoost, the algorithm uses a mainstream classifier to construct a model with accurate classification on GWAS data, and then uses the score of the model on variables when construction is completed to measure the relevance of each SNP to a disease, but because the model tends to select variables with higher marginal effect in the construction process, the algorithm has poor performance when searching for SNPs related to the disease but without marginal effect. Searching based on evolutionary algorithms, such as FHSA-SED, MACOED, CSE, which use an index for evaluating the association between SNP combinations and diseases as a target function, and search for SNP combinations that optimize the target function using evolutionary algorithms such as genetic algorithm or ant colony algorithm, but the design of evolutionary algorithms and the selection of target functions are difficult to study. In summary, although there are currently many algorithms that explore disease-related SNPs or SNP combinations on GWAS data, they are summarized as follows: 1. with precision, measures of association of SNP combinations with disease often select only one measure, which does not find well SNP combinations that do not fit their hypothesis for the pathogenic model. 2. On the memory, many algorithms have high requirements on the memory when analyzing GWAS data, and most computing platforms cannot run. 3. In speed, most algorithms have high computational complexity, and even many algorithms cannot analyze GWAS data within a reasonable time. 4. The results of the operation of the algorithm are not interpreted biologically well, leading to poor utilization of the results by subsequent researchers.

Disclosure of Invention

The invention designs and develops a method for exploring SNP (single nucleotide polymorphism) combinations related to diseases in data of whole genome association analysis based on an evolutionary algorithm, and by adopting the SEE (Sort, amplification and amplification) algorithm designed by the invention, eight indexes for evaluating the SNP combinations are fused as a target function of the evolutionary algorithm, so that the defects of the existing method in precision, memory, speed and interpretability are overcome, the time and space required by calculation are greatly reduced, and the precision and interpretability of results are improved.

The technical scheme provided by the invention is as follows:

a method for exploring disease-associated SNP combinations in genome-wide association analyzed data based on evolutionary algorithms, comprising the steps of:

initializing a group and an individual record table, and calculating evaluation indexes of individuals in the group; wherein the evaluation index includes: ce. gini, k2, g, cec, ginic, k2c, gc;

step two, ranking and fusing the evaluation indexes, and calculating an exploration probability er through the following formula:

wherein numspecs is the de-duplication number of SNPs in the current population, numtop is the number of individuals in the population, l is the length of each individual, and pe is a parameter of the SEE algorithm;

step three, if the evolution of the population reaches a termination condition, generating a result; otherwise, generating a random number between 0 and 1; if the random number is larger than the exploration probability, generating a new individual by using the method, and if the random number is smaller than the exploration probability, generating a new individual by using the method;

step four, calculating the evaluation indexes of the new individual, if the eight evaluation indexes of the new individual are all larger than the maximum value of the corresponding evaluation indexes maintained in the current group, the new individual is considered to be useless for the group, the step three is carried out again, if at least one of the eight evaluation indexes of the new individual is smaller than the maximum value of the corresponding evaluation indexes maintained in the current group, the new individual is considered to be useful for the group, the new individual is used for replacing the worst individual in the current group, and the step two is carried out again.

Preferably, in said step one,

the calculation method of the cec comprises the following steps: cec (Y, I) ═ ce (Y | I) -ce (Y | E); wherein, ce (Y | I) is the ce value of the SNP combination, E is the SNP with the minimum ce value in the SNP combination, and ce (Y | E) is the ce value of E;

the ginic calculation method comprises the following steps: ginic (Y, I) ═ gini (Y | I) -gini (Y | E); wherein gini (Y | I) is the gini value of the SNP combination, E is the SNP of the smallest gini value in the SNP combination, and gini (Y | E) is the gini value of E;

the k2c calculation method comprises the following steps:

wherein k2(Y, I) is the k2 value of the SNP combination, E is the SNP with the minimum k2 value in the SNP combination, and k2(Y, E) is the k2 value of E;

the gc calculation method comprises the following steps:

wherein g (Y, I) is the g value of the SNP combination, E is the SNP with the minimum g value in the SNP combination, and g (Y | E) is the g value of E.

Preferably, in said step one,

the ce calculation method comprises the following steps:

wherein I is an individual of the SNP combination; y is phenotype, sample status; mi is mutual information; h is information entropy; c is a possible value set of the SNP combination; s is a set of phenotype dereferencing values; p (c, s) is the sample proportion of the SNP combination value in the sample being c and the phenotype value being s; p (c) is the sample proportion with the SNP combination value of c in the sample;

the gini calculation method comprises the following steps:

in the formula, p(s) is the sample proportion of which the phenotype value is s in the sample, C is a possible value set of the SNP combination, and p (C) is the sample proportion of which the SNP combination value is C in the sample; s is a set of phenotype evaluable values; p (s | c) is the proportion of the sample with the phenotype value of s in the sample with the SNP combination value of c;

the k2 calculation method is as follows:

wherein C is the possible value set of the SNP combination, S L is the length of the SNP combination, C is the possible value set of the SNP combination, S is the set of the phenotype possible values, m_cThe number of samples with the value c for the SNP combination in the sample, m_c,sThe number of samples with the SNP combination value of c and the phenotype value of s in the samples is shown;

the calculation method of g is as follows:

in the formula, E_c,sWhen the SNP combination is independent of the phenotype, the value of the SNP combination in the sample is c and the value of the phenotype isIs the expected value of the number of samples of S, C is the set of possible values of the SNP combination, S is the set of phenotypical values, m_cThe number of samples, m, of which the SNP combination in a sample has the value c_sNumber of samples with a shape value of s in the sample, m_c,sAnd the number of samples with the SNP combination value of C and the phenotype value of s in the samples is shown, pvaue _ of is the pvalue value of the calculated variable under the chi-square distribution, and the degree of freedom is the length of C minus 1.

Preferably, in the first step, a plurality of individuals are randomly generated to constitute an initial population, and G is made 0.

Preferably, in the second step, the ranking and fusing the evaluation indexes includes: and sequencing each index of all individuals, obtaining 8 serial numbers for each individual, respectively representing the number of the individuals smaller than the index under the corresponding index, weighting and accumulating the serial numbers to obtain a weight, and fusing and sequencing the weights.

Preferably, the weight value is initially set to a default value of 1.

Preferably, in the third step, the termination condition is reached when G exceeds a set upper limit.

Preferably, in the fourth step, the generating of the new individual process by the exploring method comprises:

randomly generating two integers in the interval [0, numPop), and assigning the larger integer to a first variable A; randomly generating a real number on the interval [0,1), and assigning the real number to a second variable B;

if the second variable is smaller than the quotient of the first variable and numPop, randomly generating an individual, wherein e is a new individual generated by the SEE in the exploration method; otherwise, assigning the A-th individual on the population to e, randomly generating an integer third variable C on the interval [0, o), randomly generating an integer on the interval [0, n), assigning the integer to the C-th element of e, wherein e is a new individual generated by the SEE in a method for exploration;

wherein numPop is the number of individuals in the population; o is the number of elements each individual contains; n is the number of SNPs in the data.

Preferably, in the fourth step, the generating of the new individual process by the method to be utilized comprises:

randomly generating two random integers on the interval [0, numPop), assigning the smaller one to a fourth variable D, randomly generating two random integers on the interval [0, numPop), assigning the smaller one to a fifth variable E, assigning the D element of the group pop to E, randomly generating two real numbers F and G on the interval [0, o), replacing the F element of the E with the G element of the E individual in the group pop, wherein E is a new individual generated by the SEE with the method;

Compared with the prior art, the invention has the following beneficial effects: aiming at the problem of insufficient precision of the current algorithm, the invention provides a method for fusing 8 evaluation indexes by a sequencing method, thereby improving the precision. Aiming at the problem of high space complexity of the current algorithm, the space complexity of the algorithm used when the individual is adjusted is very small while a storage structure similar to BOOST is used, so that the requirement of a memory is reduced, and SEE software can stably run on most computing platforms. Aiming at the deficiency of the current algorithm speed, a new evolutionary algorithm needs to be designed, and the analysis speed of GWAS data is increased by combining with the adjustment of a new individual algorithm. Aiming at the problem of insufficient interpretability of the result, the method maps the SNP to the gene, and utilizes cytoscape software to draw a network, so that the interpretability of the result is increased.

Drawings

Fig. 1 is a main flow of the SEE algorithm.

Fig. 2 is a diagram illustrating how 8 evaluation indexes are fused by ranking.

Fig. 3(a) is a diagram illustrating how new individuals may be adjusted by way of example.

Fig. 3(b) is a diagram illustrating how new individuals may be adjusted by way of example.

Fig. 3(c) is a diagram illustrating how new individuals may be adjusted.

Fig. 3(d) is a diagram illustrating how new individuals may be adjusted by way of example.

FIG. 4 is a partial result of analysis on a CD data set using the SEE algorithm.

Detailed Description

The present invention is further described in detail below with reference to the attached drawings so that those skilled in the art can implement the invention by referring to the description text.

As shown in FIG. 1, the invention provides a method for searching SNP (single nucleotide polymorphism) combinations related to diseases in data of whole genome association analysis based on an evolutionary algorithm, which comprises the following specific steps:

step S110, initializing a group and an individual record table;

randomly generating a plurality of (numPop, user-specified parameters) individuals (SNP combinations) to form an initial population, setting G to 0 (when an evolutionary algorithm is applied, the number of evolutions of the population is generally recorded, and when the population is initialized, G to 0, and each time a new individual is generated by the evolutionary algorithm, G is added by 1, the parameter maxter needs to be specified when the SEE algorithm runs, when G is equal to maxter, the SEE running is terminated, and a larger maxter can make the result of the SEE better but also needs more running time), initializing the individual (SNP combination) record table to be empty, and then adding all the individuals existing in the population to the tracking.

Step S120, calculating the evaluation index of each individual in the group;

in order to evaluate the relation between a SNP combination (individual) and a disease, the invention defines 8 evaluation indexes, wherein the first four indexes are the measurement values which are commonly used in the field and used for evaluating the SNP combination and the disease relation, the last four indexes are the design in the invention and are used for measuring whether the relation between the SNP combination and the disease is caused by the super strong marginal effect of a certain SNP, the smaller the eight indexes are, the more remarkable and obvious the relation between the represented SNP combination and the disease is, and the values of the 8 indexes are calculated one by one for each individual.

In the present invention, the 8 indexes are respectively:

(1) ce (conditional expression) comes from the derivation of mi (mutual information), for many years, many researchers use mi as an index for measuring SNP combinations and disease relationships, and since h (y) is always constant in the process of analyzing GWAS data, ce can become an equivalent alternative to mi, and the calculation method of ce is as follows:

mi(I,Y)＝H(Y)-ce(Y|I)

wherein I is an individual (SNP combination); y is phenotype, sample status; mi is mutual information; h is information entropy; c is a possible value set of the SNP combination (individual I); s is a set of phenotype dereferencing values; a sample proportion of a SNP combination value in a p (c, s) sample being c and a phenotype value being s; p (c) is the sample proportion with the SNP combination value of c in the sample;

(2) gini is a method used by CART (classification regression tree) to measure the relationship between variables, and in this field it is also commonly used to measure the relationship between SNP combinations and diseases, and gini is calculated as follows:

gini(Y|I)＝Gini(Y,I)；

wherein p(s) is the sample proportion of which the table type value is s in the sample; c is a possible value set of the SNP combination; p (c) is the sample proportion with the SNP combination value of c in the sample; s is a set of phenotype dereferencing values; p (s | c) is the proportion of samples with a phenotype value of s in all samples with a SNP combination value of c;

(3) k2 is a Bayesian network-based scoring criterion, and is also often used as an index for measuring the relationship between SNP combinations and diseases, and the calculation method of k2 is as follows:

wherein C is the possible value set of the SNP combination, S L is the length (including how many SNPs) of the SNP combination, C is the possible value set of the SNP combination, S is the set of the phenotype possible values, m_cThe number of samples with the SNP combination value of c in the sample is obtained; m is_c,sThe number of samples with the SNP combination value of c and the phenotype value of s in the samples is shown;

(4) g is the pvalue value of G-test, which is widely used as a statistical method for testing whether two variables are independent of each other to judge the relationship between SNP combination and disease, and the calculation method of G is as follows:

g(Y,I)＝pvalue_of(G-statistic(Y,I))；

in the formula, E_c,sWhen the SNP combination is independent of the phenotype, the expected value of the number of samples with the SNP combination value of c and the phenotype value of s in the sample is obtained; c is a possible value set of the SNP combination; s is a set of phenotype dereferencing values; m is_cThe number of samples with the SNP combination value of c in the sample is obtained; m is_sThe number of samples with the table type value of s in the samples is shown; m is_c,sThe number of samples with the SNP combination value of c and the phenotype value of s in the samples is shown; pvaue _ of is the pvalue value of the calculated variable under chi-square distribution, with the degree of freedom being the length of C minus 1;

(5) the calculation method of the cec is as follows:

cec(Y,I)＝ce(Y|I)-ce(Y|E)；

wherein ce (Y | I) is the ce value of the SNP combination; e is the SNP with the minimum ce value in the SNP combination; ce (Y | E) is the ce value of E;

(6) ginic was used to measure how much less the fraction of a SNP combination under gini's index is than the SNP that it possesses the smallest gini, and ginic was calculated as follows:

ginic(Y,I)＝gini(Y|I)-gini(Y|E)；

wherein gini (Y | I) is the gini value of the SNP combination; e is the SNP with the smallest gini value in the SNP combination; gini (Y | E) is the gini value of E;

(7) k2c is a measure of how much less the fraction of a combination of SNPs under the index k2 is than the fraction of the SNP with the smallest k2, and k2c is calculated as follows:

wherein k2(Y, I) is the k2 value of the SNP combination; e is the SNP with the smallest k2 value in the SNP combination; k2(Y, E) is the k2 value of E;

(8) gc is used to measure how much less the fraction of a SNP combination under g is than the SNP it possesses the smallest g, and g is calculated as follows:

wherein g (Y, I) is the g value of the SNP combination; e is the SNP with the minimum g value in the SNP combination; g (Y | E) is the g value of E.

S130, sequencing the groups, and calculating exploration probability;

in order to use the 8 indexes simultaneously, the 8 indexes need to be fused into a new index, however, the 8 indexes come from different fields, the value ranges and the distribution are very different, the 8 indexes cannot be effectively utilized by simply adding or multiplying, the invention designs a method based on sequencing to fuse the 8 evaluation indexes, fig. 2 illustrates the method designed by the invention, in order to enable the SEE algorithm to find the optimal solution, the evolution process needs to simultaneously consider the two processes of 'S160, exploration' and 'S170 and utilization', which respectively represent the number of SNPs in the enriched population to prevent the population from being trapped in a locally optimal algorithm and fully utilize the combination of SNPs in the population to generate a better solution, and are two methods of complementation and opposition, and in order to find the optimal solution, the two methods are not available.

When the number of SNPs in a population is large, SEE should be more biased to call "S160, explore", and conversely "S170, utilize"; aiming at the idea, the variable of the exploration probability (er) is designed, and the calculation method is as follows:

er＝1-coverage^pe

where numspecs is the number of SNPs in the current population (deduplication), numppo is the number of individuals in the population, l is the length of each individual, pe is one of the parameters of the SEE algorithm and needs to be specified by the user, the larger pe favors the generation of new individuals by "S160, exploration" during the entire evolution, and the smaller pe favors "S170, utilization".

Step S140, whether a termination condition is achieved;

whether the evolution of the population is terminated is judged, generally the SEE continuously maintains a variable G, G is 0 when the algorithm is started, G is larger and larger along with the running of the SEE, and when the G reaches the preset upper limit maxIter (parameters specified by a user are larger, the larger the maxIter is, the longer the SEE running time is, and the better the result is), the SEE algorithm terminates the evolution of the population.

Step S150, generating a random number between 0 and 1;

a random number between 0 and 1 is generated, and if the random number is less than the search probability, "S160, search" is performed, otherwise, "S170, utilization" is performed.

Step S160, generating a new individual by an exploration method;

SEE favors the invocation of the method to generate new individuals when the number of SNPs in a population is small, and generally increases the number of SNPs in a population (deduplication) when an individual generated by the method is included in the population, as described in detail below:

algorithm 1 generates new individuals in an Exploration (Exploration) approach:

defining: numPop is the number of individuals in the group and is specified by the user when the SEE is run; o is the number of elements contained in each individual, and is specified by a user when the SEE is run; n is the number of SNPs in the data, determined by the content of the data analyzed by SEE.

1. Randomly generating two integers over the interval [0, numtop), assigning the larger integer to i 1;

2. randomly generating a real number on the interval [0,1), and assigning the real number to ran;

3. if ran is less than i1/numPop, execute 4, otherwise, execute 5;

4. randomly generating an individual e, namely randomly generating each element of e, wherein each element is in the interval [0, n), and executing 8;

5. assigning e to an i1 th individual pop [ i1] on the population pop;

6. randomly generating an integer i2 over the interval [0, o);

7. randomly generating an integer on the interval [0, n), and assigning to the i2 th element of e, namely e [ i2 ];

e is a new individual generated by the method explored by SEE.

The pseudo code of the method is as follows:

step S170, generating a new individual by using the method;

when the number of SNPs in a population is large, SEE is biased to call the method to generate new individuals, and the idea is mainly referred to the crossover operation in genetic algorithm, and the detailed description of the method is as follows:

algorithm 2 generated new individuals in the method of (Exploitation):

defining: pop, numPop, o are as defined for algorithm 1.

1. Randomly generating two random integers over the interval [0, numtop), assigning the smaller one to i 1;

2. randomly generating two random integers over the interval [0, numtop), assigning the smaller one to i 2;

3. assigning the i1 th element of the group pop to e, i.e., e ═ pop [ i1 ];

4. randomly generating two real numbers i3 and i4 over the interval [0, o);

5. replacing the i3 th element of e with the i4 th element of the i2 th individual in the group pop, i.e. e [ i3] ═ pop [ i2] [ i4 ];

e is a new individual generated by the method used for SEE;

the pseudo code of the method is as follows:

step S180, adjusting a new individual;

the generated new individual is explored or utilized and can not be directly added into a group as a possible solution, the method needs to adjust the new individual to enable the new individual to become a possible solution, the same solution is prevented from being repeatedly calculated according to the content of an individual record table, meanwhile, the repeated solution is fully utilized, namely when the evolutionary algorithm repeatedly visits the same solution, the solution adjacent to the repeated solution is preferentially selected to become an adjusted individual, and the idea effectively accelerates the search of the evolutionary algorithm on the global optimum. FIG. 3 illustrates how the adjustment of a new individual is made, the detailed description of the adjustment being as follows:

algorithm 3 adjusts the individuals:

defining: x is the input, individual to be adjusted; the steplnTable is a parameter designated by a user and used for determining the maximum distance allowed when an individual is adjusted; the training is an individual record table, is maintained by the SEE operation, and stores all the individuals which are calculated once; n, o are as defined for algorithm 1.

1. Initializing two empty queues, waiting0 and waiting 1;

2. if x contains two identical elements, the adjustment fails, and the return is made, otherwise, 3 is executed;

3. if x is not in the tracking, the adjustment is successful, x is returned as the adjusted individual, otherwise, 4 is executed;

4. adding x to waiting0 and waiting1, initializing i to 0;

5. if waiting1 is not null, take an individual assignment from waiting1 to x, perform 6, otherwise perform 10;

6. if x is not in the tracking, the adjustment is successful, x is the adjusted individual, waiting1 and waiting0 are emptied, and if not, 7 is executed;

7. initializing j to 0;

8. assigning x to xx, xx [ j ] ═ xx [ j ] + 1;

9. if xx [ j ] is above [0, n), and all elements above xx satisfy a small to large order, add xx to waiting1, j +1, if j < o, perform 8, otherwise perform 10;

10. if waiting0 is not null, take an individual assignment from waiting0 to x, execute 11, otherwise execute 15;

11. if x is not in the tracking, the adjustment is successful, x is the adjusted individual, waiting1 and waiting0 are emptied, and if not, 12 is executed;

12. initializing j to 0;

13. assigning x to xx, xx [ j ] ═ xx [ j ] -1;

14. if xx [ j ] is above [0, n), and all elements above xx satisfy a small to large order, add xx to waiting0, j +1, if j < o, perform 13, otherwise perform 15;

if i is equal to i +1, executing 5, otherwise, failing to adjust, emptying waiting1 and waiting0, and returning;

the pseudo code for the adjustment is as follows:

s190, calculating the adjusted new individual and adding the new individual to an individual record table;

calculating 8 evaluation indexes of the adjusted new individual, adding the individual to an individual record table, and recording that the individual is already calculated.

S200, whether the new individual is useful for the population;

the SEE algorithm can continuously maintain the maximum value of 8 evaluation indexes in the population in the evolution process of the population (when the SEE algorithm is operated, each individual has 8 values, corresponding to 8 indexes, for each index, the maximum value in all the individuals is taken, the 8 maximum values can be continuously updated along with the operation of the SEE algorithm and the evolution of the population), if any one of the 8 evaluation indexes of a new individual is smaller than the maximum value of the corresponding index in the current population, the new individual is considered to be useful for the population by the invention, otherwise, the new individual is useless.

S210, replacing the worst individual in the population with the new individual;

and replacing the worst individual in the population by the new individual, wherein the population is sorted according to rankSum from small to large, the worst individual is the last individual in the population, namely the individual with the largest rankSum, and the last individual in the population is replaced by the new individual.

And S220, generating a result.

The SEE algorithm provides a threshold value for each of the 8 evaluation indexes, a user can set the threshold values by specifying parameters when running a program, and when the algorithm is finished, only individuals of which the values of the 8 evaluation indexes are not more than the corresponding threshold values are returned as results; in addition, since the result of "SNP combination related to disease" is not well interpreted biologically only, and thus the result is not well utilized by the subsequent researchers, the present invention utilizes NCBI database to convert SNP into gene and SNP combination into gene combination, the result is better interpretable and convenient for the subsequent researchers to utilize, and fig. 4 shows the partial result obtained by analyzing CD data set by SEE algorithm according to the present invention.

Examples

As shown in fig. 4, CD (Crohn's Disease) data set in seven Disease data sets WTCCC1 was analyzed by SEE algorithm of the present invention, and 2 nd-order SNP combinations related to CD were searched, and finally SNPs in the result were converted into gene identities according to NCBI to obtain a series of gene pairs, which are networks drawn according to the gene pairs, wherein each edge represents that at least 4 SNP combinations and CD are related on the two connected genes, and it can be clearly seen from fig. 4 that SEE that genes L DB2, L OC107986262, RRP15, SMG1P5, etc. may play some key roles for CD Disease, which is worthy of further study.

Aiming at the problem of insufficient precision of the current algorithm, the technical scheme of the invention provides a method for fusing 8 different indexes for evaluating the relation between the SNP combination and the disease by utilizing a sequencing method, wherein the first 4 indexes ce, gini, k2 and g are indexes which are widely applied in the field, cec, ginic, k2c and gc are new indexes which are designed for evaluating the epistasis of the SNP combination and are used for measuring whether the relation between the SNP combination and the disease is caused by the super strong marginal effect of a certain SNP, the smaller the 8 indexes are, the stronger the relation between the SNP combination and the disease is, in order to fuse the 8 indexes, the invention respectively sequences the 8 indexes on a group, then obtains the final fused index rankSum by taking the weighted sum of all index serial numbers of each individual, and fuses a plurality of indexes to improve the precision of the result, the information quantity of the group of the evolutionary algorithm is larger in the running process, and the evolutionary algorithm also plays a certain role in accelerating the running of the algorithm.

The technical scheme of the invention aims at the problem that the current algorithm usually needs a super-large memory, uses a storage structure similar to BOOST, and uses 'bit' to store GWAS data, thereby greatly reducing the required memory space, and for the GWAS data with the scale of about 5000 samples and 500000 SNP, the memory required by SEE software during the operation is about 1GB, so that most computing platforms can use the SEE software to analyze the GWAS data.

The technical scheme of the invention provides a new evolutionary algorithm aiming at the problem of insufficient speed of the current algorithm, the group is continuously evolved through the two processes of exploration and utilization, the probability of generating new individuals by calling the two processes is determined through the number of SNPs in the group, the SEE algorithm utilizes an individual record table to adjust the new individuals, the same SNP combination cannot be repeatedly calculated, the information of the group is fully utilized, and when the evolutionary algorithm repeatedly accesses the same individual, the individuals adjacent to the individual are also preferentially accessed. In addition, because the invention uses 'bit' to store data, the process of calculating 8 indexes also makes full use of 'bit operation', which also greatly improves the speed of SEE software.

The technical scheme of the invention aims at the problem of insufficient interpretability of the current algorithm and provides that SNP is mapped to a gene to display a result, after a SNP combination related to a disease is found, the SNP is mapped to the gene by using an NCBI database, the result is converted into the combination of the gene, then, a network of the gene and the gene is drawn by using cytoscape software, so that genes having larger influence in the disease can be seen, and the interpretability of the algorithm is improved.

As shown in fig. 2, in another embodiment, in step S130, 8 evaluation indexes are fused by sorting. Firstly, the invention sequences each index of all individuals, each individual can obtain 8 serial numbers which respectively represent the number of the individuals smaller than the number of the individuals under the corresponding index, the serial numbers are weighted and accumulated to obtain values, the indexes generated by 8 evaluation indexes are fused in the invention, the weights control which indexes are emphasized or discarded by the SEE algorithm, and the invention recommends that all weights are set as default values 1.

As shown in fig. 3, in another embodiment, in step S180, the specific implementation process of adjusting the new individual process includes: assume that the length of an individual is 2 and that the new individuals resulting from exploitation or exploration are [3,4 ]. In fig. 3(b), 3(c), 3(d), F represents that this solution is not feasible because the present invention requires that SNPs in individuals are in increasing order and not identical, x represents that this individual has been calculated, stored in the individual record table, o represents the individual that will be returned by the adjustment algorithm, and one of the individuals marked as o will be returned after the adjustment is finished; fig. 3(a) shows the manhattan distance between each possible solution and [3,4], in fig. 3(b), when the individual [3,4] is not calculated, the [3,4] is returned as the adjusted individual, in fig. 3(c), when the individual [3,4] has been calculated, the individual [3,5] which is closest to the distance [3,4] and which has not been calculated is returned as the adjusted individual, in fig. 3(d), when [3,4] and the individual whose manhattan distance is 1 are both calculated, the individual which is 2 from the adjusted individual and which has not been calculated is returned as the adjusted individual, in the SEE algorithm, steplntable is a parameter for controlling the adjustment depth, and when there is no individual whose manhattan distance is smaller than steplntable and which has not been calculated, the adjustment fails.

While embodiments of the invention have been described above, it is not limited to the applications set forth in the description and the embodiments, which are fully applicable in various fields of endeavor to which the invention pertains, and further modifications may readily be made by those skilled in the art, it being understood that the invention is not limited to the details shown and described herein without departing from the general concept defined by the appended claims and their equivalents.

Claims

1. A method for searching SNP combinations related to diseases in data of genome-wide association analysis based on an evolutionary algorithm, which is characterized by comprising the following steps:

wherein numspeces is the de-duplication number of SNPs in the current population, numtop is the number of individuals in the population, l is the length of each individual, and pe is a parameter;

step four, calculating the evaluation indexes of the new individual, if the eight evaluation indexes of the new individual are all larger than the maximum value of the corresponding evaluation indexes maintained in the current group, the new individual is considered to be useless for the group, the step three is carried out again, if at least one of the eight evaluation indexes of the new individual is smaller than the maximum value of the corresponding evaluation indexes maintained in the current group, the new individual is considered to be useful for the group, the new individual is used for replacing the worst individual in the current group, and the step two is carried out again;

in the first step, the first step is carried out,

the ce calculation method comprises the following steps:

the gini calculation method comprises the following steps:

the k2 calculation method is as follows:

wherein C is the possible value set of the SNP combination, S L is the length of the SNP combination, C is the possible value set of the SNP combination, S is the set of the phenotype possible values, m_cThe number of samples with the value c for the SNP combination in the sample, m_c,sThe SNP combination in the sample takes the value c and the phenotypeThe number of samples with value s;

the calculation method of g is as follows:

in the formula, E_c,sWhen the SNP combination is independent of the phenotype, the expected value of the number of samples with the SNP combination value of C and the phenotype value of S in a sample is shown, C is a possible value set of the SNP combination, S is a set of phenotype possible values, m is_cThe number of samples, m, of which the SNP combination in a sample has the value c_sNumber of samples with a shape value of s in the sample, m_c,sThe method comprises the steps of calculating the number of samples with the SNP combination value of C and the phenotype value of s in the samples, wherein the pvalue _ of is the pvalue value of a calculated variable under chi-square distribution, the degree of freedom is the length of C minus 1, and G is the evolution frequency of a population;

the k2c calculation method comprises the following steps:

the gc calculation method comprises the following steps:

wherein g (Y, I) is the g value of the SNP combination, E is the SNP with the minimum g value in the SNP combination, and g (Y | E) is the g value of E;

in the second step, the ranking and fusing the evaluation indexes comprises: sequencing each index of all individuals, obtaining 8 serial numbers for each individual, respectively representing the number of the individuals smaller than the index under the corresponding index, performing weighted accumulation on the serial numbers to obtain a weight, and performing fusion sequencing on the weight;

in step three, the generation of new individual processes by the explored method comprises:

if the second variable is smaller than the quotient of the first variable and numPop, randomly generating an individual e, namely a new individual generated by an exploration method; otherwise, assigning the A-th individual on the population to e, randomly generating an integer third variable C on the interval [0, o), randomly generating an integer on the interval [0, n), and assigning the integer to the C-th element of e, wherein e is a new individual generated by an exploration method;

wherein numPop is the number of individuals in the population; o is the number of elements each individual contains; n is the number of SNPs in the data;

generating new individual processes in a utilized manner includes:

randomly generating two random integers on the interval [0, numPop), assigning the smaller one to a fourth variable D, randomly generating two random integers on the interval [0, numPop), assigning the smaller one to a fifth variable E, assigning the D element of the group pop to E, randomly generating two real numbers F and G on the interval [0, o), replacing the F element of the E with the G element of the E individual in the group pop, wherein E is a new individual generated by utilizing the method;

2. The method for exploring disease-associated SNP combinations in genome-wide association analysis data based on evolutionary algorithm as claimed in claim 1, wherein in step one, a plurality of individuals are randomly generated to compose an initial population, and G is made 0.

3. The method for searching for SNP (single nucleotide polymorphism) combinations related to diseases in genome-wide association analysis data based on evolutionary algorithm as claimed in claim 2, wherein the weight is initially set to a default value of 1.

4. The method for searching for a combination of SNPs associated with a disease in data analyzed by genome wide association based on evolutionary algorithm as claimed in claim 1, wherein in the third step, the termination condition is reached when G exceeds a set upper limit.