CN112447263A - Multitask high-order SNP upper detection method, system, storage medium and equipment - Google Patents
Multitask high-order SNP upper detection method, system, storage medium and equipment Download PDFInfo
- Publication number
- CN112447263A CN112447263A CN202011315829.2A CN202011315829A CN112447263A CN 112447263 A CN112447263 A CN 112447263A CN 202011315829 A CN202011315829 A CN 202011315829A CN 112447263 A CN112447263 A CN 112447263A
- Authority
- CN
- China
- Prior art keywords
- snp
- order
- multitask
- data
- search
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 106
- 238000000034 method Methods 0.000 claims abstract description 45
- 238000010845 search algorithm Methods 0.000 claims abstract description 40
- 239000002773 nucleotide Substances 0.000 claims abstract description 15
- 125000003729 nucleotide group Chemical group 0.000 claims abstract description 15
- 239000011159 matrix material Substances 0.000 claims abstract description 14
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 10
- 238000011156 evaluation Methods 0.000 claims description 29
- 238000005457 optimization Methods 0.000 claims description 23
- 201000010099 disease Diseases 0.000 claims description 22
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 22
- 230000006870 function Effects 0.000 claims description 17
- 238000006243 chemical reaction Methods 0.000 claims description 8
- 230000007246 mechanism Effects 0.000 claims description 7
- 230000001755 vocal effect Effects 0.000 claims description 7
- 238000007781 pre-processing Methods 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 4
- 238000013178 mathematical model Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 230000008030 elimination Effects 0.000 claims description 2
- 238000003379 elimination reaction Methods 0.000 claims description 2
- 238000012545 processing Methods 0.000 claims description 2
- 230000006978 adaptation Effects 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 10
- 239000000523 sample Substances 0.000 description 29
- 238000004364 calculation method Methods 0.000 description 12
- 230000000694 effects Effects 0.000 description 12
- 230000002922 epistatic effect Effects 0.000 description 8
- 230000001717 pathogenic effect Effects 0.000 description 7
- 230000008901 benefit Effects 0.000 description 6
- 230000007547 defect Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 6
- 238000010801 machine learning Methods 0.000 description 6
- 101100177269 Arabidopsis thaliana HCAR gene Proteins 0.000 description 5
- 230000035945 sensitivity Effects 0.000 description 4
- 238000000528 statistical test Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000004880 explosion Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 108090000623 proteins and genes Proteins 0.000 description 3
- 238000004088 simulation Methods 0.000 description 3
- 230000007704 transition Effects 0.000 description 3
- 238000003491 array Methods 0.000 description 2
- 238000003556 assay Methods 0.000 description 2
- 230000036541 health Effects 0.000 description 2
- 108700028369 Alleles Proteins 0.000 description 1
- 238000000342 Monte Carlo simulation Methods 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 239000013068 control sample Substances 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 230000001667 episodic effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 230000009916 joint effect Effects 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008506 pathogenesis Effects 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 238000002626 targeted therapy Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
Landscapes
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Biology (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Chemical & Material Sciences (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention belongs to the technical field of Single Nucleotide Polymorphism (SNP) up-level detection, and discloses a method, a system, a storage medium and equipment for multi-task high-order SNP (single nucleotide polymorphism) up-level detection, wherein the method for multi-task high-order SNP up-level detection reads PED (personal identification number) and MAP (MAP) format data from a VCF (virtual channel format) file by utilizing Plink software, and arranges a converted binary format file into a sample matrix; setting search algorithm parameters according to the sizes of the SNP sites and the sample size in the data; reading SNP sample data, and starting to prepare a first-stage search; and performing high-order SNP (single nucleotide polymorphism) upper combination detection by utilizing a multitask, multi-harmonic memory bank and a Harmonic Search (HS) algorithm. The invention provides a multitask and acoustic search detection method, which adopts a plurality of harmony memory banks to respectively store SNP combinations of different orders, and the application of multitask technology can simultaneously carry out high-order SNP upper detection of a plurality of different orders, promote mutual learning among individuals in a population, enhance the diversity of the population and further improve the global search capability.
Description
Technical Field
The invention belongs to the technical field of upper detection of single nucleotide polymorphism, and particularly relates to a method, a system, a storage medium and equipment for upper detection of multi-task high-order SNP.
Background
At present: single Nucleotide Polymorphism (SNP) refers to a Polymorphism caused by a variation in a Single base site at the genome level, and may be a transition (transition) or a transversion (transition) of a Single base or may be caused by an insertion or deletion of a base. One base pair C-G in SEQ ID No. 1 appears as A-T in SEQ ID No. 2, and this site is called the 1 SNP site. On the whole human genome, the number of such SNP sites exceeds 300 ten thousand, and most SNPs generally do not pose any threat to human health, but some SNP variant sites are closely related to human health. Superordinate effect (epistatic effect): representing the interaction between one gene or SNP, traditionally defined as the allele at one locus masking the expression of another allelic phenotype. The epistatic effect among multiple SNPs means that the multiple SNPs jointly act on the expression of a phenotype, and in the case of complex diseases, the expression is possibly influenced by the joint action of the multiple SNPs, and if SNP variation occurs simultaneously at the several sites of a certain person, the probability of getting ill is obviously increased. The k-order SNP episome (denoted as k-order SNP episome) is a episome in which k SNPs jointly act on a phenotype (or disease state). For the multitask high-order SNP up-level detection with k >2, the problem of very complicated SNP combination explosion is solved, the calculation amount is huge, and the existing computer cannot complete the up-level combination detection of the whole genome in effective time. At present, although a large number of methods are proposed for upper detection of multitask high-order SNP, such as an exhaustion method, a parallel calculation method, a Monte Carlo method and the like, the problems of high search cost and low detection capability still exist.
The existing multitask high-order SNP up-level detection technology basically can only complete the SNP up-level detection of a certain order (such as 3 orders) at one time. Multiple tentative operations are required to complete the SNP upper level detection of different orders (2 order, 3 order, … order, k order). The operation is very costly.
The prior art has found many susceptibility genes by correlating individual SNPs with disease states, however, it is not well-explained for complex diseases. Therefore, the biological world generally considers that "higher-order SNP epistatic combination is an important reason for possible complex disease", but because the number of higher-order SNP combinations is extremely large, and is a very complex "combination explosion problem", it is difficult for the existing computer to detect all possible SNP combinations, which becomes one of the most important challenges encountered by the existing technology. In addition, it is also an important and complicated research topic to accurately identify whether a high-order SNP combination has a superordinate effect, and the existing methods often have preference for SNP superordinate models and are difficult to be applied to judgment of all superordinate models.
At present, two main problems need to be overcome for a high-order SNP superior detection method: in addition, because the pathogenesis of complex diseases is unknown, possible SNP epistatic models are diverse, and how to correctly identify the pathogenic high-order SNP combination is also a great challenge.
The prior art can be classified into the following from the perspective of "search":
exhaustive Search (Exhaustive Search); by enumerating all k-order (combination of k SNPs) SNPs, relevance evaluation was performed using a certain method. The method has the advantages that a certain possible pathogenic SNP combination cannot be missed, and the method has the defect that the calculation amount is extremely large, and when k is larger than 3, the calculation cannot be completed in effective time.
Stochastic search (random search); the random sampling idea is adopted to search in a solution space, so that the calculation amount can be greatly reduced, but the success rate is low, and the method is better than the SNP upper combination with marginal effect.
Machine learning-based methods; a machine learning-based method (such as random forest, support vector machine and the like) adopts the idea of feature selection to remove SNP sites which are ineffective in improving classification performance from a high-dimensional SNP set. The method has the defects that the calculation amount is large, and the classification accuracy rate is low in the case of small sample and high-dimension data.
Stepwise search (step-by-step search); the method screens out an SNP set with marginal effect by adopting a statistical method, and then discovers a higher-order SNP upper combination from the SNP sites in the set. The method has the advantages that: the calculation amount is small, and the searching speed is high. The disadvantages are that: higher order SNP supercombinations with no (or low) marginal effect are difficult to find.
5. A search technique based on group intelligence optimization; the search technology based on the group intelligent optimization is to utilize the information carried by individuals in a group to learn and exchange with each other, so that the search efficiency can be obviously improved. However, how to ensure that a global optimal solution can be obtained and no preference is given to a higher-order SNP upper model is an important problem currently faced by the method.
6. Single task search detection techniques; the existing detection technology can only complete the SNP upper detection of a certain order (such as 3 orders) at one time. Multiple tentative operations are required to complete the SNP upper level detection of different orders (2 order, 3 order, … order, k order). The operation is very costly.
From the viewpoint of determining the association between the SNP combination and the disease state (evaluation method), the following methods can be classified:
(1) statistical test method (Statistictestmethod). The statistical test method is based on a hypothesis test theory, and carries out difference significance analysis according to the distribution of the genotypes corresponding to the SNP combinations in a disease (Case) sample and a normal (Control) sample, and screens out the SNP combinations with significant difference in the genotype distribution of the SNP combinations in the Case sample and the Control sample.
(2) Mutual Information (MI). And analyzing the mutual information quantity of the genotype corresponding to the SNP combination and the disease state by utilizing the thought of the information theory, and realizing the correlation analysis of the genotype of the SNP combination and the disease state.
(3) Machine Learning (ML). And training and testing sample data corresponding to the appointed SNP combination by utilizing the idea of machine learning, and further evaluating the classification accuracy of the SNP combination on the Case and Control samples.
(4) Bayesian network-based evaluation method (Bayesian network-based method). The bayesian network is a two-layer probabilistic graphical model, one layer consisting of a set of SNP nodes and the other layer consisting of a disease node. Their conditional dependencies are represented as a set of edges in a directed acyclic graph.
The above evaluations require multiple tests; mutual information and Bayesian network evaluation methods are light-weight evaluation methods, but have preference to models. The advantage of machine learning is that SNP combinations of arbitrary orders can be evaluated and compared, but for higher order SNP combinations, the recognition accuracy is low and the amount of computation is large. The existing detection method for the higher-order SNP epistatic combination mainly has the following defects: (1) the detection method is too dependent on SNP (pathogenic) epistatic (pathogenic) models, so that the detection method has preference to certain simulation models and is difficult to be applied to detection of unknown models. Especially in the face of real complex disease data sets, it is difficult to give an efficient detection method. (2) The P-value threshold used in the statistical test method is artificially determined, resulting in poor sensitivity of the detection result. (3) Most of the existing group intelligent search algorithms adopt single or relevance evaluation functions with similar functions, so that the search results are not accurate enough, and the true pathogenic SNP upper combination can be missed. (4) The detection capability is low for data of the combination of multiple pathogenic SNPs present.
Although the prior art shows a certain effect in the upper combination detection of high-order SNP, the following disadvantages exist in general:
(1) the detection method has high calculation complexity, or real SNP epistatic combinations are easily missed.
(2) The sensitivity of the detection result is not high, and the universality is very low.
(3) The detection method has preference to an SNP upper model, and the success rate of a detection algorithm is not high enough; the adopted single-task detection method needs repeated probing for unknown diseases, so that the calculation amount is large, and heuristic search is not facilitated.
Through the above analysis, the problems and defects of the prior art are as follows:
(1) the existing detection method has high calculation complexity or easily omits the real SNP upper combination.
(2) The existing detection method has low sensitivity of detection results and low universality.
(3) The existing detection method has preference to an SNP upper model, and the success rate of a detection algorithm is not high enough; the adopted single-task detection method needs repeated probing for unknown diseases, so that the calculation amount is large, and heuristic search is not facilitated.
The difficulty in solving the above problems and defects is:
(1) the number of human whole genome sites is huge, the number of combinations is exponentially increased, the existing computer and method cannot carry out relevance detection on k-order (k order, k >2) SNP combinations in a limited time, and no effective method can quickly find possible k-order SNP superordinate combinations.
(2) The SNP superordinate effect models are rich and diverse, such as a main effect + interaction model, a no main effect + interaction model and the like, all SNP superordinate models cannot be correctly identified by a single method, and preference to the superordinate effect models exists.
The significance of solving the problems and the defects is as follows: the method can provide an effective analysis method for pathopoiesia reasons of complex diseases for biologists, can quickly find out possible pathopoiesia genes, and further adopts effective measures for diagnosis and targeted therapy.
The invention adopts a multi-task and harmony search strategy, and has the following significance:
(1) the harmony search strategy is a group intelligence-based search method, can complete search within polynomial time, and has strong global search capability. The invention adopts a harmony search strategy to improve the search speed.
(2) The multitasking method comprises the following steps: the search of a plurality of high-order SNP upper combinations with different orders can be carried out simultaneously, a plurality of tasks can be mutually communicated, mutual promotion can be realized, and the search capability is improved. Therefore, the parallel searching speed of the tasks is greatly improved.
(3) Multiple tasks employ multiple relevance evaluation functions: the recognition capability of the SNP upper model with diversity can be improved.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a multitask high-order SNP upper detection method, a system, a storage medium and equipment.
The invention is realized in such a way, and the multitask high-order SNP upper detection method comprises the following steps:
reading PED and MAP format data from a VCF file by utilizing Plink software, and arranging a binary format file of a conversion bit into a sample matrix;
setting search algorithm parameters according to the sizes of the SNP sites and the sample size in the data;
reading SNP sample data, and starting to prepare a first-stage search;
and performing high-order SNP (single nucleotide polymorphism) upper combination detection by utilizing a multitask, multi-harmony memory bank and a vocal search algorithm.
Further, the Plink software is used for reading PED and MAP format data from the VCF file, and further converting the files FAM, BED and BIM in the binary format into a sample matrix.
And further, setting harmony search algorithm parameters according to the sizes of the SNP sites and the sample size in the data, wherein the parameters comprise maximum evolution algebra MaxT, sound memory bank size HMS, sound memory bank considered probability HMCR and local fine tuning probability PAR.
Further, the harmonic search algorithm of the multitask high-order SNP upper detection method is a meta-heuristic search algorithm, and the multitask high-order SNP upper detection problem is expressed as a combined optimization problem as follows:
where X represents a combination of k SNPs, the optimization problem aims to find out the superordinate combination X of SNPs having the strongest association with the disease state Y from the genome*。
Further, the objective of the multitask and acoustic search algorithm adopted by the multitask high-order SNP upper detection method is to find a plurality of SNP upper combinations with different orders from a genome, and a mathematical model is represented as follows:
wherein, XiRepresents a ki(>2) order of SNP combination, the goal of this problem is to find k from the genome with the strongest association with disease state1Order, k2Order, …, kMOrder (k)1-order,k2-order,,…,kM-order) of SNPs
Furthermore, each task of the multitask high-order SNP upper detection method corresponds to an independent harmonic memory library HM, and respective selection mechanisms are adopted for carrying out selection and elimination; in the searching process, each iteration generates a new individual for each task; the creation of new individuals occurs in two ways: generating intra-group learning and inter-group combination cross learning; each task of the multi-task and acoustic search method adopts the same type of relevance evaluation function, and adopts different types of relevance evaluation functions respectively, even each individual in the acoustic memory bank carries out a plurality of different types of evaluation functions; the adopted coding mechanism is as follows: a plurality of tasks adopt unified coding, and when the orders are different, a left-to-right selection strategy is adopted.
It is a further object of the invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:
reading PED and MAP format data from a VCF file by utilizing Plink software, and arranging a binary format file of a conversion bit into a sample matrix;
setting search algorithm parameters according to the sizes of the SNP sites and the sample size in the data;
reading SNP sample data, and starting to prepare a first-stage search;
and performing high-order SNP (single nucleotide polymorphism) upper combination detection by utilizing a multitask, multi-harmony memory bank and a vocal search algorithm.
It is another object of the present invention to provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
reading PED and MAP format data from a VCF file by utilizing Plink software, converting the data into a binary format file and arranging the binary format file into a sample matrix;
setting search algorithm parameters according to the sizes of the SNP sites and the sample size in the data;
reading SNP sample data, and starting to prepare a first-stage search;
and performing high-order SNP (single nucleotide polymorphism) upper combination detection by utilizing a multitask, multi-harmony memory bank and a vocal search algorithm.
Another object of the present invention is to provide a SNP detection information data processing terminal for implementing the above-mentioned multiplexing higher-order SNP detection method.
Another object of the present invention is to provide a multitask high-order SNP up-level detection system for implementing the multitask high-order SNP up-level detection method, including:
the data preprocessing module is used for reading PED and MAP format data from a VCF file by utilizing Plink software, and arranging a binary format file of a conversion bit into a sample matrix;
the algorithm parameter setting module is used for setting harmony search algorithm parameters according to the sizes of the SNP sites and the sample size in the data, wherein the parameters comprise maximum evolution algebra MaxT and the size of a harmony memory bank HMS, the harmony memory bank considers probability HMCR and the local fine tuning probability PAR.
The data reading module is used for reading in SNP sample data and starting to prepare a first-stage search;
and the high-order SNP upper combination detection module is used for performing high-order SNP upper combination detection by utilizing a multitask, multi-harmony memory bank and an acoustic search algorithm.
By combining all the technical schemes, the invention has the advantages and positive effects that: the invention provides a multitask harmony search detection method, which adopts a plurality of harmony memory banks to respectively store SNP combinations of different orders, and the application of multitask technology can promote mutual learning among individuals, enhance the diversity of population and further improve the global search capability.
The multitask high-order SNP up-level detection method is easy to understand and realize, and by adopting a multitask harmony search strategy, multitask high-order SNP up-level detection of a plurality of different orders can be carried out at the same time, so that the detection performance is greatly improved, and the multitask high-order SNP up-level detection method has the advantages of high detection speed and strong search capability. Each task adopts one harmony memory bank and adopts the same or different types of relevance evaluation functions, so that on one hand, the diversity of the population (harmony memory bank) is enhanced, and on the other hand, the global search capability can be enhanced by the cross learning of individuals among the population. The use of a plurality of different types of relevance evaluation functions can enhance the discrimination capability of the SNP upper model, reduce the preference of the model and further improve the detection capability of the high-order SNP upper combination.
The invention can solve the problem of low sensitivity of the prior art to the upper detection of the multitask high-order SNP; the method can solve the problems of low identification accuracy of the upper detection of the multitask high-order SNP and preference to an upper model of the SNP in the prior art; the invention can solve the problem that the existing detection technology can only carry out one SNP upper detection with the same order at one time, and the invention can simultaneously carry out the detection of a plurality of high-order SNP upper combinations with different orders. The invention can improve the global detection capability of the harmony search strategy by utilizing the multi-harmony memory bank strategy and reduce the calculation amount of the SNP combination explosion problem.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained from the drawings without creative efforts.
FIG. 1 is a flowchart of a multitask high-order SNP up-detection method provided by an embodiment of the invention.
FIG. 2 is a schematic structural diagram of a multitask high-order SNP host detection system provided by an embodiment of the invention;
in fig. 2: 1. a data preprocessing module; 2. an algorithm parameter setting module; 3. a data reading module; 4. and a high-order SNP upper combination detection module.
FIG. 3 is a flowchart illustrating an implementation of a method for higher-order detection of multitasking SNP according to an embodiment of the present invention.
FIG. 4 is a flow chart of higher-order SNP epistatic combination detection by utilizing a multitasking, multi-harmony memory bank and a vocal search algorithm according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of basic rules for generating harmony, provided by an embodiment of the present invention.
Fig. 6(a) is a schematic diagram of generating new individual combination intersections and single-site interchanges within clusters according to an embodiment of the present invention.
FIG. 6(b) is a schematic diagram of generating new individual single-site variations within a population according to an embodiment of the present invention.
Fig. 7 is a schematic diagram of inter-task (intra) cross learning provided by an embodiment of the present invention.
FIG. 8 is a schematic diagram of individual transfer learning between tasks according to an embodiment of the present invention.
Fig. 9 is a basic flowchart for generating a new individual according to an embodiment of the present invention.
Fig. 10 is a comparison graph of detection capabilities provided by embodiments of the present invention.
Fig. 11 is a comparison chart of algorithm detection time provided by the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Aiming at the problems in the prior art, the invention provides a multitask high-order SNP upper detection method, a system, a storage medium and equipment, and the invention is described in detail with reference to the attached drawings.
As shown in fig. 1, the method for detecting a high-level multitask high-order SNP provided by the present invention comprises the following steps:
s101: reading PED and MAP format data from a VCF file by utilizing Plink software, and arranging a conversion bit binary format file (FAM, BED and BIM) into a sample matrix;
s102: setting harmony search algorithm parameters according to the sizes of SNP sites and sample sizes in the data, wherein the parameters comprise maximum evolution algebra MaxT, sound memory bank size HMS, sound memory bank considered probability HMCR and local fine tuning probability PAR;
s103: reading SNP sample data, and starting to prepare a first-stage search;
s104: and performing high-order SNP (single nucleotide polymorphism) upper combination detection by utilizing a multitask, multi-harmony memory bank and a vocal search algorithm.
A person skilled in the art of the multitask high-order SNP up-level detection method provided by the present invention may also use other steps to implement the multitask high-order SNP up-level detection method provided by the present invention shown in fig. 1, which is only a specific example.
As shown in fig. 2, the system for detecting a high-level multitask high-order SNP provided by the present invention includes:
the data preprocessing module 1 is used for reading PED and MAP format data from a VCF file by utilizing Plink software, and further converting a binary format file (FAM, BED and BIM) into a sample matrix;
and the algorithm parameter setting module 2 is used for setting harmony search algorithm parameters according to the sizes of the SNP sites and the sample size in the data, wherein the parameters comprise a maximum evolution algebra MaxT and a harmony memory bank size HMS, the harmony memory bank considers the probability HMCR and the local fine tuning probability PAR.
And the data reading module 3 is used for reading in SNP sample data and starting to prepare a first-stage search.
And the high-order SNP upper combination detection module 4 is used for performing high-order SNP upper combination detection by utilizing a multitask, multi-harmony memory bank and an acoustic search algorithm.
The technical solution of the present invention is further described below with reference to the accompanying drawings.
SNP Single nucleotide polymorphism (Single nucleotide polymorphism).
Higher-order SNP epistasis (high-order SNP epistasis): multiple SNP sites act in combination on a phenotype or disease state.
Multitask (multi-task): and simultaneously carrying out multi-task high-order SNP upper detection of a plurality of different orders.
Multiple memory library (multiple harmony memory) and harmonic search strategy: harmony search algorithm (harmony searchalgorithm) with multiple harmony memory banks.
The single-task optimization refers to that one optimization task is intensively completed each time, and the task can be a single-target optimization problem or a multi-target optimization problem.
Multi-task optimization is a novel optimization technology, and can utilize potential relevance among a plurality of different tasks to influence each other, interact and learn each other, so that a plurality of optimization tasks can be rapidly realized. The multi-task optimization can simultaneously solve a plurality of single-target optimization problems and can also simultaneously solve a plurality of multi-target optimization problems.
The invention adopts the multitask optimization technology to carry out the upper detection of a plurality of multitask high-order SNPs.
The harmony search algorithm is a meta-heuristic search algorithm, and aims to find an optimal note combination and play an optimal harmony sound by simulating the process of creating harmony by musicians. The harmony search has strong global search capability and is very suitable for solving the combinatorial optimization problem. For the upper detection problem of the multitask high-order SNP, the upper detection problem can be expressed as the following combined optimization problem:
where X represents a combination of k SNPs, the optimization problem aims to find out the superordinate combination X of SNPs having the strongest association with the disease state Y from the genome*。
The objective of the multitask and acoustic search algorithm adopted by the invention is to find a plurality of SNP episodic combinations with different orders from a genome. The mathematical model can be expressed as:
wherein, XiRepresents a ki(>2) order of SNP combination, the goal of this problem is to find k from the genome with the strongest association with disease state1Order, k2Order, …, kMOrder (k)1-order,k2-order,,…,kM-order) of SNPs
The invention adopts a harmony search algorithm of a multitask and multi-harmony memory bank, and the method can solve a plurality of optimization problems simultaneously. Each task corresponds to an independent Harmonic Memory (HM) and the selection mechanisms of the harmonic memory and the harmonic memory are respectively adopted for carrying out the victimization. During the search, a new individual is generated for each task at each iteration. The generation of new individuals occurs mainly in two ways: and (4) performing intra-group learning generation (including intra-group crossing, single-point interchange, single-point mutation and the like), and performing combined cross learning generation among groups. Each task of the multi-task and acoustic search method can adopt the relevance evaluation functions of the same type (such as a Bayesian network method, a statistical test method and the like), can also respectively adopt different types of relevance evaluation functions, and even can carry out a plurality of different types of evaluation functions (similar to multi-objective optimization) on each individual in the acoustic memory bank. The invention adopts a novel framework technology from the search strategy, and the search efficiency is obviously improved. Multiple tasks are performed simultaneously, the search performance is promoted, particularly, the marginal combination effect of some low orders can be found through a multi-task search mechanism corresponding to a pathogenic model without marginal effect, and further, the discovery of higher-order SNP upper combinations is promoted. The coding mechanism adopted by the invention is as follows: unified coding is adopted for a plurality of tasks, but when the orders are different, a left-to-right selection strategy is adopted, for example: in the 3-order task, one solution vector X is (2, 6, 9, 14, 49), and only the SNP site combination (2, 6, 9) needs to be selected for association evaluation. In this coding scheme, although SNP sites 14 and 49 can be used for cross-learning with other tasks, in this task, single site interchange with previous SNP sites can be performed to facilitate individual optimization in this task population.
As shown in fig. 3, the method for detecting a high-level multitask high-order SNP specifically includes the following steps:
(1) data pre-processing
PED and MAP format data are read from the VCF file by utilizing Plink software, and the data are further converted into binary format files (FAM, BED and BIM) to be arranged into a sample matrix.
(2) Algorithm parameter setting
Setting harmony search algorithm parameters according to the sizes of SNP sites and sample sizes in the data, wherein the parameters comprise maximum evolution algebra MaxT, sound memory bank size HMS, sound memory bank considered probability HMCR, local fine tuning probability PAR and the like.
(3) And (6) reading data. SNP sample data is read in, and the first-stage search is started to be prepared.
(4) And (3) performing high-order SNP (single nucleotide polymorphism) upper combination detection by utilizing a multitask, multi-harmony memory bank and an acoustic search algorithm (the algorithm flow is shown in figure 3).
The invention evaluates the relevance of the initial harmony:
(pseudo code 1.1) evaluation of relevance for each harmony (individual) with a single evaluation function)
The following codes work: relevance evaluation was performed on individuals in the K-1 task (2,3, …, K) population (population size NP).
(pseudocode 1.1) each individual will calculate M-K-1 fitness values.
The invention evaluates the relevance of the initial harmony: each harmony (individual) needs to be evaluated using 3 evaluation functions)
The following codes work: the relevance evaluation function f is respectively adopted for individuals in K-1 task (2,3, …, K) population (the population size is NP)1,f2,f3Evaluation was carried out.
Each individual will calculate K x 3 fitness values (multiple evaluation indices)
Pseudo-code 2 (task division)
The invention has the following task division: task partitioning for all individuals
The invention creates novel individuals
K adaptive values can be calculated for each individual
Generation of a New Individual in the population of task k
Generating new individuals according to the basic rules of FIG. 5
Pseudo code 3: the invention generates new individuals according to harmony memory library rules
New individuals are generated within the population, as shown in fig. 6(a), 6(b), 7-9. And (4) cross learning among groups.
Comparison of test results of the present invention in 5 simulation data sets (see Table 1, Table 2, Table 3, FIG. 10, FIG. 11)
TABLE 1 simulation dataset parameters
Data set | Order of higher order combinations of SNPs | Number of SNPs | Sample size | Maximum number of allowed |
DME Data1 | ||||
5 | 1000 | 1000 | 500000 | |
|
5 | 1000 | 1000 | 500000 |
|
5 | 1000 | 2000 | 500000 |
|
5 | 10000 | 1000 | 5000000 |
|
5 | 10000 | 2000 | 5000000 |
|
5 | 10000 | 5000 | 5000000 |
TABLE 2 comparison of assay Capacity
Data set | EPI-ACO | SNPHarvester | MP-HS-DHSI | NHSA-DHSC | The method of the invention |
DME Data1 | 75.00% | 63.00% | 85.00% | 84.00% | 83.00% |
DME Data2 | 79.00% | 58.00% | 86.00% | 87.00% | 87.00% |
DME Data3 | 85.00% | 70.00% | 89.00% | 88.00% | 90.00% |
DME Data4 | 63.00% | 48.00% | 75.00% | 73.00% | 81.00% |
DME Data5 | 65.00% | 44.00% | 81.00% | 79.00% | 84.00% |
DME Data6 | 69.00% | 52.00% | 89.00% | 81.00% | 92.00% |
TABLE 3 average assay time comparison (unit: seconds)
It should be noted that the embodiments of the present invention can be realized by hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided on a carrier medium such as a disk, CD-or DVD-ROM, programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier, for example. The apparatus and its modules of the present invention may be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of hardware circuits and software, e.g., firmware.
The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.
Claims (10)
1. A multitask high-order SNP up-level detection method is characterized by comprising the following steps:
reading PED and MAP format data from a VCF file by utilizing Plink software, and arranging a binary format file of a conversion bit into a sample matrix;
setting search algorithm parameters according to the sizes of the SNP sites and the sample size in the data;
reading SNP sample data, and starting to prepare a first-stage search;
and performing high-order SNP (single nucleotide polymorphism) upper combination detection by utilizing a multitask, multi-harmony memory bank and a vocal search algorithm.
2. The method for higher-level detection of multitask high-order SNP according to claim 1, wherein Plink software is used for reading PED and MAP format data from a VCF file, and further converting FAM, BED and BIM files into a sample matrix.
3. The method for detecting the presence of multiple higher-order SNPs according to claim 1, wherein the parameters of the harmonic search algorithm are set according to the sizes of SNP sites and sample sizes in the data, wherein the parameters include maximum evolution algebraic MaxT and acoustic memory library size HMS (harmonic memory size), and acoustic memory library consideration probability HMCR (harmonic memory consistency rate) and local fine tuning probability PAR (pitch adaptation rate).
4. The multitask high-order SNP up-level detection method according to claim 1, wherein the sum-of-sound search algorithm of the multitask high-order SNP up-level detection method is a meta-heuristic search algorithm, and for the multitask high-order SNP up-level detection problem, the sum-of-sound search algorithm is expressed as a combined optimization problem as follows:
where X represents a combination of k SNPs, the optimization problem aims to find out the superordinate combination X of SNPs having the strongest association with the disease state Y from the genome*。
5. The multitask high-order SNP up-level detection method according to claim 1, wherein the multitask and acoustic search algorithm adopted by the multitask high-order SNP up-level detection method aims at finding a plurality of SNP up-level combinations of different orders from a genome, and a mathematical model is represented as follows:
wherein, XiRepresents a ki(>2) order of SNP combination, the goal of this problem is to find k from the genome with the strongest association with disease state1Order, k2Order, …, kMOrder (k)1-order,k2-order,,…,kM-order) of SNP episomal combination X1*,X2*,…,XM*。
6. The multitask high-order SNP up-level detection method according to claim 1, wherein each task of the multitask high-order SNP up-level detection method corresponds to an independent Harmonic Memory (HM) (harmonic memory) and the selection mechanisms of the harmonic memory and the acoustic memory are respectively adopted for performing the selection and the elimination; in the searching process, each iteration generates a new individual for each task; the creation of new individuals occurs in two ways: generating intra-group learning and inter-group combination cross learning;
each task of the multi-task and acoustic search method can adopt the same type of relevance evaluation function, also can adopt different types of relevance evaluation functions, and even each individual in the acoustic memory bank can adopt a plurality of different types of evaluation functions;
the adopted unified coding mechanism is as follows: the multiple tasks adopt unified coding, a unified search space is adopted for searching, reading is carried out from the left side of the coding when the relevance evaluation of a k-order task is carried out, and k-bit coding is continuously selected as an individual coding of the task.
7. A computer device, characterized in that the computer device comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of:
reading PED and MAP format data from a VCF file by utilizing Plink software, and arranging a binary format file of a conversion bit into a sample matrix;
setting search algorithm parameters according to the sizes of the SNP sites and the sample size in the data;
reading SNP sample data, and starting to prepare a first-stage search;
and detecting a plurality of high-order SNP upper combinations with different orders by utilizing a multitask, multi-harmony memory bank and an acoustic search algorithm.
8. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
reading PED and MAP format data from a VCF file by utilizing Plink software, and arranging a binary format file of a conversion bit into a sample matrix;
setting search algorithm parameters according to the sizes of the SNP sites and the sample size in the data;
reading SNP sample data, and starting to prepare a first-stage search;
and performing high-order SNP (single nucleotide polymorphism) upper combination detection by utilizing a multitask, multi-harmony memory bank and a vocal search algorithm.
9. A SNP (single nucleotide polymorphism) up-detection information data processing terminal, which is used for realizing the multitask high-order SNP up-detection method according to any one of claims 1 to 6.
10. A multitask high-order SNP higher-order detection system for implementing the multitask high-order SNP higher-order detection method according to any one of claims 1 to 6, wherein the multitask high-order SNP higher-order detection system comprises:
the data preprocessing module is used for reading PED and MAP format data from a VCF file by utilizing Plink software, and arranging a binary format file of a conversion bit into a sample matrix;
the algorithm parameter setting module is used for setting harmony search algorithm parameters according to the sizes of the SNP sites and the sample size in the data, wherein the parameters comprise maximum evolution algebra MaxT and the size of a harmony memory bank HMS, the harmony memory bank considers probability HMCR and the local fine tuning probability PAR.
The data reading module is used for reading in SNP sample data and starting to prepare a first-stage search;
and the multitask high-order SNP upper combination detection module is used for performing high-order SNP upper combination detection by utilizing a multitask, multi-harmony memory bank and an acoustic search algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011315829.2A CN112447263B (en) | 2020-11-22 | 2020-11-22 | Multi-task high-order SNP upper detection method, system, storage medium and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011315829.2A CN112447263B (en) | 2020-11-22 | 2020-11-22 | Multi-task high-order SNP upper detection method, system, storage medium and equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112447263A true CN112447263A (en) | 2021-03-05 |
CN112447263B CN112447263B (en) | 2023-12-26 |
Family
ID=74738143
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011315829.2A Active CN112447263B (en) | 2020-11-22 | 2020-11-22 | Multi-task high-order SNP upper detection method, system, storage medium and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112447263B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010224815A (en) * | 2009-03-23 | 2010-10-07 | Japan Found Cancer Res | Retrieval algorithm for epistasis effect based on comprehensive genome wide snp information |
WO2017159686A1 (en) * | 2016-03-15 | 2017-09-21 | Repertoire Genesis株式会社 | Monitoring and diagnosis for immunotherapy, and design for therapeutic agent |
CN109448794A (en) * | 2018-10-31 | 2019-03-08 | 华中农业大学 | A kind of epistasis site method for digging based on heredity taboo and Bayesian network |
CN110633386A (en) * | 2019-09-27 | 2019-12-31 | 哈尔滨理工大学 | Model similarity calculation method based on genetic and acoustic mixed search |
-
2020
- 2020-11-22 CN CN202011315829.2A patent/CN112447263B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010224815A (en) * | 2009-03-23 | 2010-10-07 | Japan Found Cancer Res | Retrieval algorithm for epistasis effect based on comprehensive genome wide snp information |
WO2017159686A1 (en) * | 2016-03-15 | 2017-09-21 | Repertoire Genesis株式会社 | Monitoring and diagnosis for immunotherapy, and design for therapeutic agent |
CN109448794A (en) * | 2018-10-31 | 2019-03-08 | 华中农业大学 | A kind of epistasis site method for digging based on heredity taboo and Bayesian network |
CN110633386A (en) * | 2019-09-27 | 2019-12-31 | 哈尔滨理工大学 | Model similarity calculation method based on genetic and acoustic mixed search |
Non-Patent Citations (4)
Title |
---|
SHAUN PURCELL: "PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses", THE AMERICAN JOURNAL OF HUMAN GENETICS, vol. 81, pages 559 - 575, XP055061306, DOI: 10.1086/519795 * |
SHOUHENG TUO: "Multipopulation harmony search algorithm for the detection of high-order SNP interactions", BIOINFORMATICS, vol. 36, no. 16, pages 4389 * |
杨俊;殷建平;詹宇斌;: "基于禁忌搜索的多因子降维在上位作用检测中的应用", 武汉大学学报(理学版), no. 06 * |
翟军昌;高立群;欧阳海滨;刘宏志;: "改进的新颖全局和声搜索算法", 东北大学学报(自然科学版), no. 10 * |
Also Published As
Publication number | Publication date |
---|---|
CN112447263B (en) | 2023-12-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Tsamardinos et al. | A greedy feature selection algorithm for big data of high dimensionality | |
Friedman et al. | Data analysis with Bayesian networks: A bootstrap approach | |
Anderson | Assessing the power of informative subsets of loci for population assignment: standard methods are upwardly biased | |
US20210193257A1 (en) | Phase-aware determination of identity-by-descent dna segments | |
US11068799B2 (en) | Systems and methods for causal inference in network structures using belief propagation | |
Urbanowicz et al. | Instance-linked attribute tracking and feedback for michigan-style supervised learning classifier systems | |
CN106030589A (en) | Disease prediction system using open source data | |
Simcha et al. | The limits of de novo DNA motif discovery | |
Koropoulis et al. | Detecting positive selection in populations using genetic data | |
Zhang et al. | Protein complexes discovery based on protein-protein interaction data via a regularized sparse generative network model | |
Zhang et al. | Simulation study in probabilistic Boolean network models for genetic regulatory networks | |
Shaw et al. | Fast and robust metagenomic sequence comparison through sparse chaining with skani | |
KR20220069943A (en) | Single-cell RNA-SEQ data processing | |
CN109063418A (en) | Determination method, apparatus, equipment and the readable storage medium storing program for executing of disease forecasting classifier | |
Ponte-Fernández et al. | Evaluation of existing methods for high-order epistasis detection | |
Chang et al. | Causal inference in biology networks with integrated belief propagation | |
CN112270957B (en) | High-order SNP pathogenic combination data detection method, system and computer equipment | |
CN111933215B (en) | Transcription factor binding site searching method, system, storage medium and terminal | |
Sun et al. | HS-MMGKG: a fast multi-objective harmony search algorithm for two-locus model detection in GWAS | |
CN112447263B (en) | Multi-task high-order SNP upper detection method, system, storage medium and equipment | |
Schwender et al. | Empirical Bayes analysis of single nucleotide polymorphisms | |
Sheng et al. | Change-points analysis for generalized integer-valued autoregressive model via minimum description length principle | |
CN108897990B (en) | Interactive feature parallel selection method for large-scale high-dimensional sequence data | |
Gory et al. | A comparison of internal model validation methods for multifactor dimensionality reduction in the case of genetic heterogeneity | |
Stram et al. | SNP Imputation for Association Studies |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |