CN112447263A - Multitask high-order SNP upper detection method, system, storage medium and equipment - Google Patents

Multitask high-order SNP upper detection method, system, storage medium and equipment Download PDF

Info

Publication number
CN112447263A
CN112447263A CN202011315829.2A CN202011315829A CN112447263A CN 112447263 A CN112447263 A CN 112447263A CN 202011315829 A CN202011315829 A CN 202011315829A CN 112447263 A CN112447263 A CN 112447263A
Authority
CN
China
Prior art keywords
snp
order
multitask
data
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011315829.2A
Other languages
Chinese (zh)
Other versions
CN112447263B (en
Inventor
拓守恒
刘凡
李超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Posts and Telecommunications
Original Assignee
Xian University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Posts and Telecommunications filed Critical Xian University of Posts and Telecommunications
Priority to CN202011315829.2A priority Critical patent/CN112447263B/en
Publication of CN112447263A publication Critical patent/CN112447263A/en
Application granted granted Critical
Publication of CN112447263B publication Critical patent/CN112447263B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to the technical field of Single Nucleotide Polymorphism (SNP) up-level detection, and discloses a method, a system, a storage medium and equipment for multi-task high-order SNP (single nucleotide polymorphism) up-level detection, wherein the method for multi-task high-order SNP up-level detection reads PED (personal identification number) and MAP (MAP) format data from a VCF (virtual channel format) file by utilizing Plink software, and arranges a converted binary format file into a sample matrix; setting search algorithm parameters according to the sizes of the SNP sites and the sample size in the data; reading SNP sample data, and starting to prepare a first-stage search; and performing high-order SNP (single nucleotide polymorphism) upper combination detection by utilizing a multitask, multi-harmonic memory bank and a Harmonic Search (HS) algorithm. The invention provides a multitask and acoustic search detection method, which adopts a plurality of harmony memory banks to respectively store SNP combinations of different orders, and the application of multitask technology can simultaneously carry out high-order SNP upper detection of a plurality of different orders, promote mutual learning among individuals in a population, enhance the diversity of the population and further improve the global search capability.

Description

Multitask high-order SNP upper detection method, system, storage medium and equipment
Technical Field
The invention belongs to the technical field of upper detection of single nucleotide polymorphism, and particularly relates to a method, a system, a storage medium and equipment for upper detection of multi-task high-order SNP.
Background
At present: single Nucleotide Polymorphism (SNP) refers to a Polymorphism caused by a variation in a Single base site at the genome level, and may be a transition (transition) or a transversion (transition) of a Single base or may be caused by an insertion or deletion of a base. One base pair C-G in SEQ ID No. 1 appears as A-T in SEQ ID No. 2, and this site is called the 1 SNP site. On the whole human genome, the number of such SNP sites exceeds 300 ten thousand, and most SNPs generally do not pose any threat to human health, but some SNP variant sites are closely related to human health. Superordinate effect (epistatic effect): representing the interaction between one gene or SNP, traditionally defined as the allele at one locus masking the expression of another allelic phenotype. The epistatic effect among multiple SNPs means that the multiple SNPs jointly act on the expression of a phenotype, and in the case of complex diseases, the expression is possibly influenced by the joint action of the multiple SNPs, and if SNP variation occurs simultaneously at the several sites of a certain person, the probability of getting ill is obviously increased. The k-order SNP episome (denoted as k-order SNP episome) is a episome in which k SNPs jointly act on a phenotype (or disease state). For the multitask high-order SNP up-level detection with k >2, the problem of very complicated SNP combination explosion is solved, the calculation amount is huge, and the existing computer cannot complete the up-level combination detection of the whole genome in effective time. At present, although a large number of methods are proposed for upper detection of multitask high-order SNP, such as an exhaustion method, a parallel calculation method, a Monte Carlo method and the like, the problems of high search cost and low detection capability still exist.
The existing multitask high-order SNP up-level detection technology basically can only complete the SNP up-level detection of a certain order (such as 3 orders) at one time. Multiple tentative operations are required to complete the SNP upper level detection of different orders (2 order, 3 order, … order, k order). The operation is very costly.
The prior art has found many susceptibility genes by correlating individual SNPs with disease states, however, it is not well-explained for complex diseases. Therefore, the biological world generally considers that "higher-order SNP epistatic combination is an important reason for possible complex disease", but because the number of higher-order SNP combinations is extremely large, and is a very complex "combination explosion problem", it is difficult for the existing computer to detect all possible SNP combinations, which becomes one of the most important challenges encountered by the existing technology. In addition, it is also an important and complicated research topic to accurately identify whether a high-order SNP combination has a superordinate effect, and the existing methods often have preference for SNP superordinate models and are difficult to be applied to judgment of all superordinate models.
At present, two main problems need to be overcome for a high-order SNP superior detection method: in addition, because the pathogenesis of complex diseases is unknown, possible SNP epistatic models are diverse, and how to correctly identify the pathogenic high-order SNP combination is also a great challenge.
The prior art can be classified into the following from the perspective of "search":
exhaustive Search (Exhaustive Search); by enumerating all k-order (combination of k SNPs) SNPs, relevance evaluation was performed using a certain method. The method has the advantages that a certain possible pathogenic SNP combination cannot be missed, and the method has the defect that the calculation amount is extremely large, and when k is larger than 3, the calculation cannot be completed in effective time.
Stochastic search (random search); the random sampling idea is adopted to search in a solution space, so that the calculation amount can be greatly reduced, but the success rate is low, and the method is better than the SNP upper combination with marginal effect.
Machine learning-based methods; a machine learning-based method (such as random forest, support vector machine and the like) adopts the idea of feature selection to remove SNP sites which are ineffective in improving classification performance from a high-dimensional SNP set. The method has the defects that the calculation amount is large, and the classification accuracy rate is low in the case of small sample and high-dimension data.
Stepwise search (step-by-step search); the method screens out an SNP set with marginal effect by adopting a statistical method, and then discovers a higher-order SNP upper combination from the SNP sites in the set. The method has the advantages that: the calculation amount is small, and the searching speed is high. The disadvantages are that: higher order SNP supercombinations with no (or low) marginal effect are difficult to find.
5. A search technique based on group intelligence optimization; the search technology based on the group intelligent optimization is to utilize the information carried by individuals in a group to learn and exchange with each other, so that the search efficiency can be obviously improved. However, how to ensure that a global optimal solution can be obtained and no preference is given to a higher-order SNP upper model is an important problem currently faced by the method.
6. Single task search detection techniques; the existing detection technology can only complete the SNP upper detection of a certain order (such as 3 orders) at one time. Multiple tentative operations are required to complete the SNP upper level detection of different orders (2 order, 3 order, … order, k order). The operation is very costly.
From the viewpoint of determining the association between the SNP combination and the disease state (evaluation method), the following methods can be classified:
(1) statistical test method (Statistictestmethod). The statistical test method is based on a hypothesis test theory, and carries out difference significance analysis according to the distribution of the genotypes corresponding to the SNP combinations in a disease (Case) sample and a normal (Control) sample, and screens out the SNP combinations with significant difference in the genotype distribution of the SNP combinations in the Case sample and the Control sample.
(2) Mutual Information (MI). And analyzing the mutual information quantity of the genotype corresponding to the SNP combination and the disease state by utilizing the thought of the information theory, and realizing the correlation analysis of the genotype of the SNP combination and the disease state.
(3) Machine Learning (ML). And training and testing sample data corresponding to the appointed SNP combination by utilizing the idea of machine learning, and further evaluating the classification accuracy of the SNP combination on the Case and Control samples.
(4) Bayesian network-based evaluation method (Bayesian network-based method). The bayesian network is a two-layer probabilistic graphical model, one layer consisting of a set of SNP nodes and the other layer consisting of a disease node. Their conditional dependencies are represented as a set of edges in a directed acyclic graph.
The above evaluations require multiple tests; mutual information and Bayesian network evaluation methods are light-weight evaluation methods, but have preference to models. The advantage of machine learning is that SNP combinations of arbitrary orders can be evaluated and compared, but for higher order SNP combinations, the recognition accuracy is low and the amount of computation is large. The existing detection method for the higher-order SNP epistatic combination mainly has the following defects: (1) the detection method is too dependent on SNP (pathogenic) epistatic (pathogenic) models, so that the detection method has preference to certain simulation models and is difficult to be applied to detection of unknown models. Especially in the face of real complex disease data sets, it is difficult to give an efficient detection method. (2) The P-value threshold used in the statistical test method is artificially determined, resulting in poor sensitivity of the detection result. (3) Most of the existing group intelligent search algorithms adopt single or relevance evaluation functions with similar functions, so that the search results are not accurate enough, and the true pathogenic SNP upper combination can be missed. (4) The detection capability is low for data of the combination of multiple pathogenic SNPs present.
Although the prior art shows a certain effect in the upper combination detection of high-order SNP, the following disadvantages exist in general:
(1) the detection method has high calculation complexity, or real SNP epistatic combinations are easily missed.
(2) The sensitivity of the detection result is not high, and the universality is very low.
(3) The detection method has preference to an SNP upper model, and the success rate of a detection algorithm is not high enough; the adopted single-task detection method needs repeated probing for unknown diseases, so that the calculation amount is large, and heuristic search is not facilitated.
Through the above analysis, the problems and defects of the prior art are as follows:
(1) the existing detection method has high calculation complexity or easily omits the real SNP upper combination.
(2) The existing detection method has low sensitivity of detection results and low universality.
(3) The existing detection method has preference to an SNP upper model, and the success rate of a detection algorithm is not high enough; the adopted single-task detection method needs repeated probing for unknown diseases, so that the calculation amount is large, and heuristic search is not facilitated.
The difficulty in solving the above problems and defects is:
(1) the number of human whole genome sites is huge, the number of combinations is exponentially increased, the existing computer and method cannot carry out relevance detection on k-order (k order, k >2) SNP combinations in a limited time, and no effective method can quickly find possible k-order SNP superordinate combinations.
(2) The SNP superordinate effect models are rich and diverse, such as a main effect + interaction model, a no main effect + interaction model and the like, all SNP superordinate models cannot be correctly identified by a single method, and preference to the superordinate effect models exists.
The significance of solving the problems and the defects is as follows: the method can provide an effective analysis method for pathopoiesia reasons of complex diseases for biologists, can quickly find out possible pathopoiesia genes, and further adopts effective measures for diagnosis and targeted therapy.
The invention adopts a multi-task and harmony search strategy, and has the following significance:
(1) the harmony search strategy is a group intelligence-based search method, can complete search within polynomial time, and has strong global search capability. The invention adopts a harmony search strategy to improve the search speed.
(2) The multitasking method comprises the following steps: the search of a plurality of high-order SNP upper combinations with different orders can be carried out simultaneously, a plurality of tasks can be mutually communicated, mutual promotion can be realized, and the search capability is improved. Therefore, the parallel searching speed of the tasks is greatly improved.
(3) Multiple tasks employ multiple relevance evaluation functions: the recognition capability of the SNP upper model with diversity can be improved.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a multitask high-order SNP upper detection method, a system, a storage medium and equipment.
The invention is realized in such a way, and the multitask high-order SNP upper detection method comprises the following steps:
reading PED and MAP format data from a VCF file by utilizing Plink software, and arranging a binary format file of a conversion bit into a sample matrix;
setting search algorithm parameters according to the sizes of the SNP sites and the sample size in the data;
reading SNP sample data, and starting to prepare a first-stage search;
and performing high-order SNP (single nucleotide polymorphism) upper combination detection by utilizing a multitask, multi-harmony memory bank and a vocal search algorithm.
Further, the Plink software is used for reading PED and MAP format data from the VCF file, and further converting the files FAM, BED and BIM in the binary format into a sample matrix.
And further, setting harmony search algorithm parameters according to the sizes of the SNP sites and the sample size in the data, wherein the parameters comprise maximum evolution algebra MaxT, sound memory bank size HMS, sound memory bank considered probability HMCR and local fine tuning probability PAR.
Further, the harmonic search algorithm of the multitask high-order SNP upper detection method is a meta-heuristic search algorithm, and the multitask high-order SNP upper detection problem is expressed as a combined optimization problem as follows:
Figure BDA0002791384210000061
where X represents a combination of k SNPs, the optimization problem aims to find out the superordinate combination X of SNPs having the strongest association with the disease state Y from the genome*
Further, the objective of the multitask and acoustic search algorithm adopted by the multitask high-order SNP upper detection method is to find a plurality of SNP upper combinations with different orders from a genome, and a mathematical model is represented as follows:
Figure BDA0002791384210000062
wherein, XiRepresents a ki(>2) order of SNP combination, the goal of this problem is to find k from the genome with the strongest association with disease state1Order, k2Order, …, kMOrder (k)1-order,k2-order,,…,kM-order) of SNPs
Figure BDA0002791384210000063
Furthermore, each task of the multitask high-order SNP upper detection method corresponds to an independent harmonic memory library HM, and respective selection mechanisms are adopted for carrying out selection and elimination; in the searching process, each iteration generates a new individual for each task; the creation of new individuals occurs in two ways: generating intra-group learning and inter-group combination cross learning; each task of the multi-task and acoustic search method adopts the same type of relevance evaluation function, and adopts different types of relevance evaluation functions respectively, even each individual in the acoustic memory bank carries out a plurality of different types of evaluation functions; the adopted coding mechanism is as follows: a plurality of tasks adopt unified coding, and when the orders are different, a left-to-right selection strategy is adopted.
It is a further object of the invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:
reading PED and MAP format data from a VCF file by utilizing Plink software, and arranging a binary format file of a conversion bit into a sample matrix;
setting search algorithm parameters according to the sizes of the SNP sites and the sample size in the data;
reading SNP sample data, and starting to prepare a first-stage search;
and performing high-order SNP (single nucleotide polymorphism) upper combination detection by utilizing a multitask, multi-harmony memory bank and a vocal search algorithm.
It is another object of the present invention to provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
reading PED and MAP format data from a VCF file by utilizing Plink software, converting the data into a binary format file and arranging the binary format file into a sample matrix;
setting search algorithm parameters according to the sizes of the SNP sites and the sample size in the data;
reading SNP sample data, and starting to prepare a first-stage search;
and performing high-order SNP (single nucleotide polymorphism) upper combination detection by utilizing a multitask, multi-harmony memory bank and a vocal search algorithm.
Another object of the present invention is to provide a SNP detection information data processing terminal for implementing the above-mentioned multiplexing higher-order SNP detection method.
Another object of the present invention is to provide a multitask high-order SNP up-level detection system for implementing the multitask high-order SNP up-level detection method, including:
the data preprocessing module is used for reading PED and MAP format data from a VCF file by utilizing Plink software, and arranging a binary format file of a conversion bit into a sample matrix;
the algorithm parameter setting module is used for setting harmony search algorithm parameters according to the sizes of the SNP sites and the sample size in the data, wherein the parameters comprise maximum evolution algebra MaxT and the size of a harmony memory bank HMS, the harmony memory bank considers probability HMCR and the local fine tuning probability PAR.
The data reading module is used for reading in SNP sample data and starting to prepare a first-stage search;
and the high-order SNP upper combination detection module is used for performing high-order SNP upper combination detection by utilizing a multitask, multi-harmony memory bank and an acoustic search algorithm.
By combining all the technical schemes, the invention has the advantages and positive effects that: the invention provides a multitask harmony search detection method, which adopts a plurality of harmony memory banks to respectively store SNP combinations of different orders, and the application of multitask technology can promote mutual learning among individuals, enhance the diversity of population and further improve the global search capability.
The multitask high-order SNP up-level detection method is easy to understand and realize, and by adopting a multitask harmony search strategy, multitask high-order SNP up-level detection of a plurality of different orders can be carried out at the same time, so that the detection performance is greatly improved, and the multitask high-order SNP up-level detection method has the advantages of high detection speed and strong search capability. Each task adopts one harmony memory bank and adopts the same or different types of relevance evaluation functions, so that on one hand, the diversity of the population (harmony memory bank) is enhanced, and on the other hand, the global search capability can be enhanced by the cross learning of individuals among the population. The use of a plurality of different types of relevance evaluation functions can enhance the discrimination capability of the SNP upper model, reduce the preference of the model and further improve the detection capability of the high-order SNP upper combination.
The invention can solve the problem of low sensitivity of the prior art to the upper detection of the multitask high-order SNP; the method can solve the problems of low identification accuracy of the upper detection of the multitask high-order SNP and preference to an upper model of the SNP in the prior art; the invention can solve the problem that the existing detection technology can only carry out one SNP upper detection with the same order at one time, and the invention can simultaneously carry out the detection of a plurality of high-order SNP upper combinations with different orders. The invention can improve the global detection capability of the harmony search strategy by utilizing the multi-harmony memory bank strategy and reduce the calculation amount of the SNP combination explosion problem.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained from the drawings without creative efforts.
FIG. 1 is a flowchart of a multitask high-order SNP up-detection method provided by an embodiment of the invention.
FIG. 2 is a schematic structural diagram of a multitask high-order SNP host detection system provided by an embodiment of the invention;
in fig. 2: 1. a data preprocessing module; 2. an algorithm parameter setting module; 3. a data reading module; 4. and a high-order SNP upper combination detection module.
FIG. 3 is a flowchart illustrating an implementation of a method for higher-order detection of multitasking SNP according to an embodiment of the present invention.
FIG. 4 is a flow chart of higher-order SNP epistatic combination detection by utilizing a multitasking, multi-harmony memory bank and a vocal search algorithm according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of basic rules for generating harmony, provided by an embodiment of the present invention.
Fig. 6(a) is a schematic diagram of generating new individual combination intersections and single-site interchanges within clusters according to an embodiment of the present invention.
FIG. 6(b) is a schematic diagram of generating new individual single-site variations within a population according to an embodiment of the present invention.
Fig. 7 is a schematic diagram of inter-task (intra) cross learning provided by an embodiment of the present invention.
FIG. 8 is a schematic diagram of individual transfer learning between tasks according to an embodiment of the present invention.
Fig. 9 is a basic flowchart for generating a new individual according to an embodiment of the present invention.
Fig. 10 is a comparison graph of detection capabilities provided by embodiments of the present invention.
Fig. 11 is a comparison chart of algorithm detection time provided by the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Aiming at the problems in the prior art, the invention provides a multitask high-order SNP upper detection method, a system, a storage medium and equipment, and the invention is described in detail with reference to the attached drawings.
As shown in fig. 1, the method for detecting a high-level multitask high-order SNP provided by the present invention comprises the following steps:
s101: reading PED and MAP format data from a VCF file by utilizing Plink software, and arranging a conversion bit binary format file (FAM, BED and BIM) into a sample matrix;
s102: setting harmony search algorithm parameters according to the sizes of SNP sites and sample sizes in the data, wherein the parameters comprise maximum evolution algebra MaxT, sound memory bank size HMS, sound memory bank considered probability HMCR and local fine tuning probability PAR;
s103: reading SNP sample data, and starting to prepare a first-stage search;
s104: and performing high-order SNP (single nucleotide polymorphism) upper combination detection by utilizing a multitask, multi-harmony memory bank and a vocal search algorithm.
A person skilled in the art of the multitask high-order SNP up-level detection method provided by the present invention may also use other steps to implement the multitask high-order SNP up-level detection method provided by the present invention shown in fig. 1, which is only a specific example.
As shown in fig. 2, the system for detecting a high-level multitask high-order SNP provided by the present invention includes:
the data preprocessing module 1 is used for reading PED and MAP format data from a VCF file by utilizing Plink software, and further converting a binary format file (FAM, BED and BIM) into a sample matrix;
and the algorithm parameter setting module 2 is used for setting harmony search algorithm parameters according to the sizes of the SNP sites and the sample size in the data, wherein the parameters comprise a maximum evolution algebra MaxT and a harmony memory bank size HMS, the harmony memory bank considers the probability HMCR and the local fine tuning probability PAR.
And the data reading module 3 is used for reading in SNP sample data and starting to prepare a first-stage search.
And the high-order SNP upper combination detection module 4 is used for performing high-order SNP upper combination detection by utilizing a multitask, multi-harmony memory bank and an acoustic search algorithm.
The technical solution of the present invention is further described below with reference to the accompanying drawings.
SNP Single nucleotide polymorphism (Single nucleotide polymorphism).
Higher-order SNP epistasis (high-order SNP epistasis): multiple SNP sites act in combination on a phenotype or disease state.
Multitask (multi-task): and simultaneously carrying out multi-task high-order SNP upper detection of a plurality of different orders.
Multiple memory library (multiple harmony memory) and harmonic search strategy: harmony search algorithm (harmony searchalgorithm) with multiple harmony memory banks.
The single-task optimization refers to that one optimization task is intensively completed each time, and the task can be a single-target optimization problem or a multi-target optimization problem.
Multi-task optimization is a novel optimization technology, and can utilize potential relevance among a plurality of different tasks to influence each other, interact and learn each other, so that a plurality of optimization tasks can be rapidly realized. The multi-task optimization can simultaneously solve a plurality of single-target optimization problems and can also simultaneously solve a plurality of multi-target optimization problems.
The invention adopts the multitask optimization technology to carry out the upper detection of a plurality of multitask high-order SNPs.
The harmony search algorithm is a meta-heuristic search algorithm, and aims to find an optimal note combination and play an optimal harmony sound by simulating the process of creating harmony by musicians. The harmony search has strong global search capability and is very suitable for solving the combinatorial optimization problem. For the upper detection problem of the multitask high-order SNP, the upper detection problem can be expressed as the following combined optimization problem:
Figure BDA0002791384210000111
where X represents a combination of k SNPs, the optimization problem aims to find out the superordinate combination X of SNPs having the strongest association with the disease state Y from the genome*
The objective of the multitask and acoustic search algorithm adopted by the invention is to find a plurality of SNP episodic combinations with different orders from a genome. The mathematical model can be expressed as:
Figure BDA0002791384210000112
wherein, XiRepresents a ki(>2) order of SNP combination, the goal of this problem is to find k from the genome with the strongest association with disease state1Order, k2Order, …, kMOrder (k)1-order,k2-order,,…,kM-order) of SNPs
Figure BDA0002791384210000113
The invention adopts a harmony search algorithm of a multitask and multi-harmony memory bank, and the method can solve a plurality of optimization problems simultaneously. Each task corresponds to an independent Harmonic Memory (HM) and the selection mechanisms of the harmonic memory and the harmonic memory are respectively adopted for carrying out the victimization. During the search, a new individual is generated for each task at each iteration. The generation of new individuals occurs mainly in two ways: and (4) performing intra-group learning generation (including intra-group crossing, single-point interchange, single-point mutation and the like), and performing combined cross learning generation among groups. Each task of the multi-task and acoustic search method can adopt the relevance evaluation functions of the same type (such as a Bayesian network method, a statistical test method and the like), can also respectively adopt different types of relevance evaluation functions, and even can carry out a plurality of different types of evaluation functions (similar to multi-objective optimization) on each individual in the acoustic memory bank. The invention adopts a novel framework technology from the search strategy, and the search efficiency is obviously improved. Multiple tasks are performed simultaneously, the search performance is promoted, particularly, the marginal combination effect of some low orders can be found through a multi-task search mechanism corresponding to a pathogenic model without marginal effect, and further, the discovery of higher-order SNP upper combinations is promoted. The coding mechanism adopted by the invention is as follows: unified coding is adopted for a plurality of tasks, but when the orders are different, a left-to-right selection strategy is adopted, for example: in the 3-order task, one solution vector X is (2, 6, 9, 14, 49), and only the SNP site combination (2, 6, 9) needs to be selected for association evaluation. In this coding scheme, although SNP sites 14 and 49 can be used for cross-learning with other tasks, in this task, single site interchange with previous SNP sites can be performed to facilitate individual optimization in this task population.
As shown in fig. 3, the method for detecting a high-level multitask high-order SNP specifically includes the following steps:
(1) data pre-processing
PED and MAP format data are read from the VCF file by utilizing Plink software, and the data are further converted into binary format files (FAM, BED and BIM) to be arranged into a sample matrix.
(2) Algorithm parameter setting
Setting harmony search algorithm parameters according to the sizes of SNP sites and sample sizes in the data, wherein the parameters comprise maximum evolution algebra MaxT, sound memory bank size HMS, sound memory bank considered probability HMCR, local fine tuning probability PAR and the like.
(3) And (6) reading data. SNP sample data is read in, and the first-stage search is started to be prepared.
(4) And (3) performing high-order SNP (single nucleotide polymorphism) upper combination detection by utilizing a multitask, multi-harmony memory bank and an acoustic search algorithm (the algorithm flow is shown in figure 3).
The invention evaluates the relevance of the initial harmony:
(pseudo code 1.1) evaluation of relevance for each harmony (individual) with a single evaluation function)
The following codes work: relevance evaluation was performed on individuals in the K-1 task (2,3, …, K) population (population size NP).
Figure BDA0002791384210000121
Figure BDA0002791384210000133
(pseudocode 1.1) each individual will calculate M-K-1 fitness values.
The invention evaluates the relevance of the initial harmony: each harmony (individual) needs to be evaluated using 3 evaluation functions)
The following codes work: the relevance evaluation function f is respectively adopted for individuals in K-1 task (2,3, …, K) population (the population size is NP)1,f2,f3Evaluation was carried out.
Figure BDA0002791384210000131
Each individual will calculate K x 3 fitness values (multiple evaluation indices)
Pseudo-code 2 (task division)
The invention has the following task division: task partitioning for all individuals
Figure BDA0002791384210000132
Figure BDA0002791384210000141
The invention creates novel individuals
Figure BDA0002791384210000142
K adaptive values can be calculated for each individual
Generation of a New Individual in the population of task k
Generating new individuals according to the basic rules of FIG. 5
Pseudo code 3: the invention generates new individuals according to harmony memory library rules
Figure BDA0002791384210000143
Figure BDA0002791384210000151
New individuals are generated within the population, as shown in fig. 6(a), 6(b), 7-9. And (4) cross learning among groups.
Comparison of test results of the present invention in 5 simulation data sets (see Table 1, Table 2, Table 3, FIG. 10, FIG. 11)
TABLE 1 simulation dataset parameters
Data set Order of higher order combinations of SNPs Number of SNPs Sample size Maximum number of allowed evaluations
DME Data1
5 1000 1000 500000
DME Data2 5 1000 1000 500000
DME Data3 5 1000 2000 500000
DME Data4 5 10000 1000 5000000
DME Data5 5 10000 2000 5000000
DME Data6 5 10000 5000 5000000
TABLE 2 comparison of assay Capacity
Data set EPI-ACO SNPHarvester MP-HS-DHSI NHSA-DHSC The method of the invention
DME Data1 75.00% 63.00% 85.00% 84.00% 83.00%
DME Data2 79.00% 58.00% 86.00% 87.00% 87.00%
DME Data3 85.00% 70.00% 89.00% 88.00% 90.00%
DME Data4 63.00% 48.00% 75.00% 73.00% 81.00%
DME Data5 65.00% 44.00% 81.00% 79.00% 84.00%
DME Data6 69.00% 52.00% 89.00% 81.00% 92.00%
TABLE 3 average assay time comparison (unit: seconds)
Figure BDA0002791384210000152
Figure BDA0002791384210000161
It should be noted that the embodiments of the present invention can be realized by hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided on a carrier medium such as a disk, CD-or DVD-ROM, programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier, for example. The apparatus and its modules of the present invention may be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of hardware circuits and software, e.g., firmware.
The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A multitask high-order SNP up-level detection method is characterized by comprising the following steps:
reading PED and MAP format data from a VCF file by utilizing Plink software, and arranging a binary format file of a conversion bit into a sample matrix;
setting search algorithm parameters according to the sizes of the SNP sites and the sample size in the data;
reading SNP sample data, and starting to prepare a first-stage search;
and performing high-order SNP (single nucleotide polymorphism) upper combination detection by utilizing a multitask, multi-harmony memory bank and a vocal search algorithm.
2. The method for higher-level detection of multitask high-order SNP according to claim 1, wherein Plink software is used for reading PED and MAP format data from a VCF file, and further converting FAM, BED and BIM files into a sample matrix.
3. The method for detecting the presence of multiple higher-order SNPs according to claim 1, wherein the parameters of the harmonic search algorithm are set according to the sizes of SNP sites and sample sizes in the data, wherein the parameters include maximum evolution algebraic MaxT and acoustic memory library size HMS (harmonic memory size), and acoustic memory library consideration probability HMCR (harmonic memory consistency rate) and local fine tuning probability PAR (pitch adaptation rate).
4. The multitask high-order SNP up-level detection method according to claim 1, wherein the sum-of-sound search algorithm of the multitask high-order SNP up-level detection method is a meta-heuristic search algorithm, and for the multitask high-order SNP up-level detection problem, the sum-of-sound search algorithm is expressed as a combined optimization problem as follows:
Figure FDA0002791384200000011
where X represents a combination of k SNPs, the optimization problem aims to find out the superordinate combination X of SNPs having the strongest association with the disease state Y from the genome*
5. The multitask high-order SNP up-level detection method according to claim 1, wherein the multitask and acoustic search algorithm adopted by the multitask high-order SNP up-level detection method aims at finding a plurality of SNP up-level combinations of different orders from a genome, and a mathematical model is represented as follows:
Figure FDA0002791384200000012
wherein, XiRepresents a ki(>2) order of SNP combination, the goal of this problem is to find k from the genome with the strongest association with disease state1Order, k2Order, …, kMOrder (k)1-order,k2-order,,…,kM-order) of SNP episomal combination X1*,X2*,…,XM*
6. The multitask high-order SNP up-level detection method according to claim 1, wherein each task of the multitask high-order SNP up-level detection method corresponds to an independent Harmonic Memory (HM) (harmonic memory) and the selection mechanisms of the harmonic memory and the acoustic memory are respectively adopted for performing the selection and the elimination; in the searching process, each iteration generates a new individual for each task; the creation of new individuals occurs in two ways: generating intra-group learning and inter-group combination cross learning;
each task of the multi-task and acoustic search method can adopt the same type of relevance evaluation function, also can adopt different types of relevance evaluation functions, and even each individual in the acoustic memory bank can adopt a plurality of different types of evaluation functions;
the adopted unified coding mechanism is as follows: the multiple tasks adopt unified coding, a unified search space is adopted for searching, reading is carried out from the left side of the coding when the relevance evaluation of a k-order task is carried out, and k-bit coding is continuously selected as an individual coding of the task.
7. A computer device, characterized in that the computer device comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of:
reading PED and MAP format data from a VCF file by utilizing Plink software, and arranging a binary format file of a conversion bit into a sample matrix;
setting search algorithm parameters according to the sizes of the SNP sites and the sample size in the data;
reading SNP sample data, and starting to prepare a first-stage search;
and detecting a plurality of high-order SNP upper combinations with different orders by utilizing a multitask, multi-harmony memory bank and an acoustic search algorithm.
8. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
reading PED and MAP format data from a VCF file by utilizing Plink software, and arranging a binary format file of a conversion bit into a sample matrix;
setting search algorithm parameters according to the sizes of the SNP sites and the sample size in the data;
reading SNP sample data, and starting to prepare a first-stage search;
and performing high-order SNP (single nucleotide polymorphism) upper combination detection by utilizing a multitask, multi-harmony memory bank and a vocal search algorithm.
9. A SNP (single nucleotide polymorphism) up-detection information data processing terminal, which is used for realizing the multitask high-order SNP up-detection method according to any one of claims 1 to 6.
10. A multitask high-order SNP higher-order detection system for implementing the multitask high-order SNP higher-order detection method according to any one of claims 1 to 6, wherein the multitask high-order SNP higher-order detection system comprises:
the data preprocessing module is used for reading PED and MAP format data from a VCF file by utilizing Plink software, and arranging a binary format file of a conversion bit into a sample matrix;
the algorithm parameter setting module is used for setting harmony search algorithm parameters according to the sizes of the SNP sites and the sample size in the data, wherein the parameters comprise maximum evolution algebra MaxT and the size of a harmony memory bank HMS, the harmony memory bank considers probability HMCR and the local fine tuning probability PAR.
The data reading module is used for reading in SNP sample data and starting to prepare a first-stage search;
and the multitask high-order SNP upper combination detection module is used for performing high-order SNP upper combination detection by utilizing a multitask, multi-harmony memory bank and an acoustic search algorithm.
CN202011315829.2A 2020-11-22 2020-11-22 Multi-task high-order SNP upper detection method, system, storage medium and equipment Active CN112447263B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011315829.2A CN112447263B (en) 2020-11-22 2020-11-22 Multi-task high-order SNP upper detection method, system, storage medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011315829.2A CN112447263B (en) 2020-11-22 2020-11-22 Multi-task high-order SNP upper detection method, system, storage medium and equipment

Publications (2)

Publication Number Publication Date
CN112447263A true CN112447263A (en) 2021-03-05
CN112447263B CN112447263B (en) 2023-12-26

Family

ID=74738143

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011315829.2A Active CN112447263B (en) 2020-11-22 2020-11-22 Multi-task high-order SNP upper detection method, system, storage medium and equipment

Country Status (1)

Country Link
CN (1) CN112447263B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010224815A (en) * 2009-03-23 2010-10-07 Japan Found Cancer Res Retrieval algorithm for epistasis effect based on comprehensive genome wide snp information
WO2017159686A1 (en) * 2016-03-15 2017-09-21 Repertoire Genesis株式会社 Monitoring and diagnosis for immunotherapy, and design for therapeutic agent
CN109448794A (en) * 2018-10-31 2019-03-08 华中农业大学 A kind of epistasis site method for digging based on heredity taboo and Bayesian network
CN110633386A (en) * 2019-09-27 2019-12-31 哈尔滨理工大学 Model similarity calculation method based on genetic and acoustic mixed search

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010224815A (en) * 2009-03-23 2010-10-07 Japan Found Cancer Res Retrieval algorithm for epistasis effect based on comprehensive genome wide snp information
WO2017159686A1 (en) * 2016-03-15 2017-09-21 Repertoire Genesis株式会社 Monitoring and diagnosis for immunotherapy, and design for therapeutic agent
CN109448794A (en) * 2018-10-31 2019-03-08 华中农业大学 A kind of epistasis site method for digging based on heredity taboo and Bayesian network
CN110633386A (en) * 2019-09-27 2019-12-31 哈尔滨理工大学 Model similarity calculation method based on genetic and acoustic mixed search

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
SHAUN PURCELL: "PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses", THE AMERICAN JOURNAL OF HUMAN GENETICS, vol. 81, pages 559 - 575, XP055061306, DOI: 10.1086/519795 *
SHOUHENG TUO: "Multipopulation harmony search algorithm for the detection of high-order SNP interactions", BIOINFORMATICS, vol. 36, no. 16, pages 4389 *
杨俊;殷建平;詹宇斌;: "基于禁忌搜索的多因子降维在上位作用检测中的应用", 武汉大学学报(理学版), no. 06 *
翟军昌;高立群;欧阳海滨;刘宏志;: "改进的新颖全局和声搜索算法", 东北大学学报(自然科学版), no. 10 *

Also Published As

Publication number Publication date
CN112447263B (en) 2023-12-26

Similar Documents

Publication Publication Date Title
Tsamardinos et al. A greedy feature selection algorithm for big data of high dimensionality
Friedman et al. Data analysis with Bayesian networks: A bootstrap approach
Anderson Assessing the power of informative subsets of loci for population assignment: standard methods are upwardly biased
US20210193257A1 (en) Phase-aware determination of identity-by-descent dna segments
US11068799B2 (en) Systems and methods for causal inference in network structures using belief propagation
Urbanowicz et al. Instance-linked attribute tracking and feedback for michigan-style supervised learning classifier systems
CN106030589A (en) Disease prediction system using open source data
Simcha et al. The limits of de novo DNA motif discovery
Koropoulis et al. Detecting positive selection in populations using genetic data
Zhang et al. Protein complexes discovery based on protein-protein interaction data via a regularized sparse generative network model
Zhang et al. Simulation study in probabilistic Boolean network models for genetic regulatory networks
Shaw et al. Fast and robust metagenomic sequence comparison through sparse chaining with skani
KR20220069943A (en) Single-cell RNA-SEQ data processing
CN109063418A (en) Determination method, apparatus, equipment and the readable storage medium storing program for executing of disease forecasting classifier
Ponte-Fernández et al. Evaluation of existing methods for high-order epistasis detection
Chang et al. Causal inference in biology networks with integrated belief propagation
CN112270957B (en) High-order SNP pathogenic combination data detection method, system and computer equipment
CN111933215B (en) Transcription factor binding site searching method, system, storage medium and terminal
Sun et al. HS-MMGKG: a fast multi-objective harmony search algorithm for two-locus model detection in GWAS
CN112447263B (en) Multi-task high-order SNP upper detection method, system, storage medium and equipment
Schwender et al. Empirical Bayes analysis of single nucleotide polymorphisms
Sheng et al. Change-points analysis for generalized integer-valued autoregressive model via minimum description length principle
CN108897990B (en) Interactive feature parallel selection method for large-scale high-dimensional sequence data
Gory et al. A comparison of internal model validation methods for multifactor dimensionality reduction in the case of genetic heterogeneity
Stram et al. SNP Imputation for Association Studies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant