CN115101133A - Integrated learning-based SNP interaction detection system - Google Patents
Integrated learning-based SNP interaction detection system Download PDFInfo
- Publication number
- CN115101133A CN115101133A CN202210860224.4A CN202210860224A CN115101133A CN 115101133 A CN115101133 A CN 115101133A CN 202210860224 A CN202210860224 A CN 202210860224A CN 115101133 A CN115101133 A CN 115101133A
- Authority
- CN
- China
- Prior art keywords
- snp
- combinations
- classifier
- subsets
- combination
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000003993 interaction Effects 0.000 title claims abstract description 19
- 238000001514 detection method Methods 0.000 title claims description 28
- 201000010099 disease Diseases 0.000 claims abstract description 61
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims abstract description 61
- 238000000546 chi-square test Methods 0.000 claims abstract description 25
- 238000010276 construction Methods 0.000 claims abstract description 15
- 238000011156 evaluation Methods 0.000 claims abstract description 11
- 238000012795 verification Methods 0.000 claims abstract description 9
- 238000000638 solvent extraction Methods 0.000 claims abstract description 5
- 238000010200 validation analysis Methods 0.000 claims abstract description 4
- 238000000034 method Methods 0.000 claims description 39
- 230000008569 process Effects 0.000 claims description 30
- 238000012216 screening Methods 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 15
- 108700028369 Alleles Proteins 0.000 claims description 12
- 238000007781 pre-processing Methods 0.000 claims description 7
- 238000012217 deletion Methods 0.000 claims description 6
- 230000037430 deletion Effects 0.000 claims description 6
- 238000002372 labelling Methods 0.000 claims description 5
- FCSSPCOFDUKHPV-UHFFFAOYSA-N 2-Propenyl propyl disulfide Chemical compound CCCSSCC=C FCSSPCOFDUKHPV-UHFFFAOYSA-N 0.000 claims description 3
- 230000001174 ascending effect Effects 0.000 claims description 3
- 239000003550 marker Substances 0.000 claims description 3
- 230000035772 mutation Effects 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 238000007418 data mining Methods 0.000 abstract description 2
- 238000004422 calculation algorithm Methods 0.000 description 10
- 238000004364 calculation method Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 230000001747 exhibiting effect Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 239000002773 nucleotide Substances 0.000 description 2
- 125000003729 nucleotide group Chemical group 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000013090 high-throughput technology Methods 0.000 description 1
- 102000054765 polymorphisms of proteins Human genes 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012070 whole genome sequencing analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The utility model provides a SNP interaction detecting system based on ensemble learning, which belongs to the technical field of artificial intelligence data mining classification and bioinformatics, and the scheme comprises: a data acquisition module configured to: acquiring SNP sequence information of a diseased sample and a non-diseased sample, and carrying out pretreatment to realize the construction of a whole genome SNP set; a SNP subset partitioning and combination generation module configured to: dividing a genome-wide SNP set into a plurality of SNP subsets, and constructing SNP combinations based on the SNP subsets; a multi-classifier parallel evaluation module configured to: evaluating the association of the SNP combination with the disease in parallel using a plurality of classifiers; a result verification module configured to: statistical significance validation was performed using the chi-square test on SNP combinations associated with disease assessed using several classifiers.
Description
Technical Field
The disclosure belongs to the technical field of artificial intelligence data mining classification and bioinformatics, and particularly relates to an integrated learning-based SNP interaction detection system.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
With the development of whole genome sequencing and high throughput technology, researchers have gained hundreds of millions of Single Nucleotide Polymorphisms (SNPs) information. However, how to use a suitable machine learning technique to find a combination composed of a plurality of SNPs related to a disease from a large amount of SNP data is a difficulty that needs to be solved in the application of the current machine learning technique in association detection.
The current methods for combining SNPs associated with diseases include: researching the significance of each SNP combination associated with diseases by using statistical methods such as hypothesis test and the like; dividing the sample into a plurality of subsets by using SNP information, and according to the division result, the apple is related to diseases; the whole genome SNP data are clustered by using a clustering algorithm, and then related SNP combinations are searched in each cluster.
The inventor finds that due to the high dimensionality of SNP data, the combination number grows exponentially with the increase of dimensionality, the detection load is too heavy one by one, the detection is difficult to realize, the false positive rate is too high, and the like, so that the correlation research of the interaction of a plurality of SNPs and the diseases or traits has more improvement space in the machine learning technology.
Disclosure of Invention
In order to solve the above problems, the present disclosure provides an integrated learning-based SNP interaction detection system, which reduces the required memory space and operation time by dividing a SNP collection into a plurality of subsets, selecting SNP combinations that may be related to a disease in the subsets, and further iteratively selecting more related SNP combinations; the system adopts a plurality of classifiers for joint evaluation, so that the influence of the preference of different classifiers on the disease model on the overall effect of the algorithm can be reduced; and a plurality of classifiers are used for parallel detection, so that the detection speed is improved, and the hardware requirement of the system is reduced.
According to a first aspect of embodiments of the present disclosure, there is provided an ensemble learning-based SNP interaction detection system, comprising:
a data acquisition module configured to: acquiring SNP sequence information of a diseased sample and a non-diseased sample, and carrying out pretreatment to realize the construction of a whole genome SNP set;
a SNP subset partitioning and combination generation module configured to: dividing a genome-wide SNP set into a plurality of SNP subsets, and constructing SNP combinations based on the SNP subsets;
a multi-classifier parallel evaluation module configured to: evaluating the association of the SNP combination with the disease in parallel using multiple classifiers;
a result verification module configured to: statistical significance validation was performed using the chi-square test on disease-associated SNP combinations assessed using several classifiers.
Further, the whole genome SNP collection is divided into a plurality of SNP subsets, and a high-dimensional SNP combination is constructed based on the SNP subsets, taking a two-site SNP combination as an example, specifically:
uniformly dividing a genome-wide SNP set into a plurality of SNP subsets;
in the first iteration process, aiming at each SNP subset, two different SNPs are selected to form two-site SNP combinations, and all possible SNP combinations in the subset form a set;
in the second and later iteration processes, selecting one SNP in each of two different SNP subsets to form a two-site SNP combination; all possible SNP combinations between the two subsets form a set; and inputting the SNP combination which is output in the last iteration process and possibly related to the disease into the SNP combination set which is not detected yet as the input of the classifier in the iteration process.
Further, the multi-classifier parallel evaluation module comprises a scoring voting module, an exchange voting module and a screening module, wherein:
a scoring and voting module configured to: scoring the input SNP combinations by using each classifier, and voting according to the scores;
an exchange voting module configured to: exchanging the SNP combinations which are considered to be possibly related to the diseases by each classifier into all other classifiers, and repeatedly carrying out scoring voting;
a screening module configured to: and counting the voting condition of each classifier, screening out the SNP combinations with the total number of votes larger than a preset threshold value, and inputting the SNP combinations into a result verification module.
Further, the statistical significance verification of the SNP combinations related to the disease evaluated by several classifiers using chi-square test is specifically:
calculating the p-value of the SNP combinations considered to be related to the disease by all the multi-classifiers by using chi-square test;
sequencing the SNP combinations in an ascending order according to the p value;
and finding an inflection point of a p value, and outputting the SNP combination before the inflection point as a final detection result.
Further, the acquiring of the SNP sequence information of the diseased sample and the non-diseased sample, and the preprocessing, to realize the construction of the whole genome SNP set, specifically:
labeling the diseased sample as 1; labeling a non-diseased sample as 0; for the mutation case of the SNP data sample at each SNP site, label 0 if neither allele is mutated; if one of the two alleles is mutated, the marker is 1; if both alleles are mutated, the label is 2; if the site data is missing, marking as 3; meanwhile, deleting samples with deletion SNP number larger than 5%; deleting SNPs with deletion samples of more than 5%; calculating the p-value for each SNP using the chi-square test, deleting SNPs with p-values > 0.0001; deleting SNPs with a frequency of less than 0.1 of the minor allele.
Further, the plurality of classifiers includes classifiers based on gini index, k2-score, entropy, information gain, and APDS.
According to a second aspect of the embodiments of the present disclosure, there is provided an electronic device, comprising a memory, a processor and a computer program stored in the memory for execution, the processor implementing the following steps when executing the program:
acquiring SNP data of diseased individuals and non-diseased individuals, and carrying out pretreatment to realize construction of a whole genome SNP set;
dividing a genome-wide SNP set into a plurality of SNP subsets, and constructing SNP combinations based on the SNP subsets;
inputting part of SNP combinations into a classifier for voting; meanwhile, exchanging the SNP combination with strong disease correlation under each classifier into all other classifiers, and evaluating and voting again; screening out SNP combinations with the total ticket number higher than a preset threshold value, clearing the obtained ticket number, mixing the obtained ticket number with the SNP combinations which are not evaluated by the classifier, and repeating the steps until all the SNP combinations are evaluated;
and verifying the SNP combinations with the total ticket number higher than a preset threshold value screened in the last iteration process based on chi-square test, finding out data inflection points from the p-value sequence, and outputting the SNP combinations before the inflection points.
According to a third aspect of embodiments of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of:
acquiring SNP data of an individual with or without a disease, and preprocessing the SNP data to realize the construction of a whole genome SNP set;
dividing a genome-wide SNP set into a plurality of SNP subsets, and constructing SNP combinations based on the SNP subsets;
inputting part of SNP combinations into a classifier for voting; meanwhile, exchanging the SNP combination with strong disease correlation under each classifier into all other classifiers, and evaluating and voting again; screening out SNP combinations with the total ticket number higher than a preset threshold value, clearing the obtained ticket number, mixing the obtained ticket number with the SNP combinations which are not evaluated by the classifier, and repeating the steps until all the SNP combinations are evaluated;
and verifying the SNP combination with the total number of votes screened in the last iteration process higher than a preset threshold value based on chi-square test, finding a data inflection point from the p-value sequence, and outputting the SNP combination before the inflection point.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program which, when run on one or more processors, performs the steps of:
acquiring SNP data of diseased individuals and non-diseased individuals, and carrying out pretreatment to realize construction of a whole genome SNP set;
dividing a genome-wide SNP set into a plurality of SNP subsets, and constructing SNP combinations based on the SNP subsets;
inputting part of SNP combinations into a classifier for voting; meanwhile, exchanging the SNP combination with strong disease relevance under each classifier into all other classifiers, and evaluating and voting again; screening out SNP combinations with the total ticket number higher than a preset threshold value, clearing the obtained ticket number, mixing the obtained ticket number with the SNP combinations which are not evaluated by the classifier, and repeating the steps until all the SNP combinations are evaluated;
and verifying the SNP combination with the total number of votes screened in the last iteration process higher than a preset threshold value based on chi-square test, finding a data inflection point from the p-value sequence, and outputting the SNP combination before the inflection point.
Compared with the prior art, this disclosed beneficial effect is:
(1) the present disclosure provides an integrated learning-based SNP interaction detection system that reduces the required memory space and runtime by dividing the SNP collection into multiple subsets, selecting SNP combinations in the subsets that may be relevant to the disease, and further iteratively selecting more relevant SNP combinations; the system adopts a plurality of classifiers for joint evaluation, so that the influence of the preference of different classifiers on the disease model on the overall effect of the algorithm can be reduced; and a plurality of classifiers are used for parallel detection, so that the detection speed is improved, and the hardware requirement of the system is reduced.
(2) The scheme disclosed by the disclosure brings all possible SNP combinations into the relevance evaluation range, avoids missing SNP combinations obviously related to diseases, and enhances the reliability of algorithm results; meanwhile, the relevance of the SNP combination is evaluated by using a plurality of classifiers, the influence of a single classifier on the preference of the model on the overall result of the algorithm is reduced, and the plurality of classifiers can be executed on a plurality of devices in parallel, so that the computational burden and the requirements on the experimental environment are reduced.
(3) The scheme of the disclosure divides the whole genome SNP set into a plurality of SNP subsets, and the SNP combinations are gradually evaluated through multiple iterations, rather than all possible SNP combinations which are usually evaluated directly at one time, so that the requirement on the storage space of equipment is reduced; the SNP sets which are obviously related to diseases are divided according to the inflection point of the chi-square test p value, but not hard boundary division, so that the influence of parameter setting on the experimental result is reduced.
Advantages of additional aspects of the disclosure will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the disclosure.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to be construed as limiting the disclosure.
Fig. 1 is a schematic overall flow chart of an integrated learning-based SNP interaction detection system according to an embodiment of the present disclosure.
Detailed Description
The present disclosure is further illustrated by the following examples in conjunction with the accompanying drawings.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.
The first embodiment is as follows:
the present embodiment aims to provide an integrated learning-based SNP interaction detection system.
An ensemble learning-based SNP interaction detection system, comprising:
a data acquisition module configured to: acquiring SNP sequence information of a diseased sample and a non-diseased sample, and carrying out pretreatment to realize the construction of a whole genome SNP set;
a SNP subset partitioning and combination generation module configured to: dividing a genome-wide SNP set into a plurality of SNP subsets, and constructing SNP combinations based on the SNP subsets;
a multi-classifier parallel evaluation module configured to: evaluating the association of the SNP combination with the disease in parallel using a plurality of classifiers;
a result verification module configured to: statistical significance validation was performed using the chi-square test on disease-associated SNP combinations assessed using several classifiers.
Further, the SNP subset dividing and combining generation module specifically comprises an SNP subset dividing module and a combining generation module; wherein:
a SNP subset partitioning module configured to:
dividing a genome-wide SNP set into a plurality of SNP subsets;
a combination generation module configured to:
constructing a high-dimensional SNP set according to the SNP subsets; and combining the SNP combinations which are considered to be related to the diseases in the last iteration process with the SNP combinations which are not evaluated as input of the iteration process.
Furthermore, the multi-classifier parallel evaluation module comprises a scoring voting module, an exchange voting module and a screening module; wherein:
a scoring voting module configured to: scoring the SNP combinations using a single classifier, voting according to the scores;
an exchange voting module configured to: exchanging SNP combinations which are considered to be possibly related to diseases by a single classifier into all other classifiers, and repeatedly carrying out scoring voting;
a screening model configured to: and counting the voting condition of each classifier, screening out the SNP combination with high total number of votes, and inputting the SNP combination into a combination generation module.
Further, acquiring SNP sequence information of a diseased sample and a non-diseased sample, and performing data pretreatment;
wherein the diseased sample is labeled 1; labeling the non-diseased sample as 0;
wherein, the mutation condition of the sample at each SNP site in the SNP data is marked as 0 if both alleles are not mutated; if one of the two alleles is mutated, the marker is 1; if both alleles are mutated, the label is 2; if the site data is missing, marking as 3;
deleting samples with deletion SNP number larger than 5%; deleting SNPs with deletion samples of more than 5%; calculating p-value for each SNP using chi-square test, deleting SNPs with p-value > 0.0001; deleting SNPs with a frequency of less than 0.1 of the minor allele.
Further, dividing the genome-wide SNP collection into a plurality of SNP subsets and generating SNP combinations, taking two-site SNP combinations as an example, specifically including:
s1021, uniformly dividing the whole genome SNP set into a plurality of SNP subsets;
s1022, in the first iteration process, aiming at each SNP subset, selecting two different SNPs to form two-site SNP combinations, wherein all possible SNP combinations in the subset form a set;
s1023, in the second and later iteration processes, selecting one SNP in each of two different SNP subsets to form a two-site SNP combination; all possible SNP combinations between the two subsets form a set; and inputting the SNP combination which is output in the last iteration process and is possibly related to the disease into an SNP combination set which is not detected yet, and taking the SNP combination as the input of a certain classifier in the current iteration process.
Further, multiple classifiers are used to evaluate multiple SNP combination sets in parallel; the method specifically comprises the following steps:
order toIn the process of the t-th iteration, the SNP combination input to the ith classifier belongs to i ∈ {1, …, | D | }, and | D | ═ 5 classifiers are to be adopted for detection: gini index, k2-score, entropy, information gain, apds (absolute stability differential score); specifically, the method comprises the following steps:
s1031, Gini Index (GI) calculates a score of GI ═ Gini (parent) -Gini (split) for each SNP combination, and where N is the number of all samples, N case (m) is the number of individuals with genotype m and exhibiting disease, N control (m) is the number of individuals with genotype m and exhibiting no disease, N total (m) is the number of samples with genotype m. Higher scores represent a stronger association of SNP combinations with disease;
s1032, K2-score for each SNP combination calculated asFor simplicity of calculation, we take their log form:higher scores represent a stronger association of SNP combinations with disease;
s1033, Entropy (ES) calculation of scores for each SNP combination asWhereinHigher scores represent a stronger association of SNP combinations with disease;
s1034, Information Gain (IG) calculates a score for each SNP combination as: IG ═ H (S) i |Y)+H(S j |Y)-H(S i ,S j |Y)]-[H(S i )+H(S j )-H(S i ,S j )]And is made ofHigher scores represent a stronger association of SNP combinations with disease;
s1035, APDS calculates scores for each SNP combination asHigher scores represent a stronger association of SNP combinations with disease;
s1036, after all the SNP combinations input into the classifiers are scored, sorting the SNP combinations in a descending order according to the scores, wherein the sorted order is smaller than that of the SNP combinationsIs considered to be possibly related to diseases, and the votes of the SNP combinations under the current classifier are updated toWherein b is u Is a preset parameter, o is the ordered sequence,the number of SNP combinations input into the classifier in the iteration process is shown. Because each classifier is independent, the input combination is not influenced, so the process can be executed in parallel;
s1037, exchanging the SNP combinations possibly related to the diseases in each classifier into all other classifiers, and evaluating after exchangingThe SNP combination set was estimated to be:wherein j ∈ D/i, and j 1 ∪j 2 ∪…∪j |D|-1 And U is ═ D. The scoring and voting mode of the new combination set is the same as the previous step;
s1038, counting the votes of each SNP combination by each classifier, and making the total number of votes be more than V S As output of the iteration processIf there are still SNP combination sets that have not been detected by any classifier, then the SNP combination sets will be detected by any classifierInputting the data to a combination generation module, and entering the next iteration process; otherwise, it willOutput to the result verification module;
further, the multiple classifier evaluation results are further verified by using chi-square test, and SNP combinations with significant correlation with diseases are selected; the method specifically comprises the following steps:
s1041, calculating p values of the SNP combinations which are considered to be related to diseases by all the multiple classifiers by using chi-square test;
s1042, sequencing the SNP combinations in an ascending manner according to the p values;
and S1043, finding out an inflection point of a p value, and regarding the SNP combination before the inflection point as the final output result of the algorithm, wherein the SNP combination is obviously related to the disease.
The method comprises the steps of dividing a genome-wide SNP set into a plurality of SNP subsets, generating a plurality of high-dimensional SNP combination sets from the subsets, inputting a plurality of SNP combination sets into a multi-classifier, calculating scores of each SNP combination in the sets in parallel, voting according to the scores, exchanging the SNP combinations with high scores under each classifier into all other classifiers for scoring and voting again, considering the SNP combinations with high total number of votes of the plurality of classifiers as possibly related to diseases, inputting the SNP combinations into the next iteration process or entering the next step, verifying the statistical relevance degree of the SNP combinations with the diseases by using a chi-square test, and finally outputting the SNP combinations with significant relevance to the diseases as the final result of the algorithm. According to the method, the SNP set is divided into a plurality of subsets, and iteration is performed for a plurality of times until all SNP combinations are detected, so that the omission of really related combinations is avoided, the detection processes of a plurality of classifiers can be executed in parallel in each iteration process, the calculation burden is reduced, and all possible situations can be exhaustively executed in a whole genome. In addition, the algorithm uses a plurality of classifiers to jointly determine whether a certain SNP combination is related to diseases, so that the influence of the preference of a single classifier on a data model on the algorithm result is reduced; finally, chi-square test is used for further verification, the reliability of the algorithm result is increased, and hard index division is not used for judging whether the hard index division is related to diseases or not, so that the influence of parameter setting on the result is reduced.
Example two:
the embodiment aims to provide an electronic device.
An electronic device comprising a memory, a processor and a computer program stored for execution on the memory, the processor when executing the program implementing the steps of:
acquiring SNP data of diseased individuals and non-diseased individuals, and carrying out pretreatment to realize construction of a whole genome SNP set;
dividing a genome-wide SNP set into a plurality of SNP subsets, and constructing SNP combinations based on the SNP subsets;
inputting part of SNP combinations into a classifier for voting; meanwhile, exchanging the SNP combination with strong disease relevance under each classifier into all other classifiers, and evaluating and voting again; screening out SNP combinations with the total ticket number higher than a preset threshold value, clearing the obtained ticket number, mixing the obtained ticket number with the SNP combinations which are not evaluated by the classifier, and repeating the steps until all the SNP combinations are evaluated;
and verifying the SNP combination with the total number of votes screened in the last iteration process higher than a preset threshold value based on chi-square test, finding a data inflection point from the p-value sequence, and outputting the SNP combination before the inflection point.
Further, steps executed by the electronic device in this embodiment are the same as the scheme executed by the system in the first embodiment, and technical details thereof have been described in detail in the first embodiment, and thus are not described again here.
Example three:
it is an object of the present embodiments to provide a non-transitory computer-readable storage medium.
A non-transitory computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of:
acquiring SNP data of an individual with or without a disease, and preprocessing the SNP data to realize the construction of a whole genome SNP set;
dividing a genome-wide SNP set into a plurality of SNP subsets, and constructing SNP combinations based on the SNP subsets;
inputting part of SNP combinations into a classifier for voting; meanwhile, exchanging the SNP combination with strong disease correlation under each classifier into all other classifiers, and evaluating and voting again; screening out SNP combinations with the total ticket number higher than a preset threshold value, clearing the obtained ticket number, mixing the obtained ticket number with the SNP combinations which are not evaluated by the classifier, and repeating the steps until all the SNP combinations are evaluated;
and verifying the SNP combinations with the total ticket number higher than a preset threshold value screened in the last iteration process based on chi-square test, finding out data inflection points from the p-value sequence, and outputting the SNP combinations before the inflection points.
Further, steps executed by the non-transitory computer-readable storage medium according to this embodiment are consistent with the scheme executed by the system according to the first embodiment, and technical details thereof have been described in detail in the first embodiment, and thus are not described again here.
Example four:
it is an object of the embodiments to provide a computer program product.
A computer program product comprising a computer program which, when run on one or more processors, performs the steps of:
acquiring SNP data of an individual with or without a disease, and preprocessing the SNP data to realize the construction of a whole genome SNP set;
dividing a whole genome SNP set into a plurality of SNP subsets, and constructing SNP combinations based on the SNP subsets;
inputting part of SNP combinations into a classifier for voting; meanwhile, exchanging the SNP combination with strong disease relevance under each classifier into all other classifiers, and evaluating and voting again; screening out SNP combinations with the total ticket number higher than a preset threshold value, clearing the obtained ticket number, mixing the obtained ticket number with the SNP combinations which are not evaluated by the classifier, and repeating the steps until all the SNP combinations are evaluated;
and verifying the SNP combinations with the total ticket number higher than a preset threshold value screened in the last iteration process based on chi-square test, finding out data inflection points from the p-value sequence, and outputting the SNP combinations before the inflection points.
Further, steps executed by the computer program product according to this embodiment are consistent with the scheme executed by the system according to the first embodiment, and technical details thereof have been described in detail in the first embodiment, so that details are not repeated herein.
The integrated learning-based SNP interaction detection system can be realized and has wide application prospect.
The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.
Claims (10)
1. An ensemble learning based SNP interaction detection system comprising:
a data acquisition module configured to: obtaining SNP sequence information of a diseased sample and a non-diseased sample, and carrying out pretreatment to realize the construction of a whole genome SNP set;
a SNP subset partitioning and combination generation module configured to: dividing a genome-wide SNP set into a plurality of SNP subsets, and constructing SNP combinations based on the SNP subsets;
a multi-classifier parallel evaluation module configured to: evaluating the association of the SNP combination with the disease in parallel using a plurality of classifiers;
a result verification module configured to: statistical significance validation was performed using the chi-square test on disease-associated SNP combinations assessed using several classifiers.
2. The integrated learning-based SNP interaction detection system according to claim 1, wherein the genome-wide SNP set is divided into a plurality of SNP subsets, and high-dimensional SNP combinations are constructed based on the SNP subsets, for example, two-site SNP combinations, specifically:
uniformly dividing a genome-wide SNP set into a plurality of SNP subsets;
in the first iteration process, aiming at each SNP subset, two different SNPs are selected to form two-site SNP combinations, and all possible SNP combinations in the subset form a set;
in the second and later iteration processes, selecting one SNP in each of two different SNP subsets to form a two-site SNP combination; all possible SNP combinations between the two subsets form a set; and inputting the SNP combination which is output in the last iteration process and possibly related to the disease into the SNP combination set which is not detected yet as the input of the classifier in the iteration process.
3. The integrated learning-based SNP interaction detection system of claim 1, wherein the multi-classifier parallel evaluation module includes a scoring voting module, an exchange voting module, and a screening module, wherein:
a scoring voting module configured to: scoring the input SNP combination by using each classifier, and voting according to the scores;
an exchange voting module configured to: exchanging the SNP combinations which are considered to be possibly related to the diseases by each classifier into all other classifiers, and repeatedly carrying out scoring voting;
a screening module configured to: and counting the voting conditions of all the classifiers, screening out the SNP combinations with the total number of votes larger than a preset threshold value, and inputting the SNP combinations into a result verification module.
4. The ensemble learning-based SNP interaction detection system according to claim 1, wherein the statistical significance of SNP combinations associated with disease assessed using several classifiers is verified using the chi-square test, in particular:
calculating the p-value of the SNP combinations considered to be related to the disease by all the multi-classifiers by using chi-square test;
sequencing the SNP combinations in an ascending order according to the p value;
and finding out an inflection point of a p value, and outputting the SNP combination before the inflection point as a final detection result.
5. The integrated learning-based SNP interaction detection system according to claim 1, wherein the SNP sequence information of diseased samples and non-diseased samples is acquired and preprocessed to realize the construction of genome-wide SNP sets, specifically:
labeling the diseased sample as 1; labeling the non-diseased sample as 0; for the mutation cases of the SNP data sample at each SNP site, label 0 if neither allele is mutated; if one of the two alleles is mutated, the marker is 1; if both alleles are mutated, the label is 2; if the site data is missing, it is marked as 3.
6. The ensemble learning-based SNP interaction detection system of claim 5, wherein the preprocessing further includes: deleting samples with deletion SNP number larger than 5%; deleting SNPs with deletion samples of more than 5%; calculating p-value for each SNP using chi-square test, deleting SNPs with p-value > 0.0001; deleting SNPs with a frequency of less than 0.1 of the minor allele.
7. The ensemble learning-based SNP interaction detection system of claim 1, wherein the plurality of classifiers includes a gini index, k2-score, entropy, information gain, and APDS-based classifier.
8. An electronic device comprising a memory, a processor, and a computer program stored for execution on the memory, wherein the processor when executing the program performs the steps of:
acquiring SNP data of an individual with or without a disease, and preprocessing the SNP data to realize the construction of a whole genome SNP set;
dividing a genome-wide SNP set into a plurality of SNP subsets, and constructing SNP combinations based on the SNP subsets;
inputting part of SNP combinations into a classifier for voting; meanwhile, exchanging the SNP combination with strong disease correlation under each classifier into all other classifiers, and evaluating and voting again; and screening out the SNP combinations with the total ticket number higher than a preset threshold value, clearing the obtained ticket number, mixing the obtained ticket number with the SNP combinations which are not evaluated by the classifier, and repeating the steps until all the SNP combinations are evaluated.
And verifying the SNP combinations with the total ticket number higher than a preset threshold value screened in the last iteration process based on chi-square test, finding out data inflection points from the p-value sequence, and outputting the SNP combinations before the inflection points.
9. A non-transitory computer readable storage medium having a computer program stored thereon, the program when executed by a processor implementing the steps of:
acquiring SNP data of an individual with or without a disease, and preprocessing the SNP data to realize the construction of a whole genome SNP set;
dividing a whole genome SNP set into a plurality of SNP subsets, and constructing SNP combinations based on the SNP subsets;
inputting part of SNP combinations into a classifier for voting; meanwhile, exchanging the SNP combination with strong disease relevance under each classifier into all other classifiers, and evaluating and voting again; screening out SNP combinations with the total number of votes higher than a preset threshold, clearing the total number of votes, mixing the obtained votes with the SNP combinations which are not evaluated by the classifier, and repeating the steps until all the SNP combinations are evaluated;
and verifying the SNP combination with the total number of votes screened in the last iteration process higher than a preset threshold value based on chi-square test, finding a data inflection point from the p-value sequence, and outputting the SNP combination before the inflection point.
10. A computer program product comprising a computer program, characterized in that the computer program, when run on one or more processors, performs the steps of:
acquiring SNP data of diseased individuals and non-diseased individuals, and carrying out pretreatment to realize construction of a whole genome SNP set;
dividing a genome-wide SNP set into a plurality of SNP subsets, and constructing SNP combinations based on the SNP subsets;
inputting part of SNP combinations into a classifier for voting; meanwhile, exchanging the SNP combination with strong disease correlation under each classifier into all other classifiers, and evaluating and voting again; screening out SNP combinations with the total ticket number higher than a preset threshold value, clearing the obtained ticket number, mixing the obtained ticket number with the SNP combinations which are not evaluated by the classifier, and repeating the steps until all the SNP combinations are evaluated;
and verifying the SNP combinations with the total ticket number higher than a preset threshold value screened in the last iteration process based on chi-square test, finding out data inflection points from the p-value sequence, and outputting the SNP combinations before the inflection points.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210860224.4A CN115101133A (en) | 2022-07-21 | 2022-07-21 | Integrated learning-based SNP interaction detection system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210860224.4A CN115101133A (en) | 2022-07-21 | 2022-07-21 | Integrated learning-based SNP interaction detection system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115101133A true CN115101133A (en) | 2022-09-23 |
Family
ID=83299323
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210860224.4A Pending CN115101133A (en) | 2022-07-21 | 2022-07-21 | Integrated learning-based SNP interaction detection system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115101133A (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462868A (en) * | 2014-12-11 | 2015-03-25 | 西安电子科技大学 | Genome-wide SNP (single nucleotide polymorphism) site analysis method based on combination of random forest and Relief-F |
CN108256293A (en) * | 2018-02-09 | 2018-07-06 | 哈尔滨工业大学深圳研究生院 | A kind of statistical method and system of the disease association assortment of genes |
-
2022
- 2022-07-21 CN CN202210860224.4A patent/CN115101133A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462868A (en) * | 2014-12-11 | 2015-03-25 | 西安电子科技大学 | Genome-wide SNP (single nucleotide polymorphism) site analysis method based on combination of random forest and Relief-F |
CN108256293A (en) * | 2018-02-09 | 2018-07-06 | 哈尔滨工业大学深圳研究生院 | A kind of statistical method and system of the disease association assortment of genes |
Non-Patent Citations (4)
Title |
---|
XIN WANG等: "ELSSI: parallel SNP–SNP interactions detection by ensemble multi-type detectors", HTTPS://DOI.ORG/10.1093/BIB/BBAC213, pages 2 - 4 * |
姚雨晨: "关于遗传性疾病和性状相关联的单核苷酸多态性特征研究", 中国优秀硕士学位论文全文数据库医药卫生辑, no. 07, 15 July 2019 (2019-07-15), pages 3 * |
方雅兰 等: "基于几种机器学习算法的致病遗传基因位点分析", 黄冈师范学院学报, vol. 39, no. 3, 30 June 2019 (2019-06-30), pages 2 * |
杜永旺 等: "结合GWAS先验标记信息的肉鸡RFI性状全基因组选择研究", 畜牧兽医学报HTTPS://KNS.CNKI.NET/KCMS/DETAIL/11.1985.S.20220720.0946.002.HTML, 20 July 2022 (2022-07-20), pages 8 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108023876B (en) | Intrusion detection method and intrusion detection system based on sustainability ensemble learning | |
Kumar et al. | Feature selection and classification of microarray data using MapReduce based ANOVA and K-nearest neighbor | |
Lamba et al. | Feature Selection of Micro-array expression data (FSM)-A Review | |
Karim et al. | Convolutional embedded networks for population scale clustering and bio-ancestry inferencing | |
Alzubi et al. | Hybrid feature selection method for autism spectrum disorder SNPs | |
CN110379464A (en) | The prediction technique of DNA transcription terminator in a kind of bacterium | |
CN107480441B (en) | Modeling method and system for children septic shock prognosis prediction | |
US9008974B2 (en) | Taxonomic classification system | |
CN113823356A (en) | Methylation site identification method and device | |
Kalna et al. | Clustering coefficients for weighted networks | |
Lee | The fractal dimension as a measure for characterizing genetic variation of the human genome | |
CN108154189A (en) | Grey relational cluster method based on LDTW distances | |
Wang et al. | Gaebic: a novel biclustering analysis method for mirna-targeted gene data based on graph autoencoder | |
CN115101133A (en) | Integrated learning-based SNP interaction detection system | |
Khodaei et al. | A Markov chain-based feature extraction method for classification and identification of cancerous DNA sequences | |
CN110502669A (en) | The unsupervised chart dendrography learning method of lightweight and device based on the side N DFS subgraph | |
Li et al. | MODA: MOdule Differential Analysis for weighted gene co-expression network | |
Dong et al. | Decision system for copper flotation backbone process | |
Sato et al. | Directed acyclic graph kernels for structural RNA analysis | |
Salem et al. | A new gene selection technique based on hybrid methods for cancer classification using microarrays | |
Tuna et al. | Classification with binary gene expressions | |
Zhou et al. | A hybrid algorithm of minimum spanning tree and nearest neighbor for classifying human cancers | |
Gan et al. | A survey of pattern classification-based methods for predicting survival time of lung cancer patients | |
Schmidt et al. | Scalable induction of probabilistic real-time automata using maximum frequent pattern based clustering | |
CN115240765A (en) | SNP (Single nucleotide polymorphism) interaction detection system based on heterogeneous biomolecular network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |