CN108090328B - Cancer driver gene identification method based on machine learning and multiple statistical principles - Google Patents
Cancer driver gene identification method based on machine learning and multiple statistical principles Download PDFInfo
- Publication number
- CN108090328B CN108090328B CN201711496093.1A CN201711496093A CN108090328B CN 108090328 B CN108090328 B CN 108090328B CN 201711496093 A CN201711496093 A CN 201711496093A CN 108090328 B CN108090328 B CN 108090328B
- Authority
- CN
- China
- Prior art keywords
- gene
- mutation
- data
- genes
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 93
- 206010028980 Neoplasm Diseases 0.000 title claims abstract description 38
- 238000000034 method Methods 0.000 title claims abstract description 33
- 201000011510 cancer Diseases 0.000 title claims abstract description 30
- 238000010801 machine learning Methods 0.000 title claims abstract description 12
- 230000035772 mutation Effects 0.000 claims abstract description 51
- 238000012360 testing method Methods 0.000 claims abstract description 12
- 238000000342 Monte Carlo simulation Methods 0.000 claims abstract description 4
- 238000001604 Rao's score test Methods 0.000 claims abstract description 4
- 206010064571 Gene mutation Diseases 0.000 claims description 11
- 238000004088 simulation Methods 0.000 claims description 11
- 238000012549 training Methods 0.000 claims description 10
- 230000036438 mutation frequency Effects 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 210000000349 chromosome Anatomy 0.000 claims description 3
- 206010006187 Breast cancer Diseases 0.000 claims description 2
- 208000026310 Breast neoplasm Diseases 0.000 claims description 2
- 206010008342 Cervix carcinoma Diseases 0.000 claims description 2
- 206010058467 Lung neoplasm malignant Diseases 0.000 claims description 2
- 108020004485 Nonsense Codon Proteins 0.000 claims description 2
- 206010033128 Ovarian cancer Diseases 0.000 claims description 2
- 206010061535 Ovarian neoplasm Diseases 0.000 claims description 2
- 108700009124 Transcription Initiation Site Proteins 0.000 claims description 2
- 208000006105 Uterine Cervical Neoplasms Diseases 0.000 claims description 2
- 201000010881 cervical cancer Diseases 0.000 claims description 2
- 238000012217 deletion Methods 0.000 claims description 2
- 230000037430 deletion Effects 0.000 claims description 2
- 238000003780 insertion Methods 0.000 claims description 2
- 230000037431 insertion Effects 0.000 claims description 2
- 201000005202 lung cancer Diseases 0.000 claims description 2
- 208000020816 lung neoplasm Diseases 0.000 claims description 2
- 230000037434 nonsense mutation Effects 0.000 claims description 2
- 238000013519 translation Methods 0.000 claims description 2
- 230000035945 sensitivity Effects 0.000 abstract description 7
- 239000002246 antineoplastic agent Substances 0.000 abstract description 2
- 229940041181 antineoplastic drug Drugs 0.000 abstract description 2
- 230000004853 protein function Effects 0.000 abstract description 2
- 238000004422 calculation algorithm Methods 0.000 description 7
- 238000007619 statistical method Methods 0.000 description 3
- 102100034580 AT-rich interactive domain-containing protein 1A Human genes 0.000 description 2
- 206010069754 Acquired gene mutation Diseases 0.000 description 2
- 108010009392 Cyclin-Dependent Kinase Inhibitor p16 Proteins 0.000 description 2
- 101000924266 Homo sapiens AT-rich interactive domain-containing protein 1A Proteins 0.000 description 2
- 101000742859 Homo sapiens Retinoblastoma-associated protein Proteins 0.000 description 2
- 102000004034 Kelch-Like ECH-Associated Protein 1 Human genes 0.000 description 2
- 108090000484 Kelch-Like ECH-Associated Protein 1 Proteins 0.000 description 2
- 108010071382 NF-E2-Related Factor 2 Proteins 0.000 description 2
- 102100031701 Nuclear factor erythroid 2-related factor 2 Human genes 0.000 description 2
- 108010011536 PTEN Phosphohydrolase Proteins 0.000 description 2
- 102000014160 PTEN Phosphohydrolase Human genes 0.000 description 2
- 102100038042 Retinoblastoma-associated protein Human genes 0.000 description 2
- 108010078814 Tumor Suppressor Protein p53 Proteins 0.000 description 2
- 102000015098 Tumor Suppressor Protein p53 Human genes 0.000 description 2
- 102100033254 Tumor suppressor ARF Human genes 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 239000003814 drug Substances 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 201000005243 lung squamous cell carcinoma Diseases 0.000 description 2
- 231100000590 oncogenic Toxicity 0.000 description 2
- 230000002246 oncogenic effect Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 230000037439 somatic mutation Effects 0.000 description 2
- 230000001225 therapeutic effect Effects 0.000 description 2
- 101100504181 Arabidopsis thaliana GCS1 gene Proteins 0.000 description 1
- 102100030004 Calpain-8 Human genes 0.000 description 1
- 102100027768 Histone-lysine N-methyltransferase 2D Human genes 0.000 description 1
- 241000282414 Homo sapiens Species 0.000 description 1
- 101000793675 Homo sapiens Calpain-8 Proteins 0.000 description 1
- 101001045848 Homo sapiens Histone-lysine N-methyltransferase 2B Proteins 0.000 description 1
- 101001008894 Homo sapiens Histone-lysine N-methyltransferase 2D Proteins 0.000 description 1
- 101000578920 Homo sapiens Microtubule-actin cross-linking factor 1, isoforms 1/2/3/5 Proteins 0.000 description 1
- 101000603402 Homo sapiens Protein NPAT Proteins 0.000 description 1
- 101000645320 Homo sapiens Titin Proteins 0.000 description 1
- 101000785650 Homo sapiens Zinc finger protein 268 Proteins 0.000 description 1
- 102100028322 Microtubule-actin cross-linking factor 1, isoforms 1/2/3/5 Human genes 0.000 description 1
- 102000048850 Neoplasm Genes Human genes 0.000 description 1
- 108700019961 Neoplasm Genes Proteins 0.000 description 1
- 102100038870 Protein NPAT Human genes 0.000 description 1
- 102100026260 Titin Human genes 0.000 description 1
- 102100026516 Zinc finger protein 268 Human genes 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000001174 ascending effect Effects 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 238000007877 drug screening Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 208000002154 non-small cell lung carcinoma Diseases 0.000 description 1
- 101150006782 npat gene Proteins 0.000 description 1
- 230000037361 pathway Effects 0.000 description 1
- 108090000765 processed proteins & peptides Proteins 0.000 description 1
- 230000035755 proliferation Effects 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000037436 splice-site mutation Effects 0.000 description 1
- 230000014616 translation Effects 0.000 description 1
- 208000029729 tumor suppressor gene on chromosome 11 Diseases 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Public Health (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a cancer driver gene identification method based on machine learning and various statistical principles, which comprises the following steps: (1) arranging the data into a standard format; (2) calculating the background variation rate; (3) statistically testing cancer driver genes; (4) monte Carlo simulation statistic distribution; (5) and adjusting the P value. The method considers the background variation rate of each sample, gene and mutation type and the influence of various mutation types on the protein function, adopts score test to judge the driving gene, has high robustness and is widely applicable to various types of cancers; and a better balance between sensitivity and specificity is achieved, a larger number of driving genes can be detected, and lower false positive can be maintained. The invention has important significance for searching potential sites for cancer treatment and developing anti-cancer drugs.
Description
Technical Field
The invention belongs to the cross field of bioinformatics and cancer medicine, and relates to a cancer driver gene identification method adopting machine learning and various statistical methods.
Background
Cancer is mostly a disease caused by somatic mutations. The driver gene is a factor directly responsible for the development of cancer, and on the other hand, there is no direct relationship between the passenger gene and cancer, so it is necessary to identify the driver gene. Several tumor sequencing projects worldwide, such as cancer genome map project (TCGA), international association of cancer genomes project (ICGC) and clinical applications research to the general effective therapeutic project (TARGET), have established a comprehensive list of somatic mutations in various types of cancer. One of the main objectives of these sequencing projects is to identify the driver genes responsible for cancer. The cancer driving gene is found, which not only can increase the understanding of human beings on the occurrence and development of tumors, but also can provide potential therapeutic targets of some cancers.
Bioinformatic tools have been developed that use sets of mathematical data to identify cancer drivers, which can be classified into 3 categories based on their rationale: the first category is based on mutation frequency based tools, which identify genes with a mutation frequency higher than the background mutation rate as driver genes. Representative of such tools are MutSigCV (Lawrence M S, Stojanov P, Polak P, et al. Mutional homology in cancer and the search for new cancer-associated genes. [ J ]. Nature,2013, 499(7457):214-218.) and patent application No. CN201310284338 × "a method and kit for detecting non-small cell lung cancer driver gene mutation spectrum and applications". The second category is tools developed from known pathways or interaction networks. Representative of such tools are DawnRank (Hou J P, Ma J. DawnRank: screening competent driver genes in cancer [ J ]. Genome Medicine,2014,6(7):56.) and "a method for screening cancer driver genes based on biological networks" patent application No. CN 201510111810.9. The third category is the "hot spot" tool, which refers to a location that has a significant effect on the three-dimensional conformation of a peptide chain or protein. Representative thereof are the oncogenic CLUST (Tambororo D, Gonzalezez A, Lopezbigas N. oncogenic CLUST: expanding the positional restricting of the physiological reagents [ J ]. Bioinformatics,2013,29(18): 2238.).
However, the above bioinformatics tools still have some disadvantages, firstly, these algorithms do not achieve a good balance between sensitivity and specificity, i.e., some algorithms have high sensitivity but too low specificity, or have high specificity and low sensitivity; secondly, these methods lack robustness to different types of tumors, i.e., for some tumor types the methods perform well, finding many reliable drivers, but for others exhibit poor performance.
Disclosure of Invention
The present invention aims to provide a highly robust algorithm for identifying cancer driver genes. The method is based on machine learning and various statistical methods, can show higher sensitivity and specificity on data of various cancers, greatly reduces false positive caused by the traditional method, and lays an important foundation for subsequent gene function research and targeted drug screening.
The technical scheme provided by the invention is as follows: a cancer driver gene identification method based on machine learning and multiple statistical principles is realized by the following steps:
(1) data are put into a standard format: the input data format is required to be a Mutation Annotation File format (Mutation Annotation File) common to genome map engineering (TCGA), or a File containing 7 key data: the 1 st column is the name of the gene, the 5 th column is the chromosome number of the gene, the 6 th column is the initial position of the gene mutation sequence, the 9 th column is the gene mutation classification, the 11 th column is the normal reference sequence corresponding to the mutation gene sequence, the 13 th column is the mutation gene sequence, and the 16 th column is the sample number of the gene mutation; the input data can be used for the subsequent process after being arranged into the format;
(2) calculating the background variation rate: calculating the background variation rate of gene mutation by an empirical Bayes method; synonymous mutations and non-synonymous mutations obey the following distribution, respectively:
p represents the sample number in the input data, g represents the gene name, and t represents the mutation type;representing possible occurrence of t-type synonymous mutations in the g geneThe number of the first and second groups is,represents the number of possible occurrences of t-type non-synonymous mutations on the g gene,andrepresenting the actual occurrence numbers of t-type synonymous and non-synonymous mutations in the p-sample, the g-gene, respectively βpgtRepresenting the background variation rate, θ, of the occurrence of t-type mutations in the p-sample, g-genepgtRepresents the total variation rate, θ, of t-type mutations in the p-sample, g-genepgt=βpgt+αgtWherein αgtRepresents the incidence of t-type mutations in the g gene; the background variation rate was calculated according to the following formula:
(3) statistical examination of cancer driver genes: performing hypothesis test on each gene to judge whether a certain gene is a driving gene; for the gene g to be detected, the original hypothesis is H0:αg1=…=αgTT is the total number of mutation types, alternative hypothesis is 0Test statisticsAnd (4) testing for scores:
wherein,ωtfor the weight parameters for measuring the functional consequences of t-type mutation, the weight parameter calculation formula of the genes in the training data according to the machine learning thought is as follows:for genes that are not in the training data, the weights are the mean of the weights of the training genes.
(4) Monte carlo simulation statistics distribution: in the score test in the step (3), under the condition that the sample size is large enough, the theoretical distribution of the statistics is standard normal distribution; however, because the mutation frequency of a gene is very low, the number of samples of a certain gene with mutation is very small, and the actual distribution of the statistics does not meet the standard normal distribution, the probability of calculating the statistics by artificially simulating the distribution of the statistics is needed; due to the fact thatSubject to poisson distribution, mutation data can be artificially generated according to the following distribution:
whereinThe data is generated for the purpose of simulation,is βpgtAn estimated value of (d); substituting the simulation data into a test statistic formula to obtain simulation distribution, substituting the real data into statistic to calculate a real statistical value, and obtaining a statistical significance P value according to the simulation distribution;
(5) and P value adjustment: adjusting according to Benjamini-Hochberg methodSignificant P value of the entire respective Gene, i.e., Padjvalue G/r, wherein Padjvalue is the adjusted P value, pvalue is the original P value, G is the total number of genes subjected to hypothesis testing, and r is the serial number of all genes arranged in descending order of their P values; whether or not each gene is a driver gene is determined based on whether or not the adjusted P value of the gene exceeds a threshold value (usually 0.05).
Further, the data in step (1) may be obtained from sources including, but not limited to, genome map engineering (TCGA).
Further, the data in step (1) can be generated by a platform including, but not limited to, an Illumina sequencer.
Further, the data sorting tool software in the step (1) includes, but is not limited to, R software.
Further, the mutation types in the step (2) include missense mutation, nonsense mutation, translation failure normal termination mutation, splice site mutation, transcription initiation site mutation and insertion deletion mutation.
Further, the sources of the samples in step (2) include, but are not limited to, lung cancer, cervical cancer, breast cancer and ovarian cancer.
Further, the threshold value of the significance P value in the step (5) is taken to be 0.05.
The cancer driver gene identification method provided by the invention adopts machine learning and various statistical methods, considers the influence of various mutation types on the protein function, has high robustness, and is widely applicable to various cancers; and a better balance between sensitivity and specificity is achieved, a larger number of driving genes can be detected, and lower false positive can be maintained. The invention has important significance for searching potential sites for cancer treatment and developing anti-cancer drugs.
Drawings
FIG. 1 is a schematic diagram of the implementation of the cancer driver gene identification method based on machine learning and various statistical principles of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the following drawings and specific examples, but the present invention is not limited thereto.
1. Experimental materials:
experimental sample data: lung squamous cell carcinoma mutation data downloaded into TCGA database (http://tcga- data.nci.nih.gov/docs/publications/lusc_2012/);
Operating the system: linux
Software: r, Perl. Are downloaded to official websites.
2. Experimental procedure, as shown in figure 1:
(1) the data is sorted as follows: the 1 st column is the name of the gene, the 5 th column is the chromosome number of the gene, the 6 th column is the initial position of the gene mutation sequence, the 9 th column is the gene mutation classification, the 11 th column is the normal reference sequence corresponding to the mutation gene sequence, the 13 th column is the mutation gene sequence, and the 16 th column is the sample number of the gene mutation.
(2) Calculating the background variation rate: calculated according to the following formula,
(3) statistical examination of cancer driver genes: for gene g, the primary hypothesis is H0:αg1=…=αgT0, i.e. all mutation types correspond to a mutation rate equal to 0; alternative assumptions areI.e.there is a mutation type t, which corresponds to a mutation rate of more than 0. The score test statistic was calculated as follows:
wherein, thetapgt=βpgt+αgt(ii) a The formula can be calculated according to the machine learning thought for the genes in the training data as follows:for genes that are not in the training data, the weights are the mean of the weights of the training genes. Training data from the Intogen database (https://www.intogen.org/search)。
(4) Monte carlo simulation statistics distribution: the simulation data is generated according to Poisson distribution, namely:
and (4) substituting the simulation data into the formula in the step (3) to obtain the simulation distribution. Then, the significance P value of the statistic calculated from the real data in step (3) can be calculated.
(5) And adjusting the P value. Arranging all genes according to the P values obtained by respective calculation in ascending order, wherein each gene obtains the sequence number thereof, and then the adjusted P values are as follows: padjvalue is pvalue G/r, where pvalue is the original p-value, G is the total number of genes subjected to hypothesis testing, and r is the sequence number after all genes have been arranged in descending order of their p-value.
3. The experimental results are as follows:
this example identifies 14 driver genes: RB1, KEAP1, TP53, ARID1A, ZNF268, PTEN, MLL2, NFE2L2, MACF1, NPAT, CAPN8, TTN, CDKN2A, and MUC 16. 7 of them: RB1, KEAP1, TP53, ARID1A, PTEN, NFE2L2 and CDKN2A are genes that coincide with the cancer gene statistics database (http:// cancer. sanger. ac. uk/census), indicating that the results of the present invention are in high agreement with the current world's latest study; there are three other algorithms in Table 1 for identifying driver genes, MDPFinder (ZHao J, Zhang S, Wu L Y, et al. efficient methods for identifying driver genes in databases [ J ]. Bioinformatics,2012, 28(22):2940-, the three algorithms were calculated using the same data used in this example, and found 0, 14 and 3 driver genes, respectively, with coincidence rates of 0%, 21% and 0% with the CGC database, respectively, this shows that the invention is superior to the three algorithms in terms of the combination of the number of the found driver genes (sensitivity) and the CGC coincidence rate (specificity). In addition, the NPAT gene newly identified by the invention is indeed related to the abnormal proliferation of the lung squamous cell carcinoma through the verification of biochemical experiments, which shows that the gene identified by the method has high reliability from another aspect.
Table 1: performance comparison of the present invention with other 3-algorithm
Claims (7)
1. A method for identifying a cancer driver gene based on machine learning and multiple statistical principles, comprising the steps of:
(1) data are put into a standard format: the input data format is required to be a mutation annotation file format commonly used in genome map engineering, or a file containing 7 key data: the 1 st column is the name of the gene, the 5 th column is the chromosome number of the gene, the 6 th column is the initial position of the mutant gene sequence, the 9 th column is the classification of the gene mutation, the 11 th column is the normal reference sequence corresponding to the mutant gene sequence, the 13 th column is the mutant gene sequence, and the 16 th column is the sample number of the gene mutation;
(2) calculating the background variation rate: calculating the background variation rate of gene mutation by an empirical Bayes method; synonymous mutations and non-synonymous mutations obey the following distribution, respectively:
p represents the sample number in the input data, g represents the gene name, and t represents the mutation type;represents the number of possible t-type synonymous mutations on the g gene,represents the number of possible occurrences of t-type non-synonymous mutations on the g gene,andrepresenting the actual occurrence numbers of t-type synonymous and non-synonymous mutations in the p-sample, the g-gene, respectively βpgtRepresenting the background variation rate, θ, of the occurrence of t-type mutations in the p-sample, g-genepgtRepresents the total variation rate, θ, of t-type mutations in the p-sample, g-genepgt=βpgt+αgtWherein αgtRepresents the incidence of t-type mutations in the g gene; the background variation rate was calculated according to the following formula:
(3) statistical examination of cancer driver genes: performing hypothesis test on each gene to judge whether a certain gene is a driving gene; for the suspectedTest gene g, proto-hypothesis is H0:αg1=…=αgTT is the total number of mutation types, alternative hypothesis is 0Test statisticsAnd (4) testing for scores:
wherein,ωtfor the weight parameters for measuring the functional consequences of t-type mutation, the weight parameter calculation formula of the genes in the training data according to the machine learning thought is as follows:for genes not in the training data, the weight is the mean of the weights of the training genes;
(4) monte carlo simulation statistics distribution: in the score test in the step (3), under the condition that the sample size is large enough, the theoretical distribution of the statistics is standard normal distribution; because the mutation frequency of the gene is very low, the number of samples with mutation of a certain gene is very small, and the actual distribution of the statistic does not meet the standard normal distribution, the probability of calculating the statistic by manually simulating the distribution of the statistic is needed; due to the fact thatObeying the poisson distribution, mutation data were artificially generated according to the following distribution:
whereinThe data is generated for the purpose of simulation,is βpgtAn estimated value of (d); substituting the simulation data into a test statistic formula to obtain simulation distribution, substituting the real data into statistic to calculate a real statistical value, and obtaining a statistical significance Q value according to the simulation distribution;
(5) adjusting the Q value: adjusting the significance Q value of each gene, namely Q, according to the Benjamini-Hochberg methodadjvalue (Qvalue G/r), wherein Qadjvalue is the adjusted Q value, Qvalue is the original Q value, G is the total number of genes subjected to hypothesis testing, and r is the serial number of all genes arranged in descending order of their Q values; genes with significant Q values within the threshold range are driver genes.
2. The method of claim 1, wherein the method comprises the steps of: the data in the step (1) are obtained from genome map engineering TCGA.
3. The method of claim 1, wherein the method comprises the steps of: and (2) generating data in the step (1) by using an Illumina sequencer.
4. The method of claim 1, wherein the method comprises the steps of: the data sorting tool software in the step (1) comprises R software.
5. The method of claim 1, wherein the method comprises the steps of: the mutation types in the step (2) comprise missense mutation, nonsense mutation, mutation for preventing translation from normally stopping, shear site mutation, transcription initiation site mutation and insertion deletion mutation.
6. The method of claim 1, wherein the method comprises the steps of: the sample in the step (2) is from lung cancer, cervical cancer, breast cancer and ovarian cancer.
7. The method of claim 1, wherein the method comprises the steps of: the threshold value of the significance Q value in the step (5) is 0.05.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711496093.1A CN108090328B (en) | 2017-12-31 | 2017-12-31 | Cancer driver gene identification method based on machine learning and multiple statistical principles |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711496093.1A CN108090328B (en) | 2017-12-31 | 2017-12-31 | Cancer driver gene identification method based on machine learning and multiple statistical principles |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108090328A CN108090328A (en) | 2018-05-29 |
CN108090328B true CN108090328B (en) | 2020-04-10 |
Family
ID=62180262
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711496093.1A Active CN108090328B (en) | 2017-12-31 | 2017-12-31 | Cancer driver gene identification method based on machine learning and multiple statistical principles |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108090328B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110189795B (en) * | 2019-05-05 | 2023-06-23 | 西安电子科技大学 | Sub-space learning-based detection method for subgroup-specific driving genes |
CN111785325B (en) * | 2020-06-23 | 2021-10-22 | 西北工业大学 | Method for identifying heterogeneous cancer driver genes of mutually exclusive constraint graph Laplace |
CN112259163B (en) * | 2020-10-28 | 2022-04-22 | 广西师范大学 | Cancer driving module identification method based on biological network and subcellular localization data |
CN113517021B (en) * | 2021-06-09 | 2022-09-06 | 海南精准医疗科技有限公司 | Cancer driver gene prediction method |
CN117809741B (en) * | 2024-03-01 | 2024-07-12 | 浙江大学 | Method and device for predicting cancer characteristic genes based on molecular evolution selective pressure |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013188600A1 (en) * | 2012-06-12 | 2013-12-19 | Washington University | Copy number aberration driven endocrine response gene signature |
CN104059966A (en) * | 2014-05-20 | 2014-09-24 | 吴松 | STAG2 gene mutant sequence and detection method thereof as well as use of STAG2 gene mutation in detecting bladder cancer |
CN104732116A (en) * | 2015-03-13 | 2015-06-24 | 西安交通大学 | Method for screening cancer driver gene based on biological network |
CN106709278A (en) * | 2017-01-10 | 2017-05-24 | 河南省医药科学研究院 | Method for carrying out screening and functional analysis on driver genes of NSCLC (Non-Small Cell Lung Cancer) |
CN106980763A (en) * | 2017-03-30 | 2017-07-25 | 大连理工大学 | A kind of cancer based on gene mutation frequency drives the screening technique of gene |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170002319A1 (en) * | 2015-05-13 | 2017-01-05 | Whitehead Institute For Biomedical Research | Master Transcription Factors Identification and Use Thereof |
-
2017
- 2017-12-31 CN CN201711496093.1A patent/CN108090328B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013188600A1 (en) * | 2012-06-12 | 2013-12-19 | Washington University | Copy number aberration driven endocrine response gene signature |
CN104059966A (en) * | 2014-05-20 | 2014-09-24 | 吴松 | STAG2 gene mutant sequence and detection method thereof as well as use of STAG2 gene mutation in detecting bladder cancer |
CN104732116A (en) * | 2015-03-13 | 2015-06-24 | 西安交通大学 | Method for screening cancer driver gene based on biological network |
CN106709278A (en) * | 2017-01-10 | 2017-05-24 | 河南省医药科学研究院 | Method for carrying out screening and functional analysis on driver genes of NSCLC (Non-Small Cell Lung Cancer) |
CN106980763A (en) * | 2017-03-30 | 2017-07-25 | 大连理工大学 | A kind of cancer based on gene mutation frequency drives the screening technique of gene |
Non-Patent Citations (3)
Title |
---|
DrGaP:A Powerful Tool for Identifying Driver Genes and Pathways in Cancer Sequencing Studies;Xing Hua.et.;《The American Journal of Human Genetics》;20130905;第439-451页 * |
Only three driver gene mutations are required for the development of lung and colorectal cancers;Cristian Tomasetti.et.;《PNAS》;20150106;第112卷(第1期);第118-123页 * |
基于癌症基因测序数据的统计方法研究;花兴;《中国博士学位论文全文数据库 基础科学辑》;20130115(第1期);第A006-18页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108090328A (en) | 2018-05-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108090328B (en) | Cancer driver gene identification method based on machine learning and multiple statistical principles | |
Salvadores et al. | Matching cell lines with cancer type and subtype of origin via mutational, epigenomic, and transcriptomic patterns | |
CN106980763B (en) | Screening method of cancer driver gene based on gene mutation frequency | |
CN109767810B (en) | High-throughput sequencing data analysis method and device | |
CN108351916A (en) | Neoantigen is analyzed | |
WO2022095280A1 (en) | Marker, detection method and detection system for homologous recombination deletion | |
CN104885090A (en) | Systems and methods for tumor clonality analysis | |
WO2015072438A1 (en) | Complementary pcr primer set for als-related gene sequence analysis, method for analyzing als-related gene sequence, and method for testing als | |
Andreassen et al. | Will SNPs be useful predictors of normal tissue radiosensitivity in the future? | |
CN111662981A (en) | Cancer gene detection kit based on second-generation sequencing probe capture method | |
Zhao et al. | Optimization of cell lines as tumour models by integrating multi-omics data | |
CN105483210A (en) | RNA (ribonucleic acid) editing locus detection method | |
KR20150024232A (en) | Examination methods of the origin marker of resistance from drug resistance gene about disease | |
Li et al. | Mining the coding and non-coding genome for cancer drivers | |
JP2021101629A5 (en) | ||
Asif et al. | Analysis of endometrial carcinoma TCGA reveals differences in DNA methylation in tumors from Black and White women | |
CN110885886B (en) | Method for differential diagnosis of glioblastoma and typing of survival prognosis of glioma | |
US20240068041A1 (en) | Free dna-based disease prediction model and construction method therefor and application thereof | |
Zhu et al. | Identification of somatic copy number variations in plasma cell free DNA correlating with intrinsic resistances to EGFR targeted therapy in T790M negative non-small cell lung cancer | |
Ponomarenko et al. | Mining DNA sequences to predict sites which mutations cause genetic diseases | |
WO2019129200A1 (en) | C-site extraction method and apparatus | |
CA3128379A1 (en) | Stratification of risk of virus associated cancers | |
CN112397140A (en) | Target identification method and device based on allosteric mechanism and storage medium | |
CN116042820B (en) | Colon cancer DNA methylation molecular markers and application thereof in preparation of early diagnosis kit for colon cancer | |
CN115612743B (en) | HPV integration gene combination and application thereof in prediction of cervical cancer recurrence and metastasis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |