CN108090328B - Cancer driver gene identification method based on machine learning and multiple statistical principles - Google Patents

Cancer driver gene identification method based on machine learning and multiple statistical principles Download PDF

Info

Publication number
CN108090328B
CN108090328B CN201711496093.1A CN201711496093A CN108090328B CN 108090328 B CN108090328 B CN 108090328B CN 201711496093 A CN201711496093 A CN 201711496093A CN 108090328 B CN108090328 B CN 108090328B
Authority
CN
China
Prior art keywords
gene
mutation
data
genes
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711496093.1A
Other languages
Chinese (zh)
Other versions
CN108090328A (en
Inventor
刘鹏渊
韩毅
陆燕
周莉媛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201711496093.1A priority Critical patent/CN108090328B/en
Publication of CN108090328A publication Critical patent/CN108090328A/en
Application granted granted Critical
Publication of CN108090328B publication Critical patent/CN108090328B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Public Health (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a cancer driver gene identification method based on machine learning and various statistical principles, which comprises the following steps: (1) arranging the data into a standard format; (2) calculating the background variation rate; (3) statistically testing cancer driver genes; (4) monte Carlo simulation statistic distribution; (5) and adjusting the P value. The method considers the background variation rate of each sample, gene and mutation type and the influence of various mutation types on the protein function, adopts score test to judge the driving gene, has high robustness and is widely applicable to various types of cancers; and a better balance between sensitivity and specificity is achieved, a larger number of driving genes can be detected, and lower false positive can be maintained. The invention has important significance for searching potential sites for cancer treatment and developing anti-cancer drugs.

Description

Cancer driver gene identification method based on machine learning and multiple statistical principles
Technical Field
The invention belongs to the cross field of bioinformatics and cancer medicine, and relates to a cancer driver gene identification method adopting machine learning and various statistical methods.
Background
Cancer is mostly a disease caused by somatic mutations. The driver gene is a factor directly responsible for the development of cancer, and on the other hand, there is no direct relationship between the passenger gene and cancer, so it is necessary to identify the driver gene. Several tumor sequencing projects worldwide, such as cancer genome map project (TCGA), international association of cancer genomes project (ICGC) and clinical applications research to the general effective therapeutic project (TARGET), have established a comprehensive list of somatic mutations in various types of cancer. One of the main objectives of these sequencing projects is to identify the driver genes responsible for cancer. The cancer driving gene is found, which not only can increase the understanding of human beings on the occurrence and development of tumors, but also can provide potential therapeutic targets of some cancers.
Bioinformatic tools have been developed that use sets of mathematical data to identify cancer drivers, which can be classified into 3 categories based on their rationale: the first category is based on mutation frequency based tools, which identify genes with a mutation frequency higher than the background mutation rate as driver genes. Representative of such tools are MutSigCV (Lawrence M S, Stojanov P, Polak P, et al. Mutional homology in cancer and the search for new cancer-associated genes. [ J ]. Nature,2013, 499(7457):214-218.) and patent application No. CN201310284338 × "a method and kit for detecting non-small cell lung cancer driver gene mutation spectrum and applications". The second category is tools developed from known pathways or interaction networks. Representative of such tools are DawnRank (Hou J P, Ma J. DawnRank: screening competent driver genes in cancer [ J ]. Genome Medicine,2014,6(7):56.) and "a method for screening cancer driver genes based on biological networks" patent application No. CN 201510111810.9. The third category is the "hot spot" tool, which refers to a location that has a significant effect on the three-dimensional conformation of a peptide chain or protein. Representative thereof are the oncogenic CLUST (Tambororo D, Gonzalezez A, Lopezbigas N. oncogenic CLUST: expanding the positional restricting of the physiological reagents [ J ]. Bioinformatics,2013,29(18): 2238.).
However, the above bioinformatics tools still have some disadvantages, firstly, these algorithms do not achieve a good balance between sensitivity and specificity, i.e., some algorithms have high sensitivity but too low specificity, or have high specificity and low sensitivity; secondly, these methods lack robustness to different types of tumors, i.e., for some tumor types the methods perform well, finding many reliable drivers, but for others exhibit poor performance.
Disclosure of Invention
The present invention aims to provide a highly robust algorithm for identifying cancer driver genes. The method is based on machine learning and various statistical methods, can show higher sensitivity and specificity on data of various cancers, greatly reduces false positive caused by the traditional method, and lays an important foundation for subsequent gene function research and targeted drug screening.
The technical scheme provided by the invention is as follows: a cancer driver gene identification method based on machine learning and multiple statistical principles is realized by the following steps:
(1) data are put into a standard format: the input data format is required to be a Mutation Annotation File format (Mutation Annotation File) common to genome map engineering (TCGA), or a File containing 7 key data: the 1 st column is the name of the gene, the 5 th column is the chromosome number of the gene, the 6 th column is the initial position of the gene mutation sequence, the 9 th column is the gene mutation classification, the 11 th column is the normal reference sequence corresponding to the mutation gene sequence, the 13 th column is the mutation gene sequence, and the 16 th column is the sample number of the gene mutation; the input data can be used for the subsequent process after being arranged into the format;
(2) calculating the background variation rate: calculating the background variation rate of gene mutation by an empirical Bayes method; synonymous mutations and non-synonymous mutations obey the following distribution, respectively:
Figure BDA0001536434630000031
Figure BDA0001536434630000032
p represents the sample number in the input data, g represents the gene name, and t represents the mutation type;
Figure BDA0001536434630000033
representing possible occurrence of t-type synonymous mutations in the g geneThe number of the first and second groups is,
Figure BDA0001536434630000034
represents the number of possible occurrences of t-type non-synonymous mutations on the g gene,
Figure BDA0001536434630000035
and
Figure BDA0001536434630000036
representing the actual occurrence numbers of t-type synonymous and non-synonymous mutations in the p-sample, the g-gene, respectively βpgtRepresenting the background variation rate, θ, of the occurrence of t-type mutations in the p-sample, g-genepgtRepresents the total variation rate, θ, of t-type mutations in the p-sample, g-genepgt=βpgtgtWherein αgtRepresents the incidence of t-type mutations in the g gene; the background variation rate was calculated according to the following formula:
Figure BDA0001536434630000037
wherein,
Figure BDA0001536434630000038
Figure BDA0001536434630000039
p is the total number of samples;
(3) statistical examination of cancer driver genes: performing hypothesis test on each gene to judge whether a certain gene is a driving gene; for the gene g to be detected, the original hypothesis is H0g1=…=αgTT is the total number of mutation types, alternative hypothesis is 0
Figure BDA00015364346300000310
Test statistics
Figure BDA00015364346300000311
And (4) testing for scores:
Figure BDA00015364346300000312
wherein,
Figure BDA00015364346300000313
ωtfor the weight parameters for measuring the functional consequences of t-type mutation, the weight parameter calculation formula of the genes in the training data according to the machine learning thought is as follows:
Figure BDA0001536434630000041
for genes that are not in the training data, the weights are the mean of the weights of the training genes.
(4) Monte carlo simulation statistics distribution: in the score test in the step (3), under the condition that the sample size is large enough, the theoretical distribution of the statistics is standard normal distribution; however, because the mutation frequency of a gene is very low, the number of samples of a certain gene with mutation is very small, and the actual distribution of the statistics does not meet the standard normal distribution, the probability of calculating the statistics by artificially simulating the distribution of the statistics is needed; due to the fact that
Figure BDA0001536434630000042
Subject to poisson distribution, mutation data can be artificially generated according to the following distribution:
Figure BDA0001536434630000043
wherein
Figure BDA0001536434630000044
The data is generated for the purpose of simulation,
Figure BDA0001536434630000045
is βpgtAn estimated value of (d); substituting the simulation data into a test statistic formula to obtain simulation distribution, substituting the real data into statistic to calculate a real statistical value, and obtaining a statistical significance P value according to the simulation distribution;
(5) and P value adjustment: adjusting according to Benjamini-Hochberg methodSignificant P value of the entire respective Gene, i.e., Padjvalue G/r, wherein Padjvalue is the adjusted P value, pvalue is the original P value, G is the total number of genes subjected to hypothesis testing, and r is the serial number of all genes arranged in descending order of their P values; whether or not each gene is a driver gene is determined based on whether or not the adjusted P value of the gene exceeds a threshold value (usually 0.05).
Further, the data in step (1) may be obtained from sources including, but not limited to, genome map engineering (TCGA).
Further, the data in step (1) can be generated by a platform including, but not limited to, an Illumina sequencer.
Further, the data sorting tool software in the step (1) includes, but is not limited to, R software.
Further, the mutation types in the step (2) include missense mutation, nonsense mutation, translation failure normal termination mutation, splice site mutation, transcription initiation site mutation and insertion deletion mutation.
Further, the sources of the samples in step (2) include, but are not limited to, lung cancer, cervical cancer, breast cancer and ovarian cancer.
Further, the threshold value of the significance P value in the step (5) is taken to be 0.05.
The cancer driver gene identification method provided by the invention adopts machine learning and various statistical methods, considers the influence of various mutation types on the protein function, has high robustness, and is widely applicable to various cancers; and a better balance between sensitivity and specificity is achieved, a larger number of driving genes can be detected, and lower false positive can be maintained. The invention has important significance for searching potential sites for cancer treatment and developing anti-cancer drugs.
Drawings
FIG. 1 is a schematic diagram of the implementation of the cancer driver gene identification method based on machine learning and various statistical principles of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the following drawings and specific examples, but the present invention is not limited thereto.
1. Experimental materials:
experimental sample data: lung squamous cell carcinoma mutation data downloaded into TCGA database (http://tcga- data.nci.nih.gov/docs/publications/lusc_2012/);
Operating the system: linux
Software: r, Perl. Are downloaded to official websites.
2. Experimental procedure, as shown in figure 1:
(1) the data is sorted as follows: the 1 st column is the name of the gene, the 5 th column is the chromosome number of the gene, the 6 th column is the initial position of the gene mutation sequence, the 9 th column is the gene mutation classification, the 11 th column is the normal reference sequence corresponding to the mutation gene sequence, the 13 th column is the mutation gene sequence, and the 16 th column is the sample number of the gene mutation.
(2) Calculating the background variation rate: calculated according to the following formula,
Figure BDA0001536434630000061
wherein,
Figure BDA0001536434630000062
Figure BDA0001536434630000063
(3) statistical examination of cancer driver genes: for gene g, the primary hypothesis is H0g1=…=αgT0, i.e. all mutation types correspond to a mutation rate equal to 0; alternative assumptions are
Figure BDA0001536434630000067
I.e.there is a mutation type t, which corresponds to a mutation rate of more than 0. The score test statistic was calculated as follows:
Figure BDA0001536434630000064
wherein, thetapgt=βpgtgt(ii) a The formula can be calculated according to the machine learning thought for the genes in the training data as follows:
Figure BDA0001536434630000065
for genes that are not in the training data, the weights are the mean of the weights of the training genes. Training data from the Intogen database (https://www.intogen.org/search)。
(4) Monte carlo simulation statistics distribution: the simulation data is generated according to Poisson distribution, namely:
Figure BDA0001536434630000066
and (4) substituting the simulation data into the formula in the step (3) to obtain the simulation distribution. Then, the significance P value of the statistic calculated from the real data in step (3) can be calculated.
(5) And adjusting the P value. Arranging all genes according to the P values obtained by respective calculation in ascending order, wherein each gene obtains the sequence number thereof, and then the adjusted P values are as follows: padjvalue is pvalue G/r, where pvalue is the original p-value, G is the total number of genes subjected to hypothesis testing, and r is the sequence number after all genes have been arranged in descending order of their p-value.
3. The experimental results are as follows:
this example identifies 14 driver genes: RB1, KEAP1, TP53, ARID1A, ZNF268, PTEN, MLL2, NFE2L2, MACF1, NPAT, CAPN8, TTN, CDKN2A, and MUC 16. 7 of them: RB1, KEAP1, TP53, ARID1A, PTEN, NFE2L2 and CDKN2A are genes that coincide with the cancer gene statistics database (http:// cancer. sanger. ac. uk/census), indicating that the results of the present invention are in high agreement with the current world's latest study; there are three other algorithms in Table 1 for identifying driver genes, MDPFinder (ZHao J, Zhang S, Wu L Y, et al. efficient methods for identifying driver genes in databases [ J ]. Bioinformatics,2012, 28(22):2940-, the three algorithms were calculated using the same data used in this example, and found 0, 14 and 3 driver genes, respectively, with coincidence rates of 0%, 21% and 0% with the CGC database, respectively, this shows that the invention is superior to the three algorithms in terms of the combination of the number of the found driver genes (sensitivity) and the CGC coincidence rate (specificity). In addition, the NPAT gene newly identified by the invention is indeed related to the abnormal proliferation of the lung squamous cell carcinoma through the verification of biochemical experiments, which shows that the gene identified by the method has high reliability from another aspect.
Table 1: performance comparison of the present invention with other 3-algorithm
Figure BDA0001536434630000081

Claims (7)

1. A method for identifying a cancer driver gene based on machine learning and multiple statistical principles, comprising the steps of:
(1) data are put into a standard format: the input data format is required to be a mutation annotation file format commonly used in genome map engineering, or a file containing 7 key data: the 1 st column is the name of the gene, the 5 th column is the chromosome number of the gene, the 6 th column is the initial position of the mutant gene sequence, the 9 th column is the classification of the gene mutation, the 11 th column is the normal reference sequence corresponding to the mutant gene sequence, the 13 th column is the mutant gene sequence, and the 16 th column is the sample number of the gene mutation;
(2) calculating the background variation rate: calculating the background variation rate of gene mutation by an empirical Bayes method; synonymous mutations and non-synonymous mutations obey the following distribution, respectively:
Figure FDA0002320521900000011
Figure FDA0002320521900000012
p represents the sample number in the input data, g represents the gene name, and t represents the mutation type;
Figure FDA0002320521900000013
represents the number of possible t-type synonymous mutations on the g gene,
Figure FDA0002320521900000014
represents the number of possible occurrences of t-type non-synonymous mutations on the g gene,
Figure FDA0002320521900000015
and
Figure FDA0002320521900000016
representing the actual occurrence numbers of t-type synonymous and non-synonymous mutations in the p-sample, the g-gene, respectively βpgtRepresenting the background variation rate, θ, of the occurrence of t-type mutations in the p-sample, g-genepgtRepresents the total variation rate, θ, of t-type mutations in the p-sample, g-genepgt=βpgtgtWherein αgtRepresents the incidence of t-type mutations in the g gene; the background variation rate was calculated according to the following formula:
Figure FDA0002320521900000017
wherein,
Figure FDA0002320521900000018
Figure FDA0002320521900000021
p is the total number of samples;
(3) statistical examination of cancer driver genes: performing hypothesis test on each gene to judge whether a certain gene is a driving gene; for the suspectedTest gene g, proto-hypothesis is H0g1=…=αgTT is the total number of mutation types, alternative hypothesis is 0
Figure FDA0002320521900000022
Test statistics
Figure FDA0002320521900000023
And (4) testing for scores:
Figure FDA0002320521900000024
wherein,
Figure FDA0002320521900000025
ωtfor the weight parameters for measuring the functional consequences of t-type mutation, the weight parameter calculation formula of the genes in the training data according to the machine learning thought is as follows:
Figure FDA0002320521900000026
for genes not in the training data, the weight is the mean of the weights of the training genes;
(4) monte carlo simulation statistics distribution: in the score test in the step (3), under the condition that the sample size is large enough, the theoretical distribution of the statistics is standard normal distribution; because the mutation frequency of the gene is very low, the number of samples with mutation of a certain gene is very small, and the actual distribution of the statistic does not meet the standard normal distribution, the probability of calculating the statistic by manually simulating the distribution of the statistic is needed; due to the fact that
Figure FDA0002320521900000027
Obeying the poisson distribution, mutation data were artificially generated according to the following distribution:
Figure FDA0002320521900000028
wherein
Figure FDA0002320521900000029
The data is generated for the purpose of simulation,
Figure FDA00023205219000000210
is βpgtAn estimated value of (d); substituting the simulation data into a test statistic formula to obtain simulation distribution, substituting the real data into statistic to calculate a real statistical value, and obtaining a statistical significance Q value according to the simulation distribution;
(5) adjusting the Q value: adjusting the significance Q value of each gene, namely Q, according to the Benjamini-Hochberg methodadjvalue (Qvalue G/r), wherein Qadjvalue is the adjusted Q value, Qvalue is the original Q value, G is the total number of genes subjected to hypothesis testing, and r is the serial number of all genes arranged in descending order of their Q values; genes with significant Q values within the threshold range are driver genes.
2. The method of claim 1, wherein the method comprises the steps of: the data in the step (1) are obtained from genome map engineering TCGA.
3. The method of claim 1, wherein the method comprises the steps of: and (2) generating data in the step (1) by using an Illumina sequencer.
4. The method of claim 1, wherein the method comprises the steps of: the data sorting tool software in the step (1) comprises R software.
5. The method of claim 1, wherein the method comprises the steps of: the mutation types in the step (2) comprise missense mutation, nonsense mutation, mutation for preventing translation from normally stopping, shear site mutation, transcription initiation site mutation and insertion deletion mutation.
6. The method of claim 1, wherein the method comprises the steps of: the sample in the step (2) is from lung cancer, cervical cancer, breast cancer and ovarian cancer.
7. The method of claim 1, wherein the method comprises the steps of: the threshold value of the significance Q value in the step (5) is 0.05.
CN201711496093.1A 2017-12-31 2017-12-31 Cancer driver gene identification method based on machine learning and multiple statistical principles Active CN108090328B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711496093.1A CN108090328B (en) 2017-12-31 2017-12-31 Cancer driver gene identification method based on machine learning and multiple statistical principles

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711496093.1A CN108090328B (en) 2017-12-31 2017-12-31 Cancer driver gene identification method based on machine learning and multiple statistical principles

Publications (2)

Publication Number Publication Date
CN108090328A CN108090328A (en) 2018-05-29
CN108090328B true CN108090328B (en) 2020-04-10

Family

ID=62180262

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711496093.1A Active CN108090328B (en) 2017-12-31 2017-12-31 Cancer driver gene identification method based on machine learning and multiple statistical principles

Country Status (1)

Country Link
CN (1) CN108090328B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110189795B (en) * 2019-05-05 2023-06-23 西安电子科技大学 Sub-space learning-based detection method for subgroup-specific driving genes
CN111785325B (en) * 2020-06-23 2021-10-22 西北工业大学 Method for identifying heterogeneous cancer driver genes of mutually exclusive constraint graph Laplace
CN112259163B (en) * 2020-10-28 2022-04-22 广西师范大学 Cancer driving module identification method based on biological network and subcellular localization data
CN113517021B (en) * 2021-06-09 2022-09-06 海南精准医疗科技有限公司 Cancer driver gene prediction method
CN117809741B (en) * 2024-03-01 2024-07-12 浙江大学 Method and device for predicting cancer characteristic genes based on molecular evolution selective pressure

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013188600A1 (en) * 2012-06-12 2013-12-19 Washington University Copy number aberration driven endocrine response gene signature
CN104059966A (en) * 2014-05-20 2014-09-24 吴松 STAG2 gene mutant sequence and detection method thereof as well as use of STAG2 gene mutation in detecting bladder cancer
CN104732116A (en) * 2015-03-13 2015-06-24 西安交通大学 Method for screening cancer driver gene based on biological network
CN106709278A (en) * 2017-01-10 2017-05-24 河南省医药科学研究院 Method for carrying out screening and functional analysis on driver genes of NSCLC (Non-Small Cell Lung Cancer)
CN106980763A (en) * 2017-03-30 2017-07-25 大连理工大学 A kind of cancer based on gene mutation frequency drives the screening technique of gene

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170002319A1 (en) * 2015-05-13 2017-01-05 Whitehead Institute For Biomedical Research Master Transcription Factors Identification and Use Thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013188600A1 (en) * 2012-06-12 2013-12-19 Washington University Copy number aberration driven endocrine response gene signature
CN104059966A (en) * 2014-05-20 2014-09-24 吴松 STAG2 gene mutant sequence and detection method thereof as well as use of STAG2 gene mutation in detecting bladder cancer
CN104732116A (en) * 2015-03-13 2015-06-24 西安交通大学 Method for screening cancer driver gene based on biological network
CN106709278A (en) * 2017-01-10 2017-05-24 河南省医药科学研究院 Method for carrying out screening and functional analysis on driver genes of NSCLC (Non-Small Cell Lung Cancer)
CN106980763A (en) * 2017-03-30 2017-07-25 大连理工大学 A kind of cancer based on gene mutation frequency drives the screening technique of gene

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DrGaP:A Powerful Tool for Identifying Driver Genes and Pathways in Cancer Sequencing Studies;Xing Hua.et.;《The American Journal of Human Genetics》;20130905;第439-451页 *
Only three driver gene mutations are required for the development of lung and colorectal cancers;Cristian Tomasetti.et.;《PNAS》;20150106;第112卷(第1期);第118-123页 *
基于癌症基因测序数据的统计方法研究;花兴;《中国博士学位论文全文数据库 基础科学辑》;20130115(第1期);第A006-18页 *

Also Published As

Publication number Publication date
CN108090328A (en) 2018-05-29

Similar Documents

Publication Publication Date Title
CN108090328B (en) Cancer driver gene identification method based on machine learning and multiple statistical principles
Salvadores et al. Matching cell lines with cancer type and subtype of origin via mutational, epigenomic, and transcriptomic patterns
CN106980763B (en) Screening method of cancer driver gene based on gene mutation frequency
CN109767810B (en) High-throughput sequencing data analysis method and device
CN108351916A (en) Neoantigen is analyzed
WO2022095280A1 (en) Marker, detection method and detection system for homologous recombination deletion
CN104885090A (en) Systems and methods for tumor clonality analysis
WO2015072438A1 (en) Complementary pcr primer set for als-related gene sequence analysis, method for analyzing als-related gene sequence, and method for testing als
Andreassen et al. Will SNPs be useful predictors of normal tissue radiosensitivity in the future?
CN111662981A (en) Cancer gene detection kit based on second-generation sequencing probe capture method
Zhao et al. Optimization of cell lines as tumour models by integrating multi-omics data
CN105483210A (en) RNA (ribonucleic acid) editing locus detection method
KR20150024232A (en) Examination methods of the origin marker of resistance from drug resistance gene about disease
Li et al. Mining the coding and non-coding genome for cancer drivers
JP2021101629A5 (en)
Asif et al. Analysis of endometrial carcinoma TCGA reveals differences in DNA methylation in tumors from Black and White women
CN110885886B (en) Method for differential diagnosis of glioblastoma and typing of survival prognosis of glioma
US20240068041A1 (en) Free dna-based disease prediction model and construction method therefor and application thereof
Zhu et al. Identification of somatic copy number variations in plasma cell free DNA correlating with intrinsic resistances to EGFR targeted therapy in T790M negative non-small cell lung cancer
Ponomarenko et al. Mining DNA sequences to predict sites which mutations cause genetic diseases
WO2019129200A1 (en) C-site extraction method and apparatus
CA3128379A1 (en) Stratification of risk of virus associated cancers
CN112397140A (en) Target identification method and device based on allosteric mechanism and storage medium
CN116042820B (en) Colon cancer DNA methylation molecular markers and application thereof in preparation of early diagnosis kit for colon cancer
CN115612743B (en) HPV integration gene combination and application thereof in prediction of cervical cancer recurrence and metastasis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant