CN108090328B

CN108090328B - Cancer driver gene identification method based on machine learning and multiple statistical principles

Info

Publication number: CN108090328B
Application number: CN201711496093.1A
Authority: CN
Inventors: 刘鹏渊; 韩毅; 陆燕; 周莉媛
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2017-12-31
Filing date: 2017-12-31
Publication date: 2020-04-10
Anticipated expiration: 2037-12-31
Also published as: CN108090328A

Abstract

The invention discloses a cancer driver gene identification method based on machine learning and various statistical principles, which comprises the following steps: (1) arranging the data into a standard format; (2) calculating the background variation rate; (3) statistically testing cancer driver genes; (4) monte Carlo simulation statistic distribution; (5) and adjusting the P value. The method considers the background variation rate of each sample, gene and mutation type and the influence of various mutation types on the protein function, adopts score test to judge the driving gene, has high robustness and is widely applicable to various types of cancers; and a better balance between sensitivity and specificity is achieved, a larger number of driving genes can be detected, and lower false positive can be maintained. The invention has important significance for searching potential sites for cancer treatment and developing anti-cancer drugs.

Description

Cancer driver gene identification method based on machine learning and multiple statistical principles

Technical Field

The invention belongs to the cross field of bioinformatics and cancer medicine, and relates to a cancer driver gene identification method adopting machine learning and various statistical methods.

Background

Cancer is mostly a disease caused by somatic mutations. The driver gene is a factor directly responsible for the development of cancer, and on the other hand, there is no direct relationship between the passenger gene and cancer, so it is necessary to identify the driver gene. Several tumor sequencing projects worldwide, such as cancer genome map project (TCGA), international association of cancer genomes project (ICGC) and clinical applications research to the general effective therapeutic project (TARGET), have established a comprehensive list of somatic mutations in various types of cancer. One of the main objectives of these sequencing projects is to identify the driver genes responsible for cancer. The cancer driving gene is found, which not only can increase the understanding of human beings on the occurrence and development of tumors, but also can provide potential therapeutic targets of some cancers.

Bioinformatic tools have been developed that use sets of mathematical data to identify cancer drivers, which can be classified into 3 categories based on their rationale: the first category is based on mutation frequency based tools, which identify genes with a mutation frequency higher than the background mutation rate as driver genes. Representative of such tools are MutSigCV (Lawrence M S, Stojanov P, Polak P, et al. Mutional homology in cancer and the search for new cancer-associated genes. [ J ]. Nature,2013, 499(7457):214-218.) and patent application No. CN201310284338 × "a method and kit for detecting non-small cell lung cancer driver gene mutation spectrum and applications". The second category is tools developed from known pathways or interaction networks. Representative of such tools are DawnRank (Hou J P, Ma J. DawnRank: screening competent driver genes in cancer [ J ]. Genome Medicine,2014,6(7):56.) and "a method for screening cancer driver genes based on biological networks" patent application No. CN 201510111810.9. The third category is the "hot spot" tool, which refers to a location that has a significant effect on the three-dimensional conformation of a peptide chain or protein. Representative thereof are the oncogenic CLUST (Tambororo D, Gonzalezez A, Lopezbigas N. oncogenic CLUST: expanding the positional restricting of the physiological reagents [ J ]. Bioinformatics,2013,29(18): 2238.).

However, the above bioinformatics tools still have some disadvantages, firstly, these algorithms do not achieve a good balance between sensitivity and specificity, i.e., some algorithms have high sensitivity but too low specificity, or have high specificity and low sensitivity; secondly, these methods lack robustness to different types of tumors, i.e., for some tumor types the methods perform well, finding many reliable drivers, but for others exhibit poor performance.

Disclosure of Invention

The present invention aims to provide a highly robust algorithm for identifying cancer driver genes. The method is based on machine learning and various statistical methods, can show higher sensitivity and specificity on data of various cancers, greatly reduces false positive caused by the traditional method, and lays an important foundation for subsequent gene function research and targeted drug screening.

The technical scheme provided by the invention is as follows: a cancer driver gene identification method based on machine learning and multiple statistical principles is realized by the following steps:

(1) data are put into a standard format: the input data format is required to be a Mutation Annotation File format (Mutation Annotation File) common to genome map engineering (TCGA), or a File containing 7 key data: the 1 st column is the name of the gene, the 5 th column is the chromosome number of the gene, the 6 th column is the initial position of the gene mutation sequence, the 9 th column is the gene mutation classification, the 11 th column is the normal reference sequence corresponding to the mutation gene sequence, the 13 th column is the mutation gene sequence, and the 16 th column is the sample number of the gene mutation; the input data can be used for the subsequent process after being arranged into the format;

(2) calculating the background variation rate: calculating the background variation rate of gene mutation by an empirical Bayes method; synonymous mutations and non-synonymous mutations obey the following distribution, respectively:

p represents the sample number in the input data, g represents the gene name, and t represents the mutation type;

representing possible occurrence of t-type synonymous mutations in the g geneThe number of the first and second groups is,

represents the number of possible occurrences of t-type non-synonymous mutations on the g gene,

and

representing the actual occurrence numbers of t-type synonymous and non-synonymous mutations in the p-sample, the g-gene, respectively β_pgtRepresenting the background variation rate, θ, of the occurrence of t-type mutations in the p-sample, g-gene_pgtRepresents the total variation rate, θ, of t-type mutations in the p-sample, g-gene_pgt＝β_pgt+α_gtWherein α_gtRepresents the incidence of t-type mutations in the g gene; the background variation rate was calculated according to the following formula:

wherein,

p is the total number of samples;

(3) statistical examination of cancer driver genes: performing hypothesis test on each gene to judge whether a certain gene is a driving gene; for the gene g to be detected, the original hypothesis is H₀:α_g1＝…＝α_gTT is the total number of mutation types, alternative hypothesis is 0

Test statistics

And (4) testing for scores:

wherein,

ω_tfor the weight parameters for measuring the functional consequences of t-type mutation, the weight parameter calculation formula of the genes in the training data according to the machine learning thought is as follows:

for genes that are not in the training data, the weights are the mean of the weights of the training genes.

(4) Monte carlo simulation statistics distribution: in the score test in the step (3), under the condition that the sample size is large enough, the theoretical distribution of the statistics is standard normal distribution; however, because the mutation frequency of a gene is very low, the number of samples of a certain gene with mutation is very small, and the actual distribution of the statistics does not meet the standard normal distribution, the probability of calculating the statistics by artificially simulating the distribution of the statistics is needed; due to the fact that

Subject to poisson distribution, mutation data can be artificially generated according to the following distribution:

wherein

The data is generated for the purpose of simulation,

is β_pgtAn estimated value of (d); substituting the simulation data into a test statistic formula to obtain simulation distribution, substituting the real data into statistic to calculate a real statistical value, and obtaining a statistical significance P value according to the simulation distribution;

(5) and P value adjustment: adjusting according to Benjamini-Hochberg methodSignificant P value of the entire respective Gene, i.e., P_adjvalue G/r, wherein P_adjvalue is the adjusted P value, pvalue is the original P value, G is the total number of genes subjected to hypothesis testing, and r is the serial number of all genes arranged in descending order of their P values; whether or not each gene is a driver gene is determined based on whether or not the adjusted P value of the gene exceeds a threshold value (usually 0.05).

Further, the data in step (1) may be obtained from sources including, but not limited to, genome map engineering (TCGA).

Further, the data in step (1) can be generated by a platform including, but not limited to, an Illumina sequencer.

Further, the data sorting tool software in the step (1) includes, but is not limited to, R software.

Further, the mutation types in the step (2) include missense mutation, nonsense mutation, translation failure normal termination mutation, splice site mutation, transcription initiation site mutation and insertion deletion mutation.

Further, the sources of the samples in step (2) include, but are not limited to, lung cancer, cervical cancer, breast cancer and ovarian cancer.

Further, the threshold value of the significance P value in the step (5) is taken to be 0.05.

The cancer driver gene identification method provided by the invention adopts machine learning and various statistical methods, considers the influence of various mutation types on the protein function, has high robustness, and is widely applicable to various cancers; and a better balance between sensitivity and specificity is achieved, a larger number of driving genes can be detected, and lower false positive can be maintained. The invention has important significance for searching potential sites for cancer treatment and developing anti-cancer drugs.

Drawings

FIG. 1 is a schematic diagram of the implementation of the cancer driver gene identification method based on machine learning and various statistical principles of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following drawings and specific examples, but the present invention is not limited thereto.

1. Experimental materials:

experimental sample data: lung squamous cell carcinoma mutation data downloaded into TCGA database (http://tcga- data.nci.nih.gov/docs/publications/lusc_2012/)；

Operating the system: linux

Software: r, Perl. Are downloaded to official websites.

2. Experimental procedure, as shown in figure 1:

(1) the data is sorted as follows: the 1 st column is the name of the gene, the 5 th column is the chromosome number of the gene, the 6 th column is the initial position of the gene mutation sequence, the 9 th column is the gene mutation classification, the 11 th column is the normal reference sequence corresponding to the mutation gene sequence, the 13 th column is the mutation gene sequence, and the 16 th column is the sample number of the gene mutation.

(2) Calculating the background variation rate: calculated according to the following formula,

wherein,

(3) statistical examination of cancer driver genes: for gene g, the primary hypothesis is H₀:α_g1＝…＝α_gT0, i.e. all mutation types correspond to a mutation rate equal to 0; alternative assumptions are

I.e.there is a mutation type t, which corresponds to a mutation rate of more than 0. The score test statistic was calculated as follows:

wherein, theta_pgt＝β_pgt+α_gt(ii) a The formula can be calculated according to the machine learning thought for the genes in the training data as follows:

for genes that are not in the training data, the weights are the mean of the weights of the training genes. Training data from the Intogen database (https://www.intogen.org/search)。

(4) Monte carlo simulation statistics distribution: the simulation data is generated according to Poisson distribution, namely:

and (4) substituting the simulation data into the formula in the step (3) to obtain the simulation distribution. Then, the significance P value of the statistic calculated from the real data in step (3) can be calculated.

(5) And adjusting the P value. Arranging all genes according to the P values obtained by respective calculation in ascending order, wherein each gene obtains the sequence number thereof, and then the adjusted P values are as follows: p_adjvalue is pvalue G/r, where pvalue is the original p-value, G is the total number of genes subjected to hypothesis testing, and r is the sequence number after all genes have been arranged in descending order of their p-value.

3. The experimental results are as follows:

this example identifies 14 driver genes: RB1, KEAP1, TP53, ARID1A, ZNF268, PTEN, MLL2, NFE2L2, MACF1, NPAT, CAPN8, TTN, CDKN2A, and MUC 16. 7 of them: RB1, KEAP1, TP53, ARID1A, PTEN, NFE2L2 and CDKN2A are genes that coincide with the cancer gene statistics database (http:// cancer. sanger. ac. uk/census), indicating that the results of the present invention are in high agreement with the current world's latest study; there are three other algorithms in Table 1 for identifying driver genes, MDPFinder (ZHao J, Zhang S, Wu L Y, et al. efficient methods for identifying driver genes in databases [ J ]. Bioinformatics,2012, 28(22):2940-, the three algorithms were calculated using the same data used in this example, and found 0, 14 and 3 driver genes, respectively, with coincidence rates of 0%, 21% and 0% with the CGC database, respectively, this shows that the invention is superior to the three algorithms in terms of the combination of the number of the found driver genes (sensitivity) and the CGC coincidence rate (specificity). In addition, the NPAT gene newly identified by the invention is indeed related to the abnormal proliferation of the lung squamous cell carcinoma through the verification of biochemical experiments, which shows that the gene identified by the method has high reliability from another aspect.

Table 1: performance comparison of the present invention with other 3-algorithm

Claims

1. A method for identifying a cancer driver gene based on machine learning and multiple statistical principles, comprising the steps of:

(1) data are put into a standard format: the input data format is required to be a mutation annotation file format commonly used in genome map engineering, or a file containing 7 key data: the 1 st column is the name of the gene, the 5 th column is the chromosome number of the gene, the 6 th column is the initial position of the mutant gene sequence, the 9 th column is the classification of the gene mutation, the 11 th column is the normal reference sequence corresponding to the mutant gene sequence, the 13 th column is the mutant gene sequence, and the 16 th column is the sample number of the gene mutation;

represents the number of possible t-type synonymous mutations on the g gene,

and

wherein,

p is the total number of samples;

(3) statistical examination of cancer driver genes: performing hypothesis test on each gene to judge whether a certain gene is a driving gene; for the suspectedTest gene g, proto-hypothesis is H₀:α_g1＝…＝α_gTT is the total number of mutation types, alternative hypothesis is 0

Test statistics

And (4) testing for scores:

wherein,

for genes not in the training data, the weight is the mean of the weights of the training genes;

(4) monte carlo simulation statistics distribution: in the score test in the step (3), under the condition that the sample size is large enough, the theoretical distribution of the statistics is standard normal distribution; because the mutation frequency of the gene is very low, the number of samples with mutation of a certain gene is very small, and the actual distribution of the statistic does not meet the standard normal distribution, the probability of calculating the statistic by manually simulating the distribution of the statistic is needed; due to the fact that

Obeying the poisson distribution, mutation data were artificially generated according to the following distribution:

wherein

The data is generated for the purpose of simulation,

is β_pgtAn estimated value of (d); substituting the simulation data into a test statistic formula to obtain simulation distribution, substituting the real data into statistic to calculate a real statistical value, and obtaining a statistical significance Q value according to the simulation distribution;

(5) adjusting the Q value: adjusting the significance Q value of each gene, namely Q, according to the Benjamini-Hochberg method_adjvalue (Qvalue G/r), wherein Q_adjvalue is the adjusted Q value, Qvalue is the original Q value, G is the total number of genes subjected to hypothesis testing, and r is the serial number of all genes arranged in descending order of their Q values; genes with significant Q values within the threshold range are driver genes.

2. The method of claim 1, wherein the method comprises the steps of: the data in the step (1) are obtained from genome map engineering TCGA.

3. The method of claim 1, wherein the method comprises the steps of: and (2) generating data in the step (1) by using an Illumina sequencer.

4. The method of claim 1, wherein the method comprises the steps of: the data sorting tool software in the step (1) comprises R software.

5. The method of claim 1, wherein the method comprises the steps of: the mutation types in the step (2) comprise missense mutation, nonsense mutation, mutation for preventing translation from normally stopping, shear site mutation, transcription initiation site mutation and insertion deletion mutation.

6. The method of claim 1, wherein the method comprises the steps of: the sample in the step (2) is from lung cancer, cervical cancer, breast cancer and ovarian cancer.

7. The method of claim 1, wherein the method comprises the steps of: the threshold value of the significance Q value in the step (5) is 0.05.