CN109346127A

CN109346127A - A kind of statistical analysis technique driving gene for detecting potential cancer

Info

Publication number: CN109346127A
Application number: CN201810902841.XA
Authority: CN
Inventors: 李淼新; 蒋琳
Original assignee: National Sun Yat Sen University
Current assignee: Sun Yat Sen University; National Sun Yat Sen University
Priority date: 2018-08-09
Filing date: 2018-08-09
Publication date: 2019-02-15
Anticipated expiration: 2038-08-09
Also published as: CN109346127B

Abstract

The present invention provides a kind of statistical analysis technique that gene is driven for detecting potential cancer, has the advantage for being fully accurate fitting genome somatic mutation rate, so as to more effectively screen cancer driving gene.Particularly, which is not limited to by sample size, and the effect of detecting cancer driving gene can be also promoted for small sample.

Description

A kind of statistical analysis technique driving gene for detecting potential cancer

Technical field

The present invention relates to biology techniques fields, more particularly, to a kind of for detecting potential cancer driving gene Statistical analysis technique.

Background technique

There are mainly two types of the analysis methods for driving gene by somatic mutation detecting cancer at present, 1. background mutation rates (BMR) method and 2. background mutation ratio-metric method.The thought of background mutation rates method is to assess a gene in cancer sample Whether more somatic mutations, such as MutSigCV than expected are contained^[1]And MuSiC^[2]Method, wherein expected mutation count be by What multinomial index of correlation was predicted and was estimated.These prediction index include genetic characteristics, coding section length etc..MutSigCV method is also Ad hoc proposal adds the other three with the directly related variable of cancer cell (DNA replication dna number and transcriptional activity, dye in cancer cell Chromaticness situation) come improve to desired background mutation prediction effect.The method of sketch-based user interface measurement is by investigating a gene The ratio of middle variety classes somatic mutation number carrys out detecting cancer driving gene.For example, there is a method pair for entitled 20/20 rule The ratio of inactive mutation and periodical missense mutation does simple assessment and carrys out detecting cancer driving gene^[3]。 Oncodrive-fm^[4]And OncodriveFML^[5]Influence of the mutation to gene function is integrated into assessment and promotes prediction effect. OncodriveCLUST^[6]Consider the attribute of mutated site cluster.The method of a nearest entitled 20/20+^[7]Continue 20/ The thought of 20 ratio measures, and the other possible favorable selection evolution Feature of 18 cancer cells is incorporated (for example, albumen interaction is made With network dimension etc.) utilize machine learning method prediction cancer driving gene.Because it needs pre- by Monte Carlo simulation The statistical significance (namely P value) of assessment point, therefore speed can be slow.

Although the general principles of two methods are fairly simple, still remain technology barrier and need to be crossed over, especially exist Low performance problem under small sample.For example, a nearest research^[7]Show that existing cancer driving gene tester calculates Statistics P value out, which is disobeyed, to be uniformly distributed, and shows that the background mutation that they are obtained is poor fitting.Although being able to use computer Stochastic simulation measure correction P value distribution, but the key is to properly fitting background genes real cancer could be driven Dynamic gene is accurately identified from noisy background genes.Especially when sample size is too small to generate stable model When, this problem just seems more acute.So detecting cancer driving gene based on small sample in existing statistical analysis Usual inefficiency.Therefore, also have and research and propose before detecting sample with the gene spy that the integration of supervised learning method is shared Sign^[7].However, cancer have it is high heterogeneous and specific^[1]If the excessive additional general predictive feature of addition, it is possible to make The model that distinctive cancer driving gene is ignored, and is established by known drive gene in test sample, for finding new drive The efficiency of dynamic gene is often restricted.In addition, the problem of due to poor fitting, the cancer that different tools predicts drives base It is consistent because being difficult between each other, it is again usually not only hard but also have deviation to merge these results.Therefore, gene is driven in order to disclose cancer Complete map there is an urgent need to significantly more efficient methods.

Summary of the invention

Since cancer has a very strong heterogeneity, and most of cancer driving gene seems that effect is relatively mild and performance is unknown It is aobvious, cause existing method under general sample size scale to the identification of cancer driving gene usually inefficiency.And it actually grinds In studying carefully, various reasons such as resource and fund are limited to, sample size is not often sufficiently large, method existing for small sample Then it is more difficult precise Identification and goes out cancer driving gene.

To realize the above goal of the invention, the technical solution adopted is that:

A kind of statistical analysis technique driving gene for detecting potential cancer, comprising the following steps:

S1. c is used_i,jIt indicates to send out in cancer sample on nonsynonymous mutation or shearing mutational site j in some background genes i Raw mutation allele number；If gene has m_iA mutational site, y_iIndicate that the mutation allele on whole mutational sites is total Number, y_iIt obeys negative binomial distribution (NB):

Wherein μ_iIt is expected mutation count, θ is the dispersion parameter of distribution；

Then probability density function isWherein Γ () is gamma function；

S2. using the allelic variation equipotential radix for cutting zero negative binomial distribution models fitting background genes i, zero negative binomial point is cut The probability density function of cloth is

S3. generalized linear regression model is constructed:

η=log (μ_i)=β₀+β₁×[x₁, mutation allele number on same sense mutation position]

+β₂×[x₂, encode section length]

+β₃×[x₃, the limitation scoring of potential new hair mutation]

+β₄×[x₄, cancer cell system expression quantity in Cancer Cell Line Encylcopedia database]

+β₅×[x₅, HeLa cell DNA reproduction speed]

+β₆×[x₆, K562 cell HiC long range chromatin reciprocation]

Using the coefficient of maximum likelihood method estimation regression equation and the parameter θ of distribution；

The then logarithm of the number of the nonsynonymous mutation in gene i and the mutation allele on shearing mutational site It can be calculated by following formula:

Wherein,It is the coefficient of regression equation；

S4. regression equation parameter determine after, gene i zero be mutated probability be,

It cuts in zero model, the raw residual of gene i are as follows:

The deviation residual error of gene i are as follows:

Wherein sign (x) is standard signum function, and ll (μ, θ) is the natural logrithm likelihood function for cutting zero negative binomial distribution:

ll(y_i| μ, θ)=ln [g (y_i|μ_i,θ)].

It is observation y_iAverage value, acquired by following formula

S5. deviation residual error is standardized:WhereinWithRespectively deviation residual estimation mean value and standard Difference.The P value of standard deviation residual error is calculated using standardized normal distribution:

Φ (x) is the cumulative distribution function of standardized normal distribution:

Φ (x) is the cumulative distribution function of standardized normal distribution；

S6. the P value of full gene is calculated using step S1~S5；

S7. the too small significant gene of P value is rejected using threshold value；

S8. the P value of remaining gene is calculated using step S1~S5；

S9. step S7~S8 is repeated until the significant gene for not having P value too small；Determining model parameter calculation is utilized at this time The P value of full gene；

Using the gene generation P value that step S1~S9 is with reference to somatic mutation sample, the too small base of P value is then rejected The somatic mutation for retaining gene is integrated into small sample by cause, is then estimated in gene by the method for step S1~S9 High frequency somatic mutation and corresponding P value.

In order to further enhance detection efficiency, in the step S1, a weighted model is constructed based on Random Forest model The scoring s of mutational site j on predicted gene i_i,j, and by s_i,jBe converted to the scoring w of integer_i,j, w_i,j=Integer Scoring will be used as the priority valve in mutational site, then the mutation allele number after weighting are as follows:

If the mutation allele number of weighting also obeys negative binomial distribution:

It is expected mutation count,It is the dispersion parameter of bi-distribution；

Then subsequent process is executed again.

Compared with prior art, the beneficial effects of the present invention are:

Method provided by the invention has the advantage for being fully accurate fitting genome somatic mutation rate, so as to more effective Cancer driving gene is screened from background genes in ground.Particularly, which is not limited to by sample size, and small sample can also be promoted The effect of detection driving gene.

Detailed description of the invention

Fig. 1 is the configuration diagram of method.

Fig. 2 detects cancer with 4 kinds of methods for 11 kinds of cancers and the effect of gene is driven to compare figure.

Wherein, a: fold differences take the average value of logarithm；B: significant gene number；The common cancer that c:5 kind method obtains is aobvious Write gene number；D: the peculiar significant gene number of each method；E: the unique gene that each method obtains matches with cancer gene concentration Gene number.

Gene of the P value less than threshold value FDR=0.1 is removed.Cancer name label: BLCA: Urothelial Carcinoma of Bladder； BRCA: Breast cancer；COAD: colon cancer；UCEC: carcinoma of endometrium；HNSC: G. cephalantha；KIRC: kidney light cell cancer；LUAD: lung gland Cancer；LUSC: squamous cell lung carcinoma；MEL: melanoma；OV: serous cystadenocarcinoma of ovary；STAD: sdenocarcinoma of stomach.

Specific embodiment

The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent；

Below in conjunction with drawings and examples, the present invention is further elaborated.

Embodiment 1

As shown in Figure 1, method framework provided by the invention includes three layers, the 1. layer be that iteration is cut zero negative binomial and returned (ITER), background mutation model estimates that nonsynonymous mutation and shearing mutation are pre- in individual gene according to the various features of genome The number of phase.The 2. layer be that weighted iteration cuts zero negative binomial and returns (WITER), for generating priority valve, and with the power predicted Value is mutated high risk potential in cancer sample and low-risk mutation is distinguish.3. layer be to integrate reference sample, pass through The strategy of independent sample for reference is added, the unstable difficulty of regression model caused by the sample size deficiency that improvement is had by oneself because of user Topic.1. layer be the 2. layer a part, 1. layer and 2. layer be the 3. layer a part.It is non-in cancer patient body cell Variation number of alleles in same sense mutation and shearing mutation, same sense mutation site is main input.Output is cancer sample In each gene somatic mutation z-score and p value.

First layer structure: the zero negative binomial that cuts of iteration returns

First layer structure proposes a kind of new method that the system cytoplasmic process parameter for being fitted gene on genome is distributed, life The zero negative binomial that cuts of entitled iteration returns (ITER).The observation number of somatic mutation and the difference of estimation desired number are used for Whether the somatic mutation for measuring some gene in a kind of cancer is excessive.Nonsynonymous mutation and shearing mutation are interested prominent Become, if the mutation count that some gene contains exceeds the expectation mutation count of this type gene, then the gene may be significantly Promote the driving gene of growth of cancers.Present invention c_i,jIndicate nonsynonymous mutation or shearing mutational site in some background genes i The upper mutation allele number occurred in cancer sample of j.Assuming that the gene has m_iA mutational site, y_iIndicate all mutation Mutation allele sum on site, y_iIt obeys negative binomial distribution (NB).

μ_iIt is expected mutation count, θ is the dispersion parameter of distribution.Probability density function (PMF) is Wherein Γ () is gamma function.

But somatic mutation be it is rare, many genes do not have somatic mutation in the sample size of ordinary size, thus There is excessive zero in actual observation data.Because of the influence of zero expansion, so that regression equation is difficult accurately to be fitted mutation Number.Therefore first layer structure is proposed with the allelic variation equipotential radix for cutting zero negative binomial distribution models fitting background genes i.Cut zero The probability density function of negative binomial distribution is,

Based on zero negative binomial distribution 2 distributions are cut, we construct a generalized linear regression model, on predicted gene i The desired number of nonsynonymous mutation and the mutation allele on shearing mutational site.The regression equation includes 6 covariants,

+β₂×[x₂, encode section length]

+β₃×[x₃, the limitation scoring of potential new hair mutation]

+β₅×[x₅, HeLa cell DNA reproduction speed]

+β₆×[x₆, K562 cell HiC long range chromatin reciprocation],

Mutation allele number in same sense mutation is counted in the detection sample that user possesses.Encode section length It is to be estimated from reference genetic model data RefGene.Gene limits score basis Samocha et al (2014)^[8]It calculates. Last three covariants continue to use MutSigCV^[1]The predictive variable of method.Expression value is derived from Cancer Cell Line The average value that 91 cell line is expressed in Encylcopedia (CCLE).Cellular replication multiple, range are measured from HeLa cell From 100 (early stages) to 1000 (advanced stages).The chromatin state of gene is measured from the HiC of K562 cell line experiment and is obtained, range Probably from -50 (closed states) to+50 (open states).Because some covariants are missing from value, missForest is used Method fills up missing values.MissForest is a kind of widely applied nonparametric missing values complementing method based on random forest. Certain model can also add other covariants.We use R kit countreg (https: //r- Forge.r-project.org/R/? group_id=522 the maximum likelihood method in) estimates regression equation coefficient and is distribution Parameter θ.

After above-mentioned model elaborates, then same sense mutation and shearing mutation logarithm in non-in gene iIt can be with It is calculated by following formula:

Wherein,It is the coefficient of regression equation.

After the parameter and dispersion parameter of equation are determined, gene i zero be mutated probability be,

It cuts in zero model, the raw residual of gene i are as follows:

The deviation residual error of gene i are as follows:

Sign (x) is standard signum function, and ll (μ, θ) is the natural logrithm likelihood function for cutting zero negative binomial distribution:

ll(y_i| μ, θ)=ln [g (y_i|μ_i,θ)].

It is observation y_iAverage value, acquired by following formula

In the analysis of actual data, discovery standardized normal distribution can be applied to the P of approximate standard deviation residual error Value.Deviation residual error is standardized:WhereinWithRespectively deviation residual estimation mean value and standard deviation.Based on such as Lower formula calculates P value:

Φ (x) is the cumulative distribution function of standardized normal distribution.

Assuming that most genes are the non-driven gene of background, ITER model estimates that prediction body cell is non-under this null hypothesis The expectation allele number of same sense mutation and shearing mutational site.é_iBigger expression somatic mutation site allele quantity Observation more than prediction desired value it is big, become cancer driving gene a possibility that it is also bigger.

The recurrence mode of first layer structure proposition iteration reduces influence of the driving gene in null hypothesis regression model.

Step 1: calculating the P value of full gene with ITER

Step 2: rejecting the too small significant gene of P value with the False discovery rate that threshold value is (FDR)≤0.1

Step 3: calculating the P value of remaining gene with ITER

Step 4: repeat second and third step until the significant gene that does not have P value too small

It is that it is used to calculate full gene closest to null hypothesis model in the ITER model that last time iteration acquires P value (including the gene rejected when iteration).

Second layer structure: the iteration of weighting is cut zero negative binomial and is returned

ITER method is extended by second layer structure, and in mutational site, weighting becomes the more powerful side WITER of efficiency Method.Use s_i,j∈ [0,1] indicates the scoring of a mutational site j on gene i.It is cancer driving that scoring, which prompts the mutational site, A possibility that mutational site.By s_i,jBe converted to the scoring w of integer_i,i, take s_i,j/ 0.1 max-int, i.e.,The scoring of this integer will be used as the priority valve in mutational site.ITER is that WITER works as w_i,jThe one of=1 Kind special case.The mutation allele number of weighting are as follows:

Equally, also assume that the mutation allele number of weighting obeys negative binomial distribution:

Wherein,It is expected mutation count,It is the dispersion parameter of bi-distribution.Mutation allele number y originally_iQuilt The mutation allele number of weightingAfter replacement, section zero negative binomial regression process of iteration is constant, exists for detecting some gene Whether nonsynonymous mutation and shearing mutational site have excessive weighting mutation allele number.

Second layer structure constructs the scoring s that a weighted model predicts potential high-frequency body cell driving mutation_i,j.It should Weighted model is constructed based on Random Forest model.The training set of Random Forest model (including 500 decision trees) is huge cancer Disease somatic mutation database COSMIC (V83).In order to avoid repeating demonstration problem, use is eliminated from COSMIC database In all samples (number=7,916) of 34 kinds of cancers of test.4,320 individual cells in COSMIC (V83) are had collected to be mutated Positive mutation training set is constituted, these mutation incidence in the cancerous tissue of primary is higher than 15 times of mean level.Also from The mutation of 258,846 individual cells is randomly selected in COSMIC sample as negative control catastrophe set.Each to impinging upon primary cancer Primary mutation only occurs in tissue.The Prediction Parameters of each mutation include to come from dbNSFP v3.5^[9]19 genes of database The scoring of function harmfulness.

Third layer structure: ITER or WITER borrows sample for reference and analyzes small sample cancer

In small sample, when somatic mutation number is too small (being, for example, less than 28,000), it is stable to be difficult building one Regression model.However, it is noted that the core of ITER or WITER is gene constructed model non-driven to background.When two kinds When the non-driven gene mutation rate of cancer is close to each other, then integrating a kind of background genes of cancer to the background of another cancer Gene is feasible and effective.Third layer structure proposes a kind of borrow sample for reference strategy as a result,.The strategy is able to achieve to small Sample constructs stable ITER or WITER model.Generally by implementing in two steps:

The first step is the gene generation P value with reference to somatic mutation sample with above-mentioned ITER or WITER method, (false The sample for being used as reference calmly has the mutation of enough numbers).The too small gene of P value, example are rejected with a very loose threshold value Such as reject gene of the FDR less than 0.8 of corresponding P value.

The somatic mutation of retained gene is integrated into the small sample that user has by oneself, then all by them by second step It inputs ITER or WITER and constructs a new regression model.Finally, estimating the high frequency body cell in gene with this new model Mutation and corresponding P value.

For method provided by the invention compared with other methods, it can not only be accurately detected more cancer driving genes, And speed is fast.In all 11 cancers in testing, this method can always detect the significant gene of more cancers, simultaneously Avoid the statistically significant phenomenon of expansion and deflation (see Fig. 2).In the assessment test for multiple Minimum Samples, even only There is 30 or so sample size, method provided by the invention can detect significant cancer driving gene.And it is based on testing result, This method is made full use of to produce potential driving gene overall situation map 32 kinds of cancers.The map includes the 100 of 23 cancers A above peculiar gene, these genes are that the potential of diagnosing and treating cancer clearly marks.

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Bibliography

[1]Lawrence M S,Stojanov P,Polak P,et al.Mutational heterogeneity in cancer and the search for new cancer-associated genes[J].Nature,2013,499 (7457):214-218.

[2]Dees N D,Zhang Q,Kandoth C,et al.MuSiC:identifying mutational significance in cancer genomes[J].Genome Res,2012,22(8):1589-98.

[3]Vogelstein B,Papadopoulos N,Velculescu V E,et al.Cancer genome landscapes[J]. Science,2013,339(6127):1546-58.

[4]Gonzalez-PerezA,Lopez-Bigas N.Functional impact bias reveals cancer drivers[J]. Nucleic Acids Res,2012,40(21):e169.

[5]Mularoni L,Sabarinathan R,Deu-Pons J,et al.OncodriveFML:a general framework to identify coding and non-coding regions with cancer driver mutations[J]. Genome Biol,2016,17(1):128.

[6]Tamborero D,Gonzalez-PerezA,Lopez-Bigas N.OncodriveCLUST: exploiting the positional clustering of somatic mutations to identify cancer genes[J]. Bioinformatics,2013,29(18):2238-44.

[7]Tokheim C J,Papadopoulos N,Kinzler K W,et al.Evaluating the evaluation of cancer driver genes[J].Proc Natl Acad Sci U S A,2016,113(50): 14330-14335.

[8]Samocha K E,Robinson E B,Sanders S J,et al.A framework for the interpretation of de novo mutation in human disease[J].Nat Genet,2014,46(9): 944-50.

[9]Liu X,Jian X,Boerwinkle E.dbNSFP:a lightweight database of human nonsynonymous SNPs and their functional predictions[J].Hum Mutat,2011,32(8): 894-9。

Claims

1. a kind of for detecting the statistical analysis technique of potential cancer driving gene, it is characterised in that: the following steps are included:

S1. with using c_{I, j}It indicates to occur in cancer sample on nonsynonymous mutation or shearing mutational site j in some background genes i Mutation allele number；If gene i has m_iA mutational site, y_iIndicate that the mutation allele on whole mutational sites is total Number, y_iIt obeys negative binomial distribution (NB):

Then probability density function isWherein Γ () is gamma function；

S2. using the allelic variation equipotential radix for cutting zero negative binomial distribution models fitting background genes i, zero negative binomial distribution is cut Probability density function is

S3. generalized linear regression model is constructed:

+β₂×[x₂, encode section length]

+β₃×[x₃, the limitation scoring of potential new hair mutation]

+β₅×[x₅, HeLa cell DNA reproduction speed]

+β₆×[x₆, K562 cell HiC long range chromatin reciprocation]

The then logarithm of the number of the nonsynonymous mutation in gene i and the mutation allele on shearing mutational siteIt can be with It is calculated by following formula:

Wherein,It is the coefficient of regression equation；

It cuts in zero model, the raw residual of gene i are as follows:

The deviation residual error of gene i are as follows:

ll(y_i| μ, θ)=ln [g (y_i|μ_i, θ)]

It is observation y_iAverage value, acquired by following formula:；

S5. deviation residual error is standardized:WhereinWithRespectively deviation residual estimation mean value and standard deviation；It utilizes The P value of standardized normal distribution calculating standard deviation residual error:

p_i=1- Φ (é_i),

p_i=1- Φ (é_i)

S6. the P value of full gene is calculated using step S1~S5；

S8. the P value of remaining gene is calculated using step S1~S5；

S9. step S7~S8 is repeated until the significant gene for not having P value too small；It is whole using determining model parameter calculation at this time The P value of gene；

Using the gene generation P value that step S1~S9 is with reference to somatic mutation sample, the too small gene of P value is then rejected, it will The somatic mutation for retaining gene is integrated into small sample, and the high frequency in gene is then estimated by the method for step S1~S9 Somatic mutation and corresponding P value.

2. according to claim 1 for detecting the statistical analysis technique of potential cancer driving gene, it is characterised in that: institute It states in step S1, the scoring of a mutational site j on a weighted model predicted gene i is constructed based on Random Forest model s_{I, j}, and by s_{I, j}Be converted to the scoring w of integer_{I, j}, w_{I, j}=" s_{I, j}/ 0.1], integer scoring will be used as the preferential of mutational site Weight, then the mutation allele number after weighting are as follows:

If the mutation allele number of weighting obeys negative binomial distribution:

Then subsequent process is executed again.