CN106682455B

CN106682455B - A kind of Statistical Identifying Method of multisample copy number consistency variable region

Info

Publication number: CN106682455B
Application number: CN201611040980.3A
Authority: CN
Inventors: 袁细国; 李�杰; 张军英; 杨利英; 高美虹
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2016-11-24
Filing date: 2016-11-24
Publication date: 2019-03-26
Anticipated expiration: 2036-11-24
Also published as: CN106682455A

Abstract

The invention discloses a kind of Statistical Identifying Methods of multisample copy number consistency variable region, coefficient of relationship construction based on copy number site is fitted to curve, calculate the derivative value in each site, by assuming that the method for inspection detects significant derivative value, so that it is determined that copy number breakpoint, establishes copy number variation candidates region；Hypothesis testing zero cloth is constructed by way of the random permutation CNVs on full-length genome and sample both direction, detects copy number consistency variable region in multisample.The present invention avoids directly to be accommodated certain sequencing mistake and noise using sequencing read number, capable of accurately being positioned the boundary of copy number variable region；More true hypothesis testing zero cloth can be obtained compared to displacement in a single direction based on random permutation CNVs on full-length genome and sample both direction；Meanwhile being conducive to detect diversified consistency variation CNVs, i.e. copy number consistency variable region present in multisample subclass.

Description

A kind of Statistical Identifying Method of multisample copy number consistency variable region

Technical field

The invention belongs to copy number mutation field more particularly to a kind of multisample copy number consistency variable regions Statistical Identifying Method.

Background technique

New-generation sequencing technology provides genome mutation data more comprehensively, richer, for it is deep understand life mechanism, Cancer cell development mechanism provides Important Platform.Copy number variation (CopyNumber Variation, CNV) is weight in genome Generation, the development of the variation phenomenon and cancer wanted have substantial connection.For this purpose, being carried out to the CNV data on new-generation sequencing platform The analysis of system is that discovery cancer gene, research cancer cell molecule mechanism provide important channel, and how difficult point is from high score Diversified CNV mode is accurately detected in resolution, the read data of low sequencing depth.Prior art: domestic at present Existing expert proposes different copy number variation detection schemes outside, can substantially be divided into based on single tumor sample and be based on The detection scheme of tumor-normal paired sample, such as SegSeq [D.Y.Chiang et al., " High-resolution mapping of copy-number alterations with massively parallel sequencing,”Nat Methods,vol.6,no.1,pp.99-103,Jan,2009],EWT[S.T.Yoon et al.,“Sensitive and accurate detection of copy number variants using read depth of coverage,” Genome Research,vol.19,no.9,pp.1586-1592,Sep,2009],BIC-seq[R.Xi et al.,“Copy number variation detection in whole-genome sequencing data using the Bayesian information criterion,”Proc Natl Acad Sci U S A,vol.108,no.46,pp.E1128-36,Nov 15,2011],CNVnator[A.Abyzov et al.,“CNVnator:an approach to discover,genotype, and characterize typical and atypical CNVs from family and population genome sequencing,”Genome Res,vol.21,no.6,pp.974-84,Jun,2011],ReadDepth[C.A.Miller et al.,“ReadDepth:a parallel R package for detecting copy number alterations from short sequencing reads,”PLoS One,vol.6,no.1,pp.e16327,2011],Control- FREEC[V.Boeva et al.,“Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization,”Bioinformatics,vol.27, no.2,pp.268-9,Jan 15,2011],CNV-TV[J.Duan et al.,“CNV-TV:a robust method to discover copy number variation from short sequencing reads,”BMC Bioinformatics,vol.14,pp.150,2013],CNVeM[Z.Wang et al.,“CNVeM:copy number variation detection using uncertainty ofread mapping,”J Comput Biol,vol.20, no.3,pp.224-36,Mar,2013],m-HMM[H.Wang et al.,“Copy number variation detection using next generation sequencing read counts,”Bmc Bioinformatics,vol.15,Apr The methods of 14,2014].These method majorities using sequencing depth calculation gene loci read number, and then in full-length genome or Copy number variable region is predicted according to read number situation of change within the scope of whole chromosome.The characteristics of such methods is to realize opposite hold Easily, there is preferable detection effect for the data of high sequencing depth；Its shortcoming is that directly relying on property of read number, and read There is unstability in number, i.e. read number has certain random variation, and this random variation is often erroneously interpreted as copying in itself Caused by shellfish number variation, especially for the data of low sequencing depth, the ratio of random amplitude of variation and copy number variation amplitude Value is higher, so that such methods are difficult to obtain preferable copy number variation detection effect.In addition, there is part expert to propose Copy number mutation detection method based on multisample, as cnvHiTSeq [E.Bellos et al., " cnvHiTSeq: integrative models for high-resolution copy number variation detection and genotyping using population sequencing data,”Genome Biol,vol.13,no.12, pp.R120,2012],VarScan2+CMDS[D.C.Koboldt et al.,“VarScan 2:somatic mutation and copy number alteration discovery in cancerby exome sequencing,”Genome Res,vol.22,no.3,pp.568-76,Mar,2012,Q.Zhang et al.,“CMDS:a population-based method for identifying recurrent DNA copy number aberrations in cancer from high-resolution data,”Bioinformatics,vol.26,no.4,pp.464-9,Feb15,2010], JointSLM[A.Magi et al.,“Detecting common copy number variants in high- throughput sequencing data by using JointSLM algorithm,”Nucleic Acids Research,vol.39,no.10,May,2011],cn.MOPS[G.Klambauer et al.,“cn.MOPS:mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate,”Nucleic Acids Res,vol.40, no.9,pp.e69,May,2012],CBSBR[J.Duan et al.,“Common copy number variation detection from multiple sequenced samples,”IEEE Trans Biomed Eng,vol.61,no.3, pp.928-37,Mar,2014],CODEX[Y.Jiang et al.,“CODEX:a normalization and copy number variation detection method for whole exome sequencing,”Nucleic Acids Res, vol.43, no.6, pp.e39, Mar 31,2015] etc..Such methods majority is to be based on being associated between copy number variant sites Property or inter-sample difference detect consistency copy number variable region, its advantage lies in being able to hold copy number structural variation Biological nature, to distinguish the copy number variation of consistency copy number variable region and randomness.The disadvantage is that being difficult to detect weak Great consistency copy number variable region.These methods are in multisample copy number variation detection simultaneously, often to sample number Amount has certain limitation, this seems for detecting the ability of the high consistency copy number variable region of certain class cancer or general cancer It is limited.

In conclusion the Statistical Identifying Method of available sample copy number consistency variable region excessively relies on sequencing read number Variation, it is difficult to obtain have statistical significance detection effect；Sample size cannot be excessive, and computation complexity is higher, is unfavorable for examining Survey copy number consistency variable region in multisample.

Summary of the invention

The purpose of the present invention is to provide a kind of Statistical Identifying Methods of multisample copy number consistency variable region, it is intended to The Statistical Identifying Method for solving available sample copy number consistency variable region excessively relies on the variation of sequencing read number, it is difficult to obtain There must be the detection effect of statistical significance；Sample size cannot be excessive, and computation complexity is higher, is unfavorable for copying in detection multisample The problem of number consistency variable region.

The invention is realized in this way a kind of Statistical Identifying Method of multisample copy number consistency variable region, described The Statistical Identifying Method of multisample copy number consistency variable region is fitted to curve based on the coefficient of relationship in copy number site, with This calculates the derivative value in each site, by assuming that the method for inspection detects significant derivative value, so that it is determined that copy number breakpoint, builds Vertical copy number variation candidates region；Hypothesis is constructed by way of the random permutation CNVs on full-length genome and sample both direction Zero cloth is examined, copy number consistency variable region in multisample is detected.

Further, it needs to carry out the pretreatment to sequencing data file before the coefficient of relationship curve matching, it is specific to wrap It includes:

On the basis of comparing to sequencing data file, the read number in each site is calculated；According to sample read number mean value Regularization is carried out to read number, to obtain the read number signal being comparable between sample, calculation formula are as follows:

Wherein, mean_RC_nWith mean_RC respectively refer to n-th of sample read number mean value and multiple sample read numbers it is equal Value, x_nmRefer to the read number in n-th of sample, m-th of site, x'_nmRefer to the read number after corresponding site is regular.

Further, isometric bins is defined, converts the read number as unit of bin for the read number of sample site, it is right The detection for copying number variation member will be carried out as unit of bin.

Further, indicate that a sample, each column indicate a bin based on a line every in preprocessed data matrix M, M；It adopts The coefficient of relationship between bins is calculated with Pearson correlation analysis method, and is fitted to curve, leading for each bin is solved with this Numerical value.

Further, using derivative value as background, hypothesis testing zero cloth is established, the derivative value of conspicuousness, conspicuousness are examined Mean that there are breakpoints in the position of the bin, obtain copy number variation candidates region.

Further, significant CNVs is detected using loop iteration process in copy number variation candidates region, specifically includes: passing through The candidate region random permutation CNV constructs hypothesis testing zero cloth in full-length genome, tests to the candidate region CNV, if hair The CNV of existing conspicuousness, just removes it from genome, reconfigures hypothesis testing zero cloth and examines CNV candidate regions again Domain, until not finding new CNVs.

Further, detection multisample copy number consistency variable region includes: by random in full-length genome and sample CNVs is replaced with transposition of structures data matrix M^t, calculate the frequency f that random CNVs occurs in multisample；The process n times are repeated, N > 1000 obtain the distribution of a frequency f, i.e. hypothesis testing zero cloth；The CNVs frequency of data matrix before replacing is examined It tests, calculates the p value of each CNV, the CNVs of multisample consistency variation is determined according to significance threshold value.

Another object of the present invention is to provide a kind of statistics using the multisample copy number consistency variable region The cancer gene of the method for inspection.

Another object of the present invention is to provide a kind of statistics using the multisample copy number consistency variable region The cancer cell molecule of the method for inspection.

The Statistical Identifying Method of multisample copy number consistency variable region provided by the invention, foundation are with statistical theory The calculation method on basis, detects the copy number variable region of consistency in multiple samples, provides directly for discovery potential cancer gene Technological means connect, feasible.The present invention finds out bins on the basis of carrying out Regularization to read number, as primitive, The coefficient of relationship between bins is calculated in multisample space and is fitted to curve, and the derivative of each bin is calculated with this.By to derivative Value carries out significance test, copy number breakpoint is detected, to obtain the candidate region CNV.By loop iteration process in single sample In the region CNV is detected, that is, be directed to the candidate region CNV, random permutation process taken to construct zero cloth, it is aobvious to CNV with this Work property is tested, and be will test and is rejected for significant CNV, rebuilds zero cloth, until the CNV termination for not detecting new follows Ring.The advantage of doing so is that being able to detect that weak significant CNVs.On the basis of single sample CNV detection, in multisample space The middle occurrence frequency building statistics according to CNV frequency detecting copy number consistency variable region, i.e., using CNV in multisample Amount detects copy number consistency variable region by the permutation test method of multisample.

Existing most methods excessively rely on the variation of sequencing read number, since there are errors and read for sequencing technologies itself In the presence of compared with very noisy, so that these methods are difficult to obtain the detection effect with statistical significance for the sample of low sequencing depth. For this purpose, the present invention proposes to be fitted to curve using the coefficient of relationship building between copy number variant sites, each base is then calculated The derivative value significance test is asked because of the derivative value in site, and then by being converted into the test problems of copy number variable region Topic；It is not directly dependent on the size of sequencing read number in this way, certain sequencing mistake and noise can be accommodated.

The existing copy number mutation detection method for multisample has certain limitation to sample size or feature, such as CBSBR method requires sample size cannot be excessive, and algorithm defaults 6 samples, and computation complexity is higher；Cn.MOPS requires sample This, there are apparent otherness, is unfavorable for detecting copy number consistency variable region in multisample；For this purpose, the present invention establishes newly Statistical inspection model, diversified copy number variation mode is detected using circulation rejecting process, and do not limit sample size System, computation complexity is controllable, as table 1 lists the comparison of method.

The comparison of the computation complexity of 1.4 kinds of methods of table

Method	DCC	CBSBR	FREEC	cn.MOPS
					Runing time	22s	1721s	50s	38s
Time complexity	O(mn)	O(mnk)	O(n)	O(mn)
					Space complexity	O(mn)	O(m²n²)	O(n)	O(mn)
Software platform	C++	MATLAB	C++	R

Wherein DCC is method of the invention, which is the result detected to the genome that length is 5Gb.

The present invention is based on the curves that is fitted to of coefficient of relationship to calculate derivative value, to examine copy number breakpoint, so that it is determined that copying Shellfish number variation candidate region；On the one hand it avoids that directly certain sequencing mistake and noise can be accommodated, separately using sequencing read number On the one hand the boundary of copy number variable region can accurately be positioned；Based on being set at random on full-length genome and sample both direction CNVs is changed, compared to displacement in a single direction, this strategy can obtain more true hypothesis testing zero cloth；Meanwhile favorably In the diversified consistency variation CNVs of detection, i.e. copy number consistency variable region present in multisample subclass.

Detailed description of the invention

Fig. 1 is the Statistical Identifying Method process of multisample copy number consistency variable region provided in an embodiment of the present invention Figure.

Fig. 2 is the performance comparison schematic diagram of the present invention (DCC) and cn.MOPS method provided in an embodiment of the present invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.

Application principle of the invention is explained in detail with reference to the accompanying drawing.

As shown in Figure 1, the Statistical Identifying Method of multisample copy number consistency variable region provided in an embodiment of the present invention The following steps are included:

S101: on the basis of comparing to sequencing data file (i.e. Fastq file), the read number in each site is calculated； Regularization is carried out to read number according to sample read number mean value, to obtain the read number signal being comparable between sample；

S102: it based on preprocessed data matrix M (wherein every a line indicates that a sample, each column indicate a bin), adopts The coefficient of relationship between bins is calculated with Pearson correlation analysis method, construct the coefficient of relationship is fitted to curve, asks with this Solve the derivative value of each bin；Using derivative value as background, hypothesis testing zero cloth is established, the derivative value of conspicuousness is examined, it is significant Property means that there are breakpoints in the position of the bin, to obtain copy number variation candidates region；

S103: the CNVs defined based on single sample, hypothesis testing zero cloth is constructed by the Replacement Strategy of multisample；It is opposed The CNVs frequency of data matrix is tested before changing, and is calculated the p value of each CNV, is determined multisample according to significance threshold value The CNVs of consistency variation.

Application principle of the invention is further described combined with specific embodiments below.

(1) data prediction

On the basis of comparing to sequencing data file (i.e. Fastq file), the read number in each site is calculated；According to sample This read number mean value carries out Regularization to read number, specific such as formula to obtain the read number signal being comparable between sample (1) shown in.

On the basis of Regularization data, in order to reduce data dimension and reduce due between enchancement factor bring site Otherness, the present invention define isometric bins, the read number converted the read number of sample site to as unit of bin.In this way, The detection of copy number variation member will be carried out as unit of bin.

(2) derivative value is examined and for single sample detection copy number variation

Based on preprocessed data matrix M (wherein every a line indicates that a sample, each column indicate a bin), use Pearson correlation analysis method calculates the coefficient of relationship between bins, and construct the coefficient of relationship is fitted to curve, is solved with this The derivative value of each bin.

Using derivative value as background, hypothesis testing zero cloth is established, the derivative value of conspicuousness is examined, conspicuousness means There are breakpoints in the position of the bin, to obtain copy number variation candidates region.The characteristics of doing so is to make full use of copy number The intrinsic relevance of variant sites has similar horizontal coefficient of relationship between the site in that is, same copy number variable region, leads to It crosses and the mode of derivative value is examined to find coefficient of relationship mutational site, to obtain the copy number variation candidates region that length does not wait.

For copy number variation candidates region, significant CNVs is detected using loop iteration process, specific practice is as follows: logical It crosses the candidate region random permutation CNV in full-length genome and constructs hypothesis testing zero cloth, candidate region CNV is examined with this It tests, if the CNV of discovery conspicuousness, it is just removed, reconfigure hypothesis testing zero cloth and examine CNV again from genome Candidate region, until not finding new CNVs.

(3) multisample copy number consistency variable region is detected

Based on the CNVs that single sample defines, the Replacement Strategy for passing through multisample constructs hypothesis testing zero cloth: i.e. by Random permutation CNVs is in full-length genome and sample with transposition of structures data matrix M^t, random CNVs is calculated with this and is sent out in multisample Raw frequency f；The process n times (n > 1000) are repeated, to obtain the distribution of a frequency f, i.e. hypothesis testing zero cloth.With this It tests to the CNVs frequency of data matrix before replacing, calculates the p value of each CNV, it is more according to the determination of significance threshold value The CNVs of unanimity of samples variation.

The comparison of performance.Fig. 2 is (DCC) of the invention compared with the performance of cn.MOPS method, and experiment test is different The CNV detection performance of DNA is sequenced under cancer cell purity (Tumorpurity).It is relatively high that Fig. 2 shows that the method for the present invention has Performance.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims

1. a kind of Statistical Identifying Method of multisample copy number consistency variable region, which is characterized in that the multisample copy The Statistical Identifying Method of number consistency variable region is fitted to curve based on the coefficient of relationship in copy number site, calculates each site Derivative value, by assuming that the method for inspection detects significant derivative value, so that it is determined that copy number breakpoint, establishes copy number variation and wait Favored area；Hypothesis testing is constructed by way of the random permutation copy number variation CNVs on full-length genome and sample both direction Zero cloth detects copy number consistency variable region in multisample；

The Statistical Identifying Method of the multisample copy number consistency variable region specifically includes:

(1) on the basis of comparing to sequencing data file, the read number in each site is calculated；According to sample read number mean value pair Read number carries out Regularization, to obtain the read number signal being comparable between sample；

(2) it is based on preprocessed data matrix M, wherein every a line indicates that a sample, each column indicate a bin, is used Pearson correlation analysis method calculates the coefficient of relationship between bins, constructs the matched curve of the coefficient of relationship, is solved often with this The derivative value of a bin；Using derivative value as background, hypothesis testing zero cloth is established, examines the derivative value of conspicuousness, conspicuousness meaning Taste there are breakpoints in the position of the bin, to obtain copy number variation candidates region；

(3) CNVs defined based on single sample constructs hypothesis testing zero cloth by the Replacement Strategy of multisample；To number before displacement It tests according to the CNVs frequency of matrix, calculates the p value of each CNV, multisample consistency is determined according to significance threshold value The CNVs of variation.

2. the Statistical Identifying Method of multisample copy number consistency variable region as described in claim 1, which is characterized in that structure It makes before coefficient of relationship is fitted to curve and needs to carry out to specifically include the pretreatment of sequencing data file:

On the basis of comparing to sequencing data file, the read number in each site is calculated；According to sample read number mean value to reading Number of segment carries out Regularization, to obtain the read number signal being comparable between sample, calculation formula are as follows:

Wherein, mean_RC_nThe read number mean value of n-th of sample and the mean value of multiple sample read numbers are respectively referred to mean_RC, x_nmRefer to the read number in n-th of sample, m-th of site, x'_nmRefer to the read number after corresponding site is regular.

3. the Statistical Identifying Method of multisample copy number consistency variable region as claimed in claim 2, which is characterized in that fixed The isometric bins of justice, converts the read number as unit of bin for the read number of sample site, will to the detection of copy number variation Member carries out as unit of bin.

4. the Statistical Identifying Method of multisample copy number consistency variable region as described in claim 1, which is characterized in that copy Significant CNVs is detected using loop iteration process in shellfish number variation candidate region, specifically includes: by random in full-length genome The displacement copy candidate region number variation CNV constructs hypothesis testing zero cloth, tests to the copy candidate region number variation CNV, If reconfiguring hypothesis testing zero cloth and again it was found that the copy number variation CNV of conspicuousness, just removes it from genome The copy candidate region number variation CNV is examined, until not finding new CNVs.

5. the Statistical Identifying Method of multisample copy number consistency variable region as described in claim 1, which is characterized in that inspection Survey multisample copy number consistency variable region include: by full-length genome and sample random permutation CNVs with the transposition of structures Data matrix M^t, calculate the frequency f that random CNVs occurs in multisample；The process n times are repeated, n > 1000 obtain a frequency The distribution of rate f, i.e. hypothesis testing zero cloth；It tests to the CNVs frequency of data matrix before replacing, calculates each copy number The p value of variation CNV, the CNVs of multisample consistency variation is determined according to significance threshold value.

6. a kind of statistical check side using multisample copy number consistency variable region described in Claims 1 to 5 any one The cancer gene of method.

7. a kind of statistical check side using multisample copy number consistency variable region described in Claims 1 to 5 any one The cancer cell molecule of method.