CN105760712B

CN105760712B - A kind of copy number mutation detection method based on new-generation sequencing

Info

Publication number: CN105760712B
Application number: CN201610114354.8A
Authority: CN
Inventors: 李垚垚; 袁细国; 张军英; 杨利英; 白俊
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2016-03-01
Filing date: 2016-03-01
Publication date: 2019-03-26
Anticipated expiration: 2036-03-01
Also published as: CN105760712A

Abstract

The invention discloses a kind of copy number mutation detection method based on new-generation sequencing, this method includes the Performance Evaluation of the construction of the copy number variation pretreatment of data, the construction of sliding window, the calculating of statistic, the implementation of Replacement Strategy and zero cloth, algorithm, the Performance Evaluation of algorithm, which uses, judges that can algorithm in the case where false positive rate be controllable, obtain higher true positive rate, whether evaluation algorithms can accurately estimate p value, copy the boundary Detection capability of number variation；The computation complexity of parser.Copy number variation detection error caused by the present invention is solved the problems, such as due to microarray dataset and the horizontal difference of sequencing, enables result more acurrate；Using from multimodal frequency histogram feature normalization data, accurately to divide normal region and copy number variable region；The comprehensive function of relevance, establishes new model between present invention variation reads number and variant sites, solves problem of inconsistency, the significance of objective estimation copy number variation.

Description

A kind of copy number mutation detection method based on new-generation sequencing

Technical field

The invention belongs to the high throughput sequencing technologies fields that DNA molecular carries out sequencing, more particularly to one kind is based on new The copy number mutation detection method of generation sequencing.

Background technique

Copy number variation (copy number variation, CNV) is the important phenomenon in cancer gene group.It is main Amplification and the missing two states of copy number are shown as, generation, development with cancer cell there are close ties.Detect multiple cancer samples The concurrent CNV of same area in this, and influence of the confluence analysis CNV on full-length genome expression, identify those by CNV influences the cancer gene of expression, this has great importance for the generation and transfer of studying cancer.Although based on single sample CNV detection method it is more and more mature, but these methods detection sensitivity in terms of remain unchanged and cannot expire The detection in the region CNV occurs jointly for the multiple samples of foot, is that cancer is studied from molecular level to the CNV analysis for carrying out system therefore The pathogenesis of disease provides important channel, and the bottom, most crucial problem are how to detect in multiple cancer samples and swell The relevant CNV of tumor related gene.

New-generation sequencing (Next Generation Sequencing, NGS) technology be once available up to a million very To the high throughput sequencing technologies of millions of short sequence informations, have high speed, high-resolution, low cost, repeatability high The advantages that.Therefore, detection CNV is studied based on NGS data and substantially increases speed and accuracy, while also reducing cost.

It is numerous studies have shown that CNV functional mode is often implied in the consistent variation region of cancer gene group sample, and The sequential digit values relationship proportional to the copy numerical value in the region for arriving each region of genome is compared in NGS, then establishing to unite Calculation method based on meter is theoretical, detects CNV concurrent (Common) significance in multiple cancer samples, is It identifies that CNV functional mode and discovery potential cancer gene provide direct, feasible technological means, and then is biological physician couple The prediction and diagnosis of cancer provide important information.Therefore, it is most important to establish reasonable and effective statistical inspection model.

The intensive in the site high-throughput full-length genome CNV and its complexity of structure, foundation to statistical inspection model and The detection of CNV conspicuousness brings great challenge, in terms of being mainly reflected in following two.First, the difficult point of problem itself: a) Number of loci is up to more than 180 ten thousand and sample number is often less, forms a kind of data pattern of high latitude small sample；B) sequencing is flat Platform and the horizontal different bring systematic errors of sequencing, and the sample of different sequencing levels is normalized；C) gene Influence of the corresponding reads signal in site (read depth, RD) vulnerable to noises such as sequencing mistake, comparison mistakes；D) CNV There are stronger relevance between point, and dependent, so that there are reciprocal effects between detecting factor；E) detection copy number amplification Or miss status to consider of both feature, i.e. site correspond to the relevance between reads number and site, this require one it is reasonable Tradeoff the two features mechanism.Second, the challenge of the theory and method solved the problems, such as: a) data scale is big, to calculating Effective control of Time & Space Complexity is a challenge；B) how to fully consider the relevance between the site CNV, reduce The conservative of CNV significance estimation, is a difficulties；C) how to establish and the consistent null hypothesis of statistic Distribution, the statistical significance of enhancing significance estimation, the problem of being an emphasis and not yet break through at present.

It technically analyzes, considers from sample size, current existing copy mutation detection method is broadly divided into following base In the CNV detection method of single sample analysis and based on the method for multisample.Technically mainly have: being hybridized based on fluorescence sites The copy number of the detection method of technology, the Comparative genomic hybridization based on microarray and gene new-generation sequencing technology detects Method.First two method resolution ratio is very low and is difficult to detect short CNV, and based on the method for NGS because it has the excellent of high throughput Gesture and more highlight.CNV detection method based on NGS, which is broadly divided into, is signed and is based on based on PEM (pair-end mapping) Two kinds of technology paths of DOC (depth of coverage).Although method based on PEM is capable of detecting when the CNV of small fragment but very The CNV (such as SDs) of insertion (the copy number amplification) and complex region of hardly possible detection large fragment.Method based on DOC can detecte greatly The CNV of segment.Therefore there is also both some methods combined, such as CNVer improves CNV by integrating DOC and PEM signature The breakpoint accuracy rate in region.More favored currently based on the method for DOC.

DOC detection model based on segmentation relates generally to different dividing methods, such as CBS, LASSO etc..Different segmentations The testing result that method generates also is not quite similar.As ReadDepth can more accurately identify copy number using CBS partitioning algorithm The boundary of variation, sensitivity still with higher and specificity when detecting low coverage data.The uncontrolled sample preparation of FREEC method This constraint returns the accurate boundary CNV using LASSO, but ignores part reads number variation, easily causes error detection；Simultaneously G/C content may be influenced by subclone property to standardize and then influence CNV detection.Segseq method and rSW-seq method are due to straight It connects and controls sample to make comparisons, can quickly detect and accurately identify the region CNV, but the part that it does not account for multiple samples is special Signization feature causes resultant error very big.Due to the local feature feature of sequencing technologies and genome, partitioning algorithm can enable knot The false positive of fruit is relatively high.SeqCNA does not also require control sample, is suitable for detecting local small pieces using LOESS or polymorphic fitting The CNV of section, but be not suitable for detection cancer sample data.

Based on the assumption that the DOC statistical significance model examined is mainly concerned with two key elements, i.e., test statistics with Zero cloth, the quality that they are designed directly influence the validity of significance estimation and the identification performance of CNV functional mode. EWT method examines detection CNV using unilateral Z-test to the RD fitted Gaussian probability Distribution Model of continuous fragment (window), can To detect the copy number variable region of large fragment, but EWT does not account for the relevance between site, cannot accurately detect insertion (CNV) position and insensitive to the CNV of small fragment.CNV-seq method is to the RD ratio of non-overlapping segment (window) (with reference Sample) fitting Poisson distribution model, it calculates the conspicuousness of Z-score while introducing partitioning algorithm to detect CNV, improve to low The sensitivity of coverage data detection, but easily improve false positive.HMM method of the CNA-seg based on segseq and JointSLM, together When introduce card side χ²Statistic detects CNV.

Detection method currently based on the common CNV of the multisample of DOC be not still it is very mature, detection method mainly has CMDS Method [17], cn.MOPS method, JointSLM method and detection method based on punishment sparse regression model etc..Wherein CMDS method calculates its conspicuousness to detect CNV, individually with detection to the single locus building correlation diagonal matrix of multiple samples Sample is higher compared to accuracy rate, while improving the cost performance of time and space complexity.Cn.MOPS method reduces technology and life The influence of noise of object variation, suitable for detecting the inconsistent CNV of multiple sample same area variation amplitudes, and it is consistent to amplitude CNV is insensitive.JointSLM method is the extension that EWT is detected in multisample, while introducing hidden Markov model (HMM) to examine CNV is surveyed, but when common CNV occurs in the sample of part, it is helpless.Detection side based on penalty coefficient regression model Method is to be fitted a penalized regression model to the RD signal of multiple samples, will be turned to common CNV (cCNV) border detection It turns to and changes point (change point) test problems and detected using significance test method, to improve accuracy rate and drop Low false discovery rate.But but its accuracy rate can decline when ancestors' difference of multiple sample datas.

By comparing these existing model [3,7,9-27] analyses based on DOC it is found that most of method can generate One very high false discovery rate, especially when without reference to sample, feature is especially prominent.The existing conspicuousness based on NGS Model is all with CNV structure fragment for detection primitive when designing statistic, and the frequency of CNV has been used in quantitative statistics amount The information of relevance between rate and amplitude and the site CNV.For the construction of zero cloth, most methods are all by setting at random Change strategy realization.

It is analyzed from the biological characteristic of CNV data, not independent between the site CNV, i.e., the neighbouring site CNV, which is one, to be had Machine is whole, is detection base with structure fragment then being the significance for detecting primitive and being difficult to objective estimation CNV with single locus Member and the relevance for being easy ignorance inside configuration site；Secondly, although considering CNV's in Counting statistics amount there are many method The relevance of reads number and site, but they do not weigh the two features reasonably, are easy erroneous detection CNV.

Existing CNV significance detection method is primarily present following deficiency:

(1) using the single site CNV as the statistic of primitive, it is easy to cause the conservative of significance estimation；It is tied with CNV Though tile section is that constant dollar amount remains the inherent structure characteristic of copy number to a certain extent, ignore between internal site Correlation, it is difficult to it is objective estimation statistic CNV significance.

(2) without rationally weighing the frequency of CNV and the relevance of variant sites, so that CNV and the associated biological table of cancer It is now difficult to position；

(3) method based on single pattern detection is when detecting the cCNV of multiple samples, systematic error or platform errors problem Seriously.

(4) the multiple samples horizontal from different microarray datasets or sequencing without automatic Synthesis, so that detecting multiple samples There are biggish limitations when this concurrent CNV functional mode；

(5) it is directed to the sample data of low-coverage level, insensitive, detection effect is bad.

Summary of the invention

The purpose of the present invention is to provide a kind of copy number mutation detection method based on new-generation sequencing, it is intended to for not Data with coverage take different normalized measures, and data is enabled to have more operability, reduce systematic error；Integration Multiple samples propose a set of using CNV structural units as the significance etection theory and method of primitive；With supervised learning machine It is made as guiding, foundation and the consistent zero cloth of statistic, to improve the accuracy of significance estimation.

The invention is realized in this way a kind of copy number mutation detection method based on new-generation sequencing, one kind is based on new The copy number mutation detection method of generation sequencing, being somebody's turn to do the copy number mutation detection method based on new-generation sequencing includes following step It is rapid:

The pretreatment of copy number variation data: comparison quality phase in the Batch effect and comparison process of CNV signal is filtered out To very low reads；By standardizing G/C content, the corresponding reads number in adjustment data sample site；Sequencing to multiple samples Level normalization is processed into the data of corresponding same sequencing level；The data sample low for overburden depth, directly returns data One chemical conversion same level；The data sample high for overburden depth first defines copy according to its data frequency histogram feature Number amplification and miss status；

The construction of sliding window: integrated standardization treated multiple samples obtain a higher dimensional matrix；Quasi- construction sliding window Mouth utilizes Pearson formula to calculate the correlation in each window between site simultaneously from the frequency that initial position calculates site, by Gradually sliding window, until spreading each site；Calculate the correlation between site；

The calculating of statistic: calculate each site in each sliding window statistic reflection copy number variation amplification or Miss status learns the weight of frequency and related coefficient, w using known copy number mutation schema construction training set₁With w₂, with Counting statistics amount,

S_test=w₁*f+w₂*a

Wherein, f, a, S_testRespectively refer to the frequency of copy number mutation mode in training set, correlation and statistic Value；

The implementation of Replacement Strategy and the construction of zero cloth: each position on full-length genome is calculated to multiple samples after standardization The corresponding detection statistic of point, constructs zero cloth T, then implements random permutation to sample data, to each sample, random permutation Its position occurred in full-length genome constitutes a total replacement sample set until s sample standard deviation is replaced；To each displacement Sample set calculates the statistic that tandem copies number variation occurs；Finally calculate the significance of detection statistic:

Estimation based on CNV significance: the area occurred by the corresponding p value evaluation CNV of obtained sample all sites Domain, if the threshold value (such as 0.05) of the small Mr. Yu's setting of p value, it is considered that the CNV has the function of biological meaning or cancer.To each CNV structural unit establishes the zero cloth of amplification and miss status, respectively to detect the conspicuousness water of amplification and miss status respectively It is flat.

The Performance Evaluation of algorithm: judge that can algorithm in the case where false positive rate (FPR) be controllable, acquisition is higher just Really certainly rate (TPR)；Whether evaluation algorithms can accurately estimate p value (Type I Error Rate)；Copy number variation Boundary Detection capability；The computation complexity of parser.

Further, it is relatively very low that quality is compared in the Batch effect and comparison process for filtering out CNV signal Reads < Q30 in reads.

Further, the integrated standardization treated multiple samples, obtaining higher dimensional matrix in a higher dimensional matrix is sample The number of sites N of this number s* sample, pass of the copy number variation presented with one section of region between copy number variant sites Connection property is stronger, and up to 0.985, relevance is weaker between farther away site.

Further, described to be directed to each sliding window, its statistic is calculated to reflect the amplification or missing of copy number variation State directly calculates other sites in the corresponding reads number frequency in each site and the site and window for low cover degree sample Between related coefficient, it is comprehensive it frequency and related coefficient quantify its statistic (S)；For the sample of high overburden depth, benefit Accurately by the amplification of copy number and both state areas for there are different biological function performances are lacked with frequency histogram is ingenious It separates, calculates separately the statistic (S) of both states.

Further, S in the calculating of the statistic_testIntend in training set through known copy number in public database Mutation mode and the relationship of gene expression dose assign relative value to it.

Further, multiple samples after described pair of standardization calculate the corresponding detection statistics in each site on full-length genome Amount constructs zero cloth T, and then implementing sample data in random permutation to sample data is that every a line in data matrix represents one A sample, each column represent a site on full-length genome.

Further, if the zero cloth design based on CNV length is less than setting with p value in the estimation of significance 0.05 threshold value, the CNV have the function of that biological meaning or cancer, the amplification of the CNV and miss status have different biological function It can and show.

Further, whether evaluation algorithms can accurately estimate p value, the i.e. system of algorithm in the Performance Evaluation of the algorithm Whether meter model has stronger statistical significance.

The present invention solves the problems, such as that the prior art is easily trapped into conservative when copying the estimation of number variation conspicuousness；This hair Bright automatic Synthesis detects multiple samples and occurs to copy the region of number variation jointly in same area, avoids the prior art and only detects The detection error of the copy number variable region of single sample or paired sample, research copy number variation and cancer from patient groups Relationship；The present invention is solved the problems, such as to copy number variation detection error caused by due to microarray dataset and the horizontal difference of sequencing, be enabled As a result more acurrate；The present invention is utilized for new-generation sequencing data format from multimodal frequency histogram feature normalization data, with It is accurate to divide normal region and copy number variable region；The prior art only in copy number variant sites reads number, is set with statistic Timing considers that there are inconsistencies, the present invention to be directed to this problem for relevance between variation reads number and adjacent variables site, considers The comprehensive function of relevance between variation reads number and variant sites, establishes new model, solves problem of inconsistency, is estimated with objective The significance of meter copy number variation.

When detecting multisample cCNV, the present invention integrates multiple samples, reduces and is successively examined based on single sample testing method Systematic error caused by surveying or microarray dataset mistake, substantially increase detection effect.

In normalization early period (standardization) processing data, the present invention is for different sequencing horizontal datas using different Processing method, with the prior art compared with the detection of low covering horizontal data is insensitive, no matter the horizontal height of present invention sequencing covering All there is higher sensitivity, this lays a good foundation for the subsequent accuracy for improving detection copy number variation.

The copy number variation for detecting multisample common region, in addition to consider that the region of copy number variation occurs for multiple samples Identical amplification or deleted signal are showed, the correlation between adjacent sites also has the detection of copy number variation important Biological meaning.Therefore, the statistic and statistical inspection model of the feature based on construction these two aspects are conducive to more objectively estimate Count the significance of the copy number variation of common region；And the prior art often only emphasizes the amplitude of copy number variable region, And ignore the correlation between site；For this purpose, the present invention comprehensively considers both features, statistical inspection model is established, and by having Supervised learning strategy weighs the two features with reasonably Counting statistics amount, this not only makes hypothesis testing model and statistics measurer There is consistency, and the statistics and biology double meaning of significance estimation can be enhanced.

The present invention takes different standardization processing methods for the data of different covering levels in data processing, especially It is that copy number amplification and miss status, separation are first defined according to its data frequency histogram feature to high overburden depth data Only normal (0)-amplification (1) data set and normal (0)-lack (- 1) data set out；The present invention be when designing statistic with Single locus is detection primitive, and relevance between the reads number of CNV single locus and site is combined in quantitative statistics amount Information can fundamentally improve the accuracy of significance estimation；The present invention integrates multiple samples, passes through supervised learning Method weighs feature of both the correlation between the reads number (amplitude) in full-length genome site and site, with reasonable Quantitative statistics amount, and construction and the consistent hypothesis testing model of statistic, to improve the system of significance estimation Count meaning.

Given emulation data: 5 samples comprising 18 concurrent copies number variation (cCNV), the present invention can examine 17 regions cCNV are measured, and the prior art such as FREEC is by single sample detection and global alignment is only capable of detecting 15 The region cCNV.Many experiments show simultaneously: compared with FREEC, it is more quasi- that the present invention reduces variable region order when boundary detects Really.

Detailed description of the invention

Fig. 1 is the copy number mutation detection method flow chart provided in an embodiment of the present invention based on new-generation sequencing.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.

Application principle of the invention is further described with reference to the accompanying drawing.

A kind of copy number mutation detection method based on new-generation sequencing should be examined based on the copy number variation of new-generation sequencing Survey method the following steps are included:

S101: it the pretreatment of copy number variation data: filters out and is compared in the Batch effect and comparison process of CNV signal The relatively very low reads of quality；By standardizing G/C content, the corresponding reads number in adjustment data sample site；To multiple samples Sequencing level normalization be processed into the data of corresponding same sequencing level；The data sample low for overburden depth directly will Data normalization is at same level；The data sample high for overburden depth is first defined according to its data frequency histogram feature Copy number amplification and miss status out；

S102: the construction of sliding window: integrated standardization treated multiple samples obtain a higher dimensional matrix；Quasi- construction Sliding window utilizes Pearson formula to calculate the correlation in each window between site simultaneously from the frequency that initial position calculates site Property, gradually sliding window, until spreading each site；Calculate the correlation between site

S103: the amplification or missing of the statistic reflection copy number variation of each sliding window the calculating of statistic: are calculated State learns the weight of frequency and related coefficient, w using known copy number mutation schema construction training set₁And w₂, with Counting statistics amount,

S_test=w₁*f+w₂*a

S104: the implementation of Replacement Strategy and the construction of zero cloth: multiple samples after standardization are calculated on full-length genome The corresponding detection statistic in each site, constructs zero cloth T, then implements random permutation to sample data, to each sample, with Machine replaces its position occurred in full-length genome, until s sample standard deviation is replaced, constitutes a total replacement sample set；To every A displacement sample set calculates the statistic that tandem copies number variation occurs；Finally calculate the significance of detection statistic:

P-value indicates the corresponding p-value value in each site of sample, the system that K is the number T of random permutation when being zero cloth Metering,For the statistic of i-th, ifGreater than T, then counts and add one, last p value to obtain the final product.(wherein p-value,T is equal For vector)

S105: CNV the estimation based on CNV significance: is evaluated by the obtained corresponding p value of sample all sites Region, if the threshold value (such as 0.05) of p value small Mr. Yu setting, it is considered that the CNV has the function of biological meaning or cancer.It is right Each CNV structural unit establishes the zero cloth of amplification and miss status respectively, to detect the significant of amplification and miss status respectively Property it is horizontal.

S106: the Performance Evaluation of algorithm: judge algorithm can in the case where false positive rate (FPR) is controllable, obtain compared with High true positive rate (TPR)；Whether evaluation algorithms can accurately estimate p value (Type I Error Rate)；Copy number The boundary Detection capability of variation；The computation complexity of parser.

It is compared in the Batch effect and comparison process for filtering out CNV signal in the relatively very low reads of quality reads<Q30。

The integrated standardization treated multiple samples, obtaining higher dimensional matrix in a higher dimensional matrix is number of samples s* The number of sites N of sample, relevance of the copy number variation presented with one section of region between copy number variant sites compare By force, up to 0.985, relevance is weaker between farther away site.

It is described be directed to each sliding window, calculate its statistic with reflect copy number variation amplification or miss status, it is right In low cover degree sample, the phase between the corresponding reads number frequency in each site and the site and other sites in window is directly calculated Relationship number integrates its frequency and related coefficient to quantify its statistic (S)；For the sample of high overburden depth, frequency is utilized Histogram is ingenious accurately by the amplification of copy number and lacking both has the state of different biological function performances to distinguish, point The statistic (S) of both states is not calculated.

S in the calculating of the statistic_testIntend in training set through known copy number variation function in public database Energy mode and the relationship of gene expression dose assign relative value to it.

Multiple samples after described pair of standardization calculate the corresponding detection statistic in each site on full-length genome, construction zero It is distributed T, then implementing sample data in random permutation to sample data is that every a line in data matrix represents a sample, often One column represent a site on full-length genome.

If p value is less than 0.05 threshold value of setting in the estimation based on CNV significance, which has biology meaning Justice or cancer function, the amplification of the CNV and miss status have different biological functions and performance.

Whether evaluation algorithms can accurately estimate p value, the i.e. statistical model of algorithm in the Performance Evaluation of the algorithm Whether there is stronger statistical significance.

Below with reference to application principle, the invention will be further described.

On the basis of copy number biological nature and statistical theory are sufficiently studied, statistical inspection model is established, design CNV is aobvious The horizontal detection algorithm of work property emulates data testing algorithm repeatedly using a large amount of, analyses and evaluates to its performance from multi-angle.

(1) pretreatment of copy number variation data

Carrying out pretreatment appropriate to copy number variation sample data has important meaning to copy number variation conspicuousness detection Justice.A) for the Batch effect of CNV signal and the quality problems in comparison process, it is relatively very low to filter out comparison quality reads(<Q30).B) due to new-generation sequencing technology data measured, coverage is sequenced to be influenced by G/C content, to influence Copy number variation detection.It would therefore be desirable to by standardization G/C content, to adjust the corresponding reads number in data sample site. C) since the sequencing level of multiple samples is there may be height difference, subsequent normalized set cannot directly be carried out, it is necessary to return The data that one change is processed into corresponding same sequencing level are just meaningful.The data sample low for overburden depth, can directly by Data normalization is at same level；The data sample high for overburden depth, can be first fixed according to its data frequency histogram feature Justice goes out copy number amplification and miss status.

(2) construction of sliding window

Integrated standardization treated multiple samples, can obtain a higher dimensional matrix (number of sites of number of samples s* sample N).Since copy number variation is presented with one section of region, the relevance between usually neighbouring copy number variant sites is stronger, can Up to 0.985, and relevance is weaker between farther away site or even can ignore.For the phase between more acurrate calculating site Guan Xing, quasi- construction sliding window calculate each window from the frequency that initial position calculates site using Pearson formula simultaneously Correlation between interior site, gradually sliding window, until spreading each site.Wherein the selection of the size of sliding window is to result Influence less, we temporarily take 10 here, rear extended meeting by Germicidal efficacy its to impact effect.

(3) calculating of statistic

For each sliding window, its statistic is calculated to reflect the amplification or miss status of copy number variation.Due to new The data of generation sequencing are influenced by sequencing overburden depth, calculate separately statistics for low cover degree and high coverage sample Amount, greatly strengthens applicability of the invention.For low cover degree sample, the corresponding reads number frequency in each site is directly calculated The related coefficient between the site and other interior sites of window, its frequency of synthesis and related coefficient are counted to quantify its statistic (S). For the sample of high overburden depth, we accurately by the amplification of copy number and lack both and have using frequency histogram is ingenious The state of different biological function performances distinguishes, and calculates separately the statistic (S) of both states, is conducive to preferably detect Copy the significance of number variation.Here difficult point is how rationally to weigh frequency and related coefficient, for this purpose, we are using The copy number mutation schema construction training set known learns the weight of frequency and related coefficient, w₁And w₂, with Counting statistics amount.

S_test=w₁*f+w₂*a

Wherein, f, a, S_testRespectively refer to the frequency of copy number mutation mode in training set, correlation and statistic Value.Due to S_testThere is no clearly providing in training set, therefore, intend through known copy number variation function in public database Energy mode and the relationship of gene expression dose assign relative value to it.

(4) construction of the implementation of Replacement Strategy and zero cloth

Detection statistic corresponding to site each on multiple samples calculating full-length genome after standardization, constructs zero cloth T.Then to sample data, (every a line in data matrix represents a sample, and each column represent a position on full-length genome Point) implement random permutation, detailed process is as follows: a) it is directed to each sample, its position for occurring in full-length genome of random permutation, Until s sample standard deviation is replaced, a total replacement sample set is constituted；For each displacement sample set, calculates tandem copies number and become The statistic of different generation；Finally calculate the significance of detection statistic:

(5) estimation of zero cloth design and significance based on CNV length

The region occurred by the corresponding p value evaluation CNV of obtained sample all sites, if the threshold value of the small Mr. Yu's setting of p value (such as 0.05), then it is considered that the CNV has the function of biological meaning or cancer.Furthermore, it is contemplated that the amplification and miss status of CNV With different biological functions and performance, we are directed to each CNV structural unit, establish the zero of amplification and miss status respectively Cloth, to detect the significance of amplification and miss status respectively.

(6) Performance Evaluation of algorithm

The present invention is quasi- to evaluate the performance of algorithm in terms of following three: a) judging that can algorithm in false positive rate (FPR) in the case where controllable, higher true positive rate (TPR) is obtained；B) whether evaluation algorithms can accurately estimate p value (Type I Error Rate), i.e., whether the statistical model of algorithm has stronger statistical significance；C) boundary of number variation is copied Detection capability；D) computation complexity of parser.

The quasi- normal cell copy number detected using 1000Affymetrix full-length genome SNP6.0 chip considers as background NGS technology and data characteristics construct markov CNV emulation mode based on probability theory and nonstationary model, simulate big rule The CNV data based on NGS of mould deemed-to-satisfy4 can be carried out test to of the invention.Partial simulation experiment show that this algorithm is being kept In the case of higher TPR, boundary Detection capability with higher.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims

1. a kind of copy number mutation detection method based on new-generation sequencing, which is characterized in that should copying based on new-generation sequencing Detection method includes the following steps for shellfish number variation:

Copy number variation data pretreatment: filter out compared in the Batch effect and comparison process of CNV signal it is low-quality reads；By standardizing G/C content, the corresponding reads number in adjustment data sample site；To the horizontal normalizing of the sequencing of multiple samples Change the data for being processed into corresponding same sequencing level；The data sample low for overburden depth, directly by data normalization at same One is horizontal；The data sample high for overburden depth, according to its data frequency histogram feature first define copy number amplification with Miss status；

The construction of sliding window: integrated standardization treated multiple samples obtain a higher dimensional matrix；Quasi- construction sliding window from The frequency that initial position calculates site utilizes Pearson formula to calculate the correlation in each window between site simultaneously, gradually slides Dynamic window, until spreading each site；Calculate the correlation between site；

The calculating of statistic: the amplification or miss status of the statistic reflection copy number variation of each sliding window are calculated, is utilized Known copy number mutation schema construction training set, learns the weight w of frequency₁With the weight w of related coefficient₂, to calculate system Metering,

S_test=w₁*f+w₂*a

Wherein, f, a, S_testRespectively refer to the frequency of copy number mutation mode in training set, the value of correlation and statistic；

The implementation of Replacement Strategy and the construction of zero cloth: each site pair on full-length genome is calculated to multiple samples after standardization The detection statistic answered constructs zero cloth T, then implements random permutation to sample data, to each sample, random permutation its The position occurred in full-length genome constitutes a total replacement sample set until s sample standard deviation is replaced；To each displacement sample Collection calculates the statistic that tandem copies number variation occurs；Finally calculate the significance of detection statistic:

P-value indicates the corresponding p-value value in each site of sample, the statistics that K is the number T of random permutation when being zero cloth Amount, T_i ^*For the statistic of i-th, if T_i ^*Greater than T, then counts and add one, last p value to obtain the final product；Wherein p-value, T_i ^*, T be to Amount；

Estimation based on CNV significance: the region occurred by the corresponding p value evaluation CNV of obtained sample all sites, if The threshold value 0.05 of the small Mr. Yu's setting of p value, the then it is considered that CNV has biological meaning；To each CNV structural unit, build respectively The zero cloth of vertical amplification and miss status, to detect the significance of amplification and miss status respectively；

The Performance Evaluation of algorithm: judge that can algorithm obtain true positive rate in the case where false positive rate is controllable；Evaluation is calculated Whether method can estimate p value；Copy the boundary Detection capability of number variation；The computation complexity of parser.

2. the copy number mutation detection method based on new-generation sequencing as described in claim 1, which is characterized in that the filtering Fall in the Batch effect and comparison process of CNV signal and compares reads < Q30 in low-quality reads.

3. the copy number mutation detection method based on new-generation sequencing as described in claim 1, which is characterized in that the synthesis Multiple samples after standardization obtain the number of sites N that higher dimensional matrix in a higher dimensional matrix is number of samples s* sample.

4. the copy number mutation detection method based on new-generation sequencing as described in claim 1, which is characterized in that for each Sliding window calculates its statistic to reflect the amplification or miss status of copy number variation, for low cover degree sample, directly counts Calculate the related coefficient in the corresponding reads number frequency in each site and the site and window between other sites, it is comprehensive it frequency and Related coefficient quantifies its statistic S；It is ingenious accurately by copy number using frequency histogram for the sample of high overburden depth Amplification and lacking both has the state of different biological function performances to distinguish, calculate separately the statistic of both states S。

5. the copy number mutation detection method based on new-generation sequencing as described in claim 1, which is characterized in that the statistics S in the calculating of amount_testIntend in training set through known copy number mutation mode and gene expression in public database Horizontal relationship assigns relative value to it.

6. the copy number mutation detection method based on new-generation sequencing as described in claim 1, which is characterized in that described pair of mark Multiple samples after standardization calculate the corresponding detection statistic in each site on full-length genome, zero cloth T are constructed, then to sample It is that every a line in data matrix represents a sample that data, which implement sample data in random permutation, and each column represent full-length genome On a site.

7. the copy number mutation detection method based on new-generation sequencing as described in claim 1, which is characterized in that described to be based on If the zero cloth design of CNV length is less than 0.05 threshold value of setting with p value in the estimation of significance, which has biology Meaning, the amplification of the CNV and miss status have different biological functions and performance.