CN105760712B - A kind of copy number mutation detection method based on new-generation sequencing - Google Patents
A kind of copy number mutation detection method based on new-generation sequencing Download PDFInfo
- Publication number
- CN105760712B CN105760712B CN201610114354.8A CN201610114354A CN105760712B CN 105760712 B CN105760712 B CN 105760712B CN 201610114354 A CN201610114354 A CN 201610114354A CN 105760712 B CN105760712 B CN 105760712B
- Authority
- CN
- China
- Prior art keywords
- copy number
- sample
- cnv
- site
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Engineering & Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Chemical & Material Sciences (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Apparatus Associated With Microorganisms And Enzymes (AREA)
Abstract
The invention discloses a kind of copy number mutation detection method based on new-generation sequencing, this method includes the Performance Evaluation of the construction of the copy number variation pretreatment of data, the construction of sliding window, the calculating of statistic, the implementation of Replacement Strategy and zero cloth, algorithm, the Performance Evaluation of algorithm, which uses, judges that can algorithm in the case where false positive rate be controllable, obtain higher true positive rate, whether evaluation algorithms can accurately estimate p value, copy the boundary Detection capability of number variation;The computation complexity of parser.Copy number variation detection error caused by the present invention is solved the problems, such as due to microarray dataset and the horizontal difference of sequencing, enables result more acurrate;Using from multimodal frequency histogram feature normalization data, accurately to divide normal region and copy number variable region;The comprehensive function of relevance, establishes new model between present invention variation reads number and variant sites, solves problem of inconsistency, the significance of objective estimation copy number variation.
Description
Technical field
The invention belongs to the high throughput sequencing technologies fields that DNA molecular carries out sequencing, more particularly to one kind is based on new
The copy number mutation detection method of generation sequencing.
Background technique
Copy number variation (copy number variation, CNV) is the important phenomenon in cancer gene group.It is main
Amplification and the missing two states of copy number are shown as, generation, development with cancer cell there are close ties.Detect multiple cancer samples
The concurrent CNV of same area in this, and influence of the confluence analysis CNV on full-length genome expression, identify those by
CNV influences the cancer gene of expression, this has great importance for the generation and transfer of studying cancer.Although based on single sample
CNV detection method it is more and more mature, but these methods detection sensitivity in terms of remain unchanged and cannot expire
The detection in the region CNV occurs jointly for the multiple samples of foot, is that cancer is studied from molecular level to the CNV analysis for carrying out system therefore
The pathogenesis of disease provides important channel, and the bottom, most crucial problem are how to detect in multiple cancer samples and swell
The relevant CNV of tumor related gene.
New-generation sequencing (Next Generation Sequencing, NGS) technology be once available up to a million very
To the high throughput sequencing technologies of millions of short sequence informations, have high speed, high-resolution, low cost, repeatability high
The advantages that.Therefore, detection CNV is studied based on NGS data and substantially increases speed and accuracy, while also reducing cost.
It is numerous studies have shown that CNV functional mode is often implied in the consistent variation region of cancer gene group sample, and
The sequential digit values relationship proportional to the copy numerical value in the region for arriving each region of genome is compared in NGS, then establishing to unite
Calculation method based on meter is theoretical, detects CNV concurrent (Common) significance in multiple cancer samples, is
It identifies that CNV functional mode and discovery potential cancer gene provide direct, feasible technological means, and then is biological physician couple
The prediction and diagnosis of cancer provide important information.Therefore, it is most important to establish reasonable and effective statistical inspection model.
The intensive in the site high-throughput full-length genome CNV and its complexity of structure, foundation to statistical inspection model and
The detection of CNV conspicuousness brings great challenge, in terms of being mainly reflected in following two.First, the difficult point of problem itself: a)
Number of loci is up to more than 180 ten thousand and sample number is often less, forms a kind of data pattern of high latitude small sample;B) sequencing is flat
Platform and the horizontal different bring systematic errors of sequencing, and the sample of different sequencing levels is normalized;C) gene
Influence of the corresponding reads signal in site (read depth, RD) vulnerable to noises such as sequencing mistake, comparison mistakes;D) CNV
There are stronger relevance between point, and dependent, so that there are reciprocal effects between detecting factor;E) detection copy number amplification
Or miss status to consider of both feature, i.e. site correspond to the relevance between reads number and site, this require one it is reasonable
Tradeoff the two features mechanism.Second, the challenge of the theory and method solved the problems, such as: a) data scale is big, to calculating
Effective control of Time & Space Complexity is a challenge;B) how to fully consider the relevance between the site CNV, reduce
The conservative of CNV significance estimation, is a difficulties;C) how to establish and the consistent null hypothesis of statistic
Distribution, the statistical significance of enhancing significance estimation, the problem of being an emphasis and not yet break through at present.
It technically analyzes, considers from sample size, current existing copy mutation detection method is broadly divided into following base
In the CNV detection method of single sample analysis and based on the method for multisample.Technically mainly have: being hybridized based on fluorescence sites
The copy number of the detection method of technology, the Comparative genomic hybridization based on microarray and gene new-generation sequencing technology detects
Method.First two method resolution ratio is very low and is difficult to detect short CNV, and based on the method for NGS because it has the excellent of high throughput
Gesture and more highlight.CNV detection method based on NGS, which is broadly divided into, is signed and is based on based on PEM (pair-end mapping)
Two kinds of technology paths of DOC (depth of coverage).Although method based on PEM is capable of detecting when the CNV of small fragment but very
The CNV (such as SDs) of insertion (the copy number amplification) and complex region of hardly possible detection large fragment.Method based on DOC can detecte greatly
The CNV of segment.Therefore there is also both some methods combined, such as CNVer improves CNV by integrating DOC and PEM signature
The breakpoint accuracy rate in region.More favored currently based on the method for DOC.
DOC detection model based on segmentation relates generally to different dividing methods, such as CBS, LASSO etc..Different segmentations
The testing result that method generates also is not quite similar.As ReadDepth can more accurately identify copy number using CBS partitioning algorithm
The boundary of variation, sensitivity still with higher and specificity when detecting low coverage data.The uncontrolled sample preparation of FREEC method
This constraint returns the accurate boundary CNV using LASSO, but ignores part reads number variation, easily causes error detection;Simultaneously
G/C content may be influenced by subclone property to standardize and then influence CNV detection.Segseq method and rSW-seq method are due to straight
It connects and controls sample to make comparisons, can quickly detect and accurately identify the region CNV, but the part that it does not account for multiple samples is special
Signization feature causes resultant error very big.Due to the local feature feature of sequencing technologies and genome, partitioning algorithm can enable knot
The false positive of fruit is relatively high.SeqCNA does not also require control sample, is suitable for detecting local small pieces using LOESS or polymorphic fitting
The CNV of section, but be not suitable for detection cancer sample data.
Based on the assumption that the DOC statistical significance model examined is mainly concerned with two key elements, i.e., test statistics with
Zero cloth, the quality that they are designed directly influence the validity of significance estimation and the identification performance of CNV functional mode.
EWT method examines detection CNV using unilateral Z-test to the RD fitted Gaussian probability Distribution Model of continuous fragment (window), can
To detect the copy number variable region of large fragment, but EWT does not account for the relevance between site, cannot accurately detect insertion
(CNV) position and insensitive to the CNV of small fragment.CNV-seq method is to the RD ratio of non-overlapping segment (window) (with reference
Sample) fitting Poisson distribution model, it calculates the conspicuousness of Z-score while introducing partitioning algorithm to detect CNV, improve to low
The sensitivity of coverage data detection, but easily improve false positive.HMM method of the CNA-seg based on segseq and JointSLM, together
When introduce card side χ2Statistic detects CNV.
Detection method currently based on the common CNV of the multisample of DOC be not still it is very mature, detection method mainly has CMDS
Method [17], cn.MOPS method, JointSLM method and detection method based on punishment sparse regression model etc..Wherein
CMDS method calculates its conspicuousness to detect CNV, individually with detection to the single locus building correlation diagonal matrix of multiple samples
Sample is higher compared to accuracy rate, while improving the cost performance of time and space complexity.Cn.MOPS method reduces technology and life
The influence of noise of object variation, suitable for detecting the inconsistent CNV of multiple sample same area variation amplitudes, and it is consistent to amplitude
CNV is insensitive.JointSLM method is the extension that EWT is detected in multisample, while introducing hidden Markov model (HMM) to examine
CNV is surveyed, but when common CNV occurs in the sample of part, it is helpless.Detection side based on penalty coefficient regression model
Method is to be fitted a penalized regression model to the RD signal of multiple samples, will be turned to common CNV (cCNV) border detection
It turns to and changes point (change point) test problems and detected using significance test method, to improve accuracy rate and drop
Low false discovery rate.But but its accuracy rate can decline when ancestors' difference of multiple sample datas.
By comparing these existing model [3,7,9-27] analyses based on DOC it is found that most of method can generate
One very high false discovery rate, especially when without reference to sample, feature is especially prominent.The existing conspicuousness based on NGS
Model is all with CNV structure fragment for detection primitive when designing statistic, and the frequency of CNV has been used in quantitative statistics amount
The information of relevance between rate and amplitude and the site CNV.For the construction of zero cloth, most methods are all by setting at random
Change strategy realization.
It is analyzed from the biological characteristic of CNV data, not independent between the site CNV, i.e., the neighbouring site CNV, which is one, to be had
Machine is whole, is detection base with structure fragment then being the significance for detecting primitive and being difficult to objective estimation CNV with single locus
Member and the relevance for being easy ignorance inside configuration site;Secondly, although considering CNV's in Counting statistics amount there are many method
The relevance of reads number and site, but they do not weigh the two features reasonably, are easy erroneous detection CNV.
Existing CNV significance detection method is primarily present following deficiency:
(1) using the single site CNV as the statistic of primitive, it is easy to cause the conservative of significance estimation;It is tied with CNV
Though tile section is that constant dollar amount remains the inherent structure characteristic of copy number to a certain extent, ignore between internal site
Correlation, it is difficult to it is objective estimation statistic CNV significance.
(2) without rationally weighing the frequency of CNV and the relevance of variant sites, so that CNV and the associated biological table of cancer
It is now difficult to position;
(3) method based on single pattern detection is when detecting the cCNV of multiple samples, systematic error or platform errors problem
Seriously.
(4) the multiple samples horizontal from different microarray datasets or sequencing without automatic Synthesis, so that detecting multiple samples
There are biggish limitations when this concurrent CNV functional mode;
(5) it is directed to the sample data of low-coverage level, insensitive, detection effect is bad.
Summary of the invention
The purpose of the present invention is to provide a kind of copy number mutation detection method based on new-generation sequencing, it is intended to for not
Data with coverage take different normalized measures, and data is enabled to have more operability, reduce systematic error;Integration
Multiple samples propose a set of using CNV structural units as the significance etection theory and method of primitive;With supervised learning machine
It is made as guiding, foundation and the consistent zero cloth of statistic, to improve the accuracy of significance estimation.
The invention is realized in this way a kind of copy number mutation detection method based on new-generation sequencing, one kind is based on new
The copy number mutation detection method of generation sequencing, being somebody's turn to do the copy number mutation detection method based on new-generation sequencing includes following step
It is rapid:
The pretreatment of copy number variation data: comparison quality phase in the Batch effect and comparison process of CNV signal is filtered out
To very low reads;By standardizing G/C content, the corresponding reads number in adjustment data sample site;Sequencing to multiple samples
Level normalization is processed into the data of corresponding same sequencing level;The data sample low for overburden depth, directly returns data
One chemical conversion same level;The data sample high for overburden depth first defines copy according to its data frequency histogram feature
Number amplification and miss status;
The construction of sliding window: integrated standardization treated multiple samples obtain a higher dimensional matrix;Quasi- construction sliding window
Mouth utilizes Pearson formula to calculate the correlation in each window between site simultaneously from the frequency that initial position calculates site, by
Gradually sliding window, until spreading each site;Calculate the correlation between site;
The calculating of statistic: calculate each site in each sliding window statistic reflection copy number variation amplification or
Miss status learns the weight of frequency and related coefficient, w using known copy number mutation schema construction training set1With
w2, with Counting statistics amount,
Stest=w1*f+w2*a
Wherein, f, a, StestRespectively refer to the frequency of copy number mutation mode in training set, correlation and statistic
Value;
The implementation of Replacement Strategy and the construction of zero cloth: each position on full-length genome is calculated to multiple samples after standardization
The corresponding detection statistic of point, constructs zero cloth T, then implements random permutation to sample data, to each sample, random permutation
Its position occurred in full-length genome constitutes a total replacement sample set until s sample standard deviation is replaced;To each displacement
Sample set calculates the statistic that tandem copies number variation occurs;Finally calculate the significance of detection statistic:
Estimation based on CNV significance: the area occurred by the corresponding p value evaluation CNV of obtained sample all sites
Domain, if the threshold value (such as 0.05) of the small Mr. Yu's setting of p value, it is considered that the CNV has the function of biological meaning or cancer.To each
CNV structural unit establishes the zero cloth of amplification and miss status, respectively to detect the conspicuousness water of amplification and miss status respectively
It is flat.
The Performance Evaluation of algorithm: judge that can algorithm in the case where false positive rate (FPR) be controllable, acquisition is higher just
Really certainly rate (TPR);Whether evaluation algorithms can accurately estimate p value (Type I Error Rate);Copy number variation
Boundary Detection capability;The computation complexity of parser.
Further, it is relatively very low that quality is compared in the Batch effect and comparison process for filtering out CNV signal
Reads < Q30 in reads.
Further, the integrated standardization treated multiple samples, obtaining higher dimensional matrix in a higher dimensional matrix is sample
The number of sites N of this number s* sample, pass of the copy number variation presented with one section of region between copy number variant sites
Connection property is stronger, and up to 0.985, relevance is weaker between farther away site.
Further, described to be directed to each sliding window, its statistic is calculated to reflect the amplification or missing of copy number variation
State directly calculates other sites in the corresponding reads number frequency in each site and the site and window for low cover degree sample
Between related coefficient, it is comprehensive it frequency and related coefficient quantify its statistic (S);For the sample of high overburden depth, benefit
Accurately by the amplification of copy number and both state areas for there are different biological function performances are lacked with frequency histogram is ingenious
It separates, calculates separately the statistic (S) of both states.
Further, S in the calculating of the statistictestIntend in training set through known copy number in public database
Mutation mode and the relationship of gene expression dose assign relative value to it.
Further, multiple samples after described pair of standardization calculate the corresponding detection statistics in each site on full-length genome
Amount constructs zero cloth T, and then implementing sample data in random permutation to sample data is that every a line in data matrix represents one
A sample, each column represent a site on full-length genome.
Further, if the zero cloth design based on CNV length is less than setting with p value in the estimation of significance
0.05 threshold value, the CNV have the function of that biological meaning or cancer, the amplification of the CNV and miss status have different biological function
It can and show.
Further, whether evaluation algorithms can accurately estimate p value, the i.e. system of algorithm in the Performance Evaluation of the algorithm
Whether meter model has stronger statistical significance.
The present invention solves the problems, such as that the prior art is easily trapped into conservative when copying the estimation of number variation conspicuousness;This hair
Bright automatic Synthesis detects multiple samples and occurs to copy the region of number variation jointly in same area, avoids the prior art and only detects
The detection error of the copy number variable region of single sample or paired sample, research copy number variation and cancer from patient groups
Relationship;The present invention is solved the problems, such as to copy number variation detection error caused by due to microarray dataset and the horizontal difference of sequencing, be enabled
As a result more acurrate;The present invention is utilized for new-generation sequencing data format from multimodal frequency histogram feature normalization data, with
It is accurate to divide normal region and copy number variable region;The prior art only in copy number variant sites reads number, is set with statistic
Timing considers that there are inconsistencies, the present invention to be directed to this problem for relevance between variation reads number and adjacent variables site, considers
The comprehensive function of relevance between variation reads number and variant sites, establishes new model, solves problem of inconsistency, is estimated with objective
The significance of meter copy number variation.
When detecting multisample cCNV, the present invention integrates multiple samples, reduces and is successively examined based on single sample testing method
Systematic error caused by surveying or microarray dataset mistake, substantially increase detection effect.
In normalization early period (standardization) processing data, the present invention is for different sequencing horizontal datas using different
Processing method, with the prior art compared with the detection of low covering horizontal data is insensitive, no matter the horizontal height of present invention sequencing covering
All there is higher sensitivity, this lays a good foundation for the subsequent accuracy for improving detection copy number variation.
The copy number variation for detecting multisample common region, in addition to consider that the region of copy number variation occurs for multiple samples
Identical amplification or deleted signal are showed, the correlation between adjacent sites also has the detection of copy number variation important
Biological meaning.Therefore, the statistic and statistical inspection model of the feature based on construction these two aspects are conducive to more objectively estimate
Count the significance of the copy number variation of common region;And the prior art often only emphasizes the amplitude of copy number variable region,
And ignore the correlation between site;For this purpose, the present invention comprehensively considers both features, statistical inspection model is established, and by having
Supervised learning strategy weighs the two features with reasonably Counting statistics amount, this not only makes hypothesis testing model and statistics measurer
There is consistency, and the statistics and biology double meaning of significance estimation can be enhanced.
The present invention takes different standardization processing methods for the data of different covering levels in data processing, especially
It is that copy number amplification and miss status, separation are first defined according to its data frequency histogram feature to high overburden depth data
Only normal (0)-amplification (1) data set and normal (0)-lack (- 1) data set out;The present invention be when designing statistic with
Single locus is detection primitive, and relevance between the reads number of CNV single locus and site is combined in quantitative statistics amount
Information can fundamentally improve the accuracy of significance estimation;The present invention integrates multiple samples, passes through supervised learning
Method weighs feature of both the correlation between the reads number (amplitude) in full-length genome site and site, with reasonable
Quantitative statistics amount, and construction and the consistent hypothesis testing model of statistic, to improve the system of significance estimation
Count meaning.
Given emulation data: 5 samples comprising 18 concurrent copies number variation (cCNV), the present invention can examine
17 regions cCNV are measured, and the prior art such as FREEC is by single sample detection and global alignment is only capable of detecting 15
The region cCNV.Many experiments show simultaneously: compared with FREEC, it is more quasi- that the present invention reduces variable region order when boundary detects
Really.
Detailed description of the invention
Fig. 1 is the copy number mutation detection method flow chart provided in an embodiment of the present invention based on new-generation sequencing.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to embodiments, to the present invention
It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to
Limit the present invention.
The present invention takes different standardization processing methods for the data of different covering levels in data processing, especially
It is that copy number amplification and miss status, separation are first defined according to its data frequency histogram feature to high overburden depth data
Only normal (0)-amplification (1) data set and normal (0)-lack (- 1) data set out;The present invention be when designing statistic with
Single locus is detection primitive, and relevance between the reads number of CNV single locus and site is combined in quantitative statistics amount
Information can fundamentally improve the accuracy of significance estimation;The present invention integrates multiple samples, passes through supervised learning
Method weighs feature of both the correlation between the reads number (amplitude) in full-length genome site and site, with reasonable
Quantitative statistics amount, and construction and the consistent hypothesis testing model of statistic, to improve the system of significance estimation
Count meaning.
Application principle of the invention is further described with reference to the accompanying drawing.
A kind of copy number mutation detection method based on new-generation sequencing should be examined based on the copy number variation of new-generation sequencing
Survey method the following steps are included:
S101: it the pretreatment of copy number variation data: filters out and is compared in the Batch effect and comparison process of CNV signal
The relatively very low reads of quality;By standardizing G/C content, the corresponding reads number in adjustment data sample site;To multiple samples
Sequencing level normalization be processed into the data of corresponding same sequencing level;The data sample low for overburden depth directly will
Data normalization is at same level;The data sample high for overburden depth is first defined according to its data frequency histogram feature
Copy number amplification and miss status out;
S102: the construction of sliding window: integrated standardization treated multiple samples obtain a higher dimensional matrix;Quasi- construction
Sliding window utilizes Pearson formula to calculate the correlation in each window between site simultaneously from the frequency that initial position calculates site
Property, gradually sliding window, until spreading each site;Calculate the correlation between site
S103: the amplification or missing of the statistic reflection copy number variation of each sliding window the calculating of statistic: are calculated
State learns the weight of frequency and related coefficient, w using known copy number mutation schema construction training set1And w2, with
Counting statistics amount,
Stest=w1*f+w2*a
Wherein, f, a, StestRespectively refer to the frequency of copy number mutation mode in training set, correlation and statistic
Value;
S104: the implementation of Replacement Strategy and the construction of zero cloth: multiple samples after standardization are calculated on full-length genome
The corresponding detection statistic in each site, constructs zero cloth T, then implements random permutation to sample data, to each sample, with
Machine replaces its position occurred in full-length genome, until s sample standard deviation is replaced, constitutes a total replacement sample set;To every
A displacement sample set calculates the statistic that tandem copies number variation occurs;Finally calculate the significance of detection statistic:
P-value indicates the corresponding p-value value in each site of sample, the system that K is the number T of random permutation when being zero cloth
Metering,For the statistic of i-th, ifGreater than T, then counts and add one, last p value to obtain the final product.(wherein p-value,T is equal
For vector)
S105: CNV the estimation based on CNV significance: is evaluated by the obtained corresponding p value of sample all sites
Region, if the threshold value (such as 0.05) of p value small Mr. Yu setting, it is considered that the CNV has the function of biological meaning or cancer.It is right
Each CNV structural unit establishes the zero cloth of amplification and miss status respectively, to detect the significant of amplification and miss status respectively
Property it is horizontal.
S106: the Performance Evaluation of algorithm: judge algorithm can in the case where false positive rate (FPR) is controllable, obtain compared with
High true positive rate (TPR);Whether evaluation algorithms can accurately estimate p value (Type I Error Rate);Copy number
The boundary Detection capability of variation;The computation complexity of parser.
It is compared in the Batch effect and comparison process for filtering out CNV signal in the relatively very low reads of quality
reads<Q30。
The integrated standardization treated multiple samples, obtaining higher dimensional matrix in a higher dimensional matrix is number of samples s*
The number of sites N of sample, relevance of the copy number variation presented with one section of region between copy number variant sites compare
By force, up to 0.985, relevance is weaker between farther away site.
It is described be directed to each sliding window, calculate its statistic with reflect copy number variation amplification or miss status, it is right
In low cover degree sample, the phase between the corresponding reads number frequency in each site and the site and other sites in window is directly calculated
Relationship number integrates its frequency and related coefficient to quantify its statistic (S);For the sample of high overburden depth, frequency is utilized
Histogram is ingenious accurately by the amplification of copy number and lacking both has the state of different biological function performances to distinguish, point
The statistic (S) of both states is not calculated.
S in the calculating of the statistictestIntend in training set through known copy number variation function in public database
Energy mode and the relationship of gene expression dose assign relative value to it.
Multiple samples after described pair of standardization calculate the corresponding detection statistic in each site on full-length genome, construction zero
It is distributed T, then implementing sample data in random permutation to sample data is that every a line in data matrix represents a sample, often
One column represent a site on full-length genome.
If p value is less than 0.05 threshold value of setting in the estimation based on CNV significance, which has biology meaning
Justice or cancer function, the amplification of the CNV and miss status have different biological functions and performance.
Whether evaluation algorithms can accurately estimate p value, the i.e. statistical model of algorithm in the Performance Evaluation of the algorithm
Whether there is stronger statistical significance.
Below with reference to application principle, the invention will be further described.
On the basis of copy number biological nature and statistical theory are sufficiently studied, statistical inspection model is established, design CNV is aobvious
The horizontal detection algorithm of work property emulates data testing algorithm repeatedly using a large amount of, analyses and evaluates to its performance from multi-angle.
(1) pretreatment of copy number variation data
Carrying out pretreatment appropriate to copy number variation sample data has important meaning to copy number variation conspicuousness detection
Justice.A) for the Batch effect of CNV signal and the quality problems in comparison process, it is relatively very low to filter out comparison quality
reads(<Q30).B) due to new-generation sequencing technology data measured, coverage is sequenced to be influenced by G/C content, to influence
Copy number variation detection.It would therefore be desirable to by standardization G/C content, to adjust the corresponding reads number in data sample site.
C) since the sequencing level of multiple samples is there may be height difference, subsequent normalized set cannot directly be carried out, it is necessary to return
The data that one change is processed into corresponding same sequencing level are just meaningful.The data sample low for overburden depth, can directly by
Data normalization is at same level;The data sample high for overburden depth, can be first fixed according to its data frequency histogram feature
Justice goes out copy number amplification and miss status.
(2) construction of sliding window
Integrated standardization treated multiple samples, can obtain a higher dimensional matrix (number of sites of number of samples s* sample
N).Since copy number variation is presented with one section of region, the relevance between usually neighbouring copy number variant sites is stronger, can
Up to 0.985, and relevance is weaker between farther away site or even can ignore.For the phase between more acurrate calculating site
Guan Xing, quasi- construction sliding window calculate each window from the frequency that initial position calculates site using Pearson formula simultaneously
Correlation between interior site, gradually sliding window, until spreading each site.Wherein the selection of the size of sliding window is to result
Influence less, we temporarily take 10 here, rear extended meeting by Germicidal efficacy its to impact effect.
(3) calculating of statistic
For each sliding window, its statistic is calculated to reflect the amplification or miss status of copy number variation.Due to new
The data of generation sequencing are influenced by sequencing overburden depth, calculate separately statistics for low cover degree and high coverage sample
Amount, greatly strengthens applicability of the invention.For low cover degree sample, the corresponding reads number frequency in each site is directly calculated
The related coefficient between the site and other interior sites of window, its frequency of synthesis and related coefficient are counted to quantify its statistic (S).
For the sample of high overburden depth, we accurately by the amplification of copy number and lack both and have using frequency histogram is ingenious
The state of different biological function performances distinguishes, and calculates separately the statistic (S) of both states, is conducive to preferably detect
Copy the significance of number variation.Here difficult point is how rationally to weigh frequency and related coefficient, for this purpose, we are using
The copy number mutation schema construction training set known learns the weight of frequency and related coefficient, w1And w2, with Counting statistics amount.
Stest=w1*f+w2*a
Wherein, f, a, StestRespectively refer to the frequency of copy number mutation mode in training set, correlation and statistic
Value.Due to StestThere is no clearly providing in training set, therefore, intend through known copy number variation function in public database
Energy mode and the relationship of gene expression dose assign relative value to it.
(4) construction of the implementation of Replacement Strategy and zero cloth
Detection statistic corresponding to site each on multiple samples calculating full-length genome after standardization, constructs zero cloth
T.Then to sample data, (every a line in data matrix represents a sample, and each column represent a position on full-length genome
Point) implement random permutation, detailed process is as follows: a) it is directed to each sample, its position for occurring in full-length genome of random permutation,
Until s sample standard deviation is replaced, a total replacement sample set is constituted;For each displacement sample set, calculates tandem copies number and become
The statistic of different generation;Finally calculate the significance of detection statistic:
(5) estimation of zero cloth design and significance based on CNV length
The region occurred by the corresponding p value evaluation CNV of obtained sample all sites, if the threshold value of the small Mr. Yu's setting of p value
(such as 0.05), then it is considered that the CNV has the function of biological meaning or cancer.Furthermore, it is contemplated that the amplification and miss status of CNV
With different biological functions and performance, we are directed to each CNV structural unit, establish the zero of amplification and miss status respectively
Cloth, to detect the significance of amplification and miss status respectively.
(6) Performance Evaluation of algorithm
The present invention is quasi- to evaluate the performance of algorithm in terms of following three: a) judging that can algorithm in false positive rate
(FPR) in the case where controllable, higher true positive rate (TPR) is obtained;B) whether evaluation algorithms can accurately estimate p value
(Type I Error Rate), i.e., whether the statistical model of algorithm has stronger statistical significance;C) boundary of number variation is copied
Detection capability;D) computation complexity of parser.
The quasi- normal cell copy number detected using 1000Affymetrix full-length genome SNP6.0 chip considers as background
NGS technology and data characteristics construct markov CNV emulation mode based on probability theory and nonstationary model, simulate big rule
The CNV data based on NGS of mould deemed-to-satisfy4 can be carried out test to of the invention.Partial simulation experiment show that this algorithm is being kept
In the case of higher TPR, boundary Detection capability with higher.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.
Claims (7)
1. a kind of copy number mutation detection method based on new-generation sequencing, which is characterized in that should copying based on new-generation sequencing
Detection method includes the following steps for shellfish number variation:
Copy number variation data pretreatment: filter out compared in the Batch effect and comparison process of CNV signal it is low-quality
reads;By standardizing G/C content, the corresponding reads number in adjustment data sample site;To the horizontal normalizing of the sequencing of multiple samples
Change the data for being processed into corresponding same sequencing level;The data sample low for overburden depth, directly by data normalization at same
One is horizontal;The data sample high for overburden depth, according to its data frequency histogram feature first define copy number amplification with
Miss status;
The construction of sliding window: integrated standardization treated multiple samples obtain a higher dimensional matrix;Quasi- construction sliding window from
The frequency that initial position calculates site utilizes Pearson formula to calculate the correlation in each window between site simultaneously, gradually slides
Dynamic window, until spreading each site;Calculate the correlation between site;
The calculating of statistic: the amplification or miss status of the statistic reflection copy number variation of each sliding window are calculated, is utilized
Known copy number mutation schema construction training set, learns the weight w of frequency1With the weight w of related coefficient2, to calculate system
Metering,
Stest=w1*f+w2*a
Wherein, f, a, StestRespectively refer to the frequency of copy number mutation mode in training set, the value of correlation and statistic;
The implementation of Replacement Strategy and the construction of zero cloth: each site pair on full-length genome is calculated to multiple samples after standardization
The detection statistic answered constructs zero cloth T, then implements random permutation to sample data, to each sample, random permutation its
The position occurred in full-length genome constitutes a total replacement sample set until s sample standard deviation is replaced;To each displacement sample
Collection calculates the statistic that tandem copies number variation occurs;Finally calculate the significance of detection statistic:
P-value indicates the corresponding p-value value in each site of sample, the statistics that K is the number T of random permutation when being zero cloth
Amount, Ti *For the statistic of i-th, if Ti *Greater than T, then counts and add one, last p value to obtain the final product;Wherein p-value, Ti *, T be to
Amount;
Estimation based on CNV significance: the region occurred by the corresponding p value evaluation CNV of obtained sample all sites, if
The threshold value 0.05 of the small Mr. Yu's setting of p value, the then it is considered that CNV has biological meaning;To each CNV structural unit, build respectively
The zero cloth of vertical amplification and miss status, to detect the significance of amplification and miss status respectively;
The Performance Evaluation of algorithm: judge that can algorithm obtain true positive rate in the case where false positive rate is controllable;Evaluation is calculated
Whether method can estimate p value;Copy the boundary Detection capability of number variation;The computation complexity of parser.
2. the copy number mutation detection method based on new-generation sequencing as described in claim 1, which is characterized in that the filtering
Fall in the Batch effect and comparison process of CNV signal and compares reads < Q30 in low-quality reads.
3. the copy number mutation detection method based on new-generation sequencing as described in claim 1, which is characterized in that the synthesis
Multiple samples after standardization obtain the number of sites N that higher dimensional matrix in a higher dimensional matrix is number of samples s* sample.
4. the copy number mutation detection method based on new-generation sequencing as described in claim 1, which is characterized in that for each
Sliding window calculates its statistic to reflect the amplification or miss status of copy number variation, for low cover degree sample, directly counts
Calculate the related coefficient in the corresponding reads number frequency in each site and the site and window between other sites, it is comprehensive it frequency and
Related coefficient quantifies its statistic S;It is ingenious accurately by copy number using frequency histogram for the sample of high overburden depth
Amplification and lacking both has the state of different biological function performances to distinguish, calculate separately the statistic of both states
S。
5. the copy number mutation detection method based on new-generation sequencing as described in claim 1, which is characterized in that the statistics
S in the calculating of amounttestIntend in training set through known copy number mutation mode and gene expression in public database
Horizontal relationship assigns relative value to it.
6. the copy number mutation detection method based on new-generation sequencing as described in claim 1, which is characterized in that described pair of mark
Multiple samples after standardization calculate the corresponding detection statistic in each site on full-length genome, zero cloth T are constructed, then to sample
It is that every a line in data matrix represents a sample that data, which implement sample data in random permutation, and each column represent full-length genome
On a site.
7. the copy number mutation detection method based on new-generation sequencing as described in claim 1, which is characterized in that described to be based on
If the zero cloth design of CNV length is less than 0.05 threshold value of setting with p value in the estimation of significance, which has biology
Meaning, the amplification of the CNV and miss status have different biological functions and performance.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610114354.8A CN105760712B (en) | 2016-03-01 | 2016-03-01 | A kind of copy number mutation detection method based on new-generation sequencing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610114354.8A CN105760712B (en) | 2016-03-01 | 2016-03-01 | A kind of copy number mutation detection method based on new-generation sequencing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105760712A CN105760712A (en) | 2016-07-13 |
CN105760712B true CN105760712B (en) | 2019-03-26 |
Family
ID=56331603
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610114354.8A Active CN105760712B (en) | 2016-03-01 | 2016-03-01 | A kind of copy number mutation detection method based on new-generation sequencing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105760712B (en) |
Families Citing this family (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106372459B (en) * | 2016-08-30 | 2019-03-15 | 天津诺禾致源生物信息科技有限公司 | A kind of method and device based on amplification second filial sequencing copy number variation detection |
RU2768718C2 (en) * | 2016-09-22 | 2022-03-24 | Иллумина, Инк. | Detection of somatic variation of number of copies |
CN108073790B (en) * | 2016-11-10 | 2022-03-01 | 安诺优达基因科技(北京)有限公司 | Chromosome variation detection device |
CN106682455B (en) * | 2016-11-24 | 2019-03-26 | 西安电子科技大学 | A kind of Statistical Identifying Method of multisample copy number consistency variable region |
CN106682450B (en) * | 2016-11-24 | 2019-05-07 | 西安电子科技大学 | A kind of new-generation sequencing copy number variation emulation mode based on state transition model |
CN106845154B (en) * | 2016-12-29 | 2022-04-08 | 浙江安诺优达生物科技有限公司 | A device for FFPE sample copy number variation detects |
CN106650312B (en) * | 2016-12-29 | 2022-05-17 | 浙江安诺优达生物科技有限公司 | Device for detecting copy number variation of circulating tumor DNA |
CN108256292B (en) * | 2016-12-29 | 2021-11-02 | 浙江安诺优达生物科技有限公司 | Copy number variation detection device |
CN106778072B (en) * | 2016-12-30 | 2019-05-21 | 西安交通大学 | For the process bearing calibration of second generation Oncogenome high-flux sequence data |
CN106676178B (en) * | 2017-01-19 | 2020-03-24 | 北京吉因加科技有限公司 | Method and system for evaluating tumor heterogeneity |
CN110462063B (en) * | 2017-05-23 | 2023-06-23 | 深圳华大生命科学研究院 | Mutation detection method and device based on sequencing data and storage medium |
CN107229839B (en) * | 2017-05-25 | 2020-05-22 | 西安电子科技大学 | Indel detection method based on next generation sequencing data |
CN108563923B (en) * | 2017-12-05 | 2020-08-18 | 华南理工大学 | Distributed storage method and system for genetic variation data |
CN108197428B (en) * | 2017-12-25 | 2020-06-19 | 西安交通大学 | Copy number variation detection method for next generation sequencing technology based on parallel dynamic programming |
CN112365927B (en) * | 2017-12-28 | 2023-08-25 | 安诺优达基因科技(北京)有限公司 | CNV detection device |
CN108427864B (en) * | 2018-02-14 | 2019-01-29 | 南京世和基因生物技术有限公司 | A kind of detection method, device and computer-readable medium copying number variation |
CN109658983B (en) * | 2018-12-20 | 2019-11-19 | 深圳市海普洛斯生物科技有限公司 | A kind of method and apparatus identifying and eliminate false positive in variance detection |
CN109887546B (en) * | 2019-01-15 | 2019-12-27 | 明码(上海)生物科技有限公司 | Single-gene or multi-gene copy number detection system and method based on next-generation sequencing |
CN110310704A (en) * | 2019-05-08 | 2019-10-08 | 西安电子科技大学 | A kind of copy number mutation detection method based on local outlier factor |
CN112885406B (en) * | 2020-04-16 | 2023-01-31 | 深圳裕策生物科技有限公司 | Method and system for detecting HLA heterozygosity loss |
CN111508559B (en) * | 2020-04-21 | 2021-08-13 | 北京橡鑫生物科技有限公司 | Method and device for detecting target area CNV |
CN111429966A (en) * | 2020-04-23 | 2020-07-17 | 长沙金域医学检验实验室有限公司 | Chromosome copy number variation discrimination method and device based on robust linear regression |
CN111627498B (en) * | 2020-05-21 | 2022-10-04 | 北京吉因加医学检验实验室有限公司 | Method and device for correcting GC bias of sequencing data |
CN111863124B (en) * | 2020-06-06 | 2024-01-30 | 聊城大学 | Copy number variation detection method, system, storage medium and computer equipment |
CN113270141B (en) * | 2021-06-10 | 2023-02-21 | 哈尔滨因极科技有限公司 | Genome copy number variation detection integration algorithm |
CN113284558B (en) * | 2021-07-02 | 2024-03-12 | 赛福解码(北京)基因科技有限公司 | Method for distinguishing gene expression difference and long copy number variation in RNA sequencing data |
CN114758720B (en) * | 2022-06-14 | 2022-09-02 | 北京贝瑞和康生物技术有限公司 | Method, apparatus and medium for detecting copy number variation |
CN115064210B (en) * | 2022-07-27 | 2022-11-18 | 北京大学第三医院(北京大学第三临床医学院) | Method for identifying chromosome cross-exchange positions in diploid embryonic cells and application |
CN117409856B (en) * | 2023-10-25 | 2024-03-29 | 北京博奥医学检验所有限公司 | Mutation detection method, system and storable medium based on single sample to be detected targeted gene region second generation sequencing data |
CN118016150B (en) * | 2023-11-30 | 2024-10-01 | 东莞博奥木华基因科技有限公司 | Model construction for detecting copy number variation of genetic sequence and application thereof |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103778350A (en) * | 2014-01-09 | 2014-05-07 | 西安电子科技大学 | Somatic copy number alteration obviousness detection method based on two-dimension statistic model |
CN104221022A (en) * | 2012-04-05 | 2014-12-17 | 深圳华大基因医学有限公司 | Method and system for detecting copy number variation |
CN104603284A (en) * | 2012-09-12 | 2015-05-06 | 深圳华大基因研究院 | Method for detecting copy number variations by genome sequencing fragments |
CN104694384A (en) * | 2015-03-20 | 2015-06-10 | 上海美吉生物医药科技有限公司 | Mitochondrial DNA copy index variability detecting device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004044225A2 (en) * | 2002-11-11 | 2004-05-27 | Affymetrix, Inc. | Methods for identifying dna copy number changes |
-
2016
- 2016-03-01 CN CN201610114354.8A patent/CN105760712B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104221022A (en) * | 2012-04-05 | 2014-12-17 | 深圳华大基因医学有限公司 | Method and system for detecting copy number variation |
CN104603284A (en) * | 2012-09-12 | 2015-05-06 | 深圳华大基因研究院 | Method for detecting copy number variations by genome sequencing fragments |
CN103778350A (en) * | 2014-01-09 | 2014-05-07 | 西安电子科技大学 | Somatic copy number alteration obviousness detection method based on two-dimension statistic model |
CN104694384A (en) * | 2015-03-20 | 2015-06-10 | 上海美吉生物医药科技有限公司 | Mitochondrial DNA copy index variability detecting device |
Also Published As
Publication number | Publication date |
---|---|
CN105760712A (en) | 2016-07-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105760712B (en) | A kind of copy number mutation detection method based on new-generation sequencing | |
Li et al. | FDR-control in multiscale change-point segmentation | |
US7239986B2 (en) | Methods for classifying samples and ascertaining previously unknown classes | |
Gamarra et al. | Split and merge watershed: A two-step method for cell segmentation in fluorescence microscopy images | |
US7324926B2 (en) | Methods for predicting chemosensitivity or chemoresistance | |
CN106021984A (en) | Whole-exome sequencing data analysis system | |
Wan et al. | Integrating spatial and single-cell transcriptomics data using deep generative models with SpatialScope | |
CN110289047A (en) | Tumour purity and absolute copy number prediction technique and system based on sequencing data | |
Wang et al. | Spatially adaptive colocalization analysis in dual-color fluorescence microscopy | |
CN102103132B (en) | Method for screening diabetes markers from body fluid metabonome profile | |
CN116864011A (en) | Colorectal cancer molecular marker identification method and system based on multiple sets of chemical data | |
CN103778350B (en) | Somatic cell copy number based on Two-dimensional Statistical model variation significance detection method | |
Zhou et al. | Integrative deep learning analysis improves colon adenocarcinoma patient stratification at risk for mortality | |
CN101299242A (en) | Method and device for determining threshold value in human body skin tone detection | |
CN101517579A (en) | Method of searching for protein and apparatus therefor | |
Khalilabad et al. | Fully automatic classification of breast cancer microarray images | |
Metaxas et al. | Deep learning-based nuclei segmentation and classification in histopathology images with application to imaging genomics | |
EP1271131A2 (en) | Method and system for identifying non-uniform measure signal distribution | |
CN108510211A (en) | A kind of organic matter abundance in hydrocarbon source rock evaluation method | |
Wheelock et al. | Forecasting labels under distribution-shift for machine-guided sequence design | |
Tsourakakis et al. | Approximation algorithms for speeding up dynamic programming and denoising aCGH data | |
CN101565747B (en) | Method for extracting characteristic expression patterns of multiple gene sets | |
Gill et al. | Package ‘dna’ | |
WO2024086727A1 (en) | Biomolecule fitness inference using machine learning for drug discovery with directed evolution | |
CN116543907A (en) | Body mass index prediction method, model training method and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |