CN106682455B - A kind of Statistical Identifying Method of multisample copy number consistency variable region - Google Patents

A kind of Statistical Identifying Method of multisample copy number consistency variable region Download PDF

Info

Publication number
CN106682455B
CN106682455B CN201611040980.3A CN201611040980A CN106682455B CN 106682455 B CN106682455 B CN 106682455B CN 201611040980 A CN201611040980 A CN 201611040980A CN 106682455 B CN106682455 B CN 106682455B
Authority
CN
China
Prior art keywords
copy number
multisample
variable region
sample
cnvs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611040980.3A
Other languages
Chinese (zh)
Other versions
CN106682455A (en
Inventor
袁细国
李�杰
张军英
杨利英
高美虹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201611040980.3A priority Critical patent/CN106682455B/en
Publication of CN106682455A publication Critical patent/CN106682455A/en
Application granted granted Critical
Publication of CN106682455B publication Critical patent/CN106682455B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

The invention discloses a kind of Statistical Identifying Methods of multisample copy number consistency variable region, coefficient of relationship construction based on copy number site is fitted to curve, calculate the derivative value in each site, by assuming that the method for inspection detects significant derivative value, so that it is determined that copy number breakpoint, establishes copy number variation candidates region;Hypothesis testing zero cloth is constructed by way of the random permutation CNVs on full-length genome and sample both direction, detects copy number consistency variable region in multisample.The present invention avoids directly to be accommodated certain sequencing mistake and noise using sequencing read number, capable of accurately being positioned the boundary of copy number variable region;More true hypothesis testing zero cloth can be obtained compared to displacement in a single direction based on random permutation CNVs on full-length genome and sample both direction;Meanwhile being conducive to detect diversified consistency variation CNVs, i.e. copy number consistency variable region present in multisample subclass.

Description

A kind of Statistical Identifying Method of multisample copy number consistency variable region
Technical field
The invention belongs to copy number mutation field more particularly to a kind of multisample copy number consistency variable regions Statistical Identifying Method.
Background technique
New-generation sequencing technology provides genome mutation data more comprehensively, richer, for it is deep understand life mechanism, Cancer cell development mechanism provides Important Platform.Copy number variation (CopyNumber Variation, CNV) is weight in genome Generation, the development of the variation phenomenon and cancer wanted have substantial connection.For this purpose, being carried out to the CNV data on new-generation sequencing platform The analysis of system is that discovery cancer gene, research cancer cell molecule mechanism provide important channel, and how difficult point is from high score Diversified CNV mode is accurately detected in resolution, the read data of low sequencing depth.Prior art: domestic at present Existing expert proposes different copy number variation detection schemes outside, can substantially be divided into based on single tumor sample and be based on The detection scheme of tumor-normal paired sample, such as SegSeq [D.Y.Chiang et al., " High-resolution mapping of copy-number alterations with massively parallel sequencing,”Nat Methods,vol.6,no.1,pp.99-103,Jan,2009],EWT[S.T.Yoon et al.,“Sensitive and accurate detection of copy number variants using read depth of coverage,” Genome Research,vol.19,no.9,pp.1586-1592,Sep,2009],BIC-seq[R.Xi et al.,“Copy number variation detection in whole-genome sequencing data using the Bayesian information criterion,”Proc Natl Acad Sci U S A,vol.108,no.46,pp.E1128-36,Nov 15,2011],CNVnator[A.Abyzov et al.,“CNVnator:an approach to discover,genotype, and characterize typical and atypical CNVs from family and population genome sequencing,”Genome Res,vol.21,no.6,pp.974-84,Jun,2011],ReadDepth[C.A.Miller et al.,“ReadDepth:a parallel R package for detecting copy number alterations from short sequencing reads,”PLoS One,vol.6,no.1,pp.e16327,2011],Control- FREEC[V.Boeva et al.,“Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization,”Bioinformatics,vol.27, no.2,pp.268-9,Jan 15,2011],CNV-TV[J.Duan et al.,“CNV-TV:a robust method to discover copy number variation from short sequencing reads,”BMC Bioinformatics,vol.14,pp.150,2013],CNVeM[Z.Wang et al.,“CNVeM:copy number variation detection using uncertainty ofread mapping,”J Comput Biol,vol.20, no.3,pp.224-36,Mar,2013],m-HMM[H.Wang et al.,“Copy number variation detection using next generation sequencing read counts,”Bmc Bioinformatics,vol.15,Apr The methods of 14,2014].These method majorities using sequencing depth calculation gene loci read number, and then in full-length genome or Copy number variable region is predicted according to read number situation of change within the scope of whole chromosome.The characteristics of such methods is to realize opposite hold Easily, there is preferable detection effect for the data of high sequencing depth;Its shortcoming is that directly relying on property of read number, and read There is unstability in number, i.e. read number has certain random variation, and this random variation is often erroneously interpreted as copying in itself Caused by shellfish number variation, especially for the data of low sequencing depth, the ratio of random amplitude of variation and copy number variation amplitude Value is higher, so that such methods are difficult to obtain preferable copy number variation detection effect.In addition, there is part expert to propose Copy number mutation detection method based on multisample, as cnvHiTSeq [E.Bellos et al., " cnvHiTSeq: integrative models for high-resolution copy number variation detection and genotyping using population sequencing data,”Genome Biol,vol.13,no.12, pp.R120,2012],VarScan2+CMDS[D.C.Koboldt et al.,“VarScan 2:somatic mutation and copy number alteration discovery in cancerby exome sequencing,”Genome Res,vol.22,no.3,pp.568-76,Mar,2012,Q.Zhang et al.,“CMDS:a population-based method for identifying recurrent DNA copy number aberrations in cancer from high-resolution data,”Bioinformatics,vol.26,no.4,pp.464-9,Feb15,2010], JointSLM[A.Magi et al.,“Detecting common copy number variants in high- throughput sequencing data by using JointSLM algorithm,”Nucleic Acids Research,vol.39,no.10,May,2011],cn.MOPS[G.Klambauer et al.,“cn.MOPS:mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate,”Nucleic Acids Res,vol.40, no.9,pp.e69,May,2012],CBSBR[J.Duan et al.,“Common copy number variation detection from multiple sequenced samples,”IEEE Trans Biomed Eng,vol.61,no.3, pp.928-37,Mar,2014],CODEX[Y.Jiang et al.,“CODEX:a normalization and copy number variation detection method for whole exome sequencing,”Nucleic Acids Res, vol.43, no.6, pp.e39, Mar 31,2015] etc..Such methods majority is to be based on being associated between copy number variant sites Property or inter-sample difference detect consistency copy number variable region, its advantage lies in being able to hold copy number structural variation Biological nature, to distinguish the copy number variation of consistency copy number variable region and randomness.The disadvantage is that being difficult to detect weak Great consistency copy number variable region.These methods are in multisample copy number variation detection simultaneously, often to sample number Amount has certain limitation, this seems for detecting the ability of the high consistency copy number variable region of certain class cancer or general cancer It is limited.
In conclusion the Statistical Identifying Method of available sample copy number consistency variable region excessively relies on sequencing read number Variation, it is difficult to obtain have statistical significance detection effect;Sample size cannot be excessive, and computation complexity is higher, is unfavorable for examining Survey copy number consistency variable region in multisample.
Summary of the invention
The purpose of the present invention is to provide a kind of Statistical Identifying Methods of multisample copy number consistency variable region, it is intended to The Statistical Identifying Method for solving available sample copy number consistency variable region excessively relies on the variation of sequencing read number, it is difficult to obtain There must be the detection effect of statistical significance;Sample size cannot be excessive, and computation complexity is higher, is unfavorable for copying in detection multisample The problem of number consistency variable region.
The invention is realized in this way a kind of Statistical Identifying Method of multisample copy number consistency variable region, described The Statistical Identifying Method of multisample copy number consistency variable region is fitted to curve based on the coefficient of relationship in copy number site, with This calculates the derivative value in each site, by assuming that the method for inspection detects significant derivative value, so that it is determined that copy number breakpoint, builds Vertical copy number variation candidates region;Hypothesis is constructed by way of the random permutation CNVs on full-length genome and sample both direction Zero cloth is examined, copy number consistency variable region in multisample is detected.
Further, it needs to carry out the pretreatment to sequencing data file before the coefficient of relationship curve matching, it is specific to wrap It includes:
On the basis of comparing to sequencing data file, the read number in each site is calculated;According to sample read number mean value Regularization is carried out to read number, to obtain the read number signal being comparable between sample, calculation formula are as follows:
Wherein, mean_RCnWith mean_RC respectively refer to n-th of sample read number mean value and multiple sample read numbers it is equal Value, xnmRefer to the read number in n-th of sample, m-th of site, x'nmRefer to the read number after corresponding site is regular.
Further, isometric bins is defined, converts the read number as unit of bin for the read number of sample site, it is right The detection for copying number variation member will be carried out as unit of bin.
Further, indicate that a sample, each column indicate a bin based on a line every in preprocessed data matrix M, M;It adopts The coefficient of relationship between bins is calculated with Pearson correlation analysis method, and is fitted to curve, leading for each bin is solved with this Numerical value.
Further, using derivative value as background, hypothesis testing zero cloth is established, the derivative value of conspicuousness, conspicuousness are examined Mean that there are breakpoints in the position of the bin, obtain copy number variation candidates region.
Further, significant CNVs is detected using loop iteration process in copy number variation candidates region, specifically includes: passing through The candidate region random permutation CNV constructs hypothesis testing zero cloth in full-length genome, tests to the candidate region CNV, if hair The CNV of existing conspicuousness, just removes it from genome, reconfigures hypothesis testing zero cloth and examines CNV candidate regions again Domain, until not finding new CNVs.
Further, detection multisample copy number consistency variable region includes: by random in full-length genome and sample CNVs is replaced with transposition of structures data matrix Mt, calculate the frequency f that random CNVs occurs in multisample;The process n times are repeated, N > 1000 obtain the distribution of a frequency f, i.e. hypothesis testing zero cloth;The CNVs frequency of data matrix before replacing is examined It tests, calculates the p value of each CNV, the CNVs of multisample consistency variation is determined according to significance threshold value.
Another object of the present invention is to provide a kind of statistics using the multisample copy number consistency variable region The cancer gene of the method for inspection.
Another object of the present invention is to provide a kind of statistics using the multisample copy number consistency variable region The cancer cell molecule of the method for inspection.
The Statistical Identifying Method of multisample copy number consistency variable region provided by the invention, foundation are with statistical theory The calculation method on basis, detects the copy number variable region of consistency in multiple samples, provides directly for discovery potential cancer gene Technological means connect, feasible.The present invention finds out bins on the basis of carrying out Regularization to read number, as primitive, The coefficient of relationship between bins is calculated in multisample space and is fitted to curve, and the derivative of each bin is calculated with this.By to derivative Value carries out significance test, copy number breakpoint is detected, to obtain the candidate region CNV.By loop iteration process in single sample In the region CNV is detected, that is, be directed to the candidate region CNV, random permutation process taken to construct zero cloth, it is aobvious to CNV with this Work property is tested, and be will test and is rejected for significant CNV, rebuilds zero cloth, until the CNV termination for not detecting new follows Ring.The advantage of doing so is that being able to detect that weak significant CNVs.On the basis of single sample CNV detection, in multisample space The middle occurrence frequency building statistics according to CNV frequency detecting copy number consistency variable region, i.e., using CNV in multisample Amount detects copy number consistency variable region by the permutation test method of multisample.
Existing most methods excessively rely on the variation of sequencing read number, since there are errors and read for sequencing technologies itself In the presence of compared with very noisy, so that these methods are difficult to obtain the detection effect with statistical significance for the sample of low sequencing depth. For this purpose, the present invention proposes to be fitted to curve using the coefficient of relationship building between copy number variant sites, each base is then calculated The derivative value significance test is asked because of the derivative value in site, and then by being converted into the test problems of copy number variable region Topic;It is not directly dependent on the size of sequencing read number in this way, certain sequencing mistake and noise can be accommodated.
The existing copy number mutation detection method for multisample has certain limitation to sample size or feature, such as CBSBR method requires sample size cannot be excessive, and algorithm defaults 6 samples, and computation complexity is higher;Cn.MOPS requires sample This, there are apparent otherness, is unfavorable for detecting copy number consistency variable region in multisample;For this purpose, the present invention establishes newly Statistical inspection model, diversified copy number variation mode is detected using circulation rejecting process, and do not limit sample size System, computation complexity is controllable, as table 1 lists the comparison of method.
The comparison of the computation complexity of 1.4 kinds of methods of table
Method DCC CBSBR FREEC cn.MOPS
Runing time 22s 1721s 50s 38s
Time complexity O(mn) O(mnk) O(n) O(mn)
Space complexity O(mn) O(m2n2) O(n) O(mn)
Software platform C++ MATLAB C++ R
Wherein DCC is method of the invention, which is the result detected to the genome that length is 5Gb.
The present invention is based on the curves that is fitted to of coefficient of relationship to calculate derivative value, to examine copy number breakpoint, so that it is determined that copying Shellfish number variation candidate region;On the one hand it avoids that directly certain sequencing mistake and noise can be accommodated, separately using sequencing read number On the one hand the boundary of copy number variable region can accurately be positioned;Based on being set at random on full-length genome and sample both direction CNVs is changed, compared to displacement in a single direction, this strategy can obtain more true hypothesis testing zero cloth;Meanwhile favorably In the diversified consistency variation CNVs of detection, i.e. copy number consistency variable region present in multisample subclass.
Detailed description of the invention
Fig. 1 is the Statistical Identifying Method process of multisample copy number consistency variable region provided in an embodiment of the present invention Figure.
Fig. 2 is the performance comparison schematic diagram of the present invention (DCC) and cn.MOPS method provided in an embodiment of the present invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.
Application principle of the invention is explained in detail with reference to the accompanying drawing.
As shown in Figure 1, the Statistical Identifying Method of multisample copy number consistency variable region provided in an embodiment of the present invention The following steps are included:
S101: on the basis of comparing to sequencing data file (i.e. Fastq file), the read number in each site is calculated; Regularization is carried out to read number according to sample read number mean value, to obtain the read number signal being comparable between sample;
S102: it based on preprocessed data matrix M (wherein every a line indicates that a sample, each column indicate a bin), adopts The coefficient of relationship between bins is calculated with Pearson correlation analysis method, construct the coefficient of relationship is fitted to curve, asks with this Solve the derivative value of each bin;Using derivative value as background, hypothesis testing zero cloth is established, the derivative value of conspicuousness is examined, it is significant Property means that there are breakpoints in the position of the bin, to obtain copy number variation candidates region;
S103: the CNVs defined based on single sample, hypothesis testing zero cloth is constructed by the Replacement Strategy of multisample;It is opposed The CNVs frequency of data matrix is tested before changing, and is calculated the p value of each CNV, is determined multisample according to significance threshold value The CNVs of consistency variation.
Application principle of the invention is further described combined with specific embodiments below.
(1) data prediction
On the basis of comparing to sequencing data file (i.e. Fastq file), the read number in each site is calculated;According to sample This read number mean value carries out Regularization to read number, specific such as formula to obtain the read number signal being comparable between sample (1) shown in.
Wherein, mean_RCnWith mean_RC respectively refer to n-th of sample read number mean value and multiple sample read numbers it is equal Value, xnmRefer to the read number in n-th of sample, m-th of site, x'nmRefer to the read number after corresponding site is regular.
On the basis of Regularization data, in order to reduce data dimension and reduce due between enchancement factor bring site Otherness, the present invention define isometric bins, the read number converted the read number of sample site to as unit of bin.In this way, The detection of copy number variation member will be carried out as unit of bin.
(2) derivative value is examined and for single sample detection copy number variation
Based on preprocessed data matrix M (wherein every a line indicates that a sample, each column indicate a bin), use Pearson correlation analysis method calculates the coefficient of relationship between bins, and construct the coefficient of relationship is fitted to curve, is solved with this The derivative value of each bin.
Using derivative value as background, hypothesis testing zero cloth is established, the derivative value of conspicuousness is examined, conspicuousness means There are breakpoints in the position of the bin, to obtain copy number variation candidates region.The characteristics of doing so is to make full use of copy number The intrinsic relevance of variant sites has similar horizontal coefficient of relationship between the site in that is, same copy number variable region, leads to It crosses and the mode of derivative value is examined to find coefficient of relationship mutational site, to obtain the copy number variation candidates region that length does not wait.
For copy number variation candidates region, significant CNVs is detected using loop iteration process, specific practice is as follows: logical It crosses the candidate region random permutation CNV in full-length genome and constructs hypothesis testing zero cloth, candidate region CNV is examined with this It tests, if the CNV of discovery conspicuousness, it is just removed, reconfigure hypothesis testing zero cloth and examine CNV again from genome Candidate region, until not finding new CNVs.
(3) multisample copy number consistency variable region is detected
Based on the CNVs that single sample defines, the Replacement Strategy for passing through multisample constructs hypothesis testing zero cloth: i.e. by Random permutation CNVs is in full-length genome and sample with transposition of structures data matrix Mt, random CNVs is calculated with this and is sent out in multisample Raw frequency f;The process n times (n > 1000) are repeated, to obtain the distribution of a frequency f, i.e. hypothesis testing zero cloth.With this It tests to the CNVs frequency of data matrix before replacing, calculates the p value of each CNV, it is more according to the determination of significance threshold value The CNVs of unanimity of samples variation.
The comparison of performance.Fig. 2 is (DCC) of the invention compared with the performance of cn.MOPS method, and experiment test is different The CNV detection performance of DNA is sequenced under cancer cell purity (Tumorpurity).It is relatively high that Fig. 2 shows that the method for the present invention has Performance.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims (7)

1. a kind of Statistical Identifying Method of multisample copy number consistency variable region, which is characterized in that the multisample copy The Statistical Identifying Method of number consistency variable region is fitted to curve based on the coefficient of relationship in copy number site, calculates each site Derivative value, by assuming that the method for inspection detects significant derivative value, so that it is determined that copy number breakpoint, establishes copy number variation and wait Favored area;Hypothesis testing is constructed by way of the random permutation copy number variation CNVs on full-length genome and sample both direction Zero cloth detects copy number consistency variable region in multisample;
The Statistical Identifying Method of the multisample copy number consistency variable region specifically includes:
(1) on the basis of comparing to sequencing data file, the read number in each site is calculated;According to sample read number mean value pair Read number carries out Regularization, to obtain the read number signal being comparable between sample;
(2) it is based on preprocessed data matrix M, wherein every a line indicates that a sample, each column indicate a bin, is used Pearson correlation analysis method calculates the coefficient of relationship between bins, constructs the matched curve of the coefficient of relationship, is solved often with this The derivative value of a bin;Using derivative value as background, hypothesis testing zero cloth is established, examines the derivative value of conspicuousness, conspicuousness meaning Taste there are breakpoints in the position of the bin, to obtain copy number variation candidates region;
(3) CNVs defined based on single sample constructs hypothesis testing zero cloth by the Replacement Strategy of multisample;To number before displacement It tests according to the CNVs frequency of matrix, calculates the p value of each CNV, multisample consistency is determined according to significance threshold value The CNVs of variation.
2. the Statistical Identifying Method of multisample copy number consistency variable region as described in claim 1, which is characterized in that structure It makes before coefficient of relationship is fitted to curve and needs to carry out to specifically include the pretreatment of sequencing data file:
On the basis of comparing to sequencing data file, the read number in each site is calculated;According to sample read number mean value to reading Number of segment carries out Regularization, to obtain the read number signal being comparable between sample, calculation formula are as follows:
Wherein, mean_RCnThe read number mean value of n-th of sample and the mean value of multiple sample read numbers are respectively referred to mean_RC, xnmRefer to the read number in n-th of sample, m-th of site, x'nmRefer to the read number after corresponding site is regular.
3. the Statistical Identifying Method of multisample copy number consistency variable region as claimed in claim 2, which is characterized in that fixed The isometric bins of justice, converts the read number as unit of bin for the read number of sample site, will to the detection of copy number variation Member carries out as unit of bin.
4. the Statistical Identifying Method of multisample copy number consistency variable region as described in claim 1, which is characterized in that copy Significant CNVs is detected using loop iteration process in shellfish number variation candidate region, specifically includes: by random in full-length genome The displacement copy candidate region number variation CNV constructs hypothesis testing zero cloth, tests to the copy candidate region number variation CNV, If reconfiguring hypothesis testing zero cloth and again it was found that the copy number variation CNV of conspicuousness, just removes it from genome The copy candidate region number variation CNV is examined, until not finding new CNVs.
5. the Statistical Identifying Method of multisample copy number consistency variable region as described in claim 1, which is characterized in that inspection Survey multisample copy number consistency variable region include: by full-length genome and sample random permutation CNVs with the transposition of structures Data matrix Mt, calculate the frequency f that random CNVs occurs in multisample;The process n times are repeated, n > 1000 obtain a frequency The distribution of rate f, i.e. hypothesis testing zero cloth;It tests to the CNVs frequency of data matrix before replacing, calculates each copy number The p value of variation CNV, the CNVs of multisample consistency variation is determined according to significance threshold value.
6. a kind of statistical check side using multisample copy number consistency variable region described in Claims 1 to 5 any one The cancer gene of method.
7. a kind of statistical check side using multisample copy number consistency variable region described in Claims 1 to 5 any one The cancer cell molecule of method.
CN201611040980.3A 2016-11-24 2016-11-24 A kind of Statistical Identifying Method of multisample copy number consistency variable region Active CN106682455B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611040980.3A CN106682455B (en) 2016-11-24 2016-11-24 A kind of Statistical Identifying Method of multisample copy number consistency variable region

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611040980.3A CN106682455B (en) 2016-11-24 2016-11-24 A kind of Statistical Identifying Method of multisample copy number consistency variable region

Publications (2)

Publication Number Publication Date
CN106682455A CN106682455A (en) 2017-05-17
CN106682455B true CN106682455B (en) 2019-03-26

Family

ID=58866051

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611040980.3A Active CN106682455B (en) 2016-11-24 2016-11-24 A kind of Statistical Identifying Method of multisample copy number consistency variable region

Country Status (1)

Country Link
CN (1) CN106682455B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107967410B (en) * 2017-11-27 2021-07-30 电子科技大学 Fusion method for gene expression and methylation data
CN111508559B (en) * 2020-04-21 2021-08-13 北京橡鑫生物科技有限公司 Method and device for detecting target area CNV
CN112767999A (en) * 2021-01-05 2021-05-07 中国科学院上海药物研究所 Analysis method and device for whole genome sequencing data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103778350A (en) * 2014-01-09 2014-05-07 西安电子科技大学 Somatic copy number alteration obviousness detection method based on two-dimension statistic model
CN105760712A (en) * 2016-03-01 2016-07-13 西安电子科技大学 Copy number variation detection method based on next generation sequencing

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103778350A (en) * 2014-01-09 2014-05-07 西安电子科技大学 Somatic copy number alteration obviousness detection method based on two-dimension statistic model
CN105760712A (en) * 2016-03-01 2016-07-13 西安电子科技大学 Copy number variation detection method based on next generation sequencing

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CNV-TV: A robust method to discover copy number variation from short sequencing reads;Junbo Duan et al.;《BMC Bioinformatics》;20131231;第14卷(第150期);第1-12页
Common Copy Number Variation Detection From Multiple Sequenced Samples;Junbo Duan et al.;《IEEE Trans Biomed Eng》;20140331;第61卷(第3期);第928-937页
Copy number variation detection using next generation sequencing read counts;Heng Wang et al.;《BMC Bioinformatics》;20141231;第15卷(第109期);第1-14页
新一代测序的拷贝数变异检测算法研究与设计;李燕 等;《生物信息学》;20150930;第13卷(第3期);第186-191页

Also Published As

Publication number Publication date
CN106682455A (en) 2017-05-17

Similar Documents

Publication Publication Date Title
Tarabichi et al. A practical guide to cancer subclonal reconstruction from DNA sequencing
Agrawal et al. Large-scale analysis of disease pathways in the human interactome
Ay et al. Analysis methods for studying the 3D architecture of the genome
JP6240210B2 (en) Accurate and rapid mapping of target sequencing leads
Hong et al. Inferring the origin of metastases from cancer phylogenies
WO2020035446A1 (en) Systems and methods for using neural networks for germline and somatic variant calling
Liu et al. Quantitative assessment of cell population diversity in single-cell landscapes
CN106682455B (en) A kind of Statistical Identifying Method of multisample copy number consistency variable region
Halperin et al. A method to reduce ancestry related germline false positives in tumor only somatic variant calling
Zhang et al. Statistical method evaluation for differentially methylated CpGs in base resolution next-generation DNA sequencing data
Sefer A comparison of topologically associating domain callers over mammals at high resolution
Park et al. i6mA-DNC: Prediction of DNA N6-Methyladenosine sites in rice genome based on dinucleotide representation using deep learning
Rackham et al. A Bayesian approach for analysis of whole-genome bisulfite sequencing data identifies disease-associated changes in DNA methylation
Gilmore et al. ACE: A workbench using evolutionary genetic algorithms for analyzing association in TCGA
Wyllie et al. M. tuberculosis microvariation is common and is associated with transmission: analysis of three years prospective universal sequencing in England
US20210324465A1 (en) Systems and methods for analyzing and aggregating open chromatin signatures at single cell resolution
WO2017201400A1 (en) Determination of cell types in mixtures using targeted bisulfite sequencing
Li et al. SM-RCNV: a statistical method to detect recurrent copy number variations in sequenced samples
Wu et al. Computational Systems Biology
Hu et al. Processing UMI Datasets at High Accuracy and Efficiency with the Sentieon ctDNA Analysis Pipeline
CN116825182B (en) Method for screening bacterial drug resistance characteristics based on genome ORFs and application
Lauria Rank-based miRNA signatures for early cancer detection
Aljouie et al. Cross-validation and cross-study validation of chronic lymphocytic leukaemia with exome sequences and machine learning
Haque et al. Detection of copy number variations from NGS data by using an adaptive kernel density estimation-based outlier factor
Shi et al. Ultra-rapid metagenotyping of the human gut microbiome

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant