CN105986008A

CN105986008A - CNV detection method and CNV detection apparatus

Info

Publication number: CN105986008A
Application number: CN201510039685.5A
Authority: CN
Inventors: 李甫强; 史旭莲; 谢国云; 鲁娜; 赵至坤; 蒋润泽; 梁瀚; 侯勇; 吴逵
Original assignee: BGI Shenzhen Co Ltd
Current assignee: BGI Shenzhen Co Ltd
Priority date: 2015-01-27
Filing date: 2015-01-27
Publication date: 2016-10-05

Abstract

The invention provides a CNV detection method which including the steps of: 1) acquiring a genome sequencing result of a target individual; 2) comparing the sequencing result with a reference sequence to obtain a comparison result, wherein the reference sequence comprises a plurality of windows; 3) on the basis of the comparison result, calculating initial comparison ratio of each window, which is the number of reads in the windows on the comparison dividing the average value of the number of the reads in the windows on the comparison, wherein the average value of the number of the reads in the windows on the comparison is the total number of reads in all windows on the comparison dividing the number of the windows; 4) combining a plurality of adjacent windows of which the initial comparison ratios have no significant difference, and defining the combined adjacent windows as a primary zone, and the rest individual windows are respectively called the primary zones; and 5) if the comparison ratio of the primary zone is not equal to a preset comparison ratio, determining existence of CNV in the primary zone.

Description

CNV detection method and device

Technical field

The present invention relates to bio information field, concrete, the present invention relates to the method and apparatus detecting CNV.

Background technology

Unicellular sequencing technologies is to utilize secondary sequencing technologies to check order the trace dna of individual cells.This technology is main Including unicellular separation, the extraction of unicellular nucleic acid and amplification and order-checking three parts.Unicellular order-checking is as a revolutionary skill Art, was extensively applied in scientific research and biomedicine field in recent years.Such as, check order to tumor is unicellular, disclose The heterogeneity of the unicellular aspect of tumor, deduces the evolutionary process of tumor；Noinvasive prenatal diagnosis；The microorganism that assembling can not be cultivated Genome；The acquisition of trace cell (prudence etc. can be applied to) genome；Single cell technology is also introduced into embryo and plants Enter front diagnosis etc..Unicellular sequencing technologies solve trace cellular genome obtain a difficult problem, for disease incidence mechanism and examine Disconnected learning studies the method providing new.

In unicellular research, unicellular copy number variation (Copy Number Variants, CNV) plays critically important Role.CNV refers generally to the fragment generation loss more than 1Kb or the phenomenon of repetition on chromosome.CNV is that one is widely present Genetic polymorphism in animal-plant gene group, its mutation frequency is far above SNP, and genome research is proved CNV and groups of people Class disease is correlated with, such as with the Various Complex such as tumor, obesity, infantile autism, autoimmune disease and systemic lupus erythematosus (sle) Disease is correlated with.In the heterogeneity and Study on Evolution of tumor, the detection of the single celled CNV of tumor, by relatively more unicellular it Between, and the difference of the CNV of unicellular and corresponding tissue, disclose the tumor heterogeneity in individual cells aspect, for tumor Evolution deduction provides foundation；Noinvasive prenatal diagnosis, then need minim DNA has been detected whether chromosome aneuploid Variation (one of CNV) and cause mongolism (47 ,+21), E trisomy (47 ,+18), 13- Patau syndrome (47 ,+13) etc.；Diagnosis before Embryonic limb bud cell and examination, need to enter single sexual cell or embryonic cell Row coherent detection is analyzed；Legal medical expert collects evidence sample (blood of trace, seminal fluid etc.), needs analysis carrying out trace cell etc.. In general, current biological medical domain proposes need for the detection of large fragment CNV of trace cell, even individual cells Summation challenge.

Existing CNV detection method is mostly for tissue sequencing data, such as CNV-seq, PenCNV, CNAseg and Readdepth etc..Unicellular sequencing data, especially low depth sequencing data, have low genome coverage and high amplification is inclined Tropism, the zones of different short sequence alignment fluctuation at genome is very big, these CNV detection methods be very suitable for single carefully The detection of born of the same parents' copy number variation.

Summary of the invention

It is contemplated that at least solve at least one the problems referred to above or propose the selection of at least one business.

According to an aspect of of the present present invention, the present invention provides a kind of method detecting CNV, said method comprising the steps of: obtain Taking the gene order-checking result of target individual, described sequencing result includes multiple reading section；By described sequencing result and reference sequences Comparison, it is thus achieved that comparison result, described reference sequences includes that multiple window, described comparison result include each described window in comparison The number of the reading section of mouth；Based on described comparison result, calculate the initial comparison rate of each window, the initial comparison rate of window= In comparison, described window reads the reading hop count purpose meansigma methods of all windows, all windows in described comparison in hop count mesh/comparison Read the reading section sum/window number of all windows in hop count purpose meansigma methods=comparison；Merge initial comparison rate without significant difference Multiple adjacent window apertures, the multiple adjacent window apertures after definition merges are a sub-region, and remaining each independent window is the most once Region；Comparison rate based on a described sub-region is unequal with predetermined comparison rate, it is determined that a described sub-region exists CNV, institute Stating the average of the initial comparison rate that comparison rate is the window that a described sub-region comprises of a sub-region, described predetermined comparison rate is The comparison rate of the window that the comparison rate medium frequency of all windows is the highest, the comparison rate of described window is a sub-region at its place Comparison rate.In one embodiment of the invention, described genome is available from the individual cells of described target individual.By building Single celled gene order-checking library, and described sequencing library is carried out the sequencing described sequencing result of acquisition.Optional, Build described sequencing library and include described genome is carried out degenerate oligonucleotide primed PCR, multiple displacement amplification and/or repeatedly Anneal ring-type cyclic amplification, to obtain the nucleic acid amount enough building storehouse and/or enough go up the nucleic acid amount that machine checks order.Sequencing is permissible Utilize existing order-checking platform, include but not limited to CG (Complete Genomics), Illumina/Solexa, Life Technologies ABI SOLiD and Roche 454 checks order platform, can check order accordingly according to selected order-checking platform Prepared by library, optional single-ended or both-end order-checking, thus obtained sequencing result is made up of multiple short sequences, by each short sequence Row are referred to as the section of reading.Described comparison can utilize known comparison software to carry out, such as utilize Bowtie, SOAP, BWA and/ Or TeraMap etc. is carried out.In one embodiment of the invention, only utilize the comparison in described comparison result to described reference The reading section of sequence unique positions is compared the calculating of rate, is beneficial to improve the accurate of CNV detection improving data accuracy Property.

Alleged window can predefine, it is also possible to determine when carrying out target individual detection simultaneously.In the present invention one In embodiment, window is predetermined.The determination of described window includes: by short sequence sets and reference sequences comparison, determine The original position of the short sequence of described reference sequences in comparison, described short sequence sets includes multiple short sequence；Described with reference to sequence Delimit window on row, make each described window comprise equal number of described original position, optional, do not have between described window There is overlap.In comparison process, according to the setting of alignment parameters, a short sequence has at most allowed m base mispairing (mismatch), m is preferably 1 or 2, if having more than m base generation mispairing in a short sequence, is then considered as this short sequence Row cannot comparison to reference sequences.The alleged initial base that original position is each short sequence of reference sequences in comparison and ginseng Examining the matched position of sequence, when having the initial base ratio of multiple short sequence to during to reference sequences same position, only record is once, I.e. recording described original position is one.Here, the initial base of alleged short sequence, the direction of the short sequence i.e. related to, It is with the direction of reference sequences as reference, such as, (position, reference sequences front position will be matched in a short sequence Numbering minimum) base be referred to as the initial base of this short sequence.Each described window is made to comprise identical original position number, And it is not intended to the number of loci of its not section of reading comprised coupling, so each general window is in different size, so, Advantageously reduce the skewed popularity that unicellular genome amplification brings.Accordingly, under this design, it is possible to so that each window bag Carrying out window delimitation containing equal number of ad-hoc location, described ad-hoc location is each short sequence of reference sequences in comparison The matched position of same position base and reference sequences, such as, making described ad-hoc location is each of reference sequences in comparison The terminal bases of individual short sequence and the matched position of reference sequences.

Alleged short sequence sets may be from simulated series collection and/or sequencing result, and sequencing result mentioned here can be that oneself measures The sequencing data of people's nucleic acid, it is also possible to being the sequencing result of sample of nucleic acid disclosed in other people, nucleic acid can be genomic DNA It can also be dissociative DNA.It is also preferred that the left make simulated series energy in comparison to reference genome that described simulated series is concentrated Having relatively uniform distribution, in one embodiment of the invention, simulated series can be obtained in that from described reference sequences The base of one end of chromosome of a length of Q start, copy P base of described chromosome, to obtain Article 1 simulation Sequence, the other end direction along described chromosome is moved a base and is copied P base of described chromosome, to obtain second Bar simulated series, the other end direction along described chromosome is moved two bases and is copied P base of described chromosome, to obtain Obtain Article 3 simulated series, obtain the Q-P+1 article simulated series, the terminal bases of described Q-P+1 article of simulated series according to this Overlapping with the base of the other end of described chromosome, wherein, P is the length of simulated series, it is also preferred that the left P >=10.At this In a bright embodiment, overlapping and described window sum between described window, is not had to be not more than 100,000.The size of window Arrange and can adjust based on CNV accuracy of detection, in the case of people determines with reference to Genome Size, the size of window and window Number is inversely proportional to.In this embodiment, the sum of window be no less than 10,000 and no more than 100,000, and between do not have Overlap, is beneficial to accurately detect the CNV not less than 1K of general definition.

In one embodiment of the invention, described target individual is the mankind, and the mankind are diplont, its chromosome set number Being 2, preferably the described reference sequences behaviour reference genome of correspondence is at least some of, and for example, HG19, HG19 are permissible Obtain from ncbi database, or be the reference sequences of all windows composition.In another embodiment of the present invention, with N substitutes the described people each base with reference to the pseudoautosomal region of the Y chromosome of genome, and N represents A, T, C and G In any one, so, be conducive to the false positive avoiding heterosomal pseudoautosomal region CNV to detect.

In one embodiment of the invention, before the initial comparison rate of merging is without multiple adjacent window apertures of significant difference, utilize The relation of comparison rate-G/C content carries out GC correction to the initial comparison rate of each described window, it is thus achieved that the correction of each window Comparison rate, to eliminate or to reduce G/C content to sequencing result, the impact of comparison rate, and replaces with the correction comparison rate of window Carrying out subsequent detection for the initial comparison rate of described window, such as, it is all that the comparison rate of a described sub-region becomes that it comprises The average of the correction comparison rate of window, and when determining alleged predetermined comparison rate, the comparison rate of a sub-region is assigned to its institute The window comprised, the comparison rate of each window being i.e. in a same sub-region is the most equal, for the ratio of a sub-region at its place To rate, so, the comparison rate of all windows is counted, determine the number of times that each comparison rate occurs, will appear from number of times Many i.e. window comparison rate that frequency is the highest are set to alleged predetermined comparison rate.The relation of described comparison rate-G/C content can be in advance The sequencing data utilizing check sample is set up, is preserved, in order to correct each sample to be tested sequencing result, preferred control sample This is tissue samples infraspecific with target individual, it is also possible to utilize target sample genome when detecting target sample simultaneously Sequencing result is set up.In one embodiment of the invention, the sequencing result directly utilizing the target sample needed for detection comes Setting up the relation of comparison rate-G/C content, the foundation of the relation of described comparison rate-G/C content is as follows: obtain at least one sample The sequencing data of this nucleic acid, described sequencing data is made up of multiple reading sections；Described sequencing data is compared with reference sequences Right, it is thus achieved that comparison result, described reference sequences includes that multiple window, described comparison result include each described window in comparison The number of reading section；Calculate the initial comparison rate of each described window, the reading of described window in the initial comparison rate=comparison of window The reading hop count purpose meansigma methods of all windows in hop count mesh/comparison, the reading hop count purpose meansigma methods of all windows in described comparison= The reading section sum/window number of all windows in comparison；The initial comparison rate of windows based on many groups and the G/C content of this window Numerical value, utilize bidimensional regression analytic process to set up the relation of described comparison rate-G/C content.In one embodiment of the invention, The bidimensional regression analytic process utilized is local weighted recurrence scatterplot smoothing techniques (Lowess).

Alleged initial comparison rate refers to the initial comparison rate adjacent window apertures without essence difference without multiple adjacent window apertures of significant difference, Such as, due to initial comparison rate or correction comparison rate be curved about one group of numerical value that " 1 " fluctuates, can with 1 or with 1 ± 1*10% is the boundary with or without essence difference, if adjacent, correction comparison rate is all below 0.9, or 0.90～1.10, or The window of more than 1.10 is the window without essence difference.In one embodiment of the invention, the initial comparison rate of described merging without Multiple adjacent window apertures of significant difference are that merging meets adjacent window apertures described below, and the correction comparison rate of multiple adjacent window apertures is all More than 1 or both less than 1.Further, by determining the size of detected CNV and position (breakpoint) definitely occurring, The method also comprises determining that the second zone in a described sub-region, including, (1) is based on formulaMeter Calculate the difference of the subregion M in a described sub-region and the comparison rate of other windows all in this sub-region, it is thus achieved that all Z_ij, take Z_c=max_{1≤i ＜ j≤n}|Z_ij|, (2) are by Z_cCompare with the first marginal value, work as Z_cDuring more than the first marginal value, accordingly Subregion M is described second zone, and described second zone is CNV region, and the border of described second zone is CNV's Position occurs, and (3) remove the second zone in a described sub-region, update i, j and n, carry out step (1) and (2), Until without Z_cMore than the first marginal value；Wherein, i and j is the numbering of the window in a described sub-region, and n is a described district The number of the window in territory, described subregion M is i+1 window in a described sub-region between jth window Region, R_iFor the correction comparison rate of i-th window in a described sub-region, described first marginal value is Z_ijIn distribution first The probability density of predetermined probability, described first predetermined probability >=95%, 1≤i ＜ j≤n, S_i=R₁+…+R_i, S_j=R₁+ …+R_j, S_n=R₁+…+R_n.Assume that subregion M is normal unmanifest region, Z_ijDistribution refers to Z_ijObey standard normal Distribution, the first predetermined probability and the first marginal value one_to_one corresponding, general statistics books all comprise the first predetermined probability and first and face The form that dividing value is corresponding supplies to consult.In one embodiment of the invention, Z is worked as_cFall into region of rejection, i.e. Z_cPredetermined more than first First marginal value of probability for example, 99.9% correspondence, it is known that there occurs small probability event, negates null hypothesis, i.e. subregion M For variable region.Said process, the same tropism of the correction comparison rate of foundation window, the most both greater than 1 or both less than 1, right Window merges, it is thus achieved that a big sub-region, then is circulated in each sub-region and judges to determine CNV therein Generation border, determine second zone the most from which, in multiple sub-regions, determine second zone parallel so simultaneously, profit In quickly detecting CNV.In one embodiment of the invention, the comparison rate of described second zone is that described second zone comprises The average of correction comparison rate of all windows.In one embodiment of the invention, the method also includes, based on comparing State the comparison rate of second zone and the size of described predetermined comparison rate, it is determined that the type of CNV, including, when described secondary When the comparison rate in region is more than described predetermined comparison rate, it is determined that described second zone is that copy number increases region, when described secondary When the comparison rate in region is less than described predetermined comparison rate, it is determined that described second zone is that copy number reduces region.The present invention's In another embodiment, below equation is utilized to calculate the copy number of described second zone, the copy number of second zone=this secondary The chromosome set number of the comparison rate in region/predetermined comparison rate * target individual, the comparison rate of described second zone comprised by its There is the average of the correction comparison rate of window.

Without significant difference, it is also possible to refer to the evaluation no significant difference to data variance statistically, such as set Predetermined probability, usual predetermined probability can be set to not less than 95%, and the correction comparison rate of adjacent multiple windows is carried out statistics inspection Test, such as, can utilize z inspection or t inspection, the no significant difference (p ＞ 0.05) between multiple correction comparison rate, i.e. recognize For reaching described without significant difference.In one embodiment of the invention, the initial comparison rate of described merging is without significant difference Multiple adjacent window apertures are the difference no statistical significance merging and meeting adjacent window apertures as described below correction comparison rate, make merging The sub-region obtained is CNV region.Merge initial comparison rate to specifically include without multiple adjacent window apertures of significant difference: (a) base In formulaThe difference of the comparison rate of zoning N and other all windows, it is thus achieved that all Z_xy, take Z_b=max_{1≤x ＜ y≤w}|Z_xy|, (b) is by Z_bCompare with marginal value, work as Z_bWhen exceeding described marginal value, corresponding region N For a described sub-region, a described sub-region is removed, is updated x, y and w, carries out step (a) and (b), directly by (c) To without Z_bExceeding described marginal value, wherein, x and y is the numbering of window, and w is window sum, and described region N is (x+1)th Individual window is to the region between y-th window, R_xFor the correction comparison rate of x-th window, described marginal value is Z_xyIn distribution The probability density of predetermined probability, described predetermined probability >=95%, 1≤x ＜ y≤w, S_x=R₁+…+R_x, S_y=R₁+ …+R_y, S_w=R₁+…+R_w.Described Z_xyIt is distributed as Z_xyObey standard normal distribution, predetermined probability and marginal value One_to_one corresponding.In one embodiment of the invention, it is assumed that region N is normal unmanifest region, works as Z_bFall into region of rejection, I.e. Z_bExceed the marginal value of predetermined probability for example, 99.9% correspondence, it is known that there occurs small probability event, negate null hypothesis, i.e. Region N is variable region.Said process, based on all windows are circulated the generation border judging to determine CNV, determines The sub-region gone out is CNV region.In one embodiment of the invention, the method also includes: based on relatively described one The size of the comparison rate of sub-region and described predetermined comparison rate, it is determined that the type of described CNV, including, when described once When the comparison rate in region is more than described predetermined comparison rate, it is determined that a described sub-region is that copy number increases region；When described once When the comparison rate in region is less than described predetermined comparison rate, it is determined that a described sub-region is that copy number reduces region.The present invention's In another embodiment, the method also includes: utilize below equation to calculate the copy number of a described sub-region, a sub-region The chromosome set number of the comparison rate of copy number=this sub-region/predetermined comparison rate * target individual, the comparison of a described sub-region Rate is the average of the correction comparison rate of its all windows comprised.

Utilize above-mentioned one aspect of the present invention or arbitrary detailed description of the invention in CNV detection method, it is possible to solve above-mentioned In existing CNV testing process, come with some shortcomings, the window of the employing regular length in the most existing method, it is impossible to very Solve well bias problem and repetitive sequence problem that in unicellular order-checking, whole genome amplification is brought, it is impossible to well use Detection etc. in the single celled CNV of diploid.CNV in above-mentioned one aspect of the present invention or arbitrary detailed description of the invention Detection method, is highly suitable for CNV based on unicellular sequencing data detection, is based particularly on the order-checking of unicellular low depth CNV detects, and the data using different amplification method to carry out unicellular order-checking or tissue order-checking for difference order-checking platform all have Effect, the suitability is extensive.When difference order-checking platform uses different whole genome amplification method to carry out unicellular order-checking, this Bright method is all fine at the Sensitivity and Specificity of detection CNV, is based especially on cyclization cyclic amplification technology (MALBAC) The sequencing data of Proton platform.And, utilize the testing result of the method for the present invention to have high duplication, credible result. With existing CNV Comparison between detecting methods, the method for the present invention uses the window of length change, is conducive to keeping all windows ratio Stability to the meansigma methods of upper short sequence number, it is also possible to avoid the impact that repetitive sequence region is brought so that CNV detects More accurate.

According to another aspect of the present invention, the present invention provides a kind of device detecting CNV, described device can in order to performing or Completing the invention described above CNV detection method on the one hand or in arbitrary detailed description of the invention, described device includes: data are defeated Enter unit, in order to receive data；Data outputting unit, in order to export data；Processor, can perform in order to perform computer Program, performs described computer executable program and includes realizing in the invention described above one side or arbitrary detailed description of the invention CNV detection method；And, memory element, in order to store data, including described computer executable program.Described Computer executable program can be saved in storage medium, alleged storage medium may include that read only memory, random Memorizer, disk or CD etc..The present invention also provides for a kind of computer-readable recording medium, and it is used for storing holds for computer Row program, the execution of described program included aforementioned one aspect of the present invention or in its arbitrary detailed description of the invention CNV detection method.The aforementioned advantage of CNV detection method to the present invention and the description of technical characteristic are also applied for this CNV Detection device and computer-readable recording medium, do not repeat them here.

Accompanying drawing explanation

Above-mentioned and/or the additional aspect of the present invention and advantage will become bright from combining the accompanying drawings below description to embodiment Aobvious and easy to understand, wherein:

Fig. 1 is the density profile of the ratio of each window after the window in a specific embodiment of the present invention merges；

Fig. 2 is the CG platform unicellular sequencing data inspection based on MDA amplification in a specific embodiment of the present invention Survey the result schematic diagram of CNV；

Fig. 3 is the unicellular sequencing data of Proton platform based on MDA amplification in a specific embodiment of the present invention CNV detection result schematic diagram；

Fig. 4 is the unicellular order-checking of Proton platform based on MALBAC amplification in a specific embodiment of the present invention The result schematic diagram of the CNV detection of data.

Detailed description of the invention

Hereinafter the general step of the inventive method or the acquisition mode of relevant information are introduced.

First from the website (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/) of UCSC Carry the hg19 sequential file chromFa.tar.gz with reference to genome.Here, CNV detection method will only use those just than To the short sequence with reference to one position of genome, sequence N of pseudoautosomal region on Y chromosome is replaced by we.Right In Y chromosome, what subsequent process used is all this modified version.Pseudoautosomal region is X chromosome and Y dyeing The position exchanged uniquely can occur between body, and this is also its title origin, is autosome due to what exchange can occur, and X contaminates Colour solid and Y chromosome are usually the phenomenon do not exchanged, and only occur in that exchange abnormally at pseudoautosomal region, cause man Property and women are with the duplicate of two these regional genes.This makes the gene expression of pseudoautosomal region be similar to autosome, and Non-heterosomal sex linked inheritance pattern, thus gain the name.

2. determine the size of each detection window (window).

A) for the lower machine data of Proton order-checking platform, single-ended analog data can be used to divide.With hg19 with reference to gene On the basis of group, simulating the short sequence of single-ended order-checking, from the beginning of first base of genome chromosome, every 50 bases are one Read section (reads), and generate ID and mass value generation fastaq form for it.Move a base the most backward, until The end of short sequence is last base of chromosome.Use bowtie analog data to reference on genome, result Only retain the short sequence (the short sequence i.e. removing in repeatable comparison) of those unique comparisons, use samtools comparison result Be converted to BAM form.

Alignment parameters may be configured as: bowtie-S-t-n 2-e 70-m 1--best strata, the ratio of follow-up unicellular sequencing data Can also equally to parameter, parameter meaning is: in-n 2 represents high-fidelity region, mispairing number not can exceed that 2, and-e 70 represents Mismatch site mass value not can exceed that 70,--in best report file, the matching result of each short sequence will by quality of match from High to Low sequence,--srtata with--best is used together the part that report quality is the highest, and-m 1 represents and reports all comparisons Short sequence.

B) for the lower machine data of CG platform, come by the sequencing data of a cell DNA of the cell line to normal person Divide.The lower machine data of CG, carry out procedure information analysis, such as utilize Teramap software comparison to reference to genome, so After the form of comparison result is converted into BAM format result.

The data of different platform the most all can use software samtools to remove the short sequence repeated.Record is with reference to every on genome One position being initiateed base covering by short sequence, and these positions are divided into 10,000 to 100,000 window, each The position number comprised in individual window is identical, but its siding-to-siding block length is change.Calculate each window the most respectively to be wrapped Containing the G/C content with reference to genome sequence.

3. extract the DNA of individual cells, carry out whole genome amplification, then build machine order-checking on storehouse, obtain lower machine data, And carry out corresponding analysis and process and obtain the result of bam format comparison.

A) Proton platform, the data (BAM form) of its lower machine, we use BEDTools to be converted to FASTQ form Data, (50bp adds then to use Trimmomatic software that from 3 ' ends, the short sequence being longer than 50bp is intercepted effective length The short sequence nucleotide sequence isometric with the primer of whole genome amplification method), such as many reannealings and cyclization cyclic amplification technology (MALBAC) primer is 35bp, and effective length is 85bp, filters out the length short sequence less than effective length simultaneously.Make With bowtie the short sequence alignment after intercepting to reference genome, and after being converted into bam file ordering with samtools view Removal repeats short sequence.

B) CG platform, the analysis process using its platform to research and develop compares short for lower machine sequence data with reference to genome hg19 Right.Then comparison result being changed into BAM form and be ranked up, the single-ended mode of samtools is removed and is repeated short sequence.

4. in pair each window, the short sequence in comparison carries out statistical counting and is standardized processing, and i.e. calculates each window Comparison rate (ratio)=comparison on short sequence number/all window comparisons on the meansigma methods of short sequence number.

5. the ratio that the ratio-GC relation with contents using LOWESS algorithm to determine obtains after processing each window Playsization Carry out GC correction, it is thus achieved that correction ratio.

The most each sample, according to the ratio value after all window correction, can use CBS segment software to close window And form non-overlapping region (segment) and calculate its ratio value, this ratio value is assigned in region (segment) Each window.Concrete includes, (a) is based on formulaZoning N and other all windows The difference of comparison rate, it is thus achieved that all Z_xy, wherein, region N is (x+1)th window to the region between y-th window, Z_xyIn Standard normal distribution, takes Z_b=max_{1≤x ＜ y≤w}|Z_xy|, (b) is by Z_bCompare with marginal value, work as Z_bWhen exceeding marginal value, phase The region N answered be anticipated window combined region, i.e. region N be occur CNV region (c) window combined region is gone Remove, update x, y and w, carry out above-mentioned two steps (a) and (b), until without Z_bExceed marginal value, i.e. circulation divides window Mouthful, until window can not merge again；Wherein, x and y is the numbering of window, and w is window sum, described R_xFor xth The correction comparison rate of individual window, described marginal value is Z_xyThe probability density of the predetermined probability in distribution, described predetermined probability >=95%, 1≤x ＜ y≤w, S_x=R₁+…+R_x, S_y=R₁+…+R_y, S_w=R₁+…+R_w.Described Z_xyIt is distributed as Z_xyObey standard normal distribution, predetermined probability and marginal value one_to_one corresponding.Said process can be regarded as, it is assumed that region N is just Normal unmanifest region, works as Z_bFall into region of rejection, i.e. Z_bMarginal value more than 99.9% correspondence, it is known that there occurs small probability event, Negative null hypothesis, i.e. region N are variable region.The ratio of each combined region is the equal of the correction ratio of window included by it Value, is then assigned to its all windows included the ratio value of this combined region, is the comparison rate of window.

Then, the ratio of all windows is drawn density curve scattergram, as shown in Figure 1.For near diploid cell or two Times body is the cell of the mode of all times of types, and the ratio value that in density profile, peak-peak is corresponding for this cell copy number is then The ratio value of 2.

7. it is the ratio of 2 the ratio in each region divided by copy number, then is multiplied by 2, then obtain the copy number in each region.

8. calculate the Sensitivity and Specificity of CNV detection.Sensitivity=LT/LC, specificity=LT/L, wherein, L: refer to The total length of the CNV (>=1Mb) that unicellular order-checking is found, LC: represent the CNV (>=1Mb) found in tissue order-checking Total length, LT: represent the total length of CNV (>=1Mb) that unicellular order-checking and tissue order-checking are found jointly.

Below in conjunction with concrete individual specimen, detection method and the testing result according to the present invention is described in detail.Show below Example, is only used for explaining the present invention, and is not considered as limiting the invention.In describing the invention, " once ", " two Secondary " etc. for referring to or describing conveniently, it is impossible to be interpreted as ordering relation or relative importance instruction, except as otherwise noted, " multiple " are meant that two or more.

Except as otherwise explaining, the reagent explained the most especially that relates in following example, sequence (joint, label and primer), soft Part and instrument, be all conventional commercial product or disclosed, such as builds purchased from the hiseq2000 order-checking platform of Illumina company Storehouse related kit etc..

Embodiment one: the CNV detection method test of CG platform low depth sequencing data based on MDA amplification

Flourish, with Complete Genomics (CG), IlluminaSolexa and Roche along with high throughput sequencing technologies The secondary order-checking that 454 is representative, and the HelicosGenetic included by three generations's sequencing technologies (i.e. single-molecule sequencing technology) The various sequencing technologies such as Analysis System, unimolecule real-time sequencing technologies (SMRT) and nanometer pore single-molecule sequencing technologies Become the important tool of unicell group logistics research.CG platform as a kind of secondary sequencing technologies being absorbed in human genome, Completely and accurately can be checked order mankind's full-length genome, and its sequencing throughput is big, in the field of business by highly recognition.It is main for CG platform Including order-checking platform, high throughput process automatic technology and complete three parts of data management solutions, its platform that checks order Including DNA nano-array (DNANanoball arrays, DNB^TMArrays) and combination probe grappling connect sequencing (combinatorial probe-anchor ligation, cPAL^TM), the application of these two technology greatly reduce reagent consumption and Shorten the time of imaging.So we first at CG platform, utilize the lower machine data of CG platform that the CNV of the present invention is examined Survey method carries out verification experimental verification.

Isolated from the glioblast tumor tissue of patient 3 unicellular, tissue samples from Beijing Tiantan Hospital provide, Extract each single celled DNA and utilize MDA whole genome amplification technology to expand, then carrying out library construction, then Unicellular low depth order-checking is carried out at CG platform.The last detection analysis carrying out unicellular CNV in the present inventive method.For The CNV detection efficiency of checking the inventive method, we are extracted the DNA of tissue and carry out library construction, then at CG Platform carries out genome sequencing, and uses the standard analysis flow process detection of CG to obtain the CNV result of tissue.3 slender CNV that born of the same parents' sample (P1-T2-SC#) and tissue samples (P1-T2) detect is as in figure 2 it is shown, the heavy black table of paralleled by X axis Showing the copy number in each region, it occurs copy number to increase in this region is described more than 2, sends out in this region is described less than 2 Raw copy number reduces, and represents that equal to 2 copy number is normal, and the ratio value scatterplot of each window represents.

Further the Sensitivity and Specificity of CNV detection method is estimated, sensitivity=LT/LC, specificity=LT/L. The average sensitivity of 5 samples of estimation is 91.01%, and specificity is 74.47%, and result is as shown in table 1.

Table 1

Then, the repeatability of CNV detection method based on CG order-checking platform MDA amplification is added up, finds sample Between repeatability higher than 0.7, the results detailed in Table 2.

Table 2

Can draw from sensitivity, specificity and repeatability statistical computation result, the CNV of the present invention analyzes testing process The effectiveness of testing result, it is feasible on CG order-checking platform.

Embodiment two: the CNV detection method test of the unicellular low depth order-checking of Proton platform based on MDA amplification

Existing unicellular sequencing data is many by Illumina order-checking platform output.Although the sequencing throughput of Illumina sequenator is big, But machine order-checking time cycle is long on it, order-checking cost is high, and these can limit the fast development that unicellular CNV detection is analyzed. And some researchs often do not have demand to sequencing throughput, relative, time and cost to order-checking have higher demand, this Time Proton order-checking platform will be preferably selection.The Proton order-checking platform speed of service is fast, and the order-checking cycle only needs several hours, Order-checking low cost, is more suitable for being deployed to hospital or third party testing agency, shortens the detection time, reduces cost, thus improves Detection efficiency.And the unicellular order-checking CNV detection for Proton rarely has report.

Extract 5 cells (MDA-2_BGC#) of mankind's gastric adenocarcinoma cells system (BGC823) from tumour hospital of Peking University, And carry out unicellular low depth order-checking at Proton platform after carrying out library construction by multiple displacement amplification (MDA) technology.With Time extract the DNA of mankind's gastric adenocarcinoma cells system (BGC823) cell (BGC), carry out after conventional libraries structure Proton platform checks order.Then the detection analysis of CNV is carried out by our scheme, unicellular 5 of BGC823 CNV such as Fig. 3 that sample and tissue samples (BGC) detect, the copy number (heavy black of paralleled by X axis) in each region is big Occur copy number to increase in 2 for explanation region, reduce for copy number less than 2, be that copy number is normal equal to 2, three kinds In copy number region of variation, the ratio value of window represents by the scatterplot of the different gray scale degree of depth respectively.

Whole genome detects on multiple chromosomes the CNV of large fragment, keeps with the CNV testing result of cell mass Unanimously, the effectiveness of the inventive method detection CNV is demonstrated.

Then, according to five unicellular and cell CNV testing results, we are further to detecting the quick of CNV method Perception and specificity are estimated, sensitivity=LT/LC, specificity=LT/L.The average sensitivity of 5 unicellular samples of estimation Property is 85.86%, and specificity is 81.18%, and result is as shown in table 3.

Table 3

The repeatability of CNV detection method based on Proton order-checking platform MDA amplification is added up, finds between sample Repeatability is higher than 0.7, the results detailed in Table 4.

Table 4

Can draw from sensitivity, specificity and repeatability statistical computation result, the CNV testing process analysis knot of the present invention The effectiveness of fruit, it is feasible for the sequencing data of Proton order-checking platform MDA amplification method.

Embodiment three: the CNV detection method test of the unicellular low depth order-checking of Proton platform based on MALBAC amplification

Extract 5 mankind's gastric adenocarcinoma cells system's cell (BGC823), carry out often by MALBAC whole genome amplification method Unicellular low depth order-checking is carried out at Proton platform after rule library construction；Extract mankind's gastric adenocarcinoma cells system (BGC823) simultaneously The DNA of one cell (BGC) checks order at Proton platform.The lower machine data obtained carry out CNV by our scheme Detection analysis, find CNV at five samples of BGC823, as shown in Figure 4, wherein, abscissa represents chromosome to result； Right side vertical coordinate is 5 unicellular samples and group's cell sample, and left side vertical coordinate is copy number, and the heavy black line on figure represents The ratio value of zoning, this region copy number of its value explanation more than 2 increases, and subtracts less than copy number in 2 explanation regions Few, normal equal to copy number in 2 explanation regions.Three kinds of copy number region of variation are respectively with the scatterplot table of different ash color depths Show the ratio value of window.

The Sensitivity and Specificity further present invention detecting CNV method is estimated, sensitivity=LT/LC, special Property=LT/L.The average sensitivity of 5 samples of estimation is 84.72%, and specificity is 85.18%, result such as table 5.

Table 5

The repeatability of CNV detection method based on Proton order-checking platform MALBAC amplification is added up, finds sample Between repeatability higher than 0.92, refer to table 6.

Table 6

Can draw from sensitivity, specificity and repeatability statistical computation result, the CNV testing process analysis knot of the present invention The effectiveness of fruit, it is feasible for the sequencing data of Proton order-checking platform MALBAC amplification method.

Claims

1. the method detecting CNV, it is characterised in that comprise the following steps:

Obtaining the gene order-checking result of target individual, described sequencing result includes multiple reading section；

By described sequencing result and reference sequences comparison, it is thus achieved that comparison result, described reference sequences includes multiple window, described Comparison result includes the number of the reading section of each window in comparison；

Based on described comparison result, calculate the initial comparison rate of each window, described window in the initial comparison rate=comparison of window Read the reading hop count purpose meansigma methods of all windows in hop count mesh/comparison, in described comparison, the reading hop count purpose of all windows is average The reading section sum/window number of all windows in value=comparison；

Merging the initial comparison rate multiple adjacent window apertures without significant difference, the multiple adjacent window apertures after definition merges are a sub-region, Remaining each independent window is also referred to as a sub-region；

Comparison rate based on a described sub-region is unequal with predetermined comparison rate, it is determined that a described sub-region exists CNV,

The comparison rate of a described sub-region is the average of the initial comparison rate of the window that a described sub-region comprises,

Described predetermined comparison rate is the comparison rate of the window that the comparison rate medium frequency of all windows is the highest, the comparison rate of described window Comparison rate for a sub-region at its place.

2. the method for claim 1, it is characterised in that described genome is available from the individual cells of described target individual；

Optional, by building the gene order-checking library of described cell, and described sequencing library is carried out sequencing acquisition Described sequencing result；

Optional, build described sequencing library and include described genome is carried out degenerate oligonucleotide primed PCR, multiple displacement Amplification and/or ring-type cyclic amplification of repeatedly annealing.

3. the method for claim 1, it is characterised in that described reference sequences is behaved with reference to genome；

Optional, substituting the described people each base with reference to the pseudoautosomal region of the Y chromosome of genome with N, N represents Any one in A, T, C and G.

4. the method for claim 1, it is characterised in that the determination of described window, including,

By short sequence sets and reference sequences comparison, determine the original position of the short sequence of described reference sequences in comparison, described short Sequence sets includes multiple short sequence, and described short sequence sets is from simulated series collection and/or sequencing result；

Described reference sequences delimited window, makes each described window comprise equal number of described original position；

Optional, the acquisition of the simulated series that described simulated series is concentrated includes,

From the beginning of the base of one end of the chromosome of a length of Q of described reference sequences, copy P of described chromosome Base, to obtain Article 1 simulated series,

Other end direction along described chromosome is moved base and is copied P base of described chromosome, to obtain the Article two, simulated series,

Other end direction along described chromosome is moved two bases and is copied P base of described chromosomes, to obtain the Article three, simulated series,

Obtain the Q-P+1 article simulated series, the terminal bases of described Q-P+1 article of simulated series and described dyeing according to this The base of the other end of body overlaps, wherein,

P is the length of simulated series, P >=10；

Optional, there is no overlap between described window；

Optional, described window sum is not more than 100,000.

5. the method for claim 1, it is characterised in that in the initial comparison rate of described merging without multiple adjacent windows of significant difference Before Kou, utilize the relation of comparison rate-G/C content that the initial comparison rate of each described window is carried out GC correction, it is thus achieved that each The correction comparison rate of individual window,

The initial comparison rate of this window is substituted with the correction comparison rate of described window.

6. the method for claim 5, it is characterised in that set up the relation of described comparison rate-G/C content, including,

Obtaining the sequencing data of the nucleic acid of at least one sample, described sequencing data is made up of multiple reading sections；

Being compared with reference sequences by described sequencing data, it is thus achieved that comparison result, described reference sequences includes multiple window, Described comparison result includes the number of the reading section of each described window in comparison；

Calculate the initial comparison rate of each described window, the reading hop count mesh/ratio of described window in the initial comparison rate=comparison of window Reading hop count purpose meansigma methods to upper all windows, all in the reading hop count purpose meansigma methods=comparison of all windows in described comparison The reading section sum/window number of window；

The initial comparison rate of windows based on many groups and the G/C content of this window, utilize bidimensional regression analytic process to set up described ratio Relation to rate-G/C content；

Optional, described bidimensional regression analytic process is local weighted recurrence scatterplot smoothing techniques.

7. the method for claim 5, it is characterised in that the initial comparison rate of described merging is without multiple adjacent window apertures of significant difference Refer to, merge and meet following adjacent window apertures,

The difference no statistical significance of correction comparison rate.

8. the method for claim 7, it is characterised in that the initial comparison rate of described merging without multiple adjacent window apertures of significant difference, Including,

A () is based on formulaThe difference of the comparison rate of zoning N and other all windows, Obtain all Z_xy, take Z_b=max_{1≤x ＜ y≤w}| Zxy |,

B () is by Z_bCompare with marginal value, work as Z_bWhen exceeding described marginal value, corresponding region N is a described sub-region,

C a described sub-region is removed by (), update x, y and w, carries out step (a) and (b), until without Z_bSuper Cross described marginal value, wherein,

X and y is the numbering of window,

W is window sum,

Described region N is (x+1)th window to the region between y-th window,

R_xFor the correction comparison rate of x-th window,

Described marginal value is Z_xyThe probability density of the predetermined probability in distribution, described predetermined probability >=95%,

1≤x ＜ y≤w,

S_x=R₁+...+R_x,

S_y=R₁+...+R_y,

S_w=R₁+...+R_w。

9. the method for claim 8, it is characterised in that also include,

Comparison rate based on a relatively described sub-region and the size of described predetermined comparison rate, it is determined that the type of described CNV, its Include,

When the comparison rate of a described sub-region is more than described predetermined comparison rate, it is determined that a described sub-region is that copy number increases district Territory,

When the comparison rate of a described sub-region is less than described predetermined comparison rate, it is determined that a described sub-region is that copy number reduces district Territory.

10. claim 7-9 either method, it is characterised in that also include,

Below equation is utilized to calculate the copy number of a described sub-region,

The chromosome set number of the comparison rate of the copy number of one sub-region=this sub-region/predetermined comparison rate * target individual,

The comparison rate of a described sub-region is the average of the correction comparison rate of its all windows comprised.

11. 1 kinds of devices detecting CNV, it is characterised in that include,

Data input cell, in order to receive data；

Data outputting unit, in order to export data；

Processor, in order to perform executable program, performs described executable program and has included claim 1-10 either method； And,

Memory element, in order to store data, including described executable program.