CN104428425A

CN104428425A - Methods for determining absolute genome-wide copy number variations of complex tumors

Info

Publication number: CN104428425A
Application number: CN201380034335.9A
Authority: CN
Inventors: 亚伦·哈珀恩; 克利须那·潘特
Original assignee: Callida Genomics Inc
Current assignee: Callida Genomics Inc
Priority date: 2012-05-04
Filing date: 2013-05-06
Publication date: 2015-03-18
Also published as: HK1203220A1; EP2844771A4; EP2844771A1; WO2013166517A1

Abstract

Methods for interpreting absolute copy number of complex tumors and for determining the copy number of a genomic region at a detection position of a target sequence in a sample are disclosed. In certain aspects, genomic regions of a target sequence in a sample are sequenced and measurement data for sequence coverage is obtained. Sequence coverage bias is corrected and may be normalized against a baseline sample. Hidden Markov Model (HMM) segmentation, scoring, and output are performed, and in some embodiments population-based no-calling and identification of low-confidence regions may also be performed. A total copy number value and region-specific copy number value for a plurality of regions are then estimated.

Description

Measure the method for the absolute copy number variation of complicated tumour full-length genome

Related application

The application is the U. S. application the 13/270th submitted on October 11st, 2011, the part continuity of No. 989 is applied for and requires its preference, this application requires the U.S. Provisional Patent Application the 61/503rd that on June 30th, 2011 submits to, the U.S. Provisional Patent Application the 61/392nd that No. 327 and on October 13rd, 2010 submit to, the right of priority of No. 567.The application also requires the U.S. Provisional Patent Application the 61/643rd that on May 4th, 2012 submits to, the right of priority of No. 225.Their respective full contents by reference entirety are incorporated to the application, as fully set forth in the present invention.

Background of invention

Genomic abnormality associates with various genetic diseases, degenerative disease and cancer usually.Such as, the disappearance of gene copy or the disappearance of increase and genomic fragment or specific region or increase of common occurrence in cancer.Such as, the change separately of proto-oncogene and tumor suppressor gene is tumorigenic common attribute.The qualification in therefore relevant with various genetic diseases with cancer specific gene group region and cloning tumorigenic research and researching and developing better diagnose and method of prognosis useful.

Relative to the normal cell of same organization type, the qualification of corresponding polynucleotide is changed with copy number in cancerous cells, former cancer cells or low metastatic potential cell, for diagnostic tool provides the foundation, by providing target to promote drug discovery for candidate agent, and be used for identifying the therapeutic targets of the cancer therapy being more suitable for cancer types to be treated.

In diagnostic gene order-checking, the precise requirements of clinical diagnosis exacerbates the computational complexity of the sequential analysis relating to 3,000,000,000 base pairs in human genome further, thus the sequence number strong point making to analyze 60,000,000,000 or more is to provide an accurate genome sequence.In early stage sequence measurement, by producing sequence data from thousands of isolated, very long DNA fragmentation, thus the situation integrity of reservation queue information and the redundancy testing reduced needed for precise information are to process this complicacy.But owing to preparing the complicacy and the relatively high cost of many independent biochemical tests in early stage of genomic fragment, this method for generation of first complete human genome, each genome consumes multi-billion dollar.

In addition, in each human cell, the existence of genomic two different copies exacerbates the contextual information in genome further, thus makes accurate clinical analysis need to distinguish according to genome copies the ability of DNA sequence dna with diagnosis.Therefore, main challenge is studded with the sequence difference between two unique copies of 3,000,000,000 DNA bases of the single nucleotide polymorphism (SNPs) of millions of heredity, thousands of short insertion and disappearance and hundreds of spontaneous mutation for distinguishing.

But the dysploidy of height, matrix (normally) are polluted and genome is heterogeneous makes to assess absolute total amount from the full gene sequence sense data of tumor sample and less allele copy number is challenging.Although have made some progress in this field, but still there is no reliable method.Such as, have developed some and helped copy number variation (" CNV ") in the complete DNA sequence dnas of qualification, and contributed to based on described sequence and canonical sequence or the method for the degree of confidence of qualification that compares with the multiple different copy of described sequence.In these methods, the qualification of copy number is from its confirmation all based on different sample series, and data used in these class methods are easy to make mistakes and notoriously containing some artificial deviation relatively.

Summary of the invention

In some aspects, the invention provides the method for the genome area copy number of target polynucleotide sequence inspection positions in a kind of working sample, described method comprises: the take off data obtaining described sample sequence coverage; The sequence coverage deviation of correcting measuring data, wherein sequence coverage offset correction comprises the baseline correction carried out ploidy and be correlated with; And estimate total copy numerical value and the regiospecificity copy numerical value of multiple genome area.In one embodiment, described method comprises and carries out hidden Markov model (HMM) segmentation, scoring and output.In another embodiment, described method comprises the qualification of carrying out reading (no-calling) and low confidence region based on the nothing of colony.

In some aspects, technology described herein and/or method provide series of steps (as model) for the absolute copy number of the complicated tumour of decipher.In some embodiments, configuration computer logic is to perform the model of processing sum and the allele-specific order-checking degree of depth (read depth) data that can produce the figure of the easy explanation of tumor sample, and automatic analysis process.This analysis is based on such model: assumes samples tumor section and the most homogeneity of genome, but allows the impact of a part by Tumor Heterogeneity of sample.The result data obtained by performing final mask can be input to the independent segmental machining based on model (as HMM)---such as, result data can be used as the initial input state of the segmental machining based on model, and state description can be used for annotating final fragment group.Due to being separated of mould processing and final segmentation, the visual of tumour can be presented to user; If there is problem, available alternative model replaces automatic reduced model.

On the one hand, described method also comprises by comparing to come stdn sequence coverage with baseline sample.

On the one hand, described method also comprises and determines sequence coverage by measure sample genome in the sequential covering degree of depth at each position place.

On the one hand, described method also comprises and carrys out correction sequence deviation by the coverage of calculation window-mean value.

On the one hand, described method is also included in library construction and sequencing procedure and carries out adjusting to explain GC deviation.

In another embodiment, described method also comprises and carries out adjusting to make up deviation based on other weighting factor associated with single collection of illustrative plates.

On the one hand, described method also comprises the step of being undertaken by sequenator, described step comprises: a) provide multiple amplicon, wherein: i) each amplicon comprises multiple copies of the fragment of target nucleic acid, ii) each amplicon comprises the joint of multiple distribution at the predetermined site place of described fragment, each joint comprises at least one anchor probe hybridization site, and iii) described multiple amplicon comprises the fragment substantially covering described target nucleic acid; B) provide the random array being fixed on the described amplicon on surface with such density, described density makes at least most of described amplicon be that optics is distinguishable; C) one or more anchor probe and described random array are hybridized; D) one or more order-checking probes and described random arrays hybridizes, thus to check order probes and the duplex forming extreme between target nucleic acid fragment and mate at described one or more; E) anchor probe is connected to order-checking probe; And at least one Nucleotide of at least one joint scattered of f) qualification vicinity; And g) repeating step (c)-(f) until identify the nucleotide sequence of described target nucleic acid.

On the one hand, described method also comprises and measures take off data by carrying out following step, described step comprises: reading a) measuring the sequence of genomic multiple approximately random fragment in representative sample, wherein said multiple sampling providing sample gene group, the average base positions of genome is sampled one or many whereby; B) by described reading is mapped to reference gene group, or the spectrum data of described reading is obtained by described reading being mapped to composite sequence (the such as composite sequence of such as sample self or the composite sequence of relevant baseline sample); And c) by along reference gene group or measure described reading along composite sequence intensity to obtain coverage data, wherein take off data comprises spectrum data and coverage data.

In another embodiment, described method also comprises the generation of initial model, and described initial model is based on overall coverage distribution estimated state number and their mean number.

In another embodiment, described method also comprises by adding state to described model sequence and then remove state from model sequence or initial model being optimized in its combination.

In another embodiment, stdn also comprises the coverage of the correction of bioassay standard.

In another embodiment, described method also comprises being measured sequence coverage by fragment replication and being obtained degree of confidence observed value and detects position described collection of illustrative plates to be partly attributed to each.

On the one hand, described method comprise carry out HMM calculate to determine each inspection positions multiple.

In another embodiment, described method also comprises multiple hidden Markov models (HMM) state producing and correspond to respective copy number, if wherein sample is normal specimens, then carry out HMM segmentation, scoring and output, it comprises: be greater than for having copy number N each state that 0 to N/2 is multiplied by the median being contemplated to coverage in a diplontic sample part, the mean value of the transmitting distribution of initialize HMM; And be 0 to the state on the occasion of (being less than the numerical value used of the state with copy number 1) for having copy number, the mean value of distribution is launched in initialize.

In another embodiment, described method also comprises the multiple HMM states producing corresponding respective copy number, if wherein sample is tumor sample, then carry out HMM segmentation, scoring and output, it comprises mean value based on coverage distribution estimated state number and each state to produce HMM initial model; By the status number in amendment model and the parameter optimizing each state optimizes initial model; And by add to model sequence state then order remove state or it combines the status number revised in model.

In another embodiment, described method also comprises, adjustment initial model, and it comprises: if a) add new state the likelihood associated with HMM is increased to the threshold value predetermined more than first, then between a pair state, add described new state; B) between every a pair state cyclically repeating step (a) until more interpolation can not be had; If c) likelihood is not reduced to the threshold value predetermined more than second by removing of state, then remove described state from HMM; And d) to all states repeating step (c) repeatedly.

Another embodiment comprises the computer-readable permanent storage media having in the above and store instruction, it is for being determined at the copy number of the inspection positions genome area of target polynucleotide sequence in sample, when executed by the computer processor, described instruction makes treater carry out following operation: use the take off data from sample sequence coverage described in the data acquisition that produces of end pairing collection of illustrative plates; The sequence coverage deviation of correcting measuring data, wherein correcting measuring data comprises the baseline correction carried out ploidy and be correlated with; And at least based on the take off data corrected, estimate total copy numerical value and the regiospecificity copy numerical value in each region in multiple genome area.

Another embodiment comprises the computer-readable permanent storage media with the instruction clearly presented thereon, when executed by the computer processor, described instruction makes treater carry out following operation: obtain the take off data comprising the sequence coverage of the biological sample of target sequence; The sequence coverage deviation of correcting measuring data, wherein correcting measuring data comprises the baseline correction carried out ploidy and be correlated with; Based on the take off data corrected, carry out hidden Markov model (HMM) segmentation, scoring and output; Based on HMM scoring with export, carry out based on colony without reading and the qualification of low confidence region; And estimate total copy numerical value and the regiospecificity copy numerical value in multiple region.

Another embodiment comprises the system of the copy number variation of the inspection positions genome area for measuring target sequence, and it comprises: a. computer processor; And the computer-readable storage media that b. is connected with described treater, described storage media has the instruction clearly presented on it, when executed by the computer processor, described instruction makes treater carry out following operation: the take off data using the sequence coverage from sample described in the data acquisition that produces of end pairing collection of illustrative plates; The sequence coverage deviation of correcting measuring data, wherein correcting measuring data comprises the baseline correction carried out ploidy and be correlated with; And at least based on the take off data corrected, estimate total copy numerical value and the regiospecificity copy numerical value in each region in multiple genome area.

There is provided this general introduction for introducing the concept of selection in simplified form and describing further in following detailed description.This general introduction is not intended to determine the key of claimed theme or basic feature, and is not intended to the scope for limiting claimed theme.Other features of claimed theme, details, effectiveness and advantage following comprise limit in claim illustrated with appended in accompanying drawing those in written detailed description in will become apparent.

Accompanying drawing explanation

The following drawings represents a kind of form presenting the data that embodiment of the present invention provide.These accompanying drawings are not intended to the enforcement limiting aspect of the present invention described herein by any way, and are that key concept of the present invention is illustrated in help.

Fig. 1 describes recapitulative block diagram, which illustrates according to the embodiment of present disclosure for reading the system containing the variation in the sample of target sequence.

Fig. 2 describes recapitulative schema, which illustrates the CNV read method of the embodiment according to present disclosure.

Fig. 3 describes the generality computer system merging according to some aspect of present disclosure and operate.

Fig. 4 A and 4B illustrates exemplary sequencing system.

Fig. 5 illustrates exemplary calculating device, its for or be connected to sequenator and/or computer system.

Fig. 6 is 1-component tumor model figure.

Fig. 7 is 2-component tumor model figure.

Fig. 8 is the figure of the exemplary embodiments measuring reading coverage and segmentation.

Fig. 9 is the figure of exemplary original state estimation logic.

Figure 10 A-10C illustrates the exemplary result of the robustness of the process that can show the change percentage in arid comprising tumour and healthy tissues.

Figure 11 illustrates statistic correlation strong between tumour and high average copy number and high mutability.

Detailed Description Of The Invention

In the following description, numerous detail is stated to provide more thorough understanding of the invention.But, it is obvious to the skilled person that the present invention can not have the one or more lower enforcement in these details.In other cases, for avoiding covering the present invention, characteristic sum program known in those skilled in the art is not described.

With reference to specific embodiment, although the present invention is described main, also can be expected that, after those skilled in the art read present disclosure, other embodiment is concerning being apparent them, and it is intended that this type of embodiment and comprises in the method for the invention.

Unless otherwise defined, all technology used herein and scientific terminology have and usually understand identical implication with those skilled in the art of the invention.All publications mentioned in this article are incorporated to herein by reference, for describing and disclose the object of device that is described in described publication and that may use in conjunction with the present invention, composition, preparation and method.

When providing numerical range, should be understood to any value that other are pointed out in each intervening value between the upper limit and lower limit of described scope (unless the context clearly determines otherwise, otherwise be accurate to 1/10th of lower limit unit) and described scope or intervening value is all covered by the present invention.These upper and lower bounds more among a small circle can be included in independently be also covered by the present invention described more among a small circle in, it to obey in described scope the concrete boundary got rid of arbitrarily.When described scope comprises one or two boundary, the scope getting rid of any one or two in the boundary included by those is also included within the present invention.

Exemplary sequence measurement

Exemplary methods for the target nucleic acid that checks order comprises sample preparation, and it relates to extraction from DNA sample and segmentation target nucleic acid, to produce the target nucleic acid template of the fragmentation generally including one or more joint.Described target nucleic acid template is optionally past amplification method to form nucleic acid nano ball, and for the object analyzed, it is configured in surface or matrix usually.Matrix is produced by the medelling of nucleic acid nano ball or random alignment.The method forming nucleic acid nano ball is described in the patent application of following discloses: WO2007120208, WO2006073504, WO2007133831 and US2007099208, U.S. Patent Application Serial 11/679,124; 11/981,761; 11/981,661; 11/981,605; 11/981,793; 11/981,804; 11/451,691; 11/981,607; 11/981,767; 11/982,467; 11/451,692; 12/335,168; 11/541,225; 11/927,356; 11/927,388; 11/938,096; 11/938,106; 10/547,214; 11/981,730; 11/981,685; 11/981,797; 12/252,280; 11/934,695; 11/934,697; 11/934,703; 12/265,593; 12/266,385; 11/938,213; 11/938,221; 12/325,922; 12/329,365 and 12/335,188, all these are incorporated herein by reference in their entirety, for all objects, in particular for all instructions relevant with forming nucleic acid nano ball.The method forming nucleic acid nano ball array is described in disclosed patent application WO2007120208, WO2006073504, WO2007133831 and US2007099208, and U.S. Patent Application Serial 11/679,124; 11/981,761; 11/981,661; 11/981,605; 11/981,793; 11/981,804; 11/451,691; 11/981,607; 11/981,767; 11/982,467; 11/451,692; 12/335,168; 11/541,225; 11/927,356; 11/927,388; 11/938,096; 11/938,106; 10/547,214; 11/981,730; 11/981,685; 11/981,797; 12/252,280; 11/934,695; 11/934,697; 11/934,703; 12/265,593; 12/266,385; 11/938,213; 11/938,221; 12/325,922; 12/329,365 and 12/335,188, all these are incorporated herein by reference in their entirety, for all objects, in particular for all instructions relevant with the array forming nucleic acid nano ball.U.S. Patent Application Serial 11/679,124; 11/981,761; 11/981,661; 11/981,605; 11/981,793; 11/981,804; 11/451,691; 11/981,607; 11/981,767; 11/982,467; 11/451,692; 12/335,168; 11/541,225; 11/927,356; 11/927,388; 11/938,096; 11/938,106; 10/547,214; 11/981,730; 11/981,685; 11/981,797; 12/252,280; 11/934,695; 11/934,697; 11/934,703; 12/265,593; 12/266,385; 11/938,213; 11/938,221; 12/325,922; 12/329,365; And 12/335, the method using nucleic acid nano ball in the detection of sequencing reaction and particular target sequence is also illustrated in 188, by reference its each entirety is incorporated to herein, for all objects, in particular for all instructions relevant with nucleic acid nano ball carrying out sequencing reaction.It should be understood that arbitrary as herein described and sequence measurement as known in the art all can be applied to the nucleic acid-templated and/or nucleic acid nano ball in solution, or be applied to the nucleic acid-templated and/or nucleic acid nano ball be configured on the surface and/or in array.

Nucleic acid nano ball carries out nucleotide sequencing process, usually by order-checking-interconnection technique, the probe grappling comprising combination connects (" cPAL ") method, it is described in such as Drmanac etal., " Human Genome Sequencing Using Unchained Base Reads onSelf-Assembling DNA Nanaoarrays, " Science327:78-81, 2009 (on January 1st, 2010) and the PCT patent application WO07/133831 published, WO06/138257, WO06/138284, WO07/044245, WO08/070352, WO08/058282, WO08/070375, and in U.S. Patent application 2007-0037152 and 2008-0221832 published.In these class methods, according to the rule fully understood, known marker (the specific fragment as containing distinguishable fluorophore individual molecule) is connected to target nucleic acid template as marker, then on the DNA chain of identical type the rearrangement of index to provide the basis of overlapped data.The sequencing procedure mentioned herein is only representational.In another embodiment, label is employed.Other treatment technologies that are as known in the art or research and development can be used.Then with the nucleic acid nano ball that radiation irradiation matrix is assembled to excite the fluorophore being enough to cause the fluorophore relevant with each concrete marker C, G, A or T at the emitted at wavelengths fluorescence of their uniquenesses, from here can by photographic camera on (standard or the integrated TDI of time delay) ccd array or replace the scanner of ccd array, or other stream of electrons/voltage induced technology that can be applicable in sequenator produces spatial image.Also other induction mechanism can be used, such as impedance variations inductor block.Irradiation can be spectrum special thus once only excite a kind of fluorophore of selection, then camera record can be passed through, the input maybe can filtering photographic camera is with induction and only record the special fluorescent radiation of the spectrum that receives, or can respond on the LCD array of colour and all fluorescent radiations of record, subsequently have wherein on each inquiry site of nucleic acid construct and analyze spectral content. simultaneouslyIMAQ creates a series of images in many inquiry sites, described inquiry site can based on Spectral Properties anisotropic fluorescent intensity, by be referred to herein as base read process in the computer disposal of strength level is analyzed, described process will have more detailed explanation hereinafter.CPAL and other sequence measurement also may be used for detecting special sequence, such as comprise the single nucleotide polymorphism (" SNP ") in nucleic acid construct, (described nucleic acid construct comprises the nucleic acid-templated of nucleic acid nano ball and straight chain and ring-type).Reading or base read the qualification of sequence, and such as, due to the obvious cause of the program characteristic that checks order, base reads may comprise error.Use Reed-Solomon (Reed-Solomon) error correction based on computer disposal, no matter with the form of the computer processor carrying out Reed-Solomon algorithm, or with the form using precalculated expection base to read the comparison mechanism of sequence, such as in look-up table, error can be identified." reading " sequence can be marked and can correct, reading sequence with the base producing correction.It should be understood that the size of site as herein described and structure is the small part of the size of site and the structure that matrix is analyzed, because they are not easy to be described.Such as, matrix can be photoetch, (SOM) of finishing 25 silicon matrix of mm x 75mm, it has the array of the raster mode of the about 300nm spot closed for nucleic acid nano chou, to increase DNA content/array, and compare random genomic dna array raising graphic information density.

Available various marker can mark order-checking probe with detecting.Although above mainly for wherein by the embodiment of fluorophore mark order-checking probe, it should be understood that and utilize the similar embodiment comprising the order-checking probe of other types marker to comprise in the present invention.And method of the present invention can use unlabelled structure.

In some embodiments, multiple cPAL circulation (no matter be single, two times, three times etc.) by the multiple bases in the target nucleic acid region of qualification adjacent joint.(in alternate design, likely use single cPAL to circulate and produce multiple base.) in brief, by utilizing the hybridization of circulation anchor probe and the enzymatic ligation of order-checking probe cell, implement cPAL method repeatedly to inquire the multiple bases in target nucleic acid, described order-checking probe cell is through being designed for the Nucleotide detecting different positions place and remove from the interface between joint and target nucleic acid.In arbitrary given circulation, be that the identity being positioned at the one or more base in one or more position is associated with the identity of the marker being connected to this order-checking probe by order-checking probe design used.Once the order-checking probe of connection be detected, and one or more bases of one or more inquiry position thus, slough the connection mixture of nucleic acid nano ball, and the new circulation carried out joint and order-checking probe hybridization and be connected.Around this principle, the data of repeated sampling can be obtained.

The definition selected

" joint " refers to the construct of the genetic modification comprising " joint component ", and wherein one or more joints may be interspersed within the target nucleic acid of library construction body.According to the purposes of joint, be included in joint component in any joint or feature extensively various, but generally include restriction endonuclease identification and/or shearing site, primer combines (for the library construction body that increases) or anchor primer combines (target nucleic acid in sequencing library construct) site, nickase site etc.In some respects, joint by genetic modification to comprise following one or more: 1) about 20 to about 250 Nucleotide, or about 40 or are less than about 60 Nucleotide to about 100 oligonucleotide, or be less than the length of about 50 Nucleotide; 2) in order to be connected to target nucleic acid as at least one and the feature of usual two " arms "; 3) be positioned at 5 ' end of joint and/or the difference of 3 ' end and the grappling binding site of uniqueness for the order-checking of contiguous target nucleic acid; And 4) optionally one or more restriction site.On the one hand, joint can for the joint scattered.So-called " joint of distribution " means the oligonucleotide of the position of inserting interval in target nucleic acid interior region herein.On the one hand, " inside " of target nucleic acid means the site of target nucleic acid inside before such as cyclisation and cutting process, and described process can calling sequence inversion, or similar transformation, and it destroys the sequence of target nucleic acid inner nucleotide.The use promotion rebuilding series of the joint scattered and calibration, can allow from bases such as readings 20,30,40 in not having calibration because run from the sequence of 10 bases of single joint at every turn.

" amplicon " refers to the product of polynucleotide amplification reaction.That is, it copies from one or more homing sequence the polynucleotide group obtained.Amplicon can be generated by multiple amplified reaction, include but not limited to polymerase chain reaction (PCRs), linear polymerization enzyme reaction, amplification based on nucleotide sequence, rolling circle amplification and similar reaction (consult as U.S. Patent number 4,683,195,4,965,188,4,683,202,4,800159,5,210,015,6,174,670,5,399,491,6,287,824 and 5,854,033; And US publication 2006/0024711).

When identify use under background time, term " base " refers to the purine relevant with the Nucleotide of specified location in target nucleic acid or pyrimidyl (or its analogue or variant).Therefore, for reading base or being qualification Nucleotide, both refers to that determination data value is to identify the purine of specific location in target nucleic acid or pyrimidyl (or its analogue or variant).Purine and pyrimidyl comprise four kinds of main nucleotide base C, G, A and T.

It is together covalently bound in a linear fashion that " polynucleotide " used herein, " nucleic acid ", " oligonucleotide ", " oligomer " or grammatical equivalents are often referred at least two Nucleotide.Nucleic acid comprises phosphodiester bond usually, although in some cases, nucleic acid analog can be included, and it has substituting main chain as phosphoramidite, phosphorodithioate or methyl phosphoramidite key; Or peptide nucleic acid backbones and key.Other nucleic acid analog comprises those with twin nuclei, comprises lock nucleic acid, positivity main chain, non-ionic type main chain and non-ribose backbone.

Term " with reference to polynucleotide sequence " or only " reference " refer to reference to organic known nucleotide sequence.Described reference can be the part, many with reference to organic consensus sequence, based on the organizational order of the organic different components of difference, a collection of genome sequence that obtains from organism group with reference to organic whole genome sequence (such as reference gene group), reference gene group, or any other suitable sequence.Described reference also can comprise the known information with reference to variant about finding in organism group.Described also can be that the sample treating order-checking has specific with reference to organism, and described sample may obtain (may be normal for complementary cancer sequence) from relevant individuality or identical individuality separately.

" sample polynucleotide sequence " refers to derive from gene, controlling element, genomic dna, cDNA, RNAs (comprising mRNAs, rRNAs, siRNAs, miRNAs etc.), and/or comes from sample or the organic nucleotide sequence of target of its fragment.Sample polynucleotide sequence can be the nucleic acid from sample, or secondary nucleic acid is as the product of amplified reaction.For the polynucleotide passage of sample polynucleotide sequence or " deriving from " sample polynucleotide (or any polynucleotide), can refer to that sample sequence/polynucleotide passage makes sample polynucleotide (or any other polynucleotide) fragmentation by physics, chemistry and/or enzymatic means and formed." derive from " polynucleotide and also can refer to that fragment is the result that the particular subset of nucleotide sequence of source polynucleotide copies or increases.

" reading " refers to the set of one or more data values of the one or more nucleotide base of table." reading of coupling " (being also referred to as " pairing ") is often referred to, the Nucleotide reading that a group of the region (arm) separated from genome sequence two is independent, described region is positioned at across the contrary end of the DNA fragmentation of hundreds of or several thousand base distances.Can in sequencing procedure, the fragment of the larger continuous polynucleotide (such as DNA) obtained from sample organism that is to be read and/or that re-assembly variation produces the reading of pairing.

" collection of illustrative plates " refers to by reading (such as, the reading such as matched) with 0, reading similarly with reference in one or more data values of associating of one or more positions, such as, by the reading of example is mated with the one or more key positions in the index corresponding to position in reference.

" hybridization " refers to that two single stranded polynucleotide Non-covalent binding are to form the process of stable double-stranded polynucleotide.The double-stranded polynucleotide of (usually) gained is " heterocomplex (hybrid) " or " duplex (duplex) "." hybridization conditions " can comprise usually lower than about 1M, more generally lower than about 500mM and may lower than the salt concn of about 200mM.Hybridization temperature can be low to moderate 5 DEG C, but usually above 22 DEG C, more generally higher than about 30 DEG C, and usually more than 37 DEG C.

" connection " means in the reaction of template-driven, forms covalent linkage or connection (linkage) between the end of two or more nucleic acid (such as oligonucleotide and/or polynucleotide).The characteristic of described key or connection can have a great difference, and connection can be that enzymatic or chemistry carry out.As used herein, connect and generally undertaken by enzymatic, connect to form phosphodiester between 5 ' the one of carbon tip Nucleotide and 3 ' carbon of another Nucleotide of an oligonucleotide.The ligation of template-driven is described in following reference: U.S. Patent number 4,883, and 750,5,476,930,5,593,826 and 5,871,921.

" logic " means instruction group, when the one or more treaters (such as CPU) by one or more calculating device and/or computer system perform, it operationally performs one or more functions, and/or with the form return data of other logitron one or more results used and/or data.In multiple embodiment with enforcement, can by any given logic of following execution: as the one or more software components performed by one or more treater (such as CPU), as one or more hardware component as application specific integrated circuit (ASIC) and/or field programmable gate array (FPGA), or as one or more software component and one any combination with multiple hardware component.The software component of any certain logic can be implemented, but be not limited to, as independent or client-server software application, as the client in customer service system, as the server in customer service system, as one or more software module, as one or more function storehouse, and as the storehouse of one or more static state and/or Dynamic link library.The term of execution, the instruction of any certain logic can be presented as one or more computer processes, thread, optical fiber and other suitable entity working time any, it can specialize and can distributes calculation resources in the hardware of one or more calculating device, and it includes but not limited to such as storer, CPU time, storage space and the network bandwidth.

" primer " means when forming duplex with polynucleotide template, can serve as the starting point of nucleic acid synthesis, and extend along template from its 3 ' end, thus forms the oligonucleotide of the natural of the duplex extended or synthesis.The nucleotide sequence added in extension process is determined by the sequence of template polynucleotide.Primer is extended by archaeal dna polymerase usually.

" probe " is often referred to the oligonucleotide under study for action with oligonucleotide or complementary target.To allow the mode detected, such as, with probe used in fluorescence or other some aspects of the present invention that optionally recognizable label is claimed.

Target nucleic acid " sequencing " (also referred to as " order-checking ") means the mensuration of the information relevant with the sequence of target nucleic acid nucleotide base.This type of information can comprise the part of target nucleic acid and the qualification of complete sequence information or mensuration.Reliability of statistics in various degree or confidence measure sequence information can be used.On the one hand, order-checking comprises the mensuration of the sequence of the continuous nucleotide in the mensuration of identity and many target nucleic acids originating in different IPs thuja acid in target nucleic acid.Undertaken checking order and each step by the sequenator comprising reaction subsystem and imaging subsystems.Reaction subsystem comprises flow-through appt (between plurality of reagents, damping fluid etc. and biological sample or fragment derivative thus, biochemical reaction occurring thereon) and other assembly (such as pipe, valve, syringe, stopper, engine etc.) multiple, described assembly to be configured to reagent, damping fluid, sample fragment etc. to be placed on flow-through appt or within.Imaging subsystems comprises photographic camera, microscope (and/or suitable camera lens and pipe), check order during support the platform of flow-through appt and for the flow-through appt placing and adjust on platform and other the assembly (such as, such as engine, stopper, mechanical arm etc.) multiple adjusting photographic camera and microscopical relative position.

" target nucleic acid " means to derive from gene, controlling element, the nucleic acid of (usually) unknown nucleotide sequence of genomic dna, cDNA, RNA (comprising mRNA, rRNA, siRNA, miRNA etc.) and its fragment.Target nucleic acid can be the nucleic acid deriving from sample, or secondary nucleic acid is as the product of amplified reaction.Target nucleic acid can be obtained from almost any source, and method as known in the art can be used to prepare.Such as, target nucleic acid can not increase, and ground is direct to be separated, increase by using method as known in the art and be separated, it includes but not limited to polymerase chain reaction (PCR), strand displacement amplification (SDA), multiple displacement amplification (MDA), rolling circle amplification (RCA), rolling circle amplification (RCR) and other amplification (comprising whole genome amplification) method.Also obtain target nucleic acid by clone, described clone includes but not limited to be cloned into medium such as plasmid, yeast and bacterial artificial chromosome.In some respects, target nucleic acid comprises mRNA or cDNA.In certain embodiments, the transcript of the separation of biological sample is used to produce target DNA.Method as known in the art can be used from sample to obtain target nucleic acid.It should be understood that sample can comprise any amount of material, it includes but not limited to almost any organic body fluid, such as, such as blood, urine, serum, lymph, saliva, anus and vaginal secretion, sweat and seminal fluid, preferred mammal sample, the particularly preferably sample of people.The method obtaining target nucleic acid from various organism is well known in the art.Find that the sample comprising human gene group DNA can use in many embodiments.Such as genome sequencing some in, preferably obtain the equivalent of the genome-DNA of about 20 to about 1,000,0000 or more to guarantee that target dna fragment group is enough to cover whole genome.

The exemplary methods that gene order-checking and CNV estimate.

The present invention relates to the method for the copy number variation of the inspection positions target genome area for estimating target sequence in sample, finding that it can be used in multiple application as described herein.

The method of present disclosure also can comprise from sample extraction target nucleic acid and make its fragmentation, and/or checks order to the target nucleic acid carrying out CNV estimation.The nucleic acid of these fragmentations can be used for producing the target nucleic acid template generally including one or more joint.Target nucleic acid template through amplification method to form nucleic acid concatermer, such as, such as nucleic acid nano ball.

On the one hand, the nucleic acid-templated joint that can comprise target nucleic acid and multiple distribution, in this article also referred to as " library construction body ", " template of circulation ", " construct of circulation ", " target nucleic acid template " and other grammatical equivalents.Nucleic acid-templated construct is assembled by inserting linkers in the multiple site running through each target nucleic acid.The joint scattered allows to obtain sequence information continuously or side by side from the multiple sites target nucleic acid.

In another embodiment, nucleic acid-templated may be used for formed from multiple genomic fragment produces nucleic acid-templated library.In some embodiments, this type of nucleic acid-templated library will comprise target nucleic acid, and described target nucleic acid comprises whole genomic all or part of jointly.Namely by using the initial gene group (genome of such as cell) of sufficient amount, in conjunction with random fragmentation, target nucleic acid " covering " genome fully of the specific size of the template for generation of circulation obtained, although it should be understood that occasional by mistake to introduce deviation and hinder and represent whole genome.

Other embodiment and the example that build nucleic acid-templated method are described in U.S. Serial No 11/679,124; 11/981,761; 11/981,661; 11/981,605; 11/981,793; 11/981,804; 11/451,691; 11/981,607; 11/981,767; 11/982,467; 11/451,692; 12/335,168; 11/541,225; 11/927,356; 11/927,388; 11/938,096; 11/938,106; 10/547,214; 11/981,730; 11/981,685; 11/981,797; 12/252,280; 11/934,695; 11/934,697; 11/934,703; 12/265,593; 12/266,385; 11/938,213; 11/938,221; 12/325,922; 12/329,365; And 12/335, in 188, by it, each section of entirety is incorporated to herein by reference, for all objects, in particular for all nucleic acid-templated relevant instructions with building the techniques described herein.

That the techniques described herein nucleic acid-templated can be double-strand or strand, and they can be straight chain or ring-type.In some embodiments, create nucleic acid-templated library, and in other embodiments, target sequence contained between different templates in this type of library covers jointly whole genomic all or part of.It should be understood that these nucleic acid-templated libraries can comprise diploid gene group maybe can use method process as known in the art they, carry out separation sequence with the karyomit(e) from one group of parental generation to another group.What it will be understood by those skilled in the art that is, single stranded circle template in library can comprise two chains (i.e. " water is gloomy " and " Ke Like " chain) of karyomit(e) or chromosomal region jointly, or containing coming from the ring of sequence of a chain, or another can use method as known in the art to be separated to their library.

Any sequence measurement nucleic acid-templated for use as known in the art and as herein described, the techniques described herein provide for measuring at least about 10 methods as about 200 bases in target nucleic acid.In another embodiment, the techniques described herein provide for measuring in target nucleic acid at least about 20 to about 180, about 30 to about 160, about 40 to about 140, about 50 to about 120, about 60 to about 100, and about 70 methods to about 80 bases.Still in other embodiments, sequence measurement for the identification of contiguous nucleic acid-templated in the base of 5,10,15,20,25,30 of one or both ends of each joint or more.

The technological overview that CNV reads

The CNV of normal specimens and tumor sample reads more total features, but also variant.In some embodiments, the sample of two types is through following three steps.

1) calculating of sequence coverage.

2) estimation of coverage deviation and correction:

A. the model of coverage deviation is set up;

B. modeling correction for drift;

C. coverage smoothing (smoothing).

3) by comparing to come stdn coverage with baseline sample or sample sets.

Accordingly, use hidden Markov model (HMM) segmentation normal specimens and tumor sample, but in following step, different models used to two kinds of sample types:

4A) for the HMM segmentation of normal specimens, scoring and output;

4B) for the amendment of the HMM segmentation of tumor sample, scoring and output;

Finally, normal sample is through " without reading " process, and described process identifies that in following step suspicious CNV reads:

5) based on the qualification without reading/low confidence region of colony.

In multiple embodiment, the dissimilar logic by performing on one or more calculating device carries out the above step of CNV reading.

CNV reads the exemplary embodiment of technology

1. the calculating of sequence coverage

As hereafter used, " DNB " refers to the sequence of nucleic acid nano ball, checks order to one or more reading (reading such as matched) from it.It should be noted, in the reading from biological sample or its sequencing fragment, DNB is expressed as one or more readings of the full sequence that can cover or can not cover composition DNB.Such as, in one embodiment, DNB is expressed as the coupling reading comprising two or more arm readings deriving from DNB opposite ends, and it separated by the unknown nucleotide sequence of a hundreds of base.

On the one hand, the gratifying paired end of the binding character of all pairings (such as complete DNB) collection of illustrative plates is used for sequence of calculation coverage.In a certain embodiment, unique paired end collection of illustrative plates contributes to the single counting of each base of the reference of aliging with DNB.Based on the estimated probability that collection of illustrative plates is the tram of DNB in reference, make the reference base weighting (such as, being given a point counting number) of aliging with the paired end collection of illustrative plates of non-uniqueness.Therefore, the mark of the DNB proportional with degree of confidence in each collection of illustrative plates belongs to, and provides the ability giving the estimation of rational coverage when collection of illustrative plates is the region of non-uniqueness.

On the one hand, each position i of reference gene group R receives following coverage values ci:

c_{t} = \underset{m &Element; M_{t}}{Σ} P (DNB | R, m) / (α + \underset{n &Element; N (m)}{Σ} P ({DNB}_{m} / R, n))

Wherein M _ifor the atlas on all DNB, thus the base read in each collection of illustrative plates is alignd with position i, DNB _mfor by the DNB described by collection of illustrative plates m, N (m) for relating to all atlas of DNBm, and α is the probability that the mode that do not allow DNB to map to described reference produces DNB.

According to the techniques described herein, computer logic (the CNV reader (CNV caller) 18 in such as such as Fig. 1 and/or its assembly, such as coverage computational logic 22) calculates the covering angle value of position (or locus) described in reference gene group based on DNB collection of illustrative plates.Then the covering angle value calculated in the take off data that computer logic comprises for subsequent disposal.

2. the estimation of coverage deviation and correction (coverage of sample interior operates)

At present, gene order-checking may cause the coverage deviation that can affect copy number estimation.One of key element of deviation relates to GC content in the interval close to initial DNA fragmentation length and becomes DNB (such as approximately 400bp), although also other factors known.In one embodiment, usually preferred copy number estimation before or as copy number estimation a part, carry out simulation and the correction of this type of deviation.

In another embodiment, it is desirable to the short-scale fluctuation some smoothings be applied in coverage, it can have specificity to single ring-type library or DNB at least in part.

The method of several offset correction and smoothing is had to use.Operations all in these methods and step are by computer logic (the CNV reader 18 in such as such as Fig. 1 and/or its assembly, such as GC correcting logic 34) to carry out based on take off data, described take off data includes but not limited to the covering angle value of each position in reference gene group.

Method 1: coverage corrects afterwards

In one embodiment, by window-sequence coverage of smoothing as previously discussed of averaging, the GC deviation explained in library construction and sequencing procedure is then adjusted.

Window-average is carried out by the mean value of the coverage values of not smoothing of position each in calculation window.For length of window N, the mean coverage that i place, position records is

{\overset{&OverBar;}{c}}_{t} = Σ_{j = t - N / 2}^{t + (N / 2 - 1)} c_{t} / N

In one embodiment, Dynamic gene collection is calculated from the coverage of this type of smoothing.In 1000 base pair windows (i.e. N=1000), calculate GC content along each every 1000 base with reference to contig.Based on the quantity of G and C existed in the reference part that window covers, distribute in 1000 storehouses to each window.W is allowed to become list window collection (being equivalent to their central position), and W _bfor [G+C]=b window collection.The average uncorrected coverage of each storehouse b is be confirmed as:

Allow for the average covering in whole genome for each GC storehouse b, correction factor f _bbe defined as:

f_{b} = \hat{C} / {\hat{c}}_{b}

In another embodiment, other smoothing operation corrective factor estimation can be used.Such as this can be small sample variation or overfitting provides larger robustness.Such as, can use curve, piecewise regression, sliding window be averaged, LOESS etc. is to item f _bsmooth.

{\hat{f}}_{γ} {\hat{f}}_{γ} = LOESS (f (γ)) c_{t}^{'} = c_{t} / {\hat{f}}_{g c_{t}}

Then, (storehouse b is distributed to by the window of 1000 bases calculated centered by the i of position as follows _i) correction, smoothing coverage:

{\overset{&OverBar;}{c}}_{t}^{'} = {\overset{&OverBar;}{c}}_{t} * f_{b_{t}}

Can by the correction of the larger window of length l=n*1000 (n is positive integer), the coverage of smoothing is calculated as the mean value of the value of the window comprising 1000 bases.

In addition to the above, should be clear that there is the change of many embodiments.Window size can change with transfer.Based on various features as structure annotation (such as repeating), variability too much or not enough in multiple sample, for drawing substandard accessibility/uniqueness of collection of illustrative plates, the degree of depth (measuring drafting property) etc. of coverage in simulated data, some position can be ignored (and corresponding window or expansion are with the accepted position obtaining fixed qty, or only get the mean number that can accept position).Mathematical mean number can be replaced by the median in appropriate location, pattern or other tabulate statistics data.Can based on the mean coverage calculation correction factor of the coverage of single position instead of window, application is smoothed/is averaged after calibration instead of before correcting.

This class exemplary methods of easily extensible, thus consider multiple predictor of coverage by calculating the correction factor of multi-dimensional position storehouse on genome.Such as, not only can consider the GC content in whole DNB scale, but also the GC content in single DNB arm scale can be considered.Selectively, the independent correction factor of each predictor can be calculated, corresponding to the hypothesis of effect independence.

method 2: the coverage of collection of illustrative plates level corrects:

In the second method of offset correction and smoothing, give other weighting factor of single collection of illustrative plates with the deviation before making up smoothing.Compare desired by unified stochastic sampling, more will may subtract weight owing to the DNB of deviation (collection of illustrative plates), simultaneously by the unlikely DNB weighted owing to deviation (and more may contribute to the calculating of coverage than whole counting).Allow q _mfor the correction factor (hereafter defined) of collection of illustrative plates m, the coverage that i place, position corrects can be calculated as:

c_{t}^{'} = \underset{m &Element; M}{Σ} q_{m} * P (DNB | R, m) / (α + \underset{n &Element; N (m)}{Σ} P ({DNB}_{m} / R, n))

Based on the odds ratio determination correction factor q deriving from Logic Regression Models matching _m, thus distinguish from the collection of illustrative plates of the data centralization simulated with the unified stochastic sampling of reference gene group the collection of illustrative plates that True Data is concentrated.Based on 1 rank b batten (segmentation linear), it has knot at each 5th hundredths place of (real+simulation) data centralization GC content distribution of combination, and the given collection of illustrative plates of described model prediction is real or simulation.Such as, corresponding R code is:

Model <-glm (being real ~ bs) (dnbGCpcnt, df=20, degree=1, boundary knot=c (0, l)), data=d, family=binomial)

Wherein input data set d, the record of the collection of illustrative plates of being simulated by the paired end of the uniqueness of equal amount and the record of unique real collection of illustrative plates of paired end form.For analog record, it is real=0; For real record, it is real=1.DnbGCpcnt is for drawing the GC per-cent in crossed over reference part by collection of illustrative plates.

Consider the model obtained thus, correction factor q _mbe considered to the simulation of the model prediction of the given GC per-cent of collection of illustrative plates m: real odds ratio.Therefore, if given GC per-cent is likely three times in simulated data in True Data, there is with the weighting of 1/3 factor true collection of illustrative plates of this GC content.

Use and explain that the logical model of numerous characteristics of collection of illustrative plates can measure the similar odds ratio based on factor, the factor comprised such as:

The composition (~ 500bp) of whole fragment;

The composition (~ 80bp) of genome segment in final DNB;

The selection of each position base in final DNB;

The oligomer of specific location in initial fragment;

The sequence (such as, joint efficiency impact) of adjacent joint;

The sequence of the usual position of restriction enzyme site;

The physical ones of prediction;

Melting temperature(Tm);

Handiness/curvature;

The measurement of genome area/feature of measurable/prediction, such as histone combines and methylates.

The linear effect that model not only can comprise single measuring result can also comprise various transformations (such as piecewise linear or polynomial fitting or storehouse) and the interaction item of single measuring result.

In a certain embodiment, then average via the window slided and smooth the coverage of model tuning, and be rounded to integer.The width of window is configurable; Default value is 2kb.The mean coverage of the window (such as equaling the window transfer of window width) adjoined by Default Reports, but other transfer amount can be applied.Report the coverage of the correction that the point midway place of each window is average.

Process each contig (or continuous print locus region) of reference gene group individually, thus make to give tacit consent to the coverage values of width=2k, each contig length >2kb generation relative to lkb, 3kb, 5kb... place of contig starting point.Therefore, for this type of position i, the coverage of smoothing is given as:

{\overset{&OverBar;}{c}}_{t}^{'} = Σ_{j = t - 1000}^{t + 999} c_{t}^{'} / 2000

The first window of each contig starts from first base of contig; Shift described window until the end of window exceedes the end of contig.Because relative to its karyomit(e), the zero position of contig can be arbitrary value, so the chromosome position of given window of report may not be a good round number.

method 3: GC standardisation process

In one embodiment, computer logic (the CNV reader 18 in such as such as Fig. 1 and/or its assembly are as GC correcting logic 34) is estimated as follows and is corrected the deviation of coverage.

First, for the window (getting rid of the position being less than 500 bases from contig end) of the base of 1000 centered by genomic often calculates GC content.Such as, if the base at j place, position is G or C, function G C (j) can be set as 1, otherwise be set as 0, and can calculating location i place GC content g c as described below _i:

During estimation GC correction factor, do not consider the position being less than 500 bases coming from the arbitrary end of contig.

Next step, the GC value γ possible for each, measures gc _tthe mean coverage of the position of=γ allow n _γfor gc in genome _tthe quantity of the position i of=γ, can by calculating mean coverage as follows:

In example is implemented, the position of coverage >500 can be got rid of.

Next step, complete above two steps to stand-in.Use subscript " * " to represent analog result, the mean coverage of simulating can be measured by following:

It should be noted, due to the GC content of immanent tumor-necrosis factor glycoproteins, micro-satellite region etc., make above result and non-fully is even, dissimilar with genome integrally.

Next step, to the ratio of each GC value calculation sample coverage with simulation coverage, the overall average coverage of adjustment sample and stand-in (is respectively with ).Such as, this ratio can be calculated by following:

Next step, obtain the function of coverage ratio as GC of smoothing such as, local weighted polynomial regression can be used by following:

{\hat{f}}_{γ} = LOESS (f (γ))

As local regression operation, except region unstable in number, carry out LOESS smoothing, LOWESS is then carried out in region unstable in number.

Next step, the coverage of (single base) is corrected by following genome each position GC-of calculating:

c_{t}^{'} = c_{t} / {\hat{f}}_{{gc}_{t}}

Near the end of contig, the GC content average with full-length genome fills up ' base of disappearance '.If the GC content of the window of given position too extreme (i.e. <20% or >80%GC), treats covering angle value (such as, as missing data) as unknown number.

By obtaining the mean value of each position i in given window carry out window-smoothing.The window edge selecting the chapters and sections (section) that hereafter title is " window edge definition " to define, fills window (vicinity, nonoverlapping).That is, to corresponding to spacing [i, j) window, the coverage of average correction be calculated as:

{\overset{&OverBar;}{c}}_{i, j}^{'} = Σ_{k = i}^{j - 1} c_{k}^{'} / (j - i)

It should be noted, for convenience of record, hereafter omit subscript " j " in chapters and sections, namely use replace because there is a window originating in position i at the most.

3. by comparing to come stdn coverage with baseline sample

In multiple embodiment, by computer logic such as, such as, in Fig. 1, CNV makes a variation reader 18 and/or by its assembly, and the correcting logic 36 that such as such as ploidy is relevant, can carry out the operation described in these chapters and sections (chapters and sections 3), calculating and method steps.

In some embodiments, can consider not regulated by the above the coverage deviation corrected by comparing with baseline sample.But, for obtaining the coverage proportional with absolute copy number, can according to the copy number adjustment baseline sample in described sample.

Allow d ' _iand p _ifor the coverage of baseline sample at i place, position and ploidy, and d is the estimated value of the typical diploid coverage of baseline sample, can by following measurement deviation correction factor b _i:

(in one embodiment, be considered to 45% percentile of window in euchromosome).Then by the coverage of the correction of following normalized

{\overset{&OverBar;}{c}}_{i}^{''} = {\overset{&OverBar;}{c}}_{i}^{'} * b_{i}

If p _i=0 (in this case, d _iowing to collection of illustrative plates error, and in this position, covering behavior is not reliable index), be regarded as disappearance.Based on certain position in baseline sample known or the ploidy of hypothesis and this offset correction carried out of coverage, be called as in this article " baseline correction that ploidy is relevant ".Particularly, the ploidy that the ploidy baseline correction of being correlated with detects based on each position (or locus) place in the target polynucleotide sequence of target sample and coverage, adjust baseline or the coverage values with reference to this same position in sample, as the key element using baseline value to correct the coverage of sample to be analyzed.

In some embodiments, the sequence of one group of sample instead of single sample can be used as baseline, to reduce due to the susceptibility to fluctuation of sampling (statistics noise) or cause due to library specificities deviation.Such as, following baseline sample collection S can be used:

p_{i} = \underset{s &Element; S}{Σ} p_{i}^{s}

d_{i} = \underset{s &Element; S}{Σ} d_{i}^{s}

Wherein pi is the ploidy at window i place.Ideally, this will be the baseline sample ploidy real to this window.But, because it is unknown, so need estimation.

Therefore, in one embodiment, the CNV that baseline production process comprises each baseline gene group reads, and uses wherein euchromosome copy number to be 2 and the suitable stand-in of sex chromosome sex.Be used as the stand-in of baseline to provide the indirect method of the variation of suppressor gene picture group spectrum drafting property, such as the region of the tumor-necrosis factor glycoproteins of corresponding high copy, high identity, it is during collection of illustrative plates is drawn " exuberant ".But this may can not solve coverage deviation due to biological chemistry.In the region of medium coverage deviation, if the yardstick of deviation length is short relative to the length of window, then available correct ploidy reads baseline gene group, and therefore correction factor will suitably make up described deviation.But, there is the region of the sustained deviation of the coverage >50% of the diploid mean value caused away from correct ploidy, baseline gene group will make its copy number misread; This causes the baseline " correction " strengthening the trend reading CNV in this position, namely causes misreading of strong/consistent abnormal ploidy.In other embodiments, the ploidy estimation of baseline gene group can based on extraneous information (CNV such as based on chip reads), the exhibition of manual plan or the automatic business processing attempting measuring Population pattern by analyzing multiple genome simultaneously.

In other embodiments, can measure in many ways .Such as, the median being estimated as the position with ploidy 2 in the estimated former estimation of the ploidy to baseline sample can be regarded as, be considered as the covering angle value of model, or be considered as some fixing percentiles (may adjust for the male sex and women's sample) of coverage of full-length genome.One group of sample can be used, instead of single sample, as baseline.In this case, d ' _iand p _ithe coverage of all baseline sample at reference position i place and the summation of ploidy may be considered to be, and the summation of the sample of typical diploid coverage can be identified as.Selectively, the mean number for each value calculated of several baseline sample or median can be used, to the difference of coverage between baseline sample, there is the estimation of less susceptibility to provide.

If do not have sample to input as baseline, then arrange as follows simply

{\overset{&OverBar;}{c}}_{i}^{''} = {\overset{&OverBar;}{c}}_{i}^{'} .

The HMM segmentation of 4A. normal specimens, scoring and output

In multiple embodiment, can by computer logic such as, such as, CNV variation reader 18 in Fig. 1 and/or its assembly, such as HMM model logic 20 carries out the operation described in these chapters and sections (chapters and sections 4A), calculating and method steps.

In some aspects, have many methods of quantitative time m-series being carried out to segmentation, namely described method can be applicable to read CNV-can be applied to the coverage data produced by above sequence of steps.Hidden Markov model (HMMs) provides these class methods that has some attracting characteristic (explicit model approximating method, elastic model, natural degree of confidence are measured, the ability of limited model, integrate the ability of multiple coverage production model), wherein state is equivalent to copy number level, radiation is the form (observe/correct/relative) of some coverages, and the change changing copy number between state.Emission probability pattern can turn to Poisson's distribution, negative binomial, the mixed type of Poisson's distribution, the segmented model etc. of fitting data.The selection of model can be carried out with goodness of fit measurement and cross validation.In one embodiment, it is desirable to the covering angle value of each position of smoothing in longer (slip) window, although it is much narrower than the minimum event size expected to it is desirable to window width.In one embodiment, it is desirable to limited model in many ways, such as requiring that the expection of each copy number level exports the mean number of the emission probability of state (in the such as HMM) is mutually consistent multiple, as desired by from the change of discrete copy number.In one embodiment, it is desirable to the component comprising " pollutent " corresponding to the tumor sample with healthy tissues in the coverage distribution of expection, or such as utilize mixture model to catch Tumor Heterogeneity.

On the other hand, likely other signals (such as its parameter and value) are integrated into CNV and detect, or use other signals (such as data value) to confirm or to filter the output from the CNV detector based on coverage.Other these type of signals are included in the existence of the end pairing of the boundary exception between two copy number levels, or the change of heterozygote position allelic balance.

Still on the other hand, based on the function of reference gene group position, the specific method based on HMM of estimation copy number can be used.Such as, GC-correct, window-average, standardized coverage data the HMM that its state corresponds to integer ploidy (copy number) can be inputed to.The ploidy of the sequence of the most probable state of model can be estimated as along genomic copy number.The posterior probability produced based on HMM calculates various score.This respect is hereafter having more detailed description.

Model definition:

By the matrix of transition probability, initial state probabilities and emission probability define have corresponding to ploidy 0, ploidy 1, ploidy 2 ... the HMM connected completely of the state of ploidy 9 and " 10 or more " ploidy.(in multiple embodiment, the exact number of state can be revised).

Coverage distribution (i.e. state emission probability) pattern is turned to negative binomial, its parametrization can be made by the mean number of each distributions and variance.

Model assessment:

In principle, adopt Podbielniak (Baum-Welch) algorithm by estimating that maximizing (EM) can estimate all model parameters; But in practice, unrestricted estimation (especially coverage distribution) is not to provide gratifying result.For processing this problem, in one embodiment, select initial value and restriction renewal subsequently to reflect following hypothesis: supposition coverage depends on the copy number of reference sections given in target gene group; Assuming that copy number is round values; Assuming that coverage and copy number linear dependence; Assuming that most genome is diploid, thus autosomal " typically " value is made to may be used for determining the mean coverage of ploidy=2; For the state of corresponding ploidy >=1, assuming that the standard deviation of state and the mean value of state proportional; For the state of corresponding ploidy=0, independent variance may be used for the impact taking collection of illustrative plates mistake and not exclusive collection of illustrative plates into account.Consider these restrictions and supposition, only have two parameters freely for coverage distribution, by the single value that coverage associates with the standard deviation of ploidy >=1, and another is the variance parameter of ploidy=0.

In one embodiment, can from data estimation transition probability, but default behavior will maintain initial value.User can set initial value, if not setting, initial value can be defaulted as t _ij=0.01, for example, assuming that for any different state i and j, model is in state i when time t, then there is the possibility of 1% state when time t+1 to be j.In another embodiment, transition probability can from data estimation, but the risk of overfitting is very high.Therefore, one group of default value can be used, thus make to be set to 0.003 at any " time " (window) probability from a kind of state to the transfer of another kind of state, and probability remaining in given state is considered to 1-0.003*10=0.97.

Initial state probabilities is all set as that 1 divided by status number.

By the following mean value initialize by ploidy being the transmitting (coverage) of the state of n and distributing, except below pointing out:

Wherein for all positions median, the coverage of the correction of the smoothing in described position normalized.For taking the existence of some the obvious coverages caused due to collection of illustrative plates mistake into account, in one embodiment, setting μ ₀be 1, i.e. μ ₀=1; In another embodiment, μ ₀can be set as the initial estimation of mean value is not upgraded during follow-up model-fitting.

The initial variance of ploidy 2 state is set as:

In some embodiments, set other states variance thus make standard variance and mean value proportional:

σ_{n}^{2} = σ_{2}^{2} * {(n / 2)}^{2}

In another embodiment, can by the initial variance of following setting negative binomial:

σ_{n}^{2} = 3 * μ_{n} .

Upgrade the parameter determined of variance until model ' convergence ' by EM, such as between successive iteration the data that model gives log-likelihood in change be enough little, such as, below a certain threshold value.

On the other hand, initial variance estimation (use has the EM of amendment to limit mean value) can be upgraded during model-fitting, but limit that it is less than above never.Under following hypothesis, operate this model: most genome is diploid, the median of whole distribution is by the median of close genome diploid part and mean value, and copy number is strict round values.In this respect, need along with the time carry out adjusting to estimate height aneuploid sample, have the tumour of substantial " general pollutent " and with reference in the copy number in not exclusive region.

Allow the program iteration upgraded until its ' convergence ', the log-likelihood change of the data that such as model gives changes and is less than 0.001 between successive iteration.

Ploidy reasoning, segmentation and scoring:

In another embodiment, after estimation program is assembled, common HMM reasoning and calculation is carried out.Final result is based on the most probable state in each position place.(being chosen as to specify of standard corresponds to the state of most probable unipath ploidy.)

In one embodiment, " ploidy (calledPloidy) of reading " of each position in input is treated as the ploidy in the most probable state in that position." ploidy score (ploidyScore) " is considered to phred sample score (such as with the score based on logarithm that decibel dB measures), and it reflects that the ploidy of described reading is correct degree of confidence." CNV type score (CNVTypeScore) " is considered to phred sample score, it reflects such degree of confidence, namely the ploidy of described reading correctly represents: relative to the expection (being diploid except being contemplated to except monoploid at male sex's sex-chromosome) of nominal, and described position has the ploidy of minimizing, the ploidy of expection or the ploidy increased.Other score (" score ploidy=0 ", " score ploidy=1 " etc.) at each position place reflects the probability of each possible ploidy; Each state must be divided into int (101ogl0 (L _is)), wherein L _isfor the likelihood of i place, position state s.

In another embodiment, " sections " is for closing on the sequence of the position of the ploidy with identical reading.' top ' and ' end ' position of sections is considered to be in the mid point outside that initial sum stops window.Give each sections ploidy score and CNV type score, described ploidy score equals the mean number of the ploidy score of the position in sections, and described CNV type must be divided into the mean value of the CNV type score of the position in sections.

The explication of above score judges to provide in the chapters and sections being hereafter entitled as " score calculates " with rationalization.

The amendment (tumour CNV method) of the HMM segmentation of 4B. tumor sample, scoring and output

In multiple embodiment, can by computer logic such as, such as, CNV variation reader 18 in Fig. 1 and/or its assembly, such as HMM model logic 20 carries out the operation described in these chapters and sections (chapters and sections 4B), calculating and method steps.

In some aspects, the copy number in tumor sample reads and causes some challenges to up to the present described method.Due to the possibility of highly average copy number, assuming that the coverage that genomic diploid (" normally ") region has close to sample median is unadvisable.Even if the typical coverage in diploid region (such as by the analysis of minimum gene frequency) can be determined, might not be 50% of this value for the change expected in the increase of single copy or minimizing coverage, because there is the possibility of the pollutent (" normal pollutent ") from Normocellular unknown quantity that is contiguous or that be mixed into.Even if between tumour cell, due to the heterogeneity of tumour, may by the sections of the copy number characterizing genes group of integer.

Therefore, the supposition of the covering level of model state of usefully relaxing the restriction, to allow the ratio of coverage by continuous valuation.Which increase the challenge finding correct value, and have also been introduced the problem determining to comprise how many states, cause the analysis comprising Model Selection assembly.Therefore, evaluating objects is be modified to region genome being segmented into unified " abundance classification ", does not force and given classification is illustrated as integer copy number.

In theory, HMM can be furnished with different status numbers simply, uses EM to determine the covering level that each state is expected, and selects the status number that can give best-fit degree.In practice, the unrestricted estimation of the model parameter of any given status number is not a sane process.Therefore, for addressing this problem, on the other hand, introduce other initial step or module, it is based on total coverage distribution estimated state number and their mean value, and introduce another step, this step optimizes initial model by then removing state from model sequence to model sequence interpolation state.

Initial model generates:

In one embodiment, treat that whole genomic (correction, standardized, window-average) coverage of segmentation is distributed as the mixing of different abundance category distribution.Identify that a method of visibly different abundance classification is, use the input data representing the original state (or peak value) that computer logic generates, it performs the model (as described in before this) for explaining the absolute copy number of complicated tumour.The peak P obtained thus is used as the state in initial model, and the coverage values of expection equals each peak center.EM can be used to estimate variance (identical with the model-fitting above in conjunction with the restriction described by normal sample segmentation).

Identify that a method of visibly different abundance classification is, find the peak value of the whole genome coverage distribution of smoothing.In another embodiment, another method is, identifies the mixture model of the coverage distribution observed by close matching.By by the fractile function application of proper distribution in cumulative distribution function (cdf), then before smoothing and peakvalue's checking, remove difference between continuous print value, thus realize improvement that direct peak value is identified.A rear method provides better susceptibility for the small leak beyond the abundance classification of evaluating center.

Such as, given coverage H=h ₀, h ₁, h ₂... histogram, wherein h _ifor covering the number of i position, and n is minimum value, thus the complete histogram being less than 0.001 is clipped, and the fractile function allowing Q (p) be proper distribution, the peak P that can obtain thus by following calculating:

N = Σ_{i = 0}^{i = n} h_{i}

c _i＝h _i/N

q _i＝Q(c _i)

d _i＝q _i-q _i-1

D＝d ₁，d ₂，...，d _n

S=smooths (D)

s _i＝S(i)#

P={i|m ₁=1 and d _i>=.002}

The peak P obtained thus is used as the state in initial model, and the coverage values of expection equals the center of each peak value.EM can be used to estimate variance (identical with the model-fitting above in conjunction with the restriction described by normal sample segmentation).

Model refinement:

In another embodiment, once infer initial model by this way, this model is exactly repeatedly improve.First, other state is assessed.To each continuous print state between state add (the abundance classification arranged by the coverage of expection) and assess, if likelihood improves (Pr (data | model)) exceed certain threshold value, then accept interpolation.Namely each continuous print has the coverage c of expection _iwith c _jstate between i and j, attempt interpolation there is initial coverage c _{i '}=(c _i+ c _jthe state i ' of)/2.The EM that use has the expection covering level of the state that every other (being pre-existing in) is determined optimizes c _{i '}.If described optimization creates interval (c _i, c _j) outer value, if or reduce Pr (data | model) and do not exceed and accept threshold value, just refuse to add; Otherwise, just accept interpolation.If acceptance interpolation, attempts at i and i ' between add another state, recursion until do not accept another add.Once all continuous regimes between interpolation be rejected, just stop adding procedure.Secondly, removing of state is evaluated.Once remove a state from described model, and use EM to optimize the model obtained thus; If the model obtained thus is not significantly worse than former model, then accepts described state and remove.

In certain embodiments, described segmentation also comprises and comes estimated state number and their mean number to generate initial model based on the distribution of overall coverage.In certain embodiments, described method comprises optimizes initial model by the known to the skilled various method of quantitative data modeling, and it comprises status number in amendment model and optimizes each state parameter.Such as, can by add to model sequence state then order remove state or the combination of both and revise status number in model; Similar program can be applicable in multivariate regression model selection method used.More than maximizing perhaps by estimation, the method for other optimization multivariate model optimizes each state parameter.

Those skilled in the art know the change in aforementioned process.Such as, can attempt removing each state to determine which state has minimum impact from maximum model, remove that state and recursion.Be proficient in technician's this type of alternative method known of many quantitative models system of selection.In another example, can by add to model sequence state then order remove state or the combination of both and revise status number in model; Similar program can be applicable in multivariate regression model selection method used.More than maximizing perhaps by estimation, the method for other optimization multivariate model optimizes each state parameter.

Segmentation and sections score:

Once have selected model and optimize parameter, then determine segmentation and the sections score of normal specimens as previously mentioned.In brief, report has the continuous print position sections of identical most probable state, and Score Lists is shown in the mean value of position in the sections of the probability of classification error.

The disclosure is from the different of many currently known methodss, crucial difference is that it replaces the intensity measurements (such as microarray data) at set of locations place large but special on genome, and described method is relevant based on the overburden depth measuring result checked order (such as sequencing data of future generation) to position each on genome.Some other differences are as described below:

1) for measuring the use of point counting number of coverage.Still, in another embodiment, when the reading (such as corresponding whole DNB) mated maps to a more than position, degree of confidence observed value is used for collection of illustrative plates to be partly attributed to each position.Result is the coverage that this allows to assess to a greater degree than other method in fragmentation replica.

2) one of method of described correction coverage deviation.Still in another embodiment, the method (using a particular of logistic regression) of each DNB of weighting provides the ability that model affects multiple bias factor, and this gives offset correction more better than former method demonstrably.

3) the use of the estimation of copy number in the sample of each baseline/coupling.Still in another embodiment, by to estimate in general baseline or coupling baseline in the copy number of each sample, avoid one of challenge of former method, it relates to the calculating of relative intensity (microarray) or relative cover (CNV based on order-checking), that is, the true sample self for using as baseline can have CNV.When baseline sample has CNV, intensity/coverage measured in CNV locus will not provide the estimation of the copy number intensity that (is generally diploid) normally, cause comparing in most genome, the relative cover of target sample has different relations from absolute copy number.By the estimation adjustment baseline sample self according to copy number, preserve the linear relationship of expecting between copy number and relative cover, to allow to infer absolute copy number more accurately.

4) in HMM, two features are distinguished.Still in another embodiment, these features allow more sane data modeling (more accurate CNV reads).

A) measured the mean value of each state by the method, these methods provide the alternatives using common HMM training method (EM), and it seems reliably not assemble to the coverage of useful value.

I) for normal sample, in the expection diploid part of sample, the median of coverage is used to measure the mean value of diploid condition, and to determine other state (copy number) from 50% increment of diploid condition or decrement.(0 copy state is special, give slightly higher than 0 value to allow collection of illustrative plates mistake.)

Ii) for tumor sample, use independent process to infer initial level set; This process can based on the histogram analysis of coverage data; Once selection initial level, apply other and calculate to improve level set.

B) by the variance (restriction) of the method estimated state; At least in some embodiments, variance is limited to the linear dependence of state mean value, which reflects the fact of result that most variance is deviation instead of sampling noise; Therefore, in given sample, the twice of the distribution (standard deviation) of coverage that the state (coverage level) with the mean value doubling the second state will have observed by the second state usually.

5) use the coverage data of arrogant (such as 50 samples) baseline to determine position, its some aspects of middle sequencing procedure result in the covering level of high variation.

Still in another embodiment, if this type of position is not accredited as problematic, they read causing false CNV.Once identify, the change of vacation by this type of position mark being just unknown copy number instead of having distributed.

Window edge definition (for carrying out window-smoothing)

When selecting the window edge for carrying out window smoothing, in the embodiment of an example, define the window of the overwhelming majority thus make their karyomit(e) coordinate be length of window even-multiple, thus make 2k window, such as, the chromosome position of window edge terminates with " x000 ", and wherein x is even number numeral.The border of these windows is called as " default boundary ".The exception of these default boundary is be in the window of contig end.Window will cross over the base taken from more than a contig never, even if the room between contig is small enough to allow to cross over.And, the base of the outmost full default window of special each contig of process.Or the first full window these " outside bases " are added into towards contig center, or be placed in their window, this depends on base number whether large than window width 1/2.Such as, the contig of position 25336 and the window width of 2000 is continued to for from position 17891, the list (17891 between following window region can be used, 20000), (20000,22000), (22000,24000), (24000,25336).

It should be noted, front 109 bases of contig are added into the right in next-door neighbour 2k interval, last 1336 bases are placed in their window simultaneously.The contig (be such as chrM for 100k window) being less than window width is made the single window comprising whole contig.In contig, window is not reported in room.For illustration, suppose that karyomit(e) is made up of as shown in table 1 three contigs.

Table 1: karyomit(e) contig example

Contig is numbered	Zero position	Final position
			1	17891	25336
2	25836	29277
			3	33634	34211

This will produce the following window using/report; Contig numbering is just shown herein for distinct presenting:

Contig 1:(17891,20000), (20000,22000), (22000,24000), (24000,25336)

Contig 2:(25836,28000), (28000,29277)

Contig 3:(33634,34211)

The result of this method is:

Genomic institute all comprises in the window (and only having a window) with or without room base;

Window is limited to single contig;

Between window is 0.5 times to 1.5 times of nominal window (nominal window) width;

Window edge is generally round number, and it is more obvious that this envoy's segment boundary corresponds to window edge, and the chance of the precision of overinterprete CNV reading boundary is less.

5. based on the qualification without reading/low confidence district of colony

In multiple embodiment, by computer logic such as, such as, CNV variation reader 18 in Fig. 1 and/or its assembly, the nothing reading logic 38 such as based on colony carries out the calculating described in these chapters and sections (chapters and sections 5) and method steps.

On the one hand, the above-described reading based on HMM usually comprises or is manufactured products or the CNV for not too interested multiple supposition.Mainly, these appear at following one of two things: A) described reference gene group sequence do not provide the explanation of replace mode in great majority or all samples genome, and great majority or all samples genome match each other.B) the more variation can explained than small number of discrete ploidy level is had in coverage.By qualification and the effectiveness annotating this type of region increase CNV reasoning.Hereinafter, so the region of annotation is considered to " without what read ", with regard to its meaning, can not give the discrete estimation of ploidy for these regions.

This class behavior can result from many reasons; Some possible mechanism comprise:

. error in reference gene group.Such as in fact in great majority or all genomes, two contigs can overlap each other, and namely correspond to individual gene group interval.In this case, two contig ends can be made up of highly similar sequence to a certain extent, itself otherwise unique, make DNB map to two positions.Observe/measure coverage will be reduced, which results in obvious copy number minimizing.Selectively, most of or all sample gene groups can comprise non-existent tumor-necrosis factor glycoproteins in reference.In this case, the coverage corresponding to the observation in the reference part of duplicated segment will be improved, and which results in and increase relative to the copy number of reference, but not be real polymorphism.

. uncorrected coverage deviation.On the one hand, may look it is CNV relative to reference to the region of high in fact performance or low performance in sequencing result.For retaining the ability producing absolute copy number and infer, considering that the initial copy number of baseline gene group is inferred simultaneously, completing baseline correction as above.This possibility of result is that the region of severe deviations in baseline and target sample can be regarded as real CNV.The signal of this event type will be, coverage pattern that is that most of or all samples all show similar raising or that suppress.

. manufactured products analyzes.Although rare, the accidental collection of illustrative plates manufactured products that can cause a large amount of false collection of illustrative plates in given position is still had to exist.This type of manufactured products can result from the ad hoc arrangement of the variation coming from reference in duplicated segment, thus makes the tumor-necrosis factor glycoproteins of mistake more similar to the sequence of target sample with reference to copy.To depend on the mode of the variation be present in given sample, these can cause the very large spike in reference in some position coverage.

. sections copies and tandem repetitive sequence.In reference, the change of sample room coverage can be caused with replication form existence and through the sections of populational variation, increase than copy number typical in the sequence of uniqueness or lose little.In the limiting case, in high copy sequence type colony, variability can cause crossing over the scope of continuous print substantially that a large amount of sample covers angle value fully.

. due to extreme correction factor or low-down original coverage and the estimation of instability.Example comprises: 1) region, and wherein coverage corrects very low due to GC, and GC correction factor is corresponding comparatively large, thus the noise during coverage is estimated is corrected the factor amplified; 2) region, wherein in simulation and real data, because collection of illustrative plates is exuberant, coverage is very low, which results in correction term large in the baseline offset correction factor; 3) region, wherein nearly all baseline gene group has 0 ploidy.

The qualification in this type of region can be carried out in every way.Finally, the manual plan exhibition of single position replace mode is highly effective, but in some cases due to the shortage of data, the degree of effort, and/or process unstable its be inhibition.The use of sequence similarity and/or structure annotation has some prospects, because the problematic region of a big chunk is equivalent to the known repeating part (tumor-necrosis factor glycoproteins of sections, self chain, STR, tumor-necrosis factor glycoproteins-shelter element (masker element)) of reference gene group in practice; But because many real copy number polymorphisms occur in this type of region, it is infeasible for getting rid of this type of sections too widely, and more selectively standard is found to be very challenging property.Therefore, still on the other hand, it is desirable to identify the problematic region being directed to coverage data.

Two class replace modes represent the several of above situation.The first kind relate to wherein coverage than the more changeable region (" hypervariable region ") can explained by a small amount of discrete ploidy level.Equations of The Second Kind relates to wherein coverage and does not mate described reference as expected but euploid region similar in all samples (" constant region ").

Consider a considerable amount of genome (such as 50 or more), " background set ", offset correction and through smoothing but the tabulate statistics of not standardized coverage data is enough to be used in (as heuristically or by halves) is separated to function good region, hypervariable region and constant region by genome.The following tabulate statistics for each genomic locations i in n genomic G collection calculates can be used in such a way.For 1≤x≤n, allow for the coverage through overcorrection and smoothing that the xth of g ∈ G ' rank statistics, i.e. position i place xth between genome in background set ' is minimum

Median

Distribution s _i:

s_{i} = {\overset{&OverBar;}{c}}_{i < n >}^{'} - {\overset{&OverBar;}{c}}_{i < 1 >}^{'} = \max_{g &Element; G} {\overset{&OverBar;}{c}}_{i}^{'} (g) - \min_{g &Element; G} {\overset{&OverBar;}{c}}_{i}^{'} (g)

Gather coefficient q _i:

q_{i} = \frac{\min_{1 \leq q < r < s < n} SSE (i, 0, q) + SSE (i, q, r) + SSE (i, r, s) + SSE (i, s, n)}{SSE (i, 0, n)}

Wherein SSE (i, x, y) is the summation of square error, namely

C_{i, x, y} = \underset{x < t \leq y}{Σ} ({\overset{&OverBar;}{c}}_{i < x + 1 >}^{'}, . . ., {\overset{&OverBar;}{c}}_{i < y >}^{'}) / (y - x)

SSE (i, x, y) = Σ_{x < t \leq y} {({\overset{&OverBar;}{c}}_{i < t >}^{'} - C_{i, x, y})}^{2}

Considering these tabulate statistics, can be hypermutation or constant by the standard definition of mark position.

The annotation of hypervariable region

The position meeting following all four standards can be labeled as the position of " hypermutation " (instead of be marked as CNV or be categorized as euploid):

I position is called CNV/ aneuploid by above-described HMM reasoning process by ().

(ii) the covering angle value in background set is not to point out the mode of simple polymorphism in colony to assemble.Formally, for the value Q that can select by rule of thumb, as described below:

q _i＞Q

(iii) in background set, the scope of this position covering angle value is wider than what see at (euploid) genome place of the overwhelming majority.Formally, for the value S that can select by rule of thumb, as described below:

s_{i} / {\tilde{m}}_{i} > S

(iv) coverage observed by target sample falls into the scope of the value arrived seen in background set, or drops on by outside the scope observed by little absolute magnitude (such as easily can pass through the amount of sampling or processing variation is explained).Formally, for R and the X value that can select by rule of thumb, as described below:

| {\overset{&OverBar;}{c}}_{i}^{'} - {\tilde{m}}_{i} | < \min (s_{i} * R, X)

The annotation of constant region

The position meeting following all standards is marked as " constant " (instead of being marked as CNV):

I position is called CNV/ aneuploid by above-described HMM staging treating by ().

q _i＞Ｑ

(iii) coverage of crossing over this position of Background Samples shows low variability, the disappearance of high minor allele frequencies polymorphism and reduction process variation (manufactured products) in prompting colony.For the value S that can select by rule of thumb, as described below:

s_{i} / {\tilde{m}}_{i} < S

(iv) coverage observed by target sample falls into the scope of the value arrived seen in background set, or drops on by outside the scope observed by little absolute magnitude (such as can easily by sampling or processing the amount that variation explains).Formally, for R and the X value that can select by rule of thumb, as described below:

| {\overset{&OverBar;}{c}}_{i}^{'} - {\tilde{m}}_{i} | < \min (s_{i} * R)

The improvement of annotation

On the one hand, above standard can cause CNV read exceedingly fragment turn to selectable reading with without the sections that reads.It is desirable to, if the coverage of observing is quite similar with the flank interval that do not annotate, based on (maintenance does not annotate) that above standard allows " without what read " short interval (namely annotation is " hypervariable " or " invariant ") to be allowed to as reading.Particularly, " hypermutation " or " constant " at interval can be suppressed to mark, but it is less than and meets above standard be the L base of a part for longer sections in HMM output.

The selection of cutoff

On the one hand, cutoff Q, S, R, X and L in the analysis of the subset that can read based on initial CNV and the above standard of alternative that distributes with the genome range of background coverage scope tabulate statistics.Consider and the first initial set (" training set ") that CNV reads be classified into suspicious (mark " hypermutation " or " constant ") and be considered to real CNV, and whole genome (namely, position selected by separating along genome, such as result from those of above-described window) tabulate statistics, the cutoff of available following standard qualification near optimal:

● most genome is called as euploid or CNV/ aneuploid (such as only having the genome of small portion to be without reading/annotating as hypermutation or constant);

● in " training set ", most problematic region is without reading;

● in training set, most believable region is (no noting) that read.

Manual plan exhibition based on the set of initial CNV reading can obtain training set.Described plan exhibition can relate to manual examination (check) cover situation with identify read and with pass through independently method and compare with the external data collection of the CNV of identified presumption.

By measuring and training set or independent test set, and without the consistence of portion gene group read, the candidate value of assessment Q, R, S and L.The final selection of cutoff can relate to the balance of reading between amount that integrity (the portion gene group of reading) and problematic CNV read.

Score calculates

Above-described CNV segmentation score is more clearly described in these chapters and sections.

Given HMM can calculate as concrete status switch σ=s ₁..., s _tthe given sequence D=d of output of length t that occurs of result ₁..., d _tprobability, HMM is made up of state n, and described state is by following initial state probability P=p ₁... p _n, transition probability T={t _ijand emission probability E={e _sddefined:

\Pr (D, σ | P, T, E) = p_{s_{1}} * e_{s_{1}, d_{1}} * Σ_{i = 2}^{t} t_{s_{i - 1} s_{i}} e_{s_{i}, d_{i}}

The data probability that model gives is the summation of all possible status switch, namely for the S set of all possible status switch of length t:

\Pr (D | P, T, E) = \underset{σ &Element; S}{Σ} \Pr (D, σ | P, T, E)

Use forward direction/backward (Forward/Backward) algorithm effectively can calculate the equation of this equation with other of the subset summation relating to S.The application of Bayes rule allows the probability measuring the given path considering data and model:

\Pr (σ | P, T, E, D) = \frac{\Pr (D, σ | P, T, E)}{\Pr (D | P, T, E)}

From here it is seen that, consider that the most of possible path of data and model is, make Pr (D, σ | P, T, E) get the path of maximum value.Viterbi algorithm is used can effectively to measure the path making this equation get maximum value.

But, also can calculate the probability of local path.Such as, by calculating in the particular state q of specified time u as follows, the path probability through model of viewed data sequence can in fact be led to:

\Pr (s_{u} = q | P, T, E, D) = \frac{\Pr (D, s_{u} = q | P, T, E,)}{\Pr (D | P, T, E)}

Be discussed above denominator, can by concrete path summation in data probability and all paths be obtained molecule, s for this reason _u=q, is expressed as

\Pr (D, s_{u} = q | P, T, E) = Σ_{σ &Element; S_{s_{u} = q}}^{t} \Pr (D, σ | P, T, E)

Therefore;

\Pr (s_{u} = q | P, T, E, D) = \frac{Σ_{σ &Element; S_{s_{u} = q}} \Pr (D, σ | P, T, E)}{Σ_{σ &Element; S} \Pr (D, σ | P, T, E)}

As described belowly carry out state assignment (" ploidy of reading "); In position u place deduction state (ploidy), for having the state maximizing probability:

\hat{s_{u}} = {\arg \max}_{q} \Pr (s_{u} = q | P, T, E, D)

(when with number, at random selecting).Then position u place ploidy score, π _ufor:

π_{u} = - 10 * \log_{10} (1 - \Pr (s_{u} = \hat{s_{u}} | P, T, E, D))

And position u place CNV type score mark (also referred to as DEI score), δ _ufor:

δ_{u} = - 10 * \log_{10} (1 - Σ_{q = a}^{b} \Pr (s_{u} = q | P, T, E, D))

The boundary of summation a and b is as follows.For being contemplated to diplontic region, if a=0, b=1; If a=b=2; If the ploidy (being generally 10) that a=3, b=are maximum.For being contemplated to haploid region (male sex chromosome), if a=0, b=0; If a=b=1; If the ploidy (being generally 10) that a=2, b=are maximum.

Sections is defined as the maximum operation of class ploidy position.For from position l to the sections of position r, ploidy score π _{l, r}be considered to the mean value of the ploidy score forming position:

π_{i, r} = \frac{Σ_{u = i}^{r} π_{u}}{r - l + 1}

And similarly, the CNV type score of sections, π _l,rfor forming the mean value of the CNV type score of position:

δ_{i, r} = \frac{Σ_{u = l}^{r} δ_{u}}{r - l + 1}

Selectable method for marking:

The selectable of sections can be calculated based on the likelihood of part path and obtain diversity.Such as, can by the probability in the real path calculated as follows from position l to the state q of position r:

\Pr (s_{i} = s_{i + 1} = \cdot \cdot \cdot = s_{r} = q | P, T, E, D) = \frac{Σ_{σ &Element; S_{s_{i} = s_{i + 1} = \cdot \cdot \cdot = s_{r} = q}} \Pr (D, σ | P, T, E)}{Σ_{σ &Element; S} \Pr (D, σ | P, T, E)}

The probability of another statistic data that may be relevant to the degree of confidence calculating sections boundary to be u place, position be state q, but be not in position u-1 (or, similarly, at position u+1):

\Pr (s_{u} = q, s_{u - 1} &NotEqual; q | P, T, E, D) = \frac{Σ_{σ &Element; S_{s_{u} = q, s_{u - 1} &NotEqual; q}} \Pr (D, σ | P, T, E)}{Σ_{σ &Element; S} \Pr (D, σ | P, T, E)}

Finally, the alternatives of DEI score defined above can be calculated; Such as, being in the shape probability of state that ploidy is greater than 2 from position l to position r is:

\Pr (s_{i : i \leq i \leq r} > 2 | P, T, E, D) = \frac{Σ_{σ &Element; S_{s_{i : i \leq i \leq r} > 2}} \Pr (D, σ | P, T, E)}{Σ_{σ &Element; S} \Pr (D, σ | P, T, E)}

As noted earlier, all paths summation can effectively be calculated via forward direction-backward (Forward-Backward) algorithm.

HMN model is used to be known in the art, such as be discussed in Rabiner, L.R.ATutorial on Hidden Markov Models and Selected Applications in SpeechRecognition.Proceedings of the IEEE, 1989,77.2:257-286.

For the exemplary implementation system that CNV reads

Computer system

The exemplary computer system that can use according to the embodiment of present disclosure can executive software, and result can in the user passed on monitor or other display equipment.In some embodiments, be configured to estimate sample target sequence in copy number variation exemplary computer system can using result as display equipment as the graphical user interface (GUI) on computer monitor presents to user.Fig. 3 illustrates an example of the structure of computer system 400, and it is configured to the estimation of the copy number variation implementing present disclosure.As shown in Figure 3, computer system 400 can comprise one or more treater 402 (such as such as CPU).Treater 402 is connected with the communications infrastructure 406 (such as communication bus, crossbar or network).Computer system 400 can comprise display interface 422, and it transmits image, text and other data to be presented at display unit 424 from the communications infrastructure 406 (or the frame buffer never shown).

Computer system 400 also can comprise primary storage 404, as random access memory (RAM) and supplementary storage 408.Such as, supplementary storage 408 can comprise hard disk drive (HDD) 410 and/or removable memory driver 412, and it can represent floppy disk, tape drive, CD drive etc.Removable memory driver 412 reads and/or writes removable memory module 416 from removable memory module 416.Removable memory module 416 can be floppy disk, tape, CD etc.It should be understood that removable memory module 416 can comprise the computer-readable recording medium having and store computer software and/or data thereon.

In selectable embodiment, supplementary storage 408 can comprise other the similar device allowing computer program, computer logic or other instruction to be loaded on computer system 400.Supplementary storage 408 can comprise removable memory module 418 and corresponding interface 514.The example of this type of removable memory module includes but not limited to, USB or flash disc drives, and it allows software and data to be transferred to computer system 400 from removable memory module 418.

Computer system 400 also can comprise communication interface 420.Communication interface 420 allows software and data to shift between computer system 400 and external device (ED).The example of communication interface 420 can comprise modulator-demodulator unit, Ethernet card, wireless network card, Personal Computer Memory Card Internatio (PCMCIA) slot and card etc.The software shifted via communication interface 420 and data can be the form of signal, described signal can be electronics, electromagnetism, optics etc., it can be received by communication interface 420.These signals can be supplied to communication interface 420 via communication path (such as passage), and electric wire, cable, optical fiber, telephone wire, cellular link, radio frequency (RF) can be used to connect for this and other communication port is implemented.

In the document, term " computer program medium " and " computer-readable storage media " refer to non-volatile media, such as primary storage 404, removable memory driver 412 and the hard disk be arranged in hard disk drive 410.These computer programs are supplied to computer system 400 software or other logic.Computer program (also referred to as computer control logic) is stored in primary storage 404 and/or supplementary storage 408.Also can via communication interface 420 receiving computer program or other software logic.This type of computer program or logic, when it is performed by a processor, can make computer system 400 perform the feature of method discussed in this article.Such as, primary storage 404, supplementary storage 408 or removable memory module 416 or 418 using computer programs code (instruction) coding, for the operation performing the process shown in corresponding diagram 3.

In the embodiment using software logic to implement, software instruction can be stored in computer program, and utilizes removable memory driver 412, hard disk drive 410 or communication interface 420 to be loaded on computer system 400.In other words, computer program (it can be computer-readable storage media), can have the instruction clearly presented on it.Software instruction, when being performed by treater 402, makes treater 402 perform the function (operation) of method as herein described.In another embodiment, method mainly utilizes such as, and the digital signal processor that hardware component such as comprises application specific integrated circuit (ASIC) is implemented within hardware.Still in another embodiment, the combination of hardware and software is used to implement described method.

According to the example system read for CNV of the embodiment of present disclosure.Fig. 1 is block diagram, and it illustrates according to an exemplary for reading the system of the variation in sample polynucleotide sequence.In the present embodiment, system can comprise the computer cluster 10 of one or more calculating device as computer 12 and data repository 14.Computer 12 can be connected with data repository 14 via high-speed LANs (LAN) 16.The example that can perform CNV reader 18 at least partially of computer 12.(in some embodiments, CNV reader such as CNV reader 18 can be included and as a part of collection channel logic (assembly pipeline logic), described collection channel logic is configured and operates genome that is that original reading is focused to mapping and that check order, and described genome comprises the variation of the detection coming from reference gene group; The example of this type of embodiment is described in the U. S. application the 12/770th submitted on April 29th, 2010, and in No. 089, this application is incorporated herein by reference in their entirety, as described completely herein).The correcting logic 36 that CNV reader 18 can comprise HMM model logic 20, coverage computational logic 22, GC correcting logic 34, ploidy are correlated with and based on colony without reading logic 38.

Data repository 14 can store several database, it comprises storage reference polynucleotide sequence 24, the reading 26 of the obtained coupling that checked order to sample polynucleotide sequence by use Biochemical processes, and one or more databases of the reading 28 of the mapping coupling generated by the reading 26 mated.

Refer to reference to organic known nucleotide sequence (example is genome as is known) with reference to polynucleotide sequence 24 (being called reference for short hereinafter).This comprises such reference, and this reference is included in one or more position in genome and has the sequence of known variation.Polynucleotide molecule is organic polymer molecules, and it is covalently combined in chain by nucleotide monomer and forms.Thymus nucleic acid (DNA) and Yeast Nucleic Acid (RNA) are for having the example of the polynucleotide of different biological function.The entirety (or entirety of essence) that organic genome (such as such as people) is organism genetic information, it is encoded as DNA or RNA.Haploid genome comprises a copy of each hereditary unit organic.In such as mammiferous diploid organism, genome is the polynucleotide of a series of complementations of two copies comprising most of genetic information, and it is organized as has discrete hereditary unit or allelic karyomit(e) collection.The allelotrope that specific location provides each to copy on individual chromosome, and in genome, each allelic genotype comprises the allelotrope pair that on homologous chromosomes, specific location exists, and it determines concrete characteristic or proterties.If genome comprises the allelotrope of two identical copies, then for this allelotrope, it is what isozygoty, and when genome comprises two different allelotrope, and for this locus, it is heterozygosis.DNA self is organized as two chains of complementary polynucleotide.

Whole genome sequence is can be, a part for reference gene group with reference to 24, many with reference to organic consensus sequence, based on editor's sequence of the organic different components of difference, or any other suitable sequence.Also can comprise about the known information with reference to variation found in organism colony with reference to 24.

During can carrying out sequencing procedure at the polynucleotide sequence obtained from organic biological sample, obtain the reading 26 of coupling, such as, from the nucleotide sequence of gene to be analyzed, genomic dna, RNA or its fragment.The reading 26 of coupling can be obtained from and comprise whole genomic sample, such as whole mammiferous genome, more specifically whole human genome.In another embodiment, the reading 26 of coupling can be the specific fragment from full-length genome.In one embodiment, to check order as amplimer carries out by the nucleic acid construct in the amplification such as using polymerase chain reaction (PCR) or rolling cycle replication to produce and obtain the reading 26 of coupling.The such as example of spendable amplimer, be described in U.S. Patent Publication No. 20090111705,20090111706 and 20090075343, it is incorporated herein by reference in their entirety.

The reading (mapped mated read) 28 mapping coupling refers to the reading 26 of the coupling of the position be mapped in reference 24.Exemplary mapping method is described in following patent application: the U.S. Patent application the 12/698th that on February 2nd, 2010 submits to, No. 965, and its whole content is incorporated to herein by by reference; The U.S. Patent application the 12/698th that on February 2nd, 2010 submits to, No. 986, its whole content is incorporated to herein by by reference; The U.S. Patent application the 12/698th that on February 2nd, 2010 submits to, No. 994, its whole content is incorporated to herein by by reference.

For combining with reference to 24 qualifications and reading in the object mapping copy number variation or the difference detected in the sequence of the reading 28 of coupling, copy number variation CNV reader 18 produces sequence and marks to it.

The exportable CNV of CNV reader 18 read document 32, list or other comprise the data structure of the variation of qualification, each part all describing the sequence wherein observing the reading 28 mapping coupling is different from the mode with reference to 24 at specific position or near specific location.

Computer cluster 10 can be configured, thus the example of the CNV reader 18 performed on different computers 12 is operated abreast in the different piece of the reading 26 with reference to 24 and mapping coupling.Job schedule 30 is responsible for computers 12 different in computer cluster 10 distributes operation or data packet.

Computer 12 can comprise typical hardware component (not shown), and it comprises one or more treater, input unit (such as keyboard, pointing device etc.), and take-off equipment (such as display unit etc.).An example of computer 12 is the computer system 400 shown in Fig. 3 and/or calculating device 2500.Computer 12 can comprise computer-readable/writeable medium, such as, comprise storer and the storing device (such as flash memory, hard disk drive, CD drive, disc driver etc.) of computer instruction, when it is performed by a processor, described instruction performs function disclosed herein.Computer 12 also can comprise the writeable medium of computer, and it is for performing data repository 14 and reading document 32 for storing CNV.Computer 12 also can comprise the wired or wireless network communication interface for communicating.

Data genaration

In some embodiments, sequenator (such as, sequenator as illustrated in figures 4 a and 4b) can be used for the reading 26 producing coupling, and described reading is obtained from organic sample polynucleotide to be analyzed.In one embodiment, sequenator provides discrete but relevant data set, thus makes the content of the reading 26 mated can comprise spatial relation and/or the dissociation of prediction.Can based on the knowledge of the existing Biochemical processes about the reading 26 for generation of coupling (such as, if Biochemical processes are applied to sample, sequence based on expection obtains), based on the sequence data of reading 26 of coupling or the experience estimation of the initial analysis of its subset, expert estimates, or other suitable technology determines this relation.

Many Biochemical processes may be used for the generation of the reading 26 promoting coupling by sequenator, to use together with CNV read method of the present invention.These include but not limited to: as U.S. Patent number 6,864,052,6,309,824,6,401, and hybridizing method disclosed in 267; As U.S. Patent number 6,210,891,6,828,100,6,833,246,6,911,345,7,329,496 and Margulies, et al. (2005), Nature 437:376-380 and Ronaghi, et al. (1996), the comprehensive sequence measurement disclosed in Anal.Biochem.242:84-89; As U.S. Patent number 6,306,597, the disclosed method based on connecting of WO2006073504, WO2007120208; As U.S. Patent number 5,795,782,6,015,714,6,627,067,7,238,485 and 7,258,838 and nanoporous sequencing technologies disclosed in U.S. Patent application 200600317120090029477; And the nanochannel sequencing technologies disclosed in U.S. Patent Application Publication No. 20090111115, it is incorporated to herein by full.In one specific embodiment, combination probe grappling connects (cPAL) process and can be used for (See U. S. Patent application publication number 20080234136 and 20070099208, is incorporated herein by reference in their entirety) in some embodiments.

Once generate the readings of elementary mapping coupling, just according to the CNV read method process information of present disclosure as shown in Figure 2, Fig. 2 describes the illustrative methods of the copy number of the inspection positions genome area for target polynucleotide sequence in working sample, obtains the readings of mapping to measure the sequence coverage of described sample 202; Correction sequence coverage deviation, wherein ordering bias corrects to comprise and carries out the relevant baseline correction 204 of ploidy; Carrying out based on the qualification 206 without reading/low confidence region of colony with after carrying out HMM segmentation, scoring and output 208, estimating total copy numerical value and the regiospecificity copy numerical value 210 of multiple genome area.

The example (in such as Fig. 1 variation read document 32 provides) that the CNV of the diploid/non-tumour/aneuploid sample produced according to the exemplary of present disclosure reads the output of process is as shown in table 2.

Table 2

In table 2, file " karyomit(e) " determines karyomit(e) number, file " beginning " and " end " are determined the initial gene seat of given area and are terminated locus, the ploidy (such as copy number) of file " ploidy " indicating area, file " ploidy score " indicates the score of given area (its mid-score is the value represented with decibel dB based on algorithm), (such as "=" indicates normal ploidy 2 to the type of the ploidy observed by file " type " indicating area, "+" instruction is higher than normal ploidy, "-" indicates subnormal ploidy, " hypermutation " instruction ploidy can not be read, and " constant " instruction ploidy is from different normally, but it is observed identical with baseline, described baseline is the set at least several reference gene group, and file " type scores " indicates the confidence score of the type read in file " type " in same row.Such as, second row instruction in table 2: start from the locus 5100001 on karyomit(e) 1 and end at the region of locus 5800000, there is ploidy 3,15dB score and there is the type that " increase " has 40 points.

The example (in such as Fig. 1 variation read document 32 provides) that the CNV of the non-diploid/tumour/aneuploid sample produced according to the exemplary of present disclosure reads the output of process is as shown in table 3.

Table 3

In table 3, file " karyomit(e) " determines karyomit(e) number, file " beginning " and " end " are determined the initial gene seat of given area and are terminated locus, the coverage level that file " level " instruction is exported by the region of HMM model is (wherein because the dysploidy of tumor sample and further feature, coverage level is calculated when not supposing normal ploidy 2), file " horizontal score " indicates the confidence score of the level read in file " level " in same row.Such as, second row instruction in table 3: start from the locus 10001 on karyomit(e) 2 and end at the region of locus 243189373, there is the coverage level of 1.05,38 points must be divided into.

The graphical interpretation technology of absolute copy number

In an exemplary embodiment, be input in OptSeg logic by the window mean coverage of correcting a deviation and allele-specific reading, it determines the region and the secondary allelotrope mark (LAF) that are segmented into even coverage.The effect of segmentation is conceptually similar to the circulation binary segmentation model without fixed model state set, but provides global best solution.Suppress too short fragment to reduce noise.

The result of LAF and total coverage provide the assessment of secondary allelotrope coverage.The icon that the two-dimensional space that total coverage and secondary allelotrope degree define can be used for tumour presents, and wherein, most of state expections are positioned at the summit of normal grid.Density in this space can be made into table, and carries out core-smoothing by computer logic before visual.Peak value in the distribution of enough height of smoothing determines the first initial set of state.

Rule-based logic finds, attempts (a kind of tumour assembly) model of the constraint of catching original state lower density as much as possible, is in the maximum average ploidy limit.The Data support of setting models can carry out visual assessment, and the various characteristics of model can directly be observed from figure.

The model of constraint provides matrix to pollute the assessment of mark, and the peak value of Matching Model is explained according to total receiving with the integer of secondary copy number.The original state be not responsible for model is construed as the result of Tumor Heterogeneity; They can be included in final mask, and receive the non-integer of total and/or secondary copy number.

Final mask can be explained in the logic being together input to and implementing the independent fragmentation procedure based on model (as HMN) with the state of the final set for annotating fragment.Due to the independence of model process and final segmentation, computer logic can generate the visual of tumour and present; If there is problem, available alternative model replaces the model of automatically deriving.

LAF assesses

Secondary allelotrope mark (being also referred to as " secondary copy ratio ") is the copy mark of the given area containing the allelic sample of lower abundance.LAF can be assessed at the reading of the heterozygous mutant gene seat of the normal specimens of coupling based on from tumor sample.Assessment allows the uncertainty (such as, arbitrary allelotrope of given locus can be " secondary allelotrope ", does not rely on reading) of phase place to avoid Bias.By the error of β-binomial model process binomial sampling.

Conceptual modeling

One-assembly-Jia-matrix-pollution-the model (one-component-plus-stromal-contamination model) of exemplary standard

In certain embodiments, following hypothesis is carried out:

Sample cell is tumour cell, probability t; Or be normal cell, probability 1-t.

All tumour cells have identical genome.

All normal cells are diploid.

For genomic given area i, if:

A _ithe major allele copy number of each tumour cell in=i

B _isecondary allele copy number (a of each tumour cell in=i _i>=b _i)

C _iaverage copy number in=i

L _isecondary allelotrope mark in=i

So:

c _i＝2(1-t)+(a _i+b _i)t

l _i＝(1-t+b _it)/c _i

Copy number state (the c allowed in this model _i, l _i) be arranged in the square grid (see Fig. 6) of diagram.

Multicompartment model

The 1-assembly tumor model (Fig. 6) of possible state can be easy to extend into the more complicated model relating to two or more assemblies (subclone) of the tumor section of sample.But even number 2-component model also causes the very big expansion (see Fig. 7) of possible state.The explanation of whole model (relative fractions of each assembly, the coverage etc. of each full copy) and single status is deficient fixed (see Fig. 7-9) usually.

The diagram of result

Process is found in for the robustness that matrix is polluted in various degree artificial (simulating in the computer) titration using the tumour of coupling and the matrix content of normal cell system.Figure 10 A-10C illustrates three data sets, first containing pure tumor cell line, second normal cell mated containing 50% tumour/50%, the 3rd containing 25% the normal cell of tumour/75%.

Although genomic all regions have significant secondary allelotrope content in the sample polluted, minimum state may be interpreted as loss of heterozygosity，LOH (LOH) state.

Having the high variable tumour of high average copy number and copy number and can have noticeable matching, is wherein that heterogeneous region highlights (as shown in Figure 11) by tumour.

Some samples are difficult: even if manual explanation is also difficult to understand.At least visual being easy to identifies poor model-fitting (as shown in Figure 12).

In certain embodiments, the combination of homogeneity assembly is not resolved in region heterogeneous in tumour, but reports according to mean apparent.In some other embodiment, wherein sample peak value is clear determines grid, and Model Selection has some degenerate, it can cause having state (a that is main and secondary copy number, b) state (a+n, b+n) polluted with less matrix cannot be distinguished.In those embodiments, the model with minimum possibility ploidy is preferred.

For the exemplary fragmentation procedure based on model that CNV reads

In exemplary embodiment, for the method for the genome area copy number of working sample target polynucleotide sequence detection site, described method comprises: the take off data obtaining described sample sequence coverage; The sequence coverage deviation of correcting measuring data, wherein, sequence coverage offset correction comprises carries out the relevant baseline correction of ploidy; Carry out hidden Markov model (HMM) fragmentation procedure, scoring and output; And estimate total copy numerical value and the regiospecificity copy numerical value of multiple genome area.

Such as, in one embodiment, the method comprises the multiple states producing multiple HMM model corresponding to respective copy number, and wherein, described sample is tumor sample; And carry out HMM segmentation, scoring and output, comprising: use by the model of the absolute copy number for explaining complicated tumour (as described in before this) the input data presenting original state that measure produce initial model for HMN; Optimize initial model by the status number in amendment model and optimize the parameter of each state; And by adding state subsequently to model, then remove state, or it combines the status number revised in model.

In one embodiment, the method comprises the original state measured based on the model by the absolute copy number for explaining complicated tumour further, annotates total copy numerical value of multiple genome area and the copy numerical value of specific region.

Exemplary sequenator and calculating device

In some embodiments, the order-checking of DNA sample (as represented the sample of whole human genome) is carried out by sequencing system.The embodiment of two sequencing systems as illustrated in figures 4 a and 4b.

Fig. 4 A and 4B is the block diagram of exemplary sequencing system 2490, and according to exemplary embodiment of the present invention, it is configured to implement for explaining the technology that the absolute copy number of complicated tumour and CNV read and/or method.Sequencing system 2490 can comprise or associate with multiple subsystem, and such as, one or more sequenator is as sequenator 2491, and one or more computing system is as computing system 2497, and one or more data storage bank is as data storage bank 2495.In the embodiment shown in Fig. 4 A, the subsystems of system 2490 carries out by one or more network 2493 connection having communication, described network 2493 can comprise the network infrastructure device (as router, exchange board etc.) of packet switching or other type, so that carry out message exchange between remote system.In the embodiment shown in Fig. 4 B, sequencing system 2490 is sequencing device, wherein, subsystems (as sequenator 2491, computing system 2497, and possible data storage bank 2495) is for can communicatedly and/or the assembly being operably connected and integrating in sequencing device.

In certain operations situation, the data storage bank 2495 in the embodiment shown in Fig. 4 A and 4B and/or computing system 2497 are configurable in cloud computing environment 2496.In cloud computing environment, can distribute the storing device comprising data storage bank and/or the calculating device comprising computing system and hypostazation to be used as effectiveness and to calculate with choosing (on-demand); Therefore, cloud computing environment provides architecture (as physics and virtual machine, raw data/block holder, fireproof brickwork, load balancing module, polymerizer, network, storage collection etc.), platform (as comprised calculating device and/or the solution stack of operating system, programming language execution environment, database server, the webserver, application server etc.) as service and performs the relevant and/or necessary software of calculation task of any storage (as application, application programming interfaces or API etc.).

It should be noted that, in various embodiments, technology of the present invention performs by the various system and device comprising different configuration and the some or all of above-mentioned subsystem formed under the factor and assembly (as sequenator, computing system and data storage bank); Therefore, the exemplary embodiment shown in Fig. 4 A and 4B and configuration should be considered to exemplary, do not have restricted.

Configuration sequenator 2491 also operationally receives the target nucleic acid 2492 being derived from biopolymer fragment sample, and checks order to target nucleic acid.Any instrument that can check order can be used, wherein, described instrument can use various sequencing technologies, includes but not limited to hybrid method order-checking, connection method order-checking, synthesis method order-checking, single-molecule sequencing, optical series detection, electromagnetism sequential detection, voltage changes sequential detection and be applicable to the technology that reads any known of sequence or develop afterwards from DNA.In various embodiments, sequenator can measure the sequence of target nucleic acid and produce sequence reads, and it can comprise or not comprise breach and can be or not can be end pairing (or both-end) reading.As illustrated in figures 4 a and 4b, sequenator 2491 measures the sequence of target nucleic acid 2492, and obtaining sequence reads 2494, it is transmitted so that (interim and/or permanent) is stored to one or more data storage bank 2495 and/or is processed by one or more computing system 2497.

Data storage bank 2495 can be applicable to one or morely be mixed with disk array (as SCSI array), store collection, or other storing units (as hard disk, CD, solid hard disk etc.) of storing device mechanisms be applicable to.The storing device of data storage bank can be mixed with the outside/of system 2490 or black box or invest the external module (as external hard drive or disk array) (as shown in Figure 4 B) of system 2490, and/or with the method be applicable to, as grid, store collection, storage area network (SAN), and/or network attached storage (NAS) can communicatedly interconnected (as shown in Figure 4 A).In various embodiment and implementation, the file system that data storage bank can be applicable to information is stored as file as one or more by storing device, one or morely information is stored as the database of data logging and/or any data storage mechanism that other is applicable to.

Computing system 2497 can comprise one or more calculating device, it comprises general processor (as central processing unit or CPU), storer and computer logic 2499, it can perform some or all of techniques and methods of the present invention together with configuration data and/or operating system (OS) software, and/or can control the operation of sequenator 2491.Such as, by comprising the computer installation of treater (being configured to actuating logic 2499 for carrying out various method steps), any data mart modeling of the present invention and data analysing method can be performed whole or in part.Further, although method steps can step by number present, should know, the step of method of the present invention can be carried out (parallel mode of calculating device collection) simultaneously or be undertaken by different order.Computer logic 2499 can be used as single integrate module (as integrated logic) and implements its function or be combined with two or more software module of some other functions that provides.

In some embodiments, computer system 2497 can be single calculating device.In other embodiments, computer system 2497 can comprise multiple can communicatedly and/or operationally interconnected calculating device in grid, set or cloud computing environment.Described multiple calculating device can be different the formation factor, as computing node, sheet (blade) or other any applicable Hardware configuration are configured.Therefore, the computer system 2497 in Fig. 4 A and 4B is considered to exemplary, does not have restricted.

Fig. 5 is the block diagram of exemplary calculating device 2500, and its configurable part as sequenator and/or computer system, for performing instruction to implement technology of the present invention and/or method.

In Fig. 5, calculating device 2500 comprises several directly interconnected or by one or more system bus, as the assembly that bus 2575 is indirectly interconnected.Described assembly can include but not limited to, keyboard 2578, permanent storage 2579 (as hard disk, solid magnetic disc, CD etc.) and can connect the display adapter 2582 of one or more display unit (as LCD display, flat-screen CRT monitor, plasma screen etc.).Peripheral equipment and I/O (I/O) device (it is connected to I/O controller 2571), means by any number known in the art are connected to calculating device 2500, include but not limited to one or more serial port, one or more parallel port and one or more USB (USB).External interface 2581 (it can comprise NIC and/or serial port) can be used for calculating device 2500 being connected to network (as internet or local area network (LAN)).External interface 2581 also can comprise some can from various external device (ED), as sequenator or its random component place receive the input interface of information.Exchanged with the assembly be communicated with separately by the interconnected one or more treater (as CPU) 2573 that makes of system bus 2575, to perform (with controlling to perform) instruction from Installed System Memory 2572 and/or storing device 2579, and the information interchange between each assembly.Installed System Memory 2572 and/or storing device 2579 can be rendered as one or more computer-readable permanent storage media (strings of commands that its storage of processor 2573 sends) and other data.Described computer-readable permanent storage media includes but not limited to, random access memory (RAM), read-only storage (ROM), electromagnetic medium (as hard disk, solid state hard disc, finger-like storer, floppy disk etc.), optical medium, as CD (CD) or digital versatile disc (DVD), flash memory etc.Can various data value and other structurizing or unstructured information can be exported in another assembly or subsystem from an assembly or subsystem, and present to user by display adapter 2582 and applicable display unit, send remote-control device or remote data repository to by external interface 2581 network, or (interim and/or permanent) storing device 2579 can be stored in.

Should be understood that, any method performed by calculating device 2500 and function can be implemented with the form of logic in the mode of module or integration with hardware and/or computer software.

Although many multi-form embodiments meet the present invention, but, as in conjunction with the preferred embodiment of the invention describe in detail, be understood that, present disclosure should be considered to the example of inventive principle, and is not intended to the specific embodiments that limit the invention to illustrate and describe herein.By those skilled in the art, many changes can be made and do not deviate from spirit of the present invention.Scope of the present invention is weighed by claims and their equivalent.Summary and title are not interpreted as limiting the scope of the invention, and enable suitable mechanism and the public determine general aspects of the present invention rapidly because its objective is.

Claims

1. the method for the copy number of the genome area of target polynucleotide sequence inspection positions in working sample, described method comprises:

Use the data produced from end pairing collection of illustrative plates, obtain the take off data of the sequence coverage of described sample;

Correct the sequence coverage deviation of described take off data, wherein correct described take off data and comprise the baseline correction carried out ploidy and be correlated with; And

At least based on the take off data corrected, estimate total copy numerical value and the regiospecificity copy numerical value in each region in multiple genome area;

Wherein, described method is carried out by one or more calculating device.

2. the take off data that the method for claim 1, wherein described method also comprises based on correcting carries out hidden Markov model (HMM) segmentation, scoring and output.

3. the method for claim 1, wherein described method also comprises the qualification of nothing reading and the low confidence region of carrying out based on colony.

4. the method for claim 1, wherein described method also comprises by comparing to come the take off data of stdn sequence coverage with the sequence data of baseline sample.

5. the method for claim 1, wherein the acquisition of the take off data of sequence coverage comprises the sequential covering degree of depth measuring described genomic each position place.

6. the sequence coverage deviation the method for claim 1, wherein correcting described take off data comprises the coverage of calculation window-mean value.

7. the sequence coverage deviation the method for claim 1, wherein correcting described take off data is included in library construction and sequencing procedure carries out adjusting to explain GC deviation.

8. the sequence coverage deviation the method for claim 1, wherein correcting described take off data comprises carries out adjusting to make up deviation based on other weighting factor associated with single collection of illustrative plates.

9. the method for claim 1, wherein described sequence coverage c _idetermined by following formula:

c_{t} = \underset{m &Element; M_{t}}{Σ} P (DNB | R, m) / (&Proportional; + \underset{n &Element; N (m)}{Σ} P ({DNB}_{m} | R, n)) .

10. the take off data the method for claim 1, wherein obtaining sequence coverage comprises:

A) measure and represent the reading of the sequence of genomic multiple approximately random fragment in described sample, wherein saidly multiplely provide described genomic sampling, the average base positions of genome is sampled one or many whereby;

B) by described reading is mapped to reference gene group, or spectrum data is obtained by described reading is mapped to composite sequence; And

C) by obtaining coverage data along reference gene group or along the intensity that composite sequence measures described sampled sequence,

Wherein said take off data comprises described spectrum data and described coverage data.

11. methods as claimed in claim 10, wherein, measure described reading further comprising the steps of:

A) multiple amplicon is provided, wherein:

I) each amplicon comprises multiple copies of the fragment of target nucleic acid,

Ii) each amplicon comprises the joint of multiple distribution in the predetermined site of described fragment, and each joint comprises at least one anchor probe hybridization site,

Iii) described multiple amplicon comprises the fragment substantially covering described target nucleic acid;

B) provide the random array being fixed on the described amplicon on surface with such density, described density makes at least most of described amplicon be that optics is distinguishable;

C) one or more anchor probe and described random array are hybridized;

D) one or more order-checking probes and described random arrays are hybridized, thus described one or more check order probes with form the duplex mated completely between target nucleic acid fragment;

E) described anchor probe is connected to described order-checking probe;

F) at least one Nucleotide of qualification at least one joint scattered contiguous; With

G) repeating step (c)-(f) is until identify the nucleotide sequence of described target nucleic acid;

Wherein step (a)-(g) is completed by sequenator.

12. methods as claimed in claim 2, wherein, carry out HMM segmentation and also comprise generation initial model, and described initial model comes estimated state number and their mean number based on overall coverage distribution.

13. methods as claimed in claim 12, wherein, carry out one or more amendments that HMM segmentation comprises by carrying out status number in described model and optimize described initial model, and optimize the parameter of each state.

14. methods as claimed in claim 12, wherein, the coverage that i place, position corrects is:

c_{t}^{'} = \underset{m &Element; M}{Σ} q_{m} * P (DNB | R, m) / (&Proportional; + \underset{n &Element; N (m)}{Σ} P ({DNB}_{m} | R, n)) .

15. methods as claimed in claim 4, wherein, take off data described in stdn comprises the coverage of the correction using following equation confirmed standardization:

{\overset{&OverBar;}{c}}_{i}^{''} = {\overset{&OverBar;}{c}}_{i}^{'} * \frac{\tilde{d}}{d_{i}^{'}} * \frac{pi}{2} .

16. the method for claim 1, it also comprises the collection of illustrative plates utilizing the generation of the estimated value of sequence coverage for the sequenced fragments of described genomic more than one position, and utilizes the degree of confidence observed value of each collection of illustrative plates described each collection of illustrative plates to be partly attributed to each detection position.

17. the method for claim 1, its also comprise carry out HMM calculate with the multiple determining each inspection positions.

18. the method for claim 1, its also comprise carry out HMM calculate with the multiple score determining each inspection positions, the multiple that described multiple Score Lists is shown in described detection position finding is correct degree of confidence.

19. the method for claim 1, it also comprises and carries out HMM and calculate with the CNV type score determining each inspection positions, and the multiple that described CNV type Score Lists is shown in described detection position finding correctly represents the degree of confidence of ploidy, the ploidy of expection or the ploidy of increase reduced in described detection position.

20. methods as claimed in claim 2, wherein, multiple HMM state corresponds to respective copy number, and if wherein sample be normal specimens, then carry out HMM segmentation, scoring and output, comprising:

Each state that 0 to N/2 is multiplied by the median being contemplated to coverage in a diplontic sample part is greater than, the mean value of the transmitting distribution of initialize HMM for having copy number N; And

For have copy number 0 to be less than the numerical value used of the state with copy number 1 on the occasion of state, initialize launch distribution mean value.

21. methods as claimed in claim 2, wherein, multiple HMM state corresponds to respective copy number, and if wherein sample be tumor sample, then carry out HMM segmentation, scoring and output and comprise:

Based on the mean value of coverage distribution estimated state number and each state to produce HMM initial model;

By the status number in amendment model and the parameter optimizing each state optimizes described initial model; And

By add to described model sequence state then order remove state or it combines the status number revised in model.

22. methods as claimed in claim 21, wherein revise described initial model and comprise:

If a) add new state the likelihood associated with HMM is increased to the threshold value predetermined more than first, then between a pair state, add described new state;

B) between every a pair state cyclically repeating step (a) until more interpolation can not be had;

If c) described likelihood is not reduced to the threshold value predetermined more than second by removing of state, then remove described state from HMM; And

D) to all states repeating step (c) repeatedly.

23. methods as claimed in claim 2, wherein, multiple HMM state corresponds to respective copy number, and wherein carry out HMM segmentation, scoring and output and comprise each state of mean value for there is copy number N being multiplied by constant the transmitting distribution of described state, the variance of the transmitting distribution of initialize HMM.

The 24. computer-readable storage medias comprising the instruction clearly presented in the above, when executed by the computer processor, described instruction makes described treater carry out following operation:

Use the data produced from end pairing collection of illustrative plates, obtain the take off data of the sequence coverage of biological sample;

At least based on the take off data corrected, estimate total copy numerical value and the regiospecificity copy numerical value in each region in multiple genome area.

The 25. computer-readable storage medias comprising the instruction clearly presented in the above, when executed by the computer processor, described instruction makes described treater carry out following operation:

Obtain the take off data comprising the sequence coverage of the sample of target sequence;

Correct the sequence coverage deviation of described take off data, wherein correct described take off data and comprise the baseline correction carried out ploidy and be correlated with;

Based on the take off data corrected, carry out hidden Markov model (HMM) segmentation, scoring and output;

Based on HMM scoring with export, carry out based on colony without reading and the qualification of low confidence region; And

Based on HMM scoring and output, estimate total copy numerical value and the regiospecificity copy numerical value in multiple region.

The system that 26. copy numbers measuring the inspection positions genome area of target polynucleotide sequence make a variation, it comprises:

A. computer processor; And

B. the computer-readable storage media be connected with described treater, described storage media has the instruction clearly presented thereon, and when being performed by described treater, described instruction makes described treater carry out following operation:

The method of the copy number of the inspection positions genome area of target polynucleotide sequence in 27. working samples, described method comprises:

Based on the take off data corrected, carry out hidden Markov model (HMM) segmentation, scoring and output; And

Estimate total copy numerical value and the regiospecificity copy numerical value in each region in multiple genome area;

Wherein, described method is completed by one or more computer installation.

28. methods as claimed in claim 27, it also comprises:

The input data that computer logic generates, it represents original state by the model run for the absolute copy number explaining complicated tumour;

Wherein carry out HMM segmentation and also comprise the initial model generated based on described input data.

29. methods as claimed in claim 27, it also comprises based on by running the state decryption generated for the model of the absolute copy number explaining complicated tumour, with total copy numerical value and the described multiple genome area of regiospecificity copy numerical value annotation.

The 30. computer-readable storage medias comprising the instruction clearly presented in the above, when executed by the computer processor, the method that described instruction makes described treater enforcement of rights require according to any one of 27-29.

The device that 31. copy numbers measuring the inspection positions genome area of target polynucleotide sequence make a variation, it comprises:

Computer processor; And

The computer-readable storage media be connected with described treater, described storage media clearly presents instruction in the above, and when being performed by described treater, described instruction makes the method for described treater enforcement of rights requirement according to any one of 27-19.