Methods for interpreting absolute copy number of complex tumors and for determining the copy number of a genomic region at a detection position of a target sequence in a sample are disclosed. In certain aspects, genomic regions of a target sequence in a sample are sequenced and measurement data for sequence coverage is obtained. Sequence coverage bias is corrected and may be normalized against a baseline sample. Hidden Markov Model (HMM) segmentation, scoring, and output are performed, and in some embodiments population-based no-calling and identification of low-confidence regions may also be performed. A total copy number value and region-specific copy number value for a plurality of regions are then estimated.


测定复杂肿瘤全基因组绝对拷贝数变异的方法 Determination of complex genome-wide tumor absolute copy number variation method

[0001] 相关申请 [0001] RELATED APPLICATIONS

[0002] 本申请是2011年10月11日提交的美国申请第13/270, 989号的部分延续申请并要求其优选权,该申请要求2011年6月30日提交的美国临时专利申请第61/503, 327号以及2010年10月13日提交的美国临时专利申请第61/392, 567号的优先权。 [0002] This application is October 11, 2011 U.S. Application No. 13/270, filed, a continuation in part of Application No. 989 preferably right and claims, which application claims priority to US Provisional Patent Application 30 June 2011 No. 61 / 503, and US provisional Patent No. 327 October 13, 2010 filed on 61/392, priority No. 567. 本申请还要求2012年5月4日提交的美国临时专利申请第61/643, 225号的优先权。 This application also claims the first 61/643, the benefit of US Provisional Patent Application No. 225 of May 4, 2012 submission. 它们各自的全部内容通过引用整体并入本申请,如同在本发明中充分阐述。 The entire contents of their respective entirety in the present application as if fully set forth in the present invention by reference.

[0003] 发明背景 [0003] Background of the Invention

[0004] 基因组异常通常与各种遗传疾病、退行性疾病以及癌症关联。 [0004] Typically abnormalities, degenerative diseases and cancer genome associated with various genetic diseases. 例如,癌症中基因拷贝的缺失或增加以及基因组片段或特定区域的缺失或扩增屡见不鲜。 For example, gene deletion, or deletion of the copy and increased cancer or a particular region of genomic fragments or amplification uncommon. 例如,原癌基因与肿瘤抑制基因各自的改变是肿瘤发生的常见特征。 For example, proto-oncogenes and tumor suppressor gene is a common feature of each change tumorigenesis. 因此与癌症和各种遗传疾病有关的特定基因组区域的鉴定和克隆对于肿瘤发生的研究和研发更好的诊断与预后方法有益。 Therefore, the identification and cloning associated with cancer and various genetic diseases specific genomic regions useful for better diagnostic and prognostic methods of research and development tumorigenesis.

[0005] 相对于同一组织类型的正常细胞,与癌性细胞、原癌细胞或低转移潜能细胞中拷贝数改变对应的多核苷酸的鉴定,为诊断工具提供了基础,通过为候选试剂提供靶标促进药物发现,并且还用来鉴定更适合待治疗癌症类型的癌症治疗的治疗靶标。 [0005] with respect to a normal cell of the same tissue type, and cancerous cells, the original low-metastatic potential cancer cells or copy number alterations to the polynucleotide corresponding to the identification, provides the basis for diagnostic tools, by providing a target for the candidate agent facilitate drug discovery, and further more suitable to be used to identify the type of cancer therapy of treating cancer therapeutic target.

[0006] 在诊断性基因组测序中,临床诊断的精确度要求进一步地加剧了涉及人类基因组中三十亿碱基对的序列分析的计算复杂性,从而使得必须分析600亿或更多的序列数据点以提供一个精确的基因组序列。 [0006] In the diagnostic genome sequencing, clinical diagnosis accuracy requirements further exacerbated computational complexity relates to sequence analysis three billion base pairs in the human genome, so that the 60 billion or more sequence data must be analyzed points to provide a precise genomic sequence. 早期的测序方法中,通过从数以千计的孤立的、非常长的DNA片段中产生序列数据,从而保留序列信息的情境完整性并且减少精确数据所需的冗余测试来处理这一复杂性。 Early sequencing method, the context generating sequence data from thousands integrity isolated, very long DNA segments, thereby preserving the sequence information and to reduce the required precision data redundancy to handle the complexity of the test . 然而,由于制备基因组片段的前期复杂性以及许多单独的生化试验相对较高的成本,这种用于产生第一个完整人类基因组的方法,在每一基因组上耗费了数亿美元。 However, due to the complexity of pre-prepared genomic fragment and a number of individual biochemical relatively high cost, this method for the first complete human genome generated on each genome cost hundreds of millions of dollars.

[0007] 另外,每一人类细胞中基因组的两个不同拷贝的存在进一步加剧了基因组中的情境信息,从而使得精确的临床分析与诊断需要根据基因组拷贝辨别DNA序列的能力。 [0007] Further, the presence of two copies of each different genome of human cells further exacerbated by the context information of the genome, such that an accurate diagnosis requires analysis and clinical genomic copy based on the ability to identify DNA sequence. 因此, 主要的挑战为辨别散布着数百万遗传的单核苷酸多态性(SNPs)、成千上万短的插入和缺失以及数以百计的自发突变的三十亿DNA碱基的两个独特拷贝间的序列差异。 Therefore, the main challenge is to identify dotted with millions of single nucleotide polymorphisms (SNPs) genetic, thousands of short insertions and deletions as well as hundreds of spontaneous mutation of the three billion DNA bases sequence differences between the two distinct copies.

[0008] 然而,高度的非整倍性、基质(正常的)污染和基因组异质性使得从肿瘤样品的全部基因序列读出数据中评估绝对总量和更少的等位基因拷贝数具有挑战性。 [0008] However, a high degree of aneuploidy, the substrate (normal) and genomic contamination heterogeneous gene sequences such that all data is read out from the evaluation of tumor samples and the total amount of the absolute number of copies of allele less challenging sex. 尽管在这一领域取得了一些进展,但仍然没有可靠的方法。 Despite some progress in this area, there is still no reliable method. 例如,已研发了一些帮助鉴定完整的DNA序列中拷贝数变异("CNV"),以及有助于基于所述序列与参照序列或与所述序列的多种不同拷贝进行比较的鉴定的置信度的方法。 For example, some have been developed to help identify the complete sequence of the DNA copy number variation ( "CNV"), and to facilitate identification of the sequence is based on comparing a plurality of different copies of the reference sequence or sequences confidence Methods. 在这些方法中,拷贝数的鉴定与其确认都基于不同的样品系列,并且此类方法中所用的数据相对易于出错且众所周知地含有某些人为偏差。 In these methods, the copy number and its identification confirmation based on a different series of samples, and such data used in the process is relatively error prone and contain certain known human bias. [0009] 发明概述 [0009] Summary of the Invention

[0010] 在某些方面,本发明提供了一种测定样品中靶多核苷酸序列检测位置处的基因组区域拷贝数的方法,所述方法包括:获得所述样品序列覆盖度的测量数据;校正测量数据的序列覆盖度偏差,其中序列覆盖度偏差校正包括进行倍性相关的基线校正;以及估算多个基因组区域的总拷贝数值和区域特异性拷贝数值。 [0010] In certain aspects, the present invention provides a method of copy number of a genomic region of a target polynucleotide sequence at a detection position is determined in a sample, said method comprising: obtaining measurement data of the sample sequence coverage; correction sequence coverage deviation measurement data, wherein the deviation correction sequence coverage including baseline corrected correlation times; and total copy number value and the estimated value of the plurality of copies of specific regions of genomic regions. 在一个实施方案中,所述方法包括进行隐马尔可夫模型(HMM)分段、评分以及输出。 In one embodiment, the method comprises a hidden Markov model (HMM) segmentation, and an output rating. 在另一个实施方案中,所述方法包括进行基于群体的无读取(no-calling)和低置信区域的鉴定。 In another embodiment, the method comprises groups without reading (no-calling) and based on the identification of low confidence region.

[0011] 在某些方面,本文描述的技术和/或方法提供一系列步骤(如模型)用于解译复杂肿瘤的绝对拷贝数。 [0011] In certain aspects, techniques and / or methods described herein to provide a series of steps (e.g., model) the complex interpretation of the absolute copy number for tumor. 在一些实施方案中,配置计算机逻辑以执行能产生肿瘤样品的易解释的图形的加工总数和等位基因特异性测序深度(readd印th)数据的模型,以及自动分析过程。 In some embodiments, the computer logic configured to perform processing and the total number of allele-specific sequencing depth (READD printed th) data model easily interpreted graphics to produce tumor samples, and the automatic analysis process. 该分析基于这样的模型:假设样品肿瘤部分与基因组大部分的同质性,但允许样品的一部分受肿瘤异质性的影响。 The analysis is based on the model: the homogeneity of the sample assumes that most part of the tumor genome, but allows a portion of the sample is affected by tumor heterogeneity. 通过执行最终模型所获得的结果数据可输入到单独的基于模型的分段加工(如HMM)--例如,结果数据可被用作基于模型的分段加工的最初的输入状态,状态说明可用于注释最终的片段组。 Results obtained by performing the final model may be input to a separate model based segmentation processing (e.g., HMM) - For example, the resultant data may be used as the initial segmentation processing based on the input state of the model, it can be used to illustrate the state of Note final fragment group. 由于模型加工和最终分段的分离,可向用户呈现肿瘤的可视化;如果有问题,可用替代模型取代自动导出模型。 Since the model separation processing and end segments may be presented to a user to visualize the tumor; if there are problems, substituted surrogate model derived automatically available models.

[0012] 在一方面,所述方法还包括通过与基线样品比较来标准化序列覆盖度。 [0012] In one aspect, the method further comprises coverage by comparing the normalized baseline sample sequence.

[0013] 在一方面,所述方法还包括通过测量样品基因组在每一位置处的序列覆盖深度来确定序列覆盖度。 [0013] In one aspect, the method further comprises determining a sample by measuring sequence coverage genomic sequence coverage depth at each location.

[0014] 在一方面,所述方法还包括通过计算窗口-平均值的覆盖度来校正序列偏差。 [0014] In one aspect, the method further includes calculating a window - the sequence variation is corrected average coverage.

[0015] 在一方面,所述方法还包括在文库构建和测序过程中进行调整以解释GC偏差。 [0015] In one aspect, the method further comprises adjusting library construction and sequencing to explain GC deviation.

[0016] 在另一实施方案中,所述方法还包括基于与单个图谱关联的其它权重因子进行调整以弥补偏差。 [0016] In another embodiment, the method further comprises based on the weights associated with the other individual weight factor map is adjusted to compensate for the deviation.

[0017] 在一方面,所述方法还包括通过测序仪所进行的步骤,所述步骤包括:a)提供多个扩增子,其中:i)每一扩增子包含靶核酸的片段的多个拷贝,ii)每一扩增子在所述片段的预定位点处包含多个散布的接头,每一接头包含至少一个锚定探针杂交位点,以及iii) 所述多个扩增子包含基本上覆盖所述靶核酸的片段;b)提供以这样的密度固定于表面上的所述扩增子的随机阵列,所述密度使得至少大多数所述扩增子为光学可分辨的;c)将一种或多种锚定探针与所述随机阵列杂交;d)将一种或多种测序探针与所述随机阵列杂交, 从而在所述一种或多种测序探针与靶核酸片段间形成极度匹配的双螺旋;e)将锚定探针连接至测序探针;以及f)鉴定邻近至少一个散布的接头的至少一个核苷酸;以及g)重复步骤(c)-(f)直到鉴定出所述靶核酸的核苷酸序列。 [0017] In one aspect, the method further comprises the step performed by the sequencer, the steps comprising: providing a plurality of amplicons A), wherein: i) each amplicon containing a target nucleic acid fragments more copy, ii) each comprising a plurality of interspersed amplicon joints at predetermined sites of the fragment, each linker comprises at least one anchor probe hybridization site, and iii) a plurality of amplicons a fragment comprising substantially covering the target nucleic acid; b) providing a random array of such density fixed to the upper surface of the amplicons, a density such that at least the majority of the amplicon optically resolvable; c) one or more probes hybridize to the random array anchor; D) one or more of the random sequencing probe hybridized to an array, so that the one or more probes and sequencing formed between the target nucleic acid fragment matched duplexes extreme; E) the anchor probe is attached to the probe sequence; at least one nucleotide and f) identifying at least one adjacent interspersed linker; and g) repeating steps (c) - (f) until the nucleotide sequence of the identified target nucleic acid.

[0018] 在一方面,所述方法还包括通过进行以下的步骤来测定测量数据,所述步骤包括: a)测定代表样品中基因组的多个大约随机的片段的序列的读数,其中所述多个提供了样品基因组的抽样,借此基因组平均一个碱基位置被抽样一次或多次;b)通过将所述读数映射至参照基因组,或通过将所述读数映射至组合序列(例如诸如样品自身的组合序列或有关的基线样品的组合序列)获得所述读数的图谱数据;以及c)通过沿着参照基因组或沿着组合序列测量所述读数的强度以获得覆盖度数据,其中测量数据包括图谱数据与覆盖度数据。 [0018] In one aspect, the method further comprises measuring the measurement data, by performing the step comprising the following steps: a) determination of the sequence of reading the plurality of representative samples of approximately random genomic fragments, wherein said multi- providing two samples of each genome, a base position whereby the average genome is sampled one or more times; b) by the reading mapped to the reference genome, or by reading the sequence mapped to a composition (such as a sample itself e.g. the combination sequence or combination of sequences related to the baseline sample) to obtain the map data reading; and c) along the reference genome sequences or along a combination of said measured intensities to obtain readings coverage data, wherein the measurement data includes a map data coverage data.

[0019] 在另一实施方案中,所述方法还包括初始模型的生成,所述初始模型基于整体覆盖度分布估算状态数和它们的平均数。 [0019] In another embodiment, the method further comprises generating an initial model, the initial model is based on the number of states to estimate the overall coverage, and their average distribution.

[0020] 在另一实施方案中,所述方法还包括通过向所述模型顺序添加状态然后从模型顺序移除状态或其组合来优化初始模型。 [0020] In another embodiment, the method further comprises optimizing the initial model is then removed from the model state sequence of the model by sequentially adding to a state, or a combination thereof.

[0021] 在另一实施方案中,标准化还包括测定标准化的校正的覆盖度。 [0021] In another embodiment, further comprising determining coverage normalized normalized correction.

[0022] 在另一实施方案中,所述方法还包括通过片段复制来测定序列覆盖度和获得置信度测量值以将所述图谱部分地归于每一检测位置。 [0022] In another embodiment, the method further comprising determining the sequence coverage and the confidence measure obtained by copying to said pattern segment partially be attributed to each detector position.

[0023] 在一方面,所述方法包括进行HMM计算以确定每一检测位置处倍数。 [0023] In one aspect, the method includes multiple HMM calculations to determine at each detection position.

[0024] 在另一实施方案中,所述方法还包括产生对应于各自的拷贝数的多个隐马尔可夫模型(HMM)状态,其中如果样品为正常样品,则进行HMM分段、评分以及输出,其包括:对于具有拷贝数N大于0至N/2乘以预期为二倍体的样品一部分中覆盖度的中位数的每一状态,初始化HMM的发射分布的平均值;以及对于具有拷贝数为0至正值(小于具有拷贝数1 的状态所用数值)的状态,初始化发射分布的平均值。 [0024] In another embodiment, the method further comprises generating a plurality of hidden Markov model (HMM) corresponding to the respective number of copies of a state, wherein if the sample is a normal sample, HMM segmentation is performed, and the score output, comprising: for a copy number N is greater than 0 to N / 2 multiplied by the expected median for each state in the portion of the sample diploid coverage, average emission distribution HMM initialization; and for having copy number from 0 to a positive value (less than 1 having a copy number of the state value is used) state, initialize the firing average of the distribution.

[0025] 在另一实施方案中,所述方法还包括产生对应各自的拷贝数的多个HMM状态,其中如果样品为肿瘤样品,则进行HMM分段、评分以及输出,其包括基于覆盖度分布估算状态数和每一状态的平均值以产生HMM初始模型;通过修改模型中的状态数并优化每一状态的参数来优化初始模型;以及通过向模型顺序添加状态然后顺序移除状态或其组合来修改模型中的状态数。 [0025] In another embodiment, the method further comprising generating the copy number corresponding to a respective plurality of HMM states, wherein if the sample is a tumor sample, HMM segmentation is performed, and an output score, based on which the distribution of coverage comprising estimating the average number of states and each state to generate an initial HMM model; initial model to optimize the model by modifying the state and the number of optimization parameters for each state; and then sequentially removed by the addition of state to a state model for the sequence or a combination thereof to modify the number of states in the model.

[0026] 在另一实施方案中,所述方法还包括,调整初始模型,其包括:a)如果添加新的状态将与HMM关联的似然提高至超过第一预定的阈值,则在一对状态间添加所述新的状态; b)在每一对状态间循环地重复步骤(a)直到不可能有更多的添加;c)如果状态的移除没有将似然减少至超过第二预定的阈值,则从HMM移除所述状态;以及d)对所有的状态反复地重复步骤(c)。 [0026] In another embodiment, the method further comprises, adjusting the initial model, comprising: a) If the new state is added to increase the likelihood of the HMM associated beyond a first predetermined threshold value, the pair of the new state is added to the intermediate state; b) in each pair cyclically repeating steps (a) between the state until there can add more; c) removing the state not to reduce the likelihood exceeds a second predetermined the threshold value is removed from the HMM state; and d) are repeated for all state repeatedly in step (c).

[0027]另一实施方案包括在其上面具有存储指令的计算机可读的永久性存储介质,其用于测定在样品中靶多核苷酸序列的检测位置处基因组区域的拷贝数,当由计算机处理器执行时,所述指令使得处理器进行以下操作:使用从末端配对图谱所产生的数据获得所述样品序列覆盖度的测量数据;校正测量数据的序列覆盖度偏差,其中校正测量数据包括进行倍性相关的基线校正;以及至少基于校正的测量数据,估算多个基因组区域中每一个区域的总拷贝数值和区域特异性拷贝数值。 [0027] Another embodiment comprises a permanent storage medium having instructions stored thereon computer readable for detection of genomic copy number was measured at the location of the region of the target polynucleotide sequence in a sample, as processed by the computer when executes the instructions cause the processor to: use the data from an end of paired pattern generated measurement data is obtained sequence coverage of the sample; sequence coverage deviation correction of the measurement data, wherein the correction data includes the measurement times related to baseline correction; and a plurality of copies of total genomic regions in each region and the region-specific numerical values ​​based on at least measurement data copy corrected estimate.

[0028] 另一实施方案包括具有明确呈现在其上的指令的计算机可读的永久性存储介质, 当由计算机处理器执行时,所述指令使得处理器进行以下操作:获得包含靶序列的生物样品的序列覆盖度的测量数据;校正测量数据的序列覆盖度偏差,其中校正测量数据包括进行倍性相关的基线校正;基于校正的测量数据,进行隐马尔可夫模型(HMM)分段、评分以及输出;基于HMM评分与输出,进行基于群体的无读取和低置信区域的鉴定;以及估算多个区域的总拷贝数值和区域特异性拷贝数值。 [0028] Another embodiment includes a clear presentation in a persistent storage medium having computer-readable instructions thereon which, when executed by a computer processor, the instructions cause the processor to: obtaining a biological target sequence comprising sequence coverage measurement data of the sample; sequence coverage deviation correction of the measurement data, wherein the correction of the measurement data including baseline correction times associated; corrected based on the measurement data, for hidden Markov model (HMM) segmentation, Rating and an output; and an output based on the HMM score, based on population without reading and identification of low confidence region; and estimating the copy number value and the total value of a plurality of copies of specific regions of the regions.

[0029] 另一实施方案包括用于测定靶序列的检测位置处基因组区域的拷贝数变异的系统,其包含:a.计算机处理器;以及b.与所述处理器连接的计算机可读的存储介质,所述存储介质具有明确呈现其上的指令,当由计算机处理器执行时,所述指令使得处理器进行以下操作:使用从末端配对图谱所产生的数据获得所述样品的序列覆盖度的测量数据;校正测量数据的序列覆盖度偏差,其中校正测量数据包括进行倍性相关的基线校正;以及至少基于校正的测量数据,估算多个基因组区域中每一个区域的总拷贝数值和区域特异性拷贝数值。 [0029] Another embodiment includes a system for determining the number of copies of a position detecting region genomic target sequence variation, comprising:. A computer processor; and a computer-readable memory coupled to the processor to b. medium, the storage medium having instructions thereon clearly presented, when executed by a computer processor, the instructions cause the processor to: use the data from an end of paired pattern generated sequence coverage obtained in the sample measurement data; deviation correction sequence coverage measurement data, wherein the correction data includes the measurement baseline corrected correlation times; and based on at least the corrected measurement data, multiple copies of the estimated total genomic region and each region value regiospecific copy value.

[0030] 提供该概述用于以简化形式引入选择的概念并在以下的详述中进一步地描述。 [0030] This Summary is provided to introduce a selection of concepts in a simplified form that are further described in the following detailed description. 该概述并非意图确定要求保护的主题的关键或基本的特征,而且也并非意图用于限制要求保护的主题的范围。 This summary is not intended to identify the claimed subject matter of critical or essential characteristics, but also not intended to limit the scope of the claimed subject matter. 要求保护的主题的其他特征、细节、效用以及优势在下列包括附图中图示的与所附的权利要求中限定的那些方面的书面详细描述中将变得显而易见。 Other features of the claimed subject matter, details, utilities, and advantages will become apparent in the written description including the following illustrated in the drawings and defined in the appended claims, those aspects in detail.


[0031] 以下附图代表了呈现本发明的实施方案所提供的数据的一种格式。 [0031] The following drawings represent an embodiment of the invention presented in the format provided by the program data. 这些附图并非意图以任何方式限制本文描述的本发明方面的实施,而在于帮助阐明本发明的基本概念。 These drawings are not intended in any way to limit the embodiments described herein, aspects of the present invention and to help clarify the basic concept of the invention.

[0032] 图1描述了概括性的框图,其图示了根据本公开内容的实施方案用于读取含有靶序列的样品中的变异的系统。 [0032] Figure 1 depicts a generalized block diagram illustrating a variation for reading a sample containing the target sequence in accordance with an embodiment of the system of the present disclosure.

[0033] 图2描述了概括性的流程图,其图示了根据本公开内容的实施方案的CNV读取方法。 [0033] Figure 2 depicts a generalized flowchart illustrating a method according to an embodiment CNV reading the present disclosure.

[0034] 图3描述了根据本公开内容的某些方面合并和操作的概括性计算机系统。 [0034] FIG. 3 and described merge operation according to certain aspects of the present disclosure of the general computer system.

[0035] 图4A和4B图示了例示性的测序系统。 [0035] Figures 4A and 4B illustrate an exemplary embodiment sequencing system.

[0036] 图5图示了例示性的计算装置,其用于或连接于测序仪和/或计算机系统。 [0036] FIG 5 illustrates an exemplary embodiment a computing device, or for connection to a sequencer and / or a computer system.

[0037] 图6为1-组分肿瘤模型图。 [0037] FIG. 6 is a 1-component tumor model in FIG.

[0038] 图7为2-组分肿瘤模型图。 [0038] FIG. 7 is a 2-component tumor model in FIG.

[0039] 图8为测定读数覆盖度和分段的例示性实施例的图。 The embodiment of FIG. [0039] FIG. 8 is a coverage measurement readings and segmented exemplary embodiment.

[0040] 图9为例示性初始状态估算逻辑的图。 [0040] FIG. 9 illustrates an example of an initial state estimation logic of FIG.

[0041] 图10A-10C图示了能表明包含肿瘤和正常组织的变化百分率的过程的稳健性的例示性结果。 [0041] Figures 10A-10C illustrate exemplary results could indicate the robustness of the process comprising a percentage change in tumor and normal tissue.

[0042] 图11图示了肿瘤与高平均拷贝数以及高可变性之间强烈的统计相关性。 [0042] FIG. 11 illustrates a strong statistical correlation between the tumor and the average high copy number and high variability.

[0043] 发明详述 [0043] DETAILED DESCRIPTION

[0044] 在以下的描述中,众多具体细节被加以陈述以提供对本发明更为彻底的理解。 [0044] In the following description, numerous specific details are set forth in order to provide them a more thorough understanding of the present invention. 然而,对于本领域技术人员而言显而易见的是,本发明可在没有这些具体细节中的一个或多个下实施。 However, those skilled in the art will be apparent that the present invention may be one or more of the embodiments without these specific details. 在其他情况下,为避免掩盖本发明,没有描述本领域技术人员所公知的特征和程序。 In other cases, to avoid obscuring the present invention, there is no description of those skilled in the well-known features and procedures.

[0045] 尽管主要参照具体的实施方案描述本发明,也可以预期的是,本领域中技术人员阅读本公开内容后,其他的实施方案对他们来说是显而易见的,并且其意图是此类实施方案包含在本发明的方法中。 [0045] While primarily described with reference to specific embodiments of the present invention may be contemplated that the skilled in the art reading the present disclosure, other embodiments will be apparent to them, and it is intended that such embodiments in the embodiment comprising the method of the present invention.

[0046] 除非另有定义,本文所用的所有技术和科学术语具有与本发明所属领域技术人员通常理解相同的含义。 [0046] Unless defined otherwise, all technical and scientific terms used herein have the ordinary skill in the art of the present invention is the same meaning as commonly understood. 本文提到的所有出版物通过引用并入本文,用于描述和公开所述出版物中所描述的以及可能结合本发明而使用的装置、组合物、制剂和方法的目的。 All publications mentioned herein are incorporated herein by reference, and means for the purpose of the present invention may be used in conjunction with the description of the disclosed and described in the publications, compositions, formulations and methods.

[0047] 当提供数值范围时,应当理解为介于所述范围的上限与下限之间的每一居中值(除非上下文另有明确规定,否则精确到下限单位的十分之一)和所述范围内的任意其他指出的值或居中值都涵盖于本发明中。 [0047] When numerical ranges, should be understood as each intervening value between the upper and lower limit range (unless the context clearly dictates otherwise, to the tenth of the unit of the lower limit) and the any other stated or intervening value within the range are encompassed by the present invention. 这些较小范围的上限和下限可独立地包括在也涵盖于本发明中的所述较小范围中,其服从所述范围中任意具体排除的界限。 These upper and lower smaller ranges may independently be included in the smaller ranges is also encompassed in the present invention, the scope of which is subject to any specifically excluded limit. 当所述范围包括一个或两个界限时,排除那些所包括的界限中的任一个或两个的范围也包括在本发明中。 Where the stated range includes one or both of the limits, ranges excluding either or those included limits are in either of the two is also included in the present invention.

[0048] 例示性测序方法 [0048] Exemplary sequencing method

[0049] 用于测序靶核酸的例示性方法包括样品制备,其涉及从DNA样品中提取并且分段靶核酸,以产生通常包括一个或多个接头的片段化的靶核酸模板。 Exemplary Method [0049] for sequencing a target nucleic acid sample preparation, which involves extracting the target nucleic acid and a segment from a DNA sample, to produce a fragment generally comprises one or more joints of the target nucleic acid template. 所述靶核酸模板任选地经过扩增方法以形成核酸纳米球,出于分析的目的,其通常被配置在表面或基质上。 Optionally, the target nucleic acid template to form a nucleic acid amplification method through nanospheres, for purposes of analysis, which is typically disposed on a surface or substrate. 基质可通过核酸纳米球的模式化或随机排列进行生产。 Matrix pattern can be produced by nucleic acid nanospheres or randomly arranged. 形成核酸纳米球的方法描述于以下公开的专利申请:W02007120208、W02006073504、W02007133831 和US2007099208,美国专利申请系列号11/679, 124 ;11/981,761 ;11/981,661 ;11/981,605 ;11/981,793 ;11/981,804 ; 11/451,691 ; 11/981,607 ; 11/981,767 ; 11/982,467 ; 11/451,692 ; 12/335, 168 ; 11/541, 225 ; 11/927, 356 ; 11/927, 388 ; 11/938, 096 ; 11/938, 106 ; 10/547, 214 ; 11/981,730 ; 11/981,685 ; 11/981,797 ; 12/252,280 ; 11/934,695 ; 11/934,697 ; 11/934,703 ; 12/265, 593 ; 12/266,385 ; 11/938,213 ; 11/938,221 ; 12/325,922 ; 12/329, 365和12/335, 188,所有这些通过引用整体并入本文,用于所有的目的,尤其是用于所有与形成核酸纳米球有关的教导。 The method of forming a nucleic acid nanospheres described in the following published patent applications: W02007120208, W02006073504, W02007133831 and US2007099208, U.S. Patent Application Serial No. 11/679, 124; 11 / 981,761; 11 / 981,661; 11 / 981,605 ; 11 / 981,793; 11 / 981,804; 11 / 451,691; 11 / 981,607; 11 / 981,767; 11 / 982,467; 11 / 451,692; 12/335, 168; 11/541, 225; 11/927, 356 ; 11/927, 388; 11/938, 096; 11/938, 106; 10/547, 214; 11 / 981,730; 11 / 981,685; 11 / 981,797; 12 / 252,280; 11 / 934,695; 11 / 934,697; 11 / 934,703; 12/265, 593; 12 / 266,385; 11 / 938,213; 11 / 938,221; 12 / 325,922; 12/329, 365 and 12/335, 188, all of which are incorporated herein by reference in their entirety, for all object, in particular for all teachings related to form a nucleic acid nanospheres. 形成核酸纳米球阵列的方法描述于公开的专利申请W02007120208、TO2006073504、TO2007133831 和US2007099208,以及美国专利申请系列号11/679, 124 ;11/981,761 ;11/981,661 ;11/981,605 ;11/981,793 ;11/981,804 ; 11/451,691 ; 11/981,607 ; 11/981,767 ; 11/982,467 ; 11/451,692 ; 12/335, 168 ; 11/541,225 ; 11/927, 356 ; 11/927, 388 ; 11/938, 096 ; 11/938, 106 ;10/547, 214 ; 11/981,730 ; 11/981,685 ; 11/981,797 ; 12/252,280 ; 11/934,695 ; 11/934,697 ; 11/934, 703 ;12/265, 593 ;12/266, 385 ;11/938, 213 ;11/938, 221 ;12/325, 922 ;12/329, 365 和12/335, 188,所有这些通过引用整体并入本文,用于所有的目的,尤其是用于与形成核酸纳米球的阵列有关的所有教导。 The method of forming an array of nucleic acid nanospheres described in published patent applications W02007120208, TO2006073504, TO2007133831 and US2007099208, and U.S. Patent Application Serial No. 11/679, 124; 11 / 981,761; 11 / 981,661; 11 / 981,605 ; 11 / 981,793; 11 / 981,804; 11 / 451,691; 11 / 981,607; 11 / 981,767; 11 / 982,467; 11 / 451,692; 12/335, 168; 11 / 541,225; 11/927, 356; 11 / 927, 388; 11/938, 096; 11/938, 106; 10/547, 214; 11 / 981,730; 11 / 981,685; 11 / 981,797; 12 / 252,280; 11 / 934,695; 11 / 934,697; 11/934 , 703; 12/265, 593; 12/266, 385; 11/938, 213; 11/938, 221; 12/325, 922; 12/329, 365 and 12/335, 188, all of which by reference in their entirety incorporated herein for all purposes, in particular for all teachings form a nucleic acid array associated nanospheres. 美国专利申请系列号11/679, 124 ;11/981,761 ; 11/981,661 ; 11/981,605 ; 11/981,793 ; 11/981,804 ; 11/451,691 ; 11/981,607 ; 11/981,767 ; 11/982,467 ; 11/451,692 ; 12/335, 168 ; 11/541,225 ; 11/927,356 ; 11/927,388 ; 11/938,096 ; 11/938, 106 ; 10/547, 214 ; 11/981,730 ; 11/981,685 ; 11/981,797 ; 12/252,280 ; 11/934,695 ; 11/934,697 ; 11/934,703 ; 12/265,593 ; 12/266, 385 ;11/938, 213 ;11/938, 221 ;12/325, 922 ;12/329, 365 ;以及12/335, 188 中也描述了测序反应与特定靶序列的检测中使用核酸纳米球的方法,通过引用将其每一个整体并入本文,用于所有的目的,尤其是用于与核酸纳米球上进行测序反应有关的所有教导。 U.S. Patent Application Serial No. 11/679, 124; 11 / 981,761; 11 / 981,661; 11 / 981,605; 11 / 981,793; 11 / 981,804; 11 / 451,691; 11 / 981,607; 11 / 981,767; 11 / 982,467; 11 / 451,692; 12/335, 168; 11 / 541,225; 11 / 927,356; 11 / 927,388; 11 / 938,096; 11/938, 106; 10/547, 214; 11 / 981,730; 11 / 981,685; 11 / 981,797; 12 / 252,280; 11 / 934,695; 11 / 934,697; 11 / 934,703; 12 / 265,593; 12/266, 385; 11/938, 213; 11/938, 221; 12/325, 922; 12/329, 365; and 12/335, 188 also describes a method of using a nucleic acid sequencing reaction nanoballs the detection of a specific target sequence, each by reference in its entirety herein for all purposes, particularly for the nucleic acid nanospheres sequencing reactions were performed on all the relevant teachings. 应理解的是,任一本文所述的和本领域中已知的测序方法均可以应用于溶液中的核酸模板和/ 或核酸纳米球,或应用于被配置在表面上和/或阵列中的核酸模板和/或核酸纳米球。 It should be understood that any of the herein and known in the art can be applied to methods of sequencing a nucleic acid template and / or nucleic acid nanospheres solution, or to be disposed on the surface and / or arrays nucleic acid template and / or nucleic acid nanospheres. [0050] 在核酸纳米球上进行核苷酸测序过程,通常通过测序-连接技术,包括组合的探针锚定连接("cPAL")方法,其描述于例如Drmanacetal·,"HumanGenomeSequencing UsingUnchainedBaseReadsonSelf-AssemblingDNANanaoarrays,,'Science327: 78-81,2009(2010 年1月1日)以及出版的?(:1'专利申请恥07/133831、恥06/138257、 W006/138284、W007/044245、W008/070352、W008/058282、W008/070375 ;以及出版的美国专利申请2007-0037152和2008-0221832中。在此类方法中,根据充分理解了的规则,将已知的标记物(如含有可分辨的荧光团单个分子的特定片段)作为标记物连接于靶核酸模板, 然后在相同类型的DNA链上索引的重新排序以提供重叠数据的基础。本文提及的测序过程仅仅是代表性的。在另一实施方案中,使用了标签。可以使用本领域中已知的或研发的其他处理技术。然后用辐射照射基质上聚集的核酸纳米球以激 [0050] Nucleotide sequencing process on the nucleic acid nanospheres, typically by sequencing - connection technology, a combination comprising the anchor probe is connected ( "cPAL") method, which is described in e.g. Drmanacetal ·, "HumanGenomeSequencing UsingUnchainedBaseReadsonSelf-AssemblingDNANanaoarrays, , 'Science327: 78-81,2009 (2010 Nian 1 1st) and published (:? 1' patent application 07/133831 shame, shame 06/138257, W006 / 138284, W007 / 044245, W008 / 070352, W008 / 058282, W008 / 070375; and U.S. Patent application publication 2007-0037152 and 2008-0221832 in such a method, in accordance with well-understood rules, known markers (e.g., comprising a single resolvable fluorophores. specific molecule fragment) as a marker attached to a target nucleic acid template, and the index in the same type of DNA strand reordered to provide overlapping data base. sequencing process mentioned herein is merely representative. in another embodiment , a label may be used in the art, or other processing techniques known in research and development and then the substrate with the radiation aggregation stimulated to a nucleic acid nanospheres 足以引起与每个具体标记物C、G、A或T有关的荧光团在它们独特的波长处发射荧光的荧光团,从此处可以通过照相机在(标准的或延时集成TDI)CCD阵列上或代替CCD阵列的扫描仪,或其他的可应用于测序仪中的电子流/电压感应技术产生空间图像。也可使用其他的感应机制,诸如阻抗变化感应器。辐照可为光谱特异的从而一次只激发一种选择的荧光团,然后可以通过照相机记录,或可过滤照相机的输入以感应并且只记录接收到的光谱特异的荧光辐射,或可以在彩色的LCD阵列上同时感应并且记录的所有的荧光辐射,再然后在其中有核酸构建体的每一询问位点上分析光谱含量。图像采集产生了许多询问位点的一系列图像,所述询问位点可以基于光谱特异性荧光强度,通过在本文称为碱基读取的过程中对强度水平的计算机处理来进行分析,所述过程将在下文中 Sufficient to cause fluorophore emits fluorescence at a wavelength of their unique and specific for each marker C, G, A or T related fluorophores, from there on through a camera (or a standard delay integration TDI) CCD array or instead of the CCD array of the scanner, or may be other electron current / voltage sensing technique to create a space in the sequencer images may also be used other sensing mechanism, such as a change in the impedance of the sensor. irradiation may be such that a specific spectrum exciting fluorophores only one selected, can then be recorded by the camera, or the camera may be input to the filter inductor and only records the spectrum specific fluorescence radiation received, or can be simultaneously induced in the array of color LCD and records all fluorescence radiation, and then the nucleic acid construct in which the analysis of the spectral content of each interrogation position thereof. many image acquisition series of images generated interrogation site, the query may be based on site-specific fluorescence intensity spectra, by process is referred to herein as a base intensity level of the read process to analyze the computer, the process will hereinafter 有更为详尽的解释。cPAL和其他的测序方法也可以用于检测特异的序列,诸如包括核酸构建体中的单核苷酸多态性("SNP"),(所述的核酸构建体包括核酸纳米球以及直链和环状的核酸模板)。读取或碱基读取序列的鉴定,例如由于测序程序特性的明显原因,碱基读取可能包含误差。使用基于计算机处理的里德-索罗门(Reed-Solomon)误差校正,不论以进行里德-索罗门算法的计算机处理器的形式,还是以使用预先计算的预期碱基读取序列的比较机制的形式,诸如在检查表中,可以鉴定误差。 A more detailed explanation .cPAL sequencing and other methods can also be used to detect specific sequences, including such as a single nucleotide polymorphism ( "SNP") in the body of a nucleic acid construct, (the nucleic acid construct comprises nucleic acid nanospheres and linear and circular nucleic acid template) reading or identifying nucleotide sequence read, e.g. obvious reasons sequencing program characteristics, the base may include an error reading the computer-based processing using the Reed - Solomon (Reed-Solomon) error correction, whether to perform the Reed - Solomon algorithm in the form of computer processors, or to use a pre-calculated expected read compare the nucleotide sequences of the form of the mechanism, such as checklist , the error can be identified. 可以标记"未读取的"序列并且可以进行校正,以产生校正的碱基读取序列。 It may be marked "not read" sequence and may be corrected to produce a corrected read nucleotide sequence. 应理解的是,本文所述的位点与结构的大小只是基质上所分析的位点与结构的大小的极小部分,因为它们不容易进行描述。 It should be understood that the size of the sites described herein with only a very small part of the structure of the size of the site and the structure on the substrate of the analysis because they are not easily described. 例如,基质可为光蚀刻的、表面修饰的(S0M)25mmX75mm的娃基质,其具有用于核酸纳米球结合的大约300nm斑点的栅格模式的阵列,以增加DNA含量/阵列,并且相比随机的基因组DNA阵列提高图像信息密度。 For example, the substrate may be a photo-etching, the surface-modified (S0M) 25mmX75mm baby substrate having an array of about 300nm spot for nucleic acid nanospheres bound grid pattern to increase DNA content / array, as compared random genomic DNA array to improve image density information.

[0051] 可用各种各样的标记物可检测地标记测序探针。 [0051] available a wide variety of labels may be detectably labeled sequencing probe. 尽管上文主要针对其中用荧光团标记测序探针的实施方案,应理解的是,利用包含其他类型标记物的测序探针的相似实施方案包含在本发明中。 Although described above primarily with respect embodiment wherein the fluorophore-labeled probe sequence, it should be understood that the use of a similar embodiment comprising sequencing probes of other types of markers included in the present invention. 而且,本发明的方法可以使用未标记的结构。 Further, the method of the present invention may be used unlabeled structure.

[0052] 在一些实施方案中,多个cPAL循环(无论是单个的,二倍的,三倍的等)将鉴定邻近接头的靶核酸区域中的多个碱基。 [0052] In some embodiments, a plurality of cycles cPAL (whether single, double, three, etc.) identification of target nucleic acid adjacent the region of the joint of the plurality of bases. (在替代设计中,有可能使用单个cPAL循环来产生多个碱基。)简言之,通过利用测序探针池的循环锚定探针杂交和酶促连接反应,反复地实施cPAL方法以询问靶核酸中的多个碱基,所述测序探针池经设计用于检测不同位置处从接头与靶核酸之间的接口移除的核苷酸。 (In an alternative design, it is possible to use a single loop to generate a plurality of bases cPAL.) Briefly, by utilizing the cell cycle sequencing probes anchor probe hybridization and enzymatic ligation reactions, in a method of repeatedly interrogation embodiment cPAL a plurality of bases in the target nucleic acid, the pool of sequencing probes designed for detecting a position at a different nucleotide removed from the interface between the linker and the target nucleic acid. 在任一给定的循环中,将所用的测序探针设计为使得位于一个或多个位置处一个或多个碱基的身份与连接至该测序探针的标记物的身份相关联。 In any given cycle, the sequencing probes used in the design is such that at one or more positions located one or more bases are connected to the identity of the identity and sequencing probes associated markers. 一旦检测到连接的测序探针,并由此一个或多个询问位置处的一个或多个碱基,脱去核酸纳米球的连接复合物,并且进行接头与测序探针杂交及连接的新循环。 Upon detecting the connection sequencing probe, and whereby one or more bases at one or more interrogation positions, a nucleic acid nanospheres off junction complexes, and the linker sequence for probe hybridization and connected to a new cycle . 根据这个原理,可获得重复取样的数据。 According to this principle, oversampled data are available.

[0053] 选择的定义 [0053] The definition of the selected

[0054] "接头"指的是包含"接头元件"的基因改造的构建体,其中一个或多个接头可散布在文库构建体的靶核酸内。 [0054] "linker" refers to a genetically modified comprising "connector element" of the construct, wherein the one or more joints may be interspersed within the target nucleic acid library construction body. 根据接头的用途,包括在任何接头中的接头元件或特征广泛多样,但是通常包括限制性核酸内切酶识别和/或剪切位点、引物结合(用于扩增文库构建体)或锚定引物结合(用于测序文库构建体中的靶核酸)位点、切口酶位点等。 The use of the joint, the joint comprising a joint element or feature any of a wide variety, but typically includes restriction endonuclease recognition and / or a restriction endonuclease cleavage site, primer binding (for amplification of library constructs) or anchor primer binding (for sequencing a target nucleic acid library in the construct) site, nicking enzyme sites and the like. 在一些方面,接头被基因改造以便包含下列的一个或多个:1)大约20个至大约250个核苷酸,或大约40个至大约100个寡核苷酸,或小于大约60个核苷酸,或小于大约50个核苷酸的长度; 2)为了连接至靶核酸作为至少一个且通常两个"臂"的特征;3)位于接头的5'末端和/或3'末端的不同且独特的锚定结合位点以用于邻近的靶核酸测序;以及4)任选地一个或多个限制性位点。 In some aspects, the linker is genetically modified to include one or more of the following: 1) from about 20 to about 250 nucleotides, or from about 40 to about 100 oligonucleotides, or less than about 60 nucleotides acid, or less than about 50 nucleotides in length; 2) in order to connect to the target nucleic acid as characterized in the at least one, and typically two "arms"; and 3) of the joint 5 'end and / or 3' end and different anchoring a unique binding site for a target nucleic acid sequence adjacent; and 4) optionally one or more restriction sites. 在一方面,接头可以为散布的接头。 In one aspect, the linker can spread to joints. 本文所谓的"散布的接头"意指插入靶核酸内部区域内间隔的位置处的寡核苷酸。 Herein called "spread linker" means a target oligonucleotide is inserted at spaced locations within the interior region of the nucleic acid. 在一方面,靶核酸的"内部"意指在诸如环化和切割处理之前靶核酸内部的位点,所述处理可引入序列倒位,或相似的转变,其破坏了靶核酸内核苷酸的排序。 In one aspect, the target nucleic acid "internal" means before the cutting process, such as a ring and the interior of the target nucleic acid sites, the processing sequence may be introduced into the inversion, transformation or the like, which destroyed the target nucleic acid nucleotides Sort. 散布的接头的使用促进序列重建与校准,因为每次从单个接头的10个碱基的序列运行可以允许自身在没有校准的情况下读取20个、30个、40个等碱基。 Use interspersed linker sequences facilitate the reconstruction of the calibration, because each run sequence 10 bases from a single linker may allow itself to read without calibration 20, 30, 40 and the like bases.

[0055]"扩增子"指多核苷酸扩增反应的产物。 [0055] "amplicon" refers to a polynucleotide amplification reaction. 也就是说,其是从一条或多条起始序列复制得到的多核苷酸群。 That is, it is starting from one or more polynucleotide sequences obtained copy group. 扩增子可以通过多种扩增反应来生成,包括但不限于聚合酶链式反应(PCRs)、线性聚合酶反应、基于核酸序列的扩增、滚环扩增及类似反应(参阅如美国专利号4, 683, 195、4, 965, 188、4, 683, 202、4, 800159、5, 210, 015、6, 174, 670、5, 399, 491、 6, 287, 824 和5, 854, 033 ;以及美国公开号2006/0024711)。 Amplicons may be generated by a variety of amplification reactions, including but not limited to polymerase chain reaction (the PCRs), linear polymerase reactions, nucleic acid sequence-based amplification, rolling circle amplification reactions and the like (see, for example U.S. Pat. No. 4, 683, 195,4, 965, 188,4, 683, 202,4, 800159,5, 210, 015,6, 174, 670,5, 399, 491, 6, 287, 824 and 5, 854 , 033; and US Publication No. 2006/0024711).

[0056] 当在鉴定背景下使用时,术语"碱基"指与靶核酸内指定位置处的核苷酸有关的嘌呤或嘧啶基(或其类似物或变体)。 [0056] When used in identifying context, the term "base" refers to a related nucleotide at a specified position in the target nucleic acid a purine or pyrimidine base (or an analogue or variant thereof). 因此,为读取碱基或为鉴定核苷酸,这两者指测定数据值以鉴定靶核酸内特定位置处的嘌呤或嘧啶基(或其类似物或变体)。 Thus, to read or to identify nucleotide bases, which refers to both the measurement data values ​​to a purine or pyrimidine base (or an analogue or variant thereof) to identify a target nucleic acid at the specific position. 嘌呤与嘧啶基包括四种主要的核苷酸碱基C、G、A和T。 Purine to pyrimidine base includes four major nucleotide bases C, G, A and T.

[0057] 本文使用的"多核苷酸"、"核酸"、"寡核苷酸"、"寡聚物"或语法等同物通常指至少两个核苷酸以线性的方式共价连接在一起。 [0057] As used herein, "polynucleotide", "nucleic acid", "oligonucleotide", "oligomer" or grammatical equivalents thereof generally refers to a linear least two nucleotides covalently linked together manner. 核酸通常包含磷酸二酯键,尽管在一些情况下,核酸类似物可包括在内,其具有替代性主链如亚磷酰胺、二硫代磷酸酯、或甲基亚磷酰胺键;或肽核酸主链与键。 Typically comprises a nucleic acid phosphodiester bonds, although in some cases, nucleic acid analogs may be included, which has a main chain as an alternative phosphoramidite, phosphorodithioate, methyl, or phosphoramidite linkage; or peptide nucleic acid backbone and the key. 其他的核酸类似物包括具有双环结构的那些,包括锁核酸、正性主链、非离子型主链和非核糖主链。 Other analog nucleic acids include those comprising locked nucleic acids, positive backbones, non-ionic backbones, and non-ribose backbones having a bicyclic structure.

[0058] 术语"参照多核苷酸序列"或仅"参照"指参照有机体的已知的核苷酸序列。 [0058] The term "reference polynucleotide sequence" or simply the "reference" refers to a known nucleotide sequence of the reference organism. 所述参照可为参照有机体的整个基因组序列(例如参照基因组)、参照基因组的一部分、许多参照有机体的共有序列、基于不同有机体的不同组分的编制序列、从有机体群中得到的一批基因组序列,或任何其他适当的序列。 The reference can be the entire genomic sequences with reference to an organism (e.g., a reference genome), referring to a part of the genome, many reference organisms consensus sequence, based on the preparation of sequences of different organisms of different components, a number of genomic sequences obtained from an organism population or any other suitable sequence. 所述参照也可以包括关于有机体群中发现的已知的参照变体的信息。 The reference may also include information about the known reference population of variants found in an organism. 所述参照有机体也可以是对待测序的样品具有特异性的,所述样品可能单独从相关的个体或相同的个体得到(可能对互补癌症序列而言是正常的)。 The reference is to treat the organism may be sequenced in specific samples, the sample may be obtained from the relevant individual or the same individual subject (probably normal cancer complementary sequences).

[0059]"样品多核苷酸序列"指来源于基因,调控元件,基因组DNA、cDNA、RNAs(包括mRNAs、rRNAs、siRNAs、miRNAs等),和/或来自于其片段的样品或祀有机体的核酸序列。 [0059] "sample polynucleotide sequence" refers to a gene derived from a regulatory elements, genomic DNA, cDNA, RNAs (including mRNAs, rRNAs, siRNAs, miRNAs, etc.), and / or sample nucleic acid from an organism or Si fragments thereof sequence. 样品多核苷酸序列可为来自样品的核酸,或二级核酸如扩增反应的产物。 Sample polynucleotide sequences may be a product of a nucleic acid from a sample, such as nucleic acid amplification or secondary reactions. 对于样品多核苷酸序列或"来源于"样品多核苷酸(或任何多核苷酸)的多核苷酸片段而言,可指样品序列/ 多核苷酸片段通过物理、化学、和/或酶促方法使样品多核苷酸(或任何其他的多核苷酸) 片段化而形成。 For samples or polynucleotide sequence "derived from" sample polynucleotide (or any polynucleotide) is a polynucleotide fragment, the sequence may refer to a sample / polynucleotide fragments by physical, chemical, and / or enzymatically the sample polynucleotide (or any other polynucleotide) fragmented form. "来源于"多核苷酸也可指片段为来源多核苷酸的核苷酸序列的特定子集复制或扩增的结果。 "Derived" may also refer to a polynucleotide result of copying or amplifying a specific subset of the source of the polynucleotide fragment is a polynucleotide sequence.

[0060] "读数"指代表一个或多个核苷酸碱基的一个或多个数据值的集合。 [0060] "read" refers to a collection represents one or more nucleotide bases or a plurality of data values. "匹配的读数"(也被称为"配对")通常指,来自基因组序列两个分开的区域(臂)的一组单独的核苷酸读数,所述区域位于横跨几百个或几千个碱基距离的DNA片段相反的末端。 "Match readings" (also called "pairing") generally refers to a set of single nucleotide genomic sequence readings from two separate regions (arms), the region is located across hundreds or thousands bp DNA fragment from the opposite end. 可以在测序过程中,从待读取和/或重新组装变异的样品有机体获得的较大的连续多核苷酸(例如DNA)的片段产生配对的读数。 In the sequencing process may be larger contiguous polynucleotides (e.g., DNA) obtained from the organism to be read and / or re-assembled from fragments produced variant paired sample readings.

[0061] "图谱"指将读数(例如,诸如配对的读数)与0、读数与之相似的参照中的一个或多个位置关联起来的一个或多个数据值,例如通过将示例的读数与对应于参照中位置的索引内的一个或多个关键位置匹配。 [0061] "map" refers to one or more data values ​​associated readings (e.g., such as a pair of readings) and 0, reading Similarly with reference to one or more locations, for example, by reading the sample and matching one or more key positions in the index corresponds to the reference position.

[0062]"杂交"指两条单链多核苷酸非共价结合以形成稳定的双链多核苷酸的过程。 [0062] "Hybridization" refers to two single-stranded polynucleotides bind non-covalently to the formation of a stable double-stranded polynucleotide. (通常)所得的双链多核苷酸是"杂合物(hybrid)"或"双链体(duplex)"。 (Usually) double-stranded polynucleotide is obtained "hybrid (Hybrid)" or "duplex (Duplex)." "杂交条件"通常会包括低于大约1M、更通常的是低于大约500mM和可能低于大约200mM的盐浓度。 "Hybridization conditions" will typically comprise less than about 1M, more usually less than about 500mM to about 200mM and the salt concentration may be lower. 杂交温度可以低至5°C,但通常高于22°C,更通常的是高于约30°C,并且通常超过37°C。 Hybridization temperatures can be as low as 5 ° C, but generally above 22 ° C, more typically above about 30 ° C, and typically in excess of 37 ° C.

[0063]"连接"意指在模板驱动的反应中,在两条或更多条核酸(例如寡聚核苷酸和/或多核苷酸)的末端之间形成共价键或联接(linkage)。 [0063] "linked" means in a template-driven reaction, the two or more nucleic acids (e.g., oligonucleotides and / or polynucleotides) or coupled to form a covalent bond (Linkage) between the end of the . 所述键或联接的特性可以有很大不同,而且连接可以是酶促或化学进行的。 The coupling characteristics of the bond or may be very different, but the connection may be carried out enzymatically or chemically. 如本文所用的,连接一般通过酶促进行,以在一条寡聚核苷酸的5'碳末端核苷酸与另一核苷酸的3'碳之间形成磷酸二酯联接。 As used herein, the connections are typically performed enzymatically, to between 'nucleotide and the terminal carbon of another nucleotide 3' oligonucleotide in a 5-carbon coupling is formed phosphodiester. 模板驱动的连接反应描述于下列参考文献:美国专利号4, 883, 750、5, 476, 930、5, 593, 826和5, 871,921。 Template-driven ligation reaction described in the following references: U.S. Patent No. 4, 883, 750,5, 476, 930,5, 593, 826 and 5, 871,921.

[0064]"逻辑"意指指令组,当由一个或多个计算装置和/或计算机系统的一个或多个处理器(例如CPU)执行时,其可操作地执行一种或多种功能,和/或以其它逻辑元件所用的一种或多种结果和/或数据的形式返回数据。 [0064] "logical" means a set of instructions, when the device and / or the computer system by one or more computing one or more processors (e.g., CPU) executed, operable to perform one or more functions, and / or other logic elements used in one form or more results and / or data return data. 在多个实施方案与实施中,可按以下执行任何给定的逻辑:作为由一个或多个处理器(例如CPU)执行的一个或多个软件构件,作为一个或多个硬件构件如专用集成电路(ASIC)和/或现场可编程门阵列(FPGA),或作为一个或多个软件构件与一个与多个硬件构件的任何组合。 In various embodiments, in the embodiment, the following may be performed in any given logic: as one or more software components by one or more processors (e.g., CPU) executed as one or more hardware components such as application specific integrated any combination circuit (ASIC) and / or a field programmable gate array (FPGA), or as one or more software components and a plurality of hardware components. 可以实施任何特定逻辑的软件构件,但不限于,作为单独的或客户端-服务器软件应用,作为客户服务系统中的客户,作为客户服务系统中的服务器,作为一个或多个软件模块,作为一个或多个功能库,以及作为一个或多个静态和/或动态连接的库。 Software components may be implemented in any specific logic, but are not limited to, alone or as a client - server software application, a customer service system customer, the customer service system as a server, as one or more software modules, as a or a plurality of libraries, as well as one or more static and / or dynamic link libraries. 执行期间,任何特定逻辑的指令可体现为一个或多个计算机进程、线程、光纤以及任何其它合适的运行时间实体,其可以在一个或多个计算装置的硬件中具体化并且可以分配计算资源,其包括但不限于诸如存储器、CPU时间、存储空间以及网络带宽。 During execution, any specific logic instructions may be embodied as one or more computer processes, threads, fibers and any other suitable runtime entity, either hardware computing device embodied in one or more computing resources and may, including but not limited to, such as memory, CPU time, memory space, and network bandwidth.

[0065]"引物"意指在与多核苷酸模板形成双链体时,能够充当核酸合成的起始点,并自其3'末端沿模板延伸,从而形成延伸的双链体的天然或合成的寡聚核苷酸。 Starting point [0065] "primer" means the formation of a duplex with a polynucleotide template, capable of acting as nucleic acid synthesis, and since its 3 'end extending along the template, thereby forming a natural or synthetic duplexes extending oligonucleotide. 延伸过程中添加的核苷酸序列是由模板多核苷酸的序列决定的。 The nucleotide sequence of the extension process is determined by adding a polynucleotide sequence of the template. 引物通常由DNA聚合酶延伸。 A primer is typically extended by a DNA polymerase.

[0066] "探针"通常指在研究中与寡聚核苷酸或靶核酸互补的寡聚核苷酸。 [0066] "probe" generally refers to a nucleotide or oligo complementary to the target nucleic acid oligonucleotide in the study. 以允许检测的方式,例如用荧光或其他任选地可辨别的标签标记要求保护的本发明的某些方面中所用的探针。 Manner that allows detection of, for example, a labeled probe of the present invention, certain aspects of the claimed or used in other fluorescent tag optionally discernible.

[0067]靶核酸"序列测定"(也称为"测序")意指与靶核酸中核苷酸碱基的序列有关的信息的测定。 Measurement means [0067] The target nucleic acid "sequence determination" (also referred to as "sequencing") nucleotide base in a target nucleic acid sequence information about. 此类信息可包括靶核酸的部分及完整序列信息的鉴定或测定。 Such information may include identifying or determining the complete sequence information portion and a target nucleic acid. 可以用不同程度的统计可靠性或置信度测定序列信息。 Sequence information can be determined with varying degrees of statistical reliability or confidence. 在一方面,测序包括同一性的测定和许多起始于靶核酸中不同核苷酸的靶核酸中的连续核苷酸的排序的测定。 In one aspect, the sequencing comprises determining the identity of the measurement and sorting of many different target nucleic acid starting at the nucleotide in the target nucleic acid contiguous nucleotides. 通过包含反应子系统与成像子系统的测序仪进行测序及其各个步骤。 Sequencing and by sequencing various steps of the reaction apparatus comprising a subsystem and imaging subsystem. 反应子系统包括流动装置(在其上多种试剂、缓冲液等、以及生化样品或由此衍生的片段之间发生生化反应)与多种其他的组件(例如管子、 阀门、注射器、制动器、发动机等),所述组件经配置以将试剂、缓冲液、样品片段等安置在流动装置之上或之内。 The reaction means comprises a flow subsystem (biochemical reaction occurs between the various reagents, buffers, etc., as well as biological samples or fragment derived therefrom) with a variety of other components (e.g., pipes, valves, syringes, brakes, engine etc.), the component is configured to reagents, buffers, and other sample fragment disposed on or within the mobile device. 成像子系统包含照相机、显微镜(和/或适当的镜头与管子)、测序期间支持流动装置的平台以及用于放置及调整平台上的流动装置以及调整照相机与显微镜的相对位置的多种其他的组件(例如,诸如发动机、制动器、机械臂等)。 The imaging subsystem includes a camera, a microscope (and / or a suitable lens and the tube), the flow during the sequencing platform support means and a plurality of mobile devices for adjusting the relative position of the camera and a microscope placed on the platform and adjust the other components (e.g., such as the engine, brakes, robot arm, etc.).

[0068]"靶核酸"意指来源于基因,调控元件,基因组DNA、cDNA、RNA(包括mRNA、rRNA、 siRNA、miRNA等)以及其片段的(通常)未知序列的核酸。 [0068] "target nucleic acid" is meant derived from the genes, regulatory elements, genomic DNA, cDNA, RNA (including mRNA, rRNA, siRNA, miRNA, etc.) and fragments thereof (usually) a nucleic acid of unknown sequence. 靶核酸可为来源于样品的核酸,或二级核酸如扩增反应的产物。 The target nucleic acid may be derived from a nucleic acid sample, such as nucleic acid amplification products or secondary reaction. 可以从几乎任何的来源获得靶核酸,并且可以使用本领域中已知的方法制备。 The target nucleic acid can be obtained from virtually any source, and may be prepared using methods known in the art. 例如,靶核酸可以没有扩增地直接分离,通过使用本领域中已知的方法扩增分离,其包括但不限于聚合酶链式反应(PCR)、链置换扩增(SDA)、多重置换扩增(MDA)、滚环扩增(RCA)、滚环扩增(RCR)以及其他的扩增(包括全基因组扩增)方法。 For example, the amplified target nucleic acid may not be directly separated, isolated by amplification using methods known in the art, including but not limited to polymerase chain reaction (the PCR), strand displacement amplification (the SDA), multiple displacement expander increase (MDA), rolling circle amplification (RCA), rolling circle amplification (the RCR) and other amplification (including whole genome amplification) method. 也可通过克隆来获得靶核酸,所述克隆包括但不限于克隆至媒介诸如质粒、酵母以及细菌人工染色体。 May also be obtained by cloning the target nucleic acid, including but not limited to the clones cloned into an intermediary such as a plasmid, yeast and bacterial artificial chromosomes. 在一些方面,靶核酸包括mRNA或cDNA。 In some aspects, the target nucleic acid comprising mRNA or cDNA. 在某些实施方案中,使用来自生物样品的分离的转录物产生靶DNA。 In certain embodiments, the use of separate transcripts from the biological sample to generate a target DNA. 可以使用本领域中已知的方法从样品中获取靶核酸。 Known in the art may be used in the method of obtaining a target nucleic acid from the sample. 应理解的是,样品可包含任何数量的物质,其包括但不限于几乎任何有机体的体液,诸如,例如血液、 尿液、血清、淋巴、唾液、肛门与阴道分泌液、汗液以及精液,优选哺乳动物样品,特别优选人的样品。 It should be understood that the sample may comprise any number of materials, including but not limited to bodily fluids almost any organism, such as, for example, blood, urine, serum, lymph, saliva, anal and vaginal secretions, perspiration and semen, breast preferably samples animals, particularly preferably a human sample. 从各种有机体获取靶核酸的方法为本领域中所熟知。 Obtaining target nucleic acids from organisms of various methods well known in the art. 发现包含人基因组DNA的样品可在许多实施方案中使用。 Found to contain human genomic DNA samples can be used in many embodiments. 在诸如全基因组测序的一些方面,优选地获得大约20至大约1,000, 0000或更多的基因组-DNA的等同物以确保靶DNA片段群足以覆盖整个基因组。 In some aspects, such as whole genome sequencing, obtained preferably about 20 to about 000, 0000 or more -DNA genome equivalents in order to ensure sufficient target population of DNA fragments covering the entire genome.

[0069] 基因组测序与CNV估算的例示性方法。 [0069] Exemplary methods and genome sequencing CNV estimation.

[0070] 本发明涉及用于估算样品中靶序列的检测位置处目标基因组区域的拷贝数变异的方法,发现其可用于如本文所述的多种应用中。 [0070] The present invention relates to a method of copy number variation in the genomic regions for estimating the sample at the detection position in a target sequence, which may be used to find a variety of applications as described herein.

[0071] 本公开内容的方法也可包括从样品提取靶核酸并使其片段化,和/或对进行CNV 估算的靶核酸进行测序。 [0071] The method of the present disclosure may also include a target nucleic acid extracted from the sample and allowed to fragmentation, and / or CNV estimated target nucleic acid sequencing. 这些片段化的核酸可用于产生通常包括一个或多个接头的靶核酸模板。 These fragmented nucleic be used to generate one or more joints generally comprise a target nucleic acid template. 靶核酸模板经过扩增方法以形成核酸串联体,诸如,例如核酸纳米球。 After the target nucleic acid template to form a nucleic acid amplification methods series body, such as, for example, a nucleic acid nanospheres.

[0072] 在一方面,核酸模板可以包含靶核酸与多个散布的接头,在本文中也称为"文库构建体","循环的模板","循环的构建体","靶核酸模板"以及其他语法等同物。 [0072] In one aspect, the nucleic acid templates may comprise a plurality of target nucleic acid interspersed linker, also referred to herein as "library constructs", "loop templates", "building a circular body", "target nucleic acid template" and other grammatical equivalents. 通过在贯穿每一靶核酸的多个位点处插入接头分子来装配核酸模板构建体。 Construct is assembled by inserting the template nucleic acid linker molecules at multiple sites throughout each target nucleic acid. 散布的接头允许从靶核酸中的多个位点连续地或同时地获取序列信息。 Interspersed linker allows multiple sites or from the target nucleic acid sequence information acquired simultaneously continuously.

[0073] 在另一实施方案中,从多个基因组片段所形成的核酸模板可以用于产生核酸模板的文库。 [0073] In another embodiment, the nucleic acid template is formed from a plurality of genomic fragments may be used to generate a library of nucleic acid templates. 在一些实施方案中,此类核酸模板的文库将包含靶核酸,所述靶核酸共同包含整个基因组的全部或部分。 In some embodiments, such a library of nucleic acid templates containing the target nucleic acid, the target nucleic acid together comprise all or part of the entire genome. 即通过使用足够数量的起始基因组(例如细胞的基因组),结合随机的片段化,所得到的用于产生循环的模板的特定大小的靶核酸充分地"覆盖"基因组,尽管应理解的是,偶尔会无意地引入偏差而妨碍代表整个基因组。 I.e., by using a sufficient number of initial genome (e.g. genome of the cell), in conjunction with random fragmentation, the resultant target nucleic acids of a particular size for template generation cycle is sufficiently "covered" genome, although it should be understood that, occasionally inadvertently hinder the introduction of bias representative of the genome.

[0074] 构建核酸模板的方法的其它实施方案和实例描述于美国专利系列号11/679, 124 ; 11/981, 761 ; 11/981, 661 ; 11/981, 605 ; 11/981, 793 ; 11/981, 804 ; 11/451, 691 ; 11/981, 607 ; 11/981, 767 ; 11/982, 467 ; 11/451, 692 ;12/335, 168 ;11/541, 225 ; 11/927,356 ; 11/927,388 ; 11/938,096 ; 11/938, 106 ; 10/547,214 ; 11/981,730 ; 11/981,685 ; 11/981,797 ; 12/252,280 ; 11/934,695 ; 11/934,697 ; 11/934,703 ; 12/265, 593 ;12/266, 385 ;11/938, 213 ;11/938, 221 ;12/325, 922 ;12/329, 365 ;以及12/335, 188中,通过引用将其每一篇整体并入本文,用于所有的目的,尤其是用于所有与构建本文所述的技术的核酸模板有关的教导。 Other embodiments and examples [0074] The method of constructing a nucleic acid template are described in U.S. Patent Application Serial No. 11/679, 124; 11/981, 761; 11/981, 661; 11/981, 605; 11/981, 793; 11/981, 804; 11/451, 691; 11/981, 607; 11/981, 767; 11/982, 467; 11/451, 692; 12/335, 168; 11/541, 225; 11 / 927,356; 11 / 927,388; 11 / 938,096; 11/938, 106; 10 / 547,214; 11 / 981,730; 11 / 981,685; 11 / 981,797; 12 / 252,280; 11 / 934,695; 11 / 934,697; 11 / 934,703; 12 / 265, 593; 12/266, 385; 11/938, 213; 11/938, 221; 12/325, 922; 12/329, 365; and 12/335, 188, each one by reference in its entirety incorporated herein for all purposes, in particular for all concerned with the construction techniques taught herein to a nucleic acid template.

[0075] 本文所述的技术的核酸模板可为双链的或单链的,并且它们可为直链的或环状的。 [0075] The nucleic acid template techniques described herein may be double-stranded or single-stranded, and they may be linear or cyclic. 在一些实施方案中,产生了核酸模板的文库,并且在其它实施方案中,此类文库中不同模板间所含的靶序列共同覆盖整个基因组的全部或部分。 In some embodiments, a library of nucleic acid template is generated, and in other embodiments, such target sequences contained in the library of different templates together cover all or part of the entire genome. 应理解的是,核酸模板的这些文库可包含二倍体基因组或可使用本领域中已知的方法处理它们,以从一组亲代的染色体至另一组来分离序列。 It should be understood that the library of nucleic acid templates may comprise diploid genome or can be produced using methods known in the art to deal with them, a group of from parental chromosomes isolated sequence to another group. 本领域技术人员应理解的是,文库中的单链环状模板可共同包含染色体或染色体区域的两条链(即"沃特森"与"克里克"链),或含有来自于一条链的序列的环, 或另一个可以使用本领域中已知的方法分离至它们自己的文库。 Those skilled in the art will appreciate that libraries of single-stranded circular template may together comprise a chromosome or chromosomal region of the two chains (i.e., "Watson" and "Crick" strand), or contains one strand from the loop sequence, may be used, or other methods known in the art to separate their own library.

[0076] 对本领域中已知的以及本文所述的使用核酸模板的任何测序方法而言,本文所述的技术提供了用于测定靶核酸中至少大约10个至于大约200个碱基的方法。 [0076] to those known in the art and any method of sequencing a nucleic acid template is used herein, the techniques described herein provide for the determination of the target nucleic acid The method of at least about 10 to about 200 bases to dryness. 在另一实施方案中,本文所述的技术提供了用于测定靶核酸中至少大约20个至大约180个,大约30个至大约160个,大约40个至大约140个,大约50个至大约120个,大约60个至大约100个, 以及大约70个至大约80个碱基的方法。 In another embodiment, the techniques described herein provide for the determination of the target nucleic acid at least about 20 to about 180, from about 30 to about 160, from about 40 to about 140, from about 50 to about 120, from about 60 to about 100, and about 70 to about 80 bases in the method. 仍然在其它实施方案中,测序方法用于鉴定邻近核酸模板中每一接头的一端或两端的5个、10个、15个、20个、25个、30个或更多的碱基。 In still other embodiments, the sequencing method for 5, 10, 15, 20, 25, 30 or more bases in the template nucleic acid adjacent to one or both ends of each joint of identification.

[0077]CNV读取的技术概述 [0077] Technical Overview read CNV

[0078] 正常样品与肿瘤样品的CNV读取共有一些特征,但是也有差异。 [0078] CNV tumor sample and a normal sample reading share some characteristics, but there are differences. 在一些实施方案中,两种类型的样品经过下列的三个步骤。 In some embodiments, two types of samples were subjected to the following three steps.

[0079] 1)序列覆盖度的计算。 Calculation [0079] 1) the sequence coverage.

[0080] 2)覆盖度偏差的估算与校正: [0080] 2) Estimation coverage deviation correction:

[0081] a.建立覆盖度偏差的模型; . [0081] a model coverage deviation;

[0082] b.建模偏差的校正; . [0082] b deviation correction modeling;

[0083] c.覆盖度校平(smoothing)。 [0083] c. Coverage leveling (smoothing).

[0084] 3)通过与基线样品或样品集比较来标准化覆盖度。 [0084] 3) by comparing baseline sample or sample set are normalized coverage.

[0085] 据此,使用隐马尔可夫模型(HMM)分段正常样品与肿瘤样品,但是在下列的步骤中对两种样品类型使用不同的模型: [0085] Accordingly, using Hidden Markov Model (HMM) segmentation tumor sample and a normal sample, but the use of the two sample types at the following steps in different models:

[0086] 4A)用于正常样品的HMM分段、评分以及输出; [0086] 4A) for HMM segmentation normal sample, score and an output;

[0087] 4B)用于肿瘤样品的HMM分段、评分以及输出的修改; HMM [0087] 4B) tumor sample for the segments, and modifying the output score;

[0088] 最终,正常的样品经过"无读取"过程,所述过程在下列的步骤中鉴定可疑的CNV 读取: [0088] Finally, the samples were subjected to the normal "no read" process, the process of identifying suspicious CNV reading the following steps:

[0089] 5)基于群体的无读取/低置信区域的鉴定。 [0089] 5) identifying free groups based on the read / low confidence region.

[0090] 在多个实施方案中,可通过在一个或多个计算装置上执行的不同类型的逻辑进行CNV读取的以上步骤。 [0090] In various embodiments, the above procedure may be read by different CNV type of logic is executed on one or more computing devices.

[0091]CNV读取技术的例示性实施方案 [0091] CNV reading technique according to an exemplary embodiment

[0092] 1.序列覆盖度的计算 [0092] 1. Calculation of sequence coverage

[0093] 如下文所用的,"DNB"指核酸纳米球的序列,从其已对一个或多个读数(例如配对的读数)进行测序。 [0093] As used below, "DNB" refers to a nucleic acid nanospheres therefrom has one or more readings (e.g. paired readings) were sequenced. 应注意的是,在从生物样品或其片段测序的读数中,DNB表示为可覆盖或可不覆盖组成DNB的全部序列的一个或多个读数。 It should be noted that, from a biological sample or in a sequenced readings, DNB indicated as a cover or may not cover all of a sequence consisting of one or more readings DNB. 例如,在一实施方案中,DNB表示为包含来源于DNB相反末端的两个或更多个臂读数的匹配读数,其由几百个碱基的未知序列所分隔。 For example, in one embodiment, derived from the DNB is represented as including two opposite ends of the arms or more readings DNB matching reading unknown sequence which is separated by a few hundred bases.

[0094] 在一方面,所有配对的约束性令人满意的成对末端(例如完整的DNB)图谱用于计算序列覆盖度。 [0094] In one aspect, all pairs of paired-end satisfactory binding (e.g. full DNB) for calculating a coverage map sequence. 在某一实施方案中,独特的成对末端图谱有助于与DNB对齐的参照的每一碱基的单一计数。 In certain embodiments, the unique pattern of paired-end and helps DNB aligned single reference count for each base. 基于图谱为DNB在参照中的正确位置的估算概率,使与非独特的成对末端图谱对齐的参照碱基加权(例如,被给予分数计数)。 DNB estimated probability map in the correct position based on the reference in the reference and the weighted base paired-end non-unique pattern aligned (e.g., fractional count is given). 因此,与每一图谱中置信度成比例的DNB的分数归属,提供了图谱为非独特的区域时给予合理的覆盖度估算的能力。 Thus, the confidence score is proportional to the home of each atlas DNB, provides the ability to map a non-unique when given a reasonable coverage area estimation.

[0095] 在一方面,参照基因组R的每一位置i接收下列的覆盖值ci : [0095] In one aspect, each position of the reference genome R i received the following coverage value ci:

Figure CN104428425AD00161

[0097] 其中Mi为所有DNB上的图谱集,从而使得每一图谱中读取的碱基与位置i对齐, DNBm为通过图谱m所描述的DNB,N(m)为涉及DNBm的所有图谱集,以及α为以不允许DNB 映射至所述参照的方式产生DNB的概率。 [0097] where Mi is the set of all maps on DNB, so that the base map of each position i and read aligned, DNBm for DNB pattern described by m, N (m) for all the atlas relates DNBm , and α is allowed to DNB mapped to the embodiment with reference to the probability of generating DNB.

[0098] 根据本文所述的技术,计算机逻辑(例如诸如图1中的CNV读取器(CNV caller) 18和/或其组件,诸如覆盖度计算逻辑22)基于DNB图谱计算参照基因组中所述位置(或基因座)的覆盖度值。 [0098] The techniques described herein computer logic (e.g., such as 18 and / or components thereof in FIG. 1 CNV reader (CNV caller), such as coverage calculation logic 22) is calculated based on the map of DNB in ​​the reference genome position (or locus) coverage value. 然后计算机逻辑包括用于后续处理的测量数据中计算的覆盖度值。 The computer then logic includes a coverage value measurement data calculated in the subsequent processing.

[0099] 2.覆盖度偏差的估算与校正(样品内部的覆盖度操作) [0099] 2. The estimate deviation correction coverage (coverage operation inside the sample)

[0100]目前,基因组测序可能导致可影响拷贝数估算的覆盖度偏差。 [0100] Currently, genome sequencing may affect coverage can result in a deviation of the estimated copy number. 偏差的要素之一涉及接近初始DNA片段长度的区间中GC含量变成DNB(例如大约400bp),尽管还已知其他因素。 One of the elements is directed to a deviation section GC content close to the initial length of the DNA fragment into DNB (e.g. about 400bp), although other factors are also known. 在一实施方案中,通常优选在拷贝数估算之前或作为拷贝数估算的一部分,进行此类偏差的模拟与校正。 In one embodiment, it is generally preferred to estimate copy number prior to or as part of the copy number estimation, simulation and correcting such deviation.

[0101] 在另一实施方案中,理想的是将一些校平应用于覆盖度中的短尺度波动,其可至少部分地对单个环状文库或DNB具有特异性。 [0101] In another embodiment, it is desirable that some of the short dimension is applied to leveling of fluctuations in coverage, which may be at least partly specific to a single library or cyclic DNB.

[0102] 有几种偏差校正和校平的方法可以使用。 [0102] There are several leveling and deviation correction method can be used. 这些方法中所有的操作与步骤可通过计算机逻辑(例如诸如图1中的CNV读取器18和/或其组件,诸如GC校正逻辑34)基于测量数据进行,所述测量数据包括但不限于参照基因组中每一位置的覆盖度值。 These processes all of the operations of step by computer logic (e.g., as in FIG. 1 CNV reader 18 and / or components thereof, such as GC correction logic 34) based on the measurement data, the measurement data including but not limited to a reference genome coverage value for each position.

[0103] 方法1:事后覆盖度校正 [0103] Method 1: post-correction coverage

[0104] 在一实施方案中,通过窗口-求平均值来校平如以上所述的序列覆盖度,然后调整以解释文库构建和测序过程中的GC偏差。 [0104] In one embodiment, the window by - averaging the leveling sequence coverage as described above, and then adjusted to account for bias GC library construction and sequencing processes.

[0105] 通过计算窗口内每个位置的未校平的覆盖值的平均值进行窗口-求平均值。 [0105] covering for a window by calculating an average value of the non-leveling for each window position - averaging. 对窗口长度N而言,位置i处所记录的平均覆盖度为 Window of length N, the average coverage is recorded at position i

Figure CN104428425AD00162

[0107] 在一实施方案中,从此类校平的覆盖度计算调整因子集。 [0107] In one embodiment, calculated from the coverage of such leveling adjustment factor set. 在1000个碱基对窗口(即N= 1000)内,沿着每一参照叠连群的每1000个碱基计算GC含量。 1000 base pairs windows (i.e. N = 1000) the reference to the stack along each contig every 1000 bases of GC content calculations. 基于窗口所覆盖的参照部分中存在的G和C的数量,给每一窗口分配1000个堆栈中的一个。 Based on the number of G and C reference portion of the window covered by the presence of each window assigned to a 1000 stack. 让W成为列表窗口集(相当于它们的中心位置),并且Wb为[G+C] =b窗口集。 Let W be the set list window (corresponding to the central position thereof), and Wb is a [G + C] = b window set. 每一堆栈b的平均未校正的覆盖度为被确定为: Each stack b uncorrected average degree of coverage is determined as:

[0108] [0108]

Figure CN104428425AD00163

[0109] 让^为整个基因组内的平均覆盖〇? 丨W1),对每一GC堆栈b而言,校正因子fb被定义为: [0109] Let ^ average coverage billion within the entire genome Shu W1), for each GC stack b, the correction factor is defined as fb?:

[oho]=i/db [Oho] = i / db

[0111] 在另一实施方案中,可以使用其它校平操作估算校正因子。 [0111] In another embodiment, the leveling operation may be estimated using other correction factors. 例如这可为小样本变异或过度拟合提供更大的稳健性。 For example, this may provide greater robustness to small sample variation or over-fitting. 例如,可以使用曲线、分段回归、滑动窗求平均值、LOESS等对项fb进行校平。 For example, curve segment regression, sliding window averaging, on the other items LOESS fb for leveling.

Figure CN104428425AD00171

[0113] 然后,按如下计算以位置i为中心的1000个碱基的窗口(分配给堆栈印的校正的、校平的覆盖度: [0113] Then, the position i is calculated as follows for the 1000 bases of the central window (assigned to the correction of the stack of printing, leveling coverage:

Figure CN104428425AD00172

[0115] 可以将长度I=η*1000 (η为正整数)的更大窗口的校正的、校平的覆盖度计算为包含1000个碱基的窗口的值的平均值。 [0115] may be the length I = η * 1000 (η is a positive integer) larger window is corrected, the correction calculating the level of coverage is the average of the values ​​of the window containing 1000 bases.

[0116] 除了以上之外,应清楚的是可以存在许多实施方案变化。 [0116] In addition to the above, it should be clear that there may be varied in many embodiments. 窗口大小与转移可以改变。 Transfer window size can be changed. 基于多种特征如结构注释(例如重复),多个样品中过多或不足的变异性,用于绘制图谱的标准下的可及性/独特性,模拟数据中覆盖度的深度(测量可绘制性)等,某些位置可以忽略(并且相应的窗口或者扩大以获得固定数量的可接受位置,或者仅取可接受位置的平均数)。 The structure of various features annotation (e.g. duplicate), excessive or inadequate plurality of sample variability, and for a standard of the unique drawn map, the depth / analog data based on coverage (measure drawable property), certain locations may be ignored (and the corresponding window or enlarged to obtain a fixed number of acceptable positions, or only take the average of the acceptable position). 数学上的平均数可被适当位置中的中位数、模式或其他的汇总统计数据代替。 The average on the mathematics can be replaced in a median position, pattern or other summary statistics. 可基于单个位置的覆盖度而不是窗口的平均覆盖度计算校正因子,在校正后而不是校正前应用校平/求平均值。 Leveling can be applied prior to calculating a correction factor based on the average coverage coverage single location instead of a window, rather than the correction after the correction / averaging.

[0117] 可扩展这一类例示性方法,从而通过计算用于基因组上多维位置堆栈的校正因子来考虑覆盖度的多个预测因子。 [0117] Scalable exemplary method of this type, so that a plurality of predictors considered coverage by calculating a position of the stack on the genome of a multidimensional correction factor. 例如,不但可考虑全部DNB规模上的GC含量,而且还可考虑单个DNB臂规模上的GC含量。 For example, consider not only the GC content of all of the DNB scale, but also on considering the GC content of a single arm DNB scale. 可选择地,可以计算每一预测因子的单独的校正因子,对应于效应独立性的假设。 Alternatively, a separate calibration factor may be calculated for each predictor, in effect corresponding to the assumption of independence.

[0118] 方法2:图谱水平的覆盖度校ιΗ: [0118] Method 2: map coverage level correction ιΗ:

[0119] 在偏差校正与校平的第二种方法中,给予单个图谱其它权重因子以弥补校平之前的偏差。 [0119] In a second method for correcting the deviation of the leveling to give a single pattern other weight factor to compensate for the error before leveling. 相比统一的随机抽样所预期的,将更加可能归因于偏差的DNB(图谱)减权重,同时将不太可能归因于偏差的DNB加权重(并且可能比整个计数更有助于覆盖度的计算)。 Compared uniform random sampling expected, will be more likely due to the deviation of DNB (map) weight reduction, while less likely to be attributed to weighted deviation DNB weight (and possibly contribute to more than the whole coverage count the calculation). 让为图谱m的校正因子(下文所定义的),位置i处校正的覆盖度可计算为: Let the correction factor for the spectrum of m (defined below), position i corrected degree of coverage is calculated as:

Figure CN104428425AD00173

[0121] 基于来源于逻辑回归模型拟合的比值比确定校正因子qm,从而从用参照基因组的统一随机抽样所模拟的数据集中的图谱来辨别真实数据集中的图谱。 [0121] qm ratio correction factor is determined from fitting logistic regression models based on the ratio, thereby to identify the real pattern data set from the data set a uniform pattern of a reference genome randomly simulated. 基于1阶b样条(分段的线性),其在组合的(真实的+模拟的)数据集中GC含量分布的每个第五百分位处都有结,所述模型预测给定的图谱为真实的还是模拟的。 Based on the first-order b-spline (piecewise linear), which is concentrated at each of the fifth percentile distribution of GC content (simulated real +) junction has combined data, the model predicts a given pattern It is real or simulated. 例如,相应的R代码为: For example, the corresponding R code:

[0122] 模型〈-glm(为真实的〜bs) (dnbGCpcnt,df= 20,程度=1,分界结=c(0, 1)), 数据=d,家族=二项式的) [0122] Model <-glm (as true ~bs) (dnbGCpcnt, df = 20, = 1 degree, the junction boundary = c (0, 1)), D = data, family = binomial)

[0123] 其中输入数据集d,由相等数量的独特的成对末端模拟的图谱的记录和独特的成对末端真实的图谱的记录组成。 [0123] wherein the input data set record d, paired-end by a unique simulation of an equal number of unique recording pattern and the real pattern of paired-end components. 对模拟记录而言,为真实的=〇;对真实的记录而言,为真实的=1。 For analog recording, to square = true; true for recording, is true = 1. dnbGCpcnt为通过图谱绘制所跨越的参照部分中的GC百分比。 dnbGCpcnt GC percentage by reference as part of the profiling in the cross.

[0124] 考虑到由此得到的模型,校正因子qm被认为是图谱m的给定GC百分比的模型预测的模拟:真实的比值比。 [0124] Considering the model thus obtained, a correction factor is considered qm simulation model to predict a given percentage of the GC spectrum m: the true odds ratio. 因此,如果给定的GC百分比在真实数据中很可能是在模拟数据中的三倍,以1/3因子的加权具有该GC含量的真实图谱。 Thus, if a given percentage of the real data GC is likely to be tripled in the analog data to third weighting factor having a real map of the GC content.

[0125] 使用解释图谱的许多特性的逻辑模型可以测定基于因素的类似比值比,包括的因素诸如: [0125] using an interpreter logic model map may be determined based on a number of characteristics similar odds ratio factors, including factors such as:

[0126] •整个片段的组成(〜500bp); [0126] • up the entire segment (~500bp);

[0127] •最终DNB中基因组节段的组成(〜80bp); [0127] The final composition DNB • genome segment (~80bp);

[0128] •最终DNB中每个位置处碱基的选择; [0128] selected at each base position in DNB • final;

[0129] •最初片段中特定位置处的寡聚物; [0129] • oligomer at a specific location in the first segment;

[0130] •邻近接头的序列(例如,连接效率影响); [0130] • adjacent linker sequence (e.g., ligation efficiency);

[0131] •限制性酶切位点的通常位置处的序列; [0131] • restriction site sequence at a normal position;

[0132] •预测的物理学特征; [0132] Physical characteristics • predictable;

[0133] •解链温度; [0133] • melting temperature;

[0134] •灵活性/曲率; [0134] • Flexibility / curvature;

[0135] •基因组区域的测量的/可测量的/预测的特征,诸如组蛋白结合与甲基化。 [0135] • / predicted characteristic genomic region measured / measurable, such as histones and methylation.

[0136] 模型不但可以包括单个测量结果的线性效应还可以包括单个测量结果的各种转变(例如分段线性的或多项式拟合或堆栈)与交互作用项。 [0136] model may include only a single linear effect measurement result may also comprise a single transition variety of measurements (e.g., piecewise linear or polynomial fit or stack) with interaction terms.

[0137] 在某一实施方案中,然后经由滑动的窗口求平均值来校平模型校正的覆盖度,并且四舍五入为整数。 [0137] In certain embodiments, then averaged via a sliding window model calibration correction coverage level, and rounded to an integer. 窗口的宽度为可配置的;默认值为2kb。 Width of the window to be configurable; default 2kb. 通过默认报告邻接的窗口(例如等于窗口宽度的窗口转移)的平均覆盖度,但是可以应用其他转移量。 By default reporting window adjacent to (e.g., equal to the window width of the transfer window) has an average degree of coverage, but may be applied to other transfer amount. 报告每一窗口的中点位置处平均的校正的覆盖度。 The average coverage report corrected at the midpoint of each window.

[0138] 单独地处理参照基因组的每一叠连群(或连续的基因座区域),从而使得默认宽度=2k,每一叠连群长度>2kb产生相对于叠连群起点的lkb、3kb、5kb...处的覆盖值。 Each contig (continuous locus or region) [0138] treated separately reference genome, so that the default width = 2k, the length of each contig> 2kb contig is generated with respect to the origin lkb, 3kb, covering the value 5kb ... at. 因此,对此类位置i而言,校平的覆盖度被给定为: Thus, for such a position i, leveling coverage is given by:

Figure CN104428425AD00181

[0140] 每一叠连群的第一窗口开始于叠连群的第一个碱基;转移所述窗口直至窗口的末端超过叠连群的末端。 [0140] stack starts at the first base of the first window of each contig contig; window until the end of the transfer window exceeds contig ends. 因为相对于它的染色体,叠连群的起始位置可为任意值,所以报告的给定窗口的染色体位置可能不是一个不错的约整数。 Because approximately an integer of its chromosome contig starting position may be any value, the report given window chromosomal location may not be a good respect.

[0141]方法3:GC标准化讨程 [0141] Method 3: GC discuss standardization process

[0142] 在一实施方案中,计算机逻辑(例如诸如图1中的CNV读取器18和/或其组件如GC校正逻辑34)如下估算和校正覆盖度的偏差。 [0142] In one embodiment, the computer logic (e.g., such as GC correction logic 34 in FIG. 1 CNV reader 18 and / or components thereof) for estimating and correcting deviations below coverage.

[0143] 首先,为以基因组的每点为中心的1000个碱基的窗口(排除从叠连群末端小于500个碱基的位置)计算GC含量。 [0143] First, as per the genome of the center of window 1000 bases (excluded from the contig terminal position is less than 500 bases) calculated GC content. 例如,如果位置j处的碱基为G或C,可以将函数GC(j) 设定为1,不然就设定为〇,并且可以如下所述计算位置i处GC含量gCi : For example, if the nucleotide at position j is G or C, may be the function GC (j) is set to 1, otherwise it is set to square, and may be calculated as follows GC content at positions i gCi:

[0144] [0144]

Figure CN104428425AD00182

[0145] 估算GC校正因子期间,不考虑来自于叠连群任一末端的小于500个碱基的位置。 [0145] During the correction factor GC is estimated from the stack without regard to the position of less than 500 bases group attached either end.

[0146] 下一步,对每一可能的GC值Y而言,测定gct =Y的位置的平均覆盖度&。 [0146] Next, for each possible value Y GC, the measurement of the average coverage gct = Y & position. 让11^ 为基因组中gct =Y的位置i的数量,可按如下计算平均覆盖度: Let 11 ^ i is the number of positions in the genome gct = Y, the average coverage can be calculated as follows:

[0147] [0147]

Figure CN104428425AD00191

[0148] 在实例实施中,可以排除覆盖度>500的位置。 [0148] In example embodiments, the coverage may be excluded> position 500.

[0149] 下一步,对模拟物完成以上两个步骤。 [0149] Next, the simulation was completed more than two steps. 使用上标表示模拟结果,可按如下测定模拟的平均覆盖度: Using superscript represents simulation results, the simulation can be measured as the average coverage:

[0150] [0150]

Figure CN104428425AD00192

[0151] 应注意的是,由于无所不在的重复序列、微卫星区域等的GC含量,使得以上的结果并非完全均匀,与作为一个整体的基因组并不相似。 [0151] It should be noted that, due to the ubiquitous GC content repeats, microsatellite regions, etc., so that the above results are not completely uniform, with the genome as a whole are not similar.

[0152] 下一步,对每一GC值计算样品覆盖度与模拟覆盖度的比率,调整样品与模拟物的总平均覆盖度(分别为与F)。 [0152] Next, calculation of the total average coverage rate of the sample covers the analog coverage value for each GC, the sample is adjusted with the mimetic (and respectively F). 例如,可按如下计算该比率: For example, the ratio can be calculated as follows:

[0153] [0153]

Figure CN104428425AD00193

[0154] 下一步,获得校平的覆盖度比率作为GC的函数例如,可按如下使用局部加权的多项式回归: [0154] Next, the leveling is obtained as a function of the ratio of coverage of GC, for example, can be used as locally weighted regression polynomial of:

Figure CN104428425AD00194

[0156] 作为局部回归操作,除了在数字上不稳定的区域之外进行LOESS校平,而在数字上不稳定的区域则进行L0WESS。 [0156] As a partial return operation, in addition to the unstable region in the digital leveling LOESS performed, and in the unstable area is performed digital L0WESS.

[0157] 下一步,按如下计算基因组每个位置处GC-校正(单个碱基)的覆盖度: [0157] Next, at the position of each GC- correction (single base) is calculated as follows coverage Genome:

Figure CN104428425AD00195

[0159] 靠近叠连群的末端,用全基因组平均的GC含量填满'缺失的碱基'。 [0159] near the end of the contig, fill 'base pair deletion' genome with an average GC content. 如果给定位置的窗口的GC含量过于极端(即〈20 %或>80%GC),将覆盖度值作为未知数对待(例如,作为缺失数据)。 If the GC content in a given position of the window is too extreme (i.e., <20% or> 80% GC), the coverage value as unknowns treated (e.g., as missing data).

[0160] 通过在给定的窗口内获取每一位置i的平均值各进行窗口-校平。 [0160] each window by obtaining an average value for each position i in a given window - leveling. 选择下文标题为"窗口边界定义"的章节(section)所定义的窗口边界,来填充窗口(邻近的,不重叠的)。 Window border section (section) selected below entitled "window border definition" as defined, to fill the window (adjacent, do not overlap). 艮P,对相应于间距[i,j)的窗口而言,平均的校正的覆盖度Cli计算为: Gen P, the window corresponding to the spacing for [i, j), the average coverage Cli correction is calculated as:

Figure CN104428425AD00196

[0162] 应注意的是,为方便记录,下文章节中省略下标"j",即使用疗代替€』,因为有至多一个起始于位置i的窗口。 [0162] It should be noted that, to facilitate record, the section of the article omitted the subscript "j", i.e., instead of using treatment € ", because there is at most one starting at position i window.

[0163] 3.通过与基线样品比较来标准化覆盖度 [0163] 3. By comparing normalized baseline sample coverage

[0164] 在多个实施方案中,通过计算机逻辑诸如,例如图1中CNV变异读取器18和/或通过其组件,诸如例如倍性相关的校正逻辑36,可以进行本章节(章节3)中所述的操作、计算以及方法步骤。 [0164] In various embodiments, such as, for example, FIG. 1 CNV variants reader 18 and / or by components thereof, such as e.g. ploidy associated correction logic 36 may be made in this section (Section 3) through the computer logic in the operation, and a calculation method steps.

[0165] 在一些实施方案中,通过与基线样品比较可以考虑未被以上所述调节校正的覆盖度偏差。 [0165] In some embodiments, by comparison with the baseline sample could not consider the above deviation correction with adjusted coverage. 然而,为获得与绝对拷贝数成比例的覆盖度,可以根据所述样品中的拷贝数调整基线样品。 However, to obtain the absolute copy number is proportional to the degree of coverage can be adjusted according to baseline copy number of the sample in the sample.

[0166] 让d'i和Pi为基线样品在位置i处的覆盖度和倍性,并且d为基线样品的典型的二倍体覆盖度的估算值,可按如下测定偏差校正因子4 : [0166] Let d'i and Pi is the baseline sample coverage and ploidy at position i, and d is the estimate of the typical coverage diploid baseline sample bias correction factor can be determined as follows 4:

[0167] [0167]

Figure CN104428425AD00201

[0168] (在一实施方案中,是被认为是常染色体中窗口的45%百分位数)。 [0168] (in one embodiment, is considered to be 45% percentile autosomes window). 然后按如下计算标准化的校正的覆盖度if: Then the correction is calculated as follows normalized coverage if:

Figure CN104428425AD00202

[0170] 如果Pi = 〇(在这种情况下,di归因于图谱失误,并且这一位置中覆盖行为不是可靠的指标),疔被视为缺失。 [0170] If Pi = square (in this case, di attributed to map errors, and this behavior is not covered in the position of reliable indicators), boils are treated as missing. 基于基线样品中某位置处已知的或假设的倍性和覆盖度进行的这种偏差校正,在本文中被称为"倍性相关的基线校正"。 This deviation is based on a baseline sample is known or assumed at a position of correct ploidy and coverage, known as the "baseline correction times associated" herein. 具体地,倍性相关的基线校正基于靶样品的靶多核苷酸序列中每一位置(或基因座)处所检测的倍性和覆盖度,来调整基线或参照样品中该同一位置的覆盖值,作为使用基线值来校正待分析的样品的覆盖度的要素。 In particular, related to baseline correction times based on the target sample target polynucleotide sequence and fold coverage of each position (or locus) detected at, or adjusting the baseline reference value covering the same position in the sample, used as elements to be corrected baseline value coverage samples analyzed.

[0171] 在一些实施方案中,可使用一组样品而不是单个样品的序列作为基线,以便降低由于抽样(统计噪音)或由于文库特异性偏差而导致的对波动的敏感性。 [0171] In some embodiments, the sequence may be used instead of a single set of samples as the baseline sample, since the sample in order to reduce (statistical noise) or sensitivity due to fluctuations caused by variations specific library. 例如,可使用下列的基线样品集S: For example, baseline samples can be used the following set S:

Figure CN104428425AD00203

[0174] 其中pi为窗口i处的倍性。 [0174] where pi is the ploidy window i. 理想地,这将是基线样品对这一窗口真正的倍性。 Ideally, this will be the baseline sample of this window real ploidy. 然而,因为它是未知的,所以需要估算。 However, because it is unknown, it is necessary to estimate.

[0175] 因此,在一实施方案中,基线产生过程包括每一基线基因组的CNV读取,使用其中常染色体拷贝数为2并且性染色体性别适当的模拟物。 [0175] Thus, in one embodiment, the process includes generating a baseline for each baseline CNV genome read, wherein the use of the copy number of 2 autosomal and sex chromosomes of a suitable mimetic. 使用作为基线的模拟物提供了校正基因组图谱可绘制性的变异的间接方法,例如对应高拷贝、高同一性的重复序列的区域,其在图谱绘制期间"充溢"。 Use the mimetics as an indirect method for correcting the baseline variability of draw genome mapping, for example, high copy corresponding to, regions of high sequence identity to repeat, which is "overflowing" during profiling. 然而,这可能由于生物化学不能解决覆盖度偏差。 However, this may not solve the coverage due Biochemistry deviation. 在中等覆盖度偏差的区域中,如果偏差长度的尺度相对于窗口的长度短,则可用正确的倍性读取基线基因组,并且因此校正因子将适当地弥补所述偏差。 Moderate deviation in the region of coverage, if the length scale of the deviation with respect to the short length of the window, the correct ploidy of the available genomic baseline reading, and thus the correction factor will be appropriately compensate for the deviation. 然而,具有导致远离正确倍性的二倍体平均值的覆盖度>50%的持续偏差的区域,在基线基因组上将使其拷贝数被误读;这导致加强在这一位置读取CNV的趋势的基线"校正",即导致强烈的/ 一致的异常倍性的误读。 However, with the correct ploidy of the lead away from the coverage of diploid average> 50% area of ​​continuous variation of the copy number it will be misread at baseline genome; this results in strengthening the position of the read CNV baseline trend "corrected", i.e., results in abnormal ploidy misreading strong / consistent. 在其它实施方案中,基线基因组的倍性估算可基于外部信息(例如基于芯片的CNV读取)、手动策展、或通过同时分析多个基因组尝试测定群体模式的自动化处理。 In other embodiments, the fold of the baseline of the genome can be estimated based on external information (e.g., read the chip based on CNV), manually curated, or by a simultaneous analysis of multiple genomic attempt to automate the measurement mode groups.

[0176] 在其它实施方案中,可以多种方式测定<1。 [0176] In other embodiments, it may be determined in various ways <1. 例如,可将其视为所估算的在以前对基线样品的倍性估算中被估算为具有倍性2的位置的中位数,视为模型的覆盖度值,或视为全基因组的覆盖度的一些固定的百分位数(可能针对男性与女性样品进行调整)。 For example, it may be considered to be estimated before the estimated times of estimation of the baseline sample as having a ploidy 2 position of the median value of the model considered coverage, or coverage deemed complete genome Some fixed percentile (may be adjusted for both men and women samples). 可使用一组样品,而不是单一的样品,作为基线。 A set of samples can be used, rather than a single sample, as a baseline. 在这种情况下,d' 1和Pi可能被视为是所有基线样品在参照位置i处的覆盖度和倍性的总和,并且ί可被认定为典型的二倍体覆盖度的样品的总和。 In this case, d '1 and Pi may be regarded as the sum of the times of coverage of all the baseline and the reference sample location i, and the sum ί sample may be identified as typical coverage diploid . 可选择地,可使用针对几个基线样品的每一个所计算的值的平均数或中位数,以便提供对基线样品间覆盖度的差别具有较小敏感性的估算。 Alternatively, the average or median can be used for each of the calculated values ​​of several of the baseline sample, in order to provide for the difference between the baseline sample with a smaller coverage sensitivity estimates.

[0177] 如果没有样品作为基线输入,则简单地如下设置⑮: [0177] If no input samples as the baseline, then simply set a ⑮:

Figure CN104428425AD00211

[0179] 4Α.正常样品的HMM分段、评分以及输出 [0179] 4Α. HMM segmentation normal samples, and an output rating

[0180] 在多个实施方案中,可以通过计算机逻辑诸如,例如图1中的CNV变异读取器18 和/或其组件,诸如HMM模型逻辑20进行本章节(章节4Α)中所述的操作、计算以及方法步骤。 [0180] In various embodiments, such as through a computer logic, such as in FIG. 1 CNV variants reader 18 and / or components thereof, such as the HMM model logic 20 operates to the present section (Section 4alpha) in the , and a calculation method steps.

[0181] 在某些方面,有许多对定量的时间-系列进行分段的方法,所述方法可应用于读取CNV-即可以应用于通过以上的步骤顺序所产生的覆盖度数据。 [0181] In certain aspects, many of the amount of time - series segmenting method, the method may be applied that is applied to the read CNV- coverage data produced by the above sequence of steps. 隐马尔可夫模型(HMMs) 提供了一个具有某些吸引人的特性(明显的模型拟合方法、弹性模型、天然置信度测量、限制模型的能力、整合多种覆盖度产生模型的能力)的此类方法,其中状态相当于拷贝数水平,辐射为一些覆盖度的形式(观察到的/校正的/相对的),以及状态间的转变为拷贝数的变化。 Hidden Markov Models (HMMs) is provided with a certain attractive features (apparent modeling method, an elastic model, natural confidence measure, the ability of the model limits the ability to integrate multiple coverage model production) such a method, wherein the state corresponds to the level of copy number, as some forms of radiation coverage (opposite to the observed / correction /), and copy number changes on transition between the states. 发射概率可模式化为泊松分布、负二项式、泊松分布的混合型、拟合数据的分段模型等。 Emission probability can be modeled as a Poisson distribution, negative binomial, mixed, fitting a Poisson distribution model data segment and the like. 可以用拟合优度测量和交叉验证进行模型的选择。 It can be measured and cross-validation of the model selected by the goodness of fit. 在一实施方案中,理想的是校平较长的(滑动的)窗口内的每个位置的覆盖度值,尽管理想的是窗口宽度比期望的最小事件大小要狭窄得多。 In one embodiment, it is desirable that the leveling longer coverage value for each position within the window (sliding), although it is desirable that the window width than the desired minimum size of the event to be much narrower. 在一实施方案中,理想的是以多种方式限制模型,例如要求每一拷贝数水平的预期输出(例如HMM中状态的发射概率的平均数)互相为一致的倍数,如从离散的拷贝数变化所预期的。 In one embodiment, an ideal model is restricted in many ways, for example, requires that each level of the expected output copy number (e.g. average emission probability of the state HMM) consistent with each other multiples, such as number of copies from discrete changes expected. 在一实施方案中,理想的是在预期的覆盖度分布中包含对应于具有正常组织的肿瘤样品的"污染物"的组分,或例如利用混合模型捕获肿瘤异质性。 In one embodiment, the component preferably comprising "contaminant" corresponds to the normal tissue sample having a tumor in the intended coverage distribution, for example, or using a mixed model captures tumor heterogeneity.

[0182] 在另一方面,可能的是将其他信号(例如其参数与值)整合至CNV检测,或使用其他信号(例如数据值)以确认或过滤来自基于覆盖度的CNV检测器的输出。 [0182] In another aspect, it is possible to other signals (e.g., the value of which parameter) integrated into CNV detection, or other signals (e.g. data value) or to confirm the filter output from the detector based on the CNV of coverage. 其他此类信号包括在两个拷贝数水平之间的边界处异常的末端配对的存在,或杂合体位置中等位基因平衡的变化。 Other alleles abnormal changes between two such signals include copy number level at the boundary of the existence of paired-end, or a hybrid position equilibrated.

[0183] 仍然在另一方面,基于参照基因组位置的函数,可使用估算拷贝数的特定的基于HMM的方法。 [0183] In still another aspect, the genomic location based on the reference function, and can be used to estimate the copy number of a particular HMM-based approach. 例如,GC-校正的、窗口-求平均值的、标准化的覆盖度数据疔,可输入至其状态对应于整数倍性(拷贝数)的ΗΜΜ。 For example, GC-correction window - averaged, normalized coverage data boil, which may be input to state corresponds to a multiple of an integer ΗΜΜ (copy number). 沿着基因组的拷贝数可估算为模型最可能的状态的序列的倍性。 Ploidy model can be estimated most likely state sequence along the copy number of the genome. 基于HMM所产生的后验概率计算各种得分。 Various score calculating posterior probabilities generated based HMM. 这方面在下文有更详尽的描述。 In this regard there are more fully described hereinafter.

[0184] 模型定义: [0184] model defines:

[0185] 通过转移概率、初始状态概率以及发射概率的矩阵来定义具有对应于倍性0、倍性1、倍性2、……倍性9以及"10或更多"倍性的状态的完全连接的HMM。 [0185] By the transition probability, the initial state probability and the probability of emission of a matrix to define exactly corresponding to the times of 0, 2, 9, and "10 or more" 1 ploidy state ploidy ...... ploidy of ploidy connected HMM. (在多个实施方案中,可以修改状态的精确数目)。 (In various embodiments, the exact number of states may be modified).

[0186] 将覆盖度分布(即状态发射概率)模式化为负二项式,可以通过每一状态分布的平均数与方差使其参数化。 [0186] The coverage of the distribution (i.e., the state emission probability) into a negative binomial model, mean and variance can be distributed by each state it parameterization.

[0187] 模型估算: [0187] model estimate:

[0188] 原则上,采用波氏(Baum-Welch)算法通过估计最大化(EM)可以估算所有的模型参数;然而,实践中,不受限制的估算(尤其是覆盖度分布)并不是总能提供令人满意的结果。 [0188] In principle, the use of Bordetella (Baum-Welch) algorithm model parameters can be estimated by estimating all maximized (the EM); however, in practice, unlimited estimation (especially coverage distribution) is not always provide satisfactory results. 为处理这一问题,在一实施方案中,选择初始值并且限制随后的更新以反映以下假设: 假定覆盖度取决于目标基因组中给定的参照节段的拷贝数;假定拷贝数为整数值;假定覆盖度与拷贝数线性相关;假定大多数的基因组为二倍体,从而使得常染色体的"典型的"值可以用于确定倍性=2的平均覆盖度;对于对应倍性> =1的状态而言,假定状态的标准偏差与状态的平均值成比例;对于对应倍性=〇的状态而言,单独的方差可以用于顾及图谱错误与非唯一的图谱的影响。 To address this issue, in one embodiment, the selected initial value and limits subsequent updated to reflect the following assumptions: assumed coverage depends on the copy number of the genome of the target with reference to a given segment; assumed copy number of integer values; coverage is assumed a linear correlation with the number of copies; assumes that most diploid genome, so that the autosomal "typical" values ​​may be used to determine the average coverage of the ploidy = 2; for the corresponding times of> = 1 state, the average standard deviation is proportional to the state assumed state; ploidy = square for the corresponding state, the variance may be used to take into account the individual non-unique pattern with the influence of the error map. 考虑到这些限制与假定,对于覆盖度分布只有两个自由的参数,即将覆盖度与倍性> =1的标准偏差关联起来的单个值,并且另一个为倍性=〇的方差参数。 Given these limitations and assume, for coverage of only two parameters consisting of the distribution, ie the ploidy coverage> = standard deviation of the individual values ​​associated 1, and the other parameter is the variance of square = ploidy.

[0189] 在一实施方案中,可以从数据估算转移概率,但是默认行为将维持初始值。 [0189] In one embodiment, the transition probabilities can be estimated from the data, but the default behavior will maintain the initial value. 用户可设定初始值,如果没有设定,初始值可默认为^ = 0. 01,例如假定对于任何不同的状态i与j,模型在时间t时处于状态i,则有1%的可能性在时间t+i时状态为j。 Users can set an initial value, if not set, the initial value may default to ^ = 0.01, for example, assume a different state for any i and j, the model is in state i at time t, there is the possibility of a 1% at time t + i to state j. 在另一实施方案中,转移概率可从数据估算,但是过度拟合的风险很高。 In another embodiment, the transition probability can be estimated from the data, but the high risk of over-fitting. 因此,可使用一组默认值,从而使得在任何"时间"(窗口)从一种状态至另一种状态的转移的概率被设定为0. 003,并且给定状态中剩下的概率被认为是1-0. 003*10 = 0. 97。 Therefore, the probability can be set using a default value, such that any "time" (window) the probability of transition from one state to another state is set to 0.003, and the rest state is given considered to be 1-0. 003 * 10 = 0.97.

[0190] 初始状态概率全部设定为1除以状态数。 [0190] Initial state probabilities are all set to 1 divided by the number of states.

[0191] 按如下将倍性为η的状态的发射(覆盖度)分布的平均值初始化,以下指出的除外: [0191] As will be initialized except where the mean value η for the ploidy state emission (coverage) distribution indicated below:

[0192] [0192]

Figure CN104428425AD00221

[0193] 其中,为所有位置的疔的中位数,在所述位置处已计算标准化的校平的校正的覆盖度。 [0193] where, is the median boil all locations, correction of coverage at the location calculated normalized-leveling. 为顾及由于图谱错误所导致的一些明显的覆盖度的存在,在一实施方案中,设定μQ为1,即= 1 ;在另一实施方案中,μ^可设定为拘=0.1 » 在后续的模型拟合期间不更新平均值的初始估算。 To cater for some obvious coverage map errors caused because, in one embodiment, is set to μQ 1, i.e. = 1; In another embodiment, μ ^ = arrest may be set to 0.1 »in average initial estimate is not updated during the subsequent model fit.

[0194] 倍性2状态的初始方差设定为: [0194] The initial variance ploidy state is set to 2:

Figure CN104428425AD00222

[0197] 在一些实施方案中,设定其他状态的方差从而使得标准方差与平均值成比例: [0197] In some embodiments, the variance is set so that the other state is proportional to the mean standard deviation:

[0198] 〇Ί= *C«/2)2 [0198] 〇Ί = * C «/ 2) 2

[0199] 在另一实施方案中,可按如下设定负二项式的初始方差: [0199] In another embodiment, the initial variance may be set as follows negative binomial:

[0200] σξ = d*J-Itt。 [0200] σξ = d * J-Itt.

[0201] 通过EM更新方差确定的参数直到模型'会聚',例如逐次迭代间模型给予的数据的对数似然中的改变是足够小的,例如在某一阈值以下。 [0201] By updating the parameter EM variance determination until the model 'convergence', for example, changing the number of successive iterations of data between models given likelihood is sufficiently small, for example, below a certain threshold.

[0202] 在另一方面,在模型拟合期间可以更新初始的方差估算(使用有修改的EM以限制平均值),但是限制其永远不要比以上的小。 [0202] On the other hand, during the initial model fitting can update the variance estimate (using a modified EM have to limit the average), but it should never limit is smaller than the above. 在下列假设下操作该模型:大多数的基因组为二倍体,整个分布的中位数将靠近基因组二倍体部分的中位数和平均值,以及拷贝数为严格的整数值。 Operation of the model under the following assumptions: Most diploid genome, the entire profile will be near the median and average median portion diploid genome, and the copy number of strict integer value. 在这方面,需要随着时间进行调整以估算高度非整倍体样品、具有实质上的"一般污染物"的肿瘤和参照中非唯一的区域的拷贝数。 In this regard, to estimate needs to be adjusted highly aneuploid sample with time, having a substantially "general contaminant" reference copy number of tumors and Central Africa region only.

[0203] 允许更新的程序迭代直至其'会聚',例如模型给予的数据的对数似然改变在逐次迭代间改变了小于〇.〇〇1。 [0203] Allow the updated program iteration until 'convergence', data such as a model given the log-likelihood change between successive iterations is less than 〇.〇〇1 changed.

[0204] 倍性推理、分段以及评分: [0204] times reasoning, and a scoring segment:

[0205] 在另一实施方案中,在估算程序会聚之后,进行通常的HMM推理计算。 [0205] In another embodiment, the estimation procedure after convergence, the normal inference computing an HMM. 最终的结果基于每一位置处最可能的状态。 The end result at each position based on the most likely state. (标准的选择为指定对应于最可能的单通路的倍性的状态。) (Specified selection criteria corresponding to the highest possible single passage ploidy state.)

[0206] 在一实施方案中,把输入中每一位置的"读取的倍性(calledPloidy) "当成在那个位置处最可能的状态的倍性。 [0206] In one embodiment, the "read ploidy (calledPloidy)" as the input position of each fold of the most likely at that location. FIG. "倍性得分(ploidyScore)"被认为是phred样得分(例如以分贝dB测量的基于对数的得分),其反映所述读取的倍性是正确的置信度。 "Fold score (ploidyScore)" is considered to be like phred score (e.g., based on the logarithmic decibel score measured in dB), which reflects the reading is correct ploidy confidence. "CNV型得分(CNVTypeScore) "被认为是phred样得分,其反映这样的置信度,即所述读取的倍性正确地表示:相对于标称的预期(除了在男性中性染色体预期为单倍体之外均为二倍体),所述位置具有减少的倍性、预期的倍性、还是增加的倍性。 "CNV-type score (CNVTypeScore)" is considered phred score like, which reflect the degree of confidence, i.e., the times of read correctly represent: with respect to the expected nominal (expected except in the chromosome as a single neutral male are diploid ploidy outside), the position having a reduced ploidy expected ploidy, or increased ploidy. 每一位置处的其它得分("得分倍性=〇","得分倍性=1"等)反映每一可能的倍性的概率;每一状态的得分为int(IOloglO(Lis)),其中Lis为位置i处状态s的似然。 Other score at each position ( "score = ploidy billion", "ploidy Score = 1", etc.) reflect the probability of each possible ploidy; each state score is int (IOloglO (Lis)), wherein Lis is at position i s state likelihood.

[0207] 在另一实施方案中,"节段"为临近具有相同的读取的倍性的位置处的序列。 [0207] In another embodiment, the "segment" has the sequence of adjacent positions at the same ploidy read. 节段的'始端'与'末端'位置被认为是在起始和终止窗口的中点外部。 'Starting end' and 'end' position of the segment is considered to be outside the midpoint of the start and end of the window. 给予每一节段倍性得分和CNV型得分,所述倍性得分等于节段中的位置的倍性得分的平均数,所述CNV型得分为节段中的位置的CNV型得分的平均值。 Each segment administration and CNV-type fold score score, the score is equal to twice the average number of times the score of the segment position, the CNV-type score of CNV-type section location in average scores .

[0208] 以上得分的精确定义与合理化判断在下文题为"得分计算"的章节中给出。 [0208] Score more precise definition is given below and rationalization Analyzing entitled "score calculation" chapters.

[0209] 4B.肿瘤样品的HMM分段、评分以及输出的修改(肿瘤CNV方法) [0209] 4B. HMM segmentation modified tumor samples, and an output of the score (CNV tumor Method)

[0210] 在多个实施方案中,可以通过计算机逻辑诸如,例如图1中的CNV变异读取器18 和/或其组件,诸如HMM模型逻辑20进行本章节(章节4B)中所述的操作、计算以及方法步骤。 [0210] In various embodiments, such as through a computer logic, such as in FIG. 1 CNV variants reader 18 and / or components thereof, such as the HMM model logic 20 operates to the present section (Section 4B) in the , and a calculation method steps.

[0211] 在某些方面,肿瘤样品中的拷贝数读取对到目前为止所述的方法造成了一些挑战。 [0211] In certain aspects, the copy number of tumor samples caused some challenges read method described so far. 由于高度平均的拷贝数的可能性,假定基因组的二倍体("正常的")区域具有接近样品中位数的覆盖度是不明智的。 Average height due to the possibility of copy number, assuming diploid genome ( "normal") near the median area of ​​the sample having a degree of coverage is unwise. 即使可以确定二倍体区域典型的覆盖度(例如通过最小等位基因频率的分析),对单拷贝的增加或减少而言覆盖度中预期的改变并不一定为该值的50%,因为存在来自邻近的或混入的正常细胞的未知量的污染物("正常的污染物")的可能性。 Even if the typical coverage may be determined (e.g., by a minor allele frequency of analysis) diploid region, a single copy of an increase or decrease in terms of coverage of 50% is not expected to change for a certain value, because there the possibility of contamination or mixing of unknowns from adjacent normal cells ( "normal contaminant") is. 而且即使在肿瘤细胞间,由于肿瘤的异质性,可能无法通过整数的拷贝数表征基因组的节段。 And even in the tumor cells, due to the heterogeneity of the tumor, it may not be characterized by an integer number of copies of genome segments.

[0212] 因此,有用的是放宽限制模型状态的覆盖水平的假定,以允许覆盖度的比率被连续估值。 [0212] Thus, it is useful to relax the restrictions model state assumed coverage level to allow coverage ratio is continuously valued. 这增加了找到正确的值的挑战,并且也引入了决定包括多少状态的问题,导致包括模型选择组件的分析。 This increases the value of the right of challenge to find, and also introduced a number of decisions, including the status of the issue, including an analysis of the model results in the selection of components. 因此,分析目标为被修改为将基因组分段为统一的"丰度类别"的区域,没有强迫将给定类别阐释为整数拷贝数。 Accordingly, the analysis target to be modified to genomic segment region "abundance category" is uniform, there is no explanation as to force a given category integer copy number.

[0213] 理论上,HMM可以简单地配有不同的状态数,使用EM来确定每一状态预期的覆盖水平,并且选择可以给予最佳拟合度的状态数。 [0213] In theory, the HMM can be easily equipped with a different number of states, each state is determined using the EM expected coverage level, and selects the number of states may be administered to the best fit. 实践中,任何给定的状态数的模型参数的不受限制的估算不是一个稳健的过程。 In practice, the estimation of the model parameters is not limited to any number of given state is not a stable process. 因此,为解决这一问题,在另一方面,引入其它初始步骤或模块,其基于总的覆盖度分布估算状态数和它们的平均值,并且引入另一步骤,该步骤通过向模型顺序添加状态然后从模型顺序移除状态来优化初始模型。 Thus, to solve this problem, on the other hand, the introduction of other initial step or module, and estimating the number of states thereof based on the total distribution average of the coverage and the introduction of a further step, the step of adding to the state model by sequentially then sequentially removed from the model to optimize the initial state of the model.

[0214] 初始模型生成: [0214] Initial model generation:

[0215] 在一实施方案中,待分段的整个基因组的(校正的、标准化的、窗口-求平均值的) 覆盖度分布为不同丰度类别分布的混合。 Entire genome (corrected normalized window - averaging) [0215] In one embodiment, the coverage of the segment to be a mixture of different abundance distribution of distribution categories. 鉴定明显不同的丰度类别的一个方法为,使用代表计算机逻辑生成的初始状态(或峰值)的输入数据,其执行用于解释复杂肿瘤绝对拷贝数的模型(如这之前所述)。 Identification of a distinct class of the abundance of methods, using logic represents the computer-generated initial state (or peak) of the input data, which executes a model to explain the complex absolute copy number of the tumor (which as previously described). 由此得到的峰值位置P被用作初始模型中的状态,预期的覆盖值等于每一峰值中心。 The thus obtained peak position P is used as the initial state of the model, the expected coverage value equal to the center of each peak. 可以使用EM估算方差(与以上结合正常的样品分段所描述的限制的模型拟合相同)。 EM can be used to estimate the variance (model fitting with limited binding to normal or more of the same sample described in subparagraphs).

[0216] 鉴定明显不同的丰度类别的一个方法为,寻找校平的整个基因组覆盖度分布的峰值。 [0216] Identification of a distinct class of methods abundance, to find the peak leveling entire genome coverage distribution. 在另一实施方案中,另一个方法为,鉴定密切拟合所观察的覆盖度分布的混合模型。 In another embodiment, the other method, to identify a close fit hybrid model coverage observed distribution. 通过将正常分布的分位数函数应用于累积分布函数(cdf),然后在校平与峰值检测之前去除连续的值之间的差异,从而实现对直接的峰值鉴定的改进。 Quantile of a normal distribution function is applied to the cumulative distribution function (CDF), and then removing the difference between successive values ​​before leveling and peak detection, thereby achieving an improved direct peak identification. 后一方法对于鉴定中心丰度类别以外的小峰值提供了更好的敏感性。 After a method to identify small peaks outside the center of abundance categories provide better sensitivity.

[0217] 例如,给定的覆盖度H=Ih1J2,…的柱状图,其中比为覆盖i位置的数目,并且η为最小的值,从而使得小于0. 001的完整的柱状图被截去,并且让Q(p)为正常分布的分位数函数,可按如下计算由此得到的峰值位置P: [0217] For example, a given coverage H = Ih1J2, ... histogram, wherein the cover ratio of the number of position i, and η is the minimum value, so that the complete histogram is less than 0.001 truncated, and let Q (p) is the quantile function of a normal distribution, the peak position P can be calculated as follows thus obtained:

Figure CN104428425AD00241

[0219] Ci=hj/N [0219] Ci = hj / N

[0220] qi =Q(Ci) [0220] qi = Q (Ci)

[0221] (Ii = q厂qH [0221] (Ii = q plant qH

[0222] D= (I1,d2, . . . ,dn [0222] D = (I1, d2,..., Dn

[0223] S=校平(D) [0223] S = leveler (D)

[0224] Si=S(i)# [0224] Si = S (i) #

[0225] [0225]

Figure CN104428425AD00251

[0226]P={iIm1 = 1 且Cli彡· 002} [0226] P = {iIm1 = 1} 002 and Cli San ·

[0227]由此得到的峰值位置P被用作初始模型中的状态,预期的覆盖值等于每一峰值的中心。 [0227] The thus obtained peak position P is used as the initial state of the model, the expected coverage value equal to the center of each peak. 可以使用EM估算方差(与以上结合正常的样品分段所描述的限制的模型拟合相同)。 EM can be used to estimate the variance (model fitting with limited binding to normal or more of the same sample described in subparagraphs).

[0228] 模型改进: [0228] model improvement:

[0229] 在另一实施方案中,一旦以这种方式推测初始模型,该模型就是反复改进的。 [0229] In another embodiment, in this way once the initial estimation model that is modified repeatedly. 首先,评估其它状态。 First, the evaluator other states. 对每一连续的状态对之间的状态添加(通过预期的覆盖度整理的丰度类别)进行评估,如果似然改善(Pr(数据I模型))超过某临界值,则接受添加。 State between each successive state is added (by finishing the expected abundance of coverage categories) assessment, improve if the likelihood (Pr (Data Model I)) exceeds a threshold value, accept add. 即每一连续的具有预期的覆盖度Ci与的状态对i与j之间,尝试添加具有初始覆盖度Ci,= (Ci+Cj)/2的状态i'。 I.e., each successive having the desired coverage state between Ci and i and j, having an initial attempt to add coverage Ci, = (Ci + Cj) status / 2 i '. 使用拥有所有其他(预先存在的)确定的状态的预期覆盖水平的EM 优化Ci,。 Use has all the other (pre-existing) the expected level of coverage determined by the state of EM optimization Ci ,. 如果所述优化产生了区间(Ci,cP外的值,或如果减少Pr(数据I模型)没有超过接受临界值,就拒绝添加;否则,就接受添加。如果接受添加,尝试在i与i'之间添加另一个状态,递推直至不接受另一个添加。一旦在所有连续状态对之间的添加被拒绝,就终止添加过程。其次,评价状态的移除。从所述模型一次移除一个状态,并且使用EM优化由此得到的模型;如果由此得到的模型并没有显著差于以前的模型,则接受所述状态移除。 If the optimization produces interval (Ci, outside cP value, or if the reduction Pr (I model data) does not exceed the acceptance threshold, adding refused; otherwise, if accepted accepted add added, and try i i '. Add state between the other, are not accepted until the recursive add another. once added between all successive state is rejected, the process is terminated added. Then, the evaluation condition is removed. time the model is removed from a state, and the EM optimization model thus obtained; and if the model thus obtained is not significantly worse than the previous model, the state is accepted removed.

[0230] 在某些实施方案中,所述分段还包括基于整体覆盖度分布来估算状态数和它们的平均数以生成初始模型。 [0230] In certain embodiments, the segment further comprises estimating the number of states thereof and to generate an initial model based on the average overall coverage distribution. 在某些实施方案中,所述方法包括通过定量数据建模的技术人员已知的各种方法来优化初始模型,其包括修改模型中状态数并优化每一状态参数。 In certain embodiments, the method includes optimizing the initial model modeling quantitative data by various methods known in the art, including modifying the model to optimize the number of states and each state parameter. 例如,可以通过向模型顺序添加状态然后顺序移除状态或这两者的组合来修改模型中的状态数;相似的程序可应用于多变量回归所用的模型选择方法中。 For example, the sequence may then be removed state, or a combination of both by adding to the state model to modify the order of the number of states in the model; similar procedure may be applied to select a multivariate regression model used in the. 可通过估算最大化或许多其他的优化多变量模型的方法来优化每一状态参数。 Each state parameter can be optimized by maximizing or estimating a number of other optimization methods multivariate model.

[0231] 本领域技术人员熟知前述过程中的变化。 [0231] well known to those skilled in the foregoing process variation. 例如,可尝试从最大的模型移除每一状态以确定哪一个状态具有最小的影响,移除那个状态并递推。 For example, each state may try to remove from the largest model to determine which state has a minimal effect, remove that state and recursive. 精通多定量模型选择方法的技术人员已知此类替代方法。 Multi proficient quantitative model selection methods known in the art such alternative methods. 在另一实例中,可以通过向模型顺序添加状态然后顺序移除状态或这两者的组合来修改模型中的状态数;相似的程序可应用于多变量回归所用的模型选择方法中。 In another example, the sequence may then be removed state, or a combination of both by adding to the state model to modify the order of the number of states in the model; similar procedure may be applied to select a multivariate regression model used in the. 可通过估算最大化或许多其他的优化多变量模型的方法来优化每一状态参数。 Each state parameter can be optimized by maximizing or estimating a number of other optimization methods multivariate model.

[0232] 分段与节段得分: [0232] segmentation and segment Score:

[0233] -旦选择了模型并且优化了参数,则如前所述确定正常样品的分段与节段得分。 [0233] - once selected and optimized model parameters, the segment is determined as described above and a normal sample segment score. 简而言之,报告具有相同最可能状态的连续的位置节段,得分表示在分类错误的概率的节段中位置的平均值。 In short, the most likely to report having the same position contiguous segments of the state, represents the average score of the position of the segments in the probability of misclassification.

[0234] 本公开与许多已知方法的不同在于,关键的差异为其代替在基因组上大但特异的位置组处的强度测量结果(例如微阵列数据),所述的方法与基因组上每个位置的基于测序的覆盖深度测量结果(例如下一代测序数据)相关。 [0234] The present disclosure differs from many known methods is that the key difference in its genome in place of each large but strength measurement (e.g. microarray data) at the specific position of the group, according to the method the genomic based on the measurement result associated coverage depth sequencing (e.g., the next-generation sequencing data) position. 一些其它差异如下所述: Other differences are as follows:

[0235] 1)用于测量覆盖度的分数计数的#用。 [0235] 1) for the fractional count of # coverage measurement. 仍然在另一实施方案中,当匹配的读数(例如对应整个DNB)映射至不止一个位置时,置信度测量值用于将图谱部分地归于每一位置。 When In still another embodiment, when reading (e.g. corresponding to the entire DNB) matching mapped to more than one location, the confidence measure for each pattern part, attributed to the position. 结果是这允许比其他的方法更大程度地评估片段化复制品中的覆盖度。 This allows the evaluation result is fragmented replica coverage than other methods to a greater extent.

[0236] 2)所沭的柃if覆盖度偏差的方法之一。 One [0236] 2) Method Eurya if the coverage Shu deviation. 仍然在另一实施方案中,加权每个DNB的方法(使用逻辑回归的一特定实施方案)提供了模型影响多个偏差因素的能力,这可论证地给予了比以前的方法更好的偏差校正。 In still another embodiment, the weighting for each DNB (using a logistic regression particular embodiment) provides the ability to model a plurality of deviation factors influence, which can be administered demonstrates better than previous methods bias correction .

[0237] 3)毎一某线/兀配的样品中拷贝数的估算的#用。 Estimating a sample every a line / Wu with the copy number of [0237] 3) with #. 仍然在另一实施方案中,通过估算一般基线中或匹配的基线中每一样品的拷贝数,避免了对以前的方法的挑战之一,其涉及相对强度(微阵列)或相对覆盖度(基于测序的CNV)的计算,也就是说,事实为作为基线使用的样品自身可以具有CNV。 In still another embodiment, by estimating the copy number of each sample baseline or baseline matched general, to avoid one of the previous method challenge, which relates to the relative intensity (microarray) or relative coverage (based on sequencing of CNV) is calculated, that is to say, the fact is used as a baseline sample itself may have CNV. 当基线样品具有CNV时,CNV基因座内所测量的强度/ 覆盖度将不提供正常的(通常为二倍体)拷贝数强度的估算,导致相比在绝大多数基因组中,目标样品的相对覆盖度与绝对拷贝数具有不同的关系。 When the baseline sample with CNV, the CNV locus measured intensity / coverage will not provide a normal (typically diploid) Estimation of the strength of the copy number, resulting in the majority of the genome compared to the target sample relative coverage and absolute copy number has a different relationship. 通过根据拷贝数的估算调整基线样品自身,保存拷贝数与相对覆盖度之间预期的线性关系,以允许更精确地推断出绝对拷贝数。 By adjusting the baseline sample based on the estimated number of copies of itself, save a linear relationship between the copy number and the relative coverage expected to allow more accurately inferred absolute copy number.

[0238] 4)HMM内,两个特征是与众不同的。 The [0238] 4) HMM, two features are different. 仍然在另一实施方案中,这些特征允许更稳健的数据建模(更精确的CNV读取)。 In still another embodiment, these features allow for more robust data modeling (CNV more precise reading).

[0239] a)通过该方法测定每一状态的平均值,这些方法提供了使用通常的HMM训练方法(EM)的备选方案,其对有用值的覆盖度看起来并没有可靠地会聚。 [0239] a) measured by this method the average value for each state, these methods provide an alternative to using a conventional HMM training method (EM), which is useful coverage value did not seem to converge reliably.

[0240] i)对正常的样品而言,样品的预期二倍体部分中覆盖度的中位数被用于测定二倍体状态的平均值,并且以从二倍体状态的50%增量或减量来确定其他的状态(拷贝数)。 [0240] i) the normal samples, the median portion of the sample in the diploid expected coverage measurement is used to average the diploid state, and in 50% increments from a diploid state or decrement to determine other status (number of copies). (0 拷贝状态为特异的,给予稍微高于〇的值以允许图谱错误。) (0 copy status is specific, given a value slightly higher than the square pattern to allow error.)

[0241]ii)对于肿瘤样品,使用单独的过程来推断初始的水平集;该过程可基于覆盖度数据的柱状图分析;一旦选择初始水平,应用其它计算以改进水平集。 [0241] ii) For tumor samples, using a separate process to infer the initial level set; the data may be based on histogram analysis coverage; initial level Once selected, other computing applications to improve the level set.

[0242]b)通过该方法估算状态的方差(限制);至少在一些实施方案中,方差受限于状态平均值的线性相关,这反映了大多数的方差为偏差的结果而不是抽样噪音的事实;因此,在给定的样品中,具有两倍于第二状态的平均值的状态(覆盖度水平)将通常具有第二状态所观察的覆盖度的散布(标准偏差)的两倍。 [0242] b) the variance of the estimated state by the method (limit); at least in some embodiments, the variance to the mean value of the restricted state of the linear correlation, which reflects most of the variance of the deviation of the noise sample, rather than the result fact; Thus, in a given sample, the state of twice the average of a second state (coverage level) will generally have a double having dispersed (standard deviation) of a second covered state observed.

[0243] 5)#用来自大的(例如50个样品)某线的覆盖度数据来确定位置,其中测序讨稈的一@方面导致了高夺异的覆盖水平。 [0243] 5) # coverage data from a large (e.g. 50 samples) to determine the location of a line, wherein the sequencing of a stalk @ discuss aspects CAPTURE results in high level of coverage is different.

[0244] 仍然在另一实施方案中,如果此类位置没有被鉴定为有问题的,它们将导致假的CNV读取。 [0244] In yet another embodiment, if such a location has not been identified as a problem, they will lead to a false reading of the CNV. 一旦鉴定了,就将此类位置标记为未知的拷贝数而不是已分配的假的变化。 Once identified, such a position will be marked as an unknown number of copies is not a false change assigned.

[0245] 窗口边界定义(用于进行窗口-校平) [0245] window boundary definition (for windows - leveling)

[0246] 当选择用于进行窗口校平的窗口边界时,在一实例的实施方案中,定义绝大部分的窗口从而使得它们的染色体坐标为窗口长度偶数倍,从而使得对2k窗口而言,例如,窗口边界的染色体位置以"xOOO"结束,其中X为偶数数字。 [0246] When the window selected for leveling window border, in an example embodiment, a window is defined such that most of their chromosomes even multiple of the length of the window coordinates, such terms 2k window pair, For example, the chromosomal location of the window border to "xOOO" end, wherein X is an even number. 这些窗口的边界被称为"默认边界"。 The boundaries of these windows are called "default border." 这些默认边界的例外情况为,处于叠连群末端的窗口。 These exceptions to the default boundary, at the end of the contig window. 窗口将永远不会跨越取自超过一个叠连群的碱基,即使叠连群之间的空位小到足以允许跨越。 Window will never be more than one spanning from nucleotide contig stack, even if the gap between the overlapping contig small enough to allow the crossing. 而且,特别处理每一叠连群最外面的全默认窗口的碱基。 Moreover, the whole process in particular default window outermost bases of each contig. 或者将这些"外部碱基"添加至面向叠连群中心的第一全窗口,或者将其置于它们自己的窗口中,这取决于碱基数是否比窗口宽度大1/2。 Or the "external base" is added to the first full window facing the center of the contig, or placed in their own window, depending on whether a large number of bases than 1/2 of the window width. 例如,对从位置17891持续至位置25336的叠连群以及2000的窗口宽度而言,可使用下列窗口区间的列表(17891,20000),(20000,22000),(22000,24000),(24000,25336)。 For example, from a position 17891 to position 25336 of continuous contig 2000 and the window width, you can use the following list window section (17891,20000), (20000,22000), (22000,24000), (24000, 25336).

[0247] 应注意的是,将叠连群的前109个碱基添加至紧邻2k区间的右边,同时将最后的1336个碱基置于它们自己的窗口中。 [0247] It should be noted that before adding the base 109 of the contig 2k immediately to the right section, while the last 1336 bases placed in their own window. 将小于窗口宽度的叠连群(例如对IOOk窗口而言为chrM)制成包括整个叠连群的单个窗口。 The window width is smaller than the contig (e.g. for IOOk window CHRM) made in a single window including the entire contig. 叠连群内空位没有报道窗口。 Within contigs not reported vacancy window. 为例证,假设染色体由如表1中所示的三个叠连群组成。 Of illustration, it is assumed by the three chromosome contig As shown in Table 1 composition.

[0248] 表1 :染色体叠连群实例 [0248] Table 1: Examples of chromosome contig

[0249] [0249]

Figure CN104428425AD00271

[0250] 这将产生以下使用/报道的窗口;只是出于明晰呈现而在此处显示叠连群编号: [0250] The following will use this window / report generation; just for clarity of presentation and display contigs number here:

[0251]叠连群1: (17891,20000),(20000, 22000),(22000, 24000),(24000, 25336) [0251] contig 1: (17891,20000), (20000, 22000), (22000 24000), (24000, 25336)

[0252]叠连群2: (25836, 28000),(28000, 29277) [0252] Contig 2: (25836, 28000), (28000, 29277)

[0253]叠连群3: (33634, 34211) [0253] Contig 3: (33634, 34211)

[0254] 这种方法的结果为: [0254] The results of this method are:

[0255] •基因组的所有无空位碱基都包括在窗口中(并且只有一个窗口); [0255] • ungapped all bases are included in the genome of the window (and only one window);

[0256] •窗口限于单个叠连群; [0256] • window limited to a single contig;

[0257] •窗口为标称窗口(nominalwindow)宽度的0. 5倍至1. 5倍之间; [0257] • Window 0.5 times to 1.5 times the nominal window (nominalwindow) width between;

[0258] •窗口边界通常为约整数,这使节段边界对应于窗口边界更明显,并且过度解释CNV读取边界的精度的机会更少。 [0258] • window boundaries is usually an integer of from about which the segments boundaries correspond to window boundaries more apparent, and the opportunity to explain the excessive reading accuracy CNV boundary less.

[0259] 5.基于群体的无读取/低置信区的鉴定 [0259] The identification / low confidence region without reading groups based on

[0260] 在多个实施方案中,可通过计算机逻辑诸如,例如图1中的CNV变异读取器18和/ 或其组件,诸如基于群体的无读取逻辑38进行本章节(章节5)中所述的计算和方法步骤。 [0260] In various embodiments, computer logic, such as by, for example, in FIG. 1 CNV variants reader 18 and / or components thereof, such as population-based non-read logic 38 in this section (Section 5) and method steps of the calculation.

[0261] 在一方面,以上所述的基于HMM的读取通常包含或为人工产品或为不太感兴趣的多种推测的CNV。 [0261] In one aspect, the above-described read based HMM typically comprise artificial or more product or less CNV presumed interest. 主要地,这些出现在以下两种情况之一:A)所述参照基因组序列不提供大多数或所有样品基因组中的覆盖模式的说明,大多数或所有样品基因组彼此匹配。 Mainly, these appear in one of two cases: A) does not provide a description of all or most of the sample genome coverage pattern of the genomic sequence of the reference, most or all samples match each other genomes. B)覆盖度中有比少数离散倍性水平可以解释的更多的变异。 B) in the coverage of more than a few discrete variation can be explained by the level of ploidy. 可通过鉴定和注释此类区域增加CNV 推理的效用。 CNV can increase the reasoning by identifying and comments such regional utility. 在下文中,如此注释的区域被认为是"无读取的",就其意义而言,对于这些区域可不给予倍性的离散估算。 In the following, such a comment area is considered, in terms of its significance, these areas do not give estimated times of discrete "no reading."

[0262] 此类行为可起因于多种原因;一些可能的机制包括: [0262] Such behavior can result from a variety of reasons; some of the possible mechanisms include:

[0263] •参照基因组中的误差。 [0263] • error reference genome. 例如事实上在大多数或所有的基因组中,两个叠连群可彼此重叠,即对应于单个基因组间隔。 In fact, for example, most or all of the genome contig two may overlap with each other, i.e., the interval corresponding to a single genome. 在这种情况下,两个叠连群末端可以在一定程度上由高度相似的序列组成,其否则就是独特的,使得DNB映射至两个位置。 In this case, the two ends may contig sequences of highly similar to a certain extent, which otherwise is unique, such that DNB is mapped to two positions. 观察的/测量的覆盖度将得以减少,这导致了明显的拷贝数减少。 Coverage / measured observation will be reduced, which results in a significant reduction of the number of copies. 可选择地,大多数或所有的样品基因组可包含参照中不存在的重复序列。 Alternatively, most or all of the sample may comprise genomic repeat sequence does not exist in the reference. 在这种情况下,对应于重复节段的参照部分内的观察的覆盖度将得以提高,这导致了相对于参照的拷贝数增加,但不是真正的多态性。 In this case the degree of coverage corresponding to the reference portion of the repeating segment viewed will be increased, which results in a relative increase in the copy number of reference, but not true polymorphism.

[0264] •未校IH的覆盖度偏差。 [0264] • coverage uncorrected deviation of IH. 在一方面,相对于参照在测序结果中实质上高表现或低表现的区域可能看上去是CNV。 In one aspect, the sequencing results relative to a reference in the region substantially high or low performance performance may seem CNV. 为保留产生绝对拷贝数推断的能力,同时考虑到基线基因组的初始拷贝数推断,完成如上所述的基线校正。 To retain the ability to produce the absolute copy number estimation, taking into account the initial baseline copy number of the genome deduced, complete baseline corrected as described above. 这个的结果可能是基线中严重偏差的区域以及目标样品可理解为真正的CNV。 This result may be serious discrepancies in the baseline area and the target sample may be understood as a real CNV. 这种事件类型的信号将为,大多数或所有的样品都表现出相似的提高的或抑制的覆盖度模式。 This type of event signal will be, most or all of the samples showed a similar pattern of increased coverage or suppressed.

[0265] •人工产品分析。 Analysis [0265] • artificial products. 尽管罕见,仍有可以导致在给定位置处的大量假图谱的偶然的图谱人工产品存在。 Although rare, there still may lead to a large number of fake profiles set at the position of occasional profiles artifacts. 此类人工产品可起因于源于重复节段中参照的变异的特定安排,从而使得错误的重复序列参照拷贝与目标样品的序列更相似。 Such artifacts may be due to the particular arrangement of the segments from the repeated reference to the variation, so that the error refers to the replication repeat sequence of the target sample more similar. 以取决于存在于给定样品中的变异的方式,这些可以导致在参照上某些位置处覆盖度中的非常大的尖峰。 In a manner depending on the presence of a variation in a given sample, which may lead to some reference position at a coverage of a very large spike.

[0266] •节段复制与串联重复序列。 [0266] • segment copy and tandem repeats. 在参照中以复制形式存在并且经过群体变异的节段可导致样品间覆盖度的变化,比独特的序列中典型的拷贝数增加或损失小。 In reference to the presence of replicative form in section and through population variation may cause a change between samples coverage, increase or loss of typically less than the number of unique sequences in the copy. 在极限情况下, 高拷贝序列类型群体中充分的变异性可导致跨越大量样品覆盖度值的基本上连续的范围。 In the limit, a sufficient variability type high-copy sequences across the population can result in a large number of sample values ​​substantially continuous coverage range.

[0267] ·由于极端的柃if闵子或非常低的原始覆盖度而不稳定的估算。 [0267] · Due to the extreme Eurya if Min Tzu or very low coverage of the original and unstable estimates. 实例包括:1)区域,其中覆盖度由于GC校正非常低,并且GC校正因子相应较大,从而使得覆盖度估算中的噪音被校正因子所放大;2)区域,其中在模拟的以及真实的数据中,由于图谱充溢而覆盖度非常低,这导致了基线偏差校正因子中大的校正项;3)区域,其中几乎所有的基线基因组具有〇倍性。 Examples include: 1) region, wherein the degree of coverage is very low due to the correction of GC, GC and correspondingly large correction factor, so that the coverage of the noise estimate is amplified by a correction factor; 2) area, where the real and simulated data , since the coverage map filled very low, which results in baseline large deviation correction factor correction term; 3) region, wherein substantially all of the genome has a square base ploidy.

[0268] 可以各种方式进行此类区域的鉴定。 [0268] may be performed in various ways to identify such regions. 最后,单个位置处覆盖模式的手动策展是高度有效的,但是在一些情况下由于数据的缺乏,努力的程度,和/或过程不稳定性其为抑制性的。 Finally, at a single location covered manually curated model is highly effective, but in some cases due to lack of data, the degree of effort, and / or instability of the process which is inhibitory. 序列相似性和/或结构注释的使用具有一些前景,因为在实践中很大一部分有问题的区域相当于参照基因组的已知重复部分(节段的重复序列、自身链、STR、重复序列-掩蔽元件(maskerelement));然而,因为许多真正的拷贝数多态性发生在此类区域中,过于广泛地排除此类节段是不可行的,并且找到更有选择性的标准是很有挑战性的。 Using sequence similarity and / or structural annotation has some promise, since in practice a large part of the problematic region corresponds to the known reference part of the genome repeat (repeat segment, the chain itself, the STR, repeat - masking element (maskerelement)); however, since many true copy number polymorphisms occur in such region, too widely to exclude such segment it is not feasible, and to find more selective criteria is challenging of. 因此,仍然在另一方面,理想的是能够鉴定直接来源于覆盖度数据的有问题的区域。 Accordingly, still another aspect, desirable to be able to identify directly from the coverage area of ​​the data in question.

[0269] 两类覆盖模式代表以上情况的几个。 Several [0269] Representative of the above two types of mulching mode. 第一类涉及其中覆盖度比可以通过少量离散倍性水平所解释的更多变的区域("超变区")。 The first category involves areas where coverage ratio can be explained by the small amount of dispersion becomes more ploidy level ( "hypervariable region"). 第二类涉及其中覆盖度不如预期的匹配所述参照但在所有样品中相似的整倍体区域("不变区")。 The second category relates to which the coverage of the reference match as expected but similar in all samples euploid area (the "constant region").

[0270] 考虑到相当数量的基因组(例如50个或更多的),"背景设定",偏差校正的并且经过校平的但是未标准化的覆盖度数据的汇总统计足以用于(如试探性地或不完全地)将基因组分离至功能良好的区域,超变区以及不变区。 [0270] Taking into consideration the number of genomes (e.g., 50 or more), "set the background", and deviation correction after leveling but unnormalized summary statistics sufficient for data coverage (e.g., heuristic or incompletely) to the isolation of genomic function well region, the hypervariable region and constant region. 可按这种方式使用为η基因组的G集内的每个基因组位置i所计算的以下汇总统计。 This embodiment can use the following summary statistics of each η genomic location within a set of genomic i G calculated. 对于1彡X彡η,geG的第X'阶统计,即背景设定中基因组间位置i处第X'最小的经过校正与校平的覆盖度 For X 1 San San η, geG of X 'order statistics, i.e., between the background set Genomic X at position i' coverage smallest corrected and the correction level

[0271] 中位数 [0271] median

Figure CN104428425AD00281

r〇97Sl Zhuihuaixuansa Tri ·

Figure CN104428425AD00291

[0277]其中SSE(i,x,y)为〜@<y>的平方误差的总和,即 [0277] wherein the sum of ~ @ <y> squared errors SSE (i, x, y), i.e.,

Figure CN104428425AD00292

[0279] 考虑到这些汇总统计,可以将标记位置的标准定义为超变的或不变的。 [0279] In view of these summary statistics, you can mark the location for the standard definition of the hypervariable or unchanged.

[0280] 超变区的注释 [0280] Comment hypervariable regions

[0281] 满足下列所有四个标准的位置可标记为"超变的"(而不是被标记为CNV或分类为整倍体)的位置: [0281] satisfying the following positions (rather than being labeled or classified as CNV euploid) all four positions may be marked as the standard "hypervariable":

[0282] (i)通过以上所述的HMM推理过程将位置称为CNV/非整倍体。 [0282] (i) by HMM reasoning referred to above will be the position of CNV / aneuploid.

[0283] (ii)背景设定中的覆盖度值不以提示群体中简单的多态性的方式聚集。 Coverage values ​​[0283] (ii) in the background set to prompt a population of non-aggregated polymorphism simple manner. 正式地, 对可以凭经验选择的值Q而言,如以下所述: Formally, the value of Q may be chosen empirically terms, as described below:

[0284] qi > Q [0284] qi> Q

[0285] (iii)背景设定中该位置处覆盖度值的范围比在绝大多数的(整倍体)基因组处所见到的宽。 [0285] covering the range of values ​​of the position (iii) in the background set than in most (euploid) seen genome wide spaces. 正式地,对可以凭经验选择的值S而言,如以下所述: Formally, the value of S may be chosen empirically terms, as described below:

[0286] SlZifll > S [0286] SlZifll> S

[0287] (iv)目标样品所观察的覆盖度落入背景设定中所见到的值的范围,或落在通过小的绝对量(例如可以容易地通过抽样或处理变化所解释的量)所观察的范围之外。 [0287] coverage observed (iv) the target sample values ​​fall within the range seen in the background set, or fall by a small absolute amount (e.g., by sampling or can be easily explained process variation) outside the range observed. 正式地, 对可以凭经验选择的R与X值而言,如以下所述: Formally, may be empirically selected for the X and R values, as described below:

Figure CN104428425AD00293

[0289] 不变区的注释 Comment [0289] constant region

[0290] 满足下列所有标准的位置被标记为"不变的"(而不是被标记为CNV): [0290] satisfy all of the following criteria location is marked as "unchanged" (rather than being labeled CNV):

[0291] (i)通过以上所述的HMM分段处理将位置称为CNV/非整倍体。 [0291] (i) above according to the position of the HMM segmentation process referred CNV / aneuploid.

[0292](ii)背景设定中的覆盖度值不以提示群体中简单的多态性的方式聚集。 Coverage values ​​[0292] (ii) in the background set to prompt a population of non-aggregated polymorphism simple manner. 正式地, 对可以凭经验选择的值Q而言,如以下所述: Formally, the value of Q may be chosen empirically terms, as described below:

[0293] qi > Q [0293] qi> Q

[0294] (iii)跨越背景样品该位置处的覆盖度显示低的变异性,提示群体中高微小等位基因频率多态性和低处理变异(人工产品)的缺失。 [0294] (iii) at the position across the background sample coverage show low variability, suggesting that the population of high and low minor allele frequencies polymorphism mutation treatment (artificial product) deletion. 对可以凭经验选择的值S而言,如以下所述: S may be empirically selected value terms, as described below:

Figure CN104428425AD00294

[0296] (iv)目标样品所观察的覆盖度落入背景设定中所见到的值的范围,或落在通过小的绝对量(例如可以容易地通过抽样或处理变异所解释的量)所观察的范围之外。 [0296] coverage observed (iv) the target sample values ​​fall within the range seen in the background set, or fall by a small absolute amount (e.g., by sampling or can be easily explained process variation amount) outside the range observed. 正式地, 对可以凭经验选择的R与X值而言,如以下所述: Formally, may be empirically selected for the X and R values, as described below:

[0297] 1¾ -wi|[ < mm(S| * ff) [0297] 1¾ -wi | [<mm (S | * ff)

[0298] 注释的改进 [0298] Note Improvement

[0299] 在一方面,以上的标准可引起CNV读取过度地片段化为可选择的读取与无读取的节段。 [0299] In one aspect, the above criteria can cause CNV read excessively fragmented into segments and selectively read without reading. 理想的是,如果观察的覆盖度与未注释的侧翼间隔十分相似,基于以上的标准允许"无读取的"短的间隔(即注释为"超变量"或"不变量")被允许为读取的(保持未注释的)。 Ideally, if the coverage observed with the flanking interval unannotated very similar, based on the above standards allow a short interval "no read" (i.e., annotated as "hypervariable" or "variable") is allowed to be read taken (holding unannotated). 具体地,可抑制间隔的"超变的"或"不变的"标记,其小于满足以上的标准但是为HMM输出中更长的节段的一部分的L碱基。 Specifically, or "constant" label inhibit interval "hypervariable" and a base portion which is smaller than L but the output of HMM longer segments satisfies the above criteria.

[0300] 截止值的选择 [0300] selected cutoff value

[0301] 在一方面,可以基于初始CNV读取的子集的分析以及与背景覆盖度范围汇总统计的基因组范围分布的比较选择以上标准中的截止值Q、S、R、X以及L。 [0301] In one aspect, the cutoff value Q can be summarized comparative genome wide selection of statistical distributions of these standards, S, R, X based on an initial analysis of a subset of CNV read coverage range, and the background and L. 考虑到将CNV读取的初始集("训练集")分类入可疑的(标记"超变的"或"不变的")和被认为是真正的CNV, 以及整个基因组(即,沿着基因组隔开的所选择的位置的,例如产生于以上所述的窗口的那些)的汇总统计,可用下列的标准鉴定近似最佳的截止值: Taking into account the initial set ( "training set") will read CNV classification into suspect (labeled "hypervariable" or "the same") and is considered to be a real CNV, and the entire genome (ie, along the genome selected location spaced, for example, produced in the above that window) summary statistics, using standard identified following approximate optimal cutoffs:

[0302] •绝大多数基因组被称为整倍体或CNV/非整倍体(例如只有小部分的基因组为无读取的/注释为超变的或不变的); [0302] • majority genome is called euploid or the CNV / aneuploidy (e.g., only a small fraction of the genome is non-readable / or annotated as hypervariable unchanged);

[0303] •"训练集"中绝大多数有问题的区域为无读取的; [0303] • "training set" in the vast majority of problem areas is not read;

[0304] •训练集中绝大多数可信的区域为读取的(无注释的)。 [0304] • centralized training area for the vast majority of credible reading of the (no comment).

[0305] 基于初始的CNV读取的集合的手动策展可以得到训练集。 [0305] Based on the training set can be set manually curated CNV initial reading. 所述的策展可涉及手动检查覆盖状况以鉴定读取及与通过独立的方法与所鉴定的推定的CNV的外部数据集比较。 The curated condition may involve manual inspection cover to identify the independent read and compared with the method identified putative external data sets CNV.

[0306] 通过测定与训练集或单独的测试集,以及无读取的部分基因组的一致性,评估Q、 R、S以及L的候选值。 [0306] determined by the training set and test set separately, and the consistency or part of the genome of the non-read assessment Q, R, S and L candidate values. 截止值最终的选择可涉及读取完整性(读取的部分基因组)与有问题的CNV读取的量之间的权衡。 The final selection of cut-off value may involve a tradeoff between the amount of the integrity of the read (read part of the genome) in question reads the CNV.

[0307] 得分计算 [0307] score calculation

[0308] 本章节中更明确地描述了以上所述的CNV分段得分。 [0308] More specifically, this section describes the CNV segment above score.

[0309] 在给定的HMM上可以计算作为具体的状态序列〇= Sl,…,St的结果出现的长度t的输出的给定序列D=Cl1,…,dt的概率,HMM由状态η组成,所述状态通过如下的初始的状态概率P=P1,…Pn,转移概率T=ItJ以及发射概率E= {esd}所定义: [0309] In a given HMM can be computed as a particular sequence of states square = Sl, ..., output results St occurring length t of a given sequence D = Cl1, ..., probability dt is, HMM from the state η composition , the initial state by following the state probability P = P1, ... Pn, and transition probabilities T = ItJ emission probability E = {esd} defined by:

Figure CN104428425AD00301

[0311] 模型给予的数据概率为所有可能的状态序列的总和,即对长度t的所有可能的状态序列的集合S而言: [0311] Data probability model given the sum of all possible state sequences, the set S of all possible state sequences of length t, i.e., in terms of:

Figure CN104428425AD00302

[0313] 使用前向/后向(Forward/Backward)算法可以有效地计算涉及S的子集总和的本等式与其他的等式。 [0313] effectively calculates the sum of the subset S relates to the (Forward / Backward) algorithm using the forward / backward according to the equation and other equations. Bayes规则的应用允许测定考虑数据与模型的给定路径的概率: Application of Bayes rule allows the measurement probability of the given path and consider the data model:

Figure CN104428425AD00303

[0315] 从这里可以看到的是,考虑数据与模型的大多数可能的路径为,使Pr(D,〇|P,T, E)取最大值的路径。 [0315] From here can be seen that, considering the data with the model most likely path for the Pr (D, square | P, T, E) path takes a maximum value. 使用Viterbi算法可以有效地测定使这一等式取最大值的路径。 This assay can effectively make use of the equation takes the maximum path of the Viterbi algorithm.

[0316]然而,也可以计算局部路径的概率。 [0316] However, the probability of partial path may be calculated. 例如,可按如下计算在特定时间U的特定状态q内,实际上通向所观察到的数据序列的穿过模型的路径概率: For example, within a particular state q can be calculated as follows U at a particular time, in fact, through the probability model path leading to a data sequence observed:

Figure CN104428425AD00311

[0318] 以上讨论了分母,可以通过将数据概率和所有路径内具体路径求和来获得分子, 为此su =q,表不为5?~=£|·· [0318] denominator discussed above, and by the probabilities for all data paths molecule obtained by summing the specific path, for su = q, the table is not 5 ~ = £ |? ··

Figure CN104428425AD00312

[0322] 如下所述进行状态分配("读取的倍性");在位置U处推断状态(倍性),C为具有最大化概率的状态: [0322] follows the state assignment ( "ploidy read"); state estimation (ploidy) at the position U, C having a maximized state probabilities:

Figure CN104428425AD00313

[0324](同数的情况下,任意地选择)。 [0324] (in the case of the same number, arbitrarily selected). 然后位置U处倍性得分,πu为: Then fold score at the position U, πu of:

Figure CN104428425AD00314

[0326] 并且位置u处CNV型得分分数(也称为DEI得分),δu为: [0326] u and the position of the CNV type of credits (also referred to as DEI score), δu is:

Figure CN104428425AD00315

[0328] 总和a与b的界限如下。 [0328] limit of the sum of a and b as follows. 对预期为二倍体的区域而言,如果a= 0,b= 1; 如果為<2,a =b= 2 ;如果a = 3,b=最大的倍性(通常为10)。 For diploid region expected, if a = 0, b = 1; if <2, a = b = 2; if a = 3, b = maximum ploidy (usually 10). 对预期为单倍体的区域而言(男性性染色体),如果¾<l,a= 0,b= 0;如果-C=1,a=b=l;如果4>l,a= 2,b=最大的倍性(通常为10)。 As expected in terms of haploid region (male sex chromosomes), if ¾ <l, a = 0, b = 0; if -C = 1, a = b = l; if 4> l, a = 2, b = maximum ploidy (usually 10).

[0329] 节段被定义为类倍性位置的最大运行。 [0329] segments are defined as a class of position of the maximum operating times. 对从位置1至位置r的节段而言,倍性得分Hb被认为是组成位置的倍性得分的平均值: Segments in terms of r 1 from position to position, fold score Hb is considered to be composed of the average scores ploidy position:

Figure CN104428425AD00316

[0331] 并且相似地,节段的CNV型得分,πu为组成位置的CNV型得分的平均值: [0331] and, similarly, score CNV-type segment, πu CNV-type composition is the average of the position of the score:

Figure CN104428425AD00317

[0333] 用于评分的可选择的方法: [0333] An alternative method for scoring:

[0334] 可以基于部分路径的似然计算节段的可选择的得分集。 [0334] Alternatively can score set likelihood calculation section based on the partial path. 例如,可按如下计算从位置1至位置r的状态q中的真正路径的概率: For example, q can be calculated as follows from the state 1 to the position in the position r of the real path probability:

Figure CN104428425AD00321

[0336] 可能与计算节段界限的置信度相关的另一统计数据为位置U处为状态q的概率, 但不是在位置u-1 (或,类似地,在位置u+1): [0336] Another possible statistics associated with a confidence limit calculation section is at a position U of the probability for the state q, but not at a position u-1 (or, similarly, in the position of u + 1):

Figure CN104428425AD00322

[0338] 最终,可以计算以上所定义的DEI得分的备选方案;例如,从位置1至位置r处于倍性大于2的状态的概率为: [0338] Finally, as defined above may be calculated DEI alternative score; for example, from position 1 to position r times greater than the probability in state 2 is:

Figure CN104428425AD00323

[0340] 如早先所注意的,经由前向-后向(Forward-Backward)算法可以有效地计算所有路径总和。 [0340] As previously noted, via a forward - backward can be efficiently computed (Forward-Backward) algorithm is the sum of all paths.

[0341] 使用HMN模型在本领域是已知的,例如论述于Rabiner,LRATutorial onHiddenMarkovModelsandSelectedApplicationsinSpeechRecognition. ProceedingsoftheIEEE, 1989,77.2:257-286。 [0341] HMN models used are known in the art, such as discussed in Rabiner, LRATutorial onHiddenMarkovModelsandSelectedApplicationsinSpeechRecognition ProceedingsoftheIEEE, 1989,77.2:. 257-286.

[0342] 用于CNV读取的示例性实施机制 [0342] An exemplary embodiment of the read mechanism CNV

[0343] 计算机系统 [0343] Computer system

[0344] 可以根据本公开内容的实施方案使用的示例性计算机系统可以执行软件,并且结果可以呈递给监控器或其他的显示设备上的用户。 [0344] Software may be executed according to an exemplary embodiment of a computer system used in the present disclosure, and the results can be presented to the user on the monitor or other display device. 在一些实施方案中,配置以估算样品的靶序列中的拷贝数变异的示例性计算机系统可以将结果作为显示设备如计算机监测器上的图形用户接口(⑶I)呈现给用户。 In some embodiments, the configuration of the display device such as a graphical user interface (⑶I) on a computer monitor presented to the user of an exemplary computer system to estimate the number of copies of the target sequence in a sample may be as a result of variation. 图3图示了计算机系统400的结构的一个实例,其经配置以实施本公开内容的拷贝数变异的估算。 FIG 3 illustrates an example of a configuration of a computer system 400 configured to implement the copy number variation estimating the present disclosure. 如图3中所示,计算机系统400可包括一个或多个处理器402 (例如诸如CPU)。 As shown in Figure 3, computer system 400 may include one or more processors 402 (e.g., such as a CPU). 处理器402与通信基础设施406 (例如通信总线、交叉杆或网络)相连。 Processor 402 and communication infrastructure 406 (e.g. a communications bus, crossover bar, or network) is connected. 计算机系统400可包括显示界面422,其从通信基础设施406 (或从未显示的帧缓冲器)传送图像、文本以及其它数据以显示在显示单元424上。 The computer system 400 may include a display interface 422, the communication infrastructure 406 (or from the not shown frame buffer) transmitting the image, text, and other data to be displayed on the display unit 424.

[0345] 计算机系统400还可包括主存储器404,如随机存取存储器(RAM)以及辅助存储器408。 [0345] Computer system 400 also includes a main memory 404, such as random access memory (RAM) and a secondary memory 408. 例如,辅助存储器408可包括硬盘驱动器(HDD) 410和/或可移动存储驱动器412, 其可代表软盘驱动器、磁带驱动器、光盘驱动器等。 For example, secondary memory 408 may include a hard disk drive (HDD) 410 and / or a removable storage drive 412, representing a floppy disk drive which may, tape drives, optical disk drives and the like. 可移动存储驱动器412从可移动存储单元416读取和/或写入可移动存储单元416。 Removable storage drive 412 reads from and / or writes to a removable storage unit 416 from the removable storage unit 416. 可移动存储单元416可为软盘、磁带、光盘等。 Removable storage unit 416 may be a floppy disk and the like, a magnetic tape, an optical disk. 应理解的是,可移动存储单元416可包括具有在其上储存有计算机软件和/或数据的计算机可读存储介质。 It should be understood that the removable storage unit 416 may include a computer having software and / or data-readable storage medium having stored thereon.

[0346] 在可选择的实施方案中,辅助存储器408可包括允许计算机程序、计算机逻辑或其它指令被加载至计算机系统400的其它相似的装置。 [0346] In alternative embodiments, secondary memory 408 may include allowing computer programs, computer logic, or other instructions to be loaded into the computer system 400 to other similar devices. 辅助存储器408可包括可移动存储单元418和相应的接口514。 Secondary memory 408 may include removable storage unit 418 and a corresponding interface 514. 此类可移动存储单元的实例包括但不限于,USB或闪盘驱动器,其允许软件和数据从可移动存储单元418转移至计算机系统400。 Examples of such removable storage units include, but are not limited to, USB or flash drive, which allow software and data to be transferred from removable storage unit 418 to computer system 400.

[0347] 计算机系统400还可包括通信接口420。 [0347] Computer system 400 may also include a communications interface 420. 通信接口420允许软件和数据在计算机系统400与外部装置之间转移。 The communication interface 420 allows software and data to be transferred between computer system 400 and external devices. 通信接口420的实例可包括调制解调器、以太网卡、无线网卡、个人计算机内存卡国际协会(PCMCIA)插槽与卡等。 Examples of communications interface 420 can include a modem, an Ethernet card, wireless card, Personal Computer Memory Card International Association (PCMCIA) slot and card. 经由通信接口420转移的软件与数据可以为信号的形式,所述信号可为电子的、电磁的、光学的等,其能够由通信接口420所接收。 Software and data communication via the interface 420 may be in the form of the transfer signal, the signal may be electronic, electromagnetic, optical and the like, which is capable of being received by communications interface 420. 这些信号可经由通信路径(例如通道)提供给通信接口420,这可以使用电线、电缆、 光纤、电话线、蜂窝链路、射频(RF)连接以及其它通信通道进行实施。 These signals may be provided to communications interface 420 via a communications path (e.g., channel), which may be implemented using wire, cable, fiber optics, a phone line, a cellular link, a radio frequency (RF) connection and other communications channels embodiments.

[0348] 在本文档中,术语"计算机程序介质"和"计算机可读的存储介质"指永久性介质, 诸如主存储器404,可移动存储驱动器412以及安装在硬盘驱动器410中的硬盘。 [0348] In the present document, the terms "computer program medium" and "computer-readable storage medium" refers to the permanent medium, such as main memory 404, removable storage drive 412 and a hard disk installed in hard disk drive 410. 这些计算机程序产品提供给计算机系统400软件或其它逻辑。 These computer program products provide software to computer system 400 or other logic. 计算机程序(也称为计算机控制逻辑)储存于主存储器404和/或辅助存储器408中。 Computer programs (also called computer control logic) are stored in and / or secondary memory 408. The main memory 404. 也可经由通信接口420接收计算机程序或其它软件逻辑。 It may also be a computer program or other software receiving via the communication interface logic 420. 此类计算机程序或逻辑,当通过处理器执行时,能够使计算机系统400 执行本文所讨论的方法的特征。 Such a computer program or logic, when executed by a processor, enable the computer system 400 to perform the features of the method discussed herein. 例如,主存储器404,辅助存储器408或可移动存储单元416 或418可用计算机程序代码(指令)编码,以用于执行对应图3中所示的过程的操作。 For example, main memory 404, secondary memory 408 or the removable storage unit 416 or 418 computer usable program code (instructions) encoded for performing the operation corresponding to the process shown in FIG.

[0349] 在使用软件逻辑实施的实施方案中,软件指令可储存在计算机程序产品中,并且利用可移动存储驱动器412、硬盘驱动器410或通信接口420加载至计算机系统400。 [0349] In embodiments implemented in software logic, software instructions stored in a computer program product, and using removable storage drive 412, hard drive 410 or communications interface 420 to the computer system 400 is loaded. 换句话说,计算机程序产品(其可为计算机可读的存储介质),可具有明确呈现其上的指令。 In other words, a computer program product (which may be a computer-readable storage medium) may have presented clear instructions thereon. 软件指令,当通过处理器402执行时,使得处理器402执行本文所述的方法的功能(操作)。 Software instructions that, when executed by the processor 402, such that the method of performing the functions described herein processor 402 (operation). 在另一实施方案中,方法主要利用例如,硬件部件诸如包含专用集成电路(ASIC)的数字信号处理器在硬件中实施。 In another embodiment, the method is mainly e.g., a digital signal processor comprises a hardware component such as an application specific integrated circuit (ASIC) implementation in hardware utilization. 仍然在另一实施方案中,使用硬件与软件的组合实施所述方法。 In still another embodiment, a combination of hardware and software implementation of the method.

[0350] 根据本公开内容的实施方案的用于CNV读取的示例性系统。 [0350] to an exemplary embodiment of the system according to the read CNV embodiment of the present disclosure. 图1为框图,其示出根据一不例性实施方案用于读取样品多核苷酸序列中的变异的系统。 FIG. 1 is a block diagram that illustrates a sample reading is not according to an exemplary embodiment variant embodiment of the polynucleotide sequence in the system. 在本实施方案中,系统可包括一个或多个计算装置如计算机12和数据存储库14的计算机集群10。 In this embodiment, the system may include one or more computing devices such as a computer 12 and the data store of the computer cluster 1014. 计算机12 可经由高速局域网络(LAN) 16与数据存储库14连接。 The computer 12 may store the data 16 and 14 is connected via a high speed local area network (LAN). 计算机12的至少一部分可执行CNV 读取器18的实例。 At least a part of a computer-executable CNV reader 12 of Example 18. (在一些实施方案中,CNV读取器如CNV读取器18可被包括而作为聚集管道逻辑(assemblypipelinelogic)的一部分,所述聚集管道逻辑经配置和操作将原始读数聚集至映射的及测序的基因组,所述基因组包括来自于参照基因组的检测的变异;此类实施方案的实例描述于2010年4月29日提交的美国申请第12/770, 089号中,该申请通过引用整体并入本文,如同在本文完全描述)。 (In some embodiments, such as CNV CNV reader reader 18 may be included as a part of the pipe aggregation logic (assemblypipelinelogic), said conduit aggregation logic is configured and operated to collect the raw readings mapped and sequenced genome, said genome comprising a mutation detected from the reference genome; U.S. examples of such embodiments are described in the April 29, 2010, filed application No. 12/770, No. 089, which is incorporated herein by reference in its entirety , as fully described herein). CNV读取器18可包括HMM模型逻辑20、覆盖度计算逻辑22、GC校正逻辑34、倍性相关的校正逻辑36以及基于群体的无读取逻辑38。 CNV reader 18 may include a HMM model logic 20, coverage computation logic 22, GC correction logic 34, the associated times of the correction logic 36 and read logic based groups 38 no.

[0351] 数据存储库14可储存几个数据库,其包括储存参照多核苷酸序列24、通过使用生物化学过程对样品多核苷酸序列进行测序所获得的匹配的读数26,以及由匹配的读数26 生成的映射匹配的读数28的一个或多个数据库。 [0351] Data store 14 can store several databases, including storing a reference polynucleotide sequence 24, to match the readings obtained by using the sequencing process biochemical samples polynucleotide sequences 26, 26 by the matching and reading readings generated map 28 matches the one or more databases.

[0352] 参照多核苷酸序列24 (在下文中被简称为参照)指参照有机体的已知核苷酸序列(例如已知的基因组)。 [0352] Referring polynucleotide sequence 24 (to be referred to hereinafter as a reference) refers to a known nucleotide (e.g., genomic known) of the reference organism. 这包括这样的参照,该参照包含在基因组内一个或多个位置处具有已知变异的序列。 This includes a reference to the reference comprises at one or more positions with a known variant sequences within the genome. 多核苷酸分子为有机聚合物分子,其由核苷酸单体共价地结合在链中所组成。 Polynucleotide molecule is an organic polymer molecule covalently bound by a nucleotide monomer composition in the chain. 脱氧核糖核酸(DNA)与核糖核酸(RNA)为具有不同生物功能的多核苷酸的实例。 Deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) as examples of polynucleotides with different biological functions. 有机体的基因组(例如诸如人)为有机体遗传信息的整体(或实质的整体),其被编码为DNA 或RNA。 Genome of the organism (e.g. such as a human) genetic information of the organism as a whole (or substantially the whole), which is encoded as DNA or RNA. 单倍体基因组包含有机体的每一遗传单位的一个拷贝。 Haploid genome contains a copy of each genetic unit organism. 在诸如哺乳动物的二倍体有机体中,基因组为包含大多数遗传信息的两个拷贝的一系列互补的多核苷酸,其被组织为具有离散的遗传单位或等位基因的染色体集。 In a diploid organism such as a mammal, comprising two series of genomic polynucleotide complementary copies of most genetic information, which is organized as a unit having a genetic or chromosomal discrete set of alleles. 在单个染色体上特定位置处提供每一拷贝的等位基因,并且基因组中每一等位基因的基因型包含同源染色体上特定位置处存在的等位基因对,其决定具体的特性或性状。 Provided in a single chromosome at a particular location of each copy of the allele, and the genome of each genotype comprises allele present on homologous chromosomes pair of alleles at a particular location, which determine the specific characteristics or traits. 如果基因组包含两个相同拷贝的等位基因,则对该等位基因而言它为纯合的,并且当基因组包含两个不同的等位基因时,对该基因座而言它为杂合的。 If the genome contains two identical copies of the allele, which is the purposes of homozygous allele, and when the genome contains two different alleles, which is heterozygous for the locus of . DNA自身被组织为互补的多核苷酸的两条链。 DNA itself is organized as two complementary polynucleotide chains.

[0353] 参照24可为整个基因组序列,参照基因组的一部分,许多参照有机体的共有序列,基于不同有机体的不同组分的编辑序列,或任何其他适当的序列。 [0353] 24 may be a reference to the entire genome sequence, a portion of the reference genome, many organisms consensus reference sequence, based on the edit sequence of different components of different organisms, or any other suitable sequence. 参照24也可包括关于已知在有机体群体中发现的参照变异的信息。 Reference 24 may also include information regarding known reference variation found in the organism population.

[0354] 可以在从有机体的生物样品获得的多核苷酸序列上进行测序过程期间,获得匹配的读数26,例如来自待分析的基因、基因组DNA、RNA或其片段的核酸序列。 During reading [0354] In the sequencing process may be performed on the polynucleotide sequences obtained from an organism of a biological sample, obtaining a matching 26, for example, the nucleic acid sequence to be analyzed gene from genomic DNA, RNA or fragment thereof. 匹配的读数26 可获取自包含整个基因组的样品,诸如整个哺乳动物的基因组,更具体地整个人类基因组。 26 matches readings may be obtained from a sample comprising the entire genome, such as the entire genome of a mammal, and more particularly the entire human genome. 在另一实施方案中,匹配的读数26可为来自全基因组的特定片段。 In another embodiment, the matching readings from a particular segment 26 may be a whole genome. 在一实施方案中,可通过在诸如使用聚合酶链式反应(PCR)或滚环式复制产生的扩增的核酸构建体如扩增引物上进行测序来获得匹配的读数26。 In one embodiment, the body 26 such as by reading the amplification primers match the sequence to obtain a nucleic acid construct, such as using polymerase chain reaction (PCR) amplification or rolling the cyclic copy generation. 例如可使用的扩增引物的实例,描述于美国专利公开号20090111705、20090111706以及20090075343中,其通过引用整体并入本文。 Examples of amplification primers, for example, may be used, are described in U.S. Patent Publication No. 20090075343 20090111705,20090111706, and in, which is incorporated herein by reference in its entirety.

[0355] 映射匹配的读数(mappedmatedread) 28指已被映射至参照24中的位置的匹配的读数26。 [0355] The map matching reading (mappedmatedread) 28 refers to the matching has been mapped to the reference position 26 of the 24 readings. 示例性映射方法描述于下列的专利申请中:2010年2月2日提交的美国专利申请第12/698,965号,其通过引用将其全部的内容并入本文;2010年2月2日提交的美国专利申请第12/698, 986号,其通过引用将其全部的内容并入本文;2010年2月2日提交的美国专利申请第12/698, 994号,其通过引用将其全部的内容并入本文。 An exemplary mapping method is described in the following patent applications: U.S. Patent No. 2 February 2010 filed Application No. 12 / 698,965, the entire contents of which are incorporated herein by reference; filed February 2, 2010 U.S. Patent application No. 12/698, 986, which is incorporated by reference the entire contents of which are incorporated herein; US Patent application February 2, 2010, filed on 12/698, 994, which is incorporated by reference all of the content and its incorporated herein.

[0356] 出于结合参照24鉴定和读取在映射匹配的读数28的序列中检测到的拷贝数变异或差异的目的,拷贝数变异CNV读取器18产生序列并对其进行评分。 [0356] object of the copy number variation or difference in binding sequence for identification and reference 24 reads map matching readings 28 detected copy number variations reader 18 produce a sequence of CNV and its score.

[0357]CNV读取器18可输出CNV读取文档32、列表或其他包含鉴定的变异的数据结构, 每一种均描述了其中观察到映射匹配的读数28的序列的一部分在特定位置或靠近特定位置处不同于参照24的方式。 [0357] CNV CNV reader 18 reads the document 32 may be output, list or other data structure containing the mutation identified, each of which is described in which a portion of the sequence 28 map matching observed readings near or at a particular location at a specific position different from the reference system 24.

[0358] 计算机集群10可被配置,从而使得在不同的计算机12上执行的CNV读取器18的实例在参照24和映射匹配的读数26的不同部分上平行地操作。 [0358] Computer cluster 10 may be configured such that the CNV caller executing on different computers 12 Example 18 operate in parallel on different portions of the reference maps 24 and 26 to match readings. 作业调度程序30负责在计算机集群10中不同的计算机12上分配作业或数据包。 Job scheduler 30 is responsible for a different computer in the computer cluster 10 assigned job or packet 12.

[0359] 计算机12可包括典型的硬件部件(未显示),其包括一个或多个处理器、输入装置(例如键盘、定点设备等),以及输出装置(例如显示装置等)。 [0359] The computer 12 can include typical hardware components (not shown), including one or more processors, input devices (e.g., keyboard, pointing device, etc.), and an output device (e.g. display device or the like). 计算机12的一个实例为图3所示的计算机系统400和/或计算装置2500。 Examples of a computer system computer 12 shown in FIG 400 and / or computing device 2500. 计算机12可包括计算机可读的/可写入的介质,例如,包含计算机指令的存储器与存储装置(例如闪速存储器、硬盘驱动器、光盘驱动器、磁盘驱动器等),当通过处理器执行时,所述指令执行本文公开的功能。 The computer 12 may include a media / computer-readable writable, for example, a memory storage device comprising computer instructions (e.g., a flash memory, a hard disk drive, optical disk drive, disk drive, etc.), when executed by a processor, the said instruction execution features disclosed herein. 计算机12 还可包括计算机可写入的介质,其用于执行数据存储库14并且用于储存CNV读取文档32。 The computer 12 may further include a computer-writable medium, for performing a data store 14 for storing and reading a document 32 CNV. 计算机12还可包括用于通信的有线的或无线的网络通信接口。 The computer 12 may also include a wired communication network or wireless communication interface.

[0360] 数据生成 [0360] Data generated

[0361] 在一些实施方案中,测序仪(例如,如图4A和4B所示的测序仪)可用于产生匹配的读数26,所述读数获取自待分析的有机体的样品多核苷酸。 Reading [0361] In some embodiments, the sequencer (e.g., as shown in FIG. 4A and sequencer 4B) may be used to generate matching 26, the readings taken from an organism to be analyzed sample polynucleotide. 在一实施方案中,测序仪提供了离散但有关的数据集,从而使得匹配的读数26的内容可包括预测的空间关系和/或分离变异。 In one embodiment, the sequencer provides discrete but related data sets, so that the readings match the content 26 may include a prediction of the spatial relationship and / or isolated variant. 可以基于已有的关于用于产生匹配的读数26的生物化学过程的知识(例如,如果将生物化学过程应用于样品,基于预期获得的序列),基于匹配的读数26的序列数据或其子集的初步分析的经验估计,专家估计,或其它适当的技术确定这一关系。 May be based on existing knowledge of biochemical processes on reading for a match 26 (e.g., if the biochemical processes applied to the sample, based on the obtained expected sequence), reading the sequence data based on the matching or a subset of 26 preliminary analysis of empirical estimates, experts estimate, or other appropriate techniques to determine this relationship.

[0362] 许多生物化学过程通过测序仪可以用于促进匹配的读数26的生成,以便与本发明的CNV读取方法一起使用。 [0362] Many biochemical processes by the sequencer can be used to facilitate matching of readings generated 26, for use with the present invention CNV reading method. 这些包括但不限于:如美国专利号6, 864, 052、 6, 309, 824、6, 401,267中所公开的杂交方法;如美国专利号6, 210, 891、6, 828, 100、 6,833,246、6,911,345、7,329,496 #&Margulies,etal.(2005),Nature437:376-380 和Ronaghi,etal. (1996),Anal.Biochem. 242:84-89 所公开的综合测序方法;如美国专利号6, 306, 597、TO2006073504、TO2007120208公开的基于连接的方法;如美国专利号5, 795, 782、6, 015, 714、6, 627, 067、7, 238, 485 和7, 258, 838 以及美国专利申请200600317120090029477所公开的纳米孔测序技术;以及如美国专利申请公开号20090111115所公开的纳米通道测序技术,所有这些都通过引用将其全文并入本文。 These include, but are not limited to: U.S. Patent No. 6, 864, 052, 6, 309, 824,6, 401,267 in the hybridization methods disclosed; as described in U.S. Patent No. 6, 210, 891,6, 828, 100, 6,833,246,6,911,345,7,329,496 # & Margulies, etal (2005), Nature437: 376-380 and Ronaghi, etal (1996), Anal.Biochem 242: 84-89 disclosed the synthesis method of sequencing; U.S. Patent No.... 6, 306, 597, TO2006073504, TO2007120208 connection based method disclosed; U.S. Patent No. 5, 795, 782,6, 015, 714,6, 627, 067,7, 238, 485 and 7, 258, and 838 U.S. Patent application 200600317120090029477 nanopore sequencing technology disclosed; as well as U.S. Patent application Publication No. 20090111115 disclosed nanochannel sequencing, all of which are incorporated by reference in its entirety herein. 在一具体的实施方案中,组合探针锚定连接(cPAL)过程可用于一些实施方案中(参阅美国专利申请公开号20080234136和20070099208,将其通过引用整体并入本文)。 In a particular embodiment, the combination probe anchor connection (cPAL) process may be used in some embodiments (see U.S. Patent Application Publication No. 20080234136 and 20070099208, which is incorporated herein by reference in its entirety).

[0363] 一旦生成了初级的映射匹配的读数数据,就根据如图2中所示的本公开内容的CNV读取方法处理信息,图2描述了用于测定样品中靶多核苷酸序列的检测位置处基因组区域的拷贝数的示例性方法,获得映射的读数数据以测量所述样品202的序列覆盖度;校正序列覆盖度偏差,其中序列偏差校正包括进行倍性相关的基线校正204 ;在进行基于群体的无读取/低置信区域的鉴定206和进行HMM分段、评分以及输出208后,估算多个基因组区域的总拷贝数值和区域特异性拷贝数值210。 [0363] Once the primary data readings generated map matching processing information according to the present disclosure CNV reading method as shown in FIG. 2, FIG. 2 depicts the detection target polynucleotide sequence in a sample for determining An exemplary method of copy number of a genomic region at a position to obtain readings to measure data mapping sequence coverage of the sample 202; bias correction sequence coverage, wherein the sequence includes deviation correction times associated with baseline correction 204; performed no identification of groups read / low confidence region 206 and HMM segmentation performed, and the output score 208, a plurality of copies of the estimated total genomic region and the region-specific numerical value 210 based on the copy.

[0364] 根据本公开内容的示例性实施方案所产生的二倍体/非肿瘤/非整倍体样品的CNV读取过程的输出的实例(例如图1中变异读取文档32所提供的)如表2中所示。 [0364] diploid generated according to an exemplary embodiment of the present disclosure examples / non-tumor / body aneuploidy output samples of the CNV read process (e.g. variation in FIG. 1 reads a document 32 provided) as shown in table 2.

[0365] 表2 [0365] TABLE 2

[0366] [0366]

Figure CN104428425AD00351

[0367] [0367]

Figure CN104428425AD00361

[0368] 表2中,纵列"染色体"确定染色体号,纵列"开始"和"结束"确定给定区域的起始基因座和结束基因座,纵列"倍性"指示区域的倍性(例如拷贝数),纵列"倍性得分"指示给定区域的得分(其中分数为基于算法的以分贝dB表示的值),纵列"类型"指示区域所观察的倍性的类型(例如"="指示正常的倍性2," + "指示高于正常的倍性,指示低于正常的倍性,"超变的"指示倍性不能被读取,以及"不变的"指示倍性与正常的不同,但是与基线中所观察的相同,所述基线为在至少几种参照基因组的集合,以及纵列"类型得分"指示同一排中在纵列"类型"中读取的类型的置信度得分。例如,表2中第二排指示:开始于染色体1上的基因座5100001并且结束于基因座5800000的区域,具有倍性3,15dB得分并且具有"增加的"具有40分的类型。 [0368] In Table 2, column "chromosome" OK chromosome, column "start" and "end" to determine the locus starting and ending times of a given region loci, column "ploidy" indication region type (e.g., copy number), column "fold score" score indicates a given region (which fraction value based algorithm expressed in decibels dB), column "type" indicates the observed area ploidy (e.g. "=" indicates normal ploidy 2, "+" indicates a higher than normal ploidy times lower than the normal indication, "hypervariable" indicates ploidy can not be read, and the "constant" indicates times properties different from the normal, but the same as observed in the baseline, the baseline is set at least several reference genome, and the column "score type" indicates the type of the same row read in the column "type" the confidence score e.g., table 2 indicates the second row: starts at 5,100,001 locus on chromosome 1 and ends in the region of the locus 5,800,000, having a ploidy 3,15dB score and having "increased" with 40 points Types of.

[0369] 根据本公开内容的示例性实施方案所产生的非二倍体/肿瘤/非整倍体样品的CNV读取过程的输出的实例(例如图1中变异读取文档32所提供的)如表3中所示。 [0369] Examples of output generated according to an exemplary embodiment of the present disclosure, non-diploid / tumors / samples aneuploid read process of the CNV (e.g. variation in FIG. 1 reads a document 32 provided) as shown in table 3.

[0370] 表3 [0370] TABLE 3

Figure CN104428425AD00362

[0373] 表3中,纵列"染色体"确定染色体号,纵列"开始"与"结束"确定给定区域的起始基因座与结束基因座,纵列"水平"指示通过HMM模型的区域输出的覆盖度水平(其中因为肿瘤样品的非整倍性与其它特征,在没有假定正常倍性2的情况下计算覆盖度水平),纵列"水平得分"指示同一排中在纵列"水平"中读取的水平的置信度得分。 [0373] In Table 3, column "chromosome" OK chromosome, column "Start" and "End" is determined for a given locus region and the starting end of the locus indicated, column "horizontal" area through the HMM model coverage level output (as aneuploidy wherein the tumor sample with other features, does not assume the case of a normal ploidy level 2 is calculated coverage), column "horizontal score" indicates the same row in the column "level "read confidence score level. 例如,表3中第二排指示:开始于染色体2上的基因座10001并且结束于基因座243189373的区域,具有1. 05 的覆盖度水平,得分为38分。 For example, Table 3 indicates the second row: starts on chromosome 2 locus 10001 and ends at the area locus 243189373, having a coverage level of 1.05, a score of 38 points.

[0374] 绝对拷贝数的图表阐析技术 [0374] Chart absolute copy number of technical Analyzes

[0375] 在一例示性的实施方案中,将矫正偏差的窗口平均覆盖度以及等位基因特异性读数输入到OptSeg逻辑中,其决定分段成均匀覆盖度的区域以及次要等位基因分数(LAF)。 [0375] In an exemplary embodiment, the correction of the deviation and the average coverage of allele-specific reading window OptSeg input to logic that determines the segmented regions into homogeneous fractions, and the minor allele coverage (LAF). 分段的作用概念性地类似于不具有固定模型化状态集的循环二进制分段模型,但提供全球最佳的解决方案。 Role conceptually similar to the segment having no fixed cycle binary segment model modeled state set, but provides a global optimal solution. 抑制过短的片段以降低噪音。 Inhibits short fragments to reduce noise.

[0376]LAF的结果以及总覆盖度提供次要等位基因覆盖度的评估。 [0376] LAF results and provide an assessment of the total coverage of the minor allele coverage. 总覆盖度和次要等位基因度定义的二维空间可用于肿瘤的图标呈现,其中,大多数状态预期位于直角坐标网的顶点。 Icon in two-dimensional space and the total coverage of the minor allele of the tumor can be used to define the presentation, wherein the majority of the state is expected at the apex of the cartesian coordinates. 该空间中的密度可制成表,并在可视化之前通过计算机逻辑进行核-校平。 The space tabulated density, and core logic before computer visualization - leveling. 在校平的足够高度的分布中的峰值决定状态的初始集。 In an initial set of sufficient height leveling distribution peak in the determined state.

[0377] 基于规则的逻辑发现,试图捕获尽可能多的初始状态下密度的束缚的(一种肿瘤组件)模型,处在最大平均倍性极限内。 [0377] found that rule-based logic, the density of bound try to capture as much of the initial state (a tumor components) model, the maximum average is within the limits of ploidy. 给定模型的数据支持可进行可视化评估,模型的各种特性可直接从图中观察到。 Data support a given model can be visualized evaluation, various characteristics can be observed directly from the model to FIG.

[0378] 束缚的模型提供基质污染分数的评估,并且匹配模型的峰值依据总的和次要的拷贝数的整数接收解释。 [0378] Bound model provides contamination assessment score matrix, and the model based on matching the peak and the total number of integers secondary copy recipient explanation. 不对模型负责的初始状态被解释成肿瘤异质性的结果;它们可包括在最终模型内,并接收总的和/或次要的拷贝数的非整数。 Not responsible for the initial state of the model to be construed as a result of the heterogeneity of the tumor; they can be included in the final model, and receives non-integer total and / or secondary copy number.

[0379] 最终模型可与用于注释片段的最终集合的状态解释一同输入到实施单独的基于模型的分段过程(如HMN)的逻辑中。 [0379] The final model may be used to explain the state of the comment segment together is input to the final set of separate segments based process model (e.g., the HMN) of logic. 由于模型过程以及最终分段的独立性,计算机逻辑可生成肿瘤的可视化呈现;如果有问题,可用替代模型替换自动导出的模型。 Since the model of the process and the final segment of independence, computer logic may generate a visual presentation of tumor; If you have questions, be used to automatically replace the surrogate model derived model.

[0380]LAF评估 [0380] LAF Evaluation

[0381] 次要等位基因分数(也被称为"次要拷贝比例")为含较低丰度等位基因的样品的给定区域的拷贝分数。 [0381] Score minor allele (also referred to as "secondary copy ratio") is low abundance alleles sample containing a given copy of the score area. 可基于来自肿瘤样品在匹配的正常样品的杂合突变基因座的读数评估LAF。 It may be based on readings from a tumor sample matched normal sample heterozygous mutation locus evaluation LAF. 评估允许相位的不确定性(例如,给定基因座的任一等位基因可为"次要等位基因",不依赖于读数)以避免偏差评估。 Allows evaluation phase uncertainty (e.g., to either a given locus alleles may be "minor allele", does not depend on reading) in order to avoid bias evaluation. 可通过β_二项式模型处理二项式抽样的误差。 Binomial sampling error can be processed by β_ binomial model.

[0382] 概念性建模 [0382] Conceptual Modeling

[0383] 例示性标准的一种-组件-加-基质-污染-模型(one-component-plus-strom al-contaminationmodel) [0383] One exemplary standard - assembly - plus - matrix - pollution - model (one-component-plus-strom al-contaminationmodel)

[0384] 在某些实施方案中,进行下述假设: [0384] In certain embodiments, the following assumptions:

[0385] •样品细胞为肿瘤细胞,或然率t;或为正常细胞,或然率l_t。 [0385] • sample cell is a tumor cell, the probability T; or normal cells, probability l_t.

[0386] •所有肿瘤细胞具有相同的基因组。 [0386] • all tumor cells have the same genome.

[0387] •所有的正常细胞为二倍体。 [0387] • all normal diploid cells.

[0388] 对于基因组的给定区域i,如果: [0388] For a given region of the genome of i, if:

[0389] · % =i中的每个肿瘤细胞的主要等位基因拷贝数 [0389] ·% = i in tumor cells in each major allele copy number

[0390] ·h=i中的每个肿瘤细胞的次要等位基因拷贝数(¾彡bj [0390] · h = minor allele copy number (¾ San bj i of each tumor cells

[0391] · Ci = i中平均拷贝数 [0391] · Ci = i mean copy numbers in

[0392] ·Ii =i中次要等位基因分数 [0392] · Ii = i in minor allele fraction

[0393]那么: [0393] then:

[0394]Ci = 2 (l~t) + (aj+bj)t [0394] Ci = 2 (l ~ t) + (aj + bj) t

[0395]Ii = (1-t+bjt)/Ci [0395] Ii = (1-t + bjt) / Ci

[0396] 这个模型中允许的拷贝数状态(Ci,Ii)位于图示中的方格网(见图6)。 [0396] model state that allows copy number (Ci, Ii) positioned in the illustrated square grid (see Figure 6).

[0397] 多组件模型 [0397] Multi-component model

[0398] 可能状态的1-组件肿瘤模型(图6)可易于延伸成更复杂的涉及样品的肿瘤部分的二个或多个组件(亚克隆)的模型。 [0398] 1- assembly tumor model may be a state (FIG. 6) may readily extend the model into two or more components of more complex portions of the tumor samples involving (subclones) of. 然而,偶数2-组件模型也导致可能状态的极大扩展(见图7)。 However, the even-2- component model may also lead to greatly expanded state (see FIG. 7). 整体模型(每个组件的相对分数,每个全拷贝的覆盖度等)以及单个状态的解释通常都是欠定的(见图7-9)。 Overall interpretation model (the relative fraction of each component, each of the coverage of the whole copy, etc.) and a single state usually are underdetermined (see FIG. 7-9).

[0399] 结果的图示 [0399] The results shown in

[0400] 过程对于不同程度基质污染的稳健性可见于使用匹配的肿瘤和正常细胞系的基质含量的人工(在电脑中模拟)滴定。 [0400] using a matching process can be found in the artificial matrix content of tumor and normal cell lines (computer simulation) titration for different degrees of robustness matrix contamination. 图10A-10C展示了三个数据集,第一个含纯肿瘤细胞系,第二个含50%肿瘤/50%匹配的正常细胞,第三个含有25%的肿瘤/75%的正常细胞。 Figures 10A-10C show three data sets, the first containing the pure tumor cell lines, the second containing 50% tumor / 50% matched normal cells, and the third containing 25% of tumors / 75% of normal cells.

[0401] 尽管污染的样品中基因组的所有区域具有显著的次要等位基因含量,最低的状态可解释为杂合性丢失(LOH)状态。 [0401] Although all contaminated regions of the sample genome having a significant content of the minor allele, a state may be interpreted as the minimum loss of heterozygosity (LOH) state.

[0402] 具有高平均拷贝数以及拷贝数高可变性的肿瘤可具有引人注目的拟合,其中将肿瘤为异质的区域突出显示(见图11所示)。 [0402] having a high copy number and copy number average high variability may have dramatic tumor fitting, wherein the tumor is a heterogeneous highlight region (see FIG. 11).

[0403] 一些样品是困难的:即使手动解释也很难懂。 [0403] Some samples are difficult: even if the manual interpretation is also difficult to understand. 至少可视化可易于鉴定差的模型拟合(见图12所示)。 Visualization can readily identify at least the difference between the model fit (see FIG. 12).

[0404] 在某些实施方案中,肿瘤内异质的区域不分解成同质组件的组合,而是依据平均表现进行报告。 [0404] In certain embodiments, the tumor area is not decomposed into a heterogeneous combination of homogeneous components, but according to the average performance report. 在某些其它实施方案中,其中样品峰值清楚确定网格,在模型选择上有某些退化,其可导致具有主要的和次要的拷贝数的状态(a,b)与较少基质污染的状态(a+n,b+n)无法区分。 In certain other embodiments, wherein the sample peak clearly defined meshes, certain degradation in model selection, which may result in a primary and secondary state (a, b) with the copy number of the matrix was less contaminated state (a + n, b + n) can not be distinguished. 在那些实施方案中,具有最低可能倍性的模型是优选的。 In those embodiments having the lowest possible ploidy model is preferred.

[0405] 用于CNV读取的基于模型的例示性分段过程 [0405] Example model based segmentation process illustrating exemplary read CNV

[0406] 在例示性实施方案中,用于测定样品靶多核苷酸序列检测位点的基因组区域拷贝数的方法,所述方法包括:获得所述样品序列覆盖度的测量数据;校正测量数据的序列覆盖度偏差,其中,序列覆盖度偏差校正包括进行倍性相关的基线校正;进行隐马尔可夫模型(HMM)分段过程、评分以及输出;以及估算多个基因组区域的总拷贝数值和区域特异性拷贝数值。 [0406] In the example embodiment illustrated exemplary embodiment, the measurement sample for a target polynucleotide region of the genomic copy number of methods of sequence detection site, the method comprising: obtaining measurement data of the sample sequence coverage; correction of the measurement data sequence coverage bias, wherein the bias correction sequence coverage including ploidy associated baseline correction; for hidden Markov model (HMM) segmentation process, and an output rates; and a total area and the estimated value of the plurality of copy genomic region specific copy number.

[0407] 例如,在一实施方案中,该方法包括产生多个对应于各自的拷贝数的HMM模型的多个状态,其中,所述样品为肿瘤样品;以及进行HMM分段、评分以及输出,包括:使用通过用于解释复杂肿瘤的绝对拷贝数的模型(如这之前所述)所测定的呈现初始状态的输入数据来产生用于HMN的初始模型;通过修改模型中的状态数来优化初始模型并优化每个状态的参数;以及通过随后向模型添加状态,然后移除状态,或其结合来修改模型中的状态数。 [0407] For example, in one embodiment, the method includes generating a plurality of states corresponding to the respective plurality of copy number of the HMM model, wherein the sample is a tumor sample; and performing HMM segmentation, score and an output, comprising: using the model to explain the absolute copy number of the tumor by a complex (as is previously described) presenting measured data in the initial state of the input to generate an initial model for the HMN; optimized by modifying the initial state model number and optimizing the model parameters for each state; and modifying the model number of states followed by adding to the state model, then removed state, or a combination thereof.

[0408] 在一实施方案中,该方法进一步包括基于通过用于解释复杂肿瘤的绝对拷贝数的模型所测定的初始状态,注释多个基因组区域的总拷贝数值以及特定区域的拷贝数值。 [0408] In one embodiment, the method further comprises an initial state based on the absolute copy number of models to explain the complex for the tumor by the measurement, the total value of annotation copy copy values ​​and the plurality of specific regions of genomic regions.

[0409] 例示性测序仪以及计算装置 [0409] Exemplary computing device and the sequencer

[0410] 在一些实施方案中,可通过测序系统进行DNA样品(如代表整个人基因组的样品) 的测序。 [0410] In some embodiments, the DNA sample can be sequenced (e.g., a representative sample of the entire human genome) sequencing through the system. 两个测序系统的实施例如图4A和4B所示。 For example, two embodiments and sequencing system shown in FIG. 4A 4B.

[0411] 图4A和4B为例示性测序系统2490的框图,根据本发明例示性实施方案,其经配置以实施用于解释复杂肿瘤的绝对拷贝数以及CNV读取的技术和/或方法。 [0411] Figures 4A and 4B are exemplary block diagram 2490 sequencing system, according to an exemplary embodiment of the present invention, which is configured to implement complex absolute copy number for tumor as well as technical and / or CNV reading explanation. 测序系统2490 可包括或与多个子系统关联,例如,一个或多个测序仪如测序仪2491,一个或多个计算系统如计算系统2497,以及一个或多个数据储存库如数据储存库2495。 Sequencing system 2490 may include or be associated with a plurality of subsystems, e.g., as one or more sequencer sequencer 2491, one or more computing systems such as the computing system 2497, and one or more data repositories, such as the data repository 2495. 在图4A所示的实施方案中,系统2490的各个子系统可通过一个或多个网络2493进行有通信的连接,所述网络2493 可包括分组交换或其它类型的网络基础架构装置(如路由器、交换机等),以便于在远程系统之间进行信息交换。 In the embodiment shown in FIG. 4A, each sub-system 2490 may be performed with a communication 2493 is connected through one or more networks, may comprise a packet switched network 2493 or other type of network infrastructure device (such as a router, switches, etc.), to facilitate exchange of information between remote systems. 在图4B所示的实施方案中,测序系统2490为测序装置,其中,各个子系统(如测序仪2491、计算系统2497,以及可能的数据储存库2495)为测序装置内可通信地和/或可操作地连接和整合的组件。 In the embodiment shown in FIG. 4B, the sequencing is sequencing device 2490 system, wherein various subsystems (e.g., a sequencer 2491, the computing system 2497, and possibly the data store 2495) as the sequencing device may be communicatively and / or operably linked to and integrated components.

[0412] 在一些操作情境中,图4A和4B所示的实施方案中的数据储存库2495和/或计算系统2497可配置在云计算环境2496内。 [0412] In certain operating scenarios, FIGS. 4A and 4B embodiment shown in the data repository 2495 and / or computing system 2497 may be configured in a cloud computing environment 2496. 在云计算环境中,可对包括数据储存库的存储装置和/或包括计算系统的计算装置进行分配以及实体化以用作效用和随选(on-demand)计算;因此,云计算环境作为服务提供基础架构(如物理和虚拟机、原始数据/块储存器、防火墙、负载均衡模块、聚合器、网络、存储集等)、平台(如可包括操作系统、编程语言执行环境、数据库服务器、网络服务器、应用服务器等的计算装置和/或解决方案栈)、以及执行任何存储相关和/或计算任务所必需的软件(如应用、应用程序接口或API等)。 In the cloud computing environment, the memory device may include a data repository and / or the computing means comprises a computing system of the entity to be allocated and used as the utility, and demand (on-demand) is calculated; therefore, as a service cloud computing environment provide the infrastructure (such as physical and virtual machine, the original data / block store, firewalls, load balancing module, polymerization, network, storage sets, etc.), platform (e.g., may include an operating system, programming language execution environments, database servers, network servers, application servers and other computing devices and / or solution stack), and performing any related storage and / or computational tasks necessary software (e.g., applications, application program interface, or API, etc.).

[0413] 需要注意的是,在各种实施方案中,本发明所述的技术可通过包括不同配置和形成因子下的一些或所有上述的子系统和组件(如测序仪、计算系统和数据储存库)的各种系统和装置执行;因此,图4A和4B所示的例示性实施方案和配置应该被认为是例示性的, 不具有限制性。 [0413] It is noted that, in various embodiments, the techniques of the present invention may comprise different configurations and are formed by some or all of the subsystems and components described above (e.g., a sequencer, a computing system and data storage of the Factor library) systems and apparatus for performing various; Thus, FIGS. 4A and an exemplary embodiment and the configuration shown in FIG. 4B to be considered illustrative, not limiting.

[0414] 配置测序仪2491并可操作地接收源自生物样品片段的靶核酸2492,并对靶核酸进行测序。 Target [0414] sequencer may receive configuration from the biological sample a nucleic acid fragment operably 2491 2492, and the target nucleic acid sequencing. 可使用任何能测序的仪器,其中,所述仪器可用各种测序技术,包括但不限于杂交法测序、连接法测序、合成法测序、单分子测序、光学序列检测、电磁序列检测、电压改变序列检测以及适合从DNA中读取序列的任何已知的或之后开发的技术。 Any instrument can be sequenced, wherein the sequencing instrument using various techniques including, but not limited to, sequencing by hybridization, sequencing by ligation, sequencing synthesis, single molecule sequencing, sequence detection optical, electromagnetic sequence detection, the voltage change sequence and detecting any suitable technique known or hereafter developed from the DNA sequence is read. 在各种实施方案中,测序仪可测定靶核酸的序列并产生序列读数,其可包括或不包括缺口以及可为或不可为末端配对(或双端)读数。 In various embodiments, the sequencer can be determined and a target nucleic acid sequence to generate sequence reads, which may or may not include notches and a pair of a terminal or not (or double-ended) readings. 如图4A和4B所示,测序仪2491测定靶核酸2492的序列,并获得序列读数2494,其被传送以便(临时和/或永久)存储至一个或多个数据储存库2495 和/或通过一个或多个计算系统2497进行加工。 4A and FIG, 4B 2491 sequencer assay target nucleic acid sequence of the 2492 and 2494 to obtain sequence reads, which is transmitted to (temporarily and / or permanently) to store one or more data repositories 2495 and / or by a 2497 or more computing systems for processing.

[0415] 数据储存库2495可应用于一个或多个配制成磁盘阵列(如SCSI阵列)、存储集, 或其他适合的存储装置机构的储存装置(如硬盘、光盘、固体硬盘等)。 [0415] repository 2495 may be applied to one or more disk arrays formulated (such as SCSI Array), a memory set, or other suitable storage means storing device (e.g., hard disk, optical disk, hard disk, solid). 数据储存库的存储装置可配制成系统2490的外部/或整体组件或附于系统2490的外部组件(如外接硬盘或磁盘阵列)(如图4B所示),和/或以适合的方法,如网格、存储集、存储区域网(SAN),和/ 或网络附加存储(NAS)可通信地互联(如图4A所示)。 Storage means data repository may be formulated as an external / or integral components of the system 2490 or 2490 is attached to the outside of the system component (e.g., external hard drive or disk array) (FIG. 4B), and / or a suitable method, such as grid, set storage, storage area networks (SAN), and / or network attached storage (NAS) communicatively interconnected (FIG. 4A). 在各种实施方案和实作中,数据储存库可应用于存储装置上作为一个或多个将信息存储成文件的文件系统、一个或多个将信息存储成数据记录的数据库、和/或任何其它适合的数据存储机构。 In various embodiments and implementation, the data repository can be applied as one or more storage devices storing information into the file system, a database or a plurality of data records stored as information, and / or any other suitable means of data storage.

[0416]计算系统2497可包括一个或多个计算装置,其包括通用处理器(如中央处理器或CPU)、存储器以及计算机逻辑2499,其与配置数据和/或操作系统(OS)软件一起可执行一些或所有本发明所述的技术和方法,和/或可控制测序仪2491的操作。 [0416] The computing system 2497 may include one or more computing devices, including a general purpose processor (e.g., a central processing unit or CPU), memory and computer logic 2499, which may be used together with the configuration data and / or operating system (OS) software techniques and methods to perform some or all of the present invention, and / or control the operation of sequencing 2491 instrument. 例如,通过包括处理器(经配置以执行逻辑2499用于进行各种方法步骤)的计算机装置,可全部或部分地执行本发明所述的任何数据加工和数据分析方法。 For example, by including a processor (logic 2499 configured to perform various method steps for carrying out) the computer means, may be fully or partially perform any data processing and data analysis method of the present invention. 进一步地,尽管方法步骤可按编号的步骤进行呈现,但应当知晓的是,本发明所述的方法的步骤可同时进行(计算装置集的平行方式) 或按不同顺序进行。 Further, although according to method step numbered steps for presentation, it should be known that the steps of the method according to the present invention can be performed simultaneously (in parallel mode calculating means sets) or performed in a different order. 计算机逻辑2499可作为单个整合模块(如整合逻辑)实施其功能或与两个或多个可提供一些其它功能的软件模块结合。 Computer logic as a single integrated module 2499 (e.g. integrated logic) which functions embodiment or may be provided with two or more other features of the software modules in combination.

[0417] 在一些实施方案中,计算机系统2497可为单个计算装置。 [0417] In some embodiments, the computer system 2497 may be a single computing device. 在其它实施方案中,计算机系统2497可包括多个可在网格、集合或云计算环境中可通信地和/或可操作地互联的计算装置。 In other embodiments, the computer system 2497 may include a plurality of the grid, or a collection of cloud computing environment communicatively and / or operatively interconnected computing devices. 所述多个计算装置可以不同的形成因子,如计算节点、片(blade)或其它任何适合的硬件配置进行配置。 The plurality of computing devices may be formed of different factors, such as computing nodes sheet (Blade), or any other suitable configuration hardware configuration. 因此,图4A和4B中的计算机系统2497被认为是例示性的,不具有限制性。 Thus, FIGS. 4A and 4B computer system 2497 is considered to be illustrative, not limiting.

[0418] 图5为例示性计算装置2500的框图,其可配置作为测序仪和/或计算机系统的一部分,用于执行指令来实施本发明所述的技术和/或方法。 [0418] FIG 5 illustrates an example block diagram of an exemplary computing device 2500, which may be configured as a sequencer and / or a portion of a computer system, for executing instructions to implement the techniques and / or methods of the present invention.

[0419] 图5中,计算装置2500包含几个直接互联或通过一个或多个系统总线,如总线2575间接互联的组件。 [0419] FIG. 5, the computing device 2500 includes several interconnected directly or via one or more system buses such as bus 2575 indirectly interconnected components. 所述组件可包括但不限于,键盘2578、永久存储装置2579(如硬盘、固态磁盘、光盘等)、以及可连接一个或多个显示装置(如IXD显示器、纯平显示器、等离子屏幕等)的显示适配器2582。 The assembly may include, but are not limited to, a keyboard 2578, a permanent storage device 2579 (e.g., hard disk, a solid state disk, optical disk, etc.), and can be connected to one or more display devices (e.g., display IXD, flat screen monitor, plasma screen, etc.) 2582 display adapter. 周边设备和输入/输出(I/O)装置(其连接至I/O控制器2571),可通过本领域已知的任何数目的手段连接至计算装置2500,包括但不限于一个或多个串行端口、一个或多个平行端口、以及一个或多个通用串行总线(USB)。 Peripherals and input / output (I / O) device (which is connected to the I / O controller 2571), can be any number of means known in the art connected to the computing means 2500, including but not limited to, one or more strings line ports, one or more parallel ports, and one or more universal serial bus (USB). 外部接口2581 (其可包括网络接口卡和/或串行端口)可用于将计算装置2500连接至网络(如互联网或局域网(LAN))。 An external interface 2581 (which may include a network interface card and / or a serial port) for the computing device 2500 may be connected to a network (such as the Internet or a local area network (LAN)). 外部接口2581还可包括一些能从各种外部装置,如测序仪或其任意组件处接收信息的输入接口。 The external interface 2581 may also include some from various external devices, such as a sequencer, or any component of the information received at the input interface. 通过系统总线2575的互联可使得一个或多个处理器(如CPU) 2573与各自连通的组件进行交流,以执行(和控制执行)来自系统内存2572和/或存储装置2579的指令,以及各个组件之间的信息交流。 Interconnected via a system bus 2575 may cause one or more processors (e.g., CPU) 2573 to communicate with the respective communication components, memory to execute instructions and / or storage device 2579 (the execution and control) from the system 2572, as well as individual components the exchange of information between. 系统内存2572和/或存储装置2579 可呈现为一个或多个计算机可读的永久存储介质(其存储处理器2573发出的指令串)以及其它数据。 System memory 2572 and / or storage device 2579 may be presented as a permanent storage medium of one or more computer-readable (2573 issued by the processor for storing instructions string) and other data. 所述计算机可读的永久存储介质包括但不限于,随机存取存储器(RAM)、只读存储器(ROM)、电磁介质(如硬盘、固态硬盘、指状存储器、软盘等)、光学介质,如光盘(CD) 或数字通用光盘(DVD)、闪存等。 The computer-readable permanent storage media include, without limitation, random access memory (RAM), a read only memory (ROM), magnetic media (e.g. hard disk, SSD, the finger memory, floppy disk, etc.), optical media, such as compact disc (CD) or digital versatile disc (DVD), flash memory and so on. 可将各种数据值和其它结构化或非结构化信息可从一个组件或子系统中输出至另一个组件或子系统中,并可通过显示适配器2582以及适合的显示装置呈现给用户,可通过外部接口2581用网络传送给远程装置或远程数据储存库,或可(临时和/或永久)储存于存储装置2579。 Various data values ​​may be structured or unstructured, and other information may be output from one component or subsystem to another subsystem or component, and presented to the user through the display adapter and the display device 2582 suitable, by The external interface 2581 transmitted by the network to the remote device or remote data repository, or (temporary and / or permanent) 2579 stored in the storage means.

[0420] 应当理解的是,通过计算装置2500执行的任何方法和功能可用硬件和/或计算机软件以模块或整合的方式以逻辑的形式来实施。 [0420] It should be appreciated that any of the methods and functions performed by the computing device 2500 available hardware and / or computer software in a modular or integrated manner to the embodiment of formal logic.

[0421] 尽管许多不同形式的实施方案满足本发明,然而,如结合本发明优选的实施方案所详述的,应理解的是,本公开内容应被认为是发明原理的示例,且并非意图将本发明限于本文说明和描述的具体实施方案。 [0421] Although embodiments in many different forms of embodiment of the present invention to satisfy, however, such as in connection with preferred embodiments of the present invention is described in detail, it should be understood that the present disclosure is to be considered an exemplification of the principles of the invention and are not intended to the present invention is limited to the specific embodiments illustrated and described herein. 通过本领域中技术人员,可作出许多的改变而不背离本发明的精神。 By persons skilled in the art, many changes can be made without departing from the spirit of the invention. 通过所附权利要求和它们的等同物衡量本发明的范围。 By the appended claims and their equivalents measure the scope of the invention. 摘要与标题不被解释为限制本发明的范围,因为其目的是使适当的机构以及公众能够快速地确定本发明的一般性质。 Summary title is not to be construed as limiting the scope of the invention, since its purpose is to make the appropriate mechanism and the public to determine the general nature of the invention rapidly.

