WO2016045106A1 - 单细胞染色体的cnv分析方法和检测装置 - Google Patents

单细胞染色体的cnv分析方法和检测装置 Download PDF

Info

Publication number
WO2016045106A1
WO2016045106A1 PCT/CN2014/087604 CN2014087604W WO2016045106A1 WO 2016045106 A1 WO2016045106 A1 WO 2016045106A1 CN 2014087604 W CN2014087604 W CN 2014087604W WO 2016045106 A1 WO2016045106 A1 WO 2016045106A1
Authority
WO
WIPO (PCT)
Prior art keywords
breakpoint
value
sequencing
cnv
breakpoints
Prior art date
Application number
PCT/CN2014/087604
Other languages
English (en)
French (fr)
Inventor
李剑
夏滢颖
陈大洋
甄贺富
张彩芬
张爱萍
张现东
刘赛军
李尉
黄奕乐
Original Assignee
深圳华大基因股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因股份有限公司 filed Critical 深圳华大基因股份有限公司
Priority to PCT/CN2014/087604 priority Critical patent/WO2016045106A1/zh
Priority to CN201480082248.5A priority patent/CN106795551B/zh
Publication of WO2016045106A1 publication Critical patent/WO2016045106A1/zh

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids

Definitions

  • the present invention relates to the field of biotechnology, and more particularly to a CNV analysis method and detection device for single cell chromosomes.
  • PES Preimplantation Genetic Screening
  • chromosome copy number information can also be detected at the single cell level to study the mechanisms of cancer development and progression. Single-cell or micro-samples for chromosome copy number detection are used in many ways.
  • FISH Fluorescent In Situ Hybridization
  • the present invention has developed a set of (non-equal length sequence) single-cell chromosome copy number variation (CNV) analysis technology suitable for sequencing single-cell whole genome amplification products, which is corrected by GC and added to the control.
  • CNV chromosome copy number variation
  • the sample used in the present invention is a single cell, a few cells, or a micro DNA sample.
  • Cell class The type may be embryonic cells genetically detected before implantation, single tumor cells of cancer research, prenatal diagnosis of maternal peripheral blood nucleated red blood cells, plasma, amniotic fluid, tissue sections of pathological studies, and the like.
  • the whole genome amplification refers to genome-wide amplification of single cells, several cells or micronucleic acid samples
  • the method may be partial random primer amplification (Degenerate Oligonucleotide Primer PCR, abbreviated as DOP- PCR), Primer Extension Preamplification PCR (abbreviated PEP-PCR), Multiple Strand Displacement Amplification (MDA), OmniPlex WGA, and the like.
  • DOP-PCR Degenerate Oligonucleotide Primer PCR
  • PEP-PCR Primer Extension Preamplification PCR
  • MDA Multiple Strand Displacement Amplification
  • OmniPlex WGA and the like.
  • Commercial kits such as REPLI-g from QIAgen, GenomePlex WGA from Sigma Aldrich, Sureplex from New England Biolabs, PicoPlex WGA from Rubicon Genomics, and illustra Genomiphi V2 from GE Healthcare can also be used. .
  • the invention can perform chromosome copy number analysis on the sequencing sequence generated by the new generation high-throughput semiconductor sequencing platform.
  • the new generation high-throughput semiconductor sequencing platform includes Ion Torrent TM , Ion Proton TM sequencing platform.
  • the specific analysis method is as follows:
  • a first aspect of the present invention provides a CNV analysis method for a single cell chromosome, comprising the steps of: extracting the first step of the effective data; and performing the sequence alignment on the extracted valid data to determine whether the Y chromosome exists or not. Step; the third step of performing GC content correction after the sequence alignment window after sequence alignment; the fourth step of performing breakpoint screening on the GC content corrected data; and satisfying the data after the breakpoint screening.
  • the fifth step of data filtering and visualization of the judgment condition comprising the steps of: extracting the first step of the effective data; and performing the sequence alignment on the extracted valid data to determine whether the Y chromosome exists or not.
  • the sequence alignment is a SOAP alignment.
  • the judgment of the Y chromosome judgment is based on the support number of the specific gene on the Y chromosome.
  • the CG content correction is performed by: calculating a correction coefficient; multiplying the original read number by a correction coefficient to obtain a corrected number of reads; and dividing the corrected number of reads by the corrected
  • the sample sequence is the average of the genome-wide reads number to obtain the Ratio value.
  • the breakpoint screening step comprises the following three sub-steps performed in sequence: an initialization step in which all chromosomes on the genome are joined end to end, forming a loop, and each window on the genome is treated as one Point, take the same number of points on each side of each point as the initial comparison point set, perform a preliminary run test on the initial comparison point set, select possible breakpoints according to the P value size, and establish a preliminary breakpoint set.
  • the P value is the P value after the chi-square test of the corrected GC value of the GC content of all the windows between the two possible breakpoints; the preliminary screening breakpoint step, in the preliminary screening breakpoint step, in each possible break The left and right sides of the point are respectively taken from the points between the adjacent possible breakpoints to establish two sets of preliminary comparison points, and the run test is performed on the two preliminary comparison point sets, and the calculated P value is used as the possible break.
  • the new P value of the point; and the loop determines the final breakpoint step in the loop determining the final breakpoint step, repeating the region between the adjacent breakpoints around the possible breakpoints that repeatedly maximizes the P value by the run test Were combined, and P are updated values of adjacent breakpoints, up to a maximum value or smaller than the threshold value P may be less than the minimum number of breakpoints breakpoint value, the breakpoint is determined as the final remaining selected breakpoint.
  • the judgment condition refers to the following two conditions:
  • the visualization refers to drawing a karyotype map of the CNV and a peak map corresponding to the Ratio value of each window.
  • a second aspect of the present invention provides a CNV detection method for a single cell chromosome, comprising the steps of: constructing a library according to the PF rapid database construction method of the first aspect of the present invention; performing sequencing on the constructed library to obtain a sequencing result And performing information analysis on the sequencing results.
  • the sequencing is performed using high throughput sequencing technology.
  • the on-machine sequencing is performed using an IonProton sequencer.
  • a third aspect of the present invention provides a single cell chromosome CNV detecting device, comprising: a database building unit, the library unit constructs a library and outputs the same; a sequencing unit, the sequencing unit is connected to the database building unit and outputs the database unit Library for sequencing on the machine to output sequencing results; and analysis unit, the analysis The unit is connected to the sequencing unit and the sequencing result output to the sequencing unit is subjected to information analysis according to the CNV analysis method of the first aspect of the present invention.
  • the sequencing is performed using high throughput sequencing technology.
  • the on-machine sequencing is performed using an IonProton sequencer.
  • the present invention develops a method for detecting chromosome copy number variation of a single cell, a few cells or a trace nucleic acid sample in view of the unequal length of the sequencing sequence of the Ion Proton sequencing platform. Particularly in the field of in vitro fertilization-embryo transfer, the present invention enables accurate detection of aneuploidy and microdeletions, microduplications of embryonic chromosomes prior to implantation into the uterine cavity.
  • the present invention corrects the error generated in the sequencing process by increasing the control set; reduces the influence of the amplification bias based on the correction of each batch of data, and improves the accuracy of the detection; and the number of specific genes supported on the Y chromosome To determine whether the Y chromosome exists, the accuracy of the strategy based on the coverage is higher; the location and size of the CNV are determined by a unique breakpoint screening strategy.
  • the Ion Proton sequencing platform is fast, simple, and scalable. Combined with the information analysis process described in the present invention, the Ion Proton sequencing platform can effectively advance clinical research progress in cancer and hereditary diseases.
  • FIG. 1 is a flow chart showing a CNV analysis method of the present invention.
  • Fig. 2 is a structural diagram showing a CNV detecting device for a single cell chromosome of the present invention.
  • Figure 3 is a schematic diagram showing the process of breakpoint screening of the present invention.
  • FIG. It is to be understood that the following examples are merely illustrative of the invention and are not intended to limit the scope of the invention.
  • a control sample set refers to a collection of normal sample compositions known relative to a test sample.
  • the method of building the library, the sequencing reagent and the type of sequencing should be as consistent as possible with the sample to be tested.
  • the control sample set is established to reduce experimental accidental errors and provide a reference for GC calibration, standardization, fragmentation, and estimation of copy number variation. In order to increase the credibility of the control, we established a control sample of 30 male and female samples.
  • Sequence tags generated by high-throughput sequencing platforms are called reads.
  • the data format of bam is converted into the fastQ data format required by the comparison software, and 50bp is intercepted from the 5' end of the read for subsequent analysis.
  • 20 bp was excised from its 5' end to rule out the effect of WGA (DOP-PCR primer sequence) on subsequent analysis.
  • WGA DOP-PCR primer sequence
  • unique reads refer to the readings with only one alignment position on the reference genome.
  • the DNA sequence of the intercepted fastQ data format was compared with the human genome reference sequence of version 37.3 (hg19; NCBIBuild37.3) in the NCBI database for SOAPaligner/soap2 alignment (SOAP alignment shown in Figure 1).
  • a mismatch of up to two bases is allowed to obtain positional information of the sequence on the genome.
  • the basic information of the sequence is counted before the comparison, and the statistical data includes quality value, comparison rate, GC content, repetition rate, genome coverage, sequencing depth, Q20 value, etc., and the sequencing data is quality-controlled according to the above information. .
  • the present invention devises the step of judging the sample Y chromosome.
  • There are two methods for judging one is based on the support number of the specific gene on the Y chromosome; the other is based on the average depth of the Y chromosome.
  • the traditional method is to judge the Y by the average depth of the reads on the Y chromosome (because the depth of different parts of the chromosome may be different, so the average value is used to represent the depth of sequencing on one chromosome). Whether the chromosome exists or not, that is, when the average depth of the Y chromosome exceeds the threshold, the Y chromosome is considered to exist.
  • the Y chromosome-specific gene support number is obtained by selecting five genes unique to the Y chromosome, and after certain screening (the number of reads in the gene region needs to meet the minimum requirement), there are several such sequencing results in the sample.
  • a gene for example, if the number of reads in a region of four of the five genes exceeds a threshold for a sample, the number of supports is four.
  • the method of using Y chromosome-specific gene support number can effectively avoid the influence of homologous sequences, and can also reduce the influence of sequencing error and sample fluctuation on Y chromosome judgment, which is equivalent to narrowing the scope of observation and reducing the scope of observation. The possibility of error within)).
  • the support number of the specific gene on the Y chromosome is the final judgment basis.
  • the hg18 is broken into the length of the reads (50 bp) to establish the simulation data.
  • the window is drawn according to the rule that the number of reads in each window is 100K. This is to ensure that all windows in the normal sample are in the window.
  • the number of reads has a high degree of uniformity, which facilitates subsequent detection of copy number variation.
  • slide the window to a certain extent to increase the number of reads in the window by 20K. That is to say, the human genome reference sequence is divided into a window of about 100 kb and slides up and down by 20 kb, but is not limited to such a window, and may be a window of other length according to the sequencing read length.
  • GC% GC content of each window. For example, if the number of unique reads in a window W is 100, the GC content of each of the reads is calculated, and their median (assumed to be 47%) is taken as the GC content of the window.
  • the average of the number of unique reads of all windows on the genome can be calculated (assuming 130).
  • each window on the sample sequence and the reference sequence is divided into different correction units by GC% (gradient 0.05), and the median (Mi) of the number of different windows reads in each correction unit is calculated.
  • the window is divided into different correction units according to the gradient of 5%. If the GC content of the sample genome is in the range of 35% to 55%, it can be divided into 35% to 40%, 40% to 45%. 45% to 50%, 50% to 55% of five correction units, and window W is in 45% to 50% of the correction units.
  • the number of corrected reads the original number of reads ⁇ the correction coefficient of the correction unit c i (1.2)
  • the above process is a step of establishing a control set for a sample.
  • special attention should be paid to the amplification kit used in the control and the sample, and the other methods such as the database construction method and the sequencing method should be consistent, so as to effectively reduce the high GC content or low GC content in the genome.
  • the copy number deviation improves the accuracy of copy number variation detection.
  • Each window is treated as a point, and n points (for example, 100 windows) are taken on the left and right sides for run-length inspection, and the corresponding P values of each point are obtained, leaving m points with the smallest P value (in the example) Select 10000 points), by loop iteration, delete the point with the largest P value each time, and update the P value of the left and right points of the point until the P value of the remaining points is less than 1e-25 or the number of points is less than 24 , the remaining points are taken as candidate CNV breakpoints (ie, the boundary points of each CNV segment); the Ratio value between the two breakpoints is calculated (the average of the GC-corrected Ratio values of all the windows between the two breakpoints) And the P value (the P value after the chi-square test of the Ratio value after all windows GC correction between the two breakpoints).
  • n points for example, 100 windows
  • the corresponding P values of each point are obtained, leaving m points with the smallest P value (
  • Initialization Finding breakpoints: Take the same number of points (currently taking 100 points) on each of the left and right points of each point as two sets of comparison points, and perform a preliminary run test on the two point sets, according to P The value size filters out the possible breakpoints (the P values are arranged from small to large, and the 10,000 points ranked first, that is, the 10000 points with the smallest P value) are selected to establish a preliminary breakpoint set. Subsequent work is to continuously verify the points in the breakpoint set and screen out the points that are not breakpoints.
  • Initial screening breakpoints Set the left and right point sets on the left and right sides of each breakpoint and the points between the adjacent breakpoints, perform run-length test on the two point sets, and use the calculated P value as The new P value of the breakpoint (actually the P value of the update breakpoint, because the number of elements in the point set will increase in general, the fluctuation of the data in the point set is closer to the fluctuation of the sample data, and the run test result is more Close to the real situation, so replace the P value obtained during initialization with a new P value).
  • the breakpoint set is the largest after updating the L point and the R point P value, and repeat the above steps until the maximum P value is less than the threshold (currently we set 1e-25, this value It can be set by the user) or the number of breakpoints is less than the minimum breakpoint value (because all the chromosomes in the genome have been connected into a loop at the beginning, so the breakpoints have at least the number of breakpoints with the minimum number of breakpoints, in this embodiment The minimum breakpoint value in the middle is 24). Finally, the remaining breakpoints in the set of breakpoints are the breakpoints in the final CNV result, and the start and end of the region where the copy number variation occurs occurs.
  • the characteristics of the screening are as follows: 1. Using the looping method, the genome is regarded as a whole, and the aneuploidy can be detected more effectively than the method of finding the breakpoints by some chromosomes; 2. Using the run test The method of screening breakpoints is less affected by the fluctuation of the observed values than the traditional parameter test; 3. Using multiple run-length tests, a large number of false positive signals can be excluded, making the breakpoints more accurate.
  • the run test also called "coherent test” is a test method that judges the number of runs formed by the arrangement of sample observations, and can detect whether the randomness of the sample and the overall distribution are the same.
  • the run test is mainly used to check whether the breakpoints are coherent. If the P value of the run test is large, it indicates that the point sets on both sides of the breakpoint meet the same distribution, and the coherence is high. The probability of being a breakpoint is small; on the contrary, it means that the set of points on both sides of the breakpoint belongs to a different distribution, and its coherence is low, and the point may be a breakpoint.
  • CNV positive signal
  • a computer program can be automatically executed, which can perform batch correction on the sample by the data generated by the next-generation sequencing technology, and then perform data correction, standardization and fragmentation with the control set to estimate the test.
  • the degree and magnitude of copy number variation of the sample can be automatically executed, which can perform batch correction on the sample by the data generated by the next-generation sequencing technology, and then perform data correction, standardization and fragmentation with the control set to estimate the test. The degree and magnitude of copy number variation of the sample.
  • a CNV detecting device for a single cell chromosome As shown in FIG. 2, the apparatus includes a database building unit 100, a sequencing unit 200, and an analysis unit 300.
  • the database building unit 100 constructs a library and outputs it.
  • the sequencing unit 200 is connected to the database building unit 100 and performs sequencing on the library outputted by the database unit 100 to output the sequencing result.
  • the analysis unit 300 is connected to the sequencing unit 200 and performs the information analysis on the sequencing result output by the sequencing unit 200 using the above analysis technique.
  • the present invention has performed sample verification of more than 300 known results, and the signal detection rate is 100%.
  • the following is a partial result display:

Landscapes

  • Chemical & Material Sciences (AREA)
  • Organic Chemistry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

一种单细胞染色体的CNV分析方法和检测装置,单细胞染色体的CNV分析方法包括以下步骤:提取有效数据的第1步骤;对所提取的有效数据进行序列比对后再判断Y染色体是否存在的第2步骤;将经过序列比对后的序列划分窗口再进行GC含量校正的第3步骤;对GC含量校正后的数据进行断点筛查的第4步骤;以及对断点筛查后的数据进行满足判断条件的数据过滤及可视化的第5步骤。

Description

单细胞染色体的CNV分析方法和检测装置 技术领域
本发明涉及生物技术领域,更具体地,涉及单细胞染色体的CNV分析方法和检测装置。
背景技术
目前很多科学研究与临床应用需要在单个细胞水平进行,或者在微量水平进行。在单细胞水平分析DNA遗传信息,判断细胞或胚胎或个体是否存在染色体拷贝数异常,亦是常见的研究方法。例如,在辅助生殖技术中的植入前筛查(Preimplantation Genetic Screening,缩写PGS),涉及对配子细胞、单个卵裂球细胞或胚胎细胞进行DNA遗传检测,判断受精卵或胚胎的染色体是否正常,选取正常的胚胎进行植入。也可通过对母体外周血中的极少量胎儿细胞或胎儿染色体拷贝数检测,确定胎儿是否正常,以达到无创产前诊断的目的。在癌症的研究中,也可在单细胞水平检测染色体拷贝数信息,以研究癌症的发生和发展机制。单细胞或微量样本进行染色体拷贝数检测在很多方面得以应用。
单细胞水平检测染色体拷贝数异常,原位荧光杂交(Fluorescent In Situ Hybridization,缩写FISH)方法应用已久。但是由于荧光染料数目限制,只能对有限的几对染色体进行检测,且操作复杂,不适用于大规模检测。随着高通量测序技术的不断发展,加上单细胞全基因组扩增技术,使得利用单个细胞做全基因组范围的染色体拷贝数检测成为可能。但是,由于单细胞全基因组扩增中不可避免的扩增偏向性问题,可能会掩盖基因组中本来的变异信息。
发明内容
本发明针对上述问题,开发了一套适用于单细胞全基因组扩增产物测序的(非等长序列)单细胞染色体拷贝数变异(Copy Number Variation,缩写CNV)分析技术,通过GC校正、加入对照集合等来修正扩增中产生的偏向性,并实现了信息分析自动化的效果,适用于大样本量检测。
本发明所用样本为单细胞、少数的几个细胞或者是微量DNA样本。细胞类 型可以是植入前遗传检测的胚胎细胞,癌症研究的单个肿瘤细胞,产前诊断的母体外周血有核红细胞、血浆、羊水,病理学研究的组织切片等。
本发明中,所述的全基因组扩增是指对单个细胞、几个细胞或微量核酸样本进行全基因组范围的扩增,其方法可以是部分随机引物扩增(Degenerate Oligonucleotide Primer PCR,缩写DOP-PCR),完全随机引物扩增(Primer Extension Preamplification PCR,缩写PEP-PCR),多重链置换扩增(Multiple Displacement Amplification,缩写MDA),OmniPlex WGA等方法中的任一种。也可采用商业试剂盒如QIAgen公司的REPLI-g,Sigma Aldrich公司的GenomePlex WGA,New England Biolabs公司的Sureplex,Rubicon Genomics公司的PicoPlex WGA,GE Healthcare公司的illustra Genomiphi V2等试剂盒中的任一种。
本发明可对新一代高通量半导体测序平台产生的测序序列进行染色体拷贝数分析。其中,新一代高通量半导体测序平台包括Ion TorrentTM,Ion ProtonTM测序平台。
本发明的目的在于提供一种信息分析方法,通过GC校正、加入对照集合等来修正扩增中不可避免的偏向性,并实现了大样本量自动分析的效果。具体分析方法如下:
本发明的第一方面提供了一种单细胞染色体的CNV分析方法,包括以下步骤:提取有效数据的第1步骤;对所提取的有效数据进行序列比对后再判断Y染色体是否存在的第2步骤;将经过序列比对后的序列划分窗口再进行GC含量校正的第3步骤;对GC含量校正后的数据进行断点筛查的第4步骤;以及对断点筛查后的数据进行满足判断条件的数据过滤及可视化的第5步骤。
优选地,所述序列比对是SOAP比对。
优选地,所述Y染色体判断的判定依据是Y染色体上特异基因的支持数。
优选地,用以下步骤进行所述CG含量校正:计算校正系数;将原reads数乘以校正系数以得到校正后的reads数;以及将校正后的reads数除以校正后 的样本序列全基因组reads数的平均数以得到Ratio值。
优选地,断点筛查步骤包含以下三个依次执行的子步骤:初始化步骤,在该初始化步骤中,将基因组上所有染色体首尾相接,连成一个环,将基因组上每个窗口视作一个点,在每个点的左右各取相同数目的点作为初始比较点集,对初始比较点集进行初步的游程检验,根据P值大小筛选出可能的断点,建立初步的断点集,该P值是两个可能的断点间所有窗口的GC含量校正后的拷贝值经卡方检验后的P值;初步筛选断点步骤,在该初步筛选断点步骤中,在每一个可能的断点的左右两边分别取与相邻的可能的断点之间的点建立左右两个初步比较点集,对这两个初步比较点集进行游程检验,用计算出的P值作为该可能的断点的新的P值;以及循环确定最终断点步骤,在循环确定最终断点步骤中,通过游程检验重复地将P值最大的可能的断点左右的相邻断点之间的区域合并,并分别更新了相邻断点的P值,直到最大的P值小于阈值或可能的断点数小于最小断点值,将最终剩下的断点确定为筛选出的断点。
优选地,所述判断条件是指以下两个条件:
(a)CNV片段不小于1M;
(b)Ratio≤0.7或Ratio≥1.3。
优选地,所述可视化是指画出CNV的核型图以及各窗口Ratio值对应的峰图。
本发明的第二方面提供了一种单细胞染色体的CNV检测方法,包括以下步骤:根据本发明第一方面的PF快速建库方法构建文库;对所构建的文库进行上机测序,得到测序结果;以及对所述测序结果进行信息分析。
优选为,所述上机测序是用高通量测序技术进行的。
优选为,所述上机测序是利用IonProton测序仪进行的。
本发明的第三方面提供了一种单细胞染色体的CNV检测装置,具备:建库单元,该建库单元构建文库且输出;测序单元,该测序单元连接于建库单元且对建库单元输出的文库进行上机测序以输出测序结果;以及分析单元,该分析 单元连接于测序单元且对测序单元输出的测序结果根据本发明第一方面的CNV分析方法进行信息分析。
优选为,所述上机测序是用高通量测序技术进行的。
优选为,所述上机测序是利用IonProton测序仪进行的。
本发明针对Ion Proton测序平台测序序列不等长的特点,开发出针对对单个细胞、几个细胞或微量核酸样本染色体拷贝数变异检测方法。特别是在体外受精-胚胎移植领域,本发明能够实现对移入到子宫腔之前胚胎染色体的非整倍性和微缺失,微重复的精确检测。另外本发明通过增加对照集合来修正测序过程中产生的误差;基于对各批次数据的矫正来减少扩增偏向性带来的影响,提高检测的准确度;根据Y染色体上特异基因的支持数来判断Y染色体是否存在,相比根据覆盖度判断的策略准确度更高;通过独特的断点筛选策略来确定CNV的位置和大小。Ion Proton测序平台拥有快速、简单及可扩展等特征,结合本发明所述的信息分析流程,能有效推进癌症及遗传性疾病等临床研究进展。
附图说明
图1是示出本发明的CNV分析方法的流程图。
图2是示出本发明的单细胞染色体的CNV检测装置的结构图。
图3是示出本发明的断点筛查的过程的示意图。
具体实施方式
以下参照附图,结合具体实施方式,进一步阐述本发明。应理解,以下实施方式仅用于说明本发明而不用于限制本发明的范围。
CNV样本检测
在开始检测样本之前,首先要获取对照样品集合。对照样品集合指的是相对于测试样本而言已知的正常样本组成的集合。其建库方法、测序试剂及测序类型等应尽量与待测样品一致。建立对照样品集合是为了减少实验偶然误差,并为检测样本数据的GC校正,标准化,片段化,估算拷贝数变异程度提供参照。为了增加对照的可信度,我们以男女样本各为30个建立对照样本。
下面参照图1,详述分析方法的步骤如下:
1提取有效数据
高通量测序平台产生的序列标签称为reads。根据Ion Proton测序平台测序数据不等长的特点,将bam的数据格式转换为比对软件所需的fastQ数据格式,并从reads的5’端截取50bp用于后续分析,在此基础上,再从其5’端切除20bp,以排除WGA(DOP-PCR引物序列)对后续分析的影响。之所以要截取50bp用于后续分析,是因为在现有的算法中,截取的reads长度越长,可用于下游分析的总数据量越少;截取的reads长度越短,截取后的唯一比对率(unique map rate)越低。为了让数据量和唯一比对率达到平衡,使有效数据量最大化,经梯度测试,我们认为截取50bp的reads在目前的条件下是最优的选择。其中,unique reads是指在参考基因组上只有一个比对位置的reads。
2序列比对
将截取后的fastQ数据格式的DNA序列与NCBI数据库中版本37.3(hg19;NCBIBuild37.3)的人类基因组参考序列进行SOAPaligner/soap2比对(图1中所示出的SOAP比对),比对时允许最多两个碱基的错配,得到序列在基因组上的位置信息。在进行比对之前对序列的基本信息进行统计,统计数据包括质量值、比对率、GC含量、重复率、基因组覆盖度、测序深度、Q20值等信息,根据以上信息对测序数据进行质控。为避免重复序列对拷贝数变异分析的干扰,只选取与人类基因组参考序列唯一比对的测序序列(unique reads),并去除其中由于扩增产生的重复序列,计算序列重复比对率。另外,关于SOAP的具体技术可参见http://soap.genomics.org.cn/。
序列比对时的Y染色体判断
为了使对照更有针对性,本发明设计了对样本Y染色体判断步骤。判断方法分为两种,一种为依据Y染色体上特异基因的支持数;二为依据Y染色体的平均深度。传统的方法是通过Y染色体上reads的平均深度(因染色体不同部位的深度可能不同,故用其平均值来代表一条染色体上的测序深度)来判断Y 染色体是否存在,即当Y染色体的平均深度超过阈值后就认为Y染色体存在。传统的方法会受测序误差、同源序列、样本状态(如,某些状态较差的胚胎样本其测序数据总体波动较大)等的影响较大,可能会造成某些样本出现假阳性(如将性染色体正常的男性样本判断为-X)或假阴性(如可能将XXY个体判断为正常)。
而Y染色体特异基因支持数的方法是通过选取Y染色体上特有的5个基因,通过一定筛选(基因区域内的reads数需达到最低要求)后,看在样本的测序结果中共有几个这样的基因,例如,若对某样本,这5个基因中共有4个基因的区域内reads数超过阈值,则其支持数为4。
采用Y染色体特异基因支持数的方法可以有效规避同源序列造成的影响,也能减少测序误差和样本波动对Y染色体判断的影响,相当于缩小了观察的范围,也就减小了在观察范围内出现误差的可能性)。
根据现有结果比较,以Y染色体上特异基因的支持数为最终判定依据。
3窗口划分
将hg18打断成reads长度(50bp)建立模拟数据,将模拟数据比对到参考基因组上后,按照保证每个窗口内reads数为100K的规则画窗口,这是为了保证正常样本中所有窗口内的reads数具有较高的均一性,便于后续检测拷贝数变异。然后为了使断点定位更准确,将窗口左右滑动一定范围使其内的reads数增加20K。也就是说,将人类基因组参考序列划分为100kb左右的窗口,并上下滑动20kb,但不限于此类窗口,根据测序读长,也可以是其他长度的窗口。GC含量校正
首先,统计各窗口内的unique reads数,并计算各窗口的GC含量(GC%)。例如,设某窗口W中unique reads数为100,算出其中每条reads的GC含量,取它们的中位数(假设为47%)作为该窗口的GC含量。对基因组上所有窗口进行上述处理,可以算出基因组上所有窗口unique reads数的平均值(假设为130)。
其次,分别将样本序列和参考序列上的各窗口按GC%(梯度为0.05)划分为不同校正单元,并计算各校正单元内不同窗口reads数的中位数(Mi)。例如,将窗口按照GC含量,以5%为梯度,划分为不同校正单元,假设样本基因组GC含量分布范围为35%~55%,则可划分出35%~40%,40%~45%,45%~50%,50%~55%五个校正单元,窗口W在其中的45%~50%校正单元中。
接着,根据式(1.1),计算得到各校正单元的校正系数C。
Figure PCTCN2014087604-appb-000001
例如,计算各校正单元内不同窗口reads数的中位数,设45%~50%校正单元的中位数为110,则根据式(1.1)算得其校正系数为C=130/110≈1.18。
再根据式(1.2),计算各窗口校正后的reads数及校正后的样本序列全基因组reads数的平均数。
校正后的reads数=原reads数×所属校正单元的校正系数ci   (1.2)
在上述的例子中,窗口W校正后的reads数=100×1.18=118。用同样的方法算出基因组上所有窗口校正后的reads数,并算出其平均值(假设为125)。最后根据式(1.3),计算各窗口的Ratio值,用于后续分析。
Figure PCTCN2014087604-appb-000002
在上述的例子中,窗口W的Ratio值=118/125=0.9440。
上述流程是针对样本建立对照集合的步骤。在构建对照集合的过程中特别要注意对照和样本所使用的扩增试剂盒,建库方式,测序方式等其他条件均要保持一致,这样才能有效的减少基因组中高GC含量或低GC含量区域出现的拷贝数偏差,提高拷贝数变异检测的精度。
4断点筛查
将每个窗口视作一个点,在其左右两侧各取n个点(例如100个窗口)进行游程检验,得到每个点相应的P值,留下P值最小的m个点(例子中选取10000个点),通过循环迭代,每次删除P值最大的点,并更新该点左右两点的P值,直至剩下的点中P值小于1e-25或者点的个数小于24个,将剩下的点做为候选CNV断点(即每个CNV片段的边界点);算出两个断点之间的Ratio值(两断点间所有窗口的GC校正后Ratio值的平均数)和P值(两断点间所有窗口GC校正后Ratio值经卡方检验后的P值)。
参照图3,断点筛查的具体步骤如下:将基因组上所有染色体首尾相接,连成一个环。将基因组上每个窗口视作一个点(以下用“点”表示窗口),下述所有的检验的观测值是每个点的Ratio值:
1)初始化(寻找断点):在每个点的左右各取相同数目的点(目前取100个点)作为两个比较的点集,对这两个点集进行初步的游程检验,根据P值大小筛选出可能的断点(将P值由小到大排列,选出排在前面的10000个点,即P值最小的10000个点),建立初步的断点集。后续的工作就是不断对断点集内的点进行验证,筛掉其中不是断点的点。
2)初步筛选断点:在每一断点左右两边分别取其与相邻断点之间的点建立左、右点集,对这两个点集进行游程检验,用计算出的P值作为该断点新的P值(实际上就是更新断点的P值,因为一般情况下点集内元素的个数会增加,点集内数据的波动更接近样本整体数据的波动,游程检验结果更接近真实情况,故用新的P值替换初始化过程中得到的P值)。
3)循环确定最终断点:选出断点集中P值最大的点(设为M),取其左右相邻的断点(设为L、R)分别进行游程检验:在L左右两边分别取其与相邻断点之间的点建立左、右两点集,对这两个点集进行游程检验,用计算出的P值作为L新的P值;对R做相同处理,并从断点集中删除M(因为M为P值最大的点,可以认为是最不可能为真实断点的点)。实际效果是,把L和R之间的区域 合并,并分别更新了L和R的P值。上述处理完成后,再选出更新了L点和R点P值后的断点集中P值最大的点,重复上述步骤,直到最大的P值小于阈值(目前我们设为1e-25,这个值可由用户自行设定)或断点数小于最小断点值(因为最开始已把基因组上所有染色体连成了一个环,故断点集中至少有最小断点值个数的断点,在本实施例中最小断点值是24)。最终,断点集内剩下的这些断点就是最终的CNV结果中的断点,断即发生了拷贝数变异的区域的起始和终止位置。
该筛查的特点在于:1.采用成环的方式,将基因组看成一个整体,相比某些分染色体找断点的方法,能更有效地检出非整倍体;2.采用游程检验的方法筛选断点,相比传统的参数检验受观测值波动的影响较小;3.采用多次游程检验,能排除大量假阳性信号,使断点寻找更为准确.
游程检验亦称“连贯检验”,是根据样本观测值的排列所形成的游程的多少进行判断的检验方法,可以检测样本的随机性以及总体的分布是否相同。上述断点筛选策略中,用游程检验主要是为了检验断点两侧是否连贯,若游程检验的P值较大,则说明断点两侧的点集符合同一分布,其连贯性高,该点为断点的可能性较小;反之,则说明断点两侧的点集属于不同分布,其连贯性低,该点可能为一个断点。
5数据过滤及可视化
判断中阳性信号(CNV)是否满足两个条件:a)CNV片段不小于1M;b)Ratio≤0.7(缺失)或Ratio≥1.3(重复)。根据上述条件判断CNV,并画出其核型图以及各窗口Ratio值对应的峰图。
上述流程中,可以自动执行的采用计算机程序,它能够通过新一代测序技术产生的数据,将受试样本进行批次修正,然后和对照集合进行数据校正、标准化和片段化,估算出受试样本的拷贝数变异程度和大小。
根据本发明的再另一方面,提供了一种单细胞染色体的CNV检测装置。如图2所示,该装置包括建库单元100、测序单元200、以及分析单元300。
根据本发明的实施方式,建库单元100构建文库且输出。
测序单元200连接于建库单元100且对建库单元100输出的文库进行上机测序以输出测序结果
分析单元300连接于测序单元200且对测序单元200输出的测序结果采用上述的分析技术进行信息分析。
本领域技术人员能够理解的是,可以采用本领域中已知的任何适于进行上述操作的装置作为上述各个单元的组成部件。在本文中所使用的术语“连接”应作广义解释,可以是直接相连,也可以通过中间媒介简介相连,对于本领域的普通技术人员而言,可以根据具体情况理解上述的具体含义。
结果
本发明已进行超过300例已知结果的样本验证,信号检出率为100%,以下为部分结果的展示:
表1 Ion Proton平台检测CNV验证结果
Figure PCTCN2014087604-appb-000003
Figure PCTCN2014087604-appb-000004
此外应理解,在阅读了本发明的上述讲授内容之后,本领域技术人员可以对本发明作各种改动或修改,这些等价形式同样落于本申请所附权利要求书所限定的范围。

Claims (10)

  1. 一种单细胞染色体的CNV分析方法,其特征在于,包括以下步骤:
    提取有效数据的第1步骤;
    对所提取的有效数据进行序列比对后再判断Y染色体是否存在的第2步骤;
    将经过序列比对后的序列划分窗口再进行GC含量校正的第3步骤;
    对GC含量校正后的数据进行断点筛查的第4步骤;以及
    对断点筛查后的数据进行满足判断条件的数据过滤及可视化的第5步骤。
  2. 如权利要求1所述的CNV分析方法,其特征在于,
    所述序列比对是SOAP比对。
  3. 如权利要求1所述的CNV分析方法,其特征在于,
    所述Y染色体判断的判定依据是Y染色体上特异基因的支持数。
  4. 如权利要求1所述的CNV分析方法,其特征在于,
    用以下步骤进行所述CG含量校正:
    计算校正系数;
    将原reads数乘以校正系数以得到校正后的reads数;以及
    将校正后的reads数除以校正后的样本序列全基因组reads数的平均数以得到Ratio值。
  5. 如权利要求1所述的CNV分析方法,其特征在于,
    所述第4步骤包含以下三个依次执行的子步骤:
    初始化步骤,在该初始化步骤中,将基因组上所有染色体首尾相接,连成一个环,将基因组上每个窗口视作一个点,在每个点的左右各取相同数目的点作为初始比较点集,对初始比较点集进行初步的游程检验,根据P值大小筛选出可能的断点,建立初步的断点集,该P值是两个可能的断点间所有窗口的GC含量校正后的Ratio值经游程检验后的P值;
    初步筛选断点步骤,在该初步筛选断点步骤中,在每一个可能的断点的左 右两边分别取与相邻的可能的断点之间的点建立左右两个初步比较点集,对这两个初步比较点集进行游程检验,用计算出的P值作为该可能的断点的新的P值;以及
    循环确定最终断点步骤,在循环确定最终断点步骤中,通过游程检验重复地将P值最大的可能的断点左右的相邻断点之间的区域合并,并分别更新了相邻断点的P值,直到最大的P值小于阈值或可能的断点数小于最小断点值,将最终剩下的断点确定为筛选出的断点。
  6. 如权利要求1所述的CNV分析方法,其特征在于,
    所述判断条件是指以下两个条件:
    (a)CNV片段不小于1M;
    (b)Ratio≤0.7或Ratio≥1.3。
  7. 如权利要求1所述的CNV分析方法,其特征在于,
    所述可视化是指画出CNV的核型图以及各窗口Ratio值对应的峰图。
  8. 一种单细胞染色体的CNV检测装置,其特征在于,具备:
    建库单元,该建库单元构建文库且输出;
    测序单元,该测序单元连接于建库单元且对建库单元输出的文库进行上机测序以输出测序结果;以及
    分析单元,该分析单元连接于测序单元且对测序单元输出的测序结果根据权利要求1-7中任意一项所述的CNV分析方法进行信息分析。
  9. 如权利要求8所述的CNV检测装置,其特征在于,
    所述上机测序是用高通量测序技术进行的。
  10. 如权利要求8所述的CNV检测装置,其特征在于,
    所述上机测序是利用Ion Proton测序仪进行的。
PCT/CN2014/087604 2014-09-26 2014-09-26 单细胞染色体的cnv分析方法和检测装置 WO2016045106A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2014/087604 WO2016045106A1 (zh) 2014-09-26 2014-09-26 单细胞染色体的cnv分析方法和检测装置
CN201480082248.5A CN106795551B (zh) 2014-09-26 2014-09-26 单细胞染色体的cnv分析方法和检测装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/087604 WO2016045106A1 (zh) 2014-09-26 2014-09-26 单细胞染色体的cnv分析方法和检测装置

Publications (1)

Publication Number Publication Date
WO2016045106A1 true WO2016045106A1 (zh) 2016-03-31

Family

ID=55580153

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/087604 WO2016045106A1 (zh) 2014-09-26 2014-09-26 单细胞染色体的cnv分析方法和检测装置

Country Status (2)

Country Link
CN (1) CN106795551B (zh)
WO (1) WO2016045106A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112365927A (zh) * 2017-12-28 2021-02-12 安诺优达基因科技(北京)有限公司 Cnv检测装置
WO2021114139A1 (zh) * 2019-12-11 2021-06-17 深圳华大基因股份有限公司 一种基于血液循环肿瘤dna的拷贝数变异检测方法和装置

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110648721B (zh) * 2019-09-19 2022-04-12 首都医科大学附属北京儿童医院 针对外显子捕获技术检测拷贝数变异的方法及装置
CN111429966A (zh) * 2020-04-23 2020-07-17 长沙金域医学检验实验室有限公司 基于稳健线性回归的染色体拷贝数变异判别方法及装置
CN113113085B (zh) * 2021-03-15 2022-08-19 杭州杰毅生物技术有限公司 基于智能宏基因组测序数据肿瘤检测的分析系统及方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102985561A (zh) * 2011-04-14 2013-03-20 维里纳塔健康公司 用于确定并且验证常见的和罕见的染色体非整倍性的归一化染色体
CN103003447A (zh) * 2011-07-26 2013-03-27 维里纳塔健康公司 用于确定样品中存在或不存在不同非整倍性的方法
WO2013059967A1 (zh) * 2011-10-28 2013-05-02 深圳华大基因科技有限公司 一种检测染色体微缺失和微重复的方法
CN103215350A (zh) * 2013-03-26 2013-07-24 赛业(苏州)生物信息技术有限公司 一种基于单核苷酸多态性位点的孕妇血浆中胎儿dna含量的测定方法
WO2013149385A1 (zh) * 2012-04-05 2013-10-10 深圳华大基因健康科技有限公司 一种拷贝数变异检测方法和系统

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
SI2561103T1 (sl) * 2011-06-29 2014-11-28 Bgi Diagnosis Co., Ltd. Neinvazivna detekcija genetske anomalije ploda
US20140228226A1 (en) * 2011-09-21 2014-08-14 Bgi Health Service Co., Ltd. Method and system for determining chromosome aneuploidy of single cell
CN103946394A (zh) * 2011-10-18 2014-07-23 姆提普力科姆公司 胎儿染色体非整倍性诊断

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102985561A (zh) * 2011-04-14 2013-03-20 维里纳塔健康公司 用于确定并且验证常见的和罕见的染色体非整倍性的归一化染色体
CN103003447A (zh) * 2011-07-26 2013-03-27 维里纳塔健康公司 用于确定样品中存在或不存在不同非整倍性的方法
WO2013059967A1 (zh) * 2011-10-28 2013-05-02 深圳华大基因科技有限公司 一种检测染色体微缺失和微重复的方法
WO2013149385A1 (zh) * 2012-04-05 2013-10-10 深圳华大基因健康科技有限公司 一种拷贝数变异检测方法和系统
CN103215350A (zh) * 2013-03-26 2013-07-24 赛业(苏州)生物信息技术有限公司 一种基于单核苷酸多态性位点的孕妇血浆中胎儿dna含量的测定方法

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112365927A (zh) * 2017-12-28 2021-02-12 安诺优达基因科技(北京)有限公司 Cnv检测装置
CN112365927B (zh) * 2017-12-28 2023-08-25 安诺优达基因科技(北京)有限公司 Cnv检测装置
WO2021114139A1 (zh) * 2019-12-11 2021-06-17 深圳华大基因股份有限公司 一种基于血液循环肿瘤dna的拷贝数变异检测方法和装置

Also Published As

Publication number Publication date
CN106795551A (zh) 2017-05-31
CN106795551B (zh) 2020-11-20

Similar Documents

Publication Publication Date Title
TWI640636B (zh) A method for simultaneously performing gene locus, chromosome and linkage analysis
JP5938484B2 (ja) ゲノムのコピー数変異の有無を判断する方法、システム及びコンピューター読み取り可能な記憶媒体
US20190194743A1 (en) Methods for non-invasive prenatal paternity testing
CA3116156C (en) Methods for allele calling and ploidy calling
CN108573125B (zh) 一种基因组拷贝数变异的检测方法及包含该方法的装置
KR101795124B1 (ko) 복제 수 변이를 검측하기 위한 방법 및 시스템
ES2886508T3 (es) Métodos y procedimientos para la evaluación no invasiva de variaciones genéticas
WO2016045106A1 (zh) 单细胞染色体的cnv分析方法和检测装置
BR112013020220B1 (pt) Método para determinar o estado de ploidia de um cromossomo em um feto em gestação
US20130196862A1 (en) Informatics Enhanced Analysis of Fetal Samples Subject to Maternal Contamination
CN106834490A (zh) 一种鉴定胚胎平衡易位断裂点和平衡易位携带状态的方法
TW201317362A (zh) 一種檢測染色體拷貝數變異的方法
US20210130900A1 (en) Multiplexed parallel analysis of targeted genomic regions for non-invasive prenatal testing
WO2020192680A1 (en) Determining linear and circular forms of circulating nucleic acids
WO2019051812A1 (zh) 确定预定染色体保守区域的方法、确定样本基因组中是否存在拷贝数变异的方法、系统和计算机可读介质
CN107077533A (zh) 测序数据处理装置和方法
CN109461473B (zh) 胎儿游离dna浓度获取方法和装置
JP7446343B2 (ja) ゲノム倍数性を判定するためのシステム、コンピュータプログラム及び方法
CN114303202A (zh) 用于确定胚胎中遗传模式的系统和方法
JP2014530629A (ja) 染色体の微細欠失及び微細重複を検出する方法
WO2019227420A1 (zh) 确定男性待测样本是否存在三倍体的方法、系统和计算机可读介质
TWI564742B (zh) Methods for determining the aneuploidy of fetal chromosomes, systems and computer-readable media
Belyaev et al. The Assessment of Methods for Preimplantation Genetic Testing for Aneuploidies Using a Universal Parameter: Implications for Costs and Mosaicism Detection
WO2014153755A1 (zh) 确定胎儿染色体非整倍性的方法、系统和计算机可读介质
CN116597897A (zh) 一种基于芯片数据的cnv分析方法与装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14902852

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 09.08.2017)

122 Ep: pct application non-entry in european phase

Ref document number: 14902852

Country of ref document: EP

Kind code of ref document: A1