WO2013149385A1 - 一种拷贝数变异检测方法和系统 - Google Patents

一种拷贝数变异检测方法和系统 Download PDF

Info

Publication number
WO2013149385A1
WO2013149385A1 PCT/CN2012/073545 CN2012073545W WO2013149385A1 WO 2013149385 A1 WO2013149385 A1 WO 2013149385A1 CN 2012073545 W CN2012073545 W CN 2012073545W WO 2013149385 A1 WO2013149385 A1 WO 2013149385A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
window
cnv
sample
breakpoint
Prior art date
Application number
PCT/CN2012/073545
Other languages
English (en)
French (fr)
Inventor
李旭超
陈盛培
陈芳
谢伟伟
汪建
王俊
杨焕明
张秀清
Original Assignee
深圳华大基因健康科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to KR1020147031062A priority Critical patent/KR101795124B1/ko
Priority to EP12873786.3A priority patent/EP2835752B8/en
Priority to US14/389,898 priority patent/US20150056619A1/en
Priority to SG11201406250SA priority patent/SG11201406250SA/en
Priority to PCT/CN2012/073545 priority patent/WO2013149385A1/zh
Priority to RU2014144349A priority patent/RU2014144349A/ru
Application filed by 深圳华大基因健康科技有限公司 filed Critical 深圳华大基因健康科技有限公司
Priority to JP2015503724A priority patent/JP5972448B2/ja
Priority to CN201280066929.3A priority patent/CN104221022B/zh
Priority to AU2012376134A priority patent/AU2012376134B2/en
Publication of WO2013149385A1 publication Critical patent/WO2013149385A1/zh
Priority to IL234875A priority patent/IL234875B/en
Priority to US15/881,902 priority patent/US11371074B2/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C99/00Subject matter not provided for in other groups of this subclass
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2535/00Reactions characterised by the assay type for determining the identity of a nucleotide base or a sequence of oligonucleotides
    • C12Q2535/122Massive parallel sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2537/00Reactions characterised by the reaction format or use of a specific feature
    • C12Q2537/10Reactions characterised by the reaction format or use of a specific feature the purpose or use of
    • C12Q2537/165Mathematical modelling, e.g. logarithm, ratio

Definitions

  • the invention relates to the field of bioinformatics technology, in particular to a copy number variation
  • CNV is a variant of genomic structure.
  • Narrowly defined CNV usually refers to a copy of a DNA fragment on a chromosome.
  • the types and causes of genomic structural variation include: 1. deletion (end deletion, intermediate deletion); 2. translocation (mutual translocation, Robertson translocation); 3. inversion; 4. circular chromosome; 5. double-wire Granular chromosomes; 6. Insertion and so on.
  • Generalized CNV also includes structural variations such as chromosomal aneuploidy and partial aneuploidy.
  • One technical problem to be solved by the present invention is to provide a copy number variation detecting method and system capable of accurately detecting copy number variation including microdeletions/microduplications.
  • a copy number variation detecting method comprising: acquiring read order information of at least a part of nucleic acid molecules in a test sample; determining a unique alignment to (genome) reference according to the read order information Sequence sequence Signing; dividing the genome reference sequence into windows, counting the number of sequence tags falling into each window; performing GC correction on the number of sequence tags of each window and correcting according to the number of expected sequence tags corrected by the control sample set to obtain the adjusted number of sequence tags Using the start or end point of the window as the demarcation point, calculate the difference significance value of the numerical group consisting of the adjusted number of sequence labels on both sides, and select the boundary point with the smaller significance value as the candidate CNV breakpoint; for each CNV From the breakpoint to the previous CNV breakpoint and the two CNC breakpoint sequences, calculate the difference significance value of the two numerical populations consisting of the adjusted sequence number of the windows contained in the two sequences, the minimum of each culling Significant candidate CNV breakpoints and update the difference significance values
  • the method further comprises the step of sequencing at least a portion of the nucleic acid molecules in the sample to obtain the reading information.
  • each window has the same reference unique reads, or each window has the same length.
  • the termination threshold is obtained from a control sample set consisting of normal samples.
  • performing GC correction on the number of sequence labels of each window includes the following steps: grouping the windows by GC content, obtaining a correction coefficient based on the average number of sequence labels in the group and the average number of sequence labels of all the windows, the sequence of the window.
  • the number of tags is corrected to obtain the number of GC-corrected sequence tags.
  • the number of expected sequence labels corrected by the comparison sample set is obtained by: calculating a ratio of the number of GC-corrected sequence labels per window to the total number of labels in the control set; based on the ratio, obtaining an average of the corresponding window ratios of all the control samples The number is calculated based on the average number and the total number of sequence labels of the sample to be tested, and the expected value of the number of labels of each window sequence of the sample to be tested is calculated.
  • the method further includes: performing a confidence selection on the segment between the CNV breakpoints; the confidence selection includes the following steps: using the control set according to the distribution rule of the adjusted number of sequence tags Determine the adjusted order A confidence interval in which the number of column labels is normal; when the mean value of the number of sequence labels adjusted within the segment is outside the confidence interval, it is considered that the segment between the CNV breakpoints does have an abnormality.
  • the number of sequence tags conforms to a normal distribution
  • the confidence interval is a 95% confidence interval
  • a single chromosome cyclization or a genome-wide horizontal cyclization process is performed when a candidate CNV breakpoint is selected.
  • the method further comprises: the test sample is a sample from a human sample, and the test sample comprises amniotic fluid obtained by amniocentesis, villus obtained by biopsy of the villus, and cord blood obtained by puncture of the abdominal umbilical vein.
  • a spontaneously aborted fetal tissue or a peripheral blood of a human and/or a genomic DNA molecule of the test sample obtained by a salt extraction method, a column chromatography method, a magnetic bead method, or a SDS method; and/ Or randomly, the genomic DNA molecule of the test sample is subjected to enzymatic cleavage, atomization, ultrasound, or HydroShear method to obtain a DNA fragment; and/or unidirectionally sequence the genomic DNA molecule fragment of the test sample. Or the two-dimensional sequencing to obtain the DNA fragment reading information.
  • the method further comprises: adding a different tag sequence to the DNA fragment of each test sample to distinguish different test samples.
  • a copy number variation detecting system comprising: a reading order acquiring unit, configured to acquire reading order information of at least a part of nucleic acid molecules in a sample to be tested; and a sequence label determining unit, configured to The reading order information determines a unique alignment to the (genome) reference sequence sequence label; the label number statistics unit is configured to divide the genome reference sequence into a window, and count the number of sequence labels falling into each window; the label number adjustment unit is used for each The number of sequence labels of the window is GC corrected and corrected according to the number of expected sequence labels corrected by the comparison sample set to obtain the adjusted number of sequence labels; the candidate breakpoint selection unit is used to calculate the two points starting or ending with the window as the demarcation point The side consists of the adjusted number of sequence tags to form the difference significance value of the numerical population, and the demarcation point with the smaller significance value is selected as the candidate CNV.
  • a breakpoint determining unit configured to calculate, for each of the two CNV breakpoints to the preceding CNV breakpoint and the subsequent CNV breakpoint, the number of the adjusted sequence labels of the window included in the two sequences The difference significance value of the two numerical populations, each time rejecting the least significant candidate CNV breakpoint and updating the difference significance value of the two candidate CNV breakpoints around the rejected candidate CNV breakpoint, loop iteration, until all candidates
  • the difference significance value of the CNV breakpoint is less than the termination threshold to determine the CNV breakpoint.
  • each window has the same number of reference sequence labels, or each window has the same length.
  • the termination threshold is obtained from a control sample set consisting of normal samples.
  • the label number adjusting unit includes: a GC correction module, configured to group the windows by GC content, obtain a correction coefficient based on an average number of sequence labels in the group and an average number of sequence labels of all the windows, and perform the correction coefficient on the window sequence number Correcting the number of sequence labels obtained by GC correction; a window correction module, configured to calculate a ratio of the number of sequence labels corrected by GC in each window of the control set to the total number of labels; and based on the ratio, obtain an average of corresponding window ratios of all control samples; The average number and the total number of sequence labels of the sample to be tested are calculated, and the expected value of the number of labels of each window sequence of the sample to be tested is calculated, and the number of the sequenced labels corrected by the GC is corrected by the number of expected sequence labels corrected by the sample set to obtain the adjusted sequence label. number.
  • the system further includes a breakpoint filtering unit, configured to determine, according to a distribution rule of the number of sequence tags, after the breakpoint determining unit determines the CNV breakpoint, use the control set to determine a normal confidence interval of the number of sequence tags; When the mean of the number of sequence tags is outside the confidence interval, the segment between the CNV breakpoints is considered to be abnormal.
  • a breakpoint filtering unit configured to determine, according to a distribution rule of the number of sequence tags, after the breakpoint determining unit determines the CNV breakpoint, use the control set to determine a normal confidence interval of the number of sequence tags; When the mean of the number of sequence tags is outside the confidence interval, the segment between the CNV breakpoints is considered to be abnormal.
  • the adjusted number of sequence tags conforms to a normal distribution
  • the confidence interval is a 95% confidence interval
  • the candidate breakpoint selection unit performs a single chromosome cyclization or a genome-wide horizontal cyclization process when selecting candidate CNV breakpoints.
  • the test sample is a sample from a human sample, the amniotic fluid obtained by amniocentesis, the villus obtained by biopsy of the villus, the cord blood obtained by puncture of the abdominal umbilical vein, and the fetus of spontaneous abortion.
  • Peripheral blood of tissue or human body; and/or genomic DNA molecules of said test sample are obtained by salt extraction, column chromatography, magnetic bead method, or SDS method; and/or The genomic DNA molecule of the test sample is randomly interrupted by enzyme digestion, atomization, ultrasound, or HydroShear method to obtain a DNA fragment;
  • the DNA fragment reading information is obtained by performing unidirectional sequencing or bidirectional sequencing on the genomic DNA molecule fragment of the test sample.
  • test sequences are distinguished by different tag sequences added to the DNA fragments of the test sample.
  • An advantage of the copy number variation detection method and system of the present invention is that it is clinically feasible to accurately detect copy number variations including smaller microdeletion/microrepeat regions.
  • FIG. 1 is a flow chart showing an embodiment of a copy number variation detecting method of the present invention
  • Figure 2 is a flow chart showing another embodiment of the copy number variation detecting method of the present invention.
  • Figure 3 is a flow chart showing still another embodiment of the copy number variation detecting method of the present invention.
  • Figure 4 is a schematic flow chart showing the analysis of chromosome CNV in an implementation of the present invention.
  • Figure 5 is a block diagram showing an embodiment of a copy number variation detecting system of the present invention.
  • Figure 6 shows the junction of another embodiment of the copy number variation detecting system of the present invention.
  • 7A-7H are diagrams showing the results of detection of eight samples in one application example of the present invention. detailed description
  • Copy number variation refers to a change in the copy of a nucleic acid molecule larger than lkb compared to a normal sample nucleic acid sequence.
  • the circumstances and causes of copy number variation include: missing, such as micro-missing; insertion, such as micro-insertion, micro-repetition, repetition, inversion, transposition, and complex multi-site variation.
  • Aneuploidy refers to the increase or decrease of the number of chromosomes in the genetic material compared to the normal sample, and further includes the increase or decrease of the whole or part of the chromosome.
  • the copy number variation in the present invention also includes the case of aneuploidy.
  • Sequencing The process of obtaining sample nucleic acid sequence information. Sequencing can be performed by a variety of sequencing methods, including but not limited to the dideoxy chain termination method; preferred high throughput sequencing methods include, but are not limited to, second generation sequencing techniques or single molecule sequencing techniques.
  • Second generation sequencing platform (Metzker ML. Sequencing technologies-the next generation. Nat Rev Genet. 2010 Jan; ll(l): 31-46) including but not limited to Illumiim-Solexa (GA TM , HiSeq2000TM, etc.), ABI- Solid and Roche-454 (pyrophosphate sequencing) sequencing platforms; single-molecule sequencing platforms (technologies) including but not limited to Helicos' True Single Molecule DNA sequencing, Pacific Biosciences single-molecule real-time sequencing (single molecule) Real-time (SMRTTM) ), and nanopore sequencing technology from Oxford Nanopore Technologies (Rusk, Nicole (2009-04-01). Cheap Third-Generation Sequencing. Nature Methods 6 (4): 2446 (4 ) contradict
  • Sequencing types can be single-end sequencing and two-way (Pair-end)
  • the sequencing length can be 50 bp, 90 bp, or 100 bp.
  • the sequencing platform is Illumina/Solexa
  • the sequencing type is Pair-end sequencing, resulting in a 100 bp size DNA sequence molecule having a bidirectional positional relationship.
  • the sequencing depth of the sequencing can be determined based on the size of the chromosome fragment of the test sample to be tested. The higher the sequencing depth, the higher the sensitivity of the detection, and the smaller the number of deletions and repeats detected.
  • the sequencing depth can be 0.1-30 X, i.e., the total amount of data is 0.1-30 times the length of the human genome, for example, in one embodiment of the invention, the sequencing depth is 0.1 X, (2.5 X 108 bp).
  • Reads A nucleic acid sequence of a fixed length (generally greater than 20 bp), such as the result of a sequencing sequence produced by a sequencer; by sequence alignment, a specific region or position of the reference sequence can be located.
  • Sequence alignment refers to the process by which one or more nucleic acid sequences are compared to a reference sequence. It is common to compare a shorter nucleic acid sequence (such as a read sequence) with a reference genomic sequence to determine the location of a shorter nucleus on the reference genome.
  • sequence alignment can be performed by any of a sequence alignment program, ELAND (Efficient local alignment of nucleotide data), SOAP (Short Oligonucleotide Analysis Package), and BWA (Burrows-Wheeler Aligner). .
  • the criteria for successful comparison are divided into non-fault-tolerant comparisons (100% match) and partial fault-tolerant comparisons (less than 100% match).
  • Sequence tag A read-only (read) of a unique position that can locate a reference sequence (such as a reference genome sequence).
  • Reference unique reads A sequence that has a fixed length and a unique alignment position on a reference sequence (usually a reference genome).
  • the process of obtaining a reference sequence tag includes, for example, cleaving the reference genome into fixed length sequences, aligning the sequences back to the reference genome, and selecting a sequence uniquely aligned to the reference genome as a reference unique alignment sequence.
  • the fixed length depends on the sequence length of the sequencing result of the sequencer, and the specific length can be referred to specifically. Different sequencers get The length of the sequencing results is different. For each sequencing, the length of the sequencing results may be different. The selection of the length is subjective and empirical.
  • Tag sequence is a sequence of nucleic acids of a specific length that serves as an identifier.
  • each sample can be labeled with a different tag sequence for sample differentiation during sequencing (Micah Hamady, Jeffrey J Walker, J Kirk Harris et Al. Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nature Methods, 2008, March, Vol. 5 No. 3), thereby enabling simultaneous sequencing of multiple samples.
  • the tag sequence is designed to distinguish between different sequences, but does not affect other functions of the DNA molecule to which the tag sequence is added.
  • GC calibration Because there is a certain GC bias between the sequencing batches, the copy number deviation will occur in the high GC or low GC region of the genome, and the sequencing data will be corrected by GC based on the control sample set to obtain the corrected relative in each window. The number of sequencing sequences can remove this bias and improve the accuracy of copy number variation detection.
  • Average The average number referred to herein is generally the arithmetic mean or median.
  • Number of sequence tags can be based on the initial number of statistics to obtain the number of statistics, or the number of sequence tags can be corrected by other parameters, for example, can be a ratio, in some cases with "copy rate" can exchange.
  • Test sample In some cases, it may be referred to as a sample to be tested, and refers to a sample containing a nucleic acid molecule, and the nucleic acid molecule is suspected of being mutated.
  • the type of the nucleic acid is not particularly limited and may be deoxyribonucleic acid (DNA) or ribonucleic acid (RNA), preferably DNA.
  • DNA deoxyribonucleic acid
  • RNA ribonucleic acid
  • Control sample It is a sample that is considered to be normal relative to the sample. Normally, normal means that the phenotype is normal.
  • Control sample set refers to a set of control sample compositions. In one embodiment of the invention, the number of control samples in the set is required to be greater than 30.
  • sequencing technology has been used more and more widely in the detection of chromosomal variation.
  • the present disclosure designs a high-throughput sequencing technology for genome-wide horizontal copy number variation screening, which has high throughput, high specificity, and accurate positioning.
  • the advantages By obtaining a sample of the subject, DNA is extracted, high-throughput sequencing is performed, and the obtained data is analyzed to obtain a test result.
  • Fig. 1 is a flow chart showing an embodiment of a copy number variation detecting method of the present invention.
  • read information of at least a portion of the nucleic acid molecules in the test sample is obtained. At least a portion of the nucleic acid molecules or all of the nucleic acid molecules in the sample can be sequenced to obtain read order information.
  • the reading order information of a part of the nucleic acid molecules of the test sample can be obtained, or the reading order information of all the nucleic acid molecules can be obtained.
  • a genomic DNA molecule from a test sample is randomly broken to obtain a DNA fragment, and the DNA fragment is sequenced, and then a certain length of reading is obtained. The length of the obtained reading sequence may be within a certain range, and a fixed length reading order can be obtained by the intercepting operation.
  • the length of the DNA fragment can range from 50 bp to 1500 bp, for example, 50 bp to 150 bp, 150 bp to 350 bp, 350 bp to 500 bp, 500 bp to 700 bp, 700 bp to 1000 bp, or lOOObp to 1500 bp.
  • 50 bp, 90 bp, 100 bp, 150 bp, 300 bp, 350 bp, 500 bp, 700 bp, 1000 bp, 1500 bp can be selected.
  • the length of the reading order is different in view of the sequencer
  • the sequence tag is selected based on the reading order, the sequence of 20 bp or more in the reading sequence is generally selected for alignment, preferably above 26 bp.
  • Step 104 Determine a sequence tag uniquely aligned to the (genome) reference sequence according to the read order information. For example, all or part of the sequence of the reading sequence is aligned with the genomic reference sequence to obtain the site information of the reading sequence on the genome, and the site information of the reading sequence on the specific chromosome is obtained.
  • the human genome reference sequence can be the human genome reference sequence in the NCBI database.
  • the human genome sequence is the human genome reference sequence of version 36 (hgl8; NCBI Build 36) in the NCBI database, and the alignment software used is SOAPaligner/soap2schreibselection and genomic reference sequence unique alignment
  • the DNA fragment reads, ie, the last reading in the human genome reference sequence, that is, the only sequence tag aligned to the (genomic) reference sequence.
  • the genome reference sequence is divided into windows, and the number of sequence tags falling into each window is counted.
  • the window is determined by the following method:
  • the reference genome is interrupted into fragments of the same length as the sample to be detected, and the same parameters of the same alignment software are used for comparison, and the positions on the chromosome that can be uniquely aligned are screened;
  • a window is defined by a matching length of a certain length. You can choose between windows to have cross slides.
  • the number of sites that can be uniquely aligned in the window is related to the amount of sequencing data of the sample to be tested.
  • the number of expected reads of the sample to be tested falling in each window is more than 300, so that the number of reads in the window is consistent with the Poisson distribution. For example, suppose the number of points that the genome can be uniquely aligned is N, the number of valid reads of the sample to be tested is n, and the expected number of reads per window is E, then
  • Each window of the reference genome should contain X uniquely aligned sites.
  • Step 108 Perform GC correction on the number of sequence labels of each window and perform correction according to the number of expected sequence labels corrected by the comparison sample set to obtain the adjusted number of sequence labels.
  • performing GC correction on the number of sequence labels of each window includes the steps of: grouping the windows by GC content, obtaining a correction coefficient based on the average number of sequence labels in the group and the sequence label average of all windows, The number of window sequence labels is corrected to obtain the number of GC-corrected sequence labels; The set of corrected expected sequence tag numbers is obtained by: calculating a ratio of the number of GC-corrected sequence tags per window to the total number of tags in the control set; based on the ratio, obtaining an average of the corresponding window ratios of all control samples; And the total number of sequence labels of the sample to be tested, and calculate the expected value of the number of labels of each window sequence of the sample to be tested.
  • Step 110 Taking the start point or the end point of the window as a demarcation point, calculating a difference saliency value of the numerical group consisting of the number of the adjusted sequence labels on both sides, and selecting a demarcation point having a small significance value (ie, the difference is significant)
  • Candidate CNV breakpoints For example, a predetermined number of windows are selected as candidate CNV breakpoints based on the P value representing the significance level of the copy number variation on each side of each window in the genome-wide range, and the difference significance value of each candidate CNV breakpoint is obtained, that is, p value.
  • Step 112 Calculate the difference between two numerical groups consisting of the number of adjusted sequence labels of the window included in the two segments for each CNV breakpoint to the two CNV breakpoints and the subsequent CNV breakpoint sequence. Significant value, each time the least significant candidate CNV breakpoint is culled and the difference significance value of the two candidate CNV breakpoints around the rejected candidate CNV breakpoint is updated, loop iteration, until the difference of significance of all candidate CNV breakpoints The values are all less than the termination threshold to determine the CNV breakpoint.
  • the termination threshold is usually preset. For example, the termination threshold is obtained by analyzing the control sample set consisting of normal samples.
  • the loop iteration performs the CNV breakpoint selection, thereby completing the CNV detection, and accurately detecting the smaller copy number variation including the microdeletion/microrepetition region.
  • the test sample may be amniotic fluid obtained by amniocentesis, villus obtained from villus biopsy, cord blood obtained by abdominal umbilical vein puncture, fetal tissue of spontaneous abortion or genomic DNA obtained from human peripheral blood.
  • Conventional DNA extraction methods such as salting out, column chromatography, magnetic beads, and SDS.
  • column chromatography is preferably used.
  • the column chromatography refers to the action of blood, tissue or cells through the action of cell lysate and proteinase K to obtain a dew-exposed DNA molecule, and the DNA molecule is combined with a silica gel membrane under high salt conditions. DNA molecules elute from the silica gel membrane at low salt and high pH. See the Tiangen TIANamp Micro DNA Kit (DP316) data sheet for specific principles and methods)).
  • each DNA fragment of the sample may be added with a different label sequence (index), and the length may be 4 bp to 12 bp for use in Differentiation of test samples during sequencing (Micah Hamady, Jeffrey J Walker, J Kirk Harris et al. Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nature Methods, 2008, March, Vol.5 ⁇ 3) .
  • index label sequence
  • the length may be 4 bp to 12 bp for use in Differentiation of test samples during sequencing (Micah Hamady, Jeffrey J Walker, J Kirk Harris et al. Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nature Methods, 2008, March, Vol.5 ⁇ 3) .
  • the detection of a plurality of test samples can be simultaneously processed, the efficiency is improved, and the detection cost is reduced.
  • Fig. 2 is a flow chart showing another embodiment of the copy number variation detecting method of the present invention.
  • the genomic DNA molecule of the test sample is randomly interrupted to obtain a DNA fragment.
  • Randomly interrupted DNA molecules can be digested, nebulized, sonicated, or HydroShear.
  • an ultrasonic method such as Covaris' S-series (based on AFA technology, when the acoustic energy released by the sensor can pass through the DNA sample, the dissolved gas forms a bubble. When the energy is removed, the bubble bursts and produces The ability to break DNA molecules.
  • Covaris' S-series based on AFA technology
  • Step 204 sequencing the DNA fragment to obtain a DNA fragment sequencing sequence, that is, reading order.
  • the read sequence obtained by sequencing can be within a certain length range, and a fixed length read sequence can be obtained from the DNA fragment sequencing sequence by a truncation operation.
  • the DNA fragment sequencing sequence refers to a fixed length read sequence.
  • Sequencing method used It can be a high throughput sequencing method Illumina/Hiseq2000, ABI/SOLiD, Roche/454.
  • the sequencing type can be Single-end (unidirectional) sequencing or Pair-end (bidirectional) sequencing, and the sequencing length can be 50 bp to 1500 bp.
  • the sequencing platform is Illumina/Hiseq2000, and the sequencing type is Pair-end sequencing, and a 100 bp DNA sequence molecule having a bidirectional positional relationship is obtained.
  • the depth of sequencing can be determined according to the size of the chromosomal variation fragment detected. The higher the sequencing depth, the higher the sensitivity of detection, and the smaller the number of missing and repeated fragments can be detected.
  • the amount of sequencing the human test sample is between 2 ⁇ 900 X 10 8 Article sequencing fragments.
  • Step 206 Align the reading sequence with the genomic reference sequence to obtain the position information of the reading sequence on the genome.
  • step 206 a read order on the unique alignment with the genomic reference sequence is selected as the sequence label.
  • Step 208 counting the number of uniquely aligned sequence tags of the DNA fragments falling into each window of the genome. For each test sample, count the number of sequence labels that each window falls (denoted as ⁇ , subscripts i and j represent the window number and sample number, respectively, for distinction, not repeated below).
  • Step 210 Determine an average GC content of each window on the genome to determine the window correction coefficient, and obtain the number of corrections of the sequence label of each window according to the correction coefficient. This step mainly corrects the number of sequence labels of the window according to the GC content of each window, which can be called batch correction or GC correction.
  • the average GC content of the sequence tags dropped by each window was counted (denoted).
  • the sum of the number of G and C bases in all sequence tags is ⁇ , and the total length of all sequence tags is recorded as
  • each statistical window is pressed by its ' Grouping, ie ⁇ "the same window is grouped into a group.
  • Step 212 Divide the number of corrections of the sequence label of each window by the expected number of the window to obtain the adjusted sequence number of each window, that is, the copy rate.
  • the expected number of values for each window is obtained by composing a control set from a normal sample. This step mainly performs correction of the number of sequence labels per window based on normal sample data, which may be referred to as window correction.
  • the percentage of relative sequence tags ( ⁇ ) is defined as the ratio of the number of sequence tags ( n " ) in the window to the total number of whole genome tags ( ).
  • the copy rate of each window is equal to the number of modified sequence tags "" divided by the expected value in the window (the total number of genome-wide serial tags multiplied by the number of window sequence tags), ie ⁇ XN
  • the method of building the library, the sequencing reagent and the type of sequencing should be consistent with the sample to be tested, thereby improving the correction effect of the sample to be tested.
  • the sample in the control set should be a normal sample with a sample size greater than 30.
  • Step 214 Select a predetermined number of windows as the candidate CNV breakpoints according to the ⁇ value representing the significant difference of the copy number variation on each side of the window in the genome-wide range, and obtain the ⁇ value of each candidate CNV breakpoint.
  • Candidate CNV breakpoint selection For all windows of the whole genome, calculate a certain number of windows on the left and right sides of each window (the number of windows is usually greater than 30 or the minimum sample size limit of the test model is satisfied, so that the test model is statistically significant) Copy The difference in the rate of change in the shell rate, according to the level of significant difference in the genome-wide range (p value from small to large), select a certain number (for example, 1% of the total number of windows) of the point (corresponding to the window) as the candidate CNV breakpoint ( Breakpoint, That is, the boundary point of each CNV segment).
  • Step 216 culling the least significant candidate CNV breakpoint each time and updating the p-values of the two candidate CNV breakpoints of the culled candidate CNV breakpoint, loop iteration until the p-values of all candidate CNV breakpoints are less than the termination p
  • the value ie, the termination threshold
  • the terminating p value is obtained from the control set.
  • Iterative Merging By successive loop iterations, each time the least significant candidate breakpoint is culled, and the p values of the left and right breakpoints are updated until all p values are less than the end p value.
  • the comparison sample may be subjected to the above-described iterative merging operation, and the maximum p-value of the merging of each iteration is recorded, and the iteration is terminated when it is merged into one segment.
  • the point where the maximum p value changes most sharply that is, the point where the slope change is most obvious (the point with the largest curvature) in the curve of the p value corresponds to or before the merge process
  • the maximum p value is used as the termination threshold.
  • the termination of the iterative merging may also be set to the number of segments after the iterative merging equal to the number of predicted segments, for example, when the whole genome is analyzed, the control sample is terminated when the number of merged segments is 24 , terminate p by calculation at this time The average value of the values can effectively obtain the terminating 7 value.
  • a single chromosome cyclization means: When calculating the window near the start of a chromosome, if the number of valid windows on the left side is not enough for statistical testing, then a sufficient window is obtained from the tail of the chromosome to calculate it; for the same reason, near the end point The position on the right side where a valid window is not available is obtained from the front end of the chromosome. This operation allows the windows at the front and the end of the chromosome to still be calculated.
  • Whole genome level cyclization is to index the tail of the previous chromosome when the number of effective windows at the front end of each chromosome is insufficient, and to index the front end of the next chromosome when the tail is insufficient, and chromosome 1 is connected to the Y chromosome.
  • the method further comprises: performing confidence selection on the segment between the CNV breakpoints: determining a normal confidence interval of the copy rate according to the distribution rule of the copy rate; and determining the mean value of the intra-segment copy rate in a confidence interval Outside, it is considered that there is an abnormality in the segment between the CNV breakpoints.
  • the copy rate is in a normal distribution with a 95% confidence interval. This step is used to filter the fragmentation results to obtain reliable results. If the average ⁇ of the segment is less than the lower threshold or greater than the upper threshold, it will be output as a positive result.
  • Threshold selection Calculate the copy rate distribution of the window in each control sample. According to the central limit theorem, the reading order is random in the window, so the copy rate r is normally distributed, and the significance level is 0.05. Quantile. The mean value in the control set was calculated as the upper and lower thresholds for screening the copy rate variation.
  • the accuracy of the detection result is improved by batch correction and window correction.
  • the accuracy can be increased by expanding the reference set to reduce the pressure on the amount of starting DNA.
  • Figure 3 is a flow chart showing still another embodiment of the method for detecting genomic copy number variation of the present invention.
  • the process flow of the control set consisting of normal samples (3A) and the process flow of the test sample (3B) are included.
  • process 3A includes:
  • step 310A DNA molecules of the control sample are extracted.
  • Step 311A the DNA molecules of the control sample are randomly broken into DNA fragments for sequencing, and the DNA fragment test sequence data of the control sample is obtained, that is, the reading order.
  • Step 312A comparing the reading of the control sample to the reference genome.
  • Step 313A the statistical window uniquely compares the number of readings, that is, the number of sequence labels.
  • step 314A batch correction is performed on the control sample.
  • Step 315A obtaining a window number expectation value from the control sample for window correction of the test sample.
  • Step 316A breakpoint selection and fragmentation. Selecting candidate CNV breakpoints, culling the least significant candidate CNV breakpoints at each time and updating the P values of the two candidate CNV breakpoints around the culled candidate CNV breakpoints, loop iterating, leaving the final segment with a predetermined number of numbers (eg 24) stop.
  • a predetermined number of numbers eg 24
  • Step 317A determining the termination threshold.
  • Process 3B includes:
  • Step 310B extracting DNA molecules of the test sample.
  • Step 311B the DNA molecules of the test sample are randomly broken into DNA fragments for sequencing, and the DNA fragments of the test samples are read.
  • Step 312B the reading of the DNA fragment of the test sample is compared with the reference genome.
  • Step 313B go to the statistics window to uniquely compare the number of readings, that is, the number of sequence labels.
  • step 314B the test sample is subjected to batch correction.
  • step 315B the test sample is subjected to window correction according to the expected number of windows obtained from the control sample.
  • Step 316B breakpoint selection and fragmentation.
  • step 317B the result is filtered.
  • the database construction method, sequencing reagent and sequencing type should be consistent with the test sample, thereby improving the calibration effect of the control sample on the test sample.
  • the sample in the control set should be a normal sample with a sample size greater than 30.
  • FIG. 4 shows a simplified flow chart of chromosome CNV analysis in one implementation of the invention.
  • step 401 DNA extraction and sequencing: After extracting genomic DNA according to the Tiangen DP327-02 Kit operating manual, the library was built according to the Illumina/Hiseq2000 standard database construction process. In this process, the 500 bp DNA molecule itself is added to the linker used for sequencing, and each sample is labeled with a different tag sequence (index), so that data of multiple samples can be made in one sequencing data. differentiate.
  • Step 402 Sequence alignment: Sequencing by Illumina/Hiseq2000 sequencing method (other or similar sequencing methods such as ABI/SOLiD can achieve the same or similar effect), each sample obtains a DNA reading of a certain length fragment, and it is combined with the NCBI database.
  • the standard human genome reference sequence is subjected to a SOAP2 alignment to obtain information that is read in the corresponding position of the genome.
  • only the readings that are uniquely aligned with the human genome reference sequence are selected, that is, only the previous reading sequence, ie the number of sequence tags, is aligned on the human genome reference sequence, as a subsequent CNV analysis.
  • Valid data is possible.
  • Step 403 PSCC analysis.
  • PSCC bioinformatics methods
  • Step 404 Perform CNV analysis based on the copy number segment obtained in step 403, and use the test sample copy rates ⁇ 0.7 and ⁇ 1.3 as detection thresholds of fragment deletion and repetition, respectively, and analyze the whole genome horizontal copy number variation segment, and then perform Resultable Visualization.
  • the implemented software algorithm is a series of programs developed by Shenzhen Huada Genetic Research Institute for detection of genome-wide copy number variation, collectively referred to as
  • PSCC PSCC. It is capable of batch-correcting the sample by data generated by next-generation sequencing technology, and then performing data correction, normalization, and fragmentation with the control set to estimate the degree and magnitude of copy number variation of the sample. At lower sequencing depth
  • a single copy number variant (CNV) fragment of about 0.5 Mb can be detected (50 M sequencing short sequences).
  • Fig. 5 is a view showing the configuration of an embodiment of the genome copy number variation detecting system of the present invention.
  • the system includes: a read order obtaining unit 51, which acquires read order information of at least a part of nucleic acid molecules in the test sample; and a sequence label determining unit 52 that determines a unique alignment to the (genome) reference sequence.
  • the tag number counting unit 53 divides the genome reference sequence into windows, and counts the number of sequence tags falling into each window; the tag number adjusting unit 54 performs GC correction on the number of sequence tags of each window and corrects according to the comparison sample set The number of expected sequence tags is corrected to obtain the adjusted number of sequence tags; the candidate breakpoint selection unit 55 uses the start or end point of the window as a demarcation point, and calculates the difference significance value of the numerical group composed of the adjusted number of sequence tags on both sides.
  • each window can have the same number of reference sequence tags
  • the candidate breakpoint selection unit 55 performs a single chromosome cyclization or a genome-wide horizontal cyclization process when selecting a candidate CNV breakpoint.
  • the sequence tag determining unit determines the sequence tag uniquely aligned to the (genome) reference sequence according to the reading order, and the tag number adjusting unit corrects the number of sequence tags of each window, and is determined by the candidate breakpoint selecting unit and the breakpoint.
  • the unit performs cyclical iteration of gene significant differences to perform CNV breakpoint selection, thereby completing CNV detection, and accurately detecting copy number variation including smaller microdeletion/microrepetition regions.
  • Fig. 6 is a block diagram showing another embodiment of the copy number variation detecting system of the present invention.
  • the system includes: a read order obtaining unit 51, a sequence tag determining unit 52, a tag number counting unit 53, a tag number adjusting unit 64, a candidate breakpoint selecting unit 55, and a breakpoint determining unit 56.
  • the read order obtaining unit 51, the sequence label determining unit 52, the label number counting unit 53, the candidate breakpoint selecting unit 55, and the breakpoint determining unit 56 can be referred to the detailed description of Fig. 5, and will not be described in detail herein for the sake of brevity.
  • the tag number adjustment unit 64 includes a GC correction module 641 and a window correction module 642.
  • the GC correction module 641 groups the windows according to the GC content, obtains a correction coefficient based on the average number of the sequence labels in the group and the average number of the sequence labels of all the windows, and corrects the number of the window sequence labels to obtain the GC-corrected sequence label number;
  • the window correction module 642 calculates a ratio of the number of sequence labels corrected by the GC in each window of the comparison set to the total number of labels; based on the ratio, obtains an average of the corresponding window ratios of all the control samples; and based on the average number and the sequence label of the sample to be tested Total, calculate the expected value of the number of labels of each window sequence of the sample to be tested, and correct the number of sequence labels that are corrected by the GC to the number of expected sequence labels corrected by the sample set to obtain the adjusted number of sequence tags, also called the copy rate.
  • the system further includes a breakpoint filtering unit 67, which determines a normal confidence interval of the copy rate according to a distribution rule of the copy rate after the breakpoint determining unit determines the CNV breakpoint; When the mean of the internal copy rate is outside the confidence interval, it is considered that the segment between the CNV breakpoints does have an abnormality.
  • the number of sequence tags conforms to a normal distribution, and the confidence interval is
  • the test sample is a sample from a human sample
  • the genomic DNA molecule of the test sample is amniotic fluid obtained by amniocentesis, villus obtained by biopsy of the villus, and cord blood obtained by puncture of the abdominal umbilical vein.
  • Genomic DNA obtained from spontaneously aborted fetal tissue or human peripheral blood
  • genomic DNA molecules of the test sample are obtained by salt extraction, column chromatography, magnetic bead method, or SDS DNA extraction method
  • the genomic DNA molecule of the sample is randomly fragmented by digestion, atomization, ultrasound, or HydroShear method
  • the DNA fragment sequencing sequence is obtained by unidirectional sequencing or bidirectional sequencing of the genomic DNA molecule fragment of the test sample. .
  • Different test samples can be distinguished by different tag sequences added to the DNA fragments of the test sample.
  • Samplel The DNA of 8 samples (hereinafter referred to as Samplel, Sample2, Sample3...Sample8) was extracted, and the extracted DNA was constructed according to the modified Illumina/Hiseq2000 standard library process, and the DNA molecules concentrated in 500b were
  • each sample was labeled with a different tag sequence, and then hybridized with a complementary junction of the Flowcell surface, and the nucleic acid molecules were clustered under certain conditions, and then sequenced by double-end sequencing on an Illumina Hiseq2000.
  • a DNA fragment of length lOObp is listed.
  • the DNA of about 100 ng (Quant-IT dsDNA HS Assay kit) obtained from the above amniotic fluid sample was used to construct a modified Illumina/Hiseq 2000 standard procedure, and the specific process refers to the prior art (see http: ⁇ www. Illumina/Solexa standard library specification provided by illumina.com/).
  • the size of the DNA library and the insert fragment were determined to be about 500 bp by the 2100 Bioanalyzer (Agilent), and the QPCR was accurately quantified and then sequenced.
  • Sequencing In this example, the DNA samples obtained from the above 8 samples were processed according to the Illumina/Solexa officially published ClusterStation and Hiseb2000 (PEsequencing) instructions, so that each sample obtained about 5G data amount for sequencing on the machine. Each sample is separated according to the index tag area.
  • the comparison software SOAP2 the sequenced DNA sequence was compared with the human genome reference sequence of the version 36 (hgl8; NCBIBuild36) in the NCBI database to obtain information on the position of the measured DNA sequence at the corresponding position of the genome.
  • each statistical window is grouped by the average of the readings falling into it, and the median or arithmetic mean of each group is divided by the median or arithmetic mean of the whole genome level.
  • the correction coefficient c s is obtained between the numbers, and the subscript g represents the GC content of the different groups. Put the original read of each window The number ⁇ is multiplied by the correction coefficient to obtain the correction value of the number of sequence tags in each window (denoted as c).
  • Data correction ' Select the sequencing data of 90 YH populations as the control set, and define the relative sequence number of labels (p") as the window. Number of sequenced sequence tags
  • the copy rate of each window is equal to the number of correction correction sequence labels "" divided by the expected value in the window (the total number of genome-wide sequence labels multiplied by the number of window sequence labels), ie
  • Threshold and filtering Filter the fragmented results. If fragment When the average r " is less than 0.7 or greater than 1.3, it is output as a positive result.
  • Table 1 CNV results for 8 samples where chr represents the chromosome and T7 represents the chromosome 3 trisomy. XYY has no sex chromosome trisomy abnormalities. 7A-7H are schematic views showing the results of detection of eight samples.
  • the microdeletion fragment as small as 0.4M is as large as the entire number of chromosomes. This method can accurately detect and locate, which proves that the detection efficiency and precision are excellent. .
  • the invention performs genome-wide copy number variation detection analysis on the applicable population, and is beneficial to providing genetic counseling and providing clinical decision-making basis for accurate pathological determination of patients with micro-deletion syndrome.
  • the population to which the present invention is applicable may be all microdeletion patients or potential microdeletion carriers, and the applicable population is merely illustrative of the invention and should not be construed as limiting the scope of the invention.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Organic Chemistry (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Immunology (AREA)
  • Genetics & Genomics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biochemistry (AREA)
  • Molecular Biology (AREA)
  • Microbiology (AREA)
  • Computing Systems (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

本发明公开一种基因组拷贝数变异检测方法和系统,涉及生物信息学技术领域。该方法包括:获得读序;根据读序确定序列标签;统计落入各个窗口的序列标签数目;对各个窗口的序列标签数目进行GC校正并根据以对照样本集修正的期望序列标签数目进行修正获得调整后的序列标签数目;选取显著性值较小的分界点为候选的CNV断点;每次剔除最不显著的候选CNV断点并更新被剔除的候选CNV断点左右两个候选CNV断点的差异显著性值,循环迭代,直至所有候选CNV断点的差异显著性值都小于终止阈值,从而确定CNV断点。本发明的方法和系统,具有临床可行性,在使用50M左右的数据情况下,可精确检测到0.5M的微缺失/微重复区域。

Description

一种拷贝数变异检测方法和系统 技术领域
本发明涉及生物信息学技术领域, 尤其涉及一种拷贝数变异
( Copy Number Variation, CNV )检测方法和系统。 背景技术
CNV是基因组结构变异的一种, 狭义的 CNV通常指染色体 上 DNA片段的拷贝 «L生了变化。 基因组结构变异的类型及原 因包括: 1.缺失(末端缺失、 中间缺失); 2.易位(相互易位、 罗 伯逊易位); 3.倒位; 4.环状染色体 ; 5.双着丝粒染色体; 6.插入 等。 广义的 CNV也包括例如染色体非整倍性和部分非整倍性的 结构变异。
目前拷贝数变异的检测方法, 主要有高分辨率染色体核型分 析、 FISH (荧光原位杂交)、 Array CGH (比较基因组杂交)、 MLPA (多重连接探针扩增技术) 和 PCR ( Polymerase Chain Reaction, 多聚 式反应) 的方法等, 其中遗传学诊断以 FISH 检测为金标准, 可以有效地检测出大部分已知染色体缺失或重复。 但这些方法普遍存在效率较低, 尤其在全基因组水平全面扫描的 情况下, 资源消耗较大或无法检测未知 CNV等缺点。
因此, 开发一种新的拷贝数变异检测方法, 对已知的位点进 行鉴定, 并对未知的位点进行发现性探索, 已经迫在眉睫。 发明内容
本发明要解决的一个技术问题是提供一种拷贝数变异检测方 法和系统, 能够精确检测包括微缺失 /微重复的拷贝数变异。
根据本发明的一个方面, 提供一种拷贝数变异检测方法, 其 特征在于, 包括: 获取受试样品中至少一部分核酸分子的读序信 息; 根据读序信息确定唯一比对至(基因组)参考序列的序列标 签; 将基因组参考序列划分窗口,统计落入各个窗口的序列标签数 目; 对各个窗口的序列标签数目进行 GC校正并根据以对照样本 集修正的期望序列标签数目进行修正获得调整后的序列标签数目; 以窗口的起点或终点作为分界点, 计算两侧由调整后的序列标签 数目组成数值群体的差异显著性值, 选取显著性值较小的分界点 为候选的 CNV断点; 对于每个 CNV断点至在先 CNV断点和在 后 CNV 断点的两段序列, 计算两段序列包含的窗口的调整后的 序列标签数目组成的两个数值群体的差异显著性值, 每次剔除最 不显著的候选 CNV断点并更新被剔除的候选 CNV断点左右两个 候选 CNV断点的差异显著性值, 循环迭代, 直至所有候选 CNV 断点的差异显著性值都小于终止阈值, 从而确定 CNV断点。
可选地, 还包括对受试样本中的至少一部分核酸分子进行测 序以获得读序信息的步骤。
可选地, 每个窗口具有相同的参考序列标签数目 (reference unique reads ), 或者, 每个窗口具有相同的长度。
可选地, 终止阈值根据由正常样本组成对照样本集获得。
可选地, 对各个窗口的序列标签数目进行 GC校正包括以下 的步骤: 将窗口按 GC含量分组, 基于组内序列标签的平均数与 所有窗口的序列标签平均数获得修正系数, 对该窗口序列标签数 进行修正获得 GC校正的序列标签数。
可选地, 对照样本集修正的期望序列标签数目由以下方法获 得: 计算对照集中每个窗口经过 GC校正的序列标签数与标签总 数的比值; 基于该比值, 获得所有对照样本对应窗口比值的平均 数; 基于上述平均数和待测样本的序列标签总数, 计算待测样本 各个窗口序列标签数的期望值。
可选地, 在确定 CNV断点后还包括:对根据 CNV断点之间 的片段进行置信选择的步骤; 所述置信选择包括以下步骤: 根据 调整后的序列标签数目的分布规律, 利用对照集确定调整后的序 列标签数目正常的置信区间; 当片段内调整后的序列标签数目的 均值处于置信区间之外时, 认为该 CNV 断点之间的片段确实存 在异常。
可选地, 序列标签数目符合正态分布, 所述置信区间为 95% 置信区间。
可选地, 在选择候选的 CNV 断点时进行单条染色体环化或 者全基因组水平环化处理。
可选地, 还包括: 所述受试样品是来自人类样本的样品, 所 述受试样品包括是羊膜腔穿刺获得的羊水、 绒毛活检得到的绒毛、 经腹脐静脉穿刺得到的脐带血、 自然流产的胎儿组织或人体的外 周血; 和 /或所述受试样品的基因组 DNA分子采用盐析法、 柱层 析法、 磁珠法、 或 SDS法的 DNA提取方法获得; 和 /或对所述受 试样品的基因组 DNA 分子采用酶切、 雾化、 超声、 或者 HydroShear法随机打断得到 DNA片段; 和 /或对所述受试样品的 基因组 DNA分子片段进行单向测序或双向测序获得所述 DNA片 段读序信息。
可选地, 该方法还包括: 在每个受试样品的 DNA 片段上加 上不同的标签序列以区别不同的受试样品。
根据本发明的另一方面, 提供一种拷贝数变异检测系统, 包 括: 读序获取单元, 用于获取据受试样品中至少一部分核酸分子 的读序信息; 序列标签确定单元, 用于根据读序信息确定唯一比 对至(基因组)参考序列的序列标签; 标签数目统计单元, 用于 将基因组参考序列划分窗口,统计落入各个窗口的序列标签数目; 标签数目调整单元, 用于对各个窗口的序列标签数目进行 GC校 正并根据以对照样本集修正的期望序列标签数目进行修正获得调 整后的序列标签数目; 候选断点选择单元, 用于以窗口的起点 或终点作为分界点, 计算两侧由调整后的序列标签数目组成数值 群体的差异显著性值, 选取显著性值较小的分界点为候选的 CNV 断点; 断点确定单元, 用于对于每个 CNV断点至在先 CNV断点 和在后 CNV 断点的两段序列, 计算所述两段序列包含的窗口的 调整后的序列标签数目组成的两个数值群体的差异显著性值, 每 次剔除最不显著的候选 CNV断点并更新被剔除的候选 CNV断点 左右两个候选 CNV 断点的差异显著性值, 循环迭代, 直至所有 候选 CNV断点的差异显著性值都小于终止阈值, 从而确定 CNV 断点。
可选地, 每个窗口具有相同的参考序列标签数目, 或者, 每 个窗口具有相同的长度。
可选地, 终止阈值根据由正常样本组成对照样本集获得。
可选地, 标签数目调整单元包括: GC校正模块, 用于将窗 口按 GC含量分组, 基于组内序列标签的平均数与所有窗口的序 列标签平均数获得修正系数, 对该窗口序列标签数进行修正获得 GC校正的序列标签数; 窗口修正模块, 用于计算对照集中每个 窗口经过 GC校正的序列标签数与标签总数的比值; 基于该比值, 获得所有对照样本对应窗口比值的平均数; 基于上述平均数和待 测样本的序列标签总数, 计算待测样本各个窗口序列标签数的期 望值, 对 GC校正后的序列标签数以对照样本集修正的期望序列 标签数目进行修正获得调整后的序列标签数目。
可选地, 该系统还包括断点过滤单元, 用于在所述断点确定 单元确定 CNV 断点后根据序列标签数目的分布规律, 利用对照 集确定序列标签数目正常的置信区间; 当片段内序列标签数目的 均值处于置信区间之外时, 认为该 CNV 断点之间的片段确实存 在异常。
可选地, 调整后的序列标签数目符合正态分布, 所述置信区 间为 95%置信区间。
可选地, 候选断点选择单元在选择候选的 CNV 断点时进行 单条染色体环化或者全基因组水平环化处理。 可选地, 受试样品是来自人类样本的样品, 所述受试样品包 括是羊膜腔穿刺获得的羊水、 绒毛活检得到的绒毛、 经腹脐静脉 穿刺得到的脐带血、 自然流产的胎儿组织或人体的外周血; 和 /或 所述受试样品的基因组 DNA分子采用盐析法、 柱层析法、 磁珠 法、 或 SDS法的 DNA提取方法获得; 和 /或对所述受试样品的基 因组 DNA分子采用酶切、 雾化、 超声、 或者 HydroShear法随机 打断得到 DNA片段;
和 /或
对所述受试样品的基因组 DNA分子片段进行单向测序或双 向测序获得所述 DNA片段读序信息。
可选地, 通过受试样品 DNA 片段上所加的不同标签序列以 区别不同的受试样品。
本发明的拷贝数变异检测方法和系统的一个优点在于, 具有 临床可行性, 可精确检测到包括较小微缺失 /微重复区域的拷贝数 变异。 附图说明
图 1 示出本发明的拷贝数变异检测方法的一个实施例的流程 图;
图 2 示出本发明的拷贝数变异检测方法的另一个实施例的流 程图;
图 3 示出本发明的拷贝数变异检测方法的又一个实施例的流 程图;
图 4示出本发明一个实现中对染色体 CNV分析的简要流程 图;
图 5示出本发明的拷贝数变异检测系统的一个实施例的结构 图;
图 6示出本发明的拷贝数变异检测系统的另一个实施例的结 构图;
图 7A-7H示出本发明的一个应用例中 8个样品的检测结果示 意图。 具体实施方式
对本文中用到的术语进行说明:
拷贝数变异 ( copy number variation, CNV ) : 是指待测样品 核酸序列与正常样品核酸序列相比, 大于 lkb的核酸分子的拷贝 生了变化。 拷贝数变异的情形及原因包括: 缺失, 例如微缺 失; 插入, 例如微插入, 微重复, 重复, 倒置, 转座和复杂的多 位点变异。
非整倍体: 是指与正常样本相比, 遗传物质中存在染色体数 目的增加或减少, 进一步包括整条或部分染色体的增加或减少。 本发明中拷贝数变异也包括非整倍体的情形。
测序: 获得样品核酸序列信息的过程。 测序可通过各种测序 方法进行, 包括但不限于双脱氧链终止法; 优选高通量的测序方 法, 包括但不限于第二代测序技术或者是单分子测序技术。
第二代测序平台 (Metzker ML. Sequencing technologies-the next generation. Nat Rev Genet. 2010 Jan; ll(l):31-46 ) 包括但不 限于 Illumiim-Solexa ( GATM,HiSeq2000™等)、 ABI-Solid 和 Roche-454 (焦磷酸测序) 测序平台; 单分子测序平台 (技术) 包括但不限于 Helicos公司的真实单分子测序技术(True Single Molecule DNA sequencing ), Pacific Biosciences公司单分子实时 测序 ( single molecule real-time (SMRT™) ), 以及 Oxford Nanopore Technologies公司的纳米孔测序技术等(Rusk, Nicole (2009-04-01). Cheap Third-Generation Sequencing. Nature Methods 6 (4): 2446 (4 )„
测序类型可以为单向 (single-end )测序和双向 (Pair-end ) 测序, 测序长度可以为 50bp、 90bp、 或 100bp。 在本发明的一个 实施方案中, 测序平台为 Illumina/Solexa, 测序类型为 Pair-end 测序, 得到具有双向位置关系的 lOObp大小的 DNA序列分子。
本发明的一个实施方案中, 测序的测序深度可以依据检测的 受试样品染色体变异片段大小确定, 测序深度越高, 检测的灵敏 度越高, 即可检出的缺失和重复的片段越小。 测序深度可以是 0.1-30 X , 即总数据量为人类基因组长度的 0.1-30倍, 例如在本 发明的一个实施方案中, 测序深度为 0.1 X , ( 2.5 X 108bp )。
读序 (reads ): —定长度的核酸序列 (一般大于 20bp ),例如 由测序仪产生的测序序列的结果; 通过序列比对的方法, 可定位 至参考序列的 特定区域或位置。
序列比对(比对): 是指一个或多个核酸序列与参考序列进 行比较的过程。 具体常见为将一段较短的核酸序列 (如读序)与 参考基因组序列相比较, 以确定较短核 列在参考基因组上的 位置。 在利用计算机进行序列比对时, 序列比对可以通过任何一 种序列比对程序, ELAND ( efficient local alignment of nucleotide data ) , SOAP ( Short Oligonucleotide Analysis Package ) 和 BWA ( Burrows-Wheeler Aligner )等程序进行。 比对成功的认定 标准又分为不容错比对(100%匹配)和部分容错比对(小于 100% 的匹配)。
序列标签: 是指可定位参考序列 (例如参考基因组序列)唯 一位置的读序( reads )。
参考序列标签 ( reference unique reads ): 是指具有固定长 度, 且在参考序列 (通常是参考基因组)上有唯一比对位置的序 列。 获得参考序列标签的过程, 例如包括, 将参考基因组切割为 固定长度的序列, 将这些序列比对回参考基因组, 选择唯一比对 到参考基因组的序列为参考唯一比对序列。 固定长度依测序仪的 测序结果序列长度而定, 具体可参考平均长度。 不同测序仪得到 的测序结果长度是不同的, 具体每一次测序, 测序结果的长度也 可能不同, 该长度的选^ 在一定主观和经验因素。
标签序列 (index ): 是一段特定长度的, 起标识性作用的核 酸序列。 当待测的 DNA分子来自多个受试样本时, 每个样本可 以被加上不同的标签序列, 以用于在测序过程中进行样品的区分 (Micah Hamady, Jeffrey J Walker, J Kirk Harris et al. Error- correcting barcoded primers forpyrosequencing hundreds of samples in multiplex. Nature Methods, 2008, March, Vol.5 No.3) , 从而实现同时对多个样品进行测序。 标签序列为了区分 不同序列, 但不影响添加标签序列的 DNA分子的其他功能。
GC校正: 因为测序批次间 /内存在一定的 GC偏向性, 会使 基因组中高 GC或低 GC区域出现拷贝数偏差, 对测序数据基于 对照样本集进行 GC校正得到每个窗口中校正后的相对测序序列 数, 可以去除此偏向性, 提高拷贝数变异检测的精度。
平均数:本文所指平均数一般为算术平均数或中位数。
序列标签数目: 序列标签数目可以基于最初的数量统计得到 个数统计量, 也可以是序列标签个数经其他参数修正得到的修正 值, 例如可以是一个比值, 一些情况下与 "拷贝率" 可以互换。
受试样品: 在一些情形下可被称为待测样品, 是指含有核酸 分子, 而且核酸分子被怀疑发生变异的样品。 核酸的类型并不受 特别限制, 可以是脱氧核糖核酸(DNA ), 也可以是核糖核酸 ( RNA ), 优选 DNA。 对于 RNA, 可以通过常规手段将其转换 为具有相应序列的 DNA, 进行后续检测和分析。
对照样本: 是相对受试样本而言的, 是被认为正常的样本, 通常情况下, 正常是指表型正常。
对照样本集(对照集): 是指对照样本组成的集合, 本发明 一个实施中, 集合中对照样本的数目要求大于 30。
下面参照附图对本发明进行更全面的描述, 其中说明本发明 的示例性实施例。
随着高通量测序技术的不断发展与测序成本的不断降低, 测 序技术在染色体变异检测方面得到了越来越广泛的应用。
为了提高目前临床上对拷贝数变异检测的技术, 本公开设计 了一种基于高通量测序技术的进行全基因组水平拷贝数变异筛查 的技术方案, 具有通量高、 特异性高、 定位准确的优点。 通过获 取受试者样品, 提取 DNA, 进行高通量测序, 对获得的数据进行 分析, 得出检测结果。
图 1 示出本发明的拷贝数变异检测方法的一个实施例的流程 图。
如图 1 所示, 在步骤 102, 获取受试样品中至少一部分核酸 分子的读序 (reads )信息。 可以对受试样本中的至少一部分核酸 分子或者全部核酸分子进行测序以获得读序信息。 可以获取受试 样品的部分核酸分子的读序信息, 或者获取全部核酸分子的读序 信息。 例如, 将来自受试样品的基因组 DNA分子随机打断得到 DNA片段, 对 DNA片段进行测序, 然后获得一定长度的读序。 获得的读序的长度可能在一定范围内, 可以通过截取操作获得固 定长度的读序。 DNA 片段的长度可以在 50bp~1500bp , 例如 50bp~150bp、 150bp - 350bp、 350bp~500bp、 500bp~700bp、 700bp ~ 1000bp, 或 lOOObp ~ 1500bp。 例如, 可以选择 50bp、 90bp、 100bp、 150bp、 300bp、 350bp、 500bp、 700bp、 1000bp、 1500bp. 在一个实施例中, 优选 300bp ~ 700bp, 更优选 350bp ~ 500bp„ 读序的长度鉴于测序仪的不同可能会有较大的差别, 例 如 illumina-solexa和 life technologies-solid仪器目前普遍读序长 度在 300b 以内, 而 roche-454、 传统的 Sanger测序、 最新的单 分子的测序系统得到的读序长度可接近或超过 1000bp。 为了适应 唯一比对的需要, 在基于读序选择序列标签时, 一般会选择读序 中 20bp以上的序列去比对, 优选地, 在 26bp以上。 步骤 104, 根据读序信息确定唯一比对至(基因组)参考序 列的序列标签。 例如, 将读序的全部或部分序列与基因组参考序 列进行比对以获得读序在基因组上的位点信息, 得到读序在具体 染色体上的位点信息。 对于人类受试样本, 人类基因组参考序列 可以是 NCBI数据库中的人类基因组参考序列。 在本发明的一个 实施例中, 人类基因组序列是 NCBI数据库中版本 36 ( hgl8; NCBI Build 36 ) 的人类基因组参考序列, 所采用的比对软件是 SOAPaligner/soap2„ 选择与基因组参考序列唯一比对上的 DNA 片段读序, 即只在人类基因组参考序列上比对上一次的读序, 也 就是唯一比对至(基因组)参考序列的序列标签。
步骤 106, 将基因组参考序列划分窗口, 统计落入各个窗口 的序列标签数目。 对于窗口采用以下方法确定: 将参考基因组打 断为与待检测样品测序长度相同的片段, 采用相同的比对软件的 相同参数进行比对, 筛选出染色体上可被唯一比对的位置; 然后 每间隔一定长度的可比对位置划定一次窗口。 窗口之间可以选择 是否有交叉滑动。 窗口内应包含的可被唯一比对的位点数与待测 样品的测序数据量有关。 一般要使待测样品落在每个窗口中的期 望 reads数在 300条以上, 从而保证落在窗口内的 reads数分布 符合泊松分布。 例如, 假设基因组可被唯一比对的点数为 N, 待 检测样品的有效 reads数为 n, 每个窗口的期望 reads数为 E, 则
― E
参考基因组的每个窗口中应包含 X 个可被唯一比对位点。
步骤 108, 对各个窗口的序列标签数目进行 GC校正并根据 以对照样本集修正的期望序列标签数目进行修正获得调整后的序 列标签数目。 在一个实施例中, 对各个窗口的序列标签数目进行 GC校正包括以下的步骤:将窗口按 GC含量分组, 基于组内序列 标签的平均数与所有窗口的序列标签平均数获得修正系数, 对该 窗口序列标签数进行修正获得 GC校正的序列标签数; 对照样本 集修正的期望序列标签数目通过以下方法获得: 计算对照集中每 个窗口经过 GC校正的序列标签数与标签总数的比值; 基于该比 值, 获得所有对照样本对应窗口比值的平均数; 基于上述平均数 和待测样本的序列标签总数, 计算待测样本各个窗口序列标签数 的期望值
步骤 110, 以窗口的起点或终点作为分界点, 计算两侧由调 整后的序列标签数目组成数值群体的差异显著性值, 选取显著性 值较小 (即差异显著性较大) 的分界点为候选的 CNV 断点。 例 如, 在全基因组范围内根据表征每个窗口两侧拷贝数变异的显著 性水平的 P值选取预定数量的窗口作为候选 CNV断点, 获得每 个候选 CNV断点的差异显著性值, 即 p值。
步骤 112, 对于每个 CNV断点至在先 CNV断点和在后 CNV 断点的两段序列, 计算所述两段序列包含的窗口的调整后的序列 标签数目组成的两个数值群体的差异显著性值, 每次剔除最不显 著的候选 CNV断点并更新被剔除的候选 CNV断点左右两个候选 CNV断点的差异显著性值, 循环迭代, 直至所有候选 CNV断点 的差异显著性值都小于终止阈值, 从而确定 CNV 断点。 终止阈 值通常预先设定。 例如, 通过对由正常样本组成对照样本集进行 分析处理获得该终止阈值。
上述实施例中, 通过将获取的读序与基因组参考序列比对, 选择唯一比对的读序进行窗口统计, 对每个窗口的序列标签数目 进行 GC校正和对照集修正后进行基因显著性差异的循环迭代进 行 CNV断点选择, 从而完成 CNV的检测, 能够精确检测到较小 的包括微缺失 /微重复区域的拷贝数变异。
对于人类样本, 受试样品可以是羊膜腔穿刺获得的羊水、 绒 毛活检得到的绒毛、 经腹脐静脉穿刺得到的脐带血、 自然流产的 胎儿组织或人体外周血中获取的基因组 DNA, 可以采用盐析法、 柱层析法、 磁珠法、 SDS 法等常规 DNA提取方法。 在一个实施 例中, 优选采用柱层析法, 该柱层析法是指血液、 组织或细胞经 过细胞裂解液和蛋白酶 K的作用后得到棵露的 DNA分子, DNA 分子在高盐条件下与硅胶膜结合, 在低盐、 高 pH值时 DNA分 子从硅胶膜上洗脱下来。 具体原理和方法参见 Tiangen TIANamp Micro DNA Kit ( DP316 )产品说明书))。
当待测的 DNA片段来自多个受试样本的 DNA分子时, 每个 受试样本的 DNA 片段可以被加上不同的标签序列 (index ), 长 度可选 4bp ~ 12bp , 以用于在测序过程中进行受试样品的区分 (Micah Hamady, Jeffrey J Walker, J Kirk Harris et al. Error- correcting barcoded primers forpyrosequencing hundreds of samples in multiplex. Nature Methods, 2008, March, Vol.5 Νο·3)。 通过这样的方式, 能够同时处理多个受试样品的检测, 提高了效 率, 减小了检测成本。
图 2 示出本发明的拷贝数变异检测方法的另一个实施例的流 程图。
步骤 202, 对受试样品的基因组 DNA 分子随机打断得到 DNA 片段。 DNA分子随机打断处理可以采用酶切、 雾化、 超声、 或者 HydroShear法。 优选地, 采用超声法, 例如 Covaris公司 的 S-series (基于 AFA技术, 当由传感器释放的声能 /才0¾能通过 DNA样品时, 溶解气体形成气泡。 当能量移除后, 气泡破裂并产 生断裂 DNA分子的能力。 通过设置一定的能量强度和时间间隔 等条件, 可将 DNA分子打断至一定范围的大小。 具体原理和方 法请参见 Covaris公司的 S-series说明书), 将 DNA分子打断为 比较集中的一定大小的片段。
步骤 204, 对 DNA片段进行测序获得 DNA片段测序序列, 即读序。 测序获得的读序可以在一定长度范围内, 可以通过截取 操作从 DNA 片段测序序列获得固定长度的读序, 在本实施例下 文中 DNA 片段测序序列指固定长度的读序。 所采用的测序方法 可以为高通量测序方法 Illumina/Hiseq2000、 ABI/SOLiD、 Roche/454。 测序类型可以为 Single-end (单向)测序或 Pair-end (双向)测序, 测序长度可以为 50bp~1500bp。 在本发明的一个 实施例中, 测序平台为 Illumina/Hiseq2000, 测序类型为 Pair- end测序, 得到具有双向位置关系的 lOObp大小的 DNA序列分 子。 对测序深度有一定要求, 测序深度可以依据检测的染色体变 异片段大小确定, 测序深度越高, 检测的灵敏度越高, 即可检出 的缺失和重复的片段越小。 在本发明的一个实施例中, 人类受试 样本的测序量在 2~900 X 108条测序片段之间。
步骤 206, 将读序与基因组参考序列进行比对以获得读序在 基因组上的位点信息。
步骤 206, 选择与基因组参考序列唯一比对上的读序作为序 列标签。
步骤 208, 统计落入基因组上每个窗口的 DNA片段唯一比对 的序列标签的个数。 对于每个受试样品, 统计每个窗口所落的序 列标签个数(记为 ^, 下标 i和 j分别代表窗口编号和样本编号, 以作区分, 下不复述)。
步骤 210, 确定基因组上每个窗口的平均 GC含量以确定该 窗口修正系数, 根据修正系数获得每个窗口的序列标签的修正个 数。 该步骤主要根据每个窗口的 GC含量对窗口的序列标签个数 进行修正, 可称为批次修正或 GC校正。
统计每个窗口所落的序列标签的平均 GC含量(记为 )。 平均 GC含量 ( GC- )的计算, 即计算落在统计窗口内的所有序 列标签的平均 GC含量水平。 在计算时, 将所有序列标签中的 G 和 C碱基的个数总和即为 ^, 所有序列标签的总长度记为 则
为了修正样品数据量的差异, 将各个统计窗口按其 ' 进行 分组, 即^ "相同的窗口划为一组。 取各分组中序列标签数" 的 中位数或算术平均数 , 除全基因组水平序列标签数的中位数 或算术平均数^, 得到修正系数^ ( c- = m- ), 下标 g代表 不同分组的 GC含量。 将原始的每个窗口的序列标签数" ' 乘以修 正系数^得到每个窗口内序列标签数的修正值(记为"" )
步骤 212, 将每个窗口的序列标签的修正个数除以该窗口的 个数期望值获得每个窗口的调整后的序列标签数, 即拷贝率。 其 中每个窗口的个数期望值通过由正常样本组成对照集 (control set) 获得。 该步骤主要根据正常样本数据进行每个窗口的序列标签的 个数的校正, 可以称为窗口校正。
在对照集样本中, 定义相对序列标签数百分比( ^ )为窗口 内序列标签数( n" )和全基因组总序列标签数( )之比, 即
N , 然后计算得到对照集中每个窗口的平均百分比 ,
1 n ^ 'J。 在受试样本中, 每个窗口的拷贝率^等于修正序列标 签数""除以窗口内的期望值(全基因组总的序列标签数乘以窗口 序列标签数百分比), 即^ XN
在对照集的选择中, 建库方法、 测序试剂及测序类型等应尽 量与待测样品一致, 从而提高对照样品对待测样品的校正效果。 对照集中的样品应为正常样本, 且样本量大于 30。
步骤 214, 在全基因组范围内根据表征每个窗口两侧拷贝数 变异的显著性差异的 ρ值选取预定数量的窗口作为候选 CNV断 点, 获得每个候选 CNV断点的 ρ值。
1 )候选 CNV断点选择: 针对全基因组的所有窗口, 计算每 个窗口左右两侧一定数量窗口 (窗口数量通常大于 30 或满足检 验模型的最低样品量限制, 以使得检验模型具有统计意义) 的拷 贝率变化差异, 根据全基因组范围的显著性差异水平(p值从小 到大), 选取一定数量(例如, 总窗口数量的 1% )的 (与窗口对 应)点作为候选 CNV断点( Breakpoint, 即每个 CNV片段的边 界点)。
2 )初始化: 将排序后的所有断点的集合记为 = { , W , 则在每个断点 k左右两侧, 分别存在相邻的断点 k+l、 k-l。 通过 计算 k_i与 k之间的拷贝数集合与 k与 k+1之间的拷贝率集合之 间的显著性差异, 得到代表每个断点两侧显著性差异的 p值。 例 如, 在一个实施例中, 选用非参数检验中的游程检验, 利用两个 群体元素混合后的分布均匀状态评价此两个群体的差异显著性, 详见: Wald, A. & Wolfowitz, J. On a test whether two samples are from the same population. The Annals of Mathematical Statistics 11, 147-162 (1940)。
步骤 216, 每次剔除最不显著的候选 CNV断点并更新被剔除 的候选 CNV断点左右两个候选 CNV断点的 p值, 循环迭代, 直 至所有候选 CNV断点的 p值都小于终止 p值(即, 终止阈值), 终止 p值根据对照集获得。
迭代合并: 通过不断的循环迭代, 每次将最不显著的候选突 破点剔除, 同时更新左右两个断点的 p值, 直到所有 p值都小于 终止 p值为止。
终止 p值的获得: 例如, 可以在将对照样品进行上述迭代合 并的操作, 记录每次迭代合并的最大 p值, 迭代合并至一条片段 时终止。 此时, 根据最大 p值的变化趋势, 将最大 p值变化最剧 烈的点 (即取 p值的变化曲线中斜率变化最明显的点 (曲率最大 的点))对应或之前的那次合并过程的最大 p值作为终止阈值。 在一个实施例中, 亦可将迭代合并的终止设置为迭代合并后的片 段数等于预知的片段数, 例如, 当对全基因组进行分析时, 对照 样品迭代合并后的片段数为 24 个时终止, 通过计算此时终止 p 值的平均值, 可以有效地获得终止 7值。
上述步骤 214和 216也可以称为片段化。 注意: 在步骤 214 中 1 )和 2 ) 中的窗口和断点选择时, 可以考虑进行单条染色体 环化或者全基因组水平环化。 单条染色体环化是指: 在计算染色 体起点附近的窗口时, 如果左侧的有效窗口数量不足以进行统计 检验, 则从本条染色体的尾部反向获得足够的窗口进行计算; 同 理, 对于终点附近右侧无法获得足够有效窗口的位置则从染色体 前端获得。 这一操作使得位于染色体前部和尾部的窗口依然可以 计算。 全基因组水平环化则是在每条染色体前端有效窗口数不足 时索引前一条染色体的尾端, 而尾端不足时索引下一条染色体的 前端, 1号染色体则和 Y染色体相连。
在步骤 216后还可以包括对根据 CNV断点之间的片段进行 置信选择的步骤: 根据拷贝率的分布规律, 利用对照集确定拷贝 率正常的置信区间; 当片段内拷贝率的均值处于置信区间之外时 , 认为该 CNV 断点之间的片段确实存在异常。 在一个实施例中, 拷贝率符合正态分布, 置信区间为 95%置信区间。 通过该步驟对 片段化结果进行过滤, 获得可靠的结果。 如果片段的平均^小于 阈值下限或者大于阈值上限时, 将作为阳性结果输出。
阈值选择: 统计每个对照样品中窗口的拷贝率分布, 根据中 心极限定理, 读序在窗口内是随机的, 所以拷贝率 r是服从正态 分布的, 取其显著性水平为 0.05时的左右分位点。 计算其在对照 集中的均值, 分别作为筛选拷贝率变异时的上下限阈值。
上述实施例中, 通过批次修正和窗口校正, 提高了检测结果 的准确率。 通过引入参照集, 可以通过扩大参照集来提高精度, 以减轻对起始 DNA量的压力。
图 3 示出本发明的基因组拷贝数变异检测方法的又一个实施 例的流程图。 在图 3 中, 包括了由正常样本组成的对照集的处理 流程 ( 3A )和受试样品的处理流程 ( 3B )„ 对照集主要用于获取 对受试样品进行校正的数据以及受试样品处理的迭代终止条件的 终止阈值。
如图 3所示, 流程 3A包括:
步骤 310A, 提取对照样品的 DNA分子。
步骤 311A, 对对照样品的 DNA分子随机打断为 DNA片段 进行测序, 获得对照样品的 DNA片段测试序列数据, 即读序。
步骤 312A, 将对照样品的读序与参考基因组进行对比。
步骤 313A, 统计窗口唯一比对读序数, 即序列标签数目。 步骤 314A, 对对照样品进行批次修正。
步骤 315A, 通过对照样品获得窗口个数期望值, 用于受试 样品的窗口校正。
步骤 316A, 断点选择与片段化。 选择候选 CNV断点, 每次 剔除最不显著的候选 CNV断点并更新被剔除的候选 CNV断点左 右两个候选 CNV断点的 P值, 循环迭代, 使最终的片段剩余预 定个数(例如 24个) 时停止。
步骤 317A, 确定终止阈值。 通过计算此时终止 p值的平均 值, 可以有效地获得终止 p值, 作为受试样品迭代终止条件的终 止阈值。
流程 3B包括:
步骤 310B, 提取受试样品的 DNA分子。
步骤 311B, 对受试样品的 DNA分子随机打断为 DNA片段 进行测序, 获得受试样品的 DNA片段的读序。
步骤 312B, 将受试样品的 DNA片段的读序与参考基因组进 行对比。
步骤 313B, 去统计窗口唯一比对读序数, 即序列标签数。 步骤 314B, 对受试样品进行批次修正。
步骤 315B, 根据对照样品获得的窗口个数期望值对受试样品 进行窗口校正。 步骤 316B, 断点选择与片段化。
步骤 317B, 进行结果过滤。
在对照集的选择中, 建库方法、 测序试剂及测序类型等应尽 量与受试样品一致, 从而提高对照样品对受试样品的校正效果。 对照集中的样品应为正常样本, 且样本量大于 30。
图 4示出本发明一个实现中染色体 CNV分析的简要流程图。 如图 4 所示, 步骤 401 , DNA提取及测序: 根据 Tiangen DP327-02 Kit 操作手册提取基因组 DNA 后, 基本按照 Illumina/Hiseq2000 标准建库流程进行建库。 在这个过程中, 本 身集中于 500bp的 DNA分子两端被加上测序所用接头, 每个样 本被加上不同的标签序列 (index ), 从而在一次测序得到的数据 中可以使多个样本的数据区分开。
步骤 402, 序列比对: 利用测序方法 Illumina/Hiseq2000 测 序 (用其它测序方法如 ABI/SOLiD能达到相同或相近的效果), 每个样本得到一定长度片段的 DNA读序, 将其与 NCBI数据库 中的标准人类基因组参考序列进行 SOAP2 比对, 得到读序定位 于基因组相应位置的信息。 为避免重复序列对 CNV分析的干扰, 只选取与人类基因组参考序列唯一比对的读序, 即只在人类基因 组参考序列上比对上一次的读序, 也即序列标签数, 作为后续 CNV分析的有效数据。
步骤 403, PSCC 分析。 运用本申请人自主开发的针对全基 因组拷贝数变异检测的一系列生物信息学方法(下称 PSCC ), 在 对受试样本进行批次性差异修正, 根据对照集(control set ), 对 受试样品进行窗口校正 ( correction )、 标准化 ( Normalization ) 和拷贝数片段化 ( segmentation )。
步骤 404, 基于步骤 403中得到的拷贝数片段进行 CNV分析, 将受试样品拷贝率≤0.7和≥1.3分别作为片段缺失和重复的检测阈 值, 分析得到全基因组水平拷贝数变异片段, 然后进行结果的可 视化。
上述实施例中, 实现的软件算法是一种由深圳华大基因研究 院开发针对全基因组拷贝数变异检测的一系列程序, 统称为
PSCC。 它能够通过新一代测序技术产生的数据, 将受试样本进 行批次修正, 然后和对照集合进行数据校正、 标准化和片段化, 估算出受试样本的拷贝数变异程度和大小。 在较低测序深度
( 50M条测序短序列)的时候可检测出 0.5Mb左右的单个拷贝数 变异(CNV ) 片段。
图 5示出本发明的基因组拷贝数变异检测系统的一个实施例 的结构图。 如图 5所示, 该系统包括: 读序获取单元 51, 获取受 试样品中至少一部分核酸分子的读序信息; 序列标签确定单元 52, 确定唯一比对至(基因组)参考序列的读序作为序列标签; 标签 数目统计单元 53, 将基因组参考序列划分窗口, 统计落入各个窗 口的序列标签数目; 标签数目调整单元 54 , 对各个窗口的序列标 签数目进行 GC校正并根据以对照样本集修正的期望序列标签数 目进行修正获得调整后的序列标签数目; 候选断点选择单元 55, 以窗口的起点或终点作为分界点, 计算两侧由调整后的序列标签 数目组成数值群体的差异显著性值, 选取显著性值较小 (即差异 显著性较大)的分界点为候选的 CNV断点; 断点确定单元 56, 对于每个 CNV断点至在先 CNV断点和在后 CNV断点的两段序 列, 计算所述两段序列包含的窗口的调整后的序列标签数目组成 的两个数值群体的差异显著性值, 每次剔除最不显著的候选 CNV 断点并更新被剔除的候选 CNV断点左右两个候选 CNV断点的差 异显著性值, 循环迭代, 直至所有候选 CNV 断点的差异显著性 值都小于终止阈值, 从而确定 CNV 断点; 该终止阈值可以根据 由正常样本组成对照样本集获得。 标签数目统计单元 53 划分窗 口时, 可以使得每个窗口具有相同的参考序列标签数目
( reference unique reads ), 或者, 使得每个窗口具有相同的长度。 在一个实施例中, 候选断点选择单元 55在选择候选的 CNV断点 时进行单条染色体环化或者全基因组水平环化处理
上述实施例中, 序列标签确定单元根据读序确定唯一比对至 (基因组)参考序列的序列标签, 标签数目调整单元对各个窗口 的序列标签数目进行修正, 由候选断点选择单元和断点确定单元 进行基因显著性差异的循环迭代进行 CNV 断点选择, 从而完成 CNV 的检测, 能够精确检测到包括较小的微缺失 /微重复区域的 拷贝数变异。
图 6示出本发明的拷贝数变异检测系统的另一个实施例的结 构图。 如图 6所示, 该系统包括: 读序获取单元 51、 序列标签确 定单元 52、 标签数目统计单元 53、 标签数目调整单元 64、 候选 断点选择单元 55和断点确定单元 56。 读序获取单元 51、 序列标 签确定单元 52、 标签数目统计单元 53、 候选断点选择单元 55、 和断点确定单元 56可以参见图 5 的具体描述, 为简洁起见在此 不再详细描述。 标签数目调整单元 64 包括 GC校正模块 641和 窗口修正模块 642。 其中, GC校正模块 641, 将窗口按 GC含量 分组, 基于组内序列标签的平均数与所有窗口的序列标签平均数 获得修正系数, 对该窗口序列标签数进行修正获得 GC校正的序 列标签数; 窗口修正模块 642, 计算对照集中每个窗口经过 GC 校正的序列标签数与标签总数的比值; 基于该比值, 获得所有对 照样本对应窗口比值的平均数; 基于上述平均数和待测样本的序 列标签总数, 计算待测样本各个窗口序列标签数的期望值, 对 GC校正后的序列标签数以对照样本集修正的期望序列标签数目 进行修正获得调整后的序列标签数目, 也称拷贝率。
才艮据本发明的一个实施例, 该系统还包括断点过滤单元 67, 在断点确定单元确定 CNV 断点后根据拷贝率的分布规律, 利用 对照集确定拷贝率正常的置信区间; 当片段内拷贝率的均值处于 置信区间之外时, 认为该 CNV 断点之间的片段确实存在异常。 在一个实施例中, 序列标签数目符合正态分布, 所述置信区间为
95%置信区间。
在一个实施例中, 受试样品是来自人类样本的样品, 所述受 试样品的基因组 DNA分子是羊膜腔穿刺获得的羊水、 绒毛活检 得到的绒毛、 经腹脐静脉穿刺得到的脐带血、 自然流产的胎儿组 织或人体外周血中获取的基因组 DNA; 受试样品的基因组 DNA 分子采用盐析法、 柱层析法、 磁珠法、 或 SDS法的 DNA提取方 法获得; 对受试样品的基因组 DNA分子为采用酶切、 雾化、 超 声、 或者 HydroShear法随机打断得到 DNA片段; DNA片段测 序序列通 it t受试样品的基因组 DNA分子片段进行单向测序或 双向测序获得。 可以通过受试样品 DNA 片段上所加的不同标签 序列以区别不同的受试样品。
对于图 5至图 6中各个单元的功能, 可以参考上文中关于本 发明方法的实施例中对应部分的说明, 为简洁起见, 在此不再详 述。
本领域技术人员将意识到硬件、 固件和软件配置在这些情况 下的可替换性, 以及如何最好地实现每个特定应用地该功能。
下面将结合应用例对本发明的实施例进行详细描述, 但是本 领域技术人员将会理解, 下列应用例仅用于说明本发明, 而不应 视为限定本发明的范围。 应用例中未注明具体条件者, 按照常规 条件或制造商建议的条件进行。 所用试剂或仪器未注明生产厂商 者, 均为可以通过市场获得的常规产品。 以下括号内为各个试剂 或试剂盒的厂家货号。 所使用的测序用的接头和标签序列 index 来源于 Illumina 公司 的 Multiplexing Sample Preparation Oligonutide Kit。
应用例一、 2个染色体数目异常及 6个微缺失样品检测
1、 DNA提取:
按照 Tiangen的 TIANamp Micro DNA Kit ( DP316 )操作 流程提取 8 例样品 ( 以下简称 Samplel 、 Sample2、 Sample3...Sample8 ) 的 DNA , 所提取 DNA 按照修改后的 Illumina/Hiseq2000 标准建库流程进行建库, 在本身集中于 500b 的 DNA分子两端被加上测序所用接头, 每个样本被加上 不同的标签序列 (index ), 然后与 Flowcell表面互补接头杂交, 在一定条件下使核酸分子成簇生长, 然后在 Illumina Hiseq2000 上通过双末端测序, 得到长度为 lOObp的 DNA片^ Ψ列。
具体的, 将获自上述羊水样品的约 lOOng ( Quant-IT dsDNA HS Assay kit定量)的 DNA, 进行修改后的 Illumina/Hiseq2000 标准流程建库 , 具体流程参照现有技术 ( 参见 http:〃 www.illumina.com/提供的 Illumina/Solexa 标准建库说明 书)。 经 2100Bioanalyzer (Agilent)确定 DNA文库大小及插入片 段为约 500bp, QPCR精确定量后可上机测序。
2.测序: 本实施例中, 对于获自上述 8例样品的 DNA样本按 照 Illumina/Solexa 官方公布的 ClusterStation 和 Hiseq2000 ( PEsequencing )说明书进行操作, 使每个样品得到约 5G数据 量进行上机测序, 每个样本根据所述 index标签区进行分开。 利 用比对软件 SOAP2, 将测序所得 DNA序列与 NCBI数据库中版 本 36 ( hgl8; NCBIBuild36 ) 的人类基因组参考序列进行比对, 得到所测 DNA序列定位于基因组相应位置的信息。
3·数据分析
a )基本统计: 统计每个窗口所落的序列标签个数(即唯一 比对的 reads数, 记为" , 下标 i和 j分别代表窗口编号和样本 编号, 以作区分, 下不复述)和平均 GC含量(记为 GC' 。
b)批次修正: 为了修正样品数据量的差异, 将各个统计窗口 按落入其中的 reads平均 GC分组, 取各分组的中位数或算术平 均数除全基因组水平的中位数或算术平均数之间得到修正系数 cs , 下标 g代表不同分组的 GC含量。 将原始的每个窗口的 reads个 数 ^乘以修正系数 得到每个窗口内序列标签数的修正值(记为 c )数据校正'. 选取 90例 YH群体的测序数据作为对照集, 定义相对序列标签数百分比 ( p" ) 为窗口内测序序列标签数
( n" )和全基因组总序列标签数( )之比, 即 。 然后 一 ρ = γΡ
计算得到每个窗口的平均百分比 ' " "。 在受试样本中, 每个窗口的拷贝率 ^等于修正修正序列标签数""除以窗口内的期 望值(全基因组总序列标签数乘以窗口序列标签数百分比), 即
d ) 片段化
①候选断点选择: 针对全基因组的所有^, 计算其左右两侧
100个窗口的拷贝率变化差异, 根据全基因组范围的显著水平 ( P 值从小到大), 选取最显著的 10000个点作为候选 CNV断点。
②初始化: 将排序后的所有断点的集合记为 = {}, 则在每个断点 k左右两侧, 分别存在相邻的断点 k+l、 k-l o 通过 计算 k-1与 k之间的拷贝率集合与 k与 k+1之间的拷贝率集合之 间的显著性差异(本实施例中选用非参数检验中的游程检验), 得到代表每个断点两侧显著性差异的 p值。
③迭代合并: 通过不断的循环迭代, 每次将最不显著的候选 断点剔除, 同时更新左右两个断点的 p值, 直到所有 p值都小于 1E- 时为止。
注意: 在步骤①和②中的窗口和断点选择时, 进行了全基因 组水平环化, 从而使得位于染色体前部和尾部的窗口依然可以计 算。
e ) 阈值与过滤: 对片段化过后的结果进行过滤。 如果片段 的平均 r"小于 0.7或者大于 1.3时, 作为阳性结果输出。
f )结果可视化。
4. 结果统计
对于 8个样品, 详细的检测结果和验证结果如下表 1所示。 其中验证结果通过 CGH 芯片 (比较基因组杂交芯片)得出。 实验使用 Human Genome CGH Microarray Kit , ( Agilent Technologies Inc. ), 完全按照厂商的使用说明进行操作。
Figure imgf000026_0001
表 1. 8例样品的 CNV结果其中 chr代表染色体, T7代表 7号染色体三体型。 XYY未性染色体三体异常。 图 7A-7H示出 8个样品的检测结果示意图。
从上表 1和图 7A-7H可以看到, 小至 0.4M的微缺失片段, 大到整条染色体数目异常, 本方法都可精确的进行检测和定位, 证明其检测效率和精度都较为优秀。
与目前已报道的拷贝数变异检测的分析方法对比, 本公开的 优越性主要有一下几点:
( 1 )分辨率, 使用 50M左右的数据, 可精确检测到 0.5M 的微缺失区域。
( 2 )可扩展性。 除了通过增加测序量之外, 可以通过扩大 参照集来提高精度, 以减轻对起始 DNA量的压力。
( 3 ) 更稳定, 更加全面。 已报道文章中, 并无明确指出自 身的操作细节, 而本发明涉及数据批次修正, 群体校正, 片段化 条件优选等的各个方面。
本发明对适用人群进行全基因组拷贝数变异检测分析, 有利 于提供遗传咨询和提供临床决策依据, 对微缺失综合征患者进行 准确的病理判定。 本发明适用人群可以是所有微缺失患者或潜在 微缺失携带者, 适用人群举例仅用于说明本发明, 而不应为限定 本发明的范围。
本发明的描述是为了示例和描述起见而给出的, 而并不是无 遗漏的或者将本发明限于所公开的形式。 很多修改和变化对于本 领域的普通技术人员而言是显然的。 选择和描述实施例是为了更 好说明本发明的原理和实际应用 , 并且使本领域的普通技术人员 能够理解本发明从而设计适于特定用途的带有各种修改的各种实 施例。

Claims

权 利 要 求
1. 一种拷贝数变异检测方法, 其特征在于, 包括:
获取受试样品中至少一部分核酸分子的读序信息;
根据所述读序确定唯一比对至基因组参考序列的序列标签; 将基因组参考序列划分窗口, 统计落入各个窗口的序列标签 数目;
对各个窗口的序列标签数目进行 GC校正并根据以对照样本 集修正的期望序列标签数目进行修正获得调整后的序列标签数目; 以窗口的起点或终点作为分界点, 计算两侧由调整后的序列 标签数目组成数值群体的差异显著性值, 选取显著性值较小的分 界点为候选的 CNV断点;
对于每个 CNV断点至在先 CNV断点和在后 CNV断点的两 段序列, 计算所述两段序列包含的窗口的调整后的序列标签数目 组成的两个数值群体的差异显著性值, 每次剔除最不显著的候选 CNV断点并更新被剔除的候选 CNV断点左右两个候选 CNV断 点的差异显著性值, 循环迭代, 直至所有候选 CNV 断点的差异 显著性值都小于终止阈值, 从而确定 CNV断点。
2. 根据权利要求 1所述的方法, 其特征在于, 还包括对受试 样本中的至少一部分核酸分子进行测序以获得读序信息的步骤。
3.根据权利要求 1所述的方法, 其特征在于, 每个窗口具有 相同的参考序列标签数目, 或者, 每个窗口具有相同的长度。
4. 根据权利要求 1或 2所述的方法, 其特征在于, 所述终止 阈值根据由正常样本组成对照样本集获得。
5. 根据权利要求 1所述的方法, 其特征在于, 所 ¾t各个窗 口的序列标签数目进行 GC校正包括以下的步骤:
将窗口按 GC含量分组, 基于组内序列标签的平均数与所有 窗口的序列标签平均数获得修正系数, 对该窗口序列标签数进行 修正获得 GC校正的序列标签数;
和 /或
对照样本集修正的期望序列标签数目由以下方法获得: 计算对照集中每个窗口经过 GC校正的序列标签数与序列标 签总数的比值; 基于该比值, 获得所有对照样本对应窗口比值的 平均数; 基于上述平均数和待测样本的序列标签总数, 计算待测 样本各个窗口序列标签数的期望值。
6. 根据权利要求 5所述的方法, 其特征在于, 在选择候选的 CNV断点时进行单条染色体环化或者全基因组水平环化处理。
7. 根据权利要求 1 或 2 所述的方法, 其特征在于, 在确定 CNV断点后还包括:
对根据 CNV断点之间的片段进行置信选择的步骤;
所述置信选择包括以下步骤: 根据调整后的序列标签数目的 分布规律, 利用对照集确定调整后的序列标签数目正常的置信区 间; 当片段内调整后的序列标签数目的均值处于置信区间之外时 , 认为该 CNV断点之间的片段确实存在异常。
8. 根据权利要求 7所述的方法, 其特征在于, 所述调整后的 序列标签数目符合正态分布, 所述置信区间为 95%置信区间。
9. 根据权利要求 1至 8中任意一项所述的方法, 其特征在于, 还包括:
所述受试样品是来自人类样本的样品 , 所述受试样品包括是 羊膜腔穿刺获得的羊水、 绒毛活检得到的绒毛、 经腹脐静脉穿刺 得到的脐带血、 自然流产的胎儿组织或人体外周血;
和 /或
所述受试样品的基因组 DNA分子采用盐析法、 柱层析法、 磁珠法、 或 SDS法的 DNA提取方法获得;
和 /或
对所述受试样品的基因组 DNA分子采用酶切、 雾化、 超声、 或者 HydroShear法随机打断得到 DNA片段;
和 /或
对所述受试样品的基因组 DNA分子片段进行单向测序或双 向测序获得所述 DNA片段读序。
10. 根据权利要求 1所述的方法, 其特征在于, 还包括: 在每个受试样品的 DNA 片段上加上不同的标签序列以区别 不同的受试样品。
11. 一种拷贝数变异检测系统, 其特征在于, 包括:
读序获取单元, 用于获取受试样品中至少一部分核酸分子的 读序信息;
序列标签确定单元, 用于根据所述读序信息确定唯一比对至 基因组参考序列的序列标签;
标签数目统计单元, 用于将基因组参考序列划分窗口, 统计 落入各个窗口的序列标签数目;
标签数目调整单元, 用于对各个窗口的序列标签数目进行
GC校正并根据以对照样本集修正的期望序列标签数目进行修正 获得调整后的序列标签数目;
候选断点选择单元, 用于以窗口的起点或终点作为分界点, 计算两侧由调整后的序列标签数目组成数值群体的差异显著性值, 选取显著性值较小的分界点为候选的 CNV断点;
断点确定单元, 用于对于每 CNV断点至在 CNV断点和在后 CNV断点的两段序列, 计算所述两段序列包含的窗口的调整后的 序列标签数目组成的两个数值群体的差异显著性值, 每次剔除最 不显著的候选 CNV断点并更新被剔除的候选 CNV断点左右两个 候选 CNV断点的差异显著性值, 循环迭代, 直至所有候选 CNV 断点的差异显著性值都小于终止阈值, 从而确定 CNV断点。
12. 根据权利要求 11所述的系统, 其特征在于, 每个窗口具 有相同的参考序列标签数目, 或者, 每个窗口具有相同的长度。
13. 根据权利要求 11所述的系统, 其特征在于, 所述终止阈 值根据由正常样本组成对照样本集获得。
14. 根据权利要求 11所述的系统, 其特征在于, 所述标签数 目调整单元包括:
GC校正模块, 用于将窗口按 GC含量分组, 基于组内序列 标签的平均数与所有窗口的序列标签平均数获得修正系数, 对该 窗口序列标签数进行修正获得 GC校正的序列标签数;
窗口修正模块, 用于计算对照集中每个窗口经过 GC校正的 序列标签数与序列标签总数的比值; 基于该比值, 获得所有对照 样本对应窗口比值的平均数; 基于上述平均数和待测样本的序列 标签总数, 计算待测样本各个窗口序列标签数的期望值, 对 GC 校正后的序列标签数以对照样本集修正的期望序列标签数目进行 修正获得调整后的序列标签数目。
15. 根据权利要求 11所述的系统, 其特征在于, 还包括断点 过滤单元, 用于在所述断点确定单元确定 CNV 断点后根据序列 标签数目的分布规律, 利用对照集确定序列标签数目正常的置信 区间; 当片段内序列标签数目的均值处于置信区间之外时, 认为 该 CNV断点之间的片段确实存在异常。
16. 根据权利要求 15所述的系统, 其特征在于, 所述序列标 签数目符合正态分布, 所述置信区间为 95%置信区间。
17. 根据权利要求 11至 16中任意一项所述的系统, 其特征 在于, 所述受试样品是来自人类样本的样品, 所述受试样品包括 是羊膜腔穿刺获得的羊水、 绒毛活检得到的绒毛、 经腹脐静脉穿 刺得到的脐带血、 自然流产的胎儿组织或人体的外周血;
和 /或
所述受试样品的基因组 DNA分子采用盐析法、 柱层析法、 磁珠法、 或 SDS法的 DNA提取方法获得;
和 /或 对所述受试样品的基因组 DNA分子采用酶切、 雾化、 超声、 或者 HydroShear法随机打断得到 DNA片段;
和 /或
对所述受试样品的基因组 DNA分子片段进行单向测序或双 向测序获得所述 DNA片段的读序。
18. 根据权利要求 11所述的系统, 其特征在于, 通过受试样 品 DNA片段上所加的不同标签序列以区别不同的受试样品。
19.根据权利要求 11所述的系统, 其特征在于, 候选断点选 择单元在选择候选的 CNV 断点时进行单条染色体环化或者全基 因组水平环化处理。
PCT/CN2012/073545 2012-04-05 2012-04-05 一种拷贝数变异检测方法和系统 WO2013149385A1 (zh)

Priority Applications (11)

Application Number Priority Date Filing Date Title
EP12873786.3A EP2835752B8 (en) 2012-04-05 2012-04-05 Method and system for detecting copy number variation
US14/389,898 US20150056619A1 (en) 2012-04-05 2012-04-05 Method and system for determining copy number variation
SG11201406250SA SG11201406250SA (en) 2012-04-05 2012-04-05 Method and system for detecting copy number variation
PCT/CN2012/073545 WO2013149385A1 (zh) 2012-04-05 2012-04-05 一种拷贝数变异检测方法和系统
RU2014144349A RU2014144349A (ru) 2012-04-05 2012-04-05 Способ и система детекции вариации числа копий
KR1020147031062A KR101795124B1 (ko) 2012-04-05 2012-04-05 복제 수 변이를 검측하기 위한 방법 및 시스템
JP2015503724A JP5972448B2 (ja) 2012-04-05 2012-04-05 コピー数変異を検出する方法及びシステム
CN201280066929.3A CN104221022B (zh) 2012-04-05 2012-04-05 一种拷贝数变异检测方法和系统
AU2012376134A AU2012376134B2 (en) 2012-04-05 2012-04-05 Method and system for detecting copy number variation
IL234875A IL234875B (en) 2012-04-05 2014-09-29 A method for detecting a copy number change
US15/881,902 US11371074B2 (en) 2012-04-05 2018-01-29 Method and system for determining copy number variation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2012/073545 WO2013149385A1 (zh) 2012-04-05 2012-04-05 一种拷贝数变异检测方法和系统

Related Child Applications (2)

Application Number Title Priority Date Filing Date
US14/389,898 A-371-Of-International US20150056619A1 (en) 2012-04-05 2012-04-05 Method and system for determining copy number variation
US15/881,902 Continuation US11371074B2 (en) 2012-04-05 2018-01-29 Method and system for determining copy number variation

Publications (1)

Publication Number Publication Date
WO2013149385A1 true WO2013149385A1 (zh) 2013-10-10

Family

ID=49299922

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2012/073545 WO2013149385A1 (zh) 2012-04-05 2012-04-05 一种拷贝数变异检测方法和系统

Country Status (10)

Country Link
US (2) US20150056619A1 (zh)
EP (1) EP2835752B8 (zh)
JP (1) JP5972448B2 (zh)
KR (1) KR101795124B1 (zh)
CN (1) CN104221022B (zh)
AU (1) AU2012376134B2 (zh)
IL (1) IL234875B (zh)
RU (1) RU2014144349A (zh)
SG (1) SG11201406250SA (zh)
WO (1) WO2013149385A1 (zh)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104560697A (zh) * 2015-01-26 2015-04-29 上海美吉生物医药科技有限公司 一种基因组拷贝数不稳定性的检测装置
CN104694384A (zh) * 2015-03-20 2015-06-10 上海美吉生物医药科技有限公司 线粒体dna拷贝数变异性的检测装置
CN105243299A (zh) * 2015-09-30 2016-01-13 深圳华大基因科技服务有限公司 一种检测cnv的精确断点及断点周围特征的方法及装置
WO2016045106A1 (zh) * 2014-09-26 2016-03-31 深圳华大基因股份有限公司 单细胞染色体的cnv分析方法和检测装置
CN107208153A (zh) * 2015-01-13 2017-09-26 香港中文大学 血浆线粒体dna分析的应用
TWI607332B (zh) * 2016-12-21 2017-12-01 國立臺灣師範大學 Correlation between persistent organic pollutants and microRNAs station
WO2019157791A1 (zh) * 2018-02-14 2019-08-22 南京世和基因生物技术有限公司 一种拷贝数变异的检测方法、装置以及计算机可读介质
CN113362891A (zh) * 2014-09-12 2021-09-07 伊鲁米纳剑桥有限公司 用短读测序数据检测重复扩增
CN115132271A (zh) * 2022-09-01 2022-09-30 北京中仪康卫医疗器械有限公司 一种基于批次内校正的cnv检测方法

Families Citing this family (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224543A (zh) * 2014-05-30 2016-01-06 国际商业机器公司 用于处理时间序列的方法和装置
CN104745718B (zh) * 2015-04-23 2018-02-16 北京中仪康卫医疗器械有限公司 一种检测人类胚胎染色体微缺失和微重复的方法
US10395759B2 (en) 2015-05-18 2019-08-27 Regeneron Pharmaceuticals, Inc. Methods and systems for copy number variant detection
KR101848438B1 (ko) 2015-10-29 2018-04-13 바이오코아 주식회사 디지털 pcr을 이용한 산전진단 방법
AU2016355983B2 (en) * 2015-11-18 2021-12-23 Sophia Genetics S.A. Methods for detecting copy-number variations in next-generation sequencing
NZ745249A (en) 2016-02-12 2021-07-30 Regeneron Pharma Methods and systems for detection of abnormal karyotypes
CN105760712B (zh) * 2016-03-01 2019-03-26 西安电子科技大学 一种基于新一代测序的拷贝数变异检测方法
EP3967324A1 (en) 2016-07-20 2022-03-16 BioNTech SE Selecting neoepitopes as disease-specific targets for therapy with enhanced efficacy
CN106520940A (zh) * 2016-11-04 2017-03-22 深圳华大基因研究院 一种染色体非整倍体和拷贝数变异检测方法及其应用
US11993811B2 (en) * 2017-01-31 2024-05-28 Myriad Women's Health, Inc. Systems and methods for identifying and quantifying gene copy number variations
WO2018161245A1 (zh) * 2017-03-07 2018-09-13 深圳华大基因研究院 一种染色体变异的检测方法及装置
CN109097457A (zh) * 2017-06-20 2018-12-28 深圳华大智造科技有限公司 确定核酸样本中预定位点突变类型的方法
CA3085739A1 (en) * 2017-12-14 2019-06-20 Ancestry.Com Dna, Llc Detection of deletions and copy number variations in dna sequences
CN109979529B (zh) * 2017-12-28 2021-01-08 北京安诺优达医学检验实验室有限公司 Cnv检测装置
CN109979535B (zh) * 2017-12-28 2021-03-02 浙江安诺优达生物科技有限公司 一种胚胎植入前遗传学筛查装置
CN108256289B (zh) * 2018-01-17 2020-10-16 湖南大地同年生物科技有限公司 一种基于目标区域捕获测序基因组拷贝数变异的方法
KR102036609B1 (ko) * 2018-02-12 2019-10-28 바이오코아 주식회사 디지털 pcr을 이용한 산전진단 방법
CN108415886B (zh) * 2018-03-07 2019-04-05 清华大学 一种基于生产工序的数据标签纠错方法及装置
CN108664766B (zh) * 2018-05-18 2020-01-31 广州金域医学检验中心有限公司 拷贝数变异的分析方法、分析装置、设备及存储介质
WO2021114139A1 (zh) * 2019-12-11 2021-06-17 深圳华大基因股份有限公司 一种基于血液循环肿瘤dna的拷贝数变异检测方法和装置
CN111261225B (zh) * 2020-02-06 2022-08-16 西安交通大学 一种基于二代测序数据的反转相关复杂变异检测方法
CN113496761B (zh) * 2020-04-03 2023-09-19 深圳华大生命科学研究院 确定核酸样本中cnv的方法、装置及应用
DE102020116178A1 (de) * 2020-06-18 2021-12-23 Analytik Jena Gmbh Verfahren zum Erkennen einer Amplifikationsphase in einer Amplifikation
CN111968701B (zh) * 2020-08-27 2022-10-04 北京吉因加科技有限公司 检测指定基因组区域体细胞拷贝数变异的方法和装置
CN114220481B (zh) * 2021-11-25 2023-09-08 深圳思勤医疗科技有限公司 基于全基因组测序完成待测样本的核型分析的方法、系统和计算机可读介质
CN114999573B (zh) * 2022-04-14 2023-07-07 哈尔滨因极科技有限公司 一种基因组变异检测方法及检测系统
CN114758720B (zh) * 2022-06-14 2022-09-02 北京贝瑞和康生物技术有限公司 用于检测拷贝数变异的方法、设备和介质
CN114864000B (zh) * 2022-07-05 2022-09-09 北京大学第三医院(北京大学第三临床医学院) 一种动态鉴定人类单细胞染色体拷贝数的方法
CN116386718B (zh) * 2023-05-30 2023-08-01 北京华宇亿康生物工程技术有限公司 检测拷贝数变异的方法、设备和介质

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004044225A2 (en) * 2002-11-11 2004-05-27 Affymetrix, Inc. Methods for identifying dna copy number changes
CN101449161A (zh) * 2006-05-03 2009-06-03 人口诊断公司 评估遗传病缺陷的方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7979215B2 (en) * 2007-07-30 2011-07-12 Agilent Technologies, Inc. Methods and systems for evaluating CGH candidate probe nucleic acid sequences
US20120178635A1 (en) * 2009-08-06 2012-07-12 University Of Virginia Patent Foundation Compositions and methods for identifying and detecting sites of translocation and dna fusion junctions
JP2011078409A (ja) * 2009-09-10 2011-04-21 Fujifilm Corp アレイ比較ゲノムハイブリダイゼーション法による核酸変異解析法

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004044225A2 (en) * 2002-11-11 2004-05-27 Affymetrix, Inc. Methods for identifying dna copy number changes
CN101449161A (zh) * 2006-05-03 2009-06-03 人口诊断公司 评估遗传病缺陷的方法

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
METZKER ML: "Sequencing technologies-the next generation", NAT REV GENET., vol. 11, no. 1, January 2010 (2010-01-01), pages 31 - 46
MICAH HAMADY; JEFFREY J WALKER; J KIRK HARRIS ET AL.: "Error-correcting barcoded primers forpyrosequencing hundreds of samples in multiplex", NATURE METHODS, vol. 5, no. 3, March 2008 (2008-03-01)
RUSK, NICOLE: "Cheap Third-Generation Sequencing", NATURE METHODS, vol. 6, no. 4, 1 April 2009 (2009-04-01), pages 2446
See also references of EP2835752A4
WALD, A.; WOLFOWITZ, J.: "The Annals of Mathematical Statistics", vol. 11, 1940, article "On a test whether two samples are from the same population", pages: 147 - 162

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113362891A (zh) * 2014-09-12 2021-09-07 伊鲁米纳剑桥有限公司 用短读测序数据检测重复扩增
CN106795551B (zh) * 2014-09-26 2020-11-20 深圳华大基因股份有限公司 单细胞染色体的cnv分析方法和检测装置
WO2016045106A1 (zh) * 2014-09-26 2016-03-31 深圳华大基因股份有限公司 单细胞染色体的cnv分析方法和检测装置
CN106795551A (zh) * 2014-09-26 2017-05-31 深圳华大基因股份有限公司 单细胞染色体的cnv分析方法和检测装置
CN107208153A (zh) * 2015-01-13 2017-09-26 香港中文大学 血浆线粒体dna分析的应用
US11242559B2 (en) 2015-01-13 2022-02-08 The Chinese University Of Hong Kong Method of nuclear DNA and mitochondrial DNA analysis
CN104560697A (zh) * 2015-01-26 2015-04-29 上海美吉生物医药科技有限公司 一种基因组拷贝数不稳定性的检测装置
CN104694384B (zh) * 2015-03-20 2017-02-08 上海美吉生物医药科技有限公司 线粒体dna拷贝数变异性的检测装置
CN104694384A (zh) * 2015-03-20 2015-06-10 上海美吉生物医药科技有限公司 线粒体dna拷贝数变异性的检测装置
CN105243299A (zh) * 2015-09-30 2016-01-13 深圳华大基因科技服务有限公司 一种检测cnv的精确断点及断点周围特征的方法及装置
TWI607332B (zh) * 2016-12-21 2017-12-01 國立臺灣師範大學 Correlation between persistent organic pollutants and microRNAs station
WO2019157791A1 (zh) * 2018-02-14 2019-08-22 南京世和基因生物技术有限公司 一种拷贝数变异的检测方法、装置以及计算机可读介质
CN115132271A (zh) * 2022-09-01 2022-09-30 北京中仪康卫医疗器械有限公司 一种基于批次内校正的cnv检测方法
CN115132271B (zh) * 2022-09-01 2023-07-04 北京中仪康卫医疗器械有限公司 一种基于批次内校正的cnv检测方法

Also Published As

Publication number Publication date
US11371074B2 (en) 2022-06-28
AU2012376134A1 (en) 2014-11-06
RU2014144349A (ru) 2016-05-27
EP2835752A4 (en) 2015-11-18
CN104221022A (zh) 2014-12-17
AU2012376134B2 (en) 2016-03-03
US20180148765A1 (en) 2018-05-31
JP2015512264A (ja) 2015-04-27
CN104221022B (zh) 2017-11-21
US20150056619A1 (en) 2015-02-26
KR101795124B1 (ko) 2017-12-01
EP2835752A1 (en) 2015-02-11
SG11201406250SA (en) 2014-11-27
JP5972448B2 (ja) 2016-08-17
EP2835752B8 (en) 2018-12-26
KR20140140122A (ko) 2014-12-08
IL234875B (en) 2019-03-31
EP2835752B1 (en) 2018-09-19

Similar Documents

Publication Publication Date Title
WO2013149385A1 (zh) 一种拷贝数变异检测方法和系统
AU2019250200B2 (en) Error Suppression In Sequenced DNA Fragments Using Redundant Reads With Unique Molecular Indices (UMIs)
EP3191993B1 (en) Detecting repeat expansions with short read sequencing data
CN106715711B (zh) 确定探针序列的方法和基因组结构变异的检测方法
JP2020108377A (ja) 非侵襲的に胎児の性染色体異数性のリスクを計算する方法
WO2013097062A1 (zh) 一种遗传变异检测方法
US20200286586A1 (en) Sequence-graph based tool for determining variation in short tandem repeat regions
US20220254442A1 (en) Methods and systems for visualizing short reads in repetitive regions of the genome
RU2825664C2 (ru) Инструмент на основе графов последовательностей для определения вариаций в областях коротких тандемных повторов
RU2799654C2 (ru) Инструмент на основе графов последовательностей для определения вариаций в областях коротких тандемных повторов
TWI564742B (zh) Methods for determining the aneuploidy of fetal chromosomes, systems and computer-readable media
WO2014153755A1 (zh) 确定胎儿染色体非整倍性的方法、系统和计算机可读介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12873786

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 14389898

Country of ref document: US

ENP Entry into the national phase

Ref document number: 2015503724

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 20147031062

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2014144349

Country of ref document: RU

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2012873786

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2012376134

Country of ref document: AU

Date of ref document: 20120405

Kind code of ref document: A