WO2016090583A1 - Device and method for sequencing data processing - Google Patents

Device and method for sequencing data processing Download PDF

Info

Publication number
WO2016090583A1
WO2016090583A1 PCT/CN2014/093511 CN2014093511W WO2016090583A1 WO 2016090583 A1 WO2016090583 A1 WO 2016090583A1 CN 2014093511 W CN2014093511 W CN 2014093511W WO 2016090583 A1 WO2016090583 A1 WO 2016090583A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequencing
sequence
result
read
reads
Prior art date
Application number
PCT/CN2014/093511
Other languages
French (fr)
Chinese (zh)
Inventor
刘敬一
刘兴民
刘耿
赵鑫
杨明
侯勇
吴逵
李波
Original Assignee
深圳华大基因研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因研究院 filed Critical 深圳华大基因研究院
Priority to CN201480082793.4A priority Critical patent/CN107077533B/en
Priority to PCT/CN2014/093511 priority patent/WO2016090583A1/en
Publication of WO2016090583A1 publication Critical patent/WO2016090583A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection

Definitions

  • the present invention relates to the field of biological information. Specifically, the present invention relates to a sequencing data processing apparatus and method, and more particularly, to a sequencing data processing apparatus, a sequencing data processing system, and a processing method for sequencing data.
  • cfDNA (cell-free DNA), which is present in serum, plasma or other body fluids, is an effective biomarker that can be applied to a variety of mutation detection, such as cancer, fetal chromosomal variation and other genetic mutations. Due to the lack of high sensitivity and accuracy of quantitative analysis techniques, previous studies have focused on a number of known disease-related genes, such as the pigmentoma-GNAQ gene (Metz, Stephan HD, et al. Ultradeep sequencing detection GNAQ and GNA11mutations). In cell ⁇ free DNA from plasma of patients with uveal melanoma. Cancer medicine 2.2 (2013): 208-215.), 21 Trisomy 21 (Liao, Gary JW, et al. "Noninvasive prenatal diagnosis of fetal trisomy 21by Allelic ratio analysis using targeted massively parallel sequencing of maternal plasma DNA. "PLoS One 7.5 (2012): e38154.) and the like.
  • MPS Massively Parallel Sequencing
  • CNV Copy-Number Variations
  • Copy number variation is an important biomarker for many human diseases (such as cancer, hereditary diseases, cardiovascular diseases) and has become a hot spot in many diseases.
  • the detection of copy number variation in tumors can reveal the loss or doubling of tumor DNA throughout the genome.
  • CGH comparative genomic hybridization
  • ROMA representative oligonucleotide microarray analysis
  • These platforms have insufficient detection capabilities for small CNVs (below 20 kb), and have problems such as cumbersome operations and high costs.
  • the present invention is directed to solving at least some of the above technical problems or at least providing a commercial choice.
  • the present invention provides a sequencing data processing apparatus, the apparatus comprising: a data receiving unit, configured to receive the sequencing data, the sequencing data comprising a plurality of pairs of read pairs, each pair of reads Composed of two reads, each from two locations of a chromosome segment, two reads from each pair of read pairs from the stain
  • the positive and negative strands of the fragment, or both reads of each pair of reads are from the positive strand of the chromosome fragment or the negative strand of the chromosome, each read contains a gap, a pair of reads
  • the two read segments of the pair are respectively defined as a left arm and a right arm
  • a processor for executing a data processing program, and executing the data processing program includes performing comparison of the sequencing data with a reference sequence to obtain an alignment result And eliminating a gap in each of the alignment results, obtaining a universal alignment result, the comparison result comprising a plurality of alignments of the pair of reads, and/or the comparison result
  • the pair of reads from two positions of a chromosome fragment, respectively, can be obtained by sequencing a constructed library by constructing a pair-end library or a mate-pair library.
  • multiple pairs of read pairs are obtained using the library construction method of Complete Genomics (CG) and its sequencing platform.
  • the distance between a pair of read pairs is determined by the length of the read and the enzyme.
  • the distance between the recognition site and the cleavage site is controlled.
  • the CG platform was constructed by enzymatic cleavage to construct a multi-linker paired-end library, and the constructed circular library was sequenced by a unique combinatorial probe-ligation sequencing (cPAL) technique.
  • cPAL combinatorial probe-ligation sequencing
  • the bases on both sides of the linker were read because they were ligated by restriction enzyme digestion.
  • Two segments of a linker are used to construct a paired-end library, since each enzyme has a preferred cutting distance, and in actual digestion, it is often one more position or one less than the preferred distance, which makes the reading often With a gap, the gap is often +1 or -1, and / or, if the same enzyme is used for multiple digestions during the construction of the library, the position of the enzyme digestion is easy to change, and the position of the enzyme digestion will change.
  • the obtained reads are nicked, for example, when constructing a multi-ligand circular library, the Alu enzyme is used for two digestions to join different portions of the plurality of linkers, and when the bases adjacent to the linkers are read, a band of +3 is generated. A reading of the gap of /-3.
  • the size of the gap in the present invention may also be zero.
  • the 2-AD sequencing output has a total length of 60 bp, which can be divided into two pairs of mate-paired reads, and each pair of reads is centered.
  • the reads have a small gap at 10 bp, an invalid sequencing site N at the 20 bp position, and the distance between the two reads of a pair of reads is generally less than 2000 bp.
  • the term "positive strand” and "negative strand” as used herein are complementary two strands constituting a chromosome fragment, and are opposite. A strand is said to be a positive strand, and its complementary strand may be referred to as a negative strand, in an embodiment of the present invention.
  • a chain that matches a reference sequence is referred to as a positive chain, and another chain is referred to as a negative chain.
  • the alignment can be performed using known comparison software, such as SOAP, BWA, etc., or can be performed using the comparison software TeraMap of the CG platform.
  • the alignment is performed using TeraMap, and the resulting alignment result is in the format TeraMap.
  • the gap of each read in the elimination comparison result means that the negative gap is removed from the read with the negative gap, that is, the overlapping base is removed, and the positive gap is removed.
  • the read segment replaces the size of the positive gap by N, N is A, T, C or G.
  • the read can be divided into two parts based on the gap, the end of the two parts There are 2 nt overlaps.
  • the two parts of the read are ATCGCTTAAG and AGTACGATTC respectively, and the negative gaps are overlapped, and the corresponding read is ATCGCTTAAGTACGATTC.
  • the aligning in the method of one aspect of the invention comprises: comparing the left and right arms of each pair of read pairs to the reference sequence, respectively, to obtain a level one left alignment The result is compared with the first-order right-aligned result; one of the first-order left-aligned result and the first-order right-aligned result is used as a reference, and the other is compared, and the second-order left-aligned result and the second are obtained.
  • the read comparison result can be obtained.
  • the first alignment is globally aligned with the reference sequence
  • the left arm/right arm alignment result is The second alignment of the baseline for the right arm/left arm alignment results is a local alignment, such that alignments from the second-order left alignment result and the second-order right alignment result, respectively, can be performed on the same chromosome. The distance between the two reads that match the expected pair is paired into a pair of read pairs, and the read contrast is obtained.
  • the comparing comprises: setting the size of the notch to compare each left arm or each right arm with the reference sequence multiple times to obtain an optimal ratio For the result.
  • the gaps of each of the left arms or each of the right arms are set to -3 nt, -2 nt, -1 nt, 0 nt, 1 nt, 2 nt, 3 nt, 4 nt, 5 nt, 6 nt, and 7 nt, respectively.
  • a read segment respectively comparing the corresponding plurality of read segments with the reference sequence, and using the optimal aligned sequence as the left arm/right arm, where the comparison result may be based on the utilized Compare the software to the default evaluation of the results.
  • executing the data processing program further includes implementing, before the gap of each of the comparison results in the comparison result, extracting a unique comparison result in the comparison result to replace The alignment result, the unique alignment result comprising a plurality of read pairs uniquely aligned with the reference sequence, and each of the reads contrasts to the same chromosome to the reference sequence, each of the The distance between the two reads of the pair of reads corresponds to the expected distance between the two locations of the chromosome segment from which it came.
  • executing the data processing program further comprises implementing correcting a positive strand of the same chromosome that contrasts each pair of the unique alignment results to the reference sequence. For example, for a pair of reads that respectively align the positive and negative strands of the previous chromosome, the reads of the aligned negative strands become their complementary strands, thus replacing the reads with their reverse complementary strands. Said correction.
  • executing the data processing program further comprises implementing a data format conversion, the data format conversion comprising converting the alignment result or the format of the unique alignment result.
  • the format of the general comparison result is required to be SAM or BAM, so as to facilitate subsequent analysis of the data based on the comparison result or the comparison result
  • SAM or BAM is a common binary format
  • BAM is a SAM. Compressed format. Due to the use of different comparison software, the format of the output comparison result or the unique comparison result may not be applicable to existing subsequent data processing or analysis software programs, such as the comparison result of the aforementioned TeraMap format, and the output data format thereof. It does not meet the requirements of the input data format of most existing mutation detection software SOAPsnp, GATK or SOAPindel, and converts the data format to obtain the general comparison result with the common data format, which is convenient for further analysis and processing of the data.
  • a sequencing data processing system comprising a host and a display, the system further comprising a sequencing data processing device in accordance with one or any embodiment of the present invention.
  • a method for processing a sequencing data comprising the steps of: acquiring sequencing data, the sequencing data comprising a plurality of pairs of read segments, each pair of read segments consisting of two read segments, respectively Two positions from one chromosome segment, two reads from each pair of reads are from the positive and negative strands of the chromosome segment, or two reads from each pair of read pairs are from the chromosome a positive strand of a fragment or a negative strand of the chromosome fragment, each read comprising a gap, defining two reads of a pair of read pairs as a left arm and a right arm, respectively; comparing the sequencing data to a reference sequence And obtaining a comparison result, the comparison result comprising a comparison result of the plurality of the pair of readings, and/or, the comparison result comprising a comparison result of the plurality of the left arms and a plurality of the The result of the alignment of the right arm; the gap of each of the
  • the pair of reads from two positions of a chromosome fragment can be constructed by constructing a pair-end library or a mate-pair library.
  • sequencing in one embodiment of the present invention, multiple pairs of read pairs are obtained by using the library construction method of Complete Genomics (CG) and its sequencing platform, and the distance between a pair of read pairs is read by The length and the distance between the recognition site of the enzyme and the cleavage site are controlled.
  • CG Complete Genomics
  • the CG platform was constructed by enzymatic cleavage to construct a multi-linker paired-end library, and the constructed circular library was sequenced by a unique combinatorial probe-ligation sequencing (cPAL) technique.
  • the bases on both sides of the linker were read because they were ligated by restriction enzyme digestion.
  • Two segments of a linker are used to construct a paired-end library, since each enzyme has a preferred cutting distance, and in actual digestion, it is often one more position or one less than the preferred distance, which makes the reading often With a gap, the gap is often +1 or -1, and / or, if the same enzyme is used for multiple digestions during the construction of the library, the position of the enzyme digestion is easy to change, and the position of the enzyme digestion will change.
  • the obtained reads are nicked, for example, when constructing a multi-ligand circular library, the Alu enzyme is used for two digestions to join different portions of the plurality of linkers, and when the bases adjacent to the linkers are read, a band of +3 is generated. A reading of the gap of /-3.
  • the size of the gap in the present invention may also be zero. From a plurality of reads in a multi-joint library, one read can form a pair of read pairs with any other read.
  • the term "positive strand” and "negative strand” as used herein are complementary two strands constituting a chromosome fragment, and are opposite.
  • a strand is said to be a positive strand, and its complementary strand may be referred to as a negative strand, in an embodiment of the present invention.
  • a chain that matches a reference sequence is referred to as a positive chain, and another chain is referred to as a negative chain.
  • the alignment can be performed using known comparison software, such as SOAP, BWA, etc., or can be performed using the comparison software TeraMap of the CG platform.
  • the alignment is performed using TeraMap, and the resulting alignment result is in the format TeraMap.
  • the gap of each read in the elimination comparison result means that the negative gap is removed from the read with the negative gap, that is, the overlapping base is removed, and the positive gap is removed.
  • the read segment replaces the size of the positive gap with N, N is A, T, C or G, for example, for a read with a negative gap such as -2 nt, the read can be divided into two parts based on the gap, and the ends of the two parts have 2 nt overlap, for example, the two parts of the read are respectively ATCGCTTAAG and AGTACGATTC, eliminate the negative gap, that is, the overlapping AG, and obtain the corresponding read segment as ATCGCTTAAGTACGATTC.
  • obtaining the sequencing data comprises constructing a sequencing library to obtain a sequencing library, the sequencing library being a single-stranded circular DNA library, the sequencing library being composed of a strand of the chromosome fragment and at least one The predetermined DNA sequence is constructed.
  • the single-stranded circular library can be constructed by a known library construction method, for example, by constructing a single-linker circular double-stranded library with reference to the construction of a paired-end library of SOCID of Life Technologies, and then separating the double-stranded to obtain a single-stranded circular library.
  • the single-stranded circular library is constructed using the CG library construction technique, and the library construction can be referred to US7897344 to obtain a multi-linker single-stranded circular library.
  • each pair of reads is from both ends of the chromosome segment.
  • two parts of a linker are respectively ligated to both ends of a chromosome fragment, single-stranded and single-stranded to obtain a 1-ligand single-stranded circular library, and the 1-linker single-stranded
  • the circular library consists of a strand of the chromosomal fragment and a predetermined DNA sequence joining the two ends of the strand.
  • the rolling circle is expanded to form a DNA nanosphere (DNB), and the DNB is sequenced by CG sequencing cPAL technology.
  • Implanted on a chip and cPAL technology can be referenced to US8278039B2 and US8518640B2, respectively.
  • the predetermined DNA sequence is a known sequence and is a link of the aforementioned linker or linker.
  • the improved CG building method constructs a 1-ligand circular single-strand library comprising the steps of: (1) extracting a nucleic acid to be tested; (2) phosphorylating the nucleic acid at the terminal to obtain a terminal phosphorylated product; and (3) end-repairing Said terminal phosphorylation product, obtaining a terminal repair product; (4) linking the first sequence and the second sequence to both ends of the terminal repair product to obtain a first ligation product; (5) using the third sequence for the ligation The product is subjected to nick translation and amplification to obtain an amplification product, the third sequence being a pair of primer pairs, at least one primer of the primer pair carrying a biotin label; (6) using the biotin label to Amplification products are subjected to single-strand
  • Said fourth sequence is capable of linking said first sequence and said second sequence to form said adaptor, and nick translation is for eliminating a first sequence and/or a second sequence attached at both ends of the end repair product
  • the nick caused by the dideoxynucleotide at the other end uses at least one primer with biotin labeling to carry at least one strand of the amplified product with biotin labeling, so that it is easy to separate and obtain a single strand based on the biotin label. product.
  • the improved CG library construction method constructs a 1-ligand circular single-strand library comprising the steps of: (1) extracting a nucleic acid to be tested; (2) repairing the nucleic acid at the end to obtain a terminal repair product.
  • terminal phosphorylating the terminal repair product to obtain a terminal phosphorylation product (3) terminal phosphorylating the terminal repair product to obtain a terminal phosphorylation product; (4) linking the first sequence and the second sequence to both ends of the terminal phosphorylation product to obtain a first ligation product; Performing nick translation and amplification of the ligation product using a third sequence to obtain an amplification product, the third sequence being a pair of primer pairs, at least one primer of the primer pair carrying a biotin label; (6) Using the biotin labeling pair The amplification product is subjected to single-strand separation to obtain a single-stranded product; (7) cyclizing the single-stranded product with a fourth sequence to obtain the sequencing library; wherein the fourth sequence is capable of linking the first sequence At one end and at one end of the second sequence, the other end of the first sequence and/or the second sequence is a dideoxynucleotide.
  • End repair is to obtain a blunt-ended nucleic acid fragment that enables attachment of other nucleotides or sequences.
  • Terminal phosphorylation is to reduce the interconnection of sample nucleic acid fragments, so that samples with low nucleic acid content can also be constructed in a library and meet the requirements of the library.
  • Single-linker circular single-strand library As shown in Figure 1, the constructed single-linker circular single-strand library (1-AD) was sequenced on the machine, and the 1-AD sequencing output read pair had a total length of about 30 bp, one read. 12 bp, 19 bp in one read, the median distance of the genome between the two reads in a read is about 140 bp.
  • the single joint has a small amount of storage, which is suitable for the case of less cfDNA content, and has the advantages of short construction time and low construction cost.
  • the alignment in the method of the invention comprises: comparing the left and right arms of each pair of read pairs to the reference sequence, respectively, to obtain a level 1 left alignment result and The first-order right-aligned result is compared with one of the first-order left-aligned result and the first-order right-aligned result, and the other is compared, and the second-order left-aligned result and the second-level right are obtained. Aligning the results, obtaining a comparison result of the plurality of the pair of readings based on the result of the second-order left alignment and the result of the second-order right alignment, or obtaining a comparison result of the plurality of the left arms and The alignment of the right arms.
  • the read comparison result can be obtained.
  • the first alignment is globally aligned with the reference sequence
  • the left arm/right arm alignment result is The second alignment of the baseline for the right arm/left arm alignment results is a local alignment, such that alignments from the second-order left alignment result and the second-order right alignment result, respectively, can be performed on the same chromosome.
  • the distance between the two reads that match the expected pair is paired into a pair of read pairs, and the read contrast is obtained.
  • the aligning includes arranging the gaps such that each left or each right arm is compared with the reference sequence multiple times to obtain an optimal alignment result.
  • the gaps of each of the left arms or each of the right arms are set to -3 nt, -2 nt, -1 nt, 0 nt, 1 nt, 2 nt, 3 nt, 4 nt, 5 nt, 6 nt, and 7 nt, respectively.
  • a read segment respectively comparing the corresponding plurality of read segments with the reference sequence, and using the optimal aligned sequence as the left arm/right arm, where the comparison result may be based on the utilized Compare the software to the default evaluation of the results.
  • executing the data processing program further includes implementing, before the gap of each of the comparison results in the comparison result, extracting a unique comparison result in the comparison result to replace The alignment result, the unique alignment result comprising a plurality of read pairs uniquely aligned with the reference sequence, and each of the reads contrasts to the same chromosome to the reference sequence, each of the The distance between the two reads of the pair of reads corresponds to the expected distance between the two locations of the chromosome segment from which it came.
  • executing the data processing program further comprises implementing correcting a positive strand of the same chromosome that contrasts each pair of the unique alignment results to the reference sequence. For example, for a pair of reads that respectively align the positive and negative strands of the previous chromosome, the reads of the aligned negative strands become their reverse complementary strands, thus The correction is achieved by a complement to replace the read.
  • executing the data processing program further comprises implementing a data format conversion, the data format conversion comprising converting the alignment result or the format of the unique alignment result.
  • the format of the general comparison result is required to be SAM or BAM, so as to facilitate subsequent analysis of the data based on the comparison result or the comparison result
  • SAM or BAM is a common binary format
  • BAM is a SAM. Compressed format. Due to the use of different comparison software, the format of the output comparison result or the unique comparison result may not be applicable to existing subsequent data processing or analysis software programs, such as the comparison result of the aforementioned TeraMap format, and the output data format thereof. It does not meet the requirements of the input data format of most existing mutation detection software SOAPsnp, GATK or SOAPindel, and converts the data format to obtain the general comparison result with the common data format, which is convenient for further analysis and processing of the data.
  • a computer readable storage medium for storing a program for execution by a computer, the execution of the program comprising performing an aspect of the aforementioned invention or any one of its embodiments. Sequencing data processing method.
  • the foregoing description of the advantages and technical features of the sequencing data processing method of the present invention is also applicable to the computer readable storage medium, and details are not described herein again.
  • the storage medium may include: a read only memory, a random access memory, a magnetic disk or an optical disk, and the like.
  • the present invention provides a method for detecting copy number variation (CNV), the method comprising: a. acquiring a nucleic acid of a sample to be tested; b. sequencing the nucleic acid to obtain sequencing data; Processing the sequencing data to obtain a universal alignment result; d. detecting the CNV based on the universal alignment result; wherein c step is sequencing data processing in one aspect of the invention or in any particular embodiment The device and/or method performed.
  • CNV copy number variation
  • the step b includes performing a sequencing library construction on the nucleic acid to obtain a sequencing library, the sequencing library being a single-stranded circular DNA library, and the construction of the single-stranded circular DNA library comprises: End-phosphorylation of the nucleic acid to obtain a terminal phosphorylation product; end-repairing the terminal phosphorylation product to obtain a terminal repair product; and linking the first sequence and the second sequence to both ends of the terminal repair product to obtain a first linkage a product; performing nick translation and amplification of the ligation product using a third sequence to obtain an amplification product, the third sequence being a pair of primer pairs, at least one primer of the primer pair carrying a biotin label; Performing single-strand separation of the amplification product to obtain a single-stranded product; cyclizing the single-stranded product to obtain the sequencing library; wherein the fourth sequence is capable of joining one end of the first sequence And at one end of the second sequence, the other end of the first sequence and
  • end repair is performed followed by terminal phosphorylation.
  • End repair is to obtain a blunt-ended nucleic acid fragment that enables attachment of other nucleotides or sequences.
  • Terminal phosphorylation is to reduce the interconnection of sample nucleic acid fragments, so that samples with low nucleic acid content can also be constructed in a library and meet the requirements of the library.
  • Single-linker circular single-stranded library is shown in Figure 1. The single-linker has a small amount of storage, which is suitable for cfDNA content. In addition, there are also advantages of short construction time and low cost of building a database.
  • Said fourth sequence is capable of joining the first sequence and the second sequence to form one of said linkers, and the nick translation is to eliminate the dideoxy at the other end of the first sequence and/or the second sequence attached to the ends of the end repair product.
  • a nick caused by a nucleotide, with at least one primer carrying a biotin label carries at least one strand of the amplified product with a biotin label, so that subsequent separation of the single-stranded product based on the biotin label is easily obtained.
  • sequencing of the constructed library is performed using a combinatorial probe anchor ligation sequencing technique, such as using a CG sequencing platform.
  • the detection of CNV based on the general comparison result can utilize the currently known CNV detection methods, such as using hidden Markov model, circular binary segmentation, hierarchical segmentation or kernel smoothing algorithm.
  • the step d includes: setting a plurality of windows on the reference sequence, based on a general comparison result of the amount of the read segment matching the window and the comparison sample in the universal comparison result The difference in the amount of reads in the matching to the same window is significant, determining that the CNV is present in the sample nucleic acid to be tested, wherein the window is part of the reference sequence.
  • the size of the window can be adjusted according to the size of the pre-detected CNV, and the general comparison result of the comparison sample can be obtained by the method of one aspect of the present invention or the sequencing data processing method in any of the specific embodiments, whether the difference is
  • z-score standard score
  • the predetermined threshold is 3, that is, when the absolute value of z is greater than 3, it is determined that CNV occurs in the window.
  • the amount of the read segment may be a number or a ratio.
  • the z-score standard score
  • the depth of sequencing of the window the amount of reads to the window / the size of the window.
  • the GC content in the reads during the actual sequencing process will have a certain effect on the depth of sequencing [Alkan, Can, Jeffrey M Kidd, Tomas Marques-Bonet, Gozde Aksay, Francesca Antonacci , Fereydoun Hormozdiari, Jacob O Kitzman, et al. "Personalized Copy Number and Segmental Duplication Maps Using next-Generation Sequencing.” Nature Genetics 41, no. 10 (October 2009): 1061–67], first performing GC content correction, eliminating GC The effect of the content on the depth of sequencing.
  • the GC content correction can utilize the sequencing data of multiple control samples, take the GC content of multiple window calculation windows and the average sequencing depth, and perform two-dimensional regression analysis on the GC-sequence depth data, for example, using local weighted regression.
  • the point smoothing method (lowess regression) establishes the relationship between the two, and corrects the GC content of each window according to the regression relationship.
  • the relationship between the sequencing depth and the GC content can be established by obtaining sequencing data of a plurality of control sample nucleic acids, the sequencing data being composed of a plurality of reading segments; setting a plurality of windows on the reference sequence, Sequencing data of the plurality of control samples are respectively compared with the window of the reference sequence, and each of the sequencing data of each control sample is calculated.
  • the number of the aforementioned control samples is not less than 30, and the number of samples reaches 30, so that the sample data presentation satisfies a specific distribution conforming to the test using a majority statistical test method, for example, using t test, z test, etc. Inspection generally requires multiple sample data to conform to a normal distribution.
  • the sequencing data, the comparison result, and the like of the foregoing control sample can be obtained by referring to the sequencing data processing method in one aspect of the present invention or in any of the specific embodiments, and can be obtained simultaneously with the sequencing data and the comparison result of the sample to be tested.
  • the save reserve can be obtained in advance.
  • the present invention provides a CNV detecting apparatus for performing all or part of the steps of the CNV detecting method of one aspect of the present invention, the apparatus comprising: a nucleic acid acquiring apparatus for acquiring a test a nucleic acid of the sample; a sequencing device for sequencing the nucleic acid from the nucleic acid acquisition unit to obtain sequencing data; and a data processing device for processing the sequencing data from the sequencing device to obtain a general comparison result; Detecting means for detecting the CNV based on a universal comparison result from the data processing device; wherein the data processing device comprises a data receiving unit for receiving sequencing data from the sequencing device, the sequencing The data includes pairs of pairs of reads, each pair of reads consisting of two reads, each from two locations of a chromosome segment, and two reads of each pair of read pairs are from the positive strand of the chromosome segment, respectively.
  • each read Include a gap, define two reads of a pair of read pairs as a left arm and a right arm, respectively, a processor for executing a data processing program, and executing the data processing program includes implementing the sequencing data and the reference sequence Comparing, obtaining alignment results, and eliminating gaps in each of the alignment results, obtaining a universal alignment result comprising a plurality of alignments of the pair of reads, and/ Or, the comparison result includes a comparison result of the plurality of left arms and a comparison result of a plurality of the right arms, and at least one storage unit for storing data including the data processing program.
  • the CNV detecting device of this aspect of the present invention is also applicable to the description of the advantages and technical features of the CNV detecting method in any of its specific embodiments, and details are not described herein again, and those skilled in the art can understand that the present invention can be understood. All or a portion of the units of this apparatus are selectively detachably including one or more subunits to perform or implement various embodiments of the aforementioned CNV detection methods of the present invention.
  • Sequencing data was obtained by single-link sequencing of the CG platform, and the cost was lower and faster.
  • the TeraMap2Sam conversion software is developed, and the comparison result of the CG platform TeraMap is converted into a common SAM format, so that many excellent open source softwares such as Samtools, GATK, etc. can be directly used for mutation detection.
  • the CNV detection program developed by the CNV detection method and/or device of the present invention performs CNV analysis based on the standard fraction method, and has high speed and high resolution.
  • FIG. 1 is a schematic view showing the structure of a single-linker circular single-stranded library in one embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of a sequencing data processing apparatus in an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a sequencing data processing system in an embodiment of the present invention.
  • FIG. 4 is a flow chart of a method for processing sequencing data in an embodiment of the present invention.
  • Figure 5 is a flow chart showing a method of processing sequencing data in an embodiment of the present invention.
  • FIG. 6 is a flow chart of a CNV detecting method in an embodiment of the present invention.
  • FIG. 7 is a flow chart of a CNV detecting method in an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a CNV detecting apparatus in an embodiment of the present invention.
  • Figure 9 is a flow diagram showing the construction and sequencing of a single linker library in one embodiment of the present invention.
  • Figure 10 is a flow chart of the algorithm of the Teramap2Sam software in one embodiment of the present invention.
  • the processing device 100 includes a data receiving unit 10, a processor 20, and a storage unit 30.
  • the processor 20 is connected to the data receiving unit 10 and the storage unit 30, and the storage unit 30 is connected to the data processing unit 10.
  • the data receiving unit 10 is configured to receive sequencing data, where the sequencing data includes multiple pairs of read pairs, each pair of read segments consists of two read segments, respectively, from two positions of a chromosome segment, each pair of read long pairs
  • the two reads are from the positive and negative strands of the chromosome fragment, respectively, or both reads in each pair of reads are from the positive strand of the chromosome fragment or the negative strand of the chromosome, each read
  • the segments all contain gaps, and the two reads of a pair of read pairs are defined as the left and right arms, respectively.
  • the pair of reads from two positions of a chromosome fragment, respectively can be obtained by sequencing a constructed library by constructing a pair-end library or a mate-pair library.
  • multiple pairs of read pairs are obtained using the library construction method of Complete Genomics (CG) and its sequencing platform.
  • the distance between a pair of read pairs is determined by the length of the read and the enzyme.
  • the distance between the recognition site and the cleavage site is controlled.
  • the CG platform was constructed by enzymatic cleavage to construct a multi-linker paired-end library, and the constructed circular library was sequenced by a unique combinatorial probe-ligation sequencing (cPAL) technique.
  • cPAL combinatorial probe-ligation sequencing
  • Two segments of a linker are used to construct a paired-end library, since each enzyme has a preferred cutting distance, and in actual digestion, it is often one more position or one less than the preferred distance, which makes the reading often With a gap, the gap is often +1 or -1, and / or, if the same enzyme is used for multiple digestions during the construction of the library, the position of the enzyme digestion is easy to change, and the position of the enzyme digestion will change.
  • the obtained reads are nicked, for example, when constructing a multi-ligand circular library, the Alu enzyme is used for two digestions to join different portions of the plurality of linkers, and when the bases adjacent to the linkers are read, a band of +3 is generated. A reading of the gap of /-3.
  • the size of the gap in the present invention may also be zero.
  • the 2-AD sequencing output has a total length of 60 bp, which can be divided into two pairs of mate-paired reads, and each pair of reads is centered.
  • the reads have a small gap at 10 bp, an invalid sequencing site N at the 20 bp position, and the distance between the two reads of a pair of reads is generally less than 2000 bp. From a plurality of reads in a multi-joint library, one read can form a pair of read pairs with any other read.
  • positive strand and negative strand are complementary two strands constituting a chromosome fragment, and are opposite.
  • a strand is said to be a positive strand, and its complementary strand may be referred to as a negative strand, in an embodiment of the present invention.
  • a chain that matches a reference sequence is referred to as a positive chain, and another chain is referred to as a negative chain.
  • the processor 20 is configured to execute a data processing program, and the executing the data processing program comprises: comparing the sequencing data with a reference sequence, obtaining a comparison result, and eliminating each read in the comparison result a gap, obtaining a universal alignment result, the comparison result comprising a plurality of alignment results of the pair of reads, and/or, the comparison result comprising a plurality of comparison results of the left arm and a plurality of The result of the comparison of the right arm.
  • the comparison can be performed by using known comparison software, such as SOAP, BWA, etc., or by using the comparison software TeraMap of the CG platform.
  • the alignment is performed using TeraMap, and the resulting alignment result is in the format TeraMap.
  • the gap of each read in the elimination comparison result means that the negative gap is removed from the read with the negative gap, that is, the overlapping base is removed, and the positive gap is removed.
  • the read segment replaces the size of the positive gap by N, and N is A, T, C or G.
  • the read with a gap of 0 is not processed.
  • the read can be divided into two parts based on the gap, and the ends of the two parts have 2 nt overlap, such as two parts of the read.
  • ATCGCTTAAG and AGTACGATTC respectively, eliminate the negative gap, that is, the overlapping AG, and obtain the corresponding reading as ATCGCTTAAGTACGATTC.
  • the storage unit 30 is for storing data, and the above-described data processing program is stored in the storage unit 30, and intermediate data or results of the processing of the sequencing data from the data receiving unit 10 and the processor 20 are also stored.
  • FIG. 3 is a block diagram showing the structure of a system in an embodiment of the sequencing data processing system of the present invention.
  • the sequencing data processing system 1000 includes a sequencing data processing device 100, a host 200, and a display device 300.
  • the host 200 can be an audio/video/signal source device, such as a computer host, mainframe, etc., for transmitting display data required by the display device 300.
  • the host 200 includes at least one interface electrically connected to the sequencing data processing device 100.
  • the sequencing data processing device 100 receives the sequencing data output from the host 200, processes the sequenced data, and then outputs the processed data or results to the display device. 300.
  • the sequencing data processing method comprises the steps of: S1 acquiring sequencing data, the sequencing data comprising a plurality of pairs of read segments, each pair of read segments consisting of two read segments, respectively, from two positions of a chromosome segment, each pair of reads The two reads of the pair are from the positive and negative strands of the chromosome fragment, respectively, or both reads of each pair of read lengths are from the positive strand of the chromosome fragment or the negative strand of the chromosome fragment, Each read segment includes a gap, and two reads of a pair of read pairs are respectively defined as a left arm and a right arm; S2 compares the sequencing data with a reference sequence to obtain a comparison result, and the comparison result Include a comparison result of a plurality of the pair of read segments, and/or, the comparison result includes a comparison result of the plurality of the left arms and a comparison result of the plurality of the
  • the pair of reads from two positions of a chromosome fragment can be constructed by constructing a pair-end library or a mate-pair library.
  • multiple pairs of read pairs are obtained by using the library construction method of Complete Genomics (CG) and its sequencing platform, and the distance between a pair of read pairs is read by The length and the distance between the recognition site of the enzyme and the cleavage site are controlled.
  • the CG platform was constructed by enzymatic cleavage to construct a multi-linker paired-end library, and the constructed circular library was sequenced by a unique combinatorial probe-ligation sequencing (cPAL) technique. The bases on both sides of the linker were read because they were ligated by restriction enzyme digestion.
  • cPAL combinatorial probe-ligation sequencing
  • Two segments of a linker are used to construct a paired-end library, since each enzyme has a preferred cutting distance, and in actual digestion, it is often one more position or one less than the preferred distance, which makes the reading often With a gap, the gap is often +1 or -1, and / or, if the same enzyme is used for multiple digestions during the construction of the library, the position of the enzyme digestion is easy to change, and the position of the enzyme digestion will change.
  • Make the obtained reading with a gap For example, when constructing a multi-ligand circular library, the Alu enzyme is digested twice to join different portions of multiple linkers, and when the bases next to these linkers are read, a read with a gap of +3/-3 is produced.
  • the size of the gap in the present invention may also be zero. From a plurality of reads in a multi-joint library, one read can form a pair of read pairs with any other read.
  • the term "positive strand” and "negative strand” as used herein are complementary two strands constituting a chromosome fragment, and are opposite. When a strand is a positive strand, the complementary strand can be said to be a minus strand.
  • a chain that matches a reference sequence is referred to as a positive chain, and another chain is referred to as a negative chain.
  • the comparison can be performed by using known comparison software, such as SOAP, BWA, etc., or by using the comparison software TeraMap of the CG platform.
  • the alignment is performed using TeraMap, and the resulting alignment result is in the format TeraMap.
  • the gap of each read in the elimination comparison result means that the negative gap is removed from the read with the negative gap, that is, the overlapping base is removed, and the positive gap is removed.
  • the read segment replaces the size of the positive gap by N, N is A, T, C or G, and the read with the gap 0 is not processed.
  • the read based on the gap The segment can be divided into two parts, and the ends of the two parts have 2nt overlap.
  • the two parts of the read segment are ATCGCTTAAG and AGTACGATTC respectively, and the negative gap, that is, the overlapping AG, is eliminated, and the corresponding read segment is obtained as ATCGCTTAAGTACGATTC.
  • FIG. 5 is a flow chart showing the data processing of one embodiment of the sequencing data processing method of the present invention.
  • the sequencing data processing method comprises: S10 acquiring sequencing data, the sequencing data comprising a plurality of pairs of read pairs, each pair of read segments consisting of two read segments, respectively, from two positions of one chromosome segment, each pair of read pairs The two reads in the pair are from the positive and negative strands of the chromosome fragment, or the two reads in each pair of read lengths are from the positive strand of the chromosome fragment or the negative strand of the chromosome fragment, each Each of the read segments includes a gap, and two reads of a pair of read pairs are respectively defined as a left arm and a right arm; S20 compares the sequencing data with a reference sequence to obtain a comparison result, and the comparison result includes Aligning results of a plurality of the pair of read segments, and/or, the comparison result includes a comparison result of the plurality of the left arms and a comparison result of the plurality
  • Fig. 6 is a flow chart showing the detection of an embodiment of the CNV detecting method of the present invention.
  • the CNV detection method comprises the steps of: S11 acquiring nucleic acid of a sample to be tested; S12 sequencing the nucleic acid to obtain sequencing data; S13 processing the sequencing data to obtain a general comparison result; S14 is based on the universal comparison As a result, the CNV is detected; wherein S13 is performed using a sequencing data processing device and/or a sequencing data processing method in one aspect of the invention or in any of the embodiments.
  • Detection of CNV based on universal alignment results can utilize currently known CNV detection methods, such as Use hidden Markov model, circular binary segmentation, hierarchical segmentation or kernel smoothing algorithm.
  • Fig. 7 is a flow chart showing the detection of an embodiment of the CNV detecting method of the present invention.
  • the CNV detection method includes the steps of: S110 acquiring nucleic acid of a sample to be tested; S120 sequencing the nucleic acid to obtain sequencing data; S130 processing the sequencing data to obtain a general comparison result, and S130 is by the above-mentioned invention.
  • S150 corrects the sequencing depth of the window by the relationship between the sequencing depth and the GC content, and obtains the corrected sequencing depth of the window;
  • S160 is based on the The corrected sequencing depth of the window is significantly different from the corrected sequencing depth of the same window of the control sample, and the CNV is determined to be present in the sample nucleic acid to be tested, wherein the window is part of the reference sequence.
  • the number of the aforementioned control samples is not less than 30, and the number of samples reaches 30, so that the sample data is presented to satisfy a specific distribution, which is suitable for testing by using a majority statistical test method, for example, using t test, z test, etc.
  • the sample data is in a normal distribution.
  • the sequencing data, the comparison result, and the like of the foregoing control sample can be obtained by referring to the sequencing data processing method in one aspect of the present invention or in any of the specific embodiments, and can be obtained simultaneously with the sequencing data and the comparison result of the sample to be tested.
  • the save reserve can be obtained in advance.
  • the depth of sequencing and the GC content of the window were determined using two-dimensional regression analysis, for example, using Lowess regression to establish the relationship between sequencing depth and GC content.
  • FIG. 8 is a block diagram showing the structure of an embodiment of a CNV detecting apparatus of the present invention.
  • the device 2000 includes: a nucleic acid acquisition device 200 for acquiring nucleic acid of a sample to be tested; a sequencing device 400 for sequencing nucleic acid from the nucleic acid acquisition unit to obtain sequencing data; and a data processing device 600 for The sequencing data of the sequencing device is processed to obtain a universal alignment result; the detection device 800 is configured to detect the CNV based on a universal comparison result from the data processing device 600; wherein the data processing device 600 includes a data receiving unit 610, configured to receive sequencing data from the sequencing device, the sequencing data comprising a plurality of pairs of read pairs, each pair of read segments consisting of two read segments, respectively, from two locations of a chromosome segment The two reads of each pair of read lengths are from the positive and negative strands of the chromosome fragment, respectively, or the two reads of each pair of read lengths are from the positive strand of the chromosome fragment
  • each read includes a gap
  • two reads of a pair of read pairs are respectively defined as a left arm and a right arm
  • the processor 630 is configured to execute a data processing program, and execute the
  • the data processing program includes implementing the alignment of the sequencing data with a reference sequence, obtaining a comparison result, and eliminating a gap of each of the comparison results, obtaining a universal alignment result, the comparison result including Alignment results of the plurality of read pairs, and/or, the comparison result includes a comparison result of the plurality of left arms and a comparison result of a plurality of the right arms, and at least one storage unit 650. For storing data, including the data processing program.
  • peripheral blood plasma of lung cancer patients was taken as the test object.
  • the samples were from Southwest Hospital and tested as follows:
  • the above reaction product was purified by 60 ul of Ampure XP beads and eluted with 22 ul of Elution buffer.
  • the above reaction product was purified by 40 ul of Ampure XP beads and eluted with 22 ul of Elution buffer.
  • the two strands of the first sequence are: TTGGCCTCCGACT/3-ddT/(SEQ ID NO: 1), /5phos/AAGTCGGAGGCCAAGCGGTCGT/ddC/ (SEQ ID NO: 2).
  • the two strands of the second sequence are: /5Phos/GTCTCCAGTCGAAGCCCGACG/3ddC/(SEQ ID NO: 3), GCTTCGACTGGAGA/3ddC/(SEQ ID NO: 4).
  • the upstream primer in the third sequence is/5-bio/TCCTAAGACCGCTTGGCCTCCGACT (SEQ ID NO: 5),
  • the intermediate "x" is a variable tag sequence region, which can be replaced by N, N is A, T, C or G, when no other sample libraries are mixed together, only A sample library is on the machine, no tag sequence is required, ie the third sequence can be
  • 5Phos/AGACAAGCTCGATCGGGCTTCGACTGGAGAC (SEQ ID NO: 7), in this example, because of the tumor free nucleic acid sample, the target nucleic acid (ctDNA) content in the mixed nucleic acid is low, and if a plurality of such sample libraries are mixed on the machine to obtain mixed data, it is required Splitting the mixed data corresponding to the respective samples will lose a part of the data, and the single-joint circular library reads are relatively short. To accurately detect the mutation, deep sequencing is required to obtain a relatively large amount of measured data, preferably, on a single sample library. machine.
  • the above reaction product was purified by 40 ul of Ampure XP beads and eluted with 37.4 ul of Elution buffer.
  • the above reaction product was purified by 50 ul of Ampure XP beads and eluted with 22 ul of Elution buffer.
  • the PCR product was subjected to concentration determination using a Qubit dsDNA HS assay kit.
  • Tween20 0.5% Tween20, 1X BBB/Tween Mix, 1X BWB/Tween Mix, 0.1M NaOH.
  • the 0.5% Tween20 configuration method is the same as the above, and the other three configuration methods are as follows:
  • the product of this step can be stored frozen at -20 °C.
  • ligase reaction mixture is shaken and thoroughly mixed. After centrifugation, 11 ul of the ligase reaction mixture is added to the EP tube to which the primer reaction mixture has been added, shaken for 10 s, and centrifuged instantaneously.
  • the starting amount of the sample used for the preparation of DNB was adjusted to 35.3 ng-53 ng according to the concentration of single-stranded molecular quantitative determination.
  • the corresponding volume sample ( ⁇ 60 ul) was transferred to the Biorad PCR plate, and the total volume was not more than 120 ul using 1XTE. .
  • the final concentration is 5.625-7.5fmol/ul
  • the volume is 120ul
  • the total amount is 35.3ng-53ng
  • the DNB in the 1adapter sequencing needs 120fmol, 7.5foml/ul, 16ul. Therefore, the library needs to be diluted to 7.5 fmol/ul.
  • the offline data of the first embodiment is processed.
  • the sequencing data processing method and/or the CNV detection method of the present invention based on the CG platform sequencing technology, ultra-micro cfDNA enrichment, library establishment, sequencing and data analysis can be performed.
  • the sequencing reads are short, and there are resequencing and small gaps at specific locations. It is difficult to directly compare the sequencing results using ordinary comparison software.
  • the TG platform's proprietary TeraMap for comparison. The working principle is: First, it will compare the two ends of the read length (LeftArm, RightArm), and TeraMap will try a variety of gaps.
  • the value is used to process the read length to obtain more comparison results; then, the comparison result at each end is taken as a reference, and the other end is locally aligned (for example, 4-AD, the range of the local alignment is 0 to 700bp); if both ends can be well aligned to the same chromosome, and the insert-size meets expectations (eg 4-AD, the distance between the two reads of a read pair is 0-700bp), then only the best alignment result is output Otherwise, multiple comparison results at both ends are output.
  • TeraMap is a comparison software for CG sequencing platform. It can compare CG unique sequences to the reference genome. Its output format consists of three parts. The brief description is as follows: the first line is the reads sequence information; the second line and the third The line is the reading comparison case description; the fourth line and the fifth line are the details of the reads comparison result.
  • the Teramap2Sam software is developed according to the method of the present invention, and the gap in the TeraMap comparison result is removed and converted into SAM (sequence alignment/map format).
  • SAM sequence alignment/map format
  • Step 1 Extract the unique alignment results. According to the TeraMap output result matchCount to determine whether the unique alignment, while requiring the length of the insert to meet the requirements and the read alignment of the two ends on a reference sequence.
  • Step 2 Remove the gap.
  • the gap position in the reads is determined according to the gaps field, and the read sequence is corrected.
  • the third step calculate FLAG. According to the comparison direction of the double-ended read, the FLAG parameter in the SAM file is calculated to obtain the comparison.
  • SAM is a more general format for storing comparison information.
  • Each line is a pair of reads. It consists mainly of eleven fields. Later, more fields can be added to contain more information, such as XT:A: U means that this reads is a unique comparison.
  • U means that this reads is a unique comparison.
  • BAM binary compression format
  • CG developed the Assembly Software for its read structure to reassemble the reads, and perform the follow-up work after the assembly is completed.
  • the short readout is short (12 bp).
  • the original CG mutation detection tool is no longer applicable or the detection result is not good.
  • BAM data to detect copy number variation.
  • the existing copy number variation detection methods include hidden Markov model, circular binary segmentation, hierarchical segmentation, and kernel smoothing algorithm.
  • We use the z-score (standard score) to obtain copy number variation results based on the read depth distribution of multiple windows with a total length of 1,000,000 bp.
  • the GC content in the reads will have a certain influence on the sequencing depth during the actual sequencing process
  • the GC content and the average sequencing depth of a plurality of window calculation windows with a total length of 1,000,000 bp were taken, and the GC-sequence depth data was subjected to lowess regression, and the GC content was corrected according to the regression curve.
  • the standard score also called the z-score
  • z (x - ⁇ ) / ⁇ .
  • x is a specific fraction
  • is the average
  • is the standard deviation.
  • the amount of Z value represents the distance between the original score and the parent mean, calculated in units of standard deviation. Z is negative when the original score is lower than the average, and vice versa if it is lower.
  • copy number variation can be effectively detected by measuring the distance between the reads count (original score) and the overall reads average (multiple normal control samples) in the 2000 bp window using the standard deviation.
  • the reaction is greater than 2 (the normal sample is 2 times), such as repetition, and the negative copy number is less than 2 when the z value is negative, such as a deletion.
  • the above CNV detection method in this embodiment is written as a program, and the program is named calcu_zscore_query, and the region where the absolute value of z is larger than 3 is judged to be CNV.
  • CG single-join sequencing method Compared with the traditional method, we can use the CG single-join sequencing method to achieve ultra-micro-sequencing database sequencing. Only 1-10 ng of nucleic acid is needed for database construction, and the peripheral blood volume is 2-5 ml, and the standardization process of CG is simple and fast.
  • TeraMap ratio After converting the result to SAM format, it is more versatile than the closed source TeraMap format, and can be processed using software such as Samtools.
  • CNV can be quickly detected using z-score (standard score), and CNV analysis of 50-by-full genome data takes only 4 hours, as a comparison, CONTRA software [ http://sourceforge.net/projects/contra-cnv/ ] It takes more than 1 day.
  • TeraMap is used for comparison.
  • the original reads are obtained using the CG platform's integrated tool makeADF, and then compared with TeraMap, and the sequenced reads are aligned on the reference sequence.
  • the resulting alignment results are converted to the generic SAM format using TeraMap2Sam. Table 1 shows the results.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

A device (100) for sequencing data processing is provided, and the device includes: data receiving unit (10) is used for receiving the sequencing data, the sequencing data comprises a plurality of read segment pairs, each read segment pair is composed of two read segments originating from two positions of a chromosome segment respectively, and each read segment contains a gap; processor (20) is used for performing data processing program, performing the data processing program includes the comparing the sequencing data with a reference sequence to obtain the comparison result, and eliminating the gap of each read segment in the comparison result to obtain a general comparison result; and, at least one memory unit (30) is used for storing data, wherein the data processing program is included. A system and method for sequencing data processing, a computer readable storage medium, a method and device for detecting CNV are also provided.

Description

测序数据处理装置和方法Sequencing data processing device and method 技术领域Technical field
本发明涉及生物信息领域,具体的,本发明涉及测序数据处理装置和方法,更具体地,本发明涉及一种测序数据处理装置、一种测序数据处理系统、一种测序数据的处理方法、一种计算机可读存储介质、一种检测CNV的方法以及一种CNV检测设备。The present invention relates to the field of biological information. Specifically, the present invention relates to a sequencing data processing apparatus and method, and more particularly, to a sequencing data processing apparatus, a sequencing data processing system, and a processing method for sequencing data. A computer readable storage medium, a method of detecting CNV, and a CNV detecting apparatus.
背景技术Background technique
存在于血清、血浆或其他体液中的cfDNA(cell-free DNA),是一种有效的生物标记物,它可以应用于多种突变检测中,比如癌症、胎儿染色体变异等基因突变导致的疾病。由于缺乏高敏感度和准确度的定量分析技术,此前的大量研究关注的都是一些已知的疾病相关基因,比如色素瘤-GNAQ基因(Metz,Claudia HD,et al.Ultradeep sequencing detects GNAQ and GNA11mutations in cell‐free DNA from plasma of patients with uveal melanoma.Cancer medicine 2.2(2013):208-215.),21三体-21号染色体(Liao,Gary JW,et al."Noninvasive prenatal diagnosis of fetal trisomy 21by allelic ratio analysis using targeted massively parallel sequencing of maternal plasma DNA."PLoS One 7.5(2012):e38154.)等。cfDNA (cell-free DNA), which is present in serum, plasma or other body fluids, is an effective biomarker that can be applied to a variety of mutation detection, such as cancer, fetal chromosomal variation and other genetic mutations. Due to the lack of high sensitivity and accuracy of quantitative analysis techniques, previous studies have focused on a number of known disease-related genes, such as the pigmentoma-GNAQ gene (Metz, Claudia HD, et al. Ultradeep sequencing detection GNAQ and GNA11mutations). In cell‐free DNA from plasma of patients with uveal melanoma. Cancer medicine 2.2 (2013): 208-215.), 21 Trisomy 21 (Liao, Gary JW, et al. "Noninvasive prenatal diagnosis of fetal trisomy 21by Allelic ratio analysis using targeted massively parallel sequencing of maternal plasma DNA. "PLoS One 7.5 (2012): e38154.) and the like.
新一代测序技术454(Roche)、Solexa(Illumina)和SOLiD(ABI)等的诞生,使得测序通量迅速提升的而测序成本急剧下降,这为cfDNA检测提供了新的思路。目前大规模并行测序(Massively Parallel Sequencing,MPS)是最主流的cfDNA检测技术,它被广泛应用于血浆DNA分子诊断、胎儿染色体异倍体、全基因组核型分析,甚至胎儿全基因组测序中。拷贝数变异(Copy-Number Variations,CNV)是指在人类基因组中广泛存在的,从1000bp到数百万bp范围内的缺失、插入、重复和复杂多位点的变异。拷贝数变异是许多人类疾病(如癌症、遗传性疾病、心血管疾病)的重要生物标志,已成为许多疾病研究的热点。尤其是对肿瘤的拷贝数变异检测可以发现肿瘤DNA在整个染色体组的缺失或倍增。目前已有的CNV检测平台有基于大插入片段的比较基因组杂交(CGH)、代表性寡核苷酸微阵列分析(ROMA)等。这些平台对于小CNV(20kb以下)的检测能力不足,而且存在操作繁琐,成本高等问题。The birth of next-generation sequencing technologies 454 (Roche), Solexa (Illumina) and SOLiD (ABI) has led to a rapid increase in sequencing throughput and a sharp drop in sequencing costs, which provides new ideas for cfDNA detection. Massively Parallel Sequencing (MPS) is the most popular cfDNA detection technology. It is widely used in plasma DNA molecular diagnosis, fetal chromosomal aneuploidy, whole genome karyotyping, and even fetal whole genome sequencing. Copy-Number Variations (CNV) refers to deletions, insertions, duplications, and complex multi-site variations ranging from 1000 bp to millions of bp that are widespread in the human genome. Copy number variation is an important biomarker for many human diseases (such as cancer, hereditary diseases, cardiovascular diseases) and has become a hot spot in many diseases. In particular, the detection of copy number variation in tumors can reveal the loss or doubling of tumor DNA throughout the genome. Currently available CNV detection platforms include comparative genomic hybridization (CGH) based on large inserts, representative oligonucleotide microarray analysis (ROMA), and the like. These platforms have insufficient detection capabilities for small CNVs (below 20 kb), and have problems such as cumbersome operations and high costs.
发明内容Summary of the invention
本发明旨在至少在一定程度上解决上述技术问题之一或至少提供一种商业选择。The present invention is directed to solving at least some of the above technical problems or at least providing a commercial choice.
依据本发明的第一方面,本发明提出了一种测序数据处理装置,该装置包括:数据接收单元,用于接收所述测序数据,所述测序数据包括多对读段对,每对读段对由两个读段组成,分别来自一条染色体片段的两个位置,每对读长对中的两个读段分别来自所述染色 体片段的正链和负链,或者每对读段对中的两个读段都来自所述染色体片段的正链或所述染色体的负链,每个读段都包含缺口,将一对读段对的两个读段分别定义为左臂和右臂;处理器,用于执行数据处理程序,执行所述数据处理程序包括实现将所述测序数据与参考序列作比对,获得比对结果,以及消除所述比对结果中的每个读段的缺口,获得通用比对结果,所述比对结果包括多个所述读段对的比对结果,和/或,所述比对结果包括多个所述左臂的比对结果和多个所述右臂的比对结果;以及,至少一个存储单元,用于存储数据,其中包括所述数据处理程序。这里所说的分别来自一条染色体片段的两个位置的读段对,可以通过构建末端文库(pair-end library)或者配对末端文库(mate-pair library),对所构建的文库进行测序来获得,在本发明的一个实施例中,利用Complete Genomics公司(CG)的文库构建方法及其测序平台,获得多对读段对,一对读段对之间的距离是由读段的长度以及酶的识别位点和切割位点的距离来控制的。CG平台通过酶切构建多接头配对末端文库,利用特有的组合探针连接测序(cPAL)技术对所构建的环状文库进行测序,测读出接头两旁的碱基,因为其是利用酶切连接一个接头的两段来进行配对末端文库构建的,由于每一种酶都有一个首选的切割距离,而在实际酶切时经常比首选距离多一个位置或少一个位置,这样使得读段中经常带有缺口(gap),缺口常为+1或者-1,和/或,建库时倘若使用同一种酶多次酶切,每次的酶切位置易发生变化,酶切位置的变化也会使获得的读段带有缺口,例如在构建多接头环状文库时,利用Alu酶两次酶切来连接多个接头的不同部分,读测这些接头旁的碱基时,会产生带+3/-3的缺口的读段。在本发明中缺口的大小还可以是0。以CG平台目前的双接头(two adaptors,2-AD)测序文库为例,2-AD测序输出总长为60bp,可分为两对读段对(mate-paired reads),每对读段对中的读段在10bp的位置都有小的gap,在20bp位置有一个无效测序位点N,一对读段对的两个读段之间的在基因组上的距离一般小于2000bp。来自多接头文库中的多个读段,一个读段可以和任一其它读段组成一对读段对。这里所说的“正链”和“负链”是组成染色体片段的互补的两条链,是相对的,称一条链为正链就可以称其互补链为负链,在本发明的一个实施例中,将与参考序列匹配的链称为正链,将另一条链称为负链。在本发明中,比对可以利用已知比对软件进行,比如SOAP、BWA等,也可以利用CG平台的比对软件TeraMap进行。在本发明的一个实施例中,比对是利用TeraMap进行的,所得的比对结果的格式为TeraMap。在本发明的一个实施例中,所说的消除比对结果中每个读段的缺口是指,对带负缺口的读段去除掉其负缺口即去除掉重叠的碱基,对带正缺口的读段以N替代正缺口的大小,N为A、T、C或G,例如,对带负缺口比如为-2nt的读段,基于缺口处该读段可分成两部分,两部分的末端有2nt重叠,比如该读段的两部分分别为ATCGCTTAAG和AGTACGATTC,消除其负缺口即重叠的AG,获得对应的读段为ATCGCTTAAGTACGATTC。 According to a first aspect of the present invention, the present invention provides a sequencing data processing apparatus, the apparatus comprising: a data receiving unit, configured to receive the sequencing data, the sequencing data comprising a plurality of pairs of read pairs, each pair of reads Composed of two reads, each from two locations of a chromosome segment, two reads from each pair of read pairs from the stain The positive and negative strands of the fragment, or both reads of each pair of reads, are from the positive strand of the chromosome fragment or the negative strand of the chromosome, each read contains a gap, a pair of reads The two read segments of the pair are respectively defined as a left arm and a right arm; a processor for executing a data processing program, and executing the data processing program includes performing comparison of the sequencing data with a reference sequence to obtain an alignment result And eliminating a gap in each of the alignment results, obtaining a universal alignment result, the comparison result comprising a plurality of alignments of the pair of reads, and/or the comparison result And a comparison result of the plurality of the left arms and a comparison result of the plurality of the right arms; and at least one storage unit for storing data, including the data processing program. The pair of reads from two positions of a chromosome fragment, respectively, can be obtained by sequencing a constructed library by constructing a pair-end library or a mate-pair library. In one embodiment of the present invention, multiple pairs of read pairs are obtained using the library construction method of Complete Genomics (CG) and its sequencing platform. The distance between a pair of read pairs is determined by the length of the read and the enzyme. The distance between the recognition site and the cleavage site is controlled. The CG platform was constructed by enzymatic cleavage to construct a multi-linker paired-end library, and the constructed circular library was sequenced by a unique combinatorial probe-ligation sequencing (cPAL) technique. The bases on both sides of the linker were read because they were ligated by restriction enzyme digestion. Two segments of a linker are used to construct a paired-end library, since each enzyme has a preferred cutting distance, and in actual digestion, it is often one more position or one less than the preferred distance, which makes the reading often With a gap, the gap is often +1 or -1, and / or, if the same enzyme is used for multiple digestions during the construction of the library, the position of the enzyme digestion is easy to change, and the position of the enzyme digestion will change. The obtained reads are nicked, for example, when constructing a multi-ligand circular library, the Alu enzyme is used for two digestions to join different portions of the plurality of linkers, and when the bases adjacent to the linkers are read, a band of +3 is generated. A reading of the gap of /-3. The size of the gap in the present invention may also be zero. Taking the current two-coupler (2-AD) sequencing library of the CG platform as an example, the 2-AD sequencing output has a total length of 60 bp, which can be divided into two pairs of mate-paired reads, and each pair of reads is centered. The reads have a small gap at 10 bp, an invalid sequencing site N at the 20 bp position, and the distance between the two reads of a pair of reads is generally less than 2000 bp. From a plurality of reads in a multi-joint library, one read can form a pair of read pairs with any other read. The term "positive strand" and "negative strand" as used herein are complementary two strands constituting a chromosome fragment, and are opposite. A strand is said to be a positive strand, and its complementary strand may be referred to as a negative strand, in an embodiment of the present invention. In the example, a chain that matches a reference sequence is referred to as a positive chain, and another chain is referred to as a negative chain. In the present invention, the alignment can be performed using known comparison software, such as SOAP, BWA, etc., or can be performed using the comparison software TeraMap of the CG platform. In one embodiment of the invention, the alignment is performed using TeraMap, and the resulting alignment result is in the format TeraMap. In one embodiment of the present invention, the gap of each read in the elimination comparison result means that the negative gap is removed from the read with the negative gap, that is, the overlapping base is removed, and the positive gap is removed. The read segment replaces the size of the positive gap by N, N is A, T, C or G. For example, for a read with a negative gap such as -2 nt, the read can be divided into two parts based on the gap, the end of the two parts There are 2 nt overlaps. For example, the two parts of the read are ATCGCTTAAG and AGTACGATTC respectively, and the negative gaps are overlapped, and the corresponding read is ATCGCTTAAGTACGATTC.
在本发明的一个实施例中,本发明的一方面的方法中的作比对包括:将每对读段对的左臂和右臂分别与所述参考序列比对,获得一级左比对结果和一级右比对结果;分别以所述一级左比对结果和所述一级右比对结果的其中一个为参考,对另一个作比对,获得二级左比对结果和二级右比对结果;基于所述二级左比对结果和所述二级右比对结果,获得多个所述读段对的比对结果,或者获得多个所述左臂的比对结果和多个所述右臂的比对结果。这样经过两次比对,可以获得读段对比对结果,在本发明的一个实施例中,第一次比对是与参考序列作全局比对,以该次左臂/右臂比对结果为基准对右臂/左臂比对结果进行的第二次比对为局部比对,这样,能够将分别来自二级左比对结果和二级右比对结果中的比对到同一染色体上且之间的距离符合预期的两个读段配对成一对读段对,获得读段对比对结果。In one embodiment of the invention, the aligning in the method of one aspect of the invention comprises: comparing the left and right arms of each pair of read pairs to the reference sequence, respectively, to obtain a level one left alignment The result is compared with the first-order right-aligned result; one of the first-order left-aligned result and the first-order right-aligned result is used as a reference, and the other is compared, and the second-order left-aligned result and the second are obtained. Level-aligning the result; obtaining a comparison result of the plurality of the pair of readings based on the result of the second-order left alignment and the result of the second-order right alignment, or obtaining an alignment result of the plurality of the left arms Alignment results with a plurality of said right arms. Thus, after two comparisons, the read comparison result can be obtained. In one embodiment of the present invention, the first alignment is globally aligned with the reference sequence, and the left arm/right arm alignment result is The second alignment of the baseline for the right arm/left arm alignment results is a local alignment, such that alignments from the second-order left alignment result and the second-order right alignment result, respectively, can be performed on the same chromosome. The distance between the two reads that match the expected pair is paired into a pair of read pairs, and the read contrast is obtained.
在本发明的一个实施例中,所说的作比对包括,设置所述缺口的大小以使每个左臂或者每个右臂与所述参考序列进行多次比对,以获得最佳比对结果。例如,将所述每个左臂或者所述每个右臂的缺口分别设置为-3nt、-2nt、-1nt、0nt、1nt、2nt、3nt、4nt、5nt、6nt和7nt,获得对应的多个读段,分别将所述对应的多个读段与所述参考序列比对,将最优比对的序列作为该左臂/右臂,这里对于比对结果的好坏可以基于所利用的比对软件对比对结果的默认评判。In an embodiment of the invention, the comparing comprises: setting the size of the notch to compare each left arm or each right arm with the reference sequence multiple times to obtain an optimal ratio For the result. For example, the gaps of each of the left arms or each of the right arms are set to -3 nt, -2 nt, -1 nt, 0 nt, 1 nt, 2 nt, 3 nt, 4 nt, 5 nt, 6 nt, and 7 nt, respectively. a read segment, respectively comparing the corresponding plurality of read segments with the reference sequence, and using the optimal aligned sequence as the left arm/right arm, where the comparison result may be based on the utilized Compare the software to the default evaluation of the results.
在本发明的一个实施例中,执行所述数据处理程序还包括实现,在所述消除比对结果中的每个读段的缺口之前,提取所述比对结果中的唯一比对结果以替换所述比对结果,所述唯一比对结果包括唯一比对上所述参考序列的多个读段对,并且每一所述读段对比对到所述参考序列的相同染色体,每一所述读段对中的两个读段的距离符合预期的其来自的所述染色体片段的两个位置之间的距离。In an embodiment of the present invention, executing the data processing program further includes implementing, before the gap of each of the comparison results in the comparison result, extracting a unique comparison result in the comparison result to replace The alignment result, the unique alignment result comprising a plurality of read pairs uniquely aligned with the reference sequence, and each of the reads contrasts to the same chromosome to the reference sequence, each of the The distance between the two reads of the pair of reads corresponds to the expected distance between the two locations of the chromosome segment from which it came.
在本发明的一个实施例中,执行所述数据处理程序还包括实现,修正使所述唯一比对结果中的每一对读段对比对到所述参考序列的相同染色体的正链。例如,对于分别比对上一染色体的正负链的一对读段,将比对上负链的读段变成其反向互补链,这样以其反向互补链来替代该读段得以实现所说的修正。In one embodiment of the invention, executing the data processing program further comprises implementing correcting a positive strand of the same chromosome that contrasts each pair of the unique alignment results to the reference sequence. For example, for a pair of reads that respectively align the positive and negative strands of the previous chromosome, the reads of the aligned negative strands become their complementary strands, thus replacing the reads with their reverse complementary strands. Said correction.
在本发明的一个实施例中,执行所述数据处理程序还包括实现数据格式转换,所述数据格式转换包括转换所述比对结果或所述唯一比对结果的格式。在本发明得一个实施利中,要求通用比对结果的格式为SAM或BAM,以利于后续的基于比对结果或比对结果对数据进一步分析,SAM或BAM是常见的二进制格式,BAM是SAM的压缩格式。由于利用不同的比对软件,可能输出的比对结果或者唯一比对结果的格式不适用于现有的后续数据处理或者分析软件程序,例如前述的TeraMap格式的比对结果,其输出的数据格式不满足现有的大部分变异检测软件SOAPsnp、GATK或SOAPindel等对输入数据格式的要求,转换数据格式使获得具有通用数据格式的通用比对结果,便于对比对数据进一步分析处理。 In one embodiment of the invention, executing the data processing program further comprises implementing a data format conversion, the data format conversion comprising converting the alignment result or the format of the unique alignment result. In an implementation of the present invention, the format of the general comparison result is required to be SAM or BAM, so as to facilitate subsequent analysis of the data based on the comparison result or the comparison result, SAM or BAM is a common binary format, and BAM is a SAM. Compressed format. Due to the use of different comparison software, the format of the output comparison result or the unique comparison result may not be applicable to existing subsequent data processing or analysis software programs, such as the comparison result of the aforementioned TeraMap format, and the output data format thereof. It does not meet the requirements of the input data format of most existing mutation detection software SOAPsnp, GATK or SOAPindel, and converts the data format to obtain the general comparison result with the common data format, which is convenient for further analysis and processing of the data.
依据本发明的第二方面,提供一种测序数据处理系统,其包括一主机和一显示器,该系统还包括本发明一方面或者任一具体实施方式中的测序数据处理装置。前述对测序数据处理装置的优点及技术特征的描述,同样适用本发明的这一系统,在此不再赘述。According to a second aspect of the present invention, there is provided a sequencing data processing system comprising a host and a display, the system further comprising a sequencing data processing device in accordance with one or any embodiment of the present invention. The foregoing description of the advantages and technical features of the sequencing data processing apparatus is equally applicable to the system of the present invention and will not be described herein.
依据本发明的第三方面,提供一种测序数据处理方法,该方法包括如下步骤:获取测序数据,所述测序数据包括多对读段对,每对读段对由两个读段组成,分别来自一条染色体片段的两个位置,每对读段对中的两个读段分别来自所述染色体片段的正链和负链,或者每对读长对中的两个读段都来自所述染色体片段的正链或所述染色体片段的负链,每个读段都包含缺口,将一对读段对的两个读段分别定义为左臂和右臂;将所述测序数据与参考序列比对,获得比对结果,所述比对结果包括多个所述读段对的比对结果,和/或,所述比对结果包括多个所述左臂的比对结果和多个所述右臂的比对结果;消除所述比对结果中的每个读段的缺口,获得通用比对结果。关于读段对的获取方式、读段包含的缺口、比对、消除缺口,比对结果和通用比对结果等的特点可以参照上述对本发明一方面或者任一具体实施方式中的装置中的相应技术特征的描述。例如,同样的,这里所说的分别来自一条染色体片段的两个位置的读段对,可以通过构建末端文库(pair-end library)或者配对末端文库(mate-pair library),对所构建的文库进行测序来获得,在本发明的一个实施例中,利用Complete Genomics公司(CG)的文库构建方法及其测序平台,获得多对读段对,一对读段对之间的距离是由读段的长度以及酶的识别位点和切割位点的距离来控制的。CG平台通过酶切构建多接头配对末端文库,利用特有的组合探针连接测序(cPAL)技术对所构建的环状文库进行测序,测读出接头两旁的碱基,因为其是利用酶切连接一个接头的两段来进行配对末端文库构建的,由于每一种酶都有一个首选的切割距离,而在实际酶切时经常比首选距离多一个位置或少一个位置,这样使得读段中经常带有缺口(gap),缺口常为+1或者-1,和/或,建库时倘若使用同一种酶多次酶切,每次的酶切位置易发生变化,酶切位置的变化也会使获得的读段带有缺口,例如在构建多接头环状文库时,利用Alu酶两次酶切来连接多个接头的不同部分,读测这些接头旁的碱基时,会产生带+3/-3的缺口的读段。在本发明中缺口的大小还可以是0。来自多接头文库中的多个读段,一个读段可以和任一其它读段组成一对读段对。这里所说的“正链”和“负链”是组成染色体片段的互补的两条链,是相对的,称一条链为正链就可以称其互补链为负链,在本发明的一个实施例中,将与参考序列匹配的链称为正链,将另一条链称为负链。在本发明中,比对可以利用已知比对软件进行,比如SOAP、BWA等,也可以利用CG平台的比对软件TeraMap进行。在本发明的一个实施例中,比对是利用TeraMap进行的,所得的比对结果的格式为TeraMap。在本发明的一个实施例中,所说的消除比对结果中每个读段的缺口是指,对带负缺口的读段去除掉其负缺口即去除掉重叠的碱基,对带正缺口的读段以N替代正缺口的大小,N为 A、T、C或G,例如,对带负缺口比如为-2nt的读段,基于缺口处该读段可分成两部分,两部分的末端有2nt重叠,比如该读段的两部分分别为ATCGCTTAAG和AGTACGATTC,消除其负缺口即重叠的AG,获得对应的读段为ATCGCTTAAGTACGATTC。According to a third aspect of the present invention, a method for processing a sequencing data is provided, the method comprising the steps of: acquiring sequencing data, the sequencing data comprising a plurality of pairs of read segments, each pair of read segments consisting of two read segments, respectively Two positions from one chromosome segment, two reads from each pair of reads are from the positive and negative strands of the chromosome segment, or two reads from each pair of read pairs are from the chromosome a positive strand of a fragment or a negative strand of the chromosome fragment, each read comprising a gap, defining two reads of a pair of read pairs as a left arm and a right arm, respectively; comparing the sequencing data to a reference sequence And obtaining a comparison result, the comparison result comprising a comparison result of the plurality of the pair of readings, and/or, the comparison result comprising a comparison result of the plurality of the left arms and a plurality of the The result of the alignment of the right arm; the gap of each of the readouts is eliminated, and a general alignment result is obtained. For the characteristics of the acquisition mode of the read pair, the gap included in the read, the alignment, the elimination of the gap, the comparison result and the general comparison result, reference may be made to the above-mentioned corresponding to the device in one aspect or any embodiment of the present invention. Description of technical features. For example, in the same way, the pair of reads from two positions of a chromosome fragment, respectively, can be constructed by constructing a pair-end library or a mate-pair library. By performing sequencing, in one embodiment of the present invention, multiple pairs of read pairs are obtained by using the library construction method of Complete Genomics (CG) and its sequencing platform, and the distance between a pair of read pairs is read by The length and the distance between the recognition site of the enzyme and the cleavage site are controlled. The CG platform was constructed by enzymatic cleavage to construct a multi-linker paired-end library, and the constructed circular library was sequenced by a unique combinatorial probe-ligation sequencing (cPAL) technique. The bases on both sides of the linker were read because they were ligated by restriction enzyme digestion. Two segments of a linker are used to construct a paired-end library, since each enzyme has a preferred cutting distance, and in actual digestion, it is often one more position or one less than the preferred distance, which makes the reading often With a gap, the gap is often +1 or -1, and / or, if the same enzyme is used for multiple digestions during the construction of the library, the position of the enzyme digestion is easy to change, and the position of the enzyme digestion will change. The obtained reads are nicked, for example, when constructing a multi-ligand circular library, the Alu enzyme is used for two digestions to join different portions of the plurality of linkers, and when the bases adjacent to the linkers are read, a band of +3 is generated. A reading of the gap of /-3. The size of the gap in the present invention may also be zero. From a plurality of reads in a multi-joint library, one read can form a pair of read pairs with any other read. The term "positive strand" and "negative strand" as used herein are complementary two strands constituting a chromosome fragment, and are opposite. A strand is said to be a positive strand, and its complementary strand may be referred to as a negative strand, in an embodiment of the present invention. In the example, a chain that matches a reference sequence is referred to as a positive chain, and another chain is referred to as a negative chain. In the present invention, the alignment can be performed using known comparison software, such as SOAP, BWA, etc., or can be performed using the comparison software TeraMap of the CG platform. In one embodiment of the invention, the alignment is performed using TeraMap, and the resulting alignment result is in the format TeraMap. In one embodiment of the present invention, the gap of each read in the elimination comparison result means that the negative gap is removed from the read with the negative gap, that is, the overlapping base is removed, and the positive gap is removed. The read segment replaces the size of the positive gap with N, N is A, T, C or G, for example, for a read with a negative gap such as -2 nt, the read can be divided into two parts based on the gap, and the ends of the two parts have 2 nt overlap, for example, the two parts of the read are respectively ATCGCTTAAG and AGTACGATTC, eliminate the negative gap, that is, the overlapping AG, and obtain the corresponding read segment as ATCGCTTAAGTACGATTC.
在本发明的一个实施例中,获取所述测序数据包括构建测序文库,获得测序文库,所述测序文库为单链环状DNA文库,所述测序文库由所述染色体片段的一条链和至少一个预定DNA序列构成。所说的单链环状文库可以利用已知文库构建方法来构建,比如参考LifeTechnologies公司的SOLiD的配对末端文库的构建获得单接头环状双链文库,接着分离双链获得单链环状文库,在本发明的一个实施例中,单链环状文库是利用CG的建库技术来构建,文库构建可参考US7897344,获得多接头单链环状文库。In one embodiment of the invention, obtaining the sequencing data comprises constructing a sequencing library to obtain a sequencing library, the sequencing library being a single-stranded circular DNA library, the sequencing library being composed of a strand of the chromosome fragment and at least one The predetermined DNA sequence is constructed. The single-stranded circular library can be constructed by a known library construction method, for example, by constructing a single-linker circular double-stranded library with reference to the construction of a paired-end library of SOCID of Life Technologies, and then separating the double-stranded to obtain a single-stranded circular library. In one embodiment of the invention, the single-stranded circular library is constructed using the CG library construction technique, and the library construction can be referred to US7897344 to obtain a multi-linker single-stranded circular library.
在本发明的一个实施例中,所述每对读段分别来自所述染色体片段的两端。通过参考改进CG的建库技术,利用一个接头的两部分分别连接于一段染色体片段的两端,单链分离,单链成环,获得1-接头单链环状文库,该1-接头单链环状文库由所述染色体片段的一条链和连接所述一条链的两端的一个预定DNA序列构成,滚环扩增形成DNA纳米球(DNB),利用CG测序cPAL技术对DNB进行测序,关于DNB种植在芯片上以及cPAL技术可分别参考US8278039B2和US8518640B2。所说的预定DNA序列为已知序列,为前述的接头或者接头的一条链。所说的改进CG建库方法构建1-接头环状单链文库包括步骤:(1)提取待测核酸;(2)末端磷酸化所述核酸,获得末端磷酸化产物;(3)末端修复所述末端磷酸化产物,获得末端修复产物;(4)将第一序列和第二序列连接至所述末端修复产物的两端,获得第一连接产物;(5)利用第三序列对所述连接产物进行缺刻平移和扩增,获得扩增产物,所述第三序列为一对引物对,所述引物对的至少一条引物带有生物素标记;(6)利用所述生物素标记对所述扩增产物进行单链分离,获得单链产物;(7)利用第四序列环化所述单链产物,获得所述测序文库;其中,所述第四序列能够连接所述第一序列的一端和所述第二序列的一端,所述第一序列和/或所述第二序列的另一端为双脱氧核苷酸。所说的第四序列能够连接所说的第一序列和所说的第二序列形成一个所说的接头,缺刻平移是为消除连接在末端修复产物两端的第一序列和/或第二序列的另一端的双脱氧核苷酸造成的缺刻(nick),利用至少一条引物带有生物素标记使扩增产物的至少一条链带有生物素标记,使后续易于基于该生物素标记分离获得单链产物。在本发明的一个实施例中,所说的改进CG建库方法构建1-接头环状单链文库包括步骤:(1)提取待测核酸;(2)末端修复所述核酸,获得末端修复产物;(3)末端磷酸化所述末端修复产物,获得末端磷酸化产物;(4)将第一序列和第二序列连接至所述末端磷酸化产物的两端,获得第一连接产物;(5)利用第三序列对所述连接产物进行缺刻平移和扩增,获得扩增产物,所述第三序列为一对引物对,所述引物对的至少一条引物带有生物素标记;(6)利用所述生物素标记对所 述扩增产物进行单链分离,获得单链产物;(7)利用第四序列环化所述单链产物,获得所述测序文库;其中,所述第四序列能够连接所述第一序列的一端和所述第二序列的一端,所述第一序列和/或所述第二序列的另一端为双脱氧核苷酸。末端修复和末端磷酸化哪个步骤先进行不作限制。末端修复是为获得平末端核酸片段,使得能够连接其它核苷酸或序列。末端磷酸化是为了减少样本核酸片段的互相连接,使得核酸含量很低的样本也能够进行文库构建且满足文库上机量要求。单接头环状单链文库如图1所示,将所构建的单接头环状单链文库(1-AD)上机测序,1-AD测序输出读段对总长约为30bp,其中一读段12bp,一读段19bp,一读段对两个读段之间在基因组上的距离的中位数约为140bp。单接头建库量较小,适合cfDNA含量较少的情况,此外还有建库时间短及建库成本低的优点。In one embodiment of the invention, each pair of reads is from both ends of the chromosome segment. By referring to the improved CG library construction technique, two parts of a linker are respectively ligated to both ends of a chromosome fragment, single-stranded and single-stranded to obtain a 1-ligand single-stranded circular library, and the 1-linker single-stranded The circular library consists of a strand of the chromosomal fragment and a predetermined DNA sequence joining the two ends of the strand. The rolling circle is expanded to form a DNA nanosphere (DNB), and the DNB is sequenced by CG sequencing cPAL technology. Implanted on a chip and cPAL technology can be referenced to US8278039B2 and US8518640B2, respectively. The predetermined DNA sequence is a known sequence and is a link of the aforementioned linker or linker. The improved CG building method constructs a 1-ligand circular single-strand library comprising the steps of: (1) extracting a nucleic acid to be tested; (2) phosphorylating the nucleic acid at the terminal to obtain a terminal phosphorylated product; and (3) end-repairing Said terminal phosphorylation product, obtaining a terminal repair product; (4) linking the first sequence and the second sequence to both ends of the terminal repair product to obtain a first ligation product; (5) using the third sequence for the ligation The product is subjected to nick translation and amplification to obtain an amplification product, the third sequence being a pair of primer pairs, at least one primer of the primer pair carrying a biotin label; (6) using the biotin label to Amplification products are subjected to single-strand separation to obtain a single-stranded product; (7) cyclizing the single-stranded product with a fourth sequence to obtain the sequencing library; wherein the fourth sequence is capable of ligating one end of the first sequence And at one end of the second sequence, the other end of the first sequence and/or the second sequence is a dideoxynucleotide. Said fourth sequence is capable of linking said first sequence and said second sequence to form said adaptor, and nick translation is for eliminating a first sequence and/or a second sequence attached at both ends of the end repair product The nick caused by the dideoxynucleotide at the other end uses at least one primer with biotin labeling to carry at least one strand of the amplified product with biotin labeling, so that it is easy to separate and obtain a single strand based on the biotin label. product. In one embodiment of the present invention, the improved CG library construction method constructs a 1-ligand circular single-strand library comprising the steps of: (1) extracting a nucleic acid to be tested; (2) repairing the nucleic acid at the end to obtain a terminal repair product. (3) terminal phosphorylating the terminal repair product to obtain a terminal phosphorylation product; (4) linking the first sequence and the second sequence to both ends of the terminal phosphorylation product to obtain a first ligation product; Performing nick translation and amplification of the ligation product using a third sequence to obtain an amplification product, the third sequence being a pair of primer pairs, at least one primer of the primer pair carrying a biotin label; (6) Using the biotin labeling pair The amplification product is subjected to single-strand separation to obtain a single-stranded product; (7) cyclizing the single-stranded product with a fourth sequence to obtain the sequencing library; wherein the fourth sequence is capable of linking the first sequence At one end and at one end of the second sequence, the other end of the first sequence and/or the second sequence is a dideoxynucleotide. The steps of end repair and terminal phosphorylation are first made without limitation. End repair is to obtain a blunt-ended nucleic acid fragment that enables attachment of other nucleotides or sequences. Terminal phosphorylation is to reduce the interconnection of sample nucleic acid fragments, so that samples with low nucleic acid content can also be constructed in a library and meet the requirements of the library. Single-linker circular single-strand library As shown in Figure 1, the constructed single-linker circular single-strand library (1-AD) was sequenced on the machine, and the 1-AD sequencing output read pair had a total length of about 30 bp, one read. 12 bp, 19 bp in one read, the median distance of the genome between the two reads in a read is about 140 bp. The single joint has a small amount of storage, which is suitable for the case of less cfDNA content, and has the advantages of short construction time and low construction cost.
在本发明的一个实施例中,本发明的这一方法中的比对包括:将每对读段对的左臂和右臂分别与所述参考序列比对,获得一级左比对结果和一级右比对结果,分别以所述一级左比对结果和所述一级右比对结果的其中一个为参考,对另一个作比对,获得二级左比对结果和二级右比对结果,基于所述二级左比对结果和所述二级右比对结果,获得多个所述读段对的比对结果,或者获得多个所述左臂的比对结果和多个所述右臂的比对结果。这样经过两次比对,可以获得读段对比对结果,在本发明的一个实施例中,第一次比对是与参考序列作全局比对,以该次左臂/右臂比对结果为基准对右臂/左臂比对结果进行的第二次比对为局部比对,这样,能够将分别来自二级左比对结果和二级右比对结果中的比对到同一染色体上且之间的距离符合预期的两个读段配对成一对读段对,获得读段对比对结果。In one embodiment of the invention, the alignment in the method of the invention comprises: comparing the left and right arms of each pair of read pairs to the reference sequence, respectively, to obtain a level 1 left alignment result and The first-order right-aligned result is compared with one of the first-order left-aligned result and the first-order right-aligned result, and the other is compared, and the second-order left-aligned result and the second-level right are obtained. Aligning the results, obtaining a comparison result of the plurality of the pair of readings based on the result of the second-order left alignment and the result of the second-order right alignment, or obtaining a comparison result of the plurality of the left arms and The alignment of the right arms. Thus, after two comparisons, the read comparison result can be obtained. In one embodiment of the present invention, the first alignment is globally aligned with the reference sequence, and the left arm/right arm alignment result is The second alignment of the baseline for the right arm/left arm alignment results is a local alignment, such that alignments from the second-order left alignment result and the second-order right alignment result, respectively, can be performed on the same chromosome. The distance between the two reads that match the expected pair is paired into a pair of read pairs, and the read contrast is obtained.
在本发明的一个实施例中,所说的比对包括,设置所述缺口的大小以使每个左臂或者每个右臂与所述参考序列进行多次比对,以获得最佳比对结果。例如,将所述每个左臂或者所述每个右臂的缺口分别设置为-3nt、-2nt、-1nt、0nt、1nt、2nt、3nt、4nt、5nt、6nt和7nt,获得对应的多个读段,分别将所述对应的多个读段与所述参考序列比对,将最优比对的序列作为该左臂/右臂,这里对于比对结果的好坏可以基于所利用的比对软件对比对结果的默认评判。In one embodiment of the invention, the aligning includes arranging the gaps such that each left or each right arm is compared with the reference sequence multiple times to obtain an optimal alignment result. For example, the gaps of each of the left arms or each of the right arms are set to -3 nt, -2 nt, -1 nt, 0 nt, 1 nt, 2 nt, 3 nt, 4 nt, 5 nt, 6 nt, and 7 nt, respectively. a read segment, respectively comparing the corresponding plurality of read segments with the reference sequence, and using the optimal aligned sequence as the left arm/right arm, where the comparison result may be based on the utilized Compare the software to the default evaluation of the results.
在本发明的一个实施例中,执行所述数据处理程序还包括实现,在所述消除比对结果中的每个读段的缺口之前,提取所述比对结果中的唯一比对结果以替换所述比对结果,所述唯一比对结果包括唯一比对上所述参考序列的多个读段对,并且每一所述读段对比对到所述参考序列的相同染色体,每一所述读段对中的两个读段的距离符合预期的其来自的所述染色体片段的两个位置之间的距离。In an embodiment of the present invention, executing the data processing program further includes implementing, before the gap of each of the comparison results in the comparison result, extracting a unique comparison result in the comparison result to replace The alignment result, the unique alignment result comprising a plurality of read pairs uniquely aligned with the reference sequence, and each of the reads contrasts to the same chromosome to the reference sequence, each of the The distance between the two reads of the pair of reads corresponds to the expected distance between the two locations of the chromosome segment from which it came.
在本发明的一个实施例中,执行所述数据处理程序还包括实现,修正使所述唯一比对结果中的每一对读段对比对到所述参考序列的相同染色体的正链。例如,对于分别比对上一染色体的正负链的一对读段,将比对上负链的读段变成其反向互补链,这样以其反向互 补链来替代该读段得以实现所说的修正。In one embodiment of the invention, executing the data processing program further comprises implementing correcting a positive strand of the same chromosome that contrasts each pair of the unique alignment results to the reference sequence. For example, for a pair of reads that respectively align the positive and negative strands of the previous chromosome, the reads of the aligned negative strands become their reverse complementary strands, thus The correction is achieved by a complement to replace the read.
在本发明的一个实施例中,执行所述数据处理程序还包括实现数据格式转换,所述数据格式转换包括转换所述比对结果或所述唯一比对结果的格式。在本发明得一个实施利中,要求通用比对结果的格式为SAM或BAM,以利于后续的基于比对结果或比对结果对数据进一步分析,SAM或BAM是常见的二进制格式,BAM是SAM的压缩格式。由于利用不同的比对软件,可能输出的比对结果或者唯一比对结果的格式不适用于现有的后续数据处理或者分析软件程序,例如前述的TeraMap格式的比对结果,其输出的数据格式不满足现有的大部分变异检测软件SOAPsnp、GATK或SOAPindel等对输入数据格式的要求,转换数据格式使获得具有通用数据格式的通用比对结果,便于对比对数据进一步分析处理。In one embodiment of the invention, executing the data processing program further comprises implementing a data format conversion, the data format conversion comprising converting the alignment result or the format of the unique alignment result. In an implementation of the present invention, the format of the general comparison result is required to be SAM or BAM, so as to facilitate subsequent analysis of the data based on the comparison result or the comparison result, SAM or BAM is a common binary format, and BAM is a SAM. Compressed format. Due to the use of different comparison software, the format of the output comparison result or the unique comparison result may not be applicable to existing subsequent data processing or analysis software programs, such as the comparison result of the aforementioned TeraMap format, and the output data format thereof. It does not meet the requirements of the input data format of most existing mutation detection software SOAPsnp, GATK or SOAPindel, and converts the data format to obtain the general comparison result with the common data format, which is convenient for further analysis and processing of the data.
依据本发明的第四方面,本发明提供一种计算机可读存储介质,其用于存储供计算机执行的程序,所述程序的执行包括完成前述本发明一方面的或者其任一具体实施方式中的测序数据处理方法。前述对本发明的测序数据处理方法的优点和技术特征的描述也适用于该计算机可读存储介质,在此不再赘述。所称存储介质可以包括:只读存储器、随机存储器、磁盘或光盘等。According to a fourth aspect of the present invention, there is provided a computer readable storage medium for storing a program for execution by a computer, the execution of the program comprising performing an aspect of the aforementioned invention or any one of its embodiments. Sequencing data processing method. The foregoing description of the advantages and technical features of the sequencing data processing method of the present invention is also applicable to the computer readable storage medium, and details are not described herein again. The storage medium may include: a read only memory, a random access memory, a magnetic disk or an optical disk, and the like.
依据本发明的第五方面,本发明提供一种检测拷贝数变异(CNV)的方法,该方法包括:a.获取待测样本的核酸;b.对所述核酸进行测序,获得测序数据;c.对所述测序数据进行处理,以获得通用比对结果;d.基于所述通用比对结果检测所述CNV;其中,c步骤本发明一方面的或者任一具体实施方式中的测序数据处理装置和/或方法进行的。上述对本发明的测序数据处理装置和/或方法的优点及技术特征的描述,也适用于本发明这一方面的CNV检测方法,在此不再赘述。According to a fifth aspect of the present invention, the present invention provides a method for detecting copy number variation (CNV), the method comprising: a. acquiring a nucleic acid of a sample to be tested; b. sequencing the nucleic acid to obtain sequencing data; Processing the sequencing data to obtain a universal alignment result; d. detecting the CNV based on the universal alignment result; wherein c step is sequencing data processing in one aspect of the invention or in any particular embodiment The device and/or method performed. The above description of the advantages and technical features of the sequencing data processing apparatus and/or method of the present invention is also applicable to the CNV detection method of this aspect of the present invention, and will not be described herein.
在本发明的一个实施例中,b步骤包括,对所述核酸进行测序文库构建,获得测序文库,所述测序文库为单链环状DNA文库,所述单链环状DNA文库的构建包括:末端磷酸化所述核酸,获得末端磷酸化产物;末端修复所述末端磷酸化产物,获得末端修复产物;将第一序列和第二序列连接至所述末端修复产物的两端,获得第一连接产物;利用第三序列对所述连接产物进行缺刻平移和扩增,获得扩增产物,所述第三序列为一对引物对,所述引物对的至少一条引物带有生物素标记;利用所述生物素标记对所述扩增产物进行单链分离,获得单链产物;环化所述单链产物,获得所述测序文库;其中,所述第四序列能够连接所述第一序列的一端和所述第二序列的一端,所述第一序列和/或所述第二序列的另一端为双脱氧核苷酸。在本发明的另一实施例中,先进行末端修复再进行末端磷酸化。末端修复是为获得平末端核酸片段,使得能够连接其它核苷酸或序列。末端磷酸化是为了减少样本核酸片段的互相连接,使得核酸含量很低的样本也能够进行文库构建且满足文库上机量要求。单接头环状单链文库如图1所示,单接头建库量较小,适合cfDNA含量较少的情 况,此外还有建库时间短及建库成本低的优点。所说的第四序列能够连接第一序列和的第二序列形成一个所说的接头,缺刻平移是为消除连接在末端修复产物两端的第一序列和/或第二序列的另一端的双脱氧核苷酸造成的缺刻(nick),利用至少一条引物带有生物素标记使扩增产物的至少一条链带有生物素标记,使后续易于基于该生物素标记分离获得单链产物。在本发明的一个实施例中,对构建的文库进行测序是利用组合探针锚定连接测序技术进行的,例如利用CG测序平台进行。In one embodiment of the present invention, the step b includes performing a sequencing library construction on the nucleic acid to obtain a sequencing library, the sequencing library being a single-stranded circular DNA library, and the construction of the single-stranded circular DNA library comprises: End-phosphorylation of the nucleic acid to obtain a terminal phosphorylation product; end-repairing the terminal phosphorylation product to obtain a terminal repair product; and linking the first sequence and the second sequence to both ends of the terminal repair product to obtain a first linkage a product; performing nick translation and amplification of the ligation product using a third sequence to obtain an amplification product, the third sequence being a pair of primer pairs, at least one primer of the primer pair carrying a biotin label; Performing single-strand separation of the amplification product to obtain a single-stranded product; cyclizing the single-stranded product to obtain the sequencing library; wherein the fourth sequence is capable of joining one end of the first sequence And at one end of the second sequence, the other end of the first sequence and/or the second sequence is a dideoxynucleotide. In another embodiment of the invention, end repair is performed followed by terminal phosphorylation. End repair is to obtain a blunt-ended nucleic acid fragment that enables attachment of other nucleotides or sequences. Terminal phosphorylation is to reduce the interconnection of sample nucleic acid fragments, so that samples with low nucleic acid content can also be constructed in a library and meet the requirements of the library. Single-linker circular single-stranded library is shown in Figure 1. The single-linker has a small amount of storage, which is suitable for cfDNA content. In addition, there are also advantages of short construction time and low cost of building a database. Said fourth sequence is capable of joining the first sequence and the second sequence to form one of said linkers, and the nick translation is to eliminate the dideoxy at the other end of the first sequence and/or the second sequence attached to the ends of the end repair product. A nick caused by a nucleotide, with at least one primer carrying a biotin label, carries at least one strand of the amplified product with a biotin label, so that subsequent separation of the single-stranded product based on the biotin label is easily obtained. In one embodiment of the invention, sequencing of the constructed library is performed using a combinatorial probe anchor ligation sequencing technique, such as using a CG sequencing platform.
基于通用比对结果检测CNV可以利用目前已知的CNV检测方法,比如利用隐马可夫模型、环状二元分割、等级分割或核平滑算法等。在本发明的一个实施例中,d步骤包括:在所述参考序列上设置多个窗口,基于所述通用比对结果中匹配到所述窗口的读段的量与对照样本的通用比对结果中匹配到相同窗口的读段的量的差异具有显著性,判定所述待测样本核酸存在所述CNV,其中,所述窗口为所述参考序列的一部分。其中,窗口的大小可依据预检测的CNV的大小来调整设置,对照样本的通用比对结果可以通过本发明的一方面的方法或者其任一具体实施方式中的测序数据处理方法获得,差异是否为显著性的判断可以利用统计检验比如z-score(标准分数)计算z值来进行,当z值大于或小于某一预定阈值时则判定该窗口区域存在CNV,比如正常对照为二倍体(CNV=2),当z值为正数时说明待测样本的该窗口的CNV>2,为负数则说明待测样本的该窗口的CNV<2,在本发明的一个实施例中,设定预定阈值为3,即当z值得绝对值大于3时则确定该窗口发生CNV。所说的读段的量可以为一个数目,也可以为一个比值,例如,也可以基于待测样本的窗口的测序深度与对照样本相应窗口的测序深度的差异,使用z-score(标准分数)进行检验检测出拷贝数变异,所说的窗口的测序深度=比对到该窗口的读段的量/该窗口的大小。在本发明的一个实施例中,考虑到实际测序过程中读段(reads)中GC含量会对测序深度有一定的影响[Alkan,Can,Jeffrey M Kidd,Tomas Marques-Bonet,Gozde Aksay,Francesca Antonacci,Fereydoun Hormozdiari,Jacob O Kitzman,et al.“Personalized Copy Number and Segmental Duplication Maps Using next-Generation Sequencing.”Nature Genetics 41,no.10(October2009):1061–67],先进行GC含量校正,消除GC含量对测序深度的影响。所说的GC含量校正,可以利用多个对照样本的测序数据,取多个窗口算窗口的GC含量和平均测序深度,对GC-测序深度的数据进行二维回归分析,例如利用局部加权回归散点平滑法(lowess回归)建立二者的关系,根据回归得的关系对各个窗口的测序深度进行GC含量校正。所说的测序深度和GC含量的关系可以通过以下来建立:获得多个对照样本核酸的测序数据,所述测序数据由多个读段组成;在所述参考序列上设置多个窗口,将所述多个对照样本的测序数据分别与所述参考序列的窗口比对,计算各个对照样本的测序数据中比对上每个窗 口的读段的数目,获得每个窗口的测序深度,所述窗口为所述参考序列的一部分,所述窗口的测序深度=各个对照样本的比对上所述窗口的读段的总数目/(对照样本个数*所述窗口的大小);基于每个窗口的测序深度和该窗口的GC含量,利用二维回归分析法建立所述测序深度和GC含量的关系。The detection of CNV based on the general comparison result can utilize the currently known CNV detection methods, such as using hidden Markov model, circular binary segmentation, hierarchical segmentation or kernel smoothing algorithm. In an embodiment of the present invention, the step d includes: setting a plurality of windows on the reference sequence, based on a general comparison result of the amount of the read segment matching the window and the comparison sample in the universal comparison result The difference in the amount of reads in the matching to the same window is significant, determining that the CNV is present in the sample nucleic acid to be tested, wherein the window is part of the reference sequence. Wherein, the size of the window can be adjusted according to the size of the pre-detected CNV, and the general comparison result of the comparison sample can be obtained by the method of one aspect of the present invention or the sequencing data processing method in any of the specific embodiments, whether the difference is The judgment for the significance can be performed by using a statistical test such as z-score (standard score) to calculate the z value, and when the z value is greater than or less than a predetermined threshold, it is determined that the CNV exists in the window region, for example, the normal control is diploid ( CNV=2), when the z value is positive, the CNV>2 of the window of the sample to be tested is indicated, and the negative number indicates that the CNV<2 of the window of the sample to be tested is set in one embodiment of the present invention. The predetermined threshold is 3, that is, when the absolute value of z is greater than 3, it is determined that CNV occurs in the window. The amount of the read segment may be a number or a ratio. For example, the z-score (standard score) may also be used based on the difference between the sequencing depth of the window of the sample to be tested and the sequencing depth of the corresponding window of the control sample. A test is performed to detect copy number variation, the depth of sequencing of the window = the amount of reads to the window / the size of the window. In one embodiment of the invention, it is contemplated that the GC content in the reads during the actual sequencing process will have a certain effect on the depth of sequencing [Alkan, Can, Jeffrey M Kidd, Tomas Marques-Bonet, Gozde Aksay, Francesca Antonacci , Fereydoun Hormozdiari, Jacob O Kitzman, et al. "Personalized Copy Number and Segmental Duplication Maps Using next-Generation Sequencing." Nature Genetics 41, no. 10 (October 2009): 1061–67], first performing GC content correction, eliminating GC The effect of the content on the depth of sequencing. The GC content correction can utilize the sequencing data of multiple control samples, take the GC content of multiple window calculation windows and the average sequencing depth, and perform two-dimensional regression analysis on the GC-sequence depth data, for example, using local weighted regression. The point smoothing method (lowess regression) establishes the relationship between the two, and corrects the GC content of each window according to the regression relationship. The relationship between the sequencing depth and the GC content can be established by obtaining sequencing data of a plurality of control sample nucleic acids, the sequencing data being composed of a plurality of reading segments; setting a plurality of windows on the reference sequence, Sequencing data of the plurality of control samples are respectively compared with the window of the reference sequence, and each of the sequencing data of each control sample is calculated. The number of reads of the mouth, the depth of sequencing of each window is obtained, the window being part of the reference sequence, the sequencing depth of the window = the total number of reads of the window on the alignment of the respective control samples / (Control sample number * size of the window); based on the sequencing depth of each window and the GC content of the window, the relationship between the sequencing depth and the GC content was established by two-dimensional regression analysis.
在本发明的一个实施例中,d步骤包括:在所述参考序列上设置多个窗口,计算窗口的测序深度,窗口的测序深度=所述通用比对结果中比对到所述窗口的读段的数量/所述窗口的大小;利用测序深度和GC含量的关系校正所述窗口的测序深度,获得窗口的校正测序深度;基于所述窗口的校正测序深度与对照样本的相同窗口的校正测序深度的差异具有显著性,判定所述待测样本核酸存在所述CNV,其中,所述窗口为所述参考序列的一部分。较佳地,前述的对照样本的个数不小于30个,样本数目达到30个使样本数据呈现满足特定分布符合适于利用多数统计检验方法来检验,例如,利用t检验、z检验等来统计检验一般要求多个样本数据符合正态分布。所说的对照样本的相同窗口的校正测序深度是利用所述测序深度和GC含量的关系校正对照样本的相同窗口的测序深度获得的,所述对照样本的相同窗口的测序深度=所述对照样本的测序数据中比对到所述窗口的读段的数目/所述窗口的大小。前述对照样本的测序数据、比对结果等可以通过参照前述本发明一方面的或者任一具体实施方式中的测序数据处理方法获得,可以与待测样本的测序数据、比对结果同时获得,也可以预先获得保存备用。In an embodiment of the present invention, the step d includes: setting a plurality of windows on the reference sequence, calculating a sequencing depth of the window, and sequencing depth of the window = comparing the reading to the window in the universal comparison result The number of segments / the size of the window; the sequencing depth of the window is corrected using the relationship between the sequencing depth and the GC content, the corrected sequencing depth of the window is obtained; the corrected sequencing depth based on the window is corrected with the same window of the control sample The difference in depth is significant, and the CNV is determined to be present in the sample nucleic acid to be tested, wherein the window is part of the reference sequence. Preferably, the number of the aforementioned control samples is not less than 30, and the number of samples reaches 30, so that the sample data presentation satisfies a specific distribution conforming to the test using a majority statistical test method, for example, using t test, z test, etc. Inspection generally requires multiple sample data to conform to a normal distribution. The corrected sequencing depth of the same window of the control sample is obtained by correcting the sequencing depth of the same window of the control sample using the relationship between the sequencing depth and the GC content, the sequencing depth of the same window of the control sample = the control sample The number of reads to the window/the size of the window is compared in the sequencing data. The sequencing data, the comparison result, and the like of the foregoing control sample can be obtained by referring to the sequencing data processing method in one aspect of the present invention or in any of the specific embodiments, and can be obtained simultaneously with the sequencing data and the comparison result of the sample to be tested. The save reserve can be obtained in advance.
依据本发明的第六方面,本发明提供一种CNV检测设备,该设备用以执行本发明一方面的CNV检测方法的全部或部分步骤,所述设备包括:核酸获取装置,用以获取待测样本的核酸;测序装置,用以对来自所述核酸获取单元的核酸进行测序,获得测序数据;数据处理装置,用于对来自所述测序装置的测序数据进行处理,以获得通用比对结果;检测装置,用于基于来自所述数据处理装置的通用比对结果检测所述CNV;其中,所述数据处理装置包括,数据接收单元,用于接收来自所述测序装置的测序数据,所述测序数据包括多对读段对,每对读段对由两个读段组成,分别来自一条染色体片段的两个位置,每对读长对中的两个读段分别来自所述染色体片段的正链和负链,或者每对读长对中的两个读段都来自所述染色体片段的正链或所述染色体的负链,每个读段都包含缺口,将一对读段对的两个读段分别定义为左臂和右臂,处理器,用于执行数据处理程序,执行所述数据处理程序包括实现将所述测序数据与参考序列作比对,获得比对结果,以及消除所述比对结果中的每个读段的缺口,获得通用比对结果,所述比对结果包括多个所述读段对的比对结果,和/或,所述比对结果包括多个所述左臂的比对结果和多个所述右臂的比对结果,以及,至少一个存储单元,用于存储数据,其中包括所述数据处理程序。前述对本发明一方面的或 者任一其具体实施方式中的CNV检测方法的优点和技术特征的描述,同样适用本发明这一方面的CNV检测设备,在此不再赘述,而且,本领域普通技术人员可以理解,本发明的这一装置中的全部或部分单元,可选择的、可拆卸的包含一个或多个子单元以执行或实现前述本发明CNV检测方法的各个具体实施方式。According to a sixth aspect of the present invention, the present invention provides a CNV detecting apparatus for performing all or part of the steps of the CNV detecting method of one aspect of the present invention, the apparatus comprising: a nucleic acid acquiring apparatus for acquiring a test a nucleic acid of the sample; a sequencing device for sequencing the nucleic acid from the nucleic acid acquisition unit to obtain sequencing data; and a data processing device for processing the sequencing data from the sequencing device to obtain a general comparison result; Detecting means for detecting the CNV based on a universal comparison result from the data processing device; wherein the data processing device comprises a data receiving unit for receiving sequencing data from the sequencing device, the sequencing The data includes pairs of pairs of reads, each pair of reads consisting of two reads, each from two locations of a chromosome segment, and two reads of each pair of read pairs are from the positive strand of the chromosome segment, respectively. And the negative strand, or both reads of each pair of read lengths are from the positive strand of the chromosome fragment or the negative strand of the chromosome, each read Include a gap, define two reads of a pair of read pairs as a left arm and a right arm, respectively, a processor for executing a data processing program, and executing the data processing program includes implementing the sequencing data and the reference sequence Comparing, obtaining alignment results, and eliminating gaps in each of the alignment results, obtaining a universal alignment result comprising a plurality of alignments of the pair of reads, and/ Or, the comparison result includes a comparison result of the plurality of left arms and a comparison result of a plurality of the right arms, and at least one storage unit for storing data including the data processing program. The foregoing is an aspect of the invention or The CNV detecting device of this aspect of the present invention is also applicable to the description of the advantages and technical features of the CNV detecting method in any of its specific embodiments, and details are not described herein again, and those skilled in the art can understand that the present invention can be understood. All or a portion of the units of this apparatus are selectively detachably including one or more subunits to perform or implement various embodiments of the aforementioned CNV detection methods of the present invention.
通过CG平台单接头测序获得测序数据,成本更低速度也更快。利用本发明的数据处理装置、系统和/或方法,开发TeraMap2Sam转换软件,将CG平台TeraMap的比对结果转化为通用的SAM格式,使能够直接使用Samtools,GATK等众多优秀的开源软件进行变异检测,使后续分析的选择更加广泛。利用本发明的CNV检测方法和/或设备开发的CNV检测程序基于标准分数方法做CNV分析,速度快,分辨率高。Sequencing data was obtained by single-link sequencing of the CG platform, and the cost was lower and faster. Using the data processing apparatus, system and/or method of the present invention, the TeraMap2Sam conversion software is developed, and the comparison result of the CG platform TeraMap is converted into a common SAM format, so that many excellent open source softwares such as Samtools, GATK, etc. can be directly used for mutation detection. To make the selection of subsequent analysis more extensive. The CNV detection program developed by the CNV detection method and/or device of the present invention performs CNV analysis based on the standard fraction method, and has high speed and high resolution.
附图说明DRAWINGS
本发明的上述和/或附加的方面和优点从结合下面附图对实施方式的描述中将变得明显和容易理解,其中:The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from
图1是本发明的一个实施例中的单接头环状单链文库的结构示意图;1 is a schematic view showing the structure of a single-linker circular single-stranded library in one embodiment of the present invention;
图2是本发明的一个实施例中的测序数据处理装置的结构示意图;2 is a schematic structural diagram of a sequencing data processing apparatus in an embodiment of the present invention;
图3是本发明的一个实施例中的测序数据处理系统的结构示意图;3 is a schematic structural diagram of a sequencing data processing system in an embodiment of the present invention;
图4是本发明的一个实施例中的测序数据处理方法的流程图;4 is a flow chart of a method for processing sequencing data in an embodiment of the present invention;
图5是本发明的一个实施例中的测序数据处理方法的流程图;Figure 5 is a flow chart showing a method of processing sequencing data in an embodiment of the present invention;
图6是本发明的一个实施例中的CNV检测方法的流程图;6 is a flow chart of a CNV detecting method in an embodiment of the present invention;
图7是本发明的一个实施例中的CNV检测方法的流程图;7 is a flow chart of a CNV detecting method in an embodiment of the present invention;
图8是本发明的一个实施例中的CNV检测设备的结构示意图;FIG. 8 is a schematic structural diagram of a CNV detecting apparatus in an embodiment of the present invention; FIG.
图9是本发明的一个实施例中的单接头文库构建和测序得流程图;Figure 9 is a flow diagram showing the construction and sequencing of a single linker library in one embodiment of the present invention;
图10是本发明的一个实施例中的Teramap2Sam软件的算法流程图。Figure 10 is a flow chart of the algorithm of the Teramap2Sam software in one embodiment of the present invention.
具体实施方式detailed description
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中,自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本发明,而不能理解为对本发明的限制。需要说明的是在本文中所使用的术语“第一”、“第二”、“第三”、“第四”或者“一级”、“二级”等仅为方便描述指代,而不能理解为指示或暗示相对重要性,也不能理解为之间有先后顺序关系。在本发明的描述中,除非另有说明,“多个”的含义是两个或两个以上。The embodiments of the present invention are described in detail below, and the examples of the embodiments are illustrated in the drawings, wherein the same or similar reference numerals are used to refer to the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the accompanying drawings are intended to be illustrative of the invention and are not to be construed as limiting. It should be noted that the terms "first", "second", "third", "fourth" or "first grade", "secondary grade" and the like as used herein are merely for convenience of description, but not To understand or indicate the relative importance, it cannot be understood as a sequential relationship. In the description of the present invention, "a plurality" means two or more unless otherwise stated.
图2显示本发明的测序数据处理装置的一个实施例的装置结构示意图,该测序数据处 理装置100包括:数据接收单元10、处理器20和一个存储单元30,处理器20与数据接收单元10和存储单元30连接,存储单元30和数据处理单元10连接。数据接收单元10,用于接收测序数据,所述测序数据包括多对读段对,每对读段对由两个读段组成,分别来自一条染色体片段的两个位置,每对读长对中的两个读段分别来自所述染色体片段的正链和负链,或者每对读段对中的两个读段都来自所述染色体片段的正链或所述染色体的负链,每个读段都包含缺口,将一对读段对的两个读段分别定义为左臂和右臂。这里所说的分别来自一条染色体片段的两个位置的读段对,可以通过构建末端文库(pair-end library)或者配对末端文库(mate-pair library),对所构建的文库进行测序来获得,在本发明的一个实施例中,利用Complete Genomics公司(CG)的文库构建方法及其测序平台,获得多对读段对,一对读段对之间的距离是由读段的长度以及酶的识别位点和切割位点的距离来控制的。CG平台通过酶切构建多接头配对末端文库,利用特有的组合探针连接测序(cPAL)技术对所构建的环状文库进行测序,测读出接头两旁的碱基,因为其是利用酶切连接一个接头的两段来进行配对末端文库构建的,由于每一种酶都有一个首选的切割距离,而在实际酶切时经常比首选距离多一个位置或少一个位置,这样使得读段中经常带有缺口(gap),缺口常为+1或者-1,和/或,建库时倘若使用同一种酶多次酶切,每次的酶切位置易发生变化,酶切位置的变化也会使获得的读段带有缺口,例如在构建多接头环状文库时,利用Alu酶两次酶切来连接多个接头的不同部分,读测这些接头旁的碱基时,会产生带+3/-3的缺口的读段。在本发明中缺口的大小还可以是0。以CG平台目前的双接头(two adaptors,2-AD)测序文库为例,2-AD测序输出总长为60bp,可分为两对读段对(mate-paired reads),每对读段对中的读段在10bp的位置都有小的gap,在20bp位置有一个无效测序位点N,一对读段对的两个读段之间的在基因组上的距离一般小于2000bp。来自多接头文库中的多个读段,一个读段可以和任一其它读段组成一对读段对。这里所说的“正链”和“负链”是组成染色体片段的互补的两条链,是相对的,称一条链为正链就可以称其互补链为负链,在本发明的一个实施例中,将与参考序列匹配的链称为正链,将另一条链称为负链。2 is a schematic view showing the structure of an apparatus of an embodiment of the sequencing data processing apparatus of the present invention, the sequencing data portion The processing device 100 includes a data receiving unit 10, a processor 20, and a storage unit 30. The processor 20 is connected to the data receiving unit 10 and the storage unit 30, and the storage unit 30 is connected to the data processing unit 10. The data receiving unit 10 is configured to receive sequencing data, where the sequencing data includes multiple pairs of read pairs, each pair of read segments consists of two read segments, respectively, from two positions of a chromosome segment, each pair of read long pairs The two reads are from the positive and negative strands of the chromosome fragment, respectively, or both reads in each pair of reads are from the positive strand of the chromosome fragment or the negative strand of the chromosome, each read The segments all contain gaps, and the two reads of a pair of read pairs are defined as the left and right arms, respectively. The pair of reads from two positions of a chromosome fragment, respectively, can be obtained by sequencing a constructed library by constructing a pair-end library or a mate-pair library. In one embodiment of the present invention, multiple pairs of read pairs are obtained using the library construction method of Complete Genomics (CG) and its sequencing platform. The distance between a pair of read pairs is determined by the length of the read and the enzyme. The distance between the recognition site and the cleavage site is controlled. The CG platform was constructed by enzymatic cleavage to construct a multi-linker paired-end library, and the constructed circular library was sequenced by a unique combinatorial probe-ligation sequencing (cPAL) technique. The bases on both sides of the linker were read because they were ligated by restriction enzyme digestion. Two segments of a linker are used to construct a paired-end library, since each enzyme has a preferred cutting distance, and in actual digestion, it is often one more position or one less than the preferred distance, which makes the reading often With a gap, the gap is often +1 or -1, and / or, if the same enzyme is used for multiple digestions during the construction of the library, the position of the enzyme digestion is easy to change, and the position of the enzyme digestion will change. The obtained reads are nicked, for example, when constructing a multi-ligand circular library, the Alu enzyme is used for two digestions to join different portions of the plurality of linkers, and when the bases adjacent to the linkers are read, a band of +3 is generated. A reading of the gap of /-3. The size of the gap in the present invention may also be zero. Taking the current two-coupler (2-AD) sequencing library of the CG platform as an example, the 2-AD sequencing output has a total length of 60 bp, which can be divided into two pairs of mate-paired reads, and each pair of reads is centered. The reads have a small gap at 10 bp, an invalid sequencing site N at the 20 bp position, and the distance between the two reads of a pair of reads is generally less than 2000 bp. From a plurality of reads in a multi-joint library, one read can form a pair of read pairs with any other read. The term "positive strand" and "negative strand" as used herein are complementary two strands constituting a chromosome fragment, and are opposite. A strand is said to be a positive strand, and its complementary strand may be referred to as a negative strand, in an embodiment of the present invention. In the example, a chain that matches a reference sequence is referred to as a positive chain, and another chain is referred to as a negative chain.
处理器20,用于执行数据处理程序,执行所述数据处理程序包括实现将所述测序数据与参考序列作比对,获得比对结果,以及消除所述比对结果中的每个读段的缺口,获得通用比对结果,所述比对结果包括多个所述读段对的比对结果,和/或,所述比对结果包括多个所述左臂的比对结果和多个所述右臂的比对结果。比对可以利用已知比对软件进行,比如SOAP、BWA等,也可以利用CG平台的比对软件TeraMap进行。在本发明的一个实施例中,比对是利用TeraMap进行的,所得的比对结果的格式为TeraMap。在本发明的一个实施例中,所说的消除比对结果中每个读段的缺口是指,对带负缺口的读段去除掉其负缺口即去除掉重叠的碱基,对带正缺口的读段以N替代正缺口的大小,N为A、T、C或G, 对缺口为0的读段不作处理,例如,对带负缺口比如为-2nt的读段,基于缺口处该读段可分成两部分,两部分的末端有2nt重叠,比如该读段的两部分分别为ATCGCTTAAG和AGTACGATTC,消除其负缺口即重叠的AG,获得对应的读段为ATCGCTTAAGTACGATTC。The processor 20 is configured to execute a data processing program, and the executing the data processing program comprises: comparing the sequencing data with a reference sequence, obtaining a comparison result, and eliminating each read in the comparison result a gap, obtaining a universal alignment result, the comparison result comprising a plurality of alignment results of the pair of reads, and/or, the comparison result comprising a plurality of comparison results of the left arm and a plurality of The result of the comparison of the right arm. The comparison can be performed by using known comparison software, such as SOAP, BWA, etc., or by using the comparison software TeraMap of the CG platform. In one embodiment of the invention, the alignment is performed using TeraMap, and the resulting alignment result is in the format TeraMap. In one embodiment of the present invention, the gap of each read in the elimination comparison result means that the negative gap is removed from the read with the negative gap, that is, the overlapping base is removed, and the positive gap is removed. The read segment replaces the size of the positive gap by N, and N is A, T, C or G. The read with a gap of 0 is not processed. For example, for a read with a negative gap such as -2 nt, the read can be divided into two parts based on the gap, and the ends of the two parts have 2 nt overlap, such as two parts of the read. ATCGCTTAAG and AGTACGATTC, respectively, eliminate the negative gap, that is, the overlapping AG, and obtain the corresponding reading as ATCGCTTAAGTACGATTC.
存储单元30,用于存储数据,存储单元30中存储有上述数据处理程序,也存储来自数据接收单元10的测序数据和处理器20的处理的中间数据或结果。The storage unit 30 is for storing data, and the above-described data processing program is stored in the storage unit 30, and intermediate data or results of the processing of the sequencing data from the data receiving unit 10 and the processor 20 are also stored.
图3显示本发明的测序数据处理系统的一个实施例中的系统结构示意图。该测序数据处理系统1000包括:测序数据处理装置100,主机200和显示装置300。主机200可为一音频/视频/信号发送源装置,比如电脑主机、大型机等,其用以传送显示装置300所需的显示数据。该主机200包含至少一个与测序数据处理装置100电性连接的接口,测序数据处理装置100接收从主机200输出的测序数据,并对测序数据进行处理,然后将处理的数据或结果输出到显示装置300。Figure 3 is a block diagram showing the structure of a system in an embodiment of the sequencing data processing system of the present invention. The sequencing data processing system 1000 includes a sequencing data processing device 100, a host 200, and a display device 300. The host 200 can be an audio/video/signal source device, such as a computer host, mainframe, etc., for transmitting display data required by the display device 300. The host 200 includes at least one interface electrically connected to the sequencing data processing device 100. The sequencing data processing device 100 receives the sequencing data output from the host 200, processes the sequenced data, and then outputs the processed data or results to the display device. 300.
图4显示本发明的测序数据处理方法的一个实施例的测序数据处理流程图。该测序数据处理方法包括步骤:S1获取测序数据,所述测序数据包括多对读段对,每对读段对由两个读段组成,分别来自一条染色体片段的两个位置,每对读段对中的两个读段分别来自所述染色体片段的正链和负链,或者每对读长对中的两个读段都来自所述染色体片段的正链或所述染色体片段的负链,每个读段都包含缺口,将一对读段对的两个读段分别定义为左臂和右臂;S2将所述测序数据与参考序列比对,获得比对结果,所述比对结果包括多个所述读段对的比对结果,和/或,所述比对结果包括多个所述左臂的比对结果和多个所述右臂的比对结果;S3消除所述比对结果中的每个读段的缺口,获得通用比对结果。关于读段对的获取方式、读段包含的缺口、比对、消除缺口,比对结果和通用比对结果等的特点可以参照上述对本发明一方面或者任一具体实施方式中的测序数据处理装置中的相应技术特征的描述。例如,同样的,这里所说的分别来自一条染色体片段的两个位置的读段对,可以通过构建末端文库(pair-end library)或者配对末端文库(mate-pair library),对所构建的文库进行测序来获得,在本发明的一个实施例中,利用Complete Genomics公司(CG)的文库构建方法及其测序平台,获得多对读段对,一对读段对之间的距离是由读段的长度以及酶的识别位点和切割位点的距离来控制的。CG平台通过酶切构建多接头配对末端文库,利用特有的组合探针连接测序(cPAL)技术对所构建的环状文库进行测序,测读出接头两旁的碱基,因为其是利用酶切连接一个接头的两段来进行配对末端文库构建的,由于每一种酶都有一个首选的切割距离,而在实际酶切时经常比首选距离多一个位置或少一个位置,这样使得读段中经常带有缺口(gap),缺口常为+1或者-1,和/或,建库时倘若使用同一种酶多次酶切,每次的酶切位置易发生变化,酶切位置的变化也会使获得的读段带有缺口, 例如在构建多接头环状文库时,利用Alu酶两次酶切来连接多个接头的不同部分,读测这些接头旁的碱基时,会产生带+3/-3的缺口的读段。在本发明中缺口的大小还可以是0。来自多接头文库中的多个读段,一个读段可以和任一其它读段组成一对读段对。这里所说的“正链”和“负链”是组成染色体片段的互补的两条链,是相对的,称一条链为正链就可以称其互补链为负链。这里,将与参考序列匹配的链称为正链,将另一条链称为负链。比对可以利用已知比对软件进行,比如SOAP、BWA等,也可以利用CG平台的比对软件TeraMap进行。在本发明的一个实施例中,比对是利用TeraMap进行的,所得的比对结果的格式为TeraMap。在本发明的一个实施例中,所说的消除比对结果中每个读段的缺口是指,对带负缺口的读段去除掉其负缺口即去除掉重叠的碱基,对带正缺口的读段以N替代正缺口的大小,N为A、T、C或G,对缺口为0的读段不作处理,例如,对带负缺口比如为-2nt的读段,基于缺口处该读段可分成两部分,两部分的末端有2nt重叠,比如该读段的两部分分别为ATCGCTTAAG和AGTACGATTC,消除其负缺口即重叠的AG,获得对应的读段为ATCGCTTAAGTACGATTC。4 is a flow chart showing the sequencing data processing of one embodiment of the sequencing data processing method of the present invention. The sequencing data processing method comprises the steps of: S1 acquiring sequencing data, the sequencing data comprising a plurality of pairs of read segments, each pair of read segments consisting of two read segments, respectively, from two positions of a chromosome segment, each pair of reads The two reads of the pair are from the positive and negative strands of the chromosome fragment, respectively, or both reads of each pair of read lengths are from the positive strand of the chromosome fragment or the negative strand of the chromosome fragment, Each read segment includes a gap, and two reads of a pair of read pairs are respectively defined as a left arm and a right arm; S2 compares the sequencing data with a reference sequence to obtain a comparison result, and the comparison result Include a comparison result of a plurality of the pair of read segments, and/or, the comparison result includes a comparison result of the plurality of the left arms and a comparison result of the plurality of the right arms; S3 eliminates the ratio A common alignment result is obtained for each gap in the result. For the characteristics of the acquisition method of the read pair, the gap included in the read, the alignment, the elimination of the gap, the comparison result and the general comparison result, reference may be made to the above-mentioned sequencing data processing apparatus in one aspect or any embodiment of the present invention. A description of the corresponding technical features in . For example, in the same way, the pair of reads from two positions of a chromosome fragment, respectively, can be constructed by constructing a pair-end library or a mate-pair library. By performing sequencing, in one embodiment of the present invention, multiple pairs of read pairs are obtained by using the library construction method of Complete Genomics (CG) and its sequencing platform, and the distance between a pair of read pairs is read by The length and the distance between the recognition site of the enzyme and the cleavage site are controlled. The CG platform was constructed by enzymatic cleavage to construct a multi-linker paired-end library, and the constructed circular library was sequenced by a unique combinatorial probe-ligation sequencing (cPAL) technique. The bases on both sides of the linker were read because they were ligated by restriction enzyme digestion. Two segments of a linker are used to construct a paired-end library, since each enzyme has a preferred cutting distance, and in actual digestion, it is often one more position or one less than the preferred distance, which makes the reading often With a gap, the gap is often +1 or -1, and / or, if the same enzyme is used for multiple digestions during the construction of the library, the position of the enzyme digestion is easy to change, and the position of the enzyme digestion will change. Make the obtained reading with a gap, For example, when constructing a multi-ligand circular library, the Alu enzyme is digested twice to join different portions of multiple linkers, and when the bases next to these linkers are read, a read with a gap of +3/-3 is produced. The size of the gap in the present invention may also be zero. From a plurality of reads in a multi-joint library, one read can form a pair of read pairs with any other read. The term "positive strand" and "negative strand" as used herein are complementary two strands constituting a chromosome fragment, and are opposite. When a strand is a positive strand, the complementary strand can be said to be a minus strand. Here, a chain that matches a reference sequence is referred to as a positive chain, and another chain is referred to as a negative chain. The comparison can be performed by using known comparison software, such as SOAP, BWA, etc., or by using the comparison software TeraMap of the CG platform. In one embodiment of the invention, the alignment is performed using TeraMap, and the resulting alignment result is in the format TeraMap. In one embodiment of the present invention, the gap of each read in the elimination comparison result means that the negative gap is removed from the read with the negative gap, that is, the overlapping base is removed, and the positive gap is removed. The read segment replaces the size of the positive gap by N, N is A, T, C or G, and the read with the gap 0 is not processed. For example, for a read with a negative gap such as -2 nt, the read based on the gap The segment can be divided into two parts, and the ends of the two parts have 2nt overlap. For example, the two parts of the read segment are ATCGCTTAAG and AGTACGATTC respectively, and the negative gap, that is, the overlapping AG, is eliminated, and the corresponding read segment is obtained as ATCGCTTAAGTACGATTC.
图5是本发明的测序数据处理方法的一个实施例的数据处理流程图。该测序数据处理方法包括:S10获取测序数据,所述测序数据包括多对读段对,每对读段对由两个读段组成,分别来自一条染色体片段的两个位置,每对读段对中的两个读段分别来自所述染色体片段的正链和负链,或者每对读长对中的两个读段都来自所述染色体片段的正链或所述染色体片段的负链,每个读段都包含缺口,将一对读段对的两个读段分别定义为左臂和右臂;S20将所述测序数据与参考序列比对,获得比对结果,所述比对结果包括多个所述读段对的比对结果,和/或,所述比对结果包括多个所述左臂的比对结果和多个所述右臂的比对结果;S30提取所述比对结果中的唯一比对结果以替换所述比对结果,所述唯一比对结果包括唯一比对上所述参考序列的多个读段对,并且每一所述读段对比对到所述参考序列的相同染色体,每一所述读段对中的两个读段的距离符合预期的其来自的所述染色体片段的两个位置之间的距离;S40修正使所述唯一比对结果中的每一对读段对比对到所述参考序列的相同染色体的正链。例如,对于分别比对上一染色体的正负链的一对读段,将比对上负链的读段变成其反向互补链,这样以其反向互补链来替代该读段得以实现所说的修正;S50消除所述唯一比对结果中的每个读段的缺口,获得通用比对结果。Figure 5 is a flow chart showing the data processing of one embodiment of the sequencing data processing method of the present invention. The sequencing data processing method comprises: S10 acquiring sequencing data, the sequencing data comprising a plurality of pairs of read pairs, each pair of read segments consisting of two read segments, respectively, from two positions of one chromosome segment, each pair of read pairs The two reads in the pair are from the positive and negative strands of the chromosome fragment, or the two reads in each pair of read lengths are from the positive strand of the chromosome fragment or the negative strand of the chromosome fragment, each Each of the read segments includes a gap, and two reads of a pair of read pairs are respectively defined as a left arm and a right arm; S20 compares the sequencing data with a reference sequence to obtain a comparison result, and the comparison result includes Aligning results of a plurality of the pair of read segments, and/or, the comparison result includes a comparison result of the plurality of the left arms and a comparison result of the plurality of the right arms; S30 extracting the comparison a unique alignment result in the result to replace the alignment result, the unique alignment result comprising a plurality of read pairs uniquely aligned with the reference sequence, and each of the read pairs is compared to the reference The same chromosome of the sequence, two of each pair of reads The distance of the read is in accordance with the expected distance between the two positions of the chromosome segment from which it is derived; the S40 correction causes each pair of the unique alignment to be compared to the same chromosome of the reference sequence Positive chain. For example, for a pair of reads that respectively align the positive and negative strands of the previous chromosome, the reads of the aligned negative strands become their complementary strands, thus replacing the reads with their reverse complementary strands. Said correction; S50 eliminates the gap of each of the unique alignment results to obtain a general alignment result.
图6是本发明的CNV检测方法的一个实施例的检测流程图。该CNV检测方法包括步骤:S11获取待测样本的核酸;S12对所述核酸进行测序,获得测序数据;S13对所述测序数据进行处理,以获得通用比对结果;S14基于所述通用比对结果检测所述CNV;其中,S13是利用本发明一方面的或者任一具体实施方式中的测序数据处理装置和/或测序数据处理方法进行的。基于通用比对结果检测CNV可以利用目前已知的CNV检测方法,比如利 用隐马可夫模型、环状二元分割、等级分割或核平滑算法等。Fig. 6 is a flow chart showing the detection of an embodiment of the CNV detecting method of the present invention. The CNV detection method comprises the steps of: S11 acquiring nucleic acid of a sample to be tested; S12 sequencing the nucleic acid to obtain sequencing data; S13 processing the sequencing data to obtain a general comparison result; S14 is based on the universal comparison As a result, the CNV is detected; wherein S13 is performed using a sequencing data processing device and/or a sequencing data processing method in one aspect of the invention or in any of the embodiments. Detection of CNV based on universal alignment results can utilize currently known CNV detection methods, such as Use hidden Markov model, circular binary segmentation, hierarchical segmentation or kernel smoothing algorithm.
图7是本发明的CNV检测方法的一个实施例的检测流程图。该CNV检测方法包括步骤:S110获取待测样本的核酸;S120对所述核酸进行测序,获得测序数据;S130对所述测序数据进行处理,以获得通用比对结果,S130是通过上述本发明一方面的或者任一具体实施方式中的测序数据处理装置和/或测序数据处理方法进行的;S140在所述参考序列上设置多个窗口,计算窗口的测序深度,窗口的测序深度=所述通用比对结果中比对到所述窗口的读段的数量/所述窗口的大小;S150利用测序深度和GC含量的关系校正所述窗口的测序深度,获得窗口的校正测序深度;S160基于所述窗口的校正测序深度与对照样本的相同窗口的校正测序深度具有显著差异,判定所述待测样本核酸存在所述CNV,其中,所述窗口为所述参考序列的一部分。前述的对照样本的个数不小于30个,样本数目达到30个使样本数据呈现满足特定分布符合适于利用多数统计检验方法来检验,例如,利用t检验、z检验等来统计检验一般要求多个样本数据符合正态分布。所说的对照样本的相同窗口的校正测序深度是利用所述测序深度和GC含量的关系校正对照样本的相同窗口的测序深度获得的,所述对照样本的相同窗口的测序深度=所述对照样本的测序数据中比对到所述窗口的读段的数目/所述窗口的大小。前述对照样本的测序数据、比对结果等可以通过参照前述本发明一方面的或者任一具体实施方式中的测序数据处理方法获得,可以与待测样本的测序数据、比对结果同时获得,也可以预先获得保存备用。所说的测序深度和GC含量的关系可以通过以下来建立:获得多个对照样本核酸的测序数据,所述测序数据由多个读段组成;在所述参考序列上设置多个窗口,将所述多个对照样本的测序数据分别与所述参考序列的窗口比对,计算各个对照样本的测序数据中比对上每个窗口的读段的数目,获得每个窗口的测序深度,所述窗口为所述参考序列的一部分,所述窗口的测序深度=各个对照样本的比对上所述窗口的读段的总数目/(对照样本个数*所述窗口的大小);基于每个窗口的测序深度和该窗口的GC含量,利用二维回归分析法,例如利用Lowess回归建立所述测序深度和GC含量的关系。Fig. 7 is a flow chart showing the detection of an embodiment of the CNV detecting method of the present invention. The CNV detection method includes the steps of: S110 acquiring nucleic acid of a sample to be tested; S120 sequencing the nucleic acid to obtain sequencing data; S130 processing the sequencing data to obtain a general comparison result, and S130 is by the above-mentioned invention. Performing the sequencing data processing apparatus and/or the sequencing data processing method in any one of the embodiments; S140 setting a plurality of windows on the reference sequence, calculating a sequencing depth of the window, and sequencing depth of the window = the universal Comparing the number of reads to the window/the size of the window in the comparison result; S150 corrects the sequencing depth of the window by the relationship between the sequencing depth and the GC content, and obtains the corrected sequencing depth of the window; S160 is based on the The corrected sequencing depth of the window is significantly different from the corrected sequencing depth of the same window of the control sample, and the CNV is determined to be present in the sample nucleic acid to be tested, wherein the window is part of the reference sequence. The number of the aforementioned control samples is not less than 30, and the number of samples reaches 30, so that the sample data is presented to satisfy a specific distribution, which is suitable for testing by using a majority statistical test method, for example, using t test, z test, etc. The sample data is in a normal distribution. The corrected sequencing depth of the same window of the control sample is obtained by correcting the sequencing depth of the same window of the control sample using the relationship between the sequencing depth and the GC content, the sequencing depth of the same window of the control sample = the control sample The number of reads to the window/the size of the window is compared in the sequencing data. The sequencing data, the comparison result, and the like of the foregoing control sample can be obtained by referring to the sequencing data processing method in one aspect of the present invention or in any of the specific embodiments, and can be obtained simultaneously with the sequencing data and the comparison result of the sample to be tested. The save reserve can be obtained in advance. The relationship between the sequencing depth and the GC content can be established by obtaining sequencing data of a plurality of control sample nucleic acids, the sequencing data being composed of a plurality of reading segments; setting a plurality of windows on the reference sequence, Sequencing data of the plurality of control samples are respectively compared with the window of the reference sequence, and the number of reads of each window in the sequencing data of each control sample is calculated, and the sequencing depth of each window is obtained, the window For a portion of the reference sequence, the sequencing depth of the window = the total number of reads of the window on the alignment of the respective control samples / (the number of control samples * the size of the window); based on each window The depth of sequencing and the GC content of the window were determined using two-dimensional regression analysis, for example, using Lowess regression to establish the relationship between sequencing depth and GC content.
图8是本发明的CNV检测设备的一个实施例的设备结构示意图。该设备2000包括:核酸获取装置200,用以获取待测样本的核酸;测序装置400,用以对来自所述核酸获取单元的核酸进行测序,获得测序数据;数据处理装置600,用于对来自所述测序装置的测序数据进行处理,以获得通用比对结果;检测装置800,用于基于来自所述数据处理装置600的通用比对结果检测所述CNV;其中,所述数据处理装置600包括,数据接收单元610,用于接收来自所述测序装置的测序数据,所述测序数据包括多对读段对,每对读段对由两个读段组成,分别来自一条染色体片段的两个位置,每对读长对中的两个读段分别来自所述染色体片段的正链和负链,或者每对读长对中的两个读段都来自所述染色体片段的正链 或所述染色体的负链,每个读段都包含缺口,将一对读段对的两个读段分别定义为左臂和右臂,处理器630,用于执行数据处理程序,执行所述数据处理程序包括实现将所述测序数据与参考序列作比对,获得比对结果,以及消除所述比对结果中的每个读段的缺口,获得通用比对结果,所述比对结果包括多个所述读段对的比对结果,和/或,所述比对结果包括多个所述左臂的比对结果和多个所述右臂的比对结果,以及,至少一个存储单元650,用于存储数据,其中包括所述数据处理程序。前述对本发明一方面的或者任一其具体实施方式中的CNV检测方法的优点和技术特征的描述,同样适用本发明这一方面的CNV检测设备,在此不再赘述,而且,本领域普通技术人员可以理解,本发明的这一装置中的全部或部分单元,可选择的、可拆卸的包含一个或多个子单元以执行或实现前述本发明CNV检测方法的各个具体实施方式。Figure 8 is a block diagram showing the structure of an embodiment of a CNV detecting apparatus of the present invention. The device 2000 includes: a nucleic acid acquisition device 200 for acquiring nucleic acid of a sample to be tested; a sequencing device 400 for sequencing nucleic acid from the nucleic acid acquisition unit to obtain sequencing data; and a data processing device 600 for The sequencing data of the sequencing device is processed to obtain a universal alignment result; the detection device 800 is configured to detect the CNV based on a universal comparison result from the data processing device 600; wherein the data processing device 600 includes a data receiving unit 610, configured to receive sequencing data from the sequencing device, the sequencing data comprising a plurality of pairs of read pairs, each pair of read segments consisting of two read segments, respectively, from two locations of a chromosome segment The two reads of each pair of read lengths are from the positive and negative strands of the chromosome fragment, respectively, or the two reads of each pair of read lengths are from the positive strand of the chromosome fragment. Or a negative chain of the chromosome, each read includes a gap, and two reads of a pair of read pairs are respectively defined as a left arm and a right arm, and the processor 630 is configured to execute a data processing program, and execute the The data processing program includes implementing the alignment of the sequencing data with a reference sequence, obtaining a comparison result, and eliminating a gap of each of the comparison results, obtaining a universal alignment result, the comparison result including Alignment results of the plurality of read pairs, and/or, the comparison result includes a comparison result of the plurality of left arms and a comparison result of a plurality of the right arms, and at least one storage unit 650. For storing data, including the data processing program. The foregoing description of the advantages and technical features of the CNV detection method in one aspect of the present invention or any of its specific embodiments also applies to the CNV detecting device of this aspect of the present invention, and details are not described herein again. It will be understood by those skilled in the art that all or a portion of the units of the present invention, optionally, detachably, include one or more sub-units to perform or implement various embodiments of the aforementioned CNV detection methods of the present invention.
以下实施例仅用于说明本发明的优选实施方式,实施例中未注明具体操作手段或条件的,可以按照本领域内的文献所描述的技术或条件(例如参考J.萨姆布鲁克等著,黄培堂等译的《分子克隆实验指南》,第三版,科学出版社)或者按照产品说明书进行。所用试剂或仪器未注明生产厂商者,均为可以通过市购获得的常规产品或服务。The following examples are merely illustrative of preferred embodiments of the invention, and specific methods or conditions are not indicated in the examples, which may be in accordance with the techniques or conditions described in the literature in the field (for example, reference to J. Sambrook et al. , Huang Peitang et al., "Molecular Cloning Experimental Guide", third edition, Science Press) or in accordance with product specifications. Any reagents or instruments that are not indicated by the manufacturer are commercially available products or services.
实施例一 Embodiment 1
以下以肺癌患者的外周血血浆作为检测对象,样本来自西南医院,进行如下检测:The peripheral blood plasma of lung cancer patients was taken as the test object. The samples were from Southwest Hospital and tested as follows:
(一)文库建立及测序(1) Library establishment and sequencing
建库及测序流程如图9所示,以下涉及的具体序列,都是从左到右为5’端至3’端,序列中的“//”中为末端修饰基团,“phos”表示磷酸化,“dd”表示双脱氧,“bio”表示生物素。The construction and sequencing process is shown in Figure 9. The specific sequences involved below are from 5' to 3' from left to right. The "/" in the sequence is the terminal modification group, and "phos" indicates Phosphorylation, "dd" means dideoxy, and "bio" means biotin.
1、cfDNA的提取(采用SnoMag Circulating DNA Kit):1. Extraction of cfDNA (using SnoMag Circulating DNA Kit):
1)取200ul血浆于1.5mlEP管,加入600ul buffer LSB。1) Take 200ul of plasma in a 1.5ml EP tube and add 600ul of buffer LSB.
2)加入20μlNanoMag Circulating Beads混匀,室温放置10min,每2-3min混匀一次。2) Add 20 μl NanoMag Circulating Beads and mix for 10 min at room temperature and mix once every 2-3 min.
3)将EP管置于磁力架上吸附1min,弃上清。3) Place the EP tube on a magnetic stand for 1 min and discard the supernatant.
4)取下EP管加入150uL Buffer WA,混匀。4) Remove the EP tube and add 150uL Buffer WA and mix.
5)将EP管置于磁力架上吸附1min,弃上清。5) Place the EP tube on a magnetic stand for 1 min and discard the supernatant.
6)取下EP管加入150uL 75%乙醇,混匀。6) Remove the EP tube and add 150 uL of 75% ethanol and mix.
7)将EP管置于磁力架上吸附1min,弃上清。7) Place the EP tube on a magnetic stand for 1 min and discard the supernatant.
8)重复6-7一次。8) Repeat 6-7 times.
9)室温干燥磁珠5min。9) Dry the magnetic beads for 5 min at room temperature.
10)加入32ul elution buffer混匀磁珠,室温静置5min。 10) Add 32 ul of elution buffer to mix the magnetic beads and let stand for 5 min at room temperature.
11)将EP管置于磁力架上吸附1min,转移上清至新的1.5mlEP管。11) Place the EP tube on a magnetic stand for 1 min and transfer the supernatant to a new 1.5 ml EP tube.
2、文库的构建:2. Construction of the library:
1)rSAP去磷酸化反应1) rSAP dephosphorylation
cfDNAcfDNA 30ul 30ul
10x NEBuffer 210x NEBuffer 2 3.5ul3.5ul
rSAP(1U/ul)rSAP(1U/ul) 1.5ul1.5ul
TotalTotal 35ul35ul
反应条件:Reaction conditions:
Figure PCTCN2014093511-appb-000001
Figure PCTCN2014093511-appb-000001
2)T4DNA Polymerase末端补平2) T4DNA Polymerase end fill
Figure PCTCN2014093511-appb-000002
Figure PCTCN2014093511-appb-000002
反应条件:Reaction conditions:
12℃12 ° C 20min20min
4℃4 ° C holdHold
60ul AmpureXP beads纯化以上反应产物,22ulElution buffer洗脱。The above reaction product was purified by 60 ul of Ampure XP beads and eluted with 22 ul of Elution buffer.
3)第一序列和第二序列分别连接到末端补平的DNA片段的两端3) The first sequence and the second sequence are respectively ligated to both ends of the end-filled DNA fragment
Figure PCTCN2014093511-appb-000003
Figure PCTCN2014093511-appb-000003
反应条件:Reaction conditions:
20℃20 ° C 15min15min
4℃4 ° C holdHold
40ul AmpureXP beads纯化以上反应产物,22ulElution buffer洗脱。The above reaction product was purified by 40 ul of Ampure XP beads and eluted with 22 ul of Elution buffer.
第一序列的两条链为:TTGGCCTCCGACT/3-ddT/(SEQ ID NO:1),/5phos/AAGTCGGAGGCCAAGCGGTCGT/ddC/(SEQ ID NO:2)。 The two strands of the first sequence are: TTGGCCTCCGACT/3-ddT/(SEQ ID NO: 1), /5phos/AAGTCGGAGGCCAAGCGGTCGT/ddC/ (SEQ ID NO: 2).
第二序列的两条链分别为:/5Phos/GTCTCCAGTCGAAGCCCGACG/3ddC/(SEQ ID NO:3),GCTTCGACTGGAGA/3ddC/(SEQ ID NO:4)。The two strands of the second sequence are: /5Phos/GTCTCCAGTCGAAGCCCGACG/3ddC/(SEQ ID NO: 3), GCTTCGACTGGAGA/3ddC/(SEQ ID NO: 4).
4)缺刻平移(Nick Translation)4) Nick Translation
Figure PCTCN2014093511-appb-000004
Figure PCTCN2014093511-appb-000004
第三序列中的上游引物/5-bio/TCCTAAGACCGCTTGGCCTCCGACT(SEQ ID NO:5),The upstream primer in the third sequence is/5-bio/TCCTAAGACCGCTTGGCCTCCGACT (SEQ ID NO: 5),
第三序列中的下游引物Downstream primer in the third sequence
5Phos/AGACAAGCTCxxxxxxxxxxGATCGGGCTTCGACTGGAGAC(SEQ ID NO:6),中间“x”处为可变的标签序列区域,可以以N替代,N为A、T、C或G,当没有其它样本文库一起混合上机,只有一个样本文库上机,不需要标签序列,即第三序列可为5Phos/AGACAAGCTCxxxxxxxxxxGATCGGGCTTCGACTGGAGAC (SEQ ID NO: 6), the intermediate "x" is a variable tag sequence region, which can be replaced by N, N is A, T, C or G, when no other sample libraries are mixed together, only A sample library is on the machine, no tag sequence is required, ie the third sequence can be
5Phos/AGACAAGCTCGATCGGGCTTCGACTGGAGAC(SEQ ID NO:7),在该示例中,由于是肿瘤游离核酸样本,混合核酸中的目标核酸(ctDNA)含量低,若多个这样的样本文库混合上机获得混合数据,需要拆分混合数据对应到各自样本,会损失一部分数据,且构建的是单接头环状文库读段相对短,要准确检测变异需要深度测序获得相对大量的测定数据,较佳的,单个样本文库上机。5Phos/AGACAAGCTCGATCGGGCTTCGACTGGAGAC (SEQ ID NO: 7), in this example, because of the tumor free nucleic acid sample, the target nucleic acid (ctDNA) content in the mixed nucleic acid is low, and if a plurality of such sample libraries are mixed on the machine to obtain mixed data, it is required Splitting the mixed data corresponding to the respective samples will lose a part of the data, and the single-joint circular library reads are relatively short. To accurately detect the mutation, deep sequencing is required to obtain a relatively large amount of measured data, preferably, on a single sample library. machine.
反应条件:Reaction conditions:
60℃60 ° C 5min5min
37℃37 ° C 0.1℃/secs-hold0.1°C/secs-hold
向上步反应物中加入如下8ul Nick Translation mixAdd 8ul of the following translations to the top reaction.
Figure PCTCN2014093511-appb-000005
Figure PCTCN2014093511-appb-000005
反应条件:Reaction conditions:
37℃37 ° C 20min20min
4℃4 ° C holdHold
40ul AmpureXP beads纯化以上反应产物,37.4ulElution buffer洗脱。The above reaction product was purified by 40 ul of Ampure XP beads and eluted with 37.4 ul of Elution buffer.
5)PCR with Pfx5) PCR with Pfx
Figure PCTCN2014093511-appb-000006
Figure PCTCN2014093511-appb-000006
反应条件:Reaction conditions:
Figure PCTCN2014093511-appb-000007
Figure PCTCN2014093511-appb-000007
50ul AmpureXP beads纯化以上反应产物,22ulElution buffer洗脱。The above reaction product was purified by 50 ul of Ampure XP beads and eluted with 22 ul of Elution buffer.
6)Qubit定量6) Qubit quantification
利用Qubit dsDNA HS assay kit对PCR产物进行浓度测定。The PCR product was subjected to concentration determination using a Qubit dsDNA HS assay kit.
7)链分离(Strand Separation)7) Strand Separation
a)多个文库混合,使DNA共约160ng。样品补1xTE至总体积为60ul。a) Multiple libraries were mixed to give a total of about 160 ng of DNA. The sample was filled with 1 x TE to a total volume of 60 ul.
b)提前准备以下试剂:4X BBB,Streptavidin Beads,0.3M MOPS acid,0.5%Tween20,1X BBB/Tween Mix,1X BWB/Tween Mix,0.1M NaOH。其中1X BWB/Tween Mix、0.1M NaOH,Streptavidin Beads需现配现用。b) Prepare the following reagents in advance: 4X BBB, Streptavidin Beads, 0.3M MOPS acid, 0.5% Tween 20, 1X BBB/Tween Mix, 1X BWB/Tween Mix, 0.1 M NaOH. Among them, 1X BWB/Tween Mix, 0.1M NaOH, and Streptavidin Beads are ready for use.
c)提前15min配置以下四种试剂c) Configure the following four reagents 15 minutes in advance
0.5%Tween20,1X BBB/Tween Mix,1X BWB/Tween Mix,0.1M NaOH.0.5% Tween20, 1X BBB/Tween Mix, 1X BWB/Tween Mix, 0.1M NaOH.
其中0.5%Tween20配置方法同前述,其他三种配置方法如下:The 0.5% Tween20 configuration method is the same as the above, and the other three configuration methods are as follows:
d)1X BBB/Tween Mixd) 1X BBB/Tween Mix
1X BBB1X BBB 30ul30ul
0.5%Tween200.5% Tween20 0.3ul0.3ul
TotalTotal 30.3ul30.3ul
e)1X BWB/Tween Mixe) 1X BWB/Tween Mix
1X BWB1X BWB 2000ul2000ul
0.5%Tween200.5% Tween20 20ul20ul
TotalTotal 2020ul2020ul
f)0.1M NaOH f) 0.1M NaOH
0.5M NaOH0.5M NaOH 15.6ul15.6ul
WaterWater 62.40ul62.40ul
TotalTotal 78.0ul78.0ul
g)Streptavidin Beads洗涤方法如下:g) Streptavidin Beads washing method is as follows:
·每个样品取30ul Streptavidin Beads:加入3-5倍体积的1XBBB,混匀后置于磁力架上静止吸附,调整不粘管的方向,使得beads在1XBBB洗液中前后游动,弃上清液后,重复上述操作一次,· Take 30ul Streptavidin Beads per sample: add 3-5 times the volume of 1XBBB, mix and place on a magnetic stand to absorb statically, adjust the direction of the non-stick tube, so that the beads move back and forth in the 1XBBB lotion, discard the supernatant. After the liquid, repeat the above operation once.
·取出不黏管加入1倍体积(30ul)1X BBB/Tween Mix悬浮,混匀后室温静置。• Remove the non-stick tube and add 1 volume (30 ul) of 1X BBB/Tween Mix suspension, mix and let stand at room temperature.
h)向60ulPCR产物样品中加入20ul 4XBBB混匀,然后转移到上步骤含有30ul1X BBB/Tween Mix溶解的beads的不粘管中混匀,此110ul混合物室温下结合15-20min,中间轻轻弹匀一次。h) Add 20 ul of 4XBBB to 60 ul of PCR product mixture, then transfer to a non-stick tube containing 30 ul of 1X BBB/Tween Mix-dissolved beads. Mix the 110 ul mixture at room temperature for 15-20 min. once.
i)将上述不粘管磁力架放置3-5min,弃去上清液,用1ml的1X BWB/Tween Mix洗涤2次,方法同Streptavidin Beads的洗涤方法i) Place the above non-stick magnetic frame for 3-5min, discard the supernatant, and wash it twice with 1ml of 1X BWB/Tween Mix. The method is the same as the washing method of Streptavidin Beads.
j)向上述beads中加入26ul 0.1M NaOH,吹打混匀后放置10min,再置于磁力架上3-5min,取上清到新的1.5ml EP管中。j) Add 26 ul of 0.1 M NaOH to the above beads, mix by blowing and let stand for 10 min, then place on a magnetic stand for 3-5 min, and take the supernatant into a new 1.5 ml EP tube.
k)向上述1.5mlEP管中加入13ul 0.3M MOPS,混匀备用。k) Add 13ul of 0.3M MOPS to the above 1.5ml EP tube and mix for later use.
l)此步骤产物可以冻存于-20℃。l) The product of this step can be stored frozen at -20 °C.
8)环化(Splint Circulation)8) Splint Circulation
a)向上一步得到的39ul的样品中加入10ul的20uM第四序列,第四序列为a) Add 10ul of 20uM fourth sequence to the 39ul sample obtained in the previous step. The fourth sequence is
TCGAGCTTGTCTTCCTAAGACCGC(SEQ ID NO:8);TCGAGCTTGTCTTCCTAAGACCGC (SEQ ID NO: 8);
b)提前5分钟准备连接酶反应混合液,配制如下:b) Prepare the ligase reaction mixture 5 minutes in advance, prepared as follows:
WaterWater 4.2ul4.2ul
10x TA Buffer(LK1)10x TA Buffer(LK1) 6ul6ul
100mM ATP100mM ATP 0.6ul0.6ul
600U/ul Ligase600U/ul Ligase 0.2ul0.2ul
TotalTotal 11ul11ul
c)将连接酶反应混合液震荡充分混匀,离心后,向已经加入引物反应混合液的EP管中加入连接酶反应混合液11ul,震荡10s混匀,瞬时离心。c) The ligase reaction mixture is shaken and thoroughly mixed. After centrifugation, 11 ul of the ligase reaction mixture is added to the EP tube to which the primer reaction mixture has been added, shaken for 10 s, and centrifuged instantaneously.
d)置于PCR仪中37℃孵育1.5h。d) Incubate in a PCR machine for 1.5 h at 37 °C.
e)反应完成后,取出5ul样品,待6%变性胶电泳检测,剩余的约55ul体积,进入下一步酶反应。e) After the reaction is completed, 5 ul of the sample is taken out and subjected to electrophoresis detection of 6% denaturing gel, and the remaining volume of about 55 ul is passed to the next enzyme reaction.
9)酶切消化(Exo I and III)9) Digestive digestion (Exo I and III)
a)提前5分钟左右准备引物反应混合液,配制如下:a) Prepare the primer reaction mixture about 5 minutes in advance, and prepare as follows:
10x TA Buffer(LK1)10x TA Buffer(LK1) 1ul1ul
20U/ul Exo I20U/ul Exo I 3ul3ul
200/ul Exo III200/ul Exo III 1ul1ul
TotalTotal 5ul5ul
b)将上述混合液震荡充分混匀,离心后,向上一步得到的55ul的样品中分别加入5ul的反应混合液;b) The mixture is shaken and thoroughly mixed, and after centrifugation, 5 ul of the reaction mixture is separately added to the 55 ul sample obtained in the previous step;
c)震荡10s混匀离心,置于PCR仪中37℃孵育30min。c) Incubate for 10 s, mix and centrifuge, and incubate in a PCR machine at 37 ° C for 30 min.
d)酶切30min完成后,向样品中加入2.5ul 500mM EDTA终止酶反应。d) After the enzyme digestion was completed for 30 min, 2.5 ul of 500 mM EDTA was added to the sample to terminate the enzyme reaction.
e)上述样品用PEG32beads/tween20纯化,方法如下:e) The above sample was purified with PEG32beads/tween20 as follows:
将上步骤样品59ul转移到1.5ml不粘管中,加入78ul的PEG32beads/tween20(PEG32beads:tween20=100:1),室温结合15min,期间吹打混匀一次;Transfer 59 ul of the above step to a 1.5 ml non-stick tube, add 78 ul of PEG32beads/tween 20 (PEG32beads: tween20=100:1), and combine at room temperature for 15 min, while blowing and mixing once;
f)不粘管置于磁力架3-5min后弃去上清,用700ul 75%乙醇洗涤两次,洗涤时将不粘管前后方向反转,使得beads在乙醇中游动,每次洗涤游动2-3次;f) After the non-stick tube is placed on the magnetic stand for 3-5min, discard the supernatant and wash it twice with 700ul 75% ethanol. When washing, the non-stick tube will be reversed in the front-rear direction, so that the beads move in the ethanol, each wash tour Move 2-3 times;
g)室温下晾干后用27ul TE/tween20回溶(TE:tween20=500:1),溶解时间共计15min,中间混匀一次;g) After drying at room temperature, dissolve with 27ul TE/tween20 (TE: tween20=500:1), dissolve for 15min, mix once in the middle;
h)上清转移到新的1.5mlEP管中,将最终得到产物用QubitTMssDNA Assay Kit定量。Buffer与染料比例为199:1混匀后votex并离心混合备用,取两份190ul稀释后染料工作液分别加入10ul的两种标准品votex并离心混合备用,取198ul稀释后染料工作液加入2ul样品,votex后并离心进行Qubit仪器定量。h) Transfer supernatant to a new tube 1.5mlEP, the final product was obtained with quantitative Qubit TM ssDNA Assay Kit. The ratio of Buffer to dye is 199:1. After mixing, votex and centrifuge for mixing. Take two 190 ul of diluted dye working solution and add 10 ul of two standard votex and centrifuge for mixing. Add 198 ul of diluted dye working solution to 2 ul sample. After the votex, centrifuge and quantify the Qubit instrument.
i)浓度标准化(Normalization)i) Normalization of concentration
按照单链分子定量测定的浓度调整DNB制备使用的样本起始量统一调整为35.3ng-53ng,将对应体积样本(<60ul)转移至Biorad PCR板中,使用1XTE补齐使总体积不超过120ul。The starting amount of the sample used for the preparation of DNB was adjusted to 35.3 ng-53 ng according to the concentration of single-stranded molecular quantitative determination. The corresponding volume sample (<60 ul) was transferred to the Biorad PCR plate, and the total volume was not more than 120 ul using 1XTE. .
终浓度为5.625-7.5fmol/ul,体积为120ul,则总量为35.3ng-53ng,1adapter测序中的DNB需要120fmol,7.5foml/ul,16ul。故需要把文库稀释至7.5fmol/ul。The final concentration is 5.625-7.5fmol/ul, the volume is 120ul, the total amount is 35.3ng-53ng, and the DNB in the 1adapter sequencing needs 120fmol, 7.5foml/ul, 16ul. Therefore, the library needs to be diluted to 7.5 fmol/ul.
10)CG 1-Adapter测序10) CG 1-Adapter sequencing
利用CG平台的标准化流程测序。DNA纳米芯片是CG独创的一种高通量测序技术。该示例的对改进的单接头测序文库进行测序,较其他测序方案成本更低、速度更快,并集成质控确保其测序质量。Sequencing using the standardized process of the CG platform. DNA nanochips are a high-throughput sequencing technology pioneered by CG. This example of sequencing improved single-joint sequencing libraries is less expensive and faster than other sequencing protocols, and integrates quality control to ensure sequencing quality.
实施例二 Embodiment 2
对实施例一的下机数据进行处理。利用本发明的测序数据处理方法和/或CNV检测方法,基于CG平台测序技术,可以对超微量的cfDNA进行富集,文库建立、测序和数据分析工作。在该示例中,由于CG测序原理的特殊性,其测序的reads较短,且在特定位置存在重测序以及小gap的现象,难以不处理直接使用普通的比对软件对测序结果进行比对。 针对reads的特殊结构,我们使用CG平台专有的TeraMap进行比对,其工作原理是:首先,它将读长中的两端(LeftArm,RightArm)分别做比对,其间TeraMap会尝试多种gap值来处理读长,以获得更多的比对结果;然后,将每一端的比对结果拿出来作为参考,对另一端做局部比对(例如4-AD,局部比对的范围是0~700bp);如果两端可以良好比对到同一染色体,且insert-size符合期望(例如4-AD,一读段对的两读段的距离为0~700bp),则只输出最佳比对结果,否则两端的多个比对结果全部输出。TeraMap是CG测序平台的比对软件,它可将CG特有序列比对到参考基因组上,其输出格式由三部分组成,简要说明如下:第一行,是reads序列信息;第二行和第三行,是reads比对情况说明;第四行和第五行,是reads比对结果详细信息。The offline data of the first embodiment is processed. Using the sequencing data processing method and/or the CNV detection method of the present invention, based on the CG platform sequencing technology, ultra-micro cfDNA enrichment, library establishment, sequencing and data analysis can be performed. In this example, due to the particularity of the CG sequencing principle, the sequencing reads are short, and there are resequencing and small gaps at specific locations. It is difficult to directly compare the sequencing results using ordinary comparison software. For the special structure of reads, we use the TG platform's proprietary TeraMap for comparison. The working principle is: First, it will compare the two ends of the read length (LeftArm, RightArm), and TeraMap will try a variety of gaps. The value is used to process the read length to obtain more comparison results; then, the comparison result at each end is taken as a reference, and the other end is locally aligned (for example, 4-AD, the range of the local alignment is 0 to 700bp); if both ends can be well aligned to the same chromosome, and the insert-size meets expectations (eg 4-AD, the distance between the two reads of a read pair is 0-700bp), then only the best alignment result is output Otherwise, multiple comparison results at both ends are output. TeraMap is a comparison software for CG sequencing platform. It can compare CG unique sequences to the reference genome. Its output format consists of three parts. The brief description is as follows: the first line is the reads sequence information; the second line and the third The line is the reading comparison case description; the fourth line and the fifth line are the details of the reads comparison result.
第一行:first row:
列号Column number 字段Field 类型Types of 简介 Introduction
11 QNAMEQNAME 字符串String 参考序列编号 Reference sequence number
22 POSPOS 整型Integer 比对到参考序列的位置Align to the position of the reference sequence
33 SEQSEQ 字符串String 比对片段的序列信息Align the sequence information of the fragment
第二行:second line:
Figure PCTCN2014093511-appb-000008
Figure PCTCN2014093511-appb-000008
第四行:The fourth line:
Figure PCTCN2014093511-appb-000009
Figure PCTCN2014093511-appb-000009
Figure PCTCN2014093511-appb-000010
Figure PCTCN2014093511-appb-000010
因为TeraMap比对存在gap问题,使得无法进行下游分析,依据本发明的方法开发Teramap2Sam软件,将TeraMap比对结果中gap去除并转换为SAM(sequence alignment/map format)。Teramap2Sam软件的主要流程可分为三部分,算法流程图如图10所示。Because the TeraMap has a gap problem, making it impossible to perform downstream analysis, the Teramap2Sam software is developed according to the method of the present invention, and the gap in the TeraMap comparison result is removed and converted into SAM (sequence alignment/map format). The main process of Teramap2Sam software can be divided into three parts, and the algorithm flow chart is shown in Figure 10.
第一步:提取唯一比对结果。根据TeraMap输出结果matchCount判定是否唯一比对,同时要求插入片段长度满足要求以及两端read比对在用一条参考序列上。Step 1: Extract the unique alignment results. According to the TeraMap output result matchCount to determine whether the unique alignment, while requiring the length of the insert to meet the requirements and the read alignment of the two ends on a reference sequence.
第二步:去除gap。根据gaps字段判定reads中的gap位置,并修正read序列。Step 2: Remove the gap. The gap position in the reads is determined according to the gaps field, and the read sequence is corrected.
第三步:计算FLAG。根据双端read的比对方向,计算SAM文件中的FLAG参数,获得比对情况。The third step: calculate FLAG. According to the comparison direction of the double-ended read, the FLAG parameter in the SAM file is calculated to obtain the comparison.
SAM是存储比对信息的一种较通用的格式,每一行是一个reads的比对结果,主要由十一个字段组成,其后还可添加更多字段包含更多信息,比如XT:A:U就是表示此reads为unique比对。简要说明如下:SAM is a more general format for storing comparison information. Each line is a pair of reads. It consists mainly of eleven fields. Later, more fields can be added to contain more information, such as XT:A: U means that this reads is a unique comparison. A brief description is as follows:
Figure PCTCN2014093511-appb-000011
Figure PCTCN2014093511-appb-000011
Figure PCTCN2014093511-appb-000012
Figure PCTCN2014093511-appb-000012
在实际使用中为了节约存储资源,主要使用其二进制压缩格式(BAM)。此外CG又针对其reads结构开发了Assembly Software将reads重新组装,组装完成后进行变异检测分析等后续工作。In order to save storage resources in actual use, the binary compression format (BAM) is mainly used. In addition, CG developed the Assembly Software for its read structure to reassemble the reads, and perform the follow-up work after the assembly is completed.
由于CG单接头reads的特殊结构存在reads太短(最短12bp)的缺点,在一些特殊数据的处理中CG原有的突变检测工具不再适用或者检测结果不佳。针对这种情况,我们首先开发工具将TeraMap的比对结果转化为通用的SAM/BAM格式,其中SAM/BAM是高通量测序中普遍使用的比对结果格式,所以我们采用这种通用的格式,然后再使用BAM数据检测拷贝数变异。目前已有的拷贝数变异检测方法有隐马可夫模型、环状二元分割、等级分割、核平滑算法等。我们根据总长达1,000,000bp的多个窗口的reads深度分布,使用z-score(标准分数)得到拷贝数变异结果。Due to the shortcomings of the GS single-joint reads, the short readout is short (12 bp). In some special data processing, the original CG mutation detection tool is no longer applicable or the detection result is not good. In response to this situation, we first developed a tool to convert TeraMap's alignment results into a common SAM/BAM format, where SAM/BAM is a commonly used alignment format for high-throughput sequencing, so we use this common format. Then use BAM data to detect copy number variation. At present, the existing copy number variation detection methods include hidden Markov model, circular binary segmentation, hierarchical segmentation, and kernel smoothing algorithm. We use the z-score (standard score) to obtain copy number variation results based on the read depth distribution of multiple windows with a total length of 1,000,000 bp.
考虑到实际测序过程中reads中GC含量会对测序深度有一定的影响,我们对比对结果(BAM)进行GC含量校正,消除GC含量对深度的影响。具体的,取总长达1,000,000bp的多个窗口算窗口的GC含量和平均测序深度,对GC-测序深度的数据进行lowess回归,根据回归曲线对GC含量进行校正。Considering that the GC content in the reads will have a certain influence on the sequencing depth during the actual sequencing process, we compare the GC content of the results (BAM) to eliminate the influence of GC content on the depth. Specifically, the GC content and the average sequencing depth of a plurality of window calculation windows with a total length of 1,000,000 bp were taken, and the GC-sequence depth data was subjected to lowess regression, and the GC content was corrected according to the regression curve.
标准分数(standard score)也叫z分数(z-score),是一个分数与平均数的差再除以标准差的过程。用公式表示为:z=(x-μ)/σ。其中x为某一具体分数,μ为平均数,σ为标准差。Z值的量代表着原始分数和母体平均值之间的距离,是以标准差为单位计算。在原始分数低于平均值时Z则为负数,反之则为正数。在该示例中,通过对2000bp窗口内reads计数(原始分数)和总体reads平均值(多个正常对照样本)之间的距离使用标准差进行度量,可以有效检测出拷贝数变异。Z值为正数时反应为拷贝数大于2(正常样本是2倍体),比如重复,z值为负数时反应拷贝数小于2,比如缺失。将该实施例中的上述的CNV检测方法编写成程序,并将该程序命名为calcu_zscore_query,将z绝对值大于3的区域判断为发生CNV。The standard score, also called the z-score, is the process of dividing the difference between the score and the mean by the standard deviation. Expressed as: z = (x - μ) / σ. Where x is a specific fraction, μ is the average, and σ is the standard deviation. The amount of Z value represents the distance between the original score and the parent mean, calculated in units of standard deviation. Z is negative when the original score is lower than the average, and vice versa if it is lower. In this example, copy number variation can be effectively detected by measuring the distance between the reads count (original score) and the overall reads average (multiple normal control samples) in the 2000 bp window using the standard deviation. When the Z value is positive, the reaction is greater than 2 (the normal sample is 2 times), such as repetition, and the negative copy number is less than 2 when the z value is negative, such as a deletion. The above CNV detection method in this embodiment is written as a program, and the program is named calcu_zscore_query, and the region where the absolute value of z is larger than 3 is judged to be CNV.
较之传统方法,我们使用的基于CG单接头测序的方法可以实现超微量建库测序,建库只需要1-10ng核酸,需要外周血量2-5ml,并且CG的标准化流程简单快速,TeraMap 比对结果转换为SAM格式之后比闭源的TeraMap格式更加通用,可以使用Samtools等软件进行处理。此外,使用z-score(标准分数)可以快速检测出CNV,50乘全基因组数据的CNV分析只需4小时,作为对比,CONTRA软件[http://sourceforge.net/projects/contra-cnv/]需要1天以上时间。Compared with the traditional method, we can use the CG single-join sequencing method to achieve ultra-micro-sequencing database sequencing. Only 1-10 ng of nucleic acid is needed for database construction, and the peripheral blood volume is 2-5 ml, and the standardization process of CG is simple and fast. TeraMap ratio After converting the result to SAM format, it is more versatile than the closed source TeraMap format, and can be processed using software such as Samtools. In addition, CNV can be quickly detected using z-score (standard score), and CNV analysis of 50-by-full genome data takes only 4 hours, as a comparison, CONTRA software [ http://sourceforge.net/projects/contra-cnv/ ] It takes more than 1 day.
该示例中利用TeraMap进行比对。测序完成后使用CG平台的集成工具makeADF得到原始reads,然后用TeraMap进行比对,将测序得到的reads比对的参考序列上。得到的比对结果使用TeraMap2Sam转换为通用的SAM格式。表1展示结果。In this example, TeraMap is used for comparison. After the sequencing is completed, the original reads are obtained using the CG platform's integrated tool makeADF, and then compared with TeraMap, and the sequenced reads are aligned on the reference sequence. The resulting alignment results are converted to the generic SAM format using TeraMap2Sam. Table 1 shows the results.
表1Table 1
Figure PCTCN2014093511-appb-000013
Figure PCTCN2014093511-appb-000013

Claims (40)

  1. 一种测序数据处理装置,其特征在于,包括,A sequencing data processing device, characterized in that
    数据接收单元,用于接收所述测序数据,所述测序数据包括多对读段对,每对读段对由两个读段组成,分别来自一条染色体片段的两个位置,每对读段对中的两个读段分别来自所述染色体片段的正链和负链,或者每对读长对中的两个读段都来自所述染色体片段的正链或所述染色体的负链,每个读段都包含缺口,将一对读段对的两个读段分别定义为左臂和右臂;a data receiving unit, configured to receive the sequencing data, the sequencing data includes a plurality of pairs of read pairs, each pair of read segments consisting of two read segments, respectively, from two positions of a chromosome segment, each pair of read pairs The two reads in the pair are from the positive and negative strands of the chromosome fragment, or the two reads in each pair of read lengths are from the positive strand of the chromosome fragment or the negative strand of the chromosome, each The read segments all contain gaps, and the two read pairs of the pair of read pairs are defined as the left arm and the right arm, respectively;
    处理器,用于执行数据处理程序,执行所述数据处理程序包括实现将所述测序数据与参考序列作比对,获得比对结果,以及消除所述比对结果中的每个读段的缺口,获得通用比对结果,所述比对结果包括多个所述读段对的比对结果,和/或,a processor for executing a data processing program, the executing the data processing program comprising: comparing the sequencing data with a reference sequence, obtaining a comparison result, and eliminating a gap of each of the comparison results Obtaining a universal alignment result, the alignment result comprising a plurality of alignments of the pair of reads, and/or,
    所述比对结果包括多个所述左臂的比对结果和多个所述右臂的比对结果;以及,The comparison result includes a comparison result of a plurality of the left arms and a comparison result of a plurality of the right arms; and
    至少一个存储单元,用于存储数据,其中包括所述数据处理程序。At least one storage unit for storing data, including the data processing program.
  2. 权利要求1的装置,其特征在于,所述作比对包括,The device of claim 1 wherein said comparing comprises
    将每对读段对的左臂和右臂分别与所述参考序列比对,获得一级左比对结果和一级右比对结果,Comparing the left arm and the right arm of each pair of read pairs with the reference sequence, respectively, obtaining a first-order left-alignment result and a first-order right-alignment result,
    分别以所述一级左比对结果和所述一级右比对结果的其中一个为参考,对另一个作比对,获得二级左比对结果和二级右比对结果,Taking one of the first-order left-aligned result and the first-order right-aligning result as a reference, and comparing the other, obtaining the second-order left-aligned result and the second-level right-aligning result,
    基于所述二级左比对结果和所述二级右比对结果,获得多个所述读段对的比对结果,或者获得多个所述左臂的比对结果和多个所述右臂的比对结果。Obtaining a comparison result of the plurality of the pair of read segments based on the result of the second-order left alignment and the result of the second-order right alignment, or obtaining a comparison result of the plurality of the left arms and a plurality of the right The result of the alignment of the arms.
  3. 权利要求2的装置,其特征在于,所述作比对包括,设置所述缺口的大小以使每个左臂或者每个右臂与所述参考序列进行多次比对。The apparatus of claim 2 wherein said comparing comprises arranging said notches to align each of said left or each right arm with said reference sequence a plurality of times.
  4. 权利要求3的装置,其特征在于,所述每个左臂或者每个右臂与参考序列进行多次比对为,将所述每个左臂或者所述每个右臂的缺口分别设置为-3nt、-2nt、-1nt、0nt、1nt、2nt、3nt、4nt、5nt、6nt和7nt,获得对应的多个读段,分别将所述对应的多个读段与所述参考序列比对。The apparatus of claim 3 wherein said each of said left or each right arm is aligned a plurality of times with said reference sequence, said gaps of said each of said left arms or said each of said right arms being respectively set to -3nt, -2nt, -1nt, 0nt, 1nt, 2nt, 3nt, 4nt, 5nt, 6nt, and 7nt, obtaining corresponding plurality of reads, respectively comparing the corresponding plurality of reads with the reference sequence .
  5. 权利要求1-4任一装置,其特征在于,所述比对结果的格式为TeraMap。The apparatus of any of claims 1-4, wherein the format of the comparison result is TeraMap.
  6. 权利要求1-5任一装置,其特征在于,执行所述数据处理程序还包括实现,在所述消除比对结果中的每个读段的缺口之前,提取所述比对结果中的唯一比对结果以替换所述比对结果,所述唯一比对结果包括唯一比对上所述参考序列的多个读段对,并且每一所述读段对比对到所述参考序列的相同染色体,每一所述读段对中的两个读段的距离符合所述染色体片段的两个位置的距离。 Apparatus according to any of claims 1-5, wherein performing said data processing program further comprises: implementing a unique ratio in said comparison result before said gap in each of said comparison results is eliminated Substituting the result of the alignment, the unique alignment result includes a plurality of pairs of reads that are uniquely aligned with the reference sequence, and each of the reads contrasts to the same chromosome to the reference sequence, The distance between the two reads of each of the pairs of reads corresponds to the distance of the two locations of the chromosome segment.
  7. 权利要求6的装置,其特征在于,执行所述数据处理程序还包括实现,修正使所述唯一比对结果中的每一对读段对比对到所述参考序列的相同染色体的正链。The apparatus of claim 6 wherein performing said data processing program further comprises implementing correcting a positive chain of the same chromosome that pairs each of said unique alignment results against said reference sequence.
  8. 权利要求6或7的装置,其特征在于,执行所述数据处理程序还包括实现数据格式转换,所述数据格式转换包括转换所述比对结果或所述唯一比对结果的格式。The apparatus of claim 6 or 7, wherein executing the data processing program further comprises implementing a data format conversion, the data format conversion comprising converting the alignment result or the format of the unique alignment result.
  9. 权利要求1-8任一装置,其特征在于,消除所述比对结果或者所述唯一比对结果中的每个读段的缺口包括,Apparatus according to any of claims 1-8, wherein the elimination of said alignment result or the gap of each of said unique alignment results comprises,
    若所述读段包含正缺口,以N填补所述正缺口的大小,If the read segment includes a positive gap, fill the size of the positive gap with N,
    若所述读段包含负缺口,去除所述负缺口,其中,If the read segment includes a negative gap, the negative gap is removed, wherein
    N为A、T、C或G。N is A, T, C or G.
  10. 权利要求1-9任一装置,其特征在于,所述通用比对结果的格式为SAM或BAM。The apparatus of any of claims 1-9, wherein the format of the universal alignment result is SAM or BAM.
  11. 一种测序数据处理系统,其包括一主机和一显示装置,其特征在于,所述系统还包括权利要求1-10任一所述的测序数据处理装置。A sequencing data processing system comprising a host and a display device, characterized in that the system further comprises the sequencing data processing device of any of claims 1-10.
  12. 一种测序数据处理方法,其特征在于,包括如下步骤,A sequencing data processing method, comprising the following steps,
    获取测序数据,所述测序数据包括多对读段对,每对读段对由两个读段组成,分别来自一条染色体片段的两个位置,每对读长对中的两个读段分别来自所述染色体片段的正链和负链,或者每对读长对中的两个读段都来自所述染色体片段的正链或所述染色体片段的负链,每个读段都包含缺口,将一对读段对的两个读段分别定义为左臂和右臂;Obtaining sequencing data, the sequencing data comprising a plurality of pairs of read segments, each pair of read segments consisting of two read segments, respectively from two locations of one chromosome segment, and two reads of each pair of read length pairs are respectively from The positive and negative strands of the chromosomal segment, or both reads of each pair of read lengths are from the positive strand of the chromosomal segment or the negative strand of the chromosomal segment, each read containing a gap, The two readings of a pair of read pairs are defined as the left arm and the right arm, respectively;
    将所述测序数据与参考序列比对,获得比对结果,所述比对结果包括多个所述读段对的比对结果,和/或,Aligning the sequencing data with a reference sequence to obtain a comparison result, the alignment result comprising a plurality of alignments of the pair of reads, and/or,
    所述比对结果包括多个所述左臂的比对结果和多个所述右臂的比对结果;The comparison result includes a comparison result of a plurality of the left arms and a comparison result of a plurality of the right arms;
    消除所述比对结果中的每个读段的缺口,获得通用比对结果。The gap of each of the readout results is eliminated, and a general alignment result is obtained.
  13. 权利要求12的方法,其特征在于,获取所述测序数据包括构建测序文库,获得测序文库,所述测序文库为单链环状DNA文库,所述测序文库由所述染色体片段的一条链和至少一个预定DNA序列构成。The method of claim 12, wherein obtaining the sequencing data comprises constructing a sequencing library to obtain a sequencing library, the sequencing library being a single-stranded circular DNA library, the sequencing library being a strand of the chromosome fragment and at least A predetermined DNA sequence constitutes.
  14. 权利要求12的方法,其特征在于,所述每对读段分别来自所述染色体片段的两端。The method of claim 12 wherein each pair of reads is from both ends of said chromosome segment.
  15. 权利要求14的方法,其特征在于,所述获取测序结果包括测序文库构建,获得测序文库,所述测序文库为单链环状DNA文库,所述测序文库由所述染色体片段的一条链和连接所述一条链的两端的一个预定DNA序列构成。The method of claim 14 wherein said obtaining sequencing results comprises sequencing library construction, obtaining a sequencing library, said sequencing library being a single-stranded circular DNA library, said sequencing library being linked and linked by said chromosome fragment A predetermined DNA sequence at both ends of the one strand is constructed.
  16. 权利要求15的方法,其特征在于,构建所述测序文库包括,The method of claim 15 wherein constructing said sequencing library comprises
    (1)提取待测核酸;(1) extracting a nucleic acid to be tested;
    (2)末端磷酸化所述核酸,获得末端磷酸化产物;(2) terminal phosphorylating the nucleic acid to obtain a terminal phosphorylated product;
    (3)末端修复所述末端磷酸化产物,获得末端修复产物; (3) repairing the terminal phosphorylation product at the end to obtain a terminal repair product;
    (4)将第一序列和第二序列连接至所述末端修复产物的两端,获得第一连接产物;(4) connecting the first sequence and the second sequence to both ends of the terminal repair product to obtain a first ligation product;
    (5)利用第三序列对所述连接产物进行缺刻平移和扩增,获得扩增产物,所述第三序列为一对引物对,所述引物对的至少一条引物带有生物素标记;(5) performing nick translation and amplification of the ligation product using a third sequence to obtain an amplification product, the third sequence being a pair of primer pairs, at least one primer of the primer pair carrying a biotin label;
    (6)利用所述生物素标记对所述扩增产物进行单链分离,获得单链产物;(6) performing single-strand separation of the amplification product using the biotin label to obtain a single-stranded product;
    (7)利用第四序列环化所述单链产物,获得所述测序文库;其中,(7) cyclizing the single-stranded product with a fourth sequence to obtain the sequencing library;
    所述第四序列能够连接所述第一序列的一端和所述第二序列的一端,所述第一序列和/或所述第二序列的另一端为双脱氧核苷酸。The fourth sequence is capable of joining one end of the first sequence to one end of the second sequence, and the other end of the first sequence and/or the second sequence is a dideoxynucleotide.
  17. 权利要求15的方法,其特征在于,构建所述测序文库包括,The method of claim 15 wherein constructing said sequencing library comprises
    (1)提取待测核酸;(1) extracting a nucleic acid to be tested;
    (2)末端修复所述核酸,获得末端修复产物;(2) repairing the nucleic acid at the end to obtain a terminal repair product;
    (3)末端磷酸化所述末端修复产物,获得末端磷酸化产物;(3) terminal phosphorylating the terminal repair product to obtain a terminal phosphorylation product;
    (4)将第一序列和第二序列连接至所述末端磷酸化产物的两端,获得第一连接产物;(4) connecting the first sequence and the second sequence to both ends of the terminal phosphorylation product to obtain a first ligation product;
    (5)利用第三序列对所述连接产物进行缺刻平移和扩增,获得扩增产物,所述第三序列为一对引物对,所述引物对的至少一条引物带有生物素标记;(5) performing nick translation and amplification of the ligation product using a third sequence to obtain an amplification product, the third sequence being a pair of primer pairs, at least one primer of the primer pair carrying a biotin label;
    (6)利用所述生物素标记对所述扩增产物进行单链分离,获得单链产物;(6) performing single-strand separation of the amplification product using the biotin label to obtain a single-stranded product;
    (7)利用第四序列环化所述单链产物,获得所述测序文库;其中,(7) cyclizing the single-stranded product with a fourth sequence to obtain the sequencing library;
    所述第四序列能够连接所述第一序列的一端和所述第二序列的一端,所述第一序列和/或所述第二序列的另一端为双脱氧核苷酸。The fourth sequence is capable of joining one end of the first sequence to one end of the second sequence, and the other end of the first sequence and/or the second sequence is a dideoxynucleotide.
  18. 权利要求12-17任一的方法,其特征在于,所述比对包括,The method of any of claims 12-17, wherein said comparing comprises
    将每对读段对的左臂和右臂分别与所述参考序列比对,获得一级左比对结果和一级右比对结果,Comparing the left arm and the right arm of each pair of read pairs with the reference sequence, respectively, obtaining a first-order left-alignment result and a first-order right-alignment result,
    分别以所述一级左比对结果和所述一级右比对结果的其中一个为参考,对另一个作比对,获得二级左比对结果和二级右比对结果,Taking one of the first-order left-aligned result and the first-order right-aligning result as a reference, and comparing the other, obtaining the second-order left-aligned result and the second-level right-aligning result,
    基于所述二级左比对结果和所述二级右比对结果,获得多个所述读段对的比对结果,或者获得多个所述左臂的比对结果和多个所述右臂的比对结果。Obtaining a comparison result of the plurality of the pair of read segments based on the result of the second-order left alignment and the result of the second-order right alignment, or obtaining a comparison result of the plurality of the left arms and a plurality of the right The result of the alignment of the arms.
  19. 权利要求12-18任一方法,其特征在于,所述比对包括,设置所述缺口的大小以使每个左臂或者每个右臂与所述参考序列进行多次比对。A method according to any of claims 12-18, wherein said aligning comprises arranging said notches to align each of said left or each right arm with said reference sequence a plurality of times.
  20. 权利要求19的方法,其特征在于,所述每个左臂或者每个右臂与参考序列进行多次比对为,将所述每个左臂或者所述每个右臂的缺口分别设置为-3nt、-2nt、-1nt、0nt、1nt、2nt、3nt、4nt、5nt、6nt和7nt,获得对应的多个读段,分别将所述对应的多个读段与所述参考序列比对。The method of claim 19, wherein said each left or each right arm is compared a plurality of times with a reference sequence, and wherein each of said left arms or said each of said right arm gaps is set to -3nt, -2nt, -1nt, 0nt, 1nt, 2nt, 3nt, 4nt, 5nt, 6nt, and 7nt, obtaining corresponding plurality of reads, respectively comparing the corresponding plurality of reads with the reference sequence .
  21. 权利要求12-20任一方法,其特征在于,所述比对结果的格式为TeraMap。 The method of any of claims 12-20, wherein the format of the comparison result is TeraMap.
  22. 权利要求12-21任一方法,其特征在于,在所述消除比对结果中的每个读段的缺口之前,提取所述比对结果中的唯一比对结果以替换所述比对结果,所述唯一比对结果包括唯一比对上所述参考序列的多个读段对,并且每一所述读段对比对到所述参考序列的相同染色体,每一所述读段对中的两个读段的距离符合所述染色体片段的大小。A method according to any one of claims 12 to 21, characterized in that before the gap of each of the readout results is eliminated, a unique alignment result of the alignment results is extracted to replace the alignment result, The unique alignment result includes a plurality of pairs of reads that are uniquely aligned with the reference sequence, and each of the reads contrasts to the same chromosome to the reference sequence, two of each of the pairs of reads The distance of the reads corresponds to the size of the chromosome segment.
  23. 权利要求22的方法,其特征在于,对所述唯一比对结果进行修正,以使所述唯一比对结果中的每一对读段对比对到所述参考序列的相同染色体的正链。The method of claim 22 wherein the unique alignment result is modified such that each pair of the unique alignment results is aligned to a positive strand of the same chromosome of the reference sequence.
  24. 权利要求22或23的方法,其特征在于,获得所述通用比对结果还包括,对所述比对结果或所述唯一比对结果进行数据格式转换。The method of claim 22 or 23, wherein obtaining the universal alignment result further comprises performing a data format conversion on the comparison result or the unique alignment result.
  25. 权利要求12-24任一方法,其特征在于,消除所述比对结果或者所述唯一比对结果中的每个读段的缺口包括,A method according to any one of claims 12-24, characterized in that the elimination of the alignment result or the gap of each of the unique alignment results comprises,
    若所述读段包含正缺口,以N填补所述正缺口的大小,If the read segment includes a positive gap, fill the size of the positive gap with N,
    若所述读段包含负缺口,去除所述负缺口,其中,If the read segment includes a negative gap, the negative gap is removed, wherein
    N为A、T、C或G。N is A, T, C or G.
  26. 权利要求12-25任一方法,其特征在于,所述通用比对结果的格式为SAM或BAM。The method of any of claims 12-25, wherein the format of the universal alignment result is SAM or BAM.
  27. 一种计算机可读存储介质,其特征在于,用于存储供计算机执行的程序,所述程序的执行包括完成权利要求12-26任一方法。A computer readable storage medium for storing a program for execution by a computer, the execution of the program comprising performing the method of any of claims 12-26.
  28. 一种检测CNV的方法,其特征在于,包括,A method for detecting CNV, characterized in that
    a.获取待测样本的核酸;a. obtaining nucleic acid of the sample to be tested;
    b.对所述核酸进行测序,获得测序数据;b. sequencing the nucleic acid to obtain sequencing data;
    c.对所述测序数据进行处理,以获得通用比对结果;c. processing the sequencing data to obtain a universal alignment result;
    d.基于所述通用比对结果检测所述CNV;其中,c步骤利用权利要求1-10任一测序数据处理装置进行。d. detecting the CNV based on the universal alignment result; wherein the c step is performed using the sequencing data processing apparatus of any of claims 1-10.
  29. 权利要求28的方法,其特征在于,b步骤包括,对所述核酸进行测序文库构建,获得测序文库,所述测序文库为单链环状DNA文库。The method of claim 28, wherein the step b comprises performing a sequencing library construction on the nucleic acid to obtain a sequencing library, the sequencing library being a single-stranded circular DNA library.
  30. 权利要求29的方法,其特征在于,所述测序文库构建包括,The method of claim 29, wherein said sequencing library construction comprises
    末端磷酸化所述核酸,获得末端磷酸化产物;Phosphorylating the nucleic acid at the end to obtain a terminal phosphorylated product;
    末端修复所述末端磷酸化产物,获得末端修复产物;End-repairing the terminal phosphorylation product to obtain a terminal repair product;
    将第一序列和第二序列连接至所述末端修复产物的两端,获得第一连接产物;Connecting the first sequence and the second sequence to both ends of the end repair product to obtain a first ligation product;
    利用第三序列对所述连接产物进行缺刻平移和扩增,获得扩增产物,所述第三序列为一对引物对,所述引物对的至少一条引物带有生物素标记;Amplifying the product by performing nick translation and amplification using a third sequence, wherein the third sequence is a pair of primer pairs, at least one primer of the primer pair carrying a biotin label;
    利用所述生物素标记对所述扩增产物进行单链分离,获得单链产物;Single-strand separation of the amplification product using the biotin label to obtain a single-stranded product;
    利用第四序列环化所述单链产物,获得所述测序文库,其中, The single-stranded product is cyclized using a fourth sequence to obtain the sequencing library, wherein
    所述第四序列能够连接所述第一序列的一端和所述第二序列的一端,所述第一序列和/或所述第二序列的另一端为双脱氧核苷酸。The fourth sequence is capable of joining one end of the first sequence to one end of the second sequence, and the other end of the first sequence and/or the second sequence is a dideoxynucleotide.
  31. 权利要求29的方法,其特征在于,所述测序文库构建包括,The method of claim 29, wherein said sequencing library construction comprises
    末端修复所述核酸,获得末端修复产物;Repairing the nucleic acid at the end to obtain a terminal repair product;
    末端磷酸化所述末端修复产物,获得末端磷酸化产物;End-phosphorylation of the terminal repair product to obtain a terminal phosphorylated product;
    将第一序列和第二序列连接至所述末端磷酸化产物的两端,获得第一连接产物;Connecting the first sequence and the second sequence to both ends of the terminal phosphorylation product to obtain a first ligation product;
    利用第三序列对所述连接产物进行缺刻平移和扩增,获得扩增产物,所述第三序列为一对引物对,所述引物对的至少一条引物带有生物素标记;Amplifying the product by performing nick translation and amplification using a third sequence, wherein the third sequence is a pair of primer pairs, at least one primer of the primer pair carrying a biotin label;
    利用所述生物素标记对所述扩增产物进行单链分离,获得单链产物;Single-strand separation of the amplification product using the biotin label to obtain a single-stranded product;
    利用第四序列环化所述单链产物,获得所述测序文库,其中,The single-stranded product is cyclized using a fourth sequence to obtain the sequencing library, wherein
    所述第四序列能够连接所述第一序列的一端和所述第二序列的一端,所述第一序列和/或所述第二序列的另一端为双脱氧核苷酸。The fourth sequence is capable of joining one end of the first sequence to one end of the second sequence, and the other end of the first sequence and/or the second sequence is a dideoxynucleotide.
  32. 权利要求28-31任一方法,其特征在于,所述测序是利用组合探针锚定连接测序技术进行的。The method of any of claims 28-31, wherein said sequencing is performed using a combinatorial probe anchor ligation sequencing technique.
  33. 权利要求28的方法,其特征在于,d步骤包括,The method of claim 28 wherein the step d comprises
    在所述参考序列上设置多个窗口,基于所述通用比对结果中匹配到所述窗口的读段的量与对照样本的通用比对结果中匹配到相同窗口的读段的量的差异具有显著性,判定所述待测样本核酸存在所述CNV,其中,Setting a plurality of windows on the reference sequence, based on a difference between an amount of a read that matches the window in the general alignment result and a read that matches a same window in a general comparison result of the control sample Significantly, determining that the test sample nucleic acid has the CNV, wherein
    所述窗口为所述参考序列的一部分。The window is part of the reference sequence.
  34. 权利要求33的方法,其特征在于,所述对照样本的通用比对结果是通过权利要求12-26任一测序数据处理方法获得的。The method of claim 33, wherein the universal alignment result of said control sample is obtained by the sequencing data processing method of any of claims 12-26.
  35. 权利要求28的方法,其特征在于,d步骤包括,The method of claim 28 wherein the step d comprises
    在所述参考序列上设置多个窗口,计算窗口的测序深度,窗口的测序深度=所述通用比对结果中比对到所述窗口的读段的数量/所述窗口的大小;Setting a plurality of windows on the reference sequence, calculating a sequencing depth of the window, a sequencing depth of the window = comparing the number of reads to the window in the universal comparison result / the size of the window;
    利用测序深度和GC含量的关系校正所述窗口的测序深度,获得窗口的校正测序深度;Correcting the sequencing depth of the window by using the relationship between the sequencing depth and the GC content, and obtaining the corrected sequencing depth of the window;
    基于所述窗口的校正测序深度与对照样本的相同窗口的校正测序深度的差异具有显著性,判定所述待测样本核酸存在所述CNV,其中,The difference between the corrected sequencing depth of the window and the corrected sequencing depth of the same window of the control sample is significant, and the CNV is determined to exist in the sample nucleic acid to be tested, wherein
    所述窗口为所述参考序列的一部分。The window is part of the reference sequence.
  36. 权利要求33-35任一的方法,其特征在于,所述对照样本的个数不小于30个。The method according to any one of claims 33 to 35, characterized in that the number of the control samples is not less than 30.
  37. 权利要求35的方法,其特征在于,所述测序深度和GC含量的关系的建立包括,The method of claim 35, wherein the establishing of the relationship between the sequencing depth and the GC content comprises
    获得多个对照样本核酸的测序数据,所述测序数据由多个读段组成;Obtaining sequencing data of a plurality of control sample nucleic acids, the sequencing data consisting of a plurality of reads;
    在所述参考序列上设置多个窗口,将所述多个对照样本的测序数据分别与所述参考序 列的窗口比对,计算各个对照样本的测序数据中比对上每个窗口的读段的数目,获得每个窗口的测序深度,所述窗口为所述参考序列的一部分,所述窗口的测序深度=各个对照样本的比对上所述窗口的读段的总数目/(对照样本个数*所述窗口的大小);Setting a plurality of windows on the reference sequence, respectively, sequencing data of the plurality of control samples and the reference sequence The window alignment of the columns, calculating the number of reads of each window in the sequencing data of each control sample, obtaining the sequencing depth of each window, the window being part of the reference sequence, sequencing of the window Depth = the total number of reads of the window on the comparison of the respective control samples / (the number of control samples * the size of the window);
    基于每个窗口的测序深度和该窗口的GC含量,利用二维回归分析法建立所述测序深度和GC含量的关系。Based on the sequencing depth of each window and the GC content of the window, the relationship between the sequencing depth and the GC content was established using two-dimensional regression analysis.
  38. 权利要求37的方法,其特征在于,所述二维回归分析法为局部加权回归散点平滑法。The method of claim 37, wherein said two-dimensional regression analysis is a locally weighted regression scatter smoothing method.
  39. 权利要求35的方法,其特征在于,所述对照样本的相同窗口的校正测序深度是利用所述测序深度和GC含量的关系校正对照样本的相同窗口的测序深度获得的,所述对照样本的相同窗口的测序深度=所述对照样本的测序数据中比对到所述窗口的读段的数目/所述窗口的大小。The method of claim 35, wherein the corrected sequencing depth of the same window of the control sample is obtained by correcting the sequencing depth of the same window of the control sample using the relationship between the sequencing depth and the GC content, the control sample being the same Sequencing depth of the window = the number of reads in the sequencing data of the control sample compared to the window / the size of the window.
  40. 一种CNV检测设备,其特征在于,包括,A CNV detecting device, characterized in that
    核酸获取装置,用以获取待测样本的核酸;a nucleic acid acquisition device for acquiring nucleic acid of the sample to be tested;
    测序装置,用以对来自所述核酸获取单元的核酸进行测序,获得测序数据;a sequencing device for sequencing nucleic acid from the nucleic acid acquisition unit to obtain sequencing data;
    数据处理装置,用于对来自所述测序装置的测序数据进行处理,以获得通用比对结果;a data processing device for processing sequencing data from the sequencing device to obtain a universal alignment result;
    检测装置,用于基于来自所述数据处理装置的通用比对结果检测所述CNV;其中,Detecting means for detecting the CNV based on a general comparison result from the data processing device; wherein
    所述数据处理装置包括,The data processing device includes
    数据接收单元,用于接收来自所述测序装置的测序数据,所述测序数据包括多对读段对,每对读段对由两个读段组成,分别来自一条染色体片段的两个位置,每对读长对中的两个读段分别来自所述染色体片段的正链和负链,或者每对读长对中的两个读段都来自所述染色体片段的正链或所述染色体的负链,每个读段都包含缺口,将一对读段对的两个读段分别定义为左臂和右臂,a data receiving unit, configured to receive sequencing data from the sequencing device, the sequencing data comprising a plurality of pairs of read segments, each pair of read segments consisting of two read segments, each from two locations of a chromosome segment, each The two reads of the pair of read pairs are from the positive and negative strands of the chromosome fragment, respectively, or both reads of each pair of read lengths are from the positive strand of the chromosome fragment or the negative of the chromosome Chain, each read contains a gap, and the two reads of a pair of read pairs are defined as the left arm and the right arm, respectively.
    处理器,用于执行数据处理程序,执行所述数据处理程序包括实现将所述测序数据与参考序列作比对,获得比对结果,以及消除所述比对结果中的每个读段的缺口,获得通用比对结果,所述比对结果包括多个所述读段对的比对结果,和/或,所述比对结果包括多个所述左臂的比对结果和多个所述右臂的比对结果,以及,a processor for executing a data processing program, the executing the data processing program comprising: comparing the sequencing data with a reference sequence, obtaining a comparison result, and eliminating a gap of each of the comparison results Obtaining a universal alignment result, the alignment result comprising a plurality of alignment results of the pair of read segments, and/or, the comparison result comprising a plurality of said left arm alignment results and a plurality of said The result of the comparison of the right arm, and,
    至少一个存储单元,用于存储数据,其中包括所述数据处理程序。 At least one storage unit for storing data, including the data processing program.
PCT/CN2014/093511 2014-12-10 2014-12-10 Device and method for sequencing data processing WO2016090583A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201480082793.4A CN107077533B (en) 2014-12-10 2014-12-10 Sequencing data processing device and method
PCT/CN2014/093511 WO2016090583A1 (en) 2014-12-10 2014-12-10 Device and method for sequencing data processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/093511 WO2016090583A1 (en) 2014-12-10 2014-12-10 Device and method for sequencing data processing

Publications (1)

Publication Number Publication Date
WO2016090583A1 true WO2016090583A1 (en) 2016-06-16

Family

ID=56106452

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/093511 WO2016090583A1 (en) 2014-12-10 2014-12-10 Device and method for sequencing data processing

Country Status (2)

Country Link
CN (1) CN107077533B (en)
WO (1) WO2016090583A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110462056A (en) * 2017-05-19 2019-11-15 深圳华大生命科学研究院 Samples sources detection method, device and storage medium based on DNA sequencing data
CN111383717A (en) * 2018-12-29 2020-07-07 北京安诺优达医学检验实验室有限公司 Method and system for constructing biological information analysis reference data set
CN115132271A (en) * 2022-09-01 2022-09-30 北京中仪康卫医疗器械有限公司 CNV detection method based on batch internal correction

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016090585A1 (en) * 2014-12-10 2016-06-16 深圳华大基因研究院 Sequencing data processing apparatus and method
CN116254320A (en) * 2022-12-15 2023-06-13 纳昂达(南京)生物科技有限公司 Flat-end double-stranded joint element, kit and flat-end library building method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101914628A (en) * 2010-09-02 2010-12-15 深圳华大基因科技有限公司 Method and system for detecting polymorphism locus of genome target region
CN103525939A (en) * 2013-10-28 2014-01-22 广州爱健生物技术有限公司 Method and system for noninvasive detection of fetus chromosome aneuploid
CN103824001A (en) * 2014-02-27 2014-05-28 北京诺禾致源生物信息科技有限公司 Method and device for detecting chromosome
US20140272954A1 (en) * 2013-03-15 2014-09-18 Nabsys, Inc. Methods and systems for electronic karyotyping
CN104093858A (en) * 2012-11-13 2014-10-08 深圳华大基因医学有限公司 Method, system and computer readable medium for determining whether chromosome number variation exists in biological sample

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DK2171088T3 (en) * 2007-06-19 2016-01-25 Stratos Genomics Inc Nucleic acid sequencing in a high yield by expansion
WO2009076238A2 (en) * 2007-12-05 2009-06-18 Complete Genomics, Inc. Efficient base determination in sequencing reactions
WO2011143231A2 (en) * 2010-05-10 2011-11-17 The Broad Institute High throughput paired-end sequencing of large-insert clone libraries
ES2766860T5 (en) * 2013-05-15 2023-02-23 Bgi Genomics Co Ltd Method for detecting chromosomal structural abnormalities and device for it
CN104156631B (en) * 2014-07-14 2017-07-18 天津华大基因科技有限公司 The chromosome triploid method of inspection
CN104133914B (en) * 2014-08-12 2017-03-08 厦门万基生物科技有限公司 A kind of GC deviation eliminating high-flux sequence introducing and the detection method to chromosome copies number variation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101914628A (en) * 2010-09-02 2010-12-15 深圳华大基因科技有限公司 Method and system for detecting polymorphism locus of genome target region
CN104093858A (en) * 2012-11-13 2014-10-08 深圳华大基因医学有限公司 Method, system and computer readable medium for determining whether chromosome number variation exists in biological sample
US20140272954A1 (en) * 2013-03-15 2014-09-18 Nabsys, Inc. Methods and systems for electronic karyotyping
CN103525939A (en) * 2013-10-28 2014-01-22 广州爱健生物技术有限公司 Method and system for noninvasive detection of fetus chromosome aneuploid
CN103824001A (en) * 2014-02-27 2014-05-28 北京诺禾致源生物信息科技有限公司 Method and device for detecting chromosome

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110462056A (en) * 2017-05-19 2019-11-15 深圳华大生命科学研究院 Samples sources detection method, device and storage medium based on DNA sequencing data
CN110462056B (en) * 2017-05-19 2023-08-29 深圳华大生命科学研究院 Sample source detection method, device and storage medium based on DNA sequencing data
CN111383717A (en) * 2018-12-29 2020-07-07 北京安诺优达医学检验实验室有限公司 Method and system for constructing biological information analysis reference data set
CN115132271A (en) * 2022-09-01 2022-09-30 北京中仪康卫医疗器械有限公司 CNV detection method based on batch internal correction

Also Published As

Publication number Publication date
CN107077533A (en) 2017-08-18
CN107077533B (en) 2021-07-27

Similar Documents

Publication Publication Date Title
CA2983935C (en) Error suppression in sequenced dna fragments using redundant reads with unique molecular indices (umis)
JP5972448B2 (en) Method and system for detecting copy number variation
CN107077537B (en) Detection of repeat amplification with short read sequencing data
TWI793586B (en) Single-molecule sequencing of plasma dna
JP5938484B2 (en) Method, system, and computer-readable storage medium for determining presence / absence of genome copy number variation
CN108300716B (en) Linker element, application thereof and method for constructing targeted sequencing library based on asymmetric multiplex PCR
CN106715711B (en) Method for determining probe sequence and method for detecting genome structure variation
US20210363583A1 (en) Methods for assessing a genomic region of a subject
JP2017176181A (en) Fetal chromosomal aneuploidy diagnosis
WO2016090583A1 (en) Device and method for sequencing data processing
WO2016090584A1 (en) Method and device for determining concentration of tumor nucleic acid
KR20140050032A (en) Method for determining the presence or absence of different aneuploidies in a sample
WO2014074246A1 (en) Validation of genetic tests
WO2012068919A1 (en) Dna library and preparation method thereof, and method and device for detecting snps
Babarinde et al. Computational methods for mapping, assembly and quantification for coding and non-coding transcripts
US20170321270A1 (en) Noninvasive prenatal diagnostic methods
CN110564838A (en) Multiplex PCR primer system for neonatal glycogen accumulation disease genotyping and application thereof
WO2016090585A1 (en) Sequencing data processing apparatus and method
WO2015089726A1 (en) Chromosome aneuploidy detection method and apparatus therefor
KR102452413B1 (en) Method for detecting chromosomal abnormality using distance information between nucleic acid fragments
US20230178182A1 (en) Method for detecting chromosomal abnormality by using information about distance between nucleic acid fragments
CN110993024B (en) Method and device for establishing fetal concentration correction model and method and device for quantifying fetal concentration
Qian et al. Noninvasive Prenatal Screening for Common Fetal Aneuploidies Using Single-Molecule Sequencing
KR20220071122A (en) Method for Detecting Cancer and Predicting prognosis Using Nucleic Acid Fragment Ratio
CN118086508A (en) Primer group, kit and application for high-sensitivity detection of FLT3-ITD mutation based on high-throughput sequencing technology

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14907893

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14907893

Country of ref document: EP

Kind code of ref document: A1