WO2016090583A1

WO2016090583A1 - Device and method for sequencing data processing

Info

Publication number: WO2016090583A1
Application number: PCT/CN2014/093511
Authority: WO
Inventors: 刘敬一; 刘兴民; 刘耿; 赵鑫; 杨明; 侯勇; 吴逵; 李波
Original assignee: 深圳华大基因研究院
Priority date: 2014-12-10
Filing date: 2014-12-10
Publication date: 2016-06-16
Also published as: CN107077533A; CN107077533B

Abstract

A device (100) for sequencing data processing is provided, and the device includes: data receiving unit (10) is used for receiving the sequencing data, the sequencing data comprises a plurality of read segment pairs, each read segment pair is composed of two read segments originating from two positions of a chromosome segment respectively, and each read segment contains a gap; processor (20) is used for performing data processing program, performing the data processing program includes the comparing the sequencing data with a reference sequence to obtain the comparison result, and eliminating the gap of each read segment in the comparison result to obtain a general comparison result; and, at least one memory unit (30) is used for storing data, wherein the data processing program is included. A system and method for sequencing data processing, a computer readable storage medium, a method and device for detecting CNV are also provided.

Description

Sequencing data processing device and method

Technical field

The present invention relates to the field of biological information. Specifically, the present invention relates to a sequencing data processing apparatus and method, and more particularly, to a sequencing data processing apparatus, a sequencing data processing system, and a processing method for sequencing data. A computer readable storage medium, a method of detecting CNV, and a CNV detecting apparatus.

Background technique

cfDNA (cell-free DNA), which is present in serum, plasma or other body fluids, is an effective biomarker that can be applied to a variety of mutation detection, such as cancer, fetal chromosomal variation and other genetic mutations. Due to the lack of high sensitivity and accuracy of quantitative analysis techniques, previous studies have focused on a number of known disease-related genes, such as the pigmentoma-GNAQ gene (Metz, Claudia HD, et al. Ultradeep sequencing detection GNAQ and GNA11mutations). In cell‐free DNA from plasma of patients with uveal melanoma. Cancer medicine 2.2 (2013): 208-215.), 21 Trisomy 21 (Liao, Gary JW, et al. "Noninvasive prenatal diagnosis of fetal trisomy 21by Allelic ratio analysis using targeted massively parallel sequencing of maternal plasma DNA. "PLoS One 7.5 (2012): e38154.) and the like.

The birth of next-generation sequencing technologies 454 (Roche), Solexa (Illumina) and SOLiD (ABI) has led to a rapid increase in sequencing throughput and a sharp drop in sequencing costs, which provides new ideas for cfDNA detection. Massively Parallel Sequencing (MPS) is the most popular cfDNA detection technology. It is widely used in plasma DNA molecular diagnosis, fetal chromosomal aneuploidy, whole genome karyotyping, and even fetal whole genome sequencing. Copy-Number Variations (CNV) refers to deletions, insertions, duplications, and complex multi-site variations ranging from 1000 bp to millions of bp that are widespread in the human genome. Copy number variation is an important biomarker for many human diseases (such as cancer, hereditary diseases, cardiovascular diseases) and has become a hot spot in many diseases. In particular, the detection of copy number variation in tumors can reveal the loss or doubling of tumor DNA throughout the genome. Currently available CNV detection platforms include comparative genomic hybridization (CGH) based on large inserts, representative oligonucleotide microarray analysis (ROMA), and the like. These platforms have insufficient detection capabilities for small CNVs (below 20 kb), and have problems such as cumbersome operations and high costs.

Summary of the invention

The present invention is directed to solving at least some of the above technical problems or at least providing a commercial choice.

According to a first aspect of the present invention, the present invention provides a sequencing data processing apparatus, the apparatus comprising: a data receiving unit, configured to receive the sequencing data, the sequencing data comprising a plurality of pairs of read pairs, each pair of reads Composed of two reads, each from two locations of a chromosome segment, two reads from each pair of read pairs from the stain The positive and negative strands of the fragment, or both reads of each pair of reads, are from the positive strand of the chromosome fragment or the negative strand of the chromosome, each read contains a gap, a pair of reads The two read segments of the pair are respectively defined as a left arm and a right arm; a processor for executing a data processing program, and executing the data processing program includes performing comparison of the sequencing data with a reference sequence to obtain an alignment result And eliminating a gap in each of the alignment results, obtaining a universal alignment result, the comparison result comprising a plurality of alignments of the pair of reads, and/or the comparison result And a comparison result of the plurality of the left arms and a comparison result of the plurality of the right arms; and at least one storage unit for storing data, including the data processing program. The pair of reads from two positions of a chromosome fragment, respectively, can be obtained by sequencing a constructed library by constructing a pair-end library or a mate-pair library. In one embodiment of the present invention, multiple pairs of read pairs are obtained using the library construction method of Complete Genomics (CG) and its sequencing platform. The distance between a pair of read pairs is determined by the length of the read and the enzyme. The distance between the recognition site and the cleavage site is controlled. The CG platform was constructed by enzymatic cleavage to construct a multi-linker paired-end library, and the constructed circular library was sequenced by a unique combinatorial probe-ligation sequencing (cPAL) technique. The bases on both sides of the linker were read because they were ligated by restriction enzyme digestion. Two segments of a linker are used to construct a paired-end library, since each enzyme has a preferred cutting distance, and in actual digestion, it is often one more position or one less than the preferred distance, which makes the reading often With a gap, the gap is often +1 or -1, and / or, if the same enzyme is used for multiple digestions during the construction of the library, the position of the enzyme digestion is easy to change, and the position of the enzyme digestion will change. The obtained reads are nicked, for example, when constructing a multi-ligand circular library, the Alu enzyme is used for two digestions to join different portions of the plurality of linkers, and when the bases adjacent to the linkers are read, a band of +3 is generated. A reading of the gap of /-3. The size of the gap in the present invention may also be zero. Taking the current two-coupler (2-AD) sequencing library of the CG platform as an example, the 2-AD sequencing output has a total length of 60 bp, which can be divided into two pairs of mate-paired reads, and each pair of reads is centered. The reads have a small gap at 10 bp, an invalid sequencing site N at the 20 bp position, and the distance between the two reads of a pair of reads is generally less than 2000 bp. From a plurality of reads in a multi-joint library, one read can form a pair of read pairs with any other read. The term "positive strand" and "negative strand" as used herein are complementary two strands constituting a chromosome fragment, and are opposite. A strand is said to be a positive strand, and its complementary strand may be referred to as a negative strand, in an embodiment of the present invention. In the example, a chain that matches a reference sequence is referred to as a positive chain, and another chain is referred to as a negative chain. In the present invention, the alignment can be performed using known comparison software, such as SOAP, BWA, etc., or can be performed using the comparison software TeraMap of the CG platform. In one embodiment of the invention, the alignment is performed using TeraMap, and the resulting alignment result is in the format TeraMap. In one embodiment of the present invention, the gap of each read in the elimination comparison result means that the negative gap is removed from the read with the negative gap, that is, the overlapping base is removed, and the positive gap is removed. The read segment replaces the size of the positive gap by N, N is A, T, C or G. For example, for a read with a negative gap such as -2 nt, the read can be divided into two parts based on the gap, the end of the two parts There are 2 nt overlaps. For example, the two parts of the read are ATCGCTTAAG and AGTACGATTC respectively, and the negative gaps are overlapped, and the corresponding read is ATCGCTTAAGTACGATTC.

In one embodiment of the invention, the aligning in the method of one aspect of the invention comprises: comparing the left and right arms of each pair of read pairs to the reference sequence, respectively, to obtain a level one left alignment The result is compared with the first-order right-aligned result; one of the first-order left-aligned result and the first-order right-aligned result is used as a reference, and the other is compared, and the second-order left-aligned result and the second are obtained. Level-aligning the result; obtaining a comparison result of the plurality of the pair of readings based on the result of the second-order left alignment and the result of the second-order right alignment, or obtaining an alignment result of the plurality of the left arms Alignment results with a plurality of said right arms. Thus, after two comparisons, the read comparison result can be obtained. In one embodiment of the present invention, the first alignment is globally aligned with the reference sequence, and the left arm/right arm alignment result is The second alignment of the baseline for the right arm/left arm alignment results is a local alignment, such that alignments from the second-order left alignment result and the second-order right alignment result, respectively, can be performed on the same chromosome. The distance between the two reads that match the expected pair is paired into a pair of read pairs, and the read contrast is obtained.

In an embodiment of the invention, the comparing comprises: setting the size of the notch to compare each left arm or each right arm with the reference sequence multiple times to obtain an optimal ratio For the result. For example, the gaps of each of the left arms or each of the right arms are set to -3 nt, -2 nt, -1 nt, 0 nt, 1 nt, 2 nt, 3 nt, 4 nt, 5 nt, 6 nt, and 7 nt, respectively. a read segment, respectively comparing the corresponding plurality of read segments with the reference sequence, and using the optimal aligned sequence as the left arm/right arm, where the comparison result may be based on the utilized Compare the software to the default evaluation of the results.

In an embodiment of the present invention, executing the data processing program further includes implementing, before the gap of each of the comparison results in the comparison result, extracting a unique comparison result in the comparison result to replace The alignment result, the unique alignment result comprising a plurality of read pairs uniquely aligned with the reference sequence, and each of the reads contrasts to the same chromosome to the reference sequence, each of the The distance between the two reads of the pair of reads corresponds to the expected distance between the two locations of the chromosome segment from which it came.

In one embodiment of the invention, executing the data processing program further comprises implementing correcting a positive strand of the same chromosome that contrasts each pair of the unique alignment results to the reference sequence. For example, for a pair of reads that respectively align the positive and negative strands of the previous chromosome, the reads of the aligned negative strands become their complementary strands, thus replacing the reads with their reverse complementary strands. Said correction.

In one embodiment of the invention, executing the data processing program further comprises implementing a data format conversion, the data format conversion comprising converting the alignment result or the format of the unique alignment result. In an implementation of the present invention, the format of the general comparison result is required to be SAM or BAM, so as to facilitate subsequent analysis of the data based on the comparison result or the comparison result, SAM or BAM is a common binary format, and BAM is a SAM. Compressed format. Due to the use of different comparison software, the format of the output comparison result or the unique comparison result may not be applicable to existing subsequent data processing or analysis software programs, such as the comparison result of the aforementioned TeraMap format, and the output data format thereof. It does not meet the requirements of the input data format of most existing mutation detection software SOAPsnp, GATK or SOAPindel, and converts the data format to obtain the general comparison result with the common data format, which is convenient for further analysis and processing of the data.

According to a second aspect of the present invention, there is provided a sequencing data processing system comprising a host and a display, the system further comprising a sequencing data processing device in accordance with one or any embodiment of the present invention. The foregoing description of the advantages and technical features of the sequencing data processing apparatus is equally applicable to the system of the present invention and will not be described herein.

According to a third aspect of the present invention, a method for processing a sequencing data is provided, the method comprising the steps of: acquiring sequencing data, the sequencing data comprising a plurality of pairs of read segments, each pair of read segments consisting of two read segments, respectively Two positions from one chromosome segment, two reads from each pair of reads are from the positive and negative strands of the chromosome segment, or two reads from each pair of read pairs are from the chromosome a positive strand of a fragment or a negative strand of the chromosome fragment, each read comprising a gap, defining two reads of a pair of read pairs as a left arm and a right arm, respectively; comparing the sequencing data to a reference sequence And obtaining a comparison result, the comparison result comprising a comparison result of the plurality of the pair of readings, and/or, the comparison result comprising a comparison result of the plurality of the left arms and a plurality of the The result of the alignment of the right arm; the gap of each of the readouts is eliminated, and a general alignment result is obtained. For the characteristics of the acquisition mode of the read pair, the gap included in the read, the alignment, the elimination of the gap, the comparison result and the general comparison result, reference may be made to the above-mentioned corresponding to the device in one aspect or any embodiment of the present invention. Description of technical features. For example, in the same way, the pair of reads from two positions of a chromosome fragment, respectively, can be constructed by constructing a pair-end library or a mate-pair library. By performing sequencing, in one embodiment of the present invention, multiple pairs of read pairs are obtained by using the library construction method of Complete Genomics (CG) and its sequencing platform, and the distance between a pair of read pairs is read by The length and the distance between the recognition site of the enzyme and the cleavage site are controlled. The CG platform was constructed by enzymatic cleavage to construct a multi-linker paired-end library, and the constructed circular library was sequenced by a unique combinatorial probe-ligation sequencing (cPAL) technique. The bases on both sides of the linker were read because they were ligated by restriction enzyme digestion. Two segments of a linker are used to construct a paired-end library, since each enzyme has a preferred cutting distance, and in actual digestion, it is often one more position or one less than the preferred distance, which makes the reading often With a gap, the gap is often +1 or -1, and / or, if the same enzyme is used for multiple digestions during the construction of the library, the position of the enzyme digestion is easy to change, and the position of the enzyme digestion will change. The obtained reads are nicked, for example, when constructing a multi-ligand circular library, the Alu enzyme is used for two digestions to join different portions of the plurality of linkers, and when the bases adjacent to the linkers are read, a band of +3 is generated. A reading of the gap of /-3. The size of the gap in the present invention may also be zero. From a plurality of reads in a multi-joint library, one read can form a pair of read pairs with any other read. The term "positive strand" and "negative strand" as used herein are complementary two strands constituting a chromosome fragment, and are opposite. A strand is said to be a positive strand, and its complementary strand may be referred to as a negative strand, in an embodiment of the present invention. In the example, a chain that matches a reference sequence is referred to as a positive chain, and another chain is referred to as a negative chain. In the present invention, the alignment can be performed using known comparison software, such as SOAP, BWA, etc., or can be performed using the comparison software TeraMap of the CG platform. In one embodiment of the invention, the alignment is performed using TeraMap, and the resulting alignment result is in the format TeraMap. In one embodiment of the present invention, the gap of each read in the elimination comparison result means that the negative gap is removed from the read with the negative gap, that is, the overlapping base is removed, and the positive gap is removed. The read segment replaces the size of the positive gap with N, N is A, T, C or G, for example, for a read with a negative gap such as -2 nt, the read can be divided into two parts based on the gap, and the ends of the two parts have 2 nt overlap, for example, the two parts of the read are respectively ATCGCTTAAG and AGTACGATTC, eliminate the negative gap, that is, the overlapping AG, and obtain the corresponding read segment as ATCGCTTAAGTACGATTC.

In one embodiment of the invention, obtaining the sequencing data comprises constructing a sequencing library to obtain a sequencing library, the sequencing library being a single-stranded circular DNA library, the sequencing library being composed of a strand of the chromosome fragment and at least one The predetermined DNA sequence is constructed. The single-stranded circular library can be constructed by a known library construction method, for example, by constructing a single-linker circular double-stranded library with reference to the construction of a paired-end library of SOCID of Life Technologies, and then separating the double-stranded to obtain a single-stranded circular library. In one embodiment of the invention, the single-stranded circular library is constructed using the CG library construction technique, and the library construction can be referred to US7897344 to obtain a multi-linker single-stranded circular library.

In one embodiment of the invention, each pair of reads is from both ends of the chromosome segment. By referring to the improved CG library construction technique, two parts of a linker are respectively ligated to both ends of a chromosome fragment, single-stranded and single-stranded to obtain a 1-ligand single-stranded circular library, and the 1-linker single-stranded The circular library consists of a strand of the chromosomal fragment and a predetermined DNA sequence joining the two ends of the strand. The rolling circle is expanded to form a DNA nanosphere (DNB), and the DNB is sequenced by CG sequencing cPAL technology. Implanted on a chip and cPAL technology can be referenced to US8278039B2 and US8518640B2, respectively. The predetermined DNA sequence is a known sequence and is a link of the aforementioned linker or linker. The improved CG building method constructs a 1-ligand circular single-strand library comprising the steps of: (1) extracting a nucleic acid to be tested; (2) phosphorylating the nucleic acid at the terminal to obtain a terminal phosphorylated product; and (3) end-repairing Said terminal phosphorylation product, obtaining a terminal repair product; (4) linking the first sequence and the second sequence to both ends of the terminal repair product to obtain a first ligation product; (5) using the third sequence for the ligation The product is subjected to nick translation and amplification to obtain an amplification product, the third sequence being a pair of primer pairs, at least one primer of the primer pair carrying a biotin label; (6) using the biotin label to Amplification products are subjected to single-strand separation to obtain a single-stranded product; (7) cyclizing the single-stranded product with a fourth sequence to obtain the sequencing library; wherein the fourth sequence is capable of ligating one end of the first sequence And at one end of the second sequence, the other end of the first sequence and/or the second sequence is a dideoxynucleotide. Said fourth sequence is capable of linking said first sequence and said second sequence to form said adaptor, and nick translation is for eliminating a first sequence and/or a second sequence attached at both ends of the end repair product The nick caused by the dideoxynucleotide at the other end uses at least one primer with biotin labeling to carry at least one strand of the amplified product with biotin labeling, so that it is easy to separate and obtain a single strand based on the biotin label. product. In one embodiment of the present invention, the improved CG library construction method constructs a 1-ligand circular single-strand library comprising the steps of: (1) extracting a nucleic acid to be tested; (2) repairing the nucleic acid at the end to obtain a terminal repair product. (3) terminal phosphorylating the terminal repair product to obtain a terminal phosphorylation product; (4) linking the first sequence and the second sequence to both ends of the terminal phosphorylation product to obtain a first ligation product; Performing nick translation and amplification of the ligation product using a third sequence to obtain an amplification product, the third sequence being a pair of primer pairs, at least one primer of the primer pair carrying a biotin label; (6) Using the biotin labeling pair The amplification product is subjected to single-strand separation to obtain a single-stranded product; (7) cyclizing the single-stranded product with a fourth sequence to obtain the sequencing library; wherein the fourth sequence is capable of linking the first sequence At one end and at one end of the second sequence, the other end of the first sequence and/or the second sequence is a dideoxynucleotide. The steps of end repair and terminal phosphorylation are first made without limitation. End repair is to obtain a blunt-ended nucleic acid fragment that enables attachment of other nucleotides or sequences. Terminal phosphorylation is to reduce the interconnection of sample nucleic acid fragments, so that samples with low nucleic acid content can also be constructed in a library and meet the requirements of the library. Single-linker circular single-strand library As shown in Figure 1, the constructed single-linker circular single-strand library (1-AD) was sequenced on the machine, and the 1-AD sequencing output read pair had a total length of about 30 bp, one read. 12 bp, 19 bp in one read, the median distance of the genome between the two reads in a read is about 140 bp. The single joint has a small amount of storage, which is suitable for the case of less cfDNA content, and has the advantages of short construction time and low construction cost.

In one embodiment of the invention, the alignment in the method of the invention comprises: comparing the left and right arms of each pair of read pairs to the reference sequence, respectively, to obtain a level 1 left alignment result and The first-order right-aligned result is compared with one of the first-order left-aligned result and the first-order right-aligned result, and the other is compared, and the second-order left-aligned result and the second-level right are obtained. Aligning the results, obtaining a comparison result of the plurality of the pair of readings based on the result of the second-order left alignment and the result of the second-order right alignment, or obtaining a comparison result of the plurality of the left arms and The alignment of the right arms. Thus, after two comparisons, the read comparison result can be obtained. In one embodiment of the present invention, the first alignment is globally aligned with the reference sequence, and the left arm/right arm alignment result is The second alignment of the baseline for the right arm/left arm alignment results is a local alignment, such that alignments from the second-order left alignment result and the second-order right alignment result, respectively, can be performed on the same chromosome. The distance between the two reads that match the expected pair is paired into a pair of read pairs, and the read contrast is obtained.

In one embodiment of the invention, the aligning includes arranging the gaps such that each left or each right arm is compared with the reference sequence multiple times to obtain an optimal alignment result. For example, the gaps of each of the left arms or each of the right arms are set to -3 nt, -2 nt, -1 nt, 0 nt, 1 nt, 2 nt, 3 nt, 4 nt, 5 nt, 6 nt, and 7 nt, respectively. a read segment, respectively comparing the corresponding plurality of read segments with the reference sequence, and using the optimal aligned sequence as the left arm/right arm, where the comparison result may be based on the utilized Compare the software to the default evaluation of the results.

In one embodiment of the invention, executing the data processing program further comprises implementing correcting a positive strand of the same chromosome that contrasts each pair of the unique alignment results to the reference sequence. For example, for a pair of reads that respectively align the positive and negative strands of the previous chromosome, the reads of the aligned negative strands become their reverse complementary strands, thus The correction is achieved by a complement to replace the read.

According to a fourth aspect of the present invention, there is provided a computer readable storage medium for storing a program for execution by a computer, the execution of the program comprising performing an aspect of the aforementioned invention or any one of its embodiments. Sequencing data processing method. The foregoing description of the advantages and technical features of the sequencing data processing method of the present invention is also applicable to the computer readable storage medium, and details are not described herein again. The storage medium may include: a read only memory, a random access memory, a magnetic disk or an optical disk, and the like.

According to a fifth aspect of the present invention, the present invention provides a method for detecting copy number variation (CNV), the method comprising: a. acquiring a nucleic acid of a sample to be tested; b. sequencing the nucleic acid to obtain sequencing data; Processing the sequencing data to obtain a universal alignment result; d. detecting the CNV based on the universal alignment result; wherein c step is sequencing data processing in one aspect of the invention or in any particular embodiment The device and/or method performed. The above description of the advantages and technical features of the sequencing data processing apparatus and/or method of the present invention is also applicable to the CNV detection method of this aspect of the present invention, and will not be described herein.

In one embodiment of the present invention, the step b includes performing a sequencing library construction on the nucleic acid to obtain a sequencing library, the sequencing library being a single-stranded circular DNA library, and the construction of the single-stranded circular DNA library comprises: End-phosphorylation of the nucleic acid to obtain a terminal phosphorylation product; end-repairing the terminal phosphorylation product to obtain a terminal repair product; and linking the first sequence and the second sequence to both ends of the terminal repair product to obtain a first linkage a product; performing nick translation and amplification of the ligation product using a third sequence to obtain an amplification product, the third sequence being a pair of primer pairs, at least one primer of the primer pair carrying a biotin label; Performing single-strand separation of the amplification product to obtain a single-stranded product; cyclizing the single-stranded product to obtain the sequencing library; wherein the fourth sequence is capable of joining one end of the first sequence And at one end of the second sequence, the other end of the first sequence and/or the second sequence is a dideoxynucleotide. In another embodiment of the invention, end repair is performed followed by terminal phosphorylation. End repair is to obtain a blunt-ended nucleic acid fragment that enables attachment of other nucleotides or sequences. Terminal phosphorylation is to reduce the interconnection of sample nucleic acid fragments, so that samples with low nucleic acid content can also be constructed in a library and meet the requirements of the library. Single-linker circular single-stranded library is shown in Figure 1. The single-linker has a small amount of storage, which is suitable for cfDNA content. In addition, there are also advantages of short construction time and low cost of building a database. Said fourth sequence is capable of joining the first sequence and the second sequence to form one of said linkers, and the nick translation is to eliminate the dideoxy at the other end of the first sequence and/or the second sequence attached to the ends of the end repair product. A nick caused by a nucleotide, with at least one primer carrying a biotin label, carries at least one strand of the amplified product with a biotin label, so that subsequent separation of the single-stranded product based on the biotin label is easily obtained. In one embodiment of the invention, sequencing of the constructed library is performed using a combinatorial probe anchor ligation sequencing technique, such as using a CG sequencing platform.

The detection of CNV based on the general comparison result can utilize the currently known CNV detection methods, such as using hidden Markov model, circular binary segmentation, hierarchical segmentation or kernel smoothing algorithm. In an embodiment of the present invention, the step d includes: setting a plurality of windows on the reference sequence, based on a general comparison result of the amount of the read segment matching the window and the comparison sample in the universal comparison result The difference in the amount of reads in the matching to the same window is significant, determining that the CNV is present in the sample nucleic acid to be tested, wherein the window is part of the reference sequence. Wherein, the size of the window can be adjusted according to the size of the pre-detected CNV, and the general comparison result of the comparison sample can be obtained by the method of one aspect of the present invention or the sequencing data processing method in any of the specific embodiments, whether the difference is The judgment for the significance can be performed by using a statistical test such as z-score (standard score) to calculate the z value, and when the z value is greater than or less than a predetermined threshold, it is determined that the CNV exists in the window region, for example, the normal control is diploid ( CNV=2), when the z value is positive, the CNV>2 of the window of the sample to be tested is indicated, and the negative number indicates that the CNV<2 of the window of the sample to be tested is set in one embodiment of the present invention. The predetermined threshold is 3, that is, when the absolute value of z is greater than 3, it is determined that CNV occurs in the window. The amount of the read segment may be a number or a ratio. For example, the z-score (standard score) may also be used based on the difference between the sequencing depth of the window of the sample to be tested and the sequencing depth of the corresponding window of the control sample. A test is performed to detect copy number variation, the depth of sequencing of the window = the amount of reads to the window / the size of the window. In one embodiment of the invention, it is contemplated that the GC content in the reads during the actual sequencing process will have a certain effect on the depth of sequencing [Alkan, Can, Jeffrey M Kidd, Tomas Marques-Bonet, Gozde Aksay, Francesca Antonacci , Fereydoun Hormozdiari, Jacob O Kitzman, et al. "Personalized Copy Number and Segmental Duplication Maps Using next-Generation Sequencing." Nature Genetics 41, no. 10 (October 2009): 1061–67], first performing GC content correction, eliminating GC The effect of the content on the depth of sequencing. The GC content correction can utilize the sequencing data of multiple control samples, take the GC content of multiple window calculation windows and the average sequencing depth, and perform two-dimensional regression analysis on the GC-sequence depth data, for example, using local weighted regression. The point smoothing method (lowess regression) establishes the relationship between the two, and corrects the GC content of each window according to the regression relationship. The relationship between the sequencing depth and the GC content can be established by obtaining sequencing data of a plurality of control sample nucleic acids, the sequencing data being composed of a plurality of reading segments; setting a plurality of windows on the reference sequence, Sequencing data of the plurality of control samples are respectively compared with the window of the reference sequence, and each of the sequencing data of each control sample is calculated. The number of reads of the mouth, the depth of sequencing of each window is obtained, the window being part of the reference sequence, the sequencing depth of the window = the total number of reads of the window on the alignment of the respective control samples / (Control sample number * size of the window); based on the sequencing depth of each window and the GC content of the window, the relationship between the sequencing depth and the GC content was established by two-dimensional regression analysis.

In an embodiment of the present invention, the step d includes: setting a plurality of windows on the reference sequence, calculating a sequencing depth of the window, and sequencing depth of the window = comparing the reading to the window in the universal comparison result The number of segments / the size of the window; the sequencing depth of the window is corrected using the relationship between the sequencing depth and the GC content, the corrected sequencing depth of the window is obtained; the corrected sequencing depth based on the window is corrected with the same window of the control sample The difference in depth is significant, and the CNV is determined to be present in the sample nucleic acid to be tested, wherein the window is part of the reference sequence. Preferably, the number of the aforementioned control samples is not less than 30, and the number of samples reaches 30, so that the sample data presentation satisfies a specific distribution conforming to the test using a majority statistical test method, for example, using t test, z test, etc. Inspection generally requires multiple sample data to conform to a normal distribution. The corrected sequencing depth of the same window of the control sample is obtained by correcting the sequencing depth of the same window of the control sample using the relationship between the sequencing depth and the GC content, the sequencing depth of the same window of the control sample = the control sample The number of reads to the window/the size of the window is compared in the sequencing data. The sequencing data, the comparison result, and the like of the foregoing control sample can be obtained by referring to the sequencing data processing method in one aspect of the present invention or in any of the specific embodiments, and can be obtained simultaneously with the sequencing data and the comparison result of the sample to be tested. The save reserve can be obtained in advance.

According to a sixth aspect of the present invention, the present invention provides a CNV detecting apparatus for performing all or part of the steps of the CNV detecting method of one aspect of the present invention, the apparatus comprising: a nucleic acid acquiring apparatus for acquiring a test a nucleic acid of the sample; a sequencing device for sequencing the nucleic acid from the nucleic acid acquisition unit to obtain sequencing data; and a data processing device for processing the sequencing data from the sequencing device to obtain a general comparison result; Detecting means for detecting the CNV based on a universal comparison result from the data processing device; wherein the data processing device comprises a data receiving unit for receiving sequencing data from the sequencing device, the sequencing The data includes pairs of pairs of reads, each pair of reads consisting of two reads, each from two locations of a chromosome segment, and two reads of each pair of read pairs are from the positive strand of the chromosome segment, respectively. And the negative strand, or both reads of each pair of read lengths are from the positive strand of the chromosome fragment or the negative strand of the chromosome, each read Include a gap, define two reads of a pair of read pairs as a left arm and a right arm, respectively, a processor for executing a data processing program, and executing the data processing program includes implementing the sequencing data and the reference sequence Comparing, obtaining alignment results, and eliminating gaps in each of the alignment results, obtaining a universal alignment result comprising a plurality of alignments of the pair of reads, and/ Or, the comparison result includes a comparison result of the plurality of left arms and a comparison result of a plurality of the right arms, and at least one storage unit for storing data including the data processing program. The foregoing is an aspect of the invention or The CNV detecting device of this aspect of the present invention is also applicable to the description of the advantages and technical features of the CNV detecting method in any of its specific embodiments, and details are not described herein again, and those skilled in the art can understand that the present invention can be understood. All or a portion of the units of this apparatus are selectively detachably including one or more subunits to perform or implement various embodiments of the aforementioned CNV detection methods of the present invention.

Sequencing data was obtained by single-link sequencing of the CG platform, and the cost was lower and faster. Using the data processing apparatus, system and/or method of the present invention, the TeraMap2Sam conversion software is developed, and the comparison result of the CG platform TeraMap is converted into a common SAM format, so that many excellent open source softwares such as Samtools, GATK, etc. can be directly used for mutation detection. To make the selection of subsequent analysis more extensive. The CNV detection program developed by the CNV detection method and/or device of the present invention performs CNV analysis based on the standard fraction method, and has high speed and high resolution.

DRAWINGS

The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from

1 is a schematic view showing the structure of a single-linker circular single-stranded library in one embodiment of the present invention;

2 is a schematic structural diagram of a sequencing data processing apparatus in an embodiment of the present invention;

3 is a schematic structural diagram of a sequencing data processing system in an embodiment of the present invention;

4 is a flow chart of a method for processing sequencing data in an embodiment of the present invention;

Figure 5 is a flow chart showing a method of processing sequencing data in an embodiment of the present invention;

6 is a flow chart of a CNV detecting method in an embodiment of the present invention;

7 is a flow chart of a CNV detecting method in an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a CNV detecting apparatus in an embodiment of the present invention; FIG.

Figure 9 is a flow diagram showing the construction and sequencing of a single linker library in one embodiment of the present invention;

Figure 10 is a flow chart of the algorithm of the Teramap2Sam software in one embodiment of the present invention.

detailed description

The embodiments of the present invention are described in detail below, and the examples of the embodiments are illustrated in the drawings, wherein the same or similar reference numerals are used to refer to the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the accompanying drawings are intended to be illustrative of the invention and are not to be construed as limiting. It should be noted that the terms "first", "second", "third", "fourth" or "first grade", "secondary grade" and the like as used herein are merely for convenience of description, but not To understand or indicate the relative importance, it cannot be understood as a sequential relationship. In the description of the present invention, "a plurality" means two or more unless otherwise stated.

2 is a schematic view showing the structure of an apparatus of an embodiment of the sequencing data processing apparatus of the present invention, the sequencing data portion The processing device 100 includes a data receiving unit 10, a processor 20, and a storage unit 30. The processor 20 is connected to the data receiving unit 10 and the storage unit 30, and the storage unit 30 is connected to the data processing unit 10. The data receiving unit 10 is configured to receive sequencing data, where the sequencing data includes multiple pairs of read pairs, each pair of read segments consists of two read segments, respectively, from two positions of a chromosome segment, each pair of read long pairs The two reads are from the positive and negative strands of the chromosome fragment, respectively, or both reads in each pair of reads are from the positive strand of the chromosome fragment or the negative strand of the chromosome, each read The segments all contain gaps, and the two reads of a pair of read pairs are defined as the left and right arms, respectively. The pair of reads from two positions of a chromosome fragment, respectively, can be obtained by sequencing a constructed library by constructing a pair-end library or a mate-pair library. In one embodiment of the present invention, multiple pairs of read pairs are obtained using the library construction method of Complete Genomics (CG) and its sequencing platform. The distance between a pair of read pairs is determined by the length of the read and the enzyme. The distance between the recognition site and the cleavage site is controlled. The CG platform was constructed by enzymatic cleavage to construct a multi-linker paired-end library, and the constructed circular library was sequenced by a unique combinatorial probe-ligation sequencing (cPAL) technique. The bases on both sides of the linker were read because they were ligated by restriction enzyme digestion. Two segments of a linker are used to construct a paired-end library, since each enzyme has a preferred cutting distance, and in actual digestion, it is often one more position or one less than the preferred distance, which makes the reading often With a gap, the gap is often +1 or -1, and / or, if the same enzyme is used for multiple digestions during the construction of the library, the position of the enzyme digestion is easy to change, and the position of the enzyme digestion will change. The obtained reads are nicked, for example, when constructing a multi-ligand circular library, the Alu enzyme is used for two digestions to join different portions of the plurality of linkers, and when the bases adjacent to the linkers are read, a band of +3 is generated. A reading of the gap of /-3. The size of the gap in the present invention may also be zero. Taking the current two-coupler (2-AD) sequencing library of the CG platform as an example, the 2-AD sequencing output has a total length of 60 bp, which can be divided into two pairs of mate-paired reads, and each pair of reads is centered. The reads have a small gap at 10 bp, an invalid sequencing site N at the 20 bp position, and the distance between the two reads of a pair of reads is generally less than 2000 bp. From a plurality of reads in a multi-joint library, one read can form a pair of read pairs with any other read. The term "positive strand" and "negative strand" as used herein are complementary two strands constituting a chromosome fragment, and are opposite. A strand is said to be a positive strand, and its complementary strand may be referred to as a negative strand, in an embodiment of the present invention. In the example, a chain that matches a reference sequence is referred to as a positive chain, and another chain is referred to as a negative chain.

The processor 20 is configured to execute a data processing program, and the executing the data processing program comprises: comparing the sequencing data with a reference sequence, obtaining a comparison result, and eliminating each read in the comparison result a gap, obtaining a universal alignment result, the comparison result comprising a plurality of alignment results of the pair of reads, and/or, the comparison result comprising a plurality of comparison results of the left arm and a plurality of The result of the comparison of the right arm. The comparison can be performed by using known comparison software, such as SOAP, BWA, etc., or by using the comparison software TeraMap of the CG platform. In one embodiment of the invention, the alignment is performed using TeraMap, and the resulting alignment result is in the format TeraMap. In one embodiment of the present invention, the gap of each read in the elimination comparison result means that the negative gap is removed from the read with the negative gap, that is, the overlapping base is removed, and the positive gap is removed. The read segment replaces the size of the positive gap by N, and N is A, T, C or G. The read with a gap of 0 is not processed. For example, for a read with a negative gap such as -2 nt, the read can be divided into two parts based on the gap, and the ends of the two parts have 2 nt overlap, such as two parts of the read. ATCGCTTAAG and AGTACGATTC, respectively, eliminate the negative gap, that is, the overlapping AG, and obtain the corresponding reading as ATCGCTTAAGTACGATTC.

The storage unit 30 is for storing data, and the above-described data processing program is stored in the storage unit 30, and intermediate data or results of the processing of the sequencing data from the data receiving unit 10 and the processor 20 are also stored.

Figure 3 is a block diagram showing the structure of a system in an embodiment of the sequencing data processing system of the present invention. The sequencing data processing system 1000 includes a sequencing data processing device 100, a host 200, and a display device 300. The host 200 can be an audio/video/signal source device, such as a computer host, mainframe, etc., for transmitting display data required by the display device 300. The host 200 includes at least one interface electrically connected to the sequencing data processing device 100. The sequencing data processing device 100 receives the sequencing data output from the host 200, processes the sequenced data, and then outputs the processed data or results to the display device. 300.

4 is a flow chart showing the sequencing data processing of one embodiment of the sequencing data processing method of the present invention. The sequencing data processing method comprises the steps of: S1 acquiring sequencing data, the sequencing data comprising a plurality of pairs of read segments, each pair of read segments consisting of two read segments, respectively, from two positions of a chromosome segment, each pair of reads The two reads of the pair are from the positive and negative strands of the chromosome fragment, respectively, or both reads of each pair of read lengths are from the positive strand of the chromosome fragment or the negative strand of the chromosome fragment, Each read segment includes a gap, and two reads of a pair of read pairs are respectively defined as a left arm and a right arm; S2 compares the sequencing data with a reference sequence to obtain a comparison result, and the comparison result Include a comparison result of a plurality of the pair of read segments, and/or, the comparison result includes a comparison result of the plurality of the left arms and a comparison result of the plurality of the right arms; S3 eliminates the ratio A common alignment result is obtained for each gap in the result. For the characteristics of the acquisition method of the read pair, the gap included in the read, the alignment, the elimination of the gap, the comparison result and the general comparison result, reference may be made to the above-mentioned sequencing data processing apparatus in one aspect or any embodiment of the present invention. A description of the corresponding technical features in . For example, in the same way, the pair of reads from two positions of a chromosome fragment, respectively, can be constructed by constructing a pair-end library or a mate-pair library. By performing sequencing, in one embodiment of the present invention, multiple pairs of read pairs are obtained by using the library construction method of Complete Genomics (CG) and its sequencing platform, and the distance between a pair of read pairs is read by The length and the distance between the recognition site of the enzyme and the cleavage site are controlled. The CG platform was constructed by enzymatic cleavage to construct a multi-linker paired-end library, and the constructed circular library was sequenced by a unique combinatorial probe-ligation sequencing (cPAL) technique. The bases on both sides of the linker were read because they were ligated by restriction enzyme digestion. Two segments of a linker are used to construct a paired-end library, since each enzyme has a preferred cutting distance, and in actual digestion, it is often one more position or one less than the preferred distance, which makes the reading often With a gap, the gap is often +1 or -1, and / or, if the same enzyme is used for multiple digestions during the construction of the library, the position of the enzyme digestion is easy to change, and the position of the enzyme digestion will change. Make the obtained reading with a gap, For example, when constructing a multi-ligand circular library, the Alu enzyme is digested twice to join different portions of multiple linkers, and when the bases next to these linkers are read, a read with a gap of +3/-3 is produced. The size of the gap in the present invention may also be zero. From a plurality of reads in a multi-joint library, one read can form a pair of read pairs with any other read. The term "positive strand" and "negative strand" as used herein are complementary two strands constituting a chromosome fragment, and are opposite. When a strand is a positive strand, the complementary strand can be said to be a minus strand. Here, a chain that matches a reference sequence is referred to as a positive chain, and another chain is referred to as a negative chain. The comparison can be performed by using known comparison software, such as SOAP, BWA, etc., or by using the comparison software TeraMap of the CG platform. In one embodiment of the invention, the alignment is performed using TeraMap, and the resulting alignment result is in the format TeraMap. In one embodiment of the present invention, the gap of each read in the elimination comparison result means that the negative gap is removed from the read with the negative gap, that is, the overlapping base is removed, and the positive gap is removed. The read segment replaces the size of the positive gap by N, N is A, T, C or G, and the read with the gap 0 is not processed. For example, for a read with a negative gap such as -2 nt, the read based on the gap The segment can be divided into two parts, and the ends of the two parts have 2nt overlap. For example, the two parts of the read segment are ATCGCTTAAG and AGTACGATTC respectively, and the negative gap, that is, the overlapping AG, is eliminated, and the corresponding read segment is obtained as ATCGCTTAAGTACGATTC.

Figure 5 is a flow chart showing the data processing of one embodiment of the sequencing data processing method of the present invention. The sequencing data processing method comprises: S10 acquiring sequencing data, the sequencing data comprising a plurality of pairs of read pairs, each pair of read segments consisting of two read segments, respectively, from two positions of one chromosome segment, each pair of read pairs The two reads in the pair are from the positive and negative strands of the chromosome fragment, or the two reads in each pair of read lengths are from the positive strand of the chromosome fragment or the negative strand of the chromosome fragment, each Each of the read segments includes a gap, and two reads of a pair of read pairs are respectively defined as a left arm and a right arm; S20 compares the sequencing data with a reference sequence to obtain a comparison result, and the comparison result includes Aligning results of a plurality of the pair of read segments, and/or, the comparison result includes a comparison result of the plurality of the left arms and a comparison result of the plurality of the right arms; S30 extracting the comparison a unique alignment result in the result to replace the alignment result, the unique alignment result comprising a plurality of read pairs uniquely aligned with the reference sequence, and each of the read pairs is compared to the reference The same chromosome of the sequence, two of each pair of reads The distance of the read is in accordance with the expected distance between the two positions of the chromosome segment from which it is derived; the S40 correction causes each pair of the unique alignment to be compared to the same chromosome of the reference sequence Positive chain. For example, for a pair of reads that respectively align the positive and negative strands of the previous chromosome, the reads of the aligned negative strands become their complementary strands, thus replacing the reads with their reverse complementary strands. Said correction; S50 eliminates the gap of each of the unique alignment results to obtain a general alignment result.

Fig. 6 is a flow chart showing the detection of an embodiment of the CNV detecting method of the present invention. The CNV detection method comprises the steps of: S11 acquiring nucleic acid of a sample to be tested; S12 sequencing the nucleic acid to obtain sequencing data; S13 processing the sequencing data to obtain a general comparison result; S14 is based on the universal comparison As a result, the CNV is detected; wherein S13 is performed using a sequencing data processing device and/or a sequencing data processing method in one aspect of the invention or in any of the embodiments. Detection of CNV based on universal alignment results can utilize currently known CNV detection methods, such as Use hidden Markov model, circular binary segmentation, hierarchical segmentation or kernel smoothing algorithm.

Fig. 7 is a flow chart showing the detection of an embodiment of the CNV detecting method of the present invention. The CNV detection method includes the steps of: S110 acquiring nucleic acid of a sample to be tested; S120 sequencing the nucleic acid to obtain sequencing data; S130 processing the sequencing data to obtain a general comparison result, and S130 is by the above-mentioned invention. Performing the sequencing data processing apparatus and/or the sequencing data processing method in any one of the embodiments; S140 setting a plurality of windows on the reference sequence, calculating a sequencing depth of the window, and sequencing depth of the window = the universal Comparing the number of reads to the window/the size of the window in the comparison result; S150 corrects the sequencing depth of the window by the relationship between the sequencing depth and the GC content, and obtains the corrected sequencing depth of the window; S160 is based on the The corrected sequencing depth of the window is significantly different from the corrected sequencing depth of the same window of the control sample, and the CNV is determined to be present in the sample nucleic acid to be tested, wherein the window is part of the reference sequence. The number of the aforementioned control samples is not less than 30, and the number of samples reaches 30, so that the sample data is presented to satisfy a specific distribution, which is suitable for testing by using a majority statistical test method, for example, using t test, z test, etc. The sample data is in a normal distribution. The corrected sequencing depth of the same window of the control sample is obtained by correcting the sequencing depth of the same window of the control sample using the relationship between the sequencing depth and the GC content, the sequencing depth of the same window of the control sample = the control sample The number of reads to the window/the size of the window is compared in the sequencing data. The sequencing data, the comparison result, and the like of the foregoing control sample can be obtained by referring to the sequencing data processing method in one aspect of the present invention or in any of the specific embodiments, and can be obtained simultaneously with the sequencing data and the comparison result of the sample to be tested. The save reserve can be obtained in advance. The relationship between the sequencing depth and the GC content can be established by obtaining sequencing data of a plurality of control sample nucleic acids, the sequencing data being composed of a plurality of reading segments; setting a plurality of windows on the reference sequence, Sequencing data of the plurality of control samples are respectively compared with the window of the reference sequence, and the number of reads of each window in the sequencing data of each control sample is calculated, and the sequencing depth of each window is obtained, the window For a portion of the reference sequence, the sequencing depth of the window = the total number of reads of the window on the alignment of the respective control samples / (the number of control samples * the size of the window); based on each window The depth of sequencing and the GC content of the window were determined using two-dimensional regression analysis, for example, using Lowess regression to establish the relationship between sequencing depth and GC content.

Figure 8 is a block diagram showing the structure of an embodiment of a CNV detecting apparatus of the present invention. The device 2000 includes: a nucleic acid acquisition device 200 for acquiring nucleic acid of a sample to be tested; a sequencing device 400 for sequencing nucleic acid from the nucleic acid acquisition unit to obtain sequencing data; and a data processing device 600 for The sequencing data of the sequencing device is processed to obtain a universal alignment result; the detection device 800 is configured to detect the CNV based on a universal comparison result from the data processing device 600; wherein the data processing device 600 includes a data receiving unit 610, configured to receive sequencing data from the sequencing device, the sequencing data comprising a plurality of pairs of read pairs, each pair of read segments consisting of two read segments, respectively, from two locations of a chromosome segment The two reads of each pair of read lengths are from the positive and negative strands of the chromosome fragment, respectively, or the two reads of each pair of read lengths are from the positive strand of the chromosome fragment. Or a negative chain of the chromosome, each read includes a gap, and two reads of a pair of read pairs are respectively defined as a left arm and a right arm, and the processor 630 is configured to execute a data processing program, and execute the The data processing program includes implementing the alignment of the sequencing data with a reference sequence, obtaining a comparison result, and eliminating a gap of each of the comparison results, obtaining a universal alignment result, the comparison result including Alignment results of the plurality of read pairs, and/or, the comparison result includes a comparison result of the plurality of left arms and a comparison result of a plurality of the right arms, and at least one storage unit 650. For storing data, including the data processing program. The foregoing description of the advantages and technical features of the CNV detection method in one aspect of the present invention or any of its specific embodiments also applies to the CNV detecting device of this aspect of the present invention, and details are not described herein again. It will be understood by those skilled in the art that all or a portion of the units of the present invention, optionally, detachably, include one or more sub-units to perform or implement various embodiments of the aforementioned CNV detection methods of the present invention.

The following examples are merely illustrative of preferred embodiments of the invention, and specific methods or conditions are not indicated in the examples, which may be in accordance with the techniques or conditions described in the literature in the field (for example, reference to J. Sambrook et al. , Huang Peitang et al., "Molecular Cloning Experimental Guide", third edition, Science Press) or in accordance with product specifications. Any reagents or instruments that are not indicated by the manufacturer are commercially available products or services.

Embodiment 1

The peripheral blood plasma of lung cancer patients was taken as the test object. The samples were from Southwest Hospital and tested as follows:

(1) Library establishment and sequencing

The construction and sequencing process is shown in Figure 9. The specific sequences involved below are from 5' to 3' from left to right. The "/" in the sequence is the terminal modification group, and "phos" indicates Phosphorylation, "dd" means dideoxy, and "bio" means biotin.

1. Extraction of cfDNA (using SnoMag Circulating DNA Kit):

1) Take 200ul of plasma in a 1.5ml EP tube and add 600ul of buffer LSB.

2) Add 20 μl NanoMag Circulating Beads and mix for 10 min at room temperature and mix once every 2-3 min.

3) Place the EP tube on a magnetic stand for 1 min and discard the supernatant.

4) Remove the EP tube and add 150uL Buffer WA and mix.

5) Place the EP tube on a magnetic stand for 1 min and discard the supernatant.

6) Remove the EP tube and add 150 uL of 75% ethanol and mix.

7) Place the EP tube on a magnetic stand for 1 min and discard the supernatant.

8) Repeat 6-7 times.

9) Dry the magnetic beads for 5 min at room temperature.

10) Add 32 ul of elution buffer to mix the magnetic beads and let stand for 5 min at room temperature.

11) Place the EP tube on a magnetic stand for 1 min and transfer the supernatant to a new 1.5 ml EP tube.

2. Construction of the library:

1) rSAP dephosphorylation

cfDNAcfDNA	30ul 30ul
cfDNAcfDNA	30ul 30ul
10x NEBuffer 210x NEBuffer 2	3.5ul3.5ul
10x NEBuffer 210x NEBuffer 2	3.5ul3.5ul	rSAP(1U/ul)rSAP(1U/ul)	1.5ul1.5ul
TotalTotal	35ul35ul	rSAP(1U/ul)rSAP(1U/ul)	1.5ul1.5ul

Reaction conditions:

2) T4DNA Polymerase end fill

Reaction conditions:

12℃12 ° C	20min20min
12℃12 ° C	20min20min	4℃4 ° C	holdHold

The above reaction product was purified by 60 ul of Ampure XP beads and eluted with 22 ul of Elution buffer.

3) The first sequence and the second sequence are respectively ligated to both ends of the end-filled DNA fragment

Reaction conditions:

20℃20 ° C	15min15min
20℃20 ° C	15min15min	4℃4 ° C	holdHold

The above reaction product was purified by 40 ul of Ampure XP beads and eluted with 22 ul of Elution buffer.

The two strands of the first sequence are: TTGGCCTCCGACT/3-ddT/(SEQ ID NO: 1), /5phos/AAGTCGGAGGCCAAGCGGTCGT/ddC/ (SEQ ID NO: 2).

The two strands of the second sequence are: /5Phos/GTCTCCAGTCGAAGCCCGACG/3ddC/(SEQ ID NO: 3), GCTTCGACTGGAGA/3ddC/(SEQ ID NO: 4).

4) Nick Translation

The upstream primer in the third sequence is/5-bio/TCCTAAGACCGCTTGGCCTCCGACT (SEQ ID NO: 5),

Downstream primer in the third sequence

5Phos/AGACAAGCTCxxxxxxxxxxGATCGGGCTTCGACTGGAGAC (SEQ ID NO: 6), the intermediate "x" is a variable tag sequence region, which can be replaced by N, N is A, T, C or G, when no other sample libraries are mixed together, only A sample library is on the machine, no tag sequence is required, ie the third sequence can be

5Phos/AGACAAGCTCGATCGGGCTTCGACTGGAGAC (SEQ ID NO: 7), in this example, because of the tumor free nucleic acid sample, the target nucleic acid (ctDNA) content in the mixed nucleic acid is low, and if a plurality of such sample libraries are mixed on the machine to obtain mixed data, it is required Splitting the mixed data corresponding to the respective samples will lose a part of the data, and the single-joint circular library reads are relatively short. To accurately detect the mutation, deep sequencing is required to obtain a relatively large amount of measured data, preferably, on a single sample library. machine.

Reaction conditions:

60℃60 ° C	5min5min
60℃60 ° C	5min5min	37℃37 ° C	0.1℃/secs-hold0.1°C/secs-hold

Add 8ul of the following translations to the top reaction.

Reaction conditions:

37℃37 ° C	20min20min
37℃37 ° C	20min20min	4℃4 ° C	holdHold

The above reaction product was purified by 40 ul of Ampure XP beads and eluted with 37.4 ul of Elution buffer.

5) PCR with Pfx

Reaction conditions:

The above reaction product was purified by 50 ul of Ampure XP beads and eluted with 22 ul of Elution buffer.

6) Qubit quantification

The PCR product was subjected to concentration determination using a Qubit dsDNA HS assay kit.

7) Strand Separation

a) Multiple libraries were mixed to give a total of about 160 ng of DNA. The sample was filled with 1 x TE to a total volume of 60 ul.

b) Prepare the following reagents in advance: 4X BBB, Streptavidin Beads, 0.3M MOPS acid, 0.5% Tween 20, 1X BBB/Tween Mix, 1X BWB/Tween Mix, 0.1 M NaOH. Among them, 1X BWB/Tween Mix, 0.1M NaOH, and Streptavidin Beads are ready for use.

c) Configure the following four reagents 15 minutes in advance

0.5% Tween20, 1X BBB/Tween Mix, 1X BWB/Tween Mix, 0.1M NaOH.

The 0.5% Tween20 configuration method is the same as the above, and the other three configuration methods are as follows:

d) 1X BBB/Tween Mix

1X BBB1X BBB	30ul30ul
1X BBB1X BBB	30ul30ul	0.5％Tween200.5% Tween20	0.3ul0.3ul
TotalTotal	30.3ul30.3ul	0.5％Tween200.5% Tween20	0.3ul0.3ul

e) 1X BWB/Tween Mix

1X BWB1X BWB	2000ul2000ul
1X BWB1X BWB	2000ul2000ul	0.5％Tween200.5% Tween20	20ul20ul
TotalTotal	2020ul2020ul	0.5％Tween200.5% Tween20	20ul20ul

f) 0.1M NaOH

0.5M NaOH0.5M NaOH	15.6ul15.6ul
0.5M NaOH0.5M NaOH	15.6ul15.6ul	WaterWater	62.40ul62.40ul
TotalTotal	78.0ul78.0ul	WaterWater	62.40ul62.40ul

g) Streptavidin Beads washing method is as follows:

· Take 30ul Streptavidin Beads per sample: add 3-5 times the volume of 1XBBB, mix and place on a magnetic stand to absorb statically, adjust the direction of the non-stick tube, so that the beads move back and forth in the 1XBBB lotion, discard the supernatant. After the liquid, repeat the above operation once.

• Remove the non-stick tube and add 1 volume (30 ul) of 1X BBB/Tween Mix suspension, mix and let stand at room temperature.

h) Add 20 ul of 4XBBB to 60 ul of PCR product mixture, then transfer to a non-stick tube containing 30 ul of 1X BBB/Tween Mix-dissolved beads. Mix the 110 ul mixture at room temperature for 15-20 min. once.

i) Place the above non-stick magnetic frame for 3-5min, discard the supernatant, and wash it twice with 1ml of 1X BWB/Tween Mix. The method is the same as the washing method of Streptavidin Beads.

j) Add 26 ul of 0.1 M NaOH to the above beads, mix by blowing and let stand for 10 min, then place on a magnetic stand for 3-5 min, and take the supernatant into a new 1.5 ml EP tube.

k) Add 13ul of 0.3M MOPS to the above 1.5ml EP tube and mix for later use.

l) The product of this step can be stored frozen at -20 °C.

8) Splint Circulation

a) Add 10ul of 20uM fourth sequence to the 39ul sample obtained in the previous step. The fourth sequence is

TCGAGCTTGTCTTCCTAAGACCGC (SEQ ID NO: 8);

b) Prepare the ligase reaction mixture 5 minutes in advance, prepared as follows:

WaterWater	4.2ul4.2ul
WaterWater	4.2ul4.2ul	10x TA Buffer(LK1)10x TA Buffer(LK1)	6ul6ul
100mM ATP100mM ATP	0.6ul0.6ul	10x TA Buffer(LK1)10x TA Buffer(LK1)	6ul6ul
100mM ATP100mM ATP	0.6ul0.6ul	600U/ul Ligase600U/ul Ligase	0.2ul0.2ul
TotalTotal	11ul11ul	600U/ul Ligase600U/ul Ligase	0.2ul0.2ul

c) The ligase reaction mixture is shaken and thoroughly mixed. After centrifugation, 11 ul of the ligase reaction mixture is added to the EP tube to which the primer reaction mixture has been added, shaken for 10 s, and centrifuged instantaneously.

d) Incubate in a PCR machine for 1.5 h at 37 °C.

e) After the reaction is completed, 5 ul of the sample is taken out and subjected to electrophoresis detection of 6% denaturing gel, and the remaining volume of about 55 ul is passed to the next enzyme reaction.

9) Digestive digestion (Exo I and III)

a) Prepare the primer reaction mixture about 5 minutes in advance, and prepare as follows:

10x TA Buffer(LK1)

1ul

20U/ul Exo I20U/ul Exo I	3ul3ul
20U/ul Exo I20U/ul Exo I	3ul3ul	200/ul Exo III200/ul Exo III	1ul1ul
TotalTotal	5ul5ul	200/ul Exo III200/ul Exo III	1ul1ul

b) The mixture is shaken and thoroughly mixed, and after centrifugation, 5 ul of the reaction mixture is separately added to the 55 ul sample obtained in the previous step;

c) Incubate for 10 s, mix and centrifuge, and incubate in a PCR machine at 37 ° C for 30 min.

d) After the enzyme digestion was completed for 30 min, 2.5 ul of 500 mM EDTA was added to the sample to terminate the enzyme reaction.

e) The above sample was purified with PEG32beads/tween20 as follows:

Transfer 59 ul of the above step to a 1.5 ml non-stick tube, add 78 ul of PEG32beads/tween 20 (PEG32beads: tween20=100:1), and combine at room temperature for 15 min, while blowing and mixing once;

f) After the non-stick tube is placed on the magnetic stand for 3-5min, discard the supernatant and wash it twice with 700ul 75% ethanol. When washing, the non-stick tube will be reversed in the front-rear direction, so that the beads move in the ethanol, each wash tour Move 2-3 times;

g) After drying at room temperature, dissolve with 27ul TE/tween20 (TE: tween20=500:1), dissolve for 15min, mix once in the middle;

h) Transfer supernatant to a new tube 1.5mlEP, the final product was obtained with quantitative Qubit ^TM ssDNA Assay Kit. The ratio of Buffer to dye is 199:1. After mixing, votex and centrifuge for mixing. Take two 190 ul of diluted dye working solution and add 10 ul of two standard votex and centrifuge for mixing. Add 198 ul of diluted dye working solution to 2 ul sample. After the votex, centrifuge and quantify the Qubit instrument.

i) Normalization of concentration

The starting amount of the sample used for the preparation of DNB was adjusted to 35.3 ng-53 ng according to the concentration of single-stranded molecular quantitative determination. The corresponding volume sample (<60 ul) was transferred to the Biorad PCR plate, and the total volume was not more than 120 ul using 1XTE. .

The final concentration is 5.625-7.5fmol/ul, the volume is 120ul, the total amount is 35.3ng-53ng, and the DNB in the 1adapter sequencing needs 120fmol, 7.5foml/ul, 16ul. Therefore, the library needs to be diluted to 7.5 fmol/ul.

10) CG 1-Adapter sequencing

Sequencing using the standardized process of the CG platform. DNA nanochips are a high-throughput sequencing technology pioneered by CG. This example of sequencing improved single-joint sequencing libraries is less expensive and faster than other sequencing protocols, and integrates quality control to ensure sequencing quality.

Embodiment 2

The offline data of the first embodiment is processed. Using the sequencing data processing method and/or the CNV detection method of the present invention, based on the CG platform sequencing technology, ultra-micro cfDNA enrichment, library establishment, sequencing and data analysis can be performed. In this example, due to the particularity of the CG sequencing principle, the sequencing reads are short, and there are resequencing and small gaps at specific locations. It is difficult to directly compare the sequencing results using ordinary comparison software. For the special structure of reads, we use the TG platform's proprietary TeraMap for comparison. The working principle is: First, it will compare the two ends of the read length (LeftArm, RightArm), and TeraMap will try a variety of gaps. The value is used to process the read length to obtain more comparison results; then, the comparison result at each end is taken as a reference, and the other end is locally aligned (for example, 4-AD, the range of the local alignment is 0 to 700bp); if both ends can be well aligned to the same chromosome, and the insert-size meets expectations (eg 4-AD, the distance between the two reads of a read pair is 0-700bp), then only the best alignment result is output Otherwise, multiple comparison results at both ends are output. TeraMap is a comparison software for CG sequencing platform. It can compare CG unique sequences to the reference genome. Its output format consists of three parts. The brief description is as follows: the first line is the reads sequence information; the second line and the third The line is the reading comparison case description; the fourth line and the fifth line are the details of the reads comparison result.

first row:

列号Column number	字段Field	类型Types of	简介 Introduction
列号Column number	字段Field	类型Types of	简介 Introduction	11	QNAMEQNAME	字符串String	参考序列编号 Reference sequence number
22	POSPOS	整型Integer	比对到参考序列的位置Align to the position of the reference sequence	11	QNAMEQNAME	字符串String	参考序列编号 Reference sequence number
22	POSPOS	整型Integer	比对到参考序列的位置Align to the position of the reference sequence	33	SEQSEQ	字符串String	比对片段的序列信息Align the sequence information of the fragment

second line:

The fourth line:

Because the TeraMap has a gap problem, making it impossible to perform downstream analysis, the Teramap2Sam software is developed according to the method of the present invention, and the gap in the TeraMap comparison result is removed and converted into SAM (sequence alignment/map format). The main process of Teramap2Sam software can be divided into three parts, and the algorithm flow chart is shown in Figure 10.

Step 1: Extract the unique alignment results. According to the TeraMap output result matchCount to determine whether the unique alignment, while requiring the length of the insert to meet the requirements and the read alignment of the two ends on a reference sequence.

Step 2: Remove the gap. The gap position in the reads is determined according to the gaps field, and the read sequence is corrected.

The third step: calculate FLAG. According to the comparison direction of the double-ended read, the FLAG parameter in the SAM file is calculated to obtain the comparison.

SAM is a more general format for storing comparison information. Each line is a pair of reads. It consists mainly of eleven fields. Later, more fields can be added to contain more information, such as XT:A: U means that this reads is a unique comparison. A brief description is as follows:

In order to save storage resources in actual use, the binary compression format (BAM) is mainly used. In addition, CG developed the Assembly Software for its read structure to reassemble the reads, and perform the follow-up work after the assembly is completed.

Due to the shortcomings of the GS single-joint reads, the short readout is short (12 bp). In some special data processing, the original CG mutation detection tool is no longer applicable or the detection result is not good. In response to this situation, we first developed a tool to convert TeraMap's alignment results into a common SAM/BAM format, where SAM/BAM is a commonly used alignment format for high-throughput sequencing, so we use this common format. Then use BAM data to detect copy number variation. At present, the existing copy number variation detection methods include hidden Markov model, circular binary segmentation, hierarchical segmentation, and kernel smoothing algorithm. We use the z-score (standard score) to obtain copy number variation results based on the read depth distribution of multiple windows with a total length of 1,000,000 bp.

Considering that the GC content in the reads will have a certain influence on the sequencing depth during the actual sequencing process, we compare the GC content of the results (BAM) to eliminate the influence of GC content on the depth. Specifically, the GC content and the average sequencing depth of a plurality of window calculation windows with a total length of 1,000,000 bp were taken, and the GC-sequence depth data was subjected to lowess regression, and the GC content was corrected according to the regression curve.

The standard score, also called the z-score, is the process of dividing the difference between the score and the mean by the standard deviation. Expressed as: z = (x - μ) / σ. Where x is a specific fraction, μ is the average, and σ is the standard deviation. The amount of Z value represents the distance between the original score and the parent mean, calculated in units of standard deviation. Z is negative when the original score is lower than the average, and vice versa if it is lower. In this example, copy number variation can be effectively detected by measuring the distance between the reads count (original score) and the overall reads average (multiple normal control samples) in the 2000 bp window using the standard deviation. When the Z value is positive, the reaction is greater than 2 (the normal sample is 2 times), such as repetition, and the negative copy number is less than 2 when the z value is negative, such as a deletion. The above CNV detection method in this embodiment is written as a program, and the program is named calcu_zscore_query, and the region where the absolute value of z is larger than 3 is judged to be CNV.

Compared with the traditional method, we can use the CG single-join sequencing method to achieve ultra-micro-sequencing database sequencing. Only 1-10 ng of nucleic acid is needed for database construction, and the peripheral blood volume is 2-5 ml, and the standardization process of CG is simple and fast. TeraMap ratio After converting the result to SAM format, it is more versatile than the closed source TeraMap format, and can be processed using software such as Samtools. In addition, CNV can be quickly detected using z-score (standard score), and CNV analysis of 50-by-full genome data takes only 4 hours, as a comparison, CONTRA software [ http://sourceforge.net/projects/contra-cnv/ ] It takes more than 1 day.

In this example, TeraMap is used for comparison. After the sequencing is completed, the original reads are obtained using the CG platform's integrated tool makeADF, and then compared with TeraMap, and the sequenced reads are aligned on the reference sequence. The resulting alignment results are converted to the generic SAM format using TeraMap2Sam. Table 1 shows the results.

Table 1

Claims

A sequencing data processing device, characterized in that

a data receiving unit, configured to receive the sequencing data, the sequencing data includes a plurality of pairs of read pairs, each pair of read segments consisting of two read segments, respectively, from two positions of a chromosome segment, each pair of read pairs The two reads in the pair are from the positive and negative strands of the chromosome fragment, or the two reads in each pair of read lengths are from the positive strand of the chromosome fragment or the negative strand of the chromosome, each The read segments all contain gaps, and the two read pairs of the pair of read pairs are defined as the left arm and the right arm, respectively;

a processor for executing a data processing program, the executing the data processing program comprising: comparing the sequencing data with a reference sequence, obtaining a comparison result, and eliminating a gap of each of the comparison results Obtaining a universal alignment result, the alignment result comprising a plurality of alignments of the pair of reads, and/or,

The comparison result includes a comparison result of a plurality of the left arms and a comparison result of a plurality of the right arms; and

At least one storage unit for storing data, including the data processing program.
The device of claim 1 wherein said comparing comprises

Comparing the left arm and the right arm of each pair of read pairs with the reference sequence, respectively, obtaining a first-order left-alignment result and a first-order right-alignment result,

Taking one of the first-order left-aligned result and the first-order right-aligning result as a reference, and comparing the other, obtaining the second-order left-aligned result and the second-level right-aligning result,

Obtaining a comparison result of the plurality of the pair of read segments based on the result of the second-order left alignment and the result of the second-order right alignment, or obtaining a comparison result of the plurality of the left arms and a plurality of the right The result of the alignment of the arms.
The apparatus of claim 2 wherein said comparing comprises arranging said notches to align each of said left or each right arm with said reference sequence a plurality of times.
The apparatus of claim 3 wherein said each of said left or each right arm is aligned a plurality of times with said reference sequence, said gaps of said each of said left arms or said each of said right arms being respectively set to -3nt, -2nt, -1nt, 0nt, 1nt, 2nt, 3nt, 4nt, 5nt, 6nt, and 7nt, obtaining corresponding plurality of reads, respectively comparing the corresponding plurality of reads with the reference sequence .
The apparatus of any of claims 1-4, wherein the format of the comparison result is TeraMap.
Apparatus according to any of claims 1-5, wherein performing said data processing program further comprises: implementing a unique ratio in said comparison result before said gap in each of said comparison results is eliminated Substituting the result of the alignment, the unique alignment result includes a plurality of pairs of reads that are uniquely aligned with the reference sequence, and each of the reads contrasts to the same chromosome to the reference sequence, The distance between the two reads of each of the pairs of reads corresponds to the distance of the two locations of the chromosome segment.
The apparatus of claim 6 wherein performing said data processing program further comprises implementing correcting a positive chain of the same chromosome that pairs each of said unique alignment results against said reference sequence.
The apparatus of claim 6 or 7, wherein executing the data processing program further comprises implementing a data format conversion, the data format conversion comprising converting the alignment result or the format of the unique alignment result.
Apparatus according to any of claims 1-8, wherein the elimination of said alignment result or the gap of each of said unique alignment results comprises,

If the read segment includes a positive gap, fill the size of the positive gap with N,

If the read segment includes a negative gap, the negative gap is removed, wherein

N is A, T, C or G.
The apparatus of any of claims 1-9, wherein the format of the universal alignment result is SAM or BAM.
A sequencing data processing system comprising a host and a display device, characterized in that the system further comprises the sequencing data processing device of any of claims 1-10.
A sequencing data processing method, comprising the following steps,

Obtaining sequencing data, the sequencing data comprising a plurality of pairs of read segments, each pair of read segments consisting of two read segments, respectively from two locations of one chromosome segment, and two reads of each pair of read length pairs are respectively from The positive and negative strands of the chromosomal segment, or both reads of each pair of read lengths are from the positive strand of the chromosomal segment or the negative strand of the chromosomal segment, each read containing a gap, The two readings of a pair of read pairs are defined as the left arm and the right arm, respectively;

Aligning the sequencing data with a reference sequence to obtain a comparison result, the alignment result comprising a plurality of alignments of the pair of reads, and/or,

The comparison result includes a comparison result of a plurality of the left arms and a comparison result of a plurality of the right arms;

The gap of each of the readout results is eliminated, and a general alignment result is obtained.
The method of claim 12, wherein obtaining the sequencing data comprises constructing a sequencing library to obtain a sequencing library, the sequencing library being a single-stranded circular DNA library, the sequencing library being a strand of the chromosome fragment and at least A predetermined DNA sequence constitutes.
The method of claim 12 wherein each pair of reads is from both ends of said chromosome segment.
The method of claim 14 wherein said obtaining sequencing results comprises sequencing library construction, obtaining a sequencing library, said sequencing library being a single-stranded circular DNA library, said sequencing library being linked and linked by said chromosome fragment A predetermined DNA sequence at both ends of the one strand is constructed.
The method of claim 15 wherein constructing said sequencing library comprises

(1) extracting a nucleic acid to be tested;

(2) terminal phosphorylating the nucleic acid to obtain a terminal phosphorylated product;

(3) repairing the terminal phosphorylation product at the end to obtain a terminal repair product;

(4) connecting the first sequence and the second sequence to both ends of the terminal repair product to obtain a first ligation product;

(5) performing nick translation and amplification of the ligation product using a third sequence to obtain an amplification product, the third sequence being a pair of primer pairs, at least one primer of the primer pair carrying a biotin label;

(6) performing single-strand separation of the amplification product using the biotin label to obtain a single-stranded product;

(7) cyclizing the single-stranded product with a fourth sequence to obtain the sequencing library;

The fourth sequence is capable of joining one end of the first sequence to one end of the second sequence, and the other end of the first sequence and/or the second sequence is a dideoxynucleotide.
The method of claim 15 wherein constructing said sequencing library comprises

(1) extracting a nucleic acid to be tested;

(2) repairing the nucleic acid at the end to obtain a terminal repair product;

(3) terminal phosphorylating the terminal repair product to obtain a terminal phosphorylation product;

(4) connecting the first sequence and the second sequence to both ends of the terminal phosphorylation product to obtain a first ligation product;

(5) performing nick translation and amplification of the ligation product using a third sequence to obtain an amplification product, the third sequence being a pair of primer pairs, at least one primer of the primer pair carrying a biotin label;

(6) performing single-strand separation of the amplification product using the biotin label to obtain a single-stranded product;

(7) cyclizing the single-stranded product with a fourth sequence to obtain the sequencing library;

The fourth sequence is capable of joining one end of the first sequence to one end of the second sequence, and the other end of the first sequence and/or the second sequence is a dideoxynucleotide.
The method of any of claims 12-17, wherein said comparing comprises

Comparing the left arm and the right arm of each pair of read pairs with the reference sequence, respectively, obtaining a first-order left-alignment result and a first-order right-alignment result,

Taking one of the first-order left-aligned result and the first-order right-aligning result as a reference, and comparing the other, obtaining the second-order left-aligned result and the second-level right-aligning result,

Obtaining a comparison result of the plurality of the pair of read segments based on the result of the second-order left alignment and the result of the second-order right alignment, or obtaining a comparison result of the plurality of the left arms and a plurality of the right The result of the alignment of the arms.
A method according to any of claims 12-18, wherein said aligning comprises arranging said notches to align each of said left or each right arm with said reference sequence a plurality of times.
The method of claim 19, wherein said each left or each right arm is compared a plurality of times with a reference sequence, and wherein each of said left arms or said each of said right arm gaps is set to -3nt, -2nt, -1nt, 0nt, 1nt, 2nt, 3nt, 4nt, 5nt, 6nt, and 7nt, obtaining corresponding plurality of reads, respectively comparing the corresponding plurality of reads with the reference sequence .
The method of any of claims 12-20, wherein the format of the comparison result is TeraMap.
A method according to any one of claims 12 to 21, characterized in that before the gap of each of the readout results is eliminated, a unique alignment result of the alignment results is extracted to replace the alignment result, The unique alignment result includes a plurality of pairs of reads that are uniquely aligned with the reference sequence, and each of the reads contrasts to the same chromosome to the reference sequence, two of each of the pairs of reads The distance of the reads corresponds to the size of the chromosome segment.
The method of claim 22 wherein the unique alignment result is modified such that each pair of the unique alignment results is aligned to a positive strand of the same chromosome of the reference sequence.
The method of claim 22 or 23, wherein obtaining the universal alignment result further comprises performing a data format conversion on the comparison result or the unique alignment result.
A method according to any one of claims 12-24, characterized in that the elimination of the alignment result or the gap of each of the unique alignment results comprises,

If the read segment includes a positive gap, fill the size of the positive gap with N,

If the read segment includes a negative gap, the negative gap is removed, wherein

N is A, T, C or G.
The method of any of claims 12-25, wherein the format of the universal alignment result is SAM or BAM.
A computer readable storage medium for storing a program for execution by a computer, the execution of the program comprising performing the method of any of claims 12-26.
A method for detecting CNV, characterized in that

a. obtaining nucleic acid of the sample to be tested;

b. sequencing the nucleic acid to obtain sequencing data;

c. processing the sequencing data to obtain a universal alignment result;

d. detecting the CNV based on the universal alignment result; wherein the c step is performed using the sequencing data processing apparatus of any of claims 1-10.
The method of claim 28, wherein the step b comprises performing a sequencing library construction on the nucleic acid to obtain a sequencing library, the sequencing library being a single-stranded circular DNA library.
The method of claim 29, wherein said sequencing library construction comprises

Phosphorylating the nucleic acid at the end to obtain a terminal phosphorylated product;

End-repairing the terminal phosphorylation product to obtain a terminal repair product;

Connecting the first sequence and the second sequence to both ends of the end repair product to obtain a first ligation product;

Amplifying the product by performing nick translation and amplification using a third sequence, wherein the third sequence is a pair of primer pairs, at least one primer of the primer pair carrying a biotin label;

Single-strand separation of the amplification product using the biotin label to obtain a single-stranded product;

The single-stranded product is cyclized using a fourth sequence to obtain the sequencing library, wherein

The fourth sequence is capable of joining one end of the first sequence to one end of the second sequence, and the other end of the first sequence and/or the second sequence is a dideoxynucleotide.
The method of claim 29, wherein said sequencing library construction comprises

Repairing the nucleic acid at the end to obtain a terminal repair product;

End-phosphorylation of the terminal repair product to obtain a terminal phosphorylated product;

Connecting the first sequence and the second sequence to both ends of the terminal phosphorylation product to obtain a first ligation product;

Amplifying the product by performing nick translation and amplification using a third sequence, wherein the third sequence is a pair of primer pairs, at least one primer of the primer pair carrying a biotin label;

Single-strand separation of the amplification product using the biotin label to obtain a single-stranded product;

The single-stranded product is cyclized using a fourth sequence to obtain the sequencing library, wherein

The fourth sequence is capable of joining one end of the first sequence to one end of the second sequence, and the other end of the first sequence and/or the second sequence is a dideoxynucleotide.
The method of any of claims 28-31, wherein said sequencing is performed using a combinatorial probe anchor ligation sequencing technique.
The method of claim 28 wherein the step d comprises

Setting a plurality of windows on the reference sequence, based on a difference between an amount of a read that matches the window in the general alignment result and a read that matches a same window in a general comparison result of the control sample Significantly, determining that the test sample nucleic acid has the CNV, wherein

The window is part of the reference sequence.
The method of claim 33, wherein the universal alignment result of said control sample is obtained by the sequencing data processing method of any of claims 12-26.
The method of claim 28 wherein the step d comprises

Setting a plurality of windows on the reference sequence, calculating a sequencing depth of the window, a sequencing depth of the window = comparing the number of reads to the window in the universal comparison result / the size of the window;

Correcting the sequencing depth of the window by using the relationship between the sequencing depth and the GC content, and obtaining the corrected sequencing depth of the window;

The difference between the corrected sequencing depth of the window and the corrected sequencing depth of the same window of the control sample is significant, and the CNV is determined to exist in the sample nucleic acid to be tested, wherein

The window is part of the reference sequence.
The method according to any one of claims 33 to 35, characterized in that the number of the control samples is not less than 30.
The method of claim 35, wherein the establishing of the relationship between the sequencing depth and the GC content comprises

Obtaining sequencing data of a plurality of control sample nucleic acids, the sequencing data consisting of a plurality of reads;

Setting a plurality of windows on the reference sequence, respectively, sequencing data of the plurality of control samples and the reference sequence The window alignment of the columns, calculating the number of reads of each window in the sequencing data of each control sample, obtaining the sequencing depth of each window, the window being part of the reference sequence, sequencing of the window Depth = the total number of reads of the window on the comparison of the respective control samples / (the number of control samples * the size of the window);

Based on the sequencing depth of each window and the GC content of the window, the relationship between the sequencing depth and the GC content was established using two-dimensional regression analysis.
The method of claim 37, wherein said two-dimensional regression analysis is a locally weighted regression scatter smoothing method.
The method of claim 35, wherein the corrected sequencing depth of the same window of the control sample is obtained by correcting the sequencing depth of the same window of the control sample using the relationship between the sequencing depth and the GC content, the control sample being the same Sequencing depth of the window = the number of reads in the sequencing data of the control sample compared to the window / the size of the window.
A CNV detecting device, characterized in that

a nucleic acid acquisition device for acquiring nucleic acid of the sample to be tested;

a sequencing device for sequencing nucleic acid from the nucleic acid acquisition unit to obtain sequencing data;

a data processing device for processing sequencing data from the sequencing device to obtain a universal alignment result;

Detecting means for detecting the CNV based on a general comparison result from the data processing device; wherein

The data processing device includes

a data receiving unit, configured to receive sequencing data from the sequencing device, the sequencing data comprising a plurality of pairs of read segments, each pair of read segments consisting of two read segments, each from two locations of a chromosome segment, each The two reads of the pair of read pairs are from the positive and negative strands of the chromosome fragment, respectively, or both reads of each pair of read lengths are from the positive strand of the chromosome fragment or the negative of the chromosome Chain, each read contains a gap, and the two reads of a pair of read pairs are defined as the left arm and the right arm, respectively.

a processor for executing a data processing program, the executing the data processing program comprising: comparing the sequencing data with a reference sequence, obtaining a comparison result, and eliminating a gap of each of the comparison results Obtaining a universal alignment result, the alignment result comprising a plurality of alignment results of the pair of read segments, and/or, the comparison result comprising a plurality of said left arm alignment results and a plurality of said The result of the comparison of the right arm, and,

At least one storage unit for storing data, including the data processing program.