CN111321209A

CN111321209A - Method for double-end correction of circulating tumor DNA sequencing data

Info

Publication number: CN111321209A
Application number: CN202010220739.9A
Authority: CN
Inventors: 王军一; 肖雯; 叶可勇; 闫楠; 刘杰
Original assignee: Hangzhou Heyi Gene Technology Co ltd
Current assignee: Hangzhou Heyi Gene Technology Co ltd
Priority date: 2020-03-26
Filing date: 2020-03-26
Publication date: 2020-06-23

Abstract

The invention discloses a method for double-end correction of circulating tumor DNA sequencing data, which comprises cfDNA extraction, target capture library construction and sequencing; sequencing data quality control; the invention relates to a double-end correction step of sequencing data, which adopts a double-end sequencing method to simultaneously sequence two ends of the same DNA segment from positive and negative directions, and carries out base correction of sequence overlapping regions on ctDNA high-throughput sequencing data, so that the sequencing error rate can be reduced according to the consistent characteristics of the overlapping region sequences of double-end sequencing, the ctDNA gene mutation detection and analysis accuracy, particularly the low-abundance gene mutation detection accuracy, the false positive rate is reduced, and the application value of the ctDNA detection in clinical treatment is improved.

Description

Method for double-end correction of circulating tumor DNA sequencing data

Technical Field

The invention belongs to the technical field of biology, and particularly relates to a method for double-end correction of circulating tumor DNA sequencing data.

Background

In recent years, the incidence and mortality of tumors continue to increase and the trend toward younger tumors has become one of the important factors that seriously threaten life health and cause high social burden. The 5-year survival rate of Chinese tumor patients is about 40 percent and far lags behind 60 percent of developed countries, so that the Chinese tumor prevention and treatment situation is very severe, and an effective method is urgently needed to improve the prevention, control, diagnosis and treatment efficiency of cancers and the survival rate of the patients.

Gene mutations have been shown to play an important role in the regulation of tumor cell growth and progression. Due to tumor heterogeneity and complex genetic mutations, conventional detection methods cannot accurately detect cancer-related genetic mutations. With the rapid development of high-throughput sequencing technology and computer technology, the high-throughput sequencing technology is adopted, and an optimized DNA separation and extraction technology, a target capture template technology and a biological information analysis technology are combined, so that the accurate detection and analysis of ultralow mutation can be realized, and a foundation is provided for the clinical wide application of accurate tumor treatment.

Circulating tumor DNA (ctDNA) is a small DNA fragment derived from tumor cells, has a length of about 170bp, is released from the tumor cells to peripheral blood circulation and then is cleaved to form endogenous single-stranded or double-stranded DNA, and carries molecular mutation information consistent with primary tumor tissues. A large number of studies show that ctDNA has high consistency with genome information of tumor tissues. Therefore, ctDNA assays may be used as a complement to clinical tissue sample gene assays or as a replacement in some cases.

Because the content of ctDNA in cfDNA (cell free DNA, cfDNA) is very low, and a part of samples is even lower than 0.1%, ctDNA gene mutation detection is more easily affected by various interference factors (DNA extraction, library construction, targeted capture technology, etc.). The sequencing error rate of the high-throughput sequencing technology is one thousandth, and sequencing errors of the order of magnitude have great influence on the accuracy of tumor gene mutation detection, especially on the detection of extremely low mutation abundance in ctDNA. Therefore, how to reduce the error rate of the sequencing result is a key technical link for ctDNA gene mutation detection. Double-end sequencing is adopted, and two ends of the same DNA fragment are sequenced from the positive direction and the negative direction, so that the mutual rectification effect can be realized to a certain extent; meanwhile, if the DNA fragments are shorter, overlapping regions exist in double-end sequencing, sequencing errors are corrected by using the overlapping regions, and the accuracy of ctDNA gene mutation detection can be improved.

Disclosure of Invention

The invention provides an improvement aiming at the defects of the prior art, provides a method for double-end correction of circulating tumor DNA sequencing data, is a method for performing base correction according to an overlapping region of cfDNA double-end sequencing, and is realized by the following technical scheme:

the invention discloses a method for double-end correction of circulating tumor DNA sequencing data, which comprises the following steps:

1) cfDNA extraction, target capture library construction and sequencing:

extracting cfDNA in sample plasma by using a magnetic bead method for sample library construction; adding sequencing adapters at two ends of cfDNA molecules, wherein the sequencing adapters contain 8bp tag sequences for off-line data splitting, performing hybridization capture by using a liquid phase molecular probe, capturing target DNA fragments, and completing library construction; sequencing the constructed library by using a high-throughput sequencer, wherein the sequencing read length is 150 bp;

2) sequencing data quality control:

splitting sequencing data of different samples sequenced in the step 1) according to different label sequences, and performing quality control on the split sequencing data;

3) double-end correction of sequencing data:

performing double-end data correction on the quality control qualified sample in the step 2), wherein the specific method is as follows;

a) performing reverse complementary conversion on the R2 sequence, searching the same initial positions of R1 and R2 by using Kmer, and judging whether overlap exists or not; r2 is a reverse sequencing sequence, Kmer is a nucleotide fragment which breaks the sequencing sequence into K in length by bp, R1 is a forward sequencing sequence, and overlap is a sequence overlapping region;

b) if the overlap exists, judging the positions of the overlap at the leftmost end and the rightmost end, namely the positions of the first same Kmer and the last same Kmer;

c) judging whether the overlap lengths of R1 and R2 are consistent, and discarding the two fragments if the overlap lengths are inconsistent;

d) judging the overlap length, setting a threshold value to be 40bp, and if the overlap length is smaller than the threshold value, not correcting;

e) correcting the wrong base in the overlap region, and if the correction quantity of the same overlap is more than 5, discarding the sequencing sequence of the segment;

4) using sequence comparison software to perform sequence comparison on the corrected sequencing data obtained in the step 3) and a standard human genome to generate a comparison result file;

5) performing gene mutation detection analysis on the result file obtained in the step 4) by using mutation detection software;

6) functional annotation of the gene mutation results of step 5) was performed using annotation software.

As a further improvement, the method for correcting the wrong base in the overlap region in step 3) of the invention is as follows:

when the sequencing quality value of R1R2 bases is more than or equal to 30, the bases at two positions are replaced by N;

when the sequencing quality value of one base of the R1R2 is more than 30, the other base is less than 30, and the bases less than 30 are replaced by the bases more than 30;

when the sequencing quality value of R1R2 bases is less than 30, the bases at two positions are replaced by N;

the above operation performs traversal correction on all the segments.

As a further improvement, the sample plasma in step 1) of the present invention is derived from human plasma.

As a further improvement, the high-throughput sequencer in step 1) is an Illumina nextseqCN500 sequencer, a BGISEQ-100 sequencer, a BGISEQ-1000 sequencer or a DA8600 sequencer.

As a further improvement, the sequencing mode in the step 1) is double-ended sequencing.

As a further improvement, the quality control is carried out on the split sequencing data by using fastqc software in the step 2).

As a further improvement, the software used for sequence alignment in step 4) of the present invention is BWA.

As a further improvement, the software used for mutation detection in step 5) of the present invention is varscan.

As a further improvement, the annotation software for gene mutation results in step 6) of the present invention is annovar.

The invention has the following beneficial effects:

according to the invention, a double-end sequencing method is adopted, two ends of the same DNA fragment are sequenced from the positive direction and the negative direction simultaneously, when the detected fragment is shorter, double-end sequencing can generate an overlapping region which is derived from the same DNA fragment, under the condition that no sequencing error exists, the sequences of the overlapping region of the double-end sequencing positive and negative sequencing are completely consistent, and an algorithm developed by utilizing the characteristic is used for correcting the sequencing error, so that the accuracy of gene mutation detection can be improved.

The length of the ctDNA is about 170bp, and double-end sequencing can generate an overlapping region.

ctDNA is very low in cfDNA, and even less than 0.1% in some samples, so ctDNA gene mutation detection is more susceptible to various interference factors. The method carries out sequence overlapping region base correction on ctDNA high-throughput sequencing data, can reduce the sequencing error rate according to the characteristic that the overlapping region sequences of double-end sequencing have consistency, effectively improves the accuracy of ctDNA gene mutation detection and analysis, particularly the accuracy of low-abundance gene mutation detection, reduces the false positive rate, and improves the application value of ctDNA detection in clinical treatment.

Drawings

FIG. 1: the main flow diagram of the scheme of the invention;

FIG. 2: the flow chart of step (3) of the invention is shown schematically.

Detailed Description

The technical solution of the present invention is further illustrated by the following specific examples:

(1) cfDNA extraction, target capture library construction and sequencing:

extracting cfDNA in sample plasma by using a magnetic bead method for sample library construction; the sample plasma is derived from human plasma;

adding sequencing adapters at two ends of cfDNA molecules, wherein the sequencing adapters contain 8bp tag sequences for off-line data splitting, performing hybridization capture by using a liquid phase molecular probe, capturing target DNA fragments, and completing library construction;

sequencing the constructed library by using a high-throughput sequencer, wherein the sequencing read length is 150bp, the double ends of the library are sequenced, and the high-throughput sequencer is an Illumina NextSeq CN500 sequencer, a BGISEQ-100 sequencer, a BGISEQ-1000 sequencer or a DA8600 sequencer;

(2) sequencing data quality control:

splitting sequencing data of different samples sequenced in the step (1) according to different label sequences, and performing quality control on the split sequencing data by using fastqc software;

(3) double-end correction of sequencing data:

performing double-end data correction on the QC qualified samples in the step (2), wherein the specific method is as follows;

a) and (3) performing reverse complementary transformation on the R2 (reverse sequencing sequence) sequence, searching the same initial positions of R1 (forward sequencing sequence) and R2 by using Kmer (breaking the sequencing sequence into nucleotide fragments with the length of K by bp), and judging whether overlapping (sequence overlapping region) exists.

b) And if the overlap exists, judging the positions of the overlap at the leftmost end and the rightmost end, namely the positions of the first same Kmer and the last same Kmer.

c) Judging whether the overlap lengths of R1 and R2 are consistent, if the overlap lengths are not consistent, discarding two fragments which otherwise affect false positive single-base insertion-deletion mutation in subsequent variation detection.

d) Judging the overlap length, setting the threshold value to be 40bp, and if the overlap length is smaller than the threshold value, not correcting.

e) Correcting the error base in the overlap region, if the correction quantity of the overlap of the same segment is more than 5, abandoning the sequencing sequence of the segment, and adopting the following correction method of the inconsistency of the corresponding base of the overlap of R1R 2:

1. when the sequencing quality values of R1R2 bases are both greater than or equal to 30, the bases at two positions are replaced by N.

2. When one base of R1R2 has a sequencing quality value of more than 30 and the other base is less than 30, the bases less than 30 are replaced by more than 30 bases.

3. When the sequencing quality values of R1R2 bases are both less than 30, the bases at two positions are replaced by N.

(4) The above operation performs traversal correction on all the segments. Performing sequence comparison on the corrected sequencing data obtained in the step (3) and a standard human genome by using BWA software to generate a comparison result file;

(5) performing gene mutation detection analysis on the result file obtained in the step (4) by using varscan software;

(6) functional annotation of the gene mutation results of step (5) was performed using annovar software.

(7)

By applying the technical scheme of the invention, 1 group of cfDNA standard substances with known mutation sites (8) and mutation frequency (0.5%) are analyzed, and the accuracy of the detection result is verified, wherein the specific process comprises the following steps:

(1) cfDNA extraction, target capture library construction and sequencing:

extracting and purifying the cfDNA of the standard substance by using a nucleic acid extraction kit of a commercial company, directly taking 30ng of purified cfDNA for constructing a sample library without interrupting the cfDNA;

adding sequencing connectors at two ends of a cfDNA molecule of 100-350 bp, wherein the sequencing connectors contain tag sequences of 8bp, the tag sequences are used for distinguishing data among a plurality of different samples, performing hybridization capture by using a liquid phase molecular probe, capturing target DNA fragments, and completing library construction;

finally, performing double-end sequencing on the constructed library by using an Illumina NextSeq CN500 sequencer, wherein the sequencing read length is 150 bp;

(2) sequencing data quality control:

(3) double-end correction of sequencing data:

performing double-end data correction on the quality control qualified sample in the step (2), wherein the specific method is as follows;

a) and performing reverse complementary transformation on the R2 sequence, searching the same initial positions of R1 and R2 by using Kmer, and judging whether overlap exists.

c) Judging whether the lengths of R1 and R2overlap are consistent, if not, discarding the two fragments, otherwise, the false positive single-base insertion deletion mutation in the subsequent variation detection is influenced. d) Judging the overlap length, setting the threshold value to be 40bp, and if the overlap length is smaller than the threshold value, not correcting

1. the sequencing quality value of R1R2 base is more than or equal to 30, and bases at two positions are replaced by N.

2. R1R2 has a sequencing quality value of more than 30 for one base, less than 30 for another base, and more than 30 for bases less than 30.

3. The sequencing quality values of R1R2 bases are both less than 30, and the bases at two positions are replaced by N.

All the segments are subjected to traversal correction through the operation;

(4) performing sequence comparison on the corrected sequencing data obtained in the step (3) and a standard human genome by using BWA software to generate a comparison result file;

(5) performing gene mutation detection on the result file obtained in the step (4) by using varscan software

Measuring and analyzing;

The detection conditions of 8 known mutation sites in the mutation detection results are summarized, and as shown in table 1, 8 gene mutation sites are accurately detected in 3 HD778 samples. When the sequencing data are not subjected to double-end correction, 2 low-frequency false positive mutations are detected in each sample, and after the sequencing data are subjected to double-end correction, the low-frequency false positive mutations in each sample are not detected, so that the double-end correction method provided by the invention effectively improves the accuracy of low-frequency mutation detection.

Wherein, cfDNA: free DNA;

magnetic bead method: specifically adsorbing DNA by using magnetic beads;

sequencing quality values: the probability that the base is not correctly detected is measured, and the higher the sequencing quality value is, the better the sequencing quality is;

illumina NextSeq CN500, BGISEQ-100, BGISEQ-1000 and DA8600 are models of high-throughput sequencers;

double-end sequencing: sequencing both ends of the DNA fragment;

fastqc, BWA, varscan, and annovar are software names, no industry-wide chinese name exists in China, and are all directly described in english or by abbreviations.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the core technical features of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for paired end correction of circulating tumor DNA sequencing data, comprising the steps of:

1) cfDNA extraction, target capture library construction and sequencing:

2) sequencing data quality control:

3) double-end correction of sequencing data:

a) performing reverse complementary conversion on the R2 sequence, searching the same initial positions of R1 and R2 by using Kmer, and judging whether overlap exists or not; the R2 is a reverse sequencing sequence, the Kmer is a nucleotide fragment which breaks a sequencing sequence into K in length by bp, the R1 is a forward sequencing sequence, and the overlap is a sequence overlapping region;

2. The method for double-ended correction of circulating tumor DNA sequencing data according to claim 1, wherein the step 3) corrects the overlap region error bases as follows:

the above operation performs traversal correction on all the segments.

3. The method for paired end correction of circulating tumor DNA sequencing data according to claim 1 or 2, wherein the sample plasma of step 1) is derived from human plasma.

4. The method for paired end correction of circulating tumor DNA sequencing data according to claim 1, wherein the high-throughput sequencer in step 1) is illumina nextseq CN500 sequencer, BGISEQ-100 sequencer, BGISEQ-1000 sequencer or DA8600 sequencer.

5. The method for paired end correction of circulating tumor DNA sequencing data according to claim 1, 2 or 4, wherein the sequencing mode in step 1) is paired end sequencing.

6. The bioinformatic processing method for circulating tumor DNA analysis according to claim 1, wherein the step 2) is performed by quality control of the resolved sequencing data using fastqc software.

7. The method for processing bioinformation for circulating tumor DNA analysis according to claim 1, wherein the software for sequence alignment in step 4) is BWA.

8. The bioinformatic processing method for circulating tumor DNA analysis according to claim 1, wherein the software used for mutation detection in step 5) is varscan.

9. The bioinformatic processing method for circulating tumor DNA analysis according to claim 1, wherein the gene mutation result annotation software in step 6) is annovar.