CN114937475A

CN114937475A - Automatic evaluation method for error correction result of PacBio sequencing data

Info

Publication number: CN114937475A
Application number: CN202210380137.9A
Authority: CN
Inventors: 张艳菊; 王鹤杰; 林若翰; 周帆
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2022-04-12
Filing date: 2022-04-12
Publication date: 2022-08-23

Abstract

The invention discloses an automatic evaluation method of an error correction result of PacBio sequencing data, which comprises the steps of carrying out quality control on original PacBio sequencing data to obtain a sequencing sequence which accords with a set threshold range; correcting the clean reads after the quality control by using an error correction method to be evaluated to obtain a sequence marked as corrected reads, and counting memory resources and time consumption required by error correction; comparing and analyzing clean reads and corrected reads before and after error correction to obtain error correction output rate TH and average length of sequences after error correction; comparing the corrected reads with the corresponding reference genome to obtain an alignment sequence MSA, and performing statistical analysis to obtain the correction sensitivity and the correction rate; assembling the corrected reads to obtain contigs; and (3) comparing the contigs with the corresponding reference genome to obtain the statistic analysis of the comparative contigs MSA, and counting the number of the contigs, the genome coverage rate and the NGA 50.

Description

Automatic evaluation method for error correction result of PacBio sequencing data

Technical Field

The invention relates to the technical field of biological information, in particular to an automatic evaluation method for a PacBio sequencing data error correction result.

Background

In recent years, the third generation sequencing technology, represented by PacBio, has become one of the most widely used DNA sequencing technologies. The third generation sequencing technology enables the read length to reach more than 10kb on the premise of ensuring high flux by a method for displaying the nucleotide sequence under single molecule resolution in real time, and solves a plurality of problems caused by reading length (usually only 150 to 600bp) in second generation sequencing, such as difficulty in identifying large structural variation on chromosome, observation gene fusion, difficulty in detecting variable shearing on RNA level and the like.

Although the PacBio sequencing technique has made a major breakthrough in read length, it introduces a very high error rate, typically around 15%. Therefore, reducing the sequence error rate prior to downstream applications is an important step in analyzing PacBio DNA sequencing data. There are methods for improving the accuracy of sequences by improving sequencing technologies, such as the Circular Consensus Sequencing (CCS) method proposed by PacBio corporation. However, such methods can significantly reduce sequence read length and also result in increased sequencing costs. In contrast, computational methods can improve sequence accuracy at a lower cost sequencing protocol, and are an efficient and feasible solution.

There have been a number of computational approaches to solving the error correction problem of PacBio sequencing data. These methods can be classified into two categories, i.e., a hybrid correction strategy method and a self-correction strategy method, depending on the strategy employed. These methods all performed differently for PacBio sequencing data of different scale genomes at different sequencing depths. In actual use, researchers usually only need to select an error correction method with optimal performance according to the research interests and data of the researchers. Therefore, a reasonable and effective automatic evaluation method for the sequencing data error correction result of PacBio is significant.

At present, less automatic evaluation tools exist in the field of PacBio sequencing data error correction, and some defects exist, such as evaluation is carried out only by using small-scale genome sequencing data or simulation data, and the performance of error correction tools on large-scale genomes is ignored; only data with a single depth are used, and the influence of depth change in the actual sequencing process on the performance is ignored; the evaluation index is relatively one-sided, and evaluation is not performed from multiple aspects such as error correction performance, resource requirements and downstream application performance.

Disclosure of Invention

The invention aims to provide an automatic evaluation method for the error correction result of PacBio sequencing data, aiming at the defects in the background art.

The technical scheme for realizing the purpose of the invention is as follows:

an automatic evaluation method for error correction results of PacBio sequencing data comprises the following steps:

1) performing quality control on the original PacBio sequencing data to obtain a sequencing sequence which accords with a set threshold range; specifically, factors of base mass fraction, sequence mass fraction, GC content and sequence repetition level of original PacBio sequencing data are checked, a threshold value is set according to various factors, the original sequencing data are screened, and after sequences lower than and higher than the threshold value are removed, the obtained sequences are marked as clean reads;

2) correcting the clean reads after quality control by using an error correction method to be evaluated to obtain sequences marked as corrected reads, and counting memory resources and time consumption required by error correction while correcting the errors;

3) comparing the sequence clean reads before error correction with the sequence corrected reads after error correction, and performing statistical analysis to obtain an error correction output rate TH and an average length of the sequence after error correction; the method specifically comprises the following steps:

3-1) comparing the data quantity of the corrected sequences and the clear sequences before error correction, and calculating the error correction output rate TH in the error correction process;

3-2) calculating the average length of corrected reads of the sequence after error correction, and visualizing the length distribution;

4) comparing the corrected sequences with the corresponding reference genomes to obtain a comparison sequence MSA;

5) performing statistical analysis on the compared sequence MSA to obtain the sensitivity and accuracy of error correction, and specifically comprising the following steps:

5-1) counting the successfully corrected base number TP and the error corrected base number FN of the corrected sequences;

5-2) calculating the sensitivity of the error correction process by using TH, TP and FN;

5-3) counting the total number of the inserted, deleted and replaced three wrong bases contained in the corrected sequence corrected reads;

5-4) calculating the correct rate of corrected sequences of the three wrong base numbers obtained in the step 5-3);

5-5) calculating the genome coverage rate of the corrected sequences;

6) assembling the corrected sequences corrected reads to obtain contigs sequences, which specifically comprises the following steps:

6-1) carrying out pairwise comparison on the corrected sequences, wherein each reads only retains two comparison results with optimal quality and length, and calculating the overall coverage rate estimation;

6-2) comparing and further screening the results according to the coverage rate information, and discarding the comparison with the coverage rate smaller than a threshold value;

6-3) generating a character string diagram by discarding the comparison result with low coverage rate, trimming the tail end, removing the overlapped part with the support quantity smaller than the threshold value, and eliminating short branches;

6-4) generating contigs sequences by using the character string diagram;

7) comparing the contigs sequence obtained by assembly with a reference genome corresponding to the contigs sequence to obtain a comparison sequence contigs MSA;

8) carrying out statistical analysis on the contigs MSA obtained in the step 7), specifically counting the number of contigs sequences, calculating the genome coverage rate of the contigs sequences and calculating the NGA50 of the contigs sequences.

The automatic evaluation method for the error correction result of PacBio sequencing data is suitable for real data including large genomes and various sequencing depths, and overcomes the defect that only simulation data or real data of small genomes can be used in the prior art. Meanwhile, the method takes the practical use flow of the third generation sequencing data as a starting point, selects 8 indexes, and evaluates the error correction performance, the resource requirement, the downstream application performance and other aspects, and is more comprehensive compared with the prior art.

Researchers can systematically consider computing resources, data magnitude and experimental purposes according to research interests, determine reasonable evaluation index weight, and select a proper correction tool by using the automatic evaluation method provided by the invention. Developers in the field can verify the performance of the development method from multiple dimensions according to the automatic evaluation method, so as to guide the development method to improve the method or prove the effectiveness of the method.

Drawings

Fig. 1 is a flowchart of an automated evaluation method for error correction results of PacBio sequencing data.

Detailed Description

The invention will be further described with reference to the following drawings and examples, which are not intended to limit the invention.

The embodiment is as follows:

1) performing quality control on the original PacBio sequencing data to obtain a sequencing sequence which accords with a set threshold range; specifically, factors such as base mass fraction, sequence mass fraction, GC content and sequence repetition level of original PacBio sequencing data are checked, a threshold value is set according to the factors to screen the original sequencing data, and after sequences lower than and higher than the threshold value are removed, the obtained sequence is marked as clean reads;

wherein raw sequencing data can be statistically reviewed and filtered using FastQC software;

2) calling an error correction tool which is expected to be evaluated, correcting the sequence clean reads after quality control to obtain a sequence marked as corrected reads, monitoring the memory occupation peak value of a program in the error correction process, and counting memory resources and time consumption required by the complete error correction process;

3-1) dividing the base number of the corrected sequences by the base number of the clean sequences before error correction to obtain the error correction output rate TH in the error correction process;

3-2) calculating the average length of the corrected sequences, and visualizing the length distribution by using a violin graph;

4) comparing the corrected sequences with the corresponding reference genomes to obtain a comparison sequence MSA; specifically, the corrected reads are compared to the corresponding reference genome by using comparison software Minimap2 to obtain a comparison file of the corrected reads, namely a sequence MSA;

5-1) counting the number TP of successfully corrected bases and the number FN of erroneous but uncorrected bases of corrected sequences;

5-2) calculating the sensitivity of the error correction process according to the formula (TH × TP/(TP + FN) using TH, TP, and FN;

5-3) counting the total number of the basic groups of the three errors of insertion, deletion and replacement contained in the corrected sequence corrected reads;

5-4) calculating the correct rate of corrected sequences of corrected reads according to the total number of the three wrong bases obtained in the step 5-3), wherein the specific error rate is obtained by dividing the total number of the three wrong bases by the total number of the corrected reads;

5-5) calculating the genome coverage rate of the corrected sequences; specifically, the ratio of the reference genome covered by the corrected reads in the alignment file is calculated;

6) assembling the corrected sequences to obtain contigs sequences, which specifically comprises the following steps:

6-2) according to the coverage rate information, further screening the results by comparison, and discarding the comparison with the coverage rate less than 50%;

6-3) generating a character string diagram by discarding the comparison result with low coverage, trimming the tail end, removing the overlapped part with the support quantity less than 3, and eliminating short branches;

6-4) generating contigs sequences by using the character string diagram;

the corrected reads may be assembled specifically using minism software.

7) Comparing the contigs sequence obtained by assembly with a reference genome corresponding to the contigs sequence to obtain a comparison sequence contigs MSA; specifically, the assembled contigs can be aligned to the corresponding reference genome by using alignment software Minimap2 to obtain an alignment file of the contigs.

Claims

1. An automatic evaluation method for error correction results of PacBio sequencing data is characterized by comprising the following steps:

1) performing quality control on the original PacBio sequencing data to obtain a sequencing sequence which accords with a set threshold range; specifically, factors of base mass fraction, sequence mass fraction, GC content and sequence repetition level of original PacBio sequencing data are checked, a threshold value is set according to the factors to screen the original sequencing data, and after sequences lower than and higher than the threshold value are removed, the obtained sequence is marked as clean reads;

5-4) calculating the correct rate of the corrected sequences of the three wrong bases obtained in the step 5-3);

5-5) calculating the genome coverage rate of the corrected sequences;

6-4) generating contigs sequences by using the character string diagram;

8) and (4) carrying out statistical analysis on the contigs MSA obtained in the step 7), specifically counting the number of contigs sequences, calculating the genome coverage rate of the contigs sequences and calculating the NGA50 of the contigs sequences.