CN114937475A - Automatic evaluation method for error correction result of PacBio sequencing data - Google Patents

Automatic evaluation method for error correction result of PacBio sequencing data Download PDF

Info

Publication number
CN114937475A
CN114937475A CN202210380137.9A CN202210380137A CN114937475A CN 114937475 A CN114937475 A CN 114937475A CN 202210380137 A CN202210380137 A CN 202210380137A CN 114937475 A CN114937475 A CN 114937475A
Authority
CN
China
Prior art keywords
error correction
sequence
corrected
sequences
contigs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210380137.9A
Other languages
Chinese (zh)
Inventor
张艳菊
王鹤杰
林若翰
周帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN202210380137.9A priority Critical patent/CN114937475A/en
Publication of CN114937475A publication Critical patent/CN114937475A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Data Mining & Analysis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an automatic evaluation method of an error correction result of PacBio sequencing data, which comprises the steps of carrying out quality control on original PacBio sequencing data to obtain a sequencing sequence which accords with a set threshold range; correcting the clean reads after the quality control by using an error correction method to be evaluated to obtain a sequence marked as corrected reads, and counting memory resources and time consumption required by error correction; comparing and analyzing clean reads and corrected reads before and after error correction to obtain error correction output rate TH and average length of sequences after error correction; comparing the corrected reads with the corresponding reference genome to obtain an alignment sequence MSA, and performing statistical analysis to obtain the correction sensitivity and the correction rate; assembling the corrected reads to obtain contigs; and (3) comparing the contigs with the corresponding reference genome to obtain the statistic analysis of the comparative contigs MSA, and counting the number of the contigs, the genome coverage rate and the NGA 50.

Description

Automatic evaluation method for error correction result of PacBio sequencing data
Technical Field
The invention relates to the technical field of biological information, in particular to an automatic evaluation method for a PacBio sequencing data error correction result.
Background
In recent years, the third generation sequencing technology, represented by PacBio, has become one of the most widely used DNA sequencing technologies. The third generation sequencing technology enables the read length to reach more than 10kb on the premise of ensuring high flux by a method for displaying the nucleotide sequence under single molecule resolution in real time, and solves a plurality of problems caused by reading length (usually only 150 to 600bp) in second generation sequencing, such as difficulty in identifying large structural variation on chromosome, observation gene fusion, difficulty in detecting variable shearing on RNA level and the like.
Although the PacBio sequencing technique has made a major breakthrough in read length, it introduces a very high error rate, typically around 15%. Therefore, reducing the sequence error rate prior to downstream applications is an important step in analyzing PacBio DNA sequencing data. There are methods for improving the accuracy of sequences by improving sequencing technologies, such as the Circular Consensus Sequencing (CCS) method proposed by PacBio corporation. However, such methods can significantly reduce sequence read length and also result in increased sequencing costs. In contrast, computational methods can improve sequence accuracy at a lower cost sequencing protocol, and are an efficient and feasible solution.
There have been a number of computational approaches to solving the error correction problem of PacBio sequencing data. These methods can be classified into two categories, i.e., a hybrid correction strategy method and a self-correction strategy method, depending on the strategy employed. These methods all performed differently for PacBio sequencing data of different scale genomes at different sequencing depths. In actual use, researchers usually only need to select an error correction method with optimal performance according to the research interests and data of the researchers. Therefore, a reasonable and effective automatic evaluation method for the sequencing data error correction result of PacBio is significant.
At present, less automatic evaluation tools exist in the field of PacBio sequencing data error correction, and some defects exist, such as evaluation is carried out only by using small-scale genome sequencing data or simulation data, and the performance of error correction tools on large-scale genomes is ignored; only data with a single depth are used, and the influence of depth change in the actual sequencing process on the performance is ignored; the evaluation index is relatively one-sided, and evaluation is not performed from multiple aspects such as error correction performance, resource requirements and downstream application performance.
Disclosure of Invention
The invention aims to provide an automatic evaluation method for the error correction result of PacBio sequencing data, aiming at the defects in the background art.
The technical scheme for realizing the purpose of the invention is as follows:
an automatic evaluation method for error correction results of PacBio sequencing data comprises the following steps:
1) performing quality control on the original PacBio sequencing data to obtain a sequencing sequence which accords with a set threshold range; specifically, factors of base mass fraction, sequence mass fraction, GC content and sequence repetition level of original PacBio sequencing data are checked, a threshold value is set according to various factors, the original sequencing data are screened, and after sequences lower than and higher than the threshold value are removed, the obtained sequences are marked as clean reads;
2) correcting the clean reads after quality control by using an error correction method to be evaluated to obtain sequences marked as corrected reads, and counting memory resources and time consumption required by error correction while correcting the errors;
3) comparing the sequence clean reads before error correction with the sequence corrected reads after error correction, and performing statistical analysis to obtain an error correction output rate TH and an average length of the sequence after error correction; the method specifically comprises the following steps:
3-1) comparing the data quantity of the corrected sequences and the clear sequences before error correction, and calculating the error correction output rate TH in the error correction process;
3-2) calculating the average length of corrected reads of the sequence after error correction, and visualizing the length distribution;
4) comparing the corrected sequences with the corresponding reference genomes to obtain a comparison sequence MSA;
5) performing statistical analysis on the compared sequence MSA to obtain the sensitivity and accuracy of error correction, and specifically comprising the following steps:
5-1) counting the successfully corrected base number TP and the error corrected base number FN of the corrected sequences;
5-2) calculating the sensitivity of the error correction process by using TH, TP and FN;
5-3) counting the total number of the inserted, deleted and replaced three wrong bases contained in the corrected sequence corrected reads;
5-4) calculating the correct rate of corrected sequences of the three wrong base numbers obtained in the step 5-3);
5-5) calculating the genome coverage rate of the corrected sequences;
6) assembling the corrected sequences corrected reads to obtain contigs sequences, which specifically comprises the following steps:
6-1) carrying out pairwise comparison on the corrected sequences, wherein each reads only retains two comparison results with optimal quality and length, and calculating the overall coverage rate estimation;
6-2) comparing and further screening the results according to the coverage rate information, and discarding the comparison with the coverage rate smaller than a threshold value;
6-3) generating a character string diagram by discarding the comparison result with low coverage rate, trimming the tail end, removing the overlapped part with the support quantity smaller than the threshold value, and eliminating short branches;
6-4) generating contigs sequences by using the character string diagram;
7) comparing the contigs sequence obtained by assembly with a reference genome corresponding to the contigs sequence to obtain a comparison sequence contigs MSA;
8) carrying out statistical analysis on the contigs MSA obtained in the step 7), specifically counting the number of contigs sequences, calculating the genome coverage rate of the contigs sequences and calculating the NGA50 of the contigs sequences.
The automatic evaluation method for the error correction result of PacBio sequencing data is suitable for real data including large genomes and various sequencing depths, and overcomes the defect that only simulation data or real data of small genomes can be used in the prior art. Meanwhile, the method takes the practical use flow of the third generation sequencing data as a starting point, selects 8 indexes, and evaluates the error correction performance, the resource requirement, the downstream application performance and other aspects, and is more comprehensive compared with the prior art.
Researchers can systematically consider computing resources, data magnitude and experimental purposes according to research interests, determine reasonable evaluation index weight, and select a proper correction tool by using the automatic evaluation method provided by the invention. Developers in the field can verify the performance of the development method from multiple dimensions according to the automatic evaluation method, so as to guide the development method to improve the method or prove the effectiveness of the method.
Drawings
Fig. 1 is a flowchart of an automated evaluation method for error correction results of PacBio sequencing data.
Detailed Description
The invention will be further described with reference to the following drawings and examples, which are not intended to limit the invention.
The embodiment is as follows:
an automatic evaluation method for error correction results of PacBio sequencing data comprises the following steps:
1) performing quality control on the original PacBio sequencing data to obtain a sequencing sequence which accords with a set threshold range; specifically, factors such as base mass fraction, sequence mass fraction, GC content and sequence repetition level of original PacBio sequencing data are checked, a threshold value is set according to the factors to screen the original sequencing data, and after sequences lower than and higher than the threshold value are removed, the obtained sequence is marked as clean reads;
wherein raw sequencing data can be statistically reviewed and filtered using FastQC software;
2) calling an error correction tool which is expected to be evaluated, correcting the sequence clean reads after quality control to obtain a sequence marked as corrected reads, monitoring the memory occupation peak value of a program in the error correction process, and counting memory resources and time consumption required by the complete error correction process;
3) comparing the sequence clean reads before error correction with the sequence corrected reads after error correction, and performing statistical analysis to obtain an error correction output rate TH and an average length of the sequence after error correction; the method specifically comprises the following steps:
3-1) dividing the base number of the corrected sequences by the base number of the clean sequences before error correction to obtain the error correction output rate TH in the error correction process;
3-2) calculating the average length of the corrected sequences, and visualizing the length distribution by using a violin graph;
4) comparing the corrected sequences with the corresponding reference genomes to obtain a comparison sequence MSA; specifically, the corrected reads are compared to the corresponding reference genome by using comparison software Minimap2 to obtain a comparison file of the corrected reads, namely a sequence MSA;
5) performing statistical analysis on the compared sequence MSA to obtain the sensitivity and accuracy of error correction, and specifically comprising the following steps:
5-1) counting the number TP of successfully corrected bases and the number FN of erroneous but uncorrected bases of corrected sequences;
5-2) calculating the sensitivity of the error correction process according to the formula (TH × TP/(TP + FN) using TH, TP, and FN;
5-3) counting the total number of the basic groups of the three errors of insertion, deletion and replacement contained in the corrected sequence corrected reads;
5-4) calculating the correct rate of corrected sequences of corrected reads according to the total number of the three wrong bases obtained in the step 5-3), wherein the specific error rate is obtained by dividing the total number of the three wrong bases by the total number of the corrected reads;
5-5) calculating the genome coverage rate of the corrected sequences; specifically, the ratio of the reference genome covered by the corrected reads in the alignment file is calculated;
6) assembling the corrected sequences to obtain contigs sequences, which specifically comprises the following steps:
6-1) carrying out pairwise comparison on the corrected sequences, wherein each reads only retains two comparison results with optimal quality and length, and calculating the overall coverage rate estimation;
6-2) according to the coverage rate information, further screening the results by comparison, and discarding the comparison with the coverage rate less than 50%;
6-3) generating a character string diagram by discarding the comparison result with low coverage, trimming the tail end, removing the overlapped part with the support quantity less than 3, and eliminating short branches;
6-4) generating contigs sequences by using the character string diagram;
the corrected reads may be assembled specifically using minism software.
7) Comparing the contigs sequence obtained by assembly with a reference genome corresponding to the contigs sequence to obtain a comparison sequence contigs MSA; specifically, the assembled contigs can be aligned to the corresponding reference genome by using alignment software Minimap2 to obtain an alignment file of the contigs.
8) Carrying out statistical analysis on the contigs MSA obtained in the step 7), specifically counting the number of contigs sequences, calculating the genome coverage rate of the contigs sequences and calculating the NGA50 of the contigs sequences.

Claims (1)

1. An automatic evaluation method for error correction results of PacBio sequencing data is characterized by comprising the following steps:
1) performing quality control on the original PacBio sequencing data to obtain a sequencing sequence which accords with a set threshold range; specifically, factors of base mass fraction, sequence mass fraction, GC content and sequence repetition level of original PacBio sequencing data are checked, a threshold value is set according to the factors to screen the original sequencing data, and after sequences lower than and higher than the threshold value are removed, the obtained sequence is marked as clean reads;
2) correcting the clean reads after quality control by using an error correction method to be evaluated to obtain sequences marked as corrected reads, and counting memory resources and time consumption required by error correction while correcting the errors;
3) comparing the sequence clean reads before error correction with the sequence corrected reads after error correction, and performing statistical analysis to obtain an error correction output rate TH and an average length of the sequence after error correction; the method specifically comprises the following steps:
3-1) comparing the data quantity of the corrected sequences and the clear sequences before error correction, and calculating the error correction output rate TH in the error correction process;
3-2) calculating the average length of corrected reads of the sequence after error correction, and visualizing the length distribution;
4) comparing the corrected sequences with the corresponding reference genomes to obtain a comparison sequence MSA;
5) performing statistical analysis on the compared sequence MSA to obtain the sensitivity and accuracy of error correction, and specifically comprising the following steps:
5-1) counting the successfully corrected base number TP and the error corrected base number FN of the corrected sequences;
5-2) calculating the sensitivity of the error correction process by using TH, TP and FN;
5-3) counting the total number of the basic groups of the three errors of insertion, deletion and replacement contained in the corrected sequence corrected reads;
5-4) calculating the correct rate of the corrected sequences of the three wrong bases obtained in the step 5-3);
5-5) calculating the genome coverage rate of the corrected sequences;
6) assembling the corrected sequences to obtain contigs sequences, which specifically comprises the following steps:
6-1) carrying out pairwise comparison on the corrected sequences, wherein each reads only retains two comparison results with optimal quality and length, and calculating the overall coverage rate estimation;
6-2) comparing and further screening the results according to the coverage rate information, and discarding the comparison with the coverage rate smaller than a threshold value;
6-3) generating a character string diagram by discarding the comparison result with low coverage rate, trimming the tail end, removing the overlapped part with the support quantity smaller than the threshold value, and eliminating short branches;
6-4) generating contigs sequences by using the character string diagram;
7) comparing the contigs sequence obtained by assembly with a reference genome corresponding to the contigs sequence to obtain a comparison sequence contigs MSA;
8) and (4) carrying out statistical analysis on the contigs MSA obtained in the step 7), specifically counting the number of contigs sequences, calculating the genome coverage rate of the contigs sequences and calculating the NGA50 of the contigs sequences.
CN202210380137.9A 2022-04-12 2022-04-12 Automatic evaluation method for error correction result of PacBio sequencing data Pending CN114937475A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210380137.9A CN114937475A (en) 2022-04-12 2022-04-12 Automatic evaluation method for error correction result of PacBio sequencing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210380137.9A CN114937475A (en) 2022-04-12 2022-04-12 Automatic evaluation method for error correction result of PacBio sequencing data

Publications (1)

Publication Number Publication Date
CN114937475A true CN114937475A (en) 2022-08-23

Family

ID=82861884

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210380137.9A Pending CN114937475A (en) 2022-04-12 2022-04-12 Automatic evaluation method for error correction result of PacBio sequencing data

Country Status (1)

Country Link
CN (1) CN114937475A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110295784A1 (en) * 2008-12-12 2011-12-01 Jun Wang Error correcting method of test sequence, corresponding system and gene assembly equipment
WO2013097149A1 (en) * 2011-12-29 2013-07-04 深圳华大基因科技服务有限公司 Method and device for estimating repeating sequence content of genome
WO2013097061A1 (en) * 2011-12-31 2013-07-04 深圳华大基因研究院 Bs and rrbs sequencing-based bioinformatics analysis method and device
US20140188397A1 (en) * 2011-05-17 2014-07-03 Bgi Tech Solutions Co., Ltd. Methods of acquiring genome size and error
CN107563151A (en) * 2017-09-18 2018-01-09 杭州和壹基因科技有限公司 A kind of PacBio sequencing datas assemble the error correction method of obtained genome sequence
CN108573127A (en) * 2017-03-14 2018-09-25 深圳华大基因科技服务有限公司 Processing method and its application of initial data is sequenced in a kind of nucleic acid third generation
CN108629156A (en) * 2017-03-21 2018-10-09 深圳华大基因科技服务有限公司 The method, apparatus and computer readable storage medium of three generations's sequencing data error correction
CN112133368A (en) * 2020-10-13 2020-12-25 南开大学 Automated analysis method of metagenome sequencing data based on third-generation sequencing technology
CN112687339A (en) * 2021-01-21 2021-04-20 深圳吉因加医学检验实验室 Method and device for counting sequence errors in plasma DNA fragment sequencing data
CN113496760A (en) * 2020-04-01 2021-10-12 深圳华大基因科技服务有限公司 Polyploid genome assembling method and device based on third-generation sequencing

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110295784A1 (en) * 2008-12-12 2011-12-01 Jun Wang Error correcting method of test sequence, corresponding system and gene assembly equipment
US20140188397A1 (en) * 2011-05-17 2014-07-03 Bgi Tech Solutions Co., Ltd. Methods of acquiring genome size and error
WO2013097149A1 (en) * 2011-12-29 2013-07-04 深圳华大基因科技服务有限公司 Method and device for estimating repeating sequence content of genome
WO2013097061A1 (en) * 2011-12-31 2013-07-04 深圳华大基因研究院 Bs and rrbs sequencing-based bioinformatics analysis method and device
CN108573127A (en) * 2017-03-14 2018-09-25 深圳华大基因科技服务有限公司 Processing method and its application of initial data is sequenced in a kind of nucleic acid third generation
CN108629156A (en) * 2017-03-21 2018-10-09 深圳华大基因科技服务有限公司 The method, apparatus and computer readable storage medium of three generations's sequencing data error correction
CN107563151A (en) * 2017-09-18 2018-01-09 杭州和壹基因科技有限公司 A kind of PacBio sequencing datas assemble the error correction method of obtained genome sequence
CN113496760A (en) * 2020-04-01 2021-10-12 深圳华大基因科技服务有限公司 Polyploid genome assembling method and device based on third-generation sequencing
CN112133368A (en) * 2020-10-13 2020-12-25 南开大学 Automated analysis method of metagenome sequencing data based on third-generation sequencing technology
CN112687339A (en) * 2021-01-21 2021-04-20 深圳吉因加医学检验实验室 Method and device for counting sequence errors in plasma DNA fragment sequencing data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
周卫星;石海鹤;: "高通量测序中序列拼接算法的研究进展", 计算机科学, no. 05, 15 May 2019 (2019-05-15) *
马东娜;张兴坦;魏柳锋;李仪莹;钟伟民;赵茜;尤民生;: "基因组二代测序数据与三代测序数据的混合校正和组装", 基因组学与应用生物学, no. 04, 25 April 2018 (2018-04-25) *

Similar Documents

Publication Publication Date Title
Steinegger et al. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold
Venturini et al. Leveraging multiple transcriptome assembly methods for improved gene structure annotation
Novák et al. Global analysis of repetitive DNA from unassembled sequence reads using RepeatExplorer2
CN102206704B (en) Method and device for assembling genome sequence
CN112133368A (en) Automated analysis method of metagenome sequencing data based on third-generation sequencing technology
Springer et al. ILS-aware analysis of low-homoplasy retroelement insertions: inference of species trees and introgression using quartets
Delhomme et al. Guidelines for RNA-Seq data analysis
CN102521528A (en) Method for screening gene sequence data
Rivera-Rivera et al. LS³: A method for improving phylogenomic inferences when evolutionary rates are heterogeneous among taxa
CN118186103A (en) Lateolabrax japonicus 100k liquid phase chip and application thereof
CN112466416B (en) Material data cleaning method combining nickel-based alloy priori knowledge
CN117612605A (en) Virus whole genome sequence assembly analysis method based on high-throughput sequencing
CN114937475A (en) Automatic evaluation method for error correction result of PacBio sequencing data
CN112397148A (en) Sequence comparison method, sequence correction method and device thereof
CN103270175B (en) Method and system for detecting the insertion sites of transgenic foreign fragments
CN111292806B (en) Transcriptome analysis method by using nanopore sequencing
CN112825268B (en) Sequencing result comparison method and application thereof
CN112750501A (en) Optimized analysis method for macrovirome process
CN116469462B (en) Ultra-low frequency DNA mutation identification method and device based on double sequencing
CN108595914B (en) High-precision prediction method for tobacco mitochondrial RNA editing sites
CN111445949A (en) Method for annotating genome of high-altitude polyploid fish by using nanopore sequencing data
CN106021998A (en) Computation pipeline of single-pass multiple variant calls
CN112259169B (en) Method for rapidly obtaining chloroplast genome from transcriptome data
Natsidis et al. Systematic errors in orthology inference: a bug or a feature for evolutionary analyses?
Armstrong Enabling comparative genomics at the scale of hundreds of species

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination