CN115938480A

CN115938480A - Optimization device and system for genome assembly result error correction method by long-read long-sequencing

Info

Publication number: CN115938480A
Application number: CN202111114295.1A
Authority: CN
Inventors: 贺丽娟; 邓天全; 陈世璇; 杨林峰; 谢敏
Original assignee: Wuhan Huada Gene Technology Service Co ltd
Current assignee: Wuhan Huada Gene Technology Service Co ltd
Priority date: 2021-09-23
Filing date: 2021-09-23
Publication date: 2023-04-07

Abstract

The invention provides a method for error correction of long read sequencing data, which comprises the following steps: (1) Grouping reference sequences to obtain a plurality of reference sequence subsets consisting of a portion of the reference sequences; (2) Performing error correction processing separately for each of the plurality of reference sequence subsets, the error correction processing being performed based on: (a) A reference sequence contained in the subset of reference sequences; (b) The comparison result corresponding to (a) in the total comparison results; (c) A portion of the long read sequencing data corresponding to (a); (d) And integrating the error correction results obtained from the plurality of reference sequence subsets so as to obtain the error correction result of the long read sequencing data.

Description

Optimization device and system for genome assembly result error correction method by long-read long-sequencing

Technical Field

The invention relates to the field of biological information, in particular to a method and a system for optimizing error correction of a genome assembly result by long read length sequencing, and more particularly to a method for correcting a long read length assembly result, a device for correcting a long read length assembly result and a computer readable storage medium.

Background

With the development of technology, single-molecule sequencing data represented by Pacbio and Nanopore have the characteristic of ultra-long read length, so that the problem of splicing of complex regions of a genome can be solved, and single-molecule sequencing is one of important sequencing technologies for genome assembly analysis at present. However, single molecule sequencing has a long read length and a high sequencing error rate, and some assembly software such as WTDBG2, minism, flye and the like can be spliced by using data with a high sequencing error, so that the single molecule sequencing method is particularly important for correcting the long read length assembly result. Other software firstly corrects the data with high sequencing error rate and then assembles the data, for example, canu software has high assembly accuracy, but has high calculation and analysis cost, has poor splicing effect on highly complex genomes and cannot be suitable for all species; and the software such as falcon can effectively solve highly complex regions, but the accuracy of the splicing result is low, and the error correction of the genome assembly result after the assembly is finished is also an essential step.

In the error correction link after assembly, although the accuracy of short read length data is high relative to that of long read length data, the long read length data has the characteristic of ultra-long read length and can span a large repetitive sequence region, the correction of a genome complex region is more accurate, and the genome assembly correction can obtain an assembly result with high quality and high accuracy only by using the long read length data for error correction.

Although the alignment software of the long read length replaces all alignment software with Minimap2 at present, the method has a very good effect in the application of the long read length to genome error correction. Especially, when the format of the input file of the error correction software is paf format, the comparison speed is very fast. However, for a large genome, a large memory is still required in the data error correction step, and the time consumption is long. If a large genome is divided, and each divided small block is separately compared and corrected, although the method is similar to the comparison and correction of a small genome, the memory can be reduced to a certain extent and the analysis time can be shortened, but because a large number of repeated sequences exist in eukaryote and certain similarity exists among the sequences, the comparison information of the whole genome cannot be considered in the local comparison, so that the correction result of the whole genome has certain bias, and the result is inaccurate. Therefore, currently, error correction software only supports global alignment (i.e. all data are aligned on the whole genome), and then performs error correction based on the result of the global alignment (i.e. all data are aligned on the genome sequence), so that the problems of memory and time consumption still exist in the error correction process. Therefore, there is still a need to further optimize the error correction scheme to reduce the peak memory and execution time that runs during error correction.

Disclosure of Invention

The present application is based on the discovery and recognition by the inventors of the following problems:

multiple studies show that the memory is increased dramatically in the long-read-length error correction process of large genome, and the consumed time is long; therefore, the inventor finds that in the process of correcting the genome by the long-read-length data, the processing of the genome module, the processing of the comparison result module and the processing of the long-length data module effectively divide a large error correction task into a plurality of small error correction tasks; all the correlation information between the long reading length data and the reference genome is reserved, and the accuracy of an error correction result is ensured; meanwhile, the segmentation mode can be executed by a plurality of subtasks in parallel, so that the peak value memory and execution time of operation consumed in the error correction process are effectively reduced, the time efficiency is maximized, and the cost of the whole genome error correction analysis is reduced.

In a first aspect of the invention, a method for error correction of long read sequencing data is presented. According to an embodiment of the invention, comprising: (1) Grouping reference sequences to obtain a plurality of reference sequence subsets consisting of a portion of the reference sequences; (2) Performing error correction processing separately for each of the plurality of reference sequence subsets, the error correction processing being performed based on: (a) A reference sequence contained in the subset of reference sequences; (b) The comparison result corresponding to (a) in the total comparison results; (c) A portion of the long read sequencing data corresponding to (a); (d) And (3) integrating the error correction results obtained in each reference sequence subset in the step (2) so as to obtain the error correction result of the long-read sequencing data. According to the method provided by the embodiment of the invention, the error correction is carried out on the long-read sequencing data, the data import, comparison and error correction are carried out in parallel through multiple tasks, the data error correction time is obviously reduced, and the error correction efficiency of the sequencing data is improved.

According to an embodiment of the present invention, the method may further include at least one of the following additional technical features:

according to an embodiment of the present invention, before error correction is performed on the long read sequencing data, the method further includes the following steps: (3) Grouping the long read sequencing data to obtain a plurality of sequencing data subsets of sequencing reads; (4) Comparing each of the plurality of sequencing data subsets with the reference sequence to obtain comparison results of each of the sequencing data subsets; (5) Merging the alignment results of the plurality of sequencing data subsets to obtain a total alignment result of the plurality of sequencing data subsets.

According to an embodiment of the invention, the long sequencing reads are sequencing reads of length 10K or more.

According to an embodiment of the invention, the grouping is performed randomly.

According to an embodiment of the present invention, the number of sequencing reads in each of the sequencing data subsets is not particularly limited.

According to an embodiment of the present invention, before step (3), comprising: assembling the long read sequencing data to obtain a preliminary assembly result, the preliminary assembly result constituting the reference sequence in step (4).

According to the embodiment of the invention, at least one of the step (4) and the step (2) is performed simultaneously in a multitask mode.

According to an embodiment of the present invention, the grouping of the reference sequences is based on the following criteria: (1) Not internally segmenting each sequence in the reference sequence; (2) The total length of sequences contained in each of the reference sequence subsets differs by no more than 20%.

In a second aspect of the invention, a sequencing method is provided. According to an embodiment of the invention, comprising: obtaining a nucleic acid sample; performing long read sequencing on the nucleic acid sample to obtain long read sequencing data; and performing error correction processing on the sequencing data according to the method of the first aspect so as to obtain an error-corrected sequencing result. According to the method provided by the embodiment of the invention, error correction is carried out on the long-read sequencing data, and the data import, comparison and error correction are carried out in parallel through multiple tasks, so that the data error correction time is obviously reduced, and the error correction efficiency of the sequencing data is improved.

according to an embodiment of the invention, the nucleic acid sample is derived from a host of unknown genomic sequence.

In a third aspect of the invention, an apparatus for error correction of long read sequencing data is presented. According to an embodiment of the invention, comprising: a first grouping module to group the long read sequencing data to obtain a plurality of sequencing data subsets of sequencing reads; a comparison module, configured to compare each of the sequencing data subsets with a reference sequence, so as to obtain a comparison result of each of the sequencing data subsets; an alignment result merging module, configured to merge the alignment results of the multiple sequencing data subsets so as to obtain a total alignment result of the multiple sequencing data subsets; a second grouping module for grouping the reference sequences so that a plurality of reference sequence subsets are formed of a portion of the reference sequences; an error correction module, configured to perform error correction processing on each of the plurality of reference sequence subsets, where the error correction processing is performed based on: (a) A reference sequence contained in the subset of reference sequences; (b) Partial comparison results corresponding to (a) in the total comparison results; (c) A portion of the sequencing data corresponding to (a); and the error correction result integration module is used for integrating the error correction results obtained in the plurality of reference sequence subsets so as to obtain the error correction result of the long-read sequencing data.

In a fourth aspect of the invention, an apparatus for error correction of long read sequencing data is presented. According to an embodiment of the invention, the apparatus includes a memory and a processor; the memory including a memory for storing a program; the processor comprises means for implementing the method of the first aspect by executing a program stored by the memory. According to the device provided by the embodiment of the invention, error correction is carried out on the long-read sequencing data, and data import, comparison and error correction are carried out in parallel through multiple tasks, so that the error correction efficiency of the sequencing data can be obviously improved, and the time cost is reduced.

In a fifth aspect of the invention, an apparatus of a computer-readable storage medium is presented. According to an embodiment of the invention, the storage medium has stored therein a program executable by a processor to implement the method of the first aspect.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flow diagram of a long read long error correction process according to an embodiment of the present invention, wherein,

step 101 represents the preparation of the assembly result file and the long read length data file,

step 102 represents comparing the long read length data to the assembly result file,

step 103 represents correcting the genome assembly result based on the long read length data according to the information of the comparison result,

step 104 represents the final error correction result generated after error correction;

fig. 2 is a flowchart of a split comparing and split error correcting method according to an embodiment of the present invention, wherein:

step 201 represents genome preparation and cutting, which cuts the genome sequence into M modules,

step 202 represents the preparation and dicing of the alignment file, which finally obtains M alignment results corresponding to M sub-reference sequences,

step 203 represents the long read length data slicing, which finally obtains M long read length data corresponding to M comparison results,

step 204 represents error correcting the M sub-blocks, finally obtaining M error correction results,

step 205 represents the error correction result merging to obtain the final error correction result; and (c) a second step of,

fig. 3 is a flowchart of obtaining an alignment result according to an embodiment of the present invention, in which:

step 301 represents slicing the long read length data into N data slices,

step 302, aligning the N subblock data to a reference genome to obtain N alignment results;

step 303 represents merging the N alignment results into a final long read length alignment result; and

FIG. 4 is an apparatus for error correction of long read sequencing data according to an embodiment of the present invention, including: the device comprises a first grouping module, a comparison module, a combination comparison module (comparison result combination module), a second grouping module, an error correction module and an error correction result integration module.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of the feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

The following describes an apparatus and a system for optimizing a genome assembly result error correction method by long read-length sequencing according to an embodiment of the present disclosure with reference to the drawings. Before describing the embodiments of the present invention in detail, for ease of understanding, the method for error correction of genome assembly results by long read-long sequencing is first introduced:

1. preparation and cutting of reference genome

And assembling the long read length data based on the long read length assembling software to obtain a long read length assembling result containing an error rate, namely assembling a result file as shown in step 201.

And (3) a segmentation process of the assembly result file: first, the total size of the reference genome and the length of the longest sequence are counted and the sub-block size is determined. If the length difference of the cut blocks is large, the setting of the follow-up analysis memory and the time efficiency are influenced. To ensure the consistency of the sizes of the segmented sub-blocks, the method proposes to ensure that at least the set sub-block length is larger than the longest reference genome sequence. After the sub-block length is determined, the assembly result files are read, the lengths are accumulated, when the length of the file exceeds the size of the specified sub-block, a first sub-block is generated, a next sub-block is constructed by the same method, and in step 201, M sub-blocks are finally constructed. The method cuts the sequence based on the sequence name and the corresponding sequence length without breaking the sequence.

2. Preparing and dicing comparison files

The process comprises the following steps:

(1) Data comparison

The sequenced long read length data were aligned to the reference genome using Minimap2 alignment software. The method supports a global comparison result and a block comparison result, as shown in fig. 3, the block comparison result refers to a comparison result obtained by dividing, comparing, sorting and combining long read length data on the premise that a reference genome is not divided. There are two types of data formats for long reads: bam format (original offline data file of Pacbio) and Fastq format (original offline data file of Pacbio obtained after conversion or Nanopore) and the method for splitting data in two different formats is shown here.

Segmenting the bam format file:

pacbio sequencing is performed on a chip containing millions of zero mode waveguide holes (ZMW wells for short, a nano-optical device for confining light to a small observation volume, a well with a conductive layer), by synthesizing DNA through polymerization reaction of a DNA template with an immobilized DNA polymerase while passing through the bottom of the ZMW well, and by detecting fluorescence signals excited during synthesis. Sequencing adapters are added on two sides of a double strand of DNA in a Pacbio library to form a dumbbell shape, the double strand of the DNA is opened during sequencing, the sequencing is performed in the same ZMW well in a circulating mode, and the sequencing adapter sequence is removed from off-line data and information of the ZMW well is marked, so that sequencing reads of the same ZMW well come from the same DNA fragment. If the same DNA fragment sequence is placed in different subfiles, the alignment redundancy and the calculation and storage amount are increased. Therefore, when the Pacbio bam file is split, the same sequence of ZMW holes needs to be placed in the same subfile. This is accomplished by the software program bamview in the SMRT Link installation package at the official download address of Pacbio (https:// www. Pacb. Com/support/software-downloads /). The method comprises the steps of firstly obtaining a list of ZMWs of a Pacbio off-line data containing chip boards based on a parameter "- -show-ZMWs" of software bamsieve, obtaining the total number M of the ZMWs, determining the number of ZMWs in each subfile of the ZMWs by setting the number N of the subfiles after splitting (for example, 100 is set, 100 parts are split on ZMW list information), generating a sub ZMWW list, and extracting long read information corresponding to each sub ZMWs list according to the bamsieve parameter "- -while" so as to obtain N bam format subfiles.

And (3) splitting a Fastq file:

the Fastq file is segmented only by calculating the total reads number S of the Fastq format file, setting the number N of subfiles (for example, setting the number to 100), determining the number of reads in each subfile, splitting the reads firstly, and then segmenting the long read sequence into N subfiles based on a sequence name list of the reads.

As shown in step 301 in fig. 3, after data slicing, N data slices are obtained. Comparing each sub-block data to a reference genome to obtain N comparison results, as shown in step 302 of the figure, the comparison software proposed by the method is Minimap2, and then combining the comparison results into a final comparison result, i.e. step 303 of the figure. As the comparison result has two formats paf and bam, and paf is in a table form, all files are directly merged. For the bam format, the alignment result of each sub-file is sorted first by using samtools sort and then merged into the final bam format alignment result by using samtools merge, wherein the download address of samtools (https:// githu. If Pacbio official error correction software GCPP is used, the software package pbmm2 alignment of smartlink (containing minimap2 alignment and ordering) can be used to achieve the merging of alignment results, with "dataset create- -type alignment set".

(2) Splitting of comparison files

(1) The comparison result is the global comparison result shown in fig. 2 as 202, and step 202 in fig. 2 shows that an comparison result X file is obtained based on each sequence file sequence X and the global comparison result, where X =1,2-M.

If the global comparison file is in paf format, acquiring a sub-comparison file corresponding to the reference genome sub-block based on a mode matching; if the comparison file is in a bam format, generating a bed file of the sequence according to the sequence sub-block files, wherein the starting position and the ending position of the bed file are the starting position and the ending position of each sequence in the reference sequence, and then splitting the bam file by utilizing samtools view. If Pacbio official error correction software GCPP is used, smartlink's software package "dataset split-constraints" can be used.

In addition, when the comparison file is segmented, the comparison results of the sub-blocks are not directly used, the comparison results of the sub-blocks need to be merged first, and then the corresponding comparison results are extracted based on the sub-blocks split by the reference genome, the comparison results of the reference sequence subset include all comparison results of all sequencing long read data aiming at the subset sequence, because the optimal comparison information of each position of the reference genome needs to be considered during error correction, the optimization of each block in the comparison results of the blocks cannot reflect the optimization of all data at the whole genome level, and only after the comparison files of the whole sub-blocks are merged, the comparison information of all long read sequences on each position of the reference genome is determined, so that the optimal comparison of each position of the reference genome can be obtained.

When the paf format file is generated, the time and the memory resource consumed by Minimap2 are very small, and under the condition that the CPU resource is abundant, the comparison mode can be considered not to be subjected to segmentation processing.

3. Long read long data slicer

The bam file is provided with original long read sequence information, and when the file is split through comparison, the splitting of the long read sequence information is already completed; in the paf alignment format file, since the paf file does not include base sequence information, it is necessary to read information of long read length data during error correction, and it is necessary to match long read length information corresponding to the alignment result to each sub-block.

As shown in step 203 in fig. 2, the long read length data is diced based on the comparison result in paf format, and each comparison pair sub-file is obtained to include a long read length data subset. The specific method is to obtain a corresponding long read long sequence name list from each comparison file sub-block in paf format in step 202, and the long read long sequence name list extracts long read long data from the original long read long sequencing data file according to a pattern matching method, that is, the data sub-block in step 203: data X, where X =1,2-M, this step long read length data format is Fasta and Fastq format.

4. Reference sequence sub-block error correction

As shown in 204 in fig. 2, the sequence sub-block X, the alignment sub-block X, and the long read long data sub-block X generate an error correction sub-block X corresponding thereto based on error correction software, where X =1,2-M. The implementation of this step depends on the sub-block error correction process of the reference genome to be independently completed. In the commonly used error correction method of long read-length data error correction software Arrow and RACON software, the error correction of each sequence is based on the corresponding comparison result to obtain a consistency sequence, and the error correction is relatively independent, so that the scheme is feasible. The method converts large comparison information and large long read length data into hundreds of small comparison blocks and hundreds of small long read length data, thereby greatly reducing peak memory requirements and shortening analysis time.

5. Error correction result merging

The final result is shown as 205 in fig. 2, and the error correction results of the M error correction sub-blocks are merged to obtain a final consistent sequence file, which is the error correction result of the long read length data on the reference genome.

The invention divides the large data volume into a plurality of small data volumes, but simultaneously reserves the complete association of the long reading data and the whole genome sequence, and the error correction process of the genome carries out error correction one by one, thereby effectively ensuring the accuracy of the error correction result of the genome. The segmentation mode only consumes very small resources during early processing, but can realize the reduction of the running time and peak memory of a single subtask during error correction, and can also realize the simultaneous parallel of a plurality of subtasks, thereby effectively shortening the running time and realizing the whole cost of the whole genome error correction analysis. The whole long-read long-error correction optimization strategy is particularly suitable for being used by a cluster consisting of a plurality of small machines, and the analysis efficiency is very high.

As an example, referring to fig. 4, fig. 4 is a diagram of an apparatus for error correction of the long read sequencing data according to an embodiment of the present invention, where the apparatus includes: a first grouping module to group the long read sequencing data to obtain a plurality of sequencing data subsets of sequencing reads; an alignment module, configured to respectively align each of the plurality of sequencing data subsets with a reference sequence, so as to obtain an alignment result of each of the sequencing data subsets; a second grouping module for grouping the reference sequences so that a plurality of reference sequence subsets are formed of a portion of the reference sequences; an error correction module, configured to perform error correction processing on each of the plurality of reference sequence subsets, where the error correction processing is performed based on: (a) A reference sequence contained in the subset of reference sequences; (b) Partial comparison results corresponding to (a) in the total comparison results; (c) A portion of the sequencing data corresponding to (a); and the error correction result integration module is used for integrating the error correction results obtained in the plurality of reference sequence subsets so as to obtain the error correction result of the long-read sequencing data.

The system can effectively carry out grouping processing on the long reads, the reference genome and the comparison result, and retains all the associated information between the long reads and the reference genome, so that the accuracy of the error correction result is ensured, one large error correction task is effectively divided into a plurality of error correction subtasks, the plurality of subtasks are executed in parallel, the peak value memory and the execution time of operation consumed in the error correction process are effectively reduced, the time efficiency is maximized, and the cost of the whole genome error correction analysis is reduced.

The invention also provides computer equipment, which comprises a memory and a processor; the memory including a memory for storing a program; the processor includes a method for implementing the long read data error correction described above by executing the program stored in the memory.

In addition, the present invention also provides a computer-readable storage medium having a program stored therein, the program being executable by a processor to implement the method for error correction of long-read data described above.

The invention will now be described with reference to specific examples, which are intended to be illustrative only and not to be limiting in any way.

Example 1 comparing the results of the long read length preliminary assembly with the results of the module and correcting the error of the module

In this example, the size of genome of an animal is about 3G, the depth of sequencing is about 50X by Nanopore Promethion, the total data size is about 150G, and RACON software is used to correct the primary assembly result of about 2.87G of total length, wherein the reference genome is the long read length primary assembly result.

1. Preparing comparison files

And (3) data comparison: 50X long read length Fastq format data is cut into blocks according to 500M data quantity, and the blocks are divided into 299 sub-blocks. And (4) comparing each data subblock to a 2.87G reference genome for global comparison to generate comparison subblock results in paf format. After all the subtasks are compared, the 299 paf format sub-comparison files obtained by comparison are merged to finally obtain a paf format total comparison file, and the specific flow is shown in fig. 3.

2. Preparation and cutting of reference genome

The total size of the statistical reference genome is 2.87G, the sequence format is Fasta, the maximum length of the assembled sequence is 36M, the data size of the subfiles is set to 50M, and finally the number of the subfiles after segmentation is 60.

3. And splitting the total comparison file.

In the paf format alignment result obtained in step 1, the sixth column is the sequence name of the reference genome, so the alignment information corresponding to the reference genome in the total alignment result file in paf format is extracted according to the reference sequence name of each reference genome sub-block in step 2. That is, 60 alignment subfiles are finally generated, wherein the sixth column of each alignment subfile contains all the sequence names of the reference genome sub-block to which it corresponds.

4. Long read long data slicer

The first column of the paf format alignment results obtained in step 1 reflects the sequencing data sequence names involved in the alignment. Therefore, for each of the 60 sub-alignment files obtained in step 3, the following processing is performed: and extracting corresponding sequence names from all the long read-length data files according to the long read-length sequence name information provided by the first column of each paf file, and extracting sequences to generate a new long read-length data sub-file. And finally obtaining 60 long read long data subfiles with the format consistent with that of the long read long data file. It should be noted here that the number of sequences of each sub-block of long read long data extracted from the long read long data should be consistent with the number of sequences used in the alignment file. The final reference sequence sub-block, alignment sub-block and long read-length data sub-block have one-to-one correspondence, which is 60 in this embodiment.

5. Reference genome subblock data error correction

And (3) utilizing Racon v1.3.3 software, taking the reference genome sub-blocks, the comparison sub-blocks and the long-reading-length data sub-blocks which have one-to-one correspondence as input, and correcting errors of the data of the reference genome sub-blocks to obtain 60 error correction sub-files with the format of Fasta.

6. Error correction result merging

And combining the results of each error correction subfile to obtain a final consistency sequence in a Fasta format, namely a final error correction result of the long read length data to the reference genome.

7. Statistics of time and resources

Specific time and resource statistics results are shown in table 1, in the comparison step, the long-read long data splitting thread is 5, the splitting time is 1.67h, the peak memory is 0.2G, and the total CPU time is 8.33; the data comparison divides a large comparison task into 299 sub-comparison tasks, the CPU is set to be 5, the peak memory of all the comparison tasks is 17G, the maximum running time is 0.12h, and the total CPU time is 96.1; the comparison result is 0.05 when combined with the CPU. The error correction step of this embodiment is therefore 104.48 total CPU time.

In the error correction step, the number of the split CPU of the genome is 0.1; when the CPU is split according to the comparison result, the number is 3; splitting sequencing long read data, wherein the thread used by each subtask is 5, the peak memory is 0.3G, the running time peak value of 60 tasks is 1.28h, and the total CPU time is 243; setting RACON error correction thread as 5, peak value memory as 29G, peak value running time as 4.71h, and total CPU consumption time as 302.56; the total CPU time of error correction results is 0.08. Therefore, the CPU time of the entire error correction step of this embodiment is 548.74.

After the separation, RACON error correction of long-read data on genome can be achieved within about 8h by using a machine of 5CPU 32G, and the total CPU time in the whole analysis process is 653.22.

Table 1:

example 2 Global comparison and Module error correction of Long read Length Primary Assembly results

The genome size of an animal used in this example was about 3G, the primary assembly result of about 2.87G total length was corrected by using RACON software, using Nanopore Promethion sequencing depth of about 50X and total data volume of about 150G.

1. Global alignment

The Fastq format data with the data volume of about 150G is compared to the primary assembly result of 2.87G to generate 1 comparison result in paf format, and the minimap2 is used as comparison software.

2. Block error correction

The total size of the reference genome is counted to be 2.87G, the maximum length of the assembled sequence is 36M, the size of the segmented sub-blocks is set to be 50M, the segmentation is carried out based on the sequence length and the sequence name of the reference genome, and the number of the sub-files of the genome after the final segmentation is 60.

Matching the reference sequence name corresponding to the subfile after the segmentation of the reference genome with the sixth row name information of the comparison file in the paf format obtained in the step (1), extracting the comparison result corresponding to each reference genome subblock, generating a sub-comparison file list completely corresponding to the reference genome subblock, and obtaining 60 sub-comparison result files. And extracting corresponding sequence information from all the long read-long data files according to the name information of the long read-long sequence provided by the first column of each sub-comparison file, and generating 60 long read-long data sub-files corresponding to the reference sequence sub-blocks and the comparison sub-blocks.

And correcting the error of each reference genome subfile based on each reference genome subfile and the corresponding comparison subfile and the long read-length data subfile by using Racon v1.3.3 software to obtain an error correction result of each reference genome subfile.

And combining the error correction results to obtain a final error correction result.

3. Time and resource statistics

The results of time and resource statistics are shown in table 2, in the comparison link, the method adopts the strategy of comparing all long read length data to the whole genome, the using thread is 20, the peak memory is 20.92G, the running time is 3.45h, and the comparison time consumption is 69.05.

The difference between the embodiment 2 and the embodiment 1 is only in the comparison link, but the comparison results obtained by the two methods are completely consistent, so that the specific operation and resource utilization time are also consistent in the error correction link, and detailed description is not provided herein. The total CPU time for this step is 548.74.

The overall analysis time of this embodiment is 9.67h, the peak memory 29G, the peak thread 20, and the total CPU time is 614.43.

Table 2:

example 3 Global alignment and error correction of the long read-length preliminary assembly results

Comparing Fastq format data with the data volume of about 150G to a primary assembly result of 2.87G to generate a comparison result in paf format, wherein minimap2 is used as comparison software; and (3) using the comparison result, the sequencing long reading data in the original Fastq format and the primary assembly result as input, and using Racon v1.3.3 to correct the error of the long reading data on the primary assembly result.

The results of using time and resources are shown in table 3, in this embodiment, the number of threads in the comparison step is set to 32, the peak memory is 21.16G, the total running time is 2.09h, and the total comparison time for the CPU is 66.98. The thread of the error correction process is set to 60, the peak memory is 444.15G, the running time is 89.68h, and the CPU time of the error correction step is 5380.18. Therefore, the total time of the whole error correction process is 91.77h, the peak memory 445G, the peak thread number is 60, and the total CPU time is 5646.95.

Table 3:

item	Time peak (h)	Number of threads	Number of tasks	Peak memory (G)	CPU time
						Global alignment	2.09	32.00	1.00	21.16	66.98
Global error correction	89.68	60.00	1.00	444.15	5380.18
						Total up to	91.77	60.00	1.00	444.15	5646.95

Example 4 comparison of error correction methods

In this example, the methods described in examples 1 to 3 were compared, and the specific comparison process and results are as follows:

1. comparison of alignment procedures

Table 4 shows the resource consumption of the alignment process in the three methods shown in example 1 (splitting the long read sequencing data, and then aligning and merging each long read data subset with the reference genome), example 2 (direct alignment, thread setting 20), and example 3 (direct alignment, thread setting 32). Wherein time represents peak run time; the actual thread is the thread number set during analysis; the peak memory refers to the largest memory usage in the operation process; the actual CPU time is the total CPU through accounting the whole comparison process, wherein the CPU used in embodiment 1 includes data splitting and comparison result merging. The comparison item refers to the number of lines of the compared information, and the comparison file size is the size of the finally generated comparison file. As can be seen from table 4, the comparison files obtained by the three methods have the same size and the same comparison result, and the three results are completely the same when the comparison files are sorted and checked by diff commands. And from a time usage perspective, the higher the thread settings, the less total CPU time. Under the condition of limited thread resources, the method of the embodiment 1 is proposed, namely, a plurality of long-read sequencing data subsets obtained after splitting long-read sequencing data are compared with a reference genome to obtain comparison results of the sequencing data subsets.

Table 4:

2. comparison of error correction procedures

All alignments are consistent because the long read length data of the sequencing is split at the time of alignment. Therefore, only the comparison results of the two implementation methods of embodiment 2 (block error correction) and embodiment 3 (global error correction) are shown here.

The results of the time and resource usage in the error correction process of example 2 and example 3 are shown in table 5, and it can be seen from the results that the method described in example 2 is much lower in time, thread usage, peak memory and total CPU count than the method described in example 3. In addition, the inventors exemplified the effect of the error correction method described in example 2 by taking a sub-block (about 52M) which takes the longest time after the whole genome is diced, wherein the CPU is set to 5, the reference sequence import time is 0.395s, the long read data import time is 1614.617s, the alignment result import time is 4.017s, the overlap relationship time is 2050.158s and the interface output time is 11.790s are determined from the alignment result, the corrected consistency sequence time is 13271.216s and the total time is 16953.343s, and the total time is significantly reduced.

Table 5:

error correction	Time (h)	Real threads	Peak memory (G)	Actual CPU time
					Implementation method 2 (Module error correction)	6.22	5.00	30.00	694.54
Implementation method 3 (Global error correction)	89.68	60.00	444.15	5380.18

In addition, the inventors compared the method described in example 1 (post-optimization method, i.e., modular error correction) with the method described in example 3 (pre-optimization method, i.e., global error correction), where the percentage of time consumed by each step in the methods described in examples 2 and 3 is shown in table 6, after genome segmentation and error correction, the percentage of the long read data to the total time consumption is reduced from 93.86% to 9.2%, the step that takes the longest time is transferred to the step of obtaining the consistent sequence, and the percentage of the step to the total time consumption is increased from 3.6% to 78.28%, and since the genome identity obtaining sequence is completed based on the alignment result corresponding to the sequence, and the alignment result of each reference genome subblock is complete, the time efficiency of obtaining the consistent sequence is not greatly changed regardless of whether the sequence is segmented or not. The method described in example 1 therefore effectively optimizes the overall error correction efficiency by reducing the data import time.

Table 6:

item	Before optimization (%)	After optimization (%)
			Introduction of reference sequence	0.05	0.00
Long read length data import	93.86	9.52
			Import of comparison results	0.04	0.02
Determining overlap relationships	2.28	12.09
			Interface output	0.13	0.07
Consensus sequence acquisition	3.64	78.28
			Is totaled	100.00	100.00

Comparison of the error correction results of the method described in example 1 and the method described in example 3 is shown in table 7, where the first column Ctg _ N50_ len is the N50 length of the assembly (the genome assembly results are sorted from large to small, and the total length is greater than 50% of the sequence length, which can reflect the length level of the whole genome, thereby determining the integrity of the genome sequence), the second column Ctg _ length is the total length of the assembled sequence, and the third column Busco _ comp is the ratio of the number of the whole genes compared with the standard universal single-copy orthologous gene. Theoretically, the single copy orthologous gene set reference is conserved in species, the single copy, the integrity of the genome can be evaluated by identifying the proportion of the single copy orthologous gene in the reference data set for the assembly result, and the higher the complete single copy gene ratio, the more complete the assembly, the maximum value is 100. The fourth and fifth columns are the results of Merqury's evaluation, which is a comparative evaluation of a set of k-mers from high precision sequencing reads against the results of genome assembly. The fourth column, QV (Pred), refers to the QV value assessed by Merqury, reflecting the accuracy of the genome. The fifth column Completeness is the value of Completeness evaluated by Merqury. As can be seen from the data in Table 7, the quality value after error correction is improved from 22 to 28, the Kmer coverage integrity is improved from 82% to more than 92%, and the BUSCO evaluation integrity is improved from 63.3% to more than 84% relative to the original data. The results of the method described in example 2 and the method described in example 3 are slightly different, but due to errors in the analysis software, the accuracy and integrity of the whole genome are comparable. The method in embodiment 2 effectively solves the problems of large memory consumption and long analysis time in the error correction process of long data on the premise of ensuring that the analysis result is not changed.

Table 7:

3. and (4) conclusion:

in summary, when a machine is composed of a large number of small CPU resources and a single machine is limited by the CPU, a method of module comparison and module error correction is suggested, where the optimization effect of the module error correction step is obvious, the CPU time in this embodiment is reduced to 1/7 of the original time, and the overall analysis time is reduced to 1/10 of the original time, so that the time efficiency and the CPU utilization rate in the error correction process can be effectively improved by using the method of module comparison and module error correction described in embodiment 1.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are exemplary and not to be construed as limiting the present invention, and that changes, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A method for error correction of long read sequencing data, comprising:

(1) Grouping reference sequences to obtain a plurality of reference sequence subsets consisting of a portion of the reference sequences;

(2) Performing error correction processing separately for each of the plurality of reference sequence subsets, the error correction processing being performed based on:

(a) A reference sequence contained in the subset of reference sequences;

(b) Partial comparison results corresponding to (a) in the total comparison results;

(c) A portion of the long read sequencing data corresponding to (a);

(d) And (3) integrating the error correction results obtained in each reference sequence subset in the step (2) so as to obtain the error correction result of the long read sequencing data.

2. The method of claim 1, wherein before performing step (1), further comprising:

(3) Grouping the long read sequencing data to obtain a plurality of sequencing data subsets of sequencing reads;

(4) Comparing each of the plurality of sequencing data subsets with the reference sequence to obtain comparison results of each of the sequencing data subsets;

(5) Merging the alignment results of the plurality of sequencing data subsets to obtain a total alignment result of the plurality of sequencing data subsets.

3. The method of claim 2, wherein the long sequencing reads are sequencing reads that are greater than 10K in length.

4. The method of claim 2, wherein the grouping is performed randomly.

5. The method of claim 2, wherein prior to step (3), comprising:

assembling the long read sequencing data to obtain a preliminary assembly result, the preliminary assembly result constituting the reference sequence in step (4).

6. The method of claim 1 or 2, wherein at least one of the steps (4) and (2) is performed simultaneously for multiple tasks.

7. The method of claim 1, wherein the grouping the reference sequences is based on the following criteria:

(1) Not internally segmenting each sequence in the reference sequences;

(2) The total length of sequences contained in each of the reference sequence subsets differs by no more than 20%.

8. A sequencing method, comprising:

obtaining a nucleic acid sample;

performing long read sequencing on the nucleic acid sample to obtain long read sequencing data;

performing error correction processing on the sequencing data according to the method of any one of claims 1 to 7 so as to obtain error-corrected sequencing results.

9. The method of claim 8, wherein the nucleic acid sample is derived from a host of unknown genomic sequence.

10. An apparatus for error correction of long read sequencing data, comprising:

a first grouping module to group the long read sequencing data to obtain a plurality of sequencing data subsets of sequencing reads;

an alignment module, configured to respectively align each of the plurality of sequencing data subsets with a reference sequence, so as to obtain an alignment result of each of the sequencing data subsets;

an alignment result merging module, configured to merge the alignment results of the multiple sequencing data subsets so as to obtain a total alignment result of the multiple sequencing data subsets;

a second grouping module to group the reference sequences to obtain a plurality of reference sequence subsets comprised of a portion of the reference sequences;

an error correction module, configured to perform error correction processing on each of the plurality of reference sequence subsets, where the error correction processing is performed based on:

(a) A reference sequence contained in the subset of reference sequences;

(c) A portion of the sequencing data corresponding to (a);

and the error correction result integration module is used for integrating the error correction results obtained in the plurality of reference sequence subsets so as to obtain the error correction result of the long read sequencing data.

11. A computer device, characterized by: comprising a memory and a processor;

the memory including a memory for storing a program;

the processor comprising a program for implementing the method of any one of claims 1 to 7 by executing the program stored in the memory.

12. A computer-readable storage medium characterized by: the storage medium has stored therein a program executable by a processor to implement the method of any one of claims 1 to 7.