CN107590362B

CN107590362B - Method for judging whether overlapping assembly is correct or incorrect based on long read sequence sequencing

Info

Publication number: CN107590362B
Application number: CN201710720048.3A
Authority: CN
Inventors: 邬三毛; 肖世俊; 郭文浒; 陈楠生
Original assignee: Wuhan Frasergen Co Ltd
Current assignee: Wuhan Frasergen Co Ltd
Priority date: 2017-08-21
Filing date: 2017-08-21
Publication date: 2019-12-06
Anticipated expiration: 2037-08-21
Also published as: CN107590362A

Abstract

the invention relates to a method for judging whether overlapping assembly is correct or incorrect based on long-read sequencing data, which is carried out by comparing the long-read sequencing data with an overlapping assembly result to be processed, wherein the average read length in the long-read sequencing is not less than 2 kb. By using the method of the invention, the correctness of the overlapping assembly can be judged in the stage of overlapping assembly of contigs, the wrong overlapping assembly can be eliminated, and the contig sequence information with higher reliability can be provided.

Description

method for judging whether overlapping assembly is correct or incorrect based on long read sequence sequencing

Technical Field

The invention relates to the field of genome sequencing and assembly, in particular to a method for judging whether overlapping assembly is correct or incorrect based on long-read sequencing.

Background

Since the off-line results of high throughput sequencing are not complete continuous genomes, but rather a series of fragments with overlapping ends, specific assembly algorithms and software must be used to assemble the fragments into relatively complete genomes. However, because of the quality problem of the original data or the defects of the assembly software, some errors are inevitably left in the final assembly result, and the errors can be divided into two types from different scales, wherein the first type is errors at the base level, and the second type is errors at the contig level, the former can be corrected by error correction means, but the latter has no mature error correction method at present.

These errors, which remain in the assembly result, have a large impact on the subsequent genomic analysis. However, data with higher reliability than the reliability of the original offline data must be used for correction, so that a method for performing contig-level error correction using the original offline data is very necessary and significant.

Disclosure of Invention

In order to solve the above problems, the present invention provides an overlap assembly level error correction method based on long read sequencing, which is characterized in that the method is performed by comparing long read sequencing data to an overlap assembly result to be processed, and the average read length in the long read sequencing is not less than 2 kb.

In one embodiment, the method comprises the steps of:

S1: obtaining long read sequencing data;

s2: comparing the long reading sequence sequencing data with an overlapping assembly result to be processed to obtain a comparison result;

S3: and judging whether the overlapping assembly result is correct or incorrect according to the information of the comparison result.

In one embodiment, S2 includes the steps of:

s21: aligning the long read sequencing data to the overlapping assembly result;

S22: clustering and processing comparison of the same read in the long read sequencing data, selecting a class with the longest total comparison length as a final comparison of the read, extracting comparison information, and combining discrete comparisons;

S23: and scanning the comparison of each overlapped assembly sequence of the overlapped assembly result in sequence, and recording an abnormal interruption window, the number of abnormal interruption points in the abnormal interruption window and the crossing times of the abnormal interruption window by the reading sequence in the long reading sequence data. Preferably, all positions of the whole genome are not scanned here, and only the abnormal break window, which means a window of 50-200bp in which at least one abnormal break point exists, is scanned. An abnormal breakpoint is a breakpoint where the alignment start or end position is not at the start or end of the read sequence, and these breakpoints are likely to be due to assembly errors.

In a preferred embodiment, the clustering in S22 is followed by filtering.

Preferably, in S22, the filtering before clustering eliminates the comparison with the comparison length less than 20-100bp, eliminates the comparison with the comparison length accounting for less than 0.01-0.1 of the comparison length per se, and eliminates the comparison with the comparison consistency less than 85-95%; and filtering and rejecting the comparison with the ratio of the comparison length to the self comparison length of less than 0.3-0.6 after clustering. These two filtrations reduced noise and false positive alignments.

In a preferred embodiment, in S23, the abort window is a window in which the number of internal existing abort points is greater than a threshold, and the threshold is calculated by: (sequencing depth 6)/100. Such an abort window we refer to as a high abort window where the probability of a false assembly is greatly increased.

In one embodiment, in S23, the number of read strides at the abort window is calculated by: and taking a region formed by 200bp on the left side and 200bp on the right side of the midpoint of the abnormal window as a reading sequence crossing judgment window, and calculating the reading sequence crossing the judgment window.

In one embodiment, the information on the alignment result in S3 includes the number of abnormal breaks in each abnormal break window and the number of times of crossing by the reads in the long-read sequencing data, and S3 specifically includes the following steps:

S31: constructing an SVM model by using the two indexes of the abnormal breakpoint number and the read crossing times as feature vectors;

s32: training the SVM model by using the known correct and incorrect assembly result to obtain a classifier;

s33: and judging whether the assembly of each abnormal interruption window in the overlapped assembly result is correct or incorrect by using the classifier.

In one embodiment, in S32, after the assembly software assembles the sequencing results of the known genome sequence and aligns the assembly results to the reference genome, the correct assembly position and the incorrect assembly position are labeled, for example, the results obtained by the overlap assembly software or the scaffold software align the results to the reference genome and label the correct position and the incorrect position, and the model generated by the training data can be used to identify errors in the results of the overlap assembly software or the scaffold software. Or the error assembly result in the known correct and incorrect assembly results is obtained by artificially generating error assembly, all screened sites are marked after comparison, and the model generated by the training data can identify the error connection with obvious characteristics.

In one embodiment, the long read sequencing is third generation sequencing. Third generation sequencing can produce read lengths of around 10kb, which are well suited for error correction of the reassembly results in the methods of the invention.

in one embodiment, the overlapping assembly result may be an overlapping assembly result based on second generation sequencing or third generation sequencing.

By using the method of the invention, the correctness of the overlapping assembly can be judged in the stage of overlapping assembly of contigs, the wrong overlapping assembly can be eliminated, and the contig sequence information with higher reliability can be provided.

Drawings

FIG. 1 is a flowchart illustrating an embodiment of determining whether an overlay assembly is correct;

Fig. 2 shows the result of the operation of the program in the example.

Detailed Description

The principles and features of the present invention are described below by way of example of a correction of the results of the miniasm assembly of C.elegans. The examples are given solely for the purpose of illustration and are not intended to limit the scope of the invention.

Nematodes are one of the most classical model organisms, and many important theoretical findings in modern molecular biology have been derived from studies on nematodes, such as apoptosis, RNA silencing, and the like. The caenorhabditis elegans genome is about 97M in size, and the nuclear genome has 6 chromosomes. The detection of the nematode genome has strong representativeness in the application.

third-generation sequencing is carried out on the nematode genome to obtain 8GB original reading, and the original data are subjected to overlapping assembly by using minism assembly software to obtain 48 contigs, wherein N50 is about 3.4 Mb. These 85 contigs were aligned using 30X of the original data.

In this embodiment, the flow of the error correction method is shown in fig. 1, and the specific steps are as follows:

s1: performing third-generation sequencing to obtain third-generation sequencing data

S21: comparing the third generation sequencing data with the overlapping assembly result for comparison;

s22: performing primary processing on the comparison result, rejecting the comparison with the comparison length smaller than 20-100bp, rejecting the comparison with the comparison length accounting for the comparison length of the comparison length and smaller than a threshold value, and rejecting the comparison with the comparison consistency lower than 85-95% so as to filter the noise and false positive comparison in the comparison result; then clustering comparison results of the same sequence, selecting a class with the longest total comparison length as comparison of extracted information, combining discrete comparison, and fitting partial lost comparison information; filtering the clustering result again, further reducing false positive comparison, and rejecting the comparison with the comparison length accounting for the comparison length of the comparison length smaller than a threshold value; when the alignment of different sequences has large fragment overlap, only the optimal alignment is reserved;

s23: and scanning the comparison of each overlapped assembly sequence of the overlapped assembly result in sequence, recording an abnormal interruption window, the number of abnormal interruption points of the abnormal interruption window and the crossing times of the abnormal interruption window by the reading sequence in the third-generation sequencing data, and filtering to obtain the information of the high abnormal interruption window.

S31: constructing an SVM model by using the abnormal breakpoint number and the read crossing times as feature vectors;

and (3) making the above processes into software to run, inputting the comparison result into the program, and classifying and identifying the suspected error sites by the program according to the trained model to finally obtain an identification result file.

The results of the identification are shown in FIG. 2: each point represents a detected locus, the color of the point represents a judgment result obtained by using a reference genome, and the point with higher ratio of the number of crossing times of the reading sequence/the number of abnormal broken points is consistent with the reference genome, namely a correct locus; the point where the ratio of the number of reads spanning the number of times/number of abnormal breakpoints is extremely low is inconsistent with the reference genome, i.e., the wrong point. The middle separation line is the classification model obtained by svm training. The correct sites judged by the model are above the separation line, and the wrong sites judged by the model are below the separation line. As can be seen from the figure, the software can completely separate the correct and incorrect sites, and the judgment result is completely consistent with the reference genome detection result.

Therefore, the detection result of the software is consistent with the detection result of the reference genome.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for judging whether overlapping assembly is correct or incorrect based on long read sequence sequencing data is characterized by comprising the following steps of comparing the long read sequence sequencing data to an overlapping assembly result to be processed, wherein the average read length in the long read sequence sequencing is not less than 2 kb:

S1: obtaining long read sequencing data;

s3: judging whether the overlapping assembly result is correct or incorrect according to the information of the comparison result;

Wherein the S2 includes the steps of:

S21: aligning the long read sequencing data to the overlapping assembly result;

S22: clustering and processing the comparison of the same read in the long read sequencing data, selecting the class with the longest total comparison length as the final comparison of the read, and combining the discrete comparisons;

s23: scanning the comparison of each overlapped assembly sequence of the overlapped assembly result in sequence, and recording an abnormal interruption window, the number of abnormal interruption points of the abnormal interruption window and the crossing times of the abnormal interruption window by the reading sequence in the long reading sequence data;

the alignment result comprises the number of abnormal break points in the abnormal break window and the crossing times of the read sequence in the long read sequencing data.

2. the method of claim 1, wherein the filtering is performed before and after the clustering in S22.

3. The method according to claim 2, wherein in S22, the pre-clustering filtering rejects alignments with alignment lengths less than 20-100bp, alignments with alignment lengths less than 0.01-0.1 in proportion to their own alignment lengths, and alignments with alignment consistency less than 85-95%; and filtering and rejecting the comparison with the ratio of the comparison length to the self comparison length of less than 0.3-0.6 after clustering.

4. The method according to claim 1, wherein in S23, the abort window is a high abort window, and the high abort window is an abort window with an abort point number equal to or greater than (sequencing depth x 6)/100.

5. the method of claim 4, wherein in S23, the number of times the read-order crosses over the abort window is calculated by: and calculating the read number crossing the read sequence crossing frequency judgment window by taking a region formed by 200bp on the left side and 200bp on the right side of the midpoint of the abnormal interrupt window as the read sequence crossing frequency judgment window.

6. the method according to claim 1, wherein S3 specifically comprises the steps of:

7. the method according to claim 6, wherein the known correct assembly result in S32 is obtained by labeling the correct assembly position and the incorrect assembly position after assembling the sequencing result of the known genome sequence by the assembly software and aligning the assembly result to the reference genome, or the incorrect assembly result in the known correct assembly result is obtained by artificially generating the incorrect assembly.

8. the method of any one of claims 1-7, wherein the long read sequencing is third generation sequencing.