CN107590362B - Method for judging whether overlapping assembly is correct or incorrect based on long read sequence sequencing - Google Patents

Method for judging whether overlapping assembly is correct or incorrect based on long read sequence sequencing Download PDF

Info

Publication number
CN107590362B
CN107590362B CN201710720048.3A CN201710720048A CN107590362B CN 107590362 B CN107590362 B CN 107590362B CN 201710720048 A CN201710720048 A CN 201710720048A CN 107590362 B CN107590362 B CN 107590362B
Authority
CN
China
Prior art keywords
assembly
read
window
result
comparison
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710720048.3A
Other languages
Chinese (zh)
Other versions
CN107590362A (en
Inventor
邬三毛
肖世俊
郭文浒
陈楠生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Frasergen Co Ltd
Original Assignee
Wuhan Frasergen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Frasergen Co Ltd filed Critical Wuhan Frasergen Co Ltd
Priority to CN201710720048.3A priority Critical patent/CN107590362B/en
Publication of CN107590362A publication Critical patent/CN107590362A/en
Application granted granted Critical
Publication of CN107590362B publication Critical patent/CN107590362B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

the invention relates to a method for judging whether overlapping assembly is correct or incorrect based on long-read sequencing data, which is carried out by comparing the long-read sequencing data with an overlapping assembly result to be processed, wherein the average read length in the long-read sequencing is not less than 2 kb. By using the method of the invention, the correctness of the overlapping assembly can be judged in the stage of overlapping assembly of contigs, the wrong overlapping assembly can be eliminated, and the contig sequence information with higher reliability can be provided.

Description

method for judging whether overlapping assembly is correct or incorrect based on long read sequence sequencing
Technical Field
The invention relates to the field of genome sequencing and assembly, in particular to a method for judging whether overlapping assembly is correct or incorrect based on long-read sequencing.
Background
Since the off-line results of high throughput sequencing are not complete continuous genomes, but rather a series of fragments with overlapping ends, specific assembly algorithms and software must be used to assemble the fragments into relatively complete genomes. However, because of the quality problem of the original data or the defects of the assembly software, some errors are inevitably left in the final assembly result, and the errors can be divided into two types from different scales, wherein the first type is errors at the base level, and the second type is errors at the contig level, the former can be corrected by error correction means, but the latter has no mature error correction method at present.
These errors, which remain in the assembly result, have a large impact on the subsequent genomic analysis. However, data with higher reliability than the reliability of the original offline data must be used for correction, so that a method for performing contig-level error correction using the original offline data is very necessary and significant.
Disclosure of Invention
In order to solve the above problems, the present invention provides an overlap assembly level error correction method based on long read sequencing, which is characterized in that the method is performed by comparing long read sequencing data to an overlap assembly result to be processed, and the average read length in the long read sequencing is not less than 2 kb.
In one embodiment, the method comprises the steps of:
S1: obtaining long read sequencing data;
s2: comparing the long reading sequence sequencing data with an overlapping assembly result to be processed to obtain a comparison result;
S3: and judging whether the overlapping assembly result is correct or incorrect according to the information of the comparison result.
In one embodiment, S2 includes the steps of:
s21: aligning the long read sequencing data to the overlapping assembly result;
S22: clustering and processing comparison of the same read in the long read sequencing data, selecting a class with the longest total comparison length as a final comparison of the read, extracting comparison information, and combining discrete comparisons;
S23: and scanning the comparison of each overlapped assembly sequence of the overlapped assembly result in sequence, and recording an abnormal interruption window, the number of abnormal interruption points in the abnormal interruption window and the crossing times of the abnormal interruption window by the reading sequence in the long reading sequence data. Preferably, all positions of the whole genome are not scanned here, and only the abnormal break window, which means a window of 50-200bp in which at least one abnormal break point exists, is scanned. An abnormal breakpoint is a breakpoint where the alignment start or end position is not at the start or end of the read sequence, and these breakpoints are likely to be due to assembly errors.
In a preferred embodiment, the clustering in S22 is followed by filtering.
Preferably, in S22, the filtering before clustering eliminates the comparison with the comparison length less than 20-100bp, eliminates the comparison with the comparison length accounting for less than 0.01-0.1 of the comparison length per se, and eliminates the comparison with the comparison consistency less than 85-95%; and filtering and rejecting the comparison with the ratio of the comparison length to the self comparison length of less than 0.3-0.6 after clustering. These two filtrations reduced noise and false positive alignments.
In a preferred embodiment, in S23, the abort window is a window in which the number of internal existing abort points is greater than a threshold, and the threshold is calculated by: (sequencing depth 6)/100. Such an abort window we refer to as a high abort window where the probability of a false assembly is greatly increased.
In one embodiment, in S23, the number of read strides at the abort window is calculated by: and taking a region formed by 200bp on the left side and 200bp on the right side of the midpoint of the abnormal window as a reading sequence crossing judgment window, and calculating the reading sequence crossing the judgment window.
In one embodiment, the information on the alignment result in S3 includes the number of abnormal breaks in each abnormal break window and the number of times of crossing by the reads in the long-read sequencing data, and S3 specifically includes the following steps:
S31: constructing an SVM model by using the two indexes of the abnormal breakpoint number and the read crossing times as feature vectors;
s32: training the SVM model by using the known correct and incorrect assembly result to obtain a classifier;
s33: and judging whether the assembly of each abnormal interruption window in the overlapped assembly result is correct or incorrect by using the classifier.
In one embodiment, in S32, after the assembly software assembles the sequencing results of the known genome sequence and aligns the assembly results to the reference genome, the correct assembly position and the incorrect assembly position are labeled, for example, the results obtained by the overlap assembly software or the scaffold software align the results to the reference genome and label the correct position and the incorrect position, and the model generated by the training data can be used to identify errors in the results of the overlap assembly software or the scaffold software. Or the error assembly result in the known correct and incorrect assembly results is obtained by artificially generating error assembly, all screened sites are marked after comparison, and the model generated by the training data can identify the error connection with obvious characteristics.
In one embodiment, the long read sequencing is third generation sequencing. Third generation sequencing can produce read lengths of around 10kb, which are well suited for error correction of the reassembly results in the methods of the invention.
in one embodiment, the overlapping assembly result may be an overlapping assembly result based on second generation sequencing or third generation sequencing.
By using the method of the invention, the correctness of the overlapping assembly can be judged in the stage of overlapping assembly of contigs, the wrong overlapping assembly can be eliminated, and the contig sequence information with higher reliability can be provided.
Drawings
FIG. 1 is a flowchart illustrating an embodiment of determining whether an overlay assembly is correct;
Fig. 2 shows the result of the operation of the program in the example.
Detailed Description
The principles and features of the present invention are described below by way of example of a correction of the results of the miniasm assembly of C.elegans. The examples are given solely for the purpose of illustration and are not intended to limit the scope of the invention.
Nematodes are one of the most classical model organisms, and many important theoretical findings in modern molecular biology have been derived from studies on nematodes, such as apoptosis, RNA silencing, and the like. The caenorhabditis elegans genome is about 97M in size, and the nuclear genome has 6 chromosomes. The detection of the nematode genome has strong representativeness in the application.
third-generation sequencing is carried out on the nematode genome to obtain 8GB original reading, and the original data are subjected to overlapping assembly by using minism assembly software to obtain 48 contigs, wherein N50 is about 3.4 Mb. These 85 contigs were aligned using 30X of the original data.
In this embodiment, the flow of the error correction method is shown in fig. 1, and the specific steps are as follows:
s1: performing third-generation sequencing to obtain third-generation sequencing data
S21: comparing the third generation sequencing data with the overlapping assembly result for comparison;
s22: performing primary processing on the comparison result, rejecting the comparison with the comparison length smaller than 20-100bp, rejecting the comparison with the comparison length accounting for the comparison length of the comparison length and smaller than a threshold value, and rejecting the comparison with the comparison consistency lower than 85-95% so as to filter the noise and false positive comparison in the comparison result; then clustering comparison results of the same sequence, selecting a class with the longest total comparison length as comparison of extracted information, combining discrete comparison, and fitting partial lost comparison information; filtering the clustering result again, further reducing false positive comparison, and rejecting the comparison with the comparison length accounting for the comparison length of the comparison length smaller than a threshold value; when the alignment of different sequences has large fragment overlap, only the optimal alignment is reserved;
s23: and scanning the comparison of each overlapped assembly sequence of the overlapped assembly result in sequence, recording an abnormal interruption window, the number of abnormal interruption points of the abnormal interruption window and the crossing times of the abnormal interruption window by the reading sequence in the third-generation sequencing data, and filtering to obtain the information of the high abnormal interruption window.
S31: constructing an SVM model by using the abnormal breakpoint number and the read crossing times as feature vectors;
S32: training the SVM model by using the known correct and incorrect assembly result to obtain a classifier;
s33: and judging whether the assembly of each abnormal interruption window in the overlapped assembly result is correct or incorrect by using the classifier.
and (3) making the above processes into software to run, inputting the comparison result into the program, and classifying and identifying the suspected error sites by the program according to the trained model to finally obtain an identification result file.
The results of the identification are shown in FIG. 2: each point represents a detected locus, the color of the point represents a judgment result obtained by using a reference genome, and the point with higher ratio of the number of crossing times of the reading sequence/the number of abnormal broken points is consistent with the reference genome, namely a correct locus; the point where the ratio of the number of reads spanning the number of times/number of abnormal breakpoints is extremely low is inconsistent with the reference genome, i.e., the wrong point. The middle separation line is the classification model obtained by svm training. The correct sites judged by the model are above the separation line, and the wrong sites judged by the model are below the separation line. As can be seen from the figure, the software can completely separate the correct and incorrect sites, and the judgment result is completely consistent with the reference genome detection result.
Therefore, the detection result of the software is consistent with the detection result of the reference genome.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (8)

1. A method for judging whether overlapping assembly is correct or incorrect based on long read sequence sequencing data is characterized by comprising the following steps of comparing the long read sequence sequencing data to an overlapping assembly result to be processed, wherein the average read length in the long read sequence sequencing is not less than 2 kb:
S1: obtaining long read sequencing data;
s2: comparing the long reading sequence sequencing data with an overlapping assembly result to be processed to obtain a comparison result;
s3: judging whether the overlapping assembly result is correct or incorrect according to the information of the comparison result;
Wherein the S2 includes the steps of:
S21: aligning the long read sequencing data to the overlapping assembly result;
S22: clustering and processing the comparison of the same read in the long read sequencing data, selecting the class with the longest total comparison length as the final comparison of the read, and combining the discrete comparisons;
s23: scanning the comparison of each overlapped assembly sequence of the overlapped assembly result in sequence, and recording an abnormal interruption window, the number of abnormal interruption points of the abnormal interruption window and the crossing times of the abnormal interruption window by the reading sequence in the long reading sequence data;
the alignment result comprises the number of abnormal break points in the abnormal break window and the crossing times of the read sequence in the long read sequencing data.
2. the method of claim 1, wherein the filtering is performed before and after the clustering in S22.
3. The method according to claim 2, wherein in S22, the pre-clustering filtering rejects alignments with alignment lengths less than 20-100bp, alignments with alignment lengths less than 0.01-0.1 in proportion to their own alignment lengths, and alignments with alignment consistency less than 85-95%; and filtering and rejecting the comparison with the ratio of the comparison length to the self comparison length of less than 0.3-0.6 after clustering.
4. The method according to claim 1, wherein in S23, the abort window is a high abort window, and the high abort window is an abort window with an abort point number equal to or greater than (sequencing depth x 6)/100.
5. the method of claim 4, wherein in S23, the number of times the read-order crosses over the abort window is calculated by: and calculating the read number crossing the read sequence crossing frequency judgment window by taking a region formed by 200bp on the left side and 200bp on the right side of the midpoint of the abnormal interrupt window as the read sequence crossing frequency judgment window.
6. the method according to claim 1, wherein S3 specifically comprises the steps of:
S31: constructing an SVM model by using the abnormal breakpoint number and the read crossing times as feature vectors;
S32: training the SVM model by using the known correct and incorrect assembly result to obtain a classifier;
S33: and judging whether the assembly of each abnormal interruption window in the overlapped assembly result is correct or incorrect by using the classifier.
7. the method according to claim 6, wherein the known correct assembly result in S32 is obtained by labeling the correct assembly position and the incorrect assembly position after assembling the sequencing result of the known genome sequence by the assembly software and aligning the assembly result to the reference genome, or the incorrect assembly result in the known correct assembly result is obtained by artificially generating the incorrect assembly.
8. the method of any one of claims 1-7, wherein the long read sequencing is third generation sequencing.
CN201710720048.3A 2017-08-21 2017-08-21 Method for judging whether overlapping assembly is correct or incorrect based on long read sequence sequencing Active CN107590362B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710720048.3A CN107590362B (en) 2017-08-21 2017-08-21 Method for judging whether overlapping assembly is correct or incorrect based on long read sequence sequencing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710720048.3A CN107590362B (en) 2017-08-21 2017-08-21 Method for judging whether overlapping assembly is correct or incorrect based on long read sequence sequencing

Publications (2)

Publication Number Publication Date
CN107590362A CN107590362A (en) 2018-01-16
CN107590362B true CN107590362B (en) 2019-12-06

Family

ID=61041668

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710720048.3A Active CN107590362B (en) 2017-08-21 2017-08-21 Method for judging whether overlapping assembly is correct or incorrect based on long read sequence sequencing

Country Status (1)

Country Link
CN (1) CN107590362B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113160893B (en) * 2021-06-09 2022-08-19 中国科学院昆明植物研究所 Mining plant ITSs sequence from second generation sequencing data and using the same for identifying variety families

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103617256A (en) * 2013-11-29 2014-03-05 北京诺禾致源生物信息科技有限公司 Method and device for processing file needing mutation detection
CN104239750A (en) * 2014-08-25 2014-12-24 北京百迈客生物科技有限公司 High-throughput sequencing data-based genome de novo assembly method
CN106156536A (en) * 2015-04-15 2016-11-23 深圳华大基因科技有限公司 The method and system that sample immune group storehouse sequencing data is processed

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2613248A1 (en) * 2005-06-23 2006-12-28 Keygene N.V. Improved strategies for sequencing complex genomes using high throughput sequencing technologies
EP2602734A1 (en) * 2011-12-08 2013-06-12 Koninklijke Philips Electronics N.V. Robust variant identification and validation
JP5938484B2 (en) * 2012-01-20 2016-06-22 深▲せん▼華大基因医学有限公司Bgi Diagnosis Co., Ltd. Method, system, and computer-readable storage medium for determining presence / absence of genome copy number variation
CN104298892B (en) * 2014-09-18 2017-05-10 天津诺禾致源生物信息科技有限公司 Detection device and method for gene fusion
SG11201705996PA (en) * 2015-02-09 2017-09-28 10X Genomics Inc Systems and methods for determining structural variation and phasing using variant call data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103617256A (en) * 2013-11-29 2014-03-05 北京诺禾致源生物信息科技有限公司 Method and device for processing file needing mutation detection
CN104239750A (en) * 2014-08-25 2014-12-24 北京百迈客生物科技有限公司 High-throughput sequencing data-based genome de novo assembly method
CN106156536A (en) * 2015-04-15 2016-11-23 深圳华大基因科技有限公司 The method and system that sample immune group storehouse sequencing data is processed

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Multiple Sequence Assembly from Reads Alignable to a Common Reference Genome;Qian Peng等;《 IEEE/ACM Transactions on Computational Biology and Bioinformatics 》;20101028;第1283-1295页 *
下一代测序纠错方法综述;江育娥 等;《北京工业大学学报》;20160531;第42卷(第3期);第377-386页 *

Also Published As

Publication number Publication date
CN107590362A (en) 2018-01-16

Similar Documents

Publication Publication Date Title
US10354747B1 (en) Deep learning analysis pipeline for next generation sequencing
US20130166221A1 (en) Method and system for sequence correlation
CN106650739B (en) Novel license plate character cutting method
CN108197434B (en) Method for removing human gene sequence in metagenome sequencing data
CN111584006B (en) Circular RNA identification method based on machine learning strategy
CN1008022B (en) Character recognition system
CN105389481A (en) Method for detecting variable spliceosome in third generation full-length transcriptome
CN111081315A (en) Method for detecting homologous pseudogene variation
CN110692101A (en) Method for aligning targeted nucleic acid sequencing data
KR20140006846A (en) Data analysis of dna sequences
CN112086131B (en) Screening method for false positive variation sites in resequencing database
CN104794371A (en) Method and device for detecting insertion polymorphism of retrotransposon
CN107590362B (en) Method for judging whether overlapping assembly is correct or incorrect based on long read sequence sequencing
CN112733884A (en) Welding defect recognition model training method and device and computer terminal
CN115101124A (en) Whole genome allele identification method and device
CN111180013A (en) Device for detecting blood disease fusion gene
US11335438B1 (en) Detecting false positive variant calls in next-generation sequencing
CN114155914B (en) Detection and correction system based on metagenome splicing errors
CN116564406A (en) Automatic analysis method and equipment for genetic variation
CN113571132B (en) Method for judging sample degradation based on CNV result
CN112397148A (en) Sequence comparison method, sequence correction method and device thereof
CN111916147B (en) Transcript classification method
CN114627967A (en) Method for accurately annotating three-generation full-length transcript
CN116646006B (en) Tumor related gene system mutation detection method and device based on high-throughput sequencing and Gaussian mixture model
CN113378244B (en) Intelligent electronic signature calling system and method based on data analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A method for judging the correctness and error of overlapping assembly based on long reading sequence sequencing

Effective date of registration: 20210918

Granted publication date: 20191206

Pledgee: Wuhan area branch of Hubei pilot free trade zone of Bank of China Ltd.

Pledgor: WUHAN FRASERGEN INFORMATION Co.,Ltd.

Registration number: Y2021420000096

PE01 Entry into force of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Granted publication date: 20191206

Pledgee: Wuhan area branch of Hubei pilot free trade zone of Bank of China Ltd.

Pledgor: WUHAN FRASERGEN INFORMATION CO.,LTD.

Registration number: Y2021420000096

PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A method for determining the correctness of overlapping assembly based on long read sequencing

Granted publication date: 20191206

Pledgee: Guanggu Branch of Wuhan Rural Commercial Bank Co.,Ltd.

Pledgor: WUHAN FRASERGEN INFORMATION CO.,LTD.

Registration number: Y2024980021037