CN109817277B

CN109817277B - Quality control method based on PacBio full-length transcriptome sequencing data

Info

Publication number: CN109817277B
Application number: CN201811641409.6A
Authority: CN
Inventors: 郑洪坤; 许国路; 杨春鹤; 张雪川
Original assignee: Beijing Biomarker Technologies Co ltd
Current assignee: Beijing Biomarker Technologies Co ltd
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2022-03-18
Anticipated expiration: 2038-12-29
Also published as: CN109817277A

Abstract

The invention provides a quality control method based on PacBio full-length transcriptome sequencing data, which comprises the following steps: 1) obtaining a high-quality and low-quality consistent full-length sequence from the sequencing data of an original PacBio full-length transcription set by using an IsoSeq analysis process; 2) correcting the low-quality consistent full-length sequence based on Illumina sequencing data, and filtering the sequence which still cannot reach a high-quality standard; 3) merging the high-quality and corrected low-quality consistent full-length sequences, and filtering according to the following standards: removing overlength sequences resulting from sequence chimerism; removing the consistent full-length sequence of the palindromic sequence in the self-alignment result; remove sequences that can be aligned to multiple positions by other identical full-length sequences. Chimeric sequences possibly existing in the consistent full-length sequence are filtered through a plurality of standards, the proportion of false positive results in the final transcriptome is reduced, and the accuracy of the related analysis results of the subsequent transcriptome is improved.

Description

Quality control method based on PacBio full-length transcriptome sequencing data

Technical Field

The invention relates to the technical field of bioinformatics, in particular to a quality control method based on PacBio full-length transcriptome sequencing data, which is used for filtering a chimeric sequence in the PacBio full-length transcriptome sequencing data.

Background

Transcriptome is a link of proteome connecting genomic genetic information and biological functions, and the regulation of transcription level is the most important and the most widely studied regulation mode of organisms at present, and the research of transcriptome is one of the essential tools for understanding life process. The transcriptome sequencing can sequence the transcriptome of a sample at any time point or under any condition, dynamically reflect the gene transcription level, simultaneously identify and quantify rare transcripts and normal transcripts, and provide sequence structure information of the sample specific transcripts.

However, the sequencing technology based on the second generation high throughput sequencing platform often cannot accurately obtain or assemble complete transcripts, and cannot accurately identify isofrorm and allele expressed transcripts, so that people cannot understand the meaning of the life activity in a deeper level. Full-length transcriptome sequencing based on the PacBio SMRT single-molecule real-time sequencing technology does not need to break RNA fragments, the ultralong reading of the platform comprises single complete transcript sequence information, and the complete transcript can be obtained without assembly in later analysis.

The analysis process for obtaining the full-length transcription group by the PacBio sequencing technology mainly comprises the steps of identifying the full-length sequence, horizontally clustering by isofomm to obtain a consistent sequence and a consistent sequence polising. The linker sequence cannot be correctly identified due to sequencing errors during the analysis, and the subsequences in the original polymerase sequence are connected by the linker sequence to form a chimeric sequence. In the full-length sequence identification step, a part of the chimeric sequence is filtered out by judging whether the primer sequence exists in the middle of the sequence (see figure 1), but a part of the chimeric sequence is not filtered because the primer sequence cannot be correctly identified. In particular, in the absence of sequencing the reference genome of the species, the likely chimeric sequence cannot be determined from the alignment information with the reference genome. The retention of these unrecognized chimeric sequences in the final transcriptome has a great influence on the accuracy of the analysis results related to the later transcriptome. In order to improve the accuracy of the sequencing data of the transcriptome, chimeric sequences which cannot be identified in the prior art need to be further removed, but no related method is reported at present.

Disclosure of Invention

The invention aims to provide a quality control method based on PacBio full-length transcriptome sequencing data, which is used for filtering chimeric sequences in the PacBio full-length transcriptome sequencing data so as to improve the accuracy of the transcriptome sequencing data.

In order to realize the purpose of the invention, the quality control method based on PacBio full-length transcriptome sequencing data comprises the following steps:

(1) obtaining a high-quality and low-quality consistent full-length sequence from the sequencing data of an original PacBio full-length transcription set by using an IsoSeq analysis process;

it is well known in the art that the high quality is judged by the average accuracy of the sequence, with an accuracy threshold of 0.99;

(2) correcting the low-quality consistent full-length sequence based on Illumina sequencing data, and filtering the sequence which still cannot reach a high-quality standard;

(3) combining the high-quality and corrected low-quality consistent full-length sequences, and removing overlength sequences generated by sequence mosaic;

(4) removing the consistent full-length sequence of the palindromic sequence in the self-alignment result;

(5) remove chimeric sequences that can be aligned to multiple positions by other identical full-length sequences.

In the quality control method, the high-quality and low-quality consistent full-length sequences in the step (1) are obtained by identifying primer sequences in the middle of the full-length sequences, preliminarily filtering to determine chimeric sequences connected with the primer sequences, and further processing (the specific method is a conventional technology in the field, and comprises the steps of 1) clustering all full-length non-chimeric sequences according to sequence similarity to obtain consistent sequences; 2) the consistency sequence is error corrected using the original data. ) And obtaining a high-quality and low-quality consistent full-length sequence after polishing and error correction.

In the step (1), the judgment standard of the full-length sequence with high quality consistency is that the average accuracy of the sequence is more than 0.99.

In the quality control method, the step (2) is to correct the low-quality consistency full-length sequence obtained in the step (1) by adopting proovread, and the consistency full-length sequence with the corrected sequence accuracy larger than 0.99 is reserved.

In step (3), the overlong sequence generated by sequence mosaic is the sequence with length larger than 15000bp in the merged sequence of the full-length sequence with high quality and low quality consistency meeting the conditions after correction.

In step (4) of the quality control method of the present invention, the palindromic sequence simultaneously satisfies the following conditions:

1) the consistency full-length sequence has two segments which can be reversely aligned with each other;

2) the comparison length is more than 500 bp;

3) the alignment similarity is greater than 95%.

And (5) comparing each remaining sequence of the full-length sequence with the palindromic sequence removed in the step (4) with all other sequences by using Blast, comparing the sequences to a plurality of positions of any one sequence, and judging the compared sequences as chimeric sequences and filtering when the comparison directions of two adjacent positions are opposite, the comparison length is greater than 95% of the length of the sequence per se and the comparison similarity is greater than 95%.

The invention provides application of the quality control method in further removing chimeric sequences from a consistent full-length sequence obtained by IsoSeq flow treatment under the condition of no reference genome.

The invention provides application of the quality control method in reducing the proportion of false positive results in transcriptome sequencing data.

The invention provides application of the quality control method in improving accuracy of transcriptome sequencing data.

The quality control method based on PacBio full-length transcriptome sequencing data provided by the invention is characterized in that a palindromic sequence is identified based on sequence length and sequence comparison, a chimeric sequence is filtered, and the information is not limited to a connector sequence and primer sequence information connected with the chimeric sequence (the chimeric sequence is filtered in the prior art only based on the connector sequence and primer sequence information connected with the chimeric sequence, so that the accuracy of the transcriptome sequencing data is low), the chimeric sequence which cannot be identified in the prior art can be removed under the condition of high sequencing error rate of the connector sequence and the primer sequence, and the proportion of false positive results in a final transcriptome is reduced, so that a full-length sequence with low quality consistency can be added into analysis to obtain more transcripts, and the accuracy of the related analysis results of the subsequent transcriptome is improved.

Drawings

FIG. 1 shows the structure of an artificial chimeric sequence identified by identifying the middle primer sequence of the full-length sequence in the prior art.

Detailed Description

The following examples further illustrate the present invention but are not to be construed as limiting the invention. Modifications or substitutions to methods, procedures, or conditions of the invention may be made without departing from the spirit and scope of the invention.

Unless otherwise specified, the technical means used in the examples are conventional means well known to those skilled in the art.

Example 1

The sequencing data of this example included PacBio full length transcriptome sequencing data 23G for 1 masson pine, and Illumina sequencing data for 3 biological replicates of masson pine samples, each replicate at no less than 6G.

The data are analyzed according to the quality control method of the invention, and possible chimeric sequences are filtered to obtain the final transcriptome. The specific method comprises the following steps:

(2) correcting the low-quality consistency full-length sequence by using proovread based on Illumina sequencing data, and filtering the sequence with average accuracy of less than 0.99 after correction;

(3) merging the high-quality and corrected low-quality consistent full-length sequences, counting the lengths of all the sequences, and not filtering the sequences with the length of more than 15000 bp;

(4) performing Blast self-comparison on all the sequences which are left after the last step of processing, filtering all sequences which have palindromic sequences (palindromic sequence judgment standard: more than two segments in the sequences are reversely compared with each other and meet the comparison length of more than 500bp and the comparison similarity of more than 95%) in self-comparison results, and filtering 340 sequences in total;

(5) comparing all sequences by Blast, if a certain sequence is compared to a plurality of positions of another sequence and the comparison directions of adjacent positions are opposite, the comparison length is greater than 95% of the length of the sequence and the comparison similarity is greater than 95%, removing the compared sequences, and filtering 610 sequences in total.

Wherein, the IsoSeq analysis flow identifies 2551 chimeric sequences by identifying primer sequences in the middle of full-length sequences, the process corresponds to the step (1) of the method, the method further filters 950 possible chimeric sequences through the steps (2) - (5) on the basis of the step (1), and the proportion of the possible chimeric sequences in all chimeric sequences (2551+950) is 27.14%. Based on the results, the method disclosed by the invention can further reduce the false positive rate of sequencing data and improve the sequencing accuracy.

Example 2

The sequencing data of this example included PacBio full length transcriptome sequencing data 21.88G for 1 lemon pooled sample, Illumina sequencing data for 3 separate samples (3 biological replicates per sample) in the pooled sample, with each replicate not less than 6G.

(4) performing Blast self-comparison on all the sequences which are left after the last step of processing, filtering all sequences which have palindromic sequences (palindromic sequence judgment standard: more than two segments in the sequences are reversely compared with each other and meet the comparison length of more than 500bp and the comparison similarity of more than 95%) in self-comparison results, and filtering 737 sequences in total;

(5) comparing all sequences by Blast, if a certain sequence is compared to a plurality of positions of another sequence and the comparison directions of adjacent positions are opposite, the comparison length is greater than 95% of the length of the sequence and the comparison similarity is greater than 95%, removing the compared sequences, and filtering 549 sequences in total.

Wherein, the IsoSeq analysis process identifies 3252 chimeric sequences by identifying a primer sequence in the middle of the full-length sequence, the process corresponds to step (1) of the method in this embodiment, and the method in this embodiment further filters 1286 possible chimeric sequences through subsequent steps on the basis of step (1), and the ratio of the chimeric sequences to all chimeric sequences (3252+1286) is 28.34%. Based on the results, the method disclosed by the invention filters a large number of chimeric sequences which cannot be identified by the prior art, can further reduce the false positive rate of sequencing data, and improves the sequencing accuracy.

Although the invention has been described in detail hereinabove with respect to a general description and specific embodiments thereof, it will be apparent to those skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. The quality control method based on PacBio full-length transcriptome sequencing data comprises the following steps:

(5) removing chimeric sequences that can be aligned to multiple positions by other identical full-length sequences; the method is characterized in that Blast is utilized to compare each remaining sequence of the full-length sequence with the consistency of the palindromic sequence removed in the step (4) with all other sequences, the sequences are compared to a plurality of positions of any one sequence, and when the comparison directions of two adjacent positions are opposite, the comparison length is greater than 95% of the length of the sequence per se and the comparison similarity is greater than 95%, the compared sequences are judged to be chimeric sequences for filtering.

2. The quality control method according to claim 1, wherein the high-quality and low-quality consistent full-length sequences in step (1) are obtained by preliminarily filtering the primer sequences in the middle of the identified full-length sequences to determine chimeric sequences with connected primer sequences, and further processing the chimeric sequences to obtain high-quality and low-quality consistent full-length sequences after polishing and error correction.

3. The quality control method according to claim 1, wherein in the step (1), the judgment criterion of the full-length sequence with high quality consistency is that the average accuracy of the sequence is more than 0.99.

4. The quality control method according to claim 3, wherein the step (2) corrects the low-quality consensus full-length sequence obtained in the step (1) and retains the consensus full-length sequence with the corrected sequence accuracy of more than 0.99.

5. The quality control method according to claim 1, wherein the excessively long sequence resulting from sequence chimerization in step (3) is a sequence having a length of more than 15000bp out of the combined sequences of the full-length sequences having high quality and low quality identity that are eligible after correction.

6. The quality control method according to any one of claims 1 to 5, wherein in the step (4), the palindromic sequence simultaneously satisfies the following conditions:

2) the comparison length is more than 500 bp;

3) the alignment similarity is greater than 95%.

7. Use of the quality control method according to any one of claims 1 to 6 for further removal of chimeric sequences from the consensus full-length sequence obtained by IsoSeq protocol treatment in the absence of a reference genome.

8. Use of the quality control method according to any one of claims 1 to 6 for reducing the proportion of false positive results in transcriptome sequencing data.

9. Use of the quality control method according to any one of claims 1 to 6 for improving accuracy of transcriptome sequencing data.