CN109817277B - Quality control method based on PacBio full-length transcriptome sequencing data - Google Patents
Quality control method based on PacBio full-length transcriptome sequencing data Download PDFInfo
- Publication number
- CN109817277B CN109817277B CN201811641409.6A CN201811641409A CN109817277B CN 109817277 B CN109817277 B CN 109817277B CN 201811641409 A CN201811641409 A CN 201811641409A CN 109817277 B CN109817277 B CN 109817277B
- Authority
- CN
- China
- Prior art keywords
- sequence
- length
- full
- sequences
- quality
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Abstract
The invention provides a quality control method based on PacBio full-length transcriptome sequencing data, which comprises the following steps: 1) obtaining a high-quality and low-quality consistent full-length sequence from the sequencing data of an original PacBio full-length transcription set by using an IsoSeq analysis process; 2) correcting the low-quality consistent full-length sequence based on Illumina sequencing data, and filtering the sequence which still cannot reach a high-quality standard; 3) merging the high-quality and corrected low-quality consistent full-length sequences, and filtering according to the following standards: removing overlength sequences resulting from sequence chimerism; removing the consistent full-length sequence of the palindromic sequence in the self-alignment result; remove sequences that can be aligned to multiple positions by other identical full-length sequences. Chimeric sequences possibly existing in the consistent full-length sequence are filtered through a plurality of standards, the proportion of false positive results in the final transcriptome is reduced, and the accuracy of the related analysis results of the subsequent transcriptome is improved.
Description
Technical Field
The invention relates to the technical field of bioinformatics, in particular to a quality control method based on PacBio full-length transcriptome sequencing data, which is used for filtering a chimeric sequence in the PacBio full-length transcriptome sequencing data.
Background
Transcriptome is a link of proteome connecting genomic genetic information and biological functions, and the regulation of transcription level is the most important and the most widely studied regulation mode of organisms at present, and the research of transcriptome is one of the essential tools for understanding life process. The transcriptome sequencing can sequence the transcriptome of a sample at any time point or under any condition, dynamically reflect the gene transcription level, simultaneously identify and quantify rare transcripts and normal transcripts, and provide sequence structure information of the sample specific transcripts.
However, the sequencing technology based on the second generation high throughput sequencing platform often cannot accurately obtain or assemble complete transcripts, and cannot accurately identify isofrorm and allele expressed transcripts, so that people cannot understand the meaning of the life activity in a deeper level. Full-length transcriptome sequencing based on the PacBio SMRT single-molecule real-time sequencing technology does not need to break RNA fragments, the ultralong reading of the platform comprises single complete transcript sequence information, and the complete transcript can be obtained without assembly in later analysis.
The analysis process for obtaining the full-length transcription group by the PacBio sequencing technology mainly comprises the steps of identifying the full-length sequence, horizontally clustering by isofomm to obtain a consistent sequence and a consistent sequence polising. The linker sequence cannot be correctly identified due to sequencing errors during the analysis, and the subsequences in the original polymerase sequence are connected by the linker sequence to form a chimeric sequence. In the full-length sequence identification step, a part of the chimeric sequence is filtered out by judging whether the primer sequence exists in the middle of the sequence (see figure 1), but a part of the chimeric sequence is not filtered because the primer sequence cannot be correctly identified. In particular, in the absence of sequencing the reference genome of the species, the likely chimeric sequence cannot be determined from the alignment information with the reference genome. The retention of these unrecognized chimeric sequences in the final transcriptome has a great influence on the accuracy of the analysis results related to the later transcriptome. In order to improve the accuracy of the sequencing data of the transcriptome, chimeric sequences which cannot be identified in the prior art need to be further removed, but no related method is reported at present.
Disclosure of Invention
The invention aims to provide a quality control method based on PacBio full-length transcriptome sequencing data, which is used for filtering chimeric sequences in the PacBio full-length transcriptome sequencing data so as to improve the accuracy of the transcriptome sequencing data.
In order to realize the purpose of the invention, the quality control method based on PacBio full-length transcriptome sequencing data comprises the following steps:
(1) obtaining a high-quality and low-quality consistent full-length sequence from the sequencing data of an original PacBio full-length transcription set by using an IsoSeq analysis process;
it is well known in the art that the high quality is judged by the average accuracy of the sequence, with an accuracy threshold of 0.99;
(2) correcting the low-quality consistent full-length sequence based on Illumina sequencing data, and filtering the sequence which still cannot reach a high-quality standard;
(3) combining the high-quality and corrected low-quality consistent full-length sequences, and removing overlength sequences generated by sequence mosaic;
(4) removing the consistent full-length sequence of the palindromic sequence in the self-alignment result;
(5) remove chimeric sequences that can be aligned to multiple positions by other identical full-length sequences.
In the quality control method, the high-quality and low-quality consistent full-length sequences in the step (1) are obtained by identifying primer sequences in the middle of the full-length sequences, preliminarily filtering to determine chimeric sequences connected with the primer sequences, and further processing (the specific method is a conventional technology in the field, and comprises the steps of 1) clustering all full-length non-chimeric sequences according to sequence similarity to obtain consistent sequences; 2) the consistency sequence is error corrected using the original data. ) And obtaining a high-quality and low-quality consistent full-length sequence after polishing and error correction.
In the step (1), the judgment standard of the full-length sequence with high quality consistency is that the average accuracy of the sequence is more than 0.99.
In the quality control method, the step (2) is to correct the low-quality consistency full-length sequence obtained in the step (1) by adopting proovread, and the consistency full-length sequence with the corrected sequence accuracy larger than 0.99 is reserved.
In step (3), the overlong sequence generated by sequence mosaic is the sequence with length larger than 15000bp in the merged sequence of the full-length sequence with high quality and low quality consistency meeting the conditions after correction.
In step (4) of the quality control method of the present invention, the palindromic sequence simultaneously satisfies the following conditions:
1) the consistency full-length sequence has two segments which can be reversely aligned with each other;
2) the comparison length is more than 500 bp;
3) the alignment similarity is greater than 95%.
And (5) comparing each remaining sequence of the full-length sequence with the palindromic sequence removed in the step (4) with all other sequences by using Blast, comparing the sequences to a plurality of positions of any one sequence, and judging the compared sequences as chimeric sequences and filtering when the comparison directions of two adjacent positions are opposite, the comparison length is greater than 95% of the length of the sequence per se and the comparison similarity is greater than 95%.
The invention provides application of the quality control method in further removing chimeric sequences from a consistent full-length sequence obtained by IsoSeq flow treatment under the condition of no reference genome.
The invention provides application of the quality control method in reducing the proportion of false positive results in transcriptome sequencing data.
The invention provides application of the quality control method in improving accuracy of transcriptome sequencing data.
The quality control method based on PacBio full-length transcriptome sequencing data provided by the invention is characterized in that a palindromic sequence is identified based on sequence length and sequence comparison, a chimeric sequence is filtered, and the information is not limited to a connector sequence and primer sequence information connected with the chimeric sequence (the chimeric sequence is filtered in the prior art only based on the connector sequence and primer sequence information connected with the chimeric sequence, so that the accuracy of the transcriptome sequencing data is low), the chimeric sequence which cannot be identified in the prior art can be removed under the condition of high sequencing error rate of the connector sequence and the primer sequence, and the proportion of false positive results in a final transcriptome is reduced, so that a full-length sequence with low quality consistency can be added into analysis to obtain more transcripts, and the accuracy of the related analysis results of the subsequent transcriptome is improved.
Drawings
FIG. 1 shows the structure of an artificial chimeric sequence identified by identifying the middle primer sequence of the full-length sequence in the prior art.
Detailed Description
The following examples further illustrate the present invention but are not to be construed as limiting the invention. Modifications or substitutions to methods, procedures, or conditions of the invention may be made without departing from the spirit and scope of the invention.
Unless otherwise specified, the technical means used in the examples are conventional means well known to those skilled in the art.
Example 1
The sequencing data of this example included PacBio full length transcriptome sequencing data 23G for 1 masson pine, and Illumina sequencing data for 3 biological replicates of masson pine samples, each replicate at no less than 6G.
The data are analyzed according to the quality control method of the invention, and possible chimeric sequences are filtered to obtain the final transcriptome. The specific method comprises the following steps:
(1) obtaining a high-quality and low-quality consistent full-length sequence from the sequencing data of an original PacBio full-length transcription set by using an IsoSeq analysis process;
(2) correcting the low-quality consistency full-length sequence by using proovread based on Illumina sequencing data, and filtering the sequence with average accuracy of less than 0.99 after correction;
(3) merging the high-quality and corrected low-quality consistent full-length sequences, counting the lengths of all the sequences, and not filtering the sequences with the length of more than 15000 bp;
(4) performing Blast self-comparison on all the sequences which are left after the last step of processing, filtering all sequences which have palindromic sequences (palindromic sequence judgment standard: more than two segments in the sequences are reversely compared with each other and meet the comparison length of more than 500bp and the comparison similarity of more than 95%) in self-comparison results, and filtering 340 sequences in total;
(5) comparing all sequences by Blast, if a certain sequence is compared to a plurality of positions of another sequence and the comparison directions of adjacent positions are opposite, the comparison length is greater than 95% of the length of the sequence and the comparison similarity is greater than 95%, removing the compared sequences, and filtering 610 sequences in total.
Wherein, the IsoSeq analysis flow identifies 2551 chimeric sequences by identifying primer sequences in the middle of full-length sequences, the process corresponds to the step (1) of the method, the method further filters 950 possible chimeric sequences through the steps (2) - (5) on the basis of the step (1), and the proportion of the possible chimeric sequences in all chimeric sequences (2551+950) is 27.14%. Based on the results, the method disclosed by the invention can further reduce the false positive rate of sequencing data and improve the sequencing accuracy.
Example 2
The sequencing data of this example included PacBio full length transcriptome sequencing data 21.88G for 1 lemon pooled sample, Illumina sequencing data for 3 separate samples (3 biological replicates per sample) in the pooled sample, with each replicate not less than 6G.
The data are analyzed according to the quality control method of the invention, and possible chimeric sequences are filtered to obtain the final transcriptome. The specific method comprises the following steps:
(1) obtaining a high-quality and low-quality consistent full-length sequence from the sequencing data of an original PacBio full-length transcription set by using an IsoSeq analysis process;
(2) correcting the low-quality consistency full-length sequence by using proovread based on Illumina sequencing data, and filtering the sequence with average accuracy of less than 0.99 after correction;
(3) merging the high-quality and corrected low-quality consistent full-length sequences, counting the lengths of all the sequences, and not filtering the sequences with the length of more than 15000 bp;
(4) performing Blast self-comparison on all the sequences which are left after the last step of processing, filtering all sequences which have palindromic sequences (palindromic sequence judgment standard: more than two segments in the sequences are reversely compared with each other and meet the comparison length of more than 500bp and the comparison similarity of more than 95%) in self-comparison results, and filtering 737 sequences in total;
(5) comparing all sequences by Blast, if a certain sequence is compared to a plurality of positions of another sequence and the comparison directions of adjacent positions are opposite, the comparison length is greater than 95% of the length of the sequence and the comparison similarity is greater than 95%, removing the compared sequences, and filtering 549 sequences in total.
Wherein, the IsoSeq analysis process identifies 3252 chimeric sequences by identifying a primer sequence in the middle of the full-length sequence, the process corresponds to step (1) of the method in this embodiment, and the method in this embodiment further filters 1286 possible chimeric sequences through subsequent steps on the basis of step (1), and the ratio of the chimeric sequences to all chimeric sequences (3252+1286) is 28.34%. Based on the results, the method disclosed by the invention filters a large number of chimeric sequences which cannot be identified by the prior art, can further reduce the false positive rate of sequencing data, and improves the sequencing accuracy.
Although the invention has been described in detail hereinabove with respect to a general description and specific embodiments thereof, it will be apparent to those skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.
Claims (9)
1. The quality control method based on PacBio full-length transcriptome sequencing data comprises the following steps:
(1) obtaining a high-quality and low-quality consistent full-length sequence from the sequencing data of an original PacBio full-length transcription set by using an IsoSeq analysis process;
(2) correcting the low-quality consistent full-length sequence based on Illumina sequencing data, and filtering the sequence which still cannot reach a high-quality standard;
(3) combining the high-quality and corrected low-quality consistent full-length sequences, and removing overlength sequences generated by sequence mosaic;
(4) removing the consistent full-length sequence of the palindromic sequence in the self-alignment result;
(5) removing chimeric sequences that can be aligned to multiple positions by other identical full-length sequences; the method is characterized in that Blast is utilized to compare each remaining sequence of the full-length sequence with the consistency of the palindromic sequence removed in the step (4) with all other sequences, the sequences are compared to a plurality of positions of any one sequence, and when the comparison directions of two adjacent positions are opposite, the comparison length is greater than 95% of the length of the sequence per se and the comparison similarity is greater than 95%, the compared sequences are judged to be chimeric sequences for filtering.
2. The quality control method according to claim 1, wherein the high-quality and low-quality consistent full-length sequences in step (1) are obtained by preliminarily filtering the primer sequences in the middle of the identified full-length sequences to determine chimeric sequences with connected primer sequences, and further processing the chimeric sequences to obtain high-quality and low-quality consistent full-length sequences after polishing and error correction.
3. The quality control method according to claim 1, wherein in the step (1), the judgment criterion of the full-length sequence with high quality consistency is that the average accuracy of the sequence is more than 0.99.
4. The quality control method according to claim 3, wherein the step (2) corrects the low-quality consensus full-length sequence obtained in the step (1) and retains the consensus full-length sequence with the corrected sequence accuracy of more than 0.99.
5. The quality control method according to claim 1, wherein the excessively long sequence resulting from sequence chimerization in step (3) is a sequence having a length of more than 15000bp out of the combined sequences of the full-length sequences having high quality and low quality identity that are eligible after correction.
6. The quality control method according to any one of claims 1 to 5, wherein in the step (4), the palindromic sequence simultaneously satisfies the following conditions:
1) the consistency full-length sequence has two segments which can be reversely aligned with each other;
2) the comparison length is more than 500 bp;
3) the alignment similarity is greater than 95%.
7. Use of the quality control method according to any one of claims 1 to 6 for further removal of chimeric sequences from the consensus full-length sequence obtained by IsoSeq protocol treatment in the absence of a reference genome.
8. Use of the quality control method according to any one of claims 1 to 6 for reducing the proportion of false positive results in transcriptome sequencing data.
9. Use of the quality control method according to any one of claims 1 to 6 for improving accuracy of transcriptome sequencing data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811641409.6A CN109817277B (en) | 2018-12-29 | 2018-12-29 | Quality control method based on PacBio full-length transcriptome sequencing data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811641409.6A CN109817277B (en) | 2018-12-29 | 2018-12-29 | Quality control method based on PacBio full-length transcriptome sequencing data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109817277A CN109817277A (en) | 2019-05-28 |
CN109817277B true CN109817277B (en) | 2022-03-18 |
Family
ID=66603337
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811641409.6A Active CN109817277B (en) | 2018-12-29 | 2018-12-29 | Quality control method based on PacBio full-length transcriptome sequencing data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109817277B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105279391A (en) * | 2015-09-06 | 2016-01-27 | 苏州协云和创生物科技有限公司 | Metagenome 16S rRNA high-throughput sequencing data processing and analysis process control method |
CN106170547A (en) * | 2014-01-21 | 2016-11-30 | 高效基因设计技术研究协会 | The preparation method of cells D NA compositions and the manufacture method of DNA union body |
WO2017214765A1 (en) * | 2016-06-12 | 2017-12-21 | 深圳大学 | Multi-thread fast storage lossless compression method and system for fastq data |
CN108103178A (en) * | 2018-01-23 | 2018-06-01 | 北京优迅医学检验所有限公司 | The high-throughput detection kit and detection method of neoplastic hematologic disorder fusion |
CN108388771A (en) * | 2018-01-24 | 2018-08-10 | 安徽微分基因科技有限公司 | A kind of bio-diversity automatic analysis method |
CN108486271A (en) * | 2018-03-29 | 2018-09-04 | 中国科学院微生物研究所 | White fungus strain method for detecting purity based on high throughput sequencing technologies and its application |
CN108920901A (en) * | 2018-07-24 | 2018-11-30 | 中国医学科学院北京协和医院 | A kind of sequencing data mutation analysis system |
-
2018
- 2018-12-29 CN CN201811641409.6A patent/CN109817277B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106170547A (en) * | 2014-01-21 | 2016-11-30 | 高效基因设计技术研究协会 | The preparation method of cells D NA compositions and the manufacture method of DNA union body |
CN105279391A (en) * | 2015-09-06 | 2016-01-27 | 苏州协云和创生物科技有限公司 | Metagenome 16S rRNA high-throughput sequencing data processing and analysis process control method |
WO2017214765A1 (en) * | 2016-06-12 | 2017-12-21 | 深圳大学 | Multi-thread fast storage lossless compression method and system for fastq data |
CN108103178A (en) * | 2018-01-23 | 2018-06-01 | 北京优迅医学检验所有限公司 | The high-throughput detection kit and detection method of neoplastic hematologic disorder fusion |
CN108388771A (en) * | 2018-01-24 | 2018-08-10 | 安徽微分基因科技有限公司 | A kind of bio-diversity automatic analysis method |
CN108486271A (en) * | 2018-03-29 | 2018-09-04 | 中国科学院微生物研究所 | White fungus strain method for detecting purity based on high throughput sequencing technologies and its application |
CN108920901A (en) * | 2018-07-24 | 2018-11-30 | 中国医学科学院北京协和医院 | A kind of sequencing data mutation analysis system |
Non-Patent Citations (3)
Title |
---|
Great differences in performance and outcome of high-throughput sequencing data analysis platforms for fungal metabarcoding;Sten Anslan et al.;《MycoKeys》;20180911;全文 * |
QC-Chain: Fast and Holistic Quality Control Method for Next-Generation Sequencing Data;Qian Zhou et al.;《PLOS ONE》;20130402;第29-40页 * |
基于PacBio平台的全长转录组测序;任毅鹏 等;《科学通报》;20160303;第61卷(第11期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN109817277A (en) | 2019-05-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Tardaguila et al. | SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification | |
JP6830496B2 (en) | Multi-positioning double tag adapter set for detecting gene mutations, and its preparation method and application | |
CA3057867A1 (en) | Methods for targeted nucleic acid sequence enrichment with applications to error corrected nucleic acid sequencing | |
CN108197434B (en) | Method for removing human gene sequence in metagenome sequencing data | |
CN106086162A (en) | A kind of double label joint sequences for detecting Tumor mutations and detection method | |
CN111755072B (en) | Method and device for simultaneously detecting methylation level, genome variation and insertion fragment | |
CN110692101A (en) | Method for aligning targeted nucleic acid sequencing data | |
Xie et al. | Applications and potentials of nanopore sequencing in the (epi) genome and (epi) transcriptome era | |
Liu et al. | Forensic STR allele extraction using a machine learning paradigm | |
CN111508561A (en) | Method for detecting homologous sequence and tandem repeat sequence in homologous sequence, computer readable medium and application | |
CN109817277B (en) | Quality control method based on PacBio full-length transcriptome sequencing data | |
KR102347463B1 (en) | Method and appartus for detecting false positive variants in nucleic acid sequencing analysis | |
AU2010329825A1 (en) | RNA analytics method | |
Benaglio et al. | Ultra high throughput sequencing in human DNA variation detection: a comparative study on the NDUFA3-PRPF31 region | |
CN111370063B (en) | MSI (MSI-based micro satellite instability) detection method and system based on Pacbio data | |
CN111292806A (en) | Transcriptome analysis method by using nanopore sequencing | |
CN107590362B (en) | Method for judging whether overlapping assembly is correct or incorrect based on long read sequence sequencing | |
EP3409788B1 (en) | Method and system for nucleic acid sequencing | |
CN112513292A (en) | Method and device for detecting homologous sequence based on high-throughput sequencing | |
CN116153417B (en) | Methylation characteristic screening method and device | |
LU503668B1 (en) | Clustering Method of Methylation Samples Integrated with Single-cell Sequencing Analysis Method | |
KR102319447B1 (en) | Method and Apparatus for discriminating the mutations of genes related to recessive inherited disease using next generation sequencing(NGS) | |
CN115331736B (en) | Splicing method for extending high-throughput sequencing genes based on text matching | |
CN111445956B (en) | Efficient genome data utilization method and device for second-generation sequencing platform | |
CN110853709B (en) | UMI design method capable of effectively reducing errors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |