CN109817277B - Quality control method based on PacBio full-length transcriptome sequencing data - Google Patents

Quality control method based on PacBio full-length transcriptome sequencing data Download PDF

Info

Publication number
CN109817277B
CN109817277B CN201811641409.6A CN201811641409A CN109817277B CN 109817277 B CN109817277 B CN 109817277B CN 201811641409 A CN201811641409 A CN 201811641409A CN 109817277 B CN109817277 B CN 109817277B
Authority
CN
China
Prior art keywords
sequence
length
full
sequences
quality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811641409.6A
Other languages
Chinese (zh)
Other versions
CN109817277A (en
Inventor
郑洪坤
许国路
杨春鹤
张雪川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Biomarker Technologies Co ltd
Original Assignee
Beijing Biomarker Technologies Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Biomarker Technologies Co ltd filed Critical Beijing Biomarker Technologies Co ltd
Priority to CN201811641409.6A priority Critical patent/CN109817277B/en
Publication of CN109817277A publication Critical patent/CN109817277A/en
Application granted granted Critical
Publication of CN109817277B publication Critical patent/CN109817277B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a quality control method based on PacBio full-length transcriptome sequencing data, which comprises the following steps: 1) obtaining a high-quality and low-quality consistent full-length sequence from the sequencing data of an original PacBio full-length transcription set by using an IsoSeq analysis process; 2) correcting the low-quality consistent full-length sequence based on Illumina sequencing data, and filtering the sequence which still cannot reach a high-quality standard; 3) merging the high-quality and corrected low-quality consistent full-length sequences, and filtering according to the following standards: removing overlength sequences resulting from sequence chimerism; removing the consistent full-length sequence of the palindromic sequence in the self-alignment result; remove sequences that can be aligned to multiple positions by other identical full-length sequences. Chimeric sequences possibly existing in the consistent full-length sequence are filtered through a plurality of standards, the proportion of false positive results in the final transcriptome is reduced, and the accuracy of the related analysis results of the subsequent transcriptome is improved.

Description

Quality control method based on PacBio full-length transcriptome sequencing data
Technical Field
The invention relates to the technical field of bioinformatics, in particular to a quality control method based on PacBio full-length transcriptome sequencing data, which is used for filtering a chimeric sequence in the PacBio full-length transcriptome sequencing data.
Background
Transcriptome is a link of proteome connecting genomic genetic information and biological functions, and the regulation of transcription level is the most important and the most widely studied regulation mode of organisms at present, and the research of transcriptome is one of the essential tools for understanding life process. The transcriptome sequencing can sequence the transcriptome of a sample at any time point or under any condition, dynamically reflect the gene transcription level, simultaneously identify and quantify rare transcripts and normal transcripts, and provide sequence structure information of the sample specific transcripts.
However, the sequencing technology based on the second generation high throughput sequencing platform often cannot accurately obtain or assemble complete transcripts, and cannot accurately identify isofrorm and allele expressed transcripts, so that people cannot understand the meaning of the life activity in a deeper level. Full-length transcriptome sequencing based on the PacBio SMRT single-molecule real-time sequencing technology does not need to break RNA fragments, the ultralong reading of the platform comprises single complete transcript sequence information, and the complete transcript can be obtained without assembly in later analysis.
The analysis process for obtaining the full-length transcription group by the PacBio sequencing technology mainly comprises the steps of identifying the full-length sequence, horizontally clustering by isofomm to obtain a consistent sequence and a consistent sequence polising. The linker sequence cannot be correctly identified due to sequencing errors during the analysis, and the subsequences in the original polymerase sequence are connected by the linker sequence to form a chimeric sequence. In the full-length sequence identification step, a part of the chimeric sequence is filtered out by judging whether the primer sequence exists in the middle of the sequence (see figure 1), but a part of the chimeric sequence is not filtered because the primer sequence cannot be correctly identified. In particular, in the absence of sequencing the reference genome of the species, the likely chimeric sequence cannot be determined from the alignment information with the reference genome. The retention of these unrecognized chimeric sequences in the final transcriptome has a great influence on the accuracy of the analysis results related to the later transcriptome. In order to improve the accuracy of the sequencing data of the transcriptome, chimeric sequences which cannot be identified in the prior art need to be further removed, but no related method is reported at present.
Disclosure of Invention
The invention aims to provide a quality control method based on PacBio full-length transcriptome sequencing data, which is used for filtering chimeric sequences in the PacBio full-length transcriptome sequencing data so as to improve the accuracy of the transcriptome sequencing data.
In order to realize the purpose of the invention, the quality control method based on PacBio full-length transcriptome sequencing data comprises the following steps:
(1) obtaining a high-quality and low-quality consistent full-length sequence from the sequencing data of an original PacBio full-length transcription set by using an IsoSeq analysis process;
it is well known in the art that the high quality is judged by the average accuracy of the sequence, with an accuracy threshold of 0.99;
(2) correcting the low-quality consistent full-length sequence based on Illumina sequencing data, and filtering the sequence which still cannot reach a high-quality standard;
(3) combining the high-quality and corrected low-quality consistent full-length sequences, and removing overlength sequences generated by sequence mosaic;
(4) removing the consistent full-length sequence of the palindromic sequence in the self-alignment result;
(5) remove chimeric sequences that can be aligned to multiple positions by other identical full-length sequences.
In the quality control method, the high-quality and low-quality consistent full-length sequences in the step (1) are obtained by identifying primer sequences in the middle of the full-length sequences, preliminarily filtering to determine chimeric sequences connected with the primer sequences, and further processing (the specific method is a conventional technology in the field, and comprises the steps of 1) clustering all full-length non-chimeric sequences according to sequence similarity to obtain consistent sequences; 2) the consistency sequence is error corrected using the original data. ) And obtaining a high-quality and low-quality consistent full-length sequence after polishing and error correction.
In the step (1), the judgment standard of the full-length sequence with high quality consistency is that the average accuracy of the sequence is more than 0.99.
In the quality control method, the step (2) is to correct the low-quality consistency full-length sequence obtained in the step (1) by adopting proovread, and the consistency full-length sequence with the corrected sequence accuracy larger than 0.99 is reserved.
In step (3), the overlong sequence generated by sequence mosaic is the sequence with length larger than 15000bp in the merged sequence of the full-length sequence with high quality and low quality consistency meeting the conditions after correction.
In step (4) of the quality control method of the present invention, the palindromic sequence simultaneously satisfies the following conditions:
1) the consistency full-length sequence has two segments which can be reversely aligned with each other;
2) the comparison length is more than 500 bp;
3) the alignment similarity is greater than 95%.
And (5) comparing each remaining sequence of the full-length sequence with the palindromic sequence removed in the step (4) with all other sequences by using Blast, comparing the sequences to a plurality of positions of any one sequence, and judging the compared sequences as chimeric sequences and filtering when the comparison directions of two adjacent positions are opposite, the comparison length is greater than 95% of the length of the sequence per se and the comparison similarity is greater than 95%.
The invention provides application of the quality control method in further removing chimeric sequences from a consistent full-length sequence obtained by IsoSeq flow treatment under the condition of no reference genome.
The invention provides application of the quality control method in reducing the proportion of false positive results in transcriptome sequencing data.
The invention provides application of the quality control method in improving accuracy of transcriptome sequencing data.
The quality control method based on PacBio full-length transcriptome sequencing data provided by the invention is characterized in that a palindromic sequence is identified based on sequence length and sequence comparison, a chimeric sequence is filtered, and the information is not limited to a connector sequence and primer sequence information connected with the chimeric sequence (the chimeric sequence is filtered in the prior art only based on the connector sequence and primer sequence information connected with the chimeric sequence, so that the accuracy of the transcriptome sequencing data is low), the chimeric sequence which cannot be identified in the prior art can be removed under the condition of high sequencing error rate of the connector sequence and the primer sequence, and the proportion of false positive results in a final transcriptome is reduced, so that a full-length sequence with low quality consistency can be added into analysis to obtain more transcripts, and the accuracy of the related analysis results of the subsequent transcriptome is improved.
Drawings
FIG. 1 shows the structure of an artificial chimeric sequence identified by identifying the middle primer sequence of the full-length sequence in the prior art.
Detailed Description
The following examples further illustrate the present invention but are not to be construed as limiting the invention. Modifications or substitutions to methods, procedures, or conditions of the invention may be made without departing from the spirit and scope of the invention.
Unless otherwise specified, the technical means used in the examples are conventional means well known to those skilled in the art.
Example 1
The sequencing data of this example included PacBio full length transcriptome sequencing data 23G for 1 masson pine, and Illumina sequencing data for 3 biological replicates of masson pine samples, each replicate at no less than 6G.
The data are analyzed according to the quality control method of the invention, and possible chimeric sequences are filtered to obtain the final transcriptome. The specific method comprises the following steps:
(1) obtaining a high-quality and low-quality consistent full-length sequence from the sequencing data of an original PacBio full-length transcription set by using an IsoSeq analysis process;
(2) correcting the low-quality consistency full-length sequence by using proovread based on Illumina sequencing data, and filtering the sequence with average accuracy of less than 0.99 after correction;
(3) merging the high-quality and corrected low-quality consistent full-length sequences, counting the lengths of all the sequences, and not filtering the sequences with the length of more than 15000 bp;
(4) performing Blast self-comparison on all the sequences which are left after the last step of processing, filtering all sequences which have palindromic sequences (palindromic sequence judgment standard: more than two segments in the sequences are reversely compared with each other and meet the comparison length of more than 500bp and the comparison similarity of more than 95%) in self-comparison results, and filtering 340 sequences in total;
(5) comparing all sequences by Blast, if a certain sequence is compared to a plurality of positions of another sequence and the comparison directions of adjacent positions are opposite, the comparison length is greater than 95% of the length of the sequence and the comparison similarity is greater than 95%, removing the compared sequences, and filtering 610 sequences in total.
Wherein, the IsoSeq analysis flow identifies 2551 chimeric sequences by identifying primer sequences in the middle of full-length sequences, the process corresponds to the step (1) of the method, the method further filters 950 possible chimeric sequences through the steps (2) - (5) on the basis of the step (1), and the proportion of the possible chimeric sequences in all chimeric sequences (2551+950) is 27.14%. Based on the results, the method disclosed by the invention can further reduce the false positive rate of sequencing data and improve the sequencing accuracy.
Example 2
The sequencing data of this example included PacBio full length transcriptome sequencing data 21.88G for 1 lemon pooled sample, Illumina sequencing data for 3 separate samples (3 biological replicates per sample) in the pooled sample, with each replicate not less than 6G.
The data are analyzed according to the quality control method of the invention, and possible chimeric sequences are filtered to obtain the final transcriptome. The specific method comprises the following steps:
(1) obtaining a high-quality and low-quality consistent full-length sequence from the sequencing data of an original PacBio full-length transcription set by using an IsoSeq analysis process;
(2) correcting the low-quality consistency full-length sequence by using proovread based on Illumina sequencing data, and filtering the sequence with average accuracy of less than 0.99 after correction;
(3) merging the high-quality and corrected low-quality consistent full-length sequences, counting the lengths of all the sequences, and not filtering the sequences with the length of more than 15000 bp;
(4) performing Blast self-comparison on all the sequences which are left after the last step of processing, filtering all sequences which have palindromic sequences (palindromic sequence judgment standard: more than two segments in the sequences are reversely compared with each other and meet the comparison length of more than 500bp and the comparison similarity of more than 95%) in self-comparison results, and filtering 737 sequences in total;
(5) comparing all sequences by Blast, if a certain sequence is compared to a plurality of positions of another sequence and the comparison directions of adjacent positions are opposite, the comparison length is greater than 95% of the length of the sequence and the comparison similarity is greater than 95%, removing the compared sequences, and filtering 549 sequences in total.
Wherein, the IsoSeq analysis process identifies 3252 chimeric sequences by identifying a primer sequence in the middle of the full-length sequence, the process corresponds to step (1) of the method in this embodiment, and the method in this embodiment further filters 1286 possible chimeric sequences through subsequent steps on the basis of step (1), and the ratio of the chimeric sequences to all chimeric sequences (3252+1286) is 28.34%. Based on the results, the method disclosed by the invention filters a large number of chimeric sequences which cannot be identified by the prior art, can further reduce the false positive rate of sequencing data, and improves the sequencing accuracy.
Although the invention has been described in detail hereinabove with respect to a general description and specific embodiments thereof, it will be apparent to those skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims (9)

1. The quality control method based on PacBio full-length transcriptome sequencing data comprises the following steps:
(1) obtaining a high-quality and low-quality consistent full-length sequence from the sequencing data of an original PacBio full-length transcription set by using an IsoSeq analysis process;
(2) correcting the low-quality consistent full-length sequence based on Illumina sequencing data, and filtering the sequence which still cannot reach a high-quality standard;
(3) combining the high-quality and corrected low-quality consistent full-length sequences, and removing overlength sequences generated by sequence mosaic;
(4) removing the consistent full-length sequence of the palindromic sequence in the self-alignment result;
(5) removing chimeric sequences that can be aligned to multiple positions by other identical full-length sequences; the method is characterized in that Blast is utilized to compare each remaining sequence of the full-length sequence with the consistency of the palindromic sequence removed in the step (4) with all other sequences, the sequences are compared to a plurality of positions of any one sequence, and when the comparison directions of two adjacent positions are opposite, the comparison length is greater than 95% of the length of the sequence per se and the comparison similarity is greater than 95%, the compared sequences are judged to be chimeric sequences for filtering.
2. The quality control method according to claim 1, wherein the high-quality and low-quality consistent full-length sequences in step (1) are obtained by preliminarily filtering the primer sequences in the middle of the identified full-length sequences to determine chimeric sequences with connected primer sequences, and further processing the chimeric sequences to obtain high-quality and low-quality consistent full-length sequences after polishing and error correction.
3. The quality control method according to claim 1, wherein in the step (1), the judgment criterion of the full-length sequence with high quality consistency is that the average accuracy of the sequence is more than 0.99.
4. The quality control method according to claim 3, wherein the step (2) corrects the low-quality consensus full-length sequence obtained in the step (1) and retains the consensus full-length sequence with the corrected sequence accuracy of more than 0.99.
5. The quality control method according to claim 1, wherein the excessively long sequence resulting from sequence chimerization in step (3) is a sequence having a length of more than 15000bp out of the combined sequences of the full-length sequences having high quality and low quality identity that are eligible after correction.
6. The quality control method according to any one of claims 1 to 5, wherein in the step (4), the palindromic sequence simultaneously satisfies the following conditions:
1) the consistency full-length sequence has two segments which can be reversely aligned with each other;
2) the comparison length is more than 500 bp;
3) the alignment similarity is greater than 95%.
7. Use of the quality control method according to any one of claims 1 to 6 for further removal of chimeric sequences from the consensus full-length sequence obtained by IsoSeq protocol treatment in the absence of a reference genome.
8. Use of the quality control method according to any one of claims 1 to 6 for reducing the proportion of false positive results in transcriptome sequencing data.
9. Use of the quality control method according to any one of claims 1 to 6 for improving accuracy of transcriptome sequencing data.
CN201811641409.6A 2018-12-29 2018-12-29 Quality control method based on PacBio full-length transcriptome sequencing data Active CN109817277B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811641409.6A CN109817277B (en) 2018-12-29 2018-12-29 Quality control method based on PacBio full-length transcriptome sequencing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811641409.6A CN109817277B (en) 2018-12-29 2018-12-29 Quality control method based on PacBio full-length transcriptome sequencing data

Publications (2)

Publication Number Publication Date
CN109817277A CN109817277A (en) 2019-05-28
CN109817277B true CN109817277B (en) 2022-03-18

Family

ID=66603337

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811641409.6A Active CN109817277B (en) 2018-12-29 2018-12-29 Quality control method based on PacBio full-length transcriptome sequencing data

Country Status (1)

Country Link
CN (1) CN109817277B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279391A (en) * 2015-09-06 2016-01-27 苏州协云和创生物科技有限公司 Metagenome 16S rRNA high-throughput sequencing data processing and analysis process control method
CN106170547A (en) * 2014-01-21 2016-11-30 高效基因设计技术研究协会 The preparation method of cells D NA compositions and the manufacture method of DNA union body
WO2017214765A1 (en) * 2016-06-12 2017-12-21 深圳大学 Multi-thread fast storage lossless compression method and system for fastq data
CN108103178A (en) * 2018-01-23 2018-06-01 北京优迅医学检验所有限公司 The high-throughput detection kit and detection method of neoplastic hematologic disorder fusion
CN108388771A (en) * 2018-01-24 2018-08-10 安徽微分基因科技有限公司 A kind of bio-diversity automatic analysis method
CN108486271A (en) * 2018-03-29 2018-09-04 中国科学院微生物研究所 White fungus strain method for detecting purity based on high throughput sequencing technologies and its application
CN108920901A (en) * 2018-07-24 2018-11-30 中国医学科学院北京协和医院 A kind of sequencing data mutation analysis system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106170547A (en) * 2014-01-21 2016-11-30 高效基因设计技术研究协会 The preparation method of cells D NA compositions and the manufacture method of DNA union body
CN105279391A (en) * 2015-09-06 2016-01-27 苏州协云和创生物科技有限公司 Metagenome 16S rRNA high-throughput sequencing data processing and analysis process control method
WO2017214765A1 (en) * 2016-06-12 2017-12-21 深圳大学 Multi-thread fast storage lossless compression method and system for fastq data
CN108103178A (en) * 2018-01-23 2018-06-01 北京优迅医学检验所有限公司 The high-throughput detection kit and detection method of neoplastic hematologic disorder fusion
CN108388771A (en) * 2018-01-24 2018-08-10 安徽微分基因科技有限公司 A kind of bio-diversity automatic analysis method
CN108486271A (en) * 2018-03-29 2018-09-04 中国科学院微生物研究所 White fungus strain method for detecting purity based on high throughput sequencing technologies and its application
CN108920901A (en) * 2018-07-24 2018-11-30 中国医学科学院北京协和医院 A kind of sequencing data mutation analysis system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Great differences in performance and outcome of high-throughput sequencing data analysis platforms for fungal metabarcoding;Sten Anslan et al.;《MycoKeys》;20180911;全文 *
QC-Chain: Fast and Holistic Quality Control Method for Next-Generation Sequencing Data;Qian Zhou et al.;《PLOS ONE》;20130402;第29-40页 *
基于PacBio平台的全长转录组测序;任毅鹏 等;《科学通报》;20160303;第61卷(第11期);全文 *

Also Published As

Publication number Publication date
CN109817277A (en) 2019-05-28

Similar Documents

Publication Publication Date Title
Tardaguila et al. SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification
JP6830496B2 (en) Multi-positioning double tag adapter set for detecting gene mutations, and its preparation method and application
CA3057867A1 (en) Methods for targeted nucleic acid sequence enrichment with applications to error corrected nucleic acid sequencing
CN108197434B (en) Method for removing human gene sequence in metagenome sequencing data
CN106086162A (en) A kind of double label joint sequences for detecting Tumor mutations and detection method
CN111755072B (en) Method and device for simultaneously detecting methylation level, genome variation and insertion fragment
CN110692101A (en) Method for aligning targeted nucleic acid sequencing data
Xie et al. Applications and potentials of nanopore sequencing in the (epi) genome and (epi) transcriptome era
Liu et al. Forensic STR allele extraction using a machine learning paradigm
CN111508561A (en) Method for detecting homologous sequence and tandem repeat sequence in homologous sequence, computer readable medium and application
CN109817277B (en) Quality control method based on PacBio full-length transcriptome sequencing data
KR102347463B1 (en) Method and appartus for detecting false positive variants in nucleic acid sequencing analysis
AU2010329825A1 (en) RNA analytics method
Benaglio et al. Ultra high throughput sequencing in human DNA variation detection: a comparative study on the NDUFA3-PRPF31 region
CN111370063B (en) MSI (MSI-based micro satellite instability) detection method and system based on Pacbio data
CN111292806A (en) Transcriptome analysis method by using nanopore sequencing
CN107590362B (en) Method for judging whether overlapping assembly is correct or incorrect based on long read sequence sequencing
EP3409788B1 (en) Method and system for nucleic acid sequencing
CN112513292A (en) Method and device for detecting homologous sequence based on high-throughput sequencing
CN116153417B (en) Methylation characteristic screening method and device
LU503668B1 (en) Clustering Method of Methylation Samples Integrated with Single-cell Sequencing Analysis Method
KR102319447B1 (en) Method and Apparatus for discriminating the mutations of genes related to recessive inherited disease using next generation sequencing(NGS)
CN115331736B (en) Splicing method for extending high-throughput sequencing genes based on text matching
CN111445956B (en) Efficient genome data utilization method and device for second-generation sequencing platform
CN110853709B (en) UMI design method capable of effectively reducing errors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant