CN114627967A

CN114627967A - Method for accurately annotating three-generation full-length transcript

Info

Publication number: CN114627967A
Application number: CN202210252816.8A
Authority: CN
Inventors: 张函槊; 张成胜
Original assignee: Genex Health Co Ltd
Current assignee: Genex Health Co Ltd
Priority date: 2022-03-15
Filing date: 2022-03-15
Publication date: 2022-06-14

Abstract

The invention discloses a method for accurately annotating a third-generation full-length transcript. The method comprises the following steps: obtaining corrected transcript structure information; analyzing the human reference genome annotation file, and extracting information of a specific gene-transcript as reference information; analyzing the corrected structure information of the transcript to obtain primary classification information, and converting the abnormal transcript to obtain normal data; carrying out annotation processing on normal data to obtain specific gene-transcript annotation and final transcript classification information; and (4) sorting the annotation information and the classification information to obtain accurate annotations of the three generations of full-length transcripts. Experiments prove that the annotation accuracy is obviously improved when the method provided by the invention is used for annotating the three-generation full-length transcripts. The invention has important application value.

Description

Method for accurately annotating third-generation full-length transcript

Technical Field

The invention belongs to the field of bioinformatics, and particularly relates to a method for accurately annotating a third-generation full-length transcript.

Background

Full-length transcript sequencing is a technique for obtaining the full-length sequence of an mRNA using third generation sequencing techniques. The third-generation sequencing has the advantage of high read length compared with the second-generation sequencing, the read length can completely cover most of the self length of the transcript to obtain complete transcript sequencing information, and errors caused by splicing of the second-generation sequencing short read length are avoided, so that the full-length transcript sequencing has obvious advantages.

After the sequencing data are obtained, the structural data of the transcript can be obtained through comparison software and structural analysis software, and then the transcript annotation is carried out. As the development time of the third generation sequencing technology is short, the software for annotating the sequencing data of the full-length transcript is less, and the existing software (such as SQANT software) can annotate the transcript but has low accuracy.

Disclosure of Invention

The purpose of the invention is to make accurate annotation on three generations of full-length transcripts.

The invention firstly protects a method for accurately annotating a three-generation full-length transcript, which comprises the following steps:

(1) obtaining corrected transcript structure information;

(2) analyzing the human reference genome annotation file, and extracting information of a specific gene-transcript as reference information;

(3) analyzing the corrected transcript structure information obtained in the step (1) to obtain primary classification information; carrying out transformation treatment on the abnormal transcripts; obtaining normal data;

(4) performing annotation processing on the normal type data obtained in the step (3) to obtain specific gene-transcript annotation and final transcript classification information;

(5) and (4) sorting the annotation information and the classification information to obtain accurate annotations of the three generations of full-length transcripts.

In the above method, in the step (1), the corrected transcript structure information is obtained by correcting the structure information of the three generations of full-length transcripts and the original sequence information.

In the step (1), the specific steps of obtaining the corrected transcript structure information may be as follows:

(1-1) judging the length consistency of the comparison position information in the structure information and the corresponding position of the original sequence information;

(1-2) judging the base consistency of the alignment position sequence in the structural information and the original sequence;

(1-3) judging the structural integrity consistency of the original sequence and the transcript;

and (1-4) integrating the consistency information in the step (1-1), the step (1-2) and the step (1-3) to judge the structural accuracy in a weighting mode, and carrying out optimization adjustment on the abnormal region according to the consistency data to obtain corrected transcript structural information.

In the above method, in the step (2), the reference genome annotation file may be obtained from a public database.

In the above method, in the step (2), the information of the specific gene-transcript can be used as the reference information after the data format is converted.

In the above method, in the step (3), the specific steps of obtaining the normal type data may be as follows:

(3-1) calculating data such as chain specificity, structural continuity, integrity and the like of each transcript according to the structural information of each transcript, and obtaining a classification value through weighted calculation;

(3-2) classifying the transcripts into 4 classes of normal transcripts, fusion transcripts, structural variant transcripts and abnormal transcripts according to the classification values;

(3-3) calculating the number of fusion genes and fusion breakpoints thereof for the fusion transcript in the step (3-2), cutting the transcript into a plurality of fragments by taking the fusion breakpoints as boundaries, and repeating the step (3-1) for each fragment until normal fragment classification is obtained; calculating the structural variation region and type of the structural variation transcript in the step (3-2), and repeating the step (3-1) by taking the rest fragments as a whole after removing the structural variation region until obtaining normal fragment classification;

and (3-4) treating the normal transcripts obtained in the step (3-2) as a single whole, and performing integration treatment on the normal transcripts and the normal fragment classifications obtained in the step (3-3) to obtain normal type data.

In the above method, in the step (4), the specific steps of obtaining the transcript classification information may be as follows:

(4-1) importing the reference information obtained in the step (2), and judging whether intersection exists between each fragment in the normal data and the reference information: if yes, extracting intersection annotation content; if not, classifying the fragment as a new transcript;

(4-2) judging whether a single known transcript with the structure consistent with the transcript structure exists in the intersection annotation for the transcript with the intersection annotation extracted in the step (4-1): if the transcript exists, the transcript is a known transcript, and corresponding annotation information is reserved; if not, judging whether each exon area has unique annotation information; if the unique annotation information is not contained, judging the transcript as a new transcript, simultaneously calculating the similarity coefficient of each annotation gene and the transcript, and taking the highest bit as the annotation of the transcript; if the unique annotation information exists, collecting all unique annotation exons, and judging the consistency of the unique annotation exons; if the unique annotation exons are completely identical, determining as a new transcript; if the unique annotation exons are not consistent, determining the fusion transcript; meanwhile, calculating annotation information required to be output by relevant classification;

(4-3) carrying out combined processing on the multi-fragment transcripts, judging the transcripts to be fusion transcripts if the annotations are inconsistent, and classifying the transcripts in sequence of fusion transcripts, new transcripts and known transcripts if the transcripts are finally classified in inconsistent; and finally obtaining the annotation information and the classification information.

The application of any of the above methods to the precise annotation of three generations of full-length transcripts also falls within the scope of the present invention.

The invention mainly provides a novel method for realizing accurate annotation of full-length transcripts. The method is based on the high-precision transcript structure, although the method does not require the quality of the transcript structure data, the annotation accuracy can be obviously improved in consideration of high precision, and the quality of results given by most of comparison software and structure analysis software still has great improvement space at present, so that the method provided by the invention is a method for correcting the transcript structure data with general quality, and is convenient to select and use when needed. The method for correcting the transcript structure data with general quality comprises the steps of taking complete information of a transcript structure as a premise, and comparing position information with original transcript sequence information. The accuracy of structural variation such as single base variation, insertion deletion and the like in the transcript structural information is confirmed by comparing the difference of the two in the aspects of length consistency, base consistency, integrity consistency and the like. And further confirming the accuracy of the structure information of the transcript according to the accuracy of the structure variation, and if the difference which does not meet the standard exists, carrying out optimization adjustment in the standard according to the consistency check result until the difference meets the standard and retaining the difference or judging the difference to be completely abandoned in error. According to the method for correcting the transcript structure data, on the premise that complete and accurate information meeting the standard is not changed, optimization adjustment is performed on part of information with low quality which does not meet the standard, so that the proportion of low-quality information can be remarkably improved, and the accuracy of subsequent annotation is improved.

The method provided by the invention is also a novel method for realizing accurate annotation of the full-length transcript, and comprises a type judgment method of a non-reference stage and a transcript annotation method of a reference stage.

The type judgment method of the parameter-free stage comprises the following steps: analyzing the structural information of the input full-length transcript, and initially dividing the structure into four types of normal, fusion, structural variation and abnormity according to structural continuity, comparison result chain specificity and the like; analyzing the fusion breakpoint of the fusion type transcript, and then dividing the fusion transcript into a plurality of single normal transcript fragments by taking the fusion breakpoint as a boundary; analyzing the variation region of the structural variation transcript, judging the types of the remaining normal regions after eliminating the variation region, reserving normal fragments, and discarding abnormal fragments; the abnormal type transcript is only classified and marked, and is not annotated; and coding the normal segment information of the normal transcript and the processed other types of transcripts for subsequent annotation processing.

A method for annotating a reference stage transcript comprising the steps of: analyzing the annotation file to obtain the structure information annotated by each gene-transcript as reference information; comparing the coded information with the reference information, and judging whether the annotation area corresponds to the coded information; judging whether a known transcript in the annotation region can be matched with the target transcript or not; judging whether the transcript is a fusion transcript of an adjacent gene; splicing the multi-fragment transcripts; and sorting and outputting the finally matched annotation information and structure classification information.

Experiments prove that the method and SQANT software provided by the invention are respectively adopted to annotate the sequencing data of the three generations of full-length transcripts of the melanoma cell line COLO829, and the annotation accuracy rate is counted. The result shows that the annotation accuracy of the method provided by the invention is 99%, and the annotation accuracy of SQANT software is 94%. The method provided by the invention has the advantage that the annotation on the sequencing data of the third-generation full-length transcript is accurate and remarkably improved. The invention has important application value.

Drawings

FIG. 1 is a schematic flow chart of the precise annotation of three generations of full-length transcripts.

Detailed Description

The present invention is described in further detail below with reference to specific embodiments, which are given for the purpose of illustration only and are not intended to limit the scope of the invention. The examples provided below serve as a guide for further modifications by a person skilled in the art and do not constitute a limitation of the invention in any way.

The experimental procedures in the following examples, unless otherwise indicated, are conventional and are carried out according to the techniques or conditions described in the literature in the field or according to the instructions of the products. Materials, reagents and the like used in the following examples are commercially available unless otherwise specified.

Example 1 creation of a method for accurate annotation of three generations of full-length transcript sequencing data

The inventor of the invention establishes a method for accurately annotating the sequencing data of the third-generation full-length transcript through a large number of experiments.

The method comprises the following specific steps:

1. obtaining corrected transcript structure information

(1) And analyzing the sequencing data of the three-generation full-length transcript by adopting bioinformatics to obtain structural information.

(2) And inputting structural information and original sequence (namely three generations of full-length transcript sequencing data) information for correction to obtain corrected transcript structural information. The method comprises the following specific steps:

2. And analyzing the human reference genome annotation file acquired from the public database, extracting the information of the specific gene-transcript, and converting the information into a data format convenient for subsequent use as reference information.

3. Analyzing the input corrected transcript structure information to obtain primary classification information; carrying out transformation treatment on the abnormal transcripts; and finally obtaining normal data. The method comprises the following specific steps:

and (3-1) calculating data such as chain specificity, structural continuity, integrity and the like of each transcript according to the structural information of each transcript, and obtaining a classification value by weighting calculation.

(3-2) classifying the transcripts into 4 classes of normal transcripts, fusion transcripts, structural variant transcripts and aberrant transcripts according to the classification values. Aberrant transcripts were discarded.

(3-3) calculating the number of fusion genes and fusion breakpoints thereof for the fusion transcript in the step (3-2), cutting the transcript into a plurality of fragments by taking the fusion breakpoints as boundaries, and repeating the step (3-1) for each fragment until normal fragment classification is obtained; and (3) calculating the structural variation region and type of the structural variation transcript in the step (3-2), and after the structural variation region is removed, repeating the step (3-1) by taking the rest fragments as a whole until normal fragment classification is obtained.

4. And (4) performing annotation processing on the normal type data obtained in the step (3) to obtain specific gene-transcript annotation and final transcript classification information. The method comprises the following specific steps:

(4-1) importing the reference information obtained in the step 2, and judging whether intersection exists between each fragment in the normal data and the reference information: if yes, extracting intersection annotation content; if not, the fragment is classified as a new transcript.

(4-2) judging whether a single known transcript with the structure consistent with the transcript structure exists in the intersection annotation for the transcript with the intersection annotation extracted in the step (4-1): if the annotation exists, the known transcript is obtained, and corresponding annotation information is reserved; if not, judging whether each exon area has unique annotation information; if the unique annotation information is not contained, judging the transcript as a new transcript, simultaneously calculating the similarity coefficient of each annotation gene and the transcript, and taking the highest bit as the annotation of the transcript; if unique annotation information is available, all unique annotation exons are collected and judged for identity. If the unique annotation exons are completely identical, determining as a new transcript; fusion transcripts were judged if the uniquely annotated exons were not identical. And meanwhile, calculating annotation information required to be output by relevant classification.

(4-3) performing combined processing on the multi-fragment transcripts, judging the transcripts to be fusion transcripts if the annotations are inconsistent, and classifying the transcripts in sequence of fusion transcripts, new transcripts and known transcripts if the transcripts are finally classified in inconsistent. And outputting the annotation information and the classification information.

5. And (4) sorting the annotation information and the classification information to obtain the three-generation full-length transcript sequencing data with accurate annotation.

The flow diagram of the method for accurately annotating a third generation of full-length transcripts established by the present invention is shown in FIG. 1.

Example 2, comparative example 1 the annotation methods and the existing full-length transcript sequencing data annotation software for annotation of three generations of full-length transcript sequencing data

The third generation full length transcript sequencing data for this example is for melanoma cell line COLO 829.

1. Annotation was performed on the third generation of full-length transcript sequencing data using the annotation method established in example 1, and the annotation accuracy was calculated.

The result shows that the annotation accuracy of the annotation method established in example 1 on the sequencing data of the three generations of full-length transcripts is 99%.

2. And (4) annotating the sequencing data of the three-generation full-length transcripts by using SQANT software, and counting the annotation accuracy.

The result shows that the annotation accuracy of the SQANT software on the sequencing data of the three generations of full-length transcripts is 94%.

The results show that the annotation accuracy rate is obviously improved when the method provided by the invention is used for annotating the sequencing data of the third-generation full-length transcript.

The present invention has been described in detail above. It will be apparent to those skilled in the art that the invention can be practiced within a wide range of equivalent parameters, concentrations, and conditions without departing from the spirit and scope of the invention and without undue experimentation. While the invention has been described with reference to specific embodiments, it will be appreciated that the invention can be further modified. In general, this application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. The use of some of the essential features is possible within the scope of the claims attached below.

Claims

1. A method of accurately annotating a three generation full-length transcript comprising the steps of:

(1) obtaining corrected transcript structure information;

(5) and arranging annotation information and classification information to obtain accurate annotations of three generations of full-length transcripts.

2. The method of claim 1, wherein: in the step (1), the corrected structure information of the transcript is obtained by correcting the structure information of three generations of full-length transcripts and the original sequence information.

3. The method of claim 2, wherein: in the step (1), the specific steps of obtaining the corrected transcript structure information are as follows:

4. The method of claim 1, wherein: in the step (2), the human reference genome annotation file is obtained from a public database.

5. The method of claim 1, wherein: in the step (2), the information of the specific gene-transcript can be used as reference information after the data format is converted.

6. The method of claim 1, wherein: in the step (3), the specific steps for obtaining the normal data are as follows:

(3-3) calculating the number of fusion genes and fusion breakpoints thereof for the fusion transcript in the step (3-2), cutting the transcript into a plurality of fragments by taking the fusion breakpoints as boundaries, and repeating the step (3-1) for each fragment until normal fragment classification is obtained; calculating the structural variation region and type of the structural variation transcript in the step (3-2), and after the structural variation region is removed, repeating the step (3-1) by taking the rest fragments as a whole until normal fragment classification is obtained;

7. The method of claim 1, wherein: in the step (4), the specific steps of obtaining the transcript classification information are as follows:

(4-2) judging whether a single known transcript with the structure consistent with the transcript structure exists in the intersection annotation for the transcript with the intersection annotation extracted in the step (4-1): if the annotation exists, the known transcript is obtained, and corresponding annotation information is reserved; if not, judging whether each exon area has unique annotation information; if the unique annotation information is not contained, judging the transcript as a new transcript, simultaneously calculating the similarity coefficient of each annotation gene and the transcript, and taking the highest bit as the annotation of the transcript; if the unique annotation information exists, collecting all unique annotation exons, and judging the consistency of the unique annotation exons; if the unique annotated exons are completely consistent, determining the unique annotated exons as a new transcript; if the unique annotation exons are not consistent, determining the fusion transcript; meanwhile, calculating annotation information required to be output by relevant classification;

8. Use of the method of any one of claims 1 to 7 for the accurate annotation of three generations of full-length transcripts.