CN114708912A

CN114708912A - Recognition algorithm for plant mitochondrial genome coding circular RNA

Info

Publication number: CN114708912A
Application number: CN202210276265.9A
Authority: CN
Inventors: 张亚锋; 廖珣; 常冯瑞
Original assignee: South China Agricultural University
Current assignee: South China Agricultural University
Priority date: 2022-03-21
Filing date: 2022-03-21
Publication date: 2022-07-05

Abstract

The invention discloses an identification algorithm of plant mitochondrial genome coding circular RNA. The MeCi identification method is designed according to the characteristics of plant mitochondrial genomes, and RNA-seq data are compared to a reference genome by using a blast tool to obtain a comparison result file; screening sequence characteristics of data in the comparison result file to obtain candidate circRNA; and (3) carrying out conditional screening on the candidate circRNA to obtain plant mitochondrial genome coding circular RNA. New predicted mitochondrial genome coding circRNA that could not be detected by the prior art can be obtained by transcriptome data analysis of maize and Arabidopsis using MeCi. Experimental evidence shows that the newly predicted mitochondrial genome coding circRNA has higher reliability, and the MeCi is superior to the existing method and can effectively identify the plant mitochondrial genome coding circRNA.

Description

Recognition algorithm for plant mitochondrial genome coding circular RNA

Technical Field

The invention relates to the technical field of biological information, in particular to an identification algorithm of plant mitochondrial genome coding circular RNA.

Background

Ribonucleic acid (RNA) is a ribonucleic acid polymer transcribed from deoxyribonucleic acid (DNA), and includes messenger RNA (mRNA), transfer RNA (tRNA), ribosomal RNA (rRNA), small nuclear RNA (snRNA), and the like. RNA has both linear molecules and molecules with a circular structure which are connected end to end in a covalent bond, i.e., circular RNA (circular RNA). The closed ring structure of the circRNA molecule makes it more stable than linear RNA and not easily degraded by exonuclease. The technique of transcriptome sequencing (RNA-seq) is the basis for high throughput identification of circRNA. RNA-seq can be divided into two types, single-ended sequencing, which reads RNA sequences from one end, and double-ended sequencing, which reads sequences from both ends of RNA and subsequently combines the double-ended sequencing sequences together by overlapping sequences at both ends. Paired-end sequencing can result in longer read lengths (reads).

The identification of circRNA from high throughput sequencing results is a critical step in its functional studies, which requires the assistance of suitable algorithms. Previous studies have shown that intron reverse splicing is the major pathway for the production of circRNA. The circRNA recognition algorithm now commonly used includes CIRI2^[1]、CIRI^[2]、find_circ^[3]、CIRCexplorer^[4]And UROBORUS^[5]And the like. The algorithms firstly identify reverse splicing products through different principles, and then carry out filtering according to conserved sequences of splicing sites of nuclear intron or gene annotation information and the like, so as to obtain candidate circRNA with high feasibility.

The plant mitochondrial genome is different from the nuclear genome and mainly comprises the following parts: (1) plant mitochondrial genome introns are mostly self-splicing (self-splicing) group II introns, and no report shows that the introns have an inverse splicing pathway; (2) the plant mitochondrial genome is small relative to the nuclear genome (e.g., maize B73 has a mitochondrial genome size of about 570kb and a nuclear genome size of about 2.18 Gb); (3) the coding regions of the plant mitochondrial genome present a large number of RNA editing sites, while the nuclear genome is absent. These differences allow the plant mitochondrial genome to encode circular RNA (circRNA) with different characteristics than the nuclear genome encodes circRNA. Most of the current common algorithms are designed aiming at the nuclear genome coding circRNA, and the algorithms cannot effectively recognize the plant mitochondrial genome coding circRNA.

Take CIRI2 as an example. CIRI2 is one of the widely used circRNA recognition algorithms at present. It uses a multiple segment matching (multiple seed) based maximum likelihood estimation (maximum likelihood estimation) to identify reads containing reverse-spliced junctions (BSJ) and reduces false positives due to repeated sequences and base mismatches by calculating False Discovery Rate (FDR). CIRI2 presents several problems in identifying circRNA encoded by the plant mitochondrial genome, including: (1) CIRI2 screens circRNA by recognizing intron reverse splicing products, while the plant mitochondrial genome group II intron has no reverse splicing phenomenon; (2) the plant mitochondrial genome has a large number of RNA editing sites, which causes that when the CIRI2 predicts that the plant mitochondrial genome codes circRNA, the FDR is too high, and further causes a large number of actually existing circRNA to be filtered; (3) the plant mitochondrial genome contains a large number of repetitive sequences and CIRI2 filters circrnas derived from these repetitive regions.

Reference documents:

[1]Gao Y,Zhang J,Zhao F.Circular RNA identification based on multiple seed matching[J].Briefings in Bioinformatics.2018,19(5):803-810.

[2]Gao Y,Wang J,Zhao F.CIRI:an efficient and unbiased algorithm for de novo circular RNA identification[J].Genome Biology.2015,16(1).

[3]Memczak S,Jens M,Elefsinioti A,et al.Circular RNAs are a large class of animal RNAs with regulatory potency[J].Nature.2013,495(7441):333-338.

[4]Ma X,Xue W,Chen L,et al.CIRCexplorer pipelines for circRNA annotation and quantification from non-polyadenylated RNA-seq datasets[J].Methods.2021,196:3-10.

[5]Song X,Zhang N,Han P,et al.Circular RNA profile in gliomas revealed by identification tool UROBORUS[J].Nucleic Acids Research.2016,44(9):e87.

disclosure of Invention

The technical problem to be solved by the invention is how to identify the plant mitochondrial genome coding circular RNA (circRNA).

In order to solve the above technical problems, the present invention provides, in the first place, a method for identifying a circRNA encoding a mitochondrial genome of a plant. The method comprises the following steps: comparing the RNA-seq data to a reference genome by using a blast tool to obtain a comparison result file; screening sequence characteristics of data in the comparison result file to obtain a candidate circRNA file; and carrying out conditional screening on the candidate circRNA file to obtain plant mitochondrial genome coding circRNA.

In the method, the sequence feature screening may be to remove the read length of only one sequence fragment aligned to the reference genome in the alignment result file, and only two sequence fragments are retained and aligned in opposite directions to the read length of the reference genome.

In the method described above, the sequence feature screening may comprise the steps of:

A1) and removing the read length of only one sequence fragment aligned to the reference genome in the alignment result file, and reserving the alignment result file in which more than two sequence fragments can be aligned to the read length of the reference genome in opposite directions.

A2) Extracting the position information data of the more than two sequence segments and the positive and negative chain information data of the sequence segments corresponding to the reference genome in the comparison result file.

A3) And processing the positive and negative chain information data, reserving and only comparing two sequence segments with opposite directions to the read length of a reference genome, and sequencing and integrating the sequence segments according to the position information to obtain the candidate circRNA.

In the method, the position information may be ID information of a read length corresponding to the sequence fragment. The positional information may also include positional information of the sequence fragment on the read length, and its alignment of the sequence fragment to a reference genomic sequence. The orientation may be 5 'to 3' to the nucleotide sequence.

In the method described above, the number of nucleotides of the sequence fragment is greater than or equal to 11.

In the method, a3) may comprise the following steps: filling the positive and negative chain information data of each sequence segment into a positive chain file and a negative chain file respectively; integrating the sequence segments in the positive-strand file to obtain a positive-strand integrated file, exchanging the initial and terminal position information of the sequence segments in the negative-strand file, and then integrating to obtain a negative-strand integrated file.

In the method, the range of setting the "E" parameter of the blast tool may be 10^-5～5。

The setting of the "E" parameter may be specifically 2.

In the method described above, the screening conditions may be as follows:

B1) the number of overlapping nucleotides and vacant nucleotides of the cyclization site should be 3 or less.

B2) The length of the candidate circRNA is 10000 nucleotides or less, and is more than or equal to the length of the corresponding read length.

The circularization site can be an adaptor point on the read length of the two pieces of sequence aligned in opposite directions to the reference genome. The overlapping nucleotides may be aligned in opposite directions of the two sequence segments to overlap at a position on a reference genome. The vacant nucleotide can be a nucleotide that is present at the junction of the two sequence segments that align in opposite directions to the reference genome and that cannot align to the reference genome.

In the methods described above, the length of the candidate circRNA can be the number of nucleotides between the circularization site corresponding to the starting and terminating nucleotides aligned onto the reference genome. The starting nucleotide may correspond to the 5' terminal nucleotide aligned to the reference genome. The terminator nucleotide can correspond to the 3' terminal nucleotide aligned to the reference genome.

In the method described above, the RNA-seq data may be fasta format file. The fasta format file may be a split RNA-seq data file.

In the method described above, the sequence of the reference genome may be a sequence of a formatted reference genome. The formatting tool may be a formatting db.

In order to solve the technical problems, the invention also provides a device for identifying the plant mitochondrial genome coding circular RNA (circRNA). The apparatus may include the following modules:

C1) a sequence alignment module: the method is used for aligning the transcriptome sequencing (RNA-seq) data of plant mitochondria to a reference genome by using a blast tool to obtain an alignment result file.

C2) A sequence feature screening module: and the method is used for screening sequence characteristics of data in the comparison result file to obtain a candidate circRNA file.

C3) A condition screening module: and carrying out condition screening on the candidate circRNA file to obtain the final circRNA.

The sequence feature screening can be established by a method comprising the following steps: in order to remove the read length of only one sequence fragment aligned to the reference genome in the alignment result file, only two sequence fragments capable of being aligned to the read length of the reference genome in opposite directions are reserved.

The "E" parameter setting range of the blast tool can be 10^-5～5。

The setting of the "E" parameter may be specifically 2.

In the above apparatus, the condition screening module may be established by a method comprising the steps of:

B2) The length of the candidate circRNA is 10000 nucleotides or less and the length of the candidate circRNA is larger than or equal to the length of the corresponding read length.

The circularization site can be an adaptor point on the read length of the two pieces of sequence aligned in opposite directions to the reference genome. The overlapping nucleotides can be overlapping nucleotides where the two sequence segments align in opposite directions on a reference genome. The vacant nucleotide can be a nucleotide that is present at the junction of the two sequence segments that align in opposite directions to the reference genome and that cannot align to the reference genome.

In the device described above, the length of the candidate circRNA may be such that the circularization site corresponds to the number of nucleotides between the start and stop nucleotides aligned onto the reference genome. The starting nucleotide may correspond to the 5' terminal nucleotide aligned to the reference genome. The terminator nucleotide can correspond to the 3' terminal nucleotide aligned to the reference genome.

In the above-described apparatus, the RNA-seq data may be fasta format file. The fasta format file may be a split RNA-seq data file.

In the apparatus described above, the sequence of the reference genome may be a sequence of a formatted reference genome. The formatting tool may be a formatting db.

In order to solve the above technical problem, the present invention also provides a computer-readable storage medium storing a computer program. The computer program causes a computer to establish the steps of the method or the computer program causes a computer to establish the modules of the apparatus as described above.

In order to solve the above technical problem, the present invention also provides a computer-readable storage medium storing a computer program. The computer program causes a computer to perform the steps of the method or the computer program causes a computer to perform the steps of the module of the apparatus as described above.

Aiming at the defects of the existing method, the invention aims to develop a recognition algorithm which is suitable for plant mitochondrial genome coding circRNA, namely mitochondrion-encoded circRNA identifier (MeCi) by combining the characteristics of the plant mitochondrial genome. The algorithm is used for identifying high-reliability mitochondrial genome coding circRNA from plant RNA-seq data.

CIRI2 identified circRNA by recognizing the reverse splicing product of introns, whereas most of the introns of the plant mitochondrial genome are class II introns and no report has been made that show the reverse splicing pathway of such introns. MeCi detects circularization sites in RNA-seq read-length sequences based on blast alignment (circulation junctions, i.e. junctions of two reverse aligned sequences on the read-length; fig. 1). The recognition of circRNA by MeCi is not limited to reverse splicing of introns. Plant mitochondrial genomes have a large number of RNA editing sites and if CIRI2 is used to predict that the mitochondrial genome encodes circRNA, FDR will be too high, resulting in a large number of circRNA being filtered. MeCi passes the setting of the larger blastn E value (default setting)Is 2, the adjustable range is 10^-55) and not only keeping alignment fragments with small length (more than or equal to 11nt) but also keeping the reading length with mismatch. Plant mitochondrial genomes present a certain number of repeats, MeCi allows a mitochondrial genome encoding circrnas to be aligned to different locations in the genome, and CIRI2 deletes these circrnas. CIRI2 does not allow for circrnas of overlapping or missing nucleotides at the circularization site (the junction of two oppositely aligned sequences on the read length), but many plant mitochondrial genomes contain overlapping or missing nucleotides at the junctions that encode circrnas. MeCi allows overlapping and missing nucleotides in circRNA at the circularization site and limits its number to no more than 3.

The innovation points of the method MeCi for coding circRNA by the established plant mitochondrial genome are as follows:

MeCi alignment based on blastn and using the e-value to control mismatch and alignment fragment length.

The screening of circRNA by MeCi sets conditions according to the plant mitochondrial genome characteristics and is independent of the rule of intron reverse splicing. The specific conditions are as follows:

a. requiring a read to be aligned to a reference genome, wherein the read only contains two reverse fragments;

b. reading the junction of two oppositely aligned fragments of length with no more than 3 overlapping or missing nucleotides;

c.E default to 2, adjustable range is 10^-55 and allows the occurrence of multiple mismatched nucleotides;

d. reading two long reverse alignment fragments with a length of no less than 11 nucleotides;

e. the number of nucleotides between the start and stop nucleotides on the reference genome of the circularization site between the two oppositely aligned fragments is read greater than the number of nucleotides of the read.

Compared with the prior art, the invention has the beneficial effects that:

the existing circRNA recognition algorithm is mainly used for identifying reverse splicing products from nuclear introns, and plant mitochondrial introns mostly belong to class II introns, so that a reverse splicing approach does not exist. MeCi is designed based on the characteristics of the plant mitochondrial genome. Firstly, comparing RNA-seq data to a reference genome by MeCi through blast, and screening the read length of only two reverse comparison fragments; then, through a series of screening conditions, a high-confidence read length is obtained. The specific screening conditions were as follows: allowing multiple mismatched bases to exist in the aligned fragments, allowing no more than 3 base overlaps or gaps at the circularization site (the junction of two oppositely aligned sequences on the read length), allowing a circRNA to be aligned to multiple locations in the genome, the length of the circRNA (the number of nucleotides between the first and last nucleotides of the circularization site on the reference genome) being greater than the read length (the number of nucleotides read).

In the present example, three circRNA recognition algorithms, MeCi, find _ circ and CIRI2, were used to identify the mitochondrial genome encoding circrnas in NCBI-downloaded maize and arabidopsis RNA-seq data, respectively. The results show that in 3 replicates of maize, MeCi identified 32686, 23560 and 2638 mitochondrial genomes encoding circRNA, find _ circ identified 1782, 1197 and 1256, respectively, and CIRI2 identified 263, 219 and 184, respectively. In arabidopsis thaliana repeat 2, MeCi identified 49772 and 60289 mitochondrial genomes encoding circRNA, find _ circ identified 2333 and 2782, respectively, and CIRI2 identified 453 and 593, respectively. RT-PCR amplification verification is carried out by designing divergent primers, and the result shows that most of the circRNA coded by the newly detected mitochondrial genomes of the maize and the arabidopsis thaliana by the MeCi is correct, so that the higher reliability of predicting the plant mitochondrial genome coded circRNA by the MeCi is proved. The ability of the MeCi algorithm to identify the plant mitochondrial genome coding circRNA is superior to that of the existing circRNA identification method, and the plant mitochondrial genome coding circRNA can be effectively identified.

Drawings

FIG. 1 is a determination and classification of read lengths containing circularization sites. If 1 read length comprises two fragments aligned to the reference genome in opposite directions, the fragment is determined as a candidate circRNA, and the junction point of the two fragments on the read length is the circularization site. Circrnas can be classified into three types, based on differences at the circularization site. Type I: nucleotides at the circularization site that contain repetitive alignments into the reference genome (i.e. overlapping nucleotides), type II: the circularization site contains nucleotides that cannot be aligned to the reference genome (i.e. vacant nucleotides), type III: there are no overlapping or missing nucleotides at the circularization site.

FIG. 2 is a flow chart of parameter import and verification.

FIG. 3 is a flow chart of the sequence pre-processing.

FIG. 4 is a flow chart of splitting a sequence file.

FIG. 5 is a flow chart of sequence alignment.

FIG. 6 is a flow chart of Blast comparison data parsing.

FIG. 7 is a flowchart of circular RNA screening.

FIG. 8 is a complete flow chart of plant mitochondrial circular RNA recognition.

FIG. 9 is a comparison of the number of highly reliable mitochondrial genome-encoded circRNAs predicted from RNA-seq data for MeCi, CIRI2 and find _ circ, collectively or individually.

FIG. 10 is a diagram of RT-PCR validation of the new predicted maize mitochondrial genome encoding circRNA by MeCi. circRNA from maize Zmrrn26, Zmcob, Zmnad2T2 and zmtny was verified using PCR. The gene structure is shown in the figure, the bold line indicates the intron (int), and the 5 'and 3' end positions of the mature linear RNA are indicated by arrows. Black and blue lines represent the positions of the detected circRNAs at this site for MeCi and RT-PCR, respectively. The positions of the convergent and divergent primers are shown. Con: convergent primer, Di: a divergence primer; and gD: gDNA, -and +: RNase-and RNase + cDNA; the arrow points to the target segment amplified by the convergent primer; the divergent primer amplified bands indicated by the fold lines are recovered, cloned, transformed and sequenced respectively. M: and (5) DNA marker.

FIG. 11 is a diagram of RT-PCR validation of the newly predicted Arabidopsis mitochondrial genome encoding circRNA by MeCi. circRNA derived from arabidopsis Atatp1, Atcox3, Atnad4L, AttrnY and AttrnD sites was verified using PCR. The gene structure is shown, and the 5 'and 3' end positions of the mature linear RNA are indicated by arrows. Black and light gray lines represent the positions of the detected circRNAs at this site for MeCi and RT-PCR, respectively. The positions of the convergent and divergent primers are shown. Con: convergent primer, Di: a divergence primer; and gD: gDNA, -and +: RNase-and RNase + cDNA; the arrow points to the target segment amplified by the convergent primer; the divergent primer amplified bands indicated by the fold lines are recovered, cloned, transformed and sequenced respectively. M: DNA marker.

Detailed Description

The present invention is described in further detail below with reference to specific embodiments, and the examples are given only for illustrating the present invention and not for limiting the scope of the present invention. The examples provided below serve as a guide for further modifications by a person skilled in the art and do not constitute a limitation of the invention in any way.

The experimental procedures in the following examples, unless otherwise indicated, are conventional and are carried out according to the techniques or conditions described in the literature in the field or according to the instructions of the products. Materials, reagents and the like used in the following examples are commercially available unless otherwise specified.

Example I establishment of the plant mitochondrial genome coding for the circRNA recognition Algorithm MeCi

1. Data input and data pre-processing

1.1 data entry

When the MeCi algorithm starts to operate, parameters need to be transmitted and checked. The first step, checking whether the plant mitochondrial RNA-seq sequence file and the reference genome sequence file are correctly input. In the second step, it is checked whether the parameters of each option (including "-in 1", "-in 2", and "-genome", table 1) are set, and the default parameters are filled with the unset parameters. Third, a result output folder is created (the flow is shown in fig. 2).

1.2 data preprocessing

RNA-seq is classified into two types, single-ended sequencing and double-ended sequencing. Because of the differences in the way the two types of sequencing files are processed, the input sequence file needs to be preprocessed.

Single-ended sequencing: and converting the sequence file from a fastq format to a fasta format for subsequent processing.

Double-end sequencing: firstly, a double-ended sequencing file splicing program FLASH (download website: https:// jaist.dl.sourceform.net/project/flashpage/FLASH-1.2.11. tar.gz) is used for splicing two sequence files subjected to double-ended sequencing to obtain a spliced sequence file. Then, the splicing sequence file is converted from the fastq format to the fasta format for subsequent processing (the flow is shown in fig. 3).

TABLE 1 required input data and optional options for MeCi

2. Sequence alignment

2.1 sequence File splitting

Before performing blast comparison, the sequence in the fasta format sequence file is split into a plurality of files according to a certain size, so as to improve the efficiency of the subsequent sequence comparison. The size of the split file is 0.1-100% of the read length of the sequencing file.

And (3) calculating the read length number of the fasta format sequence file in the step 1.2, and dividing the read length number by a sequence block size parameter (chunksize; generally set to be 0.1% -100% of the read length number of the sequence file), thereby obtaining the number of the sequence blocks needing to be split. And creating a split file and setting a file name. Writing and calling a Perl language script, splitting the fasta format sequence file and filling the split file prepared before to obtain a fasta format split file (the flow is shown in fig. 4).

2.2 sequence alignment

The sequence alignment was performed using the formatdb and blastall programs available from NCBI (download website of both programs: https:// ftp. NCBI. nlm. nih. gov/blast/executables/legacy. NOTP. SUPPORTED/2.2.26/blast-2.2.26-x64-linux. tar. gz.). The reference genomic sequence is formatted using the formatdb program to obtain a formatted reference genomic sequence, and then sequence alignment is initiated.

Creating a multithread comparison, calculating the base number of each sequence in the fasta-format split file, and performing batch sequence comparison on the sequence in the fasta-format split file and a formatted reference genome sequence by using blastallAnd (5) comparing and outputting a comparison result file (the flow is shown in fig. 5). Wherein, the default of the E value in the comparison process by using blastall is set as 2, and the selectable range is 10^-55. Decreasing the E value, the mismatch rate of the blastn aligned fragments (sequence fragments aligned read-length to the reference genome) becomes smaller and the length increases; increasing the E value shortens the aligned fragments and increases the mismatch rate. The number of nucleotides of the read length alignment fragment is more than or equal to 11.

TABLE 2 procedure used by MeCi

3. Screening sequence characteristics and obtaining candidate circular RNA

In the step, the comparison result file obtained in the step 2.2 is processed to screen out a comparison result which accords with the circRNA characteristics.

Referring to the existing report, the circRNA recognition algorithm mainly identifies the circRNA by reading the sequence characteristics of the length, i.e. one read length can be divided into two sequences which are respectively aligned to the reference genome, and the positions of the two sequences aligned to the reference genome are opposite in direction. The junction of the two reverse aligned sequences on the read length is the circularization site of the circRNA (FIG. 1).

Circrnas can be classified into three types, based on differences at the circularization site. Type I: nucleotides at the circularization site that contain repetitive alignments into the reference genome (i.e. overlapping nucleotides), type II: the circularization site contains nucleotides that cannot be aligned to the reference genome (i.e. vacant nucleotides), type III: the circularization site contains no overlapping nucleotides or no missing nucleotides (FIG. 1).

The method is realized in an algorithm, and mainly comprises the following steps (the flow is shown in FIG. 6):

the first step, analyzing data in the blast comparison result file, deleting the read length of which only one sequence fragment can be compared with the reference genome, namely removing the read length completely aligned with the reference genome, and only keeping the comparison result file of the candidate read length (namely, the read length of which more than two sequence fragments are respectively compared with the reference genome).

Secondly, extracting the position information of the sequence fragment which can be compared with the reference genome on each read length on the read length, the position information of the sequence fragment which is compared with the reference genome sequence on each read length and the ID information data of the read length from a comparison result file containing the candidate read lengths; calculating the positive and negative chain information of each sequence segment on the reference genome, and filling data into two files, namely a positive chain file and a negative chain file.

And thirdly, processing the positive-chain file and the negative-chain file respectively. Integrating sequence segments in the positive-strand file to obtain a positive-strand integrated file; and for the negative strand file, exchanging the information of the starting position and the ending position of the sequence segments in the negative strand file, and then integrating to obtain the negative strand integrated file. The sequences in the plus-strand integration file and minus-strand integration file are sorted by read length ID (read length ID information is derived from the original RNA-seq data) and the read length of the sequence fragment that is aligned to the reference genome is retained and only in two opposite directions. And integrating two sequence fragments on the same read length to obtain a candidate circRNA file.

4. Conditional screening is carried out on the candidate circRNA to obtain the final plant mitochondrial genome coding circRNA

The candidate circRNA of step 3 is obtained by sequence feature screening, and false positive may exist. In order to obtain high-reliability circRNA, screening conditions are set to filter candidate circRNA by combining plant mitochondrial genome characteristics, so as to obtain final plant mitochondrial genome coding circRNA (the flow is shown in FIG. 7). The screening conditions were as follows:

first, the number of nucleotides in the circularization site (the junction of two sequence segments aligned in opposite directions on the read length of the reference genome) that overlap (i.e., two sequence segments aligned in opposite directions on the reference genome overlap by several nucleotides) and the number of nucleotides in the circularization site that are not aligned on the reference genome (i.e., two sequence segments aligned in opposite directions on the reference genome overlap by several nucleotides) should be less than or equal to 3 (fig. 1).

Second, the circRNA length (the circularization site corresponds to the number of nucleotides between the starting nucleotide and the terminating nucleotide aligned to the reference genome) is 10000 nucleotides or less and is equal to or greater than the length (the number of sequence nucleotides) corresponding to the read length. Wherein the starting nucleotide corresponds to the 5' terminal nucleotide aligned to the reference genome; the stop nucleotide corresponds to the 3' terminal nucleotide aligned to the reference genome.

The circRNA encoded by the plant mitochondrial genome obtained by screening under the above conditions is filled in an output table, and two result files are generated: a data file containing the final plant mitochondrial genome encoding circRNA, named circRNA _ details.xls; another file containing the final circRNA corresponding sequencing read length data was named read _ details.

The complete flow chart of the plant mitochondrial genome coding circRNA recognition method MeCi established by the invention is shown in figure 8.

Example two comparison of the effectiveness of MeCi in comparison with the prior art

1. Comparison of prediction of plant mitochondrial genome coding circRNAs by three circRNA recognition algorithms

RNA-seq data for maize and Arabidopsis mitochondrial circRNA were downloaded from NCBI (Table 3; download Link for both sets of data: www.ncbi.nlm.nih.gov/bioproject/PRJNA 719584). Maize and Arabidopsis RNA-seq data contained 3 and 2 biological replicates, respectively.

TABLE 3 transcriptome data of maize and Arabidopsis mitochondria circRNA

The mitochondrial genomes of maize and Arabidopsis encoding circRNA were identified from two sets of data using three circRNA recognition algorithms, namely, MeCi, find _ circ and CIRI2, respectively. In 3 replicates of maize, MeCi identified 32686, 23560 and 2638 mitochondrial genomes encoding circRNA, find _ circ identified 1782, 1197 and 1256, respectively, and CIRI2 identified 263, 219 and 184, respectively (table 4). In2 replicates of arabidopsis, MeCi identified 49772 and 60289 mitochondrial genomes coding circRNA, find _ circ identified 2333 and 2782, respectively, and CIRI2 identified 453 and 593, respectively.

To improve prediction reliability, circrnas that occur in at least 2 repeats were taken as highly reliable mitochondrial genome-encoding circrnas. According to this standard, 7524, 674 and 66 mitochondrial genomes encoding circRNA were identified in maize for MeCi, find _ circ and CIRI2, respectively, and 9819, 685 and 105 in arabidopsis thaliana, respectively. Among them, 37 and 42 mitochondrial genome codes for circRNA were predicted by 3 algorithms together in maize and arabidopsis, respectively (table 4 and fig. 9); 7482 and 9749 were newly found by MeCi in maize and arabidopsis thaliana, respectively, and were not identified using the existing algorithms find _ circ and CIRI2 (fig. 9).

TABLE 4 comparison of the number of circRNAs encoded by the predicted mitochondrial genome in the same RNA-seq data for different circRNA identification methods

The find _ circ algorithm mainly comprises the following steps:

(1) using Bowtie2, aligning each read length of the RNA-seq sequence file to a reference genome, discarding reads that completely match the genome;

(2) 20 base sequences were taken from the 5 'and 3' ends of the remaining reads as anchors (anchors) and aligned with the genome, respectively. If the two anchors can be aligned to the reference genome separately and the order on the read length and reference genome is reversed, the alignment is extended.

(3) The full-length sequence of this read was aligned to the genome. And (3) if the alignment result comprises the two alignment sequences in the step (2) and the flanking sequence of the circularization site (the junction of the two reverse alignment sequences on the reading length) is a GT-AG conserved site, the circRNA is considered as a candidate.

(4) Candidate circrnas were further screened by the following conditions. First, the circularization site (the splice point on the read length of two reverse aligned sequences) should be well-defined and only one; second, each circRNA has at least two independent read-length supports; third, the anchor point should align to only one location on the genome; fourthly, the length of the circRNA (the number of nucleotides of the cyclization site between the first nucleotide and the last nucleotide on the reference genome) is not more than 100 kb; fifth, reading two long stretches allows up to two mismatched bases when aligned with the genome.

The main steps of the CIRI2 algorithm are as follows:

(1) each read of the RNA-seq sequence file is aligned to a reference genome using BWA-MEM. When a partial fragment of one read length (fragment a) can be aligned to the genome while the flanking other fragment (fragment B) cannot be aligned exactly to the reference genome sequence, the read length is retained for further analysis.

(2) Fragment B was divided into small fragments and each small fragment was aligned to the reference genome. When the small fragment can be aligned on the reference genome and is in the opposite order of fragment A on the read length and the reference genome, the small fragment is the reverse splicing type. Small fragments of the forward splicing type are counted when they are able to align on the reference genome and are identical in sequence to fragment A on the read and reference genomes. By adjusting the FDR threshold, the number, length and mismatch rate of small fragments to the reference genome can be controlled.

(3) And respectively counting the number of small fragments spliced in the forward direction and the reverse direction, and judging whether the read length is the read length spliced in the reverse direction or not by a maximum likelihood estimation method.

(4) The circRNA information corresponding to the reverse splicing reads was calculated. Under default parameters, the algorithm requires that the circRNA length (number of nucleotides between the first and last nucleotide of the circularization site on the reference genome) is no greater than 200kb and that there is a GT-AG conserved site flanking the circularization site (the junction of two reverse aligned sequences on the read length).

2. Result verification

2.1 Experimental validation method

2.1.1 Main reagents and their formulations or sources:

(1) DNA extract (formulation in Table 5)

TABLE 5 DNA extract recipe

Components	Final concentration
		Urea (Urea)	7.0M
Sodium chloride (NaCl)	0.3M
		Tris-HCl(pH 8.0)	50mM
EDTA(pH8.0)	24mM
		Sodium lauroyl sarcosinate (sarkosyl)	1％

(2) Mitochondrial extraction buffer (formulation see Table 6)

TABLE 6 mitochondrial extract formulation

Components	Final concentration
		Sucrose	0.3M
Pyrophosphate (Na)₄P₂O₇)	5mM
		Potassium dihydrogen phosphate (KH)₂PO₄)	10mM
Polyvinylpyrrolidone (PVP)	1％(w/v)
		EDTA	2mM
Bovine Serum Albumin (BSA)	1％(w/v)
		Cysteine (cysteine)	5mM
Ascorbic acid	20mM

Using H₃PO₄The pH was adjusted to 7.3.

(3) Mitochondrial washing buffer (formulation see Table 7)

TABLE 7 formulation of mitochondrial wash

Components	Final concentration
		Sucrose	0.3M
EGTA	1mM
		MOPS	10mM

Using H₃PO₄The pH was adjusted to 7.2.

(4) Primary reagents and consumables

DNA polymerase 2 XTaq Master Mix was purchased from Novozan Biotechnology Ltd (Nanjing, China), RNA and DNA purification recovery kit was purchased from Tiangen Biotechnology Ltd (Beijing, China), TRIzol reagent was purchased from Invitrogen Life technologies Ltd, PrimeScript^TMII reverse transcriptase was purchased from TaKaRa, Beijing, China, RNase R was purchased from Epicenter (USA), and Miracloth was purchased from Calbiochem (USA).

2.1.2 plant Material and planting Environment

The genetic background of maize (Zea mays) material is the W22 inbred line, and the ecotype of Arabidopsis thaliana (Arabidopsis thaliana) material is Columbia (Columbia). Maize was planted in the southern China university of agriculture test field, while Arabidopsis was planted in an artificial climate chamber (22 ℃ C., dark culture).

2.1.3 genomic DNA (gDNA) extraction

Approximately 0.2g of maize 15DAP (days self pollination) seed or Arabidopsis thaliana 5DAG (days post germination) seedlings were taken and placed in a mortar and the DNA extract was used to extract maize and Arabidopsis thaliana gDNA.

2.1.4 enrichment of maize and Arabidopsis mitochondria

(1) 10 g of maize 15DAP grains or Arabidopsis thaliana 5DAP seedlings were collected in ice-precooled 50ml centrifuge tubes.

(2) Adding 3-4 ml of mitochondrial extraction buffer solution into a precooled mortar, pouring all tissues in a centrifuge tube into the mortar, and fully grinding on ice.

(3) The homogenate was triturated with Miracloth filtration, collected in a 50ml centrifuge tube that was cold in an ice bath, and the residual tissue on the Miracloth was washed with about 10ml of mitochondrial extraction buffer.

(4)8000g, 4 ℃, and centrifuging for 10 min.

(4) The supernatant was transferred to a new 50ml centrifuge tube and centrifuged at 20000g, 4 ℃ for 10 min.

(5) The supernatant was carefully removed and the pellet was mitochondrial.

(6) 3ml of mitochondrial washing buffer is added into the precipitate, the mixture is shaken gently and mixed evenly, and the mixture is subpackaged into 3 1.5ml centrifuge tubes for removing the RNA enzyme. 20000g, 4 ℃, centrifugation for 10 min.

(7) The supernatant was discarded, and the precipitate was snap frozen using liquid nitrogen and stored at-80 ℃ for subsequent experiments.

2.1.5 mitochondrial RNA extraction

(1) And (3) taking the mitochondria deposit stored at the temperature of-80 ℃ obtained in the step 2.1.4, adding 1ml of TriZol reagent, fully and uniformly shaking, and standing on ice for 10 min.

(2) Adding 200 μ l chloroform, mixing well, and standing on ice for 5 min.

(3)12000rpm, centrifugation for 10 min. Transfer the supernatant to a new 1.5ml RNase removal tube.

(4) Adding equal volume of isopropanol, mixing, and standing on ice for 30 min.

(5)14000rpm, centrifuging for 15min, removing supernatant, and obtaining RNA as white precipitate in the tube.

(6) The white precipitate was washed twice with 80% ethanol and finally the remaining liquid was removed using a pipette gun.

(7) After standing at room temperature for 5min, 30. mu.l of RNase-free water was added.

(8) The RNA was snap frozen using liquid nitrogen and stored in a freezer at-80 ℃ for subsequent use. And (3) mitochondrial enrichment:

2.1.6 removal of gDNA contamination

DNase I was used to remove gDNA contamination from mitochondrial RNA and the reaction system is shown in Table 8.

TABLE 8 DNase I treatment of mitochondrial RNA

Components	Final concentration
		DNase I	3μl
10 XDNase I buffer	3μl
		Mitochondrial RNA	20μg
RNase-removed water	Adding water to 30 μ l

The reaction conditions were as follows: incubation was carried out at 37 ℃ for 30min, 3. mu.l EDTA was added, incubation was carried out at 75 ℃ for 10min, and DNase I was inactivated. The reaction solution was recovered using an RNA purification kit for subsequent RNase R treatment.

2.1.7 mitochondrial RNA deliinearization

Mitochondrial RNA was treated with RNase R to remove linear RNA therefrom, and the reaction system is shown in Table 9.

TABLE 9 RNase R delignification

Components	Final concentration
		RNase R	2μl
10 XRNase R buffer	2μl
		Mitochondrial RNA	20μg
RNase-removed water	Adding water to 40 μ l

Reaction conditions are as follows: incubate at 37 ℃ for 30 min. The reaction solution was recovered using an RNA purification kit for subsequent cDNA synthesis.

2.1.8 first Strand cDNA Synthesis

Using PrimeScript^TMII reverse transcriptase and random primer, using RNase R treated and untreated mitochondrial RNA as template, synthesizing two first strand cDNAs named RNase + and RNase-. The reverse transcription reaction system is shown in Table 10.

TABLE 10 reverse transcription reaction System

Components	Final concentration	Volume of
			Random primer	50uM	1.25μl
dNTP mixture	10mM	1.25μl
			Template RNA	200ng	1μl
RNase inhibitors	40μ/μl	0.5μl
			Reverse transcriptase	200μ/μl	1.25μl
5 Xreverse transcriptase buffer	1x	5.0μl
			RNase-removed water	/	14.75
Total up to	/	25μl

The reaction conditions are as follows: incubation was carried out at 30 ℃ for 10min, at 42 ℃ for 45min and at 70 ℃ for 15 min. The reaction solution was used directly for PCR amplification.

2.1.9 PCR primer design

11 maize and Arabidopsis thaliana mcircRNA generation sites were randomly selected and divergent (divergent) and convergent (convergent) primers were designed (Table 11). The convergent primer serves as a positive control, amplifying DNA and linear RNA reverse transcription products from gDNA and cDNA templates, respectively, while the divergent primer amplifies reverse transcription products from circRNA.

RT-PCR primer sequences and expected fragment sizes

2.1.10PCR amplification

PCR amplification was performed using 2 × Taq Master Mix reagent using the gDNA of step 2.1.2, and RNase + and RNase-cDNA of 2.1.7 as templates. The PCR reaction systems and conditions are shown in tables 12 and 13, respectively.

TABLE 12 PCR reaction System

TABLE 13 PCR reaction conditions

2.1.11 PCR product isolation, recovery and cloning

And recovering and purifying the PCR product by glue, connecting the PCR product to a pMD18-T vector to transform escherichia coli DH5 alpha, selecting monoclonal bacteria growing after transformation, carrying out PCR identification, selecting the monoclonal bacteria containing the expected fragment size, and sending the monoclonal bacteria to a company for sequencing.

2.1.12 determination of the circularization site of PCR amplification product

The sequencing sequences were aligned to the reference genome using the blastn function of NCBI (https:// blast. NCBI. nlm. nih. gov). Based on the comparison, it is judged whether or not the insert of the monoclonal contains a circularized adaptor.

2.2 PCR amplification and sequencing results analysis

To verify whether the MeCi newly predicted mitochondrial genome encoded circRNA accurately in step 1, 11 sites from the maize and arabidopsis mitochondrial genomes were randomly chosen (maize: Zmrrn26, Zmcob, Zmnad2T2 and ZmtrnY; arabidopsis thaliana: Atatp1, Atcox3, Atnad4L, Atrpl5, AttrnY and AttrnD) and the circRNA encoded by these sites was amplified using RT-PCR (fig. 10 and 11). From MeCi predictions, mitochondrial genomic sites can typically produce multiple circRNA isoforms (fig. 10 and 11, table 14). Such as: the number of circRNA isoforms from maize Zmnad2T2 was 322, while the number of isoforms from arabidopsis Atatp1 was up to 982 (table 17). Furthermore, the 5 'ends of the circular junctions of most mitochondrial genomes encoding circRNA isoforms are relatively close, differing primarily in the position of the 3' end (fig. 10 and 11).

TABLE 14 sequencing results analysis of RT-PCR monoclonals

Agarose gel electrophoresis of the PCR amplification products showed that each pair of convergent primers amplified a single fragment from gDNA, RNase-cDNA and RNase + cDNA and had a length consistent with the predicted size. In contrast, the divergent primers only amplify circRNA from RNase-and RNase + cDNA templates and the amplified fragments are diffuse, consistent with the above-described characteristics of isomers of circRNA, i.e., isomers of circRNA from the same site having similar 5 'ends and different 3' ends. The different divergent primer amplified fragments are sent to a company for sequencing through the steps of PCR product recovery, vector cloning, escherichia coli competence transformation, positive clone identification and the like. The results of monoclonal sequencing and sequence alignment showed that 144 of the 226 circrnas identified by RT-PCR were predicted by MeCi algorithm, with a validation rate of 63.72%. In contrast, the number of circrnas identified by RT-PCR, predicted by CIRI2 and find _ circ algorithm, was 0. These results indicate that the newly detected mitochondrial genome of MeCi encodes circRNA in large part correctly, thus demonstrating the high confidence that MeCi predicts plant mitochondrial genome-encoded circRNA.

In conclusion, the ability of the MeCi algorithm established by the invention to identify the circRNA encoded by the mitochondrial genome of plants is superior to that of the existing circRNA identification method.

The present invention has been described in detail above. It will be apparent to those skilled in the art that the invention can be practiced in a wide range of equivalent parameters, concentrations, and conditions without departing from the spirit and scope of the invention and without undue experimentation. While the invention has been described with reference to specific embodiments, it will be appreciated that the invention can be further modified. In general, this application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. The use of some of the essential features is possible within the scope of the claims attached below.

Claims

1. A method of identifying a circular RNA encoded by the mitochondrial genome of a plant, comprising: the method comprises the following steps: comparing the transcriptome sequencing data to a reference genome by using a blast tool to obtain a comparison result file; carrying out sequence feature screening on data in the comparison result file to obtain a candidate circRNA file; performing conditional screening on the candidate circRNA file to obtain plant mitochondrial genome coding circular RNA;

and the sequence characteristic screening is to remove the read length of only one sequence fragment aligned to the reference genome in the alignment result file, and only two sequence fragments are reserved and can be aligned to the read length of the reference genome in opposite directions.

2. The method of claim 1, wherein: the sequence feature screening comprises the following steps:

A1) removing the read length of only one sequence fragment in the comparison result file compared with the reference genome, and reserving the comparison result file in which more than two sequence fragments can be oppositely compared with the read length of the reference genome;

A2) extracting position information data of the more than two sequence segments in the comparison result file and sign information data of the sequence segments corresponding to a reference genome;

3. The method according to claim 1 or 2, characterized in that: the setting range of the 'E' parameter of the blast tool is 10^-5～5。

4. A method according to any one of claims 1-3, characterized in that: the screening conditions were as follows:

B1) the number of overlapped nucleotides and vacant nucleotides of the cyclization sites is less than or equal to 3;

B2) the length of the candidate circRNA is less than or equal to 10000 nucleotides and is greater than or equal to the length of the corresponding read length;

the circularization sites are the junctions of the two segments of sequence fragments aligned in opposite directions on the reference genome on the read length; the overlapping nucleotides are the overlapping nucleotides aligned in opposite directions of the two sequence segments to the position on the reference genome; the vacant nucleotide is the nucleotide which is present at the joint of the sequence segments which are oppositely aligned to the reference genome and can not be aligned to the reference genome.

5. An apparatus for identifying circular RNA encoded by the genome of a plant mitochondrion, characterized in that: the device comprises the following modules:

C1) a sequence alignment module: the system comprises a comparison module, a comparison module and a comparison module, wherein the comparison module is used for comparing the transcriptome sequencing data of plant mitochondria to a reference genome by using a blast tool to obtain a comparison result file;

C2) a sequence feature screening module: the circular RNA matching system is used for screening sequence characteristics of data in the comparison result file to obtain a candidate circular RNA file;

C3) a condition screening module: the method is used for carrying out condition screening on the candidate circRNA file to obtain final circRNA;

the sequence feature screening is established by a method comprising the following steps: in order to remove the read length of only one sequence fragment aligned to the reference genome in the alignment result file, only two sequence fragments capable of being aligned to the read length of the reference genome in opposite directions are reserved.

The setting range of the 'E' parameter of the blast tool is 10^-5～5。

6. The apparatus of claim 5, wherein: the condition screening module is established by a method comprising the following steps:

B1) the number of overlapping nucleotides and vacant nucleotides of the cyclization site is less than or equal to 3;

the circularization sites are the junctions of the two segments of sequence fragments aligned in opposite directions on the reference genome on the read length; the overlapping nucleotides are overlapping nucleotides at positions where the two sequence segments are aligned in opposite directions on a reference genome; the vacant nucleotide is the nucleotide which is present at the joint of the sequence segments which are oppositely aligned to the reference genome and can not be aligned to the reference genome.

7. A computer-readable storage medium having stored thereon a computer program for causing a computer to establish the steps of the method of any one of claims 1-4 or a module of the apparatus of any one of claims 5-6.

8. Computer readable storage medium having stored thereon a computer program for causing a computer to carry out the steps of the method according to any of the claims 1-4 or the steps of the modules of the apparatus according to any of the claims 5-6.