CN109859798B

CN109859798B - Prediction method for interaction of sRNA and target mRNA in bacteria

Info

Publication number: CN109859798B
Application number: CN201910053867.6A
Authority: CN
Inventors: 樊永显; 崔娟; 张龙; 张向文
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2019-01-21
Filing date: 2019-01-21
Publication date: 2023-06-23
Anticipated expiration: 2039-01-21
Also published as: CN109859798A

Abstract

The invention discloses a method for predicting interaction of sRNA and target mRNA thereof in bacteria, which comprises the following steps: 1) Data collection and arrangement; 2) Feature extraction, namely converting a data set into a matrix; 3) F-score feature optimization; 4) And training and constructing an SVM model and predicting to obtain a prediction result. The method can effectively characterize RNA sequence information and improve the prediction precision of sRNA-target mRNA interaction, and meanwhile, the method has the advantages of low cost, less time consumption and high prediction speed.

Description

Prediction method for interaction of sRNA and target mRNA in bacteria

Technical Field

The invention relates to classification prediction of sequence interaction in bioinformatics, in particular to a prediction method of interaction between sRNA and target mRNA thereof in bacteria.

Background

Non-coding RNA (ncRNA) is an RNA that does not code for proteins, and has attracted increasing attention in molecular biology since ncRNAs with certain regulatory functions have been recognized in rapid succession. The interaction of regulatory ncrnas with messenger RNAs (mrnas) inhibits or activates translation into proteins, resulting in dysregulation of gene expression, and thus disease occurrence. In this regulation, a popular term includes small RNAs (sRNA) in bacteria, which form a complex secondary structure with target mRNA by base pairing, directly or indirectly affecting gene expression in organisms. With the increasing number of sRNAs identified in the metagenomic era, the regulation of sRNA interactions with its target mRNA in organisms must be considered in order to understand as much as possible the function of sRNA. Although some sRNA functions are confirmed so far, a considerable part of sRNA functions are unknown, so that the identification of sRNA targets has important significance for the identification of the functions of sRNAs. However, the biological methods to recognize the interaction of sRNA with its target mRNA are very limited. Therefore, the use of computer technology in combination with biological information to predict sRNA-mRNA interactions is of great importance for the discovery and understanding of sRNA function.

Two types of methods are currently used primarily to predict the interaction of sRNA with target genes: a generic RNA-RNA interaction model and a predictive model specifically for sRNA-target mRNA interactions. The common RNA-RNA interaction model mostly provides only binding sites between two RNA molecules, without determining whether the two RNA molecules have interactions, such as: RNAcofold, RNAup, RNAduplex et al (Lorenz R, bernhart S H, siederdissen C H Z, et al Vienna RNA Package 2.0[ J ]. Algorithms for Molecular Biology,2011,6 (1): 26.). In fact, even two randomly selected RNA sequences can present many potential binding sites, but we cannot guarantee that the two RNA sequences interact. Review of the literature reveals that there are few models currently dedicated to prediction of sRNA-target mRNA interactions. Among them, sTarPicker (YerX, cao Y, wu J, et al sTarPicker: A Method for Efficient Prediction of Bacterial sRNA Targets Based on a Two-Step Model for Hybridization [ J ]. PLOS ONE,2011,6.) predicts whether there is an interaction between sRNA and target mRNA using thermodynamic stability characteristics and target accessibility characteristics between RNA-RNA molecules. This method achieves good prediction accuracy, but is characterized by a large number of features and difficult extraction. IntaRNA (Busch A, richter A S, backofen R.IntaRNA: efficient prediction of bacterial sRNA targets incorporating target site accessibility and seed regions [ J ]. Bioinformatics,2008,24 (24): 2849-2856.) the accessibility of the binding site was calculated from the user-defined seed sequence to determine if sRNA and target mRNA interacted. This approach contributes significantly in identifying whether sRNA and its target mRNA interact or not, but they rely on more biochemical experimental data features.

Disclosure of Invention

The invention aims at overcoming the defects of the prior art and provides a method for predicting the interaction between sRNA and target mRNA thereof in bacteria. The method can effectively characterize RNA sequence information and improve the prediction precision of sRNA-target mRNA interaction, and meanwhile, the method has the advantages of low cost, less time consumption and high prediction speed.

The technical scheme for realizing the aim of the invention is as follows:

the method for predicting the interaction of sRNA and target mRNA thereof in bacteria is different from the prior art in that the method comprises the following steps:

1) Data collection and arrangement: the sRNA-mRNA interaction data set is obtained from the sRNATarBase 3.0 database, the mRNA primary sequences in the data set are aligned with the corresponding mRNA full genome sequences in the NCBI database, the sequence fragments of the mRNA between 80nt upstream and 50nt downstream of the start codon, which is AUG, are intercepted, and then the sRNA sequences and the mRNA sequences are joined to form a sequence pair, each sequence pair consisting of one sRNA sequence joined to one mRNA sequence, shaped as: sRNA-bbbbbb-mRNA, wherein b is a ligation symbol, and the consolidated dataset comprises 241 positive sample RNA sequence pairs with interactions and 185 negative sample RNA sequence pairs without interactions;

2) Feature extraction, converting a data set into a matrix: the process comprises the following steps:

(1) Configuring RNA sequence pairs in the dataset by adopting a method of K-tuple nucleotide, wherein K of the K-tuple can be 1,2,3, …, K, … and m; wherein m approaches infinity, and the total number of the RNA sequences is A, C, G and U, and K nucleotides are taken as a group in any order, and 4 is taken in total ^k Combinations of species, e.g. when k=2, total 4 ² For each sample RNA sequence pair in the dataset, by the method of K-tuple nucleotides, taking K adjacent nucleotides from left to right, starting from the first nucleotide, then shifting one nucleotide to right, taking K next adjacent nucleotides, repeating such an operation (L-K+1) for traversing the entire RNA sequence pair, L being the length of each sample RNA sequence pair, counting the frequency of occurrence of each K nucleotide combination in the entire RNA sequence pair according to formula (1), and then adding 4 ^k The frequency of occurrence of the seed combination is converted to 4 ^k Vector of dimensions, 1 st to 4 th in matrix D ^k A dimension vector, the stage feature vector being represented as in equation (2):

wherein, the liquid crystal display device comprises a liquid crystal display device,

representing the number of times the jth K nucleotide combination of the ith sample in the dataset occurs in the entire pair of RNA sequences,

the frequency of occurrence of the j-th K nucleotide combination in the whole RNA sequence pair, D, representing the i-th sample in the dataset ⁱ A feature vector representing an ith sample RNA sequence pair in the dataset, T representing the transpose;

(2) And extracting information of a triplet point-bracket diagram of the RNA secondary structure in the data set: the secondary structure formed by the sRNA-bbbbbb-mRNA sequence is predicted by RNAfold software, the first row of the predicted result is the sRNA-bbbbbb-mRNA sequence, the second row is the secondary structure corresponding to the sequence and represented by a point-bracket diagram, each nucleotide in the secondary structure is only matched and unmatched, wherein the nucleotide matching near the 5 'end of the RNA sequence is represented by "(" represents, the nucleotide matching near the 3' end of the RNA sequence is represented by "("), the unmatched is represented by "," for simplifying coding, the nucleotide matching is all used "(" represents, "b" is a character connecting the sRNA and the target mRNA, and therefore "b" and the corresponding matching state thereof are deleted, and the secondary structure corresponding to the sequence in the predicted result is converted into a coding unit sequence, regardless, wherein the coding unit consists of the matching states of three adjacent nucleotides in the secondary structure of the sequence, and the total is 2 " ³ =8 coding unit forms, respectively "((", "("), "(", ") and" … "," respectively;("," (A, C, G, U) corresponding to each coding unit are extracted, the nucleotide and the coding units are combined to form a triplet, 4×8=32 triplets, the secondary structure represented by a point-bracket diagram in the software prediction result is obtained, from the first matching state, the matching state of three adjacent nucleotides is obtained from left to right, then one nucleotide is shifted to right, the matching state of the next three nucleotides is obtained, the operation (L-3+1) is repeated for traversing the whole RNA coding unit sequence, L is the length of each sample RNA sequence pair, the frequency of occurrence of each triplet in the RNA coding unit sequence is calculated according to the formula (3), 32 triplet frequency characteristics are converted into a vector of 32 dimensions, and the vector is added to 4 obtained by the formula (2) ^k In the dimension vector, the 4 th in the matrix D is obtained ^k +1 to 4 ^k The +32-dimensional vector, the feature vector corresponding to this stage is expressed as in equation (4):

wherein s is _i,j Representing the number of times the jth triplet coding unit of the ith sample in the dataset occurs in the entire RNA sequence pair, p _i,j Representing the frequency of occurrence of the jth triplet coding unit of the ith sample in the dataset in the entire RNA sequence pair;

(3) Adding energy information: the energy information includes: free energy e of a single base _i Difference Δe between energy before and after binding of sRNA and mRNA _i And the accessible energy Δacc of sRNA and mRNA binding sites _i Wherein, the method comprises the steps of, wherein,

free energy e of a single base _i The minimum free energy MFE at which the ith sample RNA sequence pair in the dataset forms a stable secondary structure is divided by the length L of this sample RNA sequence pair, as shown in equation (5):

difference Δe between energy before and after sRNA-mRNA binding _i The energy obtained by subtracting the energy of the sample RNA sequence pair in which sRNA sequence and mRNA sequence form stable intramolecular secondary structure alone from the minimum free energy MFE of the sample RNA sequence pair in the data set when forming stable secondary structure is expressed as in equation (6):

Δe _i ＝MFE-E _S -E _M (6)，

wherein E is _S Represents the energy released when sRNA sequences alone form stable intramolecular secondary structures, em represents the energy released when mRNA sequences alone form stable intramolecular secondary structures;

the accessible energy of the sRNA binding site is obtained by subtracting the free energy of the paired base in the sRNA sequence from the free energy of the unpaired base in the sRNA sequence, as shown in equation (7), the accessible energy of the mRNA binding site is obtained by the same method, as shown in equation (8), and the accessible energy of the sRNA and mRNA binding site of the ith sample RNA sequence pair in the dataset is shown in equation (9): Δsacc=Δes _unpaired -ΔEs _paired (7)，

ΔMAcc＝ΔEm _unpaired -ΔEm _paired (8)，

ΔAcc _i ＝ΔSAcc+ΔMAcc (9)，

Wherein DeltaEs _unpaired Indicating the free energy of unpaired bases in sRNA sequences, ΔEs _paired Represents the free energy of the base pairing in sRNA sequences, ΔEm _unpaired Represents the free energy of unpaired bases in the mRNA sequence ΔEm _paired Representing the free energy of the paired bases in the mRNA sequence, three energy values are converted into a 3-dimensional vector, which is added to 4 obtained by equation (4) ^k In the +32-dimensional vector, the 4 th in matrix D is obtained ^k +32+1 to 4 ^k The +32+3-dimensional vector, the feature vector corresponding to this stage is expressed as in equation (10):

(4) Obtaining specific base combination information: comprising the following steps: the content ratio of a+u, g+c, a+c in each sample RNA sequence pair in the dataset is expressed as in formulas (11), (12) and (13), respectively:

respectively representing the total content of A+U, G+C, A+C in the ith sample RNA sequence pair in the dataset,/I>

Respectively representing the content ratio of A+U, G+C and A+C in the ith sample RNA sequence pair in the data set, converting three specific base combination characteristics into 3-dimensional vectors, and adding the vectors into 4 obtained by a formula (10) ^k In the +32+3-dimensional vector, the 4 th in matrix D is obtained ^k +32+3+1 to 4 ^k The +32+3+3-dimensional vector, the feature vector corresponding to this stage is represented as in equation (14): />

Finally, a matrix is obtained

Where n represents the total number of samples in the dataset, 4 ^k +32+3+3 represents the dimension into which each sample is converted after feature extraction；

3) F-score feature optimization: the matrix obtained in the step 2) is subjected to an F-score method

The characteristics of (a) are selected and optimized, and a matrix is reserved +.>

Features containing higher recognition information, deletion matrix +.>

Features containing lower recognition information, the formula (15) is as follows:

wherein n is ⁺ Represents the total number of positive samples, n ^- Representing the total number of negative examples,

mean value of the i-th feature representing positive samples,/->

Mean value of the ith feature representing negative sample, +.>

Mean value of all samples, +.>

An ith feature representing the kth sample in the positive dataset,/for>

I features representing the kth sample in the negative dataset, F _i The larger the value of (i) indicates that the i-th feature contains higher identification degree information, the larger the influence on classification is, and F is _i According to the scale from large to smallSequentially ranking, and selecting the characteristics with great influence on classification as sample data characteristics;

4) Training and constructing an SVM model and predicting to obtain a prediction result: and (3) carrying out a 5-fold cross-validation experiment by adopting an SVM algorithm, wherein in the 5-fold cross-validation experiment, a data set is randomly divided into 5 groups, one group is sequentially selected as a test set, the rest is used as a training set, the training set is utilized to train and construct an SVM classifier, and then the test set is input into the SVM classifier to obtain a classification result.

According to the technical scheme, the feature extraction method in the sTarPicker is combined, new features are added, feature extraction encoding is carried out on a data set, so that a data format required by users is obtained, and high-dimensional data is caused by various methods for extracting data features, therefore, the problem is solved by adopting the F-score feature selection method, finally, feature data reserved by feature selection is utilized, a prediction model is trained and built, and a classification result is obtained by combining the prediction model.

Compared with the prior art, the technical scheme has the remarkable advantages that:

(1) In addition to fully considering the existing characteristics, in order to describe RNA information more effectively, the technical scheme extracts the K-tuple nucleotide characteristics and the triplet characteristics based on RNA secondary structure point-bracket diagrams;

(2) The dimension of the feature vector is reduced by adopting the F-score feature selection method, the calculation time is reduced, and the phenomenon of overfitting is avoided.

The method can effectively characterize RNA sequence information and improve the prediction precision of sRNA-target mRNA interaction, and meanwhile, the method has the advantages of low cost, less time consumption and high prediction speed.

Drawings

FIG. 1 is a schematic block diagram of a method flow in an embodiment.

Detailed Description

The present invention will now be further illustrated, but not limited, by the following figures and examples.

Examples:

referring to fig. 1, a method for predicting the interaction of sRNA with its target mRNA in a bacterium, comprising the steps of:

suppose the ith sample sequence D in the dataset ⁱ There are L nucleotides as in formula (16):

D ⁱ ＝R ₁ R ₂ R ₃ R ₄ R ₅ …R _L (16)，

(1) Configuring RNA sequence pairs D in a dataset by means of k-tuple nucleotides ⁱ K of the K-tuple may take 1,2,3, …, K, …, m; wherein m approaches infinity, and the total number of the RNA sequences is A, C, G and U, and K nucleotides are taken as a group in any order, and 4 is taken in total ^k In this example, K is 3, each sample RNA sequence pair in the dataset is taken, 3 adjacent nucleotides are taken from left to right from the first nucleotide, then one nucleotide is shifted right, the next 3 adjacent nucleotides are taken, such an operation (L-3+1) is repeated to traverse the whole RNA sequence pair, L is the length of each sample RNA sequence pair, the frequency of occurrence of each 3 nucleotide combination in the whole RNA sequence pair is counted according to formula (1), the frequency of occurrence of the 64 combinations is converted into a vector of 64 dimensions, and the 1 st to 64 th dimension vectors in the matrix D are obtained, and the feature vector of this stage represents the sequence as shown in formula (2):

the number of times the jth 3 nucleotide combination representing the ith sample in the dataset appears in the whole RNA sequence pair, +.>

The frequency of occurrence of the jth 3 nucleotide combination in the entire RNA sequence pair, D, representing the ith sample in the dataset ⁱ A feature vector representing an ith sample RNA sequence pair in the dataset, T representing the transpose;

(2) And extracting information of a triplet point-bracket diagram of the RNA secondary structure in the data set: the secondary structure formed by the sRNA-bbbbbb-mRNA sequence is predicted by RNAfold software, the first row of the predicted result is the sRNA-bbbbbb-mRNA sequence, the second row is the secondary structure corresponding to the sequence and represented by a point-bracket diagram, each nucleotide in the secondary structure is only matched and unmatched, wherein the nucleotide matching near the 5 'end of the RNA sequence is represented by "(" represents, the nucleotide matching near the 3' end of the RNA sequence is represented by "("), the unmatched is represented by "," for simplifying coding, the nucleotide matching is all used "(" represents, "b" is a character connecting the sRNA and the target mRNA, and therefore "b" and the corresponding matching state thereof are deleted, and the secondary structure corresponding to the sequence in the predicted result is converted into a coding unit sequence, regardless, wherein the coding unit consists of the matching states of three adjacent nucleotides in the secondary structure of the sequence, and the total is 2 " ³ The number of code element forms is =8, respectively "(" and ")," ("and"), "(" and ")); extracting the second nucleotide corresponding to each coding unit, namely: A, C, G, U and combining the nucleotide with the coding unit to form a triplet of 4×8=32 triplets, taking the matching state of three adjacent nucleotides from left to right, then shifting one nucleotide to right, taking the matching state of the next adjacent three nucleotides, for the secondary structure represented by the dot-bracket graph of the software prediction resultAnd (4) repeating the operation (L-3+1) for traversing the whole RNA coding unit sequence, wherein L is the length of each sample RNA sequence pair, calculating the occurrence frequency of each triplet in the RNA coding unit sequence according to the formula (3), converting 32 triplet frequency characteristics into 32-dimensional vectors, adding the 32 triplet frequency characteristics into the 64-dimensional vectors obtained by the formula (2) to obtain 65-96-dimensional vectors in a matrix D, and the characteristic vector corresponding to the stage is expressed as formula (4):

difference Δe between energy before and after sRNA-mRNA binding _i The energy obtained when the sRNA sequence and the mRNA sequence in the pair of sample RNA sequences separately form the stable intramolecular secondary structure is subtracted from the minimum free energy MFE when the ith sample RNA sequence pair in the data set forms the stable secondary structure, and is expressed as in formula (6):

Δe _i ＝MFE-E _S -E _M (6)，

the accessible energy of the sRNA binding site is obtained by subtracting the free energy of the paired base in the sRNA sequence from the free energy of the unpaired base in the sRNA sequence, as shown in equation (7), the accessible energy of the mRNA binding site is obtained by the same method, as shown in equation (8), and the accessible energy of the sRNA and mRNA binding site of the ith sample RNA sequence pair in the dataset is shown in equation (9):

ΔSAcc＝ΔEs _unpaired -ΔEs _paired (7)，

ΔMAcc＝ΔEm _unpaired -ΔEm _paired (8)，

ΔAcc _i ＝ΔSAcc+ΔMAcc (9)，

wherein DeltaEs _unpaired Indicating the free energy of unpaired bases in sRNA sequences, ΔEs _paired Represents the free energy of the base pairing in sRNA sequences, ΔEm _unpaired Represents the free energy of unpaired bases in the mRNA sequence ΔEm _paired The free energy of paired bases in the mRNA sequence is expressed, three energy values are converted into a vector with 3 dimensions, the vector with 3 dimensions is added into a vector with 96 dimensions obtained by a formula (4), the vectors with 97 th to 99 th dimensions in a matrix D are obtained, and the characteristic vector corresponding to the stage is expressed as formula (10):

Respectively representing the content proportion of A+U, G+C and A+C in the ith sample RNA sequence pair in the data set, converting three specific base combination characteristics into 3-dimensional vectors, adding the 3-dimensional vectors into the 99-dimensional vectors obtained in the formula (10) to obtain 100 th to 102 th-dimensional vectors in the matrix D, and representing the characteristic vectors corresponding to the stage as shown in the formula (14):

the total data set is 426, so that the matrix D containing biological information is finally obtained _426*102 Matrix D _426*102 Expressed as formula (17):

3) F-score feature optimization: the matrix D obtained in the step 2) is subjected to F-score method _426*102 The characteristics of the matrix D are selected and optimized, and the matrix D is reserved _426*102 Features containing higher-recognition information, deleting matrix D _426*102 Features containing lower recognition information, the formula (15) is as follows:

mean value of the i-th feature representing positive samples,/->

Mean value of the ith feature representing negative sample, +.>

Mean value of all samples, +.>

An ith feature representing the kth sample in the positive dataset,/for>

I features representing the kth sample in the negative dataset, F _i The larger the value of (i) indicates that the i-th feature contains higher identification degree information, the F is _i Ranking from large to small, selecting the features with large influence on classification as sample data features, and finally using a feature matrix D _426*102 Is reduced to 53 dimensions, denoted as matrix D _426*53 ；

4) Training and constructing an SVM model and predicting to obtain a prediction result: classical machine learning algorithms were used to predict sRNA-mRNA interactions and 5-fold cross-validation experiments were performed. In a 5-fold cross-validation experiment, the data sets were randomly divided into 5 groups, one of which was selected in turn as the test set, and the remainder as the training set. Training and constructing a classifier model by using the training set, and then inputting the testing set into the classifier model to obtain a classification result. Since the random division data may cause deviation problem, the present example performs 50 times of 5-fold cross validation experiments, and calculates the average accuracy of the classification results, as shown in table 1, where the accuracy of the classification results in table 1 indicates: based on the feature extraction method used in the example, the SVM algorithm is obviously superior to a random forest algorithm (Random Forests algorithm, RF for short) and a K-nearest neighbor algorithm (K-nearest neighbor algorithm, KNN for short). The SVM is an effective supervision mode identification method, and is widely applied in the field of bioinformatics, the basic idea of the SVM is to convert data into a high-dimensional feature space, then an optimal separation hyperplane is determined, in the example, a free software package LIBSVM written by Chang and Lin is adopted, an optimal classification hyperplane is obtained by using a radial basis function, and values of a regularization parameter C and a kernel width parameter gamma are finally determined by using an optimization method of grid search: c=32, γ=0.125.

TABLE 1 comparison of experimental results for different machine learning algorithms based on different K-tuple nucleotides

As can be seen from table 1, the prediction accuracy of the method based on the SVM to the sRNA-mRNA interaction in bacteria can be obtained, and at the same time, when k=3, the SVM can obtain relatively good prediction accuracy, which also explains why the value of K is selected to be 3 in the feature extraction of step 2) (1).

Claims

1. A method for predicting the interaction of sRNA with its target mRNA in a bacterium comprising the steps of:

1) Data collection and arrangement: the sRNA-mRNA interaction data set is obtained from the sRNATarBase 3.0 database, the mRNA primary sequence in the data set is aligned with the corresponding mRNA whole genome sequence in the NCBI database, the sequence fragment of mRNA between 80nt and 50nt downstream of the initiation codon, which is AUG, is intercepted, then the sRNA sequence and the mRNA sequence are connected to form a sequence pair, each sequence pair is formed by connecting one sRNA sequence with one mRNA sequence, namely: sRNA-bbbbbb-mRNA, wherein b is a ligation symbol, and the consolidated dataset comprises 241 positive sample RNA sequence pairs with interactions and 185 negative sample RNA sequence pairs without interactions;

(1) Configuring RNA sequence pairs in a dataset by adopting a method of K-tuple nucleotide, wherein K of the K-tuple is 1,2,3, …, K, … and m; wherein m approaches infinity, and the total number of the RNA sequences is A, C, G and U, and K nucleotides are taken as a group in any order, and 4 is taken in total ^k Seed combination, wherein each sample RNA sequence pair in the dataset starts from the first nucleotide, K adjacent nucleotides are taken from left to right, then one nucleotide is shifted right, K adjacent nucleotides are taken, such operation (L-K+1) is repeated to traverse the whole RNA sequence pair, L is the length of each sample RNA sequence pair, the occurrence frequency of each K nucleotide combination in the whole RNA sequence pair is counted according to formula (1), and 4 is counted ^k The frequency of occurrence of the seed combination is converted to 4 ^k Vector of dimensions, 1 st to 4 th in matrix D ^k A dimension vector, the stage feature vector being represented as in equation (2):

representing the number of times the jth K nucleotide combination of the ith sample in the dataset appears in the whole RNA sequence pair,/for each sample in the dataset>

(2) And extracting information of a triplet point-bracket diagram of the RNA secondary structure in the data set: the secondary structure formed by the sRNA-bbbbbb-mRNA sequence is predicted by RNAfold software, the first row of the predicted result is the sRNA-bbbbbb-mRNA sequence, the second row is the secondary structure corresponding to the sequence and represented by a point-bracket diagram, each nucleotide in the secondary structure is only matched and unmatched, wherein the nucleotide matching near the 5 'end of the RNA sequence is represented by "(" represents, the nucleotide matching near the 3' end of the RNA sequence is represented by "("), unmatched is represented by "," b "is a character connecting the sRNA and the target mRNA, and therefore" b "and the corresponding matching state thereof are deleted, and the secondary structure corresponding to the sequence in the predicted result is converted into a coding unit sequence, wherein the coding unit consists of the matching states of three adjacent nucleotides in the secondary structure of the sequence, and 2 is shared ³ The number of code element forms is =8, respectively "(" and ")," ("and"), "(" and ")); extracting the second nucleotide corresponding to each coding unit, namely: A, C, G, U, and combining the nucleotide with coding units to form a triplet, 4×8=32 triplets, taking the matching state of three adjacent nucleotides from left to right from the first matching state for the secondary structure represented by the dot-bracket graph of the software prediction result, then shifting one nucleotide to right, taking the matching state of the following adjacent three nucleotides, repeating the operation (L-3+1) to traverse the whole RNA coding unit sequence, L being the length of each sample RNA sequence pair, calculating the frequency of occurrence of each triplet in the RNA coding unit sequence according to the formula (3), converting the 32 triplets frequency characteristics into a 32-dimensional vector, and adding the vector to the 4 obtained by the formula (2) ^k In the dimension vector, the 4 th in the matrix D is obtained ^k +1 to 4 ^k The +32-dimensional vector, the feature vector corresponding to this stage is expressed as in equation (4):

Δe _i ＝MFE-E _S -E _M (6)，

ΔSAcc＝ΔEs _unpaired -ΔEs _paired (7)，

ΔMAcc＝ΔEm _unpaired -ΔEm _paired (8)，

ΔAcc _i ＝ΔSAcc+ΔMAcc (9)，

Respectively representing the content ratio of A+U, G+C and A+C in the ith sample RNA sequence pair in the data set,

converting three specific base combination characteristics into a 3-dimensional vector, and adding the vector to 4 obtained by the formula (10) ^k In the +32+3-dimensional vector, the 4 th in matrix D is obtained ^k +32+3+1 to 4 ^k The +32+3+3-dimensional vector, the feature vector corresponding to this stage is represented as in equation (14):

finally, a matrix is obtained

Where n represents the total number of samples in the dataset, 4 ^k +32+3+3 represents the dimension into which each sample is converted after feature extraction;

Features containing higher recognition information, deletion matrix +.>

mean value of the i-th feature representing positive samples,/->

Mean value of the ith feature representing negative sample, +.>

Mean value of all samples, +.>

An ith feature representing the kth sample in the positive dataset,/for>

I features representing the kth sample in the negative dataset, F _i The larger the value of (i) indicates that the i-th feature contains higher identification degree information, the F is _i Selecting the characteristics with great influence on classification as sample data characteristics according to the ranking from big to small;