CN109859798B - Prediction method for interaction of sRNA and target mRNA in bacteria - Google Patents

Prediction method for interaction of sRNA and target mRNA in bacteria Download PDF

Info

Publication number
CN109859798B
CN109859798B CN201910053867.6A CN201910053867A CN109859798B CN 109859798 B CN109859798 B CN 109859798B CN 201910053867 A CN201910053867 A CN 201910053867A CN 109859798 B CN109859798 B CN 109859798B
Authority
CN
China
Prior art keywords
srna
mrna
sequence
rna sequence
dataset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910053867.6A
Other languages
Chinese (zh)
Other versions
CN109859798A (en
Inventor
樊永显
崔娟
张龙
张向文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN201910053867.6A priority Critical patent/CN109859798B/en
Publication of CN109859798A publication Critical patent/CN109859798A/en
Application granted granted Critical
Publication of CN109859798B publication Critical patent/CN109859798B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention discloses a method for predicting interaction of sRNA and target mRNA thereof in bacteria, which comprises the following steps: 1) Data collection and arrangement; 2) Feature extraction, namely converting a data set into a matrix; 3) F-score feature optimization; 4) And training and constructing an SVM model and predicting to obtain a prediction result. The method can effectively characterize RNA sequence information and improve the prediction precision of sRNA-target mRNA interaction, and meanwhile, the method has the advantages of low cost, less time consumption and high prediction speed.

Description

Prediction method for interaction of sRNA and target mRNA in bacteria
Technical Field
The invention relates to classification prediction of sequence interaction in bioinformatics, in particular to a prediction method of interaction between sRNA and target mRNA thereof in bacteria.
Background
Non-coding RNA (ncRNA) is an RNA that does not code for proteins, and has attracted increasing attention in molecular biology since ncRNAs with certain regulatory functions have been recognized in rapid succession. The interaction of regulatory ncrnas with messenger RNAs (mrnas) inhibits or activates translation into proteins, resulting in dysregulation of gene expression, and thus disease occurrence. In this regulation, a popular term includes small RNAs (sRNA) in bacteria, which form a complex secondary structure with target mRNA by base pairing, directly or indirectly affecting gene expression in organisms. With the increasing number of sRNAs identified in the metagenomic era, the regulation of sRNA interactions with its target mRNA in organisms must be considered in order to understand as much as possible the function of sRNA. Although some sRNA functions are confirmed so far, a considerable part of sRNA functions are unknown, so that the identification of sRNA targets has important significance for the identification of the functions of sRNAs. However, the biological methods to recognize the interaction of sRNA with its target mRNA are very limited. Therefore, the use of computer technology in combination with biological information to predict sRNA-mRNA interactions is of great importance for the discovery and understanding of sRNA function.
Two types of methods are currently used primarily to predict the interaction of sRNA with target genes: a generic RNA-RNA interaction model and a predictive model specifically for sRNA-target mRNA interactions. The common RNA-RNA interaction model mostly provides only binding sites between two RNA molecules, without determining whether the two RNA molecules have interactions, such as: RNAcofold, RNAup, RNAduplex et al (Lorenz R, bernhart S H, siederdissen C H Z, et al Vienna RNA Package 2.0[ J ]. Algorithms for Molecular Biology,2011,6 (1): 26.). In fact, even two randomly selected RNA sequences can present many potential binding sites, but we cannot guarantee that the two RNA sequences interact. Review of the literature reveals that there are few models currently dedicated to prediction of sRNA-target mRNA interactions. Among them, sTarPicker (YerX, cao Y, wu J, et al sTarPicker: A Method for Efficient Prediction of Bacterial sRNA Targets Based on a Two-Step Model for Hybridization [ J ]. PLOS ONE,2011,6.) predicts whether there is an interaction between sRNA and target mRNA using thermodynamic stability characteristics and target accessibility characteristics between RNA-RNA molecules. This method achieves good prediction accuracy, but is characterized by a large number of features and difficult extraction. IntaRNA (Busch A, richter A S, backofen R.IntaRNA: efficient prediction of bacterial sRNA targets incorporating target site accessibility and seed regions [ J ]. Bioinformatics,2008,24 (24): 2849-2856.) the accessibility of the binding site was calculated from the user-defined seed sequence to determine if sRNA and target mRNA interacted. This approach contributes significantly in identifying whether sRNA and its target mRNA interact or not, but they rely on more biochemical experimental data features.
Disclosure of Invention
The invention aims at overcoming the defects of the prior art and provides a method for predicting the interaction between sRNA and target mRNA thereof in bacteria. The method can effectively characterize RNA sequence information and improve the prediction precision of sRNA-target mRNA interaction, and meanwhile, the method has the advantages of low cost, less time consumption and high prediction speed.
The technical scheme for realizing the aim of the invention is as follows:
the method for predicting the interaction of sRNA and target mRNA thereof in bacteria is different from the prior art in that the method comprises the following steps:
1) Data collection and arrangement: the sRNA-mRNA interaction data set is obtained from the sRNATarBase 3.0 database, the mRNA primary sequences in the data set are aligned with the corresponding mRNA full genome sequences in the NCBI database, the sequence fragments of the mRNA between 80nt upstream and 50nt downstream of the start codon, which is AUG, are intercepted, and then the sRNA sequences and the mRNA sequences are joined to form a sequence pair, each sequence pair consisting of one sRNA sequence joined to one mRNA sequence, shaped as: sRNA-bbbbbb-mRNA, wherein b is a ligation symbol, and the consolidated dataset comprises 241 positive sample RNA sequence pairs with interactions and 185 negative sample RNA sequence pairs without interactions;
2) Feature extraction, converting a data set into a matrix: the process comprises the following steps:
(1) Configuring RNA sequence pairs in the dataset by adopting a method of K-tuple nucleotide, wherein K of the K-tuple can be 1,2,3, …, K, … and m; wherein m approaches infinity, and the total number of the RNA sequences is A, C, G and U, and K nucleotides are taken as a group in any order, and 4 is taken in total k Combinations of species, e.g. when k=2, total 4 2 For each sample RNA sequence pair in the dataset, by the method of K-tuple nucleotides, taking K adjacent nucleotides from left to right, starting from the first nucleotide, then shifting one nucleotide to right, taking K next adjacent nucleotides, repeating such an operation (L-K+1) for traversing the entire RNA sequence pair, L being the length of each sample RNA sequence pair, counting the frequency of occurrence of each K nucleotide combination in the entire RNA sequence pair according to formula (1), and then adding 4 k The frequency of occurrence of the seed combination is converted to 4 k Vector of dimensions, 1 st to 4 th in matrix D k A dimension vector, the stage feature vector being represented as in equation (2):
Figure GDA0004230548120000021
Figure GDA0004230548120000022
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure GDA0004230548120000023
representing the number of times the jth K nucleotide combination of the ith sample in the dataset occurs in the entire pair of RNA sequences,
Figure GDA0004230548120000024
the frequency of occurrence of the j-th K nucleotide combination in the whole RNA sequence pair, D, representing the i-th sample in the dataset i A feature vector representing an ith sample RNA sequence pair in the dataset, T representing the transpose;
(2) And extracting information of a triplet point-bracket diagram of the RNA secondary structure in the data set: the secondary structure formed by the sRNA-bbbbbb-mRNA sequence is predicted by RNAfold software, the first row of the predicted result is the sRNA-bbbbbb-mRNA sequence, the second row is the secondary structure corresponding to the sequence and represented by a point-bracket diagram, each nucleotide in the secondary structure is only matched and unmatched, wherein the nucleotide matching near the 5 'end of the RNA sequence is represented by "(" represents, the nucleotide matching near the 3' end of the RNA sequence is represented by "("), the unmatched is represented by "," for simplifying coding, the nucleotide matching is all used "(" represents, "b" is a character connecting the sRNA and the target mRNA, and therefore "b" and the corresponding matching state thereof are deleted, and the secondary structure corresponding to the sequence in the predicted result is converted into a coding unit sequence, regardless, wherein the coding unit consists of the matching states of three adjacent nucleotides in the secondary structure of the sequence, and the total is 2 " 3 =8 coding unit forms, respectively "((", "("), "(", ") and" … "," respectively;("," (A, C, G, U) corresponding to each coding unit are extracted, the nucleotide and the coding units are combined to form a triplet, 4×8=32 triplets, the secondary structure represented by a point-bracket diagram in the software prediction result is obtained, from the first matching state, the matching state of three adjacent nucleotides is obtained from left to right, then one nucleotide is shifted to right, the matching state of the next three nucleotides is obtained, the operation (L-3+1) is repeated for traversing the whole RNA coding unit sequence, L is the length of each sample RNA sequence pair, the frequency of occurrence of each triplet in the RNA coding unit sequence is calculated according to the formula (3), 32 triplet frequency characteristics are converted into a vector of 32 dimensions, and the vector is added to 4 obtained by the formula (2) k In the dimension vector, the 4 th in the matrix D is obtained k +1 to 4 k The +32-dimensional vector, the feature vector corresponding to this stage is expressed as in equation (4):
Figure GDA0004230548120000031
Figure GDA0004230548120000032
wherein s is i,j Representing the number of times the jth triplet coding unit of the ith sample in the dataset occurs in the entire RNA sequence pair, p i,j Representing the frequency of occurrence of the jth triplet coding unit of the ith sample in the dataset in the entire RNA sequence pair;
(3) Adding energy information: the energy information includes: free energy e of a single base i Difference Δe between energy before and after binding of sRNA and mRNA i And the accessible energy Δacc of sRNA and mRNA binding sites i Wherein, the method comprises the steps of, wherein,
free energy e of a single base i The minimum free energy MFE at which the ith sample RNA sequence pair in the dataset forms a stable secondary structure is divided by the length L of this sample RNA sequence pair, as shown in equation (5):
Figure GDA0004230548120000033
difference Δe between energy before and after sRNA-mRNA binding i The energy obtained by subtracting the energy of the sample RNA sequence pair in which sRNA sequence and mRNA sequence form stable intramolecular secondary structure alone from the minimum free energy MFE of the sample RNA sequence pair in the data set when forming stable secondary structure is expressed as in equation (6):
Δe i =MFE-E S -E M (6),
wherein E is S Represents the energy released when sRNA sequences alone form stable intramolecular secondary structures, em represents the energy released when mRNA sequences alone form stable intramolecular secondary structures;
the accessible energy of the sRNA binding site is obtained by subtracting the free energy of the paired base in the sRNA sequence from the free energy of the unpaired base in the sRNA sequence, as shown in equation (7), the accessible energy of the mRNA binding site is obtained by the same method, as shown in equation (8), and the accessible energy of the sRNA and mRNA binding site of the ith sample RNA sequence pair in the dataset is shown in equation (9): Δsacc=Δes unpaired -ΔEs paired (7),
ΔMAcc=ΔEm unpaired -ΔEm paired (8),
ΔAcc i =ΔSAcc+ΔMAcc (9),
Wherein DeltaEs unpaired Indicating the free energy of unpaired bases in sRNA sequences, ΔEs paired Represents the free energy of the base pairing in sRNA sequences, ΔEm unpaired Represents the free energy of unpaired bases in the mRNA sequence ΔEm paired Representing the free energy of the paired bases in the mRNA sequence, three energy values are converted into a 3-dimensional vector, which is added to 4 obtained by equation (4) k In the +32-dimensional vector, the 4 th in matrix D is obtained k +32+1 to 4 k The +32+3-dimensional vector, the feature vector corresponding to this stage is expressed as in equation (10):
Figure GDA0004230548120000041
(4) Obtaining specific base combination information: comprising the following steps: the content ratio of a+u, g+c, a+c in each sample RNA sequence pair in the dataset is expressed as in formulas (11), (12) and (13), respectively:
Figure GDA0004230548120000042
Figure GDA0004230548120000043
Figure GDA0004230548120000044
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure GDA0004230548120000045
respectively representing the total content of A+U, G+C, A+C in the ith sample RNA sequence pair in the dataset,/I>
Figure GDA0004230548120000046
Respectively representing the content ratio of A+U, G+C and A+C in the ith sample RNA sequence pair in the data set, converting three specific base combination characteristics into 3-dimensional vectors, and adding the vectors into 4 obtained by a formula (10) k In the +32+3-dimensional vector, the 4 th in matrix D is obtained k +32+3+1 to 4 k The +32+3+3-dimensional vector, the feature vector corresponding to this stage is represented as in equation (14): />
Figure GDA0004230548120000047
Finally, a matrix is obtained
Figure GDA0004230548120000048
Where n represents the total number of samples in the dataset, 4 k +32+3+3 represents the dimension into which each sample is converted after feature extraction;
3) F-score feature optimization: the matrix obtained in the step 2) is subjected to an F-score method
Figure GDA0004230548120000049
The characteristics of (a) are selected and optimized, and a matrix is reserved +.>
Figure GDA00042305481200000410
Features containing higher recognition information, deletion matrix +.>
Figure GDA00042305481200000411
Features containing lower recognition information, the formula (15) is as follows:
Figure GDA0004230548120000051
wherein n is + Represents the total number of positive samples, n - Representing the total number of negative examples,
Figure GDA0004230548120000052
mean value of the i-th feature representing positive samples,/->
Figure GDA0004230548120000053
Mean value of the ith feature representing negative sample, +.>
Figure GDA0004230548120000054
Mean value of all samples, +.>
Figure GDA0004230548120000055
An ith feature representing the kth sample in the positive dataset,/for>
Figure GDA0004230548120000056
I features representing the kth sample in the negative dataset, F i The larger the value of (i) indicates that the i-th feature contains higher identification degree information, the larger the influence on classification is, and F is i According to the scale from large to smallSequentially ranking, and selecting the characteristics with great influence on classification as sample data characteristics;
4) Training and constructing an SVM model and predicting to obtain a prediction result: and (3) carrying out a 5-fold cross-validation experiment by adopting an SVM algorithm, wherein in the 5-fold cross-validation experiment, a data set is randomly divided into 5 groups, one group is sequentially selected as a test set, the rest is used as a training set, the training set is utilized to train and construct an SVM classifier, and then the test set is input into the SVM classifier to obtain a classification result.
According to the technical scheme, the feature extraction method in the sTarPicker is combined, new features are added, feature extraction encoding is carried out on a data set, so that a data format required by users is obtained, and high-dimensional data is caused by various methods for extracting data features, therefore, the problem is solved by adopting the F-score feature selection method, finally, feature data reserved by feature selection is utilized, a prediction model is trained and built, and a classification result is obtained by combining the prediction model.
Compared with the prior art, the technical scheme has the remarkable advantages that:
(1) In addition to fully considering the existing characteristics, in order to describe RNA information more effectively, the technical scheme extracts the K-tuple nucleotide characteristics and the triplet characteristics based on RNA secondary structure point-bracket diagrams;
(2) The dimension of the feature vector is reduced by adopting the F-score feature selection method, the calculation time is reduced, and the phenomenon of overfitting is avoided.
The method can effectively characterize RNA sequence information and improve the prediction precision of sRNA-target mRNA interaction, and meanwhile, the method has the advantages of low cost, less time consumption and high prediction speed.
Drawings
FIG. 1 is a schematic block diagram of a method flow in an embodiment.
Detailed Description
The present invention will now be further illustrated, but not limited, by the following figures and examples.
Examples:
referring to fig. 1, a method for predicting the interaction of sRNA with its target mRNA in a bacterium, comprising the steps of:
1) Data collection and arrangement: the sRNA-mRNA interaction data set is obtained from the sRNATarBase 3.0 database, the mRNA primary sequences in the data set are aligned with the corresponding mRNA full genome sequences in the NCBI database, the sequence fragments of the mRNA between 80nt upstream and 50nt downstream of the start codon, which is AUG, are intercepted, and then the sRNA sequences and the mRNA sequences are joined to form a sequence pair, each sequence pair consisting of one sRNA sequence joined to one mRNA sequence, shaped as: sRNA-bbbbbb-mRNA, wherein b is a ligation symbol, and the consolidated dataset comprises 241 positive sample RNA sequence pairs with interactions and 185 negative sample RNA sequence pairs without interactions;
2) Feature extraction, converting a data set into a matrix: the process comprises the following steps:
suppose the ith sample sequence D in the dataset i There are L nucleotides as in formula (16):
D i =R 1 R 2 R 3 R 4 R 5 …R L (16),
(1) Configuring RNA sequence pairs D in a dataset by means of k-tuple nucleotides i K of the K-tuple may take 1,2,3, …, K, …, m; wherein m approaches infinity, and the total number of the RNA sequences is A, C, G and U, and K nucleotides are taken as a group in any order, and 4 is taken in total k In this example, K is 3, each sample RNA sequence pair in the dataset is taken, 3 adjacent nucleotides are taken from left to right from the first nucleotide, then one nucleotide is shifted right, the next 3 adjacent nucleotides are taken, such an operation (L-3+1) is repeated to traverse the whole RNA sequence pair, L is the length of each sample RNA sequence pair, the frequency of occurrence of each 3 nucleotide combination in the whole RNA sequence pair is counted according to formula (1), the frequency of occurrence of the 64 combinations is converted into a vector of 64 dimensions, and the 1 st to 64 th dimension vectors in the matrix D are obtained, and the feature vector of this stage represents the sequence as shown in formula (2):
Figure GDA0004230548120000061
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure GDA0004230548120000062
the number of times the jth 3 nucleotide combination representing the ith sample in the dataset appears in the whole RNA sequence pair, +.>
Figure GDA0004230548120000063
The frequency of occurrence of the jth 3 nucleotide combination in the entire RNA sequence pair, D, representing the ith sample in the dataset i A feature vector representing an ith sample RNA sequence pair in the dataset, T representing the transpose;
(2) And extracting information of a triplet point-bracket diagram of the RNA secondary structure in the data set: the secondary structure formed by the sRNA-bbbbbb-mRNA sequence is predicted by RNAfold software, the first row of the predicted result is the sRNA-bbbbbb-mRNA sequence, the second row is the secondary structure corresponding to the sequence and represented by a point-bracket diagram, each nucleotide in the secondary structure is only matched and unmatched, wherein the nucleotide matching near the 5 'end of the RNA sequence is represented by "(" represents, the nucleotide matching near the 3' end of the RNA sequence is represented by "("), the unmatched is represented by "," for simplifying coding, the nucleotide matching is all used "(" represents, "b" is a character connecting the sRNA and the target mRNA, and therefore "b" and the corresponding matching state thereof are deleted, and the secondary structure corresponding to the sequence in the predicted result is converted into a coding unit sequence, regardless, wherein the coding unit consists of the matching states of three adjacent nucleotides in the secondary structure of the sequence, and the total is 2 " 3 The number of code element forms is =8, respectively "(" and ")," ("and"), "(" and ")); extracting the second nucleotide corresponding to each coding unit, namely: A, C, G, U and combining the nucleotide with the coding unit to form a triplet of 4×8=32 triplets, taking the matching state of three adjacent nucleotides from left to right, then shifting one nucleotide to right, taking the matching state of the next adjacent three nucleotides, for the secondary structure represented by the dot-bracket graph of the software prediction resultAnd (4) repeating the operation (L-3+1) for traversing the whole RNA coding unit sequence, wherein L is the length of each sample RNA sequence pair, calculating the occurrence frequency of each triplet in the RNA coding unit sequence according to the formula (3), converting 32 triplet frequency characteristics into 32-dimensional vectors, adding the 32 triplet frequency characteristics into the 64-dimensional vectors obtained by the formula (2) to obtain 65-96-dimensional vectors in a matrix D, and the characteristic vector corresponding to the stage is expressed as formula (4):
Figure GDA0004230548120000071
Figure GDA0004230548120000072
wherein s is i,j Representing the number of times the jth triplet coding unit of the ith sample in the dataset occurs in the entire RNA sequence pair, p i,j Representing the frequency of occurrence of the jth triplet coding unit of the ith sample in the dataset in the entire RNA sequence pair;
(3) Adding energy information: the energy information includes: free energy e of a single base i Difference Δe between energy before and after binding of sRNA and mRNA i And the accessible energy Δacc of sRNA and mRNA binding sites i Wherein, the method comprises the steps of, wherein,
free energy e of a single base i The minimum free energy MFE at which the ith sample RNA sequence pair in the dataset forms a stable secondary structure is divided by the length L of this sample RNA sequence pair, as shown in equation (5):
Figure GDA0004230548120000073
difference Δe between energy before and after sRNA-mRNA binding i The energy obtained when the sRNA sequence and the mRNA sequence in the pair of sample RNA sequences separately form the stable intramolecular secondary structure is subtracted from the minimum free energy MFE when the ith sample RNA sequence pair in the data set forms the stable secondary structure, and is expressed as in formula (6):
Δe i =MFE-E S -E M (6),
wherein E is S Represents the energy released when sRNA sequences alone form stable intramolecular secondary structures, em represents the energy released when mRNA sequences alone form stable intramolecular secondary structures;
the accessible energy of the sRNA binding site is obtained by subtracting the free energy of the paired base in the sRNA sequence from the free energy of the unpaired base in the sRNA sequence, as shown in equation (7), the accessible energy of the mRNA binding site is obtained by the same method, as shown in equation (8), and the accessible energy of the sRNA and mRNA binding site of the ith sample RNA sequence pair in the dataset is shown in equation (9):
ΔSAcc=ΔEs unpaired -ΔEs paired (7),
ΔMAcc=ΔEm unpaired -ΔEm paired (8),
ΔAcc i =ΔSAcc+ΔMAcc (9),
wherein DeltaEs unpaired Indicating the free energy of unpaired bases in sRNA sequences, ΔEs paired Represents the free energy of the base pairing in sRNA sequences, ΔEm unpaired Represents the free energy of unpaired bases in the mRNA sequence ΔEm paired The free energy of paired bases in the mRNA sequence is expressed, three energy values are converted into a vector with 3 dimensions, the vector with 3 dimensions is added into a vector with 96 dimensions obtained by a formula (4), the vectors with 97 th to 99 th dimensions in a matrix D are obtained, and the characteristic vector corresponding to the stage is expressed as formula (10):
Figure GDA0004230548120000081
(4) Obtaining specific base combination information: comprising the following steps: the content ratio of a+u, g+c, a+c in each sample RNA sequence pair in the dataset is expressed as in formulas (11), (12) and (13), respectively:
Figure GDA0004230548120000082
Figure GDA0004230548120000083
Figure GDA0004230548120000084
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure GDA0004230548120000085
respectively representing the total content of A+U, G+C, A+C in the ith sample RNA sequence pair in the dataset,/I>
Figure GDA0004230548120000086
Respectively representing the content proportion of A+U, G+C and A+C in the ith sample RNA sequence pair in the data set, converting three specific base combination characteristics into 3-dimensional vectors, adding the 3-dimensional vectors into the 99-dimensional vectors obtained in the formula (10) to obtain 100 th to 102 th-dimensional vectors in the matrix D, and representing the characteristic vectors corresponding to the stage as shown in the formula (14):
Figure GDA0004230548120000087
the total data set is 426, so that the matrix D containing biological information is finally obtained 426*102 Matrix D 426*102 Expressed as formula (17):
Figure GDA0004230548120000088
3) F-score feature optimization: the matrix D obtained in the step 2) is subjected to F-score method 426*102 The characteristics of the matrix D are selected and optimized, and the matrix D is reserved 426*102 Features containing higher-recognition information, deleting matrix D 426*102 Features containing lower recognition information, the formula (15) is as follows:
Figure GDA0004230548120000089
wherein n is + Represents the total number of positive samples, n - Representing the total number of negative examples,
Figure GDA00042305481200000810
mean value of the i-th feature representing positive samples,/->
Figure GDA00042305481200000811
Mean value of the ith feature representing negative sample, +.>
Figure GDA00042305481200000812
Mean value of all samples, +.>
Figure GDA00042305481200000813
An ith feature representing the kth sample in the positive dataset,/for>
Figure GDA00042305481200000814
I features representing the kth sample in the negative dataset, F i The larger the value of (i) indicates that the i-th feature contains higher identification degree information, the F is i Ranking from large to small, selecting the features with large influence on classification as sample data features, and finally using a feature matrix D 426*102 Is reduced to 53 dimensions, denoted as matrix D 426*53
4) Training and constructing an SVM model and predicting to obtain a prediction result: classical machine learning algorithms were used to predict sRNA-mRNA interactions and 5-fold cross-validation experiments were performed. In a 5-fold cross-validation experiment, the data sets were randomly divided into 5 groups, one of which was selected in turn as the test set, and the remainder as the training set. Training and constructing a classifier model by using the training set, and then inputting the testing set into the classifier model to obtain a classification result. Since the random division data may cause deviation problem, the present example performs 50 times of 5-fold cross validation experiments, and calculates the average accuracy of the classification results, as shown in table 1, where the accuracy of the classification results in table 1 indicates: based on the feature extraction method used in the example, the SVM algorithm is obviously superior to a random forest algorithm (Random Forests algorithm, RF for short) and a K-nearest neighbor algorithm (K-nearest neighbor algorithm, KNN for short). The SVM is an effective supervision mode identification method, and is widely applied in the field of bioinformatics, the basic idea of the SVM is to convert data into a high-dimensional feature space, then an optimal separation hyperplane is determined, in the example, a free software package LIBSVM written by Chang and Lin is adopted, an optimal classification hyperplane is obtained by using a radial basis function, and values of a regularization parameter C and a kernel width parameter gamma are finally determined by using an optimization method of grid search: c=32, γ=0.125.
TABLE 1 comparison of experimental results for different machine learning algorithms based on different K-tuple nucleotides
Figure GDA0004230548120000091
As can be seen from table 1, the prediction accuracy of the method based on the SVM to the sRNA-mRNA interaction in bacteria can be obtained, and at the same time, when k=3, the SVM can obtain relatively good prediction accuracy, which also explains why the value of K is selected to be 3 in the feature extraction of step 2) (1).

Claims (1)

1. A method for predicting the interaction of sRNA with its target mRNA in a bacterium comprising the steps of:
1) Data collection and arrangement: the sRNA-mRNA interaction data set is obtained from the sRNATarBase 3.0 database, the mRNA primary sequence in the data set is aligned with the corresponding mRNA whole genome sequence in the NCBI database, the sequence fragment of mRNA between 80nt and 50nt downstream of the initiation codon, which is AUG, is intercepted, then the sRNA sequence and the mRNA sequence are connected to form a sequence pair, each sequence pair is formed by connecting one sRNA sequence with one mRNA sequence, namely: sRNA-bbbbbb-mRNA, wherein b is a ligation symbol, and the consolidated dataset comprises 241 positive sample RNA sequence pairs with interactions and 185 negative sample RNA sequence pairs without interactions;
2) Feature extraction, converting a data set into a matrix: the process comprises the following steps:
(1) Configuring RNA sequence pairs in a dataset by adopting a method of K-tuple nucleotide, wherein K of the K-tuple is 1,2,3, …, K, … and m; wherein m approaches infinity, and the total number of the RNA sequences is A, C, G and U, and K nucleotides are taken as a group in any order, and 4 is taken in total k Seed combination, wherein each sample RNA sequence pair in the dataset starts from the first nucleotide, K adjacent nucleotides are taken from left to right, then one nucleotide is shifted right, K adjacent nucleotides are taken, such operation (L-K+1) is repeated to traverse the whole RNA sequence pair, L is the length of each sample RNA sequence pair, the occurrence frequency of each K nucleotide combination in the whole RNA sequence pair is counted according to formula (1), and 4 is counted k The frequency of occurrence of the seed combination is converted to 4 k Vector of dimensions, 1 st to 4 th in matrix D k A dimension vector, the stage feature vector being represented as in equation (2):
Figure FDA0004230548110000011
Figure FDA0004230548110000012
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure FDA0004230548110000013
representing the number of times the jth K nucleotide combination of the ith sample in the dataset appears in the whole RNA sequence pair,/for each sample in the dataset>
Figure FDA0004230548110000014
The frequency of occurrence of the j-th K nucleotide combination in the whole RNA sequence pair, D, representing the i-th sample in the dataset i A feature vector representing an ith sample RNA sequence pair in the dataset, T representing the transpose;
(2) And extracting information of a triplet point-bracket diagram of the RNA secondary structure in the data set: the secondary structure formed by the sRNA-bbbbbb-mRNA sequence is predicted by RNAfold software, the first row of the predicted result is the sRNA-bbbbbb-mRNA sequence, the second row is the secondary structure corresponding to the sequence and represented by a point-bracket diagram, each nucleotide in the secondary structure is only matched and unmatched, wherein the nucleotide matching near the 5 'end of the RNA sequence is represented by "(" represents, the nucleotide matching near the 3' end of the RNA sequence is represented by "("), unmatched is represented by "," b "is a character connecting the sRNA and the target mRNA, and therefore" b "and the corresponding matching state thereof are deleted, and the secondary structure corresponding to the sequence in the predicted result is converted into a coding unit sequence, wherein the coding unit consists of the matching states of three adjacent nucleotides in the secondary structure of the sequence, and 2 is shared 3 The number of code element forms is =8, respectively "(" and ")," ("and"), "(" and ")); extracting the second nucleotide corresponding to each coding unit, namely: A, C, G, U, and combining the nucleotide with coding units to form a triplet, 4×8=32 triplets, taking the matching state of three adjacent nucleotides from left to right from the first matching state for the secondary structure represented by the dot-bracket graph of the software prediction result, then shifting one nucleotide to right, taking the matching state of the following adjacent three nucleotides, repeating the operation (L-3+1) to traverse the whole RNA coding unit sequence, L being the length of each sample RNA sequence pair, calculating the frequency of occurrence of each triplet in the RNA coding unit sequence according to the formula (3), converting the 32 triplets frequency characteristics into a 32-dimensional vector, and adding the vector to the 4 obtained by the formula (2) k In the dimension vector, the 4 th in the matrix D is obtained k +1 to 4 k The +32-dimensional vector, the feature vector corresponding to this stage is expressed as in equation (4):
Figure FDA0004230548110000021
Figure FDA0004230548110000022
wherein s is i,j Representing the number of times the jth triplet coding unit of the ith sample in the dataset occurs in the entire RNA sequence pair, p i,j Representing the frequency of occurrence of the jth triplet coding unit of the ith sample in the dataset in the entire RNA sequence pair;
(3) Adding energy information: the energy information includes: free energy e of a single base i Difference Δe between energy before and after binding of sRNA and mRNA i And the accessible energy Δacc of sRNA and mRNA binding sites i Wherein, the method comprises the steps of, wherein,
free energy e of a single base i The minimum free energy MFE at which the ith sample RNA sequence pair in the dataset forms a stable secondary structure is divided by the length L of this sample RNA sequence pair, as shown in equation (5):
Figure FDA0004230548110000023
difference Δe between energy before and after sRNA-mRNA binding i The energy obtained by subtracting the energy of the sample RNA sequence pair in which sRNA sequence and mRNA sequence form stable intramolecular secondary structure alone from the minimum free energy MFE of the sample RNA sequence pair in the data set when forming stable secondary structure is expressed as in equation (6):
Δe i =MFE-E S -E M (6),
wherein E is S Represents the energy released when sRNA sequences alone form stable intramolecular secondary structures, em represents the energy released when mRNA sequences alone form stable intramolecular secondary structures;
the accessible energy of the sRNA binding site is obtained by subtracting the free energy of the paired base in the sRNA sequence from the free energy of the unpaired base in the sRNA sequence, as shown in equation (7), the accessible energy of the mRNA binding site is obtained by the same method, as shown in equation (8), and the accessible energy of the sRNA and mRNA binding site of the ith sample RNA sequence pair in the dataset is shown in equation (9):
ΔSAcc=ΔEs unpaired -ΔEs paired (7),
ΔMAcc=ΔEm unpaired -ΔEm paired (8),
ΔAcc i =ΔSAcc+ΔMAcc (9),
wherein DeltaEs unpaired Indicating the free energy of unpaired bases in sRNA sequences, ΔEs paired Represents the free energy of the base pairing in sRNA sequences, ΔEm unpaired Represents the free energy of unpaired bases in the mRNA sequence ΔEm paired Representing the free energy of the paired bases in the mRNA sequence, three energy values are converted into a 3-dimensional vector, which is added to 4 obtained by equation (4) k In the +32-dimensional vector, the 4 th in matrix D is obtained k +32+1 to 4 k The +32+3-dimensional vector, the feature vector corresponding to this stage is expressed as in equation (10):
Figure FDA0004230548110000031
(4) Obtaining specific base combination information: comprising the following steps: the content ratio of a+u, g+c, a+c in each sample RNA sequence pair in the dataset is expressed as in formulas (11), (12) and (13), respectively:
Figure FDA0004230548110000032
Figure FDA0004230548110000033
Figure FDA0004230548110000034
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure FDA0004230548110000035
respectively representing the total content of A+U, G+C, A+C in the ith sample RNA sequence pair in the dataset,/I>
Figure FDA0004230548110000036
Respectively representing the content ratio of A+U, G+C and A+C in the ith sample RNA sequence pair in the data set,
converting three specific base combination characteristics into a 3-dimensional vector, and adding the vector to 4 obtained by the formula (10) k In the +32+3-dimensional vector, the 4 th in matrix D is obtained k +32+3+1 to 4 k The +32+3+3-dimensional vector, the feature vector corresponding to this stage is represented as in equation (14):
Figure FDA0004230548110000037
finally, a matrix is obtained
Figure FDA0004230548110000038
Where n represents the total number of samples in the dataset, 4 k +32+3+3 represents the dimension into which each sample is converted after feature extraction;
3) F-score feature optimization: the matrix obtained in the step 2) is subjected to an F-score method
Figure FDA0004230548110000039
The characteristics of (a) are selected and optimized, and a matrix is reserved +.>
Figure FDA00042305481100000310
Features containing higher recognition information, deletion matrix +.>
Figure FDA00042305481100000311
Features containing lower recognition information, the formula (15) is as follows:
Figure FDA00042305481100000312
wherein n is + Represents the total number of positive samples, n - Representing the total number of negative examples,
Figure FDA00042305481100000313
mean value of the i-th feature representing positive samples,/->
Figure FDA00042305481100000314
Mean value of the ith feature representing negative sample, +.>
Figure FDA00042305481100000315
Mean value of all samples, +.>
Figure FDA00042305481100000316
An ith feature representing the kth sample in the positive dataset,/for>
Figure FDA0004230548110000041
I features representing the kth sample in the negative dataset, F i The larger the value of (i) indicates that the i-th feature contains higher identification degree information, the F is i Selecting the characteristics with great influence on classification as sample data characteristics according to the ranking from big to small;
4) Training and constructing an SVM model and predicting to obtain a prediction result: and (3) carrying out a 5-fold cross-validation experiment by adopting an SVM algorithm, wherein in the 5-fold cross-validation experiment, a data set is randomly divided into 5 groups, one group is sequentially selected as a test set, the rest is used as a training set, the training set is utilized to train and construct an SVM classifier, and then the test set is input into the SVM classifier to obtain a classification result.
CN201910053867.6A 2019-01-21 2019-01-21 Prediction method for interaction of sRNA and target mRNA in bacteria Active CN109859798B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910053867.6A CN109859798B (en) 2019-01-21 2019-01-21 Prediction method for interaction of sRNA and target mRNA in bacteria

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910053867.6A CN109859798B (en) 2019-01-21 2019-01-21 Prediction method for interaction of sRNA and target mRNA in bacteria

Publications (2)

Publication Number Publication Date
CN109859798A CN109859798A (en) 2019-06-07
CN109859798B true CN109859798B (en) 2023-06-23

Family

ID=66895364

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910053867.6A Active CN109859798B (en) 2019-01-21 2019-01-21 Prediction method for interaction of sRNA and target mRNA in bacteria

Country Status (1)

Country Link
CN (1) CN109859798B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110379464B (en) * 2019-07-29 2023-05-12 桂林电子科技大学 Method for predicting DNA transcription terminator in bacteria
CN111951889B (en) * 2020-08-18 2023-12-22 安徽农业大学 Recognition prediction method and system for M5C locus in RNA sequence
CN113140255B (en) * 2021-04-19 2022-05-10 湖南大学 Method for predicting interaction of lncRNA-miRNA of plant
CN113344272B (en) * 2021-06-08 2022-06-21 汕头大学 Prediction method of interaction relation between circRNA, miRNA and RBP based on machine learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004066183A2 (en) * 2003-01-22 2004-08-05 European Molecular Biology Laboratory Microrna
JP2007082436A (en) * 2005-09-20 2007-04-05 Bioinformatics Institute For Global Good Inc METHOD FOR PREDICTING OR IDENTIFYING TARGET mRNA CONTROLLED BY FUNCTIONAL RNA, AND APPLICATION THEREOF
KR20180017827A (en) * 2016-08-11 2018-02-21 인하대학교 산학협력단 Method and System of Predicting protein-binding regions in RNA Using Nucleotide Profiles and Compositions

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2002316267A1 (en) * 2001-06-14 2003-01-02 Rigel Pharmaceuticals, Inc. Multidimensional biodata integration and relationship inference
US20040002083A1 (en) * 2002-01-29 2004-01-01 Ye Ding Statistical algorithms for folding and target accessibility prediction and design of nucleic acids
EP2101275A1 (en) * 2008-03-10 2009-09-16 Koninklijke Philips Electronics N.V. Method for polynucleotide design and selection
US9703929B2 (en) * 2014-10-21 2017-07-11 uBiome, Inc. Method and system for microbiome-derived diagnostics and therapeutics
CN104765846B (en) * 2015-04-17 2018-01-23 西安电子科技大学 A kind of data characteristics sorting technique of feature based extraction algorithm
CN106148324B (en) * 2015-05-12 2019-05-10 中国科学院上海生命科学研究院 RNA-RNA interaction analyzes and identifies method and its application
US10443103B2 (en) * 2015-09-16 2019-10-15 Innomedicine, LLC Chemotherapy regimen selection
CN105930687A (en) * 2016-04-11 2016-09-07 中国人民解放军第三军医大学 Method for predicting outer membrane proteins at bacterial whole genome level
CN106599615B (en) * 2016-11-30 2019-04-05 广东顺德中山大学卡内基梅隆大学国际联合研究院 A kind of sequence signature analysis method for predicting miRNA target gene
CN107742063A (en) * 2017-10-20 2018-02-27 桂林电子科技大学 A kind of prokaryotes σ54The Forecasting Methodology of promoter
CN108090327B (en) * 2017-12-20 2022-03-29 吉林大学 Prediction method for exogenous miRNA (micro ribonucleic acid) regulation and control target gene containing three-dimensional free energy

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004066183A2 (en) * 2003-01-22 2004-08-05 European Molecular Biology Laboratory Microrna
JP2007082436A (en) * 2005-09-20 2007-04-05 Bioinformatics Institute For Global Good Inc METHOD FOR PREDICTING OR IDENTIFYING TARGET mRNA CONTROLLED BY FUNCTIONAL RNA, AND APPLICATION THEREOF
KR20180017827A (en) * 2016-08-11 2018-02-21 인하대학교 산학협력단 Method and System of Predicting protein-binding regions in RNA Using Nucleotide Profiles and Compositions

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RIsearch: fast RNA-RNA interaction search using a simplified nearest-neighbor energy model;Anne Wenzel等;Bioinformatics;第28卷(第21期);第2738-2746页 *
成熟microRNA识别及其功能预测方法研究;王颖;中国博士学位论文全文数据库 (基础科学辑)(第6期);全文 *

Also Published As

Publication number Publication date
CN109859798A (en) 2019-06-07

Similar Documents

Publication Publication Date Title
CN109859798B (en) Prediction method for interaction of sRNA and target mRNA in bacteria
Cao et al. Effects of rare microbiome taxa filtering on statistical analysis
Chen et al. Identifying 2′-O-methylationation sites by integrating nucleotide chemical properties and nucleotide compositions
Kawulok et al. CoMeta: classification of metagenomes using k-mers
Cary et al. Graph-theoretic approach to RNA modeling using comparative data.
CN111462820A (en) Non-coding RNA prediction method based on feature screening and integration algorithm
Su et al. Integrating thermodynamic and sequence contexts improves protein-RNA binding prediction
CN105808976A (en) Recommendation model based miRNA target gene prediction method
Tang et al. Fast and accurate microRNA search using CNN
Yao et al. plantMirP: an efficient computational program for the prediction of plant pre-miRNA by incorporating knowledge-based energy features
Raad et al. miRe2e: a full end-to-end deep model based on transformers for prediction of pre-miRNAs
Chakraborty et al. Predicting MicroRNA sequence using CNN and LSTM stacked in Seq2Seq architecture
CN105095688A (en) Method for detecting bacterial communities and abundances of human intestinal metagenome
Raad et al. Complexity measures of the mature miRNA for improving pre-miRNAs prediction
Agarwal et al. Data denoising and post-denoising corrections in single cell RNA sequencing
CN115249538B (en) Construction method of lncRNA-disease associated prediction model for generating countermeasure network based on heterogeneous graph
Korfiati et al. Predicting human miRNA target genes using a novel computational intelligent framework
Kasukurthi et al. SURFr: Algorithm for identification and analysis of ncRNA-derived RNAs
Kaur et al. A fast and novel approach based on grouping and weighted mRMR for feature selection and classification of protein sequence data
Fan et al. iterb-PPse: Identification of transcriptional terminators in bacterial by incorporating nucleotide properties into PseKNC
CN111951889A (en) Identification prediction method and system for M5C site in RNA sequence
Giansanti et al. Comparing deep and machine learning approaches in bioinformatics: a miRNA-target prediction case study
Lee Deep learning-based microrna target prediction using experimental negative data
Tran et al. Network representation of large-scale heterogeneous RNA sequences with integration of diverse multi-omics, interactions, and annotations data
Khalfaoui et al. DropLasso: A robust variant of Lasso for single cell RNA-seq data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant