CN113140255A

CN113140255A - Method for predicting plant lncRNA-miRNA interaction

Info

Publication number: CN113140255A
Application number: CN202110416052.7A
Authority: CN
Inventors: 蔡立军; 高明玉; 付祥政; 任宣百; 曾湘祥
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2021-07-20
Anticipated expiration: 2041-04-19
Also published as: CN113140255B

Abstract

The invention discloses a method for predicting plant lncRNA-miRNA interaction, which comprises the following steps: acquiring an original data set, processing the original data set to obtain an lncRNA sequence feature set, an lncRNA interaction spectrum set, an miRNA sequence feature set and an miRNA interaction spectrum set, calculating similarity S _ lncS, S _ lncP, S _ miS and S _ miP respectively corresponding to the lncRNA sequence feature set, the lncRNA interaction spectrum set, the miRNA sequence feature set and the miRNA interaction spectrum set, combining the lncRNA sequence feature similarity S _ lncS and the lncRNA interaction spectrum similarity S _ lncP to obtain a lncRNA combined similarity, and combining the miRNA sequence feature similarity S _ miS and the miRNA interaction spectrum similarity S _ miP to obtain a miRNA combined similarity; randomly choosing 1/5 elements from the first interaction matrix to zero to obtain a first training matrix, and transposing the first training matrix. The method can solve the technical problem that the existing method for predicting the lncRNA-miRNA interaction based on machine learning cannot be suitable for similarity prediction of lncRNA or miRNA of target genes without expression profiles.

Description

Method for predicting plant lncRNA-miRNA interaction

Technical Field

The invention belongs to the field of plant biomolecule relationship prediction, and particularly relates to a method for predicting plant lncRNA-miRNA interaction.

Background

Increasingly important research has shown that long non-coding RNAs play a role in a variety of biological processes, particularly lncrnas and mirnas. The sequence length of mirnas is about 22 nucleotides, and control the expression of post-transcriptional genes by regulating cell development, proliferation, apoptosis, carcinogenesis, and differentiation. lncrnas, which are typically greater than 200 nucleotides in sequence length, are widely involved in many important regulatory processes, such as X chromosome silencing, genomic imprinting, chromatin modification, transcriptional activation, transcriptional interference, and intranuclear trafficking. lncRNA regulates gene expression primarily at three levels: epigenetic, transcriptional and post-transcriptional regulation. At the epigenetic level, gene expression is regulated primarily through DNA modification, histone modification, non-coding RNA regulation, chromatin remodeling, and nucleosome localization, with some specific lncrnas recruiting chromatin remodeling and modification complexes to specific localization sites and altering DNA methylation state; at the transcriptional regulatory level, some lncRNAs bind to certain transcription factors as ligands and then form complexes to control the transcriptional activity of genes; at the post-transcriptional regulatory level, some lncrnas may also be directly involved in variable cleavage, RNA editing, and protein translation and translocation. lncRNA controls the expression of mirnas to influence the expression of miRNA target genes: lncRNA can compete with mRNA for miRNA molecules, thereby modulating miRNA-mediated target inhibition. mirnas and lncrnas interact with each other to exert higher levels of post-transcriptional regulation.

With the rapid development of computer technology, numerous methods are used to study mirnas, lncrnas and proteins and their interactions. Although predictions about lncRNA-miRNA interactions exist, most of them are animal-oriented, and plant lncRNA-miRNA interactions have been demonstrated to be very limited. For example, npinter4.0 records a wide range of functional interactions between long noncoding RNAs of over 30 species and other molecules, but only two of these species are plants. Of the 71 RNA-RNA interactions recorded by these two plants, only one was the miRNA-lncRNA interaction. Few related studies have led to the mechanism of plant miRNA-lncRNA interaction remaining quite ambiguous. Meanwhile, the lncRNA has the characteristic of low conservation of specific sequences, amino acid and nucleotide fragments of lncRNA molecules of different species or the same species can be changed in the process of biological evolution, which is particularly obvious among distant species, and thus, the relevant conclusion obtained from animal research cannot be guaranteed to be applicable to plants.

The existing method for predicting lncRNA-miRNA interaction is mainly a machine learning method, which extracts biological characteristics and trains a model to obtain a dichotomous result, namely whether lncRNA and miRNA interact or not is output.

However, the existing methods for predicting lncRNA-miRNA interactions based on machine learning have some non-negligible drawbacks: first, it relies on data features, for some lncrnas or mirnas, they may not have features such as expression profiles, target genes, etc., in which case the machine learning approach is not applicable; second, for some isolated lncrnas or mirnas, which do not interact with any miRNA or lncrnas, it is difficult to predict the interaction they may have.

Disclosure of Invention

In view of the above drawbacks or needs for improvement in the prior art, the present invention provides a method for predicting the lncRNA-miRNA interaction in plants, and aims to solve the technical problems that the existing method for predicting lncRNA-miRNA interaction based on machine learning cannot be applied to similarity prediction of lncRNA or miRNA without expression profile and target gene due to the dependence on data characteristics, and cannot perform interaction prediction on isolated lncRNA or miRNA that has no interaction with any miRNA or lncRNA.

To achieve the above object, according to one aspect of the present invention, there is provided a method for predicting a plant lncRNA-miRNA interaction, comprising the steps of:

(1) acquiring an original data set, and processing the original data set to obtain an lncRNA sequence feature set, an lncRNA interaction spectrum set, an miRNA sequence feature set and an miRNA interaction spectrum set; the method comprises the following substeps:

(1-1) acquiring an original data set, and dividing the original data set into a name set of miRNA, a name set of lncRNA, a binding sequence set of miRNA and lncRNA, and a tag set;

(1-2) inquiring a corresponding miRNA sequence in a reference sequence for each miRNA name in the miRNA name set obtained in the step (1-1), wherein all miRNA sequences form a miRNA sequence set;

(1-3) cutting the miRNA sequence set obtained in the step (1-2) by referring to the binding sequence set obtained in the step (1-1) to obtain a lncRNA sequence set;

(1-4) respectively carrying out de-duplication treatment on the miRNA sequence set obtained in the step (1-2) and the lncRNA sequence set obtained in the step (1-3) to obtain a de-duplicated miRNA sequence set and a de-duplicated lncRNA sequence set;

(1-5) extracting a plurality of characteristics from the sequence set of the miRNA after the duplication removal in the step (1-4) and the sequence set of the lncRNA after the duplication removal so as to respectively obtain the sequence characteristic set of the miRNA and the sequence characteristic set of the lncRNA.

(1-6) respectively carrying out duplication removal treatment on the miRNA name set and the lncRNA name set obtained in the step (1-1) to obtain a duplication removed miRNA name set and a duplication removed lncRNA name set;

(1-7) numbering the name set of the de-duplicated miRNA obtained in the step (1-6) and the name set of the de-duplicated lncRNA from 1 in sequence to obtain an lncRNA number set and an miRNA number set;

(1-8) generating a first interaction matrix according to the lncRNA number set and the miRNA number set obtained in the step (1-7), and the name set, the miRNA name set and the label set of the lncRNA obtained in the step (1-1), and transposing the first interaction matrix to obtain a second interaction matrix;

(1-9) generating an interaction spectrum set of lncRNA according to lncRNA corresponding to each number in the lncRNA number set obtained in the step (1-7) and the first interaction matrix obtained in the step (1-8), and generating an interaction spectrum set of miRNA according to miRNA corresponding miRNA to each number in the miRNA number set obtained in the step (1-7) and the second interaction matrix obtained in the step (1-8);

(2) calculating the similarity S _ lncS, S _ lncP, S _ miS and S _ miP respectively corresponding to the lncRNA sequence feature set, the lncRNA interaction spectrum set, the miRNA sequence feature set and the miRNA interaction spectrum set obtained in the step (1);

(3) combining the lncRNA sequence feature similarity S _ lncS obtained in the step (2) and the lncRNA interaction spectrum similarity S _ lncP to obtain the lncRNA combination similarity, and combining the miRNA sequence feature similarity S _ miS obtained in the step (2) and the miRNA interaction spectrum similarity S _ miP to obtain the miRNA combination similarity;

(4) randomly selecting 1/5 elements from the first interaction matrix obtained in the step (1-8) and setting the elements to zero to obtain a first training matrix, and transposing the first training matrix to obtain a second training matrix;

(5) processing the first training matrix obtained in the step (4) by using a label propagation algorithm based on the combined similarity of lncRNA obtained in the step (3) to obtain a first prediction matrix, and processing the second training matrix obtained in the step (4) by using a label propagation algorithm based on the combined similarity of miRNA obtained in the step (3) to obtain a second prediction matrix;

(6) performing weighted summation processing on the first prediction matrix and the second prediction matrix obtained in the step (5) to obtain a final prediction matrix;

(7) and (4) acquiring the interaction relation between the lncRNA with the corresponding number in the lncRNA number set and the miRNA with the corresponding number in the miRNA number set, which are obtained in the step (1-7), based on the final prediction matrix obtained in the step (6).

Preferably, for each sample in the raw data set, it includes the name of one miRNA, the name of one incrna, the binding sequence of the first two biomolecules in the sample, and a label, where a label of 1 indicates that the miRNA in the sample has a correlation with the incrna, and a label of 0 indicates that the miRNA in the sample has no correlation.

Preferably, the process of removing the duplicate of the sequence set of the miRNA in the step (1-4) is to create an empty file first, then read the sequence set of the miRNA in sequence, determine whether the currently read miRNA sequence already exists in the newly created file, and if not, write in the new file; if it is already present, no writing, and reading of the next of the sequence set of mirnas is continued. The above operations are repeated until all the sequences in the sequence set of mirnas are read and judged. The finally obtained new file is the sequence set of the miRNA after duplication removal;

and (4) the characteristic extraction process of the step (1-5) is realized by using an online tool Pse-in-One2.0.

The features extracted in step (1-5) include k-mer features, GC content (i.e., the ratio of guanine and cytosine in the sequence), base logarithm, and minimum free energy.

Preferably, the deduplication of the name sets of the mirnas in the step (1-6) is specifically that an empty file is created first, then the name sets of the mirnas are read in sequence, whether the name of the currently read miRNA exists in the created file or not is judged, and if not, a new file is written in; and if the miRNA name set exists, writing is not carried out, the next miRNA name set is continuously read, the above operations are repeated until all the names in the miRNA name set are read and judged, and finally the obtained new file is the de-duplicated miRNA name set.

Preferably, the elements in the first row and the first column in the first interaction matrix are corresponding labels of the lncRNA name numbered 1 in the lncRNA numbering set and the miRNA name numbered 1 in the miRNA numbering set obtained in the step (1-7) in the label set; the element in the first row and the second column in the first interaction matrix is the corresponding label of the incrna name with the number of 1 in the incrna numbering set and the miRNA name with the number of 2 in the miRNA numbering set obtained in step (1-7) in the label set, …, and so on, the element in the mth row and the nth column in the first interaction matrix is the corresponding label of the incrna name with the number of m in the incrna numbering set and the miRNA name with the number of n in the miRNA numbering set obtained in step (1-7), wherein m is 7963, and n is 1340.

Preferably, in step (1-9), for lncRNA numbered in the lncRNA number set corresponding to number 1, the corresponding interaction spectrum represents the interaction profile between the lncRNA and the mirnas numbered in all numbers in the miRNA number set obtained in step (1-7), i.e. the first row in the first interaction matrix; and (3) for the lncrnas corresponding to the number l in the lncRNA number set, the corresponding interaction spectrum represents the interaction profile between the lncrnas and the mirnas corresponding to all the numbers in the miRNA number set obtained in the step (1-7), namely the l-th row in the first interaction matrix.

Preferably, the sequence similarity S _ lncS corresponding to the lncRNA sequence feature set calculated in step (2) is calculated by the following steps:

(2-1) calculating the cosine distance between each sequence feature in the lncRNA sequence feature set obtained in the step (1) and other sequence features to obtain a cosine distance matrix corresponding to the lncRNA sequence feature set;

and (2-2) constructing an ordered subscript matrix by referring to the cosine distance matrix obtained in the step (2-1).

(2-3) neighbor ratio K according to the preset cosine distance₁And (3) generating a cosine distance-based neighbor indication matrix C by the ordered subscript matrix obtained in the step (2-2)₁；

(2-4) initializing a linear neighborhood similarity matrix G (the elements of the linear neighborhood similarity matrix G can be any numerical values), and obtaining a neighbor indication matrix C according to the step (2-3)₁Calculating the updated linear neighborhood similarity matrix G ═ (C)₁G); constructing a sequence feature matrix M of the lncRNA sequence feature set obtained in the step (1), wherein each row of M represents an lncRNA sequence feature; and sets counter cnt1 to 1;

(2-5) judging whether cnt1 is larger than 40, if so, indicating that the iteration process is ended (obtaining a linear neighborhood similarity matrix G after iteration at the moment), and entering the step (2-7), otherwise, entering the step (2-6);

(2-6) obtaining a neighbor indication matrix C according to the step (2-3)₁And (C) iteratively updating the updated linear neighborhood similarity matrix G obtained in the step (2-4) by G ═ C₁.*G).*(M*M'+λ*e')./((C₁G) ((M × M '+ λ × e')), where the lagrange factor λ μ ═ e, μ denotes the weighting parameters during data reconstruction, and e is all 1 columnsAmount, e' denotes transpose of e, setting null value in G to 0, setting counter cnt1 to cnt1+1, and returning to step (2-5);

(2-7) obtaining a matrix M according to the sequence feature matrix M of the lncRNA obtained in the step (2-4) and the linear neighborhood similarity matrix G obtained after iteration in the step (2-5)₂＝G*M；

(2-8) matrix M obtained in the step (2-7)₂Calculating Euclidean distances between rows to obtain a Euclidean distance matrix;

(2-9) constructing a new ordered subscript matrix by referring to the Euclidean distance matrix obtained in the step (2-8);

(2-10) neighbor ratio K according to the preset Euclidean distance₂And (4) generating a neighbor indication matrix C based on Euclidean distance by the new ordered subscript matrix obtained in the step (2-9)₂；

(2-11) initializing a final linear neighborhood similarity matrix W-G, and obtaining a neighbor indication matrix C according to the step (2-10)₂And calculating and updating the final linear neighborhood similarity matrix as W ═ C₂W) and sets counter cnt2 to 1;

(2-12) judging whether cnt2 is larger than 40, if so, ending the iterative updating process (obtaining a final linear neighborhood similarity matrix W after iteration at the moment), and entering the step (2-14), otherwise, entering the step (2-13);

(2-13) obtaining a neighbor indication matrix C according to the step (2-10)₂And (C) updating the final linear neighborhood similarity matrix W to the final linear neighborhood similarity matrix W obtained in the step (2-11)₂.*W).*((M₂*M₂'+λ₂*e₂'))./((C₂.*W)*(M₂*M₂'+λ₂*e₂')) of a plurality of lagrange factors, wherein the lagrange factor λ₂＝μ₂*e₂，μ₂Representing weight parameters in the data reconstruction process, e₂Is a full 1 column vector, e₂' means e₂Setting the null value in W to 0, setting the counter cnt2 to cnt2+1, and returning to step (2-12);

(2-14) setting the sequence feature similarity S _ lncS of lncRNA to be equal to the final linear neighborhood similarity matrix W updated in the step (2-13), and ending the process.

Preferably, in the cosine distance matrix of step (2-1), the first element in the first row is the cosine distance between the 1 st lncRNA sequence feature in the sequence feature set and itself, the second element in the first row is the cosine distance between the 1 st lncRNA sequence feature in the sequence feature set and the 2 nd lncRNA sequence feature in the sequence feature set, …, and so on;

specifically, the step (2-2) is to record subscripts of all elements of the cosine distance matrix obtained in the step (2-1) to obtain an initial subscript matrix, that is: in the cosine distance matrix, the first element in the first row represents the cosine distance between the 1 st lncRNA sequence feature and the first element in the first row, which is (1, 1) in the initial subscript matrix, and the second element in the first row represents the cosine distance between the 1 st lncRNA sequence feature and the 2 nd lncRNA sequence feature, which is (1, 2) in the initial subscript matrix, …, and so on; then, for each row of the initial subscript matrix, querying a cosine distance matrix according to the index value represented by each element to obtain the numerical value of the element in the cosine distance matrix corresponding to the index value, sorting the elements in the cosine distance matrix corresponding to the index values represented by all the elements in the row in an ascending order according to the size, and updating the order of the index values in the initial subscript matrix according to the order of the elements in the sorted cosine distance matrix to obtain an ordered subscript matrix.

Preferably, in step (2-3), the neighbor indication matrix C₁The cosine distance matrix sequenced in the step (2-2) has the same matrix size;

the step (2-3) is specifically as follows: firstly, for the first row in the sorted cosine distance matrix, the index value corresponding to the first Int (total column number of the cosine distance matrix x cosine distance neighbor ratio) elements in the row in the ordered subscript matrix is taken (wherein Int represents the rounding operation), and the neighbor indication matrix C is indicated₁All elements of the columns corresponding to the index values in the first row are set to be 1, and all elements of the rest columns in the first row are set to be 0; then, for the second row in the sorted cosine distance matrix, the second row in the row is takenThe index values of the first Int (the total column number of the cosine distance matrix and the cosine distance neighbor ratio) elements in the ordered subscript matrix correspond to the neighbor indication matrix C₁The elements of the column corresponding to these index values in the second row are all set to 1, the elements of the rest columns in the second row are all set to 0, …, and so on, and finally the neighbor indication matrix C is obtained₁Diagonal elements are all set to 0;

in step (2-8), the first row and the first column of the Euclidean distance matrix are the matrix M₂The Euclidean distance between the first row and the first row, the first row and the second column of elements, is the matrix M₂The euclidean distance between the first and second rows, …, and so on.

Preferably, the final prediction matrix of step (6) is β ═ first prediction matrix + (1- β) × second prediction matrix, wherein the weight β of the first prediction matrix is 0.35.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

1. because the steps (2) to (5) are adopted, a novel linear neighborhood similarity measuring method is constructed, a similarity network is constructed by extracting the characteristics of biomolecules, including common sequence characteristics and interaction spectrums, and the lncRNA-miRNA interaction of the plant is predicted by using a label propagation algorithm on the basis of constructing the similarity network without the specific characteristics of the biomolecules, so that the technical problem that the existing method for predicting the lncRNA-miRNA interaction based on machine learning cannot be suitable for the interaction prediction of lncRNA or miRNA without an expression spectrum or a target gene due to the dependence on specific data characteristics can be solved;

2. because the invention adopts the step (2), a novel linear neighborhood similarity measuring method is constructed, linear neighborhood neighbors based on cosine distance and Euclidean distance are selected for the data points, the selected neighbors are ensured to be nearest to the original data points in the space direction and the straight line distance, the neighborhood selection speed is accelerated, the neighborhood selection accuracy is enhanced, and the method is more accurate than the traditional similarity network construction method;

3. because the steps (2) to (5) are adopted, the similarity network is constructed by extracting the common characteristics of the biomolecule sequences, and the lncRNA-miRNA interaction of the plant is predicted by using a label propagation algorithm on the basis, the technical problem that the existing method for predicting the lncRNA-miRNA interaction based on machine learning cannot predict the interaction of isolated lncRNA or miRNA which has no interaction with any miRNA or lncRNA can be solved;

4. because the steps (1) to (3) are adopted, multi-dimensional information, namely common sequence characteristics and interaction spectrums of biomolecules, are extracted from the original data, similarity networks are respectively constructed for the two information sources, and finally, the two similarities are combined to form combined similarity, so that the defect of a single information source is overcome, and a more comprehensive prediction result is obtained;

5. according to the invention, the steps (4) to (6) are adopted, so that the interaction prediction is carried out from the angle of lncRNA and the angle of miRNA respectively, and the final prediction result is obtained by carrying out weighted summation processing on the two prediction results, so that the defect of a single-angle prediction result is overcome, and a more reasonable prediction result is obtained.

Drawings

FIG. 1 is a flow chart of the method of the present invention for predicting plant lncRNA-miRNA interactions.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

As shown in fig. 1, the present invention provides a method for predicting plant lncRNA-miRNA interaction, comprising the steps of:

(1) acquiring an original data set, and processing the original data set to obtain an lncRNA sequence feature set, an lncRNA interaction spectrum set, an miRNA sequence feature set and an miRNA interaction spectrum set;

the advantage of this step is that the required data features are classified and processed according to a uniform format, thereby speeding up the similarity network construction of step (2).

Specifically, this step includes the following substeps:

(1-1) acquiring a raw data set, and dividing the raw data set into a name set of miRNA (in the present embodiment, the size is 15000), a name set of lncRNA (in the present embodiment, the size is 15000), a binding sequence set of miRNA and lncRNA (in the present embodiment, the size is 15000), and a tag set (in the present embodiment, the size is 15000);

specifically, the data set in this step is downloaded from the GitHub website.

For each sample in the raw dataset, it includes the name of one miRNA (e.g., ath-miR5020a), the name of one incrna (e.g., CNT2086271), the binding sequences of the first two biomolecules in the sample (e.g., TGGAAGAAGGTGAG), and a tag (1 for miRNA in the sample to have an association with incrna and 0 for no association).

(1-2) querying each miRNA name in the name set of the miRNAs obtained in the step (1-1) for a corresponding miRNA sequence in a reference sequence, wherein all the miRNA sequences form a sequence set of miRNAs (the size of the miRNA sequences is 15000 in the embodiment);

specifically, the reference sequence used in this step is the format. fa reference sequence downloaded from the database mirbase 22.1.

(1-3) cutting the sequence set of the miRNA obtained in the step (1-2) by referring to the binding sequence set obtained in the step (1-1) to obtain a sequence set of lncRNA (in the embodiment, the size of the sequence set is 15000);

(1-4) performing deduplication processing on the sequence set of the miRNA obtained in the step (1-2) and the sequence set of the lncRNA obtained in the step (1-3) respectively to obtain a sequence set of the miRNA after deduplication (in the present embodiment, 1340 sequences) and a sequence set of the lncRNA after deduplication (in the present embodiment, 7963 sequences);

in the step, the sequence set of the miRNA is deduplicated, an empty file is created first, then the sequence set of the miRNA is read in sequence, whether the currently read miRNA sequence exists in the created file or not is judged, and if not, the currently read miRNA sequence exists in the created file, the new file is written in; if it is already present, no writing, and reading of the next of the sequence set of mirnas is continued. The above operations are repeated until all the sequences in the sequence set of mirnas are read and judged. And finally, obtaining a new file which is the sequence set of the miRNA after the duplication removal. The same procedure was performed on the sequence set of lncRNA to obtain a sequence set of deduplicated lncRNA.

(1-5) extracting a plurality of features from the sequence set of the miRNA and the sequence set of the lncRNA after the de-duplication in the step (1-4) to obtain a sequence feature set of the miRNA (in the present embodiment, the size of 1340) and a sequence feature set of the lncRNA (in the present embodiment, the size of 7963), respectively.

Specifically, the feature extraction process in the step is realized by using an online tool Pse-in-one 2.0.

The features extracted in this step include k-mer features, GC content (i.e., the ratio of guanine and cytosine in the sequence), base logarithm, and minimum free energy.

in the step, the name set of the miRNA is deduplicated, an empty file is created first, then the name set of the miRNA is read in sequence, whether the name of the miRNA currently read exists in the created file or not is judged, and if not, the new file is written in; and if the miRNA name set exists, writing is not carried out, the next miRNA name set is continuously read, the above operations are repeated until all the names in the miRNA name set are read and judged, and finally the obtained new file is the de-duplicated miRNA name set. The same procedure was performed on the name set of lncrnas to obtain a de-duplicated name set of lncrnas.

(1-7) numbering the name set of the de-duplicated miRNA and the name set of the de-duplicated lncRNA obtained in the step (1-6) in sequence from 1 to obtain an lncRNA number set (in the embodiment, the size of the lncRNA number set is 7963) and an miRNA number set (in the embodiment, the size of the lncRNA number set is 1340);

specifically, the elements in the first row and the first column in the first interaction matrix are corresponding labels of the lncRNA name numbered 1 in the lncRNA numbering set and the miRNA name numbered 1 in the miRNA numbering set in the tag set obtained in the step (1-7); the element in the first row and the second column in the first interaction matrix is the corresponding label of the incrna name with the number of 1 in the incrna numbering set and the miRNA name with the number of 2 in the miRNA numbering set obtained in step (1-7) in the label set, …, and so on, the element in the mth row and the nth column in the first interaction matrix is the corresponding label of the incrna name with the number of m in the incrna numbering set and the miRNA name with the number of n in the miRNA numbering set obtained in step (1-7), wherein m is 7963, and n is 1340.

(1-9) generating an interaction spectrum set of lncrnas (in the present embodiment, the size of the interaction spectrum set is 7963) according to lncrnas corresponding to each number in the lncRNA number set obtained in the step (1-7) and the first interaction matrix obtained in the step (1-8), and generating an interaction spectrum set of mirnas (in the present embodiment, the size of the interaction spectrum set is 1340) according to mirnas corresponding to each number in the miRNA number set obtained in the step (1-7) and the second interaction matrix obtained in the step (1-8);

specifically, for lncRNA corresponding to the number 1 in the lncRNA number set, the corresponding interaction spectrum represents the interaction profile between the lncRNA and the mirnas corresponding to all the numbers in the miRNA number set obtained in step (1-7), i.e. the first row in the first interaction matrix; and (3) for the lncrnas corresponding to the number l in the lncRNA number set, the corresponding interaction spectrum represents the interaction profile between the lncrnas and the mirnas corresponding to all the numbers in the miRNA number set obtained in the step (1-7), namely the l-th row in the first interaction matrix. The interaction profiles of mirnas are the same.

specifically, the sequence similarity S _ lncS corresponding to the lncRNA sequence feature set calculated in this step is obtained by the following steps:

(2-1) for each sequence feature in the lncRNA sequence feature set obtained in step (1), calculating cosine distances between each sequence feature and other sequence features to obtain a cosine distance matrix (in the present embodiment, the size of each cosine distance matrix is 7963 × 7963) corresponding to the lncRNA sequence feature set;

the method has the advantages that the neighborhood with the similar direction of a certain data point in the space is determined, the neighborhood range is reduced, and the neighborhood selection is accelerated.

Specifically, for the sequence feature of the lncRNA with the sequence number of the l, the cosine distance between the sequence feature and each other sequence feature is between 0 and 1, the smaller the numerical value is, the closer the distance is, and the cosine distance between the sequence feature and the self is set to be 1 so as to ignore the self influence in data reconstruction.

Specifically, in the cosine distance matrix, the first element in the first row is the cosine distance (namely 1) between the 1 st lncRNA sequence feature in the sequence feature set and the first element in the first row is the cosine distance (between 0 and 1) between the 1 st lncRNA sequence feature in the sequence feature set and the 2 nd lncRNA sequence feature in the sequence feature set, …, and so on.

Specifically, the subscripts of all elements of the cosine distance matrix obtained in step (2-1) are recorded first to obtain an initial subscript matrix, that is: in the cosine distance matrix, the first element in the first row represents the cosine distance between the 1 st lncRNA sequence feature and itself, which is (1, 1) in the initial index matrix, the second element in the first row represents the cosine distance between the 1 st lncRNA sequence feature and the 2 nd lncRNA sequence feature, which is (1, 2) in the initial index matrix, …, and so on, the jth element in the ith row represents the cosine distance between the ith lncRNA sequence feature and the jth lncRNA sequence feature, which is (i, j) in the initial index matrix; then, for each row of the initial subscript matrix, querying a cosine distance matrix according to the index value represented by each element to obtain the numerical value of the element in the cosine distance matrix corresponding to the index value, sequencing the elements in the cosine distance matrix corresponding to the index values represented by all the elements in the row in an ascending order according to the size, and updating the order of the index values in the initial subscript matrix according to the order of the elements in the sequenced cosine distance matrix to obtain an ordered subscript matrix;

specifically, assuming that the first row elements of the initial subscript matrix are { (1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (1, 7), (1, 8), (1, 9), (1, 10) }, and assuming that the first row elements of the cosine distance matrix obtained in step (2-1) are {1.0, 0.9, 0.1, 0.2, 0.4, 0.6, 0.3, 0.5, 0.8, 0.7}, the first row elements of the sorted cosine distance matrix are {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}, the first row elements of the updated cosine distance matrix are { (1, 3), (1, 4), (1, 7), (1, 5), (1, 8), (1, 6), (1, 10), (1, 2), (1, 1) }, (i.e., 1) }, the 3 rd lncRNA sequence feature is closest to the cosine, followed by the 4 th lncRNA sequence feature, followed by the 7 th lncRNA sequence feature. Similar operations are performed on each row of the initial index matrix to obtain an ordered index matrix.

In the present embodiment, the cosine-distance neighbor ratio K₁＝90％。

In particular, the neighbor indication matrix C₁And row in step (2-2)The method comprises the steps of firstly, for the first row (namely the cosine distance between the 1 st lncRNA sequence feature and each sequence feature) in the sorted cosine distance matrix, taking the index value (wherein Int represents the rounding operation) corresponding to the first Int (the total column number of the cosine distance matrix and the cosine distance neighbor ratio) elements in the ordered subscript matrix, and indicating the neighbor to the matrix C₁All elements of the columns corresponding to the index values in the first row are set to be 1, and all elements of the rest columns in the first row are set to be 0; then, for the second row (i.e. the cosine distance between the 2 nd lncRNA sequence feature and each sequence feature) in the sorted cosine distance matrix, taking the corresponding index value of the first Int (the total column number of the cosine distance matrix x cosine distance neighbor ratio) elements in the row in the ordered subscript matrix, and indicating the neighbor as the matrix C₁The elements of the column corresponding to these index values in the second row are all set to 1, the elements of the rest columns in the second row are all set to 0, …, and so on, and finally the neighbor indication matrix C is obtained₁The diagonal elements are all set to 0.

More specifically, for the example in (2-2), by the ratio K₁Taking the neighbors, if the 1 st lncRNA sequence feature has the 3 rd, 4 th, 7 th, 5 th, 8 th, 6 th, 10 th, 9 th and 2 nd lncRNA sequence features based on cosine distance, then the neighbor indication matrix C₁The first row, the third, fourth, seventh, fifth, eighth, sixth, tenth, ninth and second column elements of (1) and the first row, first column element of (0).

(2-6) obtaining a neighbor indication matrix C according to the step (2-3)₁And (C) iteratively updating the updated linear neighborhood similarity matrix G obtained in the step (2-4) by G ═ C₁.*G).*(M*M'+λ*e')./((C₁G) ((M × M ' + λ × e ')), where the lagrange factor λ ═ μ × e, μ denotes the weight parameters during data reconstruction, e is the full 1 column vector, e ' denotes the transpose of e, the null in G is set to 0, the counter cnt1 ═ cnt1+1 is set, and the process returns to step (2-5);

the method has the advantages that the neighborhood with the close linear distance of a certain data point in the space is determined in the neighborhoods with the close directions, and the accuracy of the neighborhood is enhanced.

Specifically, for the l lncRNA sequence feature, the smaller the euclidean distance between the l lncRNA sequence feature and the sequence features of other lncrnas, the closer they are.

In particular, the first row and the first column of elements of the euclidean distance matrix is the matrix M₂The Euclidean distance between the first row and the first row, the first row and the second column of elements, is the matrix M₂The euclidean distance between the first and second rows, …, and so on.

specifically, the process of constructing the new ordered subscript matrix is completely the same as the process of constructing the ordered subscript matrix in step (2-2), and is not described herein again.

In the present embodiment, the euclidean distance neighbor ratio K₂＝100％。

The process of this step is identical to that of the above step (2-3), and is not described herein again.

The calculation method of the lncRNA interaction spectrum similarity S _ lncP, the miRNA sequence feature similarity S _ miS, and the miRNA interaction spectrum similarity S _ miP is similar to the lncRNA sequence feature similarity S _ lncS, and is not repeated herein.

the method has the advantages that a more accurate similarity network is constructed by adopting multiple information sources of biomolecules, so that the accuracy of the conventional lncRNA and miRNA interaction prediction result can be enhanced, and the potential interaction of isolated lncRNA and miRNA can be predicted. For isolated lncrnas or mirnas, sequence feature similarity can help predict their likely interactions.

specifically, for the first interaction matrix of 7963 rows and 1340 columns, this step randomly selects 7963 × 1340/5-2134048 elements to be zeroed.

specifically, a propagation parameter alpha is set to be 0.7 in the label propagation algorithm process, label propagation is performed on a first training matrix by using the combined similarity of lncRNA to obtain a first prediction matrix, and label propagation is performed on a second training matrix by using the combined similarity of miRNA to obtain a second prediction matrix.

the step has the advantages that the prediction of the lncRNA and the miRNA is combined, the inaccurate and large-difference values generated by the neutralization prediction are made up, and a more reasonable and stable prediction result is formed.

Specifically, the final prediction matrix of this step is β ═ the first prediction matrix + (1- β) × the second prediction matrix, where the weight β of the first prediction matrix is 0.35.

Specifically, if the first row and the first column of the final prediction matrix are 1, lncRNA numbered 1 in the lncRNA numbering set and miRNA numbered 1 in the miRNA numbering set have an interaction; if the second column in the first row of the final prediction matrix is 1, lncRNA numbered 1 in the lncRNA numbering set and miRNA numbered 2 in the miRNA numbering set have an interaction, …, and so on, if the j column in the ith row of the final prediction matrix is 1, lncRNA numbered i in the lncRNA numbering set and miRNA numbered j in the miRNA numbering set have an interaction.

Comparison results

According to the invention, some indexes are selected, and in the obtained result, the Area (Area Under cut, AUC for short) below the ROC Curve is 0.979, the Recall rate (REC for short) is 0.9626, the Specificity (SPE for short) is 0.9994, and the Area (Area Under cut Precision-Recall cut, AUPR for short) below the PR Curve is 0.5297.

The invention selects a Pmlipred model on a Bioinformatics official network and a CINN model on a Springer official network as reference methods, and the two methods are both used for predicting the interaction of the lncRNA-miRNA of the plant. The PmliPred adopts a mode of combining a machine learning method and a deep learning method to establish a prediction model, and the final prediction result is a fuzzy decision made by two parts; CINN uses CNN and IndRNN to construct an integrated deep learning model, the former is used for automatically extracting functional characteristics of gene sequences, and the latter is used for obtaining sequence characteristic representation and dependency relationship. Both methods performed well, but not as well as the methods of the invention, especially in AUC and ACC values. Since both methods use deep learning, the present invention simulates the construction of mixed models CNNRF1 and CNNRF2 and also runs the results. For the reference method SLNSM (Springer official network) created for animal prediction, the method of the invention is observed to be superior after a data set is operated, and the superiority of the method of the invention is proved. The specific results are as follows:

it will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for predicting plant lncRNA-miRNA interaction, which is characterized by comprising the following steps:

2. The method of claim 1, wherein for each sample in the raw data set, the method comprises a name of a miRNA, a name of an IncRNA, binding sequences of the first two biomolecules in the sample, and a label, wherein a label of 1 indicates that the miRNA in the sample is associated with the IncRNA, and a label of 0 indicates that the miRNA in the sample is not associated with the IncRNA.

3. The method for predicting a plant lncRNA-miRNA interaction of claim 1 or 2,

the step (1-4) of removing the duplication of the miRNA sequence set comprises the steps of firstly creating an empty file, then reading the miRNA sequence set in sequence, judging whether the currently read miRNA sequence exists in the newly created file, and if not, writing the currently read miRNA sequence into a new file; if it is already present, no writing, and reading of the next of the sequence set of mirnas is continued. The above operations are repeated until all the sequences in the sequence set of mirnas are read and judged. The finally obtained new file is the sequence set of the miRNA after duplication removal;

4. The method for predicting plant lncRNA-miRNA interaction according to any one of claims 1 to 3, wherein the step (1-6) of removing the miRNA name set comprises creating an empty file, reading the miRNA name set in sequence, judging whether the currently read miRNA name exists in the created file, and writing the new file if the currently read miRNA name does not exist in the created file; and if the miRNA name set exists, writing is not carried out, the next miRNA name set is continuously read, the above operations are repeated until all the names in the miRNA name set are read and judged, and finally the obtained new file is the de-duplicated miRNA name set.

5. The method for predicting plant lncRNA-miRNA interaction of claim 4, wherein the elements in the first row and the first column of the first interaction matrix are corresponding labels in the label set of the lncRNA name numbered 1 in the lncRNA number set and the miRNA name numbered 1 in the miRNA number set obtained in the step (1-7); the element in the first row and the second column in the first interaction matrix is the corresponding label of the incrna name with the number of 1 in the incrna numbering set and the miRNA name with the number of 2 in the miRNA numbering set obtained in step (1-7) in the label set, …, and so on, the element in the mth row and the nth column in the first interaction matrix is the corresponding label of the incrna name with the number of m in the incrna numbering set and the miRNA name with the number of n in the miRNA numbering set obtained in step (1-7), wherein m is 7963, and n is 1340.

6. The method for predicting the interaction between lncRNA and miRNA of claim 5, wherein in the step (1-9), for lncRNA corresponding to the number 1 in the lncRNA numbering set, the corresponding interaction spectrum represents the interaction profile between the lncRNA and miRNA corresponding to all the numbers in the miRNA numbering set obtained in the step (1-7), namely the first row in the first interaction matrix; and (3) for the lncrnas corresponding to the number l in the lncRNA number set, the corresponding interaction spectrum represents the interaction profile between the lncrnas and the mirnas corresponding to all the numbers in the miRNA number set obtained in the step (1-7), namely the l-th row in the first interaction matrix.

7. The method for predicting plant lncRNA-miRNA interaction of claim 6, wherein the calculating of the sequence similarity S _ lncS corresponding to the lncRNA sequence feature set in the step (2) is performed by the following steps:

(2-8) matrix M obtained in the step (2-7)₂Calculating Euclidean distance between lines to obtain Euclidean distance momentArraying;

8. The method for predicting plant lncRNA-miRNA interaction of claim 7,

in the cosine distance matrix of the step (2-1), the first element in the first row is the cosine distance between the 1 st lncRNA sequence feature in the sequence feature set and the 2 nd lncRNA sequence feature in the sequence feature set, …, and so on;

9. The method for predicting plant lncRNA-miRNA interaction of claim 8,

in step (2-3), the neighbor indication matrix C₁The cosine distance matrix sequenced in the step (2-2) has the same matrix size;

the step (2-3) is specifically as follows: firstly, for the first row in the sorted cosine distance matrix, the index value corresponding to the first Int (total column number of the cosine distance matrix x cosine distance neighbor ratio) elements in the row in the ordered subscript matrix is taken (wherein Int represents the rounding operation), and the neighbor indication matrix C is indicated₁All elements of the columns corresponding to the index values in the first row are set to be 1, and all elements of the rest columns in the first row are set to be 0; then, for the second row in the sorted cosine distance matrix, taking the corresponding index value of the first Int (total column number of the cosine distance matrix x cosine distance neighbor ratio) elements in the row in the ordered subscript matrix, and indicating the neighbor as a matrix C₁The elements of the column corresponding to these index values in the second row are all set to 1, the elements of the rest columns in the second row are all set to 0, …, and so on, and finally the neighbor indication matrix C is obtained₁Diagonal elements are all set to 0;

10. The method of predicting plant lncRNA-miRNA interactions according to claim 9, wherein the final prediction matrix of step (6) is β ═ first prediction matrix + (1- β) × second prediction matrix, wherein the weight β of the first prediction matrix is 0.35.