CN111816255B

CN111816255B - RNA binding protein recognition incorporating multi-view and optimal multi-tag chain learning

Info

Publication number: CN111816255B
Application number: CN202010658127.8A
Authority: CN
Inventors: 邓赵红; 杨海涛; 吴敬; 王蕾; 王士同
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2020-07-09
Filing date: 2020-07-09
Publication date: 2024-03-08
Anticipated expiration: 2040-07-09
Also published as: CN111816255A

Abstract

The invention belongs to the field of bioinformatics, and relates to RNA binding protein recognition integrating multi-view and optimal multi-tag chain learning. The method comprises a training stage and a using stage, wherein the training stage comprises an initial multi-view data structure, multi-view depth feature extraction model training, multi-label feature learning and optimal multi-label chain classifier training. Multiple views include RNA sequence views, amino acid sequence views, multiple gap dipeptide component views, and RNA sequence semantic views. In order to improve the effectiveness of the multi-view feature, the invention utilizes CNN to carry out deep learning to construct the deep multi-view feature based on initial multi-view data. In order to link the multi-view features and multi-label learning together, the invention establishes a multi-label feature learning model for integrating the advantages of all views, and uses an optimal CC chain classifier different from a common CC multi-label classifier to learn the association between labels, thereby improving the classification precision more effectively.

Description

RNA binding protein recognition incorporating multi-view and optimal multi-tag chain learning

Technical Field

The invention belongs to the field of bioinformatics, and relates to RNA binding protein recognition integrating multi-view and optimal multi-tag chain learning.

Background

RNA, which is called ribonucleic acid in whole, exists in biological cells and genetic information carriers in partial viruses and viroids, mainly plays a role in regulating and controlling expression of coding genes in a living body, plays a role in synthesizing protein templates after gene transcription, and is an indispensable component in the living body. Since an RNA is expected to perform its function successfully, it is generally mediated by an RNA Binding Protein (RBP), the lack of an RBP may result in the inability of an RNA of a certain class to perform its regulatory or translational functions, thereby causing the life to lack of important proteins or abnormal proliferation of proteins, which affects its own functions.

RNA Binding Proteins (RBPs) are key participants in posttranscriptional events, and their domain versatility and structural flexibility enable RBPs to control the metabolism of a large number of transcripts. RBP involves almost all steps of the posttranscriptional regulatory layer, which establish highly dynamic interactions with other proteins and coding and non-coding RNAs, producing functional units called ribonucleoprotein complexes that regulate RNA cleavage, polyadenylation, stability, localization, translation, and degradation. It was found that certain specific RBPs have the efficacy of modulating RNA synthesis of oncoproteins and tumor suppressor proteins, thus the intricate inter-binding network between the deciphered RBPs and their cancer-associated RNA targets would provide a better understanding of tumor biology and potentially discover new approaches to treating cancer.

In the context of high development of big data and sequencing technologies, it is difficult for biomedical industries to detect the binding of each pair of RNA and RBP, so many algorithms for identifying RBP binding sites from RNA sequences using machine learning models have emerged. For example: maticzka et al propose a GraphProt method that learns the combined preference of RBP sequences and structures from high-throughput experimental data, designing a unique computational framework; corrado et al propose RNACommender, a method of predicting binding sites that can recommend RNA targets to unexplored RBPs by taking into account the protein structure and the simulated secondary structure of the RNA through available interaction information; HOCNNLB proposed by Zhang et al uses higher order nucleotide codes as an initial feature to predict whether a given piece of RNA is a binding site. The focus of these methods is to use the sequence or structural features of the original RNA sequence to determine whether a given RNA sequence fragment is a binding site for a particular RBP, but few methods use the existing binding information of RNA to RBPs to aid in prediction. In this regard, the ideep method was proposed by panne et al, which uses multi-tag classification and deep learning to find multiple RBPs for an RNA that can be combined with it, successfully achieving the desired effect of multi-tag classification. However, idepm also suffers from the following disadvantages: the single-view data of the RNA sequence has certain effectiveness on prediction classification, but is limited by insufficient information quantity of the RNA sequence, so that the precision is lower; in addition, the method uses a convolutional neural network and a long-short-time memory network to classify the multiple labels, so that the relation between the labels can not be fully learned, and the prediction accuracy is influenced.

Disclosure of Invention

The method disclosed by the invention realizes the RNA binding protein recognition integrating multi-view and optimal multi-tag chain learning, and comprises a training stage and a using stage, wherein the training stage comprises an initial multi-view feature construction model, a deep multi-view feature extraction model, multi-tag feature learning and optimal multi-tag chain classifier training.

Training phase: the initial multi-view feature construction model uses a molecular biology principle, a statistical principle and Word2Vec (Word vector) technology to convert an original RNA sequence into an amino acid sequence, a multi-gap dipeptide component and an RNA sequence semantic matrix to obtain sequence order, component and semantic feature, and then constructs the initial multi-view feature together with the original RNA sequence to obtain the initial multi-view feature construction model; constructing four convolutional neural networks by the depth multi-view feature extraction model, and training the initial four view features to obtain depth multi-view features with better classification capability, thereby obtaining a depth multi-view feature extraction model; the extracted depth features are used for training a multi-tag feature learning model, and the weighted feature vectors associated with the tags learned by the multi-tag feature learning model are used for training an optimal CC multi-tag chain classifier so as to learn the association between the tags and obtain a model with the capability of identifying RNA binding proteins.

The using stage is as follows: acquiring an RNA sequence to be detected, and constructing an initial multi-view characteristic of the sequence by utilizing a molecular biology principle, a statistical principle and a Word2Vec (Word vector) technology; extracting depth features of 4 visual angles by using the trained four convolutional neural networks; then weighting the depth features of the 4 spliced visual angles by using the trained multi-label feature learning model to obtain a weighted feature vector related to the label; and then predicting the weighting feature vectors related to the labels by using the trained optimal CC multi-label chain classifier to obtain a final prediction result.

The RNA binding protein recognition integrated multi-view deep feature learning technology integrating multi-view and optimal multi-label chain learning, the multi-label feature learning technology and the multi-label learning technology are used for optimizing feature expression of deep structure of deep learning, the multi-label feature learning technology integrates and corrects the deep feature of each view, the advantage of the feature of each view is fully utilized, a weighted feature vector related to labels is established, and the multi-label technology effectively utilizes the independence of each label and the correlation among labels. The multi-view deep learning technology, the multi-label feature learning technology and the multi-label learning technology are effectively combined, so that effective information in an RNA sequence can be fully extracted, and the generalization capability of the classifier is improved.

The RNA sequence is a segment of biological genetic material described by a text sequence, and the deep convolution model cannot process text information, so that the RNA text sequence needs to be preprocessed first and converted into a numerical form acceptable by a program. one-hot (single hot coding) is a popular coding technique at present, and the principle is that a character sequence with length m and composed of n elements is constructed into a matrix with length n, wherein each element is converted into n-dimensional standard orthogonal basis vectors to be filled in corresponding positions in length m. For RNA sequences, one-hot (one-hot coding) constructs an initial 4*m size void for an RNA sequence of length mWhite matrix, each base is transformed into a 4-dimensional orthogonal basis vector, filled in to the corresponding position of the sequence, as shown in fig. 7. The line is entitled to a specific RNA sequence, the actual length being 2700. The position of the base in the control sequence can be expressed as a vector (1, 0) for the base A in the sequence ^T Base C is expressed as a vector (0, 1, 0) ^T Base G is expressed as (0, 1, 0) ^T The base U is expressed as (0, 1) ^T And so on.

The initial feature matrix constructed by the method is helpful for extracting features, but has the disadvantage of less information. The amino acid sequence is composed of 20 amino acids, and the information content is far more abundant than that of the RNA sequence, so that a one-hot (single-hot coding) coding matrix obtained by using the amino acid sequence conversion can provide better effect for feature extraction. Translation of an RNA sequence into an amino acid sequence is unidirectional and unique, but because one amino acid can correspond to multiple base combinations, the resulting amino acid sequence cannot be reduced to the original RNA sequence, which can have the consequences of losing and misinterpreting the information. For example, the base combination GCA can be translated into the immobilized amino acid A, but the amino acid A can be represented as GCA, GCC, GCG, GCU. To address this problem, three modes of translation of the RNA sequence to the amino acid sequence are used, namely a first form in which translation is started from scratch, a second form in which translation is started from the first base is skipped, and a third form in which translation is started from the first and second bases is skipped. The method can convert the RNA sequence with the length m into 3 amino acid sequences with the length of 1/3m, and the three forms of amino acid sequences can restore the original RNA sequence information through sequence information complementation. The above-described base combination GCA can be uniquely determined by using amino acids R, A, H at positions corresponding to three morphological sequences. Therefore, the amino acid sequences with three forms are spliced to obtain an amino acid long chain with the length of m, which can completely inherit the sequence information of the original RNA sequence and has more abundant expression forms. The long chain is subjected to one-hot coding, and the principle is the same as that of an RNA sequence, so that an initial characteristic matrix with the size of 20 x m can be obtained, and the initial characteristic matrix is shown in figure 8, namely the amino acid visual angle data provided by the invention. The line is entitled to a specific amino acid sequence, the actual length being 2700. All amino acids in a row sequence can be represented as 20-dimensional orthonormal basis vectors against the positions of the amino acids in the column heading.

Both the RNA view and amino acid view data mentioned above favor the extraction of features for the sequence order, whereas a sequence is equally important in its constituent components except for the order. Because 0-gap dipeptide is biased to the composition of two-dimensional sequence components, and 1-gap dipeptide has three-dimensional structural component information, the method for extracting RNA sequence component information by using 0-gap dipeptide and 1-gap dipeptide has the best effect, and the invention adopts the combination form to extract sequence components to form a multi-gap dipeptide component view angle. Since the dipeptide is sensitive to the left and right amino acid arrangement, 21 x 2 multislot dipeptide species were all discarded for the 21 amino acids of the invention (20 natural amino acids and the added temporary amino acids O of the invention) since the combination of OO and O x O is not significant for our study. The feature vector is obtained by counting the occurrence times of 880 kinds of multi-gap dipeptide, and the component information of the amino acid sequence and the RNA sequence and the information of the amino acid space component can be effectively captured. Since 880-dimensional feature vectors are one-dimensional, the effect of extracting depth features is not ideal, so we convert them into two-dimensional histograms, and machine learning models can be used more effectively to extract depth features, as shown in fig. 9. The abscissa of the upper table in the figure is the multi-gap dipeptide species, where "AA" represents the 0-gap dipeptide with alanine on both sides, 18 representing its number in the sample amino acid sequence; "A.times.D" means 1-gap dipeptide with alanine on the left side, any amino acid in the middle, aspartic acid on the right side. The following figure only illustrates 12 multi-gap dipeptides, the actual number being 880. The lower graph shows the transformed histogram, and the upper limit of the number of each multi-gap dipeptide is set to 30, so we take a matrix of 30×880 as the initial data of the multi-gap dipeptide of this amino acid sequence.

Natural Language Processing (NLP) is an important direction in the field of computer science and artificial intelligence, and from the point of view of initial data, bioinformatics has the same form as initial research data of NLP. Therefore, the method of NLP can be used to solve the coding of text and the initial feature construction in bioinformatics. The invention uses 6-mer RNA as a word stock for training a semantic model, and the 6-mer RNA is of a structure consisting of 6 continuous bases, so that the word stock consists of 46 6-mer RNAs in total. The invention uses the Word2Vec technology which is popular at present to construct a semantic model, and the principle of the semantic model is shown in figure 10. Based on 92102 RNA sequences in the dataset used in the present invention, the following operations were performed on them one by one: 1) A sliding window with 6-bit bases as the size is used for obtaining the arrangement sequence of 6-mer RNA in the RNA sequence; 2) Encoding each 6-mer RNA, i.e. its position in 4096 morphologies (rule of 'AAAAAA' 1, 'uuuuuuu' 4096); 3) Respectively taking 2 adjacent 6-mer RNAs as a feature X and a label Y, and putting the feature X and the label Y into a semantic model for training; 4) Extracting respective word vector results of 4096 6-mer RNAs from the trained semantic model; 5) The word vector is used to replace each 6-mer RNA in the RNA sequence to construct a RNA sequence semantic matrix. The RNA sequence semantic matrix formed by the 6-mer RNA word vectors not only has smaller dimension, but also contains the sequence of RNA sequences taking 6-bit bases as motifs and the context structure information, so that deep feature learning can be better performed.

The specific steps of the part are as follows:

the first step: using the one-hot conversion matrix of the original RNA sequence as RNA initial feature X ¹ 。

And a second step of: conversion of original RNA sequence into amino acid sequence initial characteristic X using molecular biology principles and one-hot method ² 。

And a third step of: conversion of amino acid sequences into initial characteristics of the Multi-gap dipeptide component X Using statistical principles ³ 。

Fourth step: training an RNA sequence semantic model by using Word2Vec technology, obtaining 6-mer RNA Word vectors, and forming an RNA sequence semantic matrix serving as an RNA sequence semantic initial feature X ⁴ . Thereby obtaining a preliminary multi-view data set D= { X ¹ ,X ² ,X ³ ,X ⁴ ,y}

The deep multi-view feature extraction part of the invention uses convolutional neural network to automatically extract each view feature of the RNA sequence. For an original RNA sequence, the characteristics of the amino acid sequence, the characteristics of the multi-gap dipeptide component and the semantic characteristics of the RNA sequence can be obtained after pretreatment, and four different convolutional neural networks are respectively constructed aiming at the characteristics of four different visual angles to carry out deep automatic extraction on the characteristics of the different visual angles.

The CNN network adopts the result of the last output layer to calculate errors and conduct back propagation during training, so that the network learning is conducted. Because the feature vector obtained by the calculation of the penultimate layer only passes through one full-connection layer to the output layer, the network structure can be considered to be optimized according to the training of the network output layer, and the expression of the output feature vector of the penultimate layer is also optimized, namely, the network learns better feature expression while training, so that the output of the network penultimate layer is selected as the network learned feature. The features obtained through automatic learning of the convolutional neural network have smaller dimensionality than the original features, and the obtained features are the features with better dividing capability through nonlinear combination, so that the subsequent classification model has better generalization effect.

Fig. 11, 12, 13, and 14 are diagrams of CNN network architecture used for four depth of view feature extraction. The k@m ×n is used to represent the feature diagrams of each layer of the network, k is the number of the feature diagrams of the layer, and m×n is the size of the feature diagrams. The two-dimensional convolution kernel of the network is represented by k×m×n, k is the number of convolution kernels, and m×n is the size of the convolution kernel. The step size of the convolution kernel defaults to 1. The input of the network is the characteristics of each view, and the output is a vector, and the vector length is equal to 68 (namely, the combination of the RNA sequence and 68 RBPs). The first 67 dimensions of the result represent that if the sample can be combined with the RBP of that dimension, it is equal to 1, otherwise it is equal to 0; the 68 th dimension of the result shows that if the sample RNA sequence cannot bind any RBP of the first 67, it is 1, otherwise it is 0.

Fig. 11 is a CNN network architecture for RNA depth of view feature extraction, comprising 1 two-dimensional convolution layer, 1 pooling layer, 1 flattening layer, 2 dropout layers, and 2 full-connectivity layers. The input to the network is a two-dimensional matrix of 4 x 2710. The first layer convolution layer of the CNN network architecture is 101 convolution kernels of 4 x 10, and 101 obtained characteristic diagrams of 1 x 2701 are obtained; the pooling length of the second pooling layer is 3, and 101 characteristic diagrams of 1 x 900 are obtained; the third layer is a flat layer, and 1 characteristic diagram of 1 x 90900 is obtained; the fourth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram with the probability of 1 x 90900 is obtained; fifthly, the full connection layer converts 1 characteristic diagram of 1 x 90900 into a vector of 1 x 202; the sixth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 x 202 is obtained; and fifthly, the full connection layer converts 1 characteristic diagram of 1 x 202 into a vector of 1 x 68.

Fig. 12 is a CNN network architecture for amino acid view depth feature extraction, comprising a total of 1 two-dimensional convolution layer, 1 pooling layer, 1 flattening layer, 2 dropout layers, and 2 fully connected layers. The input is a two-dimensional matrix of 20 x 2710. The first layer convolution layer of the CNN network architecture is 101 convolution kernels of 20 x 10, and 101 obtained characteristic diagrams of 1 x 2701 are obtained; the pooling length of the second pooling layer is 3, and 101 characteristic diagrams of 1 x 900 are obtained; the third layer is a flat layer, and 1 characteristic diagram of 1 x 90900 is obtained; the fourth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram with the probability of 1 x 90900 is obtained; fifthly, the full connection layer converts 1 characteristic diagram of 1 x 90900 into a vector of 1 x 202; the sixth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 x 202 is obtained; and fifthly, the full connection layer converts 1 characteristic diagram of 1 x 202 into 1 vector of 1 x 68.

Fig. 13 is a CNN network architecture for multi-gap dipeptide depth of view feature extraction, comprising a total of 1 two-dimensional convolution layer, 1 flat layer, 2 dropout layers, and 2 fully connected layers. The input to the network is a two-dimensional matrix of 30 x 440. The first layer convolution layer of the CNN network architecture is 101 convolution kernels of 30 x 10, and 101 characteristic diagrams of 1 x 871 are obtained; the second layer is a flat layer, and 1 characteristic diagram of 1 x 87971 is obtained; the third layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 x 87971 is obtained; the fourth is the full connection layer, which converts 1 characteristic diagram of 1 x 87971 into a vector of 1 x 202; the fifth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram with the probability of 1 x 202 is obtained; the sixth is the full link layer, converting 1 feature map of 1 x 202 into 1 x 68 vector.

Fig. 14 is a CNN network architecture for RNA sequence semantic visual depth feature extraction, comprising 1 two-dimensional convolution layer, 1 pooling layer, 1 flattening layer, 2 dropout layers, and 2 fully connected layers in total. The input is a two-dimensional matrix of 25 x 2710. The first layer convolution layer of the CNN network architecture is 101 convolution kernels of 25 x 10, and 101 characteristic diagrams of 1 x 2701 are obtained; the pooling length of the second pooling layer is 3, and 101 characteristic diagrams of 1 x 900 are obtained; the third layer is a flat layer, and 1 characteristic diagram of 1 x 90900 is obtained; the fourth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram with the probability of 1 x 90900 is obtained; fifthly, the full connection layer converts 1 characteristic diagram of 1 x 90900 into a vector of 1 x 202; the sixth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 x 202 is obtained; and fifthly, the full connection layer converts 1 characteristic diagram of 1 x 202 into 1 vector of 1 x 68.

The last layer of the four networks all uses a sigmoid function as an activation function to introduce nonlinear transformations, expressed as follows:

the remaining layers all use a relu function as an activation function, expressed as follows:

R(x)＝max(0，x)

the loss function of the network employs a binary cross entropy (binary cross entropy) loss function, which is defined as follows.

Wherein p (x) _i ) And q (x) _i ) Represents the membership of the sequence x to the class i, p represents the true tag value, i.e. 1 or 0, q represents the predicted value, where q e (0, 1) is due to activation by the Sigmoid function.

The specific steps of the part are as follows:

the first step: by X ¹ Training the deep feature extraction net of the RNA sequence by y, and taking the penultimate layer of the CNN network architecture used for extracting the deep feature of the RNA visual angle as the deep feature of the RNA sequence

And a second step of: by X ² Training the deep feature extraction network of the y pairs of amino acid sequences, and taking the penultimate layer of the CNN network architecture used for the deep feature extraction of the visual angles of the amino acids as the deep feature of the amino acid sequences

And a third step of: by X ³ Training a y-to-multi-gap dipeptide component depth feature extraction network, and taking the penultimate layer of a CNN network architecture used for multi-gap dipeptide visual angle depth feature extraction as the multi-gap dipeptide component depth feature

Fourth step: by X ⁴ Training the RNA sequence semantic depth feature extraction network by y, and taking the penultimate layer of the CNN network architecture used for the RNA sequence semantic visual angle depth feature extraction as the RNA sequence semantic depth featureObtaining a multi-view dataset->

The invention uses an optimal multi-label chain learning algorithm based on multiple views, which comprises two parts of multi-label feature learning and optimal multi-label chain learning. The CC algorithm is a multi-label classification algorithm capable of efficiently learning the association between labels, and the principle is that a plurality of classifiers are constructed to predict a plurality of corresponding labels, and after each two classifiers are trained, the algorithm can attach the label result predicted by the classifier to initial characteristics and then serve as input characteristics of training of the next two classifiers until all the classifiers are trained. Unlike available method, the present invention improves single view CC algorithm, applies it to multi-view scene, and adds the advantages of multi-view data into CC algorithm to make it learn between labels betterThe specific principle is shown in fig. 15. The algorithm is divided into two parts: multi-tag feature learning and multi-tag learning. Firstly, depth feature vectors of all visual angles are obtained from an upstream CNN model, and are spliced together and put into multi-label feature learning model training. The input size of the model is 808-dimensional vector, the output is 68-dimensional result, and 68 labels are corresponding. Through the learning of the model, we can acquire 68 sets of 808-dimensional weight coefficients, corresponding to the contribution weights of each dimension feature of the input vector to predict each label. The 808-dimensional feature vectors are multiplied by the 68 sets of weight coefficients in sequence to obtain 68 sets of weighted feature vectors for training the downstream CC multi-label classifier. The CC multi-tag classifier in this experiment consisted of 68 bi-classifiers, predicting the membership of one RNA to 68 tags. First, we obtain a weighted feature vector x from a multi-tag feature learning module ₁ And starts training the first classifier using it as an input feature. The first tag value predicted by it is appended to the weighted feature vector x ₂ And finally, training a second classifier. This process is repeated until the last two classifiers are trained. Unlike traditional CC multi-label classifier, the optimal CC multi-label classifier provided by the invention is characterized in that after the ith two-classifier is trained, all the label values predicted at present are added to the tail end of the weighting characteristic vector xi+1 associated with the next label, and the training of the ith+1 two-classifier is performed. Therefore, the capability of learning the label relevance of the CC algorithm is reserved, the advantages of multi-view data can be reflected in the process of training the sub-classifier, and the advantages of the multi-view algorithm and the multi-label algorithm are combined together. The training of the optimal CC multi-label classifier and the prediction algorithm is shown as algorithms 1, 2.

The specific steps of the part are as follows:

the first step: splicingForm->Use->Training a multi-tag feature learning model to obtain 68 tag-related weighted feature vectors +.>

And a second step of: usingy ¹ Training a first classifier of the optimal CC chain type multi-label classifier;

and a second step of: attaching the label predicted by the steps toAfter the use of the additional tag +.>And y ² Training a second classifier of the optimal CC chain multi-label classifier;

and a third step of: attaching the label predicted by the steps toAfter the use of the additional tag +.>And y ³ Training the third classifier of the optimal CC chain multi-label classifier, and so on until the 68 th classifier is trained.

In the using stage of the method, the specific steps are as follows:

the first step: constructing a preliminary multi-view test dataset using an initial multi-view feature construction model on test data

And a second step of: obtaining a depth multi-view test dataset using a depth multi-view feature extraction model

And a third step of: splicingForm->Inputting the weighted feature vector into a trained multi-label feature model to obtain the weighted feature vector +.>

Fourth step: will beInputting into a trained optimal CC multi-label chain classifier to obtain all predicted label values +.>

Advantages of the invention include the following:

1) Construction of initial multi-view RNA sequence features: RNA sequences have a number of methods for constructing features, and features constructed in different ways have certain effects and advantages and disadvantages. The use of multi-view features for feature extraction of RNA sequences and for identification of RNA binding proteins to which they can bind can well combine the advantages of different method construction features. The invention refers to amino acid sequence expression forms capable of well expressing RNA sequence order and context characteristics, multi-gap dipeptide data capable of well expressing RNA sequence component information and structural information, and RNA sequence semantic data capable of well expressing RNA sequence semantic information to construct multi-view initial characteristics, and can carry out view information complementation from a plurality of different aspects.

2) Construction of depth multi-view features: to improve the effectiveness of the multi-view feature, deep learning is performed using CNN to construct a depth multi-view feature based on the original multi-view data. Compared with the original multi-view features, the multi-view features extracted by the depth features have smaller data dimension and higher classification effect;

3) Building a multi-label feature learning model: the multi-label feature learning technology is used for integrating the learned depth features of multiple visual angles, and the logic regression principle is utilized to correct the features of different labels, so that the multi-label features of the multi-label classifier can be better trained;

4) Constructing an optimal chain type multi-label classifier: the CC multi-label classifier is improved, multi-label learning is carried out by utilizing the corrected weighted feature vector obtained by the multi-label special learning model, and the multi-label classifier with more generalization capability is obtained and used for RNA binding protein recognition.

Drawings

Fig. 1 is a framework diagram of the algorithmic method of the present invention.

FIG. 2 is a framework diagram of an initial feature data acquisition algorithm from different perspectives in accordance with the present invention.

FIG. 3 is a framework diagram of a multi-view depth feature learning algorithm of the present invention.

Fig. 4 is a framework diagram of the multi-tag feature learning algorithm of the present invention.

Fig. 5 is a framework diagram of a multi-view learning algorithm of the present invention.

FIG. 6 is a framework diagram of an RNA binding protein recognition algorithm of the present invention.

FIG. 7 is the RNA sequence one-hot matrix data.

FIG. 8 is one-hot matrix data of amino acid sequences obtained by transformation of the RNA sequences of FIG. 7.

FIG. 9 is a histogram of the amino acid sequence of FIG. 8 converted multi-gap dipeptide composition.

FIG. 10 is semantic matrix data obtained by converting the RNA sequence of FIG. 7 after training the semantic model.

FIG. 11 is an RNA sequence depth profile extraction network.

FIG. 12 is an amino acid sequence depth profile extraction network.

FIG. 13 is a graph of a multi-gap dipeptide component depth profile extraction network.

FIG. 14 is an RNA sequence semantic deep feature extraction network.

FIG. 15 is a flowchart of an optimal multi-label chain learning algorithm incorporating multi-label feature learning and multi-label learning.

FIG. 16 is a graph of performance comparison of an algorithm used in the present invention with an existing algorithm over a single class.

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings and examples.

As shown in fig. 1 to 6, the method realizes the identification of the RNA binding protein by fusing multi-view and optimal multi-label chain learning, and comprises four parts of initial multi-view feature construction, deep multi-view feature extraction, multi-label feature model training and optimal multi-label chain classifier training. The initial multi-view feature construction part obtains initial multi-view features of the original RNA sequence; the depth multi-view feature extraction part performs depth feature learning on the initial multi-view feature to obtain a multi-view depth feature; the multi-label feature model training part uses multi-view depth features to construct weighted feature vectors related to labels; and the training part of the optimal multi-label chain classifier uses the weighted feature vector to learn the CC classifier associated with the label so as to obtain a final prediction result.

Specific steps of the training phase. The initial multi-view feature construction part of the method firstly extracts four features of an RNA sequence, an amino acid sequence, a multi-gap dipeptide component and an RNA sequence semantic matrix from an original RNA sequence, and constructs multi-view data of 4 views in total.

The original RNA sequence is a text sequence, and the numerical matrix expression form can be obtained by using the one-hot coding technology to transform. The present algorithm uses RNA sequence data as a feature of RNA viewing angle. FIG. 7 depicts the characteristics of an RNA sequence after one-hot encoding, wherein the horizontal axis represents a specific RNA sequence and the vertical axis represents one-hot encoding rules.

Example 1

According to an embodiment of the training phase, the example is done for RNA-RBP binding data of the AURA2 dataset. The dataset contained 67 RBP and 73681 RNA sequences and their 550386 binding site information, as shown in table 1. The number of sample RNAs to which each RBP can bind varies widely. The length of each RNA sequence is different, so we uniformly prescribe a length 2700, and the deficiency is complemented by base B. Table 2 shows the results of the comparison of the RRMVL process used in the present invention with the current state of the art process.

Table 2 performance metrics of the present algorithm in example 1

The method comprises a decision tree classifier model which is not subjected to deep learning, an ideep M method advanced in the field at present, and prediction performance indexes of each sub-view model and overall model under an RRMVL model. From the above table, it can be seen that the effects of using the deep learning ideep m model and any single view model under RRMVL are better than the decision tree model, demonstrating that the advantages of deep learning in extracting longer sample features are obvious. Meanwhile, the RRMVL method under the integration of all view models is higher than any single view model in AUC value and F1 value, so that the information complementarity between the multi-view data is reflected, and meanwhile, the multi-view of the data can achieve better effect in the field of bioinformatics. The best results are obtained from a single view, with respect to the multiple-gap dipeptide component view, because the multiple-gap dipeptide contains not only sequence order information, but also sequence components and structural information, which is the most informative of all views. The effect of the semantic single view of the RNA sequence is slightly lower than that of the initial RNA sequence, because millions of sample data are needed for training a good semantic model, and the dataset of the invention only contains 92102 RNA sequences and is insufficient for training the 6-polymer RNA word vector with ideal effect, so that the experimental effect is poor. Overall, among the 3 comparison algorithms, the RRMVL proposed in the present invention achieves the best effect of 3 AUCs and 3F 1, thereby proving that the optimal multi-tag chain learning method based on multiple perspectives achieves the expected effect on the problem of identifying RNA-binding proteins.

Example 2

In order to test the multi-label feature learning and the optimal multi-label chain learning effect used in the invention, 2-fold comparison experiments are carried out on RRMVL and variant methods thereof on an AURA data set, namely, an integrated learning RRMVL method based on multi-view voting is compared with an RRMVL method based on multi-label feature learning, and an RRMVL method without multi-label learning is compared with an RRMVL method based on optimal multi-label chain learning. Because the integrated learning model based on multi-view voting is not a classifier, no AUC index exists, and the five-fold cross-validation results of the rest methods are shown in the table below.

Table 3 performance testing of the present algorithm in example 2 with respect to the multi-tag feature learning model and the optimal multi-tag chain learning model

From the above table, it can be seen that, for multi-view data, after multi-label feature learning is used for the multi-view data, the prediction performance of the model is always more prominent than that of voting-based integrated learning, which indicates that the multi-label feature learning fully utilizes the advantages of the multi-view data. On the other hand, in dealing with the multi-label classification problem, the method using the multi-label classifier is always better than the method without using the multi-label technology, and the correlation among labels has a non-negligible effect on prediction. It is noted that after multi-label learning, the AUC index of RRMVL is reduced, because the classification performance of the multi-label CC classifier is slightly different from the classification capability of the "Sigmoid" network of the last layer of the neural network. For three F1 indexes, the best effect is obtained based on a multi-label learning method RRMVL under multiple visual angles, and the method provided by the invention is proved to be capable of accurately identifying which RBPs can be combined with a certain unexplored RNA.

Example 3

In order to study the influence of the number of class samples on the experimental effect, the invention uses RRMVL to carry out independent experiments on 68 class data sets, and compared with the iDeepM method, the experimental results are shown in the following table.

TABLE 3 different RBP prediction effects

/>

The prediction accuracy line graph is shown in fig. 16. As can be seen from fig. 16, in the two comparison algorithms, RRMVL achieves the best effect in most of the classification prediction accuracy, and as the number of class samples is gradually increased, each index shows a tendency of gradually increasing and becoming gentle. Note that when the number of samples is below 5000, the fluctuations in the indices are large, since too few samples of some classes result in the model not learning the depth features of these classes well. And from the comparison of 2 curves, the learning ability of the idepm method in a low sample environment is inferior to that of RRMVL, the method is represented by more intense fluctuation amplitude, and the advantage of multi-view data in small sample learning is indirectly represented. In general, the proposed method achieves the desired effect on each class of data set.

Claims

1. The RNA binding protein identification fusing multi-view and optimal multi-tag chain learning is characterized in that the steps of the training stage are as follows:

the first step: encoding the original RNA sequence into a numerical matrix by using one-hot encoding technology as the characteristic X of the original RNA sequence ¹ ；

And a second step of: converting original RNA sequence into amino acid sequence by using molecular biology principle, and converting into numerical matrix by using one-hot coding technology as initial amino acid sequence characteristic X ² ；

And a third step of: conversion of amino acid sequences into multiple gap dipeptide columnar numerical matrices using statistical principles as initial dipeptide constituent features X ³ ；

Fourth step: building a model by using Word2Vec technology, learning Word vectors with 6-mer RNA as Word stock, and converting RNA sequences into RNA sequence semantic matrixes serving as initial RNA sequence semantic features X ⁴ The method comprises the steps of carrying out a first treatment on the surface of the Obtaining an initial multi-view data set D= { X ¹ ,X ² ,X ³ ,X ⁴ ,y}；

Fifth step: by X ¹ Training the deep feature extraction net of the RNA sequence by y, and taking the penultimate layer of the CNN network architecture used for extracting the deep feature of the RNA visual angle as the deep feature of the RNA sequence

Sixth step: by X ² Training the deep feature extraction network of the y pairs of amino acid sequences, and taking the penultimate layer of the CNN network architecture used for the deep feature extraction of the visual angles of the amino acids as the deep feature of the amino acid sequences

Seventh step: by X ³ Training a y-to-multi-gap dipeptide component depth feature extraction network, and taking the penultimate layer of a CNN network architecture used for multi-gap dipeptide visual angle depth feature extraction as the multi-gap dipeptide component depth feature

Eighth step: by X ⁴ Training the RNA sequence semantic depth feature extraction network by y, and taking the penultimate layer of the CNN network architecture used for the RNA sequence semantic visual angle depth feature extraction as the RNA sequence semantic depth feature

Ninth step: splicingForm->Use->And y training a multi-tag feature learning model to obtain a weighted feature vector ++associated with each tag>

Tenth step: by means ofTraining an optimal chain type CC multi-label classifier model;

the optimal chain type CC multi-label classifier model is a classifier model which uses weighting feature vectors corresponding to all labels to train all sub-classifiers in the CC multi-label classifier, so as to obtain better classification effect;

eleventh step: constructing a preliminary multi-view test dataset using an initial multi-view feature construction model on test data

Twelfth step: obtaining a depth multi-view test dataset using a depth multi-view feature extraction model

Thirteenth step: splicingForm->Inputting the weighted feature vector into a trained multi-label feature model to obtain the weighted feature vector +.>

Fourteenth step: will beInputting into a trained optimal chain type CC multi-label classifier to obtain all predicted label values +.>

2. The RNA binding protein recognition fusion of multi-view and optimal multi-tag chain learning of claim 1, wherein the CNN network architecture used for RNA view depth feature extraction in the fifth step comprises 1 two-dimensional convolution layer, 1 pooling layer, 1 flattening layer, 2 dropout layer and 2 full-connection layer; the first layer convolution layer of the CNN network architecture is 101 convolution kernels of 4 x 10, and 101 obtained characteristic diagrams of 1 x 2701 are obtained; the pooling length of the second pooling layer is 3, and 101 characteristic diagrams of 1 x 900 are obtained; the third layer is a flat layer, and 1 characteristic diagram of 1 x 90900 is obtained; the fourth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram with the probability of 1 x 90900 is obtained; fifthly, the full connection layer converts 1 characteristic diagram of 1 x 90900 into a vector of 1 x 202; the sixth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 x 202 is obtained; and fifthly, the full connection layer converts 1 characteristic diagram of 1 x 202 into a vector of 1 x 68.

3. The RNA binding protein recognition fusion of multi-view and optimal multi-tag chain learning of claim 1, wherein the CNN network architecture used for the amino acid view depth feature extraction in the sixth step comprises 1 two-dimensional convolution layer, 1 pooling layer, 1 flattening layer, 2 dropout layer and 2 full-connection layer; the first layer convolution layer of the CNN network architecture is 101 convolution kernels of 20 x 10, and 101 obtained characteristic diagrams of 1 x 2701 are obtained; the pooling length of the second pooling layer is 3, and 101 characteristic diagrams of 1 x 900 are obtained; the third layer is a flat layer, and 1 characteristic diagram of 1 x 90900 is obtained; the fourth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram with the probability of 1 x 90900 is obtained; fifthly, the full connection layer converts 1 characteristic diagram of 1 x 90900 into a vector of 1 x 202; the sixth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 x 202 is obtained; and fifthly, the full connection layer converts 1 characteristic diagram of 1 x 202 into 1 vector of 1 x 68.

4. The RNA binding protein recognition fusion of multi-view and optimal multi-tag chain learning of claim 1, wherein the CNN network architecture used for multi-gap dipeptide angular depth feature extraction in the seventh step comprises 1 two-dimensional convolution layer, 1 flat layer, 2 dropout layer and 2 full-connection layer; the first layer convolution layer of the CNN network architecture is 101 convolution kernels of 30 x 10, and 101 characteristic diagrams of 1 x 871 are obtained; the second layer is a flat layer, and 1 characteristic diagram of 1 x 87971 is obtained; the third layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 x 87971 is obtained; the fourth is the full connection layer, which converts 1 characteristic diagram of 1 x 87971 into a vector of 1 x 202; the fifth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram with the probability of 1 x 202 is obtained; the sixth is the full link layer, converting 1 feature map of 1 x 202 into 1 x 68 vector.

5. The RNA binding protein recognition fusion of multi-view and optimal multi-tag chain learning of claim 1, wherein the CNN network architecture used for RNA sequence semantic depth feature extraction in the eighth step comprises 1 two-dimensional convolution layer, 1 pooling layer, 1 flattening layer, 2 dropout layer and 2 full-connection layer; the first layer convolution layer of the CNN network architecture is 101 convolution kernels of 25 x 10, and 101 characteristic diagrams of 1 x 2701 are obtained; the pooling length of the second pooling layer is 3, and 101 characteristic diagrams of 1 x 900 are obtained; the third layer is a flat layer, and 1 characteristic diagram of 1 x 90900 is obtained; the fourth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram with the probability of 1 x 90900 is obtained; fifthly, the full connection layer converts 1 characteristic diagram of 1 x 90900 into a vector of 1 x 202; the sixth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 x 202 is obtained; and fifthly, the full connection layer converts 1 characteristic diagram of 1 x 202 into 1 vector of 1 x 68.

6. The RNA-binding protein recognition fusion of multi-view and optimal multi-tag chain learning of claim 2, wherein the CNN network architecture used for RNA view depth feature extraction, the CNN network architecture used for amino acid view depth feature extraction, the CNN network architecture used for multi-gap dipeptide view depth feature extraction, and the last layer of the CNN network architecture used for RNA sequence semantic view depth feature extraction all use a sigmoid function as an activation function to introduce nonlinear transformation, the remaining layers use a relu function as an activation function, and the loss function of the four networks use a Binary cross-entropy loss function.

7. The RNA-binding protein recognition fusion of multi-view and optimal multi-tag chain learning of claim 3, wherein the CNN network architecture used for RNA view depth feature extraction, the CNN network architecture used for amino acid view depth feature extraction, the CNN network architecture used for multi-gap dipeptide view depth feature extraction, and the last layer of the CNN network architecture used for RNA sequence semantic view depth feature extraction all use a sigmoid function as an activation function to introduce nonlinear transformation, the rest layers use a relu function as an activation function, and the loss function of the four networks use a Binary cross-entropy loss function.

8. The RNA-binding protein recognition fusion of multi-view and optimal multi-tag chain learning of claim 4, wherein the CNN network architecture used for RNA view depth feature extraction, the CNN network architecture used for amino acid view depth feature extraction, the CNN network architecture used for multi-gap dipeptide view depth feature extraction, and the last layer of the CNN network architecture used for RNA sequence semantic view depth feature extraction all use sigmoid functions as activation functions to introduce nonlinear transformation, the rest layers use relu functions as activation functions, and the loss functions of the four networks use Binary cross-entropy classification cross entropy loss functions.

9. The RNA-binding protein recognition fusion of multi-view and optimal multi-tag chain learning of claim 5, wherein the CNN network architecture used for RNA view depth feature extraction, the CNN network architecture used for amino acid view depth feature extraction, the CNN network architecture used for multi-gap dipeptide view depth feature extraction, and the last layer of the CNN network architecture used for RNA sequence semantic view depth feature extraction all use sigmoid functions as activation functions to introduce nonlinear transformation, the rest layers use relu functions as activation functions, and the loss functions of the four networks use Binary cross-entropy classification cross entropy loss functions.