CN111816255A

CN111816255A - RNA-binding protein recognition by fusing multi-view and optimal multi-tag chain learning

Info

Publication number: CN111816255A
Application number: CN202010658127.8A
Authority: CN
Inventors: 邓赵红; 杨海涛; 吴敬; 王蕾; 王士同
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2020-07-09
Filing date: 2020-07-09
Publication date: 2020-10-23
Anticipated expiration: 2040-07-09
Also published as: CN111816255B

Abstract

The invention belongs to the field of bioinformatics, and relates to RNA binding protein recognition by fusing multi-view and optimal multi-label chain learning. The method comprises a training stage and a using stage, wherein the training stage comprises initial multi-view data construction, multi-view depth feature extraction model training, multi-label feature learning and optimal multi-label chain classifier training. The multiple views include an RNA sequence view, an amino acid sequence view, a multiple gap dipeptide component view and an RNA sequence semantic view. In order to improve the effectiveness of the multi-view features, the deep multi-view features are constructed by utilizing CNN to carry out deep learning based on initial multi-view data. In order to link multi-view features with multi-label learning, the multi-label feature learning model is established for integrating the advantages of all views, and the optimal CC chain classifier different from the common CC multi-label classifier is used for learning the association between labels, so that the classification precision is improved more effectively.

Description

RNA-binding protein recognition by fusing multi-view and optimal multi-tag chain learning

Technical Field

The invention belongs to the field of bioinformatics, and relates to RNA binding protein recognition by fusing multi-view and optimal multi-label chain learning.

Background

RNA, which is known as ribonucleic acid, is present in genetic information carriers in biological cells and partial viruses and viroids, plays a role in regulating the expression of coding genes in living bodies, plays a role in synthesizing protein templates after gene transcription, and is an essential component in living bodies. An RNA is required to exert its function smoothly, and generally needs to be mediated by an RNA Binding Protein (RBP), so that the lack of a certain RBP may cause that a certain RNA cannot exert its regulation or translation function, so that a living body lacks certain important proteins or certain proteins abnormally proliferate, and the function of the living body is influenced.

RNA Binding Proteins (RBPs) are key players of post-transcriptional events, and the versatility and structural flexibility of their domains allows RBPs to control the metabolism of a large number of transcripts. RBPs involve almost all steps of the post-transcriptional regulatory layer, which establish highly dynamic interactions with other proteins and coding and non-coding RNAs, creating functional units called ribonucleoprotein complexes, regulating RNA cleavage, polyadenylation, stability, localization, translation and degeneration. Studies have found that certain RBPs have the efficacy of regulating RNA synthesis of oncoproteins and tumor suppressor proteins, and thus deciphering the intricate network of inter-binding between RBPs and their cancer-associated RNA targets will provide a better understanding of tumor biology and potentially discover new methods of treating cancer.

Under the background of high development of big data and sequencing technology, the biomedical industry is difficult to detect the binding property of each pair of RNA and RBP, so that a plurality of algorithms for identifying RBP binding sites from RNA sequences by using a machine learning model emerge. For example: maticzka et al propose the GraphProt method, which learns the binding preference of RBP sequences and structures from high-throughput experimental data, and designs a unique calculation framework; RNAcommender, a method for predicting binding sites, proposed by Corrado et al, can recommend RNA targets to unexplored RBPs by considering the protein structure and the simulated secondary structure of the RNA through available interaction information; HOCNNLB proposed by Zhang et al uses higher order nucleotide coding as an initial feature to predict whether a given segment of RNA is a binding site. These methods have focused on using the sequence or structural features of the original RNA sequence to determine whether a given RNA sequence fragment is a binding site for a particular RBP, but few methods use the existing binding information of RNA and RBP to aid in prediction. To address this, Pan et al proposed an iDeepM approach that successfully achieves the expected effect of multi-label classification by using multi-label classification and deep learning methods to search for multiple RBPs that can be combined with one RNA. However, iDeepM also has the following disadvantages: although the used RNA sequence single-view data has certain effectiveness on prediction classification, the method is limited by insufficient information quantity of the RNA sequence, so that the precision is low; in addition, the method uses a convolutional neural network and a long-term memory network to classify multiple labels, and the relation among the labels cannot be fully learned, so that the prediction precision is also influenced.

Disclosure of Invention

The method realizes the RNA binding protein recognition by fusing multi-view and optimal multi-label chain learning, and comprises a training stage and a using stage, wherein the training stage comprises an initial multi-view feature construction model, a depth multi-view feature extraction model, multi-label feature learning and optimal multi-label chain classifier training.

A training stage: the initial multi-view characteristic construction model converts an original RNA sequence into an amino acid sequence, a multi-gap dipeptide component and an RNA sequence semantic matrix by using a molecular biology principle, a statistical principle and a Word2Vec (Word steering vector) technology to obtain the sequence, the component and the semantic characteristic, and then constructs the initial multi-view characteristic together with the original RNA sequence to obtain an initial multi-view characteristic construction model; the depth multi-view feature extraction model constructs four convolutional neural networks, and trains the initial four view features to obtain depth multi-view features with better classification capability and obtain a depth multi-view feature extraction model; the extracted depth features are used for training a multi-label feature learning model, and the weighted feature vectors associated with the labels learned by the multi-label feature learning model are used for training an optimal CC multi-label chain classifier so as to learn the association among the labels and obtain a model with the capability of recognizing RNA binding protein.

The use stage is as follows: obtaining an RNA sequence to be detected, and constructing the initial multi-view characteristic of the sequence by utilizing the molecular biology principle, the statistical principle and the Word vector (Word vector) technology; extracting depth features of 4 visual angles by using the trained four convolutional neural networks; then, using the trained multi-label feature learning model to perform weighting operation on the spliced depth features of 4 visual angles to obtain a weighted feature vector related to the label; and then predicting the weighted feature vector related to the label by using the trained optimal CC multi-label chain classifier to obtain a final prediction result.

The RNA binding protein recognition set multi-view depth feature learning technology, the multi-label feature learning technology and the multi-label learning technology which are integrated with multi-view and optimal multi-label chain learning are characterized in that deep hierarchical structure optimization feature representation of deep learning is achieved, the multi-label feature learning technology is integrated with and corrects the depth feature of each view, the advantages of the features of each view are fully utilized, weighting feature vectors related to labels are established, and the multi-label technology effectively utilizes independence of each label and correlation among the labels. The multi-view deep learning technology, the multi-label characteristic learning technology and the multi-label learning technology are effectively combined, so that effective information in an RNA sequence can be fully extracted, and the generalization capability of the classifier is improved.

The RNA sequence is a section of biological genetic material described by a character sequence, and the deep convolution model cannot process character information, so that the RNA character sequence needs to be preprocessed and converted into a numerical value form acceptable by a program. one-hot (one-hot coding) is a popular coding technique, and the principle is to construct a text sequence with a length of m, which is composed of n elements, into an n × m matrix, wherein each element is converted into an n-dimensional orthonormal basis vector to be filled into a corresponding position in the length of m. In the case of RNA sequences, one-hot (one-hot coding) constructs a primary RNA sequence of length mThe first 4 × m blank matrix converts each base into a 4-dimensional orthogonal basis vector, which is filled into the corresponding position of the sequence, as shown in fig. 7. The row is titled as a specific RNA sequence, with a real length of 2700. The base A in the sequence can be represented as a vector (1,0,0,0) by referring to the position of the base in the column^TThe base C is represented by a vector (0,1,0,0)^TThe base G is represented by (0,0,1,0)^TThe base U is represented by (0,0,0,1)^TAnd so on.

Although the initial feature matrix constructed by the method is helpful for extracting features, the method has the disadvantage of less information. The amino acid sequence is composed of 20 amino acids, and the information content is far more abundant than that of an RNA sequence, so that a one-hot (one-hot coding) coding matrix obtained by transforming the amino acid sequence can provide better effect for feature extraction. Translation of RNA sequences into amino acid sequences is unidirectional and unique, but because one amino acid can correspond to multiple base combinations, the resulting amino acid sequence cannot be reduced to the original RNA sequence, which can result in loss of information and misinterpretation of information. For example, the base combination GCA can be translated to obtain the fixed amino acid A, but the amino acid A can be represented by GCA, GCC, GCG, GCU. To address this problem, three modes of translation of the RNA sequence into an amino acid sequence are used, namely a first mode in which translation is initiated de novo, a second mode in which translation is initiated skipping the first base, and a third mode in which translation is initiated skipping the first and second bases. The RNA sequence with the length of m can be converted into 3 amino acid sequences with the length of 1/3m by the method, and the three forms of amino acid sequences can reduce the original RNA sequence information by sequence information complementation. As described above, the nucleotide combination GCA can be uniquely identified by using the amino acid R, A, H at the corresponding position in the three morphological sequences. Therefore, the amino acid sequences in the three forms are spliced to obtain an amino acid long chain with the length of m, the sequence information of the original RNA sequence can be completely inherited, and the expression form is richer. One-hot coding is performed on the long strand, and an initial feature matrix with the size of 20 × m can be obtained by the principle of the same RNA sequence, as shown in FIG. 8, which is the amino acid view data provided by the present invention. The row is titled as a specific amino acid sequence, with a physical length of 2700. All amino acids in the row sequence can be represented as 20-dimensional orthonormal basis vectors, one for each, against the position of the amino acid in the column heading.

The RNA perspective and amino acid perspective data mentioned above are biased towards characterizing the sequence order, and the composition of a sequence is equally important except for order. Since the 0-gap dipeptide is biased to two-dimensional sequence component composition and the 1-gap dipeptide carries three-dimensional structural component information, the RNA sequence component information extraction using the 0-gap dipeptide and the 1-gap dipeptide is most effective, and the present invention adopts their combination form to extract sequence components, constituting a view point of multi-gap dipeptide components. Since dipeptides are sensitive to left and right amino acid alignment, a total of 21 × 2 multiple gap dipeptide species were discarded for 21 amino acids of the present invention (20 natural amino acids and the added temporary amino acid O of the present invention) since the combination of OO and O × O was not of much significance for our study. The number of times of the 880 kinds of multi-gap dipeptides is counted to obtain a feature vector, and the information of the components of the amino acid sequence and the RNA sequence and the information of the amino acid space components can be effectively captured. Since the 880-dimensional feature vector is one-dimensional, the effect of extracting the depth feature is not ideal, so we convert it into a two-dimensional histogram, and can more effectively use a machine learning model to extract the depth feature, as shown in fig. 9. The abscissa of the upper table in the figure is the multigap dipeptide species, where "AA" represents the 0-gap dipeptide, both left and right alanine, and 18 represents its number in the amino acid sequence of the sample; "A X D" represents a 1-gap dipeptide flanked on the left by alanine, interrupted by any amino acid, and on the right by aspartic acid. The following figure lists only 12 multi-gap dipeptides, the actual number being 880. The lower panel is a transformed histogram, with the upper limit of the number of each multi-gap dipeptide set to 30, so we take a matrix of 30 x 880 size as the initial data for the multi-gap dipeptide for this amino acid sequence.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence, and bioinformatics has the same form as the initial research data of NLP from the viewpoint of the initial data. Therefore, the method of NLP can be used to solve the coding and initial feature construction of text in bioinformatics. The invention uses 6-polymer RNA as a word stock of a training semantic model, and the 6-polymer RNA has a structure consisting of 6 continuous bases, so the word stock consists of 46 kinds of 6-polymer RNA in total. The invention uses the popular Word2Vec technology to construct a semantic model, and the principle of the semantic model is shown in FIG. 10. Based on 92102 RNA sequences in the dataset used according to the invention, they were subjected to the following operations, one by one: 1) obtaining the arrangement sequence of 6-polymer RNA in an RNA sequence by using a sliding window with 6-bit base as the size; 2) each 6-mer RNA was encoded, i.e., its position in 4096 forms (rule of 'aaaaaaaa' being 1 and 'uuuuuuuu' being 4096); 3) respectively taking adjacent 2 6-polymer RNAs as a feature X and a label Y, and putting the feature X and the label Y into a semantic model for training; 4) extracting respective word vector results of 4096 kinds of 6-polymer RNAs from the trained semantic model; 5) and (3) replacing each 6-mer RNA in the RNA sequence with the word vector to construct an RNA sequence semantic matrix. An RNA sequence semantic matrix formed by 6-mer RNA word vectors not only has smaller dimension, but also contains RNA sequence order taking 6-bit bases as motif and context structure information, so that deep feature learning can be better carried out.

The specific steps of the part are as follows:

the first step is as follows: one-hot transformation matrix using original RNA sequence as RNA initial feature X¹。

The second step is that: conversion of original RNA sequences into amino acid sequences Using principles of molecular biology and the one-hot method initial features X²。

The third step: conversion of amino acid sequences into multiple gap dipeptide Components initial characteristics X Using statistical principles³。

The fourth step: training an RNA sequence semantic model by using Word2Vec technology, obtaining 6-polymer RNA Word vectors, forming an RNA sequence semantic matrix as RNA sequence semantic initial feature X⁴. This results in a preliminary multi-view dataset D ═ X¹,X²,X³,X⁴,y}

The depth multi-view feature extraction part of the invention uses a convolutional neural network to automatically extract each view feature of an RNA sequence. For an original RNA sequence, RNA sequence characteristics, amino acid sequence characteristics, multi-gap dipeptide component characteristics and RNA sequence semantic characteristics can be obtained after pretreatment, and four different convolutional neural networks are respectively constructed for the characteristics of four different visual angles to carry out deep automatic extraction on the characteristics of the different visual angles.

And the CNN network adopts the result of the last output layer to calculate errors and performs back propagation during training, so as to learn the network. Because the feature vector calculated by the second last layer only passes through one full connection layer to the output layer, the expression of the feature vector output by the second last layer is considered to be optimized while the network structure is trained and optimized according to the network output layer, namely the network learns better feature expression while training, so that the output of the second last layer of the network is selected as the feature learned by the network. The features obtained through automatic learning of the convolutional neural network have smaller dimensionality than the original features, and the obtained features are the features which are subjected to nonlinear combination and have better dividing capacity, so that a subsequent classification model can have better generalization effect.

Fig. 11, 12, 13, and 14 are diagrams of CNN network architecture used for four perspective depth feature extraction. And k @ m x n is used for representing the characteristic diagrams of each layer of the network, k represents the number of the characteristic diagrams of the layer, and m x n represents the size of the characteristic diagrams. The two-dimensional convolution kernels of the network are denoted by k m n, where k is the number of convolution kernels and m n is the size of the convolution kernels. The step size of the convolution kernel defaults to 1. The input of the network is each view angle feature, and the output is a vector with the length equal to 68 (i.e. the combination of the RNA sequence and 68 RBPs). The first 67 dimensions of the result indicate that if a sample can be combined with the RBP of that dimension, it equals 1, otherwise it equals 0; the 68 th dimension of the result indicates that the sample RNA sequence is 1 if it cannot bind to any RBP in the first 67 species, and 0 otherwise.

Fig. 11 is a CNN network architecture for extracting RNA perspective depth features, which includes 1 two-dimensional convolution layer, 1 pooling layer, 1 flat layer, 2 dropout layers, and 2 full-connected layers. The input to the network is a two-dimensional matrix of 4 x 2710. The first layer of convolution layers of the CNN network architecture are 101 convolution kernels with 4 × 10, and 101 characteristic graphs with 1 × 2701 are obtained; the pooling length of the second pooling layer was 3, resulting in 101 characteristic maps of 1 × 900; the third layer is a flat layer, and 1 characteristic diagram of 1 × 90900 is obtained; the fourth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 × 90900 is obtained; the fifth is a full connection layer, and 1 characteristic map of 1 × 90900 is converted into a vector of 1 × 202; the sixth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 x 202 is obtained; the fifth is a fully connected layer, converting 1 signature of 1 x 202 into one vector of 1 x 68.

Fig. 12 is a CNN network architecture for extracting depth features from amino acid views, which includes 1 two-dimensional convolutional layer, 1 pooling layer, 1 flat layer, 2 dropout layers, and 2 full-connection layers. The input is a two-dimensional matrix of 20 x 2710. The first layer of convolution layers of the CNN network architecture are 101 convolution kernels of 20 × 10, and 101 characteristic graphs of 1 × 2701 are obtained; the pooling length of the second pooling layer was 3, resulting in 101 characteristic maps of 1 × 900; the third layer is a flat layer, and 1 characteristic diagram of 1 × 90900 is obtained; the fourth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 × 90900 is obtained; the fifth is a full connection layer, and 1 characteristic map of 1 × 90900 is converted into a vector of 1 × 202; the sixth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 x 202 is obtained; the fifth is a fully connected layer, converting 1 signature of 1 x 202 into 1 vector of 1 x 68.

Fig. 13 is a CNN network architecture for multi-gap dipeptide view depth feature extraction, which includes 1 two-dimensional convolution layer, 1 flat layer, 2 dropout layers, and 2 full-connection layers in total. The input to the network is a two-dimensional matrix of 30 x 440. The first layer of convolution layers of the CNN network architecture are 101 convolution kernels with 30 × 10, and 101 characteristic graphs with 1 × 871 are obtained; the second layer was a flat layer, giving 1 characteristic map of 1 × 87971; the third layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 × 87971 is obtained; the fourth is a full connection layer, which converts 1 characteristic map of 1 × 87971 into a vector of 1 × 202; the fifth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 x 202 is obtained; the sixth is a fully connected layer, converting 1 signature of 1 x 202 into 1 vector of 1 x 68.

Fig. 14 is a CNN network architecture used for semantic view depth feature extraction of RNA sequences, which includes 1 two-dimensional convolutional layer, 1 pooling layer, 1 flat layer, 2 dropout layers, and 2 full-connection layers in total. The input is a two-dimensional matrix of 25 x 2710. The first layer of convolution layers of the CNN network architecture are 101 convolution kernels of 25 × 10, and 101 characteristic graphs of 1 × 2701 are obtained; the pooling length of the second pooling layer was 3, resulting in 101 characteristic maps of 1 × 900; the third layer is a flat layer, and 1 characteristic diagram of 1 × 90900 is obtained; the fourth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 × 90900 is obtained; the fifth is a full connection layer, and 1 characteristic map of 1 × 90900 is converted into a vector of 1 × 202; the sixth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 x 202 is obtained; the fifth is a fully connected layer, converting 1 signature of 1 x 202 into 1 vector of 1 x 68.

The last layer of the four networks introduces nonlinear transformation using sigmoid function as activation function, the expression of sigmoid function is as follows:

the remaining layers all use the relu function as an activation function, which is expressed as follows:

R(x)＝max(0，x)

the loss function of the network employs a binary cross entropy (binary _ cross entropy) loss function, which is defined as follows.

Wherein p (x)_i) And q (x)_i) All represent the degree of membership of the sequence x to the class i, p represents the true tag value, i.e. 1 or 0, and q represents the predicted value, where q ∈ (0,1) because of activation via the Sigmoid function.

The specific steps of the part are as follows:

the first step is as follows: by using X¹Y training the RNA sequence depth feature extraction net, and using the second last layer of the CNN network architecture used for extracting the RNA visual angle depth feature as the RNA sequence depth feature

The second step is that: by using X²Y training the amino acid sequence depth feature extraction network, and using the penultimate layer of the CNN network architecture used for extracting the amino acid visual angle depth feature as the amino acid sequence depth feature

The third step: by using X³Y training the depth feature extraction network of the components of the multi-gap dipeptide, and taking the penultimate layer of the CNN network architecture used for extracting the depth feature of the visual angle of the multi-gap dipeptide as the depth feature of the components of the multi-gap dipeptide

The fourth step: by using X⁴Y training the RNA sequence semantic depth feature extraction network, and using the penultimate layer of the CNN network architecture used for extracting the RNA sequence semantic visual angle depth feature as the RNA sequence semantic depth feature

Obtaining a multi-view dataset

The invention uses an optimal multi-label chain learning algorithm based on multiple visual angles, which comprises two parts of multi-label feature learning and optimal multi-label chain learning. The CC algorithm is a multi-label classification algorithm capable of efficiently learning the association among labels, and the principle is that a plurality of two classifiers are built to predict a plurality of corresponding labels, and after each two classifiers are trained, the algorithm adds the corresponding label result predicted by the classifier to the initial feature to be used as the input feature of the next two-classifier training until all the classifiers are trained. Different from the existing method, the invention improves the single-view CC algorithm, applies the single-view CC algorithm to a multi-view scene, and attaches the advantages of multi-view data to the CC algorithm, so that the CC algorithm can better learn the association between labels, and the specific principle is as shown in the figureShown at 15. The algorithm is divided into two parts: multi-label feature learning and multi-label learning. Firstly, depth feature vectors of all visual angles are obtained from an upstream CNN model, the depth feature vectors are spliced together, and the method is applied to training of a multi-label feature learning model. The input size of the model is 808-dimensional vector, and the output is 68-dimensional result, corresponding to 68 labels. Through learning of this model, we can obtain 68 sets of weight coefficients of 808 dimensions, corresponding to the contribution weight of each dimension feature of the input vector to predict each label. And multiplying the feature vector with 808 dimensions by the 68 groups of weight coefficients in sequence to obtain 68 groups of weighted feature vectors for training the downstream CC multi-label classifier. The CC multi-label classifier of this experiment consisted of 68 classifiers, predicting membership of one RNA to 68 labels. First, we obtain a weighted feature vector x from a multi-label feature learning module₁And begins training the first two classifiers using them as input features. The first label value predicted by it is appended to the weighted feature vector x₂And finally, training a second classifier. This process is repeated until the last two classifiers have been trained. Different from the traditional CC multi-label classifier, the optimal CC multi-label classifier provided by the invention is characterized in that after the ith second classifier is trained, all predicted label values are added to the tail of the weighted feature vector xi +1 associated with the next label, and the training of the (i + 1) th two classifier is carried out. Therefore, the method not only keeps the capability of the CC algorithm for learning the label relevance, but also can embody the advantages of multi-view data in the process of training the sub-classifiers and combine the advantages of the multi-view and multi-label algorithms. Training the optimal CC multi-label classifier and prediction algorithm is shown as

algorithms

1 and 2.

The specific steps of the part are as follows:

the first step is as follows: splicing

Form a

Use of

And y training the multi-label feature learning model to obtain 68 label-related weighted feature vectors

The second step is that: use of

y¹Training a first classifier and a second classifier of the optimal CC chain multi-label classifier;

the second step is that: attaching the label predicted by the above steps to

After using additional labels

And y²Training a second classifier of the optimal CC chain multi-label classifier;

the third step: attaching the label predicted by the above steps to

After using additional labels

And y³And training the third second classifier of the optimal CC chain multi-label classifier, and so on until the 68 th second classifier is trained.

In the use stage of the method, the specific steps are as follows:

the first step is as follows: using initial multi-view feature construction models for test dataModeling a preliminary multi-view test dataset

The second step is that: depth multi-view test dataset derived using depth multi-view feature extraction model

The third step: splicing

Form a

Inputting the weighted feature vector to the trained multi-label feature model to obtain a weighted feature vector

The fourth step: will be provided with

Inputting the predicted label values into a trained optimal CC multi-label chain classifier to obtain all the predicted label values

The advantages of the invention include the following:

1) construction of initial multi-view RNA sequence features: RNA sequences have a plurality of methods for constructing characteristics, and characteristics constructed in different modes have certain effects and also have advantages and disadvantages respectively. The use of multi-view features for feature extraction of RNA sequences and identification of RNA-binding proteins capable of binding to them can well combine the advantages of different approaches to construct features. The invention introduces an amino acid sequence representation form which can well express the sequence and the context characteristics of the RNA sequence, multi-gap dipeptide data which can well express the component information and the structural information of the RNA sequence, and RNA sequence semantic data which can well express the RNA sequence semantic information to construct the multi-view initial characteristics, and can carry out view information complementation from a plurality of different aspects.

2) Construction of depth multi-view features: to improve the effectiveness of the multi-view feature, a deep multi-view feature is constructed by performing deep learning using CNN based on the original multi-view data. Compared with the original multi-view features, the multi-view features extracted through the depth features have smaller data dimensions and higher classification effects;

3) constructing a multi-label feature learning model: integrating the learned depth features of multiple visual angles by using a multi-label feature learning technology, and performing feature correction on different labels by using a logistic regression principle to obtain multi-label features capable of better training a multi-label classifier;

4) constructing an optimal chain multi-label classifier: and the CC multi-label classifier is improved, the corrected weighted feature vector obtained by the multi-label specific learning model is utilized to carry out multi-label learning, and the multi-label classifier with higher generalization capability is obtained and used for RNA binding protein identification.

Drawings

FIG. 1 is a block diagram of the algorithmic method of the present invention.

FIG. 2 is a framework diagram of the different perspective initial feature data acquisition algorithm of the present invention.

Fig. 3 is a multi-view depth feature learning algorithm framework diagram of the present invention.

FIG. 4 is a multi-label feature learning algorithm framework diagram of the present invention.

Fig. 5 is a multi-view learning algorithm framework diagram of the present invention.

FIG. 6 is a block diagram of the RNA binding protein recognition algorithm of the present invention.

FIG. 7 is data of one-hot matrix of RNA sequences.

FIG. 8 is data of the amino acid sequence one-hot matrix obtained by transforming the RNA sequence of FIG. 7.

FIG. 9 is a bar graph of the multiple gap dipeptide elements from the amino acid sequence conversion of FIG. 8.

FIG. 10 is semantic matrix data obtained by transforming the RNA sequence of FIG. 7 after training of a semantic model.

FIG. 11 is a RNA sequence deep feature extraction network.

FIG. 12 is an amino acid sequence deep feature extraction network.

FIG. 13 is a multiple gap dipeptide component deep feature extraction network.

FIG. 14 is a semantic depth feature extraction network for RNA sequences.

FIG. 15 is a flow chart of an optimal multi-label chain learning algorithm that merges multi-label feature learning and multi-label learning.

FIG. 16 is a line graph comparing the performance of the algorithm used in the present invention with existing algorithms on a single class.

Detailed Description

The invention is described in detail below with reference to the figures and examples.

As shown in FIGS. 1 to 6, the method realizes RNA binding protein recognition by fusing multi-view and optimal multi-label chain learning, and comprises four parts of initial multi-view feature construction, deep multi-view feature extraction, multi-label feature model training and optimal multi-label chain classifier training. The initial multi-view characteristic construction part obtains initial multi-view characteristics of an original RNA sequence; the depth multi-view feature extraction part is used for carrying out depth feature learning on the initial multi-view features to obtain multi-view depth features; the multi-label feature model training part uses multi-view depth features to construct a weighted feature vector related to the label; and the optimal multi-label chain classifier training part learns the label-associated CC classifier by using the weighted feature vector to obtain a final prediction result.

And (5) specific steps of a training phase. The initial multi-view characteristic construction part of the method firstly extracts four characteristics of an RNA sequence, an amino acid sequence, a multi-gap dipeptide component and an RNA sequence semantic matrix from an original RNA sequence, and constructs multi-view data with 4 views in total.

The original RNA sequence is a text sequence, and the numerical matrix expression form of the original RNA sequence can be obtained by conversion by using a one-hot coding technology. The present algorithm utilizes RNA sequence data as a feature of RNA views. FIG. 7 is a graph plotting the RNA sequence characteristics after one-hot coding, in which the horizontal axis represents a specific RNA sequence and the vertical axis represents the one-hot coding rule.

Example 1

According to the training phase embodiment, the examples were performed on RNA-RBP binding data of the AURA2 dataset. The data set contains 67 RBP and 73681 RNA sequences with their 550386 binding site information, as shown in table 1. The amount of sample RNA that can be bound by each RBP is very different. Since the lengths of the respective RNA sequences are different, we have specified a length 2700, which is not sufficient for filling with the base B. Table 2 shows the results of a comparison of the method RRMVL used according to the invention with the current state of the art methods.

Table 2 performance index of the present algorithm in example 1

The method comprises a decision tree classifier model which is not subjected to deep learning, an iDeepM method which is advanced in the field at present, and prediction performance indexes of each sub-view model and an overall model under an RRMVL model. From the table, it can be seen that the effect of the iDeepM model using deep learning and any single-view model under RRMVL are superior to that of the decision tree model, and the advantage of deep learning in extracting longer sample features is proved to be obvious. Meanwhile, the RRMVL method under the integration of all view models is higher than any single view model in AUC value and F1 value, the information complementarity between multi-view data is embodied, and the multi-view angle of the data can obtain better effect in the field of bioinformatics. From a single perspective, the best results are obtained from the perspective of the multi-gap dipeptide component, since the multi-gap dipeptide contains not only sequence order information, but also sequence components and structural information, and is the most abundant in information in all perspectives. The effect of RNA sequence semantic single visual angle is slightly lower than that of the initial RNA sequence visual angle, because millions of sample data are needed for training a good semantic model generally, and the data set only comprises 92102 RNA sequences, which is not enough to train 6-mer RNA word vectors with ideal effect, so the experimental effect is poor. In general, the RRMVL proposed in the present invention achieves the best effect of 3 AUC and 3F 1 in 3 comparison algorithms, thereby demonstrating that the optimal multi-label chain learning method based on multi-view achieves the expected effect on the problem of RNA-binding protein identification.

Example 2

In order to check the multi-label feature learning and optimal multi-label chain learning effects used by the method, 2-fold comparison experiments are carried out on the RRMVL and the variant method thereof on the AURA data set, namely, the RRMVL method based on multi-view voting is compared with the RRMVL method based on multi-label feature learning, and the RRMVL method without multi-label learning and the RRMVL method based on optimal multi-label chain learning are compared. Since the ensemble learning model based on multi-view voting is not a classifier, there is no AUC index, and the five-fold cross-validation results of the remaining methods are shown in the table below.

Table 3 performance testing of the present algorithm in example 2 on a multi-label feature learning model and an optimal multi-label chain learning model

As can be seen from the above table, for multi-view data, after multi-label feature learning is used for the multi-view data, the prediction performance of the model is always more prominent than that of voting-based ensemble learning, which indicates that the multi-label feature learning makes full use of the advantages of the multi-view data. On the other hand, the method using the multi-label classifier is always superior to the method without the multi-label technology in processing the multi-label classification problem, and the fact that the correlation between labels has a non-negligible effect on prediction is proved. It is noted that the AUC of RRMVL decreases after multi-label learning, because the classification performance of the multi-label CC classifier is slightly different from the classification capability of the last layer "Sigmoid" network of the neural network. For the three F1 indexes, the best effect is achieved based on the multi-label learning method RRMVL under the multi-view angle, and it is proved again that the method provided by the invention can accurately identify the RBPs to which a certain unexplored RNA can be combined.

Example 3

In order to study the influence of the number of class samples on the experimental effect, the invention uses RRMVL to carry out independent experiments on 68 class data sets, and the experimental results of the comparative iDeepM method are shown in the following table.

TABLE 3 different RBP prediction Effect

The prediction accuracy line graph is shown in fig. 16. As can be seen from fig. 16, in the two comparison algorithms, the prediction accuracy of the RRMVL obtains the best effect in most classes, and each index of the two methods shows a trend of gradually increasing and becoming gentle as the number of class samples gradually increases. Note that when the number of samples is below 5000, the fluctuation of each index is large because the depth features of some class samples are not well learned by the model due to too small number of these class samples. And from the comparison of 2 curves, the learning capability of the iDeepM method in a low sample environment is inferior to that of the RRMVL, which is shown in that the fluctuation range is more severe, and indirectly shows the advantage of multi-view data in small sample learning. In general, the method provided by the invention achieves the expected effect on various kinds of data sets.

Claims

1. An RNA binding protein recognition that combines multi-view and optimal multi-label chain learning, characterized by the steps of a training phase:

the first step is as follows: the original RNA sequence is coded into a numerical matrix by using one-hot coding technology and used as the initial RNA sequence characteristic X¹；

The second step is that: converting original RNA sequence into amino acid sequence by using molecular biology principle, and converting into numerical matrix by using one-hot coding technology as initial amino acid sequence characteristic X²；

The third step: using statistical principles to convert amino acid sequences into a multi-gap dipeptide histogram numerical matrix as an initial dipeptide constituent feature X³；

The fourth step: establishing a model by using Word2Vec technology, learning Word vectors taking 6-polymer RNA as a Word stock, converting the RNA sequence into an RNA sequence semantic matrix as an initial RNA sequence semantic feature X⁴(ii) a Obtaining an initial multi-view dataset D ═ X¹,X²,X³,X⁴,y}；

The fifth step: by using X¹Y training the RNA sequence depth feature extraction net, and using the second last layer of the CNN network architecture used for extracting the RNA visual angle depth feature as the RNA sequence depth feature

And a sixth step: by using X²Y training the amino acid sequence depth feature extraction network, and using the penultimate layer of the CNN network architecture used for extracting the amino acid visual angle depth feature as the amino acid sequence depth feature

The seventh step: by using X³Y training the depth feature extraction network of the multi-gap dipeptide component, and taking the penultimate layer of the CNN network architecture used for extracting the depth feature of the multi-gap dipeptide visual angle to be used as the depth of the multi-gap dipeptide componentDegree feature

Eighth step: by using X⁴Y training the RNA sequence semantic depth feature extraction network, and using the penultimate layer of the CNN network architecture used for extracting the RNA sequence semantic visual angle depth feature as the RNA sequence semantic depth feature

The ninth step: splicing

Form a

Use of

And y training the multi-label feature learning model to obtain the weighted feature vector related to each label

The tenth step: by using

Training the optimal chained CC multi-label classifier model by y pairs;

the optimal chained CC multi-label classifier model is a classifier model which uses the weighted feature vectors corresponding to all labels to train all sub-classifiers in the CC multi-label classifier so as to obtain better classification effect;

the eleventh step: constructing a preliminary multi-view test data set using an initial multi-view feature construction model on test data

The twelfth step: depth multi-view test dataset derived using depth multi-view feature extraction model

The thirteenth step: splicing

Form a

The fourteenth step is that: will be provided with

Inputting the predicted label values into a trained optimal chain type CC multi-label classifier to obtain all the predicted label values

2. The RNA-binding protein recognition fusing multi-view and optimal multi-label chain learning of claim 1, wherein the RNA view depth feature extraction in the fifth step uses a CNN network architecture comprising 1 two-dimensional convolutional layer, 1 pooling layer, 1 flat layer, 2 dropout layers and 2 fully-connected layers; the first layer of convolution layers of the CNN network architecture are 101 convolution kernels with 4 × 10, and 101 characteristic graphs with 1 × 2701 are obtained; the pooling length of the second pooling layer was 3, resulting in 101 characteristic maps of 1 × 900; the third layer is a flat layer, and 1 characteristic diagram of 1 × 90900 is obtained; the fourth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 × 90900 is obtained; the fifth is a full connection layer, and 1 characteristic map of 1 × 90900 is converted into a vector of 1 × 202; the sixth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 x 202 is obtained; the fifth is a fully connected layer, converting 1 signature of 1 x 202 into one vector of 1 x 68.

3. The RNA-binding protein recognition fused with multi-view and optimal multi-label chain learning of claim 1, wherein the amino acid view depth feature extraction in the sixth step uses a CNN network architecture comprising 1 two-dimensional convolutional layer, 1 pooling layer, 1 flat layer, 2 dropout layers and 2 fully-connected layers; the first layer of convolution layers of the CNN network architecture are 101 convolution kernels of 20 × 10, and 101 characteristic graphs of 1 × 2701 are obtained; the pooling length of the second pooling layer was 3, resulting in 101 characteristic maps of 1 × 900; the third layer is a flat layer, and 1 characteristic diagram of 1 × 90900 is obtained; the fourth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 × 90900 is obtained; the fifth is a full connection layer, and 1 characteristic map of 1 × 90900 is converted into a vector of 1 × 202; the sixth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 x 202 is obtained; the fifth is a fully connected layer, converting 1 signature of 1 x 202 into 1 vector of 1 x 68.

4. The RNA-binding protein recognition fused with multi-view and optimal multi-tag chain learning of claim 1, wherein the CNN network architecture used for the multi-gap dipeptide angular depth feature extraction in the seventh step comprises 1 two-dimensional convolutional layer, 1 flat layer, 2 dropout layers and 2 fully-connected layers; the first layer of convolution layers of the CNN network architecture are 101 convolution kernels with 30 × 10, and 101 characteristic graphs with 1 × 871 are obtained; the second layer was a flat layer, giving 1 characteristic map of 1 × 87971; the third layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 × 87971 is obtained; the fourth is a full connection layer, which converts 1 characteristic map of 1 × 87971 into a vector of 1 × 202; the fifth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 x 202 is obtained; the sixth is a fully connected layer, converting 1 signature of 1 x 202 into 1 vector of 1 x 68.

5. The RNA-binding protein recognition fused with multi-view and optimal multi-label chain learning of claim 1, wherein the RNA sequence semantic depth feature extraction in the eighth step uses a CNN network architecture comprising 1 two-dimensional convolutional layer, 1 pooling layer, 1 flat layer, 2 dropout layers and 2 fully-connected layers; the first layer of convolution layers of the CNN network architecture are 101 convolution kernels of 25 × 10, and 101 characteristic graphs of 1 × 2701 are obtained; the pooling length of the second pooling layer was 3, resulting in 101 characteristic maps of 1 × 900; the third layer is a flat layer, and 1 characteristic diagram of 1 × 90900 is obtained; the fourth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 × 90900 is obtained; the fifth is a full connection layer, and 1 characteristic map of 1 × 90900 is converted into a vector of 1 × 202; the sixth layer is a dropout layer with the probability of 0.5, and 1 characteristic diagram of 1 x 202 is obtained; the fifth is a fully connected layer, converting 1 signature of 1 x 202 into 1 vector of 1 x 68.

6. The RNA-binding protein identification fusing multi-view and optimal multi-label chain learning of claim 2, wherein the last layer of the CNN network architecture used for RNA view depth feature extraction, the CNN network architecture used for amino acid view depth feature extraction, the CNN network architecture used for multi-gap dipeptide view depth feature extraction and the CNN network architecture used for RNA sequence semantic view depth feature extraction uses sigmoid function as activation function to introduce nonlinear transformation, the rest layers use relu function as activation function, and the loss functions of four networks use Binary cross-entropy loss function.

7. The RNA-binding protein identification fusing multi-view and optimal multi-label chain learning of claim 3, wherein the last layer of the CNN network architecture used for RNA view depth feature extraction, the CNN network architecture used for amino acid view depth feature extraction, the CNN network architecture used for multi-gap dipeptide view depth feature extraction and the CNN network architecture used for RNA sequence semantic view depth feature extraction all use sigmoid function as activation function to introduce nonlinear transformation, the rest layers use relu function as activation function, and the loss functions of four networks use Binary cross-entropy loss function.

8. The RNA-binding protein identification fusing multi-view and optimal multi-label chain learning of claim 4, wherein the last layer of the CNN network architecture used for RNA view depth feature extraction, the CNN network architecture used for amino acid view depth feature extraction, the CNN network architecture used for multi-gap dipeptide view depth feature extraction and the CNN network architecture used for RNA sequence semantic view depth feature extraction all use sigmoid function as activation function to introduce nonlinear transformation, the rest layers use relu function as activation function, and the loss functions of four networks use Binary cross-entropy loss function.

9. The RNA-binding protein identification fusing multi-view and optimal multi-label chain learning of claim 5, wherein the last layer of the CNN network architecture used for RNA view depth feature extraction, the CNN network architecture used for amino acid view depth feature extraction, the CNN network architecture used for multi-gap dipeptide view depth feature extraction and the CNN network architecture used for RNA sequence semantic view depth feature extraction all use sigmoid function as activation function to introduce nonlinear transformation, the rest layers use relu function as activation function, and the loss functions of four networks use Binary cross-entropy loss function.