CN115640529A

CN115640529A - Novel circular RNA-disease association prediction method

Info

Publication number: CN115640529A
Application number: CN202211120279.8A
Authority: CN
Inventors: 王磊
Original assignee: Zaozhuang University
Current assignee: Zaozhuang University
Priority date: 2022-09-15
Filing date: 2022-09-15
Publication date: 2023-01-24

Abstract

The invention discloses a novel circular RNA-disease association prediction method which comprises the steps of selecting and establishing a data set, constructing disease attribute information, constructing circular RNA attribute information, extracting the characteristics of an attention mechanism and constructing a randomized deep learning classification model. The invention can effectively extract the attribute information of the circular RNA and the diseases, so that the circular RNA and the diseases can fully represent the complex relation of circular RNA-disease association; the invention can fully excavate the circular RNA-disease associated information by using the graph attention neural network and extract the depth characteristics of the circular RNA-disease associated information; the method can construct a model by using a randomized deep learning classifier, and greatly improves the prediction precision, thereby obtaining a better prediction effect; the method has low calculation cost and low power consumption; the relation of the circular RNA-diseases can be effectively predicted, and the prediction accuracy can reach more than 93%.

Description

Novel circular RNA-disease association prediction method

Technical Field

The invention relates to the field of machine learning and bioinformatics, in particular to a novel circular RNA-disease association prediction method.

Background

Circular RNA is a single-stranded circular endogenous non-coding RNA that is widely expressed in life, is between 200 and 2000 nucleotides in length, and lacks a free 5 'end cap or 3' poly a tail. Most circular RNAs are usually expressed at low levels, were once considered rare, and are functional by-products generated by splicing errors during the formation of mRNAs. However, with the development of genome-wide analysis and deep RNA sequencing technologies, circular RNAs were found to be widely present and abundantly expressed in eukaryotic cells, and to exhibit tissue and cell specificity. They play an important regulatory role in cell development and pathophysiology of disease and are therefore of increasing interest.

Recently, there is increasing evidence that circular RNA is a desirable disease-associated biomarker with stability, conservation, abundance, tissue and stage specificity, resistance to degradation by rnases and stable presence. For example, wang et al discovered that cZRANB1 can be involved in glaucoma-induced retinal neurodegeneration by modulating the activation of retinal reactive glial cells and affecting the activity of Retinal Ganglion Cells (RGCs). Research by Huang et al shows that circABCC4 is significantly up-regulated in prostate cancer tissues and cell lines, and miR-1182 is regulated by adsorbing expression of FOXP4 in prostate cancer cells. Silencing circABCC4 can significantly inhibit the proliferation, cell cycle progression, migration and invasion of prostate cancer cells, thereby delaying tumor growth.

With the progress of research on circular RNA, disease data associated with circular RNA is gradually accumulated. These data provide sufficient support for predicting circular RNA disease association based on computational methods. For example, wang et al designed the semi-supervised method sgnarda to predict circular RNA-disease associations. The method combines the similarity characteristics of diseases and the natural language characteristics of circular RNA sequences, and uses all samples to pre-train to generate an antagonistic network and fine-tune the parameters thereof so as to realize the optimal model performance. Zhao et al proposed a circular RNA-disease association prediction model IBNPKATZ based on KATZ measures and two network projection algorithms. The model uses known circular RNA and disease data to describe similarity by gaussian interaction profiling nuclei, effectively determining whether circRNA is associated with disease. Although the above methods work well in predicting circular RNA-disease associations, most methods do not consider the following: 1. in the face of large amounts of circular RNA and disease data, no important information is specifically noted; 2. Limited information processing resources are not allocated to important information; 3. iterative algorithms require a significant amount of computation time.

Disclosure of Invention

The present invention has been made to solve the above-mentioned problems occurring in the prior art. The invention relates to a novel circular RNA-disease association prediction method, which comprises the steps of data set selection and establishment, disease attribute information establishment, circular RNA attribute information establishment, feature extraction of an attention mechanism and randomized deep learning classification model establishment. The invention can effectively extract the attribute information of the circular RNA and the diseases, so that the circular RNA and the diseases can fully represent the complex relation of circular RNA-disease association; the invention can fully excavate the circular RNA-disease associated information by using the graph attention neural network and extract the depth characteristics of the circular RNA-disease associated information; the method can construct a model by using a randomized deep learning classifier, and greatly improves the prediction precision, thereby obtaining a better prediction effect; the method has low calculation cost and low power consumption; the relation of the circular RNA-diseases can be effectively predicted, and the prediction accuracy can reach more than 93%.

The invention specifically adopts the following technical scheme:

a novel circular RNA-disease association prediction method, the method comprising:

constructing a data set predicting circular RNA-disease association based on circular RNA and disease-related data;

the similarity attribute of the circular RNA and the diseases is characterized through a Gaussian interaction spectrum kernel, and the correlation among different diseases is characterized according to disease classification annotation;

based on the data set for predicting the circular RNA-disease association, the attention of the network nodes and the adjacent nodes thereof is calculated in parallel by using a graph attention network, and the attention mechanism characteristics are extracted by processing the nodes with different dimensions and applying the nodes to an inductive learning problem;

the method comprises the steps of integrating a deep RVFL network model based on a randomized deep learning classifier, training the deep RVFL network model in a closed-form non-iterative mode based on attention mechanism characteristics, and obtaining a model for circular RNA-disease association prediction.

Further, after constructing a dataset predicting a circular RNA-disease association based on circular RNA-to-disease association data, the method further comprises: the data set predicting circular RNA-disease association is resolved by the following formula (1):

wherein

To predict the circular RNA-disease associated data set,

is a positive sample of the sample to be tested,

an unlabeled sample;

a preset number of negative samples were randomly selected from among the unlabeled samples based on a down-sampling method.

Further, circular RNA and Disease-related data collected from the CircR2Disease database were used to construct a data set predicting circular RNA-Disease association.

Further, the characterization of similarity attributes of circular RNA and disease by gaussian interaction profiling nuclei comprises:

using a binary vector to represent the interaction profile kernel of a circular RNA whose corresponding position is assigned a 1 when the circular RNA is associated with a disease, and a 0 otherwise;

GIP information D of the circular RNA was calculated by the following formula _GIP (d(i),d(j))：

Wherein σ _d Is the width variable, m is the number of circular RNAs in the dataset, exp () is an empirical function, d (i) is the ith disease,

is the vector in the adjacency matrix for the ith disease, d (j) is the jth disease,

is the vector in the adjacency matrix for the jth disease.

Further, the annotating according to disease classification, characterizing associations between different diseases, comprises:

reflecting the association between different diseases by utilizing a directed acyclic graph DAG according to disease classification annotation, wherein nodes are used for representing diseases in the DAG, and edges are used for representing the relationship between the diseases;

disease group N _e Disease d in DAG _e Contribution of C _e (d) The calculating method comprises the following steps:

wherein, the first and the second end of the pipe are connected with each other,

is a semantic contribution factor, is a point product, d 'is a disease d', C _e (d ') is a contribution of disease d', children of d is a subset of disease d, and e is disease e.

The semantics of the disease are obtained by the following formula:

where SC (e) is the semantics of the disease, N _e Is a subset of diseases.

Further, the method for extracting attention mechanism features by processing nodes with different dimensions and applying the nodes to an inductive learning problem based on the data set of the predicted circular RNA-disease association and utilizing a graph attention network to calculate attention of network nodes and adjacent nodes thereof in parallel comprises the following steps:

a graph attention network (GAT) was introduced to enable feature extraction for the attention mechanism. The GAT can calculate attention of each node and its neighboring nodes in parallel, and can process nodes of different dimensions and directly apply them to inductive learning problems, thereby implementing an effective attention mechanism.

Suppose the input of the graph attention network GAT is

Output is as

Where N is the number of nodes, F and F' are the attributes of the input and output nodes, respectively,

it is that,

it is that,

it is that,

is that; training weight matrices for all nodes

And obtaining corresponding input and output conversion, and implementing a self-attention mechanism a for each node:

attention coefficient e of the self-attention mechanism _i,j For expressing the importance of node j to node i, expressed as:

regularizing all neighbors of the node with a softmax function:

note that the mechanism is a single-layer feed-forward neural network in GAT, with weight vectors

The nonlinear activation of LeakyReLU was determined and added, as shown below:

after obtaining the normalized attention coefficient between different nodes after regularization, the output characteristic of each node is calculated by the following formula (9) or (10)

Where σ () is an activation function, K is a sequence number, K is a node number,

is the attention-giving mechanism parameter, W ^k Is the weight.

Further, the input of each hidden layer in the deep RVFL network model is from the nonlinear transformation characteristic of the previous layer and the original input characteristic, and the input is described as:

where g (-) is a non-linear activation function, H ^(l-1) Is the l-1 st layer input, W ^(l) Is the ith layer weight, is the number of layers, when l =1 layer,

representing a weight matrix between the input and the first hidden layer; when l is>When the number of the layers is 1,

representing a weight matrix between the inner hidden layers;

the input of the output layer in the deep RVFL network model is composed of the nonlinear characteristics of the superposed hidden layer and the original characteristics, and is represented as follows:

D＝[H ⁽¹⁾ H ⁽²⁾ ...H ^(l-1) H ^(l) X] (12)

the output of the deep RVFL network model may be defined as Y = D β _d When using regularized least squares, its closed-form solution is described as:

wherein beta is _ed Is a closed solution, λ is a weight, I is a unit vector, D ^T Is the input to this layer.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic flow chart of a novel method for predicting circular RNA-disease association in an embodiment of the present invention;

FIG. 2 is a graph of the 5FCV ROC curve obtained on a reference data set according to the present invention;

FIG. 3 is a bar graph comparing the results of different classifier models.

Detailed Description

In order to make the technical solutions of the present invention better understood, the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments. The following detailed description of embodiments of the invention is provided in connection with the accompanying drawings and the detailed description of embodiments of the invention, but is not intended to limit the invention. The order in which the various steps described herein are described as examples should not be construed as a limitation if there is no requirement for a contextual relationship between each other, and one skilled in the art would know that sequential adjustments may be made without destroying the logical relationship between each other, rendering the overall process impractical.

The embodiment of the invention provides a novel circular RNA-disease association prediction method, which is carried out according to the following steps as shown in figure 1:

a. selection and establishment of data sets: constructing a data set for predicting circular RNA-Disease association by using circular RNA and Disease-related data collected from a circular R2Disease database;

b. constructing circular RNA and disease attribute information: characterizing similarity attributes of the circular RNA and the disease by gaussian interaction spectroscopy nuclei; characterizing associations between diseases according to authoritative disease classification notes provided by MeSH;

c. feature extraction of attention mechanism: the attention of the network nodes and the adjacent nodes thereof is calculated in parallel by using a graph attention network algorithm, and effective attention mechanism feature extraction is realized by processing the nodes with different dimensions and applying the nodes to the inductive learning problem.

d. Constructing a randomized deep learning classification model: the model is trained using the edRVFL algorithm in a closed-form solution non-iterative manner, and a high-level representation of the data is used for fast and accurate classification.

The selection and establishment of the data set in the step a are as follows: the invention performs experiments on the basis of a reference data set, circR2 Disease. This data set currently collects 739 experimentally validated circular RNAs and disease association data, including 100 diseases and 661 circular RNAs. The invention divides the data set into positive samples according to the incidence relation

And unlabeled samples

To construct a balanced data set to make the evaluation parameters more accurate, we randomly selected 739 negative samples from unlabeled samples using a down-sampling method. Although this method does not guarantee that the negative sample taken is a true negative sample, it is only selected at a ratio of 739 ÷ (661X 100-739) ≈ 1.13%. The probability of pairing of related circular RNA diseases in the selected negative samples is much less than this value, which can be ignored from a probability point of view.

Constructing cyclic RNA and disease attribute information in the step b: the disease information used in the present invention is from the MeSH database, which is an authoritative thesaurus compiled by the national medical library (NLM) and used as the basis for biomedical indexing. In MeSH, the principle of topic terms and subtopic terms for each category is strictly defined, and various references and comments are attached. From the information provided by MeSH, we use Directed Acyclic Graphs (DAGs) to reflect associations between different diseases.The nodes are used to represent diseases in the DAG and the edges are used to represent relationships between diseases. Thus, disease d is at DAG _e Contribution of (1) C _e (d) The following can be calculated:

according to the above definition, we can accumulate the disease set N _e All disease contributions in the database to obtain the semantics of the disease.

Gaussian interaction profile nuclear similarity (GIP) is used to describe circular RNA information, assuming that the probability of similar circular RNAs is associated with diseases with similar function. We use binary vectors to represent the interaction profile nuclei of circular RNAs whose corresponding positions are assigned a 1 when they are associated with disease, and a 0 otherwise. Therefore, we can calculate the GIP message D of the circular RNA using the following formula _GIP (d (i), d (j)), where σ _d Is the width variable, and m is the number of circular RNAs in the dataset.

In the step c, attention is paid to feature extraction of an attention mechanism: the present invention utilizes an attention mechanism to extract features from circular RNA-disease data. Attention mechanism has originated from the study of human vision. In cognitive science, humans selectively focus on a particular portion of information and then focus on it, while ignoring other portions, thereby making reasonable use of limited visual resources. In the present invention we introduce the attention network (GAT) to enable feature extraction for the attention mechanism. The GAT can calculate attention of each node and its neighboring nodes in parallel, and can process nodes of different dimensions and directly apply them to inductive learning problems, thereby implementing an effective attention mechanism.

Assume the input of GAT is

Output is as

Where N is the number of nodes and F' are the attributes of the input and output nodes, respectively. To obtain the corresponding input/output conversion, we need to train the weight matrix of all nodes

Then, we implement a self-attention mechanism a for each node:

regardless of the information of the graph structure, its attention coefficient e _i,j For representing the importance of node j to node i, as follows:

to facilitate the comparison and calculation of the attention coefficients, a softmax function is introduced to regularize all the neighbors of a node:

The nonlinear activation of LeakyReLU was determined and added, which is expressed as follows:

after the normalized attention coefficients between different nodes are obtained through the above operations, the output characteristic of each node may be calculated as follows:

to speed up the process of self-attention learning, we use a multi-head attention mechanism to set up the attention mechanism to work independently and join the results together after separate calculations:

d, constructing a randomized deep learning classification model in the step d: in the present invention, we use a randomization-based deep learning classifier integrated deep RVFL network (edRVFL) to classify features to determine if they are related. edRVFL each hidden layer has as input the non-linear transformation characteristics from the previous layer and the original input characteristics, whose inputs can be described as:

where g (-) is a non-linear activation function, when l =1 layer,

representing a weight matrix between the input and the first hidden layer; when is l>When the number of the layers is 1,

representing a weight matrix between the inner hidden layers. The input of the output layer consists of the nonlinear features that superimpose the hidden layer and the original features, and can be expressed as:

D＝[H ⁽¹⁾ H ⁽²⁾ ...H ^(l-1) H ^(l) X] (12)

thus, the output of edRVFL may be defined as Y = D β _d . When using regularized least squares, its closed-form solution can be described as:

to fully utilize all data and obtain reliable results, we used the five-fold cross-validation (5 FCV) method to calculate accuracy (Acc), sensitivity (Sen), precision (Pre), F1 score (F1), and Mausre Correlation Coefficient (MCC) in the experiment. Specifically, we first randomly split the raw data into 5 sets by non-oversampling, and then select one of them at a time as a test set, and the remaining 4 as training sets for model training. This step is repeated 5 times until each subset is used as a test set once and only once. Finally, the average of the 5 test results was calculated and summarized in table 1 as the evaluation index result of the model. As can be seen from the table, the accuracy of the present invention in the baseline data set reached 93.10%, with a standard deviation of 1.91%. In five groups of cross validation experiments, the optimal accuracy rate is up to 96.27%, and the lowest accuracy rate is also up to 91.19%. The average values of the present invention reached 93.44% and 86.55% on F1 and MCC, respectively, which reflect the overall performance of the model. In the sensitivity and accuracy evaluation indexes, the invention respectively reaches 97.56 percent and 89.68 percent. From the ROC curves in fig. 2, it can be seen that the five sets of curves generated by the present invention tend to the upper left corner of the graph, with AUC values of 0.9235.

TABLE 1 results of 5FCV experiments obtained on the reference data set according to the invention

Comparison of different classifier models: to verify whether the edRVFL classifier is closely related to the improvement in performance of the present invention, we compared it to other classifier models. In particular, we input the extracted features into different classifier models, including random vector function chaining (RVFL), extreme Learning Machine (ELM), rotating forest (ROF), random forest (RAF), K-nearest neighbor (KNN), and Support Vector Machine (SVM), to perform 5FCV experiments, and then compare their results with those of the present invention. Table 2 summarizes the results of these classifier models, with the highest values shown in bold. At the same time, we plot these values in the form of a histogram in the graph. From the comparison results, the present invention achieved the best results in terms of accuracy, precision, F1 score, MCC, and AUC, which were 4.87%, 6.03%, 4.41%, 9.23%, and 0.0434% higher than the mean of the other classifier models, respectively. The present invention achieves the second highest results in terms of sensitivity, but only 0.14% lower than the best results. In summary, the present invention achieves the best overall performance in this comparison. This result indicates that the randomization-based deep learning algorithm can greatly improve model performance and help to accurately predict circular RNA-disease associations.

TABLE 2 Experimental results obtained on the reference data set for different classifier models

Comparison with other existing methods: in recent years, as cyclic RNA research has progressed, a number of computational-based models have been designed to rapidly predict cyclic RNA-disease associations with satisfactory results. To fully validate the capabilities of the present invention, we compared it to these excellent models, including wang et al, PWCDA, GCNCDA, NCPCDA, DWNN-RLS, SIMCCDA, and MRLDC. For fairness, we chose these models based on the reference dataset and experimented with the 5FCV method. Since AUC can fully reflect the overall performance of the models, we summarize the AUC scores generated by these models in table 3. As can be seen from the table, the AUC scores achieved by the present invention are higher than those of these models. The results show that the model based on the randomized deep learning algorithm and combined with the attention mechanism has the best performance.

TABLE 35 FCV AUC scores obtained for different models

Moreover, although exemplary embodiments have been described herein, the scope thereof includes any and all embodiments based on the present invention with equivalent elements, modifications, omissions, combinations (e.g., of various embodiments across), adaptations or alterations. The elements of the claims are to be interpreted broadly based on the language employed in the claims and not limited to examples described in the present specification or during the prosecution of the application, which examples are to be construed as non-exclusive. It is intended, therefore, that the specification and examples be considered as exemplary only, with a true scope and spirit being indicated by the following claims and their full scope of equivalents.

The above description is intended to be illustrative and not restrictive. For example, the above-described examples (or one or more versions thereof) may be used in combination with each other. For example, other embodiments may be used by those of ordinary skill in the art upon reading the above description. In addition, in the above-described embodiments, various features may be grouped together to streamline the disclosure. This should not be interpreted as an intention that features of an unclaimed invention be essential to any of the claims. Rather, inventive subject matter may lie in less than all features of a particular inventive embodiment. Thus, the following claims are hereby incorporated into the detailed description as examples or embodiments, with each claim standing on its own as a separate embodiment, and it is contemplated that the embodiments can be combined with each other in various combinations or permutations. The scope of the invention should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

Claims

1. A novel circular RNA-disease association prediction method, comprising:

the similarity attribute of the circular RNA and the diseases is represented through a Gaussian interaction spectrum kernel, and the correlation among different diseases is represented according to disease classification annotation;

and integrating a deep RVFL network model based on a randomized deep learning classifier, and training the deep RVFL network model in a closed-form solution non-iterative mode based on the attention mechanism characteristics to obtain a model for circular RNA-disease associated prediction.

2. The novel method for predicting circular RNA-disease association according to claim 1, wherein after constructing the data set for predicting circular RNA-disease association based on the circular RNA and disease-related data, the method further comprises: the data set predicting circular RNA-disease association is resolved by the following formula (1):

wherein

To predict the circular RNA-disease associated data set,

is a positive sample of the sample to be tested,

is an unlabeled sample;

3. The method of claim 1, wherein the circular RNA and Disease-related data collected from the circle 2Disease database are used to construct a data set for predicting circular RNA-Disease association.

4. The method for predicting circular RNA-disease association as claimed in claim 1, wherein said characterization of similarity between circular RNA and disease by Gaussian interaction spectrum kernel comprises:

using a binary vector to represent the interaction profile nuclei of the circular RNAs, whose corresponding positions are assigned to 1 when the circular RNAs are associated with a disease, and 0 otherwise;

is the vector in the adjacency matrix for the jth disease.

5. The novel circular RNA-disease association prediction method of claim 1 or 4, wherein the characterization of the association between different diseases according to the disease classification annotation comprises:

reflecting the association between different diseases by utilizing directed acyclic graph DAG according to disease classification annotation, wherein nodes are used for representing the diseases in the DAG, and edges are used for representing the relationship between the diseases;

disease group N _e Disease d in DAG _e Contribution of (1) C _e (d) The calculation method comprises the following steps:

where θ is a semantic contribution factor, is a point product, d 'is a disease d', C _e (d ') is a contribution of disease d', children of d is a subset of disease d, e is disease e;

the semantics of the disease are obtained by the following formula:

where SC (e) is the semantics of the disease, N _e Is a subset of diseases.

6. The novel circular RNA-disease association prediction method of claim 1, wherein the method for extracting attention mechanism features by processing nodes with different dimensions and applying the nodes to induction learning problem based on the data set for predicting circular RNA-disease association and utilizing a graph attention network to calculate attention of network nodes and adjacent nodes in parallel based on the data set for predicting circular RNA-disease association comprises the following steps:

Suppose the input of the graph attention network GAT is

Output is as

is a function of the input vector or vectors,

is the ith input vector and is the vector of the ith input,

is the output vector of the output vector,

is the ith output vector; training weight matrices for all nodes

regularizing all neighbors of the node with a softmax function:

the attention mechanism is a single layer feedforward neural network in GATFrom the weight vector

The nonlinear activation of LeakyReLU was determined and added as follows:

is a parameter of attention mechanism, W _k Is the weight.

7. The novel circular RNA-disease association prediction method as claimed in claim 1, wherein the input of each hidden layer in the deep RVFL network model is from the nonlinear transformation characteristics of the previous layer and the original input characteristics, and the input is described as follows:

wherein g (-) is a nonlinear laserLive function, H ^(l-1) Is the l-1 th layer input, l is the number of layers, when l =1 layer,

representing a weight matrix between the inner hidden layers;

the input of the output layer in the deep RVFL network model consists of the nonlinear characteristics of the superposed hidden layer and the original characteristics, and is represented as follows:

D＝[H ⁽¹⁾ H ⁽²⁾ ...H ^(l-1) H ^(l) X] (12)