CN114999635A

CN114999635A - circRNA-disease association relation prediction method based on graph convolution neural network and node2vec

Info

Publication number: CN114999635A
Application number: CN202210702017.6A
Authority: CN
Inventors: 张奕; 王真梅; 蔡钢生
Original assignee: Guilin University of Technology
Current assignee: Guilin University of Technology
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2022-09-02

Abstract

A circRNA-disease association relation prediction method based on graph convolution neural network and Node2vec comprises the following steps: acquiring a circRNA-disease association matrix; calculating the functional similarity of circRNA, the Gaussian interaction spectrum nuclear similarity of circRNA, the nuclear similarity of disease Gaussian interaction spectrum and the semantic similarity of disease, constructing the integration similarity of circRNA and the integration similarity of disease, and generating a circRNA-disease heteromorphic graph; a sparse automatic encoder performs feature extraction and transformation on the circRNA (disease) integration similarity, converts the circRNA integration similarity into a 64-dimensional feature vector, and fuses the circRNA feature vector and the disease feature vector into a final circRNA-disease feature vector; extracting local structure information of nodes from the circRNA-disease heteromorphic graph by the graph convolutional neural network; the Node2vec method extracts the global structure information of the Node from the circRNA-disease heteromorphic graph; and (4) sending the node information obtained in the first two steps into a random forest classifier, and predicting a potential circRNA-disease association relation. The circRNA related to the disease is predicted by a calculation method, so that the time is saved, the disease pathogenesis is clarified, and an effective treatment scheme is searched.

Description

circRNA-disease association relation prediction method based on graph convolution neural network and node2vec

Technical Field

The invention relates to the field of correlation prediction in bioinformatics, in particular to a circRNA-disease correlation prediction method based on graph-based neural network and Node2 vec.

Background

In 1976, the first circRNA was found to be in the study of RNA viruses. Due to the structural specificity, unknown function and low abundance of circRNA, it is considered an artifact or a mis-spliced product. With the development of sequencing technologies, more and more circrnas are identified in thousands of organisms, such as plants, animals and bacteria. It was found that circRNA has important molecular functions: participating in regulation and control of gene expression, serving as a molecular sponge to absorb microRNA, inhibiting the activity of miRNA, regulating the expression of messenger RNA and the like. Mutations or dysfunction of circRNA cause a disruption in various vital activities, thereby causing disease. Therefore, the research on the mechanism of circRNA in disease occurrence and the function of circRNA in disease treatment is carried out, and the understanding of the association relationship between circRNA and disease is an important content of bioinformatics research, is beneficial to disease prognosis, diagnosis and treatment, and is a new way for future research.

The traditional biological experiment verification method needs a large amount of manpower and material resources, and is high in prediction accuracy but time-consuming. Biological characteristics of data are mined by using a calculation method, and the association relation of the circRNA and the disease is predicted, so that the method is convenient and efficient. The current computational methods for predicting circRNA-disease associations can be divided into two broad categories, network-based propagation and machine-learning-based.

The method based on network transmission utilizes circRNA and disease association data to construct a circRNA (disease) similarity network, and predicts the association relationship of the potential circRNA and the disease. Fan et al developed a computational model KATZHCDA using KATZ measures on a heterogeneous network of circRNA expression profiles, disease phenotype similarities and known circRNA-disease associations. The model successfully predicts the circRNA-disease association for heterogeneous networks using simple metrology, but is not suitable for predicting new diseases without any known circRNA association or isolated circRNA without any known disease association. Li et al propose a method for predicting DWNCPCDA in association with disease using the Deepwalk and network consensus projection method for circRNA. The method has the advantages that the network embedding method Deepwalk is adopted to learn node embedding of known circRNA and disease association network, and the method is combined with a similarity-based method, so that greater flexibility is provided for circRNA-disease association prediction. In the future, more biomedical association data of circRNA or diseases, such as circRNA-miRNA association and miRNA-disease association, will be integrated to further improve the prediction performance.

Machine learning-based methods exploit deep features of circRNA and disease data using supervised or unsupervised methods, iteratively learn to progressively optimize model parameters, and design classifiers to identify circrnas related to disease. Lei et al propose a calculation method RWRKNN that applies a restart random walk algorithm to weighted features with global network topology information and uses a K-nearest neighbor algorithm to classify according to the features to improve prediction performance. However, RWRKNN has a slight deficiency in methods to reveal associations between disease and new circRNA without any association or between circRNA and new disease without any association. Ding et al developed a computational model based on random walk and logistic regression to predict RWLR for the association of circRNA with disease. The method for restarting the random walk to obtain the global structure information of each circRNA is better than the method based on the similarity only. RWLR predicted novel circRNA associated with no known disease. However, RWLR only considers circRNA similarity and does not contain sufficient disease information, resulting in poor prediction accuracy. Zhang et al propose a graph representation-based learning-based approach to identifying circRNA-disease associations for predicting the potential association of circRNA with disease, iGRLCDA. The method utilizes a graph convolution neural network and a deep learning model of graph decomposition, and effectively excavates circRNA and disease information of higher levels. However, iGRLCDA is less sensitive to new circRNA-disease associations, depending on the nature or character of the known circRNA-disease association.

In view of this, it is very important to study the prediction method of circRNA-disease association relationship. The invention provides a circRNA-disease association relation prediction method based on a graph convolution neural network and Node2vec, so as to predict potential circRNA-disease association.

Disclosure of Invention

The invention aims to solve the problems of low prediction precision, time consumption performance in training and the like of the conventional circRNA-disease association prediction model, provides a circRNA-disease association relation prediction method based on a graph convolution neural network and node2vec, improves the prediction precision and reduces the training cost.

The technical scheme of the invention specifically comprises the following steps:

step 1: obtaining a circRNA-disease association matrix.

Acquiring circRNA-Disease associated data verified by experiments from a circR2Disease database, deleting redundant data, and only selecting known associated data related to human complex diseases as a circRNA-Disease associated matrix.

Step 2: calculating the semantic similarity of diseases, the Gaussian interaction spectrum nuclear similarity of diseases, the functional similarity of circRNA and the Gaussian interaction spectrum nuclear similarity of circRNA, constructing the integration similarity of circRNA and the integration similarity of diseases, and generating a circRNA-disease heteromorphic graph.

Acquiring related annotation words of each disease from a MESH database, and calculating semantic similarity among the diseases by utilizing a Directed Acyclic Graph (DAG) to obtain the semantic similarity of the diseases; calculating the core similarity of the circRNA (disease) Gaussian interaction spectrum according to the circRNA-disease association matrix; and calculating the functional similarity of the circRNA according to the semantic similarity of the diseases and the circRNA-disease association matrix. And quantifying each pair of disease similarity by integrating complementary information from a plurality of data sources and different representation methods by adopting integrated similarity to overcome inherent sparsity to obtain a circRNA integrated similarity matrix and a disease integrated similarity matrix.

And step 3: and the sparse automatic encoder performs feature extraction and transformation on the circRNA (disease) integration similarity, converts the circRNA integration similarity into a 64-dimensional feature vector, and fuses the circRNA feature vector and the disease feature vector into a final circRNA-disease feature vector.

The sparse autoencoder not only can automatically learn features, but also can give better feature description than the original data. The characteristics learned by the sparse automatic encoder are used for replacing original data, and the model prediction performance is improved to a certain extent. For this purpose, the invention uses a sparse automatic encoder to integrate similarity to circRNA (disease) respectively, minimizes the error between input and output by a back propagation algorithm, extracts and transforms features, and obtains 64-dimensional circRNA (disease) feature vectors. Finally, the circRNA (disease) feature vectors are combined to obtain the final circRNA-disease feature vector.

And 4, step 4: and extracting local structure information of the nodes from the circRNA-disease heteromorphic graph by the graph convolutional neural network.

The local structure information describes local similarities between nodes in the graph. Specifically, if there is an edge connection between two nodes, the two nodes will have a connection in the embedding space; if no edge connection exists between two nodes, their first order proximity is 0. The graph convolution neural network inputs the structure of the circRNA-disease heteromorphic graph and the characteristics of circRNA (disease) nodes, and outputs pooling information of the nodes and graph structure information to acquire local structure information.

And 5: the Node2vec method extracts the global structure information of the Node from the circRNA-disease heteromorphic graph.

The global structure information describes the relationship between two nodes that are not directly connected. The Node2vec method is a targeted improvement on Deepwalk, and is to sample a graph based on random walk and map a Node adjacent structure into a sequence structure. And then training a Skip-gram model by using the sequence obtained by sampling, capturing connectivity between nodes, and obtaining global structure information.

Step 6: and (4) sending the node information obtained in the first two steps into a random forest classifier, and predicting a potential circRNA-disease association relation.

And (3) sending the node information obtained in the first two steps into a random forest classifier, predicting a potential circRNA-disease association relation, and obtaining an AUC value and an AUPR value of the invention by adopting five-fold cross validation to obtain a prediction result.

Drawings

FIG. 1 is a schematic flow diagram of the circRNA-disease association relationship prediction method based on the atlas neural network and node2 vec.

FIG. 2 is a graph of ROC curves for an implementation of the present invention.

FIG. 3 is a PR graph of the implementation method of the present invention.

Detailed Description

The invention relates to a circRNA-disease association relation prediction method based on a graph convolution neural network and node2 vec. The present invention will be described in further detail below with reference to specific embodiments and simulation experiments. It should be understood by those skilled in the art that these implementation methods are only for explaining the technical principle of the present invention and are not intended to limit the forensic scope of the present invention.

As shown in fig. 1, a circRNA-disease association relationship prediction method based on a convolutional neural network and a node2vec specifically includes the following steps:

preferably, the obtaining of the incidence matrix in step 1 specifically includes:

experimentally verified 739 circRNA-Disease known associations (involving 661 circrnas and 100 diseases) were obtained from the circR2Disease database. After the redundant data is deleted, only 650 known association data (585 circRNAs and 88 diseases are involved) related to human complex diseases are selected as the known association matrix

nc and nd represent circRNA and disease number, respectively. If circRNA c _i And disease d _j If there is an experimentally verified known correlation, then matrix element A (c) is defined _i ,d _j ) 1 is ═ 1; if any circRNA c _i And disease d _j In the absence of known correlations, which are experimentally verified, the matrix element A is defined (c) _i ,d _j )＝0。

Preferably, the calculating semantic similarity of diseases in step 2 specifically includes:

building a disease semantic similarity matrix by downloading disease-related data

Any disease d _t For disease d _i For the semantic contribution value of

Expressed, the calculation is as follows:

in the formula, σ represents the attenuation coefficient of the semantic contribution.

Matrix element DS (d) _i ,d _j ) Indicates a disease d _i And disease d _j The semantic similarity of diseases between them is calculated as follows:

preferably, the calculating of the functional similarity of circrnas in step 2 specifically comprises:

functional similarity of circRNAs is measured by their tendency to correlate with phenotypically similar diseases

Matrix element CS (c) _i ,c _j ) Represents circRNA c _i And c _j Functional similarity between them, calculated as follows:

in the formula, set D _i Representation of circular RNA c _i An associated disease set; set D _j Representation of circular RNA c _j An associated disease set; i D _i I and I D _j Respectively representing the sets D _i And D _j The number of diseases in the eye.

Preferably, the calculating of the circRNA (disease) gaussian interaction profile nuclear similarity as described in step 2 is:

the circRNA (disease) gaussian interaction profile nuclear similarity is calculated in combination with the correlation matrix and the disease semantic similarity. By means of matrices

The matrix element DK (d) represents the Gaussian interaction spectrum nuclear similarity of the disease _i ,d _j ) Indicates a disease d _i And disease d _j The gaussian interaction spectrum kernel similarity is calculated as follows:

in the formula, the parameter mu _d Control kernel bandwidth indicating GIP similarity.

In the same way, the matrix

Matrix element CK (c) representing the Gaussian interaction spectrum nuclear similarity of circRNAs _i ,c _j ) Represents circRNA c _i And c _j The gaussian interaction spectrum kernel similarity is calculated as follows:

CK(c _i ,c _j )＝exp(-μ _c ||A(c _i ,d _j )-A(c _j ,d _j )|| ² )

in the formula, the parameter mu _c Control kernel bandwidth representing GIP similarity

Preferably, the calculating of circRNA (disease) integrated similarity in step 2 is specifically:

considering disease semantic similarity and the inherent sparsity of circRNA functional similarity, integrating complementary information from multiple data sources and different representation methods, employing integrated similarity to quantify each pair of circRNA (disease) similarity overcoming inherent sparsity. By means of matrices

Representing integrated similarity of disease, matrix element X _d (d _i ,d _j ) Is calculated as follows:

circRNA integration similarity matrix

Represents, matrix element X _c (c _i ,c _j ) Is calculated as follows:

preferably, the sparse automatic encoder described in step 3 performs feature extraction and transformation on the circRNA (disease) integrated similarity, and then converts the circRNA (disease) integrated similarity into a 64-dimensional feature vector, and fuses the circRNA feature vector and the disease feature vector into a final circRNA-disease feature vector, specifically:

the sparse autoencoder encodes the original input features and reduces dimensionality to find potential associations between the input features and extracts high-order features that are expressive. The sparse automatic encoder consists of an encoder and a decoder and is a neural network with three layers, including an input layer, a hidden layer and an output layer, wherein the input layer x is mapped to the hidden layer y one by one. The encoder calculates as follows:

y＝sigmoid(W ₁ x(i)+a ₁ )

in the formula, sigmoid represents an activation function; w ₁ Representing the connection parameters of the input layer x and the hidden layer y; a is ₁ Indicating an offset.

The decoder calculates as follows:

z＝sigmoids(W ₂ y+a ₂ )

in the formula, W ₂ Representing the connection parameter of the hidden layer y to the output layer z, a ₂ Indicating the offset.

Inputting the circRNA (disease) integration similarity into a sparse automatic encoder respectively, extracting and transforming by minimizing the error between input and output through a back propagation algorithm to obtain 64-dimensional characteristic vectors respectively Z _c And Z _d Combining the two to obtain the final circRNA-disease characteristic vector Z _cd The calculation is as follows:

preferably, the local structure information of the node extracted from the circRNA-disease heteromorphic graph by the graph convolution neural network in the step 4 specifically comprises:

the graph convolution neural network inputs the structure of the graph and the characteristics of each node, and can output the pooling information of the nodes and the information of the graph (node) structure to obtain the local structure information of the graph. For this purpose, the circRNA-disease-known correlation matrix A is converted into a adjacency matrix by calculation

Local structural information is obtained using a spatial approach to the atlas neural network, which is calculated as follows:

in the formula, ReLU (, x) represents an activation function of two layers of the neural network;

to represent

The metric matrix of (a); w represents a weight matrix;

an adjacency matrix representing an added self-loop, which is calculated as

Wherein,

representing an identity matrix.

Preferably, the Node2vec method in step 5 extracts global structure information of the Node for the circRNA-disease heteromorphic graph, specifically:

node2vec is a semi-supervised learning for scalable feature learning in networks, which can maximally preserve the network domain possibilities of nodes in d-dimensional feature space. Firstly, sampling a graph based on random walk, mapping a node adjacent structure into a sequence structure, then training a Skip-gram model by using the sampled sequence, and capturing connectivity between nodes to obtain global structure information.

Preferably, the step 6 of sending the information to the random forest classifier specifically comprises:

and (4) sending the node information obtained in the first two steps into a random forest classifier, predicting a potential circRNA-disease association relation, and obtaining a prediction result.

The technical effects of the invention are further illustrated by experimental verification as follows:

1. experimental conditions and contents:

the experiments of the invention were performed on AMD 1.80GHz CPU and windows10 operating systems.

2. And (3) analyzing an experimental result:

the result shows that the prediction precision of the circRNA-disease association relation adopts five-fold cross validation, and the evaluation indexes are ROC and PR. Wherein ROC is the area under ROC curve with FPR as abscissa and TPR as ordinate, and PR is the area under Pre-Recall curve with Recall as abscissa and precision as ordinate. Greater ROC and PR values indicate greater accuracy.

The ROC curve graph and the PR curve graph obtained by performing five-fold cross validation in the invention are shown in fig. 2-3.

The above description is only one specific example of the present invention and should not be construed as limiting the invention in any way. It will be apparent to persons skilled in the relevant art that various modifications and changes in form and detail can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A circRNA-disease association relation prediction method based on a graph convolution neural network and Node2vec is characterized by comprising the following steps:

step 1: acquiring a circRNA-disease association matrix;

step 2: calculating the semantic similarity of diseases, the Gaussian interaction spectrum nuclear similarity of diseases, the functional similarity of circRNA and the Gaussian interaction spectrum nuclear similarity of circRNA, constructing the integration similarity of circRNA and the integration similarity of diseases, and generating a circRNA-disease heteromorphic graph;

and step 3: a sparse automatic encoder performs feature extraction and transformation on the circRNA (disease) integration similarity, converts the circRNA integration similarity into a 64-dimensional feature vector, and combines the circRNA feature vector and the disease feature vector into a final circRNA-disease feature vector;

and 4, step 4: extracting local structure information of nodes from the circRNA-disease heteromorphic graph by the graph convolutional neural network;

and 5: the Node2vec method extracts the global structure information of the Node from the circRNA-disease heteromorphic graph;

2. The circRNA-disease association prediction method based on graph-convolution neural network and Node2vec as claimed in claim 1, wherein in step 1, specifically:

3. The circRNA-disease association prediction method based on graph-convolution neural network and Node2vec as claimed in claim 1, wherein in step 2, specifically:

acquiring related annotation words of each disease from a MESH database, and calculating semantic similarity among the diseases by utilizing a Directed Acyclic Graph (DAG) to obtain the semantic similarity of the diseases; calculating the core similarity of the circRNA (disease) Gaussian interaction spectrum according to the circRNA-disease association matrix; calculating the functional similarity of the circRNA according to the semantic similarity of the diseases and the circRNA-disease association matrix; by integrating complementary information from multiple data sources and different representation methods, integration similarity is adopted to quantify each pair of disease similarity to overcome inherent sparsity, and a circRNA integration similarity matrix and a disease integration similarity matrix are obtained.

4. The circRNA-disease association prediction method based on the atlas neural network and Node2vec as claimed in claim 1, wherein in step 3, specifically:

the sparse automatic encoder can not only automatically learn characteristics, but also give better characteristic description than original data; original data is replaced by the learned characteristics of the sparse automatic encoder, and the model prediction performance is improved to a certain extent; therefore, the invention uses a sparse automatic encoder to respectively minimize the error between input and output through a back propagation algorithm for the integration similarity of the circRNA (diseases), extracts and transforms characteristics to obtain 64-dimensional circRNA (disease) characteristic vectors; finally, the circRNA (disease) feature vectors are combined to obtain the final circRNA-disease feature vector.

5. The circRNA-disease association prediction method based on the atlas neural network and Node2vec as claimed in claim 1, wherein in step 4, specifically:

the local structure information describes local similarity between nodes in the graph; specifically, if there is an edge connection between two nodes, the two nodes will have a connection in the embedding space; if no edge connection exists between two nodes, their first order proximity is 0; the graph convolution neural network inputs the structure of the circRNA-disease heteromorphic graph and the characteristics of circRNA (disease) nodes, and outputs pooling information of the nodes and graph structure information to acquire local structure information.

6. The circRNA-disease association prediction method based on the atlas neural network and Node2vec as claimed in claim 1, wherein in step 5, specifically:

the global structure information describes the relationship between two nodes which are not directly connected; the Node2vec method is a targeted improvement on Deepwalk, and is to sample a graph based on random walk and map a Node adjacent structure into a sequence structure; and then training a Skip-gram model by using the sequence obtained by sampling, capturing the connectivity between nodes, and obtaining global structure information.

7. The circRNA-disease association prediction method based on atlas neural network and Node2vec as claimed in claim 1, wherein in step 6, specifically:

and (3) sending the node information obtained in the first two steps into a random forest classifier, predicting a potential circRNA-disease association relation, obtaining an AUC value and an AUPR value of the invention by adopting five-fold cross validation, and obtaining a prediction result.