CN114496275A

CN114496275A - Microorganism-disease association prediction method and system based on conditional random field

Info

Publication number: CN114496275A
Application number: CN202111563953.5A
Authority: CN
Inventors: 王红; 滑美芳; 王正军; 杨雪; 杨杰; 张双永; 张子姗; 郑子希; 李维新
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-05-13

Abstract

The invention discloses a method for predicting microbe-disease association by a graph volume network based on a conditional random field, which comprises the following steps: acquiring corresponding relation data of microorganisms and diseases, and constructing a microorganism-disease association matrix; according to the microorganism-disease association matrix, acquiring similarity matrixes among microorganisms and among diseases, and integrating the similarity matrixes with the microorganism-disease association matrix to obtain an adjacency matrix; respectively extracting the characteristics of the similarity matrixes among the microorganisms and the diseases, and combining to obtain a characteristic matrix; generating an embedded vector according to the adjacency matrix and the feature matrix based on a graph convolution network; updating the embedded vector according to the conditional random field; and reconstructing the incidence matrix according to the updated embedded vector. According to the method, characteristics of microorganisms and diseases are fully excavated through a graph convolution network, similar microorganisms or diseases are ensured to be embedded in a characteristic space through introducing a CRF layer, and the accuracy of correlation prediction is improved.

Description

Microorganism-disease association prediction method and system based on conditional random field

Technical Field

The invention belongs to the technical field of medical data processing, and particularly relates to a microorganism-disease association prediction method and system based on a conditional random field.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

A microorganism is a minute organism that may exist in the form of a single cell or in a group of cells. In recent years, as microorganisms have been found to be closely related to prevention, diagnosis and treatment of many complex human diseases, more and more researchers have been working on revealing the association of microorganisms with diseases. As an effective complement to traditional experiments, more and more computational models based on various algorithms are proposed for microbe-disease association prediction to improve efficiency and save cost.

However, despite much research effort to reveal the role of microorganisms in the pathogenesis of human diseases, there is still little understanding of how microorganisms affect human health and pathogenic systems in humans. Therefore, it is necessary to investigate the correlation between the microorganism and the disease. In recent years, researchers have proposed more and more calculation methods for predicting microbes and diseases based on known microbe and disease relation data sets, such as KATZHMDA based on KATZ method, PBHMDA based on path, PRWHMDA based on random walk, LRLSHMDA based on machine learning, WMGHMDA based on metagraph, and other algorithms, but on one hand, these methods need to continuously adjust parameters to achieve optimal effect and have low efficiency; on the other hand, the lack of deep mining of features between microorganisms and between diseases affects prediction accuracy.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a microorganism-disease association prediction method and system based on a graph volume network of a conditional random field. The network is convolved with a conditional random field to ensure that similar drugs (or microorganisms) are also similar, i.e., have similar insertions, in the feature space. Therefore, the potential association relation between the microorganisms and the diseases can be fully excavated, and the prediction accuracy is improved.

In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

a method for predicting microbe-disease association based on a graph volume network of a conditional random field comprises the following steps:

acquiring corresponding relation data of microorganisms and diseases, and constructing a microorganism-disease incidence matrix;

according to the microorganism-disease association matrix, acquiring similarity matrixes among microorganisms and among diseases, and integrating the similarity matrixes with the microorganism-disease association matrix to obtain an adjacency matrix;

respectively extracting the characteristics of the similarity matrixes among the microorganisms and the diseases, and combining to obtain a characteristic matrix;

generating an embedded vector according to the adjacency matrix and the feature matrix based on a graph convolution network;

updating the embedded vector according to the conditional random field;

and reconstructing the incidence matrix according to the updated embedded vector.

Further, each row of the microorganism-disease association matrix represents a microorganism, each column represents a disease, and the elements in the matrix represent whether the corresponding microorganism is related to the disease or not.

Further, the microorganism similarity matrix calculation method comprises the following steps:

and respectively calculating the nuclear similarity and cosine similarity of the Gaussian interaction profile of the microorganism, and obtaining the comprehensive similarity of the microorganism according to the nuclear similarity and cosine similarity of the Gaussian interaction profile.

Further, the disease similarity matrix calculation method comprises the following steps:

and respectively calculating the nuclear similarity and the functional similarity of the Gaussian interaction profiles of the diseases, and obtaining the comprehensive similarity of the diseases according to the nuclear similarity and the functional similarity of the Gaussian interaction profiles.

Further, the similarity matrixes between the microorganisms and between diseases are subjected to feature extraction by respectively adopting a restarted random walk method to obtain probability profile vectors of the microorganisms and the diseases.

Further, generating an embedded vector according to the adjacency matrix and the feature matrix based on a graph convolution network comprises:

H^lindicating a layer of GCN embedding, H⁽⁰⁾＝X，

Representing a normalized similarity weight matrix with self-circulation,

a diagonal matrix is represented that represents the diagonal matrix,

W^lrepresents the weight matrix, σ represents the activation function, and I represents the identity matrix.

Further, updating the embedded vector based on the conditional random field includes:

wherein the initial embedding is set

Is Hⁱ，HⁱRepresenting nodes obtained from GCN convolutional layersPreliminary embedding of i, λ_ijRepresenting an attention score, N, between node i and node j_iAre neighbors of node i and alpha and beta are weight factors to balance the effect of the first term and the second term on the prediction performance.

One or more embodiments provide a graph data-based enhanced microorganism-disease association prediction system, comprising:

a known correlation obtaining module configured to obtain corresponding relation data of the microorganism and the disease, and construct a microorganism-disease correlation matrix;

an adjacency matrix calculation module configured to obtain similarity matrixes between microorganisms and between diseases according to the microorganism-disease association matrix, and integrate the similarity matrixes with the microorganism-disease association matrix to obtain an adjacency matrix;

the characteristic preprocessing module is configured to extract characteristics of similarity matrixes among the microorganisms and among diseases respectively and combine the similarity matrixes to obtain a characteristic matrix;

a feature embedding module configured to generate an embedding vector according to the adjacency matrix and the feature matrix based on a graph convolution network; updating the embedded vector according to the conditional random field;

and the correlation prediction module is configured to reconstruct the correlation matrix according to the updated embedded vector.

One or more embodiments provide an electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method for predicting a microorganism-disease association based on a graph volume network of conditional random fields when executing the program.

One or more embodiments provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the conditional random field-based atlas network prediction microbe-disease association method.

The above one or more technical solutions have the following beneficial effects:

the similarity between microorganisms and diseases is calculated, a heterogeneous association network between the microorganisms and the diseases is constructed, embedding of the microorganisms and the disease nodes is obtained through a graph convolution network, a CRF layer is introduced to ensure that similar microorganisms or diseases are also similar in a feature space, namely similar embedding is achieved, then self-attention is adopted to distinguish the contribution of adjacent nodes to a given node, and the accuracy of subsequent association prediction is improved.

Based on the similarity and cosine similarity of the Gaussian interaction profile kernel of the microorganism and the similarity of the disease function and Gaussian interaction profile kernel, the analysis of the similarity and disease similarity of the microorganism is respectively carried out, so that the potential association relationship between the microorganism and the disease is fully excavated, the supplement is effectively provided for a small amount of known association relationships, and the guarantee is provided for the subsequent association prediction precision.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flow diagram of a method for predicting a microorganism-disease association based on a graph volume network of conditional random fields in one or more embodiments of the invention.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example one

The embodiment discloses a method for predicting microorganism-disease association by a graph volume network based on a conditional random field, which specifically comprises the following steps:

step 1: acquiring corresponding relation data of microorganisms and diseases, and constructing a microorganism-disease network;

wherein, the microorganism-disease network adopts a graph data structure, the nodes of the graph comprise diseases and microorganisms, and the edges of the graph are connected with the corresponding microorganisms and the diseases to show the correlation of the microorganisms and the diseases. To facilitate data storage and subsequent calculations, the microbe-disease network is stored using an adjacency matrix a, each row of which represents a microbe, each column represents a disease, and the elements in the matrix represent whether or not the corresponding microbe is associated with a disease, specifically, if associated, the element value is 1, and if not, the element value is 0. The present example relates to 450 correlations of 39 diseases and 292 microorganisms constituting an initial data set, as shown in table 1.

Table 1 statistics of the microbe-disease associated data set.

Step 2: acquiring a similarity network of the microorganisms and the diseases according to the adjacency matrix; the method specifically comprises the following steps:

step 2.1: cosine similarity and Gaussian interaction profile nuclear similarity of the microorganisms are calculated respectively.

The cosine similarity calculation formula of the microorganism is as follows:

where A (i,: denotes the ith row of the adjacency matrix A, and A (j,: denotes the jth row of the adjacency matrix A.

The calculation formula of the similarity of the microbial Gaussian interaction profile core is as follows:

KM(m(i),m(j))＝exp(-λ_m||IP(m(i))-IP(m(j))||²)

wherein IP (m (i)) is represented by the interaction profile of the microorganism mi, wherein λ_mDenotes normalized kernel bandwidth, λ'_mRepresenting the original bandwidth, is typically set to 1.

Step 2.2: functional similarity and gaussian interaction profile nuclear similarity of the diseases were calculated separately.

The cosine similarity calculation formula of the disease is as follows:

based on the hypothesis that similar diseases tend to interact with similar genes, we calculated disease functional similarities based on functional associations between disease-associated genes. The newly released HumanNetv2.0 database (https:// www.inetbio.org/humannet/download. php) can be used to efficiently access gene interactions, each of which has an associated log-likelihood score (LLS) for assessing the probability of functional linkage between genes. For disease d_i,d_jWe first deduce that their related gene sets are G_i＝{g_i1,g_i2,…,g_im}，G_j＝{g_j1,g_j2,…,g_jnM is the set G_iThe number of genes in (1), n is the set G_jThe base factor of (1). We define gene G and genome G ═ G₁,g₂,…,g_kThe functional associations between } are as follows:

wherein FSS represents a functional similarity score between genes, defined as follows:

wherein LLS' is the normalized least squares of the genes, defined as follows:

wherein LLS_maxAnd LLS_minRepresenting the maximum LLS and minimum LLS in the human net database, respectively.

Finally, we express the disease functional similarity as:

the gaussian interaction profile nuclear similarity of the disease is as follows:

KD(d(i),d(j))＝exp(-λ_m||IP(d(i))-IP(d(j))||²)

wherein IP (d (i)) represents the interaction profile of disease di, wherein λ_dDenotes normalized kernel bandwidth, λ'_dRepresenting the original bandwidth, is typically set to 1.

Specifically, for microorganisms m (i) and m (j), if there is cosine similarity between them, the integrated microorganism similarity is defined as the average of CM and KM; otherwise define the overall microbial similarity MS value as follows:

the overall microbial similarity MS value is defined as follows:

similar disease similarity DS values are defined as follows:

we compose a heterogeneous network from a microorganism comprehensive similarity network MS, a disease comprehensive similarity network DS and a known microorganism-disease association network A, and the adjacent matrix of the heterogeneous network is

And step 3: characteristic treatment of microorganisms and diseases;

as described above, MS and DS matrices represent microbial and disease similarity, respectively. Each row or column represents the similarity distribution of a microorganism (or disease), which can be considered as a feature vector of that microorganism (or disease). However, due to the limitations of the calculation methods, it is not sufficient to directly use the similarity curve as an input feature for microorganisms and drugs, because the calculated similarity may contain some noise. Hence, herein we further implement a Random Walk and Restart (RWR) based method to extract features from similarity profiles. RWR is a network-based approach that can effectively capture local and global topologically intrinsic characteristics of the network.

After RWR on the microbe-like network and the disease-like network we get a probability distribution vector for each microbe or disease. These probability distribution vectors may form a new microorganism feature matrix HM and a new disease feature matrix HD. To make the features comparable between different nodes, we further normalize the probability distribution vectors in the HM matrix, i.e. normalize the sum of the probabilities in each vector to 1. Finally, the normalized probability profile vectors in HM and HD are used as input features of microbes and diseases. The new feature matrix formed is:

step 4, obtaining node embedding by using a graph convolution network;

H^lindicating a layer of GCN embedding, H⁽⁰⁾＝X，

Representing a normalized similarity weight matrix with self-circulation,

a diagonal matrix is represented that represents the diagonal matrix,

Step 5, updating and embedding through conditional random field layer

Wherein the content of the first and second substances,

layer k +1 embedding, initial embedding setup, representing node i

Is H_iDenotes the preliminary embedding of the node i obtained from the GCN convolutional layer, λ denotes the attention score between the nodes, λ_ijMeasure the importance of neighbor node i to node j, N_iAre neighbors of node i and alpha and beta are weight factors to balance the effect of the first term and the second term on the prediction performance.

We use self-attention to differentiate the contribution of neighboring nodes to a given node. Formally, the attention λ between node i and node j_ijThe definition is as follows.

a_ij＝att(W_tC_i,W_tC_j)

Wherein, C_iRepresenting the final embedding of node i, conditional random fields are used to ensure that similar drugs (or microorganisms) are also similar, i.e., have similar embedding, in the feature space.

Step 6, reconstructing the correlation prediction matrix

The feature/embedding matrix learned at the conditional random field level is represented as a learning feature matrix for microorganisms and diseases, respectively C_mAnd C_dThen the final associated prediction matrix is:

O＝C_mW_m(W_d)^T(C_d)^T

wherein, W_mAnd W_dRepresenting potential factors projected back into the original feature space of the microorganism and disease, respectively.

Example two

The present embodiment aims to provide a system for predicting microbe-disease association based on a graph volume network of conditional random fields, which includes:

EXAMPLE III

The embodiment aims at providing an electronic device.

An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the method for predicting a microorganism-disease association based on a graph volume network of conditional random fields according to an embodiment.

Example four

An object of the present embodiment is to provide a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for predicting a microorganism-disease association based on a graph volume network of conditional random fields according to the first embodiment.

The steps involved in the second to fourth embodiments correspond to the first embodiment of the method, and the detailed description thereof can be found in the relevant description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present invention.

In one or more embodiments, the potential association relationship between the microorganisms and the diseases is fully mined based on the similarity and cosine similarity of the gaussian interaction profile kernels of the microorganisms and the similarity of the gaussian interaction profile kernels of the diseases, and the similarity and disease similarity of the microorganisms are analyzed respectively, and then the characteristics of the microorganisms and the diseases are preprocessed, and the accuracy of prediction is effectively improved by embedding the updated GCN layer of the conditional random field layer.

Those skilled in the art will appreciate that the modules or steps of the present invention described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code that is executable by computing means, such that they are stored in memory means for execution by the computing means, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps of them are fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. A method for predicting microbe-disease association based on a graph volume network of a conditional random field is characterized by comprising the following steps:

acquiring corresponding relation data of microorganisms and diseases, and constructing a microorganism-disease association matrix;

updating the embedded vector according to the conditional random field;

2. The method for predicting microbe-disease association based on the graph convolution network of conditional random fields as recited in claim 1, wherein each row of the microbe-disease association matrix represents a microbe, each column represents a disease, and elements in the matrix represent whether the corresponding microbe is related to the disease.

3. The method for predicting microbe-disease association based on the graph volume network of the conditional random field as claimed in claim 1, wherein the microbe similarity matrix calculation method comprises:

4. The method for predicting microbe-disease association based on the graph volume network of the conditional random field as claimed in claim 1, wherein the disease similarity matrix calculation method is as follows:

5. The method as claimed in claim 1, wherein the similarity matrix between the microbes and between diseases is extracted by a restarted random walk method to obtain probability profile vectors of microbes and diseases.

6. The method of claim 1, wherein generating an embedded vector based on the neighborhood matrix and feature matrix based on the graph-rolled network comprises:

H^ldenotes a layer of GCN insertion, H⁽⁰⁾＝X，

Representing home with self-circulationA normalized similarity weight matrix is generated by using the similarity weight matrix,

a diagonal matrix is represented that represents the diagonal matrix,

7. The method of claim 1, wherein updating the embedded vector based on the conditional random field comprises:

wherein the initial embedding is set

Is Hⁱ，HⁱDenotes the preliminary embedding of node i, λ, obtained from the GCN convolutional layer_ijRepresenting an attention score, N, between node i and node j_iAre neighbors of node i and alpha and beta are weight factors to balance the effect of the first term and the second term on the prediction performance.

8. A graph data-based enhanced microorganism-disease association prediction system, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the method for predicting a microorganism-disease association based on a conditional random field atlas network as recited in any one of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method for predicting a microorganism-disease association according to the graph volume network based on conditional random fields of any of claims 1 to 7.