CN115618745A

CN115618745A - Biological network interaction construction method

Info

Publication number: CN115618745A
Application number: CN202211462889.6A
Authority: CN
Inventors: 赵玉凤; 庞华鑫; 张小平; 周佩; 刘佳
Original assignee: Institute Of Information On Traditional Chinese Medicine Cacms
Current assignee: Institute Of Information On Traditional Chinese Medicine Cacms
Priority date: 2022-11-21
Filing date: 2022-11-21
Publication date: 2023-01-17
Anticipated expiration: 2042-11-21
Also published as: CN115618745B

Abstract

The invention belongs to the technical field of deep learning big data medical treatment, and particularly relates to a biological network interaction construction method, which comprises the steps of establishing a undirected connection network graph G; performing primary processing on the undirected connection network graph G: aggregating the information of neighbor nodes of different levels to a source node to generate a wide-area source node characterization vector; constructing a specificity subgraph for each source node from the original network graph G, and learning a specificity characterization vector of the source node; and predicting the interactive relation existing between any two nodes based on the sum. According to part of information, the method can accurately predict the existing interactive relationship between the nodes, and can predict the unknown interactive relationship, thereby providing a direction for subsequent biomedical research. In addition, the depth model can be clearly shown to infer the interaction relations based on which neighbor node information, and the method has high reliability and robustness.

Description

Biological network interaction construction method

Technical Field

The invention belongs to the technical field of deep learning big data medical treatment, and particularly relates to a biological network interaction construction method.

Background

Biological systems are complex networks of various molecular entities (e.g., genes, proteins, and other biomolecules) linked together by interactions. Complex interactions between different molecular entities can be represented as an interaction network, with the molecular entities as nodes and their interactions as edges. Network characterization of biological systems provides a conceptual and intuitive framework for studying and understanding the direct or indirect interactions between different molecular entities in biological systems. Based on the network representation of the nodes, the interaction between new nodes is discovered by utilizing a deep learning method, which is beneficial to promoting the understanding and understanding of the biological system and further elaborating the nature and the rule of the clear life activity. For example, the discovery of new protein-protein interactions can help to understand whether there is a synergistic effect between the two proteins, or the discovery of a high probability linkage between a gene and a disease can guide us whether abnormal expression of the gene can induce a disease. Therefore, it is significant to design an interactive relationship mining model based on a biological network.

In the field of traditional Chinese medicine, medicines for treating diseases are often provided in the form of a prescription, wherein the prescription contains a plurality of medicinal materials, and each medicinal material contains a plurality of active ingredients. According to the traditional Chinese medicine diagnosis and treatment concept, the compounds form a whole and act on certain diseases together, so that the body is regulated from multiple aspects, and multiple channels are used for treating certain diseases. This treatment pattern can be spontaneously described as a network structure with multiple components associated with multiple classes of disease, with many-to-many attributes. Exploring such problems, exploring the truth of these interactions therein, has led to the elucidation of the molecular-level mechanisms of action of TCM in treating diseases. In addition, the traditional Chinese medicine has the concept of 'treating both different diseases and treating both different diseases', and the scientific connotation and the principle thereof existing behind the concept are clarified, and both the support and the application of the biological network mining technology are needed.

Because the interaction is complex and diverse after various factors such as disease symptoms, genes, medicines and the like are fused, how to process the undirected connection network graph after the fusion of the various factors to construct a biological network with differentiated characteristic forms is realized, and the complexity and difficulty are still high after the existing or unknown interaction relation is predicted. Therefore, the invention provides a biological network interaction construction method.

Disclosure of Invention

In order to solve the above technical problems in the prior art, the present invention provides a biological network interaction construction method.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a biological network interaction construction method comprises the following steps:

s1, establishing a undirected connection network graph G, wherein a node V in the graph represents syndrome information, edges E between nodes represent biological interaction relation, and a node characterization vector X represents a combination of any information in biological description, chemical structure and information coding of the nodes;

s2, performing primary processing on the undirected connection network graph G by using a multi-level graph aggregation module: aggregating the information of the neighbor nodes of different levels to the source node to generate a wide-area source node characterization vector

；

S3, constructing a specific subgraph for each source node from the original network graph G by using the wide-area source node characterization vector through a subgraph selection module;

s4, learning a specific characterization vector of the source node based on the specific subgraph

；

S5, source node characterization vector based on wide area

And a source node's specificity characterization vector

Predicting the intersection existing between any two nodesAnd (4) the relationship of each other.

Further, the syndrome information comprises diseases, symptoms, genes, medicines and biological targets; biological interactions include any biological relationship of disease-disease, disease-symptom, disease-gene, disease-drug, gene-target, etc.

Furthermore, the multi-level graph aggregation module aggregates information of neighbor nodes of different levels to the source node according to the adjacent matrix A and the dimension conversion matrix of different orders to generate a wide-area node characterization vector.

Furthermore, the multi-level graph aggregation module firstly transforms the initial information characteristics in the network graph G, and applies a full connection layer to map the initial information into the same low-dimensional shared subspace, and the specific method is as follows:

wherein, W represents an initial characteristic mapping parameter matrix, and b represents a bias coefficient; h represents the embedded representation obtained after mapping;

then, a high-order graph convolution encoder is applied to aggregate node information of different orders in the biological network graph G, and the specific method is as follows:

wherein the content of the first and second substances,

a transition probability matrix representing the k-th order of the node,

is as followslA learnable parameter matrix of the k-th order of the layer;

embedding and splicing the nodes of different orders into the representations to obtain the representation vectors of the nodes

。

Further, in the present invention,

the generation method comprises the following steps: firstly, the adjacent matrix A constructed based on the edge is subjected to Laplace transform and normalization, and then k power is obtained.

Furthermore, according to the embedded representation obtained after the multi-level graph aggregation model is mapped, calculating the information weight of the learning contribution of edges of different levels around the source node to the representation of the source node; setting the source node as u, and searching the neighbor node set of the P layer of the source node u from the biological network graph G

And the set of edges existing between these nodes

The characteristic vector learned by nodes at two ends of the edge in the set through a multi-level graph aggregation module

And

token vector with source node u

Performing splicing to calculate

The importance of each edge in the graph to the source node; the method specifically comprises the following steps:

wherein the content of the first and second substances,

represents an edge (i,j) For the weight value of the source node u,

the expression parameter is

The multi-layer sensor module of (1).

Further, the weight value of each edge in the P-layer neighbor set of the source node u is utilized

Discretizing the weighted values, wherein the specific method comprises the following steps:

wherein the content of the first and second substances,

representing the sigmoid function, mapping the calculated value to

In the interval of the time interval,

are random values, obey a uniform distribution within (0,1),

is the temperature coefficient.

Further, a simple bilinear layer is defined to learn the characterization of the potential edges:

wherein the content of the first and second substances,

representing a learnable fusion matrix, b is a bias coefficient,

representing the representation of the edge obtained by the node i and the node j through the bilinear layer.

The obtained edge characteristics are input into a full-connection layer network of the 2 layer to predict the possibility of the edge existence of two nodes

The specific method comprises the following steps:

where FC denotes a full connection layer, sigmoid and ELU denote nonlinear activation functions.

Further, a probability parameter of edge-to-edge

And (3) solving the loss by applying a binary cross loss function, which specifically comprises the following steps:

wherein the content of the first and second substances,

is an adjacent matrix of 0-1, and the adjacent matrix is a matrix,

after the prediction error loss value is obtained, parameters in the model are updated by applying a back propagation algorithm and model learning rate parameters, and the prediction error is reduced.

Compared with the prior art, the invention has the following beneficial effects:

the invention firstly establishes a non-directional connection network, then aggregates neighbor node information of different levels to a source node to generate a wide-area source node characterization vector

And then constructing a specificity subgraph of the source node to obtain a specificity characterization vector

Wide area based source node characterization vectors

And a source node's specificity characterization vector

The interactive relationship between any two nodes is predicted, so that the known probability relationship and the unknown relationship among elements such as diseases, medicines, genes, symptoms and the like can be predicted, and reliable data basis and research direction are provided for subsequent biomedical research. In addition, a subgraph is constructed for each node, so that the depth model can be clearly shown, the interaction relations are deduced based on the information of the neighbor nodes, and the reliability and the robustness are high.

Drawings

FIG. 1 is a general block diagram of an embodiment of the present invention.

FIG. 2 is a flowchart illustrating a multi-level graph aggregation module according to an embodiment of the invention.

Fig. 3 is a schematic processing diagram of a sub-graph selection module according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of selecting different levels of subgraphs for sample nodes on a DTI data set according to an embodiment of the invention.

Detailed Description

The technical solutions of the present invention will be described in detail with reference to the accompanying drawings, and it is obvious that the described embodiments are not all embodiments of the present invention, and all other embodiments obtained by those skilled in the art without any inventive work belong to the protection scope of the present invention.

The embodiment of the invention is shown in figure 1: first, preprocessing the collected medical record data and the public biological information data, extracting important information such as main disease types and used medicines in each medical record, and constructing a plurality of entity graphs, thereby generating an undirected disease-medicine network graph. Then, all the disease and drug types appearing in the calendar data are counted and labeled one by one, so as to determine the uniqueness of the disease and drug. Similarly, based on the selected drugs and diseases, matching descriptive characteristics or chemical component characteristics are found from public biological libraries such as TCMSP, drug Bank, OMIM, etc. as the initialization characteristics of these entities.

After data preprocessing is completed, an original data disease-drug network diagram G and an entity initialization characteristic matrix X in the network diagram can be obtained, then, in order to verify the effectiveness and robustness of the model, avoid overfitting of the model to the data and reduce the performance of distribution generalization, the obtained data are divided, namely 70% of edges are sampled from the network diagram G to serve as a training set G-train,10% of edges are taken as a verification set G-val and 20% of edges are taken as a final test set G-test of the model. The edges present in the graph are all regarded as positive samples, and an equal number of negative sample sets need to be selected from the sets without edges to guide the model to be trained and tested. In addition, the initial features of the entity can participate in optimization and updating during model training, and low-dimensional embedded characterization vectors are obtained.

After the training set, the verification set and the test set are divided, a specific process for describing a vector learning mode of a node entity and predicting the possibility of interaction between paired nodes is started. As shown in FIG. 1, the initial feature matrix X of entity nodes and the graph topology structure information G are first input into a multi-level graph aggregation Module (MOGA) to generate wide-area characterization vectors

. In the multi-level graph aggregation module, with reference to fig. 2, the specific operation flow is as follows:

(1) Carrying out Laplace transform on the adjacency matrix A, and converting the adjacency matrix A into a normalized adjacency matrix with self-loops;

(2) Firstly, solving a 0-order characteristic aggregation representation, and multiplying the 0-th power of the initial characteristic, the initial characteristic X and the parameterized weight matrix to obtain a 0-order aggregation representation;

(3) Sequentially calculating 1 order, 2 orders and 3 orders in the mode of (2) until the polymerization representation of the K order;

(4) And performing splicing operation on the aggregation representations of different orders to obtain a wide area representation vector.

After obtaining the wide area characterization vector of each node, the node characterization has aggregated the preliminary K-level neighbor node information, which indicates that the node characterization has the receptive field of the K-level neighbor. This facilitates the subsequent subgraph selection module to select the neighbor set with high correlation degree with the node. Next, the present invention inputs the wide area token vector and the original graph A to a subgraph selection module (SGSM) for generating a specific subgraph sub-G for each node. As shown in fig. 3:

(1) Taking a source node u as a center, sampling from an original graph A, and setting the sampling range to be a subgraph set which is P steps of transfer distance away from the source node u and comprises edges and nodes;

(2) Constructing edge representations, namely combining node representation vectors at two ends of an edge to serve as edge representations, and then splicing the edge representations and source node representations to form combined embedding, such as i-j-u;

(3) Inputting the obtained joint embedding into a predefined full connection layer to generate a weight value of the edge i-j;

(4) Discretizing the weight, eliminating irrelevant edges and purifying the subgraph scale;

(5) And reserving high-probability edges, forming a specific subgraph, and generating a subgraph adjacency matrix Sub-A.

The subgraph and the representation of a source node u are input into a new predefined multi-level graph aggregation Module (MOGA), the source node u obtains a specific subgraph representation, the subgraph selection process is repeated, a specific subgraph structure and a subgraph vector representation are generated for each node in the graph, and the node-specific subgraph representations are integrated into a representation matrix, so that the representation of each node can be clearly known to aggregate node information in the graph.

And finally, embedding and splicing the wide area characterization vectors and the specific characterizations to obtain the comprehensive characterizations of each node, wherein the comprehensive characterizations contain rich global node information and carry out reinforced representation on neighbor information with high correlation with the nodes. In order to predict whether an interactive link relationship exists between two nodes, the selected characteristics of the two nodes u and v are input into an interactive prediction module based on a bilinear layer, in the module, firstly, a characteristic vector of a potential edge is generated by the bilinear layer, and then the probability of the existence of the edge, namely the link probability, is predicted by adopting a multi-layer perceptron classification model. The multi-layer perceptron comprises two linear mapping layers and an activation function. In the model parameter training and optimizing stage, a binarization cross loss function is used for solving the difference between a model prediction value and a real situation, and then parameters in the model are updated by combining a gradient descent strategy and a back propagation strategy. In the testing and verification implementation stage, the model fixes parameters and predicts the possibility of edges existing between the initial characteristic vectors based on the input entity nodes.

We used four different data sets to verify the performance of the model, respectively: drug-target Protein interaction network (DTI dataset) from Biosnap database, drug-drug interaction network (DDI dataset) from Biosnap database, protein-Protein interaction network (PPI dataset) from Human proteome database (The Human Protein), and gene-disease interaction network (GDI dataset) from digenet database. The four data sets are specifically presented below:

(1) The DTI network contains 5018 drugs and 2325 target proteins, 15139 drug-target interactions exist between these entities, and a schematic diagram of different level subgraphs selected for sample nodes on the DTI dataset is shown in fig. 4.

(2) The DDI network contains 1514 drugs, according to drug labeling and biomedical literature. 48514 drug-drug interaction relationships were extracted from between these drugs.

(3) PPI network contains 5604 proteins, and 23322 interactions were generated by multiple orthogonal high-throughput yeast two-hybrid screening.

(4) The GDI network consists of 81746 interactions between 9413 genes and 10370 diseases from GWAS studies, animal models and scientific literature.

In order to measure the classification performance of all models, the measurement indexes common in machine learning are adopted: area under ROC curve (AUROC) and precision-recall combined area under curve (aucrc). In order to avoid negative influence caused by class imbalance and simultaneously calculate the feature interaction weight spectrum for each class, negative samples with the same number as that of positive samples are sampled for training and testing. The biological network interaction construction method is realized based on a Pythrch platform, and the version is 1.14.0. And setting main constant type hyper-parameters by adopting a gradient search method. Specifically, the number of hidden layer units of the global characterization is 16, and the number of nerves of the interactive characterization is equal to the number of features in the sample. In addition, the classifier-multilayer perceptron (MLP) has two hidden layers, the number of units is: 64 and 32, the activation function is a linear rectification function (Sigmoid). This optimizer chooses the Adam optimization method with a learning rate that is a dynamic learning rate to minimize the loss. The number of model training iterations is 50, and to prevent overfitting, the model will automatically stop training when the verification loss does not decrease over 10 generations. And finally, due to the fact that the behavior of mini-batch, model parameter initialization and the like has random attributes, all experiments are repeated for 5 times, and the experiment display result is the average value of 5 results. The experimental data are shown in table one.

TABLE AUROC test results for four different datasets

TABLE AUPRC Experimental results for two or four different datasets

From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Although the present invention has been described in detail with reference to examples, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the scope of the present invention.

Claims

1. A biological network interaction construction method is characterized by comprising the following steps:

；

s4, learning specificity characterization vectors of source nodes based on specificity subgraphs

；

S5, source node characterization vector based on wide area

And a source node's specificity characterization vector

And predicting the interaction relation existing between any two nodes.

2. The biological network interaction construction method as claimed in claim 1, wherein the multi-level graph aggregation module aggregates information of neighbor nodes of different levels to the source node according to the adjacency matrix a and the dimension transformation matrix of different orders to generate a wide-area node characterization vector.

3. The biological network interaction construction method as claimed in claim 2, wherein the multi-level graph aggregation module firstly transforms the initial information features in the network graph G, and maps the initial information into the same low-dimensional shared subspace by applying a full connection layer, and the specific method is as follows:

wherein W represents an initial characteristic mapping parameter matrix, and b represents a bias coefficient; h represents the embedded representation obtained after mapping;

wherein the content of the first and second substances,

a transition probability matrix representing the k-th order of the node,

is as followslA learnable parameter matrix of the k-th order of the layer;

the nodes of different orders are embedded into the representations for splicing,obtaining a characterization vector of a node

。

4. The bio-network interaction construction method according to claim 3,

the generation method comprises the following steps: firstly, the adjacent matrix A constructed based on the edges is subjected to Laplace transform and normalization, and then k power is calculated.

5. The biological network interaction construction method according to claim 4, wherein information weights of learning contributions of edges of different levels around the source node to the representation of the source node are calculated according to the embedded representation obtained after the multi-level graph aggregation model is mapped; setting the source node as u, and searching the neighbor node set of the P layer of the source node u from the biological network graph G