CN112308326B

CN112308326B - Biological network link prediction method based on meta-path and bidirectional encoder

Info

Publication number: CN112308326B
Application number: CN202011226195.3A
Authority: CN
Inventors: 彭绍亮; 王小奇; 李非; 辛彬; 肖霞; 王红; 张兴龙
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2022-12-13
Anticipated expiration: 2040-11-05
Also published as: CN112308326A

Abstract

The invention belongs to the field of computer science, and discloses a biological network link prediction method based on a meta-path and a bidirectional encoder. Firstly, constructing a multi-source heterogeneous medicine information network, and designing various semantic paths for sequence sampling to form a large-scale semantic information base; secondly, organically fusing a depth Transformer coder and a mask language model (masked language model) to design a depth bidirectional coding characterization model and effectively extract a low latitude characterization vector of each node; finally, a biological link prediction such as a disease-protein association relation, a protein-drug interaction, a drug-side effect association relation and the like is carried out by utilizing an Inductive matrix completion (Inductive matrix completion) technology, and a drug research and development technology system of disease-target-drug-side effect is further completed.

Description

Biological network link prediction method based on meta-path and bidirectional encoder

Technical Field

The invention belongs to the field of computer science, relates to the application of artificial intelligence technology, and particularly relates to a biological network link prediction method based on a meta-path and a bidirectional encoder.

Background

Aiming at predicting other potential interactions (links) between entities for a set of biomedical entities and their known interactions is one of the most important tasks in the biomedical field, and therefore, more and more researchers are utilizing computer technology to predict potential interactions in various biomedical networks.

Traditional methods in the biomedical field have invested considerable effort in developing biologically relevant features such as chemical substructures, gene ontologies and topological similarities. At the same time, supervised learning methods and inference models of semi-supervised graphs are used to predict potential interactions. These methods are based primarily on the assumption of similarity, i.e., entities with similar biological or structural characteristics may have similar associations. However, biological feature-based prediction methods typically face two problems: (1) The biological feature extraction process is very costly, and even some biological features are difficult to obtain, and although biological entities without features can be deleted through preprocessing, the data set is usually small in scale and loses important information, so that the method is not practical in practical application; (2) Biological features may not be sufficiently accurate to represent biomedical entities, and may not be able to build stable and accurate models.

Network characterization methods that attempt to automatically learn low latitude vectors of network nodes are expected to solve the two problems described above and are widely used in bio-link prediction. For example, matrix factorization based techniques are used for prediction of drug-disease associations; some researchers have proposed a matrix decomposition technique of manifold regularization, which improves the prediction of drug-drug interaction by incorporating laplace regularization to learn a better drug representation, and in addition, some network characterization methods based on random walk and characterization methods based on deep neural networks have been proposed. However, the existing method only concerns the structural features between the nodes of the network, and ignores the semantic information between network entities; or only short structure and meta-path can be captured, and the structure and semantic relation between network nodes cannot be deeply mined.

Disclosure of Invention

In order to overcome the defects of the technology, the invention provides a biological network link prediction method based on a meta-path and a bidirectional encoder. Firstly, constructing a multi-source heterogeneous medicine information network, and designing a plurality of meta-paths for sequence sampling to form a large-scale semantic information base; secondly, organically fusing a depth Transformer encoder and a mask language model (masked language model) to design a depth bidirectional coding representation model, and effectively extracting a low latitude representation vector of each node; finally, a biological link prediction such as a disease-protein association relation, a protein-drug interaction, a drug-side effect association relation and the like is carried out by utilizing an Inductive matrix completion (Inductive matrix completion) technology, and a drug research and development technology system of disease-target-drug-side effect is further completed.

The technical scheme adopted by the invention is as follows:

a biological network link prediction method based on meta-path and bidirectional encoder includes the following steps:

1) Initializing parameters, including: the method comprises the following steps of 1, network sequence length l, a node reading threshold value deg, a characterization vector dimension dim, the number of layers n of a transform encoder, a MASK sequence ratio k (0, 1) of a language model, a probability p (0, 1) that a MASK sequence is replaced by a special character [ MASK ], and a probability p' () that the MASK sequence is replaced by other sequences in a semantic text, wherein the number of layers n is equal to the number of layers n;

2) Constructing a medicine information network and a meta path;

3) Numbering all nodes in a network x _i ∈{x _i I =1,2,. N, num }, where num represents the total number of nodes and x represents the total number of nodes for each node _i ∈{x _i I =1, 2.,. Num } sequentially sampling according to the meta path of the step 2);

4) Inputting all semantic sequences into a deep bidirectional Transformer encoder for characterization learning to obtain a low-dimensional characterization vector of a node, wherein a Transformer model of each layer comprises the same multi-head self-attention mechanism (multi-head self-attention mechanism) and a full-connection network;

5) Judging whether the maximum training times are reached, if so, outputting the characterization vector of each node

Go to step 6), otherwise go to step 4);

6) Carrying out disease-protein correlation prediction by using a generalizing matrix completion method;

7) The same as the disease-protein correlation prediction in the step 6), the target-drug interaction is predicted by using an induction matrix completion method;

8) The relationship between the medicine and the side effect is predicted by using an induction matrix completion method as in the disease-protein correlation prediction in the step 6).

As a further improvement of the present invention, the step 2) is realized by the following steps:

2.1 Constructing a drug information network comprising 4 node types of drugs, targets, diseases and side effects, 7 edges with deletion degree less than deg by using drug public databases, uniProt, HPRD, SIDER, CTD, NDFRT and STRING public databases, wherein the 7 edges comprise drug-drug interaction, drug-protein interaction, drug-disease association, drug-side effect association, protein-disease association, drug-drug structural similarity and protein-protein sequence similarity;

2.2 According to different biological pathways and pharmaceutical mechanisms, 23 kinds of meta-paths are constructed, which are respectively: drug-protein, drug-protein-drug, drug-protein, drug-protein-disease, drug-protein-drug, drug-protein-disease, drug-protein-drug-protein, drug-protein-drug-disease, drug-protein-drug-side effect, drug-protein-disease-protein, drug-protein-disease-drug, protein-drug, protein-drug-protein, protein-drug-disease, protein-drug-side effect, protein-drug-protein, protein-drug-disease, protein-drug-side effect, protein-drug-protein-disease, protein-drug-disease-protein, protein-drug-side effect-drug;

as a further improvement of the present invention, the step 4) is realized by the following steps:

4.1 Dividing words of all semantic sequences, including removing special characters and redundant characters, and performing a space word dividing process, finally processing the semantic sequences by adopting a MASK language model, randomly selecting MASK sequences from all semantic sequences according to a MASK ratio k, generating a random number rand belonging to [0,1] for each MASK sequence, and if rand is less than p, replacing the sequence with [ MASK ], wherein the p belonging to [0,1] is the probability that the MASK sequences are replaced by [ MASK ]; if p ≦ rand < p + p ', randomly selecting a sequence from the semantic sequences to replace the mask sequence, where p' e (0, 1-p) is the probability that the mask sequence is replaced by other sequences; if p + p' ≦ rand < 1, the mask sequence remains unchanged;

4.2 Superimpose the initial token vector and location vector of each node as

Inputting the vector to a multi-head attention mechanism for learning to obtain a vector

And using residual connection and normalization processing to obtain

Secondly, further learning by utilizing a fully-connected feedforward network, and carrying out residual error connection and normalization operation by the fully-connected feedforward network; and finally obtaining the low-dimensional characterization vector of the node.

As a further improvement of the present invention, said step 6) is realized by the following steps:

6.1 Calculating the number of Ninter of disease-protein correlation in the network, randomly selecting the same number of negative samples of Ninter from the disease-protein correlation network, mixing the positive samples and the negative samples together, and performing 10-fold (10-fold) cross validation;

6.2 Reconstructing a heterogeneous network based on the inductive matrix completion model, and eliminating network association information of a test set, wherein the specific operation is as follows: by the formula

Converting node link prediction into optimization problem, where r is 7 types of network edge, P _r Is a contiguous matrix of 7 types of single networks, Z _r Is the low-rank matrix, V, corresponding to the single network to be solved _u And V _w Is a feature vector of a node in a single network; the 7 types of network edges include: drug-drug interactions, drug-protein interactions, drug-disease associations,drug-side effect association, protein-disease association, drug-drug structure similarity, protein-protein sequence similarity;

6.3 Calculate a disease-target association score in the test set based on the low rank matrix corresponding to the trained disease-protein association.

Compared with the prior art, the invention has the beneficial effects that:

the invention organically integrates the structural relationship between network nodes and the semantic relationship such as biological pathway, pharmacology mechanism and the like by constructing different types and lengths of meta-paths; secondly, a multi-head attention mechanism is adopted to effectively capture the dependency between network nodes with different distances, so that the local balance and the global balance are ensured; finally, the context relation of the semantic sequence is integrated through a mask language model, and the network representation capability is further greatly promoted; in addition, a cold start problem faced by a sparse network is effectively solved by adopting an induction matrix completion model in link prediction.

Drawings

FIG. 1 is a flow chart of a method for predicting bio-network links based on meta-paths and bi-directional encoders;

fig. 2 is a prediction result of the bio-network link prediction method based on meta-path and bi-directional encoder.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Fig. 1 shows a flowchart of a bio-network link prediction method based on meta-path and bi-directional encoder according to an embodiment of the present invention.

With reference to figure 1 of the drawings,

a bio-network link prediction method based on meta-path and bidirectional encoder includes the following steps:

1) Initializing parameters, including: the method comprises the following steps of 1, network sequence length l, a node reading threshold deg, a characterization vector dimension dim, the number of layers n of a transform encoder, MASK sequence ratio k (0, 1) of a language model, probability p (0, 1) that a MASK sequence is replaced by a special character [ MASK ], and probability p' (0, 1-p) that the MASK sequence is replaced by other sequences in a semantic text;

2) Constructing a medicine information network and a meta path;

3) Numbering x all nodes in a network _i ∈{x _i I =1, 2.. Num }, where num represents the total number of nodes, and x for each node _i ∈{x _i I =1, 2.,. Num } sequentially sampling according to the meta path of the step 2);

Go to step 6), otherwise go to step 4);

8) The same as the disease-protein correlation prediction in the step 6), the drug-side effect correlation is predicted by using an induction matrix completion method.

2.2 23 kinds of meta-paths are constructed according to different biological pathways and pharmaceutical mechanisms, and respectively are as follows: drug-protein, drug-protein-drug, drug-protein, drug-protein-disease, drug-protein-drug, drug-protein-disease, drug-protein-drug-protein, drug-protein-drug-disease, drug-protein-drug-side effect, drug-protein-disease-protein, drug-protein-disease-drug, protein-drug, protein-drug-protein, protein-drug-disease, protein-drug-side effect, protein-drug-protein, protein-drug-disease, protein-drug-side effect, protein-drug-protein-disease, protein-drug-disease-protein, protein-drug-side effect-drug;

4.2 Superimpose the initial token vector and location vector of each node as

And using residual connection and normalization processing to obtain

As a further improvement of the present invention, the step 6) is realized by the following steps:

6.1 Calculating the number of Ninter in the disease-protein correlation network, randomly selecting the same number of Ninter negative samples from the disease-protein correlation network, mixing the positive samples and the negative samples together, and performing 10-fold (10-fold) cross validation;

Converting node link prediction into optimization problem, where r is 7 types of network edge, P _r Is a contiguous matrix of 7 types of single networks, Z _r Is a low-rank matrix, V, corresponding to the single network to be solved _u And V _w Is a feature vector of a node in a single network; the 7 types of network edges include: drug-drug interactions, drug-protein interactions, drug-disease associations, drug-side effect associations, protein-disease associations, drug-drug structural similarities, protein-protein sequence similarities;

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims

1. A biological network link prediction method based on meta-path and bidirectional encoder is characterized by comprising the following steps:

2) Constructing a medicine information network and a meta path, and realizing the following steps:

2.1 Constructing a drug information network comprising 4 node types of drugs, targets, diseases and side effects, 7 edges with a deletion degree smaller than that of the nodes through drug bank, uniProt, HPRD, signature, CTD, NDFRT and STRING public databases, wherein the 7 edges comprise drug-drug interactions, drug-protein interactions, drug-disease associations, drug-side effect associations, protein-disease associations, drug-drug structural similarities and protein-protein sequence similarities;

3) Numbering x all nodes in a network _i ∈{x _i I =1,2,. N, num }, where num represents the total number of nodes and x represents the total number of nodes for each node _i ∈{x _i I =1, 2.. Num } follows the meta-path of step 2)Sampling is carried out for the second time;

4) Inputting all semantic sequences into a deep bidirectional Transformer encoder for characterization learning to obtain low-dimensional characterization vectors of nodes, wherein each layer of Transformer model comprises the same multi-head self-attention mechanism and a full-connection network;

Go to step 6), otherwise go to step 4);

6) The induction matrix completion method is used for predicting the disease-protein association, and is realized by the following steps:

6.1 Calculating the number of Ninter of disease-protein correlation in the network, randomly selecting the same number of negative samples of Ninter from the disease-protein correlation network, mixing the positive samples and the negative samples together, and performing 10-fold cross validation;

Converting node link prediction into optimization problem, where r is 7 types of network edge, P _r Is a contiguous matrix of 7 types of single networks, Z _r Is the low-rank matrix, V, corresponding to the single network to be solved _u And V _w Is a feature vector of a node in a single network; the 7 types of network edges include: drug-drug interactions, drug-protein interactions, drug-disease associations, drug-side effect associations, protein-disease associations, drug-drug structural similarities, protein-protein sequence similarities;

6.3 Computing a disease-target association score in the test set based on the low rank matrix corresponding to the trained disease-protein association;

2. The bio-network link prediction method based on meta-path and bi-directional encoder as claimed in claim 1, wherein the step 4) is implemented by the steps of:

4.1 Dividing words of all semantic sequences, including a process of removing special characters and redundant characters and dividing words by blank spaces, finally processing the semantic sequences by adopting a MASK language model, randomly selecting MASK sequences from all semantic sequences according to a MASK ratio k, generating a random number rand from [0,1] aiming at each MASK sequence, and if rand is less than p, replacing the sequences with [ MASK ], wherein p from [ MASK ] is the probability that the MASK sequences are replaced by [ MASK ]; if p ≦ rand < p + p ', randomly selecting a sequence from the semantic sequences to replace the mask sequence, where p' e (0, 1-p) is the probability that the mask sequence is replaced by other sequences; if p + p' ≦ rand < 1, the mask sequence remains unchanged;

4.2 Superimpose the initial token vector and location vector of each node as

And inputting the result into a multi-head attention mechanism to learn to obtain a vector

And using residual connection and normalization processing to obtain

Secondly, further learning by utilizing a fully-connected feedforward network, and performing residual connection and normalization operation by using the fully-connected feedforward network; finally, the low-dimensional characterization vector of the node is obtained.