CN115116580A

CN115116580A - Virus-drug association prediction method based on matrix decomposition and heterogeneous graph reasoning

Info

Publication number: CN115116580A
Application number: CN202210813511.XA
Authority: CN
Inventors: 程效龙; 瞿佳
Original assignee: Changzhou University
Current assignee: Changzhou University
Priority date: 2022-07-11
Filing date: 2022-07-11
Publication date: 2022-09-27

Abstract

The invention provides a virus-drug association prediction method based on matrix decomposition and heterogeneous graph reasoning, which mainly solves the problems of low virus-drug association prediction precision and rare virus-drug association prediction in the prior art, and comprises the following steps: (1) obtaining a data set comprising known virus-drug associations; (2) integrating the similarity matrix (3) of the virus and the medicine, performing matrix decomposition on the virus-medicine adjacent matrix, and constructing a new virus-medicine association matrix (4) to construct a differential graph to predict potential virus-medicine association. The invention predicts the association of the virus and the medicine by matrix decomposition and heterogeneous graph reasoning, and has good prediction performance and robustness.

Description

Virus-drug association prediction method based on matrix decomposition and heterogeneous graph reasoning

Technical Field

The invention relates to the technical field of machine learning and biological information, in particular to a virus-drug association prediction method based on restricted Boltzmann machine matrix decomposition and heterogeneous graph reasoning.

Background

Human life and other higher animals are closely related to microbial communities, including bacteria as well as archaea, viruses, fungi and protozoa. Viruses are a class of microorganisms, and outbreaks of new viruses can pose a significant hazard to humans. For example, SARS-CoV-2, a severe acute respiratory syndrome called COVID-19, is caused by the rapid spread of "novel coronavirus" in the world, and no specific vaccine or antiviral drug against SARS-CoV-2 has been found at present. Therefore, it is imperative to find a specific antiviral drug to prevent the spread of SARS-CoV-2 as soon as possible. In addition, Human Immunodeficiency Virus (HIV) infection can cause acquired immunodeficiency syndrome (AIDS) through stages that include viral replication and transmission, long-term asymptomatic phases, and depletion of CD4+ T cells. Ebola virus (EBOV) enters the body through damaged skin or through mucosal surfaces, resulting in EBOV infection. EBOV infections can cause fever, mucosal bleeding, and even death. Zika virus (ZIKV) can infect cells by fusion using acidic endosome and rennin mediated endocytosis. ZIKV infection can cause many diseases such as dengue fever, yellow fever and west nile virus.

Generally, after infection with a virus and a disease, one first uses a drug to treat the disease. Therefore, there is a need to find effective antiviral drugs. Drug discovery is one of the major targets of pharmaceutical science, a interdisciplinary field of basic science including biology, chemistry, physics, and statistics. For thousands of years, nature has been the source of medicinal products and many useful actives have been developed from plant sources. In the 20 th century, the discovery of penicillin was the starting point for drug discovery from microbial sources. Most drugs are developed from lead structures based on natural products synthesized by bacteria. Drugs derived from bacterial secondary metabolites find a variety of uses, for example in the diagnosis, alleviation or treatment or prevention of disease or alleviation of discomfort. It is estimated that in the golden age of microbial natural product screening (1940) -1970, tens of millions of soil microorganisms have been screened, which is a tremendous effort, providing the vast majority of microbial metabolites known today. These substances include widely used antibacterial therapies such as erythromycin, streptomycin, tetracycline, vancomycin, and chemotherapeutic drugs such as doxorubicin. 90% of all antibiotics used in clinics today are derived from microorganisms. Currently, 23000 kinds of natural products having antibacterial activity are known to be produced by microorganisms, and only 25000 kinds of natural products isolated from higher organisms such as plants and animals are known.

However, currently, drug development faces two major challenges. On the one hand, a drug has a long time period, and it takes a long time from the beginning of development to the market. On the other hand, drug resistance has begun to emerge, constituting a serious threat to human health. To overcome this problem, combinatorial chemistry has been developed as a key technology that can generate large screening libraries to meet the needs of high throughput screening. Furthermore, drug reuse, also known as drug relocation, is an idea of using drugs that have been approved for the market to treat other existing diseases. For drug combination therapy and drug relocation, determining drug-to-virus association is crucial. Therefore, detection of the interaction between virus and drug is of great importance for virus therapeutics and drug development. However, traditional wet laboratory experiments (e.g., culture-based methods) find virus-drug association time-consuming, laborious, and expensive. Therefore, computational methods that efficiently and accurately predict virus-drug binding are a beneficial addition to limited experimental methods.

Generally, due to the problem of virus resistance and long development period of new drugs, the identified virus-drug association has important significance for drug development and disease treatment. Traditional experimental methods are time-consuming and labor-consuming, and it is an extremely urgent problem to develop efficient computational methods to identify potential virus-drug associations.

Disclosure of Invention

In order to solve the above problems, we invented a virus-drug association prediction method based on matrix decomposition and heterogeneous graph reasoning, which is more time-saving and labor-saving in predicting virus-drug association.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a virus-drug association prediction method based on matrix decomposition and heterogeneous graph reasoning comprises the following steps:

the method comprises the following steps: obtaining a known virus-drug association matrix;

the data set is composed using known virus-drug data. The form of the data is represented as follows:

wherein A (i, p) represents the drug d _i And virus v _p And whether or not to correlate, if so, the value of A (i, p) is 1, otherwise, it is 0.

Step two: integrating similarity matrix of virus and drug;

for the similarity of the drugs, the chemical structure similarity of the drugs, the side effect similarity of the drugs and the Gaussian interaction spectrum nuclear similarity of the drugs are integrated to obtain the integrated drug similarity. If the drugs have chemical structure similarity or side effect similarity, the integrated drug similarity is the average of the chemical structure similarity of the drugs and the similarity of the side effects of the drugs. Otherwise, the integrated drug similarity is equal to the value of the gaussian interaction spectrum nuclear similarity of the drug. The calculation formula is as follows:

wherein SS1 is the similarity of chemical structures of drugs, SS2 is the similarity of side effects of drugs, GD is the Gaussian similarity of drugs, and SD is the similarity of integrated drugs.

For virus similarity, the virus sequence similarity and the gaussian interaction profile nuclear similarity of the viruses are integrated together to obtain integrated virus similarity. The formula is as follows:

where MV is the drug similarity of the virus, GV is the Gaussian similarity of the virus, and SV is the integrated virus similarity.

Step three: performing matrix decomposition on the virus-drug adjacency matrix and constructing a new virus-drug association matrix;

since some of the associations of the drug-virus adjacency matrix in the dataset may be redundant or absent, we break down the known drug-virus adjacency matrix into two parts. The first part includes the original matrix and the low rank matrix. The low rank matrix includes non-redundant data that can be used to construct a new drug-virus association matrix. The second part is a sparse matrix in which the elements are mostly zero. The formula of the decomposition is as follows.

A＝AX+E(3)

Wherein A is a virus and drug correlation matrix, X is a decomposed low-rank matrix, and E is a sparse matrix after decomposition;

then, a low rank matrix X is obtained using the kernel norm, and a sparse matrix E is obtained using the sparse norm. The above equation (3) can be converted into the following equation:

wherein: i | · | purple wind _* Representing kernel norm, | · | viry _2,1 Representing the sparse norm, α is for the control weight;

equation (4) above can also be equivalently expressed as:

in brief, equation (5) above can be viewed as a constraint and convex optimization problem. Inaccurate enhanced lagrange multipliers (IALMs) are intended to solve this problem. First, equation (5) above is transformed into an unconstrained problem. Second, the unconstrained problem is minimized by using the enhanced Lagrangian function of the following equation (6).

Wherein, Y ₁ 、Y ₂ Represents the langerhan multiplier; μ is a penalty factor;

represents the F norm;

from the above equation (6), two solutions can be obtained, which are defined as X, respectively ^* And E ^* . Then, by using A and X ^* Establishing a new drug-virus incidence matrix A ^* . The results are as follows:

A ^* ＝AX ^* (7)

step four: a differential map is constructed to predict potential virus-drug associations.

And (4) combining the new drug-virus association matrix constructed in the step three with drug integration similarity and virus integration similarity into a heteromorphic graph. And predicting the potential association probability of the medicine and the virus from the heterogeneous graph. If there is no known association between the drug and the virus, the potential association probability matrix is defined as follows:

wherein nv represents the number of viruses, nd represents the number of drugs, v _l Represents any one of 1 to nv viruses, d _m Represents any one of 1 to nd drugs.

In addition, the weights of the edges in the heteromorphic graph, i.e., the integrated drug similarity and the integrated virus similarity, are normalized according to the degree of their endpoints. The normalized equation is as follows:

wherein v is _l Represents any one of 1 to nv viruses, d _m Represents any one of 1 to nd drugs; according to previous studies, the normalization process helps convergence. Further, an iterative approach may be used to compute the potential association probability matrix between all drugs and all viruses. The iterative equation is as follows:

P _k+1 ＝aSV×P _k ×SD+(1-a)A ^* (11)

wherein, P _k+1 Represents k +1 iterations of the potential relevance probability matrix, and a represents a penalty factor.

After iteration is finished, the potential association probability matrix between all the medicines and all the viruses is a predicted virus-medicine association score matrix, the score matrix is still a matrix of nd rows and nv columns, and finally, virus-medicine association prediction is carried out by utilizing the score matrix.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the method takes the virus-drug association prediction problem as a prediction task, adopts multi-source biological data comprising a known drug-virus association matrix, a drug chemical structure similarity matrix, a drug side effect similarity matrix, a drug Gaussian similarity matrix, a virus sequence similarity matrix and a virus Gaussian similarity matrix, has rich data quantity, is beneficial to predicting potential drug-virus association, and realizes more accurate virus-drug association prediction.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a graph comparing ROC curves and global one-left AUC values in a data set of a virus-drug association prediction method based on matrix decomposition and isomerous graph reasoning and four other existing methods in an example.

FIG. 3 is a graph comparing ROC curves and local one-off AUC values in data sets of the virus-drug association prediction method based on matrix decomposition and isomerogram reasoning and four other prior methods involved in the example.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless otherwise indicated, it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

The present example provides a virus-drug association prediction Method (MDHGIVDA) based on matrix factorization and heterogeneous map reasoning, as shown in fig. 1, comprising the steps of:

the method comprises the following steps: a known virus-drug association matrix is obtained.

Numerous biological experiments have found a number of virus-drug associations. The virus-drug association information we used is from the drug virus dataset. Long et al constructed a DrugVirus dataset based on a DrugVirus database. The drug virus data set includes 933 known virus-drug associations, which contain 175 drugs and 95 viruses. We build adjacency matrix a to store virus-drug association information. In A, nd represents the number of drugs and nm represents the number of viruses. If the drug is associated with a virus, the association value is 1, otherwise it is 0.

Step two: integrating similarity matrix of virus and drug;

to obtain integrated drug similarity, we integrated the chemical structure similarity of drugs, the side effect similarity of drugs and the gaussian interaction spectrum nuclear similarity of drugs. If the drugs have chemical structure similarity or side effect similarity, the integrated drug similarity is the average of the chemical structure similarity of the drugs and the similarity of the side effects of the drugs. Otherwise, the integrated drug similarity is equal to the value of the gaussian interaction spectrum nuclear similarity of the drug. The calculation formula is as follows:

For virus similarity, we integrate viral sequence similarity and the gaussian interaction profile nuclear similarity of the viruses to obtain integrated virus similarity. The formula is as follows:

wherein, the similarity of the virus is the similarity of the drugs of the virus, the Gaussian similarity of the virus and the integrated virus.

A＝AX+E

Then, a low rank matrix X is obtained using the kernel norm, and a sparse matrix E is obtained using the sparse norm. Thus, the above equation can be converted into the following equation:

wherein: i | · | purple wind _* Represents the kernel norm, | ·| luminance _2,1 Represents the sparse norm, α is for the control weight;

the above formula can also be equivalently expressed as:

briefly, the above equation can be viewed as a constraint and convex optimization problem. Inaccurate enhanced lagrange multipliers (IALMs) are intended to solve this problem. First, the above equation can be converted into an unconstrained problem. Second, the unconstrained problem is minimized with the following enhanced Lagrangian function.

We can derive two solutions from the above equation, which are defined as X, respectively ^* And E ^* . Then, by using A and X ^* Establishing a new drug-virus association matrix A ^* . The results are as follows:

A ^* ＝AX ^*

step four: a heteromorphic map is constructed to predict potential virus-drug associations.

The new drug-virus association matrix and drug integration similarity, virus integration similarity are combined into a heteromorphic graph. The probability of potential association of the drug with the virus can then be predicted from the heterogeneous map. If there is no known association between the drug and the virus, we define their potential association probability as the formula:

in addition, the weights of the edges (integrated drug similarity and integrated virus similarity) are normalized according to the degree of their endpoints. The normalized equation is as follows:

according to previous studies, the normalization process helps convergence. Further, an iterative approach may be used to calculate the potential association probability between the drug and the virus. The iterative equation is as follows: z is a partition function, expressed as follows:

P _k+1 ＝aSV×P _k ×SD+(1-a)A ^*

finally, the predicted virus-drug association score matrix is a matrix of nd rows and nv columns. We use P to maintain this scoring matrix. Ranked according to predicted score, higher scores indicate a greater likelihood of a virus-drug association. And according to the sequencing result, the relevance possibility ranking between certain viruses and certain medicines can be given, the prediction relevance has great reference value, and the interaction relation between certain viruses and certain medicines can be researched in a targeted manner in the biomedical field, so that the research and development of medicines are facilitated, and diseases are treated.

The evaluation method comprises the following steps: we used global LOOCV, local LOOCV and five-fold cross-validation methods to evaluate the predictive performance of the method proposed by the present invention. In LOOCV, each known virus-drug association is selected in turn as a test sample, and the remaining known virus-drug associations are used as training samples. For global LOOCV, all unknown virus-drug pairs were used as candidate samples. Then, we train the model with the training samples, and predict the scores of the test samples and the candidate samples with the trained model. We further rank the test samples and candidate samples according to the predicted scores of the global LOOCV. Finally, we get a ranking of all test samples. While in local LOOCV, the score of the test sample is ranked against the scores of candidate samples, including the drug being investigated in the test sample. Finally, we also obtained a ranking of all test samples. In five-fold cross-validation, the known virus-drug associations are randomly divided into five subsets, each subset in turn being considered as a test sample, the other four subsets being considered as training samples. All unknown virus-drug pairs will be considered candidate samples. Then, we rank the score of each test sample with the score of the candidate sample. Finally, we get a ranking of all test samples. To avoid bias from random sample partitioning, the five-fold cross validation was repeated 100 times. Furthermore, we plot ROC curves. And calculating the area under the AUC curve, AUC, to evaluate the predicted performance of the method.

And (4) evaluation results: for the five-fold cross validation, the AUC and standard deviation obtained by our Method (MDHGIVDA) was 0.8299+/-0.0037, and the results for the comparative methods HGIMDA, IMCMDA, KATCMDA, RLSMDA were 0.6996+/-0.0022, 0.6808+/-0.0040, 0.8228+/-0.0023, 0.6513+/-0.0229, respectively. In the global leave-one-out cross validation, the results are shown in fig. two, and the AUC obtained by MDHGIVDA is 0.8528, which is higher than 0.7084 of HGIMDA, 0.6902 of IMCMDA, 0.8247 of KATCMDA, and 0.6849 of RLSMDA. In the local leave-one-out cross validation, the results are shown in fig. three, and the AUC obtained by MDHGIVDA is 0.8532, which is higher than 0.7537 of HGIMDA, 0.7436 of IMCMDA, 0.8247 of KATCMDA, and 0.6815 of RLSMDA.

Case study: further, we used a case study to further evaluate the predicted performance of our Method (MDHGIVDA). We have chosen three viruses as representatives to realize case study, and the three viruses are Zika virus, new coronavirus and HIV type 1 respectively. By performing MDHGIVDA, we predicted three virus-associated drugs. Then, we ranked the relevant drugs according to the prediction scores and validated the top 10 potentially relevant drugs for the three viruses by searching the literature on PubMed. The results are shown in tables one, two and three. For the Zika virus, the new coronavirus and the AIDS virus type 1, 10, 8 and 8 of the predicted first ten drugs are verified respectively.

Table one: predicted top ten related drugs of Zika virus

Table two: predicted first ten related drugs of novel coronaviruses

Table three: predictive HIV type 1 top ten related drugs

The same or similar reference numerals correspond to the same or similar parts;

the terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A virus-drug association prediction method based on matrix decomposition and heterogeneous graph reasoning is characterized by comprising the following steps:

the method comprises the following steps: acquiring a known virus-drug association matrix;

a dataset was composed using known virus-drug data, the form of which is expressed as follows:

wherein A (i, p) represents the drug d _i And virus v _p If so, the value of A (i, p) is 1, otherwise it is 0,

step two: integrating similarity matrix of virus and drug;

for the similarity of the drugs, the chemical structure similarity of the drugs, the side effect similarity of the drugs and the Gaussian interaction spectrum nuclear similarity of the drugs are integrated to obtain the integrated drug similarity, and the calculation formula is as follows:

wherein SS1 is the similarity of chemical structures of drugs, SS2 is the similarity of side effects of drugs, GD is the Gaussian similarity of drugs, SD is the similarity of integrated drugs,

for virus similarity, integrating virus sequence similarity and the gaussian interaction profile nuclear similarity of the viruses to obtain integrated virus similarity, the formula is as follows:

wherein MV is the drug similarity of the virus, GV is the Gaussian similarity of the virus, SV is the integrated virus similarity,

the known drug-virus adjacency matrix is decomposed into two parts, wherein the first part comprises an original matrix and a low-rank matrix, the second part is a sparse matrix, the decomposition formula is as follows,

A＝AX+E (3)

then, using the kernel norm to obtain the low rank matrix X, and using the sparse norm to obtain the sparse matrix E, equation (3) above can be converted into the following equation:

wherein: i | · | purple wind _* Representing kernel norm, | · | viry _2,1 Represents the sparse norm, α is for the control weight;

equation (4) above can also be equivalently expressed as:

firstly, the above equation (5) is converted into an unconstrained problem, and secondly, the unconstrained problem is minimized using the enhanced Lagrangian function of the following equation (6),

represents the F norm;

from the above equation (6), two solutions can be obtained, which are defined as X, respectively ^* And E ^* Then, by using A and X ^* Establishing a new drug-virus association matrix A ^* The results are as follows:

A ^* ＝AX ^* (7)

step four: a heteromorphic map is constructed to predict potential virus-drug associations,

combining the new drug-virus association matrix constructed in the step three with drug integration similarity and virus integration similarity into a heterogeneous map, predicting potential association probability of the drug and the virus from the heterogeneous map, and if no known association exists between the drug and the virus, defining the potential association probability matrix as the following formula:

wherein nv represents the number of viruses, nd represents the number of drugs, v _l Represents any one of 1 to nv viruses, d _m Represents any one of 1 to nd drugs,

in addition, the weights of the edges in the heteromorphic graph, i.e., the integrated drug similarity and the integrated virus similarity, are normalized according to the degree of their endpoints, and the normalization equation is as follows:

wherein v is _l Represents any one of 1 to nv viruses, d _m Represents any one of 1 to nd drugs; further, an iterative method is used to calculate the potential association probability matrix between all drugs and all viruses, the iterative equation is as follows:

P _k+1 ＝aSV×P _k ×SD+(1-a)A ^* (11)

wherein, P _k+1 Represents k +1 iterations of the potential relevance probability matrix, a represents a penalty factor,