CN113223655A

CN113223655A - Medicine-disease associated prediction method based on variational self-encoder

Info

Publication number: CN113223655A
Application number: CN202110496613.9A
Authority: CN
Inventors: 鱼亮; 陈生建
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-05-07
Filing date: 2021-05-07
Publication date: 2021-08-06
Anticipated expiration: 2041-05-07
Also published as: CN113223655B

Abstract

The invention provides a medicine-disease association prediction method based on a variational self-encoder, which mainly solves the problem of low accuracy of medicine-disease association prediction in the prior art and comprises the following steps: (1) constructing a drug-disease association matrix A and a disease-drug association matrix B; (2) constructing a drug characteristic matrix C and a disease characteristic matrix D; (3) constructing a medicine-disease associated prediction model H based on a variational self-encoder; (4) performing iterative training on a medicine-disease associated prediction model H based on a variational self-encoder; (5) and obtaining a medicine-disease correlation prediction result Y. The method reduces the influence of noise and data loss on the prediction result, fully extracts the implicit information of complex data, effectively improves the accuracy of drug-disease association prediction, and can be used for drug candidates for drug relocation.

Description

Medicine-disease associated prediction method based on variational self-encoder

Technical Field

The invention belongs to the technical field of bioinformatics, relates to a medicine-disease correlation prediction method, and particularly relates to a medicine-disease correlation prediction method based on a variational self-encoder, which can be used for providing candidate recommendation for new treatment application of the existing medicine in medicine relocation.

Background

The purpose of drug relocation is to determine the new application of the existing drugs, compared with the traditional drug research, the risk is greatly reduced, and the cost and the time are saved, so that the drug relocation is widely concerned, and the new indications of the existing drugs account for 20% of 84 drugs listed in 2013. In recent years, non-trade organizations, academic institutions, and governments have placed increasing emphasis on, and provide substantial financial support for, drug relocation. For example, the national center for advanced transformation science and the british medical research council have initiated a number of major funding projects in the field of drug relocation, with the goal of expanding drug molecules that have undergone significant research and development by the pharmaceutical industry to more new indications. In addition, the U.S. food and drug administration FDA has also created multiple common databases that are dedicated to computing drug relocation services, which provides much assistance for drug relocation.

The identification of drug-disease associations can provide important information for drug discovery and drug relocation. Because manual surveys are time consuming, a large number of computational methods have been proposed as high-throughput techniques have evolved and databases have been continuously updated.

In 2016, Luo et al published on Bioinformatics paper "Drug disposition based on comprehensive similarity metric similarity measures and Bi-Random walk algorithm," and disclosed a Drug-disease association prediction method MBIRW based on comprehensive similarity measures and two-way Random walk that identifies potential new indications for a given Drug based on the assumption that similar drugs are usually associated with similar diseases and vice versa, using some comprehensive similarity measures and two-way Random walk algorithms. By combining the drug or disease characteristic information with the known drug-disease association information, a comprehensive similarity measurement method is established to calculate the similarity of the drug and the disease. Drug-like and disease-like networks are then constructed and integrated into heterogeneous networks where drugs are known to interact with diseases. Based on the drug-disease heterogeneous network, a two-way random walk algorithm is employed to predict new potential drug-disease associations.

Luo et al published a paper "practical Drug reproduction using Low-Rank Matrix Approximation and random optimized Algorithms" in 2018 on Bioinformatics, disclosing a Drug-disease association prediction method DRRS using Low Rank Matrix Approximation and random algorithm, which predicts new Drug indications by integrating relevant data information of drugs and diseases. First, a heterogeneous drug-disease interaction network is constructed by integrating drug-drug, disease-disease, and drug-disease networks. The heterogeneous network is represented by a large drug-disease adjacency matrix whose entries include drug pairs, disease pairs, known drug-disease interaction pairs, and unknown drug-disease pairs. Then, for the unknown drug-disease pairs, the drug-disease adjacency matrix is complemented with the predicted unknown drug-disease pair scores using the fast singular value threshold SVT algorithm.

However, the above algorithm is operated in a default noise-free environment, and the processing capability of sparse data is not good enough, that is, the anti-interference capability is weak, and meanwhile, the above algorithm is difficult to learn deep information of complex data, and cannot sufficiently extract implicit information of the complex data.

Disclosure of Invention

The invention aims to provide a medicine-disease association prediction method based on a variational self-encoder aiming at overcoming the defects of the prior art, and aims to solve the problem of low medicine-disease association prediction precision in the prior art.

In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:

(1) constructing a drug-disease association matrix A and a disease-drug association matrix B:

(1a) obtaining S ═ S of M medicines from database₁,S₂,...,S_m,...,S_MThere are associated N diseases T ═ T₁,T₂,...,T_n,...,T_NK drug-disease association data E ═ E₁,E₂,...,E_k,...,E_KS, each drug_mIs associated with at least one disease, and each diseaseDisease T_nIs associated with at least one drug, wherein K is more than or equal to 1000, M is more than or equal to 100, N is more than or equal to 200, S_mDenotes the m-th drug, T_nM is 1. ltoreq. m.ltoreq.M, N is 1. ltoreq. N, E_kIndicates the kth drug-disease association;

(1b) constructing the element A with the size of M × N and the M row and the N column_mnAnd transposing A to obtain a disease-drug association matrix B, wherein A is 0 or 1_mnWhen the value of (A) is 0, it represents A_mnThe correlation between the corresponding mth drug and the nth disease is not found in the drug-disease correlation data E, A_mnWhen the value of (A) is 1, represents A_mnThe association of the corresponding mth drug and nth disease is in drug-disease association data E;

(2) constructing a drug characteristic matrix C and a disease characteristic matrix D:

(2a) obtaining S ═ S of M medicines from database₁,S₂,...,S_m,...,S_MThere are associated P genes G ═ G }₁,G₂,...,G_p,...,G_PQ pieces of drug-gene association data R ═ R } ═ R₁,R₂,...,R_q,...,R_QS, each drug_mIs associated with at least one gene, and each gene G_pAssociated with at least one drug; construction size M P and M row P column element C'_mpA drug-gene association matrix C 'of value 0 or 1, wherein C'_mpIs C 'when the value of (2)'_mpThe correlation between the corresponding mth drug and the pth gene is not in the drug-gene correlation data R, C'_mpIs C 'when the value of (2)'_mpThe corresponding M-th medicine and P-th gene are related in the medicine-gene related data R, P is more than or equal to 200, Q is more than or equal to 1000, M is more than or equal to 1 and less than or equal to M, P is more than or equal to 1 and less than or equal to P, G_pDenotes the p-th gene, R_qRepresents the q-th drug-gene association;

(2b) obtaining T ═ T of N diseases from database₁,T₂,...,T_n,...,T_NThere are associated O kinds of genes G ═ G }₁,G₂,...,G_o,...,G_OJ pieces of disease-gene association data U ═ U₁,U₂,...,U_j,...,U_J}, T for each disease_nIs associated with at least one gene, and each gene G_oIs associated with at least one disease; construction of size N O and N row O column element D'_noThe disease-gene correlation matrix D ' having a value of 0 or 1, wherein D ' represents D ' when the value of D ' is 0 '_noIf the correlation between the corresponding n-th disease and the o-th gene is not in the disease-gene correlation data U, D 'represents D' when the value of D 'is 1'_noThe corresponding N-th disease and O-th gene are related in the disease-gene related data U, O is more than or equal to 200, J is more than or equal to 1000, N is more than or equal to 1 and less than or equal to N, O is more than or equal to 1 and less than or equal to O, U_jRepresents the jth disease-gene association;

(2c) respectively reducing dimensions of C 'with the size of M multiplied by P and D' with the size of N multiplied by O to obtain a medicine characteristic matrix C with the size of M multiplied by V and a disease characteristic matrix D with the size of N multiplied by W, wherein each line in C is the characteristic of the medicine in the line, each line in D is the characteristic of the disease in the line, V is more than or equal to 1 and less than or equal to P, and W is more than or equal to 1 and less than or equal to O;

(3) constructing a medicine-disease associated prediction model H based on a variational self-encoder:

(3a) constructing a medicine-disease associated prediction model H structure based on a variational self-encoder:

constructing a first variational autocoder f comprising a parallel arrangement¹And a second variational self-encoder f²The drug-disease association prediction model of (1), wherein the first variation is from the encoder f¹Using a first encoder f comprising a series connection_e ¹A first latent variable layer f_z ¹And a first decoder f_d ¹Of a neural network of_e ¹Comprising a plurality of fully connected layers and a mean variance layer, f_z ¹Is connected to a first data fusion module, f_d ¹Comprising a plurality of fully-connected layers and a sigmoid-activated function output layer, f¹The weight parameter is

(ii) a Second variational autoencoder f²Comprising a second encoder f connected in series_e ²A second latent variable layer f_z ²And a second decoder f_d ²，f_e ²Comprising a plurality of fully connected layers and a mean variance layer, f_z ²Is connected to a second data fusion module, f_d ²Comprising a plurality of fully-connected layers and a sigmoid-activated function output layer, f²The weight parameter is

；

(3b) Defining a first variational autoencoder f¹Loss function Loss1 and second variational self-encoder f²Loss function Loss 2:

wherein x represents f¹The input data of (a) to (b),

denotes f¹The result of the prediction of (a) is,

L_redenotes f¹Loss of reconstruction of, PO_xDenotes the set of elements with a median value of 1, PO_x＝{x_i|x_i＝1,1≤i≤N}，NP_xDenotes the set of elements with a value of 0 in x, NP_x＝{x_j|x_j＝0,1≤j≤N}，x_iAnd x_jRespectively representing the ith and jth elements of x, beta representing a non-positive loss attenuation factor, non-positive indicating that the current association is not among the known associations, beta ∈ [0,1 ]]；

Represents the mean value of μ_xVariance of

Normal distribution of (1), N (0,1) represents the standard positive-Taiwan distribution,

to represent

And the relative entropy of N (0,1),

μ_xand delta_xRespectively represents f¹When the input is x, f_e ¹A represents a relative entropy loss attenuation factor, a ∈ [0,1 ]](ii) a y represents f²The input data of (a) to (b),

denotes f²The result of the prediction of (a) is,

(4) performing iterative training on a variational self-encoder-based medicine-disease associated prediction model H:

(4a) the initial iteration number is I, the maximum iteration number is I, I is more than or equal to 300, and the ith iteration is a first variational self-encoder f¹The weight parameter is

And a second variational self-encoder f²The weight parameter is

And let i be 0 and/or 0,

(4b) using the drug-disease association matrix A and the drug characteristics C as a first variational self-encoder f in the drug-disease association prediction model H¹Input of (1), a first encoder f_e ¹Coding A line by line, a first hidden variable layer f_z ¹To f_e ¹Mean value of the code

Sum variance

Constructed normal distribution

Sampling is carried out, and the first data fusion module is used for f_z ¹Hidden variables with V dimension collected

Additive fusion with the drug C of the corresponding row in the drug signature C, a first decoder f_d ¹Fusion result to the first data fusion module

Decoding to obtain the predicted medicine-disease correlation matrix

(4c) Using the disease-drug association matrix B and the disease characteristics D as a second variational self-encoder f in the drug-disease association prediction model H²Input of (2), a second encoder f_e ²Coding B line by line, a second latent variable layerf_z ²To f_e ²Mean value of the code

Sum variance

Constructed positive Taiwan distribution

Sampling is performed, and the second data fusion module pair f_z ²Hidden variables of dimension W are collected

Additive fusion with the drugs D of the corresponding row in the drug profile D, a second decoder f_d ²Fusion results to the second data fusion module

Decoding to obtain a predicted disease-drug correlation matrix

(4d) Using Loss function Loss1 and passing

A and

first variational autocoder f in calculating H¹Loss value of L1_iWhile using the Loss function Loss2 and passing

B and

second variational autocoder f in calculation H²Loss value of L2_i；

(4e) By reverse transmissionBroadcasting method, and through L1_iCalculating f¹Then using a gradient descent algorithm through f¹Parameter gradient pair f¹Weight parameter of

Updating is carried out; while using the counter-propagating method, and passing through L2_iCalculating f²Then using a gradient descent algorithm through f²Parameter gradient pair f²Weight parameter of

Updating is carried out;

(4f) judging whether I is greater than or equal to I, if so, obtaining a trained drug-disease association prediction model H', otherwise, enabling I to be I +1, and executing the step (4 b);

(5) obtaining a drug-disease association prediction result Y:

using the drug-disease association matrix A and the drug characteristics C as a first variational self-encoder f in a trained drug-disease association prediction model H¹Is propagated forward to obtain f¹Predicted drug-disease association set Y¹Simultaneously using the disease-drug association matrix B and the disease characteristics D as a second variational self-encoder f in the trained drug-disease association prediction model H²Is propagated forward to obtain f²Predicted drug-disease association set Y²，Y¹And Y²Y is¹∩Y²The prediction result of drug-disease association is obtained, wherein, n represents intersection.

Compared with the prior art, the invention has the following advantages:

1. the drug-disease associated prediction model based on the variational self-encoder comprises two variational self-encoders arranged in parallel and two data fusion modules, and in the process of performing iterative training on the model and acquiring a drug-disease associated result, the two data fusion modules fuse various information related to drugs and diseases, so that implicit information of complex data is fully extracted.

2. The medicine-disease associated prediction model constructed by the invention learns the data distribution rather than the unique characteristic representation of the data, so that the influence of noise and data loss on the prediction result can be reduced, and the medicine-disease associated prediction precision is further improved compared with the prior art.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention.

Detailed Description

The invention will be described in further detail with reference to the following drawings and specific examples, which are not intended to limit the invention to the 25 th clause, but are in accordance with the second clause of the patent statutes:

referring to fig. 1, the present example includes the steps of:

step 1) constructing a drug-disease association matrix A and a disease-drug association matrix B:

step 1a) obtaining from the database S ═ S for M drugs₁,S₂,...,S_m,...,S_MThere are associated N diseases T ═ T₁,T₂,...,T_n,...,T_NK drug-disease association data E ═ E₁,E₂,...,E_k,...,E_KS, each drug_mIs associated with at least one disease, and each disease T_nAt least one drug is associated, in this example, K ═ 2352, M ═ 663, N ═ 409, S_mDenotes the m-th drug, T_nM is 1. ltoreq. m.ltoreq.M, N is 1. ltoreq. N, E_kIndicates the kth drug-disease association;

step 1b) constructing the element A with the size of M multiplied by N and the mth row and the nth column_mnAnd transposing A to obtain a disease-drug association matrix B, wherein A is 0 or 1_mnWhen the value of (A) is 0, it represents A_mnThe number of drug-disease associations between the mth drug and the nth diseaseAccording to E, A_mnWhen the value of (A) is 1, represents A_mnThe association of the corresponding mth drug and nth disease is in drug-disease association data E.

Step 2), constructing a drug characteristic matrix C and a disease characteristic matrix D:

the drug characteristic matrix C and the disease characteristic matrix C of the present example are obtained based on the drug similarity matrix C 'and the disease similarity matrix D'; the Drug similarity matrix C 'and the disease similarity matrix D' are obtained directly from the paper "Drug rearrangement based on comprehensive similarity measures and Bi-Random walk algorithm" published by Luo et al in "Bioinformatics" in 2016, the size of C 'is 663 × 663, the size of D' is 409 × 409, the example uses principal component analysis to reduce the sizes of C 'and D' to 663 × 10 and 409 × 10, respectively; the dimensionality reduction adopts a principal component analysis method, and the realization steps are as follows:

step 2a) subtracting the mean value of each column in the drug similarity matrix C 'with the size of 663 x 663, and subtracting the mean value of each column in the disease similarity matrix D' with the size of 409 x 409 to obtain the drug similarity matrix C 'after data centralization'₁And disease similarity matrix D'₁；

Step 2b) respectively obtaining C'₁And D'₁To obtain a covariance matrix of 663 × 663

And covariance matrix of 409 x 409

Step 2c) is to

And

respectively decomposing the characteristic values to obtain

663 eigenvalues and 663 eigenvectors of and

409 eigenvalues and 409 eigenvectors;

step 2d) according to the sequence from big to small

663 eigenvalues of the table are arranged, the first 10 eigenvalues are selected, and then

663 eigenvectors corresponding to the 10 eigenvalues in the eigenvectors are respectively used as column vectors to form an eigenvector matrix

And C'₁The product of (a) is a medicine feature matrix C with the size of 663 multiplied by 10, and the medicine feature matrix C is simultaneously paired according to the sequence from large to small

409 and the first 10 eigenvalues are selected, and then

The 10 eigenvectors in the 409 eigenvectors corresponding to the eigenvalues are respectively used as column vectors to form an eigenvector matrix

And D'₁The product of (a) is a disease feature matrix D with a size of 409 x 10.

Step 3) building a medicine-disease associated prediction model H based on a variational self-encoder:

step 3a) building a medicine-disease associated prediction model H structure based on a variational self-encoder:

constructing a first variational autocoder f comprising a parallel arrangement¹And a second variational self-encoder f²And a drug-disease associated prediction model H of the first data fusion module and the second data fusion module, wherein the first variation is from the encoder f¹Comprising a first encoder f connected in series_e ¹A first latent variable layer f_z ¹And a first decoder f_d ¹，f_e ¹Comprising a plurality of fully connected layers and a mean variance layer, f_d ¹Comprising a plurality of fully-connected layers and a sigmoid-activated function output layer, f¹The weight parameter is

Second variational autoencoder f²Comprising a second encoder f connected in series_e ²A second latent variable layer f_z ²And a second decoder f_d ²，f_e ²Comprising a plurality of fully connected layers and a mean variance layer, f_d ²Comprising a plurality of fully-connected layers and a sigmoid-activated function output layer, f²The weight parameter is

Output of the first data fusion module and f_z ¹Is connected to the output of the second data fusion module, and the output of the second data fusion module is connected to f_z ²The outputs of the two are connected;

the first encoder f_e ¹Comprises a full connection layer and a mean variance layer, wherein the input dimension of the full connection layer is 663, the output dimension is 50, the mean variance layer is divided into two parallel parts, one part takes the output of the front layer as the input, the full connection layer is connected, the output is taken as the mean value, the input and output dimensions of the part are respectively 50 and 10, the other part also takes the output of the front layer as the input, the other full connection layer is connected, the output is taken as the variance,the input and output dimensions of the section are 50 and 10, respectively; second encoder f_e ²The method comprises a full connection layer and a mean variance layer, wherein the input dimension of the full connection layer is 409, the output dimension is 50, the mean variance layer is divided into two parallel parts, one part takes the output of a front layer as input, the full connection layer is connected, the output is taken as a mean value, the input and output dimensions of the part are respectively 50 and 10, the other part also takes the output of the front layer as input, the other full connection layer is connected, the output is taken as a variance, and the input and output dimensions of the part are respectively 50 and 10;

said first decoder f_d ¹The method comprises a full connection layer and a sigmoid activation function output layer, wherein the input dimension of the full connection layer is 10, the output dimension is 50, the input dimension of the sigmoid activation function output layer is 50, and the output dimension is 663; second decoder f_d ²The method comprises a full connection layer and a sigmoid activation function output layer, wherein the input dimension of the full connection layer is 10, the output dimension is 50, the input dimension of the sigmoid activation function output layer is 50, and the output dimension is 409;

the drug-disease associated prediction model based on the variational self-encoder comprises two variational self-encoders arranged in parallel and two data fusion modules, in the process of carrying out iterative training on the model and obtaining a drug-disease associated result, the two data fusion modules fuse various information related to drugs and diseases, implicit information in complex data is fully extracted, and meanwhile, the drug-disease associated prediction model built by the invention learns data distribution rather than data unique characteristic representation, so that the influence of noise and data loss on the prediction result can be reduced.

Step 3b) defining a first variational autocoder f¹Loss function Loss1 and second variational self-encoder f²Loss function Loss 2:

wherein x represents f¹The input data of (a) to (b),

denotes f¹The result of the prediction of (a) is,

L_redenotes f¹P denotes a set of elements with a median value of 1 in x, and P ═ x_i|x_i1,1 ≦ i ≦ N, NP representing the set of elements with x having a median value of 0, NP ≦ x_j|x_j＝0,1≤j≤N}，x_iAnd x_jRespectively representing the ith and jth elements of x, beta representing a non-positive loss attenuation factor, non-positive indicating that the current association is not among the known associations, beta ∈ [0,1 ]]；

Represents the mean value of μ_xVariance of

to represent

And the relative entropy of N (0,1),

denotes f²The result of the prediction of (a) is,

step 4) iterative training is carried out on the medicine-disease associated prediction model H based on the variational self-encoder:

step 4a) initializing the iteration number as I, the maximum iteration number as I, wherein I is 350, and the ith iteration is a first variational self-encoder f¹The weight parameter is

And a second variational self-encoder f²The weight parameter is

And let i be 0 and/or 0,

step 4b) using the drug-disease association matrix A and the drug characteristics C as a first variational self-encoder f in the drug-disease association prediction model H¹Input of (1), a first encoder f_e ¹Coding A line by line, a first hidden variable layer f_z ¹To f_e ¹Mean value of the code

Sum variance

Constructed normal distribution

Sampling is carried out, and the first data fusion module is used for f_z ¹Hidden variables of dimension 10 collected

Decoding to obtain the predicted medicine-disease correlation matrix

The first encoder f_e ¹Coding a line by line in this example, 8 drugs are selected at a time for coding, that is, the minimum batch minipatch is 8; the positive normal distribution

The sampling is carried out, in the example not directly at

Middle sampling one

Since the gradient of the computed samples cannot be propagated backwards, which would result in the model not being trained, the solution is to first sample ε in a standard positive-Tailgate N (0,1)₁Then by the formula

Is calculated to obtain

Step 4c) using the disease-drug association matrix B and the disease characteristics D as a second variational self-encoder f in the drug-disease association prediction model H²Input of, second encodingDevice f_e ²B is coded line by line, and a second hidden variable layer f_z ²To f_e ²Mean value of the code

Sum variance

Constructed positive Taiwan distribution

Sampling is performed, and the second data fusion module pair f_z ²Hidden variables of dimension 10 collected

Decoding to obtain a predicted disease-drug correlation matrix

The second encoder f_e ²Coding B line by line in this example, 8 diseases are selected at a time for coding, i.e. the minimum batch minimatch is 8; the positive normal distribution

The sampling is carried out, in the example not directly at

Middle sampling one

The solution is taken because the gradient of the computed samples cannot be propagated backwards, which would result in the model not being trainedIs to sample in a standard positive-Taiwan distribution N (0,1) to obtain epsilon₂Then by the formula

Is calculated to obtain

Step 4d) using Loss function Loss1 and passing

A and

B and

second variational autocoder f in calculation H²Loss value of L2_i；

Step 4e) Using the back propagation method and passing through L1_iCalculating f¹Then using a gradient descent algorithm through f¹Parameter gradient pair f¹Weight parameter of

Updating is carried out;

and

the update formula of (2) is:

wherein:

and

respectively represents f¹And f²The updated weight value parameters are used to update the weight value parameters,

and

respectively represents f¹And f²The weight value parameters before the update are set,

and

respectively represents f¹And f²The step size of the learning of (2),

and

respectively represents f¹And f²The gradient of the weight parameter.

And 4f) judging whether I is greater than or equal to I, if so, obtaining a trained drug-disease association prediction model H', otherwise, enabling I to be I +1, and executing the step (4 b).

Step 5) obtaining a medicine-disease correlation prediction result Y:

For the first variation autoencoder f¹And a second variational self-encoder f²Predicted result Y of (2)¹And Y²Taking the intersection can effectively reduce the false positive ratio of drug-disease association in Y.

The technical effects of the invention are further explained by simulation experiments as follows:

1. simulation conditions and contents:

simulation experiments were performed in Intel (R) core (TM) i5-7300HQ CPU, main frequency 2.50GHz, memory 8G, Python 3.6.5 on a Pycharm platform in combination with tensorflow1.0, using the Cdasetets data set proposed by Luo et al in the paper "Drug relocation basic on comprehensive knowledge media and Bi-Random walk algorithm" published in "Bioinformatics" in 2016.

The prediction accuracy of the present invention is simulated and compared with the prediction accuracy given in the comparison document, and the result is shown in table 1, prior art 1 in table 1 proposes a Drug relocation method MBIRW based on full similarity measurement and bidirectional Random walk for the paper "Drug relocation base on comprehensive similarity measure and Bi-directional Random walk" published by Luo et al in "Bioinformatics" in 2016, and prior art 2 in table 1 proposes a Drug relocation method rs using Low Rank Matrix and Random algorithm for the paper "Drug relocation method using Low Rank Matrix and Random walk" published by Luo et al in "Bioinformatics" in 2018.

2. And (3) simulation result analysis:

evaluation indexes adopted for representing the prediction precision of the drug-disease association comprise AUC and AUPR.

(1) Auc (area under curve) is the area under the ROC curve (receiving operating characteristic curve), the abscissa of the ROC curve is the false Positive rate FPR (false Positive rate), the ordinate is the true Positive rate TPR (true Positive rate), FPR/(TN + FP), TPR is TP/(TP + FN), where FP represents the number of samples that are actually negative but the model is incorrectly predicted as Positive, TN represents the number of samples that are actually negative and the model is correctly predicted as negative, TP represents the number of samples that are actually Positive and the model is correctly predicted as Positive, and FN represents the number of samples that are actually Positive but the model is incorrectly predicted as negative.

(2) The Area Under the PR Curve (Area Under Precision-Recall Curve) is AUPR (Area Under Precision-Recall Curve), the ordinate axis of the PR Curve is Precision (Precision), the abscissa axis of the PR Curve is Recall (Recall), Precision is TP/(TP + FP), and Recall is TP/(TP + FN).

The results of comparing the AUC and the aucr values on the Cdatasets dataset for the present invention with the two prior art are shown in table 1.

TABLE 1 comparison of the prediction accuracy of the prior art and the present invention

The combination table shows that the AUC value and AUPR value of the method are higher than those of the prior art, and the method proves that the method effectively improves the accuracy of the drug-disease correlation prediction.

The foregoing description is only an example of the present invention and should not be construed as limiting the invention in any way, and it will be apparent to those skilled in the art that various changes and modifications in form and detail may be made therein without departing from the principles and arrangements of the invention, but such changes and modifications are within the scope of the invention as defined by the appended claims.

Claims

1. A medicine-disease associated prediction method based on a variational self-encoder is characterized by comprising the following steps:

(1a) obtaining S ═ S of M medicines from database₁,S₂,...,S_m,...,S_MThere are associated N diseases T ═ T₁,T₂,...,T_n,...,T_NK drug-disease association data E ═ E₁,E₂,...,E_k,...,E_KS, each drug_mIs associated with at least one disease, and each disease T_nIs associated with at least one drug, wherein K is more than or equal to 1000, M is more than or equal to 100, N is more than or equal to 200, S_mDenotes the m-th drug, T_nM is 1. ltoreq. m.ltoreq.M, N is 1. ltoreq. N, E_kIndicates the kth drug-disease association;

Second variational autoencoder f²Comprising a second encoder f connected in series_e ²A second latent variable layer f_z ²And a second decoder f_d ²，f_e ²Comprising a plurality of fully connected layers and a mean variance layer, f_z ²Is connected to a second data fusion module, f_d ²Comprising a plurality of fully-connected layers and a sigmoid-activated function output layer, f²The weight parameter is

wherein x represents f¹The input data of (a) to (b),

denotes f¹The result of the prediction of (a) is,

Represents the mean value of μ_xVariance of

to represent

And the relative entropy of N (0,1),

denotes f²The result of the prediction of (a) is,

And a second variational self-encoder f²The weight parameter is

And let i be 0 and/or 0,

(4b) using the drug-disease association matrix A and the drug characteristics C as a first variational self-encoder f in the drug-disease association prediction model H¹Input of (1), a first encoder f_e ¹Coding A line by line, a first hidden variable layer f_z ¹To f_e ¹Encoded mean value mu_{f1_i}Sum variance

Constructed normal distribution

Decoding to obtain the predicted medicine-disease correlation matrix

(4c) Using the disease-drug association matrix B and the disease characteristics D as a second variational self-encoder f in the drug-disease association prediction model H²Input of (2), a second encoder f_e ²B is coded line by line, and a second hidden variable layer f_z ²To f_e ²Mean value of the code

Sum variance

Constructed positive Taiwan distribution

Decoding to obtain predicted disease-drug associationsMatrix array

(4d) Using Loss function Loss1 and passing

A and

B and

second variational autocoder f in calculation H²Loss value of L2_i；

(4e) Using a counter-propagating method and passing through L1_iCalculating f¹Then using a gradient descent algorithm through f¹Parameter gradient pair f¹Weight parameter of

Updating is carried out;

(5) obtaining a drug-disease association prediction result Y:

correlating drug-disease matricesA and the drug characteristics C are used as a first variational self-encoder f in a trained drug-disease associated prediction model H¹Is propagated forward to obtain f¹Predicted drug-disease association set Y¹Simultaneously using the disease-drug association matrix B and the disease characteristics D as a second variational self-encoder f in the trained drug-disease association prediction model H²Is propagated forward to obtain f²Predicted drug-disease association set Y²，Y¹And Y²Y is¹∩Y²The prediction result of drug-disease association is obtained, wherein, n represents intersection.

2. The method for predicting drug-disease association based on variational self-encoder as claimed in claim 1, wherein said step (2C) comprises performing dimension reduction on C 'with size M × P and D' with size N × O, respectively, by using principal component analysis method, and the method comprises the following steps:

(2c1) subtracting the mean value of each column from each column of the drug-gene association matrix C 'of size M × P, and subtracting the mean value of each column from each column of the disease-gene association matrix D' of size N × O to obtain the drug-gene association matrix C 'after data centering'₁And disease-Gene correlation matrix D'₁；

(2c2) Respectively obtaining C'₁And D'₁To obtain a covariance matrix of size P x P

And a covariance matrix of size O x O

(2c3) To pair

And

respectively decomposing the characteristic values to obtain

P eigenvalues and P eigenvectors of and

o eigenvalues and O eigenvectors;

(2c4) in the order from big to small

The first V eigenvalues are selected and then the P eigenvalues are ranked and the first V eigenvalues are selected and ranked

The eigenvectors corresponding to the V eigenvalues in the P eigenvectors are respectively used as column vectors to form an eigenvector matrix

And C₁The product of' is a drug feature matrix C with size of M × V, and the pairs are in descending order

The first W eigenvalues are selected and then the O eigenvalues are ranked

The eigenvectors corresponding to the W eigenvalues in the O eigenvectors are respectively used as column vectors to form an eigenvector matrix

And D'₁The product of (a) is a disease feature matrix D with size of M × W.

3. The method for drug-disease associated prediction based on variational self-encoder as claimed in claim 1, wherein said step (3a) builds a variational self-encoder based drug-disease associated prediction model H structure, wherein the first encoder f_e ¹The mean variance layer comprises two full-connection layers which have different weight parameters and are arranged in parallel, and the outputs of the two full-connection layers are respectively used as a mean value and a variance; second encoder f_e ²The mean variance layer comprises two full-connection layers which have different weight parameters and are arranged in parallel, and the outputs of the two full-connection layers are respectively used as a mean and a variance.

4. The method for predicting drug-disease association based on variational self-encoder as claimed in claim 1, wherein said step (4e) is performed by using gradient descent algorithm through f¹Parameter gradient pair f¹Weight parameter of