CN111477344B

CN111477344B - Drug side effect identification method based on self-weighted multi-core learning

Info

Publication number: CN111477344B
Application number: CN202010280936.XA
Authority: CN
Inventors: 刘勇国; 李杨; 杨尚明; 李巧勤
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-04-10
Filing date: 2020-04-10
Publication date: 2023-06-09
Anticipated expiration: 2040-04-10
Also published as: CN111477344A

Abstract

The invention discloses a drug side effect identification method based on self-weighted multi-core learning, which solves the problems of incomplete drug characteristic expression and unreasonable weight distribution in the method for identifying drug side effect based on multi-core learning. The method comprises the steps of data acquisition, construction of a medicine core matrix and a side effect core matrix and the like. According to the method, the medicine characteristics are described from multiple angles, and the kernel matrix is constructed by adopting four methods for the medicine and the side effect characteristics, so that the influence of characteristic deletion on a prediction result can be reduced; constructing an optimal nuclear matrix of the medicine and the side effect by adopting a self-weighting method, wherein the weight calculated by the self-weighting method can be better adapted to different nuclear matrices; the local structure of the side effect relationship of the medicine can be captured by expanding the nuclear matrix by adopting a nearest neighbor method.

Description

Drug side effect identification method based on self-weighted multi-core learning

Technical Field

The invention relates to the field of multi-core learning, in particular to a drug side effect identification method based on self-weighted multi-core learning.

Background

In recent years, drug safety problems due to side effects of drugs have been attracting attention. Drug side effects have become an important factor in the failure of clinical trials of drugs and are also a major problem affecting public health. Related studies on drug side effects have mainly several aspects: the method comprises the steps of calculating the similarity between medicines and predicting medicine targets by utilizing the relationship between medicines and side effects, realizing medicine repositioning by utilizing the similarity between the side effects, predicting the side effects possibly caused by the medicines based on the information such as the chemical structure of the medicines and the like, predicting the side effects of the medicines by utilizing a disease network and the like. Drug side effect identification plays an important role in the field of drug research, and timely and accurately predicting drug side effects has become a research hotspot at home and abroad.

The conventional method for predicting and evaluating potential side effects of drugs is generally to carry out clinical experiments on patients before the drugs are marketed and observe adverse reactions generated after the patients take the drugs.

In recent years, the accumulation of a large amount of drug side effect data provides researchers with a data source that can explore drug side effects from a molecular level, such as the SIDER database, etc. The development of computer technologies such as complex networks, data mining and the like provides a new thought for identifying side effects of drugs, and more researches begin to mine the corresponding relation of potential side effects of drugs from massive biological information data by means of scientific calculation. In the current research methods, drug side effect identification can be classified into classification algorithms and recommendation algorithms. By using chemical, biological and classification algorithms of drugs, side effects of drugs can be identified, in which method the most important issue is extraction of effective features from drugs and side effects. The classification algorithm used is a support vector machine, a decision tree, etc. The recommendation system may also identify drug side effect associations, including Matrix Factorization (MF), label Propagation Algorithms (LPA), collaborative Filtering (CF), and bipartite local models. These methods are also applicable to drug-target interactions, drug-side effect association recognition and MiRNA-disease association prediction.

The kernel method belongs to one of the classification algorithms. In view of the complexity of single core insufficiently handling the problem, many existing core-based machine learning algorithms combine multiple cores to obtain better similarity metrics. The multi-core learning algorithm is one of the core methods, and a plurality of basic cores are combined to replace a single core. The multi-core learning combines a plurality of kernel functions defined on different input data sources, is suitable for the condition that the characteristics of a sample data set are irregular and heterogeneous, and has higher flexibility. There are studies to identify drug side effects based on multi-core learning methods. CKA-MKL model [ Y.Ding, J.Tang, F.Guo.Identification of drug-side effect association via multiple information integration with centered kernel alignment [ J ]. Neurochem, 2019] constructs multiple kernels from drug space and side effect space respectively, linearly weights the corresponding kernels by a multi-kernel learning algorithm based on center kernel alignment in two different spaces, and finally identifies drug side effect by fusing the drug kernels and side effect kernels by Kronecker RLS.

Although the side effect recognition method based on the multi-core learning method has advanced to some extent, the following problems remain:

the existing method does not consider the data sources such as the association relation between the drug and the target point, and the like, considers the incomplete characteristics of the drug and the side effect, can not accurately express the drug and the side effect, and influences the prediction accuracy.

Most existing multi-core learning methods assume that the optimal core is a linear combination of basic cores, and this assumption may not be satisfied, which results in improper weight distribution and affects the accuracy of the prediction result.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: the invention provides a drug side effect identification method based on self-weighted multi-core learning, which solves the problems that drug characteristic expression is incomplete and weight distribution is unreasonable when a kernel function is weighted.

The invention provides a side effect identification method based on self-weighted multi-core learning, which more fully describes the characteristics of medicines from three aspects of side effect, target point and substructure. In order to capture the influence of the similar relationship between the drug and the side effect on the identification of the side effect of the drug, namely the local structure of the side effect relationship of the drug, a nearest neighbor method is adopted to expand the kernel matrix, so that the accuracy of the prediction result is improved.

The invention is realized by the following technical scheme:

a drug side effect identification method based on self-weighted multi-core learning comprises the following steps:

step 1: and (3) data acquisition: collecting information from a database;

step 2: construction of a drug core matrix and a side effect core matrix: constructing a data set representing the types of medicines, constructing a data set of the types of side effects, and constructing a relation matrix between medicines and the side effects;

calculating four kinds of similarity data of the relation matrix, wherein the four kinds of similarity data are a Gaussian interaction attribute core (GIP), a correlation coefficient core (Corr), a cosine similarity Core (COS) and a mutual information core (MI), and generating a core matrix of a medicine attribute space and a core matrix of a side effect attribute space according to the four kinds of similarity data obtained by calculation;

step 3: according to the kernel matrix of the medicine attribute space and the kernel matrix of the side effect attribute space obtained in the step 2, a self-weighted multi-kernel learning objective function is established, the medicine optimal kernel matrix and the side effect optimal kernel matrix are obtained through iterative updating, the kernel matrix of the medicine attribute space and the kernel matrix of the side effect attribute space are expanded by a nearest neighbor method, at the moment, the objective function is minimized by using a Gao Sichang and harmonic function method, and the predicted medicine side effect relation matrix is finally obtained through continuous iterative updating.

Further, the step 2 further includes the steps of generating four drug attribute cores by using a drug-substructure relationship matrix, generating four drug attribute cores by using a drug-target relationship matrix, and substituting the eight drug attribute cores and the four drug attribute cores generated by using a drug-side effect relationship matrix into the step 3 for calculation.

Further, in the step 1, the information collected from the database includes drug information, drug-protein interaction information, targeting protein information, drug side effect relationship information, and side effect information having both targeting protein and side effect information.

Further, the chemical structural code of the drug adopts a molecular fingerprint, and the molecular fingerprint consists of various chemical substructures defined in the PubCHem database.

Further, the step 2 includes the following detailed steps:

with d= { D ₁ ,d ₂ ,…,d _n The number of n drugs is represented by }, and the number of drugs is represented by d，S＝{s ₁ ,s ₂ ,…,s _m -represents a collection of m side effects, s represents a side effect;

an n×m adjacency matrix F represents a relationship matrix between the drug and the side effects, F _i.j (1.ltoreq.i.ltoreq.n, 1.ltoreq.j.ltoreq.m) is an element of the F adjacent matrix, when the drug d _i Side effects s exist _j When F _i.j =1; otherwise, F _i.j =0, for drug d _i The use of side effects is expressed as

Is a binary vector with length of m, and the value of each element in the vector is 1 or 0;

the gaussian interaction profile kernel (GIP) is specifically expressed as:

and->

Drug d indicated by side effects respectively _i And drug d _k Gamma represents the bandwidth of the gaussian kernel;

the correlation coefficient kernel (Corr) is expressed as:

denoted as->

And->

Covariance of->

Denoted as->

Variance of->

Denoted as->

Is a variance of (2);

cosine similarity kernel (COS) is expressed as:

the mutual information core (MI) is expressed as:

u.epsilon.0.1 and v.epsilon.0.1 for the drug variable in the side effect space, 0 indicates that the drug does not have the side effect, 1 indicates that the drug has the side effect, and f (u) indicates that u is in the side effect space

In (c) is f (v) represents v at +.>

F (u, v) represents the relative observed frequency.

Further, the step 3 includes the following detailed steps:

a kernel matrix representing a drug property space, +.>

A kernel matrix representing a side effect attribute space, C _d Representing the number of nuclei of the drug space, C _s The objective function of the self-weighted multi-kernel learning, representing the number of kernels of the side effect space, is as follows:

wherein omega _i Representation of

Weight of->

Given->

C _d Representing the number of drug cores to obtain ω _i After the initial value, calculate->

ω _i Along with->

Dynamically changing and continuously updating omega _i Finally, the optimal nucleus of the medicine is obtained>

Obtaining optimal core of side effect by the same learning method>

Further, the nearest neighbor method specifically comprises the following steps: with medicine d _i Similar k neighbor drugs are denoted as N (d _i ) E D, k neighbor graph N ^d ∈R ^n×m The elements are as follows:

N ^d pharmaceutical core matrix for sparsification

Using N ^d Obtaining an expanded medicine core matrix after thinning the core matrix>

Wherein, represents the Hadamard product of the matrix; for side effect information, and side effect s _j Similar k neighbor side effects are denoted as N (s _j ) Epsilon S, k-nearest neighbor graph N of side effects ^s Nuclear matrix for sparsifying side effects

Obtaining an expanded side effect nuclear matrix->

Further, the following objective functions are specifically minimized using the gaussian field and harmonic function (Gaussian Fields and Harmonic Functions, GFHF) method:

tr (. Cndot.) represents the trace of the matrix, μ and σ are non-negative parameters, E _l (F ^* ) Is a loss function, E _d (F ^* ) Is a graph regularization term to the drug feature space, E _s (F ^* ) Is a graph regularization term to the side effect feature space, F _train Representing a portion of the drug side effect relationship matrix, used as training data,

is a diagonal matrix in which:

L _d ∈R ^n×n and L _s ∈R ^m×m Is a laplace matrix:

D _d and D _s Is a diagonal matrix:

to find F ^* Order-making

The objective function may be rewritten as:

I _d ∈R ^n×n is an identity matrix, and continuously updates matrix F ^* Finally, a predicted drug side effect relation matrix is obtained.

The invention has the following advantages and beneficial effects:

according to the method, the medicine characteristics are described from multiple angles, and the kernel matrix is constructed by adopting four methods for the medicine and the side effect characteristics, so that the influence of characteristic deletion on a prediction result can be reduced; constructing an optimal nuclear matrix of the medicine and the side effect by adopting a self-weighting method, wherein the weight calculated by the self-weighting method can be better adapted to different nuclear matrices; the local structure of the side effect relationship of the medicine can be captured by expanding the nuclear matrix by adopting a nearest neighbor method. The invention can more accurately identify the side effects of the medicine based on the method.

Drawings

The accompanying drawings, which are included to provide a further understanding of embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention. In the drawings:

FIG. 1 is a schematic diagram of a process flow according to the present invention.

FIG. 2 is a schematic diagram of a multi-core learning model of the present invention.

Detailed Description

Before any embodiments of the invention are explained in detail, it is to be understood that the invention is not limited in its application to the details of construction set forth in the following description or illustrated in the drawings. The invention is capable of other embodiments and of being practiced or of being carried out in various ways. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive improvements, are intended to fall within the scope of the invention.

A drug side effect identification method based on self-weighted multi-core learning, as shown in figure 1, comprises the following steps:

step 1: and (3) data acquisition: collecting information from a database;

Example 1:

and (3) data acquisition:

the data used in the technical scheme of the invention are derived from a Mizutani database. The Mizutani database collects 658 drug information, 5074 drug-protein interactions, 1368 targeting proteins, 49051 drug side effects and 1339 side effects with both targeting proteins and side effects. To encode the pharmaceutical chemical structure, a molecular fingerprint was used, which consists of 881 chemical substructures defined in the pubhem database.

Construction of drug core and side effect core:

with d= { D ₁ ,d ₂ ,…,d _n The number of n drugs is represented by s= { S }, which is a set of n drugs ₁ ,s ₂ ,…,s _m And m represents a set of m side effects. The n×m adjacency matrix F represents a relationship matrix between the drug and the side effects. F (F) _i.j (1.ltoreq.i.ltoreq.n, 1.ltoreq.j.ltoreq.m) is an element of the F adjacent matrix, when the drug d _i Side effects s exist _j When F _i.j =1; otherwise, F _i.j =0. For drug d _i The use of side effects is expressed as

Is a binary vector of length m, and the value of each element in the vector is 1 or 0.

Four similarity metrics are used to construct a kernel matrix: gaussian interaction property kernel (GIP), correlation coefficient kernel (Corr), cosine similarity kernel (COS), and mutual information kernel (MI).

The Gaussian interaction attribute core is constructed according to the topological structure of a known drug-side effect network, so that nonlinear mapping can be realized, drugs are mapped into nonlinear representation, and each drug vector is high in distinguishability and specifically expressed as follows:

and->

Drug d indicated by side effects respectively _i And drug d _k Gamma represents the bandwidth of the gaussian kernel.

The correlation coefficient kernel may measure the linear relationship of the drug vector, expressed as:

denoted as->

And->

Covariance of->

Denoted as->

Variance of->

Denoted as->

Is a variance of (c).

The cosine similarity kernel regards the medicine as vector representation in m-dimensional side effect space, evaluates the similarity of two medicines by calculating the cosine value of the included angle of the two vectors, better measures the difference of two medicine variables in the side effect space, and the more consistent the direction directions of the two medicine variables are, the higher the similarity is. Expressed as:

the mutual information kernel can be used to measure the degree of interdependence between two discrete random variables, i.e., between two drug observable frequencies, expressed as:

u.epsilon.0, 1 and v.epsilon.0, 1 for the drug variable in the side effect space, 0 indicates that the drug does not have the side effect and 1 indicates that the drug does have the side effect. f (u) represents u in

For example, when u=1, f (u) represents 1 at the drug vector +.>

Is a frequency of (a) in the frequency range of (b). f (v) represents v at +.>

F (u, v) represents the relative observed frequency.

The above description is of the use of side effects to represent the property core of a drug, and similarly, the use of substructures to represent the property core of a drug is: k (K) _GIP-chem,d 、K _Corr-chem,d 、K _Cos-chem,d And K _MI-chem,d The method comprises the steps of carrying out a first treatment on the surface of the Using target to represent drug property core K _GIP-target,d 、K _{Corr-target,d} 、K _Cos-target,d And K _MI-target,d The method comprises the steps of carrying out a first treatment on the surface of the The attribute cores that use drugs to represent side effects are: k (K) _GIP-link,s 、K _Corr-link,s 、K _Cos-link,s And K _MI-link,s 。

Multi-core learning generates an optimal core:

as shown in the figure 2 of the drawings,

core representing a drug property space->

A kernel representing a side effect attribute space. C (C) _d Representing the number of cores in the drug space, C in this scenario _d ＝12；C _s The number of nuclei representing the side effect space, C in this scenario _s ＝4。

Taking the generation of a drug-optimal core as an example, due to the near-end of the drug-optimal core

For the optimal kernel to be close to each attribute kernel of the drug or side effect, the objective function of the self-weighted multi-kernel learning is as follows: />

Wherein omega _i Representation of

Weight of->

Due to omega _i Dependent on the target variable->

The +.>

Thus omega _i And cannot be calculated. First give->

C _d Indicating the number of drug cores. Obtaining omega _i After the initial value, calculate->

ω _i Along with->

Obtaining the optimal core of the side effect by the same method

Graph-based semi-supervised learning:

semi-supervised learning can obtain a global structure of drug side effects relationships, but ignores the effects of drug-and side effects-like relationships to drug side effects recognition. Thus, the present approach extends the core matrix with nearest neighbor methods. With medicine d _i Similar k neighbor drugs are denoted as N (d _i ) E D, k neighbor graph N ^d ∈R ^n×m The elements are as follows:

N ^d pharmaceutical core matrix for sparsification

Using N ^d Obtaining the extended drug core matrix after sparsifying the core matrix

Wherein, represents the Hadamard product of the matrix.

For side effect information, and side effect s _j Similar k neighbor side effects are denoted as N (s _j ) Epsilon S, k-nearest neighbor graph N of side effects ^s Nuclear matrix for sparsifying side effects

Obtaining an expanded side effect nuclear matrix->

To find the optimal predicted drug side effect relationship matrix F ^* The following objective functions were minimized using the gaussian field and harmonic function (Gaussian Fields and Harmonic Functions, GFHF) method:

tr (·) represents the trace of the matrix. μ and σ are non-negative parameters. E (E) _l (F ^* ) Is a loss function, E _d (F ^* ) Is a graph regularization term to the drug feature space, E _s (F ^* ) Is a graph regularization term to the side effect feature space. F (F) _train Representing a portion of the drug side effect relationship matrix for use as training data.

Is a diagonal matrix in which:

L _d ∈R ^n×n and L _s ∈R ^m×m Is a laplace matrix:

/>

D _d and D _s Is a diagonal matrix:

to find F ^* Order-making

The objective function may be rewritten as:

I _d ∈R ^n×n is an identity matrix. Continuously updating matrix F ^* Finally, a predicted drug side effect relation matrix is obtained.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The drug side effect identification method based on self-weighted multi-core learning is characterized by comprising the following steps of:

step 1: and (3) data acquisition: collecting information from a database; the information collected from the database comprises drug information and drug-protein interaction information, wherein the drug information and the drug-protein interaction information simultaneously comprise targeting protein information and side effect information, and the targeting protein information and the drug side effect relationship information and the side effect information; the data is derived from a Mizutani database which collects 658 drug information, 5074 drug-protein interactions, 1368 targeting proteins, 49051 drug side effect relationships, 1339 side effects, which have both targeting proteins and side effect messages;

step 2: constructing a medicine core matrix and a side effect core matrix based on the data in the step 1: constructing a data set representing the types of medicines, constructing a data set of the types of side effects, and constructing a relation matrix between medicines and the side effects;

2. The method for identifying side effects of drugs based on self-weighted multi-core learning according to claim 1, wherein the step 2 further comprises the steps of generating four drug attribute cores using a drug-substructure relationship matrix, generating four drug attribute cores using a drug-target relationship matrix, substituting the eight drug attribute cores into the step 3 together with the four drug attribute cores generated using the drug-side effect relationship matrix, and calculating.

3. The method for identifying side effects of a drug based on self-weighted multi-core learning according to claim 1, wherein the chemical structure code of the drug adopts molecular fingerprints, and the molecular fingerprints are composed of a plurality of chemical substructures defined in a pubhem database.

4. A method for identifying side effects of drugs based on self-weighted multi-core learning according to claim 3, wherein said step 2 comprises the following detailed steps:

with d= { D ₁ ,d ₂ ,…,d _n The number of n drugs is represented by }, the number of drugs is represented by d, and s= { S ₁ ,s ₂ ,…,s _m -represents a collection of m side effects, s represents a side effect;

an n×m adjacency matrix F represents a relationship matrix between the drug and the side effects, F _i.j (1.ltoreq.i.ltoreq.n, 1.ltoreq.j.ltoreq.m) is an element of the F adjacent matrix, when the drug d _i Side effects s exist _j When F _i.j =1; otherwise, F _i.j =0, for drug d _i The side effects are denoted as F _di Is a binary vector with length of m, and the value of each element in the vector is 1 or 0;

the gaussian interaction profile kernel (GIP) is specifically expressed as:

and->

the correlation coefficient kernel (Corr) is expressed as:

denoted as->

And->

Covariance of->

Denoted as->

Variance of->

Denoted as->

Is a variance of (2);

cosine similarity kernel (COS) is expressed as:

the mutual information core (MI) is expressed as:

In (c) is f (v) represents v at +.>

F (u, v) represents the relative observed frequency.

5. A method for identifying side effects of drugs based on self-weighted multi-core learning as claimed in claim 3, wherein said step 3 comprises the following detailed steps:

a kernel matrix representing a drug property space, +.>

wherein omega _i Representation of

Weight of->

Given->

Obtaining omega _i After the initial value, calculate->

ω _i Along with->

Obtaining optimal core of side effect by the same learning method>

6. The method for identifying side effects of drugs based on self-weighted multi-core learning according to claim 3, wherein the nearest neighbor method specifically comprises: with medicine d _i Similar k neighbor drugs are denoted as N (d _i ) E D, k neighbor graph N ^d ∈R ^n×m The elements are as follows:

N ^d pharmaceutical core matrix for sparsification

Using N ^d Sparsifying a kernel matrixObtaining the expanded drug core matrix

Obtaining an extended side effect kernel matrix

/>

7. A method for identifying side effects of drugs based on self-weighted multi-kernel learning according to claim 3, characterized in that the following objective functions are minimized using gaussian field and harmonic function (Gaussian Fields and Harmonic Functions, GFHF) method: