CN111477344A

CN111477344A - Drug side effect identification method based on self-weighted multi-core learning

Info

Publication number: CN111477344A
Application number: CN202010280936.XA
Authority: CN
Inventors: 刘勇国; 李杨; 杨尚明; 李巧勤
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-04-10
Filing date: 2020-04-10
Publication date: 2020-07-31
Anticipated expiration: 2040-04-10
Also published as: CN111477344B

Abstract

The invention discloses a medicine side effect identification method based on self-weighting multi-core learning, which solves the problems of incomplete medicine characteristic expression and unreasonable weight distribution when weighting a kernel function in the medicine side effect identification method based on multi-core learning. The method comprises the steps of data acquisition, construction of a drug core matrix and a side effect core matrix and the like. The method describes the characteristics of the medicine from multiple angles, and adopts four methods to construct the nuclear matrix for the characteristics of the medicine and the side effects, so that the influence of characteristic loss on the prediction result can be reduced; an optimal kernel matrix of the medicine and the side effect is constructed by adopting a self-weighting method, and the weight calculated by the self-weighting method can be better adapted to different kernel matrices; the local structure of the drug side effect relationship can be captured by expanding the nuclear matrix by adopting a nearest neighbor method.

Description

Drug side effect identification method based on self-weighted multi-core learning

Technical Field

The invention relates to the field of multi-core learning, in particular to a drug side effect identification method based on self-weighting multi-core learning.

Background

In recent years, the problem of drug safety due to side effects of drugs has been receiving much attention. The side effects of the drugs become important factors for the failure of clinical trials of the drugs and are also the main problems affecting public health. There are several major studies on the side effects of drugs: the method comprises the steps of calculating the similarity between medicines and predicting medicine targets by utilizing the relation between medicines and side effects, realizing medicine relocation by utilizing the similarity of the side effects between medicines, predicting the side effects possibly caused by the medicines based on information such as chemical structures of the medicines and the like, predicting the side effects of the medicines by utilizing a disease network and the like. Identification of side effects of drugs plays an important role in the field of drug research, and timely and accurate prediction of side effects of drugs has become a hot point of research at home and abroad.

The conventional method for predicting and evaluating potential side effects of drugs generally comprises the steps of carrying out clinical experiments on patients before the drugs are marketed, and observing adverse reactions generated after the patients take the drugs.

In recent years, the accumulation of a large amount of drug side effect data provides researchers with a data source capable of exploring drug side effects from a molecular level, such as a SIDER database and the like, the development of computer technologies such as complex networks, data mining and the like provides a new idea for drug side effect identification, and more researches start to mine the corresponding relation of potential drug side effects from massive biological information data by means of a scientific calculation method.

The multi-core learning combination is suitable for the condition that the characteristics of a sample data set are irregular and heterogeneous, and has higher flexibility.

Although the side effect identification method based on the multi-core learning method has been advanced to some extent, the following problems still exist:

the existing method does not consider data sources such as incidence relation between drugs and targets, considers incomplete characteristics of drugs and side effects, cannot accurately express the drugs and the side effects, and influences prediction precision.

Most of the existing multi-core learning methods assume that the optimal kernel is a linear combination of basic kernels, and the assumption may not be true, so that the weight distribution is not appropriate, and the accuracy of the prediction result is influenced.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the method for identifying the side effect of the medicine based on the multi-core learning has the problems of incomplete medicine characteristic expression and unreasonable weight distribution when weighting a kernel function, and the invention provides the method for identifying the side effect of the medicine based on the self-weighting multi-core learning, which solves the problems.

The invention provides a side effect identification method based on self-weighting multi-core learning, and the characteristics of the medicine are more comprehensively described from three aspects of side effects, targets and substructures. In order to capture the influence of the relationship between the medicine and the side effect similar to the medicine on the identification of the side effect of the medicine, namely the local structure of the relationship between the medicine and the side effect, a nearest neighbor method is adopted to expand a kernel matrix, and the accuracy of a prediction result is improved.

The invention is realized by the following technical scheme:

a drug side effect identification method based on self-weighted multi-core learning comprises the following steps:

step 1: data acquisition: collecting information from a database;

step 2: constructing a drug core matrix and a side effect core matrix: constructing a data set representing the drug types, constructing a data set of the side effect types, and constructing a relation matrix between the drugs and the side effects;

calculating four kinds of similarity data of the relation matrix, wherein the four kinds of similarity data are a Gaussian interaction attribute kernel (GIP), a correlation coefficient kernel (Corr), a cosine similarity kernel (COS) and a mutual information kernel (MI), and generating a kernel matrix of a drug attribute space and a kernel matrix of a side effect attribute space according to the four kinds of similarity data obtained through calculation;

and step 3: and 2, establishing a self-weighted multi-core learning objective function according to the kernel matrix of the drug attribute space and the kernel matrix of the side effect attribute space obtained in the step 2, iteratively updating to obtain an optimal drug kernel matrix and an optimal side effect kernel matrix, expanding the kernel matrix of the drug attribute space and the kernel matrix of the side effect attribute space by using a nearest neighbor method, minimizing the objective function by using a Gaussian field and harmonic function method, continuously iteratively updating, and finally obtaining a predicted drug side effect relationship matrix.

Further, the step 2 includes the following steps of generating four drug attribute kernels by using the drug-substructure relationship matrix, generating four drug attribute kernels by using the drug-target relationship matrix, and substituting the eight drug attribute kernels and the four drug attribute kernels generated by using the drug-side effect relationship matrix into the step 3 for calculation.

Further, in the step 1, the information collected from the database includes drug information, drug-protein interaction information, targeted protein information, drug side effect relationship information, and side effect information, which have both targeted protein and side effect information.

Further, the chemical structure coding of the drug employs a molecular fingerprint consisting of a plurality of chemical substructures defined in the PubChem database.

Further, the step 2 includes the following detailed steps:

with D ═ D₁,d₂,…,d_nDenotes a set of n drugs, d denotes a drug, S ═ S₁,s₂,…,s_mDenotes the set of m side effects, s denotes side effect;

n × m, F, represents a relationship matrix between the drug and the side effects, F_i.j(1. ltoreq. i.ltoreq.n, 1. ltoreq. j.ltoreq.m) is an element of the F adjacency matrix, when the drug d_iThere are side effects s_jWhen F is present_i.j1 is ═ 1; otherwise, F_i.j0 for drug d_iThe side effects of use are expressed as

Is a binary vector with length m, and the value of each element in the vector is 1 or 0;

the gaussian interaction property kernel (GIP) is specifically expressed as:

and

respectively, the use of a drug d which is indicated by a side effect_iAnd a drug d_kA binary vector of (a), γ represents the bandwidth of the gaussian kernel;

the correlation coefficient kernel (Corr) is expressed as:

is shown as

And

the covariance of (a) of (b),

is shown as

The variance of (a) is determined,

is shown as

The variance of (a);

the cosine similarity kernel (COS) is expressed as:

the mutual information core (MI) is represented as:

u ∈ 0,1 and v ∈ 0,1, for the drug variable on the side effect space, 0 means that the drug does not have this side effect,1 indicates that the drug has the side effect, and f (u) indicates that u is in

F (v) denotes that v is at

F (u, v) represents the relative observed frequency.

Further, the step 3 includes the following detailed steps:

a kernel matrix representing a drug property space,

a kernel matrix representing the side effect attribute space, C_dNumber of nuclei representing drug space, C_sThe number of kernels representing the side effect space, the objective function of self-weighted multi-kernel learning is as follows:

wherein, ω is_iTo represent

The weight of (a) is determined,

given a

C_dIndicates the number of drug nuclei, to obtain ω_iAfter the initial value of (2), calculating

ω_iWith following

Dynamic stateChange, continuously update omega_iFinally obtaining the drug optimum nucleus

Obtaining the optimal nucleus of side effect by the same learning method

Further, the nearest neighbor method specifically includes: and medicament d_iSimilar k neighbor drugs are denoted N (d)_i) ∈ D, k neighbor graph N^d∈R^n×mThe middle element is set as:

N^dfor thinning drug core matrices

Using N^dObtaining an extended drug core matrix after thinning the core matrix

Wherein, the Hadamard product of matrix is expressed; for side effect information, with side effect s_jSimilar k neighbor side effects are denoted as N(s)_j) ∈ S, k neighbor map of adverse events N^sNuclear matrix for sparsifying side effects

Obtaining an extended side-effect kernel matrix

Further, the following objective function is minimized using a Gaussian Field and Harmonic Functions (GFHF) method:

tr (-) denotes the trace of the matrix, μ and σ are non-negative parameters, E_l(F^*) Is a loss function, E_d(F^*) Is a graph regularization term to the drug feature space, E_s(F^*) Is a graph regularization term to the side effect feature space, F_trainRepresenting part of the drug side effect relationship matrix, used as training data,

is a diagonal matrix, where:

L_d∈R^n×nand L_s∈R^m×mIs the laplace matrix:

D_dand D_sIs a diagonal matrix:

to ask for F^*Let us order

The objective function can be rewritten as:

I_d∈R^n×nis a unit matrix, constantly updating a matrix F^*And finally obtaining a predicted medicine side effect relation matrix.

The invention has the following advantages and beneficial effects:

the method describes the characteristics of the medicine from multiple angles, and adopts four methods to construct the nuclear matrix for the characteristics of the medicine and the side effects, so that the influence of characteristic loss on the prediction result can be reduced; an optimal kernel matrix of the medicine and the side effect is constructed by adopting a self-weighting method, and the weight calculated by the self-weighting method can be better adapted to different kernel matrices; the local structure of the drug side effect relationship can be captured by expanding the nuclear matrix by adopting a nearest neighbor method. Based on the method, the invention can more accurately identify the side effect of the medicine.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

FIG. 1 is a schematic process flow diagram of the present invention.

FIG. 2 is a diagram of a multi-core learning model according to the present invention.

Detailed Description

Before any embodiments of the invention are explained in detail, it is to be understood that the invention is not limited in its application to the details of construction and the arrangements of components set forth in the following description or illustrated in the drawings. The invention is capable of other embodiments and of being practiced or of being carried out in various ways. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any inventive changes, are within the scope of the present invention.

A drug side effect recognition method based on self-weighted multi-core learning is disclosed, as shown in figure 1, and comprises the following steps:

step 1: data acquisition: collecting information from a database;

Example 1:

data acquisition:

the data used by the technical scheme of the invention is from a Mizutani database. The Mizutani database collects 658 drug information, 5074 drug-protein interactions, 1368 targeting proteins, 49051 drug side effect relationships, and 1339 side effects with both targeting proteins and side effect signals. To encode the chemical structure of a drug, a molecular fingerprint consisting of 881 chemical substructures defined in the PubChem database was used.

Construction of drug core and side effect core:

with D ═ D₁,d₂,…,d_nDenotes a set of n drugs, S ═ S₁,s₂,…,s_mN × m indicates a matrix of relationships between drugs and side effects_i.j(1. ltoreq. i.ltoreq.n, 1. ltoreq. j.ltoreq.m) is an element of the F adjacency matrix, when the drug d_iThere are side effects s_jWhen F is present_i.j1 is ═ 1; otherwise, F_i.j0. For drug d_iThe side effects of use are expressed as

Is a binary vector of length m, each element in the vector having a value of 1 or 0.

The kernel matrix is constructed using four similarity measures: a gaussian interaction property kernel (GIP), a correlation coefficient kernel (Corr), a cosine similarity kernel (COS), and a mutual information kernel (MI).

The Gaussian interaction attribute kernel is constructed according to the topological structure of the known drug-side effect network, nonlinear mapping can be realized, drugs are mapped into nonlinear representation, and each drug vector has high distinguishability, and the specific representation is as follows:

and

respectively, the use of a drug d which is indicated by a side effect_iAnd a drug d_kAnd gamma represents the bandwidth of the gaussian kernel.

The correlation coefficient kernel can measure the linear relationship of the drug vectors, which is expressed as:

is shown as

And

the covariance of (a) of (b),

is shown as

The variance of (a) is determined,

is shown as

The variance of (c).

The cosine similarity kernel considers the medicines as vector representation on an m-dimensional side effect space, and evaluates the similarity of the two medicines by calculating the cosine value of an included angle between the two vectors, so that the difference of the directions of the two medicine variables on the side effect space is better measured, and the more consistent the directions of the two medicine variables are, the higher the similarity is. Expressed as:

mutual information kernels can be used to measure the degree of interdependence between two discrete random variables, i.e., the degree of interdependence between two observable frequencies of a drug, expressed as:

u ∈ 0,1 and v ∈ 0,1, for a drug variable in the side effect space, 0 indicates that the drug does not have the side effect, 1 indicates that the drug does have the side effect f (u) indicates that u is in

F (u) denotes 1 in the drug vector, e.g. when u is 1

Of (2) is used. (v) denotes v is

F (u, v) represents the relative observed frequency.

The above description uses side effects to represent the property core of a drug, and similarly, the property core of a drug represented using a substructure is: k_GIP-chem,d、K_Corr-chem,d、K_Cos-chem,dAnd K_MI-chem,d(ii) a Using the target to represent the property core of a drug as K_GIP-target,d、K_{Corr-target,d}、K_Cos-target,dAnd K_MI-target,d(ii) a The attribute cores for side effects with drugs are: k_GIP-link,s、K_Corr-link,s、K_Cos-link,sAnd K_MI-link,s。

Multi-kernel learning generates the optimal kernel:

as shown in figure 2 of the drawings, in which,

a kernel representing a space of drug properties,

a kernel representing a side effect attribute space. C_dNumber of nuclei representing the drug space, C in the present case_d＝12；C_sNumber of nuclei representing side effect space, C in the present case_s＝4。

Taking the example of generating a drug-optimized core, the approach to the final drug-optimized core

The weight of (c) will be higher, and in order to get the optimal kernel close to each attribute kernel of the drug or side effect, the objective function of self-weighted multi-kernel learning is as follows:

wherein, ω is_iTo represent

The weight of (a) is determined,

due to omega_iDependent on the target variable

Cannot be determined directly at the beginning of the algorithm

Thus omega_iIt cannot be calculated. Firstly, give

C_dIndicating the number of drug cores. To obtain omega_iAfter the initial value of (2), calculating

ω_iWith following

Dynamically changing, constantly updating omega_iFinally obtaining the drug optimum nucleus

Obtaining nucleus with optimal side effects by the same method

Semi-supervised learning based on graphs:

semi-supervised learning can gain a global structure of drug side-effect relationships, but neglects the effect of drugs similar to drugs and side-effect relationships on drug side-effect recognition. Therefore, the scheme expands the kernel matrix by using a nearest neighbor method. And medicament d_iSimilar k neighbor drugs are denoted N (d)_i) ∈ D, k neighbor graph N^d∈R^n×mThe middle element is set as:

N^dfor thinning drug core matrices

Using N^dObtaining an extended drug core matrix after thinning the core matrix

Where, denotes the hadamard product of the matrix.

For side effect information, with side effect s_jSimilar k neighbor side effects are denoted as N(s)_j) ∈ S, k neighbor map of adverse events N^sNuclear matrix for sparsifying side effects

Obtaining an extended side-effect kernel matrix

To find the optimal predicted drug side effect relationship matrix F^*The following objective function is minimized using the Gaussian Field and Harmonic Functions (GFHF) method:

tr (-) denotes the trace of the matrix. μ and σ are non-negative parameters. E_l(F^*) Is a loss function, E_d(F^*) Is a graph regularization term to the drug feature space, E_s(F^*) Is a graph regularization term to the side effect feature space. F_trainAnd representing part of the drug side effect relation matrix to be used as training data.

Is a diagonal matrix, where:

L_d∈R^n×nand L_s∈R^m×mIs the laplace matrix:

D_dand D_sIs a diagonal matrix:

to ask for F^*Let us order

The objective function can be rewritten as:

I_d∈R^n×nis an identity matrix. Constantly updating matrix F^*And finally obtaining a predicted medicine side effect relation matrix.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A drug side effect identification method based on self-weighted multi-core learning is characterized by comprising the following steps:

step 1: data acquisition: collecting information from a database;

2. The method for identifying the drug side effect based on the self-weighted multi-core learning according to claim 1, wherein the step 2 further comprises the steps of generating four drug attribute kernels by using the drug-substructure relationship matrix, generating four drug attribute kernels by using the drug-target relationship matrix, and substituting the eight drug attribute kernels and the four drug attribute kernels generated by using the drug-side effect relationship matrix into the step 3 for calculation.

3. The method for identifying the side effect of the drug based on the self-weighted multi-core learning as claimed in claim 1, wherein in the step 1, the information collected from the database includes drug information having both target protein and side effect information, drug-protein interaction information, target protein information, drug side effect relationship information, and side effect information.

4. The method for identifying the side effect of the drug based on the self-weighted multi-nuclear learning as claimed in claim 3, wherein the chemical structure code of the drug adopts a molecular fingerprint, and the molecular fingerprint is composed of a plurality of chemical substructures defined in a PubChem database.

5. The method for identifying the side effect of the drug based on the self-weighted multi-core learning according to claim 4, wherein the step 2 comprises the following detailed steps:

the gaussian interaction property kernel (GIP) is specifically expressed as:

and

the correlation coefficient kernel (Corr) is expressed as:

is shown as

And

the covariance of (a) of (b),

is shown as

The variance of (a) is determined,

is shown as

The variance of (a);

the cosine similarity kernel (COS) is expressed as:

the mutual information core (MI) is represented as:

u ∈ 0,1 and v ∈ 0,1, for a drug variable in the side effect space, 0 indicates that the drug does not have the side effect, 1 indicates that the drug does have the side effect, f (u) indicates that u is in

F (v) denotes that v is at

F (u, v) represents the relative observed frequency.

6. The method for identifying the side effect of the drug based on the self-weighted multi-core learning as claimed in claim 4, wherein the step 3 comprises the following detailed steps:

a kernel matrix representing a drug property space,

wherein, ω is_iTo represent

The weight of (a) is determined,

given a

To obtain omega_iAfter the initial value of (2), calculatingω_iWith following

Obtaining the optimal nucleus of side effect by the same learning method

7. The method for identifying the side effect of the drug based on the self-weighted multi-core learning according to claim 4, wherein the nearest neighbor method specifically comprises the following steps: and medicament d_iSimilar k neighbor drugs are denoted N (d)_i) ∈ D, k neighbor graph N^d∈R^n×mThe middle element is set as:

N^dfor thinning drug core matrices

Using N^dObtaining an extended drug core matrix after thinning the core matrix

Wherein, the Hadamard product of matrix is expressed; for side effect information, the followingSide effects s_jSimilar k neighbor side effects are denoted as N(s)_j) ∈ S, k neighbor map of adverse events N^sNuclear matrix for sparsifying side effects

Obtaining an extended side-effect kernel matrix

8. The method for identifying adverse drug reactions based on self-weighted multi-core learning according to claim 4, wherein the following objective function is minimized by using Gaussian Fields and Harmonic Functions (GFHF) method: