CN110176279B

CN110176279B - Lead compound virtual screening method and device based on small sample

Info

Publication number: CN110176279B
Application number: CN201910470488.7A
Authority: CN
Inventors: 黄婉晴; 吴建盛; 胡海峰
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2019-05-31
Filing date: 2019-05-31
Publication date: 2022-08-26
Anticipated expiration: 2039-05-31
Also published as: CN110176279A

Abstract

The invention discloses a lead compound virtual screening method and device based on a small sample. The method comprises the following steps: constructing a virtual screening model by using ligand sample information of a drug target and a homologous drug target thereof and adopting sparse low-rank multi-task learning; learning common characteristics of the homologous drug target and ligand molecule actions by adopting a low-rank regularization term; learning the unique characteristics of the action of the new drug target and the ligand molecules by using a sparse regularization term; the constructed model is used for predicting the activity of the lead compound under a small sample and evaluating the performance of the lead compound. The invention also provides a device for realizing the method. The method and the device can realize the purposes of improving the screening efficiency and saving time and huge capital expenditure. Meanwhile, when the method is applied to the field of drug screening, the requirement of predicting the biological activity of the drug target ligand can be met.

Description

Lead compound virtual screening method and device based on small sample

Technical Field

The invention relates to computer-aided drug design, in particular to a lead compound virtual screening method and a device based on a small sample.

Background

The lead compound is a compound which is obtained by various ways and means and has certain biological activity and chemical structure, is used for further structural modification and modification, and is the starting point of modern new medicine research. In the new drug research process, the lead compound with biological activity obtained by compound activity screening is the basis of innovative drug research.

The traditional drug screening needs to consume a large amount of manpower and material resources, and has a series of defects of long experimental period and the like. With the rapid development of computers in the 21 st century, virtual drug screening technology has been widely applied to drug development, and especially plays a key role in discovery of new drug targets and leading compound structures of rare diseases.

Virtual screening, also called computer screening, is to simulate the interaction between a target point and a candidate drug by using molecular docking software on a computer before biological activity screening is carried out, and calculate the activity value between the target point and the candidate drug so as to reduce the number of the actually screened compounds and improve the discovery efficiency of the lead compounds. Among them, virtual screening can be classified into two categories, i.e., receptor-based virtual screening and ligand-based virtual screening.

Currently, drug development for known drug targets has approached saturation. Development of new drugs aiming at new drug targets or rare diseases has become a research hotspot in recent years, but the sample information is insufficient, so that a good ligand virtual screening model is difficult to obtain. Therefore, one technical problem that needs to be urgently solved by those skilled in the art is: how to effectively provide a method for virtually screening a lead compound based on a small sample.

Disclosure of Invention

The purpose of the invention is as follows: in order to solve the problem of virtual screening of small molecule drugs of a new target under a small sample, one of the purposes of the invention is to provide an effective virtual screening method of a lead compound based on the small sample, and the other purpose of the invention is to provide a corresponding virtual screening device of the lead compound based on the small sample according to the method.

The technical scheme is as follows: the invention relates to a lead compound virtual screening method based on a small sample, which comprises the following steps:

(1) constructing a virtual screening model by using ligand sample information of a new drug target and a homologous drug target and adopting sparse low-rank multi-task learning;

(2) learning common characteristics of the homologous drug target and ligand molecule actions by adopting a low-rank regularization term;

(3) learning the unique characteristics of the action of the new drug target and the ligand molecules by using a sparse regularization term;

(4) the constructed model is used for predicting the activity of the lead compound under a small sample and evaluating the performance of the lead compound.

Preferably, in the step (1), the method for constructing the virtual screening model includes:

the number of the homologous drug targets is not specifically required, generally more than two homologous drug targets are selected as much as possible for the performance of the model, and three homologous drug targets are adopted in the specific embodiment of the invention. When selecting cognate drug targets, targets with a relatively high number of ligand molecules known to bind are preferably selected. The homologous drug targets often have sequence homology and functional similarity, and are easier to act with similar ligands, and the interaction mode and mechanism are often more similar, so that the construction of a lead compound virtual screening model of new drug targets with insufficient sample information can be facilitated by utilizing the abundant ligand sample information of the homologous drug targets;

acquiring a required initial data set, wherein the initial data set comprises information of a new drug target and a homologous drug target thereof, and the information comprises required ligand molecules smiles and an activity value acting with a ligand;

the prediction of the activity value of each drug target is called a task, and the prediction of a plurality of homologous drug targets is called a plurality of tasks; according to the initial data set, smiles of ligands bound by n drug targets in t tasks are generated into corresponding d-dimensional molecular characteristics by using a molecular fingerprint method, wherein the d-dimensional molecular characteristics are represented by 0/1 and are marked as x _n (ii) a Changing X to [ X ] ₁ ,x ₂ ,…,x _n ] ^T ∈R ^n×d As feature matrix input for multiple tasks, where x ₁ ,x ₂ ,…,x _n Respectively represent the 1 st, 2.,. n molecular characteristics of the ligand, R ^n×d Representing a space of dimension n x d, T being a transposed symbol;

the activity values of the n drug targets are denoted as vector y _n Changing Y to [ Y ═ Y ₁ ,y ₂ ,…,y _n ] ^T ∈R ^n×t Inputting the activity value vector as a plurality of tasks; wherein, y ₁ ,y ₂ ,…,y _n Represents the activity values of the action of the 1 st, 2., n ligands, R ^n×t Representing a space of dimension n x T, T being a transposed symbol;

and constructing loss function constraint by using a multi-task learning method, and further obtaining a weight matrix W to obtain the virtual screening model.

Preferably, the step (2) of learning common characteristics of the interaction between the homologous drug target and the ligand molecule specifically comprises: and adding a low-rank regularization term into the loss function to constrain to obtain a P matrix, wherein the P matrix is a low-rank matrix and captures the correlation between tasks.

Preferably, the learning of the unique characteristics of the interaction of the novel drug target and the ligand molecule in step (3) specifically comprises: and adding a sparse regularization item to constrain through a loss function to obtain a Q matrix, wherein the Q matrix is a sparse matrix and the unique characteristics of each task are selected.

And (3) according to the constructed virtual screening model, predicting the activity value and evaluating the performance of the model under the condition that a small sample compound in smiles format and the biological activity value thereof are known.

The invention also provides a virtual screening device of the lead compound based on the small sample, which comprises the following components:

the virtual screening module based on sparse low-rank multi-task learning is used for constructing a virtual screening model;

the common characteristic recognition module is used for learning the common characteristics of the interaction of the homologous drug target and the ligand molecule;

the unique characteristic recognition module of the new drug target and ligand molecule action is used for learning the unique characteristics of the new drug target and ligand molecule action;

and the lead compound activity prediction and performance evaluation module is used for predicting the activity of the lead compound based on the small molecule sample and evaluating the performance of the virtual screening model.

Wherein the virtual screening module based on sparse low rank multi-task learning comprises:

the homologous drug target selection module is used for assisting the selection of the drug target constructed by the model;

an initial module for obtaining an initial data set comprising ligand molecules smiles and activity value information for interaction with a ligand from a database;

the characteristic extraction module is used for processing the original data to generate a characteristic matrix of the ligand;

the activity value generation module is used for sorting the activity values acted by the drug target and the ligand molecules, and carrying out-lg processing on the data of the activity values so as to reduce the span of the activity values;

and the multi-task learning module is used for constructing loss function constraints, further optimizing the loss function constraints to obtain a weight matrix, and learning the virtual screening model.

The common characteristic identification module is a low-rank regularization module and is used for learning the common characteristics to obtain a low-rank matrix representing the correlation between tasks.

The unique feature identification module is a sparse regularization module and is used for learning unique features and obtaining a sparse matrix representing uniqueness of each task.

The lead compound activity prediction and performance evaluation module comprises:

a prediction module for predicting an activity value of the effect;

and the evaluation module is used for obtaining an index for evaluating the performance of the model.

It will be understood by those skilled in the art that the homologous drug targets of the present invention are drug targets that have homology to the new drug target. Biologically, two or more structures are homologous if they have the same ancestry, i.e., they have evolved from a common ancestry. Homologous proteins are in particular evolutionarily related proteins, i.e.proteins of the same or similar function in different species or proteins with significant sequence homology. The homology of proteins is often determined by the similarity of their sequences, which refers to the ratio of the same amino acid residue sequence between the test sequence and the target sequence in the sequence alignment process; generally, when the degree of similarity is higher than 30%, it is usually presumed that the test sequence and the target sequence have functional homology.

Has the advantages that:

aiming at the problem of insufficient sample information of a lead compound of a new drug target, the invention utilizes the abundant sample information of the homologous drug target and adopts sparse low-rank multi-task learning to construct a virtual screening model according to the characteristics of the interaction between the homologous drug target and the compound molecule; aiming at the characteristic that interaction modes and mechanisms of a homologous drug target and compound molecules are often more similar, a low-rank regularization term is adopted to learn common characteristics of the interaction of the homologous drug target and ligand molecules; aiming at the unique characteristics of the new drug target and the compound molecular action, the sparse regularization term is adopted to learn the unique characteristics of the interaction. Common characteristics and unique characteristics among homologous drug targets are fully considered in the model construction process, and the problem that a virtual screening model is difficult to construct or is not ideal to construct due to insufficient sample quantity and information can be effectively solved. Virtual screening can be better carried out, and the method has great help for researching new drug targets and lead compounds of rare diseases. Experiments prove that the specific example has less than one hundred samples and can still achieve better modeling effect.

The method and the device can realize the purposes of improving the screening efficiency and saving time and huge capital expenditure. Meanwhile, when the method is applied to the field of drug screening, the requirement of predicting the biological activity of the drug target ligand can be met.

Drawings

FIG. 1 is a flow chart of a method for virtual screening of lead compounds based on small samples according to the present invention;

FIG. 2 is a flowchart of step 101 of FIG. 1;

FIG. 3 is a flow chart of step 104 in an embodiment of the method of the present invention:

FIG. 4 is a block diagram of an embodiment of the apparatus for virtual screening of lead compounds based on small samples according to the present invention;

FIG. 5 is a block diagram of a module 401 in an embodiment of the apparatus;

fig. 6 is a block diagram illustrating the structure of the module 404 in an embodiment of the present invention.

Detailed Description

The present invention will be further explained with reference to the following embodiments.

Fig. 1 shows a flow of the method for virtual screening of lead compounds based on small samples in this embodiment, which may include the following steps:

step 101: and constructing a virtual screening model by adopting sparse low-rank multi-task learning.

Specifically, referring to fig. 2, where fig. 2 is a flowchart of step 101 in practical application, step 101 may specifically include:

step 201: selecting a homologous drug target.

Table 1 shows the selected homologous drug targets in this example, which are three G protein-coupled receptors, named as GPCR1, GPCR2, and GPCR3 (in this example, one of them is assumed to be a new drug target, and the other two are homologous drug targets), and the ID numbers in the database are shown in the following table.

TABLE 1

Step 202: the required initial data set is acquired.

The initial data set contains information on new drug targets and their cognate drug targets, including the desired smiles format of the ligand molecule and the activity value for interaction with the ligand. Taking table 2 as an example, the initial data set includes:

canonical smiles: smiles format of ligand molecules for generating molecular characteristics of the ligand;

standard value: the activity value of the action of each ligand;

standard units: units.

TABLE 2

The prediction of the activity value of each drug target is called a task, and the prediction of a plurality of homologous drug targets is called a plurality of tasks; for each task, an initial data set is obtained from the database.

And step 203, generating ligand molecule characteristics according to the initial data set.

According to the initial data set, further, the obtained ligand molecules, in this case smiles of each subject, obtain corresponding characteristics by using a molecular fingerprint method; specifically, smiles of ligands bound to n drug targets in t tasks are generated into corresponding d-dimensional molecular characteristics by using a pubchem molecular fingerprint method, represented by 0/1 and marked as x _n ；

Consider the tasks together, X ═ X ₁ ,x ₂ ,…,x _n ] ^T ∈R ^n×d The input data feature matrix is denoted as X.

Wherein x is ₁ ,x ₂ ,…,x _n Respectively represent the 1 st, 2.,. n molecular characteristics of the ligand, R ^n×d Representing a space of dimension n x d, T being the transposed symbol.

Taking table 3 as an example, only 14-dimensional features are selected:

TABLE 3

0	0	0	0	0	0	0	0	0	1	1	1	1	1
														0	0	0	0	0	0	0	0	0	1	1	1	1	1
0	0	0	0	0	0	0	0	0	1	1	1	1	1
														0	0	0	0	0	0	0	0	0	1	1	1	1	1

And step 204, obtaining the activity value of each ligand action according to the initial data set, and carrying out-lg treatment on the activity value.

The activity values of the n drug targets are denoted as vector y _n And (d) converting the obtained Y into [ Y ] ₁ ,y ₂ ,…,y _n ] ^T ∈R ^n×t The activity value vector input for the plurality of tasks is denoted as Y.

Wherein, y ₁ ,y ₂ ,…,y _n Represents the activity values of the action of the 1 st, 2., n ligands, R ^n×t Representing a space of dimension n x T, T being the transposed symbol.

Taking table 4 as an example, only 8 values are selected:

TABLE 4

Value
	1.523
0.959
	-0.653
0.456
	0.357
0.699
	1.046
1.046

And step 205, utilizing a multi-task learning method for constructing the loss function constraint.

In machine learning, a general objective function is usually constructed by a loss function and a regularization term.

An objective function: w ═ l (W) + Ω W);

for an objective function, wherein the first term l (w) is a loss function term, which is used to measure the error between the true value and the predicted value of each sample in our model; the second term omega (W) is a regularization term and comprises a low-rank regularization term and a sparse regularization term, the low-rank regularization term is used for learning common characteristics of the action of a homologous drug target and a ligand molecule, and the sparse regularization term is used for learning unique characteristics of the action of a new drug target and the ligand molecule.

For the loss function, typically L (w) L (y) _i -f(x _i ；w))；

Wherein L (y) _i -f(x _i (ii) a w)), is the feature vector x used to measure our model for the ith sample _i A predicted value f (x) calculated from the weight vector w _i (ii) a w) and authentic labels y _i The error between;

because our model needs to fit our training samples as closely as possible, we will fit the training data as closely as possible at the time of training to achieve this minimum.

Thus, an overfitting situation may occur.

To avoid the over-fitting problem, we add a sparse low-rank regularization term, also called penalty term Ω (W).

And 102, learning common characteristics of the homologous drug target and the ligand molecule.

And adding a low-rank regularization term into the loss function to constrain to obtain a P matrix, wherein the P matrix is a low-rank matrix and captures the correlation between tasks.

As shown in Table 5, the P matrix is selected with 8 dimensions:

TABLE 5

0.2654	0.0190	0.0089
			-0.1584	-0.0428	-0.0029
0.0199	0	-0.0341
			-0.1170	-0.0465	-0.0067
0.0158	0.0478	-0.1856
			-0.0610	0.0636	-0.0191
0.0755	-0.0562	-0.2221
			-0.0126	-0.1671	-0.3132

Step 103, learning the unique characteristics of the interaction between the new drug target and the ligand molecule.

A Q matrix is obtained by adding a sparse regularization item into a loss function to constrain, wherein Q is a sparse matrix, and the unique characteristic of each task is selected.

As shown in Table 6, 8 dimensions of the Q matrix are selected:

TABLE 6

And step 104, predicting the activity of the lead compound under the small sample by using the constructed model and evaluating the performance of the lead compound.

Specifically, referring to fig. 3, where fig. 3 is a flowchart of step 104 in practical application, step 104 may specifically include:

and step 301, prediction is carried out.

As shown in table 7, from the P, Q matrix obtained in

steps

102 and 103, W can be obtained from the constraint W ═ P + Q.

Where P is a low rank matrix, Q is a sparse matrix, and the combination yields a weight matrix W.

The dimensions corresponding to tables 5 and 6 were selected:

TABLE 7

0.2654	0.0190	0.0089
			-0.1584	-0.0428	-0.0029
0.0199	0	-0.0341
			-0.1170	-0.0465	-0.0067
0.0158	0.0478	-0.1952
			-0.0610	0.0636	-0.0191
0.0755	-0.0562	-0.2221
			-0.0126	-0.1671	-0.3132

Preferably, the predicted value of the sample is obtained from a linear relationship of Y ═ WX.

Wherein Y is a matrix formed by activity value vectors, W is a weight matrix, and X is a feature matrix.

As shown in table 8, the predicted values obtained by selecting the samples corresponding to table 4:

TABLE 8

Coefficient of correlation r ² The evaluation indexes used in Kaggle challenge, which is an organization of Merck in 2012 for prediction of drug activity, are:

wherein y is _i In order to be the true activity value,

is the average of the true activity values,

in order to predict the value of the activity,

to predict the average value of the activity values, n is the number of the sample of the formulation. r is ² The larger the value, the better the representation model.

In order to eliminate the influence of randomly selecting a control sample on the result, 5 groups of control samples are randomly selected for each drug target data set, a ligand action biological activity prediction model is respectively constructed, and the mean value +/-variance is calculated to serve as the final result.

In order to verify the effectiveness of the method, the invention selects 7 algorithms as comparison, namely SVR, GBDT, RF, LASSO, RR, MTLa and MTGLA.

SVR is support vector regression;

GBDT is gradient elevated tree, and DT (precision Tree) trees in GDBT are regression trees;

RF is a random forest;

LASSO is the LASSO regression, and the regularization term added is L ₁ A norm;

RR is ridge regression, and the regularization term added is L ₂ A norm;

MTLa is the multitasking lasso regression;

MTGLA is a multitasking group lasso, and the regularization term added is L ₂₁ A norm;

the results are shown in Table 9, 8 th algorithm we chose.

As can be seen from table 9, the performance of the algorithm we choose is improved compared to the other 7 algorithms: wherein GPCR1, GPCR2, and GPCR3 represent the 3 drug targets selected in this example.

TABLE 9

Corresponding to the method provided by the above-mentioned method embodiment of virtual screening of lead compounds based on small samples, referring to fig. 4, the present application also provides an embodiment of an apparatus for virtual screening of lead compounds based on small samples, and in this embodiment, the apparatus may include:

the virtual screening module 401 based on sparse low-rank multi-task learning is used for constructing a virtual screening model;

referring to fig. 5, fig. 5 is a schematic structural diagram of the virtual screening module 401 based on sparse low rank multi-task learning, which specifically includes:

a homologous drug target selection module 501, configured to assist in selecting a drug target for model construction;

an initial module 502 for obtaining an initial data set containing smiles and activity value information from a database;

the feature extraction module 503 is configured to process the original data to generate a feature matrix of the ligand;

an activity value generation module 504, configured to sort the activity values acted on by the drug target and the ligand molecule, and perform-lg processing on the data of the activity values to reduce the span of the activity values;

and the multi-task learning module 505 is used for constructing loss function constraints, further optimizing the loss function constraints to obtain a weight matrix, and learning a virtual screening model.

A common characteristic recognition module 402 for the interaction of the homologous drug target and the ligand molecule, for learning the common characteristic of the interaction of the homologous drug target and the ligand molecule;

and adding low-rank regularization for learning common features to obtain a low-rank matrix representing the correlation between tasks.

A unique feature recognition module 403 for the action of the new drug target and the ligand molecule, which is used for learning the unique features of the action of the new drug target and the ligand molecule;

and adding sparse regularization for learning unique features to obtain a sparse matrix representing the uniqueness of each task.

And a lead compound activity prediction module 404, configured to predict the activity of the lead compound based on the small molecule sample and evaluate the model performance.

Referring to fig. 6, fig. 4 is a schematic structural diagram of the lead compound activity prediction module 404, which specifically includes:

a prediction module 601 for predicting an activity value of the effect;

and the evaluation module 602 is configured to obtain an index for evaluating the performance of the model.

The method and the device for virtually screening the lead compound based on the small sample provided by the application are introduced in detail, and the principle and the implementation mode of the application are explained by applying specific examples, and the description of the above examples is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A lead compound virtual screening method based on a small sample is characterized by comprising the following steps:

(4) predicting the activity of the lead compound under a small sample by using the constructed model and evaluating the performance of the lead compound;

the step (1) comprises the following steps:

selecting a homologous drug target;

the prediction of the activity value of each drug target is called a task, and the prediction of a plurality of homologous drug targets is called a plurality of tasks; according to the initial data set, smiles of ligands bound by n drug targets in t tasks are generated into corresponding d-dimensional molecular characteristics by using a molecular fingerprint method, wherein the d-dimensional molecular characteristics are represented by 0/1 and are marked as x _n (ii) a Changing X to [ X ] ₁ ,x ₂ ,…,x _n ] ^T ∈R ^n×d As a feature matrix input for a plurality of tasks, where x ₁ ,x ₂ ,…,x _n Respectively represent the 1 st, 2.,. n molecular characteristics of the ligand, R ^n×d Representing a space of dimension n x d, T being a transposed symbol;

the activity values of the n drug targets are denoted as vector y _n Changing Y to [ Y ═ Y ₁ ,y ₂ ,…,y _n ] ^T ∈R ^n×t Inputting the activity value vector as a plurality of tasks; wherein, y ¹ ,y ² ,…,y _n Represents the activity values of the action of the 1 st, 2., n ligands, R ^n×t Representing a space of dimension n x T, T being a transposed symbol; and constructing loss function constraint by using a multi-task learning method, and further obtaining a weight matrix W to obtain the virtual screening model.

2. The method for virtually screening lead compounds based on a small sample according to claim 1, wherein the step (2) comprises: and adding a low-rank regularization term into the loss function to constrain to obtain a P matrix, wherein the P matrix is a low-rank matrix.

3. The method for virtually screening lead compounds based on a small sample according to claim 1, wherein the step (3) comprises: and adding a sparse regularization item to constrain through a loss function to obtain a Q matrix, wherein the Q matrix is a sparse matrix and the unique characteristics of each task are selected.

4. A virtual screening device for a lead compound based on a small sample, comprising:

the common characteristic recognition module is used for learning the common characteristics of the interaction of the homologous drug target and the ligand molecules;

the lead compound activity prediction and performance evaluation module is used for predicting the activity of the lead compound based on a small molecule sample and evaluating the performance of the virtual screening model;

the virtual screening module based on sparse low-rank multi-task learning comprises:

the activity value generation module is used for sorting the activity values acted by the drug target and the ligand molecules, and performing-lg processing on the data of the activity values so as to reduce the span of the activity values;

5. The virtual screening apparatus of lead compounds based on small samples according to claim 4, wherein the common feature identification module is a low rank regularization module for learning the common features to obtain a low rank matrix representing correlations between tasks.

6. The virtual screening device of small-sample-based lead compounds according to claim 4, wherein the unique feature identification module is a sparse regularization module for learning unique features to obtain a sparse matrix representing uniqueness of each task.

7. The virtual screening device of lead compounds based on small samples according to claim 4, characterized in that the lead compound activity prediction and performance evaluation module comprises:

a prediction module for predicting an activity value of the effect;