CN112086139A

CN112086139A - Multi-source transfer learning method and device for virtual screening of small molecule drugs

Info

Publication number: CN112086139A
Application number: CN202010854924.3A
Authority: CN
Inventors: 袁露; 吴建盛; 胡海峰
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-08-24
Filing date: 2020-08-24
Publication date: 2020-12-15

Abstract

The invention provides a multisource migration learning method and device for virtual screening of small molecule drugs, wherein the method comprises the following steps: acquiring a same source data set, sampling the same source data set, and acquiring a sampled same source data set; inputting ligand molecules smiles and a biological activity value, and training in a neural network to obtain a virtual screening model; putting the sampled homologous source data set into a virtual screening model for training to obtain model parameters; predicting the biological activity value of the ligand molecule combined with the drug target.

Description

Multi-source transfer learning method and device for virtual screening of small molecule drugs

Technical Field

The invention relates to a virtual learning method and a virtual learning device, in particular to a multisource transfer learning method and a multisource transfer learning device for virtual screening of small molecule drugs.

Background

Virtual screening of drugs is a computational technique for drug discovery, which is used to search small molecule libraries to identify structures that are most likely to bind to drug targets, thus concentrating targets and greatly reducing the number of experimentally screened compounds, thereby shortening the development cycle and saving cost.

Among them, virtual screening can be classified into two categories, i.e., receptor-based virtual screening and ligand-based virtual screening. The virtual screening based on the receptor starts from the three-dimensional structure of a target protein, researches the characteristic properties of the binding site of the target protein and the interaction mode between the binding site and a small molecule compound, evaluates the binding capacity of the protein and the small molecule compound according to an affinity scoring function related to binding energy, and finally selects a compound with a reasonable binding mode and a high prediction score from a large amount of compound molecules for subsequent bioactivity test. Ligand-based virtual screening generally utilizes small molecule compounds with known activities, searches chemical molecular structures capable of matching the compounds in a compound database according to the shape similarity or pharmacophore model of the compounds, and then performs experimental screening research on the selected compounds.

The number of compounds with druggy properties is enormous, and machine learning can help search a huge chemical molecule library, and meanwhile, the properties of massive compounds are cataloged, characterized and compared by using an algorithm, so that researchers can be helped to quickly and economically find the best candidate drugs. Meanwhile, the medicine is safer, and the failure rate of the medicine in clinical tests is lower. In addition, it is helpful to discover new classes of drugs, exploring unexplored or repudiated chemical spaces.

At present, many of the drug developments of the discovered targets are approaching saturation, and new drug development requires discovery of new drug targets. However, the research of new drug targets is not sufficient, and virtual screening for new drug targets often faces the problem of insufficient training samples, so that a good virtual screening model is difficult to construct. Existing research shows that transfer learning is helpful for improving the virtual screening problem of a drug target when the training sample amount is insufficient. In addition, new drug targets can often find homologous or similar target proteins, some of which can even find more, and these target proteins are easier to act with similar compounds, and the interaction mode and mechanism are often more similar.

Disclosure of Invention

The purpose of the invention is as follows: in order to solve the problem of virtual screening of small molecule drugs of a new target under a small sample, the invention aims to provide an effective multisource migration learning method for virtual screening of small molecule drugs, and aims to provide a corresponding multisource migration learning device for virtual screening of small molecule drugs according to the method.

The technical scheme is as follows: the invention provides a multisource migration learning method for virtual screening of small molecule drugs, which comprises the following steps:

(1) acquiring a same source data set, sampling the same source data set, and acquiring a sampled same source data set;

(2) inputting ligand molecules smiles and a biological activity value, and training in a neural network to obtain a virtual screening model;

(3) putting the sampled homologous source data set into a virtual screening model for training to obtain model parameters;

(4) predicting the biological activity value of the ligand molecule combined with the drug target.

Wherein, step (1) includes:

(1.4) selecting a homologous drug target;

(1.5) obtaining a desired initial data set of homologous drug targets, wherein the initial data set comprises information of the homologous drug targets, and the information comprises desired smiles of the ligand molecules and activity values of ligand action;

(1.6) randomly putting back and sampling the data set corresponding to the homologous drug target, setting the sampling ratio, repeating for several times, and obtaining the sub-homologous source data set after sampling.

Preferably, step (2) comprises:

(2.5) obtaining a target drug target initial data set T { (x)₁，y₁)，...，(x_iy_i)，...(x_N，y_N)}，

Wherein x is_iSmiles for the ith ligand molecule,

y_iis the activity value of the ith ligand acting on the drug target,

n is the number of ligand molecules in the data set,

said initial data set comprising information on homologous drug targets, said information comprising desired ligand molecules smiles and activity values for ligand action;

(2.6) Using the formula

Generating a molecular fingerprint of the ligand molecule by convolution operation, and marking the molecular fingerprint as f;

wherein m is_j: attribute vector of jth atom;

N_I: neighborhood of atom j;

A_ij: associated with the edge connecting atoms i and j;

a weight matrix;

b: a bias vector;

(2.7) using the formula

Generating a weighted molecular fingerprint of the ligand molecule, and marking as F;

wherein f is_iIs the molecular fingerprint of the ith unit;

w is a parameter of the weight layer;

(2.8) predicting the bioactivity value through two full-connection layers by using the generated molecular fingerprint,

wherein the content of the first and second substances,

is the predicted biological activity value of the binding of the ith ligand molecule;

o_msparameters of the full connection layer;

F_jthe weighted molecule for the jth ligand molecule refers toAnd (4) pattern.

Further, the step (3) comprises:

(3.1) acquiring a plurality of source datasets generated in the step (1), wherein the datasets comprise information including desired ligand molecules smiles and activity values of ligand action;

(3.2) training each sub-source domain data set in the drug small molecule virtual screening model generated in the step (2) and obtaining model parameters, wherein the model parameters are set in the application

And (3.3) inputting the target domain data set into the virtual screening model, and replacing the original parameters in the virtual model with the model parameters obtained in the previous step. Obtaining the predicted biological activity value of the target domain.

Preferably, step (4) comprises:

(4.1) comparing the reliability of the target domain biological activity values obtained by training a plurality of sub-source domains, and measuring by using a correlation coefficient;

(4.2) selecting a plurality of sub-source domains with the maximum correlation coefficient, and averaging the biological activity values corresponding to the sub-source domains to obtain a final predicted biological activity value;

(4.3) comparing the final predicted value of biological activity with the actual value of activity and using the correlation coefficient r²To measure the reliability of the prediction.

The invention also provides a multi-source transfer learning device for virtual screening of small molecule drugs, which comprises the following modules:

the same source data set generating module is used for acquiring a sub same source data set;

the virtual screening module is used for constructing a virtual screening model;

the multi-source migration module is used for helping to construct a target drug target virtual screening model by utilizing ligand molecule information of a homologous drug target;

and the activity value prediction module is used for predicting the size of the activity value of the ligand molecule after being combined with the drug target and evaluating the performance of the virtual screening model.

Wherein, the homologous data set generating module includes: downloading a drug target data set from an uniprot database, wherein the obtained data set comprises smiles molecular formula of ligand molecules and activity value of the action of the ligand molecules and drug targets; and sampling the data set by using a put-back sampling mode, and outputting and obtaining the sampled data set.

The virtual screening module is as follows: predicting the bioactivity value of the ligand molecule and the drug target, inputting: a compound in smiles format, output: the biological activity value of the drug target effect, and applying the biological activity value to drug design aiming at the drug target;

preferably, the multisource migration learning device for virtual screening of small molecule drugs comprises a migration module, a virtual screening module and a prediction module; the migration module is used for migrating information of a homologous drug target ligand; inputting a data set to be sampled, and outputting a migration parameter by using a demo module; the virtual screening module is used for constructing a virtual screening module of a small-molecule drug, predicting a life activity value of a ligand molecule combined with the drug target, inputting smiles molecular formula of the ligand molecule in a data set into the demo module, and replacing an initial parameter with a migration parameter to obtain a biological activity value acting on the drug target; a prediction module: the method is used for predicting the activity value of the ligand molecule after being combined with the drug target, evaluating the performance of the model, comparing the predicted biological activity value with the actual life activity value, and evaluating the model by utilizing the reliability index.

The activity value prediction module comprises: and selecting the most reliable data sets to be predicted by using the reliability index, averaging the activity values of the data sets to obtain the final predicted biological activity value, and evaluating the reliability of the final predicted value by using the reliability index.

Has the advantages that: the compound sample information rich in the target proteins is utilized to help the drug targets with insufficient sample information to establish a virtual screening model. Through multi-source migration learning, a plurality of homologous or similar drug targets are used as source domains, a target drug target is used as a target domain, and compound information of the source domain is migrated into the target domain to help to construct a virtual screening model. Therefore, a model with strong generalization capability can be established under the condition of a small sample, and the accuracy of virtual screening can be improved.

Drawings

Fig. 1 is a schematic flow chart of a multisource migration learning method for virtual screening of small molecule drugs according to the present application;

FIG. 2 is a flow chart of step 101 in an embodiment of the method of the present application;

FIG. 3 is a flow chart of step 102 in an embodiment of the method of the present application;

FIG. 4 is a flow chart of step 302 in an embodiment of the method of the present application;

FIG. 5 is a flow chart of step 103 in an embodiment of the method of the present application;

FIG. 6 is a flow chart of step 104 in an embodiment of the method of the present application;

FIG. 7 is a schematic flow chart of a multi-source transfer learning apparatus for virtual screening of small molecule drugs according to the present application;

FIG. 8 is a schematic diagram of the structure of a module 601 in an embodiment of the apparatus of the present application;

FIG. 9 is a block diagram of an embodiment of the apparatus 602;

fig. 10 is a schematic structural diagram of a module 603 in an embodiment of the apparatus of the present application.

Detailed Description

The present invention will be further explained with reference to the following embodiments.

Fig. 1 shows a schematic diagram of the multi-source transfer learning method for virtual screening of small molecule drugs in this example, which may include the following steps:

step 101: constructing a data set generation model with the put-back samples;

specifically, referring to fig. 2, which is a flowchart of the step 101 in practical application, the step 101 specifically includes:

step 201: selecting a homologous drug target according to the target drug target. The drug target was P46093, and its cognate target was sought, as shown in table 1, four cognate drug targets were selected:

TABLE 1

Step 202: and acquiring a required homologous data set comprising the ligand molecules smiles and the combined biological activity value. Taking table 2 as an example, the homologous data set includes:

canonical smiles: molecular characteristics for generating ligands;

standard value: the activity value of the action of the respective ligand;

TABLE 2

CANONICAL SMILES	STANDARD VALUE
		CCCCC(C(＝O)NC(CC1CCCCC1)C(＝O)	0.78

Step 203: respectively sampling P25106, P25106, P47900 and P3248 data sets by a sampling mode with put back, setting the sampling ratio to be 0.5, and repeating the sampling three times to obtain 12 sub-data sets, namely D1, D2, D3, D4, D5, D6, D7, D8, D9, D10, D11 and D12;

step 102: establishing a virtual screening model based on a graph neural network;

specifically, referring to fig. 3, which is a flowchart of the step 102 in practical application, the step 102 specifically includes:

step 301: a target P46093 dataset T was obtained for the target object as shown in the following figure:

TABLE 3

Step 302: generating a molecular fingerprint;

specifically, referring to fig. 4, as a flowchart of step 302 in practical application, step 302 may specifically include:

step 302-1: input target dataset T { (x)₁，y₁)，...，(x_i，y_i)，...(x_N，y_N) In which x_iSmiles, y as the ith ligand molecule_iIs the activity value of the action of the ith ligand and the drug target, and N is the number of ligand molecules.

The generation of the molecular fingerprint may comprise L units, each unit consisting of a convolutional layer and an accumulation layer. The following operations are performed for each unit:

step 302-2: input x_iAfter rdkit processing, let A be included_iAn atom, x_iEach atom in (a) is represented by a 62-dimensional attribute vector as: m is_j(j＝1，...，A_i)；

Step 302-3: initialization of parameters, pair C₁，N₁，E₁，b_l，l∈[1，L]Initializing and letting F, F equal to 0;

step 302-4 randomly selecting N from data set T_SSamples, forming a new sample set

Step 302-5: the operation is performed on the ith unit as follows:

each atom is exported by the convolutional layer as:

m_jattribute vector of j-th atom;

NI is the neighborhood of atom j;

A_ijassociated with the edge connecting the linking atoms i and j;

a weight matrix;

b is a bias vector;

step 302-6: all atoms go through one layer of summation, the output is: f ═ f + z_i；

Step 303: generating weighted molecular fingerprints

Step 304: connecting the weighted molecular fingerprints generated in step 304 to two fully-connected layers, and outputting:

p_jmweights for connecting neuron j to neuron m;

o_m，sweights that connect neuron m to neuron s;

is the predicted biological activity value of the ith ligand binding to the drug target.

Step 305: optimizing an error function, and continuously iterating a parameter theta, wherein the theta is a set of all parameters;

step 306: a determination is made as to whether the model optimization meets the desired criteria, and if not, the process returns to step 304.

Step 307: with return prediction

And all model parameters.

Step 103: constructing a multi-source transfer learning model based on parameter transfer;

specifically, fig. 5 may be referred to as a flowchart of the step 103 in practical application;

this step 103 may specifically include:

step 401: acquiring 12 homologous data sets generated in the step 1, wherein the data sets comprise ligand molecules smiles and activity values of ligand action;

TABLE 4

Serial number	The affiliated ID
		1	P25106
2	P25106
		3	P25106
4	P25566
		5	P21556
6	P21556
		7	P47900
8	P47900
		9	P47900
10	P32246
		11	P32466
12	P32466

Step 402: training each source domain data set in the virtual screening model generated in the step 102, and acquiring all model parameters;

step 403: inputting the target data set into the virtual screening model in the step (2), replacing the original parameters in the virtual screening model with the model parameters obtained in the previous step, and obtaining the biological activity value predicted by the target domain.

Step 104: constructing an activity value prediction model based on ensemble learning;

specifically, referring to fig. 6, as a flowchart of step 104 in practical application, step 104 may specifically include:

step 501: meterCalculating the reliability of activity values obtained by training a plurality of source domains by using a correlation coefficient r²Wherein, the correlation coefficient ranges from 0 to 1, and the closer to 1 represents the higher;

wherein y is_iThe bands represent actual values;

representing the predicted value.

Step 502: in which r is selected²A maximum of 5 sub-domains;

step 503: and averaging the corresponding predicted biological activity values to obtain the final predicted biological activity value.

Comparing the final predicted value of biological activity with the actual value of biological activity, and using r²And rmse to measure the reliability of the prediction.

y_i: the activity value of the ith ligand binding to the target;

a predicted activity value for binding of the ith ligand to the target;

y: an average value of activity values for ligand binding to the target;

average of predicted activity values for ligand binding to target.

Corresponding to the method provided by the above embodiment of the multi-source migration learning method for virtual screening of small molecule drugs, the present application also provides an embodiment of a multi-source migration learning apparatus for virtual screening of small molecule drugs, referring to fig. 7, in this example, the apparatus may include:

the same source data set generating module 601 is configured to obtain a same source data set;

referring to fig. 8, fig. 8 is a schematic diagram of a same-source data set generating model, which specifically includes:

homologous target selection module 701: selection of drug targets for aiding model construction;

an initial block 702: an initial dataset for obtaining activity values for the ligands smiles and ligand action;

with the put back sampling module 703: setting a sampling ratio, repeating the sampling with the put back for multiple times, and generating a final source domain data set;

virtual screening module 602: the method is used for constructing a virtual screening model; the virtual screening module based on the graph neural network predicts the bioactivity values of ligand molecules and drug targets, and applies the bioactivity values to new drug design aiming at the drug targets, and inputs: a compound in smiles format, output: biological activity values that interact with these drug targets;

the multi-source transfer learning module 603 is used for helping to construct a target drug target virtual screening model by utilizing ligand molecule information of the homologous drug target;

referring to fig. 9, fig. 9 is a schematic structural diagram of the multi-source migration learning module 603 based on parameter migration, which specifically includes:

the homologous data set selection module 801: the method comprises the steps of obtaining information of a homologous drug target;

the migration module 802: information for migrating a cognate drug target ligand;

the virtual screening module 803 is used for constructing a virtual screening module of the small molecule drug and predicting the life activity value of the ligand molecule combined with the target drug;

a prediction module 804 for predicting the magnitude of the activity value of the ligand molecule after binding to the drug target and evaluating the performance of the model.

An activity value prediction module 604;

referring to fig. 10, fig. 10 is a schematic structural diagram of the activity value prediction module 604 based on ensemble learning, which specifically includes:

optimal data set selection module 901: selecting data with the most reliable predicted values;

mean predicted activity value module 902: the final predicted activity value is obtained by averaging.

Claims

1. A multisource migration learning method for virtual screening of small molecule drugs is characterized by comprising the following steps:

2. The multi-source migratory learning method for virtual screening of small molecule drugs according to claim 1, wherein the step (1) comprises:

(1.1) selecting a homologous drug target;

(1.2) acquiring a required homologous drug target data set, wherein the initial data set comprises information of the homologous drug target, and the information comprises required ligand molecules smiles and an activity value of ligand action;

(1.3) randomly putting back and sampling the data set corresponding to the homologous drug target, setting the sampling ratio, repeating for several times, and obtaining the sub-homologous source data set after sampling.

3. The multi-source migratory learning method for virtual screening of small molecule drugs according to claim 1, wherein the step (2) comprises:

(2.1) acquiring a target drug target data set T, wherein the initial data set comprises information of homologous drug targets, and the information comprises desired ligand molecules smiles and activity values of ligand action;

(2.2) using the formula

(2.3) Using the formula

(2.4) predicting the bioactivity value through two full-connection layers by using the generated molecular fingerprint,

4. the multi-source migratory learning method for virtual screening of small molecule drugs according to claim 1, wherein the step (3) comprises:

(3.2) training each sub-source domain data set in the drug small molecule virtual screening model generated in the step (2), and obtaining model parameters;

5. The multi-source migratory learning method for virtual screening of small molecule drugs according to claim 1, wherein the step (4) comprises:

。

6. The utility model provides a multisource migration learning device towards virtual screening of small molecule medicine which characterized in that comprises following module:

7. The multi-source migration learning device for virtual screening of small molecule drugs according to claim 6, wherein the homology data set generation module comprises: downloading a drug target data set from an uniprot database, wherein the obtained data set comprises smiles molecular formula of ligand molecules and activity value of the action of the ligand molecules and drug targets; and sampling the data set by using a put-back sampling mode, and outputting and obtaining the sampled data set.

8. The multi-source migration learning device for virtual screening of small molecule drugs according to claim 6, wherein the virtual screening module is: predicting the bioactivity value of the ligand molecule and the drug target, inputting: a compound in smiles format, output: biological activity value of the drug target effect.

9. The multi-source migration learning device for virtual screening of small molecule drugs according to claim 6, wherein the multi-source migration module comprises a migration module, a virtual screening module and a prediction module; the migration module is used for migrating information of a homologous drug target ligand; inputting a data set to be sampled, and outputting a migration parameter by using a demo module; the virtual screening module is used for constructing a virtual screening module of a small-molecule drug, predicting a life activity value of a ligand molecule combined with the drug target, inputting smiles molecular formula of the ligand molecule in a data set into the demo module, and replacing an initial parameter with a migration parameter to obtain a biological activity value acting on the drug target; a prediction module: the method is used for predicting the activity value of the ligand molecule after being combined with the drug target, evaluating the performance of the model, comparing the predicted biological activity value with the actual life activity value, and evaluating the model by utilizing the reliability index.

10. According to the multisource migration learning device for virtual screening of small molecule drugs, the activity value prediction module comprises: and selecting the most reliable data sets to be predicted by using the reliability index, averaging the activity values of the data sets to obtain the final predicted biological activity value, and evaluating the reliability of the final predicted value by using the reliability index.