CN115274007A

CN115274007A - Generalizable and interpretable depth map learning method for discovering and optimizing drug lead compound

Info

Publication number: CN115274007A
Application number: CN202210903698.2A
Authority: CN
Inventors: 殷越铭; 胡海峰; 吴建盛; 杨季涛; 叶春
Original assignee: Individual
Current assignee: Individual
Priority date: 2022-08-02
Filing date: 2022-08-02
Publication date: 2022-11-01

Abstract

The invention provides a generalizable and interpretable depth map learning method for discovering and optimizing a drug lead compound. We first obtain various classification and regression attributes of molecules from public databases such as PubChem, etc., and then quantize, encode and extract atoms and bond features in the molecules by attention neural networks. Generalizable learning of the classification and regression is then performed on the molecular attributes. Then, a molecular diagram structure is reconstructed by a molecular diagram attention reconstruction module according to the extracted molecular characteristics; and finally, a countermeasure generating model is utilized to self-detect the key perturbation direction of the molecular characteristics through gradient, new molecular characteristics are generated along the key perturbation direction of the molecular characteristics and input to a reconstruction module, and a molecular optimization result is output. The method unifies the prediction and optimization of AI molecular attributes, and can improve the efficiency and success rate of new drug discovery and design.

Description

Generalizable and interpretable depth map learning method for discovering and optimizing drug lead compound

Technical Field

The invention belongs to the technical field of computer technology, information technology, data mining and biomedical cross, and relates to a generalizable and interpretable depth map learning method for discovering and optimizing a drug lead compound.

Background

With the development of deep learning in drug discovery, generalization and interpretability of molecular activity regression have been key issues of widespread concern. The deep learning model utilizes huge parameters to fit the structure-function of molecules in training data with specific distribution, and is difficult to generalize to molecules different from the distribution of the training data. Meanwhile, deep learning interpretability based on molecular structure-activity relationship modeling has been of great interest. The invention defines the self-adaptive learning rate according to the convergence boundary of the loss function, can obviously improve the generalization of a model for discovering and optimizing a medicine lead compound, and deeply excavates the interpretability of molecular activity prediction through molecular generation.

The way of generating high activity molecules is various, and the most basic and most valuable reference is the matching molecule pair on the active cliff, which is named as MMP-Cliffs. MMP-Cliffs refers to a pair of molecules with slight structural differences but with significant differences in molecular properties. MMP-Cliffs generally have higher structure-activity relationship information content, and arouse the inspiration of pharmaceutical chemists to discover and design efficient molecules. MMP-Cliffs are widely used in pharmaceutical chemistry to study changes in compound properties, including bioactivity, toxicity, environmental hazards, etc. The existing MMP-Cliffs analysis method mainly answers the following three questions: how to identify, how to predict MMP-Cliffs, and how to optimize molecules based on MMP-Cliffs. However, these methods are still limited to MMP-Cliffs in the existing molecular library, and cannot generate novel MMP-Cliffs to develop the molecular optimization concept. The invention effectively generates MMP-Cliffs by designing an antagonistic learning mode, a graph reconstruction algorithm and a molecule generation logic, and provides important guidance for molecule optimization.

Disclosure of Invention

The technical problem to be solved by the invention is to overcome the defects of the prior art and provide a generalizable and interpretable depth map learning technology which is suitable for discovering and optimizing a drug lead compound, applying antagonistic learning and molecule generation technologies to a depth map learning framework, constructing a map learning algorithm based on generation type antagonistic subspace enhancement, and selecting a molecule embedding characteristic in an antagonistic direction to generate a high-activity molecule on an active cliff. By the method, the generalization and the interpretability of the efficacy prediction model can be effectively improved, and the lead compound with greatly improved efficacy is generated.

The technical scheme adopted by the invention for solving the technical problems is a generalizable and interpretable depth map learning technology for discovering and optimizing a drug lead compound, and the technology comprises the following steps:

step 001, collecting molecular activity samples of N GPCR targets from a GLASS molecular activity database, and classifying the obtained molecular activity samples into matching molecule pairs on an active cliff according to the structural difference and the activity difference between every two molecules

Collections of molecules matching inactive cliffs

Wherein

It means a highly active molecule which is capable of,

indicating a low reactive molecule. Then step 002 is carried out;

step 002. Construct training and testing data sets for molecular activity samples of N GPCR targets: training set

Test set

Wherein, the training set randomly divides verification according to the proportion nCollection

Make it possible to

Satisfy the requirement of

Setting a maximum iteration step e allowing model verification performance to be reduced_maxAnd go to step 003;

step 003, training set molecule activity sample

Constructing a depth map neural network E and a neural network N with a dual feature space by adopting a mean square error loss function

Training the model parameter theta_EAnd theta_N：

Wherein x represents a molecular structure sample, y represents a molecular activity sample, and N (E (x)_i),E(x_i) According to the molecular structure x)_iPredicted molecular activity. Then step 004 is carried out;

step 004. Training set molecule sample

And test set molecular samples

Defining a countermeasure feature subspace enhancement loss function

And training the model parameter theta_EAnd theta_N：

Wherein f = E (x)_i) Represents the embedded feature vector of the molecule, d = λ g |₂Is a normalized opposing perturbation vector with step size lambda,

is a non-normalized countering perturbation vector defined by the gradient of the difference function D, r is a random vector modulo less than epsilon,

represents an activity difference metric function, σ represents an activation function, and γ represents a normalization constant. Then, go to step 005;

005, according to the loss function of mean square error

Sum-reactance feature subspace enhancement penalty function

Define an adaptive learning rate

Wherein eta is^*＝1β_y+β_aExpress assurances

And

maximum learning rate of simultaneous convergence, beta_y＝4·(a^T×[W₁,W₂]×[1,1])⁴And beta_a＝(24·(a^T×[W₁,W₂]×[0,1])²a^T×[W₁,W₂]×[1,1])²Are respectively guarantee

And

converged learning rate, W₁And W₂Respectively representing hidden layer parameter matrixes of the dual-feature spatial neural network N, a representing output parameter vectors of N, 0 and 1 representing all-zero and all-one vectors respectively, eta_maxIs the maximum learning rate for stable model parameters, and α is the gradient penalty

The attenuation coefficient of (2). Then go to step 006;

a feature vector representing the true atoms and bonds of the molecule. N is a radical of_mol.Represents the total number of molecular samples. Entering step 007;

step 007-characteristics of molecular reconstruction

Defining a chemical specification loss function

Step 008;

step 008, generating new molecular characteristics by using the molecular embedding characteristics f and the anti-disturbance vector d:

according to

And the initial molecular Structure x_iReduction to give the new molecular structure

And proceeds to step 009;

step 009. For molecular activity samples of the validation set

Verification performance of the computational model: prediction of Activity y_k＝N(E(x_i) And y)_kPearson's correlation coefficient, mean square error, mean absolute error, and reconstruction error between

Judging the iteration step difference between the total error at the moment and the lowest historical total error: less than e_maxReturning to the step 003; otherwise, entering step 010;

step 010. Molecular Activity y predicted from the model when Total error history is lowest^*= N (E (x)), and a highly active lead compound is screened

And obtaining the new molecular structure generated according to step 008

Then, the step 010 is executed;

step 011, collecting molecule activity samples according to the test

Molecular activity y predicted by alignment model^*And y, estimating the accuracy of the model for predicting the molecular activity; and comparing the new molecular structures

The rate of coincidence with x estimates the success rate of creating matching molecule pairs on active cliffs. Then, go to step 012;

and 012, predicting the molecular activity of the specific target in a large molecular database by using the model, screening out high-activity molecules, and generating optimized new molecules by using the model on the basis of the high-activity molecules to obtain the lead compound with greatly improved drug effect.

2. A generalizable, interpretable depth map learning method for discovering and optimizing pharmaceutical lead compounds according to claim 1, wherein: the pair of matching molecules on the active cliff in step 001 is defined as two molecules with a small structural difference and a large activity difference. Wherein the molecular minor structural differences include: single atom differences, single pharmacophore differences, single substructure differences.

3. A generalizable, interpretable depth map learning method for discovering and optimizing pharmaceutical lead compounds according to claim 1, wherein: in the step 005, the purpose of adaptive learning rate adjustment is achieved by deducing the theoretical convergence bound, and the generalization performance of the depth map learning method for discovering and optimizing the lead compound of the drug is effectively improved.

4. A generalizable, interpretable depth map learning method for discovering and optimizing pharmaceutical lead compounds according to claim 1, wherein: the process of reconstructing the neural network G with attention in step 006 specifically includes the following steps:

00601 calculating the molecular embedding characteristics f and the embedding characteristics h of each atom_iThe component of the molecular feature on each atom is obtained and normalized using the Softmax function. Setting a molecular characteristic distribution frequency T, and then entering a step 00602;

00602 obtaining each atom hidden layer characteristic vector for reconstructing molecules by calculating the vector sum of the molecular characteristic component and the atomic characteristic

Let T ← T, proceed to step 00603;

00603 through weighting matrix W₃And a bias vector c₃For the fully connected neural network, dropout layer and activation function eLU of the parameters, the relation feature vector of the molecule to the atom is obtained by inner product with the attention weight vector:

then proceed to step 00604;

step 00604. Back-derive a hidden layer feature on each atom using GRU and ReLU based on the distributed context information:

let t ← t-1, determine t: if the value is more than 0, returning to the step 00603, otherwise, entering the step 00605;

00605 setting the number of times L of atom feature distribution, making L ← L, and using adjacent atoms to replace molecular supernodes:

acquiring attention weight values of adjacent atoms by activating an attention weight vector, performing weighted summation on context information by the adjacent atoms, and entering a step 00606;

00606, adopting the reinitialized neural network to judge that l: hidden layer characteristics for all atoms greater than 0

Performing steps 00603 and 00604 gets the atomic characteristics of the last hidden layer of atoms

Otherwise, entering a step 00607;

00607. Through weighting matrix W₄And a bias vector c₄Fully connected neural network and activation function phi as parameters_aAccording to the characteristics of the initial hidden layer

The initial features of each atom are inferred:

and by using a weight matrix W₅And a bias vector c₅Fully connected neural network, activation function phi as a parameter_bAnd a leakage ReLU function based on the initial hidden layer characteristics

And

reverse-extrapolating the initial characteristics of each key:

5. according to claimA generalizable, interpretable depth map learning method for the discovery and optimization of pharmaceutical lead compounds as claimed in claim 1, characterized in that: in said step 008 according to

And the initial molecular Structure x_iReducing to obtain new molecular structure

The method specifically comprises the following steps:

step 00801, let P (s | a) be the posterior probability of the a-th atom predicted as the s-th chemical element, which is formed by molecular embedding characteristic f and atom embedding set

Graph generator G and opposing perturbation vector d estimate:

then step 00802 is entered;

step 00802, determining key atom position according to maximum posterior probability criterion:

wherein, the first and the second end of the pipe are connected with each other,

the representation generator G predicts that the probability value on the a-th atom position is larger than the threshold value P according to the original molecule characteristic f₀The set of elements of (a) is,

eliminating the influence of the existing elements and the confusable elements on the atom position a. Entering step 00803;

step 00803 according to maximum a posteriori probability criteria and activation threshold P₀Determining a replacement element:

altering the initial molecular Structure x_iAtom in position a^*The element above is s^*To obtain new molecular structure

The invention further provides a generalizable and interpretable depth map learning system for discovering and optimizing the drug lead compound, which consists of a data preprocessing module, a feature extractor, a classification learner, a regression learner and a molecule generator. The characteristic extractor comprises a quantitative coding module of atomic information and bond information in molecules and a graph attention neural network module. The classification and regression learner comprises a training module and a generalization optimization module, each classification and regression attribute of molecules from PubChem is input into the training module to obtain model parameters, and the generalization optimization module is adopted to enhance the generalization of the model. The molecule generator comprises a reconstruction module, a confrontation generation module and a control module; the reconstruction module receives the molecules submitted by the feature extractor and the embedded features of the atoms, and restores the input molecular structure through a graph attention reconstruction network; the model in the countermeasure generation module self-detects the key direction of the embedded characteristic through gradient, generates a new molecular embedded characteristic along the key direction of the embedded characteristic, inputs the new molecular embedded characteristic to the reconstruction module, and outputs a molecular generation result; an operator sets the directional optimization attributes to the control module as required, and molecules with specific attribute promotion are generated under the supervision of the control module.

Compared with the prior art, the invention has the following beneficial effects:

1. the system provided by the invention has the advantages of strong adaptability, high generalization, strong interpretability, high practicability and great application value.

2. The invention is suitable for discovering and optimizing the prediction of each attribute of a drug lead compound and the generation of a directional molecule, and comprises the following steps: the biological activity value of the molecule aiming at various targets, and various indexes of absorption, distribution, metabolism, excretion and toxicity of the molecule finished medicine.

3. The method provided by the invention can effectively generate the molecules in the appointed optimization direction by the aid of the attention-restructured neural network and the molecule generation logic, has the potential of finding specific drugs, and has high social benefits and commercial benefits.

4. The self-adaptive learning rate adjustment provided by the invention has theoretical convergence boundary guarantee and can effectively work in various scenes.

Drawings

FIG. 1 is a schematic diagram of a generalizable and interpretable depth map learning system for discovering and optimizing drug lead compounds in accordance with the present invention.

FIG. 2 is a flow chart of a molecular embedding feature extraction algorithm.

FIG. 3 is a flow chart of a molecular graph attention reconstruction algorithm.

FIG. 4 is a flow chart of a directed property molecule generation algorithm.

Detailed Description

The technical solution of the present invention is further described in detail below with reference to the accompanying drawings.

A generalizable, interpretable depth map learning system for the discovery and optimization of pharmaceutical lead compounds is shown in figure 1. The system consists of a data preprocessing module, a feature extractor, a classification learner, a regression learner and a molecule generator. The characteristic extractor comprises a quantitative coding module of atomic information and bond information in molecules and a graph attention neural network module. The classification and regression learner comprises a training module and a generalization optimization module, each classification and regression attribute of molecules from PubChem is input into the training module to obtain model parameters, and the generalization optimization module is adopted to enhance the generalization of the model. The molecule generator comprises a reconstruction module, a confrontation generation module and a control module; the reconstruction module receives the molecules submitted by the feature extractor and the embedded features of the atoms, and restores the input molecular structure through a graph attention reconstruction network; the model in the countermeasure generation module self-detects the key direction of the embedded characteristic through gradient, generates a new molecular embedded characteristic along the key direction of the embedded characteristic, inputs the new molecular embedded characteristic to the reconstruction module, and outputs a molecular generation result; and setting the directional optimization attribute to the control module by an operator according to the requirement, and generating the molecule with the specific attribute improvement under the supervision of the control module.

As shown in fig. 2-4, a generalizable, interpretable depth map learning system for the discovery and optimization of pharmaceutical lead compounds comprises the following steps:

001, collecting activity samples of each attribute of the molecules of the N GPCR targets from a PubChem molecule activity database, and classifying the obtained molecule activity samples into matching molecule pairs on an active cliff according to the structure difference and the activity difference between every two molecules

Collections of molecules matching inactive cliffs

Wherein

It means a highly active molecule which is capable of,

indicating a low reactive molecule. Then step 002 is performed;

step 002. Construct training and testing datasets on molecular activity samples of N GPCR targets: training set

Test set

Wherein, the training set randomly divides a verification set according to the proportion n

Make it possible to

Satisfy the requirement of

Setting a maximum iteration step e that allows model validation performance to be degraded_maxAnd go to step 003;

step 003, training set molecule activity sample

Train it

The flow is shown in fig. 2. Then step 004 is carried out;

step 004. Training set molecule sample

And test set molecular samples

Defining a confrontational feature subspace

Is a normalized opposing perturbation vector with step size lambda,

step 005, according to the mean square error loss function

Sum-reactance feature subspace enhancement penalty function

Define an adaptive learning rate

Wherein eta^*＝1/β_y+β_aExpress assurances

And

maximum learning rate of simultaneous convergence, beta_y＝4·(a^T×[W₁,W₂]×[1,1])⁴And beta_a＝(24·(a^T×[W₁,W₂]×[0,1])²/a^T×[W₁,W₂]×[1,1])²Are respectively guarantee

And

The attenuation coefficient of (2). Then go to step 006;

step 006, to training set molecule sample

And test set molecular samples

Attention reconstruction spirit for constructing map

The algorithm flow is shown in FIG. 3, a_iAnd b_i,jFeature vector representing true atoms and bonds of a molecule。N_mol.Represents the total number of molecular samples. Entering step 007;

step 008;

step 008, generating new molecular characteristics by using the molecular embedding characteristics f and the anti-disturbance vectors d:

according to

And proceeds to step 009;

step 009. Sample molecular activity of validation set

Verification performance of the computational model: prediction of Activity y_k＝N(E(x_i) ) and y_kPearson's correlation coefficient, mean square error, mean absolute error, and reconstruction error between

And obtaining the new molecular structure generated according to step 008

The molecular generation algorithm flow is as followsAs shown in fig. 4. Then, go to step 010;

step 011, according to the test set molecule activity sample

2. A generalizable, interpretable depth map learning method for discovering and optimizing pharmaceutical lead compounds according to claim 1, wherein: the pair of matching molecules on the active cliff in step 001 is defined as two molecules with small structural difference and large activity difference. Wherein the molecular minor structural differences include: single atom differences, single pharmacophore differences, single substructure differences.

3. A generalizable, interpretable depth map learning method for discovering and optimizing pharmaceutical lead compounds according to claim 1, wherein: in the step 005, the purpose of adaptive learning rate adjustment is achieved by deducing a theoretical convergence boundary, and the generalization ability of the depth map learning method for discovering and optimizing the drug lead compound is effectively improved.

00601 calculating the molecular embedding characteristics f and the embedding characteristics h of each atom_iObtaining a moleculeThe components of the features on each atom are normalized using the Softmax function. Setting molecular characteristic distribution times T, and then entering a step 00602;

00602 obtaining each atom hidden layer characteristic vector for reconstructing the molecule by calculating the vector sum of the molecule characteristic component and the atom characteristic

Let T ← T, proceed to step 00603;

then entering step 00604;

let t ← t-1, determine t: if the value is more than 0, returning to the step 00603, otherwise, entering a step 00605;

00605 setting the number of times L of atom feature distribution, making L ← L, and replacing molecular super nodes with adjacent atoms:

00606, adopting the reinitialized neural network to judge that: greater than 0 being a hidden layer characteristic for all atoms

Otherwise, entering a step 00607;

The initial features of each atom are deduced backwards:

and by using a weight matrix W₅And a bias vector c₅Fully connected neural network, activation function phi as a parameter_bAnd a leak ReLU function based on the initial hidden layer characteristics

And

reverse-extrapolating the initial characteristics of each key:

6. a generalizable, interpretable depth map learning method for discovering and optimizing pharmaceutical lead compounds according to claim 1, wherein: in the step 008 according to

And the initial molecular Structure x_iReducing to obtain new component molecular structure

The method specifically comprises the following steps:

Picture generatorG and opposing disturbance vector d estimation:

then enter step 00802;

wherein the content of the first and second substances,

eliminating the influence of the existing elements and the confusable elements on the atomic position a. Entering step 00803;

step 00803 according to the maximum a posteriori probability criterion and the activation threshold P₀Determining a replacement element:

altering the initial molecular Structure x_iIn situ of the atom a^*The element above is s^*To obtain a nascent molecular structure

Claims

1. The invention provides a generalizable and interpretable depth map learning method for discovering and optimizing a drug lead compound, which comprises the following steps:

Collections of molecules matching inactive cliffs

Wherein

It means a highly reactive molecule which is capable of,

indicating a low reactive molecule. Then step 002 is carried out;

Test set

Make it

Satisfy the requirements of

step 003, training set molecule activity sample

Train its model parametersNumber theta_EAnd theta_N：

Wherein x represents a molecular structure sample, y represents a molecular activity sample, and N (E (x)_i),E(x_i) According to the molecular structure x_iPredicted molecular activity. Then step 004 is carried out;

step 004. Training set molecule sample

And test set molecular samples

Defining a countermeasure feature subspace enhancement loss function

And training the model parameter theta_EAnd theta_N：

Wherein f = E (x)_i) Represents the embedded feature vector of the molecule, d = λ g/| g |₂Is a normalized opposing perturbation vector with step size lambda,

005, according to the loss function of mean square error

Sum-robust feature subspace enhancement loss function

Defining an adaptive learning rate

Wherein eta is^*＝1/β_y+β_aRepresentation assurances

And

maximum learning rate, beta, of simultaneous convergence_y＝4·(a^T×[W₁,W₂]×[1,1])⁴And beta_a＝(24·(a^T×[W₁,W₂]×[0,1])²/a^T×[W₁,W₂]×[1,1])²Are respectively to ensure

And

The attenuation coefficient of (2). Then go to step 006;

step 006, training set molecule sample

And test set molecular samples

Constructing a graph attention reconstruction neural network G and defining a molecular reconstruction loss function

Feature vectors of true atoms and bonds. N is a radical of hydrogen_mol.Represents the total number of molecular samples. Entering step 007;

wherein I (G (f)) =1 indicates that the reconstituted molecule does not meet the chemical specification. Then go to step 008;

step 008, generating new molecular characteristics by using the molecular embedding characteristics f and the anti-disturbance vectors d: { a)^* _i,b^* _i,j} = G (f + d). According to { a^* _i,b^* _i,jAnd initial molecular Structure x_iReduction to give the new molecular structure

And proceeds to step 009;

step 009. For molecular activity samples of the validation set

Verification performance of the computational model: prediction of Activity y_k＝N(E(x_i) And y)_kPearson correlation coefficient, mean square error, mean absolute error, and reconstruction error therebetween

And obtaining the new molecular structure according to step 008

Then, the step 010 is executed;

step 011, according to the test set molecule activity sample

4. A generalizable, interpretable depth map learning method for discovering and optimizing pharmaceutical lead compounds according to claim 1, wherein: the process 006 of reconstructing the neural network G specifically includes the following steps:

Let T ← T, go to step 00603;

00603 through weighting matrix W₃And offset vector c₃For the fully connected neural network, dropout layer and activation function eLU of the parameters, the relation feature vector of the molecule to the atom is obtained by inner product with the attention weight vector:

then entering step 00604;

Performing steps 00603 and 00604 get the atomic characteristics of the last hidden layer of atoms

Otherwise, entering a step 00607;

00607. Through weighting matrix W₄And offset vector c₄Fully connected neural network and activation function phi as parameters_aAccording to the characteristics of the original hidden layer

The initial features of each atom are inferred:

And

extrapolating back the initial features of each key:

5. according toA generalizable, interpretable depth map learning method for discovering and optimizing lead compounds for pharmaceuticals as claimed in claim 1, wherein: according to { a ] in the step 008^* _i,b^* _i,jAnd initial molecular Structure x_iReducing to obtain new component molecular structure

The method specifically comprises the following steps:

Graph generator G and opposing perturbation vector d estimate:

then enter step 00802;

altering the initial molecular Structure x_iIn situ of the atom a^*The element above is s^*To obtain new molecular structure