CN115274007A - Generalizable and interpretable depth map learning method for discovering and optimizing drug lead compound - Google Patents

Generalizable and interpretable depth map learning method for discovering and optimizing drug lead compound Download PDF

Info

Publication number
CN115274007A
CN115274007A CN202210903698.2A CN202210903698A CN115274007A CN 115274007 A CN115274007 A CN 115274007A CN 202210903698 A CN202210903698 A CN 202210903698A CN 115274007 A CN115274007 A CN 115274007A
Authority
CN
China
Prior art keywords
molecular
activity
atom
molecule
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210903698.2A
Other languages
Chinese (zh)
Inventor
殷越铭
胡海峰
吴建盛
杨季涛
叶春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202210903698.2A priority Critical patent/CN115274007A/en
Publication of CN115274007A publication Critical patent/CN115274007A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction

Abstract

The invention provides a generalizable and interpretable depth map learning method for discovering and optimizing a drug lead compound. We first obtain various classification and regression attributes of molecules from public databases such as PubChem, etc., and then quantize, encode and extract atoms and bond features in the molecules by attention neural networks. Generalizable learning of the classification and regression is then performed on the molecular attributes. Then, a molecular diagram structure is reconstructed by a molecular diagram attention reconstruction module according to the extracted molecular characteristics; and finally, a countermeasure generating model is utilized to self-detect the key perturbation direction of the molecular characteristics through gradient, new molecular characteristics are generated along the key perturbation direction of the molecular characteristics and input to a reconstruction module, and a molecular optimization result is output. The method unifies the prediction and optimization of AI molecular attributes, and can improve the efficiency and success rate of new drug discovery and design.

Description

Generalizable and interpretable depth map learning method for discovering and optimizing drug lead compound
Technical Field
The invention belongs to the technical field of computer technology, information technology, data mining and biomedical cross, and relates to a generalizable and interpretable depth map learning method for discovering and optimizing a drug lead compound.
Background
With the development of deep learning in drug discovery, generalization and interpretability of molecular activity regression have been key issues of widespread concern. The deep learning model utilizes huge parameters to fit the structure-function of molecules in training data with specific distribution, and is difficult to generalize to molecules different from the distribution of the training data. Meanwhile, deep learning interpretability based on molecular structure-activity relationship modeling has been of great interest. The invention defines the self-adaptive learning rate according to the convergence boundary of the loss function, can obviously improve the generalization of a model for discovering and optimizing a medicine lead compound, and deeply excavates the interpretability of molecular activity prediction through molecular generation.
The way of generating high activity molecules is various, and the most basic and most valuable reference is the matching molecule pair on the active cliff, which is named as MMP-Cliffs. MMP-Cliffs refers to a pair of molecules with slight structural differences but with significant differences in molecular properties. MMP-Cliffs generally have higher structure-activity relationship information content, and arouse the inspiration of pharmaceutical chemists to discover and design efficient molecules. MMP-Cliffs are widely used in pharmaceutical chemistry to study changes in compound properties, including bioactivity, toxicity, environmental hazards, etc. The existing MMP-Cliffs analysis method mainly answers the following three questions: how to identify, how to predict MMP-Cliffs, and how to optimize molecules based on MMP-Cliffs. However, these methods are still limited to MMP-Cliffs in the existing molecular library, and cannot generate novel MMP-Cliffs to develop the molecular optimization concept. The invention effectively generates MMP-Cliffs by designing an antagonistic learning mode, a graph reconstruction algorithm and a molecule generation logic, and provides important guidance for molecule optimization.
Disclosure of Invention
The technical problem to be solved by the invention is to overcome the defects of the prior art and provide a generalizable and interpretable depth map learning technology which is suitable for discovering and optimizing a drug lead compound, applying antagonistic learning and molecule generation technologies to a depth map learning framework, constructing a map learning algorithm based on generation type antagonistic subspace enhancement, and selecting a molecule embedding characteristic in an antagonistic direction to generate a high-activity molecule on an active cliff. By the method, the generalization and the interpretability of the efficacy prediction model can be effectively improved, and the lead compound with greatly improved efficacy is generated.
The technical scheme adopted by the invention for solving the technical problems is a generalizable and interpretable depth map learning technology for discovering and optimizing a drug lead compound, and the technology comprises the following steps:
step 001, collecting molecular activity samples of N GPCR targets from a GLASS molecular activity database, and classifying the obtained molecular activity samples into matching molecule pairs on an active cliff according to the structural difference and the activity difference between every two molecules
Figure BDA0003775464460000021
Collections of molecules matching inactive cliffs
Figure BDA0003775464460000022
Wherein
Figure BDA0003775464460000023
It means a highly active molecule which is capable of,
Figure BDA0003775464460000024
indicating a low reactive molecule. Then step 002 is carried out;
step 002. Construct training and testing data sets for molecular activity samples of N GPCR targets: training set
Figure BDA0003775464460000025
Test set
Figure BDA0003775464460000026
Wherein, the training set randomly divides verification according to the proportion nCollection
Figure BDA0003775464460000027
Make it possible to
Figure BDA0003775464460000028
Satisfy the requirement of
Figure BDA0003775464460000029
Setting a maximum iteration step e allowing model verification performance to be reducedmaxAnd go to step 003;
step 003, training set molecule activity sample
Figure BDA00037754644600000210
Constructing a depth map neural network E and a neural network N with a dual feature space by adopting a mean square error loss function
Figure BDA00037754644600000211
Training the model parameter thetaEAnd thetaN
Figure BDA00037754644600000212
Wherein x represents a molecular structure sample, y represents a molecular activity sample, and N (E (x)i),E(xi) According to the molecular structure x)iPredicted molecular activity. Then step 004 is carried out;
step 004. Training set molecule sample
Figure BDA00037754644600000213
And test set molecular samples
Figure BDA00037754644600000214
Defining a countermeasure feature subspace enhancement loss function
Figure BDA00037754644600000215
And training the model parameter thetaEAnd thetaN
Figure BDA00037754644600000216
Wherein f = E (x)i) Represents the embedded feature vector of the molecule, d = λ g |2Is a normalized opposing perturbation vector with step size lambda,
Figure BDA00037754644600000217
is a non-normalized countering perturbation vector defined by the gradient of the difference function D, r is a random vector modulo less than epsilon,
Figure BDA00037754644600000218
represents an activity difference metric function, σ represents an activation function, and γ represents a normalization constant. Then, go to step 005;
005, according to the loss function of mean square error
Figure BDA00037754644600000219
Sum-reactance feature subspace enhancement penalty function
Figure BDA00037754644600000220
Define an adaptive learning rate
Figure BDA00037754644600000221
Wherein eta is*=1βyaExpress assurances
Figure BDA00037754644600000222
And
Figure BDA00037754644600000223
maximum learning rate of simultaneous convergence, betay=4·(aT×[W1,W2]×[1,1])4And betaa=(24·(aT×[W1,W2]×[0,1])2aT×[W1,W2]×[1,1])2Are respectively guarantee
Figure BDA00037754644600000224
And
Figure BDA00037754644600000225
converged learning rate, W1And W2Respectively representing hidden layer parameter matrixes of the dual-feature spatial neural network N, a representing output parameter vectors of N, 0 and 1 representing all-zero and all-one vectors respectively, etamaxIs the maximum learning rate for stable model parameters, and α is the gradient penalty
Figure BDA00037754644600000226
The attenuation coefficient of (2). Then go to step 006;
Figure BDA00037754644600000227
a feature vector representing the true atoms and bonds of the molecule. N is a radical ofmol.Represents the total number of molecular samples. Entering step 007;
step 007-characteristics of molecular reconstruction
Figure BDA00037754644600000228
Defining a chemical specification loss function
Figure BDA0003775464460000031
Step 008;
step 008, generating new molecular characteristics by using the molecular embedding characteristics f and the anti-disturbance vector d:
Figure BDA0003775464460000032
according to
Figure BDA0003775464460000033
And the initial molecular Structure xiReduction to give the new molecular structure
Figure BDA0003775464460000034
And proceeds to step 009;
step 009. For molecular activity samples of the validation set
Figure BDA0003775464460000035
Verification performance of the computational model: prediction of Activity yk=N(E(xi) And y)kPearson's correlation coefficient, mean square error, mean absolute error, and reconstruction error between
Figure BDA0003775464460000036
Judging the iteration step difference between the total error at the moment and the lowest historical total error: less than emaxReturning to the step 003; otherwise, entering step 010;
step 010. Molecular Activity y predicted from the model when Total error history is lowest*= N (E (x)), and a highly active lead compound is screened
Figure BDA0003775464460000037
And obtaining the new molecular structure generated according to step 008
Figure BDA0003775464460000038
Then, the step 010 is executed;
step 011, collecting molecule activity samples according to the test
Figure BDA0003775464460000039
Molecular activity y predicted by alignment model*And y, estimating the accuracy of the model for predicting the molecular activity; and comparing the new molecular structures
Figure BDA00037754644600000310
The rate of coincidence with x estimates the success rate of creating matching molecule pairs on active cliffs. Then, go to step 012;
and 012, predicting the molecular activity of the specific target in a large molecular database by using the model, screening out high-activity molecules, and generating optimized new molecules by using the model on the basis of the high-activity molecules to obtain the lead compound with greatly improved drug effect.
2. A generalizable, interpretable depth map learning method for discovering and optimizing pharmaceutical lead compounds according to claim 1, wherein: the pair of matching molecules on the active cliff in step 001 is defined as two molecules with a small structural difference and a large activity difference. Wherein the molecular minor structural differences include: single atom differences, single pharmacophore differences, single substructure differences.
3. A generalizable, interpretable depth map learning method for discovering and optimizing pharmaceutical lead compounds according to claim 1, wherein: in the step 005, the purpose of adaptive learning rate adjustment is achieved by deducing the theoretical convergence bound, and the generalization performance of the depth map learning method for discovering and optimizing the lead compound of the drug is effectively improved.
4. A generalizable, interpretable depth map learning method for discovering and optimizing pharmaceutical lead compounds according to claim 1, wherein: the process of reconstructing the neural network G with attention in step 006 specifically includes the following steps:
00601 calculating the molecular embedding characteristics f and the embedding characteristics h of each atomiThe component of the molecular feature on each atom is obtained and normalized using the Softmax function. Setting a molecular characteristic distribution frequency T, and then entering a step 00602;
00602 obtaining each atom hidden layer characteristic vector for reconstructing molecules by calculating the vector sum of the molecular characteristic component and the atomic characteristic
Figure BDA0003775464460000041
Let T ← T, proceed to step 00603;
00603 through weighting matrix W3And a bias vector c3For the fully connected neural network, dropout layer and activation function eLU of the parameters, the relation feature vector of the molecule to the atom is obtained by inner product with the attention weight vector:
Figure BDA0003775464460000042
then proceed to step 00604;
step 00604. Back-derive a hidden layer feature on each atom using GRU and ReLU based on the distributed context information:
Figure BDA0003775464460000043
let t ← t-1, determine t: if the value is more than 0, returning to the step 00603, otherwise, entering the step 00605;
00605 setting the number of times L of atom feature distribution, making L ← L, and using adjacent atoms to replace molecular supernodes:
Figure BDA0003775464460000044
acquiring attention weight values of adjacent atoms by activating an attention weight vector, performing weighted summation on context information by the adjacent atoms, and entering a step 00606;
00606, adopting the reinitialized neural network to judge that l: hidden layer characteristics for all atoms greater than 0
Figure BDA0003775464460000045
Performing steps 00603 and 00604 gets the atomic characteristics of the last hidden layer of atoms
Figure BDA0003775464460000046
Otherwise, entering a step 00607;
00607. Through weighting matrix W4And a bias vector c4Fully connected neural network and activation function phi as parametersaAccording to the characteristics of the initial hidden layer
Figure BDA0003775464460000047
The initial features of each atom are inferred:
Figure BDA0003775464460000048
and by using a weight matrix W5And a bias vector c5Fully connected neural network, activation function phi as a parameterbAnd a leakage ReLU function based on the initial hidden layer characteristics
Figure BDA0003775464460000049
And
Figure BDA00037754644600000410
reverse-extrapolating the initial characteristics of each key:
Figure BDA00037754644600000411
5. according to claimA generalizable, interpretable depth map learning method for the discovery and optimization of pharmaceutical lead compounds as claimed in claim 1, characterized in that: in said step 008 according to
Figure BDA00037754644600000412
And the initial molecular Structure xiReducing to obtain new molecular structure
Figure BDA00037754644600000413
The method specifically comprises the following steps:
step 00801, let P (s | a) be the posterior probability of the a-th atom predicted as the s-th chemical element, which is formed by molecular embedding characteristic f and atom embedding set
Figure BDA00037754644600000414
Graph generator G and opposing perturbation vector d estimate:
Figure BDA00037754644600000415
then step 00802 is entered;
step 00802, determining key atom position according to maximum posterior probability criterion:
Figure BDA00037754644600000416
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA00037754644600000417
the representation generator G predicts that the probability value on the a-th atom position is larger than the threshold value P according to the original molecule characteristic f0The set of elements of (a) is,
Figure BDA00037754644600000418
eliminating the influence of the existing elements and the confusable elements on the atom position a. Entering step 00803;
step 00803 according to maximum a posteriori probability criteria and activation threshold P0Determining a replacement element:
Figure BDA0003775464460000051
altering the initial molecular Structure xiAtom in position a*The element above is s*To obtain new molecular structure
Figure BDA0003775464460000052
The invention further provides a generalizable and interpretable depth map learning system for discovering and optimizing the drug lead compound, which consists of a data preprocessing module, a feature extractor, a classification learner, a regression learner and a molecule generator. The characteristic extractor comprises a quantitative coding module of atomic information and bond information in molecules and a graph attention neural network module. The classification and regression learner comprises a training module and a generalization optimization module, each classification and regression attribute of molecules from PubChem is input into the training module to obtain model parameters, and the generalization optimization module is adopted to enhance the generalization of the model. The molecule generator comprises a reconstruction module, a confrontation generation module and a control module; the reconstruction module receives the molecules submitted by the feature extractor and the embedded features of the atoms, and restores the input molecular structure through a graph attention reconstruction network; the model in the countermeasure generation module self-detects the key direction of the embedded characteristic through gradient, generates a new molecular embedded characteristic along the key direction of the embedded characteristic, inputs the new molecular embedded characteristic to the reconstruction module, and outputs a molecular generation result; an operator sets the directional optimization attributes to the control module as required, and molecules with specific attribute promotion are generated under the supervision of the control module.
Compared with the prior art, the invention has the following beneficial effects:
1. the system provided by the invention has the advantages of strong adaptability, high generalization, strong interpretability, high practicability and great application value.
2. The invention is suitable for discovering and optimizing the prediction of each attribute of a drug lead compound and the generation of a directional molecule, and comprises the following steps: the biological activity value of the molecule aiming at various targets, and various indexes of absorption, distribution, metabolism, excretion and toxicity of the molecule finished medicine.
3. The method provided by the invention can effectively generate the molecules in the appointed optimization direction by the aid of the attention-restructured neural network and the molecule generation logic, has the potential of finding specific drugs, and has high social benefits and commercial benefits.
4. The self-adaptive learning rate adjustment provided by the invention has theoretical convergence boundary guarantee and can effectively work in various scenes.
Drawings
FIG. 1 is a schematic diagram of a generalizable and interpretable depth map learning system for discovering and optimizing drug lead compounds in accordance with the present invention.
FIG. 2 is a flow chart of a molecular embedding feature extraction algorithm.
FIG. 3 is a flow chart of a molecular graph attention reconstruction algorithm.
FIG. 4 is a flow chart of a directed property molecule generation algorithm.
Detailed Description
The technical solution of the present invention is further described in detail below with reference to the accompanying drawings.
A generalizable, interpretable depth map learning system for the discovery and optimization of pharmaceutical lead compounds is shown in figure 1. The system consists of a data preprocessing module, a feature extractor, a classification learner, a regression learner and a molecule generator. The characteristic extractor comprises a quantitative coding module of atomic information and bond information in molecules and a graph attention neural network module. The classification and regression learner comprises a training module and a generalization optimization module, each classification and regression attribute of molecules from PubChem is input into the training module to obtain model parameters, and the generalization optimization module is adopted to enhance the generalization of the model. The molecule generator comprises a reconstruction module, a confrontation generation module and a control module; the reconstruction module receives the molecules submitted by the feature extractor and the embedded features of the atoms, and restores the input molecular structure through a graph attention reconstruction network; the model in the countermeasure generation module self-detects the key direction of the embedded characteristic through gradient, generates a new molecular embedded characteristic along the key direction of the embedded characteristic, inputs the new molecular embedded characteristic to the reconstruction module, and outputs a molecular generation result; and setting the directional optimization attribute to the control module by an operator according to the requirement, and generating the molecule with the specific attribute improvement under the supervision of the control module.
As shown in fig. 2-4, a generalizable, interpretable depth map learning system for the discovery and optimization of pharmaceutical lead compounds comprises the following steps:
001, collecting activity samples of each attribute of the molecules of the N GPCR targets from a PubChem molecule activity database, and classifying the obtained molecule activity samples into matching molecule pairs on an active cliff according to the structure difference and the activity difference between every two molecules
Figure BDA0003775464460000061
Collections of molecules matching inactive cliffs
Figure BDA0003775464460000062
Wherein
Figure BDA0003775464460000063
It means a highly active molecule which is capable of,
Figure BDA0003775464460000064
indicating a low reactive molecule. Then step 002 is performed;
step 002. Construct training and testing datasets on molecular activity samples of N GPCR targets: training set
Figure BDA0003775464460000065
Test set
Figure BDA0003775464460000066
Wherein, the training set randomly divides a verification set according to the proportion n
Figure BDA0003775464460000067
Make it possible to
Figure BDA0003775464460000068
Satisfy the requirement of
Figure BDA0003775464460000069
Setting a maximum iteration step e that allows model validation performance to be degradedmaxAnd go to step 003;
step 003, training set molecule activity sample
Figure BDA00037754644600000610
Constructing a depth map neural network E and a neural network N with a dual feature space by adopting a mean square error loss function
Figure BDA00037754644600000611
Train it
Figure BDA00037754644600000612
The flow is shown in fig. 2. Then step 004 is carried out;
step 004. Training set molecule sample
Figure BDA00037754644600000613
And test set molecular samples
Figure BDA00037754644600000614
Defining a confrontational feature subspace
Figure BDA00037754644600000615
Is a normalized opposing perturbation vector with step size lambda,
Figure BDA00037754644600000616
is a non-normalized countering perturbation vector defined by the gradient of the difference function D, r is a random vector modulo less than epsilon,
Figure BDA00037754644600000617
represents an activity difference metric function, σ represents an activation function, and γ represents a normalization constant. Then, go to step 005;
step 005, according to the mean square error loss function
Figure BDA00037754644600000618
Sum-reactance feature subspace enhancement penalty function
Figure BDA0003775464460000071
Define an adaptive learning rate
Figure BDA0003775464460000072
Wherein eta*=1/βyaExpress assurances
Figure BDA0003775464460000073
And
Figure BDA0003775464460000074
maximum learning rate of simultaneous convergence, betay=4·(aT×[W1,W2]×[1,1])4And betaa=(24·(aT×[W1,W2]×[0,1])2/aT×[W1,W2]×[1,1])2Are respectively guarantee
Figure BDA0003775464460000075
And
Figure BDA0003775464460000076
converged learning rate, W1And W2Respectively representing hidden layer parameter matrixes of the dual-feature spatial neural network N, a representing output parameter vectors of N, 0 and 1 representing all-zero and all-one vectors respectively, etamaxIs the maximum learning rate for stable model parameters, and α is the gradient penalty
Figure BDA0003775464460000077
The attenuation coefficient of (2). Then go to step 006;
step 006, to training set molecule sample
Figure BDA0003775464460000078
And test set molecular samples
Figure BDA0003775464460000079
Attention reconstruction spirit for constructing map
Figure BDA00037754644600000710
The algorithm flow is shown in FIG. 3, aiAnd bi,jFeature vector representing true atoms and bonds of a molecule。Nmol.Represents the total number of molecular samples. Entering step 007;
Figure BDA00037754644600000711
step 008;
step 008, generating new molecular characteristics by using the molecular embedding characteristics f and the anti-disturbance vectors d:
Figure BDA00037754644600000712
according to
Figure BDA00037754644600000713
And the initial molecular Structure xiReduction to give the new molecular structure
Figure BDA00037754644600000714
And proceeds to step 009;
step 009. Sample molecular activity of validation set
Figure BDA00037754644600000715
Verification performance of the computational model: prediction of Activity yk=N(E(xi) ) and ykPearson's correlation coefficient, mean square error, mean absolute error, and reconstruction error between
Figure BDA00037754644600000716
Judging the iteration step difference between the total error at the moment and the lowest historical total error: less than emaxReturning to the step 003; otherwise, entering step 010;
step 010. Molecular Activity y predicted from the model when Total error history is lowest*= N (E (x)), and a highly active lead compound is screened
Figure BDA00037754644600000717
And obtaining the new molecular structure generated according to step 008
Figure BDA00037754644600000718
The molecular generation algorithm flow is as followsAs shown in fig. 4. Then, go to step 010;
step 011, according to the test set molecule activity sample
Figure BDA00037754644600000719
Molecular activity y predicted by alignment model*And y, estimating the accuracy of the model for predicting the molecular activity; and comparing the new molecular structures
Figure BDA00037754644600000720
The rate of coincidence with x estimates the success rate of creating matching molecule pairs on active cliffs. Then, go to step 012;
and 012, predicting the molecular activity of the specific target in a large molecular database by using the model, screening out high-activity molecules, and generating optimized new molecules by using the model on the basis of the high-activity molecules to obtain the lead compound with greatly improved drug effect.
2. A generalizable, interpretable depth map learning method for discovering and optimizing pharmaceutical lead compounds according to claim 1, wherein: the pair of matching molecules on the active cliff in step 001 is defined as two molecules with small structural difference and large activity difference. Wherein the molecular minor structural differences include: single atom differences, single pharmacophore differences, single substructure differences.
3. A generalizable, interpretable depth map learning method for discovering and optimizing pharmaceutical lead compounds according to claim 1, wherein: in the step 005, the purpose of adaptive learning rate adjustment is achieved by deducing a theoretical convergence boundary, and the generalization ability of the depth map learning method for discovering and optimizing the drug lead compound is effectively improved.
4. A generalizable, interpretable depth map learning method for discovering and optimizing pharmaceutical lead compounds according to claim 1, wherein: the process of reconstructing the neural network G with attention in step 006 specifically includes the following steps:
00601 calculating the molecular embedding characteristics f and the embedding characteristics h of each atomiObtaining a moleculeThe components of the features on each atom are normalized using the Softmax function. Setting molecular characteristic distribution times T, and then entering a step 00602;
00602 obtaining each atom hidden layer characteristic vector for reconstructing the molecule by calculating the vector sum of the molecule characteristic component and the atom characteristic
Figure BDA0003775464460000081
Let T ← T, proceed to step 00603;
00603 through weighting matrix W3And a bias vector c3For the fully connected neural network, dropout layer and activation function eLU of the parameters, the relation feature vector of the molecule to the atom is obtained by inner product with the attention weight vector:
Figure BDA0003775464460000082
then entering step 00604;
step 00604. Back-derive a hidden layer feature on each atom using GRU and ReLU based on the distributed context information:
Figure BDA0003775464460000083
let t ← t-1, determine t: if the value is more than 0, returning to the step 00603, otherwise, entering a step 00605;
00605 setting the number of times L of atom feature distribution, making L ← L, and replacing molecular super nodes with adjacent atoms:
Figure BDA0003775464460000084
acquiring attention weight values of adjacent atoms by activating an attention weight vector, performing weighted summation on context information by the adjacent atoms, and entering a step 00606;
00606, adopting the reinitialized neural network to judge that: greater than 0 being a hidden layer characteristic for all atoms
Figure BDA0003775464460000085
Performing steps 00603 and 00604 gets the atomic characteristics of the last hidden layer of atoms
Figure BDA0003775464460000086
Otherwise, entering a step 00607;
00607. Through weighting matrix W4And a bias vector c4Fully connected neural network and activation function phi as parametersaAccording to the characteristics of the initial hidden layer
Figure BDA0003775464460000087
The initial features of each atom are deduced backwards:
Figure BDA0003775464460000088
and by using a weight matrix W5And a bias vector c5Fully connected neural network, activation function phi as a parameterbAnd a leak ReLU function based on the initial hidden layer characteristics
Figure BDA0003775464460000091
And
Figure BDA0003775464460000092
reverse-extrapolating the initial characteristics of each key:
Figure BDA0003775464460000093
6. a generalizable, interpretable depth map learning method for discovering and optimizing pharmaceutical lead compounds according to claim 1, wherein: in the step 008 according to
Figure BDA0003775464460000094
And the initial molecular Structure xiReducing to obtain new component molecular structure
Figure BDA0003775464460000095
The method specifically comprises the following steps:
step 00801, let P (s | a) be the posterior probability of the a-th atom predicted as the s-th chemical element, which is formed by molecular embedding characteristic f and atom embedding set
Figure BDA0003775464460000096
Picture generatorG and opposing disturbance vector d estimation:
Figure BDA0003775464460000097
then enter step 00802;
step 00802, determining key atom position according to maximum posterior probability criterion:
Figure BDA0003775464460000098
wherein the content of the first and second substances,
Figure BDA0003775464460000099
the representation generator G predicts that the probability value on the a-th atom position is larger than the threshold value P according to the original molecule characteristic f0The set of elements of (a) is,
Figure BDA00037754644600000910
eliminating the influence of the existing elements and the confusable elements on the atomic position a. Entering step 00803;
step 00803 according to the maximum a posteriori probability criterion and the activation threshold P0Determining a replacement element:
Figure BDA00037754644600000911
altering the initial molecular Structure xiIn situ of the atom a*The element above is s*To obtain a nascent molecular structure
Figure BDA00037754644600000912

Claims (5)

1. The invention provides a generalizable and interpretable depth map learning method for discovering and optimizing a drug lead compound, which comprises the following steps:
step 001, collecting molecular activity samples of N GPCR targets from a GLASS molecular activity database, and classifying the obtained molecular activity samples into matching molecule pairs on an active cliff according to the structural difference and the activity difference between every two molecules
Figure FDA0003775464450000011
Collections of molecules matching inactive cliffs
Figure FDA0003775464450000012
Wherein
Figure FDA0003775464450000013
It means a highly reactive molecule which is capable of,
Figure FDA0003775464450000014
indicating a low reactive molecule. Then step 002 is carried out;
step 002. Construct training and testing datasets on molecular activity samples of N GPCR targets: training set
Figure FDA0003775464450000015
Test set
Figure FDA0003775464450000016
Wherein, the training set randomly divides a verification set according to the proportion n
Figure FDA0003775464450000017
Make it
Figure FDA0003775464450000018
Satisfy the requirements of
Figure FDA0003775464450000019
Setting a maximum iteration step e that allows model validation performance to be degradedmaxAnd go to step 003;
step 003, training set molecule activity sample
Figure FDA00037754644500000110
Constructing a depth map neural network E and a neural network N with a dual feature space by adopting a mean square error loss function
Figure FDA00037754644500000111
Train its model parametersNumber thetaEAnd thetaN
Figure FDA00037754644500000112
Wherein x represents a molecular structure sample, y represents a molecular activity sample, and N (E (x)i),E(xi) According to the molecular structure xiPredicted molecular activity. Then step 004 is carried out;
step 004. Training set molecule sample
Figure FDA00037754644500000113
And test set molecular samples
Figure FDA00037754644500000114
Defining a countermeasure feature subspace enhancement loss function
Figure FDA00037754644500000115
And training the model parameter thetaEAnd thetaN
Figure FDA00037754644500000116
Wherein f = E (x)i) Represents the embedded feature vector of the molecule, d = λ g/| g |2Is a normalized opposing perturbation vector with step size lambda,
Figure FDA00037754644500000117
is a non-normalized countering perturbation vector defined by the gradient of the difference function D, r is a random vector modulo less than epsilon,
Figure FDA00037754644500000118
represents an activity difference metric function, σ represents an activation function, and γ represents a normalization constant. Then, go to step 005;
005, according to the loss function of mean square error
Figure FDA00037754644500000119
Sum-robust feature subspace enhancement loss function
Figure FDA00037754644500000120
Defining an adaptive learning rate
Figure FDA00037754644500000121
Wherein eta is*=1/βyaRepresentation assurances
Figure FDA00037754644500000122
And
Figure FDA00037754644500000123
maximum learning rate, beta, of simultaneous convergencey=4·(aT×[W1,W2]×[1,1])4And betaa=(24·(aT×[W1,W2]×[0,1])2/aT×[W1,W2]×[1,1])2Are respectively to ensure
Figure FDA00037754644500000124
And
Figure FDA00037754644500000125
converged learning rate, W1And W2Respectively representing hidden layer parameter matrixes of the dual-feature spatial neural network N, a representing output parameter vectors of N, 0 and 1 representing all-zero and all-one vectors respectively, etamaxIs the maximum learning rate for stable model parameters, and α is the gradient penalty
Figure FDA00037754644500000126
The attenuation coefficient of (2). Then go to step 006;
step 006, training set molecule sample
Figure FDA0003775464450000021
And test set molecular samples
Figure FDA0003775464450000022
Constructing a graph attention reconstruction neural network G and defining a molecular reconstruction loss function
Figure FDA0003775464450000023
Feature vectors of true atoms and bonds. N is a radical of hydrogenmol.Represents the total number of molecular samples. Entering step 007;
Figure FDA0003775464450000024
Figure FDA0003775464450000025
wherein I (G (f)) =1 indicates that the reconstituted molecule does not meet the chemical specification. Then go to step 008;
step 008, generating new molecular characteristics by using the molecular embedding characteristics f and the anti-disturbance vectors d: { a)* i,b* i,j} = G (f + d). According to { a* i,b* i,jAnd initial molecular Structure xiReduction to give the new molecular structure
Figure FDA0003775464450000026
And proceeds to step 009;
step 009. For molecular activity samples of the validation set
Figure FDA0003775464450000027
Verification performance of the computational model: prediction of Activity yk=N(E(xi) And y)kPearson correlation coefficient, mean square error, mean absolute error, and reconstruction error therebetween
Figure FDA0003775464450000028
Judging the iteration step difference between the total error at the moment and the lowest historical total error: less than emaxReturning to the step 003; otherwise, entering step 010;
step 010. Molecular Activity y predicted from the model when Total error history is lowest*= N (E (x)), and a highly active lead compound is screened
Figure FDA0003775464450000029
And obtaining the new molecular structure according to step 008
Figure FDA00037754644500000210
Then, the step 010 is executed;
step 011, according to the test set molecule activity sample
Figure FDA00037754644500000211
Molecular activity y predicted by alignment model*And y, estimating the accuracy of the model for predicting the molecular activity; and comparing the new molecular structures
Figure FDA00037754644500000212
The rate of coincidence with x estimates the success rate of creating matching molecule pairs on active cliffs. Then, go to step 012;
and 012, predicting the molecular activity of the specific target in a large molecular database by using the model, screening out high-activity molecules, and generating optimized new molecules by using the model on the basis of the high-activity molecules to obtain the lead compound with greatly improved drug effect.
2. A generalizable, interpretable depth map learning method for discovering and optimizing pharmaceutical lead compounds according to claim 1, wherein: the pair of matching molecules on the active cliff in step 001 is defined as two molecules with a small structural difference and a large activity difference. Wherein the molecular minor structural differences include: single atom differences, single pharmacophore differences, single substructure differences.
3. A generalizable, interpretable depth map learning method for discovering and optimizing pharmaceutical lead compounds according to claim 1, wherein: in the step 005, the purpose of adaptive learning rate adjustment is achieved by deducing a theoretical convergence boundary, and the generalization ability of the depth map learning method for discovering and optimizing the drug lead compound is effectively improved.
4. A generalizable, interpretable depth map learning method for discovering and optimizing pharmaceutical lead compounds according to claim 1, wherein: the process 006 of reconstructing the neural network G specifically includes the following steps:
00601 calculating the molecular embedding characteristics f and the embedding characteristics h of each atomiThe component of the molecular feature on each atom is obtained and normalized using the Softmax function. Setting a molecular characteristic distribution frequency T, and then entering a step 00602;
00602 obtaining each atom hidden layer characteristic vector for reconstructing molecules by calculating the vector sum of the molecular characteristic component and the atomic characteristic
Figure FDA0003775464450000031
Let T ← T, go to step 00603;
00603 through weighting matrix W3And offset vector c3For the fully connected neural network, dropout layer and activation function eLU of the parameters, the relation feature vector of the molecule to the atom is obtained by inner product with the attention weight vector:
Figure FDA0003775464450000032
then entering step 00604;
step 00604. Back-derive a hidden layer feature on each atom using GRU and ReLU based on the distributed context information:
Figure FDA0003775464450000033
let t ← t-1, determine t: if the value is more than 0, returning to the step 00603, otherwise, entering the step 00605;
00605 setting the number of times L of atom feature distribution, making L ← L, and replacing molecular super nodes with adjacent atoms:
Figure FDA0003775464450000034
acquiring attention weight values of adjacent atoms by activating an attention weight vector, performing weighted summation on context information by the adjacent atoms, and entering a step 00606;
00606, adopting the reinitialized neural network to judge that: greater than 0 being a hidden layer characteristic for all atoms
Figure FDA0003775464450000035
Performing steps 00603 and 00604 get the atomic characteristics of the last hidden layer of atoms
Figure FDA0003775464450000036
Otherwise, entering a step 00607;
00607. Through weighting matrix W4And offset vector c4Fully connected neural network and activation function phi as parametersaAccording to the characteristics of the original hidden layer
Figure FDA0003775464450000037
The initial features of each atom are inferred:
Figure FDA0003775464450000038
and by using a weight matrix W5And a bias vector c5Fully connected neural network, activation function phi as a parameterbAnd a leakage ReLU function based on the initial hidden layer characteristics
Figure FDA0003775464450000039
And
Figure FDA00037754644500000310
extrapolating back the initial features of each key:
Figure FDA00037754644500000311
5. according toA generalizable, interpretable depth map learning method for discovering and optimizing lead compounds for pharmaceuticals as claimed in claim 1, wherein: according to { a ] in the step 008* i,b* i,jAnd initial molecular Structure xiReducing to obtain new component molecular structure
Figure FDA0003775464450000041
The method specifically comprises the following steps:
step 00801, let P (s | a) be the posterior probability of the a-th atom predicted as the s-th chemical element, which is formed by molecular embedding characteristic f and atom embedding set
Figure FDA0003775464450000048
Graph generator G and opposing perturbation vector d estimate:
Figure FDA0003775464450000042
then enter step 00802;
step 00802, determining key atom position according to maximum posterior probability criterion:
Figure FDA0003775464450000043
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003775464450000044
the representation generator G predicts that the probability value on the a-th atom position is larger than the threshold value P according to the original molecule characteristic f0The set of elements of (a) is,
Figure FDA0003775464450000045
eliminating the influence of the existing elements and the confusable elements on the atom position a. Entering step 00803;
step 00803 according to the maximum a posteriori probability criterion and the activation threshold P0Determining a replacement element:
Figure FDA0003775464450000046
altering the initial molecular Structure xiIn situ of the atom a*The element above is s*To obtain new molecular structure
Figure FDA0003775464450000047
CN202210903698.2A 2022-08-02 2022-08-02 Generalizable and interpretable depth map learning method for discovering and optimizing drug lead compound Pending CN115274007A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210903698.2A CN115274007A (en) 2022-08-02 2022-08-02 Generalizable and interpretable depth map learning method for discovering and optimizing drug lead compound

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210903698.2A CN115274007A (en) 2022-08-02 2022-08-02 Generalizable and interpretable depth map learning method for discovering and optimizing drug lead compound

Publications (1)

Publication Number Publication Date
CN115274007A true CN115274007A (en) 2022-11-01

Family

ID=83770557

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210903698.2A Pending CN115274007A (en) 2022-08-02 2022-08-02 Generalizable and interpretable depth map learning method for discovering and optimizing drug lead compound

Country Status (1)

Country Link
CN (1) CN115274007A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115966266A (en) * 2023-01-06 2023-04-14 东南大学 Anti-tumor molecule strengthening method based on graph neural network
CN116189809A (en) * 2023-01-06 2023-05-30 东南大学 Drug molecule important node prediction method based on challenge resistance
CN116312855A (en) * 2023-02-28 2023-06-23 杭州生奥信息技术有限公司 Method for optimizing activity of lead compound

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115966266A (en) * 2023-01-06 2023-04-14 东南大学 Anti-tumor molecule strengthening method based on graph neural network
CN116189809A (en) * 2023-01-06 2023-05-30 东南大学 Drug molecule important node prediction method based on challenge resistance
CN115966266B (en) * 2023-01-06 2023-11-17 东南大学 Anti-tumor molecule strengthening method based on graph neural network
CN116189809B (en) * 2023-01-06 2024-01-09 东南大学 Drug molecule important node prediction method based on challenge resistance
CN116312855A (en) * 2023-02-28 2023-06-23 杭州生奥信息技术有限公司 Method for optimizing activity of lead compound
CN116312855B (en) * 2023-02-28 2023-09-08 杭州生奥信息技术有限公司 Method for optimizing activity of lead compound

Similar Documents

Publication Publication Date Title
Li et al. DeepDSC: a deep learning method to predict drug sensitivity of cancer cell lines
CN109492822B (en) Air pollutant concentration time-space domain correlation prediction method
CN115274007A (en) Generalizable and interpretable depth map learning method for discovering and optimizing drug lead compound
CN107862173B (en) Virtual screening method and device for lead compound
US7194359B2 (en) Estimating the accuracy of molecular property models and predictions
Giannakoglou et al. Aerodynamic shape design using evolutionary algorithms and new gradient-assisted metamodels
CN110321603A (en) A kind of depth calculation model for Fault Diagnosis of Aircraft Engine Gas Path
CN114022693B (en) Single-cell RNA-seq data clustering method based on double self-supervision
CN112489722B (en) Method and device for predicting binding energy of drug target
Ghadiri et al. BigFCM: Fast, precise and scalable FCM on hadoop
Jin Compositional kernel learning using tree-based genetic programming for Gaussian process regression
Gerber et al. Fast covariance parameter estimation of spatial Gaussian process models using neural networks
CN113241122A (en) Gene data variable selection and classification method based on fusion of adaptive elastic network and deep neural network
Ma An Efficient Optimization Method for Extreme Learning Machine Using Artificial Bee Colony.
Khalaf et al. Hybridized deep learning model for perfobond rib shear strength connector prediction
Staples et al. Artificial intelligence for bioinformatics: applications in protein folding prediction
Robati et al. Inflation rate modeling: Adaptive neuro-fuzzy inference system approach and particle swarm optimization algorithm (ANFIS-PSO)
CN115982141A (en) Characteristic optimization method for time series data prediction
Regazzoni et al. A physics-informed multi-fidelity approach for the estimation of differential equations parameters in low-data or large-noise regimes
Drakoulas et al. FastSVD-ML–ROM: A reduced-order modeling framework based on machine learning for real-time applications
Hou et al. Estimating elastic parameters from digital rock images based on multi-task learning with multi-gate mixture-of-experts
CN108876038B (en) Big data, artificial intelligence and super calculation synergetic material performance prediction method
Lu et al. Quality-relevant feature extraction method based on teacher-student uncertainty autoencoder and its application to soft sensors
WO2022082739A1 (en) Method for predicting protein and ligand molecule binding free energy on basis of convolutional neural network
Sainsbury-Dale et al. Neural Bayes estimators for irregular spatial data using graph neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination