WO2022077258A1

WO2022077258A1 - Free energy perturbation network design method based on machine learning

Info

Publication number: WO2022077258A1
Application number: PCT/CN2020/120845
Authority: WO
Inventors: 李治鹏; 温书豪; 杨明俊; 林志雄; 邹俊杰; 马健; 赖力鹏
Original assignee: 深圳晶泰科技有限公司
Priority date: 2020-10-14
Filing date: 2020-10-14
Publication date: 2022-04-21

Abstract

A free energy perturbation network design method based on machine learning. The method comprises the following steps: S1, preparing a micromolecule data set required for calculation; S2, preparing a micromolecule/protein input file; S3, calculating △△G and std between different micromolecule pairs by using FEP; S4, extracting feature descriptors of micromolecules; S5, preparing a training set and test set required for a machine learning model; S6, constructing the machine learning model; S7, training the machine learning model; and S8, compiling error statistics on the test set. By means of the method, a scenario in which binding free energy of a large number of micromolecules needs to be calculated and predicted can be processed, and a required perturbation network can be rapidly designed; and the correlation between an obtained result and std is higher, such that the calculation precision can be effectively improved. In addition, along with an increase in the number of calculated molecules, more data can be collected for model training, and the generalization capability and precision of a model are improved.

Description

Design method of free energy perturbation network based on machine learning

technical field

The invention belongs to the technical field of molecular dynamics simulation, in particular to a method for designing a free energy perturbation network based on machine learning.

Background technique

The binding free energy (ΔG) of small molecule drugs and target proteins plays a very important role in guiding the design of small molecule drugs. As a calculation method based on molecular dynamics (MD), free energy perturbation (FEP) can predict binding free energy. When the prediction task involves multiple small molecules, the design of the free energy perturbation network is very necessary, which can effectively improve the prediction accuracy. In the designed free energy perturbation network diagram, each node represents a small molecule, and each edge represents the difference (△△G) of binding free energy between two small molecules. In the design process of the network, the core problem is to judge whether two small molecules should be connected, so that the uncertainty (std) of △△G calculated by this edge is minimized. Most of the existing design methods make judgments according to the following principles to determine whether two small molecules should be linked:

(1) Manual judgment based on experience;

(2) Judgment based on Tanimoto similarity score.

The existing methods mainly have the following problems

1. Manual judgment based on experience: When the number of small molecules to be calculated is n, the total number of edges that can be connected, that is, the total number of molecule pairs that can be calculated by FEP is n(n-1)/2. As the number of small molecules increases, the number of edges that need to be judged increases rapidly. In this case, it is almost impossible to identify and judge by manual methods.

2. Judgment based on Tanimoto similarity score: When using this indicator, usually try to connect similar small molecules (the closer the Tanimoto similarity score is to 1, the more similar the two small molecules are). Similarity coefficients are calculated based on molecular fingerprints, considering very limited characteristics of small molecules. At the same time, similar molecules judged by this method cannot guarantee that the uncertainty of the calculated △△G is small.

SUMMARY OF THE INVENTION

In view of the above technical problems, the purpose of the present invention is to provide a free energy perturbation network design method based on machine learning, using a large number of calculation results of △△G, using the method of machine learning to train the model, and designing free energy perturbation more quickly. network to improve computational accuracy.

To achieve the above object, the present invention provides the following technical solutions:

The design method of free energy perturbation network based on machine learning includes the following steps:

S1. Prepare the small molecule data set required for the calculation;

S2. Prepare small molecule/protein input files;

S3. Use FEP to calculate △△G and std between different small molecule pairs;

S4. Extract feature descriptors of small molecules;

S5, prepare training set and test set;

S6. Build a machine learning model;

S7. Train the machine learning model;

S8, test set statistical error.

Specifically include the following steps:

S1. Prepare the small molecule data set required for the calculation: ensure the diversity of the system when preparing the data set, so as to avoid overfitting of the model to some systems;

S2. Prepare small molecule/protein input files: According to the requirements of FEP calculation, generate an initial file for FEP calculation;

S3. Use FEP to calculate △△G and std between different small molecule pairs: design the necessary molecular pairs between small molecules, use FEP to calculate the results of △△G multiple times, and then obtain the corresponding std value;

S4. Extract feature descriptors of small molecules: extract the two-dimensional structure feature descriptors of small molecules;

S5. Prepare training set and test set: collect std results of molecule pairs calculated by FEP and two-dimensional feature descriptors of corresponding small molecules, and divide the collected data into training set and test set according to a certain proportion;

S6. Build a machine learning model: use the obtained two-dimensional descriptor of the small molecule as input, and the std result of the molecule pair as output to build a machine learning model;

S7. Train the machine learning model: select appropriate parameters to train the model, and set different parameters according to different types of machine learning models;

S8. Statistical error of the test set: after the training is completed, the error is counted on the test set, and the model parameters are optimized according to the statistical error to obtain the best model.

Wherein, in step S4, the two-dimensional structure feature descriptor of the small molecule includes molecular mass, topological connection information, and the number of flexible dihedral angles.

Compared with the prior art, the beneficial effects of the present invention are:

1. Automatically design perturbation network

Compared with the method of manually designing the perturbation network, this method can handle a large number of scenarios where the binding free energy of small molecules needs to be calculated and predicted, and can quickly design the required perturbation network;

2. Improve the calculation accuracy of free energy perturbation

Compared with the method based on Tanimoto similarity score, the results obtained by this method have higher correlation with std, which can effectively improve the calculation accuracy.

3. Easy to expand

After the calculation process is determined, the number of molecules is gradually increased with the calculation. More data can be collected for model training, and the generalization ability and accuracy of the model can be improved.

Description of drawings

Fig. 1 is the flow chart of the present invention;

Fig. 2 is the correlation analysis result of embodiment Tanimoto similarity score and std;

Fig. 3 is the correlation analysis result between RFscore and std of the embodiment.

Detailed ways

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

According to the flowchart shown in FIG. 1 , in this example, a total of 200 small molecules from 8 kinase systems were selected, 300 molecular pairs were designed, and the std of ΔΔG was calculated 5 times as the output of the model.

Comparing the correlation between Tanimoto similarity score and std, as shown in Figure 2, it can be seen that the correlation between the two is very weak, and the Kendall rank correlation coefficient (Kendall rank correlation coefficient) is -0.113. Obviously, the perturbation network constructed by this standard will introduce relatively large uncertainty.

In this embodiment, two-dimensional eigenvalues of each small molecule are extracted, and each small molecule has 77 eigenvalues. And by dividing the training set and the test set according to the ratio of 7:3. Random Forest is chosen as the machine learning model for this example. At the same time, for different combinations of multiple model parameters, such as the maximum number of features, the maximum depth of the decision tree, the minimum number of samples required for internal node division, and the minimum number of samples of leaf nodes, the best random forest model is obtained. Using this model, the error obtained on the training set is 0.14, and the error obtained on the test set is 0.31. At the same time, the RF score obtained by using the existing model and the aforementioned Tanimoto similarity score are analyzed for the same correlation results, as shown in Figure 3 . The resulting Kendall correlation coefficient was 0.41.

It can be seen that the results obtained by this method can be used to design a free energy perturbation network for a large number of small molecules, and at the same time, the accuracy can be improved compared with the Tanimoto similarity score method.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, and substitutions can be made in these embodiments without departing from the principle and spirit of the invention and modifications, the scope of the invention is defined by the appended claims and their equivalents.

Claims

The design method of free energy perturbation network based on machine learning is characterized in that it includes the following steps:

S1. Prepare the small molecule data set required for the calculation;

S2. Prepare small molecule/protein input files;

S3. Use FEP to calculate △△G and std between different small molecule pairs;

S4. Extract feature descriptors of small molecules;

S5, prepare the training set and test set required for the machine learning model; S6, build the machine learning model;

S7. Train the machine learning model;

S8, test set statistical error.
The method for designing a free energy perturbation network based on machine learning according to claim 1, characterized in that it specifically comprises the following steps:

S1. Prepare the small molecule data set required for the calculation: ensure the diversity of the system when preparing the data set, so as to avoid overfitting of the model to some systems;

S2. Prepare small molecule/protein input files: According to the requirements of FEP calculation, generate an initial file for FEP calculation;

S3. Use FEP to calculate △△G and std between different small molecule pairs: design the necessary molecular pairs between small molecules, use FEP to calculate the results of △△G multiple times, and then obtain the corresponding std value;

S4. Extract feature descriptors of small molecules: extract the two-dimensional structure feature descriptors of small molecules;

S5. Prepare the training set and test set required for the machine learning model: collect the std results of the molecule pairs calculated by FEP and the two-dimensional feature descriptors of the corresponding small molecules, and divide the collected data into training sets and test set;

S6. Build a machine learning model: use the obtained two-dimensional descriptors of small molecules as input, and the std results of the molecule pairs as output to build a machine learning model;

S7. Train the machine learning model: select appropriate parameters to train the model, and set different parameters according to different types of machine learning models;

S8. Statistical error of the test set: after the training is completed, the error is counted on the test set, and the model parameters are optimized according to the statistical error to obtain the best model.
The method for designing a free energy perturbation network based on machine learning according to claim 2, wherein in step S4, the two-dimensional structure feature descriptor of the small molecule includes molecular mass, topological connection information, flexibility two Number of face angles.