CN116343910A

CN116343910A - Prediction method of docking posture between protein and ligand based on graphic neural network

Info

Publication number: CN116343910A
Application number: CN202310327119.9A
Authority: CN
Inventors: 顾彦慧; 陈晓健; 张先锋; 郝渊苏; 文羽昕; 廖楚悦
Original assignee: Nanjing Normal University
Current assignee: Nanjing Normal University
Priority date: 2023-03-30
Filing date: 2023-03-30
Publication date: 2023-06-27

Abstract

The invention discloses a method for predicting the docking posture between a protein and a ligand based on a graph neural network. First, a biological information sample set of a protein-ligand complex is obtained, and the sample set includes sample data and sample label data; secondly, Build a docking pose generation model based on a graph neural network and a docking pose evaluation model based on multiple perspectives, further adjust the parameters of the model, process the sample data through the structure generation model obtained through training, and obtain the actual output of the pose docking of the protein ligand; finally , using the mainstream pose docking structure evaluation index to evaluate the stability of the output results. The present invention directly uses the biological structure information of the ligand protein to generate the optimal docking posture structure, and evaluates the generated result through a multi-angle comprehensive evaluation model, thereby improving the accuracy of the docking prediction of the ligand-protein posture structure, and improving Evaluating the validity of ligand-protein pose-structure docking predictions.

Description

Prediction method of docking pose between protein and ligand based on graph neural network

技术领域technical field

本发明属于计算机辅助药物设计领域，具体涉及一种基于图神经网络对蛋白质与配体之间对接姿势的预测方法。The invention belongs to the field of computer-aided drug design, in particular to a method for predicting the docking posture between a protein and a ligand based on a graph neural network.

背景技术Background technique

在计算机辅助药物设计的过程中，筛选出特定蛋白质与配体之间结合亲和力高的对接姿势一直是一项难题。传统的方法考虑结合现有方法，生成无限多个对接姿势的组合，并在此基础上筛选得到最满足条件的一组姿势，忽略了药物分子与靶点蛋白质之间的全局信息，此外现有的对接姿势信息是比较少的，结合亲和力的评估也需要通过大量的实验，良好的筛选结果需要对数据集进行大量的标注。而标注量不足往往会造成对配体分子与靶点蛋白质结合亲和力预测的不准确，且此类方法中实验需要的资金成本与标注所需的人工成本是大多数科研项目难以承受的。而新兴的利用三维分子结构对对接姿势预测的方法，在预测蛋白质与配体之间的结合亲和力方面十分具有前景，提高了计算机辅助药物设计的效率与准确性。In the process of computer-aided drug design, it has always been a difficult problem to screen out the docking pose with high binding affinity between a specific protein and its ligand. The traditional method considers combining existing methods to generate an infinite number of combinations of docking poses, and on this basis to screen a set of poses that best meet the conditions, ignoring the global information between the drug molecule and the target protein. In addition, the existing The docking pose information is relatively small, and the evaluation of binding affinity also requires a large number of experiments. Good screening results require a large number of annotations on the data set. Insufficient labeling often leads to inaccurate predictions of the binding affinity between ligand molecules and target proteins, and the capital costs and labor costs required for experiments in such methods are unaffordable for most scientific research projects. The emerging method of using three-dimensional molecular structure to predict the docking posture is very promising in predicting the binding affinity between proteins and ligands, which improves the efficiency and accuracy of computer-aided drug design.

在现有的利用分子三维结构对对接姿势与亲和力预测的方法中，为了提高预测的效率，采用了各种抽样方法，其中例如Glide对姿势的全局信息采用了蒙特卡洛采样法，从而提高了在面对大量药物分子配体时对对接姿势预测的准确性与速度。然而类似于Glide的各种方法往往忽略了对对接姿势的生物学优化，而没有经过优化的姿势信息，会造成预测对接姿势与亲和力时的不准确，在药物生产过程中迫切需要更为智能且准确的预测模型。此外，在这类方法中往往只考虑了分子内部作用力与分子间作用力的其中一种，没有对两者进行综合讨论，在对亲和力的预测过程中，丢失了一些重要信息，造成了预测的不准确。In the existing methods for predicting docking posture and affinity using molecular three-dimensional structure, in order to improve the efficiency of prediction, various sampling methods are used. For example, Glide adopts the Monte Carlo sampling method for the global information of posture, thereby improving the Accuracy and speed of docking pose prediction in the face of large numbers of drug molecule ligands. However, various methods similar to Glide often ignore the biological optimization of the docking posture, and without optimized posture information, it will cause inaccurate prediction of the docking posture and affinity. There is an urgent need for more intelligent and Accurate predictive models. In addition, in this type of method, only one of the intramolecular force and the intermolecular force is often considered, and the two are not discussed comprehensively. In the process of predicting the affinity, some important information is lost, resulting in the prediction inaccurate.

总体而言，目前对蛋白质与配体之间对接姿势的预测还有很多局限性，例如：预测时耗费的时间长，现有的对接姿势信息少，对分子空间结构信息的利用不充分，对结合亲和力的预测不够准确。这主要是由于候选药物分子具有数据量大、空间结构复杂的特点。本发明针对现有蛋白质配体对接姿势预测方法的局限性，提出了一种基于图神经网络和生成对抗网络结合的预测模型，以解决配体对接姿势结构预测效率低并提高对配体蛋白质结合亲和力评估的有效性。Generally speaking, there are still many limitations in the current prediction of the docking pose between proteins and ligands, such as: the time-consuming prediction, the lack of existing docking pose information, the insufficient use of molecular spatial structure information, and the The prediction of binding affinity is not accurate enough. This is mainly due to the large amount of data and complex spatial structure of candidate drug molecules. Aiming at the limitations of existing protein ligand docking posture prediction methods, the present invention proposes a prediction model based on the combination of graph neural network and generative adversarial network to solve the low efficiency of ligand docking posture structure prediction and improve the binding of ligand to protein Validity of affinity assessment.

发明内容Contents of the invention

发明目的：本发明提供一种基于图神经网络对蛋白质与配体之间对接姿势的预测方法，实现了对蛋白质与配体分子之间的全局信息的充分利用，避免了在传统预测方法中丢失分子间重要信息的缺陷，并大大提高了对靶点蛋白质与配体分子之间对接姿势进行预测的准确性与效率。Purpose of the invention: The present invention provides a method for predicting the docking pose between a protein and a ligand based on a graph neural network, which realizes the full utilization of the global information between the protein and the ligand molecule, and avoids the loss in the traditional prediction method The defect of important information between molecules greatly improves the accuracy and efficiency of predicting the docking posture between the target protein and the ligand molecule.

技术方案：本发明旨在一种基于图神经网络对蛋白质与配体之间对接姿势的预测方法，具体包括以下步骤：Technical solution: The present invention aims at a method for predicting the docking posture between a protein and a ligand based on a graph neural network, which specifically includes the following steps:

(1)获取蛋白质-配体复合物的生物信息样本集，并对其进行预处理；样本集包括样本数据和样本数据的样本标注，对样本数据编码，得到特征向量；(1) Acquire the biological information sample set of the protein-ligand complex, and preprocess it; the sample set includes sample data and sample annotation of the sample data, encode the sample data, and obtain the feature vector;

(2)构建基于图神经网络的对接姿势生成模型，并使用生成对抗网络对其进行训练；固定靶向蛋白，对配体分子利用多步姿态预测模拟原子的对接姿势结构，由基于多视角的对接姿势判别器评估对接姿势，计算损失差值，调整对接姿势中原子的空间位置；反复迭代，直到判别结果满足阈值要求，对于所有原子的输出将被认为是最终的预测姿态；(2) Construct a docking pose generation model based on a graph neural network, and use a generative adversarial network to train it; fix the target protein, use multi-step pose prediction for the ligand molecule to simulate the docking pose structure of the atom, and use the multi-view based The docking pose discriminator evaluates the docking pose, calculates the loss difference, and adjusts the spatial position of the atoms in the docking pose; iterates repeatedly until the discriminant result meets the threshold requirement, and the output for all atoms will be considered as the final predicted pose;

(3)构建基于图神经网络的对接姿势评估模型，评估生成的理想对接姿势；对蛋白质配体的理想对接姿势实际输出，依据现有的主流评价指标，对预测得到的对接姿势结果进行评估。(3) Construct a docking pose evaluation model based on graph neural network, and evaluate the generated ideal docking pose; based on the actual output of the ideal docking pose of protein ligands, evaluate the predicted docking pose results according to the existing mainstream evaluation indicators.

进一步地，步骤(1)所述的对生物信息样本集进行预处理过程如下：Further, the preprocessing process of the biological information sample set described in step (1) is as follows:

去除可旋转键少于两个的蛋白质-配体复合物以及具有一个以上配体的蛋白质，去除缺失残基或重复残基的蛋白质，形成包含N个蛋白质-配体复合物及其天然的结合亲和力标注；Remove protein-ligand complexes with fewer than two rotatable bonds and proteins with more than one ligand, remove missing or repeated residues, form protein-ligand complexes containing N and their native associations Affinity labeling;

通过根据化学键和原子排列等结构性质，将三维生物分子数据的分子图转换为二维邻接矩阵，所得的邻接矩阵符合图神经网络的输入格式；By converting the molecular graph of three-dimensional biomolecular data into a two-dimensional adjacency matrix according to structural properties such as chemical bonds and atomic arrangements, the resulting adjacency matrix conforms to the input format of the graph neural network;

从而得到蛋白质分子结构邻接矩阵、配体分子结构邻接矩阵、和天然的结合亲和力样本标注。In this way, the protein molecular structure adjacency matrix, the ligand molecular structure adjacency matrix, and the natural binding affinity sample annotations are obtained.

进一步地，步骤(2)中，一种基于生成式对抗网络的对接姿势结构生成的训练方法，具体包括如下步骤：Further, in step (2), a training method for generating a docking pose structure based on a generative confrontation network specifically includes the following steps:

(21)获取蛋白质配体初始模拟对接姿势结构；利用AutoDock对接姿势结构模拟模型，固定蛋白质分子，改变配体分子的空间位置分布，进行随机对接姿势结构模拟；(21) Obtain the initial simulated docking posture structure of the protein ligand; use the AutoDock docking posture structure simulation model to fix the protein molecule, change the spatial position distribution of the ligand molecule, and perform random docking posture structure simulation;

(22)构建基于图神经网络的对接姿势生成模型，生成候选的对接姿势；根据模拟的对接姿势结构，初始化神经网络模型生成器；通过训练样本集中抽取的样本数据以及生成器利用定义的噪声分布，对配体分子利用多步姿态预测模拟原子的对接姿势结构，计算每个配体原子的运动并输出移动向量，获得生成器的对接姿势结构的实际输出；(22) Build a docking pose generation model based on graph neural network to generate candidate docking poses; initialize the neural network model generator according to the simulated docking pose structure; use the sample data extracted from the training sample set and the generator to use the defined noise distribution , use multi-step attitude prediction to simulate the docking pose structure of the atom, calculate the motion of each ligand atom and output the movement vector, and obtain the actual output of the generator's docking pose structure;

(23)构建基于多视角的对接姿势有效性判别模型，对生成模型的实际输出进行评估；以原始数据集PDBbind2016中天然的结合亲和力标注确定为基准结果，将生成的配体的对接姿势结果与靶向蛋白进行多视角的亲和力预测，判断生成的对接姿势结果的有效性；将计算对接姿势的结合亲和力与基准结果的偏差，反馈给生成模型进行参数的调节优化；(23) Construct a docking posture validity discrimination model based on multiple perspectives, and evaluate the actual output of the generated model; take the natural binding affinity annotation in the original data set PDBbind2016 as the benchmark result, and compare the docking posture results of the generated ligands with Predict the affinity of the target protein from multiple perspectives, and judge the validity of the generated docking posture results; the deviation between the calculated binding affinity of the docking posture and the benchmark result is fed back to the generated model for parameter adjustment and optimization;

(24)对生成对接姿势进行重复训练迭代，使得对接姿势在亲和力判别指标上达到理想预期，获得训练完成的模型并输出理想的对接姿势结果。(24) Perform repeated training iterations on the generated docking poses, so that the docking poses can reach the ideal expectation on the affinity discrimination index, obtain the trained model and output the ideal docking pose results.

进一步地，所述步骤(3)实现过程如下：Further, the implementation process of the step (3) is as follows:

利用评估函数对均方根偏差RMSD指标进行衡量，定义如下：The evaluation function is used to measure the root mean square deviation RMSD index, which is defined as follows:

其中，δ表示某一帧的原子的位置减去参考系中它的位置，即位置偏移量，T表示时间，x表示原子某时刻的位置；RMSD值表示各原子运动幅度的大小，该值越大，说明该原子的运动的空间范围越大，原子的空间位阻也就越小；在评估模型中，RMSD越小，代表生成的对接姿势结构越准确有效。Among them, δ represents the position of the atom in a certain frame minus its position in the reference frame, that is, the position offset, T represents the time, and x represents the position of the atom at a certain moment; the RMSD value represents the magnitude of the motion amplitude of each atom, and the value The larger the , the larger the spatial range of motion of the atom is, and the smaller the steric hindrance of the atom is; in the evaluation model, the smaller the RMSD, the more accurate and effective the generated docking pose structure is.

进一步地，步骤(23)所述判断生成的对接姿势结果的有效性，具体包括如下步骤：Further, the validity of the docking posture result generated by judging in step (23) specifically includes the following steps:

(231)数据特征表示；获取输入数据，即配体分子图结构，蛋白质分子图结构，整段蛋白质序列和生成的蛋白质配体对接姿势，其中蛋白质配体对接姿势利用二部图进行表示；(231) Data feature representation; obtain input data, namely ligand molecular graph structure, protein molecular graph structure, the entire protein sequence and the generated protein-ligand docking pose, wherein the protein-ligand docking pose is represented by a bipartite graph;

(232)数据编码；利用图神经网络的Attentive FP模型、BIGGNN模型和ProBert模型分别对输入数据编码，转换成特征向量；(232) Data coding; use the Attentive FP model, the BIGGNN model and the ProBert model of the graph neural network to encode the input data respectively, and convert them into feature vectors;

(233)亲和力预测；将编码得到的特征向量利用多层感知机预测亲和力结果，得到的结合亲和力值与天然的结合亲和力标注进行比较，判别生成姿势结果的有效性；(233) Affinity prediction; the eigenvector obtained by encoding is used to predict the affinity result by a multi-layer perceptron, and the obtained binding affinity value is compared with the natural binding affinity label to determine the validity of the generated posture result;

(234)计算预测结果与实际姿势的偏差；使用了四个指标来进行衡量，平均绝对误差MAE，均方根误差RMSE，均方根误差标准差SD，皮尔逊相关系数R，这四个指标的定义如下：(234) Calculate the deviation between the predicted result and the actual posture; four indicators are used to measure, the average absolute error MAE, the root mean square error RMSE, the root mean square error standard deviation SD, the Pearson correlation coefficient R, these four indicators is defined as follows:

其中，D是数据集中的样本数，y和

分别是实验确定的和模型预测的蛋白质-配体结合亲和力的值，a和b分别为回归线的截距和斜率；利用这四项指标计算的值确定损失函数，并反馈给对接姿势生成模型，进行参数的调节和优化。where D is the number of samples in the dataset, y and

are the experimentally determined and model-predicted protein-ligand binding affinity values, a and b are the intercept and slope of the regression line, respectively; the values calculated using these four indicators determine the loss function and feed back to the docking pose generation model, Adjust and optimize parameters.

有益效果：与现有技术相比，本发明的有益效果：本发明采用图神经网络，实现了对蛋白质与配体分子之间的全局信息的充分利用，避免了在传统预测方法中丢失分子间重要信息的缺陷，并大大提高了对靶点蛋白质与配体分子之间对接姿势进行预测的准确性与效率；本发明采用基于生成对抗网络的对接姿势结构生成模型和基于图神经网络的对接姿势评估模型两种优化模型，充分利用生物分子的全局信息，根据所得信息生成对蛋白质-配体对接姿势的模型并进行预测评估。Beneficial effect: compared with the prior art, the beneficial effect of the present invention: the present invention adopts the graph neural network, realizes the full use of the global information between the protein and the ligand molecule, and avoids the loss of intermolecular information in the traditional prediction method. The defect of important information greatly improves the accuracy and efficiency of predicting the docking posture between the target protein and the ligand molecule; the present invention adopts the docking posture structure generation model based on the generative confrontation network and the docking posture based on the graph neural network Evaluation model Two optimization models make full use of the global information of biomolecules, generate a model of protein-ligand docking pose based on the obtained information, and perform prediction evaluation.

附图说明Description of drawings

图1为本发明的流程图；Fig. 1 is a flowchart of the present invention;

图2为本发明蛋白质-配体复合物生物信息数据预处理流程示意图；Fig. 2 is a schematic diagram of the preprocessing flow chart of the biological information data of the protein-ligand complex of the present invention;

图3为本发明蛋白质-配体生成姿势结构评估方法流程示意图。Fig. 3 is a schematic flowchart of a method for evaluating a protein-ligand generated pose structure of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明作进一步详细说明。The present invention will be described in further detail below in conjunction with the accompanying drawings.

如图1所示，本发明提出一种基于图神经网络的蛋白质与配体之间对接姿势的预测方法，具体包括如下步骤：As shown in Figure 1, the present invention proposes a method for predicting the docking posture between a protein and a ligand based on a graph neural network, which specifically includes the following steps:

步骤1：获取蛋白质-配体复合物的生物信息样本集，并对其进行预处理；样本集包括样本数据和样本数据的样本标注，对样本数据编码，得到特征向量。Step 1: Obtain the biological information sample set of the protein-ligand complex, and preprocess it; the sample set includes sample data and sample annotation of the sample data, encode the sample data, and obtain the feature vector.

提取原始PDBbind2016数据集中的数据，以获得其中的配体与蛋白质的生物分子结构信息，去除可旋转键少于两个的蛋白质-配体复合物以及具有一个以上配体的蛋白质，去除缺失残基或重复残基的蛋白质，形成包含N个蛋白质-配体复合物及其天然的结合亲和力标注。Extract the data in the original PDBbind2016 dataset to obtain the biomolecular structure information of ligands and proteins, remove protein-ligand complexes with less than two rotatable bonds and proteins with more than one ligand, and remove missing residues or repeating residues in a protein, forming a protein-ligand complex comprising N and their native binding affinity annotations.

然后根据化学键和原子排列等结构性质，将三维生物分子数据转换为二维邻接矩阵(该邻接矩阵符合Pytorch_Geometric图神经网络框架的输入格式)。再将得到的蛋白质-配体组合多维信息用于对接姿势结构生成，以随机获取大量初始对接姿势结构，目前对接姿势结构生成已有成熟的模型，我们采用的是AutoDock对接姿势结构生成模型。最后的训练样本集与测试样本集，就是根据对蛋白质-配体对接姿式结构进行的亲和力预测与标注，进而随机划分得到的。其中，每组样本数据包括蛋白质分子结构邻接矩阵、配体分子结构邻接矩阵、蛋白质配体对接姿式结构和相应的结合亲和力样本标注。Then, based on structural properties such as chemical bonds and atomic arrangements, the 3D biomolecular data is converted into a 2D adjacency matrix (the adjacency matrix conforms to the input format of the Pytorch_Geometric graph neural network framework). Then the obtained multi-dimensional information of protein-ligand combination is used for docking pose structure generation to randomly obtain a large number of initial docking pose structures. At present, there are mature models for docking pose structure generation, and we use the AutoDock docking pose structure generation model. The final training sample set and test sample set are randomly divided according to the affinity prediction and labeling of the protein-ligand docking pose structure. Among them, each set of sample data includes protein molecular structure adjacency matrix, ligand molecular structure adjacency matrix, protein ligand docking pose structure and corresponding binding affinity sample annotation.

步骤2：构建基于图神经网络的对接姿势生成模型，并使用生成对抗网络对其进行训练；固定靶向蛋白，对配体分子利用多步姿态预测模拟原子的对接姿势结构，由基于多视角的对接姿势判别器评估对接姿势，计算损失差值，调整对接姿势中原子的空间位置；反复迭代，直到判别结果满足阈值要求，对于所有原子的输出将被认为是最终的预测姿态。如图2所示，具体包括以下步骤：Step 2: Build a docking pose generation model based on a graph neural network, and use a generative confrontation network to train it; fix the target protein, use multi-step pose prediction for the ligand molecule to simulate the docking pose structure of the atom, and use the multi-view based The docking pose discriminator evaluates the docking pose, calculates the loss difference, and adjusts the spatial position of the atoms in the docking pose; iterates repeatedly until the discriminant results meet the threshold requirements, and the output for all atoms will be considered as the final predicted pose. As shown in Figure 2, it specifically includes the following steps:

(2.1)获取蛋白质配体初始模拟对接姿势结构；利用对接姿势结构生成模型AutoDock，固定蛋白质分子，改变配体分子的空间位置分布，进行随机对接姿势结构生成；(2.1) Obtain the initial simulated docking pose structure of the protein ligand; use the docking pose structure generation model AutoDock to fix the protein molecule, change the spatial position distribution of the ligand molecule, and generate a random docking pose structure;

(2.2)构建基于图神经网络的对接姿势生成模型，生成候选的对接姿势；初始化神经网络模型生成器，通过训练样本集中抽取的样本数据以及生成器利用定义的噪声分布，对配体分子利用多步姿态预测模拟原子的对接姿势结构，计算每个配体原子的运动并输出移动向量，获得生成器的对接姿势结构的实际输出；(2.2) Build a docking pose generation model based on a graph neural network to generate candidate docking poses; initialize the neural network model generator, use the sample data extracted from the training sample set and the generator to use the defined noise distribution, and use multiple The step pose prediction simulates the docking pose structure of the atom, calculates the motion of each ligand atom and outputs the movement vector, and obtains the actual output of the generator's docking pose structure;

(2.3)构建基于多视角的对接姿势有效性判别模型，对生成模型的实际输出进行评估；以原始数据集PDBbind2016中天然的结合亲和力标注确定为基准结果，将生成的配体的对接姿势结果与靶向蛋白进行多视角的亲和力预测，判断生成的对接姿势结果的有效性；将计算对接姿势的结合亲和力与基准结果的偏差，反馈给生成模型进行参数的调节优化。(2.3) Construct a docking posture validity discrimination model based on multiple perspectives, and evaluate the actual output of the generated model; take the natural binding affinity annotation in the original data set PDBbind2016 as the benchmark result, and compare the docking posture results of the generated ligands with The target protein performs multi-view affinity prediction to judge the validity of the generated docking pose results; the deviation between the calculated docking pose binding affinity and the benchmark result is fed back to the generated model for parameter adjustment and optimization.

判断生成的对接姿势结果的有效性，如图3所示，具体过程如下：Judging the validity of the generated docking pose results, as shown in Figure 3, the specific process is as follows:

1)数据特征表示；获取输入数据，即配体分子图结构，蛋白质分子图结构，整段蛋白质序列和生成的蛋白质配体对接姿势，其中蛋白质配体对接姿势利用二部图进行表示。1) Data feature representation; the input data are obtained, namely ligand molecular graph structure, protein molecular graph structure, the entire protein sequence and the generated protein-ligand docking pose, where the protein-ligand docking pose is represented by a bipartite graph.

2)数据编码；利用图神经网络的Attentive FP模型、BIGGNN模型和ProBert模型分别对输入数据编码，转换成特征向量。2) Data encoding: Use the Attentive FP model, BIGGNN model and ProBert model of the graph neural network to encode the input data and convert them into feature vectors.

3)亲和力预测；将编码得到的特征向量利用多层感知机预测亲和力结果，得到的结合亲和力值与天然的结合亲和力标注进行比较，判别生成姿势结果的有效性。3) Affinity prediction: use the encoded eigenvectors to predict the affinity results using a multi-layer perceptron, and compare the obtained binding affinity values with the natural binding affinity labels to determine the validity of the generated pose results.

4)计算预测结果与实际姿势的偏差；使用了四个指标来进行衡量，平均绝对误差MAE，均方根误差RMSE，均方根误差标准差SD，皮尔逊相关系数R，这四个指标的定义如下：4) Calculate the deviation between the predicted result and the actual posture; four indicators are used to measure, the average absolute error MAE, the root mean square error RMSE, the root mean square error standard deviation SD, the Pearson correlation coefficient R, the four indicators It is defined as follows:

其中，D是数据集中的样本数，y和

分别是实验确定的和模型预测的蛋白质-配体结合亲和力的值，a和b分别为回归线的截距和斜率。利用这四项指标计算的值确定损失函数，并反馈给对接姿势生成模型，进行参数的调节和优化。where D is the number of samples in the dataset, y and

are the experimentally determined and model-predicted values of protein-ligand binding affinity, respectively, and a and b are the intercept and slope of the regression line, respectively. Use the values calculated by these four indicators to determine the loss function, and feed it back to the docking pose generation model for parameter adjustment and optimization.

(2.4)对生成对接姿势进行重复训练迭代，使得对接姿势在亲和力判别指标上达到理想预期，获得训练完成的模型并输出理想的对接姿势结果。(2.4) Perform repeated training iterations on the generated docking poses, so that the docking poses can reach the ideal expectation on the affinity discrimination index, obtain the trained model and output the ideal docking pose results.

步骤3：构建基于图神经网络的对接姿势评估模型，评估生成的理想对接姿势；对蛋白质配体的理想对接姿势实际输出，依据现有的主流评价指标，对预测得到的对接姿势结果进行评估。Step 3: Build a docking pose evaluation model based on graph neural network, and evaluate the generated ideal docking pose; for the actual output of the ideal docking pose of the protein ligand, evaluate the predicted docking pose results based on the existing mainstream evaluation indicators.

利用传统的评估函数对均方根偏差RMSD指标进行衡量，其定义如下：Use the traditional evaluation function to measure the root mean square deviation RMSD index, which is defined as follows:

其中，δ表示某一帧的原子的位置减去参考系中它的位置，即位置偏移量,T表示时间，x表示原子某时刻的位置；RMSD值表示各原子运动幅度的大小，该值越大，说明该原子的运动的空间范围越大，原子的空间位阻也就越小。在评估模型中，RMSD越小，代表生成的对接姿势结构越准确有效。Among them, δ represents the position of the atom in a certain frame minus its position in the reference system, that is, the position offset, T represents the time, x represents the position of the atom at a certain moment; the RMSD value represents the magnitude of the motion range of each atom, and the value The larger the , the larger the spatial range of the movement of the atom is, and the smaller the steric hindrance of the atom is. In the evaluation model, the smaller the RMSD, the more accurate and efficient the generated docking pose structure.

Claims

1. A method for predicting the docking posture between a graph neural network-based protein and a ligand, characterized in that it comprises the following steps:

(1) Acquire the biological information sample set of the protein-ligand complex, and preprocess it; the sample set includes sample data and sample annotation of the sample data, encode the sample data, and obtain the feature vector;

(2) Construct a docking pose generation model based on a graph neural network, and use a generative adversarial network to train it; fix the target protein, use multi-step pose prediction for the ligand molecule to simulate the docking pose structure of the atom, and use the multi-view based The docking pose discriminator evaluates the docking pose, calculates the loss difference, and adjusts the spatial position of the atoms in the docking pose; iterates repeatedly until the discriminant result meets the threshold requirement, and the output for all atoms will be considered as the final predicted pose;

(3) Construct a docking pose evaluation model based on graph neural network, and evaluate the generated ideal docking pose; based on the actual output of the ideal docking pose of protein ligands, evaluate the predicted docking pose results according to the existing mainstream evaluation indicators.

2. the method for predicting the docking posture between the protein and the ligand based on the graph neural network according to claim 1, characterized in that, the preprocessing process of the biological information sample set described in step (1) is as follows:

Remove protein-ligand complexes with fewer than two rotatable bonds and proteins with more than one ligand, remove missing or repeated residues, form protein-ligand complexes containing N and their native associations Affinity labeling;

By converting the molecular graph of three-dimensional biomolecular data into a two-dimensional adjacency matrix according to structural properties such as chemical bonds and atomic arrangements, the resulting adjacency matrix conforms to the input format of the graph neural network;

In this way, the protein molecular structure adjacency matrix, the ligand molecular structure adjacency matrix, and the natural binding affinity sample annotations are obtained.

3. The prediction method of the docking posture between the protein and the ligand based on the graph neural network according to claim 1, characterized in that, in step (2), a training based on the docking posture structure generation of the generative confrontation network The method specifically includes the following steps:

(21) Obtain the initial simulated docking posture structure of the protein ligand; use the AutoDock docking posture structure simulation model to fix the protein molecule, change the spatial position distribution of the ligand molecule, and perform random docking posture structure simulation;

(22) Build a docking pose generation model based on graph neural network to generate candidate docking poses; initialize the neural network model generator according to the simulated docking pose structure; use the sample data extracted from the training sample set and the generator to use the defined noise distribution , use multi-step attitude prediction to simulate the docking pose structure of the atom, calculate the motion of each ligand atom and output the movement vector, and obtain the actual output of the generator's docking pose structure;

(23) Construct a docking posture validity discrimination model based on multiple perspectives, and evaluate the actual output of the generated model; take the natural binding affinity annotation in the original data set PDBbind2016 as the benchmark result, and compare the docking posture results of the generated ligands with Predict the affinity of the target protein from multiple perspectives, and judge the validity of the generated docking posture results; the deviation between the calculated binding affinity of the docking posture and the benchmark result is fed back to the generated model for parameter adjustment and optimization;

(24) Perform repeated training iterations on the generated docking poses, so that the docking poses can reach the ideal expectation on the affinity discrimination index, obtain the trained model and output the ideal docking pose results.

4. the method for predicting the docking posture between the protein and the ligand based on the graph neural network according to claim 1, characterized in that, the step (3) implementation process is as follows:

The evaluation function is used to measure the root mean square deviation RMSD index, which is defined as follows:

Among them, δ represents the position of the atom in a certain frame minus its position in the reference frame, that is, the position offset, T represents the time, and x represents the position of the atom at a certain moment; the RMSD value represents the magnitude of the motion amplitude of each atom, and the value The larger the , the larger the spatial range of motion of the atom is, and the smaller the steric hindrance of the atom is; in the evaluation model, the smaller the RMSD, the more accurate and effective the generated docking pose structure is.

5. The method for predicting the docking posture between a protein and a ligand based on a graph neural network according to claim 3, wherein the validity of the docking posture result generated by judging in step (23) specifically includes the following steps :

(231) Data feature representation; obtain input data, namely ligand molecular graph structure, protein molecular graph structure, the entire protein sequence and the generated protein-ligand docking pose, wherein the protein-ligand docking pose is represented by a bipartite graph;

(232) data encoding; Utilize the AttentiveFP model, the BIGGNN model and the ProBert model of the graph neural network to encode the input data respectively, and convert them into feature vectors;

(233) Affinity prediction; the eigenvector obtained by encoding is used to predict the affinity result by a multi-layer perceptron, and the obtained binding affinity value is compared with the natural binding affinity label to determine the validity of the generated posture result;

(234) Calculate the deviation between the predicted result and the actual posture; four indicators are used to measure, the average absolute error MAE, the root mean square error RMSE, the root mean square error standard deviation SD, the Pearson correlation coefficient R, these four indicators is defined as follows:

where D is the number of samples in the dataset, y and