CN116343910A - Prediction method of docking posture between protein and ligand based on graphic neural network - Google Patents

Prediction method of docking posture between protein and ligand based on graphic neural network Download PDF

Info

Publication number
CN116343910A
CN116343910A CN202310327119.9A CN202310327119A CN116343910A CN 116343910 A CN116343910 A CN 116343910A CN 202310327119 A CN202310327119 A CN 202310327119A CN 116343910 A CN116343910 A CN 116343910A
Authority
CN
China
Prior art keywords
docking
protein
ligand
pose
posture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310327119.9A
Other languages
Chinese (zh)
Inventor
顾彦慧
陈晓健
张先锋
郝渊苏
文羽昕
廖楚悦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Normal University
Original Assignee
Nanjing Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Normal University filed Critical Nanjing Normal University
Priority to CN202310327119.9A priority Critical patent/CN116343910A/en
Publication of CN116343910A publication Critical patent/CN116343910A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Biomedical Technology (AREA)
  • Chemical & Material Sciences (AREA)
  • Biotechnology (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medicinal Chemistry (AREA)
  • Public Health (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Bioethics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明公开了一种基于图神经网络的蛋白质与配体之间对接姿势的预测方法,首先,获取蛋白质‑配体复合物的生物信息样本集,样本集包括样本数据和样本标注数据;其次,构建基于图神经网络的对接姿势生成模型和基于多视角的对接姿势评估模型,进一步调节模型的参数,通过训练得到的结构生成模型对样本数据进行处理,获得蛋白质配体的姿势对接实际输出;最后,利用主流的姿势对接结构评价指标对输出结果进行稳定性评估。本发明直接利用配体蛋白质的生物结构信息生成最优的对接姿势结构,并通过多角度的综合评估模型对生成结果进行评估,从而提高对配体‑蛋白质姿势结构对接预测的准确性,以及提高对配体‑蛋白质姿势结构对接预测结果评估的有效性。

Figure 202310327119

The invention discloses a method for predicting the docking posture between a protein and a ligand based on a graph neural network. First, a biological information sample set of a protein-ligand complex is obtained, and the sample set includes sample data and sample label data; secondly, Build a docking pose generation model based on a graph neural network and a docking pose evaluation model based on multiple perspectives, further adjust the parameters of the model, process the sample data through the structure generation model obtained through training, and obtain the actual output of the pose docking of the protein ligand; finally , using the mainstream pose docking structure evaluation index to evaluate the stability of the output results. The present invention directly uses the biological structure information of the ligand protein to generate the optimal docking posture structure, and evaluates the generated result through a multi-angle comprehensive evaluation model, thereby improving the accuracy of the docking prediction of the ligand-protein posture structure, and improving Evaluating the validity of ligand-protein pose-structure docking predictions.

Figure 202310327119

Description

基于图神经网络的蛋白质与配体之间对接姿势的预测方法Prediction method of docking pose between protein and ligand based on graph neural network

技术领域technical field

本发明属于计算机辅助药物设计领域,具体涉及一种基于图神经网络对蛋白质与配体之间对接姿势的预测方法。The invention belongs to the field of computer-aided drug design, in particular to a method for predicting the docking posture between a protein and a ligand based on a graph neural network.

背景技术Background technique

在计算机辅助药物设计的过程中,筛选出特定蛋白质与配体之间结合亲和力高的对接姿势一直是一项难题。传统的方法考虑结合现有方法,生成无限多个对接姿势的组合,并在此基础上筛选得到最满足条件的一组姿势,忽略了药物分子与靶点蛋白质之间的全局信息,此外现有的对接姿势信息是比较少的,结合亲和力的评估也需要通过大量的实验,良好的筛选结果需要对数据集进行大量的标注。而标注量不足往往会造成对配体分子与靶点蛋白质结合亲和力预测的不准确,且此类方法中实验需要的资金成本与标注所需的人工成本是大多数科研项目难以承受的。而新兴的利用三维分子结构对对接姿势预测的方法,在预测蛋白质与配体之间的结合亲和力方面十分具有前景,提高了计算机辅助药物设计的效率与准确性。In the process of computer-aided drug design, it has always been a difficult problem to screen out the docking pose with high binding affinity between a specific protein and its ligand. The traditional method considers combining existing methods to generate an infinite number of combinations of docking poses, and on this basis to screen a set of poses that best meet the conditions, ignoring the global information between the drug molecule and the target protein. In addition, the existing The docking pose information is relatively small, and the evaluation of binding affinity also requires a large number of experiments. Good screening results require a large number of annotations on the data set. Insufficient labeling often leads to inaccurate predictions of the binding affinity between ligand molecules and target proteins, and the capital costs and labor costs required for experiments in such methods are unaffordable for most scientific research projects. The emerging method of using three-dimensional molecular structure to predict the docking posture is very promising in predicting the binding affinity between proteins and ligands, which improves the efficiency and accuracy of computer-aided drug design.

在现有的利用分子三维结构对对接姿势与亲和力预测的方法中,为了提高预测的效率,采用了各种抽样方法,其中例如Glide对姿势的全局信息采用了蒙特卡洛采样法,从而提高了在面对大量药物分子配体时对对接姿势预测的准确性与速度。然而类似于Glide的各种方法往往忽略了对对接姿势的生物学优化,而没有经过优化的姿势信息,会造成预测对接姿势与亲和力时的不准确,在药物生产过程中迫切需要更为智能且准确的预测模型。此外,在这类方法中往往只考虑了分子内部作用力与分子间作用力的其中一种,没有对两者进行综合讨论,在对亲和力的预测过程中,丢失了一些重要信息,造成了预测的不准确。In the existing methods for predicting docking posture and affinity using molecular three-dimensional structure, in order to improve the efficiency of prediction, various sampling methods are used. For example, Glide adopts the Monte Carlo sampling method for the global information of posture, thereby improving the Accuracy and speed of docking pose prediction in the face of large numbers of drug molecule ligands. However, various methods similar to Glide often ignore the biological optimization of the docking posture, and without optimized posture information, it will cause inaccurate prediction of the docking posture and affinity. There is an urgent need for more intelligent and Accurate predictive models. In addition, in this type of method, only one of the intramolecular force and the intermolecular force is often considered, and the two are not discussed comprehensively. In the process of predicting the affinity, some important information is lost, resulting in the prediction inaccurate.

总体而言,目前对蛋白质与配体之间对接姿势的预测还有很多局限性,例如:预测时耗费的时间长,现有的对接姿势信息少,对分子空间结构信息的利用不充分,对结合亲和力的预测不够准确。这主要是由于候选药物分子具有数据量大、空间结构复杂的特点。本发明针对现有蛋白质配体对接姿势预测方法的局限性,提出了一种基于图神经网络和生成对抗网络结合的预测模型,以解决配体对接姿势结构预测效率低并提高对配体蛋白质结合亲和力评估的有效性。Generally speaking, there are still many limitations in the current prediction of the docking pose between proteins and ligands, such as: the time-consuming prediction, the lack of existing docking pose information, the insufficient use of molecular spatial structure information, and the The prediction of binding affinity is not accurate enough. This is mainly due to the large amount of data and complex spatial structure of candidate drug molecules. Aiming at the limitations of existing protein ligand docking posture prediction methods, the present invention proposes a prediction model based on the combination of graph neural network and generative adversarial network to solve the low efficiency of ligand docking posture structure prediction and improve the binding of ligand to protein Validity of affinity assessment.

发明内容Contents of the invention

发明目的:本发明提供一种基于图神经网络对蛋白质与配体之间对接姿势的预测方法,实现了对蛋白质与配体分子之间的全局信息的充分利用,避免了在传统预测方法中丢失分子间重要信息的缺陷,并大大提高了对靶点蛋白质与配体分子之间对接姿势进行预测的准确性与效率。Purpose of the invention: The present invention provides a method for predicting the docking pose between a protein and a ligand based on a graph neural network, which realizes the full utilization of the global information between the protein and the ligand molecule, and avoids the loss in the traditional prediction method The defect of important information between molecules greatly improves the accuracy and efficiency of predicting the docking posture between the target protein and the ligand molecule.

技术方案:本发明旨在一种基于图神经网络对蛋白质与配体之间对接姿势的预测方法,具体包括以下步骤:Technical solution: The present invention aims at a method for predicting the docking posture between a protein and a ligand based on a graph neural network, which specifically includes the following steps:

(1)获取蛋白质-配体复合物的生物信息样本集,并对其进行预处理;样本集包括样本数据和样本数据的样本标注,对样本数据编码,得到特征向量;(1) Acquire the biological information sample set of the protein-ligand complex, and preprocess it; the sample set includes sample data and sample annotation of the sample data, encode the sample data, and obtain the feature vector;

(2)构建基于图神经网络的对接姿势生成模型,并使用生成对抗网络对其进行训练;固定靶向蛋白,对配体分子利用多步姿态预测模拟原子的对接姿势结构,由基于多视角的对接姿势判别器评估对接姿势,计算损失差值,调整对接姿势中原子的空间位置;反复迭代,直到判别结果满足阈值要求,对于所有原子的输出将被认为是最终的预测姿态;(2) Construct a docking pose generation model based on a graph neural network, and use a generative adversarial network to train it; fix the target protein, use multi-step pose prediction for the ligand molecule to simulate the docking pose structure of the atom, and use the multi-view based The docking pose discriminator evaluates the docking pose, calculates the loss difference, and adjusts the spatial position of the atoms in the docking pose; iterates repeatedly until the discriminant result meets the threshold requirement, and the output for all atoms will be considered as the final predicted pose;

(3)构建基于图神经网络的对接姿势评估模型,评估生成的理想对接姿势;对蛋白质配体的理想对接姿势实际输出,依据现有的主流评价指标,对预测得到的对接姿势结果进行评估。(3) Construct a docking pose evaluation model based on graph neural network, and evaluate the generated ideal docking pose; based on the actual output of the ideal docking pose of protein ligands, evaluate the predicted docking pose results according to the existing mainstream evaluation indicators.

进一步地,步骤(1)所述的对生物信息样本集进行预处理过程如下:Further, the preprocessing process of the biological information sample set described in step (1) is as follows:

去除可旋转键少于两个的蛋白质-配体复合物以及具有一个以上配体的蛋白质,去除缺失残基或重复残基的蛋白质,形成包含N个蛋白质-配体复合物及其天然的结合亲和力标注;Remove protein-ligand complexes with fewer than two rotatable bonds and proteins with more than one ligand, remove missing or repeated residues, form protein-ligand complexes containing N and their native associations Affinity labeling;

通过根据化学键和原子排列等结构性质,将三维生物分子数据的分子图转换为二维邻接矩阵,所得的邻接矩阵符合图神经网络的输入格式;By converting the molecular graph of three-dimensional biomolecular data into a two-dimensional adjacency matrix according to structural properties such as chemical bonds and atomic arrangements, the resulting adjacency matrix conforms to the input format of the graph neural network;

从而得到蛋白质分子结构邻接矩阵、配体分子结构邻接矩阵、和天然的结合亲和力样本标注。In this way, the protein molecular structure adjacency matrix, the ligand molecular structure adjacency matrix, and the natural binding affinity sample annotations are obtained.

进一步地,步骤(2)中,一种基于生成式对抗网络的对接姿势结构生成的训练方法,具体包括如下步骤:Further, in step (2), a training method for generating a docking pose structure based on a generative confrontation network specifically includes the following steps:

(21)获取蛋白质配体初始模拟对接姿势结构;利用AutoDock对接姿势结构模拟模型,固定蛋白质分子,改变配体分子的空间位置分布,进行随机对接姿势结构模拟;(21) Obtain the initial simulated docking posture structure of the protein ligand; use the AutoDock docking posture structure simulation model to fix the protein molecule, change the spatial position distribution of the ligand molecule, and perform random docking posture structure simulation;

(22)构建基于图神经网络的对接姿势生成模型,生成候选的对接姿势;根据模拟的对接姿势结构,初始化神经网络模型生成器;通过训练样本集中抽取的样本数据以及生成器利用定义的噪声分布,对配体分子利用多步姿态预测模拟原子的对接姿势结构,计算每个配体原子的运动并输出移动向量,获得生成器的对接姿势结构的实际输出;(22) Build a docking pose generation model based on graph neural network to generate candidate docking poses; initialize the neural network model generator according to the simulated docking pose structure; use the sample data extracted from the training sample set and the generator to use the defined noise distribution , use multi-step attitude prediction to simulate the docking pose structure of the atom, calculate the motion of each ligand atom and output the movement vector, and obtain the actual output of the generator's docking pose structure;

(23)构建基于多视角的对接姿势有效性判别模型,对生成模型的实际输出进行评估;以原始数据集PDBbind2016中天然的结合亲和力标注确定为基准结果,将生成的配体的对接姿势结果与靶向蛋白进行多视角的亲和力预测,判断生成的对接姿势结果的有效性;将计算对接姿势的结合亲和力与基准结果的偏差,反馈给生成模型进行参数的调节优化;(23) Construct a docking posture validity discrimination model based on multiple perspectives, and evaluate the actual output of the generated model; take the natural binding affinity annotation in the original data set PDBbind2016 as the benchmark result, and compare the docking posture results of the generated ligands with Predict the affinity of the target protein from multiple perspectives, and judge the validity of the generated docking posture results; the deviation between the calculated binding affinity of the docking posture and the benchmark result is fed back to the generated model for parameter adjustment and optimization;

(24)对生成对接姿势进行重复训练迭代,使得对接姿势在亲和力判别指标上达到理想预期,获得训练完成的模型并输出理想的对接姿势结果。(24) Perform repeated training iterations on the generated docking poses, so that the docking poses can reach the ideal expectation on the affinity discrimination index, obtain the trained model and output the ideal docking pose results.

进一步地,所述步骤(3)实现过程如下:Further, the implementation process of the step (3) is as follows:

利用评估函数对均方根偏差RMSD指标进行衡量,定义如下:The evaluation function is used to measure the root mean square deviation RMSD index, which is defined as follows:

Figure BDA0004153618380000031
Figure BDA0004153618380000031

其中,δ表示某一帧的原子的位置减去参考系中它的位置,即位置偏移量,T表示时间,x表示原子某时刻的位置;RMSD值表示各原子运动幅度的大小,该值越大,说明该原子的运动的空间范围越大,原子的空间位阻也就越小;在评估模型中,RMSD越小,代表生成的对接姿势结构越准确有效。Among them, δ represents the position of the atom in a certain frame minus its position in the reference frame, that is, the position offset, T represents the time, and x represents the position of the atom at a certain moment; the RMSD value represents the magnitude of the motion amplitude of each atom, and the value The larger the , the larger the spatial range of motion of the atom is, and the smaller the steric hindrance of the atom is; in the evaluation model, the smaller the RMSD, the more accurate and effective the generated docking pose structure is.

进一步地,步骤(23)所述判断生成的对接姿势结果的有效性,具体包括如下步骤:Further, the validity of the docking posture result generated by judging in step (23) specifically includes the following steps:

(231)数据特征表示;获取输入数据,即配体分子图结构,蛋白质分子图结构,整段蛋白质序列和生成的蛋白质配体对接姿势,其中蛋白质配体对接姿势利用二部图进行表示;(231) Data feature representation; obtain input data, namely ligand molecular graph structure, protein molecular graph structure, the entire protein sequence and the generated protein-ligand docking pose, wherein the protein-ligand docking pose is represented by a bipartite graph;

(232)数据编码;利用图神经网络的Attentive FP模型、BIGGNN模型和ProBert模型分别对输入数据编码,转换成特征向量;(232) Data coding; use the Attentive FP model, the BIGGNN model and the ProBert model of the graph neural network to encode the input data respectively, and convert them into feature vectors;

(233)亲和力预测;将编码得到的特征向量利用多层感知机预测亲和力结果,得到的结合亲和力值与天然的结合亲和力标注进行比较,判别生成姿势结果的有效性;(233) Affinity prediction; the eigenvector obtained by encoding is used to predict the affinity result by a multi-layer perceptron, and the obtained binding affinity value is compared with the natural binding affinity label to determine the validity of the generated posture result;

(234)计算预测结果与实际姿势的偏差;使用了四个指标来进行衡量,平均绝对误差MAE,均方根误差RMSE,均方根误差标准差SD,皮尔逊相关系数R,这四个指标的定义如下:(234) Calculate the deviation between the predicted result and the actual posture; four indicators are used to measure, the average absolute error MAE, the root mean square error RMSE, the root mean square error standard deviation SD, the Pearson correlation coefficient R, these four indicators is defined as follows:

Figure BDA0004153618380000041
Figure BDA0004153618380000041

Figure BDA0004153618380000042
Figure BDA0004153618380000042

Figure BDA0004153618380000043
Figure BDA0004153618380000043

Figure BDA0004153618380000044
Figure BDA0004153618380000044

其中,D是数据集中的样本数,y和

Figure BDA0004153618380000045
分别是实验确定的和模型预测的蛋白质-配体结合亲和力的值,a和b分别为回归线的截距和斜率;利用这四项指标计算的值确定损失函数,并反馈给对接姿势生成模型,进行参数的调节和优化。where D is the number of samples in the dataset, y and
Figure BDA0004153618380000045
are the experimentally determined and model-predicted protein-ligand binding affinity values, a and b are the intercept and slope of the regression line, respectively; the values calculated using these four indicators determine the loss function and feed back to the docking pose generation model, Adjust and optimize parameters.

有益效果:与现有技术相比,本发明的有益效果:本发明采用图神经网络,实现了对蛋白质与配体分子之间的全局信息的充分利用,避免了在传统预测方法中丢失分子间重要信息的缺陷,并大大提高了对靶点蛋白质与配体分子之间对接姿势进行预测的准确性与效率;本发明采用基于生成对抗网络的对接姿势结构生成模型和基于图神经网络的对接姿势评估模型两种优化模型,充分利用生物分子的全局信息,根据所得信息生成对蛋白质-配体对接姿势的模型并进行预测评估。Beneficial effect: compared with the prior art, the beneficial effect of the present invention: the present invention adopts the graph neural network, realizes the full use of the global information between the protein and the ligand molecule, and avoids the loss of intermolecular information in the traditional prediction method. The defect of important information greatly improves the accuracy and efficiency of predicting the docking posture between the target protein and the ligand molecule; the present invention adopts the docking posture structure generation model based on the generative confrontation network and the docking posture based on the graph neural network Evaluation model Two optimization models make full use of the global information of biomolecules, generate a model of protein-ligand docking pose based on the obtained information, and perform prediction evaluation.

附图说明Description of drawings

图1为本发明的流程图;Fig. 1 is a flowchart of the present invention;

图2为本发明蛋白质-配体复合物生物信息数据预处理流程示意图;Fig. 2 is a schematic diagram of the preprocessing flow chart of the biological information data of the protein-ligand complex of the present invention;

图3为本发明蛋白质-配体生成姿势结构评估方法流程示意图。Fig. 3 is a schematic flowchart of a method for evaluating a protein-ligand generated pose structure of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明作进一步详细说明。The present invention will be described in further detail below in conjunction with the accompanying drawings.

如图1所示,本发明提出一种基于图神经网络的蛋白质与配体之间对接姿势的预测方法,具体包括如下步骤:As shown in Figure 1, the present invention proposes a method for predicting the docking posture between a protein and a ligand based on a graph neural network, which specifically includes the following steps:

步骤1:获取蛋白质-配体复合物的生物信息样本集,并对其进行预处理;样本集包括样本数据和样本数据的样本标注,对样本数据编码,得到特征向量。Step 1: Obtain the biological information sample set of the protein-ligand complex, and preprocess it; the sample set includes sample data and sample annotation of the sample data, encode the sample data, and obtain the feature vector.

提取原始PDBbind2016数据集中的数据,以获得其中的配体与蛋白质的生物分子结构信息,去除可旋转键少于两个的蛋白质-配体复合物以及具有一个以上配体的蛋白质,去除缺失残基或重复残基的蛋白质,形成包含N个蛋白质-配体复合物及其天然的结合亲和力标注。Extract the data in the original PDBbind2016 dataset to obtain the biomolecular structure information of ligands and proteins, remove protein-ligand complexes with less than two rotatable bonds and proteins with more than one ligand, and remove missing residues or repeating residues in a protein, forming a protein-ligand complex comprising N and their native binding affinity annotations.

然后根据化学键和原子排列等结构性质,将三维生物分子数据转换为二维邻接矩阵(该邻接矩阵符合Pytorch_Geometric图神经网络框架的输入格式)。再将得到的蛋白质-配体组合多维信息用于对接姿势结构生成,以随机获取大量初始对接姿势结构,目前对接姿势结构生成已有成熟的模型,我们采用的是AutoDock对接姿势结构生成模型。最后的训练样本集与测试样本集,就是根据对蛋白质-配体对接姿式结构进行的亲和力预测与标注,进而随机划分得到的。其中,每组样本数据包括蛋白质分子结构邻接矩阵、配体分子结构邻接矩阵、蛋白质配体对接姿式结构和相应的结合亲和力样本标注。Then, based on structural properties such as chemical bonds and atomic arrangements, the 3D biomolecular data is converted into a 2D adjacency matrix (the adjacency matrix conforms to the input format of the Pytorch_Geometric graph neural network framework). Then the obtained multi-dimensional information of protein-ligand combination is used for docking pose structure generation to randomly obtain a large number of initial docking pose structures. At present, there are mature models for docking pose structure generation, and we use the AutoDock docking pose structure generation model. The final training sample set and test sample set are randomly divided according to the affinity prediction and labeling of the protein-ligand docking pose structure. Among them, each set of sample data includes protein molecular structure adjacency matrix, ligand molecular structure adjacency matrix, protein ligand docking pose structure and corresponding binding affinity sample annotation.

步骤2:构建基于图神经网络的对接姿势生成模型,并使用生成对抗网络对其进行训练;固定靶向蛋白,对配体分子利用多步姿态预测模拟原子的对接姿势结构,由基于多视角的对接姿势判别器评估对接姿势,计算损失差值,调整对接姿势中原子的空间位置;反复迭代,直到判别结果满足阈值要求,对于所有原子的输出将被认为是最终的预测姿态。如图2所示,具体包括以下步骤:Step 2: Build a docking pose generation model based on a graph neural network, and use a generative confrontation network to train it; fix the target protein, use multi-step pose prediction for the ligand molecule to simulate the docking pose structure of the atom, and use the multi-view based The docking pose discriminator evaluates the docking pose, calculates the loss difference, and adjusts the spatial position of the atoms in the docking pose; iterates repeatedly until the discriminant results meet the threshold requirements, and the output for all atoms will be considered as the final predicted pose. As shown in Figure 2, it specifically includes the following steps:

(2.1)获取蛋白质配体初始模拟对接姿势结构;利用对接姿势结构生成模型AutoDock,固定蛋白质分子,改变配体分子的空间位置分布,进行随机对接姿势结构生成;(2.1) Obtain the initial simulated docking pose structure of the protein ligand; use the docking pose structure generation model AutoDock to fix the protein molecule, change the spatial position distribution of the ligand molecule, and generate a random docking pose structure;

(2.2)构建基于图神经网络的对接姿势生成模型,生成候选的对接姿势;初始化神经网络模型生成器,通过训练样本集中抽取的样本数据以及生成器利用定义的噪声分布,对配体分子利用多步姿态预测模拟原子的对接姿势结构,计算每个配体原子的运动并输出移动向量,获得生成器的对接姿势结构的实际输出;(2.2) Build a docking pose generation model based on a graph neural network to generate candidate docking poses; initialize the neural network model generator, use the sample data extracted from the training sample set and the generator to use the defined noise distribution, and use multiple The step pose prediction simulates the docking pose structure of the atom, calculates the motion of each ligand atom and outputs the movement vector, and obtains the actual output of the generator's docking pose structure;

(2.3)构建基于多视角的对接姿势有效性判别模型,对生成模型的实际输出进行评估;以原始数据集PDBbind2016中天然的结合亲和力标注确定为基准结果,将生成的配体的对接姿势结果与靶向蛋白进行多视角的亲和力预测,判断生成的对接姿势结果的有效性;将计算对接姿势的结合亲和力与基准结果的偏差,反馈给生成模型进行参数的调节优化。(2.3) Construct a docking posture validity discrimination model based on multiple perspectives, and evaluate the actual output of the generated model; take the natural binding affinity annotation in the original data set PDBbind2016 as the benchmark result, and compare the docking posture results of the generated ligands with The target protein performs multi-view affinity prediction to judge the validity of the generated docking pose results; the deviation between the calculated docking pose binding affinity and the benchmark result is fed back to the generated model for parameter adjustment and optimization.

判断生成的对接姿势结果的有效性,如图3所示,具体过程如下:Judging the validity of the generated docking pose results, as shown in Figure 3, the specific process is as follows:

1)数据特征表示;获取输入数据,即配体分子图结构,蛋白质分子图结构,整段蛋白质序列和生成的蛋白质配体对接姿势,其中蛋白质配体对接姿势利用二部图进行表示。1) Data feature representation; the input data are obtained, namely ligand molecular graph structure, protein molecular graph structure, the entire protein sequence and the generated protein-ligand docking pose, where the protein-ligand docking pose is represented by a bipartite graph.

2)数据编码;利用图神经网络的Attentive FP模型、BIGGNN模型和ProBert模型分别对输入数据编码,转换成特征向量。2) Data encoding: Use the Attentive FP model, BIGGNN model and ProBert model of the graph neural network to encode the input data and convert them into feature vectors.

3)亲和力预测;将编码得到的特征向量利用多层感知机预测亲和力结果,得到的结合亲和力值与天然的结合亲和力标注进行比较,判别生成姿势结果的有效性。3) Affinity prediction: use the encoded eigenvectors to predict the affinity results using a multi-layer perceptron, and compare the obtained binding affinity values with the natural binding affinity labels to determine the validity of the generated pose results.

4)计算预测结果与实际姿势的偏差;使用了四个指标来进行衡量,平均绝对误差MAE,均方根误差RMSE,均方根误差标准差SD,皮尔逊相关系数R,这四个指标的定义如下:4) Calculate the deviation between the predicted result and the actual posture; four indicators are used to measure, the average absolute error MAE, the root mean square error RMSE, the root mean square error standard deviation SD, the Pearson correlation coefficient R, the four indicators It is defined as follows:

Figure BDA0004153618380000061
Figure BDA0004153618380000061

Figure BDA0004153618380000062
Figure BDA0004153618380000062

Figure BDA0004153618380000063
Figure BDA0004153618380000063

Figure BDA0004153618380000064
Figure BDA0004153618380000064

其中,D是数据集中的样本数,y和

Figure BDA0004153618380000065
分别是实验确定的和模型预测的蛋白质-配体结合亲和力的值,a和b分别为回归线的截距和斜率。利用这四项指标计算的值确定损失函数,并反馈给对接姿势生成模型,进行参数的调节和优化。where D is the number of samples in the dataset, y and
Figure BDA0004153618380000065
are the experimentally determined and model-predicted values of protein-ligand binding affinity, respectively, and a and b are the intercept and slope of the regression line, respectively. Use the values calculated by these four indicators to determine the loss function, and feed it back to the docking pose generation model for parameter adjustment and optimization.

(2.4)对生成对接姿势进行重复训练迭代,使得对接姿势在亲和力判别指标上达到理想预期,获得训练完成的模型并输出理想的对接姿势结果。(2.4) Perform repeated training iterations on the generated docking poses, so that the docking poses can reach the ideal expectation on the affinity discrimination index, obtain the trained model and output the ideal docking pose results.

步骤3:构建基于图神经网络的对接姿势评估模型,评估生成的理想对接姿势;对蛋白质配体的理想对接姿势实际输出,依据现有的主流评价指标,对预测得到的对接姿势结果进行评估。Step 3: Build a docking pose evaluation model based on graph neural network, and evaluate the generated ideal docking pose; for the actual output of the ideal docking pose of the protein ligand, evaluate the predicted docking pose results based on the existing mainstream evaluation indicators.

利用传统的评估函数对均方根偏差RMSD指标进行衡量,其定义如下:Use the traditional evaluation function to measure the root mean square deviation RMSD index, which is defined as follows:

Figure BDA0004153618380000071
Figure BDA0004153618380000071

其中,δ表示某一帧的原子的位置减去参考系中它的位置,即位置偏移量,T表示时间,x表示原子某时刻的位置;RMSD值表示各原子运动幅度的大小,该值越大,说明该原子的运动的空间范围越大,原子的空间位阻也就越小。在评估模型中,RMSD越小,代表生成的对接姿势结构越准确有效。Among them, δ represents the position of the atom in a certain frame minus its position in the reference system, that is, the position offset, T represents the time, x represents the position of the atom at a certain moment; the RMSD value represents the magnitude of the motion range of each atom, and the value The larger the , the larger the spatial range of the movement of the atom is, and the smaller the steric hindrance of the atom is. In the evaluation model, the smaller the RMSD, the more accurate and efficient the generated docking pose structure.

Claims (5)

1.一种基于图神经网络的蛋白质与配体之间对接姿势的预测方法,其特征在于,包括以下步骤:1. A method for predicting the docking posture between a graph neural network-based protein and a ligand, characterized in that it comprises the following steps: (1)获取蛋白质-配体复合物的生物信息样本集,并对其进行预处理;样本集包括样本数据和样本数据的样本标注,对样本数据编码,得到特征向量;(1) Acquire the biological information sample set of the protein-ligand complex, and preprocess it; the sample set includes sample data and sample annotation of the sample data, encode the sample data, and obtain the feature vector; (2)构建基于图神经网络的对接姿势生成模型,并使用生成对抗网络对其进行训练;固定靶向蛋白,对配体分子利用多步姿态预测模拟原子的对接姿势结构,由基于多视角的对接姿势判别器评估对接姿势,计算损失差值,调整对接姿势中原子的空间位置;反复迭代,直到判别结果满足阈值要求,对于所有原子的输出将被认为是最终的预测姿态;(2) Construct a docking pose generation model based on a graph neural network, and use a generative adversarial network to train it; fix the target protein, use multi-step pose prediction for the ligand molecule to simulate the docking pose structure of the atom, and use the multi-view based The docking pose discriminator evaluates the docking pose, calculates the loss difference, and adjusts the spatial position of the atoms in the docking pose; iterates repeatedly until the discriminant result meets the threshold requirement, and the output for all atoms will be considered as the final predicted pose; (3)构建基于图神经网络的对接姿势评估模型,评估生成的理想对接姿势;对蛋白质配体的理想对接姿势实际输出,依据现有的主流评价指标,对预测得到的对接姿势结果进行评估。(3) Construct a docking pose evaluation model based on graph neural network, and evaluate the generated ideal docking pose; based on the actual output of the ideal docking pose of protein ligands, evaluate the predicted docking pose results according to the existing mainstream evaluation indicators. 2.根据权利要求1所述的基于图神经网络的蛋白质与配体之间对接姿势的预测方法,其特征在于,步骤(1)所述的对生物信息样本集进行预处理过程如下:2. the method for predicting the docking posture between the protein and the ligand based on the graph neural network according to claim 1, characterized in that, the preprocessing process of the biological information sample set described in step (1) is as follows: 去除可旋转键少于两个的蛋白质-配体复合物以及具有一个以上配体的蛋白质,去除缺失残基或重复残基的蛋白质,形成包含N个蛋白质-配体复合物及其天然的结合亲和力标注;Remove protein-ligand complexes with fewer than two rotatable bonds and proteins with more than one ligand, remove missing or repeated residues, form protein-ligand complexes containing N and their native associations Affinity labeling; 通过根据化学键和原子排列等结构性质,将三维生物分子数据的分子图转换为二维邻接矩阵,所得的邻接矩阵符合图神经网络的输入格式;By converting the molecular graph of three-dimensional biomolecular data into a two-dimensional adjacency matrix according to structural properties such as chemical bonds and atomic arrangements, the resulting adjacency matrix conforms to the input format of the graph neural network; 从而得到蛋白质分子结构邻接矩阵、配体分子结构邻接矩阵、和天然的结合亲和力样本标注。In this way, the protein molecular structure adjacency matrix, the ligand molecular structure adjacency matrix, and the natural binding affinity sample annotations are obtained. 3.根据权利要求1所述的基于图神经网络的蛋白质与配体之间对接姿势的预测方法,其特征在于,步骤(2)中,一种基于生成式对抗网络的对接姿势结构生成的训练方法,具体包括如下步骤:3. The prediction method of the docking posture between the protein and the ligand based on the graph neural network according to claim 1, characterized in that, in step (2), a training based on the docking posture structure generation of the generative confrontation network The method specifically includes the following steps: (21)获取蛋白质配体初始模拟对接姿势结构;利用AutoDock对接姿势结构模拟模型,固定蛋白质分子,改变配体分子的空间位置分布,进行随机对接姿势结构模拟;(21) Obtain the initial simulated docking posture structure of the protein ligand; use the AutoDock docking posture structure simulation model to fix the protein molecule, change the spatial position distribution of the ligand molecule, and perform random docking posture structure simulation; (22)构建基于图神经网络的对接姿势生成模型,生成候选的对接姿势;根据模拟的对接姿势结构,初始化神经网络模型生成器;通过训练样本集中抽取的样本数据以及生成器利用定义的噪声分布,对配体分子利用多步姿态预测模拟原子的对接姿势结构,计算每个配体原子的运动并输出移动向量,获得生成器的对接姿势结构的实际输出;(22) Build a docking pose generation model based on graph neural network to generate candidate docking poses; initialize the neural network model generator according to the simulated docking pose structure; use the sample data extracted from the training sample set and the generator to use the defined noise distribution , use multi-step attitude prediction to simulate the docking pose structure of the atom, calculate the motion of each ligand atom and output the movement vector, and obtain the actual output of the generator's docking pose structure; (23)构建基于多视角的对接姿势有效性判别模型,对生成模型的实际输出进行评估;以原始数据集PDBbind2016中天然的结合亲和力标注确定为基准结果,将生成的配体的对接姿势结果与靶向蛋白进行多视角的亲和力预测,判断生成的对接姿势结果的有效性;将计算对接姿势的结合亲和力与基准结果的偏差,反馈给生成模型进行参数的调节优化;(23) Construct a docking posture validity discrimination model based on multiple perspectives, and evaluate the actual output of the generated model; take the natural binding affinity annotation in the original data set PDBbind2016 as the benchmark result, and compare the docking posture results of the generated ligands with Predict the affinity of the target protein from multiple perspectives, and judge the validity of the generated docking posture results; the deviation between the calculated binding affinity of the docking posture and the benchmark result is fed back to the generated model for parameter adjustment and optimization; (24)对生成对接姿势进行重复训练迭代,使得对接姿势在亲和力判别指标上达到理想预期,获得训练完成的模型并输出理想的对接姿势结果。(24) Perform repeated training iterations on the generated docking poses, so that the docking poses can reach the ideal expectation on the affinity discrimination index, obtain the trained model and output the ideal docking pose results. 4.根据权利要求1所述的基于图神经网络的蛋白质与配体之间对接姿势的预测方法,其特征在于,所述步骤(3)实现过程如下:4. the method for predicting the docking posture between the protein and the ligand based on the graph neural network according to claim 1, characterized in that, the step (3) implementation process is as follows: 利用评估函数对均方根偏差RMSD指标进行衡量,定义如下:The evaluation function is used to measure the root mean square deviation RMSD index, which is defined as follows:
Figure FDA0004153618360000021
Figure FDA0004153618360000021
其中,δ表示某一帧的原子的位置减去参考系中它的位置,即位置偏移量,T表示时间,x表示原子某时刻的位置;RMSD值表示各原子运动幅度的大小,该值越大,说明该原子的运动的空间范围越大,原子的空间位阻也就越小;在评估模型中,RMSD越小,代表生成的对接姿势结构越准确有效。Among them, δ represents the position of the atom in a certain frame minus its position in the reference frame, that is, the position offset, T represents the time, and x represents the position of the atom at a certain moment; the RMSD value represents the magnitude of the motion amplitude of each atom, and the value The larger the , the larger the spatial range of motion of the atom is, and the smaller the steric hindrance of the atom is; in the evaluation model, the smaller the RMSD, the more accurate and effective the generated docking pose structure is.
5.根据权利要求3所述的基于图神经网络的蛋白质与配体之间对接姿势的预测方法,其特征在于,步骤(23)所述判断生成的对接姿势结果的有效性,具体包括如下步骤:5. The method for predicting the docking posture between a protein and a ligand based on a graph neural network according to claim 3, wherein the validity of the docking posture result generated by judging in step (23) specifically includes the following steps : (231)数据特征表示;获取输入数据,即配体分子图结构,蛋白质分子图结构,整段蛋白质序列和生成的蛋白质配体对接姿势,其中蛋白质配体对接姿势利用二部图进行表示;(231) Data feature representation; obtain input data, namely ligand molecular graph structure, protein molecular graph structure, the entire protein sequence and the generated protein-ligand docking pose, wherein the protein-ligand docking pose is represented by a bipartite graph; (232)数据编码;利用图神经网络的AttentiveFP模型、BIGGNN模型和ProBert模型分别对输入数据编码,转换成特征向量;(232) data encoding; Utilize the AttentiveFP model, the BIGGNN model and the ProBert model of the graph neural network to encode the input data respectively, and convert them into feature vectors; (233)亲和力预测;将编码得到的特征向量利用多层感知机预测亲和力结果,得到的结合亲和力值与天然的结合亲和力标注进行比较,判别生成姿势结果的有效性;(233) Affinity prediction; the eigenvector obtained by encoding is used to predict the affinity result by a multi-layer perceptron, and the obtained binding affinity value is compared with the natural binding affinity label to determine the validity of the generated posture result; (234)计算预测结果与实际姿势的偏差;使用了四个指标来进行衡量,平均绝对误差MAE,均方根误差RMSE,均方根误差标准差SD,皮尔逊相关系数R,这四个指标的定义如下:(234) Calculate the deviation between the predicted result and the actual posture; four indicators are used to measure, the average absolute error MAE, the root mean square error RMSE, the root mean square error standard deviation SD, the Pearson correlation coefficient R, these four indicators is defined as follows:
Figure FDA0004153618360000031
Figure FDA0004153618360000031
Figure FDA0004153618360000032
Figure FDA0004153618360000032
Figure FDA0004153618360000033
Figure FDA0004153618360000033
Figure FDA0004153618360000034
Figure FDA0004153618360000034
其中,D是数据集中的样本数,y和
Figure FDA0004153618360000035
分别是实验确定的和模型预测的蛋白质-配体结合亲和力的值,a和b分别为回归线的截距和斜率;利用这四项指标计算的值确定损失函数,并反馈给对接姿势生成模型,进行参数的调节和优化。
where D is the number of samples in the dataset, y and
Figure FDA0004153618360000035
are the experimentally determined and model-predicted protein-ligand binding affinity values, a and b are the intercept and slope of the regression line, respectively; the values calculated using these four indicators determine the loss function and feed back to the docking pose generation model, Adjust and optimize parameters.
CN202310327119.9A 2023-03-30 2023-03-30 Prediction method of docking posture between protein and ligand based on graphic neural network Pending CN116343910A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310327119.9A CN116343910A (en) 2023-03-30 2023-03-30 Prediction method of docking posture between protein and ligand based on graphic neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310327119.9A CN116343910A (en) 2023-03-30 2023-03-30 Prediction method of docking posture between protein and ligand based on graphic neural network

Publications (1)

Publication Number Publication Date
CN116343910A true CN116343910A (en) 2023-06-27

Family

ID=86887594

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310327119.9A Pending CN116343910A (en) 2023-03-30 2023-03-30 Prediction method of docking posture between protein and ligand based on graphic neural network

Country Status (1)

Country Link
CN (1) CN116343910A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118212973A (en) * 2024-05-20 2024-06-18 齐鲁工业大学(山东省科学院) A method and system for predicting the excitation wavelength of β-amyloid protein fluorescent probe
CN119028462A (en) * 2024-10-28 2024-11-26 西安电子科技大学 A fast and accurate protein-small molecule ligand docking method based on deep learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118212973A (en) * 2024-05-20 2024-06-18 齐鲁工业大学(山东省科学院) A method and system for predicting the excitation wavelength of β-amyloid protein fluorescent probe
CN119028462A (en) * 2024-10-28 2024-11-26 西安电子科技大学 A fast and accurate protein-small molecule ligand docking method based on deep learning

Similar Documents

Publication Publication Date Title
CN113707235A (en) Method, device and equipment for predicting properties of small drug molecules based on self-supervision learning
CN116343910A (en) Prediction method of docking posture between protein and ligand based on graphic neural network
Lu et al. Attention-based dense point cloud reconstruction from a single image
CN113239753A (en) Improved traffic sign detection and identification method based on YOLOv4
Zhang et al. Learning geometric transformation for point cloud completion
CN117334271B (en) Method for generating molecules based on specified attributes
US20240161864A1 (en) Diffusion model for generative protein design
CN112990336A (en) Depth three-dimensional point cloud classification network construction method based on competitive attention fusion
CN117454495A (en) CAD vector model generation method and device based on building sketch outline sequence
Mirzaee et al. Minireview on porous media and microstructure reconstruction using machine learning techniques: Recent advances and outlook
CN119028462B (en) Protein-small molecule ligand rapid and accurate docking method based on deep learning
CN117975174B (en) Three-dimensional digital core reconstruction method based on improvement VQGAN
CN116230071A (en) Method for constructing frozen electron microscope protein model based on neural network and storage medium
Leng et al. A point contextual transformer network for point cloud completion
CN100533478C (en) Implementation Method of Chinese Character Synthesis Based on Optimal Global Affine Transformation
CN119889426A (en) Drug target prediction method based on 3D structure and multi-level attention mechanism
Dai et al. A surrogate-assisted extended generative adversarial network for parameter optimization in free-form metasurface design
Abaidi et al. GAN-based generation of realistic compressible-flow samples from incomplete data
Premkumar Diffusion density estimators
Wu et al. Fastgrasp: Efficient grasp synthesis with diffusion
Liu et al. A multi-feature and dual-attribute interaction aggregation model for predicting drug-target interactions
CN113835964A (en) Prediction method of server energy consumption in cloud data center based on small sample learning
CN113077851A (en) Crystal structure prediction method based on generation countermeasure network
Wu et al. A rapid indoor 3D wind field prediction model based on conditional generative adversarial networks
Chen et al. TripNet: Learning Large-scale High-fidelity 3D Car Aerodynamics with Triplane Networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination