CN116343910A - Prediction method of docking posture between protein and ligand based on graphic neural network - Google Patents
Prediction method of docking posture between protein and ligand based on graphic neural network Download PDFInfo
- Publication number
- CN116343910A CN116343910A CN202310327119.9A CN202310327119A CN116343910A CN 116343910 A CN116343910 A CN 116343910A CN 202310327119 A CN202310327119 A CN 202310327119A CN 116343910 A CN116343910 A CN 116343910A
- Authority
- CN
- China
- Prior art keywords
- docking
- protein
- ligand
- pose
- posture
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003032 molecular docking Methods 0.000 title claims abstract description 129
- 239000003446 ligand Substances 0.000 title claims abstract description 78
- 102000004169 proteins and genes Human genes 0.000 title claims abstract description 54
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 54
- 238000000034 method Methods 0.000 title claims abstract description 35
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 30
- 238000012549 training Methods 0.000 claims abstract description 10
- 238000011156 evaluation Methods 0.000 claims abstract description 9
- 238000013210 evaluation model Methods 0.000 claims abstract description 9
- 238000002910 structure generation Methods 0.000 claims abstract description 7
- 239000011159 matrix material Substances 0.000 claims description 12
- 239000013598 vector Substances 0.000 claims description 9
- 238000005457 optimization Methods 0.000 claims description 6
- 238000002372 labelling Methods 0.000 claims description 4
- 238000004088 simulation Methods 0.000 claims description 4
- 239000007801 affinity label Substances 0.000 claims description 3
- 238000003062 neural network model Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 3
- 239000003814 drug Substances 0.000 description 3
- 229940079593 drug Drugs 0.000 description 3
- 238000009510 drug design Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000012614 Monte-Carlo sampling Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
- G16B15/30—Drug targeting using structural data; Docking or binding prediction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/094—Adversarial learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Medical Informatics (AREA)
- Evolutionary Biology (AREA)
- Biomedical Technology (AREA)
- Chemical & Material Sciences (AREA)
- Biotechnology (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medicinal Chemistry (AREA)
- Public Health (AREA)
- Crystallography & Structural Chemistry (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Pharmacology & Pharmacy (AREA)
- Bioethics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
本发明公开了一种基于图神经网络的蛋白质与配体之间对接姿势的预测方法,首先,获取蛋白质‑配体复合物的生物信息样本集,样本集包括样本数据和样本标注数据;其次,构建基于图神经网络的对接姿势生成模型和基于多视角的对接姿势评估模型,进一步调节模型的参数,通过训练得到的结构生成模型对样本数据进行处理,获得蛋白质配体的姿势对接实际输出;最后,利用主流的姿势对接结构评价指标对输出结果进行稳定性评估。本发明直接利用配体蛋白质的生物结构信息生成最优的对接姿势结构,并通过多角度的综合评估模型对生成结果进行评估,从而提高对配体‑蛋白质姿势结构对接预测的准确性,以及提高对配体‑蛋白质姿势结构对接预测结果评估的有效性。
The invention discloses a method for predicting the docking posture between a protein and a ligand based on a graph neural network. First, a biological information sample set of a protein-ligand complex is obtained, and the sample set includes sample data and sample label data; secondly, Build a docking pose generation model based on a graph neural network and a docking pose evaluation model based on multiple perspectives, further adjust the parameters of the model, process the sample data through the structure generation model obtained through training, and obtain the actual output of the pose docking of the protein ligand; finally , using the mainstream pose docking structure evaluation index to evaluate the stability of the output results. The present invention directly uses the biological structure information of the ligand protein to generate the optimal docking posture structure, and evaluates the generated result through a multi-angle comprehensive evaluation model, thereby improving the accuracy of the docking prediction of the ligand-protein posture structure, and improving Evaluating the validity of ligand-protein pose-structure docking predictions.
Description
技术领域technical field
本发明属于计算机辅助药物设计领域,具体涉及一种基于图神经网络对蛋白质与配体之间对接姿势的预测方法。The invention belongs to the field of computer-aided drug design, in particular to a method for predicting the docking posture between a protein and a ligand based on a graph neural network.
背景技术Background technique
在计算机辅助药物设计的过程中,筛选出特定蛋白质与配体之间结合亲和力高的对接姿势一直是一项难题。传统的方法考虑结合现有方法,生成无限多个对接姿势的组合,并在此基础上筛选得到最满足条件的一组姿势,忽略了药物分子与靶点蛋白质之间的全局信息,此外现有的对接姿势信息是比较少的,结合亲和力的评估也需要通过大量的实验,良好的筛选结果需要对数据集进行大量的标注。而标注量不足往往会造成对配体分子与靶点蛋白质结合亲和力预测的不准确,且此类方法中实验需要的资金成本与标注所需的人工成本是大多数科研项目难以承受的。而新兴的利用三维分子结构对对接姿势预测的方法,在预测蛋白质与配体之间的结合亲和力方面十分具有前景,提高了计算机辅助药物设计的效率与准确性。In the process of computer-aided drug design, it has always been a difficult problem to screen out the docking pose with high binding affinity between a specific protein and its ligand. The traditional method considers combining existing methods to generate an infinite number of combinations of docking poses, and on this basis to screen a set of poses that best meet the conditions, ignoring the global information between the drug molecule and the target protein. In addition, the existing The docking pose information is relatively small, and the evaluation of binding affinity also requires a large number of experiments. Good screening results require a large number of annotations on the data set. Insufficient labeling often leads to inaccurate predictions of the binding affinity between ligand molecules and target proteins, and the capital costs and labor costs required for experiments in such methods are unaffordable for most scientific research projects. The emerging method of using three-dimensional molecular structure to predict the docking posture is very promising in predicting the binding affinity between proteins and ligands, which improves the efficiency and accuracy of computer-aided drug design.
在现有的利用分子三维结构对对接姿势与亲和力预测的方法中,为了提高预测的效率,采用了各种抽样方法,其中例如Glide对姿势的全局信息采用了蒙特卡洛采样法,从而提高了在面对大量药物分子配体时对对接姿势预测的准确性与速度。然而类似于Glide的各种方法往往忽略了对对接姿势的生物学优化,而没有经过优化的姿势信息,会造成预测对接姿势与亲和力时的不准确,在药物生产过程中迫切需要更为智能且准确的预测模型。此外,在这类方法中往往只考虑了分子内部作用力与分子间作用力的其中一种,没有对两者进行综合讨论,在对亲和力的预测过程中,丢失了一些重要信息,造成了预测的不准确。In the existing methods for predicting docking posture and affinity using molecular three-dimensional structure, in order to improve the efficiency of prediction, various sampling methods are used. For example, Glide adopts the Monte Carlo sampling method for the global information of posture, thereby improving the Accuracy and speed of docking pose prediction in the face of large numbers of drug molecule ligands. However, various methods similar to Glide often ignore the biological optimization of the docking posture, and without optimized posture information, it will cause inaccurate prediction of the docking posture and affinity. There is an urgent need for more intelligent and Accurate predictive models. In addition, in this type of method, only one of the intramolecular force and the intermolecular force is often considered, and the two are not discussed comprehensively. In the process of predicting the affinity, some important information is lost, resulting in the prediction inaccurate.
总体而言,目前对蛋白质与配体之间对接姿势的预测还有很多局限性,例如:预测时耗费的时间长,现有的对接姿势信息少,对分子空间结构信息的利用不充分,对结合亲和力的预测不够准确。这主要是由于候选药物分子具有数据量大、空间结构复杂的特点。本发明针对现有蛋白质配体对接姿势预测方法的局限性,提出了一种基于图神经网络和生成对抗网络结合的预测模型,以解决配体对接姿势结构预测效率低并提高对配体蛋白质结合亲和力评估的有效性。Generally speaking, there are still many limitations in the current prediction of the docking pose between proteins and ligands, such as: the time-consuming prediction, the lack of existing docking pose information, the insufficient use of molecular spatial structure information, and the The prediction of binding affinity is not accurate enough. This is mainly due to the large amount of data and complex spatial structure of candidate drug molecules. Aiming at the limitations of existing protein ligand docking posture prediction methods, the present invention proposes a prediction model based on the combination of graph neural network and generative adversarial network to solve the low efficiency of ligand docking posture structure prediction and improve the binding of ligand to protein Validity of affinity assessment.
发明内容Contents of the invention
发明目的:本发明提供一种基于图神经网络对蛋白质与配体之间对接姿势的预测方法,实现了对蛋白质与配体分子之间的全局信息的充分利用,避免了在传统预测方法中丢失分子间重要信息的缺陷,并大大提高了对靶点蛋白质与配体分子之间对接姿势进行预测的准确性与效率。Purpose of the invention: The present invention provides a method for predicting the docking pose between a protein and a ligand based on a graph neural network, which realizes the full utilization of the global information between the protein and the ligand molecule, and avoids the loss in the traditional prediction method The defect of important information between molecules greatly improves the accuracy and efficiency of predicting the docking posture between the target protein and the ligand molecule.
技术方案:本发明旨在一种基于图神经网络对蛋白质与配体之间对接姿势的预测方法,具体包括以下步骤:Technical solution: The present invention aims at a method for predicting the docking posture between a protein and a ligand based on a graph neural network, which specifically includes the following steps:
(1)获取蛋白质-配体复合物的生物信息样本集,并对其进行预处理;样本集包括样本数据和样本数据的样本标注,对样本数据编码,得到特征向量;(1) Acquire the biological information sample set of the protein-ligand complex, and preprocess it; the sample set includes sample data and sample annotation of the sample data, encode the sample data, and obtain the feature vector;
(2)构建基于图神经网络的对接姿势生成模型,并使用生成对抗网络对其进行训练;固定靶向蛋白,对配体分子利用多步姿态预测模拟原子的对接姿势结构,由基于多视角的对接姿势判别器评估对接姿势,计算损失差值,调整对接姿势中原子的空间位置;反复迭代,直到判别结果满足阈值要求,对于所有原子的输出将被认为是最终的预测姿态;(2) Construct a docking pose generation model based on a graph neural network, and use a generative adversarial network to train it; fix the target protein, use multi-step pose prediction for the ligand molecule to simulate the docking pose structure of the atom, and use the multi-view based The docking pose discriminator evaluates the docking pose, calculates the loss difference, and adjusts the spatial position of the atoms in the docking pose; iterates repeatedly until the discriminant result meets the threshold requirement, and the output for all atoms will be considered as the final predicted pose;
(3)构建基于图神经网络的对接姿势评估模型,评估生成的理想对接姿势;对蛋白质配体的理想对接姿势实际输出,依据现有的主流评价指标,对预测得到的对接姿势结果进行评估。(3) Construct a docking pose evaluation model based on graph neural network, and evaluate the generated ideal docking pose; based on the actual output of the ideal docking pose of protein ligands, evaluate the predicted docking pose results according to the existing mainstream evaluation indicators.
进一步地,步骤(1)所述的对生物信息样本集进行预处理过程如下:Further, the preprocessing process of the biological information sample set described in step (1) is as follows:
去除可旋转键少于两个的蛋白质-配体复合物以及具有一个以上配体的蛋白质,去除缺失残基或重复残基的蛋白质,形成包含N个蛋白质-配体复合物及其天然的结合亲和力标注;Remove protein-ligand complexes with fewer than two rotatable bonds and proteins with more than one ligand, remove missing or repeated residues, form protein-ligand complexes containing N and their native associations Affinity labeling;
通过根据化学键和原子排列等结构性质,将三维生物分子数据的分子图转换为二维邻接矩阵,所得的邻接矩阵符合图神经网络的输入格式;By converting the molecular graph of three-dimensional biomolecular data into a two-dimensional adjacency matrix according to structural properties such as chemical bonds and atomic arrangements, the resulting adjacency matrix conforms to the input format of the graph neural network;
从而得到蛋白质分子结构邻接矩阵、配体分子结构邻接矩阵、和天然的结合亲和力样本标注。In this way, the protein molecular structure adjacency matrix, the ligand molecular structure adjacency matrix, and the natural binding affinity sample annotations are obtained.
进一步地,步骤(2)中,一种基于生成式对抗网络的对接姿势结构生成的训练方法,具体包括如下步骤:Further, in step (2), a training method for generating a docking pose structure based on a generative confrontation network specifically includes the following steps:
(21)获取蛋白质配体初始模拟对接姿势结构;利用AutoDock对接姿势结构模拟模型,固定蛋白质分子,改变配体分子的空间位置分布,进行随机对接姿势结构模拟;(21) Obtain the initial simulated docking posture structure of the protein ligand; use the AutoDock docking posture structure simulation model to fix the protein molecule, change the spatial position distribution of the ligand molecule, and perform random docking posture structure simulation;
(22)构建基于图神经网络的对接姿势生成模型,生成候选的对接姿势;根据模拟的对接姿势结构,初始化神经网络模型生成器;通过训练样本集中抽取的样本数据以及生成器利用定义的噪声分布,对配体分子利用多步姿态预测模拟原子的对接姿势结构,计算每个配体原子的运动并输出移动向量,获得生成器的对接姿势结构的实际输出;(22) Build a docking pose generation model based on graph neural network to generate candidate docking poses; initialize the neural network model generator according to the simulated docking pose structure; use the sample data extracted from the training sample set and the generator to use the defined noise distribution , use multi-step attitude prediction to simulate the docking pose structure of the atom, calculate the motion of each ligand atom and output the movement vector, and obtain the actual output of the generator's docking pose structure;
(23)构建基于多视角的对接姿势有效性判别模型,对生成模型的实际输出进行评估;以原始数据集PDBbind2016中天然的结合亲和力标注确定为基准结果,将生成的配体的对接姿势结果与靶向蛋白进行多视角的亲和力预测,判断生成的对接姿势结果的有效性;将计算对接姿势的结合亲和力与基准结果的偏差,反馈给生成模型进行参数的调节优化;(23) Construct a docking posture validity discrimination model based on multiple perspectives, and evaluate the actual output of the generated model; take the natural binding affinity annotation in the original data set PDBbind2016 as the benchmark result, and compare the docking posture results of the generated ligands with Predict the affinity of the target protein from multiple perspectives, and judge the validity of the generated docking posture results; the deviation between the calculated binding affinity of the docking posture and the benchmark result is fed back to the generated model for parameter adjustment and optimization;
(24)对生成对接姿势进行重复训练迭代,使得对接姿势在亲和力判别指标上达到理想预期,获得训练完成的模型并输出理想的对接姿势结果。(24) Perform repeated training iterations on the generated docking poses, so that the docking poses can reach the ideal expectation on the affinity discrimination index, obtain the trained model and output the ideal docking pose results.
进一步地,所述步骤(3)实现过程如下:Further, the implementation process of the step (3) is as follows:
利用评估函数对均方根偏差RMSD指标进行衡量,定义如下:The evaluation function is used to measure the root mean square deviation RMSD index, which is defined as follows:
其中,δ表示某一帧的原子的位置减去参考系中它的位置,即位置偏移量,T表示时间,x表示原子某时刻的位置;RMSD值表示各原子运动幅度的大小,该值越大,说明该原子的运动的空间范围越大,原子的空间位阻也就越小;在评估模型中,RMSD越小,代表生成的对接姿势结构越准确有效。Among them, δ represents the position of the atom in a certain frame minus its position in the reference frame, that is, the position offset, T represents the time, and x represents the position of the atom at a certain moment; the RMSD value represents the magnitude of the motion amplitude of each atom, and the value The larger the , the larger the spatial range of motion of the atom is, and the smaller the steric hindrance of the atom is; in the evaluation model, the smaller the RMSD, the more accurate and effective the generated docking pose structure is.
进一步地,步骤(23)所述判断生成的对接姿势结果的有效性,具体包括如下步骤:Further, the validity of the docking posture result generated by judging in step (23) specifically includes the following steps:
(231)数据特征表示;获取输入数据,即配体分子图结构,蛋白质分子图结构,整段蛋白质序列和生成的蛋白质配体对接姿势,其中蛋白质配体对接姿势利用二部图进行表示;(231) Data feature representation; obtain input data, namely ligand molecular graph structure, protein molecular graph structure, the entire protein sequence and the generated protein-ligand docking pose, wherein the protein-ligand docking pose is represented by a bipartite graph;
(232)数据编码;利用图神经网络的Attentive FP模型、BIGGNN模型和ProBert模型分别对输入数据编码,转换成特征向量;(232) Data coding; use the Attentive FP model, the BIGGNN model and the ProBert model of the graph neural network to encode the input data respectively, and convert them into feature vectors;
(233)亲和力预测;将编码得到的特征向量利用多层感知机预测亲和力结果,得到的结合亲和力值与天然的结合亲和力标注进行比较,判别生成姿势结果的有效性;(233) Affinity prediction; the eigenvector obtained by encoding is used to predict the affinity result by a multi-layer perceptron, and the obtained binding affinity value is compared with the natural binding affinity label to determine the validity of the generated posture result;
(234)计算预测结果与实际姿势的偏差;使用了四个指标来进行衡量,平均绝对误差MAE,均方根误差RMSE,均方根误差标准差SD,皮尔逊相关系数R,这四个指标的定义如下:(234) Calculate the deviation between the predicted result and the actual posture; four indicators are used to measure, the average absolute error MAE, the root mean square error RMSE, the root mean square error standard deviation SD, the Pearson correlation coefficient R, these four indicators is defined as follows:
其中,D是数据集中的样本数,y和分别是实验确定的和模型预测的蛋白质-配体结合亲和力的值,a和b分别为回归线的截距和斜率;利用这四项指标计算的值确定损失函数,并反馈给对接姿势生成模型,进行参数的调节和优化。where D is the number of samples in the dataset, y and are the experimentally determined and model-predicted protein-ligand binding affinity values, a and b are the intercept and slope of the regression line, respectively; the values calculated using these four indicators determine the loss function and feed back to the docking pose generation model, Adjust and optimize parameters.
有益效果:与现有技术相比,本发明的有益效果:本发明采用图神经网络,实现了对蛋白质与配体分子之间的全局信息的充分利用,避免了在传统预测方法中丢失分子间重要信息的缺陷,并大大提高了对靶点蛋白质与配体分子之间对接姿势进行预测的准确性与效率;本发明采用基于生成对抗网络的对接姿势结构生成模型和基于图神经网络的对接姿势评估模型两种优化模型,充分利用生物分子的全局信息,根据所得信息生成对蛋白质-配体对接姿势的模型并进行预测评估。Beneficial effect: compared with the prior art, the beneficial effect of the present invention: the present invention adopts the graph neural network, realizes the full use of the global information between the protein and the ligand molecule, and avoids the loss of intermolecular information in the traditional prediction method. The defect of important information greatly improves the accuracy and efficiency of predicting the docking posture between the target protein and the ligand molecule; the present invention adopts the docking posture structure generation model based on the generative confrontation network and the docking posture based on the graph neural network Evaluation model Two optimization models make full use of the global information of biomolecules, generate a model of protein-ligand docking pose based on the obtained information, and perform prediction evaluation.
附图说明Description of drawings
图1为本发明的流程图;Fig. 1 is a flowchart of the present invention;
图2为本发明蛋白质-配体复合物生物信息数据预处理流程示意图;Fig. 2 is a schematic diagram of the preprocessing flow chart of the biological information data of the protein-ligand complex of the present invention;
图3为本发明蛋白质-配体生成姿势结构评估方法流程示意图。Fig. 3 is a schematic flowchart of a method for evaluating a protein-ligand generated pose structure of the present invention.
具体实施方式Detailed ways
下面结合附图对本发明作进一步详细说明。The present invention will be described in further detail below in conjunction with the accompanying drawings.
如图1所示,本发明提出一种基于图神经网络的蛋白质与配体之间对接姿势的预测方法,具体包括如下步骤:As shown in Figure 1, the present invention proposes a method for predicting the docking posture between a protein and a ligand based on a graph neural network, which specifically includes the following steps:
步骤1:获取蛋白质-配体复合物的生物信息样本集,并对其进行预处理;样本集包括样本数据和样本数据的样本标注,对样本数据编码,得到特征向量。Step 1: Obtain the biological information sample set of the protein-ligand complex, and preprocess it; the sample set includes sample data and sample annotation of the sample data, encode the sample data, and obtain the feature vector.
提取原始PDBbind2016数据集中的数据,以获得其中的配体与蛋白质的生物分子结构信息,去除可旋转键少于两个的蛋白质-配体复合物以及具有一个以上配体的蛋白质,去除缺失残基或重复残基的蛋白质,形成包含N个蛋白质-配体复合物及其天然的结合亲和力标注。Extract the data in the original PDBbind2016 dataset to obtain the biomolecular structure information of ligands and proteins, remove protein-ligand complexes with less than two rotatable bonds and proteins with more than one ligand, and remove missing residues or repeating residues in a protein, forming a protein-ligand complex comprising N and their native binding affinity annotations.
然后根据化学键和原子排列等结构性质,将三维生物分子数据转换为二维邻接矩阵(该邻接矩阵符合Pytorch_Geometric图神经网络框架的输入格式)。再将得到的蛋白质-配体组合多维信息用于对接姿势结构生成,以随机获取大量初始对接姿势结构,目前对接姿势结构生成已有成熟的模型,我们采用的是AutoDock对接姿势结构生成模型。最后的训练样本集与测试样本集,就是根据对蛋白质-配体对接姿式结构进行的亲和力预测与标注,进而随机划分得到的。其中,每组样本数据包括蛋白质分子结构邻接矩阵、配体分子结构邻接矩阵、蛋白质配体对接姿式结构和相应的结合亲和力样本标注。Then, based on structural properties such as chemical bonds and atomic arrangements, the 3D biomolecular data is converted into a 2D adjacency matrix (the adjacency matrix conforms to the input format of the Pytorch_Geometric graph neural network framework). Then the obtained multi-dimensional information of protein-ligand combination is used for docking pose structure generation to randomly obtain a large number of initial docking pose structures. At present, there are mature models for docking pose structure generation, and we use the AutoDock docking pose structure generation model. The final training sample set and test sample set are randomly divided according to the affinity prediction and labeling of the protein-ligand docking pose structure. Among them, each set of sample data includes protein molecular structure adjacency matrix, ligand molecular structure adjacency matrix, protein ligand docking pose structure and corresponding binding affinity sample annotation.
步骤2:构建基于图神经网络的对接姿势生成模型,并使用生成对抗网络对其进行训练;固定靶向蛋白,对配体分子利用多步姿态预测模拟原子的对接姿势结构,由基于多视角的对接姿势判别器评估对接姿势,计算损失差值,调整对接姿势中原子的空间位置;反复迭代,直到判别结果满足阈值要求,对于所有原子的输出将被认为是最终的预测姿态。如图2所示,具体包括以下步骤:Step 2: Build a docking pose generation model based on a graph neural network, and use a generative confrontation network to train it; fix the target protein, use multi-step pose prediction for the ligand molecule to simulate the docking pose structure of the atom, and use the multi-view based The docking pose discriminator evaluates the docking pose, calculates the loss difference, and adjusts the spatial position of the atoms in the docking pose; iterates repeatedly until the discriminant results meet the threshold requirements, and the output for all atoms will be considered as the final predicted pose. As shown in Figure 2, it specifically includes the following steps:
(2.1)获取蛋白质配体初始模拟对接姿势结构;利用对接姿势结构生成模型AutoDock,固定蛋白质分子,改变配体分子的空间位置分布,进行随机对接姿势结构生成;(2.1) Obtain the initial simulated docking pose structure of the protein ligand; use the docking pose structure generation model AutoDock to fix the protein molecule, change the spatial position distribution of the ligand molecule, and generate a random docking pose structure;
(2.2)构建基于图神经网络的对接姿势生成模型,生成候选的对接姿势;初始化神经网络模型生成器,通过训练样本集中抽取的样本数据以及生成器利用定义的噪声分布,对配体分子利用多步姿态预测模拟原子的对接姿势结构,计算每个配体原子的运动并输出移动向量,获得生成器的对接姿势结构的实际输出;(2.2) Build a docking pose generation model based on a graph neural network to generate candidate docking poses; initialize the neural network model generator, use the sample data extracted from the training sample set and the generator to use the defined noise distribution, and use multiple The step pose prediction simulates the docking pose structure of the atom, calculates the motion of each ligand atom and outputs the movement vector, and obtains the actual output of the generator's docking pose structure;
(2.3)构建基于多视角的对接姿势有效性判别模型,对生成模型的实际输出进行评估;以原始数据集PDBbind2016中天然的结合亲和力标注确定为基准结果,将生成的配体的对接姿势结果与靶向蛋白进行多视角的亲和力预测,判断生成的对接姿势结果的有效性;将计算对接姿势的结合亲和力与基准结果的偏差,反馈给生成模型进行参数的调节优化。(2.3) Construct a docking posture validity discrimination model based on multiple perspectives, and evaluate the actual output of the generated model; take the natural binding affinity annotation in the original data set PDBbind2016 as the benchmark result, and compare the docking posture results of the generated ligands with The target protein performs multi-view affinity prediction to judge the validity of the generated docking pose results; the deviation between the calculated docking pose binding affinity and the benchmark result is fed back to the generated model for parameter adjustment and optimization.
判断生成的对接姿势结果的有效性,如图3所示,具体过程如下:Judging the validity of the generated docking pose results, as shown in Figure 3, the specific process is as follows:
1)数据特征表示;获取输入数据,即配体分子图结构,蛋白质分子图结构,整段蛋白质序列和生成的蛋白质配体对接姿势,其中蛋白质配体对接姿势利用二部图进行表示。1) Data feature representation; the input data are obtained, namely ligand molecular graph structure, protein molecular graph structure, the entire protein sequence and the generated protein-ligand docking pose, where the protein-ligand docking pose is represented by a bipartite graph.
2)数据编码;利用图神经网络的Attentive FP模型、BIGGNN模型和ProBert模型分别对输入数据编码,转换成特征向量。2) Data encoding: Use the Attentive FP model, BIGGNN model and ProBert model of the graph neural network to encode the input data and convert them into feature vectors.
3)亲和力预测;将编码得到的特征向量利用多层感知机预测亲和力结果,得到的结合亲和力值与天然的结合亲和力标注进行比较,判别生成姿势结果的有效性。3) Affinity prediction: use the encoded eigenvectors to predict the affinity results using a multi-layer perceptron, and compare the obtained binding affinity values with the natural binding affinity labels to determine the validity of the generated pose results.
4)计算预测结果与实际姿势的偏差;使用了四个指标来进行衡量,平均绝对误差MAE,均方根误差RMSE,均方根误差标准差SD,皮尔逊相关系数R,这四个指标的定义如下:4) Calculate the deviation between the predicted result and the actual posture; four indicators are used to measure, the average absolute error MAE, the root mean square error RMSE, the root mean square error standard deviation SD, the Pearson correlation coefficient R, the four indicators It is defined as follows:
其中,D是数据集中的样本数,y和分别是实验确定的和模型预测的蛋白质-配体结合亲和力的值,a和b分别为回归线的截距和斜率。利用这四项指标计算的值确定损失函数,并反馈给对接姿势生成模型,进行参数的调节和优化。where D is the number of samples in the dataset, y and are the experimentally determined and model-predicted values of protein-ligand binding affinity, respectively, and a and b are the intercept and slope of the regression line, respectively. Use the values calculated by these four indicators to determine the loss function, and feed it back to the docking pose generation model for parameter adjustment and optimization.
(2.4)对生成对接姿势进行重复训练迭代,使得对接姿势在亲和力判别指标上达到理想预期,获得训练完成的模型并输出理想的对接姿势结果。(2.4) Perform repeated training iterations on the generated docking poses, so that the docking poses can reach the ideal expectation on the affinity discrimination index, obtain the trained model and output the ideal docking pose results.
步骤3:构建基于图神经网络的对接姿势评估模型,评估生成的理想对接姿势;对蛋白质配体的理想对接姿势实际输出,依据现有的主流评价指标,对预测得到的对接姿势结果进行评估。Step 3: Build a docking pose evaluation model based on graph neural network, and evaluate the generated ideal docking pose; for the actual output of the ideal docking pose of the protein ligand, evaluate the predicted docking pose results based on the existing mainstream evaluation indicators.
利用传统的评估函数对均方根偏差RMSD指标进行衡量,其定义如下:Use the traditional evaluation function to measure the root mean square deviation RMSD index, which is defined as follows:
其中,δ表示某一帧的原子的位置减去参考系中它的位置,即位置偏移量,T表示时间,x表示原子某时刻的位置;RMSD值表示各原子运动幅度的大小,该值越大,说明该原子的运动的空间范围越大,原子的空间位阻也就越小。在评估模型中,RMSD越小,代表生成的对接姿势结构越准确有效。Among them, δ represents the position of the atom in a certain frame minus its position in the reference system, that is, the position offset, T represents the time, x represents the position of the atom at a certain moment; the RMSD value represents the magnitude of the motion range of each atom, and the value The larger the , the larger the spatial range of the movement of the atom is, and the smaller the steric hindrance of the atom is. In the evaluation model, the smaller the RMSD, the more accurate and efficient the generated docking pose structure.
Claims (5)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310327119.9A CN116343910A (en) | 2023-03-30 | 2023-03-30 | Prediction method of docking posture between protein and ligand based on graphic neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310327119.9A CN116343910A (en) | 2023-03-30 | 2023-03-30 | Prediction method of docking posture between protein and ligand based on graphic neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116343910A true CN116343910A (en) | 2023-06-27 |
Family
ID=86887594
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310327119.9A Pending CN116343910A (en) | 2023-03-30 | 2023-03-30 | Prediction method of docking posture between protein and ligand based on graphic neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116343910A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118212973A (en) * | 2024-05-20 | 2024-06-18 | 齐鲁工业大学(山东省科学院) | A method and system for predicting the excitation wavelength of β-amyloid protein fluorescent probe |
CN119028462A (en) * | 2024-10-28 | 2024-11-26 | 西安电子科技大学 | A fast and accurate protein-small molecule ligand docking method based on deep learning |
-
2023
- 2023-03-30 CN CN202310327119.9A patent/CN116343910A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118212973A (en) * | 2024-05-20 | 2024-06-18 | 齐鲁工业大学(山东省科学院) | A method and system for predicting the excitation wavelength of β-amyloid protein fluorescent probe |
CN119028462A (en) * | 2024-10-28 | 2024-11-26 | 西安电子科技大学 | A fast and accurate protein-small molecule ligand docking method based on deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113707235A (en) | Method, device and equipment for predicting properties of small drug molecules based on self-supervision learning | |
CN116343910A (en) | Prediction method of docking posture between protein and ligand based on graphic neural network | |
Lu et al. | Attention-based dense point cloud reconstruction from a single image | |
CN113239753A (en) | Improved traffic sign detection and identification method based on YOLOv4 | |
Zhang et al. | Learning geometric transformation for point cloud completion | |
CN117334271B (en) | Method for generating molecules based on specified attributes | |
US20240161864A1 (en) | Diffusion model for generative protein design | |
CN112990336A (en) | Depth three-dimensional point cloud classification network construction method based on competitive attention fusion | |
CN117454495A (en) | CAD vector model generation method and device based on building sketch outline sequence | |
Mirzaee et al. | Minireview on porous media and microstructure reconstruction using machine learning techniques: Recent advances and outlook | |
CN119028462B (en) | Protein-small molecule ligand rapid and accurate docking method based on deep learning | |
CN117975174B (en) | Three-dimensional digital core reconstruction method based on improvement VQGAN | |
CN116230071A (en) | Method for constructing frozen electron microscope protein model based on neural network and storage medium | |
Leng et al. | A point contextual transformer network for point cloud completion | |
CN100533478C (en) | Implementation Method of Chinese Character Synthesis Based on Optimal Global Affine Transformation | |
CN119889426A (en) | Drug target prediction method based on 3D structure and multi-level attention mechanism | |
Dai et al. | A surrogate-assisted extended generative adversarial network for parameter optimization in free-form metasurface design | |
Abaidi et al. | GAN-based generation of realistic compressible-flow samples from incomplete data | |
Premkumar | Diffusion density estimators | |
Wu et al. | Fastgrasp: Efficient grasp synthesis with diffusion | |
Liu et al. | A multi-feature and dual-attribute interaction aggregation model for predicting drug-target interactions | |
CN113835964A (en) | Prediction method of server energy consumption in cloud data center based on small sample learning | |
CN113077851A (en) | Crystal structure prediction method based on generation countermeasure network | |
Wu et al. | A rapid indoor 3D wind field prediction model based on conditional generative adversarial networks | |
Chen et al. | TripNet: Learning Large-scale High-fidelity 3D Car Aerodynamics with Triplane Networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |