CN116959555A

CN116959555A - Method and system for compound-protein affinity prediction based on protein three-dimensional structure

Info

Publication number: CN116959555A
Application number: CN202210828457.6A
Authority: CN
Inventors: 王绪化; 郭滨杰; 郑涵予; 江昊翰
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-03-29
Filing date: 2022-07-13
Publication date: 2023-10-27

Abstract

The invention provides a method for predicting compound-protein affinity based on a protein three-dimensional structure, which comprises the following steps: s1, a compound feature extraction step, namely obtaining updated atomic features and aggregated node features by using a deep graph convolution network and a multi-head attention algorithm; s2, protein characteristic extraction, namely enabling sequence characteristics and structural characteristics of the protein to represent more complete protein information through a characteristic aggregation algorithm and a co-evolution strategy; s3, predicting the affinity of the compound and the protein, and obtaining a predicted affinity value according to the atomic characteristics, the polymerization node characteristics and the sequence characteristics. The invention also discloses a corresponding system comprising: compound extractor, protein extractor and affinity predictor. The invention uses a discretization distance matrix and a torsion angle matrix as protein three-dimensional structure representation, introduces a co-evolution updating mechanism to update the characteristics between the protein three-dimensional structure and the sequence, and utilizes the characteristic of aggregation nodes to improve the affinity prediction precision.

Description

Method and system for predicting compound-protein affinity based on protein three-dimensional structure

技术领域Technical Field

本发明涉及药物研发领域，尤其涉及一种基于蛋白质三维结构进行化合物-蛋白质亲和力预测方法和系统。The present invention relates to the field of drug research and development, and in particular to a method and system for predicting compound-protein affinity based on protein three-dimensional structure.

背景技术Background Art

准确预测化合物与蛋白质的结合亲和力是虚拟候选药物筛选过程中的主要挑战。利用计算方法进行有效的高通量先导化合物虚拟筛选，可以通过减少研发时间和实验工作量，大大加快药物开发的进程。特别是在近年来，随着技术的发展，越来越多的数据共享项目被大力推进，大量新的生物医学数据源变的可及。例如，PubChem目前包含超过1.11亿个化合物结构和2.71亿个生物活性数据点(引用https://doi.org/10.1093/nar/gkaa971)，这大大增加了化合物-蛋白质结合亲和力(CPA)的预测潜力。然而，进一步提高预测CPA的准确率仍然是虚拟药物筛选技术在被实际应用前，所面临的主要挑战。为了提高CPA预测的准确率，计算方法的发展基本沿着无结构或基于结构的两种路径进行。尽管这两种方法对CPA预测准确率的提高有所帮助，但是这两种方法也都面临着进一步提高其CPA预测准确率的挑战Accurately predicting the binding affinity of compounds to proteins is a major challenge in the virtual drug candidate screening process. Using computational methods to conduct effective high-throughput virtual screening of lead compounds can greatly accelerate the process of drug development by reducing R&D time and experimental workload. Especially in recent years, with the development of technology, more and more data sharing projects have been vigorously promoted, and a large number of new biomedical data sources have become accessible. For example, PubChem currently contains more than 111 million compound structures and 271 million bioactivity data points (cited https://doi.org/10.1093/nar/gkaa971), which greatly increases the prediction potential of compound-protein binding affinity (CPA). However, further improving the accuracy of predicted CPA remains the main challenge facing virtual drug screening technology before it can be applied in practice. In order to improve the accuracy of CPA prediction, the development of computational methods basically follows two paths: structure-free or structure-based. Although these two methods have helped to improve the accuracy of CPA prediction, both methods also face the challenge of further improving their CPA prediction accuracy.

非结构参数模型将化合物(药物)视为原子组合信息，将蛋白质(靶标)视为残基组合信息，并计算成对的原子-残基间距离，形成[原子数]-[残基数]成对相互作用矩阵，在一定阈值下，化合物原子被认为与相应的蛋白质残基有接触(即非共价作用)。在成对相互作用矩阵中，每个元素是一个二进制值，表示相应的原子-残基对是否有相互作用。这些非结构参数模型依靠药物和靶标的序列，从成对的矩阵中学习相互作用，目的是预测药物和靶标之间的亲和力。例如，多层一维卷积神经网络(1d-CNNs)被用来从蛋白质序列中提取特征，这些矩阵代表的特征作为另一个下游深度学习(DL)模型的输入，最终学习药物-靶标的亲和力。然而，这种相互作用随着时间与环境的变化是不停地在改变的，因为蛋白质的结构是动态的，而被获取的配对信息只是电子显微镜捕捉到的某一个状态。因此，这种忽略三维空间结构中结合位点信息，仅仅对蛋白质的单一类型特征(仅考虑序列特征)进行信息提取的无结构参数模型，其对CPA预测很可能存在一个很大局限性。Non-structural parameter models regard compounds (drugs) as atomic combination information and proteins (targets) as residue combination information, and calculate the distances between pairs of atoms and residues to form a [number of atoms]-[number of residues] pairwise interaction matrix. Under a certain threshold, the compound atoms are considered to be in contact with the corresponding protein residues (i.e., non-covalent interactions). In the pairwise interaction matrix, each element is a binary value, indicating whether the corresponding atom-residue pair interacts. These non-structural parameter models rely on the sequences of drugs and targets to learn interactions from paired matrices, with the goal of predicting the affinity between drugs and targets. For example, multi-layer one-dimensional convolutional neural networks (1d-CNNs) are used to extract features from protein sequences, and the features represented by these matrices are used as inputs to another downstream deep learning (DL) model to ultimately learn the affinity of drugs and targets. However, this interaction is constantly changing with time and environmental changes, because the structure of the protein is dynamic, and the pairing information obtained is only a certain state captured by the electron microscope. Therefore, this structure-free parameter model that ignores the binding site information in the three-dimensional structure and only extracts information on a single type of protein feature (considering only sequence features) is likely to have a great limitation in CPA prediction.

先前的研究表明，预测的成对非共价牵引结果可用于无结构模型，以进一步提高其在CPA预测中的准确率。然而，简单地采用曲线下面积(AUC Area Under the Curve)作为评价指标来衡量成对矩阵预测的性能，对结果的解释是可能是误导性的。作为评价指标AUC是接收者特征曲线(ROC)下的面积，其Y轴和X轴是对灵敏度的测量，用来评估模型在真阳性预测或真阴性预测方面的能力。它通过考虑二元分类中区分正负值的不同阈值来绘制图表。如图16所示,化合物-蛋白质配对矩阵中有非常多的阴性值而仅有几个阳性(分子化合物-蛋白质相互作用)值，例如在一个有5个阳性对11595个阴性的数据集中。即使模型预测了一个全负的矩阵，在这种情况下AUC值仍然很高(超过0.95)。预测的成对相互作用矩阵是模型训练过程中产生的中间结果，这些结果被不适当地输入到CPA预测的原始模型中。从数学上讲，机器学习模型训练过程中产生的任何中间结果都只是输入特征(即化合物序列、蛋白质序列和成对相互作用矩阵)的一种表示。理论上讲，当这个表征与它的上游特征一起被反馈到预测最终的反应(即CPA)时，这个反应仍然是它们原始输入特征的表征。在我们的实验中，我们发现将中间结果与原始特征一起输入到无结构模型中的行为不能使CPA预测受益，说明无结构模型在没有输入新信息的情况下不能提高CPA预测的准确性。因此，建立在这种方法上的模型可能难以进一步提高准确性。Previous studies have shown that the predicted pairwise non-covalent traction results can be used in structure-free models to further improve their accuracy in CPA prediction. However, simply using the area under the curve (AUC Area Under the Curve) as an evaluation metric to measure the performance of pairwise matrix prediction can be misleading in the interpretation of the results. As an evaluation metric, AUC is the area under the receiver characteristic curve (ROC), and its Y-axis and X-axis are measures of sensitivity, which are used to evaluate the ability of the model in true positive prediction or true negative prediction. It is plotted by considering different thresholds for distinguishing positive and negative values in binary classification. As shown in Figure 16, there are very many negative values and only a few positive (molecular compound-protein interaction) values in the compound-protein pairing matrix, for example, in a dataset with 5 positive pairs and 11,595 negatives. Even if the model predicts an all-negative matrix, the AUC value is still high (over 0.95) in this case. The predicted pairwise interaction matrix is an intermediate result generated during the model training process, which is inappropriately input into the original model for CPA prediction. Mathematically speaking, any intermediate result produced during the training of a machine learning model is just a representation of the input features (i.e., compound sequence, protein sequence, and pairwise interaction matrix). In theory, when this representation is fed back together with its upstream features to predict the final response (i.e., CPA), the response is still a representation of their original input features. In our experiments, we found that the act of feeding the intermediate results into the unstructured model together with the original features does not benefit the CPA prediction, indicating that the unstructured model cannot improve the accuracy of CPA prediction without the input of new information. Therefore, models built on this approach may find it difficult to further improve their accuracy.

关于预测CPA的基于结构的方法，其中之一是分子对接技术。对接是靶蛋白片段和配体之间的生物信息学建模过程，它在预测过程中考虑了蛋白质-药物复合物的潜在结合位点和三维结构。但是，目前只有少数蛋白质3D结构可用，这大大限制了该方法的应用。为了克服这一局限性，现在越来越多地采用基于机器学习(ML)算法的方法被提出来，它们不仅依赖于药物和靶点的序列数据，还依赖于空间三维信息(例如，三维空间坐标、配体中药物和靶点之间的距离、蛋白质各残基间的扭角信息)，以期获得更精确地预测CPA的能力。尽管在理论上很有吸引力，但开发一个实用的范式来利用蛋白质三维结构信息进行CPA预测是非常困难的。Regarding structure-based methods for predicting CPA, one of them is molecular docking technology. Docking is a bioinformatics modeling process between target protein fragments and ligands, which takes into account the potential binding sites and three-dimensional structures of protein-drug complexes during the prediction process. However, only a few protein 3D structures are currently available, which greatly limits the application of this method. To overcome this limitation, more and more methods based on machine learning (ML) algorithms are being proposed, which rely not only on sequence data of drugs and targets, but also on spatial three-dimensional information (e.g., three-dimensional spatial coordinates, the distance between drugs and targets in ligands, and torsion angle information between protein residues) in order to obtain the ability to predict CPA more accurately. Although attractive in theory, it is very difficult to develop a practical paradigm to use protein three-dimensional structure information for CPA prediction.

目前将蛋白质三维信息应用于计算模型的主要障碍是未能有合适的方法或模型能够将蛋白质的三维信息合理的利用起来，部分研究直接将三维结构信息引入进传统模型中，但并未能取得很好的预测效果，现有方法中大多数用于化合物-蛋白质亲和力预测的计算方法也多采用序列信息对亲和力值进行预测，但在部分数据集上的预测效果并无显著提高。开发一种实用的多模态模型来合理的利用蛋白质的三维结构信息与序列信息的来进一步提高CPA预测的准确率仍然是一个迫切的需求。At present, the main obstacle to applying protein three-dimensional information to computational models is the lack of suitable methods or models to reasonably utilize protein three-dimensional information. Some studies directly introduced three-dimensional structural information into traditional models, but failed to achieve good prediction results. Most of the existing methods for compound-protein affinity prediction also use sequence information to predict affinity values, but the prediction effect on some data sets has not been significantly improved. It is still an urgent need to develop a practical multimodal model to reasonably utilize protein three-dimensional structural information and sequence information to further improve the accuracy of CPA prediction.

发明内容Summary of the invention

为了克服现有技术的缺陷，本发明的目的在于提出蛋白质的三维结构的合理表征方法，并开发一种能够同时聚合蛋白质的序列信息与结构信息的多模态端到端的模型，对蛋白质的结构信息和序列信息做表征聚合后，应用共进化的注意力算法对蛋白质的结构特征与序列特征进行优化提取；引入具有特殊残差链接的图卷积层，避免深层图卷积遇到的过平滑问题，以更全面的提取化合物特征；引入可交互的注意力算法，在亲和力预测模块中实现对蛋白质与化合物间的特征交互，以学习到蛋白质与化合物间潜在的互作信息，以提高亲和力的预测精度。In order to overcome the defects of the prior art, the purpose of the present invention is to propose a reasonable characterization method for the three-dimensional structure of proteins, and to develop a multimodal end-to-end model that can simultaneously aggregate the sequence information and structural information of proteins. After characterizing and aggregating the structural information and sequence information of proteins, a co-evolutionary attention algorithm is used to optimize the extraction of the structural features and sequence features of proteins; a graph convolution layer with special residual links is introduced to avoid the over-smoothing problem encountered in deep graph convolutions, so as to extract compound features more comprehensively; an interactive attention algorithm is introduced to realize the feature interaction between proteins and compounds in the affinity prediction module, so as to learn the potential interaction information between proteins and compounds, so as to improve the prediction accuracy of affinity.

本发明提供一种基于蛋白质三维结构进行化合物-蛋白质亲和力预测的方法，用于预测化合物-蛋白质亲和力，通过对化合物提取化合物特征，化合物特征包括原子特征和聚合节点特征，对蛋白质提取出隐含有蛋白质三维信息的蛋白质特征，再将化合物特征和蛋白质特征通过亲和力预测算法对化合物-蛋白质亲和力进行预测，具体包括以下步骤：The present invention provides a method for predicting compound-protein affinity based on protein three-dimensional structure, which is used for predicting compound-protein affinity. Compound features are extracted from compounds, and the compound features include atomic features and aggregate node features. Protein features containing implicit protein three-dimensional information are extracted from proteins. Compound features and protein features are then used to predict compound-protein affinity through an affinity prediction algorithm. The method specifically includes the following steps:

S1，化合物复合特征提取步骤，S1, compound composite feature extraction step,

根据化合物表征图，得到原子初始特征和聚合节点特征，利用深层图卷积网络和多头注意力机制，循环更新原子特征和聚合节点特征，输出最终的化合物的原子特征和聚合节点特征；According to the compound characterization graph, the initial atomic features and aggregate node features are obtained. The atomic features and aggregate node features are cyclically updated using a deep graph convolutional network and a multi-head attention mechanism to output the final atomic features and aggregate node features of the compound.

S2，蛋白质特征提取步骤；S2, protein feature extraction step;

获取蛋白质序列信息和蛋白质结构特征；根据蛋白质序列信息和蛋白质结构特征通过蛋白质特征聚合算法使得蛋白质嵌入序列特征带有蛋白质结构特征，得到蛋白质嵌入序列特征和嵌入结构特征；通过采用共进化策略对嵌入序列特征和嵌入结构特征进行循环更新，最终得到更新后的蛋白质序列特征和蛋白质结构特征；Obtain protein sequence information and protein structural features; according to the protein sequence information and protein structural features, use a protein feature aggregation algorithm to make the protein embedded sequence features carry protein structural features, thereby obtaining protein embedded sequence features and embedded structural features; cyclically update the embedded sequence features and embedded structural features by adopting a co-evolution strategy, and finally obtain updated protein sequence features and protein structural features;

S3化合物和蛋白质亲和力预测步骤；S3 compound and protein affinity prediction step;

根据步骤S1得到的化合物的原子特征和聚合节点特征，以及步骤S2中的蛋白质序列特征，经过亲和力学习单元算法，得到预测的亲和力值，亲和力值越大则表示化合物和蛋白质亲和概率越大。According to the atomic features and aggregation node features of the compound obtained in step S1 and the protein sequence features in step S2, the predicted affinity value is obtained through the affinity learning unit algorithm. The larger the affinity value, the greater the probability of affinity between the compound and the protein.

优选的，所述步骤S1具体为：Preferably, the step S1 specifically comprises:

S11，根据化合物表征图得到原子特征和聚合节点特征；S11, obtaining atomic features and aggregation node features according to the compound characterization graph;

S12，原子特征输入到深层图卷积网络，深层图卷积网络的输出和聚合节点特征根据化合物特征第一权值相加，并通过第一门控循环函数(GRU)和原子特征相结合得到更新后的原子特征；S12, the atomic features are input into the deep graph convolutional network, the output of the deep graph convolutional network and the aggregate node features are added according to the first weight of the compound features, and the updated atomic features are obtained by combining the first gated recurrent function (GRU) and the atomic features;

S13，原子特征输入到多头注意力表征算法，多头注意力表征算法的输出和聚合节点特征根据化合物特征第二权值相加，并通过第二门控循环函数(GRU)和聚合节点特征相结合得到更新后的聚合节点特征；S13, the atomic features are input into the multi-head attention representation algorithm, the output of the multi-head attention representation algorithm and the aggregate node features are added according to the second weight of the compound features, and the updated aggregate node features are obtained by combining the second gated recurrent function (GRU) and the aggregate node features;

S14；将更新后的原子特征和更新后的聚合节点特征作为原子特征和聚合节点特征重复K次步骤S12-S13后，输出原子特征和聚合节点特征。S14: After repeating steps S12-S13 K times using the updated atomic features and the updated aggregate node features as the atomic features and the aggregate node features, the atomic features and the aggregate node features are output.

优选的，所述步骤S1中的深层图卷积网络带有残差连接环路用于将原子的邻居信息聚合到原子上并避免网络层数加深时所导致的过平滑问题，使用GRU聚合网络前后层间信息，使用多头注意力机制表征算法用于获取多样性的化合物特征，提高亲和力预测步骤中亲和力预测的准确度。Preferably, the deep graph convolutional network in step S1 has a residual connection loop for aggregating the neighbor information of atoms onto atoms and avoiding the over-smoothing problem caused by increasing the number of network layers. GRU is used to aggregate the information between the front and back layers of the network, and a multi-head attention mechanism characterization algorithm is used to obtain diverse compound features, thereby improving the accuracy of affinity prediction in the affinity prediction step.

优选的，所述步骤S2具体为：Preferably, the step S2 is specifically:

S21，获取蛋白质序列信息和蛋白质结构特征，蛋白质结构特征至少包括离散化距离矩阵；S21, obtaining protein sequence information and protein structural features, where the protein structural features at least include a discretized distance matrix;

S22，蛋白质特征聚合算法具体为：蛋白质序列信息通过第一嵌入层(Eembedding层)得到序列向量，离散化距离矩阵通过第二嵌入层(Eembedding层)得到离散距离矩阵特征向量，离散距离矩阵特征向量作为特征嵌入后的嵌入距离矩阵；S22, the protein feature aggregation algorithm is specifically as follows: the protein sequence information is passed through the first embedding layer (Eembedding layer) to obtain a sequence vector, the discretized distance matrix is passed through the second embedding layer (Eembedding layer) to obtain a discrete distance matrix feature vector, and the discrete distance matrix feature vector is used as the embedded distance matrix after feature embedding;

S23，共进化策略采用N层蛋白质编码算法来实现，每层的蛋白质编码算法相同，其中一层蛋白质编码算法具体为：S23, the co-evolution strategy is implemented using N-layer protein coding algorithms, where the protein coding algorithms of each layer are the same, and the protein coding algorithms of one layer are specifically:

嵌入距离矩阵分别经过行相加和列相加后的两个结果通过门控循环单元进行特征融合，得到拼接后的嵌入结构信息；拼接后的嵌入结构信息再进入多样性卷积层，得到多样化的蛋白质结构特征，以学习到特征更为多样的蛋白质特征；The two results of row addition and column addition of the embedding distance matrix are fused through the gated recurrent unit to obtain the spliced embedded structure information; the spliced embedded structure information then enters the diversity convolution layer to obtain diversified protein structure features, so as to learn protein features with more diverse features;

嵌入序列特征依次经过多样性卷积层与普通卷积层，得到多样化的蛋白质序列特征；通过门控逻辑将多样化的蛋白质结构特征与多样化的蛋白质序列特征进行加和，并依次通过门控循环单元和卷积层，得到同时拥有结构信息和序列信息的结构特征向量；The embedded sequence features are sequentially passed through the diversity convolution layer and the common convolution layer to obtain diversified protein sequence features; the diversified protein structure features are added to the diversified protein sequence features through the gating logic, and sequentially passed through the gated recurrent unit and the convolution layer to obtain a structural feature vector that has both structural information and sequence information;

结构特征向量经过自身的结合变换(outer sum)，输出蛋白质结构特征；The structural feature vector undergoes its own combination transformation (outer sum) to output the protein structural features;

嵌入更新后的序列特征经过多样性卷积层得到多样化嵌入更新后的序列特征；多样化嵌入更新后的序列特征与多样化的蛋白质结构特征通过门控逻辑进行加和，得到融合结构信息和序列信息的序列特征向量；再经过门控循环单元将多样化的蛋白质序列特征与序列特征向量进一步融合，输出蛋白质序列特征；The updated sequence features are embedded through a diversity convolution layer to obtain the updated sequence features after diversified embedding; the updated sequence features after diversified embedding are added with the diversified protein structure features through gated logic to obtain a sequence feature vector that integrates the structure information and sequence information; the diversified protein sequence features are further integrated with the sequence feature vector through a gated recurrent unit to output the protein sequence features;

S24，蛋白质编码算法输出的新的嵌入距离矩阵就是所需的嵌入距离矩阵，将得到的嵌入距离矩阵和嵌入序列特征作为下一层蛋白质编码算法的输入继续对嵌入距离矩阵和嵌入序列特征进行更新，直到N层蛋白质编码算法全部完成，此时输出的嵌入距离矩阵就是蛋白质的结构特征，嵌入序列特征就是蛋白质的序列特征。S24, the new embedding distance matrix output by the protein coding algorithm is the required embedding distance matrix. The obtained embedding distance matrix and embedded sequence features are used as the input of the next layer of protein coding algorithm to continue updating the embedding distance matrix and embedded sequence features until N layers of protein coding algorithms are completed. At this time, the output embedding distance matrix is the structural feature of the protein, and the embedded sequence feature is the sequence feature of the protein.

优选的，所述步骤S21中的蛋白质结构特征还包括特征辅助矩阵；所述步骤S22中的得到嵌入序列特征还要通过门控逻辑与特征辅助矩阵进行相加得到嵌入更新后的序列特征。所述步骤S23中的多样性卷积层将输入特征向量在特征维度上进行均等划分为四分，并分别进入四个并行的普通卷积层，并将输出的结果进行加和，同时通过残差连接聚合序列信息输入的初始特征，得到多样化的特征向量。Preferably, the protein structure feature in step S21 also includes a feature auxiliary matrix; the embedded sequence feature obtained in step S22 is also added to the feature auxiliary matrix through gated logic to obtain the embedded updated sequence feature. The diversity convolution layer in step S23 divides the input feature vector into four equal parts in the feature dimension, and enters four parallel ordinary convolution layers respectively, and adds the output results, and at the same time aggregates the initial features of the sequence information input through residual connection to obtain a diversified feature vector.

优选的，所述特征辅助矩阵为扭角矩阵，扭角矩阵用蛋白质骨架链α碳原子与氨基和羧基间的二面角φ和ψ的正弦值与余弦值来构建。Preferably, the characteristic auxiliary matrix is a torsion angle matrix, which is constructed using the sine and cosine values of the dihedral angles φ and ψ between the α-carbon atom of the protein backbone chain and the amino and carboxyl groups.

优选的，所述步骤S21中的离散化距离矩阵为将蛋白质稳定结构的残基间的距离划分为在统计上基本符合正态分布M个等距映射区间，实现距离矩阵的离散化编码，M为正整数。Preferably, the discretized distance matrix in step S21 divides the distances between residues in the protein stable structure into M equidistant mapping intervals that are statistically substantially in line with the normal distribution, thereby realizing the discretized encoding of the distance matrix, where M is a positive integer.

优选的，所述步骤S3具体为：Preferably, the step S3 is specifically:

S31，从步骤S1接收化合物的原子特征和聚合节点特征作为化合物信息传入亲和力学习单元，聚合节点特征经过线性层更新特征得到更新后的聚合节点特征；化合物特征经过线性层更新特征，同时在原子维度上求得均值，并与更新后的聚合节点特征进行特征拼接，得到化合物综合特征；S31, receiving the atomic features and aggregate node features of the compound from step S1 as compound information and passing them into the affinity learning unit, the aggregate node features are updated through a linear layer to obtain updated aggregate node features; the compound features are updated through a linear layer, and the mean is calculated in the atomic dimension, and the features are concatenated with the updated aggregate node features to obtain the comprehensive features of the compound;

S32，从步骤S2接收蛋白质的序列特征并将蛋白质的序列特征作为蛋白质信息传入亲和力学习单元，蛋白质序列特征经过卷积层更新特征，同时在残基维度求均值，得到蛋白质综合特征；S32, receiving the sequence features of the protein from step S2 and passing the sequence features of the protein as protein information into the affinity learning unit, the protein sequence features are updated through the convolution layer, and the average is calculated in the residue dimension to obtain the comprehensive protein features;

蛋白质综合特征与化合物综合特征通过进行矩阵相乘(matmul)，并通过线性层融合两者信息，得到预测的亲和力值。The protein comprehensive features and compound comprehensive features are matrix multiplied (matmul), and the information of the two is fused through a linear layer to obtain the predicted affinity value.

本发明还公开了一种基于蛋白质三维结构进行化合物-蛋白质亲和力预测的系统，包括：化合物提取器，蛋白质提取器和亲和力预测器；The present invention also discloses a system for predicting compound-protein affinity based on protein three-dimensional structure, comprising: a compound extractor, a protein extractor and an affinity predictor;

化合物提取器，根据输入的化合物原子特征和聚合节点特征，利用深层图卷积网络和多头注意力算法，通过注意力机制，更新原子特征和聚合节点特征，通过循环输出最终的原子特征和聚合节点特征；将最终的化合物特征和聚合节点特征发送到亲和力预测器；The compound extractor uses a deep graph convolutional network and a multi-head attention algorithm to update the atomic features and aggregate node features of the input compound through an attention mechanism, and outputs the final atomic features and aggregate node features through a loop; the final compound features and aggregate node features are sent to the affinity predictor;

蛋白质提取器，根据输入的蛋白质序列特征和蛋白质结构特征通过蛋白质特征聚合算法得到嵌入序列特征和嵌入结构特征；通过采用共进化策略对嵌入序列特征和嵌入结构特征进行更新得到更新后的蛋白质序列特征和蛋白质结构特征；The protein extractor obtains embedded sequence features and embedded structural features through a protein feature aggregation algorithm according to the input protein sequence features and protein structural features; updates the embedded sequence features and embedded structural features by adopting a co-evolution strategy to obtain updated protein sequence features and protein structural features;

亲和力预测器，用于接收化合物特征、蛋白质序列特征和聚合节点特征，经过亲和力学习单元算法，得到预测的亲和力值。The affinity predictor is used to receive compound features, protein sequence features and aggregation node features, and obtain the predicted affinity value through the affinity learning unit algorithm.

优选的，所述化合物提取器包括深层图卷积单元和多头注意力表征单元，所述深层图卷积单元用于实现深层图卷积网络，所述多头注意力表征单元用于实现多头注意力算法；所述蛋白质提取器包括蛋白质信息聚合单元和共进化的更新单元，所述蛋白质信息聚合单元用来实现蛋白质特征聚合算法，所述共进化的更新单元用来实现共进化策略；所述亲和力预测器包括亲和力学习单元。Preferably, the compound extractor includes a deep graph convolution unit and a multi-head attention representation unit, the deep graph convolution unit is used to implement a deep graph convolution network, and the multi-head attention representation unit is used to implement a multi-head attention algorithm; the protein extractor includes a protein information aggregation unit and a co-evolution update unit, the protein information aggregation unit is used to implement a protein feature aggregation algorithm, and the co-evolution update unit is used to implement a co-evolution strategy; the affinity predictor includes an affinity learning unit.

与现有技术相比，本发明具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

1、在使用距离矩阵来记录蛋白质残基之间的距离时，使用离散化距离矩阵，在词嵌入阶段可以减小数据表征向量的维度；并且使用蛋白质骨架中α碳原子与氨基和羧基间的扭转角(二面角)φ和ψ来构造扭转矩阵，精确地表示序列的方向，充分体现了蛋白质的三维结构，解决了高维蛋白质表示问题。1. When using the distance matrix to record the distance between protein residues, the discretized distance matrix is used to reduce the dimension of the data representation vector in the word embedding stage; and the torsion matrix is constructed using the torsion angles (dihedral angles) φ and ψ between the α carbon atom and the amino and carboxyl groups in the protein skeleton, which accurately represents the direction of the sequence, fully reflects the three-dimensional structure of the protein, and solves the problem of high-dimensional protein representation.

2、引入共进化策略对蛋白质的序列信息与结构信息协同更新，强化蛋白质序列信息与结构信息间的关联性，以求在最终的蛋白质表征向量中提取到更全面且多样性的特征信息来表征蛋白质；2. Introduce a co-evolution strategy to update the sequence information and structure information of proteins in a coordinated manner, strengthen the correlation between protein sequence information and structure information, and extract more comprehensive and diverse feature information in the final protein representation vector to characterize proteins;

3、引入带有图节点初始信息的残差结构来避免过平滑问题，在此基础上引入聚合节点来聚合化合物的全局信息，并在图卷积过程中辅以门控更新单元以及图信息的消息传递机制，以实现图结构中的各原子节点与汇聚节点间的信息交互更新；3. A residual structure with initial information of graph nodes is introduced to avoid over-smoothing problem. On this basis, an aggregation node is introduced to aggregate the global information of compounds. In the graph convolution process, a gated update unit and a message passing mechanism of graph information are used to realize the information interaction and update between each atomic node and the aggregation node in the graph structure.

4、在亲和力预测单元中引入了亲和力学习单元来聚合蛋白质与化合物的特征，以更为全面地学习蛋白质与化合物间潜在的互作信息，以提高亲和力的预测精度。4. An affinity learning unit is introduced into the affinity prediction unit to aggregate the features of proteins and compounds, so as to learn the potential interaction information between proteins and compounds more comprehensively and improve the prediction accuracy of affinity.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是基于蛋白质三维结构进行化合物-蛋白质亲和力预测方法的流程示意图；FIG1 is a schematic diagram of a process for predicting compound-protein affinity based on protein three-dimensional structure;

图2是化合物复合特征提取的具体步骤流程图；FIG2 is a flow chart showing the specific steps of compound composite feature extraction;

图3是深层图卷积网络的流程示意图；。Figure 3 is a schematic diagram of the process of deep graph convolutional network;.

图4是深层图卷积网络的具体步骤流程图；Figure 4 is a flowchart of the specific steps of the deep graph convolutional network;

图5是原子特征的提取过程示意图；FIG5 is a schematic diagram of the atomic feature extraction process;

图6是蛋白质特征提取过程示意图；FIG6 is a schematic diagram of the protein feature extraction process;

图7是蛋白质特征聚合算法的具体步骤流程图；FIG7 is a flowchart of the specific steps of the protein feature aggregation algorithm;

图8是单层蛋白质编码算法的具体步骤流程图；FIG8 is a flowchart of the specific steps of the single-layer protein encoding algorithm;

图9是多样性卷积层的流程示意图；FIG9 is a schematic diagram of the process of the diversity convolutional layer;

图10是单层蛋白质编码算法的流程示意图；FIG10 is a schematic diagram of the flow chart of the single-layer protein encoding algorithm;

图11是亲和力学习单元算法的具体步骤流程图；FIG11 is a flowchart of the specific steps of the affinity learning unit algorithm;

图12是基于分子指纹对化合物进行相似聚类后的数据集上(小数据集)，各模型性能对比；FIG12 is a comparison of the performance of various models on a data set (small data set) after similarity clustering of compounds based on molecular fingerprints;

图13是基于多序列比对策略对蛋白质进行同源聚类后的数据集上(小数据集)，各模型性能对比；。FIG13 is a comparison of the performance of various models on a data set (small data set) after homology clustering of proteins based on a multiple sequence alignment strategy;.

图14是各模型在训练过程中的收敛情况对比(模型收敛速度与最终收敛位置)；Figure 14 is a comparison of the convergence of each model during the training process (model convergence speed and final convergence position);

图15是各性能评价指标随图卷积层数加深过程的效果图；FIG15 is a diagram showing the effect of various performance evaluation indicators as the number of graph convolution layers increases;

图16是现有非结构参数模型在性能评判中存在的问题。FIG16 shows the problems existing in the performance evaluation of the existing non-structural parameter model.

具体实施方式DETAILED DESCRIPTION

为更好的理解本发明的技术方案，下面结合附图和实施例，对本发明的具体实施方式作进一步详细描述。附图中相同的附图标记表示功能相同或相似的元件。尽管在附图中示出了实施例的各种方面，但是除非特别指出，不必按比例绘制附图。In order to better understand the technical solution of the present invention, the specific implementation of the present invention is further described in detail below in conjunction with the accompanying drawings and embodiments. The same reference numerals in the accompanying drawings represent elements with the same or similar functions. Although various aspects of the embodiments are shown in the accompanying drawings, the drawings need not be drawn to scale unless otherwise specified.

本发明提出了一种基于蛋白质三维结构进行化合物-蛋白质亲和力预测的方法，用于预测化合物-蛋白质亲和力，通过对化合物提取化合物特征，化合物特征包括原子特征和聚合节点特征，对蛋白质提取出隐含有蛋白质三维信息的蛋白质特征，蛋白质三维信息包括离散化的距离矩阵和扭角矩阵，再将化合物特征和蛋白质特征通过亲和力预测算法实现化合物-蛋白质亲和力预测。为了实现本方法采用了基于深度学习和最新技术的端到端模型，该模型被命名为快速进化注意和深层图神经网络(FeatNN)，用FeatNN实现的一种基于蛋白质三维结构进行化合物-蛋白质亲和力预测的方法如图1所示，具体实施步骤如下：The present invention proposes a method for predicting compound-protein affinity based on protein three-dimensional structure, which is used to predict compound-protein affinity, by extracting compound features from compounds, the compound features include atomic features and aggregate node features, extracting protein features that contain implicit protein three-dimensional information from proteins, the protein three-dimensional information includes a discretized distance matrix and a torsion angle matrix, and then using the compound features and protein features through an affinity prediction algorithm to predict compound-protein affinity. In order to implement this method, an end-to-end model based on deep learning and the latest technology is used, and the model is named Fast Evolving Attention and Deep Graph Neural Network (FeatNN). A method for predicting compound-protein affinity based on protein three-dimensional structure implemented by FeatNN is shown in Figure 1, and the specific implementation steps are as follows:

S1，化合物复合特征提取步骤S1, compound composite feature extraction step

化合物复合特征提取步骤，使用化合物提取器实现。The compound feature extraction step is implemented using the compound extractor.

化合物特征包括原子特征和键特征，为了更好的表征化合物特征，还引入了聚合节点特征。化合物特征和聚合节点特征统称为化合物复合特征。Compound features include atomic features and bond features. In order to better characterize compound features, aggregate node features are also introduced. Compound features and aggregate node features are collectively referred to as compound composite features.

根据化合物表征图G＝{atom，bond}，得到初始的原子特征和聚合节点特征，利用深层图卷积网络和多头注意力算法，通过注意力机制，更新原子特征和聚合节点特征，通过循环输出最终的原子特征和聚合节点特征，此时原子特征已包含有键特征，因此输出的原子特征可以代表化合物特征。具体步骤如图2所示。According to the compound characterization graph G = {atom, bond}, the initial atomic features and aggregate node features are obtained. The deep graph convolutional network and multi-head attention algorithm are used to update the atomic features and aggregate node features through the attention mechanism. The final atomic features and aggregate node features are output through a loop. At this time, the atomic features already contain the bond features, so the output atomic features can represent the compound features. The specific steps are shown in Figure 2.

本步骤的发明点在于深层图卷积网络特殊残差连接环路用于将原子的邻居信息聚合到原子上并避免网络层数加深时所导致的过平滑问题，使用GRU聚合网络前后层间信息，使用多头注意力表征算法用于获取更为多样性的化合物特征，引入了聚合节点特征，并且使用交互进化机制对聚合节点特征和原子特征进行更新。The invention of this step is that the special residual connection loop of the deep graph convolutional network is used to aggregate the neighbor information of atoms onto the atoms and avoid the over-smoothing problem caused by the deepening of the network layers. GRU is used to aggregate the information between the front and back layers of the network, and a multi-head attention representation algorithm is used to obtain more diverse compound features. Aggregate node features are introduced, and an interactive evolution mechanism is used to update the aggregate node features and atomic features.

为了清楚描述化合物复合特征提取步骤，首选给出以下相关定义。根据化合物的分子式可将其转换为化合物表征图G＝{atom，bond}，其中atom表示为原子，bond表示为键。atom包含元素名称、芳香族类型、顶点度和原子价的原子特征，bond包含键类型和形状特征，atom和bond的值可以分别通过单独热编码策略进行特征编码。根据化合物表征图G，可得到化合物中每个原子和键的特征，如下所示：对于每个原子对于每个键其中i＝1，2，...，N_a,j＝1，2，...，N_b，原始原子特征定义为原子特征的总和称为聚合节点特征，即，考虑到图卷积层l_c和多头注意特征(k_c为多头的数量)、atom特征或主节点特征，l_cth定义为和变量V具有k_c头部被定义为其中l_c＝1，2，...，l_comp,k_c＝1，2，...，k_comp.In order to clearly describe the steps of compound composite feature extraction, the following relevant definitions are given first. According to the molecular formula of the compound, it can be converted into a compound representation graph G = {atom, bond}, where atom represents atoms and bond represents bonds. Atom contains atomic features such as element name, aromatic type, vertex degree and atomic valence, and bond contains bond type and shape features. The values of atom and bond can be feature encoded separately through separate hot encoding strategies. According to the compound representation graph G, the features of each atom and bond in the compound can be obtained as shown below: For each atom For each key where i = 1, 2, ..., _Na , j = 1, 2, ..., _Nb , and the original atomic features are defined as The sum of atomic features is called the aggregate node feature, i.e., Considering the graph convolution layer l _c and multi-head attention features (k _c is the number of multi-heads), atom features or main node features, l _c th is defined as and The variable V with k _c head is defined as Where l _c =1, 2, ..., l _comp , k _c =1, 2, ..., k _comp .

根据化合物的图形表示G＝{V,E}可以得到的化合物特征，更具体地说，每个节点(即原子)v_i最初由一个长度为82的特征向量表示，它是由代表原子符号的词嵌入的连接，代表相应原子的原子符号、程度、显性价、隐性价和芳香度。每条边(即化学键)e_iE最初由长度为6的特征向量表示，它是代表键类型的词嵌入的连接，包括单、双、三、芳烃、键是共轭的和键是在一个环中的。The compound features can be obtained based on the compound graph representation G = {V, E}. More specifically, each node (i.e., atom) _vi is initially represented by a feature vector of length 82 It is represented by the concatenation of word embeddings representing atomic symbols, representing the atomic symbols, degree, explicit valence, implicit valence, and aromaticity of the corresponding atoms. Each edge (i.e., chemical bond) e _i E is initially composed of a feature vector of length 6 It represents the concatenation of word embeddings representing bond types, including mono, di, tri, aromatic, the bond is conjugated, and the bond is in a ring.

首先化合物的原子特征分别输入到线性层(linear)和深层图卷积网络，原子特征经线性层处理后，分别输入到原子聚合函数、深层图卷积网络、多头注意力表征算法和第一门控循环函数(GRU)中，原子聚合函数用于得到初始的聚合节点特征，聚合节点特征和深层图卷积网络输出结果分别经过线性层处理后相加，并通过第一激活函数(Sigmoid)得到化合物特征第一权值，聚合节点特征和多头注意力表征算法输出结果分别经过线性层处理后相加，并通过第一激活函数(Sigmoid)得到化合物特征第二权值，聚合节点特征经线性层处理后和深层图卷积网络输出结果相加，两者的权重值分别为化合物特征第一权值和1减化合物特征第一权值，相加后的结果输出到第一门控循环函数(GRU)中，第一门控循环函数输出原子特征，此时对原子特征进行更新；聚合节点特征经线性层处理后和多头注意力表征算法输出结果相加，两者的权重值分别为1减化合物特征第二权值和化合物特征第二权值，相加后的结果输出到第二门控循环函数(GRU)中，第二门控循环函数输出聚合节点特征，此时对聚合节点特征进行更新；First, the atomic features of the compound are input into the linear layer and the deep graph convolutional network respectively. After being processed by the linear layer, the atomic features are input into the atomic aggregation function, the deep graph convolutional network, the multi-head attention representation algorithm and the first gated recurrent function (GRU) respectively. The atomic aggregation function is used to obtain the initial aggregation node features. The aggregation node features and the output results of the deep graph convolutional network are respectively processed by the linear layer and added, and the first weight of the compound features is obtained through the first activation function (Sigmoid). The aggregation node features and the output results of the multi-head attention representation algorithm are respectively processed by the linear layer and added, and the compound features are obtained through the first activation function (Sigmoid). The second weight, the aggregation node feature is processed by the linear layer and added to the output result of the deep graph convolutional network. The weight values of the two are the first weight of the compound feature and 1 minus the first weight of the compound feature, respectively. The result after the addition is output to the first gated recurrent function (GRU). The first gated recurrent function outputs the atomic feature, and the atomic feature is updated at this time; the aggregation node feature is processed by the linear layer and added to the output result of the multi-head attention representation algorithm. The weight values of the two are 1 minus the second weight of the compound feature and the second weight of the compound feature, respectively. The result after the addition is output to the second gated recurrent function (GRU). The second gated recurrent function outputs the aggregation node feature, and the aggregation node feature is updated at this time;

将更新后聚合节点特征和原子特征重新作为输入进行循环，循环K次后得到的聚合节点特征和原子特征就是需要输出的聚合节点特征和原子特征，K为大于2的正整数，具体数值根据需要进行设置。The updated aggregate node features and atomic features are used as input again for looping. The aggregate node features and atomic features obtained after looping K times are the aggregate node features and atomic features to be output. K is a positive integer greater than 2, and the specific value is set according to needs.

图卷积网络(GCN)可用于生物学研究，预测药物与靶标的亲和力，但由于深度的限制，这种方法总是有一个过度平滑的问题。当层数增加时，GCN的性能会变差，因为GCN中节点的表示会收敛到相似的值。对于深层图卷积网络采用特殊的残差连接环路用于将原子的邻居信息聚合到原子上并避免网络层数加深时所导致的过平滑问题，加深GCN的层数可以提取更多的信息。Graph convolutional networks (GCNs) can be used in biological research to predict the affinity of drugs to targets, but due to the limitation of depth, this method always has an over-smoothing problem. When the number of layers increases, the performance of GCNs deteriorates because the representations of nodes in GCNs converge to similar values. For deep graph convolutional networks, special residual connection loops are used to aggregate the neighbor information of atoms onto atoms and avoid the over-smoothing problem caused by deepening the number of network layers. Increasing the number of GCN layers can extract more information.

本发明中的深层图卷积网络的目的是将原子的邻居信息聚合到原子上，采用的方式如图3所示的，如对于位于圆圈内的碳原子C，通过深层图卷积网络，会将周围氮原子N、氧原子O和溴原子Br原子的特征，以及他们之间的键特征都会聚合到碳原子的特征中。具体实现步骤如图4所示，原子节点特征与邻接矩阵通过消息传递机制将各原子的邻居节点特征聚合更新在各原子上，并辅以特殊的残差连接环路，防止在加深网络层数时遇到过平滑问题，其中，α，θ为残差连接环路的超参数。原子特征的提取循环过程如图5所示，M表示多头注意力聚合操作，R表示原子多样性特征环路构建循环次数。The purpose of the deep graph convolutional network in the present invention is to aggregate the neighbor information of atoms onto atoms, and the method used is as shown in Figure 3. For example, for the carbon atom C located in the circle, the features of the surrounding nitrogen atoms N, oxygen atoms O and bromine atoms Br atoms, as well as the bond features between them, will be aggregated into the features of the carbon atom through the deep graph convolutional network. The specific implementation steps are shown in Figure 4. The atomic node features and the adjacency matrix aggregate and update the neighbor node features of each atom on each atom through a message passing mechanism, and are supplemented by a special residual connection loop to prevent the problem of over-smoothing when deepening the number of network layers, where α and θ are the hyperparameters of the residual connection loop. The atomic feature extraction cycle is shown in Figure 5, where M represents the multi-head attention aggregation operation, and R represents the number of cycles for building the atomic diversity feature loop.

多头注意力表征算法用于实现多头注意力机制，具体实现为现有技术，在本发明中用于增加化合物原子节点特征的多样性，以更为全面地聚合化合物信息，能够学习到更为准确的化合物表征，提高对亲和力的准确度。The multi-head attention representation algorithm is used to implement the multi-head attention mechanism, which is specifically implemented as the existing technology. In the present invention, it is used to increase the diversity of the features of the atomic nodes of the compound, so as to aggregate the compound information more comprehensively, learn more accurate compound representations, and improve the accuracy of affinity.

因此此时化合物的原子特征经过深度图卷积网络和多头注意力的反复处理已能够代表化合物特征。Therefore, at this point, the atomic features of the compound can represent the compound characteristics after repeated processing by the deep graph convolutional network and multi-head attention.

线性层(linear)的作用是将输入向量映射到指定维度，以用于后续向量或矩阵间的操作。本申请中提到的所有线性层都是不同的线性层，根据实际需要进行设定。The function of a linear layer is to map an input vector to a specified dimension for subsequent operations between vectors or matrices. All linear layers mentioned in this application are different linear layers and are set according to actual needs.

化合物复合特征提取步骤在计算机中采用后面的算法1实现。深层图卷积网络的层数根据需要进行设定，图卷积在计算机中采用后面的算法2实现。The compound composite feature extraction step is implemented in a computer using the following algorithm 1. The number of layers of the deep graph convolutional network is set as needed, and the graph convolution is implemented in a computer using the following algorithm 2.

S2，蛋白质特征提取步骤；S2, protein feature extraction step;

根据蛋白质序列特征和蛋白质结构特征通过蛋白质特征聚合算法得到嵌入序列特征和嵌入结构特征，再通过采用共进化策略的蛋白质编码算法得到更新后的蛋白质序列特征和蛋白质结构特征，如图6所示。蛋白质特征提取步骤在计算机中采用后面的算法3实现。According to the protein sequence features and protein structure features, the embedded sequence features and embedded structure features are obtained by the protein feature aggregation algorithm, and then the updated protein sequence features and protein structure features are obtained by the protein encoding algorithm using the co-evolution strategy, as shown in Figure 6. The protein feature extraction step is implemented in the computer using the following algorithm 3.

本步骤的发明点在于利用蛋白质的结构特征，通过本发明提出的蛋白质特征聚合算法和蛋白质编码算法，将蛋白质结构特征隐含在了蛋白质序列特征中。The inventive point of this step is to utilize the structural features of proteins, and through the protein feature aggregation algorithm and protein encoding algorithm proposed by the present invention, the protein structural features are implicitly embedded in the protein sequence features.

S21，获取蛋白质的序列特征和结构特征；S21, obtaining sequence and structural features of proteins;

蛋白质序列信息可从相关数据库中直阶获取。Protein sequence information can be directly obtained from relevant databases.

结构特征为三维结构特征，至少包括离散化距离矩阵，还可包括结构辅助特征，本实施例中结构特征包括了离散化距离矩阵和作为结构辅助特征的扭角矩阵。The structural feature is a three-dimensional structural feature, which includes at least a discretized distance matrix and may also include a structural auxiliary feature. In this embodiment, the structural feature includes a discretized distance matrix and a torsion angle matrix as a structural auxiliary feature.

蛋白质的离散化距离矩阵为将蛋白质展开成肽链后残基之间的距离，离散化距离矩阵为对称矩阵。因为残基之间的距离通常为欧氏距离，具体数值为连续区间内的一个取值，由于单一的数值不能很好地代表结构特征，也不便于机器学习模型的理解，因此提出了离散的距离矩阵，将距离矩阵映射到宽度相等的低维度的仓中，即将连续区间分为多个相等的区间段，同一区间段内的数值设定为同一值，这极大的减少了距离矩阵的数据量，也能更好的体现残基之间的距离特征。相较于传统的距离矩阵中的距离数值变化间隔较小，并无明显区分度，在数据有限的情况下不便于特征提取，本发明将这些区分度不大的连续值划分为在统计上基本符合正态分布的等距映射区间，实现距离矩阵的离散化编码。本实施例中，将距离矩阵映射到3.25A和50.75A(A为长度单位埃，1米＝10的10次方埃)之间的38个宽度相等的低维度的仓中，另外还有一个仓来存储任何更大的距离，以及一个仓来存储任何更小的距离，因此共有40个仓。The discretized distance matrix of a protein is the distance between residues after the protein is unfolded into a peptide chain, and the discretized distance matrix is a symmetric matrix. Because the distance between residues is usually the Euclidean distance, the specific value is a value in a continuous interval. Since a single value cannot represent the structural features well, it is not easy to understand the machine learning model. Therefore, a discrete distance matrix is proposed to map the distance matrix to a low-dimensional bin with equal width, that is, the continuous interval is divided into multiple equal interval segments, and the values in the same interval segment are set to the same value, which greatly reduces the amount of data in the distance matrix and can better reflect the distance characteristics between residues. Compared with the traditional distance matrix, the distance value change interval is small and there is no obvious distinction. It is not convenient for feature extraction when the data is limited. The present invention divides these continuous values with little distinction into equidistant mapping intervals that are basically in line with the normal distribution in statistics, and realizes the discretized encoding of the distance matrix. In this embodiment, the distance matrix is mapped to 38 low-dimensional bins of equal width between 3.25A and 50.75A (A is the unit of length angstrom, 1 meter = 10 to the 10th power angstrom), with an additional bin to store any larger distances and one bin to store any smaller distances, for a total of 40 bins.

扭角矩阵的构建采用现有技术。考虑到仅仅通过每个残基之间的距离信息可能还不能精确的描述蛋白质三维结构，因此作为辅助使用扭角矩阵可以精确地表示蛋白质序列的方向。扭角矩阵用蛋白质骨架链α碳原子与氨基和羧基间的二面角φ和ψ的正弦值与余弦值来构建。The torsion angle matrix is constructed using existing technology. Considering that the distance information between each residue alone may not accurately describe the three-dimensional structure of the protein, the torsion angle matrix can be used as an auxiliary to accurately represent the direction of the protein sequence. The torsion angle matrix is constructed using the sine and cosine values of the dihedral angles φ and ψ between the α carbon atom of the protein backbone chain and the amino and carboxyl groups.

S22，根据蛋白质序列特征和三维结构特征通过蛋白质特征聚合算法获得嵌入序列特征和嵌入距离矩阵，如图7所示，具体步骤为：S22, obtaining embedded sequence features and embedded distance matrix through protein feature aggregation algorithm according to protein sequence features and three-dimensional structure features, as shown in FIG7 , the specific steps are:

蛋白质序列信息通过第一嵌入层(Eembedding层)进行词嵌入得到序列特征，The protein sequence information is embedded in the first embedding layer (Eembedding layer) to obtain sequence features.

离散的距离矩阵通过第二嵌入层进行词嵌入得到离散距离矩阵特征，输出为带有离散的距离矩阵信息的序列特征。The discrete distance matrix is embedded in the second embedding layer to obtain discrete distance matrix features, and the output is a sequence feature with discrete distance matrix information.

扭角矩阵通过卷积层进行特征提取后，得到扭角特征；扭角特征通过线性层输出，使用第一激活函数(Sigmoid)，得到扭角权重；扭角特征和序列特征以扭角权重为门控权重，通过门控逻辑相加和，输出为嵌入更新后的序列特征。After the twist angle matrix is extracted through the convolution layer, the twist angle feature is obtained; the twist angle feature is output through the linear layer, and the first activation function (Sigmoid) is used to obtain the twist angle weight; the twist angle feature and the sequence feature are gated with the twist angle weight as the gated weight, added through the gated logic, and the output is the embedded updated sequence feature.

蛋白质特征聚合算法在计算机中采用后面的算法5实现。The protein feature aggregation algorithm is implemented in a computer using the following Algorithm 5.

S23，对嵌入序列特征和嵌入距离矩阵通过共进化策略得到更新后的序列特征和更新后的结构特征；S23, obtaining updated sequence features and updated structural features through a co-evolution strategy for the embedded sequence features and the embedded distance matrix;

共进化策略采用N层蛋白质编码算法来实现，每层的蛋白质编码算法相同，其中一层蛋白质编码算法的具体步骤如图8所示。The co-evolution strategy is implemented using an N-layer protein coding algorithm, where the protein coding algorithm of each layer is the same. The specific steps of one layer of the protein coding algorithm are shown in FIG8 .

嵌入距离矩阵分别经过行相加和列相加后的两个结果通过门控循环单元进行特征融合，得到拼接后的嵌入结构信息；拼接后的嵌入结构信息再进入多样性卷积层，得到多样化的蛋白质结构特征，以学习到特征更为多样的蛋白质特征；同时，拼接后的嵌入结构信息通过线性层后使用第二激活函数(Sigmoid)得到结构信息权重。The two results of the embedded distance matrix after row addition and column addition are respectively fused through the gated recurrent unit to obtain the spliced embedded structural information; the spliced embedded structural information then enters the diversity convolution layer to obtain diversified protein structural features, so as to learn protein features with more diverse features; at the same time, the spliced embedded structural information passes through the linear layer and uses the second activation function (Sigmoid) to obtain the structural information weight.

嵌入序列特征依次经过多样性卷积层与普通卷积层，得到多样化的蛋白质序列特征；多样化的蛋白质序列特征经过线性层后使用第三激活函数(Sigmoid)得到序列信息权重。The embedded sequence features pass through the diversity convolution layer and the ordinary convolution layer in turn to obtain diversified protein sequence features; the diversified protein sequence features pass through the linear layer and use the third activation function (Sigmoid) to obtain the sequence information weight.

多样化的蛋白质结构特征与多样化的蛋白质序列特征通过以序列信息权重为门控权重，通过门控单元相加和，得到初步聚合序列信息与结构信息的结构特征向量；其中相加时，多样化的蛋白质序列特征的权重为序列信息权重，多样化的蛋白质结构特征的权重为1减嵌入结构信息权重。The diverse protein structural features and the diverse protein sequence features are added together through a gating unit using the sequence information weight as the gating weight to obtain a structural feature vector that initially aggregates the sequence information and structural information; when adding, the weight of the diverse protein sequence features is the sequence information weight, and the weight of the diverse protein structural features is 1 minus the embedded structural information weight.

初步聚合序列信息与结构信息的结构特征向量与多样化的蛋白质结构特征通过门控循环单元从序列维度上更新结构特征向量，并通过卷积层改变维度映射，得到同时拥有结构信息和序列信息的结构特征向量；结构特征向量通过卷积层指定映射维度后，将自身进行结合变换(outer sum)为蛋白质结构特征，本实施例采用了outer sum函数进行实现，即每次拿第一个输入量的一个元素与第二个输入量遍历相乘作为一行,直到第一个输入量的元素也遍历完。The structural feature vector that initially aggregates sequence information and structural information and the diversified protein structural features updates the structural feature vector from the sequence dimension through a gated recurrent unit, and changes the dimension mapping through a convolutional layer to obtain a structural feature vector that has both structural information and sequence information; after the structural feature vector specifies the mapping dimension through the convolutional layer, it combines and transforms itself (outer sum) into a protein structural feature. This embodiment uses an outer sum function to implement it, that is, each time an element of the first input is multiplied by the second input as a row, until the elements of the first input are also traversed.

嵌入更新后的序列特征经过多样性卷积层得到多样化嵌入更新后的序列特征，以结构信息权重为门控权重，通过门控逻辑将多样化嵌入更新后的序列特征和多样化的蛋白质结构特征相加和，得到融合结构信息和序列信息的序列特征向量；再经过门控循环单元将多样化嵌入更新后的序列特征与序列特征向量进一步融合，输出蛋白质序列特征；其中相加时，多样化的蛋白质结构特征使用的权重为序列信息权重，多样化嵌入更新后的序列特征权重为1减序列信息权重。The updated sequence features are embedded through a diversity convolutional layer to obtain the updated sequence features after diversified embedding. The structural information weight is used as the gating weight. The updated sequence features after diversified embedding and the diversified protein structural features are added through the gating logic to obtain the sequence feature vector integrating the structural information and the sequence information. The updated sequence features after diversified embedding are further fused with the sequence feature vector through a gated recurrent unit to output the protein sequence features. When adding, the weight used for the diversified protein structural features is the sequence information weight, and the weight of the updated sequence features after diversified embedding is 1 minus the sequence information weight.

本实施例中，普通卷积层执行传统的一维卷积运算；多样性卷积层可并行地进行卷积运算，并增强所学习特征的多样性：多样性卷积层首先将输入特征向量在特征维度上采用均等划分策略进行均等划分为四分，并分别进入四个并行的普通卷积层，并将输出的结果进行加和，同时通过残差连接聚合序列信息输入的初始特征，最终得到多样化的特征向量，如图9所示；其中均等划分策略为直接将向量从特征维度上均分为四份，需满足特征数可被划分数量(本实施例中，划分数量为四)整除的前提，在计算机中采用后面的算法9实现上述均等划分策略。In this embodiment, the ordinary convolution layer performs traditional one-dimensional convolution operations; the diversity convolution layer can perform convolution operations in parallel and enhance the diversity of the learned features: the diversity convolution layer first divides the input feature vector into four equal parts using an equal division strategy in the feature dimension, and enters four parallel ordinary convolution layers respectively, and adds the output results. At the same time, the initial features of the sequence information input are aggregated through residual connections, and finally a diversified feature vector is obtained, as shown in Figure 9; the equal division strategy is to directly divide the vector into four equal parts from the feature dimension, which must meet the premise that the feature number can be divided by the number of divisions (in this embodiment, the number of divisions is four), and the following algorithm 9 is used in the computer to implement the above-mentioned equal division strategy.

多样性卷积层在计算机中采用后面的算法7实现。The diversity convolutional layer is implemented in the computer using the following algorithm 7.

此时，通过蛋白质编码算法输出的距离矩阵特征就是所需的蛋白质的距离矩阵，如图10所示。将得到的距离矩阵和序列特征作为下一层蛋白质编码算法的输入继续对蛋白质结构特征和序列特征进行更新，直到N层蛋白质编码算法全部完成，此时的蛋白质的序列特征已经隐含了蛋白质的结构特征。因此此时的蛋白质序列特征能够代表蛋白质特征。At this point, the distance matrix features output by the protein encoding algorithm are the required protein distance matrix, as shown in Figure 10. The obtained distance matrix and sequence features are used as the input of the next layer of protein encoding algorithm to continue updating the protein structure features and sequence features until all N layers of protein encoding algorithms are completed. At this time, the protein sequence features have already implied the protein structure features. Therefore, the protein sequence features at this time can represent the protein features.

蛋白质编码在计算机中采用后面的算法4实现。Protein coding is implemented in computer using the following algorithm 4.

以下参照图11对“S3,化合物和蛋白质亲和力预测步骤”进行说明；The following describes “S3, compound and protein affinity prediction step” with reference to FIG. 11 ;

在化合物和蛋白质亲和力预测步骤中引入亲和力学习单元算法，以学习蛋白质与化合物间相互作用的潜在特征，并对特征分别聚合，最终预测亲和力；In the compound and protein affinity prediction step, an affinity learning unit algorithm is introduced to learn the potential features of the interaction between proteins and compounds, and the features are aggregated separately to finally predict the affinity;

根据步骤S1接收化合物的原子特征和聚合节点特征作为化合物信息传入亲和力学习单元，聚合节点特征经过线性层更新特征得到更新后的聚合节点特征；化合物特征经过线性层更新特征，同时在原子维度上求得均值，并与更新后的聚合节点特征进行特征拼接，得到化合物综合特征；According to step S1, the atomic features and aggregate node features of the compound are received as compound information and transmitted to the affinity learning unit, and the aggregate node features are updated through the linear layer to obtain updated aggregate node features; the compound features are updated through the linear layer, and the mean is calculated in the atomic dimension at the same time, and the features are spliced with the updated aggregate node features to obtain the comprehensive features of the compound;

根据步骤S2接收蛋白质的序列特征并将蛋白质的序列特征作为蛋白质信息传入亲和力学习单元，蛋白质序列特征经过卷积层更新特征，同时在残基维度求均值，得到蛋白质综合特征；According to step S2, the sequence features of the protein are received and passed into the affinity learning unit as protein information. The protein sequence features are updated through the convolution layer, and the average is calculated in the residue dimension to obtain the comprehensive protein features.

蛋白质综合特征与化合物综合特征通过进行矩阵相乘(matmul)，并通过线性层融合两者信息，得到预测的亲和力值。基本流程如图11所示，在计算机中采用后面的算法8实现。The protein comprehensive features and compound comprehensive features are matrix multiplied (matmul), and the information of the two is fused through a linear layer to obtain the predicted affinity value. The basic process is shown in Figure 11, and the following algorithm 8 is used in the computer to implement it.

本步骤的发明点在于使用隐含了蛋白质结构特征的蛋白质序列特征和代表了化合物特征的原子特征使用本发明所提出的亲和力学习单元算法进行融合，并结合聚合节点特征，最终预测亲和力值。The inventive point of this step is to use the protein sequence features that imply protein structural features and the atomic features that represent compound features to fuse them using the affinity learning unit algorithm proposed by the present invention, and combine them with the aggregate node features to finally predict the affinity value.

化合物和蛋白质亲和力预测步骤在计算机中的实现使用后面的算法8。The compound and protein affinity prediction steps were implemented in silico using the following algorithm8.

本发明还提出了一种基于蛋白质三维结构进行化合物-蛋白质亲和力预测系统，用于预测化合物-蛋白质亲和力，如图1所示，具体包括以下模块：The present invention also proposes a compound-protein affinity prediction system based on protein three-dimensional structure, which is used to predict compound-protein affinity, as shown in FIG1 , and specifically includes the following modules:

第一输入模块，第二输入模块、化合物提取器，蛋白质提取器和亲和力预测器；a first input module, a second input module, a compound extractor, a protein extractor, and an affinity predictor;

第一输入模块接收化合物表征图，并根据化合物表征图得到化合物原子特征和聚合节点特征；The first input module receives the compound characterization graph, and obtains the compound atomic features and the aggregation node features according to the compound characterization graph;

化合物提取器，用于接收化合物原子特征和聚合节点特征，根据输入到得到更新的化合物特征和聚合节点特征，将更新的化合物特征和聚合节点特征发送到亲和力预测器；A compound extractor, used for receiving compound atomic features and aggregation node features, obtaining updated compound features and aggregation node features according to the input, and sending the updated compound features and aggregation node features to an affinity predictor;

第二输入模块接收蛋白质序列特征，蛋白质结构特征，蛋白质结构特征包括离散化距离矩阵和扭角矩阵，将接收到的3个特征发送到蛋白质提取器；The second input module receives protein sequence features and protein structure features, the protein structure features include a discretized distance matrix and a torsion angle matrix, and sends the received three features to a protein extractor;

蛋白质提取器，用于接收蛋白质序列特征和蛋白质结构特征，并更新蛋白质序列特征，并发送到亲和力预测器。The protein extractor is used to receive protein sequence features and protein structure features, update the protein sequence features, and send them to the affinity predictor.

亲和力预测器，用于接收化合物特征、蛋白质序列特征和聚合节点特征，进行亲和力预测，给出化合物和蛋白质的亲和力预测值。The affinity predictor is used to receive compound features, protein sequence features and aggregation node features, perform affinity prediction, and provide affinity prediction values for compounds and proteins.

进一步，化合物提取器包括深层图卷积单元和多头注意力表征单元。深层图卷积单元用于实现深层图卷积网络，用于将原子的邻居信息聚合到原子上。多头注意力表征单元；用于提取更具多样性的化合物特征，以提高亲和力的预测精度。Furthermore, the compound extractor includes a deep graph convolution unit and a multi-head attention representation unit. The deep graph convolution unit is used to implement a deep graph convolution network, which is used to aggregate the neighbor information of atoms onto atoms. The multi-head attention representation unit is used to extract more diverse compound features to improve the prediction accuracy of affinity.

蛋白质提取器包括蛋白质信息聚合单元和共进化的更新单元。蛋白质信息聚合单元用来实现蛋白质特征聚合算法，使得蛋白质的序列特征带有蛋白质的结构信息。所述共进化的更新单元用来实现共进化策略，使得所提取的蛋白质序列特征可以代表蛋白质的全局信息The protein extractor includes a protein information aggregation unit and a co-evolution update unit. The protein information aggregation unit is used to implement a protein feature aggregation algorithm so that the sequence features of the protein carry the structural information of the protein. The co-evolution update unit is used to implement a co-evolution strategy so that the extracted protein sequence features can represent the global information of the protein.

亲和力预测器由亲和力学习单元组成。The affinity predictor consists of affinity learning units.

下面使用本发明方法与现有技术进行对比，实现本发明的模型被命名为快速进化注意和深层图神经网络(FeatNN)。The method of the present invention is compared with the prior art below, and the model implementing the present invention is named Fast Evolving Attention and Deep Graph Neural Network (FeatNN).

1、构建试验数据集1. Build a test dataset

在研究中，构建了两个基于PDBbind(版本2020，共有23496个条目)和BindingDB(版本2022年2月6日，共有41296个条目)的数据集。PDBbind数据库提供了PDB中总共23496个生物分子复合物的实验结合亲和力数据。本实施例中主要使用数据清洗后的蛋白质-配体亲和力数据(12628条数据)的不同测量值K_i,K_d和IC₅₀，按训练集：验证集：测试集＝7：1：2作为小数据集训练样本。BindingDB包含8661个蛋白质和1039940个小分子的241268个结合数据，本实施例中主要使用数据清洗后的蛋白质-配体亲和力数据(218615条数据)的IC₅₀测量类型，按训练集：验证集：测试集＝7：1：2作为大数据集训练样本。In the study, two data sets based on PDBbind (version 2020, with a total of 23,496 entries) and BindingDB (version February 6, 2022, with a total of 41,296 entries) were constructed. The PDBbind database provides experimental binding affinity data for a total of 23,496 biomolecular complexes in the PDB. In this embodiment, different measurement values of K _i , K _d and IC ₅₀ of the protein-ligand affinity data (12,628 data) after data cleaning are mainly used, and the training set: validation set: test set = 7:1:2 are used as small data set training samples. BindingDB contains 241,268 binding data of 8,661 proteins and 1,039,940 small molecules. In this embodiment, the IC ₅₀ measurement type of the protein-ligand affinity data (218,615 data) after data cleaning is mainly used, and the training set: validation set: test set = 7:1:2 is used as a large data set training sample.

检查关联数据的正态性，因为极端数据可能会影响整个模型。BindingDB数据集有一些过于极端的关联数据，因此消除了外部的数据(μ-3σ，μ+3σ)。Check the normality of the correlation data, because extreme data may affect the entire model. The BindingDB dataset has some extremely extreme correlation data, so the external data (μ-3σ, μ+3σ) are eliminated.

对于距离矩阵和扭角矩阵，通过使用两个数据库提供的PDB id在RCSB PDB(蛋白质数据库)数据库中搜索来获取蛋白质结构数据。RSCB PDB文件提供足够的3D结构数据(空间位置)，可以提取模型中需要使用的距离矩阵，根据这些三维结构信息计算扭转矩阵。For the distance matrix and torsion angle matrix, protein structure data were obtained by searching in the RCSB PDB (Protein Data Bank) database using the PDB IDs provided by the two databases. The RSCB PDB files provide sufficient 3D structural data (spatial position) to extract the distance matrix needed to be used in the model, and the torsion matrix was calculated based on this 3D structural information.

对于SMILES序列，我们根据配体ID从RCSB PDB获取数据。For SMILES sequences, we obtained data from RCSB PDB based on ligand IDs.

2、确定评估指标，2. Determine the evaluation indicators,

使用指标来评估模型的预测性能，本发明使用了以下指标：R²、RMSE(均方根误差)、MAE(绝对平均误差)、MedAE、皮尔逊相关系数、以及斯皮尔曼秩相关系数、；其中，R²、RMSE(均方根误差)、MAE(绝对平均误差)描述了预测值和真实值之间的距离；皮尔逊积矩相关系数和斯皮尔曼秩相关系数描述了预测值和真实值之间的相关性。Indicators are used to evaluate the predictive performance of the model. The present invention uses the following indicators: R ² , RMSE (root mean square error), MAE (absolute mean error), MedAE, Pearson correlation coefficient, and Spearman rank correlation coefficient; among them, R ² , RMSE (root mean square error), and MAE (absolute mean error) describe the distance between the predicted value and the true value; the Pearson product-moment correlation coefficient and the Spearman rank correlation coefficient describe the correlation between the predicted value and the true value.

各个指标具体描述如下：The detailed description of each indicator is as follows:

R²是描述模型有效性的无量纲分数。它根据真实值的平均值将预测与随机猜测进行比较； ^R2 is a dimensionless score that describes the effectiveness of a model. It compares the predictions to random guessing based on the average of the true values;

RMSE是MSE的标准值，在机器学习研究中通常用作训练损失；RMSE is a standard value of MSE and is often used as a training loss in machine learning research;

MAE描述了预测值和真值的绝对误差；MAE describes the absolute error between the predicted value and the true value;

MedAE是MAE的中值，它描述了残留物的中值；MedAE is the median MAE, which describes the median value of the residue;

皮尔逊相关系数描述了两个值之间的线性相关性；The Pearson correlation coefficient describes the linear correlation between two values;

斯皮尔曼秩相关系数是皮尔逊相关系数的秩形式，它描述了两个变量的相关性(例如，当一个变量增加时，另一个变量也增加)，它与函数单调性有关；The Spearman rank correlation coefficient is the rank form of the Pearson correlation coefficient, which describes the correlation between two variables (for example, when one variable increases, the other variable also increases), which is related to the monotonicity of the function;

3、对FeatNN、BACPI、GraphDTA(GATNet、GAT_GCN、GCNNet、GINConvNet)和MONN的评估。3. Evaluation of FeatNN, BACPI, GraphDTA (GATNet, GAT_GCN, GCNNet, GINConvNet) and MONN.

图12为在分别基于IC50和KIKD的测量标准，对化合物聚类后的数据集上进行测试，通过图12可以看出本发明的FeatNN在三个指标(即RMSE、Pearson和R²)中都优于其他模型。FIG12 is a test on a data set after compound clustering based on the measurement standards of IC50 and KIKD. It can be seen from FIG12 that the FeatNN of the present invention is superior to other models in three indicators (ie, RMSE, Pearson and R ² ).

图13为在新的蛋白质数据上进行测试，通过图13也可以看出本发明的FeatNN在三个指标(即RMSE、Pearson和R²)也都优于其他模型。FIG13 is a test performed on new protein data. It can be seen from FIG13 that the FeatNN of the present invention is superior to other models in three indicators (ie, RMSE, Pearson and R ² ).

图14是各模型在训练过程中的收敛情况对比(模型收敛速度与最终收敛位置)；从图中可以看出发明的FeatNN的收敛效果是最好的。FIG14 is a comparison of the convergence of each model during the training process (model convergence speed and final convergence position); it can be seen from the figure that the convergence effect of the invented FeatNN is the best.

图15为了显示深层图卷积的优势，本发明还对图卷积使用不同的深度，可以看出各性能评价指标随图卷积层数加深过程的变化。Figure 15 shows the advantages of deep graph convolution. The present invention also uses different depths for graph convolution. It can be seen that the performance evaluation indicators change with the deepening of the graph convolution layers.

表1显示的是在BindingDB(IC50测量标准，共218615条数据)大数据集上训练的MONN，GraphDTA(GATGCN，GCNNet，GINConvNet，GATConvNet)，BACPI和本发明FeatNN模型性能评价表；Table 1 shows the performance evaluation table of MONN, GraphDTA (GATGCN, GCNNet, GINConvNet, GATConvNet), BACPI and FeatNN model of the present invention trained on the BindingDB (IC50 measurement standard, a total of 218615 data);

表1Table 1

表2和表3显示的是在PDBBind小数据集(包含KIKD、IC50测量标准，共12628条数据)上训练的，在不同评测指标下(KIKD或IC50)本发明FeatNN与现有技术各个模型在对化合物(表2)与蛋白质(表3)分别进行同源相似聚类后的模型性能评价表Tables 2 and 3 show the model performance evaluation tables of the FeatNN of the present invention and the various models of the prior art after homology similarity clustering of compounds (Table 2) and proteins (Table 3) under different evaluation indicators (KIKD or IC50), which are trained on the PDBBind small data set (including KIKD and IC50 measurement standards, a total of 12628 data).

表2Table 2

表3Table 3

为了进一步体现本发明是可实现的，现将实现本发明主要模块的伪代码放在下面，希望能更好的理解本发明所提出的方法。In order to further demonstrate that the present invention is feasible, the pseudo code for implementing the main modules of the present invention is placed below, hoping to better understand the method proposed by the present invention.

首先，代码中的主要的变量定义如表4所示First, the main variable definitions in the code are shown in Table 4

表4常用变量定义Table 4 Common variable definitions

符号定义上标为名称，下标为向量维度信息。The symbol definition superscript is the name, and the subscript is the vector dimension information.

另外，N_res表示输入主序列中的残基数(在训练过程中裁剪)，用N_tors表示扭转大小，N_{ebd_s}表示嵌入大小，h表示隐藏大小，N_atom表示原子数，N_bond表示键数，F^atom表示原子特征，F^bond表示键特征，nbs为邻居维度。使用大写的运算符名称，当它们封装了学习参数时，例如，使用Linear来表示具有权重矩阵W和偏置矢量b的线性变换。In addition, _Nres represents the number of residues in the input main sequence (cropped during training), _Ntors represents the torsion size, _{Nebd_s} represents the embedding size, h represents the hidden size, _Natom represents the number of atoms, _Nbond represents the number of bonds, ^Fatom represents the atomic features, ^Fbond represents the bond features, and nbs represents the neighbor dimension. Use uppercase operator names when they encapsulate learning parameters, for example, Linear is used to represent a linear transformation with a weight matrix W and a bias vector b.

用LayerNorm表示在通道维度上操作的层归一化，每个通道的增益和偏置都可以学习。还对随机运算符使用了大写的名字，例如那些与辍学有关的运算符。对于没有参数的函数，我们使用小写的运算符名称。其中，操作和变量的定义如下所示。使用表示相乘，表示相加，⊙表示哈达玛积(Hadamard)，a^Tb表示两个向量的点积，，[·,·]_h指的是对隐藏尺寸的维度进行维度拼接操作。使用h代表隐藏尺寸，k代表注意力头的数量，k_h代表注意力头的尺寸，其中k_h＝h/k。softmax(x_i)对应soft max函数，softmax(x_i)＝exp(x_i)/∑_i exp(x_i).，代表softmax激活函数；σ(·)代表sigmoid激活函数σ(x)＝1/(1+e^-x)，tanh(·)对应tanh激活函数tanh(x)＝(e^x-e^-x)/(e^x+e^-x),代表双曲切线激活函数，gelu代表高斯误差线性单位，f(·)对应GELU激活函数其中erf(·)作为高斯误差函数， LayerNorm is used to denote layer normalization that operates on the channel dimension, and the gain and bias of each channel can be learned. We also use uppercase names for random operators, such as those related to dropout. For functions without parameters, we use lowercase operator names. The definitions of operations and variables are shown below. represents multiplication, represents addition, ⊙ represents Hadamard product, a ^T b represents the dot product of two vectors, [·,·] _h refers to the dimension concatenation operation on the dimension of the hidden size. Use h to represent the hidden size, k to represent the number of attention heads, k _h to represent the size of the attention head, where k _h = h/k. softmax( _xi ) corresponds to the soft max function, softmax( _xi ) = exp( _xi )/ _∑i exp( _xi )., represents the softmax activation function; σ(·) represents the sigmoid activation function σ(x) = 1/(1+e ^-x ), tanh(·) corresponds to the tanh activation function tanh(x) = (e ^x -e ^-x )/(e ^x +e ^-x ), represents the hyperbolic tangent activation function, gelu represents the Gaussian error linear unit, and f(·) corresponds to the GELU activation function where erf(·) is the Gaussian error function,

各个模块算法如下：The algorithms of each module are as follows:

算法1图卷积Graph ConvolutionAlgorithm 1 Graph Convolution

con_neighhor＝concat_neighhor(ver_neighbor,edge_neighbor)con _neighhor = concat _neighhor (ver _neighbor , edge _neighbor )

neighbor_label＝gelu(Linear(con_neighbor))neighbor_label=gelu(Linear(con _neighbor ))

output＝theta⊙Linear(support)+(1-theta)⊙supportoutput=theta⊙Linear(support)+(1-theta)⊙support

算法2化合物复合特征提取Compound ExtractorAlgorithm 2 Compound Extractor

算法3蛋白质特征提取ProteinExtractorAlgorithm 3 Protein feature extraction ProteinExtractor

算法4蛋白质编码ProteinEncoderAlgorithm 4 Protein Encoder

算法5蛋白质特征聚合Protein AggregatingAlgorithm 5 Protein Aggregating

seq_embed＝Embedding(Seq_init)seq_embed=Embedding(Seq _init )

torsion_embed＝CNN_Nts→h1(TorMat_init)torsion_embed=CNN _Nts→h1 (TorMat _init )

torsion_vector＝CNN_h1→h(torsion_embed)torsion_vector=CNN _h1→h (torsion_embed)

gate＝Sigmoid(Linear(torsion_vector))gate=Sigmoid(Linear(torsion_vector))

算法6单层蛋白质编码EvoUpdatingAlgorithm 6 Single-layer protein encoding EvoUpdating

MixKey＝GRU(PairKey1,PairKey2)MixKey=GRU(PairKey1,PairKey2)

Struct_Features＝DivCNN(MixKey)Struct_Features=DivCNN(MixKey)

SeqGate＝Sigmoid(Linear(Seq2Struct))SeqGate=Sigmoid(Linear(Seq2Struct))

StructGate＝Sigmoid(Linear(MixKey))StructGate=Sigmoid(Linear(MixKey))

Seq2Struct_Vector＝SeqGate⊙Seq2Struct+(1-SeqGate)⊙Struct_FeaturesSeq2Struct_Vector＝SeqGate⊙Seq2Struct+(1-SeqGate)⊙Struct_Features

Struct_Vector＝GRU(Seq2Struct_Vector,Struct_Features)Struct_Vector=GRU(Seq2Struct_Vector,Struct_Features)

Struct2Seq_Mapping＝DeepSparseCNN(Struct_Vector)Struct2Seq_Mapping=DeepSparseCNN(Struct_Vector)

Seq_Vector＝StructGate⊙Struect_Features+(1-StructGate)⊙Seq_FeaturesSeq_Vector＝StructGate⊙Struect_Features+(1-StructGate)⊙Seq_Features

算法7多样性卷积层DivCNNAlgorithm 7 Diversity Convolutional Layer DivCNN

def DivCNN(x):def DivCNN(x):

x0＝CNN_kernel＝1(x)x0＝CNN _kernel＝1 (x)

#x(seq,hidden_size)→x_k-head(seq,k,head_size)#x(seq,hidden_size)→x _k-head (seq,k,head_size)

#where k is the number of head,and head_size＝hidden_size/k#where k is the number of head,and head_size＝hidden_size/k

x_k-head＝TransposeForScores(x)x _k-head =TransposeForScores(x)

return→{x_total}return → {x _total }

算法8亲和力预测AffinityLearningAlgorithm 8 Affinity Prediction AffinityLearning

#Inputs projections#Inputs projections

#Output projection#Output projection

return→{Affinity_Prediction}return→{Affinity _Prediction }

算法9 TPS函数TransposeForScoresAlgorithm 9 TPS function TransposeForScores

def TransposeForScores({input}):def TransposeForScores({input}):

#input dimention:(N_res/N_atom,h)#input dimention:(N _res /N _atom ,h)

#output dimention:(k,N_res/N_atom,ks)#output dimention:(k,N _res /N _atom ,ks)

output＝DimentionReshape(input)output=DimentionReshape(input)

return{output}return{output}

最后应说明的是：以上所述的各实施例仅用于说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述实施例所记载的技术方案进行修改，或者对其中部分或全部技术特征进行等同替换；而这些修改或替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that the above-described embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit the present invention. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that the technical solutions described in the aforementioned embodiments may still be modified, or some or all of the technical features thereof may be replaced by equivalents. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for predicting compound-protein affinity based on a three-dimensional structure of a protein is used for predicting compound-protein affinity, extracting compound characteristics from the compound, wherein the compound characteristics comprise atomic characteristics and polymerization node characteristics, extracting protein characteristics which contain three-dimensional information of the protein from the protein, and predicting the compound-protein affinity by an affinity prediction algorithm through the compound characteristics and the protein characteristics, and specifically comprises the following steps:

S1, compound composite characteristic extraction step,

according to the compound characterization graph, atomic initial characteristics and aggregation node characteristics are obtained, the atomic characteristics and the aggregation node characteristics are circularly updated by utilizing a deep graph convolution network and a multi-head attention mechanism, and the final atomic characteristics and the aggregation node characteristics of the compound are output;

s2, extracting protein characteristics;

acquiring protein sequence information and protein structural characteristics; according to the protein sequence information and the protein structural characteristics, the protein embedding sequence characteristics have protein structural characteristics through a protein characteristic polymerization algorithm, so that the protein embedding sequence characteristics and the embedding structural characteristics are obtained; circularly updating the embedded sequence features and the embedded structure features by adopting a co-evolution strategy to finally obtain updated protein sequence features and protein structure features;

s3, predicting affinity of the compound and the protein;

according to the atomic characteristics and the aggregation node characteristics of the compound obtained in the step S1 and the protein sequence characteristics in the step S2, a predicted affinity value is obtained through an affinity learning unit algorithm, and the larger the affinity value is, the larger the probability that the compound and the protein are combined is.

2. The method for compound-protein affinity prediction based on protein three-dimensional structure according to claim 1, wherein: the step S1 specifically comprises the following steps:

s11, obtaining atomic characteristics and polymerization node characteristics according to a compound characterization graph;

s12, inputting the atomic characteristics into a deep graph convolution network, adding the output of the deep graph convolution network and the aggregation node characteristics according to a first weight of the compound characteristics, and combining the atomic characteristics with a first gating cyclic function (GRU) to obtain updated atomic characteristics;

s13, inputting the atomic characteristics into a multi-head attention characterization algorithm, adding the output of the multi-head attention characterization algorithm and the aggregation node characteristics according to a second weight of the compound characteristics, and combining the updated aggregation node characteristics through a second gating circulation function (GRU) and the aggregation node characteristics;

s14, performing S14; and (3) repeating the steps S12-S13K times by taking the updated atomic characteristics and the updated aggregation node characteristics as the atomic characteristics and the aggregation node characteristics, and outputting the atomic characteristics and the aggregation node characteristics.

3. The method for compound-protein affinity prediction based on protein three-dimensional structure according to claim 2, wherein: the deep graph convolutional network in the step S1 is provided with a residual error connection loop for aggregating neighbor information of atoms onto atoms and avoiding the problem of excessive smoothness caused when the number of network layers is deepened, the front and rear interlayer information of the GRU aggregation network is used, and a multi-head attention mechanism characterization algorithm is used for acquiring diversified compound characteristics, so that accuracy of affinity prediction in the affinity prediction step is improved.

4. The method for compound-protein affinity prediction based on protein three-dimensional structure according to claim 1, wherein: the step S2 specifically comprises the following steps:

s21, acquiring protein sequence information and protein structural characteristics, wherein the protein structural characteristics at least comprise a discretization distance matrix;

s22, a protein characteristic aggregation algorithm specifically comprises the following steps: the protein sequence information obtains a sequence vector through a first embedding layer (Eemmbedding layer), and the discretization distance matrix obtains a discrete distance matrix feature vector through a second embedding layer (Eemmbedding layer); the discrete distance matrix feature vector is used as an embedded distance matrix after feature embedding update;

s23, realizing a co-evolution strategy by adopting N layers of protein coding algorithms, wherein the protein coding algorithms of each layer are the same, and one layer of protein coding algorithm specifically comprises the following steps:

the two results of the embedded distance matrix after row addition and column addition are subjected to feature fusion through a gating circulation unit to obtain spliced embedded structure information, and the spliced embedded structure information enters a diversity convolution layer to obtain diversified protein structural features;

the embedded sequence features sequentially pass through a diversity convolution layer and a common convolution layer to obtain diversified protein sequence features; adding the diversified protein structural features and the diversified protein sequence features through a gate control logic, and sequentially passing through a gate control circulation unit and a convolution layer to obtain structural feature vectors with structural information and sequence information at the same time;

The structural feature vector outputs the structural feature of the protein through self-combination transformation (outer sum);

the sequence features after embedding and updating are subjected to a diversity convolution layer to obtain diversified sequence features after embedding and updating; adding the diversified embedded updated sequence features and the diversified protein structural features through a gate control logic to obtain sequence feature vectors fusing the structural information and the sequence information; further fusing the diversified protein sequence characteristics with sequence characteristic vectors through a gating circulating unit to output protein sequence characteristics;

s24, the new embedding distance matrix output by the protein coding algorithm is the required embedding distance matrix, the obtained embedding distance matrix and the embedding sequence feature are used as the input of the next protein coding algorithm, the embedding distance matrix and the embedding sequence feature are continuously updated until the N protein coding algorithms are all completed, at the moment, the output embedding distance matrix is the structural feature of the protein, and the embedding sequence feature is the sequence feature of the protein.

5. The method for compound-protein affinity prediction based on protein three-dimensional structure according to claim 4, wherein:

The protein structural features in the step S21 also comprise a feature auxiliary matrix;

the embedded sequence features obtained in the step S22 are added with the feature auxiliary matrix through the gate logic to obtain embedded updated sequence features;

the diversity convolution layer in step S23 equally divides the input feature vector into four parts in the feature dimension, and respectively enters four parallel common convolution layers, sums the output results, and simultaneously, aggregates the initial features of the sequence information input through residual connection to obtain the diversity feature vector.

6. The method for compound-protein affinity prediction based on protein three-dimensional structure according to claim 5, wherein: the characteristic auxiliary matrix is a torsion angle matrix, and the torsion angle matrix is constructed by sine values and cosine values of dihedral angles phi and phi between alpha carbon atoms of a protein skeleton chain and amino groups and carboxyl groups.

7. The method for compound-protein affinity prediction based on protein three-dimensional structure according to claim 4, wherein:

the discretization distance matrix in the step S21 is to divide the distance between residues of the protein stabilizing structure into M equidistant mapping intervals which basically conform to normal distribution in statistics, so as to realize the discretization coding of the distance matrix, wherein M is a positive integer.

8. The method for compound-protein affinity prediction based on protein three-dimensional structure according to claim 1, wherein: the step S3 specifically comprises the following steps:

s31, receiving atomic characteristics and aggregation node characteristics of the compound from the step S1, and transmitting the atomic characteristics and the aggregation node characteristics of the compound as compound information into an affinity learning unit, wherein the aggregation node characteristics are updated through linear layer updating characteristics to obtain updated aggregation node characteristics; the compound features are updated through a linear layer, an average value is obtained in an atomic dimension, and feature stitching is carried out on the compound features and the updated aggregate node features, so that compound comprehensive features are obtained;

s32, receiving the sequence characteristics of the protein from the step S2, transmitting the sequence characteristics of the protein as protein information into an affinity learning unit, updating the characteristics of the protein sequence characteristics through a convolution layer, and meanwhile, obtaining the average value in the residue dimension to obtain the comprehensive characteristics of the protein;

protein complex features and compound complex features are combined by matrix multiplication (matmul) and linear layer fusion to obtain predicted affinity values.

9. A system for compound-protein affinity prediction based on a three-dimensional structure of a protein, comprising: a compound extractor, a protein extractor, and an affinity predictor;

The compound extractor updates the atomic characteristics and the aggregation node characteristics by using a deep graph convolution network and a multi-head attention algorithm according to the input atomic characteristics and the aggregation node characteristics of the compound, and circularly outputs the final atomic characteristics and the aggregation node characteristics; transmitting the final compound signature and the aggregate node signature to an affinity predictor;

a protein extractor, which obtains embedded sequence characteristics and embedded structure characteristics through a protein characteristic aggregation algorithm according to the input protein sequence characteristics and protein structure characteristics; updating the embedded sequence features and the embedded structure features by adopting a co-evolution strategy to obtain updated protein sequence features and protein structure features;

and the affinity predictor is used for receiving the compound characteristic, the protein sequence characteristic and the aggregation node characteristic, and obtaining a predicted affinity value through an affinity learning unit algorithm.

10. A system for compound-protein affinity prediction based on protein three-dimensional structure according to claim 9, characterized by comprising:

the compound extractor comprises a deep map convolution unit and a multi-head attention characterization unit, wherein the deep map convolution unit is used for realizing a deep map convolution network, and the multi-head attention characterization unit is used for realizing a multi-head attention algorithm;

The protein extractor comprises a protein information aggregation unit and a co-evolution updating unit, wherein the protein information aggregation unit is used for realizing a protein characteristic aggregation algorithm, and the co-evolution updating unit is used for realizing a co-evolution strategy;

the affinity predictor includes an affinity learning unit.