WO2021218791A1 - Prediction method and device for ligand-protein interaction - Google Patents

Prediction method and device for ligand-protein interaction Download PDF

Info

Publication number
WO2021218791A1
WO2021218791A1 PCT/CN2021/089139 CN2021089139W WO2021218791A1 WO 2021218791 A1 WO2021218791 A1 WO 2021218791A1 CN 2021089139 W CN2021089139 W CN 2021089139W WO 2021218791 A1 WO2021218791 A1 WO 2021218791A1
Authority
WO
WIPO (PCT)
Prior art keywords
protein
ligand
target
sequences
feature
Prior art date
Application number
PCT/CN2021/089139
Other languages
French (fr)
Chinese (zh)
Inventor
蒋华良
郑明月
陈立凡
Original Assignee
中国科学院上海药物研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院上海药物研究所 filed Critical 中国科学院上海药物研究所
Publication of WO2021218791A1 publication Critical patent/WO2021218791A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/10Analysis or design of chemical reactions, syntheses or processes
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Definitions

  • the present invention relates to the field of drug screening, in particular to a method and device for predicting ligand-protein interaction.
  • Virtual screening is an important work in early drug research and development. It is divided into three categories: structure-based virtual screening, ligand-based virtual screening and chemical genomics-based virtual screening.
  • Structure-based virtual screening requires the crystal structure of the protein, and many potential target proteins have not solved the crystal structure. Therefore, structure-based virtual screening cannot solve drug screening for such targets.
  • Ligand-based virtual screening requires more ligand information, and the number of active small molecules reported for many targets is too small to establish accurate and reliable models.
  • virtual screening based on ligands also limits the discovery and design of active small molecules with new structures.
  • many machine learning methods based on chemical genomics have been proposed to predict ligand-protein interactions. The disadvantage of these methods is the need to manually define proteins and Descriptors of small molecules.
  • the purpose of the embodiments of the present invention is to provide a method and device for predicting the ligand-protein interaction, which is used to solve the problem that the ligand-protein interaction relationship cannot be accurately predicted in the prior art.
  • a method for predicting ligand-protein interaction including the following steps:
  • Prediction is performed based on the several protein feature sequences and the several atomic feature sequences using a preset prediction model to obtain the probability of the target protein interacting with the target ligand.
  • processing of the primary sequence of the target protein to obtain several protein feature sequences composed of feature vectors specifically includes:
  • a predetermined algorithm is used to encode each of the sequence fragments, and a number of protein feature sequences composed of feature vectors corresponding to each sequence fragment are obtained.
  • the acquiring several atomic characteristic sequences of the target ligand based on the molecular fingerprint of the target ligand specifically includes:
  • a graph convolutional network is used to process the molecular fingerprint atlas to obtain several atomic characteristic sequences of the target ligand.
  • the prediction based on the several protein feature sequences and the several atomic feature sequences using a preset prediction model to obtain the probability of the target protein interacting with the target ligand specifically includes:
  • a calculation is performed based on the target characteristic sequence to obtain the probability that the target protein binds to the target ligand.
  • the method further includes: training to obtain the prediction model using a deep learning method, which specifically includes:
  • Model training is performed based on several protein feature sequences of the sample protein, several atomic feature sequences of the sample ligand, and the true value to obtain the prediction model.
  • performing model training based on several protein feature sequences of the sample protein, several atomic feature sequences of the sample ligand, and the true value to obtain the prediction model specifically includes:
  • the cross entropy is used as the loss function of the prediction model, and the stochastic gradient descent method is used for training to obtain the prediction model.
  • a ligand-protein interaction prediction device including:
  • the first acquisition module is used to process the primary sequence of the target protein to obtain several protein feature sequences composed of feature vectors;
  • the second acquisition module is used to acquire several atomic characteristic sequences of the target ligand based on the molecular fingerprint of the target ligand;
  • the prediction module is used to make predictions based on the several protein feature sequences and the several atomic feature sequences using a preset prediction model to obtain the probability of the target protein interacting with the target ligand
  • the first obtaining module is specifically configured to:
  • a predetermined algorithm is used to encode each of the sequence fragments, and a number of protein feature sequences composed of feature vectors corresponding to each sequence fragment are obtained.
  • the second acquisition module is specifically configured to: use a chemical information package to process the SMILES molecular formula of the target ligand to obtain the molecular fingerprint of the target ligand;
  • a graph convolutional network is used to process the molecular fingerprint atlas to obtain several atomic characteristic sequences of the target ligand.
  • the prediction module is specifically used for:
  • a calculation is performed based on the target characteristic sequence to obtain the probability that the target protein binds to the target ligand.
  • the beneficial effect of the embodiments of the present invention is that the prediction model is obtained through pre-training, so that when it is necessary to predict whether a certain protein and a certain ligand can interact, only the characteristic sequence of each protein of the protein and the characteristic sequence of the ligand need to be obtained.
  • Atomic characteristic sequence by using the prediction model, it is possible to predict which protein characteristic sequence in the protein can interact with which atomic characteristic sequence in the ligand, so that the probability of interaction between the protein and the ligand can be calculated, so that the protein The prediction of the interaction with the ligand is more accurate.
  • Fig. 1 is a flowchart of a method for predicting ligand-protein interaction in an embodiment of the present invention.
  • Figure 2 is a schematic diagram of the prediction of the ligand-protein interaction in the embodiment of the method
  • FIG. 3 is a specific flow chart of obtaining an interaction feature sequence in an embodiment of the present invention.
  • Fig. 4 is a structural block diagram of a device for predicting ligand-protein interaction in an embodiment of the present invention.
  • the embodiment of the present invention provides a method for predicting ligand-protein interaction, as shown in FIG. 1, including the following steps:
  • step S101 the primary sequence of the target protein is processed to obtain several protein feature sequences composed of feature vectors.
  • word2vec the word vector embedding method in natural language processing
  • word2vec can be used to process the amino acid sequence of the protein into a sequence of feature vectors, that is, to obtain several protein feature sequences p 1 , p 2 , ..., p b .
  • Step S102 Acquire several atomic characteristic sequences of the target ligand based on the molecular fingerprint of the target ligand.
  • the chemical information package RDkit can be used to encode the graph molecular fingerprint of the target ligand, and then several atomic characteristic sequences c 1 , c 2 ,..., c of the target ligand can be learned through the graph convolution network. a .
  • Step S103 Predict using a preset prediction model based on the several protein feature sequences and the several atomic feature sequences to obtain the probability of the target protein interacting with the target ligand.
  • this step after obtaining several characteristic sequences (protein characteristic sequences) p 1 , p 2 ,..., p b of the protein and several atomic characteristic sequences c 1 , c 2 ,..., c a of the ligand, It can be encoded and decoded (in the prediction model) through the Transformer framework in natural language processing, and output the interacting target feature sequence x 1 , x 2 ,..., x a ; then calculate based on the target feature sequence, The probability that the target protein binds to the target ligand can be obtained.
  • the prediction model can be used to predict Find out which protein characteristic sequences can interact with which atomic characteristic sequences, so that the probability of interaction between the protein and the ligand can be calculated.
  • Another embodiment of the present invention provides a method for predicting ligand-protein interaction, which includes the following steps:
  • Step S201 dividing the primary sequence of the target protein into a number of sequence fragments by using a predetermined number of consecutive amino acids as a group; using a predetermined algorithm to encode each of the sequence fragments to obtain characteristics corresponding to each sequence fragment Several protein feature sequences composed of vectors.
  • Step S202 Use a chemical information package to process the SMILES molecular formula of the target ligand to obtain a molecular fingerprint of the target ligand; use a graph convolution network to process the molecular fingerprint of the target ligand to obtain the target ligand The sequence of several atomic characteristics.
  • Step S203 Use a self-attention mechanism to process the several protein feature sequences and the several atomic feature sequences to determine the target feature sequence that can interact; calculate based on the target feature sequence to obtain the target protein The probability of binding to the target ligand.
  • the predetermined calculation formula is used to calculate the several target feature sequences to obtain the interaction feature; then the fully connected neural network is used to process the interaction feature to obtain the sample protein-sample configuration The predicted value (probability) of body interaction.
  • the Transformer frame under processing encodes and decodes, and outputs the target feature sequence x 1 , x 2 ,..., x a of the interaction; then calculate the target feature sequence using the preset calculation formula to obtain the interaction feature; and finally By using the fully connected neural network to process the interaction feature, the probability of target protein-target ligand binding can be obtained.
  • This embodiment provides a method for predicting ligand-protein interaction. Before predicting the interaction between the target protein and the target ligand, the method further includes using a deep learning method to train to obtain a prediction model.
  • the implementation includes the following steps:
  • Step S301 obtaining experimental data
  • Step S302 Determine the true value of the sample protein-sample ligand interaction based on the experimental data
  • the actual value y of the interaction can be obtained according to the actual experimental data and results.
  • the actual value y is specifically "1" or "0", where 1 indicates that the interaction is possible, and 0 indicates that the interaction cannot be performed. effect.
  • Step S303 Obtain several protein characteristic sequences of the sample protein, and obtain several atomic characteristic sequences of the sample ligand;
  • a protein with an amino acid length of 200 can be selected from the experimental data, that is, a protein characteristic sequence with a dimension of 198 ⁇ 100 can be obtained.
  • the chemical information package RDkit to process the SMILES formula of the sample ligand, and each atom encodes a 34-dimensional feature vector (as shown in Table 1) to obtain the ligand's molecular fingerprint map, and then use the image volume
  • a sample ligand with a non-hydrogen atom number of 20 can be selected from the experimental data, that is, an atomic characteristic sequence with a dimension of 20 ⁇ 64 can be obtained.
  • Step S304 Perform model training based on several protein feature sequences of the sample protein, several atomic feature sequences of the sample ligand, and the true value to obtain the prediction model.
  • this step can be specifically divided into the following steps:
  • Step S3041 using a self-attention mechanism to process several protein feature sequences of the sample protein and several atomic feature sequences of the sample ligand, and predict and obtain several sample sequences that can interact.
  • the characteristic sequence of the sample protein (that is, the characteristic sequence of the protein of the sample protein), that is, p 1 , p 2 ,..., p b with a dimension of b ⁇ 100 can be input into the encoder for encoding, Output the encoded sample protein feature sequence, that is, p 1 , p 2 ,..., p b with dimensions of b ⁇ 64. Then take the atomic characteristic sequence of the sample ligand, namely c 1 , c 2 ,..., c a with dimensions a ⁇ 64 and (the encoded sample protein characteristic sequence) p 1 , p 2 ,...
  • the interaction feature sequence (ie, a number of sample sequences) is finally outputted with x 1 , x 2 ,..., x a with dimensions of a ⁇ 64;
  • Step S2042 Calculate the several sample sequences by using a preset calculation formula to obtain interaction characteristics
  • x'i is the modulus of the vector x i
  • ⁇ i is the weight of the vector x i
  • Xi represents the i-th interaction feature sequence
  • y interaction represents the interaction feature.
  • Step S3043 using a fully connected neural network to process the interaction feature to obtain a predicted value of the sample protein-sample ligand interaction
  • Step S3044 Calculate cross entropy based on the predicted value and the true value
  • This step is to obtain the predicted value After calculating the predicted value And the cross entropy of the true value y.
  • step S3045 the cross entropy is used as the loss function of the prediction model, and the stochastic gradient descent method is used for training to obtain the prediction model.
  • the stochastic gradient descent method is used to train the model is a common model training method, which will not be repeated here.
  • the sample protein characteristic sequence (that is, the protein characteristic sequence of the sample protein), that is, p 1 , p 2 ,..., p b with dimensions of b ⁇ 100 is input into the encoder for encoding, and the encoded sample is output
  • protein feature sequence specifically use the formula in the encoder For processing, where Is the input of the h l layer, W 1 , s, W 2 , and t are learnable parameters, n is the length of the sequence, m 1 and m 2 are the dimensions of the input and hidden layer features, k is the size of the convolution kernel, and ⁇ is the sigmoid function, Is the Hadamard product of the matrix.
  • the atomic characteristic sequence of the sample ligand (c 1 , c 2 ,..., c a with dimensions a ⁇ 64) and the encoded sample protein characteristic sequence (p 1 with dimensions b ⁇ 64, p 2 ,..., p b ) is input to the decoder for learning, and the interaction feature sequence (ie, a number of sample sequences) x 1 , x 2 ,..., x a is output, which can be implemented in the following way, namely through the self-attention layer
  • d k represents a scaling factor, which is the dimension of the hidden layer feature, which is 64 in this embodiment
  • T represents the transposition symbol of the matrix.
  • the self-attention mechanism is used to calculate the atomic characteristic sequence and the protein characteristic sequence.
  • the end-to-end deep learning model TransformerCPI is used to obtain the current optimal results on three public benchmark data sets.
  • the deep learning model TransformerCPI in this embodiment obtains the current optimal results in label reversal experiments. Compared with other models, the improvement effect is very significant, which proves that the method can learn real interaction features.
  • the deep learning model TransformerCPI has good interpretability, it can not only show which amino acid fragments in the protein have a high probability of binding to which atomic characteristic sequences in the ligand, but also which atoms in the ligand molecule (atomic characteristics) Sequence) has a great contribution to binding, and provides guidance and suggestions for further molecular structure modification.
  • FIG. 4 Another embodiment of the present invention provides a ligand-protein interaction prediction device, as shown in Figure 4, including:
  • the first acquisition module is used to process the primary sequence of the target protein to obtain several protein feature sequences composed of feature vectors;
  • the second acquisition module is used to acquire several atomic characteristic sequences of the target ligand based on the molecular fingerprint of the target ligand;
  • the prediction module is used to make predictions based on the several protein feature sequences and the several atomic feature sequences using a preset prediction model to obtain the probability of the target protein interacting with the target ligand
  • the first acquisition module is specifically used to: divide the primary sequence of the target protein into a number of sequence fragments using a predetermined number of consecutive amino acids as a group; Encoding is performed to obtain several protein feature sequences composed of feature vectors corresponding to each sequence fragment.
  • the second acquisition module is specifically configured to: use a chemical information package to process the SMILES molecular formula of the target ligand to obtain the molecular fingerprint of the target ligand;
  • the molecular fingerprint is processed to obtain several atomic characteristic sequences of the target ligand.
  • the prediction module is specifically configured to: use a self-attention mechanism to process the several protein feature sequences and the several atomic feature sequences to determine the target feature sequence that can interact; based on the target feature The sequence is calculated to obtain the probability that the target protein binds to the target ligand.
  • This embodiment also includes a training module for training to obtain the prediction model, the training module adopts a deep learning method to train to obtain the prediction model, and the training model is used for:
  • Model training is performed based on several protein feature sequences of the sample protein, several atomic feature sequences of the sample ligand, and the true value to obtain the prediction model.
  • the training module is specifically used to:
  • the cross entropy is used as the loss function of the prediction model, and the stochastic gradient descent method is used for training to obtain the prediction model.

Abstract

A prediction method and device for a ligand-protein interaction. The method comprises: processing a primary sequence of a target protein to obtain a plurality of protein feature sequences consisting of feature vectors; obtaining a plurality of atom feature sequences of a target ligand on the basis of a molecular fingerprint spectrum of the target ligand; and performing prediction by using a preset prediction model on the basis of the plurality of protein feature sequences and the plurality of atom feature sequences, and obtaining the probability of the interaction between the target protein and the target ligand. When it is necessary to predict whether a certain protein can interact with a certain ligand, only the protein feature sequences of the protein and the atom feature sequences of the ligand need to be obtained; by using the prediction model, it can be predicted which amino acid segments of the protein interact with which atoms of the ligand, thus the probability of the interaction between the protein and the ligand can be calculated.

Description

一种配体-蛋白质相互作用的预测方法及装置Method and device for predicting ligand-protein interaction 技术领域Technical field
本发明涉及药物筛选领域,特别涉及一种配体-蛋白质相互作用的预测方法及装置。The present invention relates to the field of drug screening, in particular to a method and device for predicting ligand-protein interaction.
背景技术Background technique
虚拟筛选是早期药物研发的一项重要的工作,分为三类:基于结构的虚拟筛选,基于配体的虚拟筛选和基于化学基因组学的虚拟筛选。基于结构的虚拟筛选需要蛋白质的晶体结构,很多潜在靶标蛋白并没有解出晶体结构,因此基于结构的虚拟筛选不能解决这类靶点的药物筛选工作。基于配体的虚拟筛选需要较多的配体信息,许多靶点报道的活性小分子数目太少,无法准确可靠的建立模型。此外,基于配体的虚拟筛选也限制了新型结构的活性小分子的发现与设计工作。鉴于基于结构的虚拟筛选和基于配体的虚拟筛选存在的局限性,许多基于化学基因组的机器学习的方法被提出以用来预测配体-蛋白质相互作用,这些方法的缺陷是需要人工定义蛋白质和小分子的描述符。Virtual screening is an important work in early drug research and development. It is divided into three categories: structure-based virtual screening, ligand-based virtual screening and chemical genomics-based virtual screening. Structure-based virtual screening requires the crystal structure of the protein, and many potential target proteins have not solved the crystal structure. Therefore, structure-based virtual screening cannot solve drug screening for such targets. Ligand-based virtual screening requires more ligand information, and the number of active small molecules reported for many targets is too small to establish accurate and reliable models. In addition, virtual screening based on ligands also limits the discovery and design of active small molecules with new structures. In view of the limitations of structure-based virtual screening and ligand-based virtual screening, many machine learning methods based on chemical genomics have been proposed to predict ligand-protein interactions. The disadvantage of these methods is the need to manually define proteins and Descriptors of small molecules.
由于机器学习模型需要定义蛋白质和小分子的描述符。模型不能够端到端地自主从数据中学习到蛋白质和小分子的特征,同时机器学习对于大样本的学习能力欠佳。Because machine learning models need to define the descriptors of proteins and small molecules. The model cannot autonomously learn the characteristics of proteins and small molecules from the data end-to-end, and machine learning is not good at learning large samples.
并且,已现有的深度学习模型没有提取到真正的相互作用特征,导致模型被与任务无关的统计规律所误导,从而无法在实际应用中取得良好的效果,无法准确的预测出配体-蛋白质相互作用关系。In addition, the existing deep learning models did not extract the true interaction features, which caused the model to be misled by statistical laws unrelated to the task, which made it impossible to achieve good results in practical applications and could not accurately predict the ligand-protein. Interaction relationship.
发明内容Summary of the invention
本发明实施例的目的在于提供一种配体-蛋白质相互作用的预测方法及装置,用于解决现有技术中无法准确的预测出配体蛋白质相互作用关系的问题。The purpose of the embodiments of the present invention is to provide a method and device for predicting the ligand-protein interaction, which is used to solve the problem that the ligand-protein interaction relationship cannot be accurately predicted in the prior art.
为了解决上述技术问题,本申请的实施例采用了如下技术方案:一种配体-蛋白质相互作用的预测方法,包括如下步骤:In order to solve the above technical problems, the embodiments of the present application adopt the following technical solutions: a method for predicting ligand-protein interaction, including the following steps:
对目标蛋白质的一级序列进行处理,获得由特征向量组成的若干蛋白质特征序列;Process the primary sequence of the target protein to obtain several protein feature sequences composed of feature vectors;
基于目标配体的分子指纹图谱获取目标配体的若干原子特征序列;Obtain several atomic characteristic sequences of the target ligand based on the molecular fingerprint of the target ligand;
基于所述若干蛋白质特征序列以及所述若干原子特征序列利用预设的预测模型进行预测,获得所述目标蛋白质和所述目标配体相互作用的概率。Prediction is performed based on the several protein feature sequences and the several atomic feature sequences using a preset prediction model to obtain the probability of the target protein interacting with the target ligand.
可选的,所述对目标蛋白质的一级序列进行处理,获得由特征向量组成的若干蛋白质特征序列,具体包括:Optionally, the processing of the primary sequence of the target protein to obtain several protein feature sequences composed of feature vectors specifically includes:
以连续预定个数的氨基酸为一组将所述目标蛋白质的一级序列分隔成若干各序列片段;Dividing the primary sequence of the target protein into a plurality of sequence fragments by taking consecutive predetermined numbers of amino acids as a group;
采用预定的算法对各所述序列片段进行编码,获得与各序列片段对应的特征向量组成的若干蛋白质特征序列。A predetermined algorithm is used to encode each of the sequence fragments, and a number of protein feature sequences composed of feature vectors corresponding to each sequence fragment are obtained.
可选的,所述基于目标配体的分子指纹图谱获取目标配体的若干原子特征序列,具体包括:Optionally, the acquiring several atomic characteristic sequences of the target ligand based on the molecular fingerprint of the target ligand specifically includes:
使用化学信息包对所述目标配体的SMILES分子式进行处理,得到所述目标配体的分子指纹图谱;Use the chemical information package to process the SMILES molecular formula of the target ligand to obtain the molecular fingerprint of the target ligand;
利用图卷积网络对所述分子指纹图谱进行处理,获得所述目标配体的若干原子特征序列。A graph convolutional network is used to process the molecular fingerprint atlas to obtain several atomic characteristic sequences of the target ligand.
可选的,所述基于所述若干蛋白质特征序列以及所述若干原子特征序列利用预设的预测模型进行预测,以获得所述目标蛋白质和所述目标配体相互作用的概率,具体包括:Optionally, the prediction based on the several protein feature sequences and the several atomic feature sequences using a preset prediction model to obtain the probability of the target protein interacting with the target ligand specifically includes:
采用自注意力机制对所述若干蛋白质特征序列以及所述若干原子特征序列进行处理,以确定出能够进行相互作用的目标特征序列;Using a self-attention mechanism to process the several protein feature sequences and the several atomic feature sequences to determine the target feature sequence that can interact;
基于所述目标特征序列进行计算获得所述目标蛋白质与所述目标配体结合的概率。A calculation is performed based on the target characteristic sequence to obtain the probability that the target protein binds to the target ligand.
可选的,所述方法还包括:采用深度学习的方法训练获得所述预测模型,具体包括:Optionally, the method further includes: training to obtain the prediction model using a deep learning method, which specifically includes:
获取实验数据;Obtain experimental data;
基于所述实验数据确定样本蛋白质-样本配体相互作用的真实值;Determine the true value of the sample protein-sample ligand interaction based on the experimental data;
获取样本蛋白质的若干蛋白质特征序列,并获取样本配体的若干原子特征序列;Obtain several protein characteristic sequences of the sample protein, and obtain several atomic characteristic sequences of the sample ligand;
基于所述样本蛋白质的若干蛋白质特征序列、所述样本配体的若干原子特征序列以及所述真实值进行模型训练,获得所述预测模型。Model training is performed based on several protein feature sequences of the sample protein, several atomic feature sequences of the sample ligand, and the true value to obtain the prediction model.
可选的,所述基于所述样本蛋白质的若干蛋白质特征序列、所述样本配体的若干原子特征序列以及所述真实值进行模型训练,获得所述预测模型,具体包括:Optionally, performing model training based on several protein feature sequences of the sample protein, several atomic feature sequences of the sample ligand, and the true value to obtain the prediction model specifically includes:
采用自注意力机制对所述样本蛋白质的若干蛋白质特征序列以及所述样本配体的若干原子特征序列进行处理,获得包含相互作用信息的若干样本序列;Using a self-attention mechanism to process several protein feature sequences of the sample protein and several atomic feature sequences of the sample ligand to obtain several sample sequences containing interaction information;
利用预设的计算公式对所述若干样本序列进行计算,获得相互作用特征;Calculate the several sample sequences using a preset calculation formula to obtain interaction characteristics;
利用全连接神经网络对所述相互作用特征进行处理,获得样本蛋白质-样本配体相互作用的预测值;Use a fully connected neural network to process the interaction feature to obtain a predicted value of the sample protein-sample ligand interaction;
基于所述预测值以及所述真实值计算交叉熵;Calculating cross entropy based on the predicted value and the true value;
将所述交叉熵作为预测模型的损失函数,以采用随机梯度下降法进行训练,获得所述预测模型。The cross entropy is used as the loss function of the prediction model, and the stochastic gradient descent method is used for training to obtain the prediction model.
为了解决上述技术问题,本申请的实施例采用了如下技术方案:一种配体-蛋白质相互作用的预测装置,包括:In order to solve the above technical problems, the following technical solutions are adopted in the embodiments of the present application: a ligand-protein interaction prediction device, including:
第一获取模块,用于对目标蛋白质的一级序列进行处理,获得由特征向量组成的若干蛋白质特征序列;The first acquisition module is used to process the primary sequence of the target protein to obtain several protein feature sequences composed of feature vectors;
第二获取模块,用于基于目标配体的分子指纹图谱获取目标配体的若干原子特征序列;The second acquisition module is used to acquire several atomic characteristic sequences of the target ligand based on the molecular fingerprint of the target ligand;
预测模块,用于基于所述若干蛋白质特征序列以及所述若干原子特征序列利用预设的预测模型进行预测,获得所述目标蛋白质和所述目标配体相互作用的概率The prediction module is used to make predictions based on the several protein feature sequences and the several atomic feature sequences using a preset prediction model to obtain the probability of the target protein interacting with the target ligand
可选的,所述第一获取模块具体用于:Optionally, the first obtaining module is specifically configured to:
以连续预定个数的氨基酸为一组将所述目标蛋白质的一级序列分隔成若干各序列片段;Dividing the primary sequence of the target protein into a plurality of sequence fragments by taking consecutive predetermined numbers of amino acids as a group;
采用预定的算法对各所述序列片段进行编码,获得与各序列片段对应的特征向量组成的若干蛋白质特征序列。A predetermined algorithm is used to encode each of the sequence fragments, and a number of protein feature sequences composed of feature vectors corresponding to each sequence fragment are obtained.
可选的,所述第二获取模块具体用于:使用化学信息包对所述目标配体的SMILES分子式进行处理,得到所述目标配体的分子指纹图谱;Optionally, the second acquisition module is specifically configured to: use a chemical information package to process the SMILES molecular formula of the target ligand to obtain the molecular fingerprint of the target ligand;
利用图卷积网络对所述分子指纹图谱进行处理,获得所述目标配体的若干原子特征序列。A graph convolutional network is used to process the molecular fingerprint atlas to obtain several atomic characteristic sequences of the target ligand.
可选的,所述预测模块具体用于:Optionally, the prediction module is specifically used for:
采用自注意力机制对所述若干蛋白质特征序列以及所述若干原子特征序列进行处理,以确定出能够进行相互作用的目标特征序列;Using a self-attention mechanism to process the several protein feature sequences and the several atomic feature sequences to determine the target feature sequence that can interact;
基于所述目标特征序列进行计算获得所述目标蛋白质与所述目标配体结合的概率。A calculation is performed based on the target characteristic sequence to obtain the probability that the target protein binds to the target ligand.
本发明实施例的有益效果在于:通过预先训练获得预测模型,这样当需要预测某个蛋白质和某个配体能否进行相互作用时,只需要获得该蛋白质的各蛋白质特征序列以及该配体的原子特征序列,通过利用预测模型,就能预测出蛋白中哪些蛋白质特征序列能和配体中哪些原子特征序列进行相互作用,由此能够计算出该蛋白质和该配体相互作用的概率,使得蛋白质和配体相互作用的预测更加准确。The beneficial effect of the embodiments of the present invention is that the prediction model is obtained through pre-training, so that when it is necessary to predict whether a certain protein and a certain ligand can interact, only the characteristic sequence of each protein of the protein and the characteristic sequence of the ligand need to be obtained. Atomic characteristic sequence, by using the prediction model, it is possible to predict which protein characteristic sequence in the protein can interact with which atomic characteristic sequence in the ligand, so that the probability of interaction between the protein and the ligand can be calculated, so that the protein The prediction of the interaction with the ligand is more accurate.
附图说明Description of the drawings
图1为本发明实施例中配体-蛋白质相互作用的预测方法的流程图。Fig. 1 is a flowchart of a method for predicting ligand-protein interaction in an embodiment of the present invention.
图2为本法明实施例中配体-蛋白质相互作用的预测的原理图;Figure 2 is a schematic diagram of the prediction of the ligand-protein interaction in the embodiment of the method;
图3为本发明实施例中获取相互作用特征序列的具体流程图;FIG. 3 is a specific flow chart of obtaining an interaction feature sequence in an embodiment of the present invention;
图4为本发明实施例中配体-蛋白质相互作用的预测装置的结构框图。Fig. 4 is a structural block diagram of a device for predicting ligand-protein interaction in an embodiment of the present invention.
具体实施方式Detailed ways
此处参考附图描述本申请的各种方案以及特征。Various solutions and features of the present application are described here with reference to the drawings.
应理解的是,可以对此处申请的实施例做出各种修改。因此,上述说明书不应该视为限制,而仅是作为实施例的范例。本领域的技术人员将想到在本申请的范围和精神内的其他修改。It should be understood that various modifications can be made to the embodiments applied herein. Therefore, the above description should not be regarded as a limitation, but merely as an example of an embodiment. Those skilled in the art will think of other modifications within the scope and spirit of this application.
包含在说明书中并构成说明书的一部分的附图示出了本申请的实施例,并且与上面给出的对本申请的大致描述以及下面给出的对实施例的详细描述一起用于解释本申请的原理。The drawings included in the specification and constituting a part of the specification illustrate the embodiments of the application, and together with the general description of the application given above and the detailed description of the embodiments given below, are used to explain the application principle.
通过下面参照附图对给定为非限制性实例的实施例的优选形式的描述,本申请的这些和其它特性将会变得显而易见。These and other characteristics of the present application will become apparent from the following description of preferred forms of embodiments given as non-limiting examples with reference to the accompanying drawings.
还应当理解,尽管已经参照一些具体实例对本申请进行了描述,但本领域技术人员能够确定地实现本申请的很多其它等效形式,它们具有如权利要求所述的特征并因此都位于借此所限定的保护范围内。It should also be understood that although the application has been described with reference to some specific examples, those skilled in the art can surely implement many other equivalent forms of the application, which have the features described in the claims and are therefore all located here. Within the limited scope of protection.
当结合附图时,鉴于以下详细说明,本申请的上述和其他方面、特征和优势将变得更为显而易见。When combined with the drawings, the above and other aspects, features and advantages of the present application will become more apparent in view of the following detailed description.
此后参照附图描述本申请的具体实施例;然而,应当理解,所申请的实施例仅仅是本申请的实例,其可采用多种方式实施。熟知和/或重复的功能和结构并未详细描述以避免不必要或多余的细节使得本申请模糊不清。因此,本文所申请的具体的结构性和功能性细节并非意在限定,而是仅仅作为权利要求的基础和代表性基础用于教导本领域技术人员以实质上任意合适的详细结构多样地使用本申请。Hereinafter, specific embodiments of the present application will be described with reference to the accompanying drawings; however, it should be understood that the applied embodiments are merely examples of the present application, which can be implemented in various ways. Well-known and/or repeated functions and structures have not been described in detail to avoid unnecessary or redundant details from obscuring the present application. Therefore, the specific structural and functional details applied for herein are not intended to be limiting, but merely serve as the basis and representative basis of the claims to teach those skilled in the art to use the present in a variety of ways with substantially any suitable detailed structure. Application.
本说明书可使用词组“在一种实施例中”、“在另一个实施例中”、“在又一实施例中”或“在其他实施例中”,其均可指代根据本申请的相同或不同实施例中的一个或多个。This specification may use the phrases "in one embodiment," "in another embodiment," "in yet another embodiment," or "in other embodiments," which can all refer to the same as in accordance with the present application. Or one or more of the different embodiments.
本发明实施例提供一种配体-蛋白质相互作用的预测方法,如图1所示,包括如下步骤:The embodiment of the present invention provides a method for predicting ligand-protein interaction, as shown in FIG. 1, including the following steps:
步骤S101,对目标蛋白质的一级序列进行处理,获得由特征向量组成的若干蛋白质特征序列。In step S101, the primary sequence of the target protein is processed to obtain several protein feature sequences composed of feature vectors.
本步骤在具体实施过程中,可以利用自然语言处理中的词向量嵌入方法(word2vec),将蛋白质的氨基酸序列处理成一组由特征向量组成的序列,即获得若干蛋白质特征序列p 1,p 2,…,p bIn the specific implementation of this step, the word vector embedding method in natural language processing (word2vec) can be used to process the amino acid sequence of the protein into a sequence of feature vectors, that is, to obtain several protein feature sequences p 1 , p 2 , …, p b .
步骤S102,基于目标配体的分子指纹图谱获取目标配体的若干原子特征序列。Step S102: Acquire several atomic characteristic sequences of the target ligand based on the molecular fingerprint of the target ligand.
本步骤在具体实施例过程中,具体可以使用化学信息包RDkit编码目标配体的图分子指纹,再通过图卷积网络学习到目标配体的若干原子特征序列c 1,c 2,…,c aIn this step, in the specific embodiment process, the chemical information package RDkit can be used to encode the graph molecular fingerprint of the target ligand, and then several atomic characteristic sequences c 1 , c 2 ,..., c of the target ligand can be learned through the graph convolution network. a .
步骤S103,基于所述若干蛋白质特征序列以及所述若干原子特征序列利用预设的预测模型进行预测,获得所述目标蛋白质和所述目标配体相互作用的概率。Step S103: Predict using a preset prediction model based on the several protein feature sequences and the several atomic feature sequences to obtain the probability of the target protein interacting with the target ligand.
本步骤在具体实施过程中,在得到蛋白质的若干特征序列(蛋白质特征序列)p 1,p 2,…,p b和配体的若干原子特征序列c 1,c 2,…,c a之后,就可以通过自然语言处理中的Transformer框架进行编码和解码(预测模型中的),输出相互作用的目标特征序列x 1,x 2,…,x a;然后基于所述目标特征序列进行计算,就可以获得所述目标蛋白质与所述目标配体结合的概率。 In the specific implementation of this step, after obtaining several characteristic sequences (protein characteristic sequences) p 1 , p 2 ,..., p b of the protein and several atomic characteristic sequences c 1 , c 2 ,..., c a of the ligand, It can be encoded and decoded (in the prediction model) through the Transformer framework in natural language processing, and output the interacting target feature sequence x 1 , x 2 ,..., x a ; then calculate based on the target feature sequence, The probability that the target protein binds to the target ligand can be obtained.
本发明实施例中当需要预测某个蛋白质和某个配体能否进行相互作用时,只需要获得该蛋白质的各蛋白质特征序列以及该配体的原子特征序列,通过利用预测模型,就能预测出哪些蛋白质特征序列能和哪些原子特征序列进行相互作用,由此能够计算出该蛋白质和该配体相互作用的概率。In the embodiment of the present invention, when it is necessary to predict whether a certain protein and a certain ligand can interact, only the protein characteristic sequence of the protein and the atomic characteristic sequence of the ligand need to be obtained, and the prediction model can be used to predict Find out which protein characteristic sequences can interact with which atomic characteristic sequences, so that the probability of interaction between the protein and the ligand can be calculated.
本发明另一实施例提供一种配体-蛋白质相互作用的预测方法,包括如下步骤:Another embodiment of the present invention provides a method for predicting ligand-protein interaction, which includes the following steps:
步骤S201,以连续预定个数的氨基酸为一组将所述目标蛋白质的一级序列分隔成若干各序列片段;采用预定的算法对各所述序列片段进行编码,获得与各序列片段对应的特征向量组成的若干蛋白质特征序列。Step S201, dividing the primary sequence of the target protein into a number of sequence fragments by using a predetermined number of consecutive amino acids as a group; using a predetermined algorithm to encode each of the sequence fragments to obtain characteristics corresponding to each sequence fragment Several protein feature sequences composed of vectors.
本步骤在具体实施例过程中,具体可以以连续的三个氨基酸为一组,将目标蛋白质的氨基酸序列分割成b个片段(b=氨基酸长度-2)然后使用word2vec算法将这b个氨基酸片段编码成特征序列p 1,p 2,…,p bIn the specific embodiment of this step, the amino acid sequence of the target protein can be divided into b fragments (b=amino acid length-2) by taking three consecutive amino acids as a group, and then using the word2vec algorithm to divide the b amino acid fragments Encoded into characteristic sequences p 1 , p 2 ,..., p b .
步骤S202,使用化学信息包对所述目标配体的SMILES分子式进行处理,得到所述目标配体的分子指纹图谱;利用图卷积网络对所述分子指纹图谱进行处理,获得所述目标配体的若干原子特征序列。Step S202: Use a chemical information package to process the SMILES molecular formula of the target ligand to obtain a molecular fingerprint of the target ligand; use a graph convolution network to process the molecular fingerprint of the target ligand to obtain the target ligand The sequence of several atomic characteristics.
本步骤在具体实施过程中,可以采用RDKit包对分子的SMILES式进行处理,每个原子编码34维的特征向量,得到小分子的图分子指纹;通过图卷积神经网络对图分子指纹进行处理,得到原子特征序列c 1,c 2,…,c a(a=分子的非氢原子数目)。 In the specific implementation of this step, the RDKit package can be used to process the SMILES formula of the molecule, and each atom encodes a 34-dimensional feature vector to obtain the graph molecular fingerprint of the small molecule; the graph molecular fingerprint is processed through the graph convolutional neural network , Get the atomic characteristic sequence c 1 , c 2 ,..., c a (a=the number of non-hydrogen atoms in the molecule).
步骤S203,采用自注意力机制对所述若干蛋白质特征序列以及所述若干原子特征序列进行处理,以确定出能够进行相互作用的目标特征序列;基于所述目标特征序列进行计算获得所述目标蛋白质与所述目标配体结合的概率。Step S203: Use a self-attention mechanism to process the several protein feature sequences and the several atomic feature sequences to determine the target feature sequence that can interact; calculate based on the target feature sequence to obtain the target protein The probability of binding to the target ligand.
本步骤在具体实施过程中是利用预设的计算公式对所述若干目标特征序列进行计算,获得相互作用特征;然后利用全连接神经网络对所述相互作用特征进行处理,获得样本蛋白质-样本配体相互作用的预测值(概率)。更加具体的,在得到蛋白质的若干特征序列(蛋白质特征序列)p 1,p 2,…,p b和配体的若干原子特征序列c 1,c 2,…,c a之后,可以通过自然语言处理中的Transformer框架进行编码和解码,输出相互作用的目标特征序列x 1,x 2,…,x a;然后利用预设的计算公式对所述目标特征序列进行计算,获得相互作用特征;最后利用全连接神经网络对所述相互作用特征进行处理,就可以获得目标蛋白质-目标配体结合的概率。 In the specific implementation of this step, the predetermined calculation formula is used to calculate the several target feature sequences to obtain the interaction feature; then the fully connected neural network is used to process the interaction feature to obtain the sample protein-sample configuration The predicted value (probability) of body interaction. More specifically, after obtaining several characteristic sequences of proteins (protein characteristic sequences) p 1 , p 2 ,..., p b and several atomic characteristic sequences of ligands c 1 , c 2 ,..., c a , you can use natural language The Transformer frame under processing encodes and decodes, and outputs the target feature sequence x 1 , x 2 ,..., x a of the interaction; then calculate the target feature sequence using the preset calculation formula to obtain the interaction feature; and finally By using the fully connected neural network to process the interaction feature, the probability of target protein-target ligand binding can be obtained.
本实施例提供一种配体-蛋白质相互作用的预测方法,在对目标蛋白质和目标配体之间的相互作用进行预测之前,还包括采用深度学习的方法训练获得预测模型。在实施时具体包括如下步骤:This embodiment provides a method for predicting ligand-protein interaction. Before predicting the interaction between the target protein and the target ligand, the method further includes using a deep learning method to train to obtain a prediction model. The implementation includes the following steps:
步骤S301,获取实验数据;Step S301, obtaining experimental data;
步骤S302,基于所述实验数据确定样本蛋白质-样本配体相互作用的真实值;Step S302: Determine the true value of the sample protein-sample ligand interaction based on the experimental data;
本步骤在具体实施过程中可以根据实际的实验数据及结果获得相互作用的真实值y,真实值y具体为“1”或“0”,其中1表示能够进行相互作用,用0表示不能进行相互作用。In the specific implementation process of this step, the actual value y of the interaction can be obtained according to the actual experimental data and results. The actual value y is specifically "1" or "0", where 1 indicates that the interaction is possible, and 0 indicates that the interaction cannot be performed. effect.
步骤S303,获取样本蛋白质的若干蛋白质特征序列,并获取样本配体的若干原子特征序列;Step S303: Obtain several protein characteristic sequences of the sample protein, and obtain several atomic characteristic sequences of the sample ligand;
本步骤在具体实施例过程中,可以对样本蛋白质的一级序列进行处理,获得由特征向量组成的若干蛋白质特征序列。比如以连续的三个氨基酸为一组,将样本蛋白质的氨基酸序列分割成b个片段(b=氨基酸长度-2)然后使用自然语言处理中的词向量嵌入方法(word2vec)将这b个氨基酸片段编码成一组由特征向量组成的序列p 1,p 2,…,p b,该组序列即包含有若干个蛋白质特征序列,比如p 1即表示一个蛋白质特征序列。具体的可以从实验数 据中选取一个氨基酸长度为200的蛋白质,即获得维度为:198×100的蛋白质特征序列。 In this step, in the process of specific embodiments, the primary sequence of the sample protein can be processed to obtain several protein feature sequences composed of feature vectors. For example, using three consecutive amino acids as a group, divide the amino acid sequence of the sample protein into b fragments (b=amino acid length-2) and then use the word vector embedding method in natural language processing (word2vec) to divide the b amino acid fragments Encoded into a set of sequences p 1 , p 2 ,..., p b composed of feature vectors, this set of sequences contains several protein feature sequences, for example, p 1 represents a protein feature sequence. Specifically, a protein with an amino acid length of 200 can be selected from the experimental data, that is, a protein characteristic sequence with a dimension of 198×100 can be obtained.
本步骤在获取样本配体的原子特征序列时,具体可以基于样本配体的分子指纹图谱获取样本配体的若干原子特征序列。更加具体的,可以使用化学信息包RDkit对样本配体的SMILES式进行处理,每个原子编码34维的特征向量(如表1所示),得到配体的图分子指纹图谱,再通过图卷积网络对分子指纹图谱进行处理,得到样本配体若干原子特征序列c 1,c 2,…,c a(a=分子的非氢原子数目)。具体的可以从实验数据中选取一个非氢原子数为20的样本配体,即获得维度为:20×64的原子特征序列。 When obtaining the atomic characteristic sequence of the sample ligand in this step, it is specifically possible to obtain several atomic characteristic sequences of the sample ligand based on the molecular fingerprint of the sample ligand. More specifically, you can use the chemical information package RDkit to process the SMILES formula of the sample ligand, and each atom encodes a 34-dimensional feature vector (as shown in Table 1) to obtain the ligand's molecular fingerprint map, and then use the image volume The product network processes the molecular fingerprints to obtain several atomic characteristic sequences c 1 , c 2 ,..., c a (a = the number of non-hydrogen atoms in the molecule) of the sample ligand. Specifically, a sample ligand with a non-hydrogen atom number of 20 can be selected from the experimental data, that is, an atomic characteristic sequence with a dimension of 20×64 can be obtained.
表1Table 1
Figure PCTCN2021089139-appb-000001
Figure PCTCN2021089139-appb-000001
步骤S304,基于所述样本蛋白质的若干蛋白质特征序列、所述样本配体的若干原子特征序列以及所述真实值进行模型训练,获得所述预测模型。Step S304: Perform model training based on several protein feature sequences of the sample protein, several atomic feature sequences of the sample ligand, and the true value to obtain the prediction model.
本步骤在具体实施例过程中,具体又可分为如下步骤:In the process of specific embodiments, this step can be specifically divided into the following steps:
步骤S3041,采用自注意力机制来对所述样本蛋白质的若干蛋白质特征序列以及所述样本配体的若干原子特征序列进行处理,预测获得能够进行相互作用的若干样本序列。Step S3041, using a self-attention mechanism to process several protein feature sequences of the sample protein and several atomic feature sequences of the sample ligand, and predict and obtain several sample sequences that can interact.
更加具体的,可以如图2所示,将样本蛋白质特征序列(即样本蛋白质的蛋白质特征序列),即维度为b×100的p 1,p 2,…,p b输入编码器中进行编码,输出编码后的样本蛋白质特征序列,即维度为b×64的p 1,p 2,…,p b。再将样本配体的原子特征序列,即维度为a×64的c 1,c 2,…,c a和(编码后的样本蛋白质特征序列)维度为b×64的p 1,p 2,…,p b输入到解码器进行学习,经过Transformer解码器的学习,最后输出相互作用特征序列(即若干样本序列)维度为a×64的x 1,x 2,…,x aMore specifically, as shown in Figure 2, the characteristic sequence of the sample protein (that is, the characteristic sequence of the protein of the sample protein), that is, p 1 , p 2 ,..., p b with a dimension of b×100 can be input into the encoder for encoding, Output the encoded sample protein feature sequence, that is, p 1 , p 2 ,..., p b with dimensions of b×64. Then take the atomic characteristic sequence of the sample ligand, namely c 1 , c 2 ,..., c a with dimensions a×64 and (the encoded sample protein characteristic sequence) p 1 , p 2 ,... with dimensions b×64 ,p b is input to the decoder for learning, and after the learning of the Transformer decoder, the interaction feature sequence (ie, a number of sample sequences) is finally outputted with x 1 , x 2 ,..., x a with dimensions of a×64;
步骤S2042,利用预设的计算公式对所述若干样本序列进行计算,获得相互作用特征;Step S2042: Calculate the several sample sequences by using a preset calculation formula to obtain interaction characteristics;
本步骤中在具体实施时是采用如下三个计算公式来计算获得相互作用特征的:In the specific implementation of this step, the following three calculation formulas are used to calculate the interaction characteristics:
Figure PCTCN2021089139-appb-000002
Figure PCTCN2021089139-appb-000002
Figure PCTCN2021089139-appb-000003
Figure PCTCN2021089139-appb-000003
Figure PCTCN2021089139-appb-000004
Figure PCTCN2021089139-appb-000004
其中,其中x′ i是向量x i的模,α i是向量x i的权重。x i表示第i个相互作用特征序列,y interaction表示相互作用特征。 Among them, x'i is the modulus of the vector x i , and α i is the weight of the vector x i. Xi represents the i-th interaction feature sequence, and y interaction represents the interaction feature.
步骤S3043,利用全连接神经网络对所述相互作用特征进行处理,获得样本蛋白质-样本配体相互作用的预测值;Step S3043, using a fully connected neural network to process the interaction feature to obtain a predicted value of the sample protein-sample ligand interaction;
本步骤中在获得了相互作用特征y interaction后,就可以将y interaction输入到全连接神经网络,最后输出预测值
Figure PCTCN2021089139-appb-000005
In the present step is obtained wherein y interaction after interaction, y interaction can be input to a fully connected neural networks, the final output prediction value
Figure PCTCN2021089139-appb-000005
步骤S3044,基于所述预测值以及所述真实值计算交叉熵;Step S3044: Calculate cross entropy based on the predicted value and the true value;
本步骤在获得预测值
Figure PCTCN2021089139-appb-000006
后,计算预测值
Figure PCTCN2021089139-appb-000007
和真实值y的交叉熵。
This step is to obtain the predicted value
Figure PCTCN2021089139-appb-000006
After calculating the predicted value
Figure PCTCN2021089139-appb-000007
And the cross entropy of the true value y.
步骤S3045,将所述交叉熵作为预测模型的损失函数,以采用随机梯度下降法进行训练,获得所述预测模型。In step S3045, the cross entropy is used as the loss function of the prediction model, and the stochastic gradient descent method is used for training to obtain the prediction model.
本步骤中采用随机梯度下降法来训练模型是一种常见的模型训练方法,在此不再赘述。In this step, the stochastic gradient descent method is used to train the model is a common model training method, which will not be repeated here.
本实施例中,在将样本蛋白质特征序列(即样本蛋白质的蛋白质特征序列),即维度为b×100的p 1,p 2,…,p b输入编码器中进行编码,输出编码后的样本蛋白质特征序列时,具体是利用编码器中的公式
Figure PCTCN2021089139-appb-000008
Figure PCTCN2021089139-appb-000009
来进行处理的,其中
Figure PCTCN2021089139-appb-000010
是h l层的输入,
Figure PCTCN2021089139-appb-000011
W 1、s、W 2、t是可学习的参数,n是序列的长度,m 1,m 2分别是输入和隐藏层特征的维度,k是卷积核的大小,σ是sigmoid函数,
Figure PCTCN2021089139-appb-000012
是矩阵的Hadamard积。参数设置:k=7,m 1=100(m 1表示输入层特征的维度),m 2=64(m 2表示隐藏层特征的维度)。即输入X=p 1,p 2,…,p b
Figure PCTCN2021089139-appb-000013
然后通过一维 卷积和门控线性单元计算h l(X)=p 1,p 2,…,p b
Figure PCTCN2021089139-appb-000014
并更新蛋白特征序列p 1,p 2,…,p b,最后输出编码后蛋白质特征序列p 1,p 2,…,p b
In this embodiment, the sample protein characteristic sequence (that is, the protein characteristic sequence of the sample protein), that is, p 1 , p 2 ,..., p b with dimensions of b×100 is input into the encoder for encoding, and the encoded sample is output When protein feature sequence, specifically use the formula in the encoder
Figure PCTCN2021089139-appb-000008
Figure PCTCN2021089139-appb-000009
For processing, where
Figure PCTCN2021089139-appb-000010
Is the input of the h l layer,
Figure PCTCN2021089139-appb-000011
W 1 , s, W 2 , and t are learnable parameters, n is the length of the sequence, m 1 and m 2 are the dimensions of the input and hidden layer features, k is the size of the convolution kernel, and σ is the sigmoid function,
Figure PCTCN2021089139-appb-000012
Is the Hadamard product of the matrix. Parameter settings: k=7, m 1 =100 (m 1 represents the dimension of the input layer feature), m 2 =64 (m 2 represents the dimension of the hidden layer feature). That is, input X=p 1 ,p 2 ,...,p b ,
Figure PCTCN2021089139-appb-000013
Then calculate h l (X) = p 1 , p 2 ,..., p b through a one-dimensional convolution and gated linear unit,
Figure PCTCN2021089139-appb-000014
And update the protein characteristic sequence p 1 , p 2 ,..., p b , and finally output the encoded protein characteristic sequence p 1 , p 2 ,..., p b .
本实施例中,在将样本配体的原子特征序列(维度为a×64的c 1,c 2,…,c a)和编码后的样本蛋白质特征序列(维度为b×64的p 1,p 2,…,p b)输入到解码器进行学习,输出相互作用特征序列(即若干样本序列)x 1,x 2,…,x a,具体可以采用如下方式实现,即通过自注意力层的计算公式:
Figure PCTCN2021089139-appb-000015
来计算注意力值(attention)。其中,d k表示一个缩放因子,为隐藏层特征的维度,本实施例中为64;T表示矩阵的转置符号。具体如图3所示,可以先将样本配体的原子特征序列作为自注意力层(即公式
Figure PCTCN2021089139-appb-000016
)的输入,计算原子特征序列的注意力值,进行加权求和以及归一化计算,此时Q,K,V=c 1,c 2,…,c a。然后将该计算结果作为第二层(自注意力层)的输入,同时将蛋白质的特征序列(蛋白质特征序列)作为第二层的输入,通过自注意力机制计算原子特征序列和蛋白质特征序列的注意力值,加权求和,归一化,此时Q=c 1,c 2,…,c a,K=V=p 1,p 2,…,p b。最后将获得的结果作为第三层的输入(即输入到卷积神经网络)进行第三次的加权求和以及归一化计算,这样就可以获得相互作用的特征序列(即若干样本序列)x 1,x 2,…,x a
In this embodiment, the atomic characteristic sequence of the sample ligand (c 1 , c 2 ,..., c a with dimensions a×64) and the encoded sample protein characteristic sequence (p 1 with dimensions b×64, p 2 ,..., p b ) is input to the decoder for learning, and the interaction feature sequence (ie, a number of sample sequences) x 1 , x 2 ,..., x a is output, which can be implemented in the following way, namely through the self-attention layer The calculation formula:
Figure PCTCN2021089139-appb-000015
To calculate attention. Among them, d k represents a scaling factor, which is the dimension of the hidden layer feature, which is 64 in this embodiment; T represents the transposition symbol of the matrix. Specifically, as shown in Figure 3, the atomic characteristic sequence of the sample ligand can be used as the self-attention layer (that is, the formula
Figure PCTCN2021089139-appb-000016
), calculate the attention value of the atomic feature sequence, perform weighted summation and normalization calculation, at this time Q, K, V = c 1 , c 2 ,..., c a . Then the calculation result is used as the input of the second layer (self-attention layer), and the characteristic sequence of the protein (protein characteristic sequence) is used as the input of the second layer. The self-attention mechanism is used to calculate the atomic characteristic sequence and the protein characteristic sequence. Attention value, weighted summation, normalization, at this time Q=c 1 , c 2 ,..., c a , K=V = p 1 , p 2 ,..., p b . Finally, the obtained result is used as the input of the third layer (that is, input to the convolutional neural network) for the third weighted summation and normalization calculation, so that the interactive feature sequence (ie, several sample sequences) x can be obtained 1 ,x 2 ,…,x a .
本发明实施例中利用端到端的深度学习模型TransformerCPI,在三个公开基准数据集上取得当前最优的结果。本实施例中的深度学习模型TransformerCPI在标签反转实验(label reversal experiments)取得当前最优的结果,对比其他模型,提升的效果十分显著,证明该方法可以学习到真正的相互作用特征。同时,由于深度学习模型TransformerCPI有很好的可解释性,既可以给出蛋白质中哪些氨基酸片段与配体中哪些原子特征序列结合的概率大,也可以给出配体分子中哪些原子(原子特征序列)对结合的贡献大,为进一步的分子结构改造给出指导建议。In the embodiment of the present invention, the end-to-end deep learning model TransformerCPI is used to obtain the current optimal results on three public benchmark data sets. The deep learning model TransformerCPI in this embodiment obtains the current optimal results in label reversal experiments. Compared with other models, the improvement effect is very significant, which proves that the method can learn real interaction features. At the same time, because the deep learning model TransformerCPI has good interpretability, it can not only show which amino acid fragments in the protein have a high probability of binding to which atomic characteristic sequences in the ligand, but also which atoms in the ligand molecule (atomic characteristics) Sequence) has a great contribution to binding, and provides guidance and suggestions for further molecular structure modification.
本发明另一实施提供一种配体-蛋白质相互作用的预测装置,如图4所示,包括:Another embodiment of the present invention provides a ligand-protein interaction prediction device, as shown in Figure 4, including:
第一获取模块,用于对目标蛋白质的一级序列进行处理,获得由特征向量组成的若干蛋白质特征序列;The first acquisition module is used to process the primary sequence of the target protein to obtain several protein feature sequences composed of feature vectors;
第二获取模块,用于基于目标配体的分子指纹图谱获取目标配体的若干原子特征序列;The second acquisition module is used to acquire several atomic characteristic sequences of the target ligand based on the molecular fingerprint of the target ligand;
预测模块,用于基于所述若干蛋白质特征序列以及所述若干原子特征序列利用预设的预测模型进行预测,获得所述目标蛋白质和所述目标配体相互作用的概率The prediction module is used to make predictions based on the several protein feature sequences and the several atomic feature sequences using a preset prediction model to obtain the probability of the target protein interacting with the target ligand
本实施中,所述第一获取模块具体用于:以连续预定个数的氨基酸为一组将所述目标蛋白质的一级序列分隔成若干各序列片段;采用预定的算法对各所述序列片段进行编码,获得与各序列片段对应的特征向量组成的若干蛋白质特征序列。In this implementation, the first acquisition module is specifically used to: divide the primary sequence of the target protein into a number of sequence fragments using a predetermined number of consecutive amino acids as a group; Encoding is performed to obtain several protein feature sequences composed of feature vectors corresponding to each sequence fragment.
本实施例中,所述第二获取模块具体用于:使用化学信息包对所述目标配体的SMILES分子式进行处理,得到所述目标配体的分子指纹图谱;利用图卷积网络对所述分子指纹图谱进行处理,获得所述目标配体的若干原子特征序列。In this embodiment, the second acquisition module is specifically configured to: use a chemical information package to process the SMILES molecular formula of the target ligand to obtain the molecular fingerprint of the target ligand; The molecular fingerprint is processed to obtain several atomic characteristic sequences of the target ligand.
具体的,所述预测模块具体用于:采用自注意力机制对所述若干蛋白质特征序列以及所述若干原子特征序列进行处理,以确定出能够进行相互作用的目标特征序列;基于所述目标特征序列进行计算获得所述目标蛋白质与所述目标配体结合的概率。Specifically, the prediction module is specifically configured to: use a self-attention mechanism to process the several protein feature sequences and the several atomic feature sequences to determine the target feature sequence that can interact; based on the target feature The sequence is calculated to obtain the probability that the target protein binds to the target ligand.
本实施例中还包括用于训练获得所述预测模型的训练模块,所述训练模块采用深度学习的方法训练获得所述预测模型,所述训练模型用于:This embodiment also includes a training module for training to obtain the prediction model, the training module adopts a deep learning method to train to obtain the prediction model, and the training model is used for:
获取实验数据;Obtain experimental data;
基于所述实验数据确定样本蛋白质-样本配体相互作用的真实值;Determine the true value of the sample protein-sample ligand interaction based on the experimental data;
获取样本蛋白质的若干蛋白质特征序列,并获取样本配体的若干原子特征序列;Obtain several protein characteristic sequences of the sample protein, and obtain several atomic characteristic sequences of the sample ligand;
基于所述样本蛋白质的若干蛋白质特征序列、所述样本配体的若干原子特征序列以及所述真实值进行模型训练,获得所述预测模型。Model training is performed based on several protein feature sequences of the sample protein, several atomic feature sequences of the sample ligand, and the true value to obtain the prediction model.
在具体实施过程中,所述训练模块具体用于:In the specific implementation process, the training module is specifically used to:
采用自注意力机制对所述样本蛋白质的若干蛋白质特征序列以及所述样本配体的若干原子特征序列进行处理,预测获得能够进行相互作用的若干样本序列;Using a self-attention mechanism to process several protein feature sequences of the sample protein and several atomic feature sequences of the sample ligand, and predict to obtain several sample sequences that can interact;
利用预设的计算公式对所述若干样本序列进行计算,获得相互作用特征;Calculate the several sample sequences using a preset calculation formula to obtain interaction characteristics;
利用全连接神经网络对所述相互作用特征进行处理,获得样本蛋白质-样本配体相互作用的预测值;Use a fully connected neural network to process the interaction feature to obtain a predicted value of the sample protein-sample ligand interaction;
基于所述预测值以及所述真实值计算交叉熵;Calculating cross entropy based on the predicted value and the true value;
将所述交叉熵作为预测模型的损失函数,以采用随机梯度下降法进行训练,获得所述预测模型。The cross entropy is used as the loss function of the prediction model, and the stochastic gradient descent method is used for training to obtain the prediction model.
本发明实施例中,不仅能准确的预测出蛋白质和配体相互作用的概率,还能知道具体是通过蛋白质中的哪些氨基酸序列和配体中的哪些原子进行结合的,为进一步的分子结构改造给出指导建议。In the embodiment of the present invention, not only the probability of interaction between protein and ligand can be accurately predicted, but also the specific amino acid sequence in the protein and which atom in the ligand are combined to be used for further molecular structure modification. Give guidance and suggestions.
以上实施例仅为本发明的示例性实施例,不用于限制本发明,本发明的保护范围由权利要求书限定。本领域技术人员可以在本发明的实质和保护范围内,对本发明做出各种修改或等同替换,这种修改或等同替换也应视为落在本发明的保护范围内。The above embodiments are only exemplary embodiments of the present invention, and are not used to limit the present invention, and the protection scope of the present invention is defined by the claims. Those skilled in the art can make various modifications or equivalent substitutions to the present invention within the essence and protection scope of the present invention, and such modifications or equivalent substitutions should also be regarded as falling within the protection scope of the present invention.

Claims (10)

  1. 一种配体-蛋白质相互作用的预测方法,其特征在于,包括如下步骤:A method for predicting ligand-protein interaction, which is characterized in that it comprises the following steps:
    对目标蛋白质的一级序列进行处理,获得由特征向量组成的若干蛋白质特征序列;Process the primary sequence of the target protein to obtain several protein feature sequences composed of feature vectors;
    基于目标配体的分子指纹图谱获取目标配体的若干原子特征序列;Obtain several atomic characteristic sequences of the target ligand based on the molecular fingerprint of the target ligand;
    基于所述若干蛋白质特征序列以及所述若干原子特征序列利用预设的预测模型进行预测,获得所述目标蛋白质和所述目标配体相互作用的概率。Prediction is performed based on the several protein feature sequences and the several atomic feature sequences using a preset prediction model to obtain the probability of the target protein interacting with the target ligand.
  2. 如权利要求1所述的方法,其特征在于,所述对目标蛋白质的一级序列进行处理,获得由特征向量组成的若干蛋白质特征序列,具体包括:The method according to claim 1, wherein the processing the primary sequence of the target protein to obtain several protein feature sequences composed of feature vectors specifically includes:
    以连续预定个数的氨基酸为一组将所述目标蛋白质的一级序列分隔成若干各序列片段;Dividing the primary sequence of the target protein into a plurality of sequence fragments by taking consecutive predetermined numbers of amino acids as a group;
    采用预定的算法对各所述序列片段进行编码,获得与各序列片段对应的特征向量组成的若干蛋白质特征序列。A predetermined algorithm is used to encode each of the sequence fragments, and a number of protein feature sequences composed of feature vectors corresponding to each sequence fragment are obtained.
  3. 如权利要求1所述的方法,其特征在于,所述基于目标配体的分子指纹图谱获取目标配体的若干原子特征序列,具体包括:The method according to claim 1, wherein the obtaining several atomic characteristic sequences of the target ligand based on the molecular fingerprint of the target ligand specifically comprises:
    使用化学信息包对所述目标配体的SMILES分子式进行处理,得到所述目标配体的分子指纹图谱;Use the chemical information package to process the SMILES molecular formula of the target ligand to obtain the molecular fingerprint of the target ligand;
    利用图卷积网络对所述分子指纹图谱进行处理,获得所述目标配体的若干原子特征序列。A graph convolutional network is used to process the molecular fingerprint atlas to obtain several atomic characteristic sequences of the target ligand.
  4. 如权利要求1所述的方法,其特征在于,所述基于所述若干蛋白质特征序列以及所述若干原子特征序列利用预设的预测模型进行预测,以获得所述目标蛋白质和所述目标配体相互作用的概率,具体包括:The method of claim 1, wherein the prediction is performed based on the plurality of protein feature sequences and the plurality of atomic feature sequences using a preset prediction model to obtain the target protein and the target ligand The probability of interaction, including:
    采用自注意力机制对所述若干蛋白质特征序列以及所述若干原子特征序列进行处理,以确定出能够进行相互作用的目标特征序列;Using a self-attention mechanism to process the several protein feature sequences and the several atomic feature sequences to determine the target feature sequence that can interact;
    基于所述目标特征序列进行计算获得所述目标蛋白质与所述目标配体结合的概率。A calculation is performed based on the target characteristic sequence to obtain the probability that the target protein binds to the target ligand.
  5. 如权利要求1所述的方法,其特征在于,所述方法还包括:采用深度学习的方法训练获得所述预测模型,具体包括:The method according to claim 1, wherein the method further comprises: adopting a deep learning method to train to obtain the prediction model, which specifically comprises:
    获取实验数据;Obtain experimental data;
    基于所述实验数据确定样本蛋白质-样本配体相互作用的真实值;Determine the true value of the sample protein-sample ligand interaction based on the experimental data;
    获取样本蛋白质的若干蛋白质特征序列,并获取样本配体的若干原子特征序列;Obtain several protein characteristic sequences of the sample protein, and obtain several atomic characteristic sequences of the sample ligand;
    基于所述样本蛋白质的若干蛋白质特征序列、所述样本配体的若干原子特征序列以及所述真实值进行模型训练,获得所述预测模型。Model training is performed based on several protein feature sequences of the sample protein, several atomic feature sequences of the sample ligand, and the true value to obtain the prediction model.
  6. 如权利要求5所述的方法,其特征在于,所述基于所述样本蛋白质的若干蛋白质特征序列、所述样本配体的若干原子特征序列以及所述真实值进行模型训练,获得所述预测模型,具体包括:The method according to claim 5, wherein the prediction model is obtained by performing model training based on several protein feature sequences of the sample protein, several atomic feature sequences of the sample ligand, and the true value , Specifically including:
    采用自注意力机制对所述样本蛋白质的若干蛋白质特征序列以及所述样本配体的若干原子特征序列进行处理,获得包含相互作用信息的若干样本序列;Using a self-attention mechanism to process several protein feature sequences of the sample protein and several atomic feature sequences of the sample ligand to obtain several sample sequences containing interaction information;
    利用预设的计算公式对所述若干样本序列进行计算,获得相互作用特征;Calculate the several sample sequences using a preset calculation formula to obtain interaction characteristics;
    利用全连接神经网络对所述相互作用特征进行处理,获得样本蛋白质-样本配体相互作用的预测值;Use a fully connected neural network to process the interaction feature to obtain a predicted value of the sample protein-sample ligand interaction;
    基于所述预测值以及所述真实值计算交叉熵;Calculating cross entropy based on the predicted value and the true value;
    将所述交叉熵作为预测模型的损失函数,以采用随机梯度下降法进行训练,获得所述预测模型。The cross entropy is used as the loss function of the prediction model, and the stochastic gradient descent method is used for training to obtain the prediction model.
  7. 一种配体-蛋白质相互作用的预测装置,其特征在于,包括:A prediction device for ligand-protein interaction, which is characterized in that it comprises:
    第一获取模块,用于对目标蛋白质的一级序列进行处理,获得由特征向量组成的若干蛋白质特征序列;The first acquisition module is used to process the primary sequence of the target protein to obtain several protein feature sequences composed of feature vectors;
    第二获取模块,用于基于目标配体的分子指纹图谱获取目标配体的若干原子特征序列;The second acquisition module is used to acquire several atomic characteristic sequences of the target ligand based on the molecular fingerprint of the target ligand;
    预测模块,用于基于所述若干蛋白质特征序列以及所述若干原子特征序列利用预设的预测模型进行预测,获得所述目标蛋白质和所述目标配体相互作用的概率The prediction module is used to make predictions based on the several protein feature sequences and the several atomic feature sequences using a preset prediction model to obtain the probability of the target protein interacting with the target ligand
  8. 如权利要求7所述的装置,其特征在于,所述第一获取模块具体用于:The device according to claim 7, wherein the first obtaining module is specifically configured to:
    以连续预定个数的氨基酸为一组将所述目标蛋白质的一级序列分隔成若干各序列片段;Dividing the primary sequence of the target protein into a plurality of sequence fragments by taking consecutive predetermined numbers of amino acids as a group;
    采用预定的算法对各所述序列片段进行编码,获得与各序列片段对应的特征向量组成的若干蛋白质特征序列。A predetermined algorithm is used to encode each of the sequence fragments, and a number of protein feature sequences composed of feature vectors corresponding to each sequence fragment are obtained.
  9. 如权利要求7所述的装置,其特征在于,所述第二获取模块具体用于:使用化学信息包对所述目标配体的SMILES分子式进行处理,得到所述目标配体的分子指纹图谱;8. The device of claim 7, wherein the second acquisition module is specifically configured to: use a chemical information package to process the SMILES molecular formula of the target ligand to obtain the molecular fingerprint of the target ligand;
    利用图卷积网络对所述分子指纹图谱进行处理,获得所述目标配体的若干原子特征序列。A graph convolutional network is used to process the molecular fingerprint atlas to obtain several atomic characteristic sequences of the target ligand.
  10. 如权利要求7所述的装置,其特征在于,所述预测模块具体用于:The device according to claim 7, wherein the prediction module is specifically configured to:
    采用自注意力机制对所述若干蛋白质特征序列以及所述若干原子特征序列进行处理,以确定出能够进行相互作用的目标特征序列;Using a self-attention mechanism to process the several protein feature sequences and the several atomic feature sequences to determine the target feature sequence that can interact;
    基于所述目标特征序列进行计算获得所述目标蛋白质与所述目标配体结合的概率。A calculation is performed based on the target characteristic sequence to obtain the probability that the target protein binds to the target ligand.
PCT/CN2021/089139 2020-04-29 2021-04-23 Prediction method and device for ligand-protein interaction WO2021218791A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010356774.3 2020-04-29
CN202010356774.3A CN113571124B (en) 2020-04-29 2020-04-29 Method and device for predicting ligand-protein interaction

Publications (1)

Publication Number Publication Date
WO2021218791A1 true WO2021218791A1 (en) 2021-11-04

Family

ID=78158583

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/089139 WO2021218791A1 (en) 2020-04-29 2021-04-23 Prediction method and device for ligand-protein interaction

Country Status (2)

Country Link
CN (1) CN113571124B (en)
WO (1) WO2021218791A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114446383A (en) * 2022-01-24 2022-05-06 电子科技大学 Quantum computation-based ligand-protein interaction prediction method
CN114927165A (en) * 2022-07-20 2022-08-19 深圳大学 Method, device, system and storage medium for identifying ubiquitination sites
CN115497555A (en) * 2022-08-16 2022-12-20 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Multi-species protein function prediction method, device, equipment and storage medium
WO2023097515A1 (en) * 2021-11-30 2023-06-08 京东方科技集团股份有限公司 Rna-protein interaction prediction method and apparatus, and medium and electronic device

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115116559B (en) * 2022-06-21 2023-04-18 北京百度网讯科技有限公司 Method, device, equipment and medium for determining and training atomic coordinates in amino acid

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862173A (en) * 2017-11-15 2018-03-30 南京邮电大学 A kind of lead compound virtual screening method and device
US20180341754A1 (en) * 2017-05-19 2018-11-29 Accutar Biotechnology Inc. Computational method for classifying and predicting ligand docking conformations
CN109273054A (en) * 2018-08-31 2019-01-25 南京农业大学 Protein Subcellular interval prediction method based on relation map
CN110289050A (en) * 2019-05-30 2019-09-27 湖南大学 A kind of drug based on figure convolution sum term vector-target interaction prediction method
CN110459274A (en) * 2019-08-01 2019-11-15 南京邮电大学 A kind of small-molecule drug virtual screening method and its application based on depth migration study
CN110767266A (en) * 2019-11-04 2020-02-07 山东省计算中心(国家超级计算济南中心) Graph convolution-based scoring function construction method facing ErbB targeted protein family

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180341754A1 (en) * 2017-05-19 2018-11-29 Accutar Biotechnology Inc. Computational method for classifying and predicting ligand docking conformations
CN107862173A (en) * 2017-11-15 2018-03-30 南京邮电大学 A kind of lead compound virtual screening method and device
CN109273054A (en) * 2018-08-31 2019-01-25 南京农业大学 Protein Subcellular interval prediction method based on relation map
CN110289050A (en) * 2019-05-30 2019-09-27 湖南大学 A kind of drug based on figure convolution sum term vector-target interaction prediction method
CN110459274A (en) * 2019-08-01 2019-11-15 南京邮电大学 A kind of small-molecule drug virtual screening method and its application based on depth migration study
CN110767266A (en) * 2019-11-04 2020-02-07 山东省计算中心(国家超级计算济南中心) Graph convolution-based scoring function construction method facing ErbB targeted protein family

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SHIN BONGGUN, PARK SUNGSOO, KANG KEUNSOO, HO JOYCE C: "Machine Learning for Healthcare Self-Attention Based Molecule Representation for Predicting Drug-Target Interaction", PROCEEDINGS OF MACHINE LEARNING RESEARCH, vol. 106, 31 December 2019 (2019-12-31), pages 1 - 18, XP055861048 *
ZHENG SHUANGJIA, LI YONGJIAN, CHEN SHENG, XU JUN, YANG YUEDONG: "Predicting drug protein interaction using quasi-visual question answering system", BIORXIV, 25 March 2019 (2019-03-25), XP055826404, Retrieved from the Internet <URL:https://www.biorxiv.org/content/10.1101/588178v1.full.pdf> [retrieved on 20210721], DOI: 10.1101/588178 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023097515A1 (en) * 2021-11-30 2023-06-08 京东方科技集团股份有限公司 Rna-protein interaction prediction method and apparatus, and medium and electronic device
CN114446383A (en) * 2022-01-24 2022-05-06 电子科技大学 Quantum computation-based ligand-protein interaction prediction method
CN114446383B (en) * 2022-01-24 2023-04-21 电子科技大学 Quantum calculation-based ligand-protein interaction prediction method
CN114927165A (en) * 2022-07-20 2022-08-19 深圳大学 Method, device, system and storage medium for identifying ubiquitination sites
CN115497555A (en) * 2022-08-16 2022-12-20 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Multi-species protein function prediction method, device, equipment and storage medium
CN115497555B (en) * 2022-08-16 2024-01-05 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Multi-species protein function prediction method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113571124A (en) 2021-10-29
CN113571124B (en) 2024-04-23

Similar Documents

Publication Publication Date Title
WO2021218791A1 (en) Prediction method and device for ligand-protein interaction
US10872596B2 (en) Systems and methods for parallel wave generation in end-to-end text-to-speech
Li et al. DeepDSC: a deep learning method to predict drug sensitivity of cancer cell lines
US20210342670A1 (en) Processing sequences using convolutional neural networks
WO2022007823A1 (en) Text data processing method and device
CN110326002A (en) Use the series processing paid attention to online
WO2021139257A1 (en) Method and apparatus for selecting annotated data, and computer device and storage medium
CN113764037B (en) Method and apparatus for model training, antibody engineering and binding site prediction
CN109858046A (en) Using auxiliary loss come the long-rang dependence in learning neural network
US20240120022A1 (en) Predicting protein amino acid sequences using generative models conditioned on protein structure embeddings
CN111966811A (en) Intention recognition and slot filling method and device, readable storage medium and terminal equipment
Downey et al. alineR: An R package for optimizing feature-weighted alignments and linguistic distances
CN117121016A (en) Granular neural network architecture search on low-level primitives
CN117149998B (en) Intelligent diagnosis recommendation method and system based on multi-objective optimization
CN115881209B (en) RNA secondary structure prediction processing method and device
CN115206421B (en) Drug repositioning method, and repositioning model training method and device
CN116313148A (en) Drug sensitivity prediction method, device, terminal equipment and medium
CN115472305A (en) Method and system for predicting microorganism-drug association effect
Sree Deep Learning Mechanism Augmented with 16-Hybrid Cellular Automata For Secondary Structure Prediction
US20240104355A1 (en) Generating neural network outputs by enriching latent embeddings using self-attention and cross-attention operations
WO2023168814A1 (en) Sentence vector generation method and apparatus, computer device and storage medium
US20220319635A1 (en) Generating minority-class examples for training data
WO2023231796A1 (en) Visual task processing method and related device thereof
WO2023226310A1 (en) Molecule optimization method and apparatus
US20230124006A1 (en) System and method for training a transformer-in-transformer-based neural network model for audio data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21797036

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21797036

Country of ref document: EP

Kind code of ref document: A1