WO2023151314A1 - 基于预训练语言模型的蛋白质构象感知表示学习方法 - Google Patents

基于预训练语言模型的蛋白质构象感知表示学习方法 Download PDF

Info

Publication number
WO2023151314A1
WO2023151314A1 PCT/CN2022/126696 CN2022126696W WO2023151314A1 WO 2023151314 A1 WO2023151314 A1 WO 2023151314A1 CN 2022126696 W CN2022126696 W CN 2022126696W WO 2023151314 A1 WO2023151314 A1 WO 2023151314A1
Authority
WO
WIPO (PCT)
Prior art keywords
protein
representation
task
amino acid
prompt
Prior art date
Application number
PCT/CN2022/126696
Other languages
English (en)
French (fr)
Inventor
张强
王泽元
韩玉强
陈华钧
Original Assignee
浙江大学杭州国际科创中心
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浙江大学杭州国际科创中心 filed Critical 浙江大学杭州国际科创中心
Priority to US18/278,298 priority Critical patent/US20240233875A9/en
Publication of WO2023151314A1 publication Critical patent/WO2023151314A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/20Protein or domain folding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Definitions

  • the invention belongs to the field of representation learning, and in particular relates to a protein conformation-aware representation learning method based on a pre-trained language model.
  • Proteins are sequence data consisting of several amino acids. At present, many studies use language models in natural language to learn the embedded representation of proteins to predict the structure and properties of proteins.
  • the unsupervised learning method provides initial parameters for the model by using the masked language model target, and fine-tunes it together with the mapping function on the labeled data set, learns the representation of the protein in the task space, and applies it to the downstream prediction task .
  • proteins have different conformations in different environments, such as spontaneous folding through the chemical properties of amino acids, and morphological changes through contact with other proteins.
  • the current protein language model can only give a unique embedding matrix for a specific protein, and it is difficult to handle the effects of structure prediction and function prediction tasks at the same time, which limits the expressive ability of the model and the application in actual scenarios. .
  • the successful representation of prompt ideas in natural language shows that the language model has a strong expressive ability, and can produce expected effects on the overall output by changing the local symbols of the sequence.
  • the traditional prompt idea is to use the existing pre-trained model to find the optimal prompt in the semantic space. But in the protein pre-training model, the semantic unit is an amino acid, and there is no way to manually construct it like a language prompt. Second, prompts on continuous spaces lack interpretability.
  • Patent document CN 106845149 A discloses a new protein sequence representation method based on gene ontology information.
  • Patent document CN104134017A discloses a protein interaction relationship pair extraction method based on compact feature representation, including the following steps: 1) Select the required corpus, the corpus is based on sentences, and there are already labels for protein entities and entity relationships ; 2) Discard sentences that do not contain a protein entity or only contain a protein entity in step 1), and obtain a sentence set; 3) replace the corresponding protein entity in the sentence with a placeholder and perform placeholder fusion, and then perform part-of-speech tagging and Syntactic analysis; 4) Take each entity pair as a unit to obtain the features of words, parts of speech, syntax and templates; 5) Perform compact expression operations on the features obtained in step 4); 6) Use support vector machines to 4) The obtained features are trained or predicted using the trained model. The purpose of this method is to make the best use of the available information in the sentence and increase the information content of the feature vector.
  • the purpose of the present invention is to provide a protein conformation-aware representation learning method based on a pre-trained language model.
  • pre-training methods By combining pre-training methods on multiple data sets, protein representations in different conformations can be obtained, and protein structure prediction can be improved. and the accuracy of the function prediction task.
  • an embodiment of the present invention provides a protein conformation-aware representation learning method based on a pre-trained language model, including the following steps:
  • the representation learning module is extracted as the protein representation module.
  • the representation learning module includes a prompt embedding layer, an amino acid embedding layer, a fusion layer and a pre-trained language model, wherein the prompt embedding layer is used to learn the embedded representation of each type of prompt, and the amino acid embedding layer uses To learn the embedded representation of the protein, the fusion layer is used to fuse the embedded representation of each type of prompt and the embedded representation of the protein to obtain a fusion representation, and the pre-trained language model is used to perform representation learning on the fusion representation to obtain the protein under each type of prompt embedded representation;
  • the task module includes a task mapping layer corresponding to each type of protein conformation, and each task mapping layer is used for task prediction based on the protein embedded representation under each type of prompt identification.
  • the amino acid embedding layer includes an amino acid information embedding layer and an amino acid position embedding layer, which are respectively used to extract amino acid information representation and amino acid position representation according to the amino acid sequence, and the amino acid information representation and amino acid position representation are superimposed to obtain protein embedding express.
  • the pre-trained language model is a pluggable masked pre-trained language model, wherein the masked pre-trained language model includes BERT, RoBERTa, ALBERT, and XLNet.
  • the task mapping layer includes a two-layer MLP, which is used for task prediction based on the protein embedding representation identified by each type of prompt.
  • the loss function is to minimize the error between the task prediction result and the label.
  • the protein representation module is applied to the prediction task of protein structure and/or protein function; when applied, in the protein representation module, the embedded representations of all prompts are simultaneously fused to the embedded representation of the protein, to A protein embedding representation under all prompts is obtained; the protein embedding representation under all prompts is used for prediction of protein structure and/or protein function.
  • the simultaneous fusion of all prompt-like embedding representations to protein embedding representations includes:
  • the embedded representations of all prompts are spliced first, and then the resulting spliced representations are fused with the embedded representations of proteins.
  • the fusion includes splicing method, full connection mapping and convolutional mapping.
  • the protein conformation includes a natural folding state, an action state
  • the amino acid sequence with a mask is used as a sample, and the normal amino acid sequence is used as a sample label to form a data set.
  • the prompt corresponding to the natural folding state is the sequence prompt, and the corresponding task is the prediction task of the masked amino acid. ;
  • the action state at least 2 amino acid sequences are used as samples, and the protein reaction type is used as the sample label to form a data set.
  • the prompt corresponding to the action state is the action prompt, and the corresponding task is the prediction task of protein action.
  • the beneficial effects of the present invention at least include:
  • Figure 1 is a flowchart of a protein conformation-aware representation learning method based on a pre-trained language model
  • Fig. 2 is a schematic diagram of the construction process of the protein expression module provided by the embodiment
  • Fig. 3 is a schematic diagram of the application process of the protein expression module provided in the embodiment.
  • Figure 1 is a flowchart of a protein conformation-aware representation learning method based on a pre-trained language model. As shown in Figure 1, the protein conformation-aware representation learning method based on the pre-trained language model provided by the embodiment includes the following steps:
  • Step 1 obtain proteins composed of amino acid sequences, construct different data sets according to protein conformations, and define prompts for each type of protein conformation.
  • the protein structure data is obtained.
  • Each protein structure data is composed of amino acid sequences.
  • the protein function data that is, the interaction between proteins data
  • the interaction data is the reaction type of the protein.
  • Protein conformation refers to the different states exhibited by proteins, including natural folding state and functional state. After the protein data is obtained, a data set with different properties is constructed according to different protein conformations, which provides a basis for protein data processing centered on protein conformations, so as to carry out protein embedding representation learning for different prediction tasks.
  • the amino acid sequence with a mask [Mask] is used as a sample, and the normal amino acid sequence (the original amino acid sequence without a mask) is used as a sample label to form an amino acid sequence data set.
  • the prediction task of masked amino acids is constructed, that is, the amino acid sequence data set is used to perform the prediction task of masked amino acids to learn the embedded representation of proteins in their natural folded state.
  • the interaction state at least two interacting amino acid sequences are used as samples, and protein reaction types are used as sample tags to form a protein interaction data set.
  • the protein interaction prediction task is constructed, that is, the protein interaction data set is used to perform the protein interaction prediction task, so as to learn the embedded representation of the protein in the action state.
  • the interacting proteins in the protein interaction data set are expressed in the form of a network graph.
  • the network graph is cut according to the species, and the proteins that cannot be used as input are eliminated according to the pre-trained language model.
  • the protein interaction network of a species is selected, proteins with a total length of no more than 2048 amino acids are randomly sampled, and the interaction matrix of the subgraph is obtained, which is used as input to the representation learning model.
  • protein reaction types include reaction, expression, activation, post-translational modification, binding, and catalysis.
  • protein reaction types include reaction, expression, activation, post-translational modification, binding, and catalysis.
  • a prompt is defined for each type of protein conformation, the prompt is defined as x Seq for the natural folding state, and the prompt is defined as x IC for the action state.
  • the cues are fused into protein embedding representations during learning to label protein conformations.
  • Step 2 build a representation learning module based on the pre-trained language model, which is used to fuse the embedded representation of each type of prompt to the embedded representation of the protein, so as to obtain the protein embedded representation under the prompt label.
  • the process of constructing the protein embedding representation is to add the prompt representing the protein conformation to the vocabulary of the pre-trained language model.
  • the pre-training The language model adds an embedding representation of the conformational cues to the protein embedding representation.
  • the representation learning module includes a prompt embedding layer, an amino acid embedding layer, a fusion layer, and a pre-trained language model, wherein the prompt embedding layer is used to map each type of prompt to a high-level space , to learn the embedded representation of each type of prompt; the amino acid embedding layer is used to map the protein to the high-dimensional space to learn the embedded representation of the protein; the fusion layer is used to fuse the embedded representation of each type of prompt and the embedded representation of the protein to obtain fusion
  • the fusion layer can use splicing, fully connected mapping and convolutional mapping to fuse the embedded representation; the pre-trained language model is used to learn the fusion representation, and obtain the protein embedding representation under each type of prompt.
  • the amino acid embedding layer includes an amino acid information embedding layer and an amino acid position embedding layer, which are respectively used to extract amino acid information representation and amino acid position representation according to the amino acid sequence, and the amino acid information representation and amino acid position representation are superimposed (Symbol + in Figure 2) to get an embedded representation of the protein.
  • the amino acid information represents the amino acid information that constitutes the protein
  • the amino acid position represents the position information of each amino acid that constitutes the protein.
  • the pre-trained language model is a pluggable masked pre-trained language model, wherein the masked pre-trained language model includes BERT, RoBERTa, ALBERT, and XLNet.
  • the masked pre-trained language model includes BERT, RoBERTa, ALBERT, and XLNet.
  • RoBERTa can be used.
  • Step 3 build a task module, which is used for task prediction based on the protein embedding representation under the prompt label for the task corresponding to each type of protein conformation.
  • the task module includes a task mapping layer corresponding to each type of protein conformation, and each task mapping layer uses different mapping functions to map protein embeddings identified by each type of prompt to different task spaces, that is, to perform different task predictions.
  • the task mapping layer can use a two-layer MLP as a mapping function to perform different task predictions.
  • the task module For the task of predicting the masked amino acid corresponding to the natural folding state, and for the task of predicting the protein action corresponding to the action state, the task module contains two task mapping layers, namely the action conformation task mapping head and the sequence task mapping head, each Each task mapping layer uses a two-layer MLP as the mapping function.
  • Step 4 Build a loss function for each type of task based on task prediction results and labels, and update the model parameters of the representation learning module and task module by combining the loss functions of all types of tasks and different data sets.
  • the loss function is to minimize the error between the task prediction result and the label. Then combine the loss functions of all class prediction tasks, optimize the model parameters of the representation learning module and task module on the data set corresponding to each class of prediction tasks, that is, optimize the prompt embedding layer, amino acid embedding layer, fusion layer, pre-trained language model and Parameters for the task map layer.
  • the total loss function constructed includes two parts: mask amino acid prediction task loss and protein interaction prediction task loss, specifically including:
  • a training batch consists of N proteins, and for each amino acid chain in it, 15% of the amino acids are randomly selected for prediction.
  • For each amino acid used for prediction it has 80% probability to change to [Mask] symbol, 10% probability to change to other amino acid symbols, and 10% probability to remain unchanged.
  • the sample data is constructed to obtain protein pairs (P, P'), where P' is the changed protein and P is the original protein, which is used as the label of protein P' in the mask amino acid prediction task.
  • the loss function l i (x, y) of the i-th amino acid prediction task in a protein is:
  • x represents the prediction result, Indicates the probability that the predicted result is the real label y i , x c indicates the probability that the predicted result is category c, C indicates the total number of predicted categories, the total loss L S of the mask amino acid prediction task is all amino acid predictions in a training batch The sum of losses, that is
  • a training batch consists of N proteins, and the mean value of all amino acid embeddings in the protein is used as the embedded representation of the protein to predict whether there is an interaction between any two proteins.
  • i ⁇ j both are protein indexes
  • BCE binary cross entropy
  • xi , x j ) means the probability of correctly predicting the interaction between protein xi and protein x j as label y ij .
  • the amino acid sequence data set corresponding to the mask amino acid prediction task is input to the representation learning module
  • the protein interaction data set corresponding to the protein interaction prediction task is input to the representation learning module
  • combined with the mask amino acid prediction task loss The total loss function of the protein interaction prediction task loss is minimized as the goal to update the model parameters of the representation learning module and the task module.
  • Step 5 after the model parameters are updated, the representation learning module is extracted as the protein representation module.
  • the prompt embedding layer, amino acid embedding layer, fusion layer and pre-trained language model included in the representation learning module with determined parameters are extracted as protein representation modules, wherein the pre-trained language model with determined parameters is used as Protein representation model.
  • the obtained protein representation module is applied to the prediction task of protein structure and/or protein function.
  • the embedded representation of all class prompts is fused to the embedded representation of the protein at the same time to obtain the protein embedded representation under all prompts; the protein embedded representation under all prompts is used for protein Prediction of structure and/or protein function.
  • the embedding representations of all prompts are fused to the embedding representation of the protein at the same time, including: splicing the embedding representations of all prompts first, and then fusing the obtained spliced representations with the protein embedding representations, the Fusion includes concatenation, fully connected maps, and convolutional maps.
  • the protein representation module includes a prompt embedding layer, an amino acid embedding layer, a fusion layer, and a protein representation model.
  • a classifier related to the prediction task downstream task mapping head
  • the protein representation model and classifier form an end-to-end task predictive model.
  • the fusion layer fuses the embedded representation of the protein with the spliced representation, it is input to the protein representation model and undergoes representation learning to obtain the protein embedding under all prompts. Indicates that based on the protein embedding representation, the parameters of the classifier included in the task prediction model are fine-tuned, and the task prediction model after parameter fine-tuning can predict protein function and protein structure.
  • the prompts corresponding to the two types of protein conformations are x Seq and x IC respectively input to the prompt embedding layer to extract the corresponding embedding representation, the two embedding
  • the expression is input to the fusion layer after splicing and fusion, and input to the protein representation model after being fused with the protein embedding representation in the fusion layer, and the protein embedding representation under the two prompts is obtained.
  • the protein conformation-aware representation learning method based on the pre-trained language model provided in the above-mentioned embodiments constructs a protein conformation-centered pre-training data set based on public protein data for the first time.
  • the protein conformation-aware representation learning method based on the pre-trained language model provided in the above embodiment is different from the existing pre-trained language model in that the protein representation generated by it is unique and cannot reflect the relevant information of the protein in different environments.
  • the proposed protein representation method can not only reflect the conformational information of protein interaction and protein spontaneous folding by adding conformational symbols, but also support the incremental addition of new conformational information under specific conditions.
  • the conformation embedding and fusion method adopted provides no difference conformation information for all amino acids in the protein, so as to better obtain amino acid embedding;

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Molecular Biology (AREA)
  • Public Health (AREA)
  • Chemical & Material Sciences (AREA)
  • Physiology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Biochemistry (AREA)
  • Library & Information Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

公开了一种基于预训练语言模型的蛋白质构象感知表示学习方法,包括:获取由氨基酸序列组成的蛋白质,根据蛋白质构象构建不同数据集,为每类蛋白质构象定义提示符;基于预训练语言模型构建表示学习模块,用于将每类提示符的嵌入表示融合到蛋白质的嵌入表示,以得到提示符标识下的蛋白质嵌入表示;构建任务模块,用于针对每类蛋白质构象对应的任务,基于提示符标识下的蛋白质嵌入表示进行任务预测;基于任务预测结果和标签构建每类任务的损失函数,结合所有类任务的损失函数和不同数据集,更新表示学习模块和任务模块的模型参数;模型参数更新结束后,提取表示学习模块作为蛋白质表示模块。

Description

基于预训练语言模型的蛋白质构象感知表示学习方法 技术领域
本发明属于表示学习领域,具体涉及一种基于预训练语言模型的蛋白质构象感知表示学习方法。
背景技术
蛋白质是由数个氨基酸组成的序列数据,目前有众多研究利用自然语言中的语言模型学习蛋白质的嵌入表示,以预测蛋白质的结构和性质。
其中,无监督学习方法通过使用掩码语言模型目标为模型提供初始的参数,并在有标签数据集上与映射函数共同进行微调,学习蛋白质在任务空间上的表示,并应用于下游的预测任务。但是蛋白质作为一类大分子,在不同环境中会存在不同的构象,如可以通过氨基酸的化学性质进行自发的折叠,又可以通过与其它蛋白质的接触发生形态上的变化。而目前的蛋白质语言模型对于特定的蛋白质,只能给出唯一一种嵌入矩阵,很难同时处理好在结构预测和功能预测任务上的效果,限制了模型的表达能力和实际场景中的应用。
自然语言中提示符思想的成功表示说明语言模型具有很强的表达能力,可以通过改变序列局部符号对整体的输出产生人们期望的影响。传统的提示符思想在于使用已有的预训练模型,在语义空间内找到最优的提示符。但在蛋白质预训练模型中,语义单元是氨基酸,没有办法像语言提示符那样,可以由人工手动构建。其次,连续空间上的提示符缺乏可解释性。
专利文献CN 106845149 A公开了一种新的基于基因本体信息的蛋白 质序列表示方法,首先使用BLAST程序搜索Swiss-Prot数据库找到蛋白质序列P所有的相似蛋白质序列,将训练数据集中所有蛋白质输入到GO数据库中,搜寻每个蛋白质所具有的GO本体信息;然后在基因本体库中搜寻P蛋白质所具有的标注基因本体信息;根据预测问题具有的M个标签,将P蛋白质定义为M个元素的离散向量,即通过将序列集中的蛋白质GO信息,融合成新的蛋白质P的向量描述,来降低蛋白质的序列表示维度。
专利文献CN104134017A公开了一种基于紧凑特征表示的蛋白质作用关系对抽取方法,包括以下步骤:1)选取所需的语料,语料是以句子为单位,已经有了蛋白质实体的标注及实体关系的标注;2)舍弃步骤1)中不包含蛋白质实体或只包含一个蛋白质实体的句子,得到句子集合;3)用占位符替换句子中相应的蛋白质实体并进行占位符融合,再进行词性标注和句法分析;4)以每个实体对为单位,获取词、词性、句法和模板的特征;5)对步骤4)中获得的特征进行紧凑化表达的操作;6)利用支持向量机对从步骤4)得到的特征进行训练或者利用已训练的模型进行预测。该方法目的是尽最大努力利用句子中的可用信息,提升特征向量的信息量。
上述两件专利申请虽然有各自的优点,但是仍然存在没有考虑蛋白质构象导致得到的表示向量不准确的问题,限制蛋白质在结构预测和功能预测任务上的效果。
发明内容
鉴于上述,本发明的目的是提供一种基于预训练语言模型的蛋白质构象感知表示学习方法,通过在多个数据集上联合预训练的方法,得到不同构象下的蛋白质表示,提升蛋白质在结构预测和功能预测任务的准确性。
为实现上述发明目的,本发明实施例提供了一种基于预训练语言模型的蛋白质构象感知表示学习方法,包括以下步骤:
获取由氨基酸序列组成的蛋白质,根据蛋白质构象构建不同数据集,为每类蛋白质构象定义提示符;
基于预训练语言模型构建表示学习模块,用于将每类提示符的嵌入表示融合到蛋白质的嵌入表示,以得到提示符标识下的蛋白质嵌入表示;
构建任务模块,用于针对每类蛋白质构象对应的任务,基于提示符标识下的蛋白质嵌入表示进行任务预测;
基于任务预测结果和标签构建每类任务的损失函数,结合所有类任务的损失函数和不同数据集,更新表示学习模块和任务模块的模型参数;
模型参数更新结束后,提取表示学习模块作为蛋白质表示模块。
在一个实施例中,所述表示学习模块包括提示符嵌入层、氨基酸嵌入层、融合层以及预训练语言模型,其中,提示符嵌入层用于学习每类提示符的嵌入表示,氨基酸嵌入层用于学习蛋白质的嵌入表示,融合层用于融合每类提示符的嵌入表示和蛋白质的嵌入表示得到融合表示,预训练语言模型用于对融合表示进行表示学习,得到每类提示符标识下的蛋白质嵌入表示;
所述任务模块包括每类蛋白质构象对应的任务映射层,每个任务映射层用于基于每类提示符标识下的蛋白质嵌入表示进行任务预测。
在一个实施例中,所述氨基酸嵌入层包括氨基酸信息嵌入层和氨基酸位置嵌入层,分别用于根据氨基酸序列提取氨基酸信息表示和氨基酸位置表示,氨基酸信息表示和氨基酸位置表示经叠加得到蛋白质的嵌入表示。
在一个实施例中,所述预训练语言模型为可插拔的掩码预训练语言模型,其中,掩码预训练语言模型包括BERT、RoBERTa、ALBERT、XLNet。
在一个实施例中,所述任务映射层包括双层MLP,双层MLP用于基于每类提示符标识下的蛋白质嵌入表示进行任务预测。
在一个实施例中,针对每类预测任务,以最小化任务预测结果和标签的误差为损失函数。
在一个实施例中,所述蛋白质表示模块应用于蛋白质结构和/或蛋白质功能的预测任务;应用时,在蛋白质表示模块中,将所有类提示符的嵌入表示同时融合到蛋白质的嵌入表示,以得到全部提示符标识下的蛋白质嵌入表示;该全部提示符标识下的蛋白质嵌入表示用于蛋白质结构和/或蛋白质功能的预测。
在一个实施例中,所述将所有类提示符的嵌入表示同时融合到蛋白质的嵌入表示,包括:
将所有类提示符的嵌入表示先进行拼接,得到的拼接表示再与蛋白质的嵌入表示进行融合,该融合包括拼接方式、全连接映射和卷积映射。
在一个实施例中,所述蛋白质构象包括自然折叠状态、作用状态;
针对自然折叠状态,以具有掩码的氨基酸序列为样本,以正常氨基酸序列作为样本标签,组成数据集,该自然折叠状态对应的提示符为序列提示符,对应的任务为掩码氨基酸的预测任务;
针对作用状态,以至少2条氨基酸序列为样本,以蛋白质反应类型为样本标签,组成数据集,该作用状态对应的提示符为作用提示符,对应的任务为蛋白质作用的预测任务。
与现有技术相比,本发明具有的有益效果至少包括:
基于蛋白质数据构建以蛋白质构象为中心的不同数据集,通过添加提示符来反应蛋白质构象,来学习提示符标识下的蛋白质嵌入表示,在此基础上,采用不同提示符标识下的蛋白质嵌入表示进行不同预测任务,以让 预训练语言模型计算多个不同预测任务的损失之和,以学习到更具区分度的构象提示符表示,融合该构象提示符表示的蛋白质嵌入表示能够提升蛋白质在结构预测和功能预测任务的准确性。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图做简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动前提下,还可以根据这些附图获得其他附图。
图1是基于预训练语言模型的蛋白质构象感知表示学习方法的流程图;
图2是实施例提供的蛋白质表示模块的构建过程示意图;
图3是实施例提供的蛋白质表示模块的应用过程示意图。
具体实施方式
为使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例对本发明进行进一步的详细说明。应当理解,此处所描述的具体实施方式仅仅用以解释本发明,并不限定本发明的保护范围。
图1是基于预训练语言模型的蛋白质构象感知表示学习方法的流程图。如图1所示,实施例提供的基于预训练语言模型的蛋白质构象感知表示学习方法,包括以下步骤:
步骤1,获取由氨基酸序列组成的蛋白质,根据蛋白质构象构建不同数据集,为每类蛋白质构象定义提示符。
实施例中,基于生物领域对蛋白质的实验结果,获得蛋白质结构数据,每条蛋白质结构数据由氨基酸序列组成,在获得蛋白质结构数据的同时,还获得蛋白质的功能数据,即蛋白质之间的相互作用数据,该相互作用数 据为蛋白质的反应类型。
蛋白质构象是指蛋白质表现出来的不同状态,包括自然折叠状态、作用状态。在获得蛋白质数据后,根据不同蛋白质构象构建表现不同性质的数据集,为以蛋白质构象为中心的蛋白质数据处理方式提供基础,以进行针对不同预测任务的蛋白质嵌入表示学习。
实施例中,针对自然折叠状态,以具有掩码[Mask]的氨基酸序列为样本,以正常氨基酸序列(不具有掩码的原始氨基酸序列)作为样本标签,组成氨基酸序列数据集。基于该氨基酸序列数据集,构建掩码氨基酸的预测任务,即利用氨基酸序列数据集进行掩码氨基酸的预测任务,以学习蛋白质在自然折叠状态下的嵌入表示。
实施例中,针对作用状态,以相互作用的至少2条氨基酸序列为样本,以蛋白质反应类型为样本标签,组成蛋白质相互作用数据集。基于该蛋白质相互作用数据集,构建蛋白质相互作用的预测任务,即利用蛋白质相互作用数据集进行蛋白质相互作用的预测任务,以学习学习蛋白质在作用状态下的嵌入表示。
实施例中,蛋白质相互作用数据集中相互作用蛋白质以网络图的方式表示,在得到网络图后根据物种将网络图进行切割,并根据预训练语言模型剔除无法作为输入的蛋白质,对于训练中的每一步,选择一个物种的蛋白质作用网络,随机采样出总长度不超过2048个氨基酸的蛋白质,并得到子图的相互作用矩阵,用于输入表示学习模型。
其中,蛋白质反应类型包括反应、表达、激活、翻译后修饰、结合、催化。在构建蛋白质相互作用数据集时,至少以一种蛋白质反应类型作为标签。
实施例中,为了标记蛋白质构象,为每类蛋白质构象定义提示符,为 自然折叠状态定义提示符为x Seq,为作用状态定义提示符为x IC。提示符在学习过程中融合到蛋白质嵌入表示中,以标记蛋白质构象。
步骤2,基于预训练语言模型构建表示学习模块,用于将每类提示符的嵌入表示融合到蛋白质的嵌入表示,以得到提示符标识下的蛋白质嵌入表示。
实施例中,构建蛋白质嵌入表示的过程,将表示蛋白质构象的提示符添加到预训练语言模型的词表中,当蛋白质嵌入表示输入到预训练语言模型中,并带有蛋白质构象信息,预训练语言模型会将构象提示符的嵌入表示添加到蛋白质嵌入表示中。
在一个实施方式中,如图2所示,表示学习模块包括提示符嵌入层、氨基酸嵌入层、融合层以及预训练语言模型,其中,提示符嵌入层用于将每类提示符映射到高位空间,以学习每类提示符的嵌入表示;氨基酸嵌入层用于将蛋白质映射到高纬空间,以学习蛋白质的嵌入表示;融合层用于融合每类提示符的嵌入表示和蛋白质的嵌入表示得到融合表示,融合层可以采用拼接、全连接映射以及卷积映射的方式进行嵌入表示的融合;预训练语言模型用于对融合表示进行表示学习,得到每类提示符标识下的蛋白质嵌入表示。
在一个实施方式中,如图2所示,氨基酸嵌入层包括氨基酸信息嵌入层和氨基酸位置嵌入层,分别用于根据氨基酸序列提取氨基酸信息表示和氨基酸位置表示,氨基酸信息表示和氨基酸位置表示经叠加(图2中符号+)得到蛋白质的嵌入表示。其中,氨基酸信息表示表征了组成蛋白质的氨基酸信息,氨基酸位置表示表征了组成蛋白质的每个氨基酸的位置信息。
实施例中,预训练语言模型为可插拔的掩码预训练语言模型,其中,掩码预训练语言模型包括BERT、RoBERTa、ALBERT、XLNet。优选地, 可以采用RoBERTa。这些预训练语言模型对输入的融合表示进行表示学习,以得到每类提示符标识下的蛋白质嵌入表示。
步骤3,构建任务模块,用于针对每类蛋白质构象对应的任务,基于提示符标识下的蛋白质嵌入表示进行任务预测。
实施例中,任务模块包括每类蛋白质构象对应的任务映射层,每个任务映射层采用不同映射函数将每类提示符标识下的蛋白质嵌入映射到不同的任务空间,即进行不同任务预测。优选地,任务映射层可以采用双层MLP作为映射函数,进行不同任务预测。
针对自然折叠状态对应的掩码氨基酸的预测任务,和针对作用状态对应的任务为蛋白质作用的预测任务,任务模块包含2个任务映射层,分别为作用构象任务映射头和序列任务映射头,每个任务映射层均采用双层MLP作为映射函数。
步骤4,基于任务预测结果和标签构建每类任务的损失函数,结合所有类任务的损失函数和不同数据集,更新表示学习模块和任务模块的模型参数。
实施例中,针对每类预测任务,以最小化任务预测结果和标签的误差为损失函数。然后结合所有类预测任务的损失函数,在每类预测任务对应的数据集上优化表示学习模块和任务模块的模型参数,即优化提示符嵌入层、氨基酸嵌入层、融合层、预训练语言模型以及任务映射层的参数。
实施例中,当蛋白质构象包括自然折叠状态、作用状态时,表示学习过程中,构建的总损失函数包括掩码氨基酸预测任务损失和蛋白质相互作用预测任务损失两部分,具体包括:
在掩码氨基酸预测任务中,一个训练批次由N条蛋白质组成,对于其中的每一条氨基酸链,随机选择其中15%的氨基酸进行预测。用于预测的 每个氨基酸,其有80%概率变为[Mask]符号,10%概率变为其它氨基酸符号,10%概率保持不变,以这样的方式构建样本数据,得到蛋白质对(P,P′),其中,P′为被改变的蛋白质,P为原始蛋白质,作为蛋白质P′在掩码氨基酸预测任务中的标签。基于此,一条蛋白质中第i个氨基酸预测任务损失函数l i(x,y)为:
Figure PCTCN2022126696-appb-000001
其中,x表示预测结果,
Figure PCTCN2022126696-appb-000002
表示预测结果为真实标签y i的概率,x c表示预测结果为类别c的概率,C表示预测类别总个数,该掩码氨基酸预测任务的总损失L S是一个训练批次中所有氨基酸预测损失之和,即
Figure PCTCN2022126696-appb-000003
Figure PCTCN2022126696-appb-000004
在蛋白质相互作用预测任务中,一个训练批次由N条蛋白质组成,将蛋白质中所有氨基酸嵌入的均值作为蛋白质的嵌入表示,预测任意两个蛋白质之间是否存在相互作用,该任务的损失函数和以上掩码氨基酸预测任务类似,令X={x 1,x 2,…,x N}作为采样出来的蛋白质序列,y∈{0,1} n×n是相互作用矩阵,损失函数L PPI表示如下所示:
Figure PCTCN2022126696-appb-000005
其中,i≠j,均为蛋白质索引,BCE表示二值交叉熵,p(y ij|x i,x j)表示正确预测蛋白质x i和蛋白质x j之间相互作用为标签y ij的概率。
则总损失为L=L S+λL PPI,其中λ是超参数。
实施例中,将掩码氨基酸预测任务对应的氨基酸序列数据集输入至表示学习模块,将蛋白质相互作用预测任务对应的蛋白质相互作用数据集输入至表示学习模块,并结合包括掩码氨基酸预测任务损失和蛋白质相互作 用预测任务损失的总损失函数最小化为目标,来更新表示学习模块和任务模块的模型参数。
步骤5,模型参数更新结束后,提取表示学习模块作为蛋白质表示模块。
实施例中,在模型参数更新后,提取参数确定的表示学习模块包括的提示符嵌入层、氨基酸嵌入层、融合层以及预训练语言模型作为蛋白质表示模块,其中,参数确定的预训练语言模型作为蛋白质表示模型。
实施例中,得到的蛋白质表示模块应用于蛋白质结构和/或蛋白质功能的预测任务。应用时,在蛋白质表示模块中,将所有类提示符的嵌入表示同时融合到蛋白质的嵌入表示,以得到全部提示符标识下的蛋白质嵌入表示;该全部提示符标识下的蛋白质嵌入表示用于蛋白质结构和/或蛋白质功能的预测。
在一个实施方式中,将所有类提示符的嵌入表示同时融合到蛋白质的嵌入表示,包括:将所有类提示符的嵌入表示先进行拼接,得到的拼接表示再与蛋白质的嵌入表示进行融合,该融合包括拼接方式、全连接映射和卷积映射。
在一个实施方式中,如图3所示,蛋白质表示模块包括提示符嵌入层、氨基酸嵌入层、融合层、蛋白质表示模型。该蛋白质表示模块应用于蛋白质结构和/或蛋白质功能的预测任务时,在蛋白质表示模型后增加与预测任务相关的分类器(下游任务映射头),该蛋白质表示模型与分类器组成端到端的任务预测模型。
应用时,根据预测任务选择合适的提示符,并构建预测任务的小规模数据集,并将小规模数据集包括的蛋白质和提示符作为任务预测模型的输入,利用提示符嵌入层提取的每类提示符的嵌入表示经过拼接后形成拼接 表示,并输入至融合层,融合层对蛋白质的嵌入表示和拼接表示进行融合后,输入至蛋白质表示模型经过表示学习,得到全部提示符标识下的蛋白质嵌入表示,基于该蛋白质嵌入表示以对任务预测模型包含的分类器进行参数微调,参数微调后的任务预测模型可以预测蛋白质功能、蛋白质结构。
具体地,当针对自然折叠状态、作用状态两类蛋白质构象时,这两类蛋白质构象对应的提示符为x Seq和x IC分别输入至提示符嵌入层,以提取对应的嵌入表示,两个嵌入表示经过拼接融合后输入至融合层,在融合层中与蛋白质的嵌入表示融合后输入至蛋白质表示模型,得到在两个提示符标识下的蛋白质嵌入表示。
上述实施例提供的基于预训练语言模型的蛋白质构象感知表示学习方法,首次基于公开蛋白质数据构建以蛋白质构象为中心的预训练数据集。
上述实施例提供的基于预训练语言模型的蛋白质构象感知表示学习方法,不同于现有的预训练语言模型,其生成的蛋白质表示是唯一,不能反应蛋白质在不同环境中的相关信息。提出的蛋白质表示方法不仅能通过添加构象符号反映蛋白质相互作用和蛋白质自发折叠的构象信息,还支持增量添加特定条件下新的构象信息。
上述实施例提供的基于预训练语言模型的蛋白质构象感知表示学习方法中,采用的构象嵌入和融合方法,为蛋白质中所有的氨基酸提供了无差异的构象信息,以便更好地得到氨基酸嵌入;
上述实施例提供的基于预训练语言模型的蛋白质构象感知表示学习方法中,不同于现有的单任务预训练模型,提出在不同构象符号下使用不同预训练任务的多任务预训练模型,让模型计算多个损失之和,以学习到更具区分度的构象符号表示。
以上所述的具体实施方式对本发明的技术方案和有益效果进行了详 细说明,应理解的是以上所述仅为本发明的最优选实施例,并不用于限制本发明,凡在本发明的原则范围内所做的任何修改、补充和等同替换等,均应包含在本发明的保护范围之内。

Claims (9)

  1. 一种基于预训练语言模型的蛋白质构象感知表示学习方法,其特征在于,包括以下步骤:
    获取由氨基酸序列组成的蛋白质,根据蛋白质构象构建不同数据集,为每类蛋白质构象定义提示符;
    基于预训练语言模型构建表示学习模块,用于将每类提示符的嵌入表示融合到蛋白质的嵌入表示,以得到提示符标识下的蛋白质嵌入表示;
    构建任务模块,用于针对每类蛋白质构象对应的任务,基于提示符标识下的蛋白质嵌入表示进行任务预测;
    基于任务预测结果和标签构建每类任务的损失函数,结合所有类任务的损失函数和不同数据集,更新表示学习模块和任务模块的模型参数;
    模型参数更新结束后,提取表示学习模块作为蛋白质表示模块。
  2. 根据权利要求1所述的基于预训练语言模型的蛋白质构象感知表示学习方法,其特征在于,所述表示学习模块包括提示符嵌入层、氨基酸嵌入层、融合层以及预训练语言模型,其中,提示符嵌入层用于学习每类提示符的嵌入表示,氨基酸嵌入层用于学习蛋白质的嵌入表示,融合层用于融合每类提示符的嵌入表示和蛋白质的嵌入表示得到融合表示,预训练语言模型用于对融合表示进行表示学习,得到每类提示符标识下的蛋白质嵌入表示;
    所述任务模块包括每类蛋白质构象对应的任务映射层,每个任务映射层用于基于每类提示符标识下的蛋白质嵌入表示进行任务预测。
  3. 根据权利要求2所述的基于预训练语言模型的蛋白质构象感知表示学习方法,其特征在于,所述氨基酸嵌入层包括氨基酸信息嵌入层和氨 基酸位置嵌入层,分别用于根据氨基酸序列提取氨基酸信息表示和氨基酸位置表示,氨基酸信息表示和氨基酸位置表示经叠加得到蛋白质的嵌入表示。
  4. 根据权利要求2所述的基于预训练语言模型的蛋白质构象感知表示学习方法,其特征在于,所述预训练语言模型为可插拔的掩码预训练语言模型,其中,掩码预训练语言模型包括BERT、RoBERTa、ALBERT、XLNet。
  5. 根据权利要求2所述的基于预训练语言模型的蛋白质构象感知表示学习方法,其特征在于,所述任务映射层包括双层MLP,双层MLP用于基于每类提示符标识下的蛋白质嵌入表示进行任务预测。
  6. 根据权利要求1所述的基于预训练语言模型的蛋白质构象感知表示学习方法,其特征在于,针对每类预测任务,以最小化任务预测结果和标签的误差为损失函数。
  7. 根据权利要求1-6任一项所述的基于预训练语言模型的蛋白质构象感知表示学习方法,其特征在于,所述蛋白质表示模块应用于蛋白质结构和/或蛋白质功能的预测任务;
    应用时,在蛋白质表示模块中,将所有类提示符的嵌入表示同时融合到蛋白质的嵌入表示,以得到全部提示符标识下的蛋白质嵌入表示;该全部提示符标识下的蛋白质嵌入表示用于蛋白质结构和/或蛋白质功能的预测。
  8. 根据权利要求7所述的基于预训练语言模型的蛋白质构象感知表示学习方法,其特征在于,所述将所有类提示符的嵌入表示同时融合到蛋白质的嵌入表示,包括:
    将所有类提示符的嵌入表示先进行拼接,得到的拼接表示再与蛋白质 的嵌入表示进行融合,该融合包括拼接方式、全连接映射和卷积映射。
  9. 根据权利要求1所述的基于预训练语言模型的蛋白质构象感知表示学习方法,其特征在于,所述蛋白质构象包括自然折叠状态、作用状态;
    针对自然折叠状态,以具有掩码的氨基酸序列为样本,以正常氨基酸序列作为样本标签,组成数据集,该自然折叠状态对应的提示符为序列提示符,对应的任务为掩码氨基酸的预测任务;
    针对作用状态,以至少2条氨基酸序列为样本,以蛋白质反应类型为样本标签,组成数据集,该作用状态对应的提示符为作用提示符,对应的任务为蛋白质作用的预测任务。
PCT/CN2022/126696 2022-02-09 2022-10-21 基于预训练语言模型的蛋白质构象感知表示学习方法 WO2023151314A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/278,298 US20240233875A9 (en) 2022-02-09 2022-10-21 Perceptual representation learning method for protein conformations based on pre-trained language model

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210122014.5A CN114678061A (zh) 2022-02-09 2022-02-09 基于预训练语言模型的蛋白质构象感知表示学习方法
CN202210122014.5 2022-02-09

Publications (1)

Publication Number Publication Date
WO2023151314A1 true WO2023151314A1 (zh) 2023-08-17

Family

ID=82071767

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/126696 WO2023151314A1 (zh) 2022-02-09 2022-10-21 基于预训练语言模型的蛋白质构象感知表示学习方法

Country Status (3)

Country Link
US (1) US20240233875A9 (zh)
CN (1) CN114678061A (zh)
WO (1) WO2023151314A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116935952A (zh) * 2023-09-18 2023-10-24 浙江大学杭州国际科创中心 基于图神经网络训练蛋白质预测模型的方法及装置
CN116992979A (zh) * 2023-09-28 2023-11-03 北京智精灵科技有限公司 基于深度学习的认知训练任务推送方法及系统
CN117711532A (zh) * 2024-02-05 2024-03-15 北京悦康科创医药科技股份有限公司 多肽氨基酸序列生成模型训练及相关产品

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114678061A (zh) * 2022-02-09 2022-06-28 浙江大学杭州国际科创中心 基于预训练语言模型的蛋白质构象感知表示学习方法
CN115565607B (zh) * 2022-10-20 2024-02-23 抖音视界有限公司 确定蛋白质信息的方法、装置、可读介质及电子设备
CN116913393B (zh) * 2023-09-12 2023-12-01 浙江大学杭州国际科创中心 一种基于强化学习的蛋白质进化方法及装置
CN118485081B (zh) * 2024-07-09 2024-10-11 山东科技大学 一种应用于智慧教育的提示学习知识追踪方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021041199A1 (en) * 2019-08-23 2021-03-04 Geaenzymes Co. Systems and methods for predicting proteins
CN113593631A (zh) * 2021-08-09 2021-11-02 山东大学 一种预测蛋白质-多肽结合位点的方法及系统
CN113887627A (zh) * 2021-09-30 2022-01-04 北京百度网讯科技有限公司 噪音样本的识别方法、装置、电子设备以及存储介质
CN114678061A (zh) * 2022-02-09 2022-06-28 浙江大学杭州国际科创中心 基于预训练语言模型的蛋白质构象感知表示学习方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021041199A1 (en) * 2019-08-23 2021-03-04 Geaenzymes Co. Systems and methods for predicting proteins
CN113593631A (zh) * 2021-08-09 2021-11-02 山东大学 一种预测蛋白质-多肽结合位点的方法及系统
CN113887627A (zh) * 2021-09-30 2022-01-04 北京百度网讯科技有限公司 噪音样本的识别方法、装置、电子设备以及存储介质
CN114678061A (zh) * 2022-02-09 2022-06-28 浙江大学杭州国际科创中心 基于预训练语言模型的蛋白质构象感知表示学习方法

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
NINGYU ZHANG; ZHEN BI; XIAOZHUAN LIANG; SIYUAN CHENG; HAOSEN HONG; SHUMIN DENG; JIAZHANG LIAN; QIANG ZHANG; HUAJUN CHEN: "OntoProtein: Protein Pretraining With Gene Ontology Embedding", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 23 January 2022 (2022-01-23), 201 Olin Library Cornell University Ithaca, NY 14853, XP091140732 *
QIANG ZHANG; ZEYUAN WANG; YUQIANG HAN; HAORAN YU; XURUI JIN; HUAJUN CHEN: "Prompt-Guided Injection of Conformation to Pre-trained Protein Model", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 7 February 2022 (2022-02-07), 201 Olin Library Cornell University Ithaca, NY 14853, XP091151474 *
QING-LIN WU, REN YU-BIN, ZHAI XIAO-WEI, CHEN DONG, LIU KAI: "Protein Sequence Design Using Generative Models", CHINESE JOURNAL OF APPLIED CHEMISTRY, vol. 39, no. 1, 10 January 2022 (2022-01-10), pages 3 - 17, XP093082958 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116935952A (zh) * 2023-09-18 2023-10-24 浙江大学杭州国际科创中心 基于图神经网络训练蛋白质预测模型的方法及装置
CN116935952B (zh) * 2023-09-18 2023-12-01 浙江大学杭州国际科创中心 基于图神经网络训练蛋白质预测模型的方法及装置
CN116992979A (zh) * 2023-09-28 2023-11-03 北京智精灵科技有限公司 基于深度学习的认知训练任务推送方法及系统
CN116992979B (zh) * 2023-09-28 2024-01-23 北京智精灵科技有限公司 基于深度学习的认知训练任务推送方法及系统
CN117711532A (zh) * 2024-02-05 2024-03-15 北京悦康科创医药科技股份有限公司 多肽氨基酸序列生成模型训练及相关产品
CN117711532B (zh) * 2024-02-05 2024-05-10 北京悦康科创医药科技股份有限公司 多肽氨基酸序列生成模型训练方法以及多肽氨基酸序列生成方法

Also Published As

Publication number Publication date
US20240136021A1 (en) 2024-04-25
US20240233875A9 (en) 2024-07-11
CN114678061A (zh) 2022-06-28

Similar Documents

Publication Publication Date Title
WO2023151314A1 (zh) 基于预训练语言模型的蛋白质构象感知表示学习方法
CN108897989B (zh) 一种基于候选事件元素注意力机制的生物事件抽取方法
CN109960728B (zh) 一种开放域会议信息命名实体识别方法及系统
CN115048447B (zh) 一种基于智能语义补全的数据库自然语言接口系统
EP4394781A1 (en) Reactant molecule prediction method and apparatus, training method and apparatus, and electronic device
CN112766507B (zh) 基于嵌入式和候选子图剪枝的复杂问题知识库问答方法
CN113408287B (zh) 实体识别方法、装置、电子设备及存储介质
CN117291265B (zh) 一种基于文本大数据的知识图谱构建方法
CN115526236A (zh) 一种基于多模态对比学习的文本网络图分类方法
CN114742016B (zh) 一种基于多粒度实体异构图的篇章级事件抽取方法及装置
CN114648015B (zh) 一种基于依存关系注意力模型的方面级情感词识别方法
CN117033423A (zh) 一种注入最优模式项和历史交互信息的sql生成方法
CN115390806A (zh) 基于双模态联合建模的软件设计模式推荐方法
CN114626463A (zh) 语言模型的训练方法、文本匹配方法及相关装置
CN113705222B (zh) 槽识别模型训练方法及装置和槽填充方法及装置
CN116384371A (zh) 一种基于bert和依存句法联合实体及关系抽取方法
US20220138425A1 (en) Acronym definition network
CN115982391B (zh) 信息处理方法及装置
CN114358021B (zh) 基于深度学习的任务型对话语句回复生成方法及存储介质
CN116302953A (zh) 一种基于增强嵌入向量语义表示的软件缺陷定位方法
CN113590745B (zh) 一种可解释的文本推断方法
CN114239605A (zh) 辅助沟通内容生成方法、装置、设备及存储介质
Yan et al. Local hypergraph-based nested named entity recognition as query-based sequence labeling
CN115983270B (zh) 一种电商商品属性智能抽取方法
Tang et al. Latent graph learning with dual-channel attention for relation extraction

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 18278298

Country of ref document: US

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22925666

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE