CN116959571A - Training method for protein language model, electronic device, computer readable medium and program product - Google Patents

Training method for protein language model, electronic device, computer readable medium and program product Download PDF

Info

Publication number
CN116959571A
CN116959571A CN202310832203.6A CN202310832203A CN116959571A CN 116959571 A CN116959571 A CN 116959571A CN 202310832203 A CN202310832203 A CN 202310832203A CN 116959571 A CN116959571 A CN 116959571A
Authority
CN
China
Prior art keywords
amino acid
acid sequence
language model
training
protein
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310832203.6A
Other languages
Chinese (zh)
Inventor
成幸毅
陈波
李绅
曾信
刘迟明
唐杰
宋乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baitu Shengke Beijing Intelligent Technology Co ltd
Original Assignee
Baitu Shengke Beijing Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baitu Shengke Beijing Intelligent Technology Co ltd filed Critical Baitu Shengke Beijing Intelligent Technology Co ltd
Priority to CN202310832203.6A priority Critical patent/CN116959571A/en
Publication of CN116959571A publication Critical patent/CN116959571A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Peptides Or Proteins (AREA)

Abstract

The present disclosure relates to the field of protein language model training, and in particular to a training method for a protein language model, a method for extracting an amino acid sequence representation using a protein language model, a method for obtaining a new amino acid sequence using a protein language model, a method for obtaining a related amino acid sequence using a protein language model, a method for predicting an amino acid sequence confusion using a protein language model, an electronic device, a computer readable medium, and a program product. The training method of the protein language model comprises the following steps: in the training process, executing at least one training subtask of two training subtasks comprising a first type training task and a second type training task; and adjusting parameters of the protein language model based on the first loss value corresponding to the first type training task and the loss value corresponding to the second type training task to obtain the protein language model after training.

Description

Training method for protein language model, electronic device, computer readable medium and program product
Technical Field
The present disclosure relates to the field of protein language model training technology, and in particular, to a training method for a protein language model, a method for extracting an amino acid sequence characterization by using a protein language model, a method for obtaining a mutant amino acid sequence by using a protein language model, a method for obtaining a related amino acid sequence by using a protein language model, a method for predicting an amino acid sequence confusion by using a protein language model, an electronic device, a computer-readable storage medium, and a computer program product.
Background
Proteins are one of the substances necessary for life, and understanding the structure of proteins can help us understand their functions from a mechanism perspective, thereby facilitating subsequent target research and drug development. In the field of artificial intelligence aided protein design and processing, an important issue is how to represent proteins in mathematical form for use in artificial intelligence algorithms, and how to generate proteins.
The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.
Disclosure of Invention
In view of this, the present disclosure provides a training method of a protein language model, a method of extracting an amino acid sequence characterization using a protein language model, a method of obtaining a mutated amino acid sequence using a protein language model, a method of obtaining a related amino acid sequence using a protein language model, a method of predicting an amino acid sequence confusion using a protein language model, an electronic device, a computer-readable storage medium, and a computer program product.
According to an aspect of the present disclosure, there is provided a training method of a protein language model, including: in the training process, executing at least one training subtask of two training subtasks comprising a first type training task and a second type training task;
the first type of training task includes: obtaining a first amino acid sequence; performing a first pretreatment operation on the first amino acid sequence to obtain a first pretreated amino acid sequence; the first preprocessing operation comprises selecting one or more first sites in the first amino acid sequence, and masking the word elements on the first sites; inputting the first pretreatment amino acid sequence into a protein language model to obtain a first prediction result; calculating a first loss value according to the word element at a first position in the first amino acid sequence and the element at a first calculation position in the first prediction result; the first calculation site is a site para to the first site; the first training subtask of the second type of training task comprises: obtaining a second amino acid sequence; performing a second pretreatment operation on the second amino acid sequence to obtain a second pretreated amino acid sequence; the second preprocessing operation comprises selecting one or more second sites in the second amino acid sequence, and masking the lemmas at the second sites; inputting the second pretreatment amino acid sequence into a protein language model to obtain a second prediction result; calculating a second loss value according to the word element at a second position in the second amino acid sequence and the element at a second calculation position in the second predicted result; the second calculation site is selected from the sites after the site para to the second site; the second training subtask of the second type of training task comprises: obtaining a third amino acid sequence; performing a third pretreatment operation on the third amino acid sequence to obtain a third pretreated amino acid sequence; the third preprocessing operation comprises selecting one or more third sites which are positioned at the tail of the sequence and adjacent to each other in the third amino acid sequence, and deleting the word elements at the third sites; inputting the third pretreatment amino acid sequence into a protein language model to obtain a third prediction result; calculating a third loss value according to the word element at a third position in the third amino acid sequence and the element at a third calculation position in the third prediction result; the second calculation site is selected from the sites after the site para to the second site; parameters of the protein language model are adjusted based on the first loss value and the loss value corresponding to the training task of the second type, and the protein language model after training is obtained.
According to another aspect of the present disclosure, there is provided a method of extracting an amino acid sequence representation using a protein language model, the protein language model being a trained protein language model obtained by training using the method provided by the present disclosure, the trained protein language model including a representation extraction layer, a transducer block, and a classification layer; the method comprises the following steps: acquiring a first target amino acid sequence; the first target amino acid sequence is input into a representation extraction layer and a transducer block of the trained protein language model, and the output of the transducer block of the protein language model is used as the representation of the first target amino acid sequence.
According to another aspect of the present disclosure, there is provided a method for obtaining a mutant amino acid sequence using a protein language model, the protein language model being a trained protein language model trained using the method provided by the present disclosure, the method comprising: obtaining a second target amino acid sequence; amino acids at a partial site in the second target amino acid sequence are substituted to generate a new amino acid sequence; performing a second pretreatment operation on the second target amino acid sequence to obtain a second target pretreated amino acid sequence; wherein the site to be replaced is taken as a second site; the length of the second target pretreatment amino acid sequence is L3; inputting the second target pretreatment amino acid sequence into a trained protein language model; inputting the character output by the kth position of the trained protein language model as the character input by the kth+1th position into the trained protein language model; wherein k is greater than L3; splicing the second target pretreatment amino acid with the output word elements from the L3+1st position to the L3+nth position of the trained protein language model to obtain a new amino acid sequence; n is the total number of new amino acids corresponding to the second amino acid position; or splicing the second target pretreatment amino acid with the output word element from the L3+1th position of the trained protein language model to the-1 st position of which the word element output by the trained protein language model is the terminator, so as to obtain a new amino acid sequence.
According to another aspect of the present disclosure, there is provided a method for obtaining a related amino acid sequence using a protein language model, the protein language model being a trained protein language model trained using the method provided by the present disclosure, the method comprising: inputting the third target pretreatment amino acid sequence into a trained protein language model; the length of the third target pretreatment amino acid sequence is L4, wherein L4 is greater than or equal to 0; inputting the character output by the q-th position of the trained protein language model as the character input by the q+1th position into the trained protein language model; wherein q is greater than L4; taking the word elements output from the L4+1th position of the trained protein language model to the-1 st position of which the word elements output by the trained protein language model are terminators as related amino acid sequences; or, taking the word elements output from the L4+1th position to the preset position of the trained protein language model as related amino acid sequences; wherein the related amino acid sequence is an amino acid sequence written based on the third target amino acid sequence; alternatively, the related amino acid sequence is an amino acid sequence paired with a third target amino acid sequence.
According to another aspect of the present disclosure, there is provided a method of predicting an amino acid sequence confusion using a protein language model, the protein language model being a trained protein language model trained using the method provided by the present disclosure, the trained protein language model including a representation extraction layer, a transformer block, and a classification layer, the method comprising: obtaining a fourth target amino acid sequence; sequentially taking each site in the fourth target amino acid sequence as a first site, and performing a first pretreatment operation to obtain a plurality of fourth target pretreated amino acid sequences; wherein each fourth target pretreatment amino acid sequence corresponds to a masked site; inputting a first current pretreatment amino acid sequence in a plurality of fourth target pretreatment amino acid sequences into a protein language model to obtain the probability of classification layer prediction; taking the probability corresponding to the first word element on the first current masked position as a confusion factor of the first current masked position; the first lemma is a lemma of the fourth target amino acid sequence at the first current masked site; the first current masked site is a masked site corresponding to the first current pre-processed amino acid sequence; and obtaining the confusion degree of the fourth target amino acid sequence according to the confusion factors of all the sites.
According to another aspect of the present disclosure, there is provided a method of predicting an amino acid sequence confusion using a protein language model, the protein language model being a trained protein language model trained using the method provided by the present disclosure, the trained protein language model including a representation extraction layer, a transformer block, and a classification layer, the method comprising: obtaining a fifth target amino acid sequence; the length of the fifth target amino acid sequence is L5; sequentially taking each site in the fifth target amino acid sequence as a second site, and performing a second pretreatment operation to obtain a plurality of fifth target pretreated amino acid sequences; wherein each fifth target pretreatment amino acid sequence corresponds to a masked site; inputting a second current preprocessed amino acid sequence in the plurality of fifth target preprocessed amino acid sequences into a protein language model to obtain the probability of classified layer prediction; taking the probability corresponding to the second lemma on the L5+1th locus as the confusion factor of the second current masked locus; the second lemma is the lemma of the fifth target amino acid sequence at the second current masked position; the second current masked site is a masked site corresponding to the second current pre-processed amino acid sequence; and obtaining the confusion degree of the fifth target amino acid sequence according to the confusion factors of all the sites.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the above-described method.
According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program, when executed by a processor, implements the above-described method.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.
FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, in accordance with an embodiment of the present disclosure;
FIG. 2 illustrates a flowchart of a method of training a protein language model according to an exemplary embodiment of the present disclosure;
3A-3C illustrate schematic diagrams of exemplary training processes of a protein language model according to exemplary embodiments of the present disclosure;
FIG. 4 illustrates a flowchart of a method for extracting amino acid sequence characterizations using a protein language model according to an exemplary embodiment of the present disclosure;
FIG. 5 illustrates a flowchart of a method for obtaining a new amino acid sequence using a protein language model according to an exemplary embodiment of the present disclosure;
FIG. 6 illustrates a flowchart of a method for obtaining related amino acid sequences using a protein language model according to an exemplary embodiment of the present disclosure;
FIG. 7 illustrates a flowchart of a method for predicting amino acid sequence confusion using a protein language model, according to an exemplary embodiment of the disclosure;
FIG. 8 illustrates a flowchart of a method for predicting amino acid sequence confusion using a protein language model, according to an exemplary embodiment of the disclosure; and
fig. 9 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.
The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.
Proteins are one of the substances necessary for life, and understanding the sequence and structure of proteins can help us understand their functions from a mechanism perspective, thus facilitating subsequent target research and drug development. In the field of artificial intelligence aided protein design and processing, an important issue is how to represent proteins in mathematical form for use in artificial intelligence algorithms, and how to generate proteins.
Vector representation of the amino acid sequence of proteins is a widely used approach in the related art. It is generally accepted to simulate a Language Model (LM) in natural Language processing (Natural Language Processing, NLP), and to input an amino acid sequence into the Language Model, the Language Model can output a corresponding vector representation. Training protein language models generally takes place in two ways: masking language model (Masked Language Model, MLM) and autoregressive language model (Auto-Regressive Language Model, ARLM). The mask language model is a model training process that predicts what the original amino acid is masked after random substitution of the amino acid sequence (e.g., substitution of certain amino acids for masks). The autoregressive language model predicts the next amino acid sequence from scratch (e.g., the first half of a given sequence predicts the second half sequentially backwards). The masking language model may represent each amino acid in the sequence by a vector that conforms to the contextual meaning, while the autoregressive language model does not adequately represent the amino acid at the current position because it predicts the next amino acid in the amino acid sequence. One advantage of an autoregressive language model is that it can be used for the generation of protein sequences, while a masking language model is difficult to do.
To solve the above-mentioned problems, the present disclosure provides a training method of a protein language model by performing a first type of training task associated with training of a mask language model and at least one second type of training task associated with training of an autoregressive language model on an acquired first amino acid sequence, and adjusting parameters of the protein language model based on a loss value corresponding to the first type of training task and a loss value corresponding to the second type of training task, thereby obtaining a trained protein language model. Thus, the present disclosure employs a hybrid training model that combines a masking language model and an autoregressive language model such that the trained protein language model has both the ability to represent each amino acid in an amino acid sequence with a vector and the ability to generate a new amino acid sequence.
In order to better understand the technical solution provided by the embodiments of the present application, some simple descriptions are provided below for application scenarios applicable to the technical solution provided by the embodiments of the present application, and it should be noted that the application scenarios described below are only used to illustrate the embodiments of the present application, but not limited thereto. In the specific implementation, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.
The method provided by the embodiment of the application can be applied to the application scene shown in fig. 1. Referring to fig. 1, an exemplary system 100 in this application scenario includes a plurality of terminal devices 101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling one or more client devices to the server 120. The terminal devices 101, 102, 103, 104, 105, and 106 and the server 120 may be connected by a wired connection or a wireless connection and transmit data. For example, the terminal devices 101, 102, 103, 104, 105, and 106 and the server 120 may be connected by data lines or by wired networks; the terminal devices 101, 102, 103, 104, 105 and 106 and the server 120 may also be connected through a radio frequency module, a WiFi module or a wireless network.
Among them, the terminal devices 101, 102, 103, 104, 105, and 106 may be computers, notebooks, palmtops (Personal Digital Assistant, PDAs), tablet computers, and the like. The server 120 may be a server or a server cluster or a cloud computing center composed of a plurality of servers, or a virtualization platform, or may be a personal computer, a mainframe computer, a computer cluster, or the like. Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a number of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. For example only, the one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.
The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of databases 130 may be used to store information such as audio files and video files. The data store 130 may reside in a variety of locations. For example, the data store used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The data store 130 may be of different types. In some embodiments, the data store used by server 120 may be a database, such as a relational database. One or more of these databases may store, update, and retrieve the databases and data from the databases in response to the commands.
In some implementations, one or more of the databases 130 can also be used by the application to store application data. The databases used by the application may be different types of databases, such as key value stores, object stores, or conventional stores supported by the file system.
Any number of terminal devices, servers, networks and databases may be provided in the application scenario in the embodiment of the present application according to implementation requirements. The present application is not particularly limited thereto. The training method of the protein language model, the method of extracting the amino acid sequence characterization by using the protein language model, the method of obtaining the new amino acid sequence by using the protein language model, the method of obtaining the related amino acid sequence by using the protein language model, and the method of predicting the confusion degree of the amino acid sequence by using the protein language model provided by the embodiment of the application can be executed by the server 120, and can also be executed cooperatively by the terminal devices 101, 102, 103, 104, 105 and 106 and the server 120.
In order to further explain the technical solution provided by the embodiments of the present application, the following details are described with reference to the accompanying drawings and the detailed description. Although embodiments of the present application provide the method operational steps shown in the following embodiments or figures, more or fewer operational steps may be included in the method, either on a routine or non-inventive basis. In steps where there is logically no necessary causal relationship, the execution order of the steps is not limited to the execution order provided by the embodiments of the present application. The methods may be performed sequentially or in parallel as shown in the embodiments or the drawings when the actual processing or the apparatus is performed.
According to one aspect of the present disclosure, a training method 200 for a protein language model is provided. As shown in fig. 2, the method 200 includes: step S202, executing at least one training subtask of two training subtasks included in a first type training task and at least one second type training task in the training process; step S203, parameters of the protein language model are adjusted based on the first loss value and the loss value corresponding to the training task of the second type, so as to obtain the protein language model after training is completed, wherein the loss value corresponding to the training task of the second type comprises at least one of the second loss value and the third loss value.
As used in this disclosure, the term "training task" refers to inputting an input sample into a model, obtaining an actual output sample of the model output, comparing the actual output sample to a desired output sample, and adjusting or optimizing the model based on the comparison. In the context of the protein language model of the present disclosure, training tasks refer to inputting a corresponding pre-processed amino acid sequence into a protein prediction model according to a corresponding type of task, obtaining a corresponding predicted amino acid model output by the protein prediction model, and calculating a corresponding loss value according to the corresponding pre-processed amino acid sequence and the vocabulary elements at useful sites in the corresponding predicted amino acid model, thereby adjusting parameters of the model in order to obtain a trained protein language model.
In step S202, the training tasks of the first type include: obtaining a first amino acid sequence; performing a first pretreatment operation on the first amino acid sequence to obtain a first pretreated amino acid sequence; the first preprocessing operation comprises selecting one or more first sites in the first amino acid sequence, and masking the word elements on the first sites; inputting the first pretreatment amino acid sequence into a protein language model to obtain a first prediction result; calculating a first loss value according to the word element at a first position in the first amino acid sequence and the element at a first calculation position in the first prediction result; the first calculation site is a site para to the first site; the first training subtask of the second type of training task comprises: obtaining a second amino acid sequence; performing a second pretreatment operation on the second amino acid sequence to obtain a second pretreated amino acid sequence; the second preprocessing operation comprises selecting one or more second sites in the second amino acid sequence, and masking the lemmas at the second sites; inputting the second pretreatment amino acid sequence into a protein language model to obtain a second prediction result; calculating a second loss value according to the word element at a second position in the second amino acid sequence and the element at a second calculation position in the second predicted result; the second calculation site is selected from the sites after the site para to the second site; the second training subtask of the second type of training task comprises: obtaining a third amino acid sequence; performing a third pretreatment operation on the third amino acid sequence to obtain a third pretreated amino acid sequence; the third preprocessing operation comprises selecting one or more third sites which are positioned at the tail of the sequence and adjacent to each other in the third amino acid sequence, and deleting the word elements at the third sites; inputting the third pretreatment amino acid sequence into a protein language model to obtain a third prediction result; calculating a third loss value according to the word element at a third position in the third amino acid sequence and the element at a third calculation position in the third prediction result; the third calculation site is selected from the sites following the site para to the third site.
According to an embodiment of the present disclosure, a trained protein language model is obtained by performing a first type of training task associated with training of a mask language model and at least one second type of training task associated with training of an autoregressive language model on the acquired first amino acid sequence, and adjusting parameters of the protein language model based on a loss value corresponding to the first type of training task and a loss value corresponding to the second type of training task. By means of the mixed training mode combining the mask language model and the autoregressive language model, the trained protein language model has the capacity of representing each amino acid in the amino acid sequence by vectors and the capacity of generating a new amino acid sequence, so that the protein language model can adapt to various downstream tasks, and the processing performance of the protein language model in various downstream tasks is improved.
In step S202, the training task of the first type may include: step S2021a, obtaining a first amino acid sequence; step S2021b, performing a first pretreatment operation on the first amino acid sequence to obtain a first pretreated amino acid sequence; step S2021c, inputting the first pretreatment amino acid sequence into a protein language model to obtain a first prediction result; step S2021d calculates the first loss value based on the first amino acid sequence and the first prediction result.
In step S2021a, the first amino acid sequence may be obtained from a laboratory or various accessible databases, which is not subject to any limitation by the present disclosure.
The first amino acid sequence may include a sequence terminator. In the subsequent processing, the sequence terminator may be processed as an amino acid of a general type without special processing.
The original amino acid sequences in the training dataset are different in length, and the end of each original amino acid sequence may include a sequence terminator. The original amino acid sequences may be processed, each of the original amino acid sequences in the training dataset may be joined end-to-end, and the joined end-to-end amino acid sequences may be truncated to an amino acid sequence of a fixed length (e.g., 512, 1024), the first amino acid sequence selected from the truncated amino acid sequences. Thus, the amino acid sequence with fixed length can be calculated, and the calculation force can be fully utilized. For example, the first amino acid sequence may be X1X 2X 3X 4X 5 EOS X6X 7, wherein EOS is a sequence terminator.
In an embodiment step S2021b, the first preprocessing operation may include selecting one or more first sites in the first amino acid sequence and masking the lemmas at the first sites. Illustratively, masking includes using the first mask placeholder to correspondingly replace an amino acid at the first position, that is, the first preprocessing operation does not change the length of the first amino acid sequence.
As used in this disclosure, the term "token" refers to the basic unit of text, and in the context of protein language models, particularly refers to the smallest constituent element of an amino acid sequence, which may be various amino acid identifiers (i.e., amino acid IDs, e.g., leucine for L, alanine for a), various tags (e.g., mask placeholders, sequence terminators, prediction initiators, prediction terminators, etc.) for representing amino acid types, and the like. Thus, it is understood that in the context of the present disclosure, when referring to various amino acid sequences, reference may be made to either amino acid sequences that are homologous to amino acid sequences that are present or not found in nature (i.e., the smallest constituent element that constitutes the amino acid sequence comprises only the amino acid ID) or amino acid sequences that are heterologous to amino acid sequences that are present or not found in nature (i.e., the smallest constituent element that constitutes the amino acid sequence comprises or only the various markers). In the context of a protein language model, the tokens can only be selected from a variety of amino acid identifiers and a variety of tags, and thus each amino acid identifier and tag is an alternative token, which each amino acid identifier and each tag constitutes.
As used in this disclosure, the term "site" refers to the location in the text of the basic unit of text, and in the context of a protein language model, particularly refers to the location in the amino acid sequence of the smallest constituent element of the amino acid sequence.
As used in this disclosure, the term "mask" may refer, in the context of a protein language model, to, for example, setting an amino acid ID at a given site (e.g., a first site) to a mask placeholder, thereby rendering the amino acid type unrecognizable from the lexes at that site.
The number and positions of the masked words may be random or may be limited, and the number of the masked words may be, for example, about 15% of the length of the first amino acid sequence, and the masked positions may be, for example, CDR regions of an antibody.
Illustratively, in step S2021c, the length of the first predicted result (i.e., the number of elements or positions included in the predicted result) obtained by inputting the first pre-processed amino acid sequence into the protein language model is the same as the length of the first pre-processed amino acid sequence, and the input word element of each position has an output element aligned therewith.
It is understood that the protein language model is able to predict the probability of which candidate word element the predicted site is required to be. In a first type of task, the site that is required to be predicted is the covered site.
In the context of a protein language model, the predicted outcome, elements at a site in the predicted outcome, can be understood in a broad sense. The element at the position in the prediction result may refer to one of the candidate terms (for example, the candidate term with the highest probability), may refer to the probability that the predicted position is a specific candidate term (for example, the candidate term is the term of the input position corresponding to the position), may refer to the probability that the predicted position is each candidate term, or may refer to any combination of the three.
In an embodiment, step S2021d comprises: and calculating a first loss value according to the word element at the first position in the first amino acid sequence and the element at the first calculation position in the first prediction result.
In an example, the first calculation site may be a site that is aligned with the first site. That is, the first calculation site may be a site in the first predicted result that is aligned with the first site in the first amino acid sequence.
In an example, an element at a first computation site characterizes a probability that an output at the first computation site is a corresponding input word at the first site. For example, there are 3 first positions on the first amino acid sequence, where the input word element at the first position a is the amino acid ID a, the input word element at the first position b is the amino acid ID L, the input word element at the first position c is the amino acid ID U, the output element at the first calculation position a refers to the probability of being the amino acid ID a at the first calculation position a of 0.7, the output element at the first calculation position b refers to the probability of being the amino acid ID L at the first calculation position b of 0.6, the output element at the first calculation position c refers to the probability of being the amino acid ID U at the first calculation position c of 0.4, and the first loss value is determined according to these three probabilities.
In an example, an element at a first computation site characterizes an output lemma at the first computation site. For example, there are 3 first positions on the first amino acid sequence, where the input word at the first position a is the amino acid ID a, the input word at the first position b is the amino acid ID L, the input word at the first position c is the amino acid ID U, the output element at the first calculation position a is a, the output element at the first calculation position b is U, and the output element at the first calculation position c is U, and the first loss value is determined according to the tensor (for example, the tensor of 0 order, 1 order, or higher order, which may be determined by looking up the word element-tensor mapping table) corresponding to each input word element and the output element at the corresponding position.
Turning now to fig. 3A, a schematic diagram of an exemplary training process for training tasks of a first type according to an embodiment of the present disclosure is shown. It should be noted that the exemplary training process shown in FIG. 3A is merely an exemplary illustration of a first type of training task and is not intended to be limiting of the first type of training task.
As shown in FIG. 3A, block 301 represents a protein language model, such as the protein language model of incomplete training in the method 200 described with reference to FIG. 2. The protein language model 301 receives the first pre-processed amino acid sequence 303a and outputs a first predicted result 304a. As an example, the first amino acid sequence 302a includes a total of 5 tokens X0 to X4, and the first preprocessing operation masks tokens X1 and X3. That is, the positions of the lemmas X1 and X3 in the first amino acid sequence 302a are the first site 305a, and the number of the first sites 305a is two. In the first predicted result 304a output from the protein language model 301, elements X1 'and X3' are located at positions corresponding to positions of the tokens X1 and X3 in the first amino acid sequence 302a (i.e., the first site 305 a), respectively. That is, the positions of the elements X1 'and X3' in the first predicted result 304a are the first calculation sites 306a, and the number of the first calculation sites 306a is also two. Thus, a first loss value may be calculated from the lemmas (e.g., lemmas X1 and X3) at the first position 305a in the first amino acid sequence 302a and the elements (e.g., lemmas X1 'and X3') at the first calculation position 306a in the first predicted result 304a.
It will be appreciated that the length of the first amino acid sequence 302a and the number of first sites 305a (and thus first calculation sites 306 a) in the example described above with reference to fig. 3A are merely illustrative, and the present disclosure is not limited in any way. It will also be appreciated that the first sites selected may be located adjacent to each other and that the first sites may be located anywhere in the first amino acid sequence and are not limited to the middle of the sequence.
In an example, training samples (X, Y) may be generated for a first amino acid sequence following the first type of task described above, where X is the first pre-processed amino acid sequence and Y is the content that needs to be predicted (i.e., the true value, i.e., the masked amino acid ID at the first position). Given the first amino acid sequence PSSLALSVGQKVTMSCKSSQSI for the protein language model of the present disclosure (by way of example and not limitation, the protein language model depicted in FIG. 3A), training samples may be constructed as, for example (PSSL [ MASK]LSVGQKV[MASK]MSCK[MASK]SQSI, A, T, S, (-). Wherein, "[ MASK ]]"represents a mask, - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -to- -i.e., the true value," - "is used to indicate the location of a non-first site, such that the location of the first site can be determined.
In step S202, the second type of training task may include at least one of two training subtasks.
In an embodiment, the first training subtask may include: step S2022a, obtaining a second amino acid sequence; step S2022b, performing a second pretreatment operation on the second amino acid sequence to obtain a second pretreated amino acid sequence; step S2022c, inputting the second pretreatment amino acid sequence into a protein language model to obtain a second prediction result; step S2022d, calculating a second loss value according to the first amino acid sequence and the second prediction result.
The second amino acid sequence may also be selected from the truncated amino acid sequences described previously.
In an embodiment, the second preprocessing operation may include selecting one or more second sites in the second amino acid sequence and masking the lemma at the second sites. Illustratively, masking includes replacing an amino acid at the second position with a second masking placeholder. The second mask placeholder may be the same as or different from the first mask placeholder.
In an embodiment, step S2022d comprises: and calculating a second loss value according to the word element at a second position in the first amino acid sequence and the element at a second calculation position in the second predicted result.
Steps of the first training subtask included in the second type of training task that are similar to those of the first type of training task may be referred to in the description above, and will not be repeated here.
Illustratively, the second calculation site is selected from a site subsequent to the site para to the second site.
In an embodiment, the second calculation site may include the L1+1 th to L1+N P2 +1 positions; wherein L1 is the length of the second pre-treated amino acid sequence, N P2 The second number of sites. It can be seen that the length of the second predicted result is greater than the length of the second pre-treated amino acid sequence.
The L1+1st through L1+N of the second predictor is expected P2 The output word elements of the positions are the word elements on the covered second position, and the L1+N is the word elements P2 The output lemma of +1 positions is the predicted terminator. The items L1+1 to L1+N are the same as each other P2 Output word element of each position and 1 st second position to N P2 The second sites may not correspond sequentially, but correspond in a preset correspondence. For example, the number of second sites is 3, and the output word elements from the L1+1st to L1+3rd positions are expected to be the second site a and the second site a respectivelyThe output lemmas at the L1+4th position are predicted terminators at the position c and the second position b (but not necessarily the second position a, the second position b and the second position c).
In an example, the element at the second computation site characterizes a probability that the output lemma at the second computation site is the corresponding input lemma at the second site. For example, there are 2 second positions on the first amino acid sequence, wherein the input word at the second position a is the amino acid ID a, the input word at the second position b is the amino acid ID L, the probability that the output element at the l1+1 second calculation position (corresponding to the second position b) is the amino acid ID L at the l1+1 second calculation position is 0.7, the probability that the output element at the l1+2 second calculation position (corresponding to the second position a) is the amino acid ID a at the l1+2 second calculation position is 0.6, the probability that the output element at the l1+3 calculation position is the predicted terminator at the l1+3 calculation position is 0.4, and the second loss value is determined according to these three probabilities.
In an example, the element at the second computation site characterizes the output lemma at the second computation site. For example, there are 2 second positions on the second amino acid sequence, where the input word at the first position a is the amino acid ID a, the input word at the second position b is the amino acid ID L, the output word at the l1+1th second calculation position (corresponding to the second position b) is L, the output word at the l1+2th second calculation position (corresponding to the second position a) is U, and the output word at the l1+3th second calculation position is the predicted terminator, and the second loss value is determined according to the tensor (e.g., 0-order, 1-order, or higher tensor, which is determined by looking up the primitive-tensor mapping table) corresponding to each input word (L, a, predicted terminator, respectively) and the output word at the corresponding position (L, U, predicted terminator, respectively).
Turning now to FIG. 3B, a diagram illustrating an exemplary training process for a first training sub-task of a second type of training task in accordance with an exemplary embodiment of the present disclosure is shown. It should be noted that the exemplary training process shown in FIG. 3B is merely illustrative of a first training sub-task of the second type of training task and is not intended to be limiting of the first training sub-task of the second type of training task.
As shown in FIG. 3B, block 301 represents a protein language model, such as the protein language model of incomplete training in the method 200 described with reference to FIG. 2. The protein language model 301 receives the second pre-processed amino acid sequence 303b and outputs a second predicted result 304b. As an example, the second amino acid sequence 302b includes a total of 5 tokens X0 to X4, and the second preprocessing operation masks tokens X1 and X2. That is, the positions of the lemmas X1 and X2 in the second amino acid sequence 302b are the second position point 305b, and the number of the second position points 305b is two. By way of example and not limitation, one mask M may be used to cover more than two locations adjacent to each other that are included in the second location 305 b. In the second predicted result 304b output by the protein language model 301, corresponding tokens X0, mask M, tokens X3 and X4 may be output at positions corresponding to the positions at which the second amino acid sequence 302b is located, respectively, although the tokens output at these positions are irrelevant for calculating the second loss value. That is, elements of the second predicted result 304b at positions (i.e., 1 st through L1 st positions) that are aligned with the positions of the lemma X0, the mask M, the lemmas X3, and X4 in the second amino acid sequence 302b do not participate in the calculation of the second loss value. In the second predicted result 304b, the positions where the elements X1' and X2' are located and one position behind X2' are the second calculation sites 306b, and the number of the second calculation sites 306b is also two. Thus, a second loss value may be calculated based at least in part on the tokens (e.g., tokens X1 and X2) at the second position 305b in the first amino acid sequence 302b and the elements (e.g., tokens X1 'and X2') at the second calculation position 306b in the second prediction result 304b. The second penalty value is also calculated from the output lemma and predicted terminator for the last position on the second calculation position 306b, as described previously.
It will be appreciated that the length of the second amino acid sequence 302B and the number of second sites 305B (and thus second calculation sites 306B) in the example described above with reference to fig. 3B are merely illustrative, and the present disclosure is not limited in any way. It will also be appreciated that the selected second sites may be non-contiguous with each other in position and that the second sites may be anywhere in the second amino acid sequence, not limited to the middle of the sequence.
In an embodiment, the second training subtask may include: step S2023a, obtaining a third amino acid sequence; step S2023b, performing a third pretreatment operation on the third amino acid sequence to obtain a third pretreated amino acid sequence; step S2023c, inputting the third preprocessed amino acid sequence into the protein language model to obtain a third predicted result; step S2023d, calculating a third loss value according to the third amino acid sequence and the third prediction result.
In an embodiment, the third preprocessing operation may include selecting one or more third sites in the third amino acid sequence that are located at the end of the sequence and adjacent to each other, and deleting the lemma at the third sites.
In an embodiment, step S2023d: and calculating a third loss value according to the word element at a third position in the first amino acid sequence and the element at a third calculation position in the third prediction result.
Steps of the second training subtask included in the second type of training task that are similar to those of the first type of training task may be referred to in the description above, and will not be repeated here.
Illustratively, the third calculation site is selected from a site subsequent to the site para to the third site.
In an embodiment, the third calculation site may include the L2+1 th to L2+N P3 +1 positions; wherein L2 is the length of the third pre-treated amino acid sequence, N P3 The third number of sites. It can be seen that the length of the third predicted result is greater than the length of the third pre-treated amino acid sequence.
The L2+1th through L2+N in the third predictor is expected P3 The output word elements of the positions are the word elements of the third position deleted, and the L2+N is the word elements of the third position deleted P3 The output lemma of +1 positions is the predicted terminator. The L2 +1th to L2+N P3 Output word element of each position and 1 st third position to N P3 The third sites may not correspond sequentially, but correspond in a preset correspondence. For example, the third site isThe 3, the output word elements from the L2+1th to L2+3th positions are expected to be the word elements at the third locus a, the third locus c and the third locus b (but not necessarily the second locus a, the second locus b and the second locus c), and the output word elements from the L2+4th positions are expected to be predicted terminators.
The calculation of the third loss value can refer to the description of the second loss value, and will not be repeated.
Turning now to FIG. 3C, a diagram illustrating an exemplary training process for a second training sub-task of a second type of training task in accordance with an exemplary embodiment of the present disclosure is shown. It should be noted that the exemplary training process shown in FIG. 3C is merely illustrative of a first training sub-task of the second type of training task and is not intended to be limiting of the first training sub-task of the second type of training task.
As an example, the third amino acid sequence 302c includes a total of 5 tokens X0 to X4, and the third preprocessing operation deletes tokens X3 and X4, resulting in a third preprocessed amino acid sequence 303c (including X0X 1X 2). Fig. 3C differs from fig. 3B in that the third pre-processed amino acid sequence 303C does not include any mask M. Similar to the first training subtask, in the third predicted result 304c output by the protein language model 301, corresponding tokens X0 to X2 may be output at positions (i.e., 1 st to L2 nd positions) corresponding to the positions at which the tokens X0 to X2 are located in the third amino acid sequence 302c, respectively, although the tokens output at these positions are irrelevant for calculating the third loss value. That is, elements of the third predicted result 304c at positions corresponding to positions of the tokens X0 to X2 in the third amino acid sequence 302c do not participate in the calculation of the third loss value. In the third predicted result 304c, the elements X3 'and X4' are located at positions belonging to the positions included in the third calculation position 306 c. Thus, a third loss value may be calculated based at least in part on the tokens (e.g., tokens X3 and X4) at third position 305c in third amino acid sequence 302c and the elements (e.g., tokens X3 'and X4') at third calculation position 306c in third prediction result 304 c. As described previously, a third penalty value is also calculated based on the output lemma and predicted terminator for the last position at third calculation position 306 c.
It will be appreciated that the length of the first amino acid sequence 302C and the number of third sites 305C (and thus third calculation sites 306C) in the example described above with reference to fig. 3C are merely illustrative, and the present disclosure is not limited in any way.
In step S203, parameters of the protein language model may be adjusted based on the first loss value and the loss value corresponding to the training task of the second type, to obtain a trained protein language model.
The training tasks of the second type may include at least one of a first training subtask and a second training subtask, and the corresponding penalty values of the training tasks of the second type include at least one of a second penalty value and a third penalty value, respectively.
In an example, the parameters of the protein language model may include parameters related to a training process of the protein language model (such as parameters related to convergence of an objective function of the model), and the like, which is not subject to any limitation by the present disclosure.
It will be appreciated that the first amino acid sequence, the second amino acid sequence and the third amino acid sequence may be the same or different. The first type training task and at least one second type training task are executed for the same amino acid sequence, or only one training task (for example, one of the first type training task, the first training subtask and the second training subtask) is executed for the same amino acid sequence, so that multiple groups of loss values corresponding to the training tasks undergone by the amino acid sequence are obtained, and parameters of the protein language model are adjusted based on the multiple groups of loss values, so that the trained protein language model is obtained.
In step S202, during the training process, at least one training subtask of two training subtasks included in the first type of training task and the second type of training task may be executed in parallel; alternatively, during the training process, at least one training subtask of the two training subtasks included in the first type of training task and the second type of training task may be sequentially performed, and the execution sequence of the training tasks and the training subtasks is not limited in the present disclosure.
For example, step S220 may include:
in a first training phase, executing a first type of training task; in a second training phase, a second type of training task is performed. For example, in a first training phase (1 st-N epoch), a first type of training task is performed; in a second training phase (n+1-M epochs), the first subtask and/or the second subtask is performed. It is understood that the learning rate may be different for different training phases.
For example, step S220 may include:
in a first training phase, executing a first type of training task; in a second training phase, executing a first type training task and a second type training task; in a third training phase, a second type of training task is performed. For example, in a first training phase (1 st-N epoch), a first type of training task is performed; in a second training phase (n+1-M epochs), performing a training task of a first type and at least one of a first subtask and a second subtask; in the third training phase (M+1st epoch) at least one of the first subtask and the second subtask is executed. It is understood that the learning rate may be different for different training phases. For example, the learning rate in the second training phase is lower than that in the first training phase and the third training phase, so that the training task can be smoothly switched from the first type to the second type.
Illustratively, step S220 includes: firstly, executing at least one training subtask of two training subtasks comprising a first type training task and a second type training task on a first data set; at least one of a first type of training task, a first training subtask, and a second training subtask is then performed on the second data set. For example, training s on the protein dataset before fine tuning on the antibody dataset.
In the above method 200, step 2022c may include, according to an embodiment of the present disclosure: inputting the second preprocessed amino acid sequence into the protein language model, and inputting the masked amino acids as lemmas of the L1+2th to L1+second locus number +1th positions into the protein language model; and/or, step 2023c may comprise: inputting the third preprocessed amino acid sequence into the protein language model, and inputting the deleted amino acids into the protein language model as the lemmas of the L2+2 to L2+third locus number+1 positions.
Illustratively, inputting the second pre-processed amino acid sequence into the protein language model may be a lexeme input model that uses the second pre-processed amino acid sequence as position 1 to position L1; the inputting of the third pre-processed amino acid sequence into the protein language model may be a word entry model using the third pre-processed amino acid sequence as a 1 st position to a L2 nd position.
Illustratively, step 2022c may further comprise: taking the second placeholder prediction beginning symbol as a word element input model of the L1+1st position; and/or, taking the second placeholder as a word element input model of the L2+1st position. The second placeholder may be a prediction starting symbol, prompting the model to predict the beginning, and outputting the expected word element at the position where the second placeholder is needed. The second placeholder may include a start prediction symbol S as shown in fig. 3B and 3C.
It is understood that in a second type of training task, the protein language model is trained as an autoregressive model. The autoregressive model is characterized in that a word is input at a kth position, the model outputs the word at the kth position as a prediction of the word at the kth+1 position, the word at the kth position is output as the input word at the kth+1 position, and the model outputs the word at the kth+1 position as a prediction of the word at the kth+2 position. In the training stage, in order to enable the model to accurately predict the word elements with the later positions without excessive prediction deviation of the word elements with the later positions caused by prediction errors of the first few positions, a true value is used as the input word element of the (k+1) th position, and the output word element of the (k) th position is not used as the input word element of the (k+1) th position. Specifically, in the first training subtask, the amino acids to be masked (i.e., L1+1 through L1+N P2 Outputting the true value corresponding to the word element at each position) as L1+2 through L1+N P2 +1 position token input protein languageA language model; in the second training subtask, the amino acids to be deleted (i.e., L2+1 through L2+N P3 Outputting the true value corresponding to the word element at each position) as L2+2 through L2+N P3 The +1 position token is entered into the protein language model.
It will be appreciated that in an autoregressive model, if a term is not entered at the kth position, the model will not output a term at the kth position. In some embodiments, during the training phase, to stop the output of the model in time without continuing the backward output, one of the following two ways may be adopted:
mode 1: in the first training subtask, the masked amino acids are taken as L1+2 to L1+N P2 The +1 position of the character is input into the protein language model, and the character is input into the protein language model at the L1+N P2 +2 positions no longer input a token. And/or in the second training subtask, the deleted amino acids are taken as L2+2th to L2+N P3 The +1 position of the character is input into the protein language model, and the character is input into the protein language model at the L1+N P2 +2 positions no longer input a token.
Mode 2: in the first training subtask and the second training subtask, if the output word element of a certain position is not a predicted terminator, the output word element of the position is used as the input word element of the next position, otherwise, the input word element is not in the next position.
It should be noted that while the expression of the lemma at the second calculation site 306B as X1 'and X2' representing the (new) amino acid ID output by the protein language model for the masked amino acid ID at the second site 305B in the first amino acid sequence 302B is shown in fig. 3B, it will be appreciated that several special cases are possible based on the mechanism of the autoregressive model described above: (a) The protein language model may output more than a second number of elements representing (new) amino acid IDs for the masked amino acid IDs at a second position in the first amino acid sequence (e.g., X3', X4', etc., in addition to X1 'and X2'); (b) The protein language model may output fewer elements representing (new) amino acid IDs than the number of second positions (e.g., output only X1') for masked amino acid IDs at the second position in the first amino acid sequence; or (c) the protein language model may output zero elements representing the (new) amino acid ID for the masked amino acid ID at the second position in the first amino acid sequence (e.g., directly outputting the predicted terminator).
For special case (a), the L1+1st to L1+Nth of the second prediction results outputted by the protein language model can be selected P2 The +1 position is used as the second calculation site, and the loss value is calculated without using the position starting from the L1+N P2 +2 elements output at positions.
For the special cases (b) and (c), assuming that the number of second sites in the second amino acid sequence is N, and in the second prediction result, the protein language model sets each site in the second calculation site other than the M (new) amino acid ID representing tokens (i.e., N-M sites on which no (new) amino acid ID representing token is output) to a given value for the number of elements representing the (new) amino acid ID output by the masked amino acid ID on the second site in the second amino acid sequence to be M, M being smaller than N, so as to calculate the second loss value.
Likewise, it should be noted that while the expression of the lemma at the third computation site 306C as X3 'and X4' representing the (related) amino acid ID output by the protein language model for the deleted amino acid ID at the third site 305C in the first amino acid sequence 302C is shown in FIG. 3C, it will be appreciated that several special cases are possible based on the mechanism of the autoregressive model described above: (a) The protein language model may output elements representing (related) amino acid IDs exceeding the number of third sites (e.g., X5', X6', etc., in addition to X3 'and X4') for the amino acid IDs deleted at the third site in the third amino acid sequence; (b) The protein language model may output less elements representing (related) amino acid IDs than the number of third positions (e.g., output only X3') for the deleted amino acid IDs at the third position in the first amino acid sequence; or (c) the protein language model may output zero elements representing (related) amino acid IDs (such as direct output predicted terminators) for the masked amino acid IDs at a third position in the third amino acid sequence.
For special case (a), the L2+1st to L2+Nth of the third prediction results outputted by the protein language model can be selected P3 The +1 position is used as the third calculation site, and the loss value is calculated without using the position starting from L2+N P3 +2 elements output at positions.
For the special cases (b) and (c), assuming that the number of third sites in the third amino acid sequence is N, and in the third prediction result, the number of elements representing (related) amino acid IDs outputted by the protein language model for the amino acid IDs deleted at the third sites in the third amino acid sequence is M, M being smaller than N, each site in the third calculation site other than the M tokens representing (related) amino acid IDs (i.e., N-M sites on which no token representing (related) amino acid ID is outputted) may be set to a given value, so as to calculate the third loss value.
In accordance with an embodiment of the present disclosure, in the above method 200, masking the amino acid at the first position may include: for each first site, replacing the amino acid at the first site with a first mask placeholder; and/or masking the amino acid at the second position may comprise: a second locus or a plurality of second loci that are not adjacent to other second loci are used as a set of second loci, and a second mask placeholder is used to correspondingly replace the word elements on the set of second loci.
The first mask placeholder and the second mask placeholder may be the same or different.
For example, when the site of X0X 1 in X0X 1X 2X 3X 4 is taken as the first site, the amino acid sequence after masking is M M M X2X 3X 4; when the site where X0X 1 is located in X0X 1X 2X 3X 4 is taken as the second site, the amino acid sequence after masking is M X X2X 3X 4.
It is desirable to enable the model to extract the amino acid sequence representation by a first type of training task, requiring that a correct understanding of each amino acid position be maintained, so masking operations do not change the amino acid sequence length in the first type of training task; it is desirable to enable the model to generate different amino acid sequences from one amino acid sequence, and possibly even two amino acids from one amino acid by the first sub-task in the second type of training task, where masking operations may change the amino acid sequence length, so that a certain flexibility needs to be maintained.
According to an embodiment of the present disclosure, in the above method 200, the third amino acid sequence may be spliced from the light chain and the heavy chain of the antibody; selecting one or more third sites in the third amino acid sequence that are located at the end of the sequence and adjacent to each other comprises: selecting a site corresponding to a chain positioned at the tail of the sequence in the third amino acid sequence as a third site; the chain at the tail of the sequence is one of the light chain or the heavy chain.
It will be appreciated that training the model with a second subtask in a second type of training task, it is desirable that the model be able to perform both types of reasoning tasks. Task 1, predicting the lower half of the amino acid sequence from the upper half of the amino acid sequence. Task 2, predicting from one amino acid sequence another amino acid sequence with which it can be paired. Task 2 a specific example is predicting one chain of an antibody from the other chain with which it is paired, e.g., predicting VH from VL or predicting VL from VH. When predicting VH according to VL, splicing the VL-VH into an amino acid sequence, and selecting a site where the VH is located as a third site; when predicting VL based on VH, the VH-VL may be spliced into an amino acid sequence and the site where VL is located is selected as the third site. For example, VH is X1X 2X 3X 4 EOS, VL is X5X 6X 7 EOS, when VL is predicted from VH, the spliced amino acid sequence is X1X 2X 3X 4 EOS X5X 6X 7 EOS, and the deletion site is X5X 6X 7 EOS.
According to an embodiment of the present disclosure, the protein language model in any of the above embodiments may include a representation extraction layer, a transducer block, and a classification layer.
The representation extraction layer may be configured to determine a token corresponding to a token from the token, where the token corresponding token may be derived from the token (the token determined from the content of the token alone) and the location token (the token determined from the location of the token), e.g., by summing the token and the location token. The position characterization may include a characterization of the relative position and/or a characterization of the absolute position. The word element representation and the position representation can be obtained through mapping table lookup. See for details the description in GLM and will not be repeated here. Unlike GLM alone, in the mapping table of the token representation, various amino acid IDs and various tag-to-token representation mappings are recorded in the disclosed embodiments.
The transformer block may be transformer block, e.g. in the form of a GLM architecture, unifying the masking language model and the autoregressive model by means of different Attention mask matrices. The embodiment of the disclosure is exactly through special attention mask matrix and two types of task mixed training, so that tasks can be understood and generated uniformly.
FIG. 4 illustrates a method 400 for extracting amino acid sequence characterizations using a protein language model in accordance with an embodiment of the present disclosure. The protein language model may be a trained protein language model trained using any of the method embodiments described above. In accordance with embodiments of the present disclosure, a trained protein language model may include a representation extraction layer, a transformer block, and a classification layer. As shown in fig. 4, the method 400 includes: step S401, obtaining a first target amino acid sequence; step S402, inputting the first target amino acid sequence into a representation extraction layer and a transducer block of the trained protein language model, and taking the output of the transducer block of the protein language model as the representation of the first target amino acid sequence.
It is noted that, in one aspect, the transformer block may encode representations (e.g., token representations, position representations, or segment representations, etc.) corresponding to respective tokens extracted by the extraction layer of the model along with representations of other tokens (e.g., adjacent tokens) to thereby encode useful context information into the corresponding representations of the respective tokens. On the other hand, the output of the transformer block (e.g. a token vector sequence consisting of token vectors for individual tokens in the first target amino acid sequence) may be matrix multiplied with the classification matrix of the classification layer (i.e. the token vector for each token output by the transformer block is vector inner-product with each row vector or column vector of the classification matrix (depending on the specific arrangement)), thereby obtaining the belonging classification of the respective candidate token in the first target amino acid sequence (e.g. what the amino acid ID of the token is or what kind of label the token represents) output by the classification layer. Thus, the output of the protein language model transformer block can be obtained as a representation of the first target amino acid sequence, such sequence representation fusing the contextual relationship of the individual tokens in the sequence with respect to each other, which is beneficial for the application of the downstream module. And, the representation of the first target amino acid sequence may be a representation vector sequence, the representation vectors in the representation vector sequence being in one-to-one correspondence with the tokens in the first target amino acid sequence.
Illustratively, the characterization of the first target amino acid sequence may be used for a variety of downstream tasks, such as protein structure prediction tasks, protein physicochemical index prediction tasks, and the like. The method of using the representation of the first target amino acid sequence for the downstream task may be that the representation of the first target amino acid sequence is input into an MLP (multi-layer sensor) corresponding to the downstream task, and a downstream task result is obtained. Wherein the MLP may be pre-trained. The trained protein language model may also be further refined on specific downstream tasks.
FIG. 5 illustrates a method 500 for generating new amino acid sequences using a protein language model in accordance with an embodiment of the present disclosure. The protein language model may be a trained protein language model trained using any of the method embodiments described above. Method 500 can be used, for example, to generate mutant sequences of known amino acid sequences at specified sites, and can also be used to generate amino acid sequences having CDR regions that differ from known sequences.
As shown in fig. 5, the method 500 includes: step S501, obtaining a second target amino acid sequence (the goal of the method 500 is to replace part of the amino acids at the position in the second target amino acid sequence to generate a new amino acid sequence); step S502, performing a second pretreatment operation on a second target amino acid sequence to obtain a second target pretreated amino acid sequence; wherein the site to be replaced is taken as a second site; the length of the second target pretreatment amino acid sequence is L3; step S503, inputting a second target pretreatment amino acid sequence into a trained protein language model; inputting the character output by the kth position of the trained protein language model as the character input by the kth+1th position into the trained protein language model; wherein k is greater than L3. It is understood that in step S503, the predicted initiator may also be used as the input word of the (l3+1) th position. Step S504, the second target pretreatment amino acid is spliced with the output from the L3+1st position to the L3+nth position of the trained protein language model to obtain a new amino acid sequence; n is the total number of new amino acids corresponding to the amino acids at the second position; or splicing the second target pretreatment amino acid with the output word element from the L3+1st position of the trained protein language model to the position-1 st position of which the word element output by the trained protein language model is a terminator, so as to obtain a new amino acid sequence.
For example, a mutant amino acid sequence may be generated based on a predetermined mutation position and number of mutations based on a length-6 amino acid sequence. For example, it is desirable to mutate the amino acids at positions 2 and 5 of the amino acid sequence, the amino acid at position 2 is mutated to 1 new amino acid, and the amino acid at position 5 is mutated to 2 new amino acids, where n=3. The amino acid sequence of length 6 is X1X 2X 3X 4X 5 EOS, which corresponds to the pre-treated amino acid sequence of X1M 1X 3X 4M 2 EOS, l3=6. In step 503, 6 lemmas in the preprocessed amino acid sequence are respectively used as the input of the 1 st to 6 th positions, the prediction initiator S is used as the input of the 7 th position, the output lemmas of the model at the 7 th position are used as the input lemmas of the 8 th position, and the output lemmas of the model at the 8 th position are used as the input lemmas of the 9 th position. In step 504, M1 in the pre-processed amino acid sequence is replaced with the model's output word at position 7 and M2 in the pre-processed amino acid sequence is replaced with the model's output word at positions 8 and 9, resulting in a mutation result.
In the reasoning stage of the model, the feature of the autoregressive model is that we output the word at the kth position as the input word at the (k+1) th position. Illustratively, k may also be made smaller than Lmax in order to avoid unnecessary output. Lmax is, for example, l3+n, or the smaller of l3+n and the position of the predicted terminator for the token output by the trained protein language model. It can be understood that when the last output position of the model is l3+n, the number of the words predicted by the model reaches the required number of words, and the words are not required to be input to the model at the next position so that the model can be continuously output. When the model outputs the terminator, the model is stopped from predicting, and the model does not need to be input with a word element at the next position so that the model can continue to output. Under the two conditions, the model can stop continuously outputting the word elements, so that the computing resources are saved.
In accordance with embodiments of the present disclosure, a trained protein language model may include a representation extraction layer, a transformer block, and a classification layer. As previously described, the model can yield probabilities that each candidate term is at the site that requires prediction. Further, for at least one position among the positions for concatenation (the position from the (l3+1) th position to the (l3+n) th position, or the position from the (l3+1) th position to the position where the predicted terminator is located-1), an alternative word that satisfies a preset condition may be selected as an output word of the position, the alternative output word including, for example, 20 amino acids and various marks. The preset condition may be that the corresponding probability satisfies at least one of a preset probability ranking condition, a probability threshold condition and an alternative character content condition, and the selection may be that the selection is randomly selected from alternative character types satisfying the preset condition. For example, selecting from the first few candidate words with the probability sum reaching 90%, randomly selecting from the words which are not predicted terminators, selecting the candidate word with the probability ranking 2, and the like. The probability is predicted by the classification layer. Therefore, the expected new amino acid type can be flexibly selected based on the word elements predicted by the model, so that not only can the amino acid sequence different from the second target amino acid sequence be easily obtained, but also a plurality of amino acid sequences different from the second target amino acid sequence can be obtained when the setting strategy of the preset condition is different.
It can be understood that the classification layer includes a classification matrix (linear layer) and a softmax function, and after the output of the transformer block is subjected to matrix multiplication with the classification matrix of the classification layer, the probability corresponding to each candidate word element at each position is obtained from the matrix multiplication result through the softmax function. Typically, the candidate word with the highest probability at the location may be taken as the output of the location. However, in the case of generating a new amino acid sequence, the amino acid type having the highest probability has a high probability of matching the original amino acid type. Therefore, the word elements can be selected from the candidate word elements meeting the preset conditions to be used as output word elements, so that the large probability of generating the amino acid with different types from the original amino acid is ensured, and a plurality of new amino acids can be generated when the preset conditions and the selection strategies are different.
FIG. 6 illustrates a method 600 for obtaining related amino acid sequences using a protein language model, according to an embodiment of the disclosure. According to embodiments of the present disclosure, the protein language model may be a trained protein language model trained using any of the method embodiments described above. As shown in fig. 6, the method 600 includes: step S601, inputting a third target pretreatment amino acid sequence into a trained protein language model; the length of the third target pretreatment amino acid sequence is L4, wherein L4 is greater than or equal to 0; inputting the character output by the q-th position of the trained protein language model as the character input by the q+1th position into the trained protein language model; wherein q is greater than L4. It is understood that in step S601, the predicted initiator may also be used as an input word element for the (l4+1) th position. Step S602, using the word elements output from the L4+1th position of the trained protein language model to the-1 st position of the trained protein language model as the terminator as the related amino acid sequence; or, taking the word elements output from the L4+1st position to the preset position of the trained protein language model as the related amino acid sequence.
In embodiments, the related amino acid sequence is an amino acid sequence that is written based on the third target amino acid sequence. In this case, the third target amino acid sequence may be an amino acid sequence containing an amino acid type, or may be empty (in this case, the amino acid sequence is predicted directly from scratch). Alternatively, the related amino acid sequence is an amino acid sequence paired with a third target amino acid sequence. The amino acid sequence from the (L4+1) th position to the position before the predicted terminator may be used as the relevant amino acid sequence, or the amino acid sequence of a desired length (from the (L4+1) th position to a predetermined position) may be cut out as the relevant amino acid sequence.
For example, the task of obtaining the relevant amino acid sequence is to give the first half of the amino acid sequence, which can be taken as the third target amino acid sequence when the model is required to predict the second half of the amino acid sequence (i.e., continue the amino acid sequence); the task is to give the heavy chain of the antibody, which can be taken as the third target amino acid sequence when the model is required to predict the light chain with which it is paired.
For example, the third target amino acid sequence is X1X 2X 3X 4X 5. In step 601, X1X 2X 3X 4X 5 is used as the input lemma of the model from the 1 st position to the 5 th position, EOS is used as the input lemma of the model from the 6 th position, the output lemma of the model from the 6 th position is used as the input lemma of the model from the 7 th position, the output lemma of the model from the 7 th position is used as the input lemma of the model from the 8 th position, and … is used as the prediction terminator until the output lemma of the model from the 10 th position. The output word from position 6 to position 9 is used as the relevant amino acid sequence.
Illustratively, the trained protein language model comprises a representation extraction layer, a converter block and a classification layer, and for at least one position from the L4+1st position to the position-1 position where the character element output by the trained protein language model is a terminator, the candidate character element meeting the preset condition is selected as the output character element of the position; the preset condition may be that the corresponding probability satisfies at least one of a preset probability ranking condition, a probability threshold condition, and a voxel content condition; the alternative terms are selected from different types of amino acids and tags; the probability is predicted by the classification layer. Reference is made to the foregoing description and will not be repeated.
The confusion of designed proteins is a general concern in the field of protein design. The degree of confusion reflects the producibility of the protein. The method has the advantages of low confusion degree and high producibility for the amino acid sequences of proteins which exist in nature truly, and low confusion degree and low producibility for the amino acid sequences of proteins which exist in nature difficultly. Wet experiments have been used to determine confusion costs are high, and it is therefore desirable to predict the confusion of amino acid sequences by means of models.
In accordance with embodiments of the present disclosure, a trained protein language model may include a representation extraction layer, a transformer block, and a classification layer. Further, the method 600 may further include: step S603, for each of the L4 +1th position to the position-1 th position of the output terminator of the protein language model after the training, taking the probability corresponding to the output word element at the position as a confusion factor of the position; step S604, obtaining the confusion degree of the related amino acid sequences according to the confusion factors of each position.
For example, the probability of outputting a word at the position l4+1 is 80%, the probability of outputting a word at the position l4+2 is 70%, and the probability of outputting a word at the last position is 50%, so that the confusion of the related amino acid sequence can be obtained according to the probabilities of the positions. Obtaining the degree of confusion for the relevant amino acid sequence from the confusion factor for each position is prior art, e.g. multiplying the degree of confusion for each position, etc.
It can be seen that for the amino acid sequence generated by method 600, the confusion can be obtained while generating the amino acid sequence. For amino acid sequences generated by non-method 600, the confusion of the amino acid sequence can also be predicted by a protein language model. For the amino acid sequence generated by method 600, the confusion for the amino acid sequence may not be obtained at the same time as the amino acid sequence is generated, but predicted by an additional step through a protein language model.
In particular, where an amino acid sequence is obtained by method 500 or method 600, when the confusion obtained by method 500 or method 600 is above the confusion threshold, method 500 or method 600 may be repeatedly performed until a new amino acid sequence is obtained that meets the confusion requirement.
Fig. 7 illustrates a method 700 for predicting amino acid sequence confusion using a protein language model, according to an embodiment of the disclosure. According to embodiments of the present disclosure, the protein language model may be a trained protein language model trained using any of the method embodiments described above. In accordance with embodiments of the present disclosure, a trained protein language model may include a representation extraction layer, a transformer block, and a classification layer. As shown in fig. 7, the method 700 includes:
step S701, obtaining a fourth target amino acid sequence; the fourth target amino acid sequence is the amino acid sequence for which confusion is to be determined.
Step S702, sequentially taking each site in the fourth target amino acid sequence as the first site, and performing the first pretreatment operation to obtain a plurality of fourth target pretreated amino acid sequences; wherein each fourth target pretreatment amino acid sequence corresponds to a masked site;
for example, the fourth target amino acid sequence is X1X 2X 3X 4X 5 EOS, and the plurality of fourth target pretreatment amino acid sequences are:
M X2 X3 X4 X5 EOS;
X1 M X3 X4 X5 EOS;
X1 X2 M X4 X5 EOS;
X1 X2 X3 M X5 EOS;
X1 X2 X3 X4 M EOS。
the fourth target amino acid sequence may further comprise X1X 2X 3X 4X 5M.
Step S703, inputting a first current pretreatment amino acid sequence in a plurality of fourth target pretreatment amino acid sequences into a protein language model to obtain the probability of the classification layer prediction; taking the probability corresponding to a first word element on a first current masked position as a confusion factor of the first current masked position; the first lemma is a lemma of the fourth target amino acid sequence at the first current masked position; the first current masked site is a masked site corresponding to the first current pre-processed amino acid sequence;
For example, M X X3X 4X 5 EOS is input into a protein language model, and at the 1 st output position, the model classification layer predicts the probability that the position is each candidate word element (comprising 20 amino acids, marks and the like), and the probability degree corresponding to X1 is taken as the confusion factor of the 1 st position.
The method comprises the steps of inputting X1M X3X 4X 5 EOS into a protein language model, predicting the probability that the position is each candidate character element at a 2 nd output position by a model classification layer, and taking the probability corresponding to X2 as a confusion factor of the 2 nd position.
Thus, the confusing factors at each site can be obtained.
And step S704, obtaining the confusion degree of the fourth target amino acid sequence according to the confusion factors of all the sites.
As can be seen, the method 700 determines confusion in a manner similar to the training tasks of the first type. Specifically, on the premise that a target amino acid sequence for which confusion evaluation is to be performed is obtained, the word elements at each site may be covered in turn from the start site to the end site of the sequence for the target amino acid sequence, thereby obtaining a plurality of target pre-processed amino acid sequences. It should be noted that, since the first preprocessing is performed on the target amino acid sequence, the model gives the probability that the masking site is each candidate word element at the site where the masking site of the target amino acid sequence is located, and takes the probability that the real word element of the masking site corresponds to the masking site as the confusion factor corresponding to the masking site. Finally, the confusion degree of the amino acid sequence is obtained according to the confusion factors of all the sites. This effectively simplifies the calculation of the confusion and expands the use of the disclosed model.
Fig. 8 illustrates a method 800 for predicting amino acid sequence confusion using a protein language model, according to an embodiment of the disclosure. According to embodiments of the present disclosure, the protein language model may be a trained protein language model trained using any of the method embodiments described above. In accordance with embodiments of the present disclosure, a trained protein language model may include a representation extraction layer, a transformer block, and a classification layer. As shown in fig. 8, the method 800 includes:
step S801, obtaining a fifth target amino acid sequence; the length of the fifth target amino acid sequence is L5;
for example, the fifth target amino acid sequence is X1X 2X 3X 4X 5 EOS, l5=6.
Step S802, sequentially taking each site in the fifth target amino acid sequence as the second site, and performing the second pretreatment operation to obtain a plurality of fifth target pretreated amino acid sequences; wherein each fifth target pretreatment amino acid sequence corresponds to a masked site;
for example, the plurality of fifth target pretreatment amino acid sequences are:
M X2 X3 X4 X5 EOS;
X1 M X3 X4 X5 EOS;
X1 X2 M X4 X5 EOS;
X1 X2 X3 M X5 EOS;
X1 X2 X3 X4 M EOS。
the fifth target amino acid sequence may further comprise X1X 2X 3X 4X 5M.
Step S803, inputting a second current pretreatment amino acid sequence in a plurality of fifth target pretreatment amino acid sequences into a protein language model to obtain the probability of the classification layer prediction; taking the probability corresponding to the second lemma on the L5+1th locus as the confusion factor of the second current masked locus; the second lemma is a lemma of the fifth target amino acid sequence at the second current masked position; the second current masked site is a masked site corresponding to the second current pre-processed amino acid sequence. It will be appreciated that when the fifth target pre-processed amino acid sequence is used as the input lemma input model for positions 1 through L5, the predicted start symbol may also be used as the input lemma input model for position L5+1.
For example, M X2X 3X 4X 5 EOS is input into a protein language model, and at the L5+1th output position, the model classification layer predicts the probability that the position is each candidate word element, and takes the probability corresponding to X1 as a confusing factor of the current masked position (the position where X1 is located).
X1M X3X 4X 5 EOS is input into a protein language model, the model classification layer predicts the probability that the position is each candidate word element at the L5+1th output position, and the probability corresponding to X2 is taken as a confusing factor of the current masked position (the position where X2 is located).
Thus, the confusing factors at each site can be obtained.
And step S804, obtaining the confusion degree of the fifth target amino acid sequence according to the confusion factors of all the sites.
As can be seen, method 800 determines confusion in a manner similar to the 1 st training sub-task in the second type of training task. Specifically, on the premise that a target amino acid sequence for which confusion evaluation is to be performed is obtained, the word elements at each site may be covered in turn from the start site to the end site of the sequence for the target amino acid sequence, thereby obtaining a plurality of target pre-processed amino acid sequences. Then, for each of the plurality of target pre-processed amino acid sequences, it is fed into a trained protein language model, resulting in a result (e.g., matrix representation) of the classification layer output of the model. It should be noted here that, since the second preprocessing is performed on the target amino acid sequence, the model gives the probability that the covered site is each candidate word element after the original length of the target amino acid sequence, and takes the probability that the true word element of the covered site corresponds to the covered site as a confusing factor. Finally, the confusion degree of the amino acid sequence is obtained according to the confusion factors of all the sites. This effectively simplifies the calculation of the confusion and expands the use of the disclosed model.
According to an embodiment of the present disclosure, there is also provided an electronic apparatus. The electronic device may include at least one processor and a memory communicatively coupled to the at least one processor. The memory may store instructions executable by the at least one processor to enable the at least one processor to perform the methods disclosed in any one of the method embodiments described above.
According to an embodiment of the present disclosure, there is also provided a non-transitory computer-readable storage medium storing computer instructions. The computer instructions may be used to cause a computer to perform the methods disclosed in any of the method embodiments described above.
According to an embodiment of the present disclosure, there is also provided a computer program product. The computer program product may comprise a computer program. The computer program may, when executed by a processor, implement the method disclosed by any of the method embodiments described above. According to embodiments of the present disclosure, there is also provided an electronic device, a readable storage medium and a computer program product. Referring to fig. 9, a block diagram of an electronic device 900 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.
Various components in device 900 are connected to I/O interface 905, including: an input unit 906, an output unit 907, a storage unit 908, and a communication unit 909. The input unit 906 may be any type of device capable of inputting information to the device 900, the input unit 906 may receive input numeric or character information and generate key signal inputs related to user settings and/or function control of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a trackpad, a trackball, a joystick, a microphone, and/or a remote control. The output unit 907 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, and the like, A video/audio output terminal, a vibrator, and/or a printer. Storage unit 908 may include, but is not limited to, magnetic disks, optical disks. Communication unit 909 allows device 900 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers, and/or chipsets, such as bluetooth TM Devices, 802.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.
The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning network algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the various methods and processes described above, such as a training method of a neural network and/or a method of predicting protein structure. For example, in some embodiments, the training method of the neural network and/or the method of predicting protein structure may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into RAM 903 and executed by the computing unit 901, one or more steps of the above-described training method of a neural network and/or method of predicting protein structure may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured by any other suitable means (e.g., by means of firmware) to perform the training method of the neural network and/or the method of predicting the protein structure.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely exemplary embodiments or examples, and that the scope of the present invention is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims (20)

1. A method of training a protein language model, comprising:
In the training process, executing at least one training subtask of two training subtasks comprising a first type training task and a second type training task;
the first type of training task comprises: obtaining a first amino acid sequence; performing a first pretreatment operation on the first amino acid sequence to obtain a first pretreated amino acid sequence; the first preprocessing operation comprises selecting one or more first sites in the first amino acid sequence, and masking the word elements on the first sites; inputting the first pretreatment amino acid sequence into a protein language model to obtain a first prediction result; calculating a first loss value according to the word element at the first position in the first amino acid sequence and the element at the first calculation position in the first prediction result; the first calculation site is a site para to the first site;
the first training subtask of the second type of training task comprises: obtaining a second amino acid sequence; performing a second pretreatment operation on the second amino acid sequence to obtain a second pretreated amino acid sequence; the second preprocessing operation comprises selecting one or more second sites in the second amino acid sequence, and masking the lemma at the second sites; inputting the second pretreatment amino acid sequence into a protein language model to obtain a second prediction result; calculating a second loss value according to the word element at the second position in the second amino acid sequence and the element at a second calculation position in the second prediction result; the second calculation site is selected from a site subsequent to the site para to the second site;
A second training subtask of the second type of training task comprises: obtaining a third amino acid sequence; performing a third pretreatment operation on the third amino acid sequence to obtain a third pretreated amino acid sequence; the third preprocessing operation comprises selecting one or more third sites which are positioned at the tail of the sequence and adjacent to each other in the third amino acid sequence, and deleting the word elements at the third sites; inputting the third pretreatment amino acid sequence into a protein language model to obtain a third prediction result; calculating a third loss value according to the word element at a third position in the third amino acid sequence and the element at a third calculation position in the third prediction result; the third calculation site is selected from a site subsequent to the site para to the third site;
and adjusting parameters of the protein language model based on the first loss value and the loss value corresponding to the training task of the second type to obtain a trained protein language model, wherein the loss value corresponding to the training task of the second type comprises at least one of the second loss value and the third loss value.
2. The method of claim 1, wherein,
In the training process, executing at least one training subtask of two training subtasks included in the first type training task and the second type training task, wherein the training subtask comprises the following steps:
in a first training phase, executing a first type of training task;
in a second training phase, a second type of training task is performed.
3. The method of claim 1, wherein,
in the training process, executing at least one training subtask of two training subtasks included in the first type training task and the second type training task, wherein the training subtask comprises the following steps:
in a first training phase, executing a first type of training task;
in a second training phase, executing a first type training task and a second type training task;
in a third training phase, a second type of training task is performed.
4. A method according to claim 2 or 3, wherein,
performing at least one of two training subtasks comprised by the training tasks of the first type and the training tasks of the second type is performed on the first data set;
after performing at least one of the training subtasks comprised by the training tasks of the first type and the training tasks of the second type on the first data set, the method further comprises performing the training tasks of the first type and/or the training tasks of the second type on the second data set.
5. The method of claim 1, wherein,
inputting the second pretreatment amino acid sequence into a protein language model to obtain a second prediction result, wherein the second prediction result comprises: inputting the second preprocessed amino acid sequence into a protein language model, and inputting masked amino acids as lemmas of the L1+2 to L1+second site number+1 positions into the protein language model;
and/or the number of the groups of groups,
inputting the third pretreatment amino acid sequence into a protein language model to obtain a third prediction result, wherein the method comprises the following steps: inputting the third pretreatment amino acid sequence into a protein language model, and inputting the deleted amino acid as the word elements of the L < 2+ > to L < 2+ > third sites and the number of the +1 to the protein language model.
6. The method according to any one of claims 1 to 5, wherein,
masking the amino acid at the first position comprises: for each first site, replacing the amino acid on the first site with a first mask placeholder; and/or the number of the groups of groups,
masking the amino acid at the second position comprises: a second locus or a plurality of second loci that are not adjacent to other second loci are used as a set of second loci, and a second mask placeholder is used to correspondingly replace the word elements on the set of second loci.
7. The method according to any one of claims 1 to 6, wherein,
inputting the second pre-processed amino acid sequence into a protein language model to obtain a second predicted result, further comprising: inputting the second placeholder as a word element of the L1+1st position into a protein language model; and/or the number of the groups of groups,
inputting the third pretreatment amino acid sequence into a protein language model to obtain a third prediction result, and further comprising: the second placeholder is entered into the protein language model as a lexeme for the L2+1th position.
8. The method of claim 1, wherein,
the third amino acid sequence is obtained by splicing a light chain and a heavy chain of an antibody;
selecting one or more third sites in the third amino acid sequence that are located at the tail of the sequence and adjacent to each other comprises: selecting a site corresponding to a chain positioned at the tail of the sequence in the third amino acid sequence as a third site; the chain at the tail of the sequence is the light chain or the heavy chain.
9. The method of claim 1, wherein the protein language model includes a representation extraction layer, a transformer block, and a classification layer.
10. A method of extracting amino acid sequence representations using a protein language model, the protein language model being the trained protein language model obtained by training using the method of any one of claims 1-9, the trained protein language model comprising a representation extraction layer, a transducer block, and a classification layer;
The method comprises the following steps:
acquiring a first target amino acid sequence;
inputting the first target amino acid sequence into a representation extraction layer and a transformer block of the trained protein language model, and taking the output of the protein language model transformer block as a representation of the first target amino acid sequence.
11. A method of generating a new amino acid sequence using a protein language model, the protein language model being the trained protein language model obtained by training using the method of any one of claims 1-9, the method comprising:
obtaining a second target amino acid sequence; amino acids at partial sites in the second target amino acid sequence are substituted to generate a new amino acid sequence;
performing the second pretreatment operation on the second target amino acid sequence to obtain the second target pretreated amino acid sequence; wherein the site to be replaced is taken as the second site; the length of the second target pretreatment amino acid sequence is L3;
inputting the second target pre-processed amino acid sequence into the trained protein language model;
inputting the word element output by the kth position of the trained protein language model as the word element input by the (k+1) th position into the trained protein language model; wherein k is greater than L3;
Splicing the second target pretreatment amino acid with output word elements from the L3+1st position to the L3+nth position of the trained protein language model to obtain the new amino acid sequence; n is the total number of new amino acids corresponding to the second site amino acid; or splicing the second target pretreatment amino acid with the output word element from the L3+1st position of the trained protein language model to the-1 st position of which the word element output by the trained protein language model is a terminator, so as to obtain the new amino acid sequence.
12. The method of claim 11, wherein,
the trained protein language model comprises a representation extraction layer, a transducer block and a classification layer; selecting an alternative word element meeting a preset condition as an output word element of at least one position in the positions for splicing; the preset condition can be that the corresponding probability meets at least one of a preset probability ranking condition, a probability threshold condition and an alternative voxel content condition; the alternative lements are selected from different types of amino acids and tags; the probability is predicted by the classification layer.
13. A method of obtaining a relevant amino acid sequence using a protein language model, the protein language model being the trained protein language model obtained by training using the method of any one of claims 1-9, the method comprising:
Inputting a third target pre-processed amino acid sequence into the trained protein language model; the length of the third target pretreatment amino acid sequence is L4, wherein L4 is more than or equal to 0;
inputting the word element output at the (q+1) th position of the trained protein language model as the word element input at the (q+1) th position into the trained protein language model; wherein q is greater than L4;
taking the word elements output from the L4+1th position of the trained protein language model to the-1 st position of which the word elements output by the trained protein language model are terminators as the related amino acid sequences; or, taking the word elements output from the L4+1st position to the preset position of the trained protein language model as the related amino acid sequence;
wherein the related amino acid sequence is an amino acid sequence written based on the third target amino acid sequence; alternatively, the related amino acid sequence is an amino acid sequence paired with the third target amino acid sequence.
14. The method of claim 13, wherein,
the trained protein language model comprises a representation extraction layer, a converter block and a classification layer, and for at least one position from the L4+1st position to the position-1 position where the word element output by the trained protein language model is a terminator, the candidate word element meeting the preset condition is selected as the output word element of the position; the preset condition can be that the corresponding probability meets at least one of a preset probability ranking condition, a probability threshold condition and an alternative voxel content condition; the alternative lements are selected from different types of amino acids and tags; the probability is predicted by the classification layer.
15. The method of claim 14, the method further comprising:
for each of the L4+1st position to the-1 st position of the output terminator of the trained protein language model, taking the probability corresponding to the output word element at the position as a confusion factor of the position;
and obtaining the confusion degree of the related amino acid sequences according to the confusion factors of each position.
16. A method of predicting amino acid sequence confusion using a protein language model, the protein language model being the trained protein language model trained using the method of any one of claims 1-9, the trained protein language model comprising a representation extraction layer, a transformer block, and a classification layer, the method comprising:
obtaining a fourth target amino acid sequence;
sequentially taking each site in the fourth target amino acid sequence as the first site, and performing the first pretreatment operation to obtain a plurality of fourth target pretreated amino acid sequences; wherein each fourth target pretreatment amino acid sequence corresponds to a masked site;
inputting a first current pretreatment amino acid sequence in a plurality of fourth target pretreatment amino acid sequences into a protein language model to obtain the probability of the classification layer prediction; taking the probability corresponding to a first word element on a first current masked position as a confusion factor of the first current masked position; the first lemma is a lemma of the fourth target amino acid sequence at the first current masked position; the first current masked site is a masked site corresponding to the first current pre-processed amino acid sequence;
And obtaining the confusion degree of the fourth target amino acid sequence according to the confusion factors of the sites.
17. A method of predicting amino acid sequence confusion using a protein language model, the protein language model being the trained protein language model trained using the method of any one of claims 1-9, the trained protein language model comprising a representation extraction layer, a transformer block, and a classification layer, the method comprising:
obtaining a fifth target amino acid sequence; the length of the fifth target amino acid sequence is L5;
sequentially taking each site in the fifth target amino acid sequence as the second site, and performing the second pretreatment operation to obtain a plurality of fifth target pretreated amino acid sequences; wherein each fifth target pretreatment amino acid sequence corresponds to a masked site;
inputting a second current preprocessed amino acid sequence in the plurality of fifth target preprocessed amino acid sequences into a protein language model to obtain the probability of the classification layer prediction; taking the probability corresponding to the second lemma on the L5+1th locus as the confusion factor of the second current masked locus; the second lemma is a lemma of the fifth target amino acid sequence at the second current masked position; the second current masked site is a masked site corresponding to the second current pre-processed amino acid sequence;
And obtaining the confusion degree of the fifth target amino acid sequence according to the confusion factors of the sites.
18. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the method comprises the steps of
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-17.
19. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-17.
20. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 1-17.
CN202310832203.6A 2023-07-06 2023-07-06 Training method for protein language model, electronic device, computer readable medium and program product Pending CN116959571A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310832203.6A CN116959571A (en) 2023-07-06 2023-07-06 Training method for protein language model, electronic device, computer readable medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310832203.6A CN116959571A (en) 2023-07-06 2023-07-06 Training method for protein language model, electronic device, computer readable medium and program product

Publications (1)

Publication Number Publication Date
CN116959571A true CN116959571A (en) 2023-10-27

Family

ID=88448503

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310832203.6A Pending CN116959571A (en) 2023-07-06 2023-07-06 Training method for protein language model, electronic device, computer readable medium and program product

Country Status (1)

Country Link
CN (1) CN116959571A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117711525A (en) * 2024-02-05 2024-03-15 北京悦康科创医药科技股份有限公司 Activity prediction model training and activity prediction related products
CN117711532A (en) * 2024-02-05 2024-03-15 北京悦康科创医药科技股份有限公司 Model training for polypeptide amino acid sequence generation and related products

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117711525A (en) * 2024-02-05 2024-03-15 北京悦康科创医药科技股份有限公司 Activity prediction model training and activity prediction related products
CN117711532A (en) * 2024-02-05 2024-03-15 北京悦康科创医药科技股份有限公司 Model training for polypeptide amino acid sequence generation and related products
CN117711525B (en) * 2024-02-05 2024-05-10 北京悦康科创医药科技股份有限公司 Activity prediction model training and activity prediction related products
CN117711532B (en) * 2024-02-05 2024-05-10 北京悦康科创医药科技股份有限公司 Training method for polypeptide amino acid sequence generation model and polypeptide amino acid sequence generation method

Similar Documents

Publication Publication Date Title
KR102302609B1 (en) Neural Network Architecture Optimization
CN109670029B (en) Method, apparatus, computer device and storage medium for determining answers to questions
US11797822B2 (en) Neural network having input and hidden layers of equal units
CN116959571A (en) Training method for protein language model, electronic device, computer readable medium and program product
JP6620439B2 (en) Learning method, program, and learning apparatus
CN110807515A (en) Model generation method and device
US10679006B2 (en) Skimming text using recurrent neural networks
CN110610234B (en) Integrating external applications into deep neural networks
WO2019142061A1 (en) Neuromorphic chip for updating precise synaptic weight values
KR20210032140A (en) Method and apparatus for performing pruning of neural network
JP6312467B2 (en) Information processing apparatus, information processing method, and program
US20190228297A1 (en) Artificial Intelligence Modelling Engine
JP2022106287A (en) Affinity prediction method and model training method, equipment, device, and medium
CN115699041A (en) Extensible transfer learning using expert models
CN114611532A (en) Language model training method and device, and target translation error detection method and device
CN113642727A (en) Training method of neural network model and processing method and device of multimedia information
CN116541536B (en) Knowledge-enhanced content generation system, data generation method, device, and medium
CN116341634B (en) Training method and device for neural structure search model and electronic equipment
CN114281990A (en) Document classification method and device, electronic equipment and medium
US11676050B2 (en) Systems and methods for neighbor frequency aggregation of parametric probability distributions with decision trees using leaf nodes
CN110929033A (en) Long text classification method and device, computer equipment and storage medium
CN116597454B (en) Image processing method, training method and device of image processing model
CN113361712B (en) Training method of feature determination model, semantic analysis method, semantic analysis device and electronic equipment
CN115713071B (en) Training method for neural network for processing text and method for processing text
CN116050465B (en) Training method of text understanding model, text understanding method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination