CN116013407A

CN116013407A - Method for generating property decoupling protein based on language model

Info

Publication number: CN116013407A
Application number: CN202211686617.4A
Authority: CN
Inventors: 张强; 王泽元; 陈华钧
Original assignee: ZJU Hangzhou Global Scientific and Technological Innovation Center
Current assignee: ZJU Hangzhou Global Scientific and Technological Innovation Center
Priority date: 2022-12-26
Filing date: 2022-12-26
Publication date: 2023-04-25

Abstract

The invention discloses a method for generating a property decoupling protein based on a language model, which comprises the following steps: constructing an amino acid quality knowledge graph according to the amino acid properties; protein data are obtained, each protein data is decoupled into an amino acid sequence according to an amino acid property knowledge graph, and the amino acid property sequence is mapped to a vector space from a property space to obtain a vector representation of the amino acid sequence; modeling and training the language model based on causal relation prediction tasks by using vector representation of the amino acid property sequence to optimize parameters of the language model; the generation of proteins is based on a parametric optimized language model, which enables the generation of specific proteins based on amino acid properties.

Description

Method for generating property decoupling protein based on language model

Technical Field

The invention relates to the technical field of proteins, in particular to a method for generating a property decoupling protein based on a language model.

Background

Proteins are sequence data consisting of several amino acids, which from this point of view have a certain similarity to natural language, and thus a great deal of research has migrated methods for natural language onto protein sequences. The language model is the currently most interesting paradigm for modeling languages, the core idea of which is to use known sequences to obtain probability distributions for unknown sequences. There are two most common language models: a mask language model and a cause and effect language model. The masking language model predicts the probability distribution of words at covered locations based on the context of covered words, which is very effective in understanding text. In the field of sequence generation, causal language models dominate, which use probabilities of the preceding modeling postambles to generate text through successive iterations. Modeling of natural language today has shifted from word-to-word co-occurrence frequency based statistics to word vector based neural network fits. Experiments prove that the distributed characteristics and the nonlinear mapping of the neural network have stronger generalization. GPT-3 expands the parameters to 175B, and it is found that at such large scales, the model can generate sentences that are no worse than human utterances. Inspired by the method, a plurality of research teams try to apply the paradigm to proteins, train models such as ProGen2, protGPT2, RITA and the like, find that the larger the model parameter scale is, the better the modeling effect on protein sequences is, the more natural protein can be generated, and the models can be expected to obtain sequences which are different from the natural ones but have expected functions through sampling.

However, based on an amino acid language model of 20 mutually independent amino acid symbols, the properties of the amino acid itself cannot be well modeled, such as: steric hindrance of amino acids, hydrophilicity of amino acids, etc., thus increasing difficulty of model learning. Secondly, the amino acid symbol embedding is a superposition of the probabilities that each property appears at the current position, because the property cannot be decoupled, and the predicted probability that each amino acid appears at the current position. However, proteins with different functions should be distinguished, so that the inability to decouple properties can result in the inability to generate function-specific proteins in a targeted manner, limiting the flexibility in model use.

The current generation method based on the causal language model mainly focuses on how to design the sampler. Conventional maximization-based samplers are intended to generate sequences that best meet model expectations, including greedy-based generation and improved method beam searching. But maximization-based methods can lead to text degradation such as being tedious, disoriented, or trapped in repeated loops. In order to enable the generated text to have flexibility, researchers design Top-k and Nuclear sampling modes, and the main idea is to select a plurality of candidate words first and then select the candidate words according to probability. However, it is noted that these sampling methods are all based on probabilities, and it is impossible to determine what meaning the generated sequence has, and it is impossible to generate a desired sequence in a tendency, so that the practicality of the model is reduced.

Disclosure of Invention

In view of the foregoing, an object of the present invention is to provide a method for generating a protein by decoupling an amino acid profile based on a language model, training a plurality of language models on a protein data set, training samplers for a specific family on a specific amino acid family data set by taking a property representation output from the plurality of language models as input, so as to produce proteins in different fields.

To achieve the above object, an embodiment provides a method for generating a property decoupling protein based on a language model, including the steps of:

constructing an amino acid quality knowledge graph according to the amino acid properties;

protein data are obtained, each protein data is decoupled into an amino acid sequence according to an amino acid property knowledge graph, and the amino acid property sequence is mapped to a vector space from a property space to obtain a vector representation of the amino acid sequence;

modeling and training the language model based on causal relation prediction tasks by using vector representation of the amino acid property sequence to optimize parameters of the language model;

predicting the probability distribution of the next amino acid property based on the vector representation of the known amino acid property sequence by using a language model with optimized parameters, predicting the amino acid based on the probability distribution of the next amino acid property by using a sampler, supplementing the predicted amino acid property into the known amino acid property sequence, repeating the steps until the completion, and converting the final amino acid property sequence into an amino acid sequence as a generated protein.

Preferably, in the amino acid quality knowledge graph, each amino acid and its property are expressed as a triplet (amino acid, property strength, property category), the amino acid quality knowledge graph is constructed according to the triplet, in the property space, the property strength is expressed by the modular length of the vector, the property category is expressed by the direction of the vector, and the embedding of the property in the amino acid quality knowledge graph is obtained.

Preferably, decoupling each protein data into amino acid sequences according to an amino acid property knowledge graph comprises:

and finding out the corresponding amino acid property of each amino acid in each protein data in an amino acid property knowledge graph, and replacing the amino acid by the amino acid property to obtain an amino acid sequence.

Preferably, the amino acid property sequence is mapped from a property space to a vector space, comprising:

according to the different amino acid properties, different mapping modes are adopted, and for discrete properties, a dictionary type embedding mode is adopted; for continuous properties, the vector direction is used for representing the properties, and the vector size is used for representing the magnitude of the property value; for the nature of the graph type representation, a graph neural network embedding approach is employed.

Preferably, the language model is a pluggable model capable of encoding sequences, including LSTM, transformer, GPT3;

the language model is used to predict the probability distribution of the next amino acid property representation from the vector representation of the known amino acid properties, in the language model, the top layer and the bottom layer independently encode the different amino acid properties, do not share parameters, the middle part except the top layer and the bottom layer is used for information sharing by embedding the different amino acid properties through sparse self-attention, and the single amino acid property representation is predicted by information interaction among multiple amino acid properties and enhancement of the known amino acid property information.

Preferably, when training the language model, constructing a loss function by minimizing errors of the property label and the property prediction result, and updating parameters of the language model according to the loss function;

for discrete amino acid properties, a constructed loss function l _b (p ₀ ： _m ) The method comprises the following steps:

wherein b represents a lot number, y represents a property label, C represents a predicted property, C is the total amount of the predicted property class,

representing the probability that the model predicts the ith amino acid class as c,/for>

Representing the probability that the model predicts the ith amino acid species as y, p ₀ ： _m The sum of the probabilities of the model predicting the amino acid sequence of length m is represented.

For consecutive amino acid properties, the constructed loss function is:

wherein m represents the total amount of amino acids,

represents an amino acid mean value of +.>

Variance is->

Is a normal distribution amino acid property probability density function of (a), x represents the input amino acid property sequence, mu and sigma represent the amino acid property mean and variance,/->

Represents an amino acid mean value of +.>

A normal distribution amino acid property probability density function with variance of 1.

Preferably, the language model is output as a predicted amino acid property representation, which is then mapped to a property space using a single layer linear network, and the loss function is calculated based on the predicted amino acid property.

Preferably, the sampler adopts a neural network and a multi-layer perceptron.

Preferably, the predicted amino acid is obtained by a sampler according to the probability distribution of the amino acid property, the amino acid property knowledge graph is used for determining the property of the predicted amino acid, and the property of the predicted amino acid is supplemented into the known amino acid sequence.

Compared with the prior art, the invention has the beneficial effects that at least the following steps are included:

(1) An amino acid quality knowledge graph is constructed for the first time based on the existing amino acid data, and a priori knowledge of finer granularity is provided for protein representation.

(2) At present, amino acid quality probability is obtained through prediction of multiple language models, and domain specific proteins are generated through a domain sampler. Unlike the existing generation model based on amino acid symbols, the resulting generated amino acid sequence is specific, cannot reflect the nature of the amino acid required at the current position, and limits the generation space to 20 natural amino acids. The language model in the invention can not only describe the attribute of the amino acid needed by the current position, but also has the interpretability, and can also enable biologists to design new artificial amino acid through the property described by the model, thereby improving the diversity of biological materials. The sampler takes various amino acid property signals as input, can better balance what amino acid is selected, and meets the requirement of the current position better.

(3) The property embedding mode of the invention utilizes the means of knowledge graph enhancement to map continuous and discrete properties to vector space for the use of language models.

(4) Different from the existing unimorph language modeling generation model, the invention proposes to use a hybrid expert system under different amino acid properties to enable each property to carry out limited communication so as to learn the sequence mode guidance generation with unique properties.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for generating a property decoupling protein based on a language model provided in the examples;

FIG. 2 is a schematic representation of the amino acid properties provided in the examples;

FIG. 3 is a schematic diagram of a pre-training and fine-tuning process for a language model provided by an embodiment.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the detailed description is presented by way of example only and is not intended to limit the scope of the invention.

FIG. 1 is a flow chart of a method for generating a property decoupling protein based on a language model according to an embodiment. As shown in fig. 1, the method for generating a property decoupling protein based on a language model provided in the embodiment includes the following steps:

and 1, constructing an amino acid quality knowledge graph according to the amino acid properties.

In the embodiment, based on experiments on amino acids in the chemical field, the physicochemical properties and the importance degree of the amino acids playing an important role in the protein function are obtained, and an amino acid quality knowledge graph is constructed based on the physicochemical properties and the importance degree, and the amino acid quality knowledge graph is used as a basis for decoupling the amino acids.

As shown in fig. 2, the amino acid properties include: category, solubility, radius, charge, polarity, component, etc. In the amino acid quality knowledge graph, each amino acid and the property thereof are expressed as a triplet (amino acid, property intensity and property category), the amino acid quality knowledge graph is constructed according to the triplet, the property intensity is expressed by the modular length of a vector in a property space, the property category is expressed by the direction of the vector, and the embedding of the property in the amino acid quality knowledge graph is obtained.

And 2, acquiring protein data and constructing a pre-training data set.

In an embodiment, the protein data is derived from protein sequence corpora, including all protein data measured based on sequencing experiments of proteins in the biological domain. Each piece of protein data is an amino acid sequence consisting of amino acids, and in the examples, except the protein data which cannot be taken as a sample, namely the protein data with the length of the amino acid sequence being more than 2048 is removed, and the total length of the rest similar length is not more than 2048 amino acid sequences which are taken as samples.

And 3, decoupling protein data based on the amino acid quality knowledge graph, and performing vector representation.

In an embodiment, each piece of protein data is decoupled into an amino acid sequence according to an amino acid property knowledge graph, and the amino acid property sequence is mapped from a property space to a vector space to obtain a vector representation of the amino acid sequence.

Specifically, each protein data is decoupled into an amino acid sequence according to an amino acid property knowledge graph, and the method comprises the following steps: and finding out the corresponding amino acid property of each amino acid in each protein data in an amino acid property knowledge graph, and replacing the amino acid by the amino acid property to obtain an amino acid sequence.

For a given piece of protein data, each amino acid in the protein data can be mapped from an amino acid profile into three vectors relating to solubility, radius, polarity, and such a piece of protein data can be mapped into three sets of amino acid property sequence vectors.

In the embodiment, in order to input the amino acid properties into the language model, mapping the amino acid properties from a property space to a dense vector space is also required, specifically, according to different amino acid properties, different mapping modes are adopted, and for discrete properties, a dictionary embedding mode is adopted; for continuous properties, the vector direction is used for representing the properties, and the vector size is used for representing the magnitude of the property value; and for the property represented by the graph type, a graph neural network embedding mode is adopted, so that a conversion processing mode of mapping the property space into a vector space is established through the property and embedding mode, and the vector representation of the amino acid property sequence is obtained.

And 4, modeling and training the language model based on a causal relation prediction task by using vector representation of the amino acid property sequence to optimize parameters of the language model.

In an embodiment, the language model is used for predicting a probability distribution of a property representation of a next amino acid from a vector representation of known amino acid properties, a property representation of a current position being obtained on the basis of the known property information being obtained by exchanging information between the properties. The language model is a pluggable model capable of encoding sequences, including LSTM, transformer, GPT3, and the like, and the embodiment adopts GPT3, and the GPT3 is a decoder-only transducer model. As shown in fig. 3, in the language model, the top layer and the bottom layer independently encode different amino acid properties, parameters are not shared, embedding of different amino acid properties is performed through sparse self-attention in the middle part except the top layer and the bottom layer, information sharing is performed, information interaction among multiple amino acid properties is performed, and single amino acid property representation is predicted through enhancement of known amino acid property information.

In an embodiment, the amino acid property representations generated by the language model are mapped to corresponding property spaces based on different property expressions in order to calculate the loss function. Specifically, a single-layer linear network is employed as a mapping head for both discrete and continuous properties to map the amino acid property representation to the corresponding property space, obtaining the amino acid property.

In an embodiment, when training the language model, a loss function is constructed by minimizing errors of the property label and the property prediction result, and parameters of the language model are updated according to the loss function.

In the example, the amino acid quality causal language modeling, a training batch consists of N pieces of protein data, for each of which the current position can only obtain the property information of the previous position. The model predicts the amino acid properties at each position one by one, reducing the confusion of the model over the whole amino acid properties.

The causal language modeling loss function of discrete amino acid properties of a protein is:

Representing the probability that the model predicts the ith amino acid species as y, p _0：m Representing the sum of the probabilities of the representation model predicting an amino acid sequence of length m, the total loss is all of the training batchesAnd predicting the sum of losses.

The causal language modeling loss function for the continuous amino acid properties of a protein is:

wherein m represents the total amount of amino acids,

represents an amino acid mean value of +.>

Variance is->

Represents an amino acid mean value of +.>

A normal distribution amino acid property probability density function with variance of 1. The total loss is the sum of all predicted losses for one training batch.

And 5, generating protein based on the language model optimized by the parameters.

In the downstream task, as shown in fig. 3, a domain-specific protein family is used as a fine-tuning dataset, an amino acid property representation is generated by using a language model optimized by parameters based on the fine-tuning dataset, and a sampler in the specific domain is subjected to parameter optimization by taking the amino acid property representation and amino acid as samples, so that the sampler after parameter optimization can generate a protein with a certain function.

In an embodiment, the protein generation process comprises: predicting the probability distribution of the next amino acid property based on the vector representation of the known amino acid property sequence by using a parameter-optimized language model, predicting the amino acid based on the probability distribution of the next amino acid property by using a sampler, supplementing the predicted amino acid property into the known amino acid property sequence, repeating the steps until the completion, and converting the final amino acid property sequence into an amino acid sequence as a generated protein, thereby generating a domain-specific protein by using the language model and the amino acid sequence simultaneously.

In an embodiment, the sampler employs a multi-layer perceptron and neural network.

The foregoing detailed description of the preferred embodiments and advantages of the invention will be appreciated that the foregoing description is merely illustrative of the presently preferred embodiments of the invention, and that no changes, additions, substitutions and equivalents of those embodiments are intended to be included within the scope of the invention.

Claims

1. A method for generating a property decoupling protein based on a language model, comprising the steps of:

2. The method for generating a property decoupling protein based on a language model according to claim 1, wherein in the amino acid property knowledge graph, each amino acid and its property are represented as a triplet (amino acid, property strength, property class), the amino acid property knowledge graph is constructed according to the triplet, the property strength is represented by the modular length of a vector in a property space, the property class is represented by the direction of the vector, and the embedding of the property in the amino acid property knowledge graph is obtained.

3. The language model based property decoupling protein production method of claim 1, wherein decoupling each protein data into amino acid sequences according to an amino acid property knowledge graph comprises:

4. The language model based property decoupling protein production method of claim 1, wherein the amino acid property sequence is mapped from a property space to a vector space, comprising:

5. The method for generating a property decoupling protein based on a language model according to claim 1, wherein the language model is a pluggable model capable of encoding sequences, including LSTM, transformer, GPT3;

6. The language model based property decoupling protein generation method of claim 1, wherein a loss function is constructed by minimizing errors of a property tag and a property prediction result when training the language model, and parameters of the language model are updated according to the loss function;

for discrete amino acid properties, a constructed loss function l _b (p _0:m ) The method comprises the following steps:

Representing the probability that the model predicts the ith amino acid species as y, p _0:m The sum of the probabilities of the model predicting the amino acid sequence of length m is represented.

For consecutive amino acid properties, the constructed loss function is:

wherein m represents the total amount of amino acids,

represents an amino acid mean value of +.>

Variance is->

Represents an amino acid mean value of +.>

7. The language model based property decoupling protein production method of claim 6, wherein the language model output is a predicted amino acid property representation, and then the predicted amino acid property representation is mapped to a property space using a single layer linear network, and a loss function is calculated based on the predicted amino acid property.

8. The method for generating a property decoupling protein based on a language model as claimed in claim 1, wherein the sampler adopts a neural network and a multi-layer perceptron.

9. The method for generating a protein with decoupling properties based on a language model according to claim 1, wherein the predicted amino acid is obtained according to the probability distribution of the amino acid properties by using a sampler, the predicted amino acid properties are determined by using an amino acid property knowledge graph, and the predicted amino acid properties are supplemented to the known amino acid sequences.