CN115662501A

CN115662501A - Protein generation method based on position specificity weight matrix

Info

Publication number: CN115662501A
Application number: CN202211312466.6A
Authority: CN
Inventors: 徐仁军; 张之韵; 李佳园
Original assignee: ZJU Hangzhou Global Scientific and Technological Innovation Center
Current assignee: ZJU Hangzhou Global Scientific and Technological Innovation Center
Priority date: 2022-10-25
Filing date: 2022-10-25
Publication date: 2023-01-31

Abstract

The invention discloses a protein generation method based on a position specificity weight matrix, which comprises the steps of obtaining a training sample set; training a transformer model through a cross entropy loss function based on a training sample set, outputting amino acid sequence probability distribution, obtaining a predicted amino acid sequence through top-k sampling, performing multi-sequence comparison on the predicted amino acid sequence by adopting a psi-blast method to obtain a position specificity weight matrix, and performing information entropy calculation on the position specificity weight matrix to obtain a weight specificity matrix information value; constructing a total loss function through the weight specificity matrix information value and the cross loss function, and updating model parameters through the total loss function based on the training sample set to obtain an amino acid sequence generation model; and (3) taking the leader sequence as input, sequentially generating an amino acid sequence through an amino acid sequence generation model, and folding the amino acid sequence through a trRosetta model to obtain a secondary structure and a tertiary structure of the protein.

Description

Protein generation method based on position specificity weight matrix

Technical Field

The invention belongs to the technical field of computer-aided protein generation, and particularly relates to a protein generation method based on a position specificity weight matrix.

Background

Bioinformatics is a research hotspot in the intersection of life sciences and computer science. Bioinformatics research efforts have been widely used today for gene discovery and prediction, storage management of gene data, data retrieval and mining, gene expression data analysis, protein structure prediction, gene and protein homology prediction, sequence analysis and comparison, and the like. The genome defines all the proteins that make up the organism, and the gene defines the amino acid sequence that makes up the protein. Although proteins consist of linear sequences of amino acids, they can only have the corresponding activity and the corresponding biological function if they are folded to form a specific spatial structure.

Protein production has been one of the most interesting directions in the field of gene and protein characterization research, and almost all protein engineering involves modification of the native protein structure. Protein generation has important significance in aspects such as protein structure research, protein evolution, disease treatment intervention and the like. On one hand, it can explore and verify the correspondence between protein structure and function; on the other hand, it can enrich the diversity of natural protein databases.

Understanding the spatial structure of a protein is useful not only for understanding the function of the protein, but also for understanding how the protein performs a function. It is important to determine the structure of the protein. Currently, the data accumulation rate of protein sequence databases is very fast, but the number of proteins of known structures is relatively small. Despite significant advances in protein structure determination technology, the process of experimentally determining protein structure is still very complicated and costly. Thus, the experimentally determined protein structure is much less than the known protein sequence.

In the field of protein production, the current mainstream methods can be divided into two categories, traditional statistical methods and machine learning methods. Conventional statistical methods include Markov probability models, etc., which use biometric methods and tools to find pattern information for protein sequences and are guided by biophysical formulas, but cannot learn long-range amino acid residue interactions. The accuracy and the confidence of the protein generated by the machine learning method exceed those of the traditional statistical method, mainly comprise methods such as a long-term and short-term memory network, an anti-neural network, a graph neural network and the like, and the features of the method are obtained by using a pre-trained language model.

The addition of GNN and transformer makes the feature discovery of long distance possible, and the accuracy of protein generation is greatly increased. Currently, proGen published in 2020 is the best machine learning model in the field of protein generation, and the generated protein is verified by machine learning and biological experiments. Its transformer architecture, which takes care of the attention mechanism and cancels the loop structure of RNN, can focus on long-range interactions of protein sequences.

However, the ProGen model, while highly accurate, also has some problems. The ProGen model is a transform-structured, large-scale, large-parameter language model with 12 billion parameters that is pre-trained on a 280-million protein database. Compared with knowledge depending on machine learning, the interpretability is not strong. In the current machine learning method, the problems of neglecting biological pattern characteristics, large parameter quantity and large training cost generally exist, and the generalization and the interpretability are also poor.

Disclosure of Invention

The invention provides a protein generation method based on a position-specific weight matrix, which strengthens the interpretability of a model by introducing a position-specific weight matrix (PSSM).

A method for protein generation based on a location-specific weight matrix, comprising:

(1) Obtaining a protein sequence set with a selected function as a sample set, and obtaining a training sample set from the sample set;

(2) Training a transformer model through a cross entropy loss function based on a training sample set, simultaneously outputting amino acid sequence probability distribution, carrying out top-k sampling on the amino acid sequence probability distribution to obtain a predicted amino acid sequence, carrying out multi-sequence comparison on the predicted amino acid sequence by adopting a psi-blast search method to obtain a position specificity weight matrix of the predicted amino acid sequence, carrying out information entropy calculation on a plurality of sequence comparison values of each line in the position specificity weight matrix to obtain information entropy values of different amino acid sequence lengths, and averaging the information entropies of different amino acid sequence lengths to obtain weight specificity matrix information values;

(3) Constructing a total loss function, constructing the total loss function through the weight specificity matrix information value and the cross loss function, and training a transformer model through the total loss function based on a training sample set to update model parameters to obtain an amino acid sequence generation model;

(4) When the method is applied, a leader sequence is used as input, the generation of an amino acid sequence is completed through an amino acid sequence generation model and top-K sampling in sequence, and the generated amino acid sequence is folded through a trRosetta model to obtain a predicted protein secondary and tertiary structure.

The set of protein sequences for selected functions was obtained from the kaggle or UniPort development datasets according to the Go tag.

the transformer model comprises an input embedding layer, an intermediate hiding layer and an output prediction layer which are connected in sequence, wherein the intermediate hiding layer is composed of a plurality of transformer modules, and each transformer module comprises a multi-head attention layer, a first Dropout layer, a first Add & Norm layer, a feedforward layer, a second Dropout layer and a second Add & Norm layer which are connected in sequence.

Performing multi-sequence comparison on the predicted amino acid sequence and a protein database by adopting a psi-blast search method to obtain a position specificity weight matrix, wherein the position specificity weight matrix comprises the following steps: and performing multi-sequence comparison on the predicted amino acid sequence and the protein database by adopting a psi-blast search method, constructing an initial position specificity weight matrix based on the comparison result, performing multi-sequence comparison on the initial position specificity weight matrix and the protein database by adopting the psi-blast search method again, constructing a new position specificity weight matrix based on the comparison result, and continuously performing multi-sequence comparison until the comparison result is not changed to obtain the position specificity weight matrix.

The position-specific weight matrix P is:

wherein N is the length of the amino acid sequence, r _i.j Is the probability value of the j-th amino acid at the i-th position of the sequence, i.e. the sequence alignment value, i.e. [1,N ∈]，j∈[1，20]。

Performing information entropy calculation on a plurality of sequence alignment values of each row in the position specificity weight matrix to obtain an information entropy value P of the length of the ith amino acid sequence _i Comprises the following steps:

wherein r is _i.j Is the probability value of the j-th amino acid at the i-th position of the sequence, i.e. the sequence alignment value, i.e. [1,N ∈]，j∈[1，20]。

The information value H of the weight specificity matrix obtained by averaging the information entropies of different amino acid sequences is as follows:

wherein, P _i Is the information entropy value of the length of the ith amino acid sequence, N is the length of the amino acid sequence, r _i.j Is the probability value of the j-th amino acid at the i-th position of the sequence, i.e. the sequence alignment value, i.e. [1,N ∈]，j∈[1，20]。

The total loss function S is:

S＝λ*L+(1-λ)*H

wherein H is the weight specificity matrix information value, lambda is the adjusting parameter, y ^(m) Is the predicted value of the m-th amino sequence,

the actual value of the mth amino sequence, and M is the amino sequence index.

Compared with the prior art, the invention has the beneficial effects that:

the invention reserves the advantages of a transformer model structure, introduces a position specificity weight matrix on the basis of the model, performs multi-sequence comparison on protein in a natural protein database in the model training process, matches with a similarity sequence with the same module information, obtains the position specificity weight matrix with sequence pattern probability distribution information, ensures that the trained model has more biological pattern information, and ensures that the model has stronger interpretability.

Drawings

FIG. 1 is a flow chart of a method for generating a protein based on a location-specific weight matrix according to an embodiment;

fig. 2 is a block diagram of a process of training a transformer model according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and technical effects of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.

The invention provides a protein generation method based on a transformer model combined with a position specificity weight matrix, which comprises the following steps:

s1: the invention adopts a Go label to obtain a protein sequence set with a specific function required to be generated from a kaggle or UniPort development data set, and divides the protein sequence set into a training sample set and a testing sample set.

S2: taking a training sample set as input, training a transformer model through a cross entropy loss function to obtain an initial transformer model, inputting the training sample into the initial transformer model to obtain amino acid sequence probability distribution, wherein the amino acid sequence probability distribution comprises 20 basic amino acids and the probability on a sequence end symbol, and performing top-k sampling on the amino acid sequence probability distribution to obtain a predicted amino acid sequence.

S3: and performing multi-sequence comparison on the predicted amino acid sequence and the protein database by adopting a psi-blast search method to obtain an initial PSSM matrix, performing comparison search on the initial PSSM matrix and the protein database by adopting the psi-blast search method again, namely performing multi-sequence comparison, reconstructing the PSSM matrix by using the search result, and repeating the process until no new search result exists, namely the PSSM matrix is not changed to obtain the PSSM matrix. The psi-blast search method can lead the protein sequence with high conservative property to have higher score, and can effectively find out homologous protein and long-distance homology (distant homologues) across species.

The specific steps for obtaining the initial PSSM matrix are as follows: the predicted amino acid sequence was subjected to psi-blast search to obtain a series of similar amino acid sequences, which were used as protein families, and then a profile was constructed, which was matrixed as the initial PSSM.

The position-specific weight matrix P is:

Performing information entropy calculation on a plurality of sequence alignment values of each row in the PSSM matrix to obtain an information entropy value P of the length of the ith amino acid sequence _i Comprises the following steps:

the uncertainty of the pattern information contained in the protein is quantified by using the information entropy. The goal is to generate proteins with as much pattern information as possible from the protein database, and therefore it is desirable to maximize the entropy of the information so that it is as deterministic as possible, and therefore take the weight-specific matrix information values as PSSM loss values.

Adding PSSM loss value in cross entropy loss function of the transform model to construct total loss function S as:

S＝λ*L+(1-λ)*H

the actual value of the mth amino sequence, and M is the amino sequence index. The final loss function adjusts the PSSM loss value and the cross entropy loss value by lambda, and the invention determines the optimal value of lambda by a grid search method.

And training a transformer model to update model parameters by a total loss function based on the training sample set to obtain an amino acid sequence generation model.

The invention adopts Adam function to optimize parameters. Meanwhile, the epoch number and the batch size of model training are adjusted according to the actual training situation, and drop out and early stopping are adopted to prevent overfitting of the model, wherein probability =10 is set, and dropout is 0.3 during training.

S4: firstly inputting a leader sequence into an amino acid sequence generation model to obtain the probability distribution of a predicted amino acid sequence based on the leader sequence, then sampling and acting on the probability distribution of the predicted amino acid sequence through top-K to obtain the next amino acid of the leader sequence, then inputting the leader sequence and the next amino acid as new sequences into the amino acid sequence generation model, repeating the steps till an end token (end token) of a protein sequence to complete the generation of the amino acid sequence, and obtaining the new amino acid sequence with the highest score.

S5: the method adopts Seuqence logo and trRosetta primary and secondary sequence verification to verify the confidence of the generated amino acid sequence through a test sample set. And (3) finishing the folding of the amino acid sequence of the generated amino acid sequence by utilizing a trRosetta model to obtain the space stability and the prediction of the secondary and tertiary structures of the generated sequence.

The transform model provided by the application comprises a transform encoder and a transform decoder, and is specifically structured by an input embedded layer, an intermediate hidden layer and an output prediction layer which are sequentially connected, wherein the intermediate hidden layer is composed of a plurality of transform modules, and each transform module comprises a multi-head attention layer, a first Dropout layer, a first Add & Norm layer, a feedforward layer, a second Dropout layer and a second Add & Norm layer which are sequentially connected.

The PSSM matrix provided by the application has the following calculation principle:

there are specific patterns of sequence fragments in the protein sequence, which are called sequence motifs (motif). Sequence motifs are closely related to biological function. For example, the N-glycosylation site motif always follows the following specific pattern: starting with Asn (aspartic acid), followed by any amino acid other than Pro (proline), followed by Ser (serine) or Thr (threonine), followed by any amino acid other than Pro.

Motifs are obtained from multiple sequence alignments and are not contiguous in amino acid sequence, but may be tightly bound together in a three-dimensional structure. Traditional statistical methods, machine learning methods such as LSTM, GRU, etc. easily ignore these long distance information. Meanwhile, the two sequences may have a far difference in evolutionary relationship, but the three-dimensional spatial structures are consistent, which is easily ignored by LSTM and GRU models which ignore long-distance information.

The present application uses PSI-BLAST (Position-Specific iterative BLAST) for motif analysis. Unlike BLAST search methods based on query sequences, the method based on profile search. PSI-BLAST begins with the query sequence, obtains a series of similar sequences, which are considered to be a family of identical proteins, and then constructs a Profile. This matrixed form of the profile is called position-specific score matrix (PSSM). PSI-BLAST searches the database with the location-specific weight matrix each time, reconstructs the PSSM using the search results, and then searches the database again with a new PSSM, and so on until no new results are generated. The psiplast can make the protein sequence with high conservative value, and can effectively find out homologous protein and distant homogenes across species.

The position specificity weight matrix is added into a machine learning model, namely, a protein sequence is subjected to multi-sequence comparison in a natural protein database in the training process, similarity sequences with the same module information are matched, and a PSSM matrix of ordered-column mode probability distribution information is obtained. This allows more biological pattern information to be included in the model, making the model more interpretable.

Claims

1. A method for generating a protein based on a position-specific weight matrix, comprising:

2. The method of claim 1, wherein the set of protein sequences for selected functions is obtained from a kaggle or UniPort development dataset according to the Go tag.

3. The method for generating protein based on location-specific weight matrix according to claim 1, wherein the fransformer model comprises an input embedding layer, an intermediate concealment layer and an output prediction layer connected in sequence, wherein the intermediate concealment layer comprises a plurality of fransformer modules, and each of the fransformer modules comprises a multi-headed attention layer, a first Dropout layer, a first Add & Norm layer, a feedforward layer, a second Dropout layer and a second Add & Norm layer connected in sequence.

4. The method for generating protein based on location-specific weight matrix according to claim 1, wherein the location-specific weight matrix is obtained by performing multi-sequence alignment of the predicted amino acid sequence and the protein database by using a psi-blast search method, comprising: and performing multi-sequence comparison on the predicted amino acid sequence and the protein database by adopting a psi-blast search method, constructing an initial position specificity weight matrix based on the comparison result, performing multi-sequence comparison on the initial position specificity weight matrix and the protein database by adopting the psi-blast search method again, constructing a new position specificity weight matrix based on the comparison result, and continuously performing multi-sequence comparison until the comparison result is not changed to obtain the position specificity weight matrix.

5. The method for generating a protein according to claim 1 or 4, wherein the position-specific weight matrix P is:

wherein N is the length of the amino acid sequence, r _i.j Is the sequence alignment value of the j-th amino acid at the i-th position of the sequence, i ∈ [1,N]，j∈[1，20]。

6. The method of claim 1, wherein the entropy of the sequence alignment values of each row in the position-specific weight matrix is calculated to obtain the entropy of the length of the ith amino acid sequence P _i Comprises the following steps:

wherein r is _i.j Is the sequence alignment value of the j-th amino acid at the i-th position of the sequence, i ∈ [1,N]，j∈[1，20]。

7. The method for generating protein based on position-specific weight matrix according to claim 1, wherein the weight-specific matrix information value H obtained by averaging the information entropies of different amino acid sequences is:

wherein, P _i Is the information entropy value of the length of the ith amino acid sequence, N is the length of the amino acid sequence, r _i.j Is the sequence alignment value of the j-th amino acid at the i-th position of the sequence, i ∈ [1,N]，j∈[1，20]。

8. The method of claim 1, wherein the overall loss function S is:

S＝λ*L+(1-λ)*H

wherein H is the weight specificity matrix information value, lambda is the adjustment parameter, y ^(m) Is the predicted value of the m-th amino sequence,

the actual value of the mth amino sequence, M is the amino sequence index.