CN115662501A - Protein generation method based on position specificity weight matrix - Google Patents

Protein generation method based on position specificity weight matrix Download PDF

Info

Publication number
CN115662501A
CN115662501A CN202211312466.6A CN202211312466A CN115662501A CN 115662501 A CN115662501 A CN 115662501A CN 202211312466 A CN202211312466 A CN 202211312466A CN 115662501 A CN115662501 A CN 115662501A
Authority
CN
China
Prior art keywords
amino acid
sequence
acid sequence
protein
weight matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211312466.6A
Other languages
Chinese (zh)
Inventor
徐仁军
张之韵
李佳园
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZJU Hangzhou Global Scientific and Technological Innovation Center
Original Assignee
ZJU Hangzhou Global Scientific and Technological Innovation Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZJU Hangzhou Global Scientific and Technological Innovation Center filed Critical ZJU Hangzhou Global Scientific and Technological Innovation Center
Priority to CN202211312466.6A priority Critical patent/CN115662501A/en
Publication of CN115662501A publication Critical patent/CN115662501A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a protein generation method based on a position specificity weight matrix, which comprises the steps of obtaining a training sample set; training a transformer model through a cross entropy loss function based on a training sample set, outputting amino acid sequence probability distribution, obtaining a predicted amino acid sequence through top-k sampling, performing multi-sequence comparison on the predicted amino acid sequence by adopting a psi-blast method to obtain a position specificity weight matrix, and performing information entropy calculation on the position specificity weight matrix to obtain a weight specificity matrix information value; constructing a total loss function through the weight specificity matrix information value and the cross loss function, and updating model parameters through the total loss function based on the training sample set to obtain an amino acid sequence generation model; and (3) taking the leader sequence as input, sequentially generating an amino acid sequence through an amino acid sequence generation model, and folding the amino acid sequence through a trRosetta model to obtain a secondary structure and a tertiary structure of the protein.

Description

Protein generation method based on position specificity weight matrix
Technical Field
The invention belongs to the technical field of computer-aided protein generation, and particularly relates to a protein generation method based on a position specificity weight matrix.
Background
Bioinformatics is a research hotspot in the intersection of life sciences and computer science. Bioinformatics research efforts have been widely used today for gene discovery and prediction, storage management of gene data, data retrieval and mining, gene expression data analysis, protein structure prediction, gene and protein homology prediction, sequence analysis and comparison, and the like. The genome defines all the proteins that make up the organism, and the gene defines the amino acid sequence that makes up the protein. Although proteins consist of linear sequences of amino acids, they can only have the corresponding activity and the corresponding biological function if they are folded to form a specific spatial structure.
Protein production has been one of the most interesting directions in the field of gene and protein characterization research, and almost all protein engineering involves modification of the native protein structure. Protein generation has important significance in aspects such as protein structure research, protein evolution, disease treatment intervention and the like. On one hand, it can explore and verify the correspondence between protein structure and function; on the other hand, it can enrich the diversity of natural protein databases.
Understanding the spatial structure of a protein is useful not only for understanding the function of the protein, but also for understanding how the protein performs a function. It is important to determine the structure of the protein. Currently, the data accumulation rate of protein sequence databases is very fast, but the number of proteins of known structures is relatively small. Despite significant advances in protein structure determination technology, the process of experimentally determining protein structure is still very complicated and costly. Thus, the experimentally determined protein structure is much less than the known protein sequence.
In the field of protein production, the current mainstream methods can be divided into two categories, traditional statistical methods and machine learning methods. Conventional statistical methods include Markov probability models, etc., which use biometric methods and tools to find pattern information for protein sequences and are guided by biophysical formulas, but cannot learn long-range amino acid residue interactions. The accuracy and the confidence of the protein generated by the machine learning method exceed those of the traditional statistical method, mainly comprise methods such as a long-term and short-term memory network, an anti-neural network, a graph neural network and the like, and the features of the method are obtained by using a pre-trained language model.
The addition of GNN and transformer makes the feature discovery of long distance possible, and the accuracy of protein generation is greatly increased. Currently, proGen published in 2020 is the best machine learning model in the field of protein generation, and the generated protein is verified by machine learning and biological experiments. Its transformer architecture, which takes care of the attention mechanism and cancels the loop structure of RNN, can focus on long-range interactions of protein sequences.
However, the ProGen model, while highly accurate, also has some problems. The ProGen model is a transform-structured, large-scale, large-parameter language model with 12 billion parameters that is pre-trained on a 280-million protein database. Compared with knowledge depending on machine learning, the interpretability is not strong. In the current machine learning method, the problems of neglecting biological pattern characteristics, large parameter quantity and large training cost generally exist, and the generalization and the interpretability are also poor.
Disclosure of Invention
The invention provides a protein generation method based on a position-specific weight matrix, which strengthens the interpretability of a model by introducing a position-specific weight matrix (PSSM).
A method for protein generation based on a location-specific weight matrix, comprising:
(1) Obtaining a protein sequence set with a selected function as a sample set, and obtaining a training sample set from the sample set;
(2) Training a transformer model through a cross entropy loss function based on a training sample set, simultaneously outputting amino acid sequence probability distribution, carrying out top-k sampling on the amino acid sequence probability distribution to obtain a predicted amino acid sequence, carrying out multi-sequence comparison on the predicted amino acid sequence by adopting a psi-blast search method to obtain a position specificity weight matrix of the predicted amino acid sequence, carrying out information entropy calculation on a plurality of sequence comparison values of each line in the position specificity weight matrix to obtain information entropy values of different amino acid sequence lengths, and averaging the information entropies of different amino acid sequence lengths to obtain weight specificity matrix information values;
(3) Constructing a total loss function, constructing the total loss function through the weight specificity matrix information value and the cross loss function, and training a transformer model through the total loss function based on a training sample set to update model parameters to obtain an amino acid sequence generation model;
(4) When the method is applied, a leader sequence is used as input, the generation of an amino acid sequence is completed through an amino acid sequence generation model and top-K sampling in sequence, and the generated amino acid sequence is folded through a trRosetta model to obtain a predicted protein secondary and tertiary structure.
The set of protein sequences for selected functions was obtained from the kaggle or UniPort development datasets according to the Go tag.
the transformer model comprises an input embedding layer, an intermediate hiding layer and an output prediction layer which are connected in sequence, wherein the intermediate hiding layer is composed of a plurality of transformer modules, and each transformer module comprises a multi-head attention layer, a first Dropout layer, a first Add & Norm layer, a feedforward layer, a second Dropout layer and a second Add & Norm layer which are connected in sequence.
Performing multi-sequence comparison on the predicted amino acid sequence and a protein database by adopting a psi-blast search method to obtain a position specificity weight matrix, wherein the position specificity weight matrix comprises the following steps: and performing multi-sequence comparison on the predicted amino acid sequence and the protein database by adopting a psi-blast search method, constructing an initial position specificity weight matrix based on the comparison result, performing multi-sequence comparison on the initial position specificity weight matrix and the protein database by adopting the psi-blast search method again, constructing a new position specificity weight matrix based on the comparison result, and continuously performing multi-sequence comparison until the comparison result is not changed to obtain the position specificity weight matrix.
The position-specific weight matrix P is:
Figure BDA0003907570850000031
wherein N is the length of the amino acid sequence, r i.j Is the probability value of the j-th amino acid at the i-th position of the sequence, i.e. the sequence alignment value, i.e. [1,N ∈],j∈[1,20]。
Performing information entropy calculation on a plurality of sequence alignment values of each row in the position specificity weight matrix to obtain an information entropy value P of the length of the ith amino acid sequence i Comprises the following steps:
Figure BDA0003907570850000032
wherein r is i.j Is the probability value of the j-th amino acid at the i-th position of the sequence, i.e. the sequence alignment value, i.e. [1,N ∈],j∈[1,20]。
The information value H of the weight specificity matrix obtained by averaging the information entropies of different amino acid sequences is as follows:
Figure BDA0003907570850000033
wherein, P i Is the information entropy value of the length of the ith amino acid sequence, N is the length of the amino acid sequence, r i.j Is the probability value of the j-th amino acid at the i-th position of the sequence, i.e. the sequence alignment value, i.e. [1,N ∈],j∈[1,20]。
The total loss function S is:
S=λ*L+(1-λ)*H
Figure BDA0003907570850000041
wherein H is the weight specificity matrix information value, lambda is the adjusting parameter, y (m) Is the predicted value of the m-th amino sequence,
Figure BDA0003907570850000042
the actual value of the mth amino sequence, and M is the amino sequence index.
Compared with the prior art, the invention has the beneficial effects that:
the invention reserves the advantages of a transformer model structure, introduces a position specificity weight matrix on the basis of the model, performs multi-sequence comparison on protein in a natural protein database in the model training process, matches with a similarity sequence with the same module information, obtains the position specificity weight matrix with sequence pattern probability distribution information, ensures that the trained model has more biological pattern information, and ensures that the model has stronger interpretability.
Drawings
FIG. 1 is a flow chart of a method for generating a protein based on a location-specific weight matrix according to an embodiment;
fig. 2 is a block diagram of a process of training a transformer model according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and technical effects of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.
The invention provides a protein generation method based on a transformer model combined with a position specificity weight matrix, which comprises the following steps:
s1: the invention adopts a Go label to obtain a protein sequence set with a specific function required to be generated from a kaggle or UniPort development data set, and divides the protein sequence set into a training sample set and a testing sample set.
S2: taking a training sample set as input, training a transformer model through a cross entropy loss function to obtain an initial transformer model, inputting the training sample into the initial transformer model to obtain amino acid sequence probability distribution, wherein the amino acid sequence probability distribution comprises 20 basic amino acids and the probability on a sequence end symbol, and performing top-k sampling on the amino acid sequence probability distribution to obtain a predicted amino acid sequence.
S3: and performing multi-sequence comparison on the predicted amino acid sequence and the protein database by adopting a psi-blast search method to obtain an initial PSSM matrix, performing comparison search on the initial PSSM matrix and the protein database by adopting the psi-blast search method again, namely performing multi-sequence comparison, reconstructing the PSSM matrix by using the search result, and repeating the process until no new search result exists, namely the PSSM matrix is not changed to obtain the PSSM matrix. The psi-blast search method can lead the protein sequence with high conservative property to have higher score, and can effectively find out homologous protein and long-distance homology (distant homologues) across species.
The specific steps for obtaining the initial PSSM matrix are as follows: the predicted amino acid sequence was subjected to psi-blast search to obtain a series of similar amino acid sequences, which were used as protein families, and then a profile was constructed, which was matrixed as the initial PSSM.
The position-specific weight matrix P is:
Figure BDA0003907570850000051
wherein N is the length of the amino acid sequence, r i.j Is the probability value of the j-th amino acid at the i-th position of the sequence, i.e. the sequence alignment value, i.e. [1,N ∈],j∈[1,20]。
Performing information entropy calculation on a plurality of sequence alignment values of each row in the PSSM matrix to obtain an information entropy value P of the length of the ith amino acid sequence i Comprises the following steps:
Figure BDA0003907570850000052
the information value H of the weight specificity matrix obtained by averaging the information entropies of different amino acid sequences is as follows:
Figure BDA0003907570850000053
the uncertainty of the pattern information contained in the protein is quantified by using the information entropy. The goal is to generate proteins with as much pattern information as possible from the protein database, and therefore it is desirable to maximize the entropy of the information so that it is as deterministic as possible, and therefore take the weight-specific matrix information values as PSSM loss values.
Adding PSSM loss value in cross entropy loss function of the transform model to construct total loss function S as:
S=λ*L+(1-λ)*H
Figure BDA0003907570850000061
wherein H is the weight specificity matrix information value, lambda is the adjusting parameter, y (m) Is the predicted value of the m-th amino sequence,
Figure BDA0003907570850000062
the actual value of the mth amino sequence, and M is the amino sequence index. The final loss function adjusts the PSSM loss value and the cross entropy loss value by lambda, and the invention determines the optimal value of lambda by a grid search method.
And training a transformer model to update model parameters by a total loss function based on the training sample set to obtain an amino acid sequence generation model.
The invention adopts Adam function to optimize parameters. Meanwhile, the epoch number and the batch size of model training are adjusted according to the actual training situation, and drop out and early stopping are adopted to prevent overfitting of the model, wherein probability =10 is set, and dropout is 0.3 during training.
S4: firstly inputting a leader sequence into an amino acid sequence generation model to obtain the probability distribution of a predicted amino acid sequence based on the leader sequence, then sampling and acting on the probability distribution of the predicted amino acid sequence through top-K to obtain the next amino acid of the leader sequence, then inputting the leader sequence and the next amino acid as new sequences into the amino acid sequence generation model, repeating the steps till an end token (end token) of a protein sequence to complete the generation of the amino acid sequence, and obtaining the new amino acid sequence with the highest score.
S5: the method adopts Seuqence logo and trRosetta primary and secondary sequence verification to verify the confidence of the generated amino acid sequence through a test sample set. And (3) finishing the folding of the amino acid sequence of the generated amino acid sequence by utilizing a trRosetta model to obtain the space stability and the prediction of the secondary and tertiary structures of the generated sequence.
The transform model provided by the application comprises a transform encoder and a transform decoder, and is specifically structured by an input embedded layer, an intermediate hidden layer and an output prediction layer which are sequentially connected, wherein the intermediate hidden layer is composed of a plurality of transform modules, and each transform module comprises a multi-head attention layer, a first Dropout layer, a first Add & Norm layer, a feedforward layer, a second Dropout layer and a second Add & Norm layer which are sequentially connected.
The PSSM matrix provided by the application has the following calculation principle:
there are specific patterns of sequence fragments in the protein sequence, which are called sequence motifs (motif). Sequence motifs are closely related to biological function. For example, the N-glycosylation site motif always follows the following specific pattern: starting with Asn (aspartic acid), followed by any amino acid other than Pro (proline), followed by Ser (serine) or Thr (threonine), followed by any amino acid other than Pro.
Motifs are obtained from multiple sequence alignments and are not contiguous in amino acid sequence, but may be tightly bound together in a three-dimensional structure. Traditional statistical methods, machine learning methods such as LSTM, GRU, etc. easily ignore these long distance information. Meanwhile, the two sequences may have a far difference in evolutionary relationship, but the three-dimensional spatial structures are consistent, which is easily ignored by LSTM and GRU models which ignore long-distance information.
The present application uses PSI-BLAST (Position-Specific iterative BLAST) for motif analysis. Unlike BLAST search methods based on query sequences, the method based on profile search. PSI-BLAST begins with the query sequence, obtains a series of similar sequences, which are considered to be a family of identical proteins, and then constructs a Profile. This matrixed form of the profile is called position-specific score matrix (PSSM). PSI-BLAST searches the database with the location-specific weight matrix each time, reconstructs the PSSM using the search results, and then searches the database again with a new PSSM, and so on until no new results are generated. The psiplast can make the protein sequence with high conservative value, and can effectively find out homologous protein and distant homogenes across species.
The position specificity weight matrix is added into a machine learning model, namely, a protein sequence is subjected to multi-sequence comparison in a natural protein database in the training process, similarity sequences with the same module information are matched, and a PSSM matrix of ordered-column mode probability distribution information is obtained. This allows more biological pattern information to be included in the model, making the model more interpretable.

Claims (8)

1. A method for generating a protein based on a position-specific weight matrix, comprising:
(1) Obtaining a protein sequence set with a selected function as a sample set, and obtaining a training sample set from the sample set;
(2) Training a transformer model through a cross entropy loss function based on a training sample set, simultaneously outputting amino acid sequence probability distribution, carrying out top-k sampling on the amino acid sequence probability distribution to obtain a predicted amino acid sequence, carrying out multi-sequence comparison on the predicted amino acid sequence by adopting a psi-blast search method to obtain a position specificity weight matrix of the predicted amino acid sequence, carrying out information entropy calculation on a plurality of sequence comparison values of each line in the position specificity weight matrix to obtain information entropy values of different amino acid sequence lengths, and averaging the information entropies of different amino acid sequence lengths to obtain weight specificity matrix information values;
(3) Constructing a total loss function, constructing the total loss function through the weight specificity matrix information value and the cross loss function, and training a transformer model through the total loss function based on a training sample set to update model parameters to obtain an amino acid sequence generation model;
(4) When the method is applied, a leader sequence is used as input, the generation of an amino acid sequence is completed through an amino acid sequence generation model and top-K sampling in sequence, and the generated amino acid sequence is folded through a trRosetta model to obtain a predicted protein secondary and tertiary structure.
2. The method of claim 1, wherein the set of protein sequences for selected functions is obtained from a kaggle or UniPort development dataset according to the Go tag.
3. The method for generating protein based on location-specific weight matrix according to claim 1, wherein the fransformer model comprises an input embedding layer, an intermediate concealment layer and an output prediction layer connected in sequence, wherein the intermediate concealment layer comprises a plurality of fransformer modules, and each of the fransformer modules comprises a multi-headed attention layer, a first Dropout layer, a first Add & Norm layer, a feedforward layer, a second Dropout layer and a second Add & Norm layer connected in sequence.
4. The method for generating protein based on location-specific weight matrix according to claim 1, wherein the location-specific weight matrix is obtained by performing multi-sequence alignment of the predicted amino acid sequence and the protein database by using a psi-blast search method, comprising: and performing multi-sequence comparison on the predicted amino acid sequence and the protein database by adopting a psi-blast search method, constructing an initial position specificity weight matrix based on the comparison result, performing multi-sequence comparison on the initial position specificity weight matrix and the protein database by adopting the psi-blast search method again, constructing a new position specificity weight matrix based on the comparison result, and continuously performing multi-sequence comparison until the comparison result is not changed to obtain the position specificity weight matrix.
5. The method for generating a protein according to claim 1 or 4, wherein the position-specific weight matrix P is:
Figure FDA0003907570840000021
wherein N is the length of the amino acid sequence, r i.j Is the sequence alignment value of the j-th amino acid at the i-th position of the sequence, i ∈ [1,N],j∈[1,20]。
6. The method of claim 1, wherein the entropy of the sequence alignment values of each row in the position-specific weight matrix is calculated to obtain the entropy of the length of the ith amino acid sequence P i Comprises the following steps:
Figure FDA0003907570840000022
wherein r is i.j Is the sequence alignment value of the j-th amino acid at the i-th position of the sequence, i ∈ [1,N],j∈[1,20]。
7. The method for generating protein based on position-specific weight matrix according to claim 1, wherein the weight-specific matrix information value H obtained by averaging the information entropies of different amino acid sequences is:
Figure FDA0003907570840000023
wherein, P i Is the information entropy value of the length of the ith amino acid sequence, N is the length of the amino acid sequence, r i.j Is the sequence alignment value of the j-th amino acid at the i-th position of the sequence, i ∈ [1,N],j∈[1,20]。
8. The method of claim 1, wherein the overall loss function S is:
S=λ*L+(1-λ)*H
Figure FDA0003907570840000031
wherein H is the weight specificity matrix information value, lambda is the adjustment parameter, y (m) Is the predicted value of the m-th amino sequence,
Figure FDA0003907570840000032
the actual value of the mth amino sequence, M is the amino sequence index.
CN202211312466.6A 2022-10-25 2022-10-25 Protein generation method based on position specificity weight matrix Pending CN115662501A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211312466.6A CN115662501A (en) 2022-10-25 2022-10-25 Protein generation method based on position specificity weight matrix

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211312466.6A CN115662501A (en) 2022-10-25 2022-10-25 Protein generation method based on position specificity weight matrix

Publications (1)

Publication Number Publication Date
CN115662501A true CN115662501A (en) 2023-01-31

Family

ID=84992171

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211312466.6A Pending CN115662501A (en) 2022-10-25 2022-10-25 Protein generation method based on position specificity weight matrix

Country Status (1)

Country Link
CN (1) CN115662501A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117476106A (en) * 2023-12-26 2024-01-30 西安慧算智能科技有限公司 Multi-class unbalanced protein secondary structure prediction method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117476106A (en) * 2023-12-26 2024-01-30 西安慧算智能科技有限公司 Multi-class unbalanced protein secondary structure prediction method and system
CN117476106B (en) * 2023-12-26 2024-04-02 西安慧算智能科技有限公司 Multi-class unbalanced protein secondary structure prediction method and system

Similar Documents

Publication Publication Date Title
Camproux et al. A hidden markov model derived structural alphabet for proteins
WO2002026934A2 (en) System and process for validating, aligning and reordering genetic sequence maps using ordered restriction map
US11532378B2 (en) Protein database search using learned representations
CN115458039B (en) Method and system for predicting single-sequence protein structure based on machine learning
KR20230152043A (en) Drug optimization by active learning
CN115662501A (en) Protein generation method based on position specificity weight matrix
CN114708903A (en) Method for predicting distance between protein residues based on self-attention mechanism
CN115472221A (en) Protein fitness prediction method based on deep learning
CN116580848A (en) Multi-head attention mechanism-based method for analyzing multiple groups of chemical data of cancers
CN116206688A (en) Multi-mode information fusion model and method for DTA prediction
Cheng et al. ODBO: Bayesian optimization with search space prescreening for directed protein evolution
US7047137B1 (en) Computer method and apparatus for uniform representation of genome sequences
CN117116383A (en) Medicine molecule optimization method and device based on pretraining fine adjustment
Rost A neural network for prediction of protein secondary structure
CN116386733A (en) Protein function prediction method based on multi-view multi-scale multi-attention mechanism
CN115881211B (en) Protein sequence alignment method, protein sequence alignment device, computer equipment and storage medium
CN115527605A (en) Antibody structure prediction method based on depth map model
Chatterjee et al. Improving prediction of protein secondary structure using physicochemical properties of amino acids
Berryman et al. Review of signal processing in genetics
Kihel et al. A novel genetic grey wolf optimizer for global optimization and feature selection
Zhang et al. Prediction of Intrinsically Disordered Proteins Based on Deep Neural Network-ResNet18.
JP2008146529A (en) Folding sequence prediction method by calculation of protein sequence fragment entropy
Jafari et al. An information gain approach to infer gene regulatory networks
CN117976047B (en) Key protein prediction method based on deep learning
CN116758978A (en) Controllable attribute totally new active small molecule design method based on protein structure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination