CN117037897B

CN117037897B - Peptide and MHC class I protein affinity prediction method based on protein domain feature embedding

Info

Publication number: CN117037897B
Application number: CN202310878264.6A
Authority: CN
Inventors: 王福旭; 臧天仪; 王皓俨
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2023-07-18
Filing date: 2023-07-18
Publication date: 2024-06-14
Anticipated expiration: 2043-07-18
Also published as: CN117037897A

Abstract

The invention provides a peptide and MHCI protein affinity prediction method based on protein domain feature embedding. The peptide and MHCI class protein affinity prediction method based on protein domain feature embedding utilizes the multi-head attention learning peptide bond and amino acid residue feature to predict the peptide and MHCI class protein affinity, and compared with the existing other methods, the prediction method provided by the invention has accurate prediction results and meets the actual demands.

Description

Peptide and MHC class I protein affinity prediction method based on protein domain feature embedding

Technical Field

The invention relates to the technical field of peptide and MHC class I protein affinity prediction, in particular to a peptide and MHC class I protein affinity prediction method based on protein domain feature embedding.

Background

The binding affinity of peptides to MHC class I proteins plays a vital role in oncology, vaccine development, early diagnosis of immune diseases, screening of graft rejection reactions, biochemistry and neuroscience. In the field of tumor drug and vaccine development, changes in binding affinity of peptides to MHC class I molecules can affect antigen presentation and recognition, thereby affecting tumor immunotherapeutic effects; in the fields of early diagnosis of immune diseases and screening of rejection reactions, peptides and MHC class I proteins can be used for predicting autoimmune disease peptide fragments so as to diagnose immune diseases and rejection reactions; in biochemistry and neuroscience, changes in peptide binding affinity to MHC class I proteins can also affect neuronal and glial cell function and activity, while helping to better understand the mechanisms of bioengineering and immune adaptation.

With the development of new antigen cancer vaccines, how to effectively, accurately and rapidly identify new antigens is a urgent problem to be solved by human beings to overcome cancers. And efficient prediction of peptide affinity to MHC class I molecules is the basis for efficient recognition of neoantigens. With the rapid development of sequencing technology, a large number of protein sequences were detected, and a large amount of sequencing data and tumor immunity data were ready for data raw materials. How to effectively utilize the sequencing data of the proteins and construct a set of peptide and MHC class I molecular affinity prediction analysis method, so that antigen peptides can be rapidly and accurately identified, and the method is a common requirement of all the students in the field.

Studies have shown that some tumors have higher mutational burden than others. The immunogenic response induced by the neoantigen vaccine for different cancer types may therefore be different. It is difficult to determine specific neoantigens for each cancer type before cancer genomic sequencing occurs. Traditional methods of neoantigen identification typically rely on single cDNA library screening, which is very inefficient. Development of second generation sequencing technology accelerates the progress of new antigen identification. Identified tumor-specific gene mutations have been widely available, for which a number of tools have been developed for peptide and MHC molecule affinity prediction. For example, netMHC, an algorithm based on a feedforward neural network to predict peptide affinity for MHC class I molecules, is the most widely used allele-specific model. NETMHCPAN, a pan-specific model not limited to a particular MHC allele, utilizes a traditional neural network model with a single hidden layer.

Disclosure of Invention

The invention aims to solve the problems in the prior art and provides a peptide and MHC class I protein affinity prediction method based on protein domain feature embedding. The peptide and MHC class I protein affinity prediction method based on protein domain feature embedding utilizes the multi-head attention to learn peptide bond and amino acid residue features to predict the peptide and MHC class I protein affinity.

The invention is realized by the following technical scheme, and provides a peptide and MHC class I protein affinity prediction method based on protein domain feature embedding, which comprises the following steps:

step 1, constructing a protein structural domain vocabulary dictionary;

step 2, a given unique MHC class I protein Indicator (ID) is used for finding out the corresponding amino acid sequence;

Step 3, obtaining a peptide sequence and an MHC I protein domain sequence, and performing word segmentation on the peptide sequence and the MHC I protein domain sequence, further processing the sequence after obtaining the MHC I protein amino acid sequence, obtaining the starting and ending positions of all domains of the MHC I protein by a hmmscan method, extracting the domain amino acid sequence by the known starting and ending positions, and performing word segmentation on the domain amino acid sequence according to an autonomously constructed protein domain word dictionary;

Step 4, constructing an amino acid word symbol embedding model;

Step 5, extracting peptide sequences and MHC class I protein amino acid word character embedding characteristics, wherein the characteristics are expressed as peptide and MHC class I protein binding embedding matrixes;

step 6, predicting the binding affinity of the peptide to the MHC class I protein.

Further, in step 1, the pairs of amino acid sequences with highest occurrence frequency in the amino acid sequences of the protein domains are counted to form amino acid logograms, and the first 10000 amino acid logograms are taken out to form a protein domain logogram dictionary.

Further, in step 2, using binding affinity data in the immune epitope database, the binding affinity is expressed by half-inhibitory concentration in micromolar, the half-inhibitory concentration value is converted to a value in the interval 0 to 1, and the calculation formula is:

wherein affinity is the binding affinity of the experimentally measured peptide to MHC class I molecules.

Further, in step 3, the peptide sequence and the MHC class I protein molecule amino acid sequence are segmented, the segmentation being performed based on an autonomously constructed segmentation dictionary; forming amino acid logograms by counting amino acid sequence pairs with highest occurrence frequency in protein structural domain sequences, and taking out the first 10000 amino acid logograms to form a protein structural domain logogram dictionary; when 10000 amino acid letters are taken, the length of the letters of the protein domain dictionary is mostly 3 or 4 amino acid letters; these protein domain symbols are more environmentally adapted to be preserved, and can carry the evolving characteristics of the protein; the sequences after word segmentation are respectively expressed asAnd/>Wherein the superscript 1 of the amino acid symbols indicates the peptide sequence, the superscript 2 indicates the amino acid sequence of the MHC class I protein, and the subscript indicates the number of amino acid symbols, which are combined into one sequence by inserting special symbols:

wherein [ CLS ], [ SEP ] and [ EOS ] are special words and respectively represent category, separator and ending; the maximum combined length of the peptide sequence and the MHC class I protein molecule amino acid sequence is normalized to 512.

Further, in step 4, a prediction model of peptide and MHC class I protein affinity based on the characteristic embedding of protein domains is constructed based on the Bert model, the model represents the protein amino acid sequence in the Uniprot database by the pre-training depth, and the peptide sequence and MHC class I protein characteristic space distance is calculated by the fine tuning model to represent the peptide and MHC class I protein affinity; the model adopts a LAMB optimizer, and the super parameter of the optimizer is set to be a default value, namely beta ₁＝0.9,β₂ =0.999, E=1E-8, and the weight attenuation rate lambda=0.01.

Further, in step 5, peptide and MHC class I protein amino acid sequence embedding features are extracted using a multi-head attentive mechanism; given an amino acid vocabulary entry list x= < X ₁,x₂,…x_n >, each amino acid vocabulary vector X _i is first calculated by a multi-head attention mechanism, and certain positions in X are identified and focused according to the correlation of the calculated result and the upper and lower amino acid vocabulary of X _i; according to the multi-headed attention mechanism, the front and rear amino acid logographic information from each vector X _i in X is encoded into an output vector y _i and weighted according to its relevance to X _i; then, by adding the initial vector x _i to the output vector y _i, the combined vector y _i is normalized and then features are extracted through a fully connected feedforward neural network, which uses GeLU functions as activation functions; geLU functions are shown in the following formulas:

GeLU(x)＝xP(X≤x)

Wherein X to N (μ, σ ²), μ and σ are parameters of the validation experiment, and μ=0 and σ=1 are assigned;

Each vector y _i independently generates an output vector z _i through the same feedforward neural network; finally, adding the vector y _i to the Z _i,z_i for normalization to obtain a vector list Z= < Z ₁,z₂,…z_n > of the whole protein feature embedding method based on multi-head attention;

The attention mechanism formula is as follows:

Multihead(Q,K,V)＝Concat(head₁,…,head_n)W^O

Wherein the method comprises the steps of

Wherein the equation for the term (Q, K, V) is as follows:

Q, K, V in the formula are matrixes formed by respectively embedding amino acid words into the vectors of queries, embedding amino acid words into the keys and embedding amino acid words into the values, wherein the queries, keys and values are respectively obtained by multiplying the amino acid words by the amino acid words conversion matrix W _Q、W_K、W_V, and d _k represents the dimension of the key of the amino acid words into the vectors; head _i represents the attention of the attention header of the i-th amino acid word;

The multi-head attention mechanism captures the attention relation among amino acid words by calculating the attention value among the amino acid words, then adds sequence position information and codes to obtain the final embedded feature, the amino acid sequence embedded feature is expressed as a 512 multiplied by 768 dimensional matrix, and the multi-head attention mechanism is used for training.

Further, in step 6, after the peptide sequence obtained by the multi-head attention mechanism and the MHC class I protein domain amino acid sequence are embedded into the eigenvector e= < E ₀,e₁,…,e_m+n+3 >, generating a predicted vector F by using a feedforward neural network, normalizing by a Softmax function, and generating a predicted affinity value of the peptide and the MHC class I protein; the output of the whole peptide binding affinity model to MHC class I proteins is a 512×768 matrix, where 768 is the dimension of the vector and 512 corresponds to the number of words of all the input amino acids at the time of input; the first vector corresponds to the relationship between amino acid symbols in the sequence, and the whole output matrix is a characteristic matrix of binding affinity of peptide and MHC class I protein molecules.

Further, peptide and MHC class I protein binding affinity feature extraction is achieved by a multi-headed self-attention and fully connected neural network model; and constructing a protein domain word dictionary, and predicting the affinity of the peptide and the MHC class I protein by utilizing the characteristics of multiple attentions to learn peptide bonds and amino acid residues.

The beneficial effects of the invention are as follows:

The invention provides a peptide and MHC class I protein affinity prediction method based on protein domain feature embedding, which utilizes the features of multi-head attention learning peptide bonds and amino acid residues to predict peptide and MHC class I protein affinity.

Drawings

FIG. 1 is a block diagram of a method for predicting affinity of peptides to MHC class I proteins based on the characteristic insertion of protein domains according to the present invention;

FIG. 2 is a diagram showing the comparison of the predicted result of the present invention with other reference methods.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-2, the present invention provides a method for predicting affinity between peptides embedded based on protein domain features and MHC class I proteins, the method comprising the steps of:

step 1, constructing a protein structural domain vocabulary dictionary;

In the step 1, counting the amino acid sequence pair with highest occurrence frequency in the amino acid sequence of the protein structural domain to form amino acid logograms, and taking out the first 10000 amino acid logograms to form a protein structural domain logogram dictionary;

in step 2, an immune epitope database is used, the database comprising binding affinity data and eluting ligand data; binding affinity data in the database, in which peptide binding affinity to MHC class I molecules is represented by a half-inhibitory concentration (50%inhibiting concentration,IC50) in micromolar (nM), was used in the mapping process, and peptide affinity values in the range of 0 to 1 were used for training and testing of the predictive method to represent affinity. The half-inhibitory concentration was converted to a value in the interval 0 to 1 using the following formula;

wherein affinity is the binding affinity of the peptide to MHC class I molecules measured. The lower the half-inhibitory concentration value, the better the binding of the peptide to MHC class I molecules. Peptides bind strongly to MHC class I molecules with affinity values below 50nM, i.e. transition values above 0.638; the affinity value is lower than 500nM, i.e. the transition value is greater than 0.426, the peptide has weak binding force with MHC class I molecules; above an affinity value of 500nM, i.e. a transition value of less than 0.426, the peptide does not bind to MHC class I molecules.

Step 3, obtaining peptide sequences and MHC class I protein domain sequences, and segmenting the peptide sequences and the MHC class I protein domain sequences;

After obtaining the amino acid sequence of the MHC class I protein, further processing the sequence to obtain a domain amino acid sequence in the protein amino acid sequence; obtaining the initial and final positions of the domains of the MHC class I protein by hmmscan method, extracting all the domains of the required MHC class I protein by knowing the initial and final positions of the domains and the amino acid sequences of the domains, and word-dividing the amino acid sequences by a protein domain word dictionary;

In step 3, the peptide sequence and the amino acid sequence of the MHC class I protein molecule are segmented, and the segmentation is performed based on an autonomously constructed segmentation dictionary. The amino acid vocabulary of the protein domain is formed by counting the amino acid sequence pair with the highest occurrence frequency in the protein domain sequence, and the first 10000 amino acid vocabulary is taken out to form a protein domain vocabulary dictionary. When 10000 amino acid letters are taken, the length of the letters of the protein domain dictionary is mostly 3 or 4 amino acid letters. These protein domain descriptors are more environmentally-friendly and are preserved and can carry the evolving characteristics of the protein. The sequences after word segmentation are respectively expressed as And/>Wherein the superscript of the amino acid symbols indicates the peptide sequence or the amino acid sequence of the MHC class I protein and the subscript indicates the number of amino acid symbols. Because peptide and MHC class I protein molecule affinity prediction method needs to divide the peptide sequence and MHC class I protein amino acid sequence in turn, insert [ SEP ] amino acid word symbol as the separator of the sequence between two sequences, the input of this method is two pairs of protein amino acid word symbol sequences, after dividing the amino acid sequence separately, combine them into a sequence, as follows:

After sequence combination, the maximum combination length of the synthesized amino acid sequence added with the mark is normalized to 512; wherein [ CLS ], [ SEP ], and [ EOS ] are used as special amino acid symbols, respectively representing a category symbol, a separator symbol, and an ending symbol.

Step 4, constructing a peptide and MHC class I protein affinity prediction model based on protein domain feature embedding based on a Bert model; calculation of peptide sequence and MHC class I protein feature space distance using the Bert model indicates peptide affinity for MHC class I proteins. The model adopts a LAMB optimizer, the optimization technology is a new self-adaptive large-scale batch optimization technology, and the super-parameters of the optimizer are set to be default values, namely beta ₁＝0.9,β₂ =0.999, epsilon=1E-8, and the weight attenuation rate lambda=0.01.

And 5, extracting peptide sequences and MHC class I protein amino acid word character embedding characteristics, wherein the characteristics are expressed as peptide and MHC class I protein binding embedding matrixes. In step 5, extracting peptide and MHC class I protein amino acid sequence embedding characteristics by using a multi-head attention mechanism; given an amino acid vocabulary entry list x= < X ₁,x₂,…x_n >, each amino acid vocabulary vector X _i is first calculated by a multi-head attention mechanism, and certain positions in X are identified and focused according to the correlation of the calculated result and the upper and lower amino acid vocabulary of X _i; according to the multi-headed attention mechanism, the front and rear amino acid logographic information from each vector X _i in X is encoded into an output vector y _i and weighted according to its relevance to X _i; then, by adding the initial vector x _i to the output vector y _i, the combined vector y _i is normalized and then features are extracted through a fully connected feedforward neural network, which uses GeLU functions as activation functions; the GeLU function has proven to perform best in a multi-head attention model and can avoid the gradient vanishing problem; geLU functions are shown in the following formulas:

GeLU(x)＝xP(X≤x)

Each vector yi independently generates an output vector z _i through the same feedforward neural network; finally, adding the vector y _i to the Z _i,z_i for normalization to obtain a vector list Z= < Z ₁,z₂,…z_n > of the whole protein feature embedding method based on multi-head attention;

The attention mechanism formula is as follows:

Multihead(Q,K,V)＝Cpmcat(head₁,…,head_n)W^O

Wherein the method comprises the steps of

Wherein the equation for the term (Q, K, V) is as follows:

Step 6, predicting the binding affinity of the peptide to the MHC class I protein. In step 6, after the peptide sequence obtained by the multi-head attention mechanism and the MHC class I protein domain amino acid sequence are embedded into a feature vector E= < E ₀,e₁,…,e_m+n+3 >, a predictive vector F is generated by using a feedforward neural network, normalization is performed by a Softmax function, and a predicted affinity value of the peptide and the MHC class I protein is generated; the output of the whole peptide binding affinity model to MHC class I proteins is a 512×768 matrix, where 768 is the dimension of the vector and 512 corresponds to the number of words of all the input amino acids at the time of input; the first vector corresponds to the relationship between amino acid symbols in the sequence, and the whole output matrix is a characteristic matrix of binding affinity of peptide and MHC class I protein molecules.

Peptide and MHC class I protein binding affinity feature extraction is achieved by a multi-head self-attention and fully connected neural network model; and constructing a protein domain word dictionary, and predicting the affinity of the peptide and the MHC class I protein by utilizing the characteristics of multiple attentions to learn peptide bonds and amino acid residues.

The invention relates to a binding affinity prediction method, which adopts a Spearman rank correlation coefficient (Spearman's rank coefficient of correlation) to evaluate the intensity of a monotonic relation between two random variables, namely whether the two random variables are consistent in change trend or not, and the monotonic relation between the two random variables does not need to be in a proportional relation. The calculation formula is as follows:

wherein R (x) and R (y) are the ranking order in which x and y are located, respectively, and R (x) and R (y) represent the average rank.

The other index of the evaluation method is AUC value, a positive sample and a negative sample are randomly extracted from the positive and negative sample sets respectively, and the predicted value of the positive sample is larger than the probability of the negative sample. The calculation formula is as follows:

Wherein the numerator represents the number of positive samples greater than the number of negative samples and the denominator represents the total number of positive and negative samples.

Claims

1. A method for predicting affinity of a peptide to an MHC class I protein based on characteristic insertion of a protein domain, comprising: the method comprises the following steps:

step 1, constructing a protein structural domain vocabulary dictionary;

In the step 3, the peptide sequence and the MHC class I protein molecule amino acid sequence are segmented, and the segmentation is performed based on an autonomously constructed segmentation dictionary; forming amino acid logograms by counting amino acid sequence pairs with highest occurrence frequency in protein structural domain sequences, and taking out the first 10000 amino acid logograms to form a protein structural domain logogram dictionary; when 10000 amino acid letters are taken, the letter length of the protein domain dictionary is 3 or 4 amino acid letters; these protein domain symbols are more environmentally adapted to be preserved, and can carry the evolving characteristics of the protein; the sequences after word segmentation are respectively expressed as And/>Wherein the superscript 1 of the amino acid symbols indicates the peptide sequence, the superscript 2 indicates the amino acid sequence of the MHC class I protein, and the subscript indicates the number of amino acid symbols, which are combined into one sequence by inserting special symbols:

wherein [ CLS ], [ SEP ] and [ EOS ] are special words and respectively represent category, separator and ending; the maximum combination length of the peptide sequence and the MHC class I protein molecule amino acid sequence is normalized to 512;

Step 4, constructing an amino acid word symbol embedding model;

In step 4, constructing a peptide and MHC class I protein affinity prediction model based on the Bert model, wherein the model represents protein amino acid sequences in a Uniprot database through a pre-training depth, and calculates peptide sequences and MHC class I protein characteristic space distances to represent peptide and MHC class I protein affinities through a fine tuning model; the model adopts a LAMB optimizer, and the super parameter of the optimizer is set to be a default value, namely beta ₁＝0.9,β₂ =0.999, E=1E-8, and the weight attenuation rate lambda=0.01;

In step 5, extracting peptide and MHC class I protein amino acid sequence embedding characteristics by using a multi-head attention mechanism; given an amino acid vocabulary entry list x= < X ₁,x₂,…x_n >, each amino acid vocabulary vector X _i is first calculated by a multi-head attention mechanism, and certain positions in X are identified and focused according to the correlation of the calculated result and the upper and lower amino acid vocabulary of X _i; according to the multi-headed attention mechanism, the front and rear amino acid logographic information from each vector X _i in X is encoded into an output vector y _i and weighted according to its relevance to X _i; then, by adding the initial vector x _i to the output vector y _i, the combined vector y _i is normalized and then features are extracted through a fully connected feedforward neural network, which uses GeLU functions as activation functions; geLU functions are shown in the following formulas:

GeLU(x)＝xP(X≤x)

The attention mechanism formula is as follows:

Multihead(Q,K,V)＝Concat(head₁,…,head_n)W^O

Wherein the method comprises the steps of

Wherein the equation for the term (Q, K, V) is as follows:

The multi-head attention mechanism captures the attention relation among amino acid words by calculating the attention value among the amino acid words, then adds sequence position information and codes to obtain the final embedded feature, the amino acid sequence embedded feature is expressed as a 512 multiplied by 768 dimensional matrix, and the multi-head attention mechanism is used for training;

2. The method according to claim 1, characterized in that: in step 2, binding affinity data in the immune epitope database, expressed as half-inhibitory concentration in micromolar, was used, with half-inhibitory concentration values converted to values ranging from 0 to 1, calculated as:

3. The method according to claim 1, characterized in that: in step 6, after the peptide sequence obtained by the multi-head attention mechanism and the MHC class I protein domain amino acid sequence are embedded into a feature vector E= < E ₀,e₁,…,e_m+n+3 >, a predictive vector F is generated by using a feedforward neural network, normalization is performed by a Softmax function, and a predicted affinity value of the peptide and the MHC class I protein is generated; the output of the whole peptide binding affinity model to MHC class I proteins is a 512×768 matrix, where 768 is the dimension of the vector and 512 corresponds to the number of words of all the input amino acids at the time of input; the first vector corresponds to the relationship between amino acid symbols in the sequence, and the whole output matrix is a characteristic matrix of binding affinity of peptide and MHC class I protein molecules.

4. The method according to claim 1, characterized in that: peptide and MHC class I protein binding affinity feature extraction is achieved by a multi-head self-attention and fully connected neural network model; and constructing a protein domain word dictionary, and predicting the affinity of the peptide and the MHC class I protein by utilizing the characteristics of multiple attentions to learn peptide bonds and amino acid residues.