CN117037897B - Peptide and MHC class I protein affinity prediction method based on protein domain feature embedding - Google Patents

Peptide and MHC class I protein affinity prediction method based on protein domain feature embedding Download PDF

Info

Publication number
CN117037897B
CN117037897B CN202310878264.6A CN202310878264A CN117037897B CN 117037897 B CN117037897 B CN 117037897B CN 202310878264 A CN202310878264 A CN 202310878264A CN 117037897 B CN117037897 B CN 117037897B
Authority
CN
China
Prior art keywords
amino acid
protein
peptide
mhc class
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310878264.6A
Other languages
Chinese (zh)
Other versions
CN117037897A (en
Inventor
王福旭
臧天仪
王皓俨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202310878264.6A priority Critical patent/CN117037897B/en
Publication of CN117037897A publication Critical patent/CN117037897A/en
Application granted granted Critical
Publication of CN117037897B publication Critical patent/CN117037897B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Medicinal Chemistry (AREA)
  • Peptides Or Proteins (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention provides a peptide and MHCI protein affinity prediction method based on protein domain feature embedding. The peptide and MHCI class protein affinity prediction method based on protein domain feature embedding utilizes the multi-head attention learning peptide bond and amino acid residue feature to predict the peptide and MHCI class protein affinity, and compared with the existing other methods, the prediction method provided by the invention has accurate prediction results and meets the actual demands.

Description

Peptide and MHC class I protein affinity prediction method based on protein domain feature embedding
Technical Field
The invention relates to the technical field of peptide and MHC class I protein affinity prediction, in particular to a peptide and MHC class I protein affinity prediction method based on protein domain feature embedding.
Background
The binding affinity of peptides to MHC class I proteins plays a vital role in oncology, vaccine development, early diagnosis of immune diseases, screening of graft rejection reactions, biochemistry and neuroscience. In the field of tumor drug and vaccine development, changes in binding affinity of peptides to MHC class I molecules can affect antigen presentation and recognition, thereby affecting tumor immunotherapeutic effects; in the fields of early diagnosis of immune diseases and screening of rejection reactions, peptides and MHC class I proteins can be used for predicting autoimmune disease peptide fragments so as to diagnose immune diseases and rejection reactions; in biochemistry and neuroscience, changes in peptide binding affinity to MHC class I proteins can also affect neuronal and glial cell function and activity, while helping to better understand the mechanisms of bioengineering and immune adaptation.
With the development of new antigen cancer vaccines, how to effectively, accurately and rapidly identify new antigens is a urgent problem to be solved by human beings to overcome cancers. And efficient prediction of peptide affinity to MHC class I molecules is the basis for efficient recognition of neoantigens. With the rapid development of sequencing technology, a large number of protein sequences were detected, and a large amount of sequencing data and tumor immunity data were ready for data raw materials. How to effectively utilize the sequencing data of the proteins and construct a set of peptide and MHC class I molecular affinity prediction analysis method, so that antigen peptides can be rapidly and accurately identified, and the method is a common requirement of all the students in the field.
Studies have shown that some tumors have higher mutational burden than others. The immunogenic response induced by the neoantigen vaccine for different cancer types may therefore be different. It is difficult to determine specific neoantigens for each cancer type before cancer genomic sequencing occurs. Traditional methods of neoantigen identification typically rely on single cDNA library screening, which is very inefficient. Development of second generation sequencing technology accelerates the progress of new antigen identification. Identified tumor-specific gene mutations have been widely available, for which a number of tools have been developed for peptide and MHC molecule affinity prediction. For example, netMHC, an algorithm based on a feedforward neural network to predict peptide affinity for MHC class I molecules, is the most widely used allele-specific model. NETMHCPAN, a pan-specific model not limited to a particular MHC allele, utilizes a traditional neural network model with a single hidden layer.
Disclosure of Invention
The invention aims to solve the problems in the prior art and provides a peptide and MHC class I protein affinity prediction method based on protein domain feature embedding. The peptide and MHC class I protein affinity prediction method based on protein domain feature embedding utilizes the multi-head attention to learn peptide bond and amino acid residue features to predict the peptide and MHC class I protein affinity.
The invention is realized by the following technical scheme, and provides a peptide and MHC class I protein affinity prediction method based on protein domain feature embedding, which comprises the following steps:
step 1, constructing a protein structural domain vocabulary dictionary;
step 2, a given unique MHC class I protein Indicator (ID) is used for finding out the corresponding amino acid sequence;
Step 3, obtaining a peptide sequence and an MHC I protein domain sequence, and performing word segmentation on the peptide sequence and the MHC I protein domain sequence, further processing the sequence after obtaining the MHC I protein amino acid sequence, obtaining the starting and ending positions of all domains of the MHC I protein by a hmmscan method, extracting the domain amino acid sequence by the known starting and ending positions, and performing word segmentation on the domain amino acid sequence according to an autonomously constructed protein domain word dictionary;
Step 4, constructing an amino acid word symbol embedding model;
Step 5, extracting peptide sequences and MHC class I protein amino acid word character embedding characteristics, wherein the characteristics are expressed as peptide and MHC class I protein binding embedding matrixes;
step 6, predicting the binding affinity of the peptide to the MHC class I protein.
Further, in step 1, the pairs of amino acid sequences with highest occurrence frequency in the amino acid sequences of the protein domains are counted to form amino acid logograms, and the first 10000 amino acid logograms are taken out to form a protein domain logogram dictionary.
Further, in step 2, using binding affinity data in the immune epitope database, the binding affinity is expressed by half-inhibitory concentration in micromolar, the half-inhibitory concentration value is converted to a value in the interval 0 to 1, and the calculation formula is:
wherein affinity is the binding affinity of the experimentally measured peptide to MHC class I molecules.
Further, in step 3, the peptide sequence and the MHC class I protein molecule amino acid sequence are segmented, the segmentation being performed based on an autonomously constructed segmentation dictionary; forming amino acid logograms by counting amino acid sequence pairs with highest occurrence frequency in protein structural domain sequences, and taking out the first 10000 amino acid logograms to form a protein structural domain logogram dictionary; when 10000 amino acid letters are taken, the length of the letters of the protein domain dictionary is mostly 3 or 4 amino acid letters; these protein domain symbols are more environmentally adapted to be preserved, and can carry the evolving characteristics of the protein; the sequences after word segmentation are respectively expressed asAnd/>Wherein the superscript 1 of the amino acid symbols indicates the peptide sequence, the superscript 2 indicates the amino acid sequence of the MHC class I protein, and the subscript indicates the number of amino acid symbols, which are combined into one sequence by inserting special symbols:
wherein [ CLS ], [ SEP ] and [ EOS ] are special words and respectively represent category, separator and ending; the maximum combined length of the peptide sequence and the MHC class I protein molecule amino acid sequence is normalized to 512.
Further, in step 4, a prediction model of peptide and MHC class I protein affinity based on the characteristic embedding of protein domains is constructed based on the Bert model, the model represents the protein amino acid sequence in the Uniprot database by the pre-training depth, and the peptide sequence and MHC class I protein characteristic space distance is calculated by the fine tuning model to represent the peptide and MHC class I protein affinity; the model adopts a LAMB optimizer, and the super parameter of the optimizer is set to be a default value, namely beta 1=0.9,β2 =0.999, E=1E-8, and the weight attenuation rate lambda=0.01.
Further, in step 5, peptide and MHC class I protein amino acid sequence embedding features are extracted using a multi-head attentive mechanism; given an amino acid vocabulary entry list x= < X 1,x2,…xn >, each amino acid vocabulary vector X i is first calculated by a multi-head attention mechanism, and certain positions in X are identified and focused according to the correlation of the calculated result and the upper and lower amino acid vocabulary of X i; according to the multi-headed attention mechanism, the front and rear amino acid logographic information from each vector X i in X is encoded into an output vector y i and weighted according to its relevance to X i; then, by adding the initial vector x i to the output vector y i, the combined vector y i is normalized and then features are extracted through a fully connected feedforward neural network, which uses GeLU functions as activation functions; geLU functions are shown in the following formulas:
GeLU(x)=xP(X≤x)
Wherein X to N (μ, σ 2), μ and σ are parameters of the validation experiment, and μ=0 and σ=1 are assigned;
Each vector y i independently generates an output vector z i through the same feedforward neural network; finally, adding the vector y i to the Z i,zi for normalization to obtain a vector list Z= < Z 1,z2,…zn > of the whole protein feature embedding method based on multi-head attention;
The attention mechanism formula is as follows:
Multihead(Q,K,V)=Concat(head1,…,headn)WO
Wherein the method comprises the steps of
Wherein the equation for the term (Q, K, V) is as follows:
Q, K, V in the formula are matrixes formed by respectively embedding amino acid words into the vectors of queries, embedding amino acid words into the keys and embedding amino acid words into the values, wherein the queries, keys and values are respectively obtained by multiplying the amino acid words by the amino acid words conversion matrix W Q、WK、WV, and d k represents the dimension of the key of the amino acid words into the vectors; head i represents the attention of the attention header of the i-th amino acid word;
The multi-head attention mechanism captures the attention relation among amino acid words by calculating the attention value among the amino acid words, then adds sequence position information and codes to obtain the final embedded feature, the amino acid sequence embedded feature is expressed as a 512 multiplied by 768 dimensional matrix, and the multi-head attention mechanism is used for training.
Further, in step 6, after the peptide sequence obtained by the multi-head attention mechanism and the MHC class I protein domain amino acid sequence are embedded into the eigenvector e= < E 0,e1,…,em+n+3 >, generating a predicted vector F by using a feedforward neural network, normalizing by a Softmax function, and generating a predicted affinity value of the peptide and the MHC class I protein; the output of the whole peptide binding affinity model to MHC class I proteins is a 512×768 matrix, where 768 is the dimension of the vector and 512 corresponds to the number of words of all the input amino acids at the time of input; the first vector corresponds to the relationship between amino acid symbols in the sequence, and the whole output matrix is a characteristic matrix of binding affinity of peptide and MHC class I protein molecules.
Further, peptide and MHC class I protein binding affinity feature extraction is achieved by a multi-headed self-attention and fully connected neural network model; and constructing a protein domain word dictionary, and predicting the affinity of the peptide and the MHC class I protein by utilizing the characteristics of multiple attentions to learn peptide bonds and amino acid residues.
The beneficial effects of the invention are as follows:
The invention provides a peptide and MHC class I protein affinity prediction method based on protein domain feature embedding, which utilizes the features of multi-head attention learning peptide bonds and amino acid residues to predict peptide and MHC class I protein affinity.
Drawings
FIG. 1 is a block diagram of a method for predicting affinity of peptides to MHC class I proteins based on the characteristic insertion of protein domains according to the present invention;
FIG. 2 is a diagram showing the comparison of the predicted result of the present invention with other reference methods.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1-2, the present invention provides a method for predicting affinity between peptides embedded based on protein domain features and MHC class I proteins, the method comprising the steps of:
step 1, constructing a protein structural domain vocabulary dictionary;
In the step 1, counting the amino acid sequence pair with highest occurrence frequency in the amino acid sequence of the protein structural domain to form amino acid logograms, and taking out the first 10000 amino acid logograms to form a protein structural domain logogram dictionary;
step 2, a given unique MHC class I protein Indicator (ID) is used for finding out the corresponding amino acid sequence;
in step 2, an immune epitope database is used, the database comprising binding affinity data and eluting ligand data; binding affinity data in the database, in which peptide binding affinity to MHC class I molecules is represented by a half-inhibitory concentration (50%inhibiting concentration,IC50) in micromolar (nM), was used in the mapping process, and peptide affinity values in the range of 0 to 1 were used for training and testing of the predictive method to represent affinity. The half-inhibitory concentration was converted to a value in the interval 0 to 1 using the following formula;
wherein affinity is the binding affinity of the peptide to MHC class I molecules measured. The lower the half-inhibitory concentration value, the better the binding of the peptide to MHC class I molecules. Peptides bind strongly to MHC class I molecules with affinity values below 50nM, i.e. transition values above 0.638; the affinity value is lower than 500nM, i.e. the transition value is greater than 0.426, the peptide has weak binding force with MHC class I molecules; above an affinity value of 500nM, i.e. a transition value of less than 0.426, the peptide does not bind to MHC class I molecules.
Step 3, obtaining peptide sequences and MHC class I protein domain sequences, and segmenting the peptide sequences and the MHC class I protein domain sequences;
After obtaining the amino acid sequence of the MHC class I protein, further processing the sequence to obtain a domain amino acid sequence in the protein amino acid sequence; obtaining the initial and final positions of the domains of the MHC class I protein by hmmscan method, extracting all the domains of the required MHC class I protein by knowing the initial and final positions of the domains and the amino acid sequences of the domains, and word-dividing the amino acid sequences by a protein domain word dictionary;
In step 3, the peptide sequence and the amino acid sequence of the MHC class I protein molecule are segmented, and the segmentation is performed based on an autonomously constructed segmentation dictionary. The amino acid vocabulary of the protein domain is formed by counting the amino acid sequence pair with the highest occurrence frequency in the protein domain sequence, and the first 10000 amino acid vocabulary is taken out to form a protein domain vocabulary dictionary. When 10000 amino acid letters are taken, the length of the letters of the protein domain dictionary is mostly 3 or 4 amino acid letters. These protein domain descriptors are more environmentally-friendly and are preserved and can carry the evolving characteristics of the protein. The sequences after word segmentation are respectively expressed as And/>Wherein the superscript of the amino acid symbols indicates the peptide sequence or the amino acid sequence of the MHC class I protein and the subscript indicates the number of amino acid symbols. Because peptide and MHC class I protein molecule affinity prediction method needs to divide the peptide sequence and MHC class I protein amino acid sequence in turn, insert [ SEP ] amino acid word symbol as the separator of the sequence between two sequences, the input of this method is two pairs of protein amino acid word symbol sequences, after dividing the amino acid sequence separately, combine them into a sequence, as follows:
After sequence combination, the maximum combination length of the synthesized amino acid sequence added with the mark is normalized to 512; wherein [ CLS ], [ SEP ], and [ EOS ] are used as special amino acid symbols, respectively representing a category symbol, a separator symbol, and an ending symbol.
Step 4, constructing a peptide and MHC class I protein affinity prediction model based on protein domain feature embedding based on a Bert model; calculation of peptide sequence and MHC class I protein feature space distance using the Bert model indicates peptide affinity for MHC class I proteins. The model adopts a LAMB optimizer, the optimization technology is a new self-adaptive large-scale batch optimization technology, and the super-parameters of the optimizer are set to be default values, namely beta 1=0.9,β2 =0.999, epsilon=1E-8, and the weight attenuation rate lambda=0.01.
And 5, extracting peptide sequences and MHC class I protein amino acid word character embedding characteristics, wherein the characteristics are expressed as peptide and MHC class I protein binding embedding matrixes. In step 5, extracting peptide and MHC class I protein amino acid sequence embedding characteristics by using a multi-head attention mechanism; given an amino acid vocabulary entry list x= < X 1,x2,…xn >, each amino acid vocabulary vector X i is first calculated by a multi-head attention mechanism, and certain positions in X are identified and focused according to the correlation of the calculated result and the upper and lower amino acid vocabulary of X i; according to the multi-headed attention mechanism, the front and rear amino acid logographic information from each vector X i in X is encoded into an output vector y i and weighted according to its relevance to X i; then, by adding the initial vector x i to the output vector y i, the combined vector y i is normalized and then features are extracted through a fully connected feedforward neural network, which uses GeLU functions as activation functions; the GeLU function has proven to perform best in a multi-head attention model and can avoid the gradient vanishing problem; geLU functions are shown in the following formulas:
GeLU(x)=xP(X≤x)
Wherein X to N (μ, σ 2), μ and σ are parameters of the validation experiment, and μ=0 and σ=1 are assigned;
Each vector yi independently generates an output vector z i through the same feedforward neural network; finally, adding the vector y i to the Z i,zi for normalization to obtain a vector list Z= < Z 1,z2,…zn > of the whole protein feature embedding method based on multi-head attention;
The attention mechanism formula is as follows:
Multihead(Q,K,V)=Cpmcat(head1,…,headn)WO
Wherein the method comprises the steps of
Wherein the equation for the term (Q, K, V) is as follows:
Q, K, V in the formula are matrixes formed by respectively embedding amino acid words into the vectors of queries, embedding amino acid words into the keys and embedding amino acid words into the values, wherein the queries, keys and values are respectively obtained by multiplying the amino acid words by the amino acid words conversion matrix W Q、WK、WV, and d k represents the dimension of the key of the amino acid words into the vectors; head i represents the attention of the attention header of the i-th amino acid word;
The multi-head attention mechanism captures the attention relation among amino acid words by calculating the attention value among the amino acid words, then adds sequence position information and codes to obtain the final embedded feature, the amino acid sequence embedded feature is expressed as a 512 multiplied by 768 dimensional matrix, and the multi-head attention mechanism is used for training.
Step 6, predicting the binding affinity of the peptide to the MHC class I protein. In step 6, after the peptide sequence obtained by the multi-head attention mechanism and the MHC class I protein domain amino acid sequence are embedded into a feature vector E= < E 0,e1,…,em+n+3 >, a predictive vector F is generated by using a feedforward neural network, normalization is performed by a Softmax function, and a predicted affinity value of the peptide and the MHC class I protein is generated; the output of the whole peptide binding affinity model to MHC class I proteins is a 512×768 matrix, where 768 is the dimension of the vector and 512 corresponds to the number of words of all the input amino acids at the time of input; the first vector corresponds to the relationship between amino acid symbols in the sequence, and the whole output matrix is a characteristic matrix of binding affinity of peptide and MHC class I protein molecules.
Peptide and MHC class I protein binding affinity feature extraction is achieved by a multi-head self-attention and fully connected neural network model; and constructing a protein domain word dictionary, and predicting the affinity of the peptide and the MHC class I protein by utilizing the characteristics of multiple attentions to learn peptide bonds and amino acid residues.
The invention relates to a binding affinity prediction method, which adopts a Spearman rank correlation coefficient (Spearman's rank coefficient of correlation) to evaluate the intensity of a monotonic relation between two random variables, namely whether the two random variables are consistent in change trend or not, and the monotonic relation between the two random variables does not need to be in a proportional relation. The calculation formula is as follows:
wherein R (x) and R (y) are the ranking order in which x and y are located, respectively, and R (x) and R (y) represent the average rank.
The other index of the evaluation method is AUC value, a positive sample and a negative sample are randomly extracted from the positive and negative sample sets respectively, and the predicted value of the positive sample is larger than the probability of the negative sample. The calculation formula is as follows:
Wherein the numerator represents the number of positive samples greater than the number of negative samples and the denominator represents the total number of positive and negative samples.

Claims (4)

1. A method for predicting affinity of a peptide to an MHC class I protein based on characteristic insertion of a protein domain, comprising: the method comprises the following steps:
step 1, constructing a protein structural domain vocabulary dictionary;
step 2, a given unique MHC class I protein Indicator (ID) is used for finding out the corresponding amino acid sequence;
Step 3, obtaining a peptide sequence and an MHC I protein domain sequence, and performing word segmentation on the peptide sequence and the MHC I protein domain sequence, further processing the sequence after obtaining the MHC I protein amino acid sequence, obtaining the starting and ending positions of all domains of the MHC I protein by a hmmscan method, extracting the domain amino acid sequence by the known starting and ending positions, and performing word segmentation on the domain amino acid sequence according to an autonomously constructed protein domain word dictionary;
In the step 3, the peptide sequence and the MHC class I protein molecule amino acid sequence are segmented, and the segmentation is performed based on an autonomously constructed segmentation dictionary; forming amino acid logograms by counting amino acid sequence pairs with highest occurrence frequency in protein structural domain sequences, and taking out the first 10000 amino acid logograms to form a protein structural domain logogram dictionary; when 10000 amino acid letters are taken, the letter length of the protein domain dictionary is 3 or 4 amino acid letters; these protein domain symbols are more environmentally adapted to be preserved, and can carry the evolving characteristics of the protein; the sequences after word segmentation are respectively expressed as And/>Wherein the superscript 1 of the amino acid symbols indicates the peptide sequence, the superscript 2 indicates the amino acid sequence of the MHC class I protein, and the subscript indicates the number of amino acid symbols, which are combined into one sequence by inserting special symbols:
wherein [ CLS ], [ SEP ] and [ EOS ] are special words and respectively represent category, separator and ending; the maximum combination length of the peptide sequence and the MHC class I protein molecule amino acid sequence is normalized to 512;
Step 4, constructing an amino acid word symbol embedding model;
In step 4, constructing a peptide and MHC class I protein affinity prediction model based on the Bert model, wherein the model represents protein amino acid sequences in a Uniprot database through a pre-training depth, and calculates peptide sequences and MHC class I protein characteristic space distances to represent peptide and MHC class I protein affinities through a fine tuning model; the model adopts a LAMB optimizer, and the super parameter of the optimizer is set to be a default value, namely beta 1=0.9,β2 =0.999, E=1E-8, and the weight attenuation rate lambda=0.01;
Step 5, extracting peptide sequences and MHC class I protein amino acid word character embedding characteristics, wherein the characteristics are expressed as peptide and MHC class I protein binding embedding matrixes;
In step 5, extracting peptide and MHC class I protein amino acid sequence embedding characteristics by using a multi-head attention mechanism; given an amino acid vocabulary entry list x= < X 1,x2,…xn >, each amino acid vocabulary vector X i is first calculated by a multi-head attention mechanism, and certain positions in X are identified and focused according to the correlation of the calculated result and the upper and lower amino acid vocabulary of X i; according to the multi-headed attention mechanism, the front and rear amino acid logographic information from each vector X i in X is encoded into an output vector y i and weighted according to its relevance to X i; then, by adding the initial vector x i to the output vector y i, the combined vector y i is normalized and then features are extracted through a fully connected feedforward neural network, which uses GeLU functions as activation functions; geLU functions are shown in the following formulas:
GeLU(x)=xP(X≤x)
Wherein X to N (μ, σ 2), μ and σ are parameters of the validation experiment, and μ=0 and σ=1 are assigned;
Each vector y i independently generates an output vector z i through the same feedforward neural network; finally, adding the vector y i to the Z i,zi for normalization to obtain a vector list Z= < Z 1,z2,…zn > of the whole protein feature embedding method based on multi-head attention;
The attention mechanism formula is as follows:
Multihead(Q,K,V)=Concat(head1,…,headn)WO
Wherein the method comprises the steps of
Wherein the equation for the term (Q, K, V) is as follows:
Q, K, V in the formula are matrixes formed by respectively embedding amino acid words into the vectors of queries, embedding amino acid words into the keys and embedding amino acid words into the values, wherein the queries, keys and values are respectively obtained by multiplying the amino acid words by the amino acid words conversion matrix W Q、WK、WV, and d k represents the dimension of the key of the amino acid words into the vectors; head i represents the attention of the attention header of the i-th amino acid word;
The multi-head attention mechanism captures the attention relation among amino acid words by calculating the attention value among the amino acid words, then adds sequence position information and codes to obtain the final embedded feature, the amino acid sequence embedded feature is expressed as a 512 multiplied by 768 dimensional matrix, and the multi-head attention mechanism is used for training;
step 6, predicting the binding affinity of the peptide to the MHC class I protein.
2. The method according to claim 1, characterized in that: in step 2, binding affinity data in the immune epitope database, expressed as half-inhibitory concentration in micromolar, was used, with half-inhibitory concentration values converted to values ranging from 0 to 1, calculated as:
wherein affinity is the binding affinity of the experimentally measured peptide to MHC class I molecules.
3. The method according to claim 1, characterized in that: in step 6, after the peptide sequence obtained by the multi-head attention mechanism and the MHC class I protein domain amino acid sequence are embedded into a feature vector E= < E 0,e1,…,em+n+3 >, a predictive vector F is generated by using a feedforward neural network, normalization is performed by a Softmax function, and a predicted affinity value of the peptide and the MHC class I protein is generated; the output of the whole peptide binding affinity model to MHC class I proteins is a 512×768 matrix, where 768 is the dimension of the vector and 512 corresponds to the number of words of all the input amino acids at the time of input; the first vector corresponds to the relationship between amino acid symbols in the sequence, and the whole output matrix is a characteristic matrix of binding affinity of peptide and MHC class I protein molecules.
4. The method according to claim 1, characterized in that: peptide and MHC class I protein binding affinity feature extraction is achieved by a multi-head self-attention and fully connected neural network model; and constructing a protein domain word dictionary, and predicting the affinity of the peptide and the MHC class I protein by utilizing the characteristics of multiple attentions to learn peptide bonds and amino acid residues.
CN202310878264.6A 2023-07-18 2023-07-18 Peptide and MHC class I protein affinity prediction method based on protein domain feature embedding Active CN117037897B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310878264.6A CN117037897B (en) 2023-07-18 2023-07-18 Peptide and MHC class I protein affinity prediction method based on protein domain feature embedding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310878264.6A CN117037897B (en) 2023-07-18 2023-07-18 Peptide and MHC class I protein affinity prediction method based on protein domain feature embedding

Publications (2)

Publication Number Publication Date
CN117037897A CN117037897A (en) 2023-11-10
CN117037897B true CN117037897B (en) 2024-06-14

Family

ID=88621688

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310878264.6A Active CN117037897B (en) 2023-07-18 2023-07-18 Peptide and MHC class I protein affinity prediction method based on protein domain feature embedding

Country Status (1)

Country Link
CN (1) CN117037897B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117976074B (en) * 2024-03-29 2024-06-25 北京悦康科创医药科技股份有限公司 MHC molecule and antigen epitope affinity determination method, model training method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115588462A (en) * 2022-09-15 2023-01-10 哈尔滨工业大学 Polypeptide and major histocompatibility complex protein molecule combination prediction method based on transfer learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113593631B (en) * 2021-08-09 2022-11-29 山东大学 Method and system for predicting protein-polypeptide binding site

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115588462A (en) * 2022-09-15 2023-01-10 哈尔滨工业大学 Polypeptide and major histocompatibility complex protein molecule combination prediction method based on transfer learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MHCRoBERTa: pan-specific peptide–MHC class I binding prediction through transfer learning with label-agnostic protein sequences;Fuxu Wang等;Briefings in Bioinformatics;20220421;第23卷(第3期);1-9 *

Also Published As

Publication number Publication date
CN117037897A (en) 2023-11-10

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN111161793B (en) Stacking integration based N in RNA 6 Method for predicting methyladenosine modification site
CN112767997A (en) Protein secondary structure prediction method based on multi-scale convolution attention neural network
CN117037897B (en) Peptide and MHC class I protein affinity prediction method based on protein domain feature embedding
CN115098620B (en) Cross-modal hash retrieval method for attention similarity migration
CN111276252B (en) Construction method and device of tumor benign and malignant identification model
CN111461201A (en) Sensor data classification method based on phase space reconstruction
CN111462820A (en) Non-coding RNA prediction method based on feature screening and integration algorithm
CN114708903A (en) Method for predicting distance between protein residues based on self-attention mechanism
CN113762417A (en) Method for enhancing HLA antigen presentation prediction system based on deep migration
CN116524960A (en) Speech emotion recognition system based on mixed entropy downsampling and integrated classifier
CN108108184B (en) Source code author identification method based on deep belief network
CN112116949B (en) Protein folding identification method based on triple loss
CN113611360A (en) Protein-protein interaction site prediction method based on deep learning and XGboost
CN117497058A (en) Antibody antigen neutralization prediction method and device based on graphic neural network
CN112215826A (en) Depth image feature-based glioma molecule subtype prediction and prognosis method
CN112085245A (en) Protein residue contact prediction method based on deep residual error neural network
CN116386733A (en) Protein function prediction method based on multi-view multi-scale multi-attention mechanism
KR102376212B1 (en) Gene expression marker screening method using neural network based on gene selection algorithm
CN112365924B (en) Bidirectional trinucleotide position specificity preference and point combined mutual information DNA/RNA sequence coding method
CN115481674A (en) Single cell type intelligent identification method based on deep learning
US20230298692A1 (en) Method, System and Computer Program Product for Determining Presentation Likelihoods of Neoantigens
CN111310546B (en) Method for extracting and authenticating writing rhythm characteristics in online handwriting authentication
CN114005529A (en) Recognition method of ncRNA with protein coding potential
CN112151109A (en) Semi-supervised learning method for evaluating randomness of biomolecular cross-linking mass spectrometry identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant