CN114242159B

CN114242159B - Method for constructing antigen peptide presentation prediction model, and antigen peptide prediction method and device

Info

Publication number: CN114242159B
Application number: CN202210170086.7A
Authority: CN
Inventors: 王天元; 翟珂
Original assignee: Beijing Jingtai Technology Co ltd
Current assignee: Beijing Jingtai Technology Co ltd
Priority date: 2022-02-24
Filing date: 2022-02-24
Publication date: 2022-06-07
Anticipated expiration: 2042-02-24
Also published as: CN114242159A

Abstract

The application relates to a construction method of an antigenic peptide presentation prediction model, an antigenic peptide prediction method and an antigenic peptide prediction device. The construction method comprises the following steps: acquiring a target HLA of a preselected type and positive sample data and negative sample data which correspond to the target HLA and have a preset proportion; respectively inputting the target HLA and corresponding positive sample data and negative sample data into a plurality of sub-models with different architectures based on the BERT model for training to obtain a plurality of trained sub-models; screening each trained sub-model through a preset rule to obtain a prediction model comprising an optimal sub-model; wherein the prediction model synthesizes the predicted presentation results of the preferred submodel to predict the result of presentation of the target antigen peptide by the target HLA. According to the scheme provided by the application, the result of antigen peptide presentation by HLA can be rapidly predicted through the prediction model, the research and development cost is reduced, and the prediction efficiency is improved.

Description

Method for constructing antigen peptide presentation prediction model, and antigen peptide prediction method and device

Technical Field

The application relates to the technical field of antigenic peptides, in particular to a construction method of an antigenic peptide presentation prediction model, and an antigenic peptide prediction method and device.

Background

T cell immunity is an important component of adaptive immunity, and plays a central role in resisting pathogenic microorganism infection and tumors. A key step in T cell immunity is the interaction of the TCR (T cell antigen receptor) with the corresponding pMHC (antigen peptide-MHC molecule complex), in which MHC-class i molecules trigger the killing of target cells by effector cells by presenting epitope polypeptides for recognition by CTLs (cytotoxic T lymphocytes). Studying the structure of complexes of antigenic peptides with MHC molecules helps to understand the details of T cell immunogenesis and to speed up the development of T cell epitope vaccines.

Human MHC molecules are also referred to as HLA molecules, HLA or HLA antigens. The process of HLA binding and presenting an antigenic peptide for TCR recognition necessarily involves binding of HLA to the antigenic peptide, and a particular HLA can selectively bind the antigenic peptide by virtue of the desired consensus motif (consensus). In the related art, wet experiments are generally used to determine the antigenic peptides capable of eliciting T cell immune responses, however, such studies are time-consuming and labor-intensive.

Therefore, how to rapidly and inexpensively determine the extent of immune presentation and response of antigenic peptides to T cells containing a particular HLA is a problem that needs to be addressed.

Disclosure of Invention

In order to solve or partially solve the problems in the related art, the present application provides a method for constructing an antigen peptide presentation prediction model, a method for predicting an antigen peptide, and an apparatus thereof, which can rapidly predict the result of antigen peptide presented by HLA through the prediction model, reduce the research and development cost, and improve the prediction efficiency.

In a first aspect, the present application provides a method for constructing an antigenic peptide presentation prediction model, comprising:

acquiring a target HLA of a preselected type and positive sample data and negative sample data which correspond to the target HLA and have a preset proportion, wherein the positive sample data comprises a positive sample polypeptide sequence, an upstream sequence of the positive sample polypeptide sequence, a downstream sequence of the positive sample polypeptide sequence and a positive presentation result of the positive sample polypeptide sequence and the target HLA; the negative sample data comprises a negative sample polypeptide sequence different from the positive sample polypeptide sequence, an upstream sequence of the negative sample polypeptide sequence, a downstream sequence of the negative sample polypeptide sequence, and a negative presentation result of the negative sample polypeptide sequence and the target HLA;

respectively inputting the target HLA and the corresponding positive sample data and negative sample data into a plurality of sub-models with different architectures based on a BERT model for training to obtain a plurality of trained sub-models;

screening each trained sub-model through a preset rule to obtain a prediction model comprising an optimal sub-model; wherein the prediction model predicts the result of presentation of the target antigen peptide by the target HLA by synthesizing the predicted presentation results of the preferred submodels.

In one embodiment, the acquiring a target HLA of a preselected class and positive and negative sample data corresponding to the target HLA in a preset ratio includes:

aiming at the target HLA, the positive sample data and the negative sample data are processed according to the following steps of 1: (8-10) generating training data, and enabling positive sample data and negative sample data to be in a ratio of 1: (800-1000) generating test data.

In one embodiment, the training data is divided according to K-fold cross validation to obtain a training set and a validation set; and/or dividing the training data according to K-fold cross validation to obtain a training set and a validation set, and adding a preset number of the pseudo label data into the training set; and the pseudo label data is formed by predicting the test data of the blank label according to a pre-trained sub model to obtain a corresponding pseudo label.

In an embodiment, the training of inputting the target HLA and the corresponding positive sample data and negative sample data into a plurality of sub-models with different architectures based on BERT models respectively to obtain a plurality of trained sub-models includes:

taking the positive sample polypeptide sequence and the positive presentation result in the positive sample data and the negative sample polypeptide sequence and the negative presentation result in the negative sample data as training data to train submodels with different architectures based on the BERT model at least partially, and obtaining the corresponding trained submodels; and/or training the sub-models with different architectures based on the BERT model at least partially by taking the positive sample data and the negative sample data as training data to obtain the corresponding trained sub-models.

In one embodiment, the plurality of different architecture sub-models based on the BERT model includes at least one of:

the model comprises a BERT and CNN fusion model, a BERT and LSTM and GRU fusion model, a BERT model with a double-layer sentence vector hiding layer, a BERT model with a three-layer sentence vector hiding layer, a BERT model with a global average pooling layer, a BERT model with word vector batch standardization and a standard BERT model.

In an embodiment, the screening each trained sub-model by a preset rule to obtain a prediction model including a preferred sub-model includes:

respectively acquiring the accuracy and recall rate of the predicted rendering result of each sub-model;

determining the accuracy evaluation score of each sub-model through a preset evaluation function according to the accuracy rate and the recall rate;

and screening in the submodels according to the corresponding accuracy evaluation scores to obtain the preferred submodels.

In one embodiment, the obtaining the accuracy and recall of the predicted rendering result of each sub-model comprises:

respectively counting the number of TP, FP and FN in the prediction rendering result of each submodel;

and determining the accuracy and recall rate of each corresponding sub-model according to the number of the corresponding TP, FP and FN.

In one embodiment, before the obtaining of the target HLA of the preselected category and the positive sample data and the negative sample data corresponding to the target HLA and having the preset ratio, the method includes:

obtaining polypeptide sequences bound and presented by the candidate HLA, and carrying out clustering processing on the obtained polypeptide sequences bound and presented by the candidate HLA according to sequence similarity to obtain a plurality of candidate HLA and corresponding positive sample polypeptide sequence sets;

and screening various target HLA in each candidate HLA, and taking the positive sample polypeptide sequence set corresponding to the candidate HLA as the positive sample data of the target HLA.

A second aspect of the present application provides a method for predicting an antigenic peptide, comprising:

obtaining a target antigen peptide sequence;

predicting the result of presentation of the target antigen peptide sequence by the target HLA in the prediction model according to the prediction model constructed as described above.

In one embodiment, the predetermined length of the target antigen peptide sequence is 8 to 11.

A third aspect of the present application provides an apparatus for constructing an antigenic peptide presentation prediction model, comprising:

the system comprises a sample acquisition module, a comparison module and a comparison module, wherein the sample acquisition module is used for acquiring a target HLA of a preselected type and positive sample data and negative sample data which correspond to the target HLA and have a preset proportion, and the positive sample data comprises a positive sample polypeptide sequence, an upstream sequence of the positive sample polypeptide sequence, a downstream sequence of the positive sample polypeptide sequence and a positive presentation result of the positive sample polypeptide sequence and the target HLA; the negative sample data comprises a negative sample polypeptide sequence different from the positive sample polypeptide sequence, an upstream sequence of the negative sample polypeptide sequence, a downstream sequence of the negative sample polypeptide sequence, and a negative presentation result of the negative sample polypeptide sequence and the target HLA;

the training module is used for inputting the target HLA and the corresponding positive sample data and negative sample data into a plurality of sub-models with different architectures based on a BERT model respectively for training to obtain a plurality of trained sub-models;

the screening module is used for screening each trained submodel through a preset rule to obtain a prediction model comprising an optimal submodel; wherein the prediction model predicts the result of presentation of the target antigen peptide by the target HLA by synthesizing the predicted presentation results of the preferred submodels.

A fourth aspect of the present application provides an antigenic peptide prediction apparatus comprising:

a sequence acquisition module for acquiring a target antigen peptide sequence

And the prediction module is used for predicting the presentation result of the target antigen peptide sequence by the target HLA in the prediction model according to the prediction model constructed in the above way.

A fifth aspect of the present application provides an electronic device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method as described above.

A sixth aspect of the present application provides a computer-readable storage medium having stored thereon executable code, which, when executed by a processor of an electronic device, causes the processor to perform the method as described above.

The technical scheme provided by the application can comprise the following beneficial effects:

according to the technical scheme, associated positive sample data are respectively obtained for each target HLA, in addition, after negative sample data completely different from the positive sample data are obtained, a plurality of different sub models based on the BERT model are trained, and the optimized sub models are screened from the trained sub models to comprehensively form an integral prediction model, so that the prediction presentation results output by the optimized sub models can be comprehensively obtained, the prediction result of whether the target antigen peptide is presented by the target HLA is obtained, and the accuracy of the prediction result is improved. According to the scheme, the trained prediction model can effectively assist research personnel in predicting the combination and presentation of the antigen peptide and the HLA, so that experiments are reduced, manpower and material resources are reduced, the research and development efficiency is improved, the research and development cost is reduced, and the research and development personnel can judge what antigen peptide is suitable for being used as a tumor polypeptide vaccine to stimulate T cells to generate immunocompetence.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The foregoing and other objects, features and advantages of the application will be apparent from the following more particular descriptions of exemplary embodiments of the application as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the application.

FIG. 1 is a schematic flow chart of a method for constructing an antigenic peptide presentation prediction model according to an embodiment of the present application;

FIG. 2 is another schematic flow chart of a method for constructing an antigenic peptide presentation prediction model as shown in the examples herein;

FIG. 3 is a graph of precision-recall (precision-recall) plotted against test results for 5 comparative models and the predictive model of the present application;

FIG. 4 is a schematic flow chart of a method for predicting an antigenic peptide as shown in the examples of the present application;

FIG. 5 is a schematic structural diagram of an apparatus for constructing an antigenic peptide presentation prediction model according to an embodiment of the present application;

FIG. 6 is a schematic view showing another structure of an apparatus for constructing an antigenic peptide presentation prediction model according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an antigenic peptide prediction apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device shown in an embodiment of the present application.

Detailed Description

Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While embodiments of the present application are illustrated in the accompanying drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms "first," "second," "third," etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

In the related art, antigenic peptides can be presented for recognition by a TCR upon binding to a specific HLA, thereby generating T cell immunity. However, the antigen peptides capable of inducing T cell immune response are generally determined one by wet experiments, and the experimental process takes a lot of time, labor and materials.

In view of the above problems, embodiments of the present application provide a method for constructing an antigenic peptide presentation prediction model and a method for predicting an antigenic peptide, which can quickly predict an antigenic peptide that can be presented by an HLA through a prediction model, reduce research and development costs, and improve prediction efficiency.

The technical solutions of the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of a method for constructing an antigenic peptide presentation prediction model according to an embodiment of the present application.

Referring to fig. 1, a method for constructing an antigenic peptide presentation prediction model provided in an embodiment of the present application includes:

s110, obtaining a target HLA of a preselected type and positive sample data and negative sample data which correspond to a single target HLA and have a preset proportion, wherein the positive sample data comprises a positive sample polypeptide sequence, an upstream sequence of the positive sample polypeptide sequence, a downstream sequence of the positive sample polypeptide sequence and a positive presentation result of the positive sample polypeptide sequence and the target HLA; the negative sample data comprises a negative sample polypeptide sequence different from the positive sample polypeptide sequence, an upstream sequence of the negative sample polypeptide sequence, a downstream sequence of the negative sample polypeptide sequence, and a negative presentation result of the negative sample polypeptide sequence and the target HLA.

It is understood that each target HLA has an antigen binding groove for receiving an antigen peptide, and thus can receive and bind to a certain length of amino acid residues in the antigen peptide. In the present application, since the types of HLA are tens of thousands, the type of target HLA may be selected in advance according to actual needs, and the type of target HLA may be one or more. The target HLA in this embodiment may be selected from HLA-I and/or HLA-II of the human body. For example, 4 common HLA species selected from HLA-I of the middle and American population such as A X11: 01, A X24: 02, C X07: 02, A X02: 01 can be used as target HLA.

For each class of target HLA, there is corresponding positive and negative sample data, and for the same class of target HLA, the positive and negative sample data are completely different. In this step, the positive sample polypeptide sequence in the positive sample data refers to a polypeptide sequence that can bind to a corresponding target HLA and can be presented by the target HLA, wherein the length of the positive sample polypeptide sequence is within a first preset length threshold range, and the first preset length threshold range may be 8 to 11; the upstream sequence is a sequence connected to the N end of the positive sample polypeptide sequence, the downstream sequence is a sequence connected to the C end of the positive sample polypeptide sequence, the lengths of the upstream sequence and the downstream sequence are respectively within a second preset length threshold range, and the second preset length threshold range can be 7-30. The positive presentation result is that the polypeptide sequence in the positive sample can be presented by the target HLA. Correspondingly, the negative sample polypeptide sequence in the negative sample data is a polypeptide sequence which cannot be presented by the target HLA, and the upstream sequence and the downstream sequence of the negative sample polypeptide sequence are sequences correspondingly connected to the N end and the C end of the negative sample polypeptide sequence. The length of the negative sample polypeptide sequence is within a first preset length threshold range, and the length of the upstream sequence and the length of the downstream sequence are respectively within a second preset length threshold range. Negative presentation means that the negative sample polypeptide sequence cannot be presented by the target HLA. It can be understood that the target HLA of each class has a different polypeptide sequence of the positive sample, and the target HLA needs to obtain different negative sample data from the corresponding positive sample data. In order to improve training efficiency, common negative sample data may be obtained by pre-screening, the common negative sample data is different from the positive sample data of all kinds of target HLAs, and the common negative sample data may be used as the respective negative sample data for each kind of target HLAs.

Further, for a single target HLA, the corresponding positive sample data thereof can be obtained through known experimental data, thereby ensuring the accuracy of the positive sample data. In addition, the corresponding data size is configured for the positive sample data and the negative sample data according to the preset proportion, for example, the data size of the negative sample data is larger than that of the positive sample data, so that the training data of the model can be enriched, and meanwhile, the accuracy of the model prediction result is improved under the condition that the positive sample data is insufficient. In one embodiment, for a target HLA, positive sample data and negative sample data are compared in a ratio of 1: (8-10) generating training data, and enabling positive sample data and negative sample data to be in a ratio of 1: (800-1000) generating test data. That is, the proportion of the negative sample data to the positive sample data in the training data is smaller than that in the test data, so that the training efficiency can be improved; the number of negative sample data in the test data needs to be much larger than the number of positive sample data, so as to ensure the accuracy of the model prediction result.

And S120, respectively inputting the target HLA and corresponding positive sample data and negative sample data into a plurality of sub-models with different architectures based on the BERT model for training, and obtaining a plurality of trained sub-models.

Among them, the BERT model (Bidirectional Encoder tokens from transducers) is a pre-training language model, and the BERT model applied to proteins learns the unlabeled protein sequence data of Pfam database at the level of ten million through 2 unsupervised tasks such as character masking and next sentence prediction in the pre-training stage, and learns the rules of natural protein sequences. In this step, various algorithms can be adopted to respectively combine with the BERT model to obtain various sub-models based on the BERT model and with different architectures.

In one embodiment, the sub-model may be at least one of 8 models, such as a BERT and CNN fusion model (BERT-CNN), a BERT and LSTM fusion model (BERT-LSTM), a BERT and LSTM and GRU fusion model (BERT-LSTM-GRU), a BERT model with a two-layer sentence vector hiding layer (BERT-2), a BERT model with a three-layer sentence vector hiding layer (BERT-3), a BERT model with a global mean pooling layer (BERT-pool), a BERT model with a word vector batch normalization (BERT-Norm), and a standard BERT model (BERT-standard). Specifically, a BERT and CNN fusion model, a BERT and LSTM and GRU fusion model can acquire deeper information in each data; the BERT model with the double-layer sentence vector hiding layer and the BERT model with the three-layer sentence vector hiding layer can obtain shallow information in each data; the BERT model containing the global average pooling layer, the BERT model containing word vector batch standardization and the standard BERT model can obtain the current layer information in each data. The input data is processed by different levels of information through submodels with different algorithms and different architectures, each submodel respectively outputs corresponding prediction rendering results, and the prediction rendering results of each submodel are possibly the same or different, so that the prediction models in the subsequent steps can comprehensively synthesize the prediction rendering results to obtain the final prediction result.

In particular, the BERT model can be mini-BERT comprising 6 hidden layers, 504 dimensions and 12 heads, and 8 sub-models with different architectures are formed by combining with other algorithms on the basis of the parameters. As shown in table 1 below:

TABLE 1

Serial number	Sub-model	Implicit vector sources	Deep model addition	Policy	Output of
						1	BERT-CNN	Layer-6	3*1dCNN	-	CNN hidden variables
2	BERT-LSTM	Layer-6	1*BiLSTM	Global averaging and maximum pooling	Average + twoHidden variable to last + Max
						3	BERT-LSTM-GRU	Layer-6	BiLSTM+BiGRU	Global averaging and max pooling	Mean + two-way end hidden variable + max
4	BERT-2	Layer-5+6	-	-	Double-layer sentence vector
						5	BERT-3	Layer-4+5+6	-	-	Three-layer sentence vector
6	BERT-pool	Layer-6	-	Global average pooling	Word vector averaging
						7	BERT-Norm	Layer-6	-	BatchNorm	Word vector batch normalization
8	BERT-standard	Layer-6	-	-	Sentence vector

Table 1 shows a specific method for adapting a BERT model-based architecture to more fully utilize information in training data. The BERT model used in the application has 6 layers, and each layer comprises sentence vectors contained in [ CLS ] vectors and word vectors contained in other [ token ]. The BERT-CNN, BERT-LSTM and BERT-LSTM-GRU in the table 1 aim to mine deeper information in training data, the three submodels respectively refer to that 3 layers of one-dimensional CNN, 1 layer of LSTM and (1 layer of LSTM +1 layer of BiGRU) are respectively accessed on the basis of a sixth layer of hidden variables in the BERT model, and the three submodels correspondingly output the hidden variables of CNN and the extracted information of LSTM/GRU. The LSTM and the GRU are both bidirectional RNN frameworks, global average pooling and global maximum pooling are carried out on bidirectional hidden variables of the LSTM and the GRU, 2 bidirectional hidden variables of the LSTM and the GRU are taken, and four parts are connected to be used as extraction information of the LSTM and the GRU.

BERT-2 and BERT-3 are intended not to miss shallow information in training data, and these two submodels refer to joining the reciprocal 2-layer sentence vector and reciprocal 3-layer sentence vector in the BERT model, respectively. BERT-pool, BERT-Norm and BERT-standard aim at extracting current layer information, and the three submodels respectively perform global average pooling/batch standardization on a sixth layer word vector in the BERT model and directly output sentence vectors.

The architecture transformation is carried out based on the BERT model, and the 8 BERT models with different architectures are obtained. On one hand, the standard BERT model is extracted to output vectors from the internal hidden layer [ CLS ] of the last 1, 2 or 3 layers, the vectors and all word vectors of the last layer of the standard BERT model are subjected to pooling splicing, and finally prediction is carried out. And on the other hand, the last hidden layer hidden variable information of the standard BERT model is utilized. These state information may be tied to deeper network models, such as bi-directional LSTM, etc. Extracting higher dimensional characteristics of the text through a deeper network model, then aggregating bidirectional GRU output and hidden state characteristics through operations of extracting hidden state, average pooling, maximum pooling and the like, and finally splicing pooling of the BERT model for prediction, thereby obtaining the multiple optimized submodels. Of course, in other embodiments, more BERT models with different kinds of architectures may be obtained based on the BERT models, which is not limited herein. It will be appreciated that the predicted rendering output by each submodel is determined based on the probability values corresponding to the output by the submodel. And when the probability value is smaller than the preset probability threshold value, determining that the predicted presentation result is that the polypeptide sequence cannot be presented by the corresponding target HLA.

Further, when the target HLA has only one type, training each sub-model according to the positive sample data and the negative sample data of the target HLA to obtain each trained sub-model. When the types of the target HLA are more than one, aiming at the sub-model of each framework, each target HLA can respectively train the corresponding data to the sub-model of the framework, so that each target HLA has an independent trained sub-model; alternatively, the data of all target HLAs may be synchronously input into a sub-model of the same architecture for training, that is, all target HLAs share a sub-model of the same architecture.

Further, in one embodiment, the positive sample polypeptide sequence and the positive presentation result in the positive sample data and the negative sample polypeptide sequence and the negative presentation result in the negative sample data are used as training data to be input into a plurality of sub-models with different architectures based on the BERT model for training, and a plurality of trained sub-models are obtained; and/or inputting the positive sample data and the negative sample data as training data into a plurality of sub-models with different architectures based on the BERT model for training to obtain a plurality of trained sub-models. That is, as shown in step S110, the positive sample data and the negative sample data both include 4 types of data, in this embodiment, for each framework submodel, only 2 types (i.e., sample polypeptide sequences and presentation results) of the data may be used to train a partial framework submodel or a full framework submodel to obtain a corresponding trained submodel, or all 4 types of data may be used to train a partial framework submodel or a full framework submodel to obtain a corresponding trained submodel. According to the data selection of the two situations, different sample data are adopted to train the submodels, so that different submodels can be obtained in a multi-dimensional mode. That is, for submodels of the same architecture, when the training data is selected differently, the trained submodels are different, the prediction accuracy of the trained submodels may be different, and the prediction presentation results for the same antigen peptide output may also be different.

It is understood that, when a plurality of kinds of target HLAs share a sub-model of the same architecture, it is possible to predict whether or not any target antigen peptide can be presented by each target HLA, based on the trained sub-models of the respective architectures, that is, to output the predicted presentation results corresponding to the respective target HLAs simultaneously. Alternatively, for any target antigen peptide, the antigen peptide sequence and a certain target HLA can be input simultaneously, and whether the target antigen peptide can be presented by the target HLA can be output specifically. When a plurality of types of target HLAs have independently trained submodels of respective frameworks, the sequence of the same target antigen peptide may be input to each submodel of the corresponding target HLA, and each submodel of each target HLA may output a corresponding predicted presentation result.

S130, screening each trained sub-model through a preset rule to obtain a prediction model comprising an optimized sub-model; wherein the prediction model synthesizes the predicted presentation results of the preferred submodel to predict the result of presentation of the target antigen peptide by the target HLA.

It can be understood that, after the submodels with different architectures are selected in step S120 and the selected submodels with different architectures are trained according to different types of sample data, the number of the trained submodels obtained is floating. For example, when the 8 kinds of submodels are selected at the same time and the data of the two cases are used to train each submodel, 16 kinds of trained submodels can be obtained. Furthermore, screening each trained sub-model through a preset rule, selecting a plurality of sub-models with higher prediction accuracy as preferred sub-models, and forming a final prediction model according to the preferred sub-models. In this embodiment, the prediction model may more fully acquire feature vectors in data from hidden layers of different layers by integrating the prediction presentation results of a plurality of different best sub-models based on BERT, output a final prediction result, and further improve the comprehensiveness and accuracy of the prediction result. The prediction model is used for synthesizing the prediction presentation results of the preferred sub-models, namely, the prediction model can be used for predicting whether any target antigen peptide can be presented by the corresponding target HLA.

For ease of understanding, for example, when the prediction model includes 5 preferred submodels, each preferred submodel separately outputs the predicted presentation results of the target antigenic peptide, e.g., where 3 preferred submodels predict that the target antigenic peptide may be presented and 2 other preferred submodels predict that the target antigenic peptide may not be presented, then the prediction model synthesizes the prediction results that the target antigenic peptide may be presented by the target HLA.

It will be appreciated that when there is only one target HLA class, there is a corresponding one predictive model. When there are more than one target HLA types, each target HLA has a separate prediction model, or each target HLA has a common prediction model. In the latter case, when it is necessary to predict the presentation result of a target antigen peptide, the common prediction model synchronously outputs the presentation results of the target antigen peptide and each target HLA. That is, the prediction model of the present application can be used to predict the presentation result of a target antigen peptide bound to each specific target HLA by constructing a plurality of specific target HLAs individually or collectively.

As can be seen from this example, the method for constructing the antigen peptide presentation prediction model according to the present application obtains associated positive sample data for each target HLA, obtains negative sample data completely different from the positive sample data, trains a plurality of different BERT model-based submodels, and screens a preferred submodel from the trained submodels to synthesize an integral prediction model, so that the prediction presentation results output by the preferred submodels can be synthesized, thereby obtaining a prediction result of whether the target antigen peptide is presented by the target HLA, and improving the accuracy of the prediction result. According to the scheme, the trained prediction model can effectively assist research personnel in predicting the combination and presentation of the antigen peptide and the HLA, so that experiments are reduced, manpower and material resources are reduced, the research and development efficiency is improved, the research and development cost is reduced, and the research and development personnel can judge what antigen peptide is suitable for being used as a tumor polypeptide vaccine to stimulate T cells to generate immunocompetence.

Fig. 2 is another schematic flow chart of a method for constructing an HLA presentation prediction model according to an embodiment of the present application.

Referring to fig. 2, a method for constructing an HLA presentation prediction model according to an embodiment of the present application includes:

and S210, collecting candidate HLA of different data sources and candidate polypeptide sequences corresponding to the candidate HLA, filtering and sorting the data, and combining to obtain positive sample polypeptide sequence sets corresponding to the candidate HLA and the candidate HLA.

In this step, a plurality of HLA candidates and candidate polypeptide sequences (including positive sample polypeptide sequences and negative sample polypeptide sequences) may be obtained from clinical experimental data and/or various public databases. It is understood that, for different data acquisition sources, based on the complexity of the data, the obtained data may be specifically cleaned and sorted by corresponding processing means to obtain the positive sample polypeptide sequences corresponding to the candidate HLAs that can be bound and presented, so as to facilitate subsequent screening to obtain the target HLAs and the corresponding positive sample polypeptide data.

In clinical experimental data, mass spectral data from different tissues of multiple different human bodies was collected, which in total could contain tens of thousands of candidate polypeptide sequences, of which only a portion could be HLA-I bound and presented as a positive sample polypeptide sequence. Therefore, in order to improve the accuracy of the positive sample data of the subsequent steps, the candidate polypeptide sequences are washed and filtered in advance, the polypeptide sequences which may produce false positive results and the polypeptide sequences which cannot be presented by the binding of each candidate HLA (namely, the negative sample polypeptide sequences) are removed, and the reasonable positive sample polypeptide sequences are obtained through screening.

In one specific embodiment, among the candidate polypeptide sequences, a plurality of first polypeptide sequences with an FDR value of less than 0.1 are screened; according to the retrieval ID of each first polypeptide sequence, ID matching is carried out in a known database respectively, and a second polypeptide sequence which is successfully matched is obtained through screening; acquiring an upstream sequence and a downstream sequence corresponding to each second polypeptide sequence from a known database, and completing sample data corresponding to the second polypeptide sequences; and filtering each second polypeptide sequence according to a preset filtering rule to obtain a preferred polypeptide sequence.

Specifically, a plurality of first polypeptide sequences of FDR (false discovery reversal, mass spectrometry protein error rate calculated by percollator) <0.1 are selected, for example, by performing washing filtration by means of percollator algorithm (mass data retrieval quality control algorithm available for mass spectrometry). Matching each first polypeptide sequence according to protein source yields a corresponding search ID, such as the Uniprot ID. From the search ID, the corresponding Ensembl ID can be found in the known database Ensembl by means of the BioMart function. If the search ID of the polypeptide sequence can find the same Ensembl ID in the Ensembl database, the ID matching is successful, that is, the upstream sequence and the downstream sequence of the first polypeptide sequence can be found in the Ensembl database (that is, the upstream sequence and the downstream sequence corresponding to the second polypeptide sequence are obtained), so that the corresponding sample data is supplemented completely. If the ID matching fails, the first polypeptide sequence is deleted, that is, the first polypeptide sequence which can not obtain the upstream and downstream sequences can not be used as the second polypeptide sequence.

In order to improve the accuracy of the prediction result in the subsequent step, the second polypeptide sequence after the completion of the data is further filtered according to the following preset filtering rule to obtain the preferred polypeptide sequence. In a specific embodiment, second polypeptide sequences are deleted if they contain unnatural amino acids (i.e., not among the 20 natural amino acids) or if there is an abnormality in the data format or a null in the sequence; deleting second polypeptide sequences with sequence lengths outside 8-11, because the polypeptide sequences with the sequence lengths outside the first preset length threshold range are difficult to be combined with HLA; deleting the second polypeptide sequence lacking protein information; the second polypeptide sequence which is redundant in duplication is deleted. It will be appreciated that the second polypeptide sequence deleted by the above-described filtering is not used as a positive sample polypeptide sequence in the subsequent step, but that it is necessary to further confirm whether the retained preferred polypeptide sequence is HLA-bound and presented. That is, the remaining polypeptide sequence is a preferred polypeptide sequence in which the amino acids are natural amino acids, the sequence has no gaps, the sequence length is 8 to 11, and no sequence repeats exist. Further, if the number of amino acids in the upstream and downstream sequences of these retained preferred polypeptide sequences is smaller than a second predetermined length threshold, for example, when the second predetermined length threshold is 30, the number of amino acids in the upstream sequence is smaller than 30, characters are used to fill the vacant positions, for example, the characters can be replaced by "X" uniformly, so that the number of amino acids in the upstream sequence reaches 30, and it is ensured that the number of amino acids in each of the upstream and downstream sequences respectively satisfies the second predetermined length threshold.

Further, among the above-mentioned preferred polypeptide sequences, on the one hand, a positive sample polypeptide sequence that can be bound and presented by a candidate HLA needs to be selected therefrom, and on the other hand, a mixed plurality of candidate HLAs and a mixed positive sample polypeptide sequence need to be distinguished. In this embodiment, taking human HLA-I as an example, human HLA-I has 3 partitions, and since alleles exist in pairs, human beings naturally contain 3 × 2, i.e. 6 HLA-I alleles, i.e. each human can contain 6 HLA-I, and the types of HLA-I contained in different human bodies may be all the same, partially the same or all different.

Since the HLA-I in all human bodies has a large number of HLA-I types, and the number of positive sample polypeptide sequences that can be bound and presented by each HLA-I is more than 1, for the above obtained preferred polypeptide sequences, in one embodiment, positive sample polypeptide sequences bound and presented by candidate HLA are obtained, and the obtained polypeptide sequences bound and presented by candidate HLA are clustered according to sequence similarity, so as to obtain a plurality of candidate HLA and corresponding positive sample polypeptide sequence sets.

In a specific embodiment, all preferred polypeptide sequences in clinical experimental data are grouped according to different HLA-I correspondences in human bodies to form a plurality of first sequence sets; in each first sequence set, according to the corresponding known HLA-I type in human body, forming a plurality of groups of second sequence sets which can be combined and presented by the corresponding HLA-I in the corresponding first sequence sets according to sequence similarity; and merging the HLA-I of the same species among different human bodies and the corresponding second sequence sets to obtain a plurality of HLA-I and corresponding positive sample polypeptide sequence sets.

In order to facilitate understanding, in this embodiment, all the summarized preferred polypeptide sequences are grouped according to different human bodies, on one hand, specific human bodies are in correspondence with the polypeptide sequences, on the other hand, screening is performed according to HLA-I in each human body, positive sample polypeptide sequences capable of being bound and presented are retained in the preferred polypeptide sequences, and redundant negative sample polypeptide sequences are removed, so that a first sequence set consisting of the positive sample polypeptide sequences is obtained. The positive sample polypeptide sequences in each first sequence set can then be clustered separately by a gibbs clustering algorithm. It is to be understood that HLA-1 has specificity for recognition of antigen peptides, and that the specificity refers to the anchoring sites at both ends of the antigen peptide capable of binding to HLA class I molecules and 4 to 5 anchoring sites located in the core sequence for binding to HLA class II molecules, and the composition of the corresponding amino acid residues is relatively constant. Thus, an antigenic peptide capable of binding to the same HLA molecule often has the same or similar anchoring position and anchoring residue, and constitutes a common motif specific to the HLA molecule. Thus, each HLA-I exhibits a specific consensus motif when bound to a polypeptide sequence. Therefore, in this embodiment, after obtaining the motif of the HLA candidate in each human body by the sequencing method, the similarity between each positive sample polypeptide sequence in the first sequence set and the motif of each HLA candidate can be compared, so as to determine the correspondence between each positive sample polypeptide sequence and one of the HLA candidates. That is, in the first sequence set corresponding to each human body, a plurality of sets of polypeptide sequences, that is, a plurality of sets of second sequences, can be obtained by aligning motifs of a plurality of HLA candidates in the human body and clustering. After each human body correspondingly obtains a plurality of groups of second sequence sets, merging the candidate HLA of the same type among different human bodies, thereby obtaining a plurality of merged candidate HLA and corresponding positive sample polypeptide sequence sets. By means of the design, the obtained positive sample polypeptide sequences bound and presented by the candidate HLA are clustered, so that the sub-models in the subsequent steps can conveniently and specifically extract the characteristics of the positive sample polypeptide sequences corresponding to various target HLA.

Further, the HLA candidates and the corresponding bound and presented polypeptide sequences (positive sample polypeptide sequences) can also be obtained from various public databases, i.e., different from the polypeptide sequences in clinical laboratory data, and the polypeptide sequences obtained from the databases are all positive sample polypeptide sequences that are definitely bound and presented. For example, from the Bulik-Sullivan database, 58 HLA-I (i.e., HLA candidates) are currently collected, along with their respective corresponding binding and presenting polypeptide sequences and their upstream and downstream sequences. The 16 HLA-I (i.e., HLA candidates) and corresponding sequences of the binding and presenting polypeptides and their upstream and downstream sequences can be found in the abelin2017 database. 28 HLA-I (i.e., HLA candidates) and corresponding binding and presenting polypeptide sequences and their upstream and downstream sequences can be gleaned in Pearson 2016. In Di Marco2017, 17 HLA-I (candidate HLA) sequences and corresponding binding and presenting polypeptide sequences can be collected. 77 HLA-I (i.e., HLA candidates) and corresponding binding and presentation polypeptide sequences can be gleaned in SysteMHC. 79 HLA-I (i.e., HLA candidates) and corresponding binding and presenting polypeptide sequences can be gleaned from HLAthena. 36 HLA-I can be collected in NetMHCpan, and 119 HLA-I (i.e., HLA candidates) and corresponding binding and presenting polypeptide sequences can be collected in IEDB.

It is understood that, each collected candidate HLA is from different databases, and there are situations where partial data duplication and partial data incompletion (some polypeptide sequences lack upstream and downstream sequences) between data of different databases, in order to improve the integrity of training data in subsequent steps, the collected candidate HLA and related sequence data may be supplemented, filtered and sorted, so that the integrity data attributes of various candidate HLA include corresponding HLA sequences, and positive sample polypeptide sequences, polypeptide sequence lengths, upstream and downstream sequences and presentation results corresponding to various candidate HLA.

For the purpose of complementing the upstream sequence and the downstream sequence corresponding to the polypeptide sequence of the positive sample, similarly referring to the processing method in the clinical experimental data, the single HLA candidate collected by the individual database can be matched with EnsemblD in the Ensembl database through the BioMart function according to the retrieval ID of the polypeptide sequence of the positive sample, and if the ID numbers of the two are the same, the matching is successful; and if the matching fails, namely EnsemblID corresponding to the retrieval ID cannot be found in the Ensembl database, deleting the collected HLA and related data. Further, for the candidate HLAs with successfully matched IDs, complete data of each candidate HLA in the Ensembl database can be further obtained in the Ensembl database by means of the corresponding Ensembl IDs, including the positive sample polypeptide sequence, the protein sequence where the positive sample polypeptide sequence is located, the upstream sequence of the polypeptide sequence, and the downstream sequence of the polypeptide sequence, so that incomplete sequence data corresponding to each candidate HLA can be completed. That is, the data corresponding to each candidate HLA includes the HLA sequence, and the positive sample polypeptide sequence, the length of the polypeptide sequence, the sequence upstream and downstream of the polypeptide, and the positive presentation result corresponding to each HLA. In one embodiment, the length of the upstream sequence and the downstream sequence of the positive sample polypeptide sequence are within a second predetermined length threshold range, and the second predetermined length threshold range may be 7-30. For example, the upstream and downstream sequences comprise 7-30 amino acids, respectively, for example, the upstream and downstream sequences comprise 7 or 30 amino acids; by adopting the design, the omission of upstream and downstream information from the polypeptide sequence of the positive sample can be avoided.

Further, the positive sample polypeptide sequences obtained from the various databases and subjected to data completion can be filtered and screened according to the preset filtering rule to obtain more reasonable sequence data, namely, the positive sample polypeptide sequences with natural amino acids, no sequence vacancy, sequence length of 8-11 and no sequence repetition are obtained through filtering. After the same filtering operation, each HLA candidate has corresponding positive sample polypeptide sequence and upstream and downstream sequences, respectively, which are finally retained, and a positive presentation result.

It can be understood that if the data acquisition source includes clinical experimental data and various public databases, the various candidate HLAs and corresponding sequence data obtained from the different sources and subjected to corresponding completion, filtering and sorting are finally merged to obtain a positive sample polypeptide sequence set corresponding to the multiple candidate HLAs. It should be understood that each positive sample polypeptide sequence set contains polypeptide sequences that can be bound and presented by the corresponding HLA candidates, and upstream and downstream sequences corresponding to each polypeptide sequence, thereby ensuring that these data can be used as positive sample data in subsequent steps.

And S220, screening various target HLA in each candidate HLA, and taking a positive sample polypeptide sequence set corresponding to the candidate HLA as the positive sample data of the target HLA.

It can be appreciated that for HLA classes with wider coverage in humans, it has richer real experimental data as training data. In this step, among the collected multiple HLAs, multiple HLAs may be preselected as target HLAs, for example, 4 target HLAs (e.g., a × 11:01, a × 24:02, C × 07:02, a × 02: 01) that are common in the middle and american species may be screened from the multiple HLA-I sorted in the above step, and the polypeptide sequence set of each HLA may be used as corresponding positive sample data.

It can be understood that, for different types of target HLAs, each target HLA has corresponding positive sample data through the processing in S210 in the above step, and the positive sample data between different target HLAs do not interfere with each other.

S230, acquiring negative sample data with a preset proportion according to the data volume of the positive sample data corresponding to the single target HLA; and determining training data and test data according to the positive sample data and the negative sample data.

In this step, according to each target HLA, the negative sample data of the preset proportion may be independently selected respectively according to the data size of the corresponding positive sample data, or a batch of common negative sample data sets may be screened, and each target HLA may obtain the negative sample data of the preset proportion in the common negative sample data set. Wherein, the negative sample data can be selected from polypeptide sequences in human proteins, and the negative sample polypeptide sequence in each target HLA is not repeated with the positive sample polypeptide sequence in the positive sample data thereof, thereby ensuring that the negative sample polypeptide sequence cannot be presented, namely has a negative presentation result, aiming at the same target HLA.

Further, in order to obtain enough data to train the prediction model, each of the positive sample data and the negative sample data needs to be divided in advance to form training data and test data. In one embodiment, for a target HLA, training data is expressed as 1: (8-10) respectively acquiring positive sample data and negative sample data, and testing the data according to the following ratio of 1: (800-1000) respectively acquiring positive sample data and negative sample data. That is, each target HLA has respective positive sample data and negative sample data, and on this basis, training data and test data are further divided, wherein the ratio of the number of positive sample polypeptide sequences to the number of negative sample polypeptide sequences in the training data is 1: (8-10), e.g., 1: 10; similarly, the ratio of the number of positive sample polypeptide sequences to the number of negative sample polypeptide sequences in the test data is 1: (800-1000), for example 1: 1000. It can be understood that the test data is different from the sample data used by the training data to ensure the accuracy of the trained submodel evaluated by the test data test.

S240, inputting the target HLA and corresponding training data into a plurality of sub-models with different frameworks based on the BERT model for training, and screening each trained sub-model through a preset rule to obtain an optimal sub-model.

In this step, in order to better evaluate the sub-models, the present embodiment trains a plurality of sub-models based on different architectures of the BERT model by using K-fold cross validation. In one embodiment, the training data is divided into a training set and a validation set according to K-fold cross validation. Wherein K is a natural number, for example, K can be selected from 3, 4, 5, 6, and the like. For example, when K is 5, the training set and the validation set are obtained by five-fold cross validation division, wherein the ratio of the training set to the validation set may be 4: 1. It is understood that each fold of data contains both positive and negative sample data.

Further, in order to evaluate the influence of different types of training data on the submodels with different architectures, two types of corresponding training data are selected to train each submodel, for example, according to the submodels with 8 architectures introduced in step S120, at most 16 trained submodels can be obtained, which is specifically described in step S120 and will not be described herein again. Of course, in this embodiment, the number of the trained sub-models is not limited, and may be less than 16, or more than 16 trained sub-models may be obtained according to more algorithms and architectures and different types of training data.

Further, to evaluate each submodel, in one embodiment, the predicted rendering result output by each trained submodel is compared with the true positive rendering result and the true negative rendering result in the corresponding sample data, and the predicted rendering result may be labeled. In one embodiment, Positive sample polypeptide sequences entered for each architecture's submodel or Positive sample polypeptide sequences and their upstream and downstream sequences are labeled True Positives (TP) when the predicted Positive sample polypeptide sequence predicts presentation as presentation; when the predicted presentation result is no presentation, it is marked as False Negative (FN). For the negative sample polypeptide sequence or negative sample polypeptide sequence and its upstream and downstream sequences input in the submodel of each architecture, when the predicted presentation result is presentation, label as False Positive (FP, False Positive); when the predicted presentation result is no presentation, it is marked as True Negative (TN). That is, if the predicted rendering result of the trained submodel is different from the actual rendering result, the predicted result of the representative submodel is wrong; if the predicted rendering is the same as the true rendering, then the predicted result representing the submodel is correct.

In one embodiment, the number of TP, FP and FN in the prediction rendering result output by each BERT model-based submodel is counted respectively; and determining the accuracy and recall rate of each corresponding sub-model according to the number of the corresponding TP, FP and FN. The accuracy rate of the prediction result of each sub model is determined according to the following formula (1), and the recall rate is determined according to the following formula (2). According to the precision rate and the recall rate, the accuracy evaluation score F of the corresponding sub-model can be calculated and determined according to a preset evaluation function in the following formula (3)_{1 fraction}。

In one embodiment, the accuracy rate and the recall rate of the predicted rendering result of each BERT model-based sub-model are respectively obtained; determining the accuracy evaluation score of each BERT model-based sub-model through a preset evaluation function according to the accuracy rate and the recall rate; and screening in each submodel to obtain the optimal submodel according to the corresponding accuracy evaluation scores. That is to say, for all the trained sub-models based on BERT, the sub-models can be ranked by the numerical value of the accuracy assessment score, and the higher the numerical value is, the higher the prediction accuracy of the sub-model is, so as to screen a plurality of sub-models with relatively high accuracy of obtaining prediction results as preferred sub-models.

For ease of understanding, the 8 different architectural sub-models described above were trained using HLA A x 02:01 gene and associated positive and negative sample data. Wherein, the 8 seed models respectively input two corresponding training data to obtain 16 trained sub-models. In this example, the training data input by each sub-model is divided by five-fold cross validation to obtain a training set and a validation set. In this embodiment, when each sub-model is trained, the 5-time probability values of the prediction results obtained after the training set and the verification set are divided each time are averaged, so as to output a final probability value, and then output a corresponding prediction rendering result.

Further, in order to improve the prediction effect of the prediction model, in an embodiment, the training data is divided according to K-fold cross validation to obtain a training set and a validation set, and a preset number of pseudo label data are added into the training set, wherein the pseudo label data are formed by predicting the test data of the blank label according to a pre-trained submodel to obtain a corresponding pseudo label. For example, taking a BERT-2 model and a BERT-LSTM model as an example, after the original training data is divided into a training set and a verification set according to five-fold cross validation, according to the above steps S110 to S120, after the sub-models of the two architectures are trained in advance, two pre-trained sub-models are obtained, then a preset number of pseudo label data are added to the training set adopted in the pre-training for each fold, so that each fold obtains the latest training set, and according to the latest training set, brand new training is performed on the sub-models of the two architectures according to the steps S110 to S120, so as to obtain the corresponding trained sub-models. For example, in the original training set after five-fold cross validation division, 10% of pseudo label data is added to each training set, and the validation set and the test data remain unchanged. That is, taking the sub-model with architecture as BERT-2 model as an example, at most 4 kinds of well-trained sub-models can be obtained by using different training data, and the training data used by each sub-model is respectively positive sample data not containing pseudo tag data and not containing upstream and downstream sequences, positive sample data not containing pseudo tag data and containing upstream and downstream sequences, positive sample data containing pseudo tag data and not containing upstream and downstream sequences, and positive sample data containing pseudo tag data and containing upstream and downstream sequences.

It should be noted that the obtaining of the pre-trained submodel is the same as the training method of the training data without the pseudo label, and is not described herein again. Of course, the pre-trained submodels are not limited to the submodels of the two architectures of the above example, and training data added with pseudo tag data may be used for training submodels of other architectures to obtain more trained submodels.

Specifically, the obtaining mode of the pseudo tag data is as follows: randomly selecting a preset number of positive and negative sample data with blank labels in test data, inputting the blank labels into a pre-trained submodel such as a BERT-2 architecture and a BERT-LSTM architecture, wherein the blank labels represent the result of whether a polypeptide sequence is not input into the model and presented by HLA, and obtaining a corresponding predicted presentation result only through the pre-trained submodel, wherein the predicted presentation result is a pseudo label (a non-real label which may be the same as or different from the real presentation result) of each positive and negative sample data, so that the positive and negative sample data of the blank labels are formed into the positive and negative sample data with the pseudo labels. And adding the positive and negative sample data with the pseudo labels, namely the pseudo label data, into an original training set, and performing brand new training on the sub-model of the BERT-2 architecture and the sub-model of the BERT-LSTM architecture to obtain the corresponding trained sub-models. By the design, the pseudo label data are added to serve as new training data, so that the training data can be enriched, and the accuracy of the prediction result of the prediction model is improved.

With the updating of the training data, in addition to the 16 submodels, a plurality of submodels added with the pseudo tag data as the training data are added for training, that is, more than 16 trained submodels are obtained in total. In a specific embodiment, the corresponding f of each submodel under different training data is obtained_{Fraction 1}Precision (precision), accuracy (accuracycacy), cross entropy (loss) and in accordance with f_{1 fraction}Performing descending order, and screening the preferred submodels with the top 8 ranks, wherein the specific numerical values are shown in the following table 2:

TABLE 2

Sub-model	f1 fraction	Rate of accuracy	Cross entropy	Rate of accuracy
					BERT-LSTM_pse	0.710821	0.784571	0.0703	0.9756
BERT-2_pse	0.685119	0.667891	0.0833	0.9682
					BERT-LSTM_ctex	0.645061	0.660463	0.154	0.9367
BERT-LSTM-GRU_ctex	0.642596	0.72525	0.1415	0.9432
					BERT-3_ctex	0.628945	0.632106	0.1628	0.9317
BERT-standard_pep	0.62685	0.664218	0.1395	0.9407
					BERT-LSTM_pep	0.623211	0.719886	0.1423	0.9413
BERT-2_pep	0.61751	0.633841	0.152	0.9374

The submodel with the suffix _ pse in the first column in the table indicates that pseudo-tag data is additionally added to the training set in the training data, and the training data comprises positive and negative sample polypeptide sequences, corresponding upstream and downstream sequences and corresponding presentation results. The submodel suffixed with _ ctex indicates that no pseudo-tag data was added to the training set, and that the training data contains the positive and negative sample polypeptide sequences and the upstream and downstream sequences and the corresponding presentation results. SuffixThe pep submodel indicates that no pseudo-tag data was added to the training set, and that the training data contains positive and negative sample polypeptide sequences and corresponding presentation results (i.e., no upstream or downstream sequences containing positive and negative sample polypeptide sequences). In Table 2 only f is shown_{1 fraction}The data of the sub-model with the top 8 ranks does not show the data of the sub-model with the back rank.

As can be seen from Table 2, f of BERT-LSTM _ pse and BERT-2_ pse_{1 fraction}The most advanced ranking indicates that the pseudo label data added in the training set can introduce test data distribution in the training process, so that the performance of the prediction model can be effectively improved. In addition, the training data are f for submodels of the polypeptide sequence and its upstream and downstream sequences_{1 fraction}Greater than the submodel for which the training data is a polypeptide sequence_{1 fraction}It is shown that the more detailed the training data, the more accurate the prediction results.

And S250, forming a prediction model according to different preference sub-models based on the BERT model, wherein the prediction model synthesizes the prediction presentation results of the preference sub-models to predict the result of presentation of the target antigen peptide by the target HLA.

In this step, the prediction model may be regarded as a model formed by integrating a plurality of preferred submodels. For the target antigen peptide to be predicted, after the output of the prediction presentation result, each preferable submodel can take the result with more votes as the prediction result of the prediction model through simple voting. For example, the 8 preferred submodels respectively predict whether the polypeptide sequence of a certain target antigen peptide is presented by the same target HLA, wherein the predicted presentation results of the polypeptides of the 5 preferred submodels are bindable and presentable by the corresponding target HLA, and the predicted presentation results of the 3 submodels are not bindable and presentable by the corresponding HLA, and the final predicted presentation result of the prediction model is that the target antigen peptide is bindable and presentable by the corresponding target HLA. In this way, the final prediction result can be obtained quickly, and the accuracy of the final prediction result is ensured to the maximum extent.

Further, the trained predictive model formed according to the 8 preferred sub-models in Table 2 above was compared with other models on the market, such as NetMHCpan4.0, MixMHCpred and HLAthena-MSiE. The 4 HLA types a x 11:01, a x 24:02, C x 07:02 and a x 02:01, each with the corresponding presented and non-presented test polypeptide sequences, were selected as test data. Inputting the same test data corresponding to the same HLA into the prediction model of the application and the 3 comparison models respectively, wherein each test polypeptide sequence corresponds to a prediction result of whether the test polypeptide sequence is presented by the corresponding HLA. The prediction results of the same HLA corresponding to each model are sorted in a descending order according to the corresponding probability values, and the number of TP (namely the prediction results and the real results are both presented) is determined as the prediction results in the 100 positive presentation results predicted by each model, and part of the test results are shown in the following table 3.

TABLE 3

HLA class	NetMHCpan4.0	MixMHCpred	HLAthena-MSiE	This application
					A2402	33	38	36	38
A0201	29	35	27	35
					A1101	23	35	33	49
C0702	10	15	13	47

As can be seen from the above table, for different HLAs, the prediction result values of the prediction model of the present application are ranked at the top among numerous comparison models, which indicates that the prediction effect of the present application is superior to the existing comparison models on the market in terms of comprehensive performance.

In addition, referring to fig. 3, fig. 3 is a graph of precision-recall (precision-recall) plotted according to the above test results for the above 3 comparison models and the prediction model of the present application. From comparison of the graphs, for the accuracy rate at 0.1% ppv, i.e., the recall rate is 0.1%, although the BERT prediction model is midstream in the distinguishing capability of FN (i.e., the prediction result is presented and the true result is not presented), the reliability of the prediction result is also midstream under the condition of high preset threshold; however, for 40% PPV, the BERT-based prediction model of the present application performs well, which means that the BERT-based prediction model of the present application has a good capability of distinguishing TN (i.e., the prediction result is non-rendered, and the true result is rendered), and has a good capability of predicting samples whose true results are rendered under a low preset threshold. In conclusion, the prediction model of the application has better classification performance on the whole.

From the above, it is understood that the method for constructing an antigenic peptide presentation prediction model according to the present application can be usedThe method includes the steps that a prediction model of one or more target HLA is established in a targeted mode, and a rich data volume training model is achieved by collecting and sorting known positive sample data and negative sample data; in addition, can be according to f_{1 fraction}And screening a plurality of different optimal submodels, thereby being beneficial to improving the accuracy of the prediction result, finally obtaining the prediction result in a simple voting mode and improving the prediction efficiency. The prediction model constructed by the method helps research and development personnel to preliminarily screen the result of the antigen peptide bound and presented by the HLA in practical application, reduces manpower, material resources and time consumed by experiments, and can meet the requirements of the research and development personnel on different types of input data, so that the long-distance interaction between the HLA and the polypeptide sequence and the long-distance interaction between the polypeptide sequence can be captured, and the value in the aspect of cost reduction in more immunological challenges and protein property prediction tasks can be realized subsequently.

Fig. 4 is a schematic flowchart of an antigenic peptide prediction method shown in the examples of the present application.

Referring to fig. 4, an antigenic peptide prediction method provided by an embodiment of the present application includes:

s410, obtaining a target antigen peptide sequence.

In this step, the predetermined length of the target antigen peptide sequence may be 8 to 11, for example, the predetermined length of the target antigen peptide may be 8, 9, 10 or 11. Wherein, the target antigen peptide sequence comprises 20 common natural amino acids.

It is understood that, since the polypeptide sequence in the training data and the sequence upstream and downstream of the polypeptide sequence in the prediction model do not include non-standard amino acids, in other embodiments, if the corresponding trained prediction model is obtained by training the sub-models of various architectures using positive and negative sample data including non-standard amino acids, the target antigen peptide sequence may also be a polypeptide sequence including non-standard amino acids, which is not limited herein.

S420, predicting the result of presentation of the target antigen peptide sequence by the target HLA in the prediction model, based on the prediction model.

In this step, the prediction model is obtained according to the method of constructing the prediction model for antigenic peptide presentation in the above-described embodiment. In one embodiment, according to the above construction method, a corresponding prediction model is constructed for each target HLA, that is, the prediction model may be a model for only one HLA, and the prediction model only needs to predict whether the input target antigen peptide sequence can be bound and presented by the HLA. In other embodiments, the prediction model may be a prediction model that is constructed for multiple HLA targets, that is, the prediction model may be a model that is constructed for multiple HLA targets at the same time, and the prediction model may synchronously predict whether the input target antigen peptide sequence can be bound and presented by each HLA.

According to the antigen peptide prediction method, whether the antigen peptide corresponding to the input polypeptide sequence can be combined and presented by HLA can be preliminarily judged according to the presentation result predicted by the prediction model, the antigen peptide prediction method has important significance for the advanced tumor immunological treatment such as immunotherapy, cell treatment, new antigen prediction and the like, has guiding significance for developing tumor polypeptide vaccine therapy, and can greatly reduce the manpower, material resources, time and the like consumed by wet experiments. Meanwhile, the method can be applied to the de-immunogenicity modification of the antibody according to the prediction result, and the developability of the antibody is improved.

Corresponding to the embodiment of the application function realization method, the application also provides a construction device of an antigen peptide presentation prediction model, an antigen peptide prediction device, an electronic device and corresponding embodiments.

Fig. 5 is a schematic structural diagram of an apparatus for constructing an antigenic peptide presentation prediction model according to an embodiment of the present application.

Referring to fig. 5, an apparatus for constructing an antigenic peptide presentation prediction model provided in an embodiment of the present application includes a sample acquiring module 510, a training module 520, and a screening module 530, wherein:

the sample obtaining module 510 is configured to obtain a target HLA of a preselected category and positive sample data and negative sample data corresponding to the target HLA and having a preset ratio, where the positive sample data includes a positive sample polypeptide sequence, an upstream sequence of the positive sample polypeptide sequence, a downstream sequence of the positive sample polypeptide sequence, and a positive presentation result of the positive sample polypeptide sequence and the target HLA; the negative sample data comprises a negative sample polypeptide sequence different from the positive sample polypeptide sequence, an upstream sequence of the negative sample polypeptide sequence, a downstream sequence of the negative sample polypeptide sequence, and a negative presentation result of the negative sample polypeptide sequence and the target HLA.

The training module 520 is configured to input the target HLA and corresponding positive and negative sample data into multiple sub-models with different architectures based on the BERT model, respectively, for training, so as to obtain multiple trained sub-models.

The screening module 530 is configured to screen each trained sub-model according to a preset rule to obtain a prediction model including a preferred sub-model; wherein the prediction model synthesizes the predicted presentation results of the preferred submodel to predict the result of presentation of the target antigen peptide by the target HLA.

Further, in an embodiment, the sample obtaining module 510 is configured to, for the target HLA, compare the positive sample data and the negative sample data according to 1: (8-10) generating training data, and enabling positive sample data and negative sample data to be in a ratio of 1: (800-1000) generating test data.

The training module 520 is configured to divide the training data according to K-fold cross validation to obtain a training set and a validation set; and/or dividing the training data according to K-fold cross validation to obtain a training set and a validation set, and adding a preset number of pseudo label data into the training set; and the pseudo label data is formed by predicting the test data of the blank label according to a pre-trained sub model to obtain a corresponding pseudo label.

Further, in one embodiment, the training module 520 is configured to train according to a BERT and CNN fusion model, a BERT and LSTM and GRU fusion model, a BERT model including a two-layer sentence vector hiding layer, a BERT model including a three-layer sentence vector hiding layer, a BERT model including a global averaging pooling layer, a BERT model including a word vector batch normalization, and a standard BERT model, respectively. Specifically, the training module 520 is configured to train the sub-models with different architectures based at least in part on the BERT model by using the positive sample polypeptide sequence and the positive presentation result in the positive sample data and the negative sample polypeptide sequence and the negative presentation result in the negative sample data as training data, so as to obtain corresponding trained sub-models. And/or the training module 520 is configured to train the sub-models with different architectures based at least in part on the BERT model using the positive sample data and the negative sample data as training data to obtain corresponding trained sub-models.

Further, referring to fig. 6, the constructing apparatus of the present application further includes a screening module 530, where the screening module 530 is configured to obtain an accuracy rate and a recall rate of the predicted rendering result of each sub-model respectively; determining the accuracy evaluation score of each sub-model through a preset evaluation function according to the accuracy rate and the recall rate; and screening in the submodels according to the corresponding accuracy evaluation scores to obtain the preferred submodels. The screening module 530 is further configured to count the numbers of TP, FP, FN in the predicted rendering result of each submodel; and determining the accuracy and recall rate of each corresponding sub-model according to the number of the corresponding TP, FP and FN.

Further, the construction apparatus of the present application further includes a clustering module 540, where the clustering module 540 is configured to obtain polypeptide sequences bound and presented by the candidate HLAs, and perform clustering processing on the obtained polypeptide sequences bound and presented by the candidate HLAs according to sequence similarity, so as to obtain multiple candidate HLAs and corresponding positive sample polypeptide sequence sets. The sample obtaining module 510 is configured to screen each of the HLA candidates to obtain a plurality of target HLAs, and use the positive sample polypeptide sequence set corresponding to the HLA candidates as positive sample data of the target HLAs.

According to the construction device of the antigen peptide presentation prediction model, positive sample data associated with the HLA can be obtained through the sample obtaining module, in addition, negative sample data completely different from the positive sample data is obtained, so that the training module can train each sub-model by adopting abundant sample data, and the accuracy of a prediction result is improved; in addition, the screening module adopts a preset rule to obtain a plurality of optimized sub-models from a plurality of well-trained sub-models based on the BERT model so as to comprehensively form an integral prediction model, thereby being capable of carrying out synthesis according to prediction presentation results output by the optimized sub-models to obtain a final prediction result and further improving the accuracy of the prediction result. The construction device can construct a prediction model capable of predicting the immune presentation and response degree of the antigen peptide and the T cells containing the specific HLA, thereby helping research and development personnel to reduce experiments, reduce manpower and material resources, improve research and development efficiency and reduce research and development cost.

Fig. 7 is a schematic structural view of an antigenic peptide prediction apparatus according to an embodiment of the present application.

Referring to fig. 7, an antigenic peptide prediction apparatus provided by an embodiment of the present application includes a sequence acquisition module 710 and a prediction module 720. Wherein:

the sequence acquisition module 710 is used for acquiring a target antigen peptide sequence. The predetermined length of the target antigen peptide sequence may be 8 to 11.

The prediction module 720 is used for predicting the result of presentation of the target antigen peptide sequence by the target HLA in the prediction model according to the prediction model constructed in the above embodiment.

The antigenic peptide prediction device can be used for efficiently and auxiliarily predicting whether the antigenic peptide can be combined and presented by the corresponding target HLA according to the prediction model constructed by the construction device, so that the experiment cost is reduced, the consumption of manpower and time is reduced, and the research and development efficiency is improved.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Referring to fig. 8, the electronic device 1000 includes a memory 1010 and a processor 1020.

The Processor 1020 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 1010 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. The ROM may store, among other things, static data or instructions for the processor 1020 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at run-time. Further, the memory 1010 may comprise any combination of computer-readable storage media, including various types of semiconductor memory chips (e.g., DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, among others. In some embodiments, memory 1010 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a digital versatile disc read only (e.g., DVD-ROM, dual layer DVD-ROM), a Blu-ray disc read only, an ultra-dense disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disk, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 1010 has stored thereon executable code that, when processed by the processor 1020, may cause the processor 1020 to perform some or all of the methods described above.

Furthermore, the method according to the present application may also be implemented as a computer program or computer program product comprising computer program code instructions for performing some or all of the steps of the above-described method of the present application.

Alternatively, the present application may also be embodied as a computer-readable storage medium (or non-transitory machine-readable storage medium or machine-readable storage medium) having executable code (or a computer program or computer instruction code) stored thereon, which, when executed by a processor of an electronic device (or server, etc.), causes the processor to perform part or all of the various steps of the above-described method according to the present application.

Having described embodiments of the present application, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for constructing an antigenic peptide presentation prediction model, comprising:

acquiring a preselected type of target HLA, and positive sample data and negative sample data which correspond to the target HLA and have a preset proportion, wherein for each type of target HLA, the positive sample data and the negative sample data are obtained according to the following steps of 1: (8-10) generating training data, and enabling positive sample data and negative sample data to be in a ratio of 1: (800-1000) generating test data; dividing the training data according to K-fold cross validation to obtain a training set and a validation set; dividing the training data according to K-fold cross validation to obtain a training set and a validation set, and adding a preset number of pseudo label data into the training set; the pseudo label data is formed by predicting the test data of blank labels according to a pre-trained sub model to obtain corresponding pseudo labels; the positive sample data comprises a positive sample polypeptide sequence, an upstream sequence of the positive sample polypeptide sequence, a downstream sequence of the positive sample polypeptide sequence and a positive presentation result of the positive sample polypeptide sequence and the target HLA; the negative sample data comprises a negative sample polypeptide sequence different from the positive sample polypeptide sequence, an upstream sequence of the negative sample polypeptide sequence, a downstream sequence of the negative sample polypeptide sequence, and a negative presentation result of the negative sample polypeptide sequence and the target HLA;

2. The method according to claim 1, wherein the step of inputting the target HLA and the corresponding positive sample data and negative sample data into a plurality of sub-models with different architectures based on BERT models for training respectively to obtain a plurality of trained sub-models comprises:

taking the positive sample polypeptide sequence and the positive presentation result in the positive sample data and the negative sample polypeptide sequence and the negative presentation result in the negative sample data as training data to train submodels with different architectures based on the BERT model at least partially, and obtaining the corresponding trained submodels; and/or

And training the submodels with different architectures based on the BERT model at least partially by using the positive sample data and the negative sample data as training data to obtain the corresponding trained submodels.

3. The method of claim 1, wherein the plurality of different architectural sub-models based on the BERT model comprises at least one of:

4. The method of claim 1, wherein the screening each trained submodel through a preset rule to obtain a prediction model including a preferred submodel comprises:

5. The method of claim 4, wherein the obtaining the accuracy and recall of the predicted rendering of each of the submodels separately comprises:

6. The method of claim 1, wherein before obtaining a target HLA of a preselected class and positive and negative sample data corresponding to the target HLA with a preset ratio, the method comprises:

7. An antigenic peptide prediction method, comprising:

obtaining a target antigen peptide sequence;

constructing the predictive model according to any one of claims 1 to 6, predicting the outcome of presentation of the target antigenic peptide sequence by a target HLA in the predictive model.

8. The method of claim 7, wherein:

the preset length of the target antigen peptide sequence is 8-11.

9. An apparatus for constructing an antigenic peptide presentation prediction model, comprising:

the sample acquisition module is used for acquiring target HLA of a preselected type and positive sample data and negative sample data which correspond to the target HLA and have a preset proportion, wherein for each target HLA, the positive sample data and the negative sample data are obtained according to the following steps of 1: (8-10) generating training data, and enabling positive sample data and negative sample data to be in a ratio of 1: (800-1000) generating test data; dividing the training data according to K-fold cross validation to obtain a training set and a validation set; dividing the training data according to K-fold cross validation to obtain a training set and a validation set, and adding a preset number of pseudo label data into the training set; the pseudo label data is formed by predicting the test data of blank labels according to a pre-trained sub model to obtain corresponding pseudo labels; the positive sample data comprises a positive sample polypeptide sequence, an upstream sequence of the positive sample polypeptide sequence, a downstream sequence of the positive sample polypeptide sequence and a positive presentation result of the positive sample polypeptide sequence and the target HLA; the negative sample data comprises a negative sample polypeptide sequence different from the positive sample polypeptide sequence, an upstream sequence of the negative sample polypeptide sequence, a downstream sequence of the negative sample polypeptide sequence, and a negative presentation result of the negative sample polypeptide sequence and the target HLA;

10. An antigenic peptide prediction apparatus, comprising:

a sequence acquisition module for acquiring a target antigen peptide sequence

A prediction module for predicting the result of presentation of the target antigen peptide sequence by a target HLA in the prediction model according to the prediction model constructed in claim 9.

11. An electronic device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any one of claims 1-8.

12. A computer-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any one of claims 1-8.