CN113762417B

CN113762417B - Method for enhancing HLA antigen presentation prediction system based on deep migration

Info

Publication number: CN113762417B
Application number: CN202111204491.8A
Authority: CN
Inventors: 方榯楷; 费才溢; 徐实
Original assignee: Nanjing Chengshi Biotechnology Co ltd
Current assignee: Nanjing Chengshi Biomedical Technology Co ltd
Priority date: 2021-10-15
Filing date: 2021-10-15
Publication date: 2022-05-27
Anticipated expiration: 2041-10-15
Also published as: CN113762417A

Abstract

The invention provides a method for enhancing an HLA antigen presentation prediction system based on deep migration, which comprises the following steps: 1) generating negative sample training sets with different proportions by using a global maximum difference scoring matrix: the method comprises the steps that a source domain data set with balanced positive and negative samples and a target data set with unbalanced positive and negative samples 2) adopt various different deep neural networks to encode known sequence information, multi-modal feature fusion and the like, a pre-training model 3) is obtained on source domain data with balanced positive and negative sample ratios, the pre-training model is transferred to a target data set with extremely unbalanced positive and negative sample ratios 4) through a deep transfer method, and an innovative strict accuracy (strict PPV) index is provided. The method can efficiently fuse multi-mode information, rapidly deploy and migrate to an unused data set, and save the computing power and time cost for retraining the model on a new environment and data.

Description

Method for enhancing HLA antigen presentation prediction system based on deep migration

Technical Field

The invention relates to the field of bioinformatics, in particular to a method for enhancing an HLA new antigen presentation prediction system based on deep migration.

Background

Human Leukocyte Antigen (HLA), a gene encoding the Major Histocompatibility Complex (MHC) of humans, is closely related to the function of the immune system of humans. MHC is divided into two major classes, the first class of MHC deals with proteins (e.g. viral) that are broken down inside the cell, the second class when an external invader undergoes endocytosis and processing with lysosomes to form fragments, with which MHC is bound and presented on the cell surface for recognition by T cells. They are closely related to the function of the human immune system. Part of the genes encode cell surface antigens, which become an indelible "signature" of each individual's cells and are the basis for the immune system to distinguish between itself and foreign substances. Cancer vaccines that utilize the principle of HLA presentation are a hot problem in medicine and pharmacology today.

The main steps for the prediction of a personalized HLA Neoantigen (Neoantigen) vaccine are as follows:

(1) identifying and validating non-synonymous somatic mutations expressed in patient tumors that are specifically immunogenic. Tumor tissue is biopsied for whole exome or transcriptome sequencing. Nonsynonymous somatic mutations, such as point mutations and indels, reading frame offsets, of tumors can be identified by comparing the sequences of the tumor and matched healthy tissue.

(2) Mutations with the highest antigen presentation potential were screened, analyzed and identified using Major Histocompatibility Complex (MHC) class I and II epitope prediction algorithms.

(3) The ordered list of candidate antigens is further validated based on in vitro binding assay results.

The prediction problem of HLA Neoantigen presentation involved in step (2) is a core point for Neoantigen vaccine development. With the wide application of artificial intelligence technology in bioinformatics, researchers in other fields begin to try to rapidly discover, predict and screen available new immunity antigen targets by using a data-driven machine learning method. Representative of these are The work and techniques of The NetHMCPan series of The Docker university (ref: Jurtz, Vanesa, et al. "NetMHCPan-4.0: expressed peptide-MHC class I interaction prediction expressed ligand and peptide binding affinity data." The Journal of Immunology 199.9(2017):3360-3368, Reynisson, Birkir, et al. "NetMHCPan-4.1and NetMHCIIpan-4.0: expressed prediction of MHC expression by binding reactivity concentration restriction and interaction of MS expression by binding affinity data 35454. A model of The affinity prediction of MHC interaction affinity and affinity of MHC expression expressed ligand binding affinity data 35454, which is provided by a multi-layer model of prediction of The affinity of binding affinity of MHC antigen expressed by MHC binding affinity and binding affinity of MHC binding affinity by binding affinity series W-35454, and subsequent prediction of The elution of MHC class of antigen expressed by MHC class II, MHC class W-35454; work by the university of Danish technology team (ref: Reynisson, Birkir, et al. "Improved prediction of MHC II antigen presentation through integration and kinetic resolution of mass spectrometry MHC estimated ligand and data." Journal of protein research 19.6(2020):2304 and 2315.) prediction of MHC class II antigens based on deconvolution model sets of elution ligand data and base sequences; work of the team at university of North Ka church (ref: Smith C, Chai S, Washington A R, et al, machine-learning prediction of tumor immunity in the selection of therapeutic epitopes [ J ]. Cancer immunity research,2019,7(10):1591- & 1604.) MHC antigen prediction models based on immunogen polyepitope selection.

The above-described work has contributed primarily to the application of machine learning models to specific MHC datasets that have been constructed in advance. However, in practical application scenarios for developing anti-tumor vaccines, most of the existing Mass Spectrometry (MS) -based identification data sets still have relatively few epitopes matching with real Human Leukocyte Antigens (HLA), and have various data sources and large data distribution differences, which makes it difficult to establish a reliable, robust and reusable HLA antigen prediction system. Such difficulties not only occur in HLA antigen prediction, but also in bioinformatics and clinical medicine in a broader sense, similar problems of high data acquisition cost and difficulty in robust model training are also prominent problems when a machine learning model is introduced.

Therefore, the researchers put forward a forward-edge artificial intelligence method such as migration learning, and rapidly deploy the existing model to a new scene with low model training and data collection cost. Transfer learning is a new machine learning method that uses existing knowledge to solve different but related domain problems. Some representative efforts to apply this method to the field of MHC prediction processing are: the southern science university scholars tried to learn common features of mixed allele-specific epitopes using a paradigm training model of migratory learning (ref: Hu, Wei Pen, Young Ping Li, and Xiu Qing Zhang. "MHC-I epitope presentation based on transfer learning." Yi Chuan ═ Hereditas 41.11(2019):1041-, the HLA Presentation of tumor antigen peptides is predicted by encoding physicochemical properties of the relevant naturally presented peptides using migratory Learning (relevant documents: Ng FS, Vandeberghe M, Portella G, Cayatte C, Qu X, Handuchi S, Landry A, Chaerkady R, Yu W, Collepardo-Guevara R, Sidders B.MINERVA: the Learning of HLA Class I Peptide Presentation in turbines with associated functional Neural Networks and Transfer Learning. Available SSRN 3704016.). But overall, there are still fewer relevant studies.

Another challenge of the mainstream method model in this field is the extremely unbalanced data ratio in real situations. In training the model, it is usually necessary to select a data set with close positive and negative samples to ensure that the model training process is smooth. But in real scenarios there are far more negative samples (not rendered) than positive samples. This further increases the difficulty of evaluating the authenticity performance of the model. There is little discussion in the academia (references: Schneider M, Wang L, Marr C. evaluation of domain adaptation protocols for robust classification of heterologous biological data. In International conference on scientific Neural Networks 2019Sep 17(pp.673-686). Springer, Cham.), but the problem is not widely regarded and discussed by the academia.

Disclosure of Invention

Two major problems exist with respect to mainstream HLA prediction mentioned in the background: 1. the model has poor migration effect on heterogeneous data, and 2. the actual performance of the model under extreme positive-negative ratio is difficult to measure by the conventional indexes.

The invention firstly discloses a method for enhancing an HLA antigen presentation prediction system based on deep migration, which comprises the following steps:

s1, feature selection and normalization processing are carried out, and original domain data are constructed to serve as a source domain data set;

s2, solving a pre-training model through feature fusion and training;

s3, constructing special extremely unbalanced migration target domain data as a target domain data set;

s4, transferring the pre-training model obtained in the S2 to target field data in the S3 by using a depth transfer method to construct a depth transfer self-adaptive optimization model;

s5, using a deep migration adaptive optimization model, performing HLA antigen presentation prediction on the target domain dataset.

Preferably, according to the difference of feature selection in S1, a corresponding normalization processing scheme is selected to obtain feature vectors with uniform format and dimension, which are convenient for fusion, specifically:

-long sequence features, using a random matrix to encode each amino acid thereof into a learnable implicit space, and then processing with a long and short memory recurrent neural network;

short sequence characteristics, encoding by a one-hot method, and sending the encoded sequence into a multi-layer perceptron network model for transformation;

-vector features, encoded using principal component decomposition, PCA, combining the features of all data in vector form into a feature matrix, matrix decomposition using principal component decomposition; selecting a specific number of matrix eigenvectors for coding transformation according to the hidden embedding dimension;

scalar features, coded using multidimensional scaling, gaussian kernel method: taking the features of all data in a scalar form as the input of a Gaussian kernel to obtain a covariance matrix of the Gaussian kernel; and carrying out multi-dimensional scale scaling on each column of the matrix to obtain a feature vector of code transformation.

Preferably, the raw domain data set constructed in S1 is a positive-negative sample number ratio balance (number of positive samples: number of negative samples: 1) to best adapt to the structure and optimization method of the pre-training model.

Preferably, when the original domain data set is constructed in S1, a window sliding method is used, a negative sequence is generated according to a preset parameter threshold, and the generated sequence segments are screened using a global difference scoring matrix, so as to obtain a non-random negative candidate training set.

Preferably, the features to be fused in S2 are selected as: polypeptide sequence characteristics, upstream and downstream sequence characteristics, presentation affinity characteristics.

Preferably, in S2:

polypeptide sequence characteristics standard characteristics were obtained by the following methods: for a given polypeptide peptide chain amino acid sequence, encoding each amino acid of the given polypeptide peptide chain into a learnable space by using a random matrix, and processing by using a long and short memory recurrent neural network to obtain polypeptide sequence characteristics; after random matrix coding mapping is carried out, the length of the longest peptide chain sequence in all data is supplemented so as to ensure that the parameters of the coding and mapping models are kept consistent;

the upstream and downstream sequence features were normalized by the following method: coding the upstream and downstream peptide chains of a given gene by using a unique heating method, cutting the coded upstream and downstream sequences to obtain a fixed-length sequence, sending the coded sequence into a multilayer perceptron network model for transformation, and extracting characteristics as the characteristics of the upstream and downstream sequences;

and (4) rendering the affinity characteristics to obtain standard characteristics through scaling so as to ensure the numerical stability of the model training optimization process.

Preferably, the pre-trained optimization model constructed in S2:

in the formula, f_WIs a predictive model with learnable parameters;

w represents learnable parameters in the model, including scheme weight when each fusion feature is obtained; w is a_nRepresenting the weights given to the loss functions of different samples, N representing the total number of samples;

x_nspecific data representing input, y_nThe method is characterized in that the method is a real value whether combined in training data is presented or not, o is an S logic function and is not simply added, and a model formula captures a potential complex relation;

after the model is optimized, the relevant parameters are stored as a pre-training model in a structured method.

Preferably, after the negative candidate data set is produced in S3, target domain data is constructed according to different strategies; the target domain data set has a far larger number of negative samples than positive samples, so as to simulate the situation that the number of negative samples is far larger than that of positive samples in the real prediction environment.

Preferably, the depth migration adaptive optimization model constructed in S4:

wherein, fW' is a prediction model to be migrated containing a learning parameter;

w' represents learnable parameters in the model, including scheme weight when each fusion feature is obtained;

loss_S，loss_Crespectively representing target loss functions in a pre-training stage and a model migration self-adaption stage; λ represents the weight of the target loss function given to the model migration adaptation phase;

true values representing training data features on the original domain data set constructed in S1 and the target domain data set constructed in S3 in combination with whether or not renderings are present, respectively;

N₁，N₂respectively representing the number of training samples on the original domain data set constructed in S1 and the target domain data set constructed in S3;

after the model is optimized, the related parameters are stored as the self-adaptive depth migration self-adaptive optimization model by a structural method.

Preferably, the depth migration adaptive optimization model selects and optimizes global optimization of all trainable parameters in the pre-training model according to the size of the pre-training model and the data scale, or only performs top selection optimization of the last two layers in the neural network model.

Preferably, the optimization model is solved in S2 and S4: traversing all training data for multiple times, optimizing by using an optimizer based on a random gradient optimization method to obtain optimal model parameters, and obtaining a pre-training prediction model f_WAnd a migration prediction model f_W′

Preferably, a single batch of data is divided from the extreme imbalance data constructed in S3, and is used for verifying the prediction effect of the depth migration adaptive optimization model on the target domain data pair in S5.

The invention has the advantages of

The invention provides a brand-new method for enhancing an HLA new antigen presentation prediction system based on deep migration, which comprises the following steps: 1) generating negative sample training sets with different proportions by using a global maximum difference scoring matrix: the method comprises the steps that a source domain data set with balanced positive and negative samples and a target data set with unbalanced positive and negative samples 2) adopt various different deep neural networks to encode known sequence information, multi-modal feature fusion and the like, a pre-training model 3) is obtained on source domain data with balanced positive and negative sample ratios, the pre-training model is transferred to a target data set with extremely unbalanced positive and negative sample ratios 4) through a deep transfer method, and an innovative strict accuracy (strict PPV) index is provided. The method can efficiently fuse multi-mode information, rapidly deploy and migrate to an unused data set, and save the computing power and time cost for retraining the model on a new environment and data.

Based on the multi-modal feature fusion prediction provided by the application, the potential complex relation among multiple features is captured instead of the traditional single summation.

The source domain data training based on the positive and negative sample ratio balance ensures the stability of the pre-training model convergence and the model learns the hidden embedded expression of the multi-modal characteristics.

And the reliability and reusability of the model after migration in a real environment are ensured based on the deep migration of the target data with extremely unbalanced positive-negative sample ratio.

Drawings

FIG. 1 is a diagram of a method for migrating and enhancing a prediction model for HLA neoantigen presentation based on deep migration

FIG. 2 general flow chart of enhancement method and evaluation system for HLA neoantigen presentation prediction system based on deep migration

Detailed Description

The invention is further illustrated by the following examples, without limiting the scope of the invention:

as shown in fig. 2, the system of the method for enhancing and evaluating the HLA neoantigen presentation prediction system based on deep migration according to the present invention is divided into three parts, which are described in detail below with respect to the data set construction, model optimized migration, and model testing.

(a) Data set construction

Firstly, the module collects data tuples such as a polypeptide peptide chain presented by a specific HLA new antigen, an upstream and downstream data pair, an affinity index and the like, and a matched data processing and standardization process according to public database resources and specific documents, and specifically comprises the following steps:

I. given a particular protein, a particular polymorphic amino acid sequence of a particular HLA neoantigen presented is successfully expressed.

And (ii) a corresponding upstream and downstream related sequence of 6 amino acids and 12 amino acids in length.

Affinity indices (affinity score) of presented expression data pairs obtained from a series of professional computing tools (ref: Jurtz, Vanessa, et al, "NetMHCpan-4.0: expressed peptide-MHC class I interaction expression expressed and peptide binding affinity data." The Journal of Immunology 199.9(2017):3360 and 3368) based on The protein, polypeptide data pairs in I, and related features.

Specifically, the published data sources and literature resources to which we refer are mainly MARIA (reference: Chen, Binbin, et al. "differentiating HLA class II anti-sensitive presentation with high integrated deletion." Nature biological technology 37.11(2019):1332-1343.NetMHCpan series of data, (reference: Reynisson, Birkir, et al. "NetMHCpan-4.1and NetMHCIan-4.0: expressed prediction of MHC anti-sensitive presentation with high current restriction and integration MS. derived from MHC isolated ligand data." Nucleic acids research 48.W1 (W-W454.) "open mapping and integration MS. MHC library 35. W19. and M.35. university:" MHC library 35. and 35. MHC library ". 3. 19. and 3. university:" MHC library ".

After positive samples of HLA neoantigens from a plurality of data sources successfully expressed and presented are collected based on the method, a data set used for training machine learning is constructed, specifically, a negative sample is generated by using a global maximum difference scoring matrix, and a source domain data set and a target domain data set are constructed. The principle is as follows: it is generally considered that the similarity of the presented peptide fragment and the normal peptide fragment has a certain negative correlation with antigen presentation and immunogenicity, so we use the global maximum difference scoring matrix to generate the peptide fragment with the lowest sequence similarity as the negative sample set of the training set. The specific method for generating the sequence comprises the steps of firstly determining the proportion of positive and negative samples, then sequentially sliding the sequences on a window, carrying out multi-sequence comparison on all generated sequences by using a BioPython sequence comparison software package, and reserving a specific negative sequence with the lowest sequence similarity as a negative training set by using a bubble method.

In a real production environment, the probability of failure of HLA neoantigen expression presentation is much higher than the probability of success, so that a plurality of negative samples need to be generated for each positive sample. The specific way of generation is to input The presentation expression data pair of The step I of The module (a) into an open source computing tool NetMHCpan (reference: Jurtz, Vanessa, et al. "NetMHCpan-4.0: improved peptide-MHC class I interaction prediction integrated ligand and peptide binding affinity data." The Journal of Immunology 199.9(2017):3360 and 3368.), and to generate negative samples with The value of 1-1000 for each successfully presented positive sample data pair according to The inverted ranking order of The candidate affinity indexes. However, the data of the unbalance of the positive and negative samples is a great challenge to the construction and optimization of the machine learning model, so that the proportion of the positive and negative samples is firstly set to be 1: 1, constructing a positive-negative ratio relatively balanced data set to train the pre-training model-this is both the source domain data set used to optimize the pre-training model.

Next, we build a target domain dataset that is used to simulate the real environment and perform model migration. Similarly, we first set the positive and negative sample ratio to 1: 100 or 1: 10 to simulate the situation where the probability of the occurrence of an unrendered negative sample in a real production environment is much greater than the probability of a successful positive sample being presented. It is to be noted that, directly on the data with the extremely unbalanced positive and negative sample ratio, most machine learning models are difficult to stably converge and learn the expression of each feature in the valuable positive and negative samples.

In addition, for a source domain data set used for optimizing a pre-training model and a target domain data set used for model migration, different sampling generation strategies can be adopted when a specific number of negative samples are selected from a candidate negative sample pool. To simulate different possibilities in a real production scenario. Specifically, we took three different negative sample (negative data) strategies: 1. generalized negative strategy: given a pair of positive data that are successfully expressed and presented, the strategy randomly selects one negative data from all its corresponding negative data candidate pools that is constructed as a data set. 2. Median negative strategy: given a pair of positive data presented successfully for expression, the strategy sorts all the corresponding negative data candidate pools in descending order according to their affinity scores (affinity score), and selects the negative data with the smallest affinity score and the lowest similarity to the positive data as a data set for construction. 3. Narrow negative strategy: given a positive data pair successfully expressed and presented, in all corresponding negative data candidates of the strategy, negative data samples with original affinity score (affinity score) of less than 500 are removed, then the remaining samples are sorted in a descending order, and the negative data with the minimum affinity score and the lowest similarity to the positive data is selected as a data set for construction.

In order to calculate the subsequent "strict accuracy rate (strict PPV)" index, we need to additionally construct a special test data set. The specific method comprises the following steps: 1000 pairs of successfully presented protein-HLA cohorts were selected from a given dataset, with 100 false positive samples being generated for each cohort.

Specifically, we selected MARIA, NetMHCpan (ref: Chen, Binbin, et al. "differentiating HLA class II anti-presentation induced depletion learning." Nature biotechnology 37.11(2019), Reynisson, Birkir, et al. "NetMHCpan-4.1and NetMHCIIpan-4.0: improved predictions of MHC anti-presentation bound kinetic depletion and integration of MS MHC estimated expression and estimation of MS estimated elevation data." Nucleic acids research 48.W1(2020): W449-W454.) data set to construct a positive and negative sample equalized source domain data set for training model; the Tubingo root dataset was chosen (reference: Rammensee, H-G., et al. "SYFPEITHI: database for MHC ligands and peptide motifs." Immunogenetics 50.3(1999):213- & 219.). And constructing a target domain data set with extremely unbalanced positive and negative samples and a strict PPV test set.

In addition, for the target domain data set based on the Tungbingo root data set, a statistical method of k-fold cross validation (k-fold cross validation) is selected to construct a respective training, testing and validation data set. The training and verifying finger is used for the model migration process in the step (b), and the testing finger is used for testing the pre-trained model and the migrated model in the step (c).

(b) Model enhancement and migration

As shown in FIG. 1, we first optimize the pre-trained model over the source domain dataset.

Specifically, for different modal characteristics on the source domain dataset, we divide it into: the method comprises the steps of long sequence characteristics, short sequence characteristics, vector characteristics and scalar characteristics, and corresponding normalization processing schemes are defined for each characteristic so as to obtain characteristic vectors which are uniform in format and dimension and convenient to fuse. For example, for a given polypeptide peptide chain amino acid sequence, each amino acid is encoded into a learnable implicit space using a random matrix, and then processed using a portal cyclic unit network (GRU) (reference: Chung, Junyoung, et al. "Empirical evaluation of gated temporal networks on sequence modification." arXiv preprint: 1412.3555 (2014)). For the upstream and downstream peptide chains of a given gene, coding is carried out by a one-hot method, the coded upstream and downstream sequences are cut to obtain a fixed-length sequence, and the coded sequence is sent into a multilayer perceptron network model for transformation to extract characteristics. For the affinity score feature (affinity score), considering the large breadth of the raw data scale range: from hundreds to tens of thousands, we use 1-log₅₀(kd), transformation and scaling are performed. Finally, inputting the processed modal characteristics into a characteristic fusion layer, and finally obtaining the following optimization model：

Wherein f is a prediction model integrating all sequence coding, multi-modal fusion and feature transformation neural networks in the step, and W is a learnable parameter in the model. w is a_nIndicating the weighting given to the loss function for different samples. In the case of positive-negative ratio equalization of the training data, a value of 1 is usually assigned. In the case of a non-uniform positive-negative ratio of possible training data, less samples may be given more weight to assign. Wherein _nxIs the specific data (polypeptide, upstream and downstream, affinity index, etc.) input, _nyis the true value of whether the binding is presented in the training data, and o' is the sigmoid function (sigmoid function).

The solution of the above optimization model can be achieved by using a batch stochastic gradient descent strategy (ref: Goyal, Priya, et al. "Accurate, large mini match sgd: tracking image net in 1hour." arXiv preprint arXiv:1706.02677 (2017)): in multiple rounds, the training data is input to the model in batches, the loss function and gradient as above are calculated, and the model is updated with the gradient descent. Specifically, we use an ADMA optimizer (ref: Kingma, Diederik P., and Jimmy Ba., "Adam: A method for storing optimization," arXiv preprint arXiv:1412.6980 (2014)), which estimates high-order gradients with first-order gradients and can automatically adjust the step size of the optimization, making the model optimization process more stable and robust. We save the optimized model parameters and structure as a pre-trained model.

Note that this step requires traversing the source domain dataset multiple times, iterative optimization, and high computational and time costs.

As shown in FIG. 2, next we perform deep migration learning on the pre-trained model on the target domain dataset.

After data processing and feature fusion similar to those of the source domain data set are carried out on the multi-modal features of the target domain data set, data with a proportion of less than 5% are randomly sampled in the multi-modal features, and deep migration learning is carried out. Two strategies, namely 'global optimization' and 'selected layer optimization' can be selected according to the size of the pre-training model and the data scale. The former optimizes all trainable parameters in the pre-training model, the latter optimizes only the last two layers in the neural network model, and other parameters are kept consistent with the pre-training model. The whole process can be seen as optimizing the following model:

wherein:' is the predictive model to be migrated;

w' represents learnable parameters in the model, including scheme weight when each fusion feature is obtained; loss_S，loss_CRespectively representing target loss functions in a pre-training stage and a model migration adaptive stage (both are optimized models in the pre-training stage in experiments), and the form of the target loss functions is consistent with that introduced in the previous step; λ represents the weight of the target loss function given to the model migration adaptation phase;

respectively representing the training data characteristics on the constructed source domain data set and the target domain data set and the real values of whether the training data characteristics are combined or not; n is a radical of₁，N₂Respectively representing the number of training samples on the constructed source domain dataset and the constructed target domain dataset. Generally the former is much larger than the latter. After the model is optimized, the relevant parameters are saved as a migration model in a structured way.

The solution of the optimization model still adopts a batch random gradient descent strategy and an ADMA optimizer. But the round number of a small amount of sampling data traversing the target domain data set is less than 5, and the computational power and time cost are extremely low.

To further enhance the model migration effect, we also perform noise whitening (noise whitening) (refer to Alam, Md Jahanger, Gautam Bhattacharya, and Patrick Kenny. "Speaker verification in mismatch verification with defect analysis domain adaptation" oxygen. Vol.2018.2018.), data stream shape alignment (refer to Wang, Chang, and Sridhar hadamarn. "correlation domain adaptation assessment" "two-segment integration correlation assessment), and so on the source domain dataset and the target domain dataset.

(c) Model inspection

Directly on a test data set divided by a target domain data set with unbalanced positive and negative samples, using a conventional index: the receiver operates the area AUC under the characteristic curve (or ROC curve) and the precision PPV to evaluate the prediction capability and performance of the model before and after the migration learning (table 1), the specific method is that 5 times of random training and test are respectively carried out, the obtained 5 times of results are averaged and the standard deviation is calculated (shown as the 'average value plus or minus standard deviation' in the table)

TABLE 1 evaluation index of prediction model

Evaluation index	Description of the invention
		precision/PPV	TP/(TP+FP)
AUC	Area under ROC curve

The following is the positive-negative ratio 1 generated based on the NetHMCPan data source: 1, a positive-negative ratio of 1 generated based on a graphical bingo data source: experimental results for the target domain data set and test set of 10 (table 2):

TABLE 2 NetHMCPan Source Domain data, TUBINGEN cum target data & test set test results

As can be seen from the table, under all negative data generation strategies, before migratory learning, the data was generated based on 1: 1 positive-negative ratio training data model, in 1: the test data of 10 positive-negative ratio all performed poorly. After the transfer learning, the AUC and the PPV are greatly improved. The effectiveness of deep migration learning as a tool for dealing with positive-negative ratio imbalance data is illustrated. Thereby demonstrating the effectiveness of model enhancement based on depth migration.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments, or alternatives may be employed, by those skilled in the art, without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A method for enhancing a prediction system for HLA antigen presentation based on deep migration, comprising the steps of:

in S1, selecting a corresponding normalization processing scheme according to different feature selections to obtain feature vectors with uniform format and dimension, which are convenient for fusion, specifically:

-long sequence features, each amino acid of which is encoded into a learnable implicit space using a random matrix and processed using a long and short memory recurrent neural network;

scalar features, encoded using a multidimensional scaling, gaussian kernel approach: taking the features of all data in a scalar form as the input of a Gaussian kernel to obtain a covariance matrix of the Gaussian kernel; carrying out multi-dimensional scale scaling on each column of the matrix to obtain a feature vector of code transformation;

s2, solving a pre-training model through feature fusion and training;

the depth migration adaptive optimization model constructed in the step S4:

in the formula (I), the compound is shown in the specification,

is a prediction model to be migrated containing learning parameters;

representing learnable parameters in the model, including scheme weights when all the fusion features are obtained;

respectively representing target loss functions in a pre-training stage and a model migration self-adaption stage;

representing the weight of the target loss function given to the model migration self-adaptive stage;

respectively representing the number of training samples on the original domain data set constructed in S1 and the target domain data set constructed in S3;

after the model is optimized, the related parameters are stored as a self-adaptive depth migration self-adaptive optimization model by a structured method;

2. The method of claim 1, wherein the raw domain data set constructed in S1 is a positive-negative sample number ratio equalization.

3. The method according to claim 1, wherein in the original domain data set constructed in S1, a window sliding method is used, and according to a preset parameter threshold, a negative sequence is generated and the generated sequence segments are screened using a global difference scoring matrix to obtain a non-random negative candidate training set.

4. The method according to claim 1, wherein the feature to be fused in S2 is selected as: polypeptide sequence characteristics, upstream and downstream sequence characteristics, presentation affinity characteristics.

5. The method according to claim 4, wherein in S2:

polypeptide sequence characteristics standard characteristics were obtained by the following methods: for a given polypeptide peptide chain amino acid sequence, each amino acid of the given polypeptide peptide chain amino acid sequence is coded into a learnable implicit space by using a random matrix, and then a long and short memory recurrent neural network is utilized to process the amino acid sequence to obtain polypeptide sequence characteristics; after random matrix coding mapping is carried out, the length of the longest peptide chain sequence in all data is supplemented so as to ensure that the parameters of the coding and mapping models are kept consistent;

the standard features were obtained by the following method: coding the upstream and downstream peptide chains of a given gene by using a unique heating method, cutting the coded upstream and downstream sequences to obtain a fixed-length sequence, sending the coded sequence into a multilayer perceptron network model for transformation, and extracting characteristics as the characteristics of the upstream and downstream sequences;

6. The method of claim 1, wherein the pre-trained optimization model constructed in S2:

in the formula (I), the compound is shown in the specification,

is a predictive model with learnable parameters;

Wrepresenting learnable parameters in the model, including scheme weights when all the fusion features are obtained;w _nrepresenting the weights given to the loss functions of different samples, N representing the total number of samples;

x _nwhich represents the specific data that was entered,y _nis the true value of whether the combination is presented in the training data, ơ is an S logic function, is not simply added, and the model formula captures the potential complex relationship;

7. The method according to claim 1, wherein after producing the negative candidate data set in S3, target domain data is constructed according to different strategies; the target domain data set has a far larger number of negative samples than positive samples, so as to simulate the situation that the number of negative samples is far larger than that of positive samples in the real prediction environment.

8. The method of claim 1, wherein the deep migration adaptive optimization model selects global optimization for optimizing all trainable parameters in the pre-trained model or performs top-selection optimization of only the last two layers in the neural network model according to the size of the pre-trained model and the data size.

9. The method of claim 1, wherein the optimization model is solved in S2, S4: traversing all training data for multiple times, optimizing by using an optimizer based on a random gradient optimization method to obtain optimal model parameters, and obtaining a pre-training prediction model

And migration prediction model

。

10. The method of claim 1, wherein the extreme imbalance data constructed in S3 is divided into a single batch of data for verifying the prediction effect of the depth migration adaptive optimization model on the target domain data pair in S5.