CN113762417B - Method for enhancing HLA antigen presentation prediction system based on deep migration - Google Patents

Method for enhancing HLA antigen presentation prediction system based on deep migration Download PDF

Info

Publication number
CN113762417B
CN113762417B CN202111204491.8A CN202111204491A CN113762417B CN 113762417 B CN113762417 B CN 113762417B CN 202111204491 A CN202111204491 A CN 202111204491A CN 113762417 B CN113762417 B CN 113762417B
Authority
CN
China
Prior art keywords
model
training
data set
data
migration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111204491.8A
Other languages
Chinese (zh)
Other versions
CN113762417A (en
Inventor
方榯楷
费才溢
徐实
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Chengshi Biomedical Technology Co ltd
Original Assignee
Nanjing Chengshi Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Chengshi Biotechnology Co ltd filed Critical Nanjing Chengshi Biotechnology Co ltd
Priority to CN202111204491.8A priority Critical patent/CN113762417B/en
Publication of CN113762417A publication Critical patent/CN113762417A/en
Application granted granted Critical
Publication of CN113762417B publication Critical patent/CN113762417B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention provides a method for enhancing an HLA antigen presentation prediction system based on deep migration, which comprises the following steps: 1) generating negative sample training sets with different proportions by using a global maximum difference scoring matrix: the method comprises the steps that a source domain data set with balanced positive and negative samples and a target data set with unbalanced positive and negative samples 2) adopt various different deep neural networks to encode known sequence information, multi-modal feature fusion and the like, a pre-training model 3) is obtained on source domain data with balanced positive and negative sample ratios, the pre-training model is transferred to a target data set with extremely unbalanced positive and negative sample ratios 4) through a deep transfer method, and an innovative strict accuracy (strict PPV) index is provided. The method can efficiently fuse multi-mode information, rapidly deploy and migrate to an unused data set, and save the computing power and time cost for retraining the model on a new environment and data.

Description

Method for enhancing HLA antigen presentation prediction system based on deep migration
Technical Field
The invention relates to the field of bioinformatics, in particular to a method for enhancing an HLA new antigen presentation prediction system based on deep migration.
Background
Human Leukocyte Antigen (HLA), a gene encoding the Major Histocompatibility Complex (MHC) of humans, is closely related to the function of the immune system of humans. MHC is divided into two major classes, the first class of MHC deals with proteins (e.g. viral) that are broken down inside the cell, the second class when an external invader undergoes endocytosis and processing with lysosomes to form fragments, with which MHC is bound and presented on the cell surface for recognition by T cells. They are closely related to the function of the human immune system. Part of the genes encode cell surface antigens, which become an indelible "signature" of each individual's cells and are the basis for the immune system to distinguish between itself and foreign substances. Cancer vaccines that utilize the principle of HLA presentation are a hot problem in medicine and pharmacology today.
The main steps for the prediction of a personalized HLA Neoantigen (Neoantigen) vaccine are as follows:
(1) identifying and validating non-synonymous somatic mutations expressed in patient tumors that are specifically immunogenic. Tumor tissue is biopsied for whole exome or transcriptome sequencing. Nonsynonymous somatic mutations, such as point mutations and indels, reading frame offsets, of tumors can be identified by comparing the sequences of the tumor and matched healthy tissue.
(2) Mutations with the highest antigen presentation potential were screened, analyzed and identified using Major Histocompatibility Complex (MHC) class I and II epitope prediction algorithms.
(3) The ordered list of candidate antigens is further validated based on in vitro binding assay results.
The prediction problem of HLA Neoantigen presentation involved in step (2) is a core point for Neoantigen vaccine development. With the wide application of artificial intelligence technology in bioinformatics, researchers in other fields begin to try to rapidly discover, predict and screen available new immunity antigen targets by using a data-driven machine learning method. Representative of these are The work and techniques of The NetHMCPan series of The Docker university (ref: Jurtz, Vanesa, et al. "NetMHCPan-4.0: expressed peptide-MHC class I interaction prediction expressed ligand and peptide binding affinity data." The Journal of Immunology 199.9(2017):3360-3368, Reynisson, Birkir, et al. "NetMHCPan-4.1and NetMHCIIpan-4.0: expressed prediction of MHC expression by binding reactivity concentration restriction and interaction of MS expression by binding affinity data 35454. A model of The affinity prediction of MHC interaction affinity and affinity of MHC expression expressed ligand binding affinity data 35454, which is provided by a multi-layer model of prediction of The affinity of binding affinity of MHC antigen expressed by MHC binding affinity and binding affinity of MHC binding affinity by binding affinity series W-35454, and subsequent prediction of The elution of MHC class of antigen expressed by MHC class II, MHC class W-35454; work by the university of Danish technology team (ref: Reynisson, Birkir, et al. "Improved prediction of MHC II antigen presentation through integration and kinetic resolution of mass spectrometry MHC estimated ligand and data." Journal of protein research 19.6(2020):2304 and 2315.) prediction of MHC class II antigens based on deconvolution model sets of elution ligand data and base sequences; work of the team at university of North Ka church (ref: Smith C, Chai S, Washington A R, et al, machine-learning prediction of tumor immunity in the selection of therapeutic epitopes [ J ]. Cancer immunity research,2019,7(10):1591- & 1604.) MHC antigen prediction models based on immunogen polyepitope selection.
The above-described work has contributed primarily to the application of machine learning models to specific MHC datasets that have been constructed in advance. However, in practical application scenarios for developing anti-tumor vaccines, most of the existing Mass Spectrometry (MS) -based identification data sets still have relatively few epitopes matching with real Human Leukocyte Antigens (HLA), and have various data sources and large data distribution differences, which makes it difficult to establish a reliable, robust and reusable HLA antigen prediction system. Such difficulties not only occur in HLA antigen prediction, but also in bioinformatics and clinical medicine in a broader sense, similar problems of high data acquisition cost and difficulty in robust model training are also prominent problems when a machine learning model is introduced.
Therefore, the researchers put forward a forward-edge artificial intelligence method such as migration learning, and rapidly deploy the existing model to a new scene with low model training and data collection cost. Transfer learning is a new machine learning method that uses existing knowledge to solve different but related domain problems. Some representative efforts to apply this method to the field of MHC prediction processing are: the southern science university scholars tried to learn common features of mixed allele-specific epitopes using a paradigm training model of migratory learning (ref: Hu, Wei Pen, Young Ping Li, and Xiu Qing Zhang. "MHC-I epitope presentation based on transfer learning." Yi Chuan ═ Hereditas 41.11(2019):1041-, the HLA Presentation of tumor antigen peptides is predicted by encoding physicochemical properties of the relevant naturally presented peptides using migratory Learning (relevant documents: Ng FS, Vandeberghe M, Portella G, Cayatte C, Qu X, Handuchi S, Landry A, Chaerkady R, Yu W, Collepardo-Guevara R, Sidders B.MINERVA: the Learning of HLA Class I Peptide Presentation in turbines with associated functional Neural Networks and Transfer Learning. Available SSRN 3704016.). But overall, there are still fewer relevant studies.
Another challenge of the mainstream method model in this field is the extremely unbalanced data ratio in real situations. In training the model, it is usually necessary to select a data set with close positive and negative samples to ensure that the model training process is smooth. But in real scenarios there are far more negative samples (not rendered) than positive samples. This further increases the difficulty of evaluating the authenticity performance of the model. There is little discussion in the academia (references: Schneider M, Wang L, Marr C. evaluation of domain adaptation protocols for robust classification of heterologous biological data. In International conference on scientific Neural Networks 2019Sep 17(pp.673-686). Springer, Cham.), but the problem is not widely regarded and discussed by the academia.
Disclosure of Invention
Two major problems exist with respect to mainstream HLA prediction mentioned in the background: 1. the model has poor migration effect on heterogeneous data, and 2. the actual performance of the model under extreme positive-negative ratio is difficult to measure by the conventional indexes.
The invention firstly discloses a method for enhancing an HLA antigen presentation prediction system based on deep migration, which comprises the following steps:
s1, feature selection and normalization processing are carried out, and original domain data are constructed to serve as a source domain data set;
s2, solving a pre-training model through feature fusion and training;
s3, constructing special extremely unbalanced migration target domain data as a target domain data set;
s4, transferring the pre-training model obtained in the S2 to target field data in the S3 by using a depth transfer method to construct a depth transfer self-adaptive optimization model;
s5, using a deep migration adaptive optimization model, performing HLA antigen presentation prediction on the target domain dataset.
Preferably, according to the difference of feature selection in S1, a corresponding normalization processing scheme is selected to obtain feature vectors with uniform format and dimension, which are convenient for fusion, specifically:
-long sequence features, using a random matrix to encode each amino acid thereof into a learnable implicit space, and then processing with a long and short memory recurrent neural network;
short sequence characteristics, encoding by a one-hot method, and sending the encoded sequence into a multi-layer perceptron network model for transformation;
-vector features, encoded using principal component decomposition, PCA, combining the features of all data in vector form into a feature matrix, matrix decomposition using principal component decomposition; selecting a specific number of matrix eigenvectors for coding transformation according to the hidden embedding dimension;
scalar features, coded using multidimensional scaling, gaussian kernel method: taking the features of all data in a scalar form as the input of a Gaussian kernel to obtain a covariance matrix of the Gaussian kernel; and carrying out multi-dimensional scale scaling on each column of the matrix to obtain a feature vector of code transformation.
Preferably, the raw domain data set constructed in S1 is a positive-negative sample number ratio balance (number of positive samples: number of negative samples: 1) to best adapt to the structure and optimization method of the pre-training model.
Preferably, when the original domain data set is constructed in S1, a window sliding method is used, a negative sequence is generated according to a preset parameter threshold, and the generated sequence segments are screened using a global difference scoring matrix, so as to obtain a non-random negative candidate training set.
Preferably, the features to be fused in S2 are selected as: polypeptide sequence characteristics, upstream and downstream sequence characteristics, presentation affinity characteristics.
Preferably, in S2:
polypeptide sequence characteristics standard characteristics were obtained by the following methods: for a given polypeptide peptide chain amino acid sequence, encoding each amino acid of the given polypeptide peptide chain into a learnable space by using a random matrix, and processing by using a long and short memory recurrent neural network to obtain polypeptide sequence characteristics; after random matrix coding mapping is carried out, the length of the longest peptide chain sequence in all data is supplemented so as to ensure that the parameters of the coding and mapping models are kept consistent;
the upstream and downstream sequence features were normalized by the following method: coding the upstream and downstream peptide chains of a given gene by using a unique heating method, cutting the coded upstream and downstream sequences to obtain a fixed-length sequence, sending the coded sequence into a multilayer perceptron network model for transformation, and extracting characteristics as the characteristics of the upstream and downstream sequences;
and (4) rendering the affinity characteristics to obtain standard characteristics through scaling so as to ensure the numerical stability of the model training optimization process.
Preferably, the pre-trained optimization model constructed in S2:
Figure BDA0003306296760000041
in the formula, fWIs a predictive model with learnable parameters;
w represents learnable parameters in the model, including scheme weight when each fusion feature is obtained; w is anRepresenting the weights given to the loss functions of different samples, N representing the total number of samples;
xnspecific data representing input, ynThe method is characterized in that the method is a real value whether combined in training data is presented or not, o is an S logic function and is not simply added, and a model formula captures a potential complex relation;
after the model is optimized, the relevant parameters are stored as a pre-training model in a structured method.
Preferably, after the negative candidate data set is produced in S3, target domain data is constructed according to different strategies; the target domain data set has a far larger number of negative samples than positive samples, so as to simulate the situation that the number of negative samples is far larger than that of positive samples in the real prediction environment.
Preferably, the depth migration adaptive optimization model constructed in S4:
Figure BDA0003306296760000042
wherein, fW' is a prediction model to be migrated containing a learning parameter;
w' represents learnable parameters in the model, including scheme weight when each fusion feature is obtained;
lossS,lossCrespectively representing target loss functions in a pre-training stage and a model migration self-adaption stage; λ represents the weight of the target loss function given to the model migration adaptation phase;
Figure BDA0003306296760000051
true values representing training data features on the original domain data set constructed in S1 and the target domain data set constructed in S3 in combination with whether or not renderings are present, respectively;
N1,N2respectively representing the number of training samples on the original domain data set constructed in S1 and the target domain data set constructed in S3;
after the model is optimized, the related parameters are stored as the self-adaptive depth migration self-adaptive optimization model by a structural method.
Preferably, the depth migration adaptive optimization model selects and optimizes global optimization of all trainable parameters in the pre-training model according to the size of the pre-training model and the data scale, or only performs top selection optimization of the last two layers in the neural network model.
Preferably, the optimization model is solved in S2 and S4: traversing all training data for multiple times, optimizing by using an optimizer based on a random gradient optimization method to obtain optimal model parameters, and obtaining a pre-training prediction model fWAnd a migration prediction model fW
Preferably, a single batch of data is divided from the extreme imbalance data constructed in S3, and is used for verifying the prediction effect of the depth migration adaptive optimization model on the target domain data pair in S5.
The invention has the advantages of
The invention provides a brand-new method for enhancing an HLA new antigen presentation prediction system based on deep migration, which comprises the following steps: 1) generating negative sample training sets with different proportions by using a global maximum difference scoring matrix: the method comprises the steps that a source domain data set with balanced positive and negative samples and a target data set with unbalanced positive and negative samples 2) adopt various different deep neural networks to encode known sequence information, multi-modal feature fusion and the like, a pre-training model 3) is obtained on source domain data with balanced positive and negative sample ratios, the pre-training model is transferred to a target data set with extremely unbalanced positive and negative sample ratios 4) through a deep transfer method, and an innovative strict accuracy (strict PPV) index is provided. The method can efficiently fuse multi-mode information, rapidly deploy and migrate to an unused data set, and save the computing power and time cost for retraining the model on a new environment and data.
Based on the multi-modal feature fusion prediction provided by the application, the potential complex relation among multiple features is captured instead of the traditional single summation.
The source domain data training based on the positive and negative sample ratio balance ensures the stability of the pre-training model convergence and the model learns the hidden embedded expression of the multi-modal characteristics.
And the reliability and reusability of the model after migration in a real environment are ensured based on the deep migration of the target data with extremely unbalanced positive-negative sample ratio.
Drawings
FIG. 1 is a diagram of a method for migrating and enhancing a prediction model for HLA neoantigen presentation based on deep migration
FIG. 2 general flow chart of enhancement method and evaluation system for HLA neoantigen presentation prediction system based on deep migration
Detailed Description
The invention is further illustrated by the following examples, without limiting the scope of the invention:
as shown in fig. 2, the system of the method for enhancing and evaluating the HLA neoantigen presentation prediction system based on deep migration according to the present invention is divided into three parts, which are described in detail below with respect to the data set construction, model optimized migration, and model testing.
(a) Data set construction
Firstly, the module collects data tuples such as a polypeptide peptide chain presented by a specific HLA new antigen, an upstream and downstream data pair, an affinity index and the like, and a matched data processing and standardization process according to public database resources and specific documents, and specifically comprises the following steps:
I. given a particular protein, a particular polymorphic amino acid sequence of a particular HLA neoantigen presented is successfully expressed.
And (ii) a corresponding upstream and downstream related sequence of 6 amino acids and 12 amino acids in length.
Affinity indices (affinity score) of presented expression data pairs obtained from a series of professional computing tools (ref: Jurtz, Vanessa, et al, "NetMHCpan-4.0: expressed peptide-MHC class I interaction expression expressed and peptide binding affinity data." The Journal of Immunology 199.9(2017):3360 and 3368) based on The protein, polypeptide data pairs in I, and related features.
Specifically, the published data sources and literature resources to which we refer are mainly MARIA (reference: Chen, Binbin, et al. "differentiating HLA class II anti-sensitive presentation with high integrated deletion." Nature biological technology 37.11(2019):1332-1343.NetMHCpan series of data, (reference: Reynisson, Birkir, et al. "NetMHCpan-4.1and NetMHCIan-4.0: expressed prediction of MHC anti-sensitive presentation with high current restriction and integration MS. derived from MHC isolated ligand data." Nucleic acids research 48.W1 (W-W454.) "open mapping and integration MS. MHC library 35. W19. and M.35. university:" MHC library 35. and 35. MHC library ". 3. 19. and 3. university:" MHC library ".
After positive samples of HLA neoantigens from a plurality of data sources successfully expressed and presented are collected based on the method, a data set used for training machine learning is constructed, specifically, a negative sample is generated by using a global maximum difference scoring matrix, and a source domain data set and a target domain data set are constructed. The principle is as follows: it is generally considered that the similarity of the presented peptide fragment and the normal peptide fragment has a certain negative correlation with antigen presentation and immunogenicity, so we use the global maximum difference scoring matrix to generate the peptide fragment with the lowest sequence similarity as the negative sample set of the training set. The specific method for generating the sequence comprises the steps of firstly determining the proportion of positive and negative samples, then sequentially sliding the sequences on a window, carrying out multi-sequence comparison on all generated sequences by using a BioPython sequence comparison software package, and reserving a specific negative sequence with the lowest sequence similarity as a negative training set by using a bubble method.
In a real production environment, the probability of failure of HLA neoantigen expression presentation is much higher than the probability of success, so that a plurality of negative samples need to be generated for each positive sample. The specific way of generation is to input The presentation expression data pair of The step I of The module (a) into an open source computing tool NetMHCpan (reference: Jurtz, Vanessa, et al. "NetMHCpan-4.0: improved peptide-MHC class I interaction prediction integrated ligand and peptide binding affinity data." The Journal of Immunology 199.9(2017):3360 and 3368.), and to generate negative samples with The value of 1-1000 for each successfully presented positive sample data pair according to The inverted ranking order of The candidate affinity indexes. However, the data of the unbalance of the positive and negative samples is a great challenge to the construction and optimization of the machine learning model, so that the proportion of the positive and negative samples is firstly set to be 1: 1, constructing a positive-negative ratio relatively balanced data set to train the pre-training model-this is both the source domain data set used to optimize the pre-training model.
Next, we build a target domain dataset that is used to simulate the real environment and perform model migration. Similarly, we first set the positive and negative sample ratio to 1: 100 or 1: 10 to simulate the situation where the probability of the occurrence of an unrendered negative sample in a real production environment is much greater than the probability of a successful positive sample being presented. It is to be noted that, directly on the data with the extremely unbalanced positive and negative sample ratio, most machine learning models are difficult to stably converge and learn the expression of each feature in the valuable positive and negative samples.
In addition, for a source domain data set used for optimizing a pre-training model and a target domain data set used for model migration, different sampling generation strategies can be adopted when a specific number of negative samples are selected from a candidate negative sample pool. To simulate different possibilities in a real production scenario. Specifically, we took three different negative sample (negative data) strategies: 1. generalized negative strategy: given a pair of positive data that are successfully expressed and presented, the strategy randomly selects one negative data from all its corresponding negative data candidate pools that is constructed as a data set. 2. Median negative strategy: given a pair of positive data presented successfully for expression, the strategy sorts all the corresponding negative data candidate pools in descending order according to their affinity scores (affinity score), and selects the negative data with the smallest affinity score and the lowest similarity to the positive data as a data set for construction. 3. Narrow negative strategy: given a positive data pair successfully expressed and presented, in all corresponding negative data candidates of the strategy, negative data samples with original affinity score (affinity score) of less than 500 are removed, then the remaining samples are sorted in a descending order, and the negative data with the minimum affinity score and the lowest similarity to the positive data is selected as a data set for construction.
In order to calculate the subsequent "strict accuracy rate (strict PPV)" index, we need to additionally construct a special test data set. The specific method comprises the following steps: 1000 pairs of successfully presented protein-HLA cohorts were selected from a given dataset, with 100 false positive samples being generated for each cohort.
Specifically, we selected MARIA, NetMHCpan (ref: Chen, Binbin, et al. "differentiating HLA class II anti-presentation induced depletion learning." Nature biotechnology 37.11(2019), Reynisson, Birkir, et al. "NetMHCpan-4.1and NetMHCIIpan-4.0: improved predictions of MHC anti-presentation bound kinetic depletion and integration of MS MHC estimated expression and estimation of MS estimated elevation data." Nucleic acids research 48.W1(2020): W449-W454.) data set to construct a positive and negative sample equalized source domain data set for training model; the Tubingo root dataset was chosen (reference: Rammensee, H-G., et al. "SYFPEITHI: database for MHC ligands and peptide motifs." Immunogenetics 50.3(1999):213- & 219.). And constructing a target domain data set with extremely unbalanced positive and negative samples and a strict PPV test set.
In addition, for the target domain data set based on the Tungbingo root data set, a statistical method of k-fold cross validation (k-fold cross validation) is selected to construct a respective training, testing and validation data set. The training and verifying finger is used for the model migration process in the step (b), and the testing finger is used for testing the pre-trained model and the migrated model in the step (c).
(b) Model enhancement and migration
As shown in FIG. 1, we first optimize the pre-trained model over the source domain dataset.
Specifically, for different modal characteristics on the source domain dataset, we divide it into: the method comprises the steps of long sequence characteristics, short sequence characteristics, vector characteristics and scalar characteristics, and corresponding normalization processing schemes are defined for each characteristic so as to obtain characteristic vectors which are uniform in format and dimension and convenient to fuse. For example, for a given polypeptide peptide chain amino acid sequence, each amino acid is encoded into a learnable implicit space using a random matrix, and then processed using a portal cyclic unit network (GRU) (reference: Chung, Junyoung, et al. "Empirical evaluation of gated temporal networks on sequence modification." arXiv preprint: 1412.3555 (2014)). For the upstream and downstream peptide chains of a given gene, coding is carried out by a one-hot method, the coded upstream and downstream sequences are cut to obtain a fixed-length sequence, and the coded sequence is sent into a multilayer perceptron network model for transformation to extract characteristics. For the affinity score feature (affinity score), considering the large breadth of the raw data scale range: from hundreds to tens of thousands, we use 1-log50(kd), transformation and scaling are performed. Finally, inputting the processed modal characteristics into a characteristic fusion layer, and finally obtaining the following optimization model:
Figure BDA0003306296760000081
Wherein f is a prediction model integrating all sequence coding, multi-modal fusion and feature transformation neural networks in the step, and W is a learnable parameter in the model. w is anIndicating the weighting given to the loss function for different samples. In the case of positive-negative ratio equalization of the training data, a value of 1 is usually assigned. In the case of a non-uniform positive-negative ratio of possible training data, less samples may be given more weight to assign. Wherein nxIs the specific data (polypeptide, upstream and downstream, affinity index, etc.) input, nyis the true value of whether the binding is presented in the training data, and o' is the sigmoid function (sigmoid function).
The solution of the above optimization model can be achieved by using a batch stochastic gradient descent strategy (ref: Goyal, Priya, et al. "Accurate, large mini match sgd: tracking image net in 1hour." arXiv preprint arXiv:1706.02677 (2017)): in multiple rounds, the training data is input to the model in batches, the loss function and gradient as above are calculated, and the model is updated with the gradient descent. Specifically, we use an ADMA optimizer (ref: Kingma, Diederik P., and Jimmy Ba., "Adam: A method for storing optimization," arXiv preprint arXiv:1412.6980 (2014)), which estimates high-order gradients with first-order gradients and can automatically adjust the step size of the optimization, making the model optimization process more stable and robust. We save the optimized model parameters and structure as a pre-trained model.
Note that this step requires traversing the source domain dataset multiple times, iterative optimization, and high computational and time costs.
As shown in FIG. 2, next we perform deep migration learning on the pre-trained model on the target domain dataset.
After data processing and feature fusion similar to those of the source domain data set are carried out on the multi-modal features of the target domain data set, data with a proportion of less than 5% are randomly sampled in the multi-modal features, and deep migration learning is carried out. Two strategies, namely 'global optimization' and 'selected layer optimization' can be selected according to the size of the pre-training model and the data scale. The former optimizes all trainable parameters in the pre-training model, the latter optimizes only the last two layers in the neural network model, and other parameters are kept consistent with the pre-training model. The whole process can be seen as optimizing the following model:
Figure BDA0003306296760000091
wherein:' is the predictive model to be migrated;
w' represents learnable parameters in the model, including scheme weight when each fusion feature is obtained; lossS,lossCRespectively representing target loss functions in a pre-training stage and a model migration adaptive stage (both are optimized models in the pre-training stage in experiments), and the form of the target loss functions is consistent with that introduced in the previous step; λ represents the weight of the target loss function given to the model migration adaptation phase;
Figure BDA0003306296760000092
respectively representing the training data characteristics on the constructed source domain data set and the target domain data set and the real values of whether the training data characteristics are combined or not; n is a radical of1,N2Respectively representing the number of training samples on the constructed source domain dataset and the constructed target domain dataset. Generally the former is much larger than the latter. After the model is optimized, the relevant parameters are saved as a migration model in a structured way.
The solution of the optimization model still adopts a batch random gradient descent strategy and an ADMA optimizer. But the round number of a small amount of sampling data traversing the target domain data set is less than 5, and the computational power and time cost are extremely low.
To further enhance the model migration effect, we also perform noise whitening (noise whitening) (refer to Alam, Md Jahanger, Gautam Bhattacharya, and Patrick Kenny. "Speaker verification in mismatch verification with defect analysis domain adaptation" oxygen. Vol.2018.2018.), data stream shape alignment (refer to Wang, Chang, and Sridhar hadamarn. "correlation domain adaptation assessment" "two-segment integration correlation assessment), and so on the source domain dataset and the target domain dataset.
(c) Model inspection
Directly on a test data set divided by a target domain data set with unbalanced positive and negative samples, using a conventional index: the receiver operates the area AUC under the characteristic curve (or ROC curve) and the precision PPV to evaluate the prediction capability and performance of the model before and after the migration learning (table 1), the specific method is that 5 times of random training and test are respectively carried out, the obtained 5 times of results are averaged and the standard deviation is calculated (shown as the 'average value plus or minus standard deviation' in the table)
TABLE 1 evaluation index of prediction model
Evaluation index Description of the invention
precision/PPV TP/(TP+FP)
AUC Area under ROC curve
The following is the positive-negative ratio 1 generated based on the NetHMCPan data source: 1, a positive-negative ratio of 1 generated based on a graphical bingo data source: experimental results for the target domain data set and test set of 10 (table 2):
TABLE 2 NetHMCPan Source Domain data, TUBINGEN cum target data & test set test results
Figure BDA0003306296760000101
As can be seen from the table, under all negative data generation strategies, before migratory learning, the data was generated based on 1: 1 positive-negative ratio training data model, in 1: the test data of 10 positive-negative ratio all performed poorly. After the transfer learning, the AUC and the PPV are greatly improved. The effectiveness of deep migration learning as a tool for dealing with positive-negative ratio imbalance data is illustrated. Thereby demonstrating the effectiveness of model enhancement based on depth migration.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments, or alternatives may be employed, by those skilled in the art, without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims (10)

1. A method for enhancing a prediction system for HLA antigen presentation based on deep migration, comprising the steps of:
s1, feature selection and normalization processing are carried out, and original domain data are constructed to serve as a source domain data set;
in S1, selecting a corresponding normalization processing scheme according to different feature selections to obtain feature vectors with uniform format and dimension, which are convenient for fusion, specifically:
-long sequence features, each amino acid of which is encoded into a learnable implicit space using a random matrix and processed using a long and short memory recurrent neural network;
short sequence characteristics, encoding by a one-hot method, and sending the encoded sequence into a multi-layer perceptron network model for transformation;
-vector features, encoded using principal component decomposition, PCA, combining the features of all data in vector form into a feature matrix, matrix decomposition using principal component decomposition; selecting a specific number of matrix eigenvectors for coding transformation according to the hidden embedding dimension;
scalar features, encoded using a multidimensional scaling, gaussian kernel approach: taking the features of all data in a scalar form as the input of a Gaussian kernel to obtain a covariance matrix of the Gaussian kernel; carrying out multi-dimensional scale scaling on each column of the matrix to obtain a feature vector of code transformation;
s2, solving a pre-training model through feature fusion and training;
s3, constructing special extremely unbalanced migration target domain data as a target domain data set;
s4, transferring the pre-training model obtained in the S2 to target field data in the S3 by using a depth transfer method to construct a depth transfer self-adaptive optimization model;
the depth migration adaptive optimization model constructed in the step S4:
Figure 11547DEST_PATH_IMAGE002
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE003
is a prediction model to be migrated containing learning parameters;
Figure 397529DEST_PATH_IMAGE004
representing learnable parameters in the model, including scheme weights when all the fusion features are obtained;
Figure DEST_PATH_IMAGE005
respectively representing target loss functions in a pre-training stage and a model migration self-adaption stage;
Figure 380529DEST_PATH_IMAGE006
representing the weight of the target loss function given to the model migration self-adaptive stage;
Figure DEST_PATH_IMAGE007
true values representing training data features on the original domain data set constructed in S1 and the target domain data set constructed in S3 in combination with whether or not renderings are present, respectively;
Figure 979000DEST_PATH_IMAGE008
respectively representing the number of training samples on the original domain data set constructed in S1 and the target domain data set constructed in S3;
after the model is optimized, the related parameters are stored as a self-adaptive depth migration self-adaptive optimization model by a structured method;
s5, using a deep migration adaptive optimization model, performing HLA antigen presentation prediction on the target domain dataset.
2. The method of claim 1, wherein the raw domain data set constructed in S1 is a positive-negative sample number ratio equalization.
3. The method according to claim 1, wherein in the original domain data set constructed in S1, a window sliding method is used, and according to a preset parameter threshold, a negative sequence is generated and the generated sequence segments are screened using a global difference scoring matrix to obtain a non-random negative candidate training set.
4. The method according to claim 1, wherein the feature to be fused in S2 is selected as: polypeptide sequence characteristics, upstream and downstream sequence characteristics, presentation affinity characteristics.
5. The method according to claim 4, wherein in S2:
polypeptide sequence characteristics standard characteristics were obtained by the following methods: for a given polypeptide peptide chain amino acid sequence, each amino acid of the given polypeptide peptide chain amino acid sequence is coded into a learnable implicit space by using a random matrix, and then a long and short memory recurrent neural network is utilized to process the amino acid sequence to obtain polypeptide sequence characteristics; after random matrix coding mapping is carried out, the length of the longest peptide chain sequence in all data is supplemented so as to ensure that the parameters of the coding and mapping models are kept consistent;
the standard features were obtained by the following method: coding the upstream and downstream peptide chains of a given gene by using a unique heating method, cutting the coded upstream and downstream sequences to obtain a fixed-length sequence, sending the coded sequence into a multilayer perceptron network model for transformation, and extracting characteristics as the characteristics of the upstream and downstream sequences;
and (4) rendering the affinity characteristics to obtain standard characteristics through scaling so as to ensure the numerical stability of the model training optimization process.
6. The method of claim 1, wherein the pre-trained optimization model constructed in S2:
Figure 527793DEST_PATH_IMAGE010
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE011
is a predictive model with learnable parameters;
Wrepresenting learnable parameters in the model, including scheme weights when all the fusion features are obtained;w n representing the weights given to the loss functions of different samples, N representing the total number of samples;
x n which represents the specific data that was entered,y n is the true value of whether the combination is presented in the training data, ơ is an S logic function, is not simply added, and the model formula captures the potential complex relationship;
after the model is optimized, the relevant parameters are stored as a pre-training model in a structured method.
7. The method according to claim 1, wherein after producing the negative candidate data set in S3, target domain data is constructed according to different strategies; the target domain data set has a far larger number of negative samples than positive samples, so as to simulate the situation that the number of negative samples is far larger than that of positive samples in the real prediction environment.
8. The method of claim 1, wherein the deep migration adaptive optimization model selects global optimization for optimizing all trainable parameters in the pre-trained model or performs top-selection optimization of only the last two layers in the neural network model according to the size of the pre-trained model and the data size.
9. The method of claim 1, wherein the optimization model is solved in S2, S4: traversing all training data for multiple times, optimizing by using an optimizer based on a random gradient optimization method to obtain optimal model parameters, and obtaining a pre-training prediction model
Figure 553518DEST_PATH_IMAGE011
And migration prediction model
Figure 554972DEST_PATH_IMAGE012
10. The method of claim 1, wherein the extreme imbalance data constructed in S3 is divided into a single batch of data for verifying the prediction effect of the depth migration adaptive optimization model on the target domain data pair in S5.
CN202111204491.8A 2021-10-15 2021-10-15 Method for enhancing HLA antigen presentation prediction system based on deep migration Active CN113762417B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111204491.8A CN113762417B (en) 2021-10-15 2021-10-15 Method for enhancing HLA antigen presentation prediction system based on deep migration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111204491.8A CN113762417B (en) 2021-10-15 2021-10-15 Method for enhancing HLA antigen presentation prediction system based on deep migration

Publications (2)

Publication Number Publication Date
CN113762417A CN113762417A (en) 2021-12-07
CN113762417B true CN113762417B (en) 2022-05-27

Family

ID=78799667

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111204491.8A Active CN113762417B (en) 2021-10-15 2021-10-15 Method for enhancing HLA antigen presentation prediction system based on deep migration

Country Status (1)

Country Link
CN (1) CN113762417B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114201924B (en) * 2022-02-16 2022-06-10 杭州经纬信息技术股份有限公司 Solar irradiance prediction method and system based on transfer learning
CN115985402B (en) * 2023-03-20 2023-09-19 北京航空航天大学 Cross-modal data migration method based on normalized flow theory
CN116469457B (en) * 2023-06-14 2023-10-13 普瑞基准科技(北京)有限公司 Predictive model training method and device for combining, presenting and immunogenicity of MHC and antigen polypeptide

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7227237B2 (en) * 2017-10-10 2023-02-21 グリットストーン バイオ インコーポレイテッド Identification of neoantigens using hotspots
WO2020046587A2 (en) * 2018-08-20 2020-03-05 Nantomice, Llc Methods and systems for improved major histocompatibility complex (mhc)-peptide binding prediction of neoepitopes using a recurrent neural network encoder and attention weighting
CN109671469B (en) * 2018-12-11 2020-08-18 浙江大学 Method for predicting binding relationship and binding affinity between polypeptide and HLA type I molecule based on circulating neural network
CN111105843B (en) * 2019-12-31 2023-07-21 杭州纽安津生物科技有限公司 HLAI type molecule and polypeptide affinity prediction method
CN113160887B (en) * 2021-04-23 2022-06-14 哈尔滨工业大学 Screening method of tumor neoantigen fused with single cell TCR sequencing data

Also Published As

Publication number Publication date
CN113762417A (en) 2021-12-07

Similar Documents

Publication Publication Date Title
JP7047115B2 (en) GAN-CNN for MHC peptide bond prediction
CN113762417B (en) Method for enhancing HLA antigen presentation prediction system based on deep migration
CN111798921B (en) RNA binding protein prediction method and device based on multi-scale attention convolution neural network
CN109994151B (en) Tumor driving gene prediction system based on complex network and machine learning method
CN110136773A (en) A kind of phytoprotein interaction network construction method based on deep learning
CN110060738B (en) Method and system for predicting bacterial protective antigen protein based on machine learning technology
CN112071361A (en) Polypeptide TCR immunogenicity prediction method based on Bi-LSTM and Self-anchoring
WO2022072722A1 (en) Deep learning system for predicting the t cell receptor binding specificity of neoantigens
Widrich et al. DeepRC: immune repertoire classification with attention-based deep massive multiple instance learning
Sha et al. DeepSADPr: A hybrid-learning architecture for serine ADP-ribosylation site prediction
Downey et al. alineR: An R package for optimizing feature-weighted alignments and linguistic distances
Chen et al. DeepGly: A deep learning framework with recurrent and convolutional neural networks to identify protein glycation sites from imbalanced data
CN113807468B (en) HLA antigen presentation prediction method and system based on multi-mode depth coding
Zhang et al. iTCep: a deep learning framework for identification of T cell epitopes by harnessing fusion features
Bai et al. A unified deep learning model for protein structure prediction
CN112908421A (en) Tumor neogenesis antigen prediction method, device, equipment and medium
Li et al. ctP 2 ISP: Protein–Protein Interaction Sites Prediction Using Convolution and Transformer With Data Augmentation
Gao et al. Neo-epitope identification by weakly-supervised peptide-TCR binding prediction
Bi et al. An attention based bidirectional LSTM method to predict the binding of TCR and epitope
Sun et al. B-cell epitope prediction method based on deep ensemble architecture and sequences
Liu et al. A Deep Learning Approach for NeoAG-Specific Prediction Considering Both HLA-Peptide Binding and Immunogenicity: Finding Neoantigens to Making T-Cell Products More Personal
TWI835007B (en) Computer-implemented method and system for predicting binding and presentation of peptides by mhc molecules, computer-implemented method for performing multiple instance learning and tangible, non-transitory computer-readable medium
CN116994654B (en) Method, apparatus and storage medium for identifying MHC-I/HLA-I binding and TCR recognition peptides
Jain et al. Prediction and Visualisation of Viral Genome Antigen Using Deep Learning & Artificial Intelligence
Wu et al. TPBTE: A model based on convolutional Transformer for predicting the binding of TCR to epitope

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230705

Address after: Room 201, 2nd Floor, Building A-4, Building 16, Shuwu, No. 73 Tanmi Road, Jiangbei New District, Nanjing City, Jiangsu Province, 211899

Patentee after: Nanjing Chengshi Biomedical Technology Co.,Ltd.

Address before: 210000 room 209, floor 2, building D-2, building 16, tree house, No. 73, tanmi Road, Jiangbei new area, Nanjing, Jiangsu

Patentee before: Nanjing Chengshi Biotechnology Co.,Ltd.