CN111429965B - T cell receptor corresponding epitope prediction method based on multiconnector characteristics - Google Patents

T cell receptor corresponding epitope prediction method based on multiconnector characteristics Download PDF

Info

Publication number
CN111429965B
CN111429965B CN202010198109.6A CN202010198109A CN111429965B CN 111429965 B CN111429965 B CN 111429965B CN 202010198109 A CN202010198109 A CN 202010198109A CN 111429965 B CN111429965 B CN 111429965B
Authority
CN
China
Prior art keywords
model
prediction
matrix
decision tree
epitope
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010198109.6A
Other languages
Chinese (zh)
Other versions
CN111429965A (en
Inventor
王嘉寅
童瑶
杨玲
郑田
刘涛
李敏
张选平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Geneplus-Beijing
Xian Jiaotong University
Original Assignee
Geneplus-Beijing
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Geneplus-Beijing, Xian Jiaotong University filed Critical Geneplus-Beijing
Priority to CN202010198109.6A priority Critical patent/CN111429965B/en
Publication of CN111429965A publication Critical patent/CN111429965A/en
Application granted granted Critical
Publication of CN111429965B publication Critical patent/CN111429965B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Epidemiology (AREA)
  • Artificial Intelligence (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a T cell receptor corresponding epitope prediction method based on multiconnector characteristics, which comprises the steps of resolving a CDR3 beta chain and a corresponding epitope into bases with the length of 3, and counting the frequency of each triplet as initial characteristics; establishing an initial characteristic matrix according to the obtained initial characteristics, and performing dimension reduction on the initial characteristic matrix by using a principal component analysis method to perform characteristic extraction; n training samples are set, after prediction data x are input, a gradient lifting decision tree model is obtained through training, and the decision results of all decision trees are linearly combined through the gradient lifting decision tree model to make prediction; inputting the characteristic data into a trained model for prediction, and selecting different prediction indexes according to different prediction purposes. The method only uses the statistical value of the triplet as the initial characteristic, and can complete the training of the model in a very short time by combining the gradient lifting decision tree model, and the prediction accuracy is higher.

Description

T cell receptor corresponding epitope prediction method based on multiconnector characteristics
Technical Field
The invention belongs to the technical field of data science with precise medicine as an application background, and particularly relates to a T cell receptor corresponding epitope prediction method based on a concatemer characteristic.
Background
Specific binding of T Cell Receptor (TCR) and epitope (MHC) with Major Histocompatibility Complex (MHC) activates the immune system, thereby triggering a series of specific immune responses. Immunotherapy is based on the specific immune system, and by developing corresponding agents, the immune system is artificially activated, so that the body's immune system can work again to eliminate invaders or cancer cells in the body. Therefore, the prediction of the corresponding epitope of the TCR can provide an important theoretical basis for the fields of exploring disease mechanisms, cancer immunotherapy, drug development, vaccine manufacture and the like.
Although second Generation Sequencing technologies (NGS) provide a huge number of nucleotide and amino acid sequences, labeling is still rarely performed due to the high cost and time consuming process. If a relatively reliable prediction model can be trained from a small amount of current labeled data, the method can be applied to the labeling problem of the TCR epitope, and a large amount of time and economic cost are saved. In addition, since the gene segments of the TCR are obtained by a series of non-homologous recombinations involving the combination of TCR loci and random nucleotide insertions and/or deletions from the variable (V), diversity (D) and joining (J) gene segments, a large number of different TCRs can be produced, up to a scale of 10 15 ~10 61 . In addition, one TCR can recognize multiple epitopes and one epitope can also recognize multiple TCRs due to the presence of cross-reactivity. It is difficult to find the matching patterns of TCR and pMHC from such data manually and statistically, and it is of great significance in the course of immunotherapy if the specific binding mechanism of TCR and pMHC can be studied by machine learning algorithms.
The TCR may be divided into four CDR (complementary determining region) regions: CDR1, CDR2, CDR2.5 and CDR3, the specific recognition of the antigen is mainly dependent on the CDR regions. Among them, CDR3 region has the highest diversity, mainly binds to the peptide chain of the epitope, and CDR1, CDR2 and CDR2.5 mainly bind to MHC molecules, but may bind to the peptide chain. It has been found that the CDR3 beta chain plays a major role in predicting epitopes, but it is not clear that the physicochemical or structural properties or other factors in the CDR3 beta chain dominate.
At present, researchers at home and abroad try to research the relationship between CDR3 and epitope data, and the relationship can be roughly divided into two types: the first category defines TCR or CDR3 sequence similarity measurement methods, which use simple classifiers such as the K-nearest neighbor (english name: K-nearest neighbor, abbreviated as K-nn) algorithm to classify after finding the similarity between sequences. The second method extracts the physicochemical characteristics of amino acids based on the TCR or CDR3 sequence or encodes the amino acid sequence based on the BLOSUM matrix, and then obtains a prediction model by utilizing machine learning model training.
However, the prediction performance of the two methods is not very good, and the following problems mainly exist: first, the first method requires the computation of similarity between any two TCR sequences, and thus the time complexity of similarity computation is O (n) 2 ) The training process is time consuming. Second, the second method is basically based on amino acid encoding, and since different CDR3 sequences are not necessarily equal in length, alignment is required to ensure that the feature vectors of each TCR sequence have the same dimension. And thirdly, the first method mainly considers the overall similarity of two TCR sequences, the second method mainly considers the information of each amino acid in the sequences, and no method considers the role played by the information provided by adjacent amino acids in the TCR sequences in the specific recognition process of TCR and epitope.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a T cell receptor corresponding epitope prediction method based on multiconnector characteristics aiming at the defects in the prior art, so as to solve the problem of complicated and complicated characteristic extraction; the problem of time consumption of model training is solved, and the model training can be completed in a short time; multi-class prediction can be directly performed.
The invention adopts the following technical scheme:
a T cell receptor corresponding epitope prediction method based on a concatemer characteristic comprises the following steps:
s1, resolving a CDR3 beta chain and a corresponding epitope into bases with the length of 3, and counting the frequency of each triplet as an initial characteristic;
s2, establishing an initial feature matrix according to the initial features obtained in the step S1, and performing dimension reduction on the initial feature matrix by using a principal component analysis method to perform feature extraction;
s3, setting n training samples, training to obtain a gradient lifting decision tree model after inputting prediction data x, and linearly combining decision results of all decision trees through the gradient lifting decision tree model to make prediction;
and S4, inputting the characteristic data of the step S2 into the model trained in the step S3 for prediction, and selecting different prediction indexes according to different prediction purposes.
Specifically, step S2 specifically includes:
s201, recording the initial characteristic matrix as: x = { X 1 ,x 2 ,…,x n Centering each column of features;
s202, order the sample point x i The projection on the hyperplane in the new space is W T x i If all the sample points are separated, the variance of the sample points after projection is maximized, and an optimization target is determined;
s203, solving the optimized target part by using a Lagrange multiplier method, and carrying out XX on the covariance matrix T Performing characteristic decomposition, and sequencing the obtained characteristic values; then, the eigenvectors corresponding to the first k eigenvalues are taken to form a projection matrix W, and finally, the obtained eigenvector matrix W T X is a matrix of k rows and n columns.
Further, in step S201, m-dimensional column vector x 1 Comprises the following steps:
Figure GDA0002496493320000041
wherein n is the number of training samples and m is the feature dimension.
Further, in step S202, the optimization objective is:
Figure GDA0002496493320000042
where W is the transformation matrix, W T Is the transpose of the transformation matrix, X is the initial feature matrix, X T Is the transpose of the initial feature matrix.
Further, in step S203, the optimization objective is solved to obtain
XX T W=λW
The projection matrix W is:
W=(w 1 ,w 2 ,…,w k )
wherein λ is a characteristic value, w i Is the column vector of the projection matrix, i is more than or equal to 1 and less than or equal to k, and the ordering of the characteristic values is as follows: lambda [ alpha ] 1 ≥λ 2 ≥…≥λ n
Specifically, step S3 specifically includes:
s301, initializing iteration number M =0, setting the maximum iteration number to be M, and initializing a model f 0 (x);
S302, adding a decision tree on the basis of the current model in each model iteration, and using residual errors L (y, f) m-1 (x) Estimate parameter Θ) m
S303, letting m = m +1, if m is smaller than the maximum iteration number, returning to the step S302; otherwise, stopping training, returning all the decision trees of the training, and finishing the training of the epitope prediction model.
Further, in step S301, model f is initialized 0 (x) Comprises the following steps:
Figure GDA0002496493320000051
where N is the number of samples, c is the constant of the initial model fit, and L is the log-likelihood loss function defined as:
Figure GDA0002496493320000052
where Y is the output variable, X is the input variable, and L is the lossLoss function, M is the number of epitope classes, y ij Is a binary index, if the category j is input example x i True class of (1), then y ij =1; otherwise y ij =0,p ij Predicting an input instance x for a model i Probability of belonging to category j.
Further, in step S302, the result of the mth iteration is:
f m (x)=f m-1 (x)+β m T(x;Θ m )
wherein f is m-1 (x) Is the decision model for the m-1 th iteration, using all R mi Set of (2)
Figure GDA0002496493320000053
i∈[1..n]To fit a regression classification decision tree.
Further, a residual L (y, f) is used m-1 (x) Estimate parameter Θ) m Parameter theta of decision Tree m The method is obtained by solving the following optimization objectives:
Figure GDA0002496493320000054
loss function in model f m-1 The negative gradient above is used to approximate the estimate residual as:
Figure GDA0002496493320000055
where i is the index of the ith training sample.
Further, in step S303,
Figure GDA0002496493320000061
wherein f is M (x) The final integrated model composed of M decision trees is obtained, wherein M is the number of epitope classes, beta m Is the weight of the mth decision tree, T is the decision tree, x is the input of the decision tree T,Θ m are parameters of the decision tree.
Compared with the prior art, the invention has at least the following beneficial effects:
the invention relates to a method for predicting TCR epitope based on combined concatemer characteristics in a TCR sequence, which scans CDR3 beta sequences one by one, analyzes polypeptide chains into continuous short peptide chains with the length of 3 and counts the occurrence frequency of each triplet. And taking the statistical result as an initial characteristic matrix, and taking the epitope corresponding to the CDR3 beta sequence as a class label. According to biological knowledge, 20 amino acids exist in a human body, and the 20 amino acids can have 8000 different permutation and combination at most, so that the dimension of an initial characteristic matrix is not more than 8000 at most, the problems of complexity in characteristic extraction and time consumption of model training of the existing model are solved, the training can be completed in a short time, and the prediction performance is superior to that of the existing model.
Furthermore, principal component analysis is used for feature transformation, and the dimensionality of the features is reduced.
Further, the feature matrix is input into a Gradient Boosting Decision Tree (GBDT) for training, the optimal parameters of the model are obtained through grid search, and finally the multiple Decision trees are obtained.
Furthermore, the test data is coded by the same method, the test data characteristic matrix is input into the model, and the sum of the predicted results of all the decision trees is taken as the final predicted result.
In conclusion, the method only uses the statistic value of the triplet as the initial characteristic, and combines the gradient lifting decision tree model to complete the training of the model in a very short time, and the prediction accuracy is higher.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
Fig. 1 is a feature matrix obtained after feature selection is performed on TCR data in the Dash et.
FIG. 2 is a schematic flow chart of the present invention;
FIG. 3 shows comparison results of different models in a Dash dataset;
FIG. 4 shows the results of a multi-classification ROC curve performed in the Dash dataset.
Detailed Description
The invention provides a TCR Epitope prediction method based on adjacent amino acid information in a TCR Sequence, which is named as SETE (Sequence-based envelope learning approach for TCR Epitope binding prediction), data of a training set are CDR3 beta sequences and corresponding polypeptide chains capable of carrying out specific recognition, and data of a test set are CDR3 beta sequences.
Based on the following general consensus in academia:
the CDR3 region in the tcr sequence has a clear interaction with MHC-presented polypeptide chains, and the β chain of this region contributes significantly to peptide recognition;
2. the number of amino acids constituting proteins in humans is 20.
Referring to fig. 2, the method for predicting corresponding epitopes of T cell receptors based on the concatemer characteristics of the present invention includes the following steps:
s1, extracting initial characteristics
Referring to fig. 1, since the input data are CDR3 β chains and corresponding epitopes, the epitopes are the classes predicted by the model. Since the input amino acid sequence cannot be directly used as a feature, it is necessary to analyze it into bases having a length of 3 and count the frequency of each triplet as an initial feature. After the obtained initial characteristics are subjected to characteristic selection, certain similarity can be found among the TCR sequence characteristics corresponding to different types of epitopes, a characteristic matrix is obtained after the TCR data in the Dash et.al paper is subjected to characteristic selection, the x axis in the graph represents the characteristics, and the y axis represents a sample; the rightmost color bar represents the epitope class corresponding to TCR; darker colors indicate a greater number of triplets.
S2, feature extraction
On the one hand, since there are a total of 20 amino acids, a short chain of 3 amino acids will have a maximum of 20 3 The method comprises the following steps of (1) carrying out seed combination, so that the features can reach 8000 dimensions at most, and feature screening is needed to reduce the dimension of the features; second, due to the similarity between TCR sequencesThere is similarity and there may be redundant information between triplets of TCR sequences of the same class. Therefore, the method for reducing the dimension of the data by using the principal component analysis specifically comprises the following steps:
s201, recording the initial characteristic matrix as: x = { X 1 ,x 2 ,…,x n -centering each column feature;
m dimensional column vector x 1 Comprises the following steps:
Figure GDA0002496493320000081
wherein n is the number of training samples, and m is the feature dimension;
s202, making the sample point x i The projection on the hyperplane in the new space is W T x i If all the sample points are separated, the variance of the sample points after projection is maximized, and the optimization target is determined as follows:
Figure GDA0002496493320000082
where W is the transformation matrix, W T Is the transpose of the transformation matrix, X is the initial feature matrix, X T Is the transpose of the initial feature matrix.
S203, solving the optimization objective member by using a Lagrange multiplier method to obtain XX T W = λ W, for covariance matrix XX T Performing characteristic decomposition, and sequencing the obtained characteristic values: lambda 1 ≥λ 2 ≥…≥λ n Then, the eigenvectors corresponding to the first k eigenvalues are taken to form a projection matrix W, and finally, the obtained eigenvector matrix W is obtained T X is a matrix with k rows and n columns;
the projection matrix W is:
W=(w 1 ,w 2 ,...,w k )
wherein λ is a characteristic value, w i Is the column vector of the projection matrix, i is more than or equal to 1 and less than or equal to k.
S3, training epitope prediction model
A new prediction model based on a gradient lifting decision tree is provided; if n training samples exist, after prediction data x is input, the gradient lifting decision tree model makes prediction by linearly combining decision results of all decision trees, and the method specifically comprises the following steps:
the n training samples are:
{(x 1 ,y 1 ),...,(x n ,y n )}
wherein, the first and the second end of the pipe are connected with each other,
Figure GDA0002496493320000093
i=1,2,...,n;
s301, initializing a model
Initializing iteration M =0, setting the maximum iteration M to be M, and initializing a model f 0 (x) Comprises the following steps:
Figure GDA0002496493320000091
where N is the number of samples, c is the constant of the initial model fit, and L is the log-likelihood loss function defined as:
Figure GDA0002496493320000092
wherein Y is an output variable, X is an input variable, L is a loss function, M is the number of epitope classes, Y ij Is a binary index, if the category j is the input example x i True class of (2), then y ij =1; otherwise y ij =0,p ij Predicting an input instance x for a model i Probability of belonging to category j.
S302, model iteration
Each iteration of the model adds a decision tree on the basis of the current model, and the result of the mth iteration is as follows:
f m (x)=f m-1 (x)+β m T(x;Θ m )
wherein f is m-1 (x) Is the firstDecision model of m-1 iterations, parameter Θ of decision tree m The method is obtained by solving the following optimization targets:
Figure GDA0002496493320000101
since the basis functions are linearly additive, the goal is to use the residual L (y, f) m-1 (x) Estimate the parameter Θ) m
For this purpose, the loss function is in the model f m-1 The negative gradient above is used to approximate the estimated residual.
Figure GDA0002496493320000102
Where i is the index of the ith training sample.
Using all R mi Set of (2)
Figure GDA0002496493320000103
i∈[1..n]To fit a Regression Classification decision Tree (English name: classification and Regression Tree, english abbreviation: CART), and solve the parameter theta m
S303, assigning m to be m +1, and if m is smaller than the maximum iteration number, returning to the step S302; otherwise, stopping training and returning to all the decision trees of the training;
Figure GDA0002496493320000104
s4, epitope prediction
And extracting initial features by the same method, extracting features, inputting final data into a trained model for prediction, and selecting different prediction indexes according to different prediction purposes.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention discloses a T cell receptor corresponding epitope prediction method based on a multiconnector characteristic, and solves the problems that the existing algorithm is long in training time and the prediction result is not ideal.
Because no model can directly carry out the TCR epitope multi-classification prediction problem at present, in order to verify the effectiveness of the invention, the prediction effect of the two classifications is firstly tested. Since the TCRGP of the existing method uses the working characteristics (ROC) of a testee and the Area Under the ROC Curve line (AUC) as the evaluation indexes of the model, the AUC is used for evaluating the prediction performance of the invention.
In addition, the runtime of the two models on the same data set is also compared; then, a multi-classification prediction test is carried out, and since the data sample amount of each class is unbalanced, the ROC is less influenced by the data imbalance, and therefore, the ROC and AUC indexes are still used for measuring the prediction performance of the model. Index name: true Positive (TP), false Positive (FP), true Negative (TN), and False Negative (FN).
Define false positive rate FPR = FP/(FP + TN).
A true positive ratio TPR = TP/(TP + FN) is defined.
The ROC curve is plotted from the values of FPR and TPR at different thresholds, and AUC is the area under the line of the ROC curve.
Tests were performed on the common data set VDJdb. Through screening, 22 types of table data are selected from VDJdb data. Because all the existing models can only process the two-classification task, in order to compare with other models, the two-classification test is firstly carried out. In each of the two classification tasks, all positive case data is used, and equal amounts of TCR data are randomly sampled from the other classes as negative cases. The results of the classification are shown in Table 1.
Table 1: SETE and TCRGP two classification results comparison (star points FRDYVDRFYKTLRAEQASBE)
Figure GDA0002496493320000121
/>
Figure GDA0002496493320000131
The above table shows that, in the two-classification task, compared with the existing method TCRGP, the method of the present invention has the same prediction effect, but the time consumption is significantly shortened, and the training time is greatly reduced, which is especially obvious in the data set with large data volume.
In the multi-classification task, the invention also carries out a series of experiments to verify the effectiveness. A ROC curve is used as an index of an evaluation model, a OneVsRest strategy is used for drawing a multi-classification ROC curve, a classifier is trained for data of each class, one class of TCR sequences is regarded as a positive example by each classifier, other classes of TCR sequences are regarded as negative examples by other classes of TCR sequences, and finally output results of ten classifiers are voted to obtain a final classification result. The results obtained using the five-fold cross-validation are shown in table 2.
Table 2: multi-classification prediction results of SETE on VDJdb dataset
Figure GDA0002496493320000132
/>
Figure GDA0002496493320000141
To further validate the ability of the present invention to predict the corresponding epitope of the TCR, tests were performed in the data set disclosed in the Dash et al paper, which collected epitope data for both class 3 humans and class 7 mice.
Since the model is more suitable for performing multi-classification tasks, a multi-classification test is performed on the data set at first, the model effect is evaluated by using the ROC curve and the AUC result, and the multi-classification result in the Dash data set is shown in table 3.
Table 3: SETE multiple classification results in Dash dataset
Figure GDA0002496493320000142
Figure GDA0002496493320000151
From the above table, it can be seen that SETE performs well in the whole of the multi-category problem, and the prediction result on individual epitope genes, such as pp65, is poor, and may have a certain relationship with the small data size of the epitope genes of this type. The comparison results of SETE multi-classification and TCRGP and TCRdist are shown in FIG. 3, the x axis represents different prediction models, and the y axis represents the area under the ROC curve of each model; in addition, ROC curves for human and mouse data were plotted for multiple classifications, respectively, and the results are shown in fig. 4. In the figure, the x-axis represents the false positive rate and the y-axis represents the true positive rate.
Two classification tests were performed on the Dash dataset and the prediction results for the two classifications are shown in table 4.
Table 4: comparison of results of two classifications of SETE and TCRGP in Dash dataset
Figure GDA0002496493320000152
Figure GDA0002496493320000161
As with previous results, SETE is able to complete training in a very short time and with prediction accuracy superior to the TCRGP model.
In conclusion, compared with the existing method TCPGP, the method can complete the training of the model in shorter time, and the performance in the two classification tasks is better than that of the existing method. In addition, the method can be directly applied to multi-classification tasks, and the prediction accuracy is high.
The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims (8)

1. A T cell receptor corresponding epitope prediction method based on a concatemer characteristic is characterized by comprising the following steps:
s1, resolving a CDR3 beta chain and a corresponding epitope into bases with the length of 3, and counting the frequency of each triplet as an initial characteristic;
s2, establishing an initial feature matrix according to the initial features obtained in the step S1, using a principal component analysis method to perform dimensionality reduction on the initial feature matrix, and performing feature extraction, wherein the method specifically comprises the following steps:
s201, recording the initial characteristic matrix as: x = { X 1 ,x 2 ,...,x n Centering each column of features, wherein n is the number of samples;
s202, order the sample point x i The projection on the hyperplane in the new space is W T x i If all the sample points are separated, the variance of the sample points after projection is maximized, and an optimization target is determined;
s203, solving the optimized target part by using a Lagrange multiplier method, and carrying out XX on the covariance matrix T Performing characteristic decomposition, and sequencing the obtained characteristic values; then, the eigenvectors corresponding to the first k eigenvalues are taken to form a projection matrix W, and finally, the obtained eigenvector matrix W T X is a matrix with k rows and n columns;
s3, inputting prediction data x into n training samples, training to obtain a gradient lifting decision tree model, and linearly combining decision results of decision trees through the gradient lifting decision tree model to make prediction, wherein the prediction specifically comprises the following steps:
s301, initializing iteration number m =0, and initializing a model f 0 (x);
S302, adding a decision tree on the basis of the current model in each model iteration, and using residual errors L (y, f) m-1 (x) Estimate parameter Θ) m
S303, letting m = m +1, if m is smaller than the maximum iteration number, returning to the step S302; otherwise, stopping training, returning to all the decision trees of the training, and finishing the training of the epitope prediction model;
and S4, inputting the characteristic data of the step S2 into the model trained in the step S3 for prediction, and selecting different prediction indexes according to different prediction purposes.
2. The method for predicting the epitope corresponding to T cell receptor based on concatemer characteristics of claim 1, wherein in step S201, the m-dimensional column vector x i Comprises the following steps:
Figure FDA0003993927790000021
wherein n is the number of training samples, and m is the feature dimension.
3. The method for predicting T cell receptor-corresponding epitopes based on concatemer characteristics according to claim 1, wherein in step S202, the optimization objective is:
Figure FDA0003993927790000022
wherein W is the projection matrix, W T Is the transpose of the projection matrix, X is the initial feature matrix, X T Is the transpose of the initial feature matrix.
4. The T cell receptor corresponding epitope prediction method based on concatemer characteristics of claim 1, wherein in step S203, the optimization objective is solved to obtain
XX T W=λW
The projection matrix W is:
W=(w 1 ,w 2 ,...,w k )
wherein λ is a characteristic value, w i Is the column vector of the projection matrix, i is more than or equal to 1 and less than or equal to k, and the ordering of the characteristic values is as follows: lambda [ alpha ] 1 ≥λ 2 ≥...≥λ n
5. The method for predicting T-cell receptor-corresponding epitopes according to claim 1, wherein the model f is initialized in step S301 0 (x) Comprises the following steps:
Figure FDA0003993927790000023
where N is the number of samples, c is the constant of the initial model fit, and L is the log-likelihood loss function defined as:
Figure FDA0003993927790000024
wherein Y is an output variable, X is an input variable, L is a loss function, M is the number of epitope classes, Y ij Is a binary index, if the category j is input example x i True class of (2), then y ij =1; otherwise y ij =0,p ij Predicting an input instance x for a model i Probability of belonging to category j.
6. The method for predicting the epitope corresponding to a T-cell receptor based on the concatemer characteristics of claim 1, wherein in step S302, the result of the mth iteration is:
f m (x)=f m-1 (x)+β m T(x;Θ m )
wherein f is m-1 (x) Is the decision model for the m-1 th iteration, using all R mi Set of (2)
Figure FDA0003993927790000031
To fit a regression classification decision tree.
7. The method of claim 6, wherein residual L (y, f) is used for predicting T cell receptor epitope mapping based on concatemer characteristics m-1 (x) Estimate the parameter Θ) m Parameter theta of decision Tree m The method is obtained by solving the following optimization objectives:
Figure FDA0003993927790000032
loss function in model f m-1 The negative gradient above is used to approximate the estimated residual as:
Figure FDA0003993927790000033
where i is the index of the ith training sample.
8. The method for predicting T-cell receptor-corresponding epitopes according to claim 1, wherein, in step S303,
Figure FDA0003993927790000034
wherein f is M (x) The finally obtained integrated model consisting of M decision trees, wherein M is the number of epitope classes and beta m Is the weight of the mth decision tree, T is the decision tree, x is the input of the decision tree T, theta m Are parameters of the decision tree.
CN202010198109.6A 2020-03-19 2020-03-19 T cell receptor corresponding epitope prediction method based on multiconnector characteristics Active CN111429965B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010198109.6A CN111429965B (en) 2020-03-19 2020-03-19 T cell receptor corresponding epitope prediction method based on multiconnector characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010198109.6A CN111429965B (en) 2020-03-19 2020-03-19 T cell receptor corresponding epitope prediction method based on multiconnector characteristics

Publications (2)

Publication Number Publication Date
CN111429965A CN111429965A (en) 2020-07-17
CN111429965B true CN111429965B (en) 2023-04-07

Family

ID=71548075

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010198109.6A Active CN111429965B (en) 2020-03-19 2020-03-19 T cell receptor corresponding epitope prediction method based on multiconnector characteristics

Country Status (1)

Country Link
CN (1) CN111429965B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112086129B (en) * 2020-09-23 2021-04-06 深圳吉因加医学检验实验室 Method and system for predicting cfDNA of tumor tissue
CN114203254B (en) * 2021-12-02 2023-05-23 杭州艾沐蒽生物科技有限公司 Method for analyzing immune characteristic related TCR based on artificial intelligence
CN114360644A (en) * 2021-12-30 2022-04-15 山东师范大学 Method and system for predicting combination of T cell receptor and epitope

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107636152A (en) * 2015-03-16 2018-01-26 马克思-德布鲁克-分子医学中心亥姆霍兹联合会 Novel immunogenic t cell epitope and the method for separating neoantigen specific t-cell receptor are detected by MHC cell libraries
CN109682978A (en) * 2017-11-30 2019-04-26 丁平 A kind of Tumor mutations peptide MHC is affine force prediction method and its application

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050026215A1 (en) * 2003-07-17 2005-02-03 Predki Paul F. Method for the prediction of an epitope
WO2014180490A1 (en) * 2013-05-10 2014-11-13 Biontech Ag Predicting immunogenicity of t cell epitopes
WO2016128060A1 (en) * 2015-02-12 2016-08-18 Biontech Ag Predicting t cell epitopes useful for vaccination
US20200182884A1 (en) * 2016-06-27 2020-06-11 Juno Therapeutics, Inc. Method of identifying peptide epitopes, molecules that bind such epitopes and related uses
CN107341363B (en) * 2017-06-29 2020-09-22 河北省科学院应用数学研究所 Prediction method of protein epitope
JP2019179356A (en) * 2018-03-30 2019-10-17 株式会社エムティーアイ Epitope prediction method and epitope prediction system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107636152A (en) * 2015-03-16 2018-01-26 马克思-德布鲁克-分子医学中心亥姆霍兹联合会 Novel immunogenic t cell epitope and the method for separating neoantigen specific t-cell receptor are detected by MHC cell libraries
CN109682978A (en) * 2017-11-30 2019-04-26 丁平 A kind of Tumor mutations peptide MHC is affine force prediction method and its application

Also Published As

Publication number Publication date
CN111429965A (en) 2020-07-17

Similar Documents

Publication Publication Date Title
CN112767997B (en) Protein secondary structure prediction method based on multi-scale convolution attention neural network
CN111429965B (en) T cell receptor corresponding epitope prediction method based on multiconnector characteristics
Liao et al. Combining pairwise sequence similarity and support vector machines for remote protein homology detection
Yang et al. A review of ensemble methods in bioinformatics
Adel et al. Learning the Structure of Sum-Product Networks via an SVD-based Algorithm.
Nguyen et al. Learning graph representation via frequent subgraphs
CN111210871A (en) Protein-protein interaction prediction method based on deep forest
Li et al. Protein contact map prediction based on ResNet and DenseNet
Gattani et al. StackCBPred: A stacking based prediction of protein-carbohydrate binding sites from sequence
Rasheed et al. Metagenomic taxonomic classification using extreme learning machines
Wang et al. Multiple independent subspace clusterings
Li et al. A novel unsupervised feature selection method for bioinformatics data sets through feature clustering
Kavitha et al. PCA-based gene selection for cancer classification
Tavakoli Seq2image: Sequence analysis using visualization and deep convolutional neural network
Shanthappa et al. ProAll-D: protein allergen detection using long short term memory-a deep learning approach
Vengatesan et al. The performance analysis of microarray data using occurrence clustering
Hauskrecht et al. Feature selection and dimensionality reduction in genomics and proteomics
Olaolu et al. A comparative analysis of feature selection and feature extraction models for classifying microarray dataset
Li et al. Random knn
Golenko et al. IMPLEMENTATION OF MACHINE LEARNING MODELS TO DETERMINE THE APPROPRIATE MODEL FOR PROTEIN FUNCTION PREDICTION.
Salman et al. Gene expression analysis via spatial clustering and evaluation indexing
Zhang et al. Fast generic interaction detection for model interpretability and compression
Feng et al. Using a low correlation high orthogonality feature set and machine learning methods to identify plant pentatricopeptide repeat coding gene/protein
CN115064217A (en) Protein immunogenicity classifier construction method, prediction method, device and medium
Alam et al. From unsupervised multi-instance learning to identification of near-native protein structures

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant