CN111429965B

CN111429965B - T cell receptor corresponding epitope prediction method based on multiconnector characteristics

Info

Publication number: CN111429965B
Application number: CN202010198109.6A
Authority: CN
Inventors: 王嘉寅; 童瑶; 杨玲; 郑田; 刘涛; 李敏; 张选平
Original assignee: Geneplus-Beijing; Xian Jiaotong University
Current assignee: Geneplus-Beijing; Xian Jiaotong University
Priority date: 2020-03-19
Filing date: 2020-03-19
Publication date: 2023-04-07
Anticipated expiration: 2040-03-19
Also published as: CN111429965A

Abstract

The invention discloses a T cell receptor corresponding epitope prediction method based on multiconnector characteristics, which comprises the steps of resolving a CDR3 beta chain and a corresponding epitope into bases with the length of 3, and counting the frequency of each triplet as initial characteristics; establishing an initial characteristic matrix according to the obtained initial characteristics, and performing dimension reduction on the initial characteristic matrix by using a principal component analysis method to perform characteristic extraction; n training samples are set, after prediction data x are input, a gradient lifting decision tree model is obtained through training, and the decision results of all decision trees are linearly combined through the gradient lifting decision tree model to make prediction; inputting the characteristic data into a trained model for prediction, and selecting different prediction indexes according to different prediction purposes. The method only uses the statistical value of the triplet as the initial characteristic, and can complete the training of the model in a very short time by combining the gradient lifting decision tree model, and the prediction accuracy is higher.

Description

T cell receptor corresponding epitope prediction method based on multiconnector characteristics

Technical Field

The invention belongs to the technical field of data science with precise medicine as an application background, and particularly relates to a T cell receptor corresponding epitope prediction method based on a concatemer characteristic.

Background

Specific binding of T Cell Receptor (TCR) and epitope (MHC) with Major Histocompatibility Complex (MHC) activates the immune system, thereby triggering a series of specific immune responses. Immunotherapy is based on the specific immune system, and by developing corresponding agents, the immune system is artificially activated, so that the body's immune system can work again to eliminate invaders or cancer cells in the body. Therefore, the prediction of the corresponding epitope of the TCR can provide an important theoretical basis for the fields of exploring disease mechanisms, cancer immunotherapy, drug development, vaccine manufacture and the like.

Although second Generation Sequencing technologies (NGS) provide a huge number of nucleotide and amino acid sequences, labeling is still rarely performed due to the high cost and time consuming process. If a relatively reliable prediction model can be trained from a small amount of current labeled data, the method can be applied to the labeling problem of the TCR epitope, and a large amount of time and economic cost are saved. In addition, since the gene segments of the TCR are obtained by a series of non-homologous recombinations involving the combination of TCR loci and random nucleotide insertions and/or deletions from the variable (V), diversity (D) and joining (J) gene segments, a large number of different TCRs can be produced, up to a scale of 10 ¹⁵ ～10 ⁶¹ . In addition, one TCR can recognize multiple epitopes and one epitope can also recognize multiple TCRs due to the presence of cross-reactivity. It is difficult to find the matching patterns of TCR and pMHC from such data manually and statistically, and it is of great significance in the course of immunotherapy if the specific binding mechanism of TCR and pMHC can be studied by machine learning algorithms.

The TCR may be divided into four CDR (complementary determining region) regions: CDR1, CDR2, CDR2.5 and CDR3, the specific recognition of the antigen is mainly dependent on the CDR regions. Among them, CDR3 region has the highest diversity, mainly binds to the peptide chain of the epitope, and CDR1, CDR2 and CDR2.5 mainly bind to MHC molecules, but may bind to the peptide chain. It has been found that the CDR3 beta chain plays a major role in predicting epitopes, but it is not clear that the physicochemical or structural properties or other factors in the CDR3 beta chain dominate.

At present, researchers at home and abroad try to research the relationship between CDR3 and epitope data, and the relationship can be roughly divided into two types: the first category defines TCR or CDR3 sequence similarity measurement methods, which use simple classifiers such as the K-nearest neighbor (english name: K-nearest neighbor, abbreviated as K-nn) algorithm to classify after finding the similarity between sequences. The second method extracts the physicochemical characteristics of amino acids based on the TCR or CDR3 sequence or encodes the amino acid sequence based on the BLOSUM matrix, and then obtains a prediction model by utilizing machine learning model training.

However, the prediction performance of the two methods is not very good, and the following problems mainly exist: first, the first method requires the computation of similarity between any two TCR sequences, and thus the time complexity of similarity computation is O (n) ² ) The training process is time consuming. Second, the second method is basically based on amino acid encoding, and since different CDR3 sequences are not necessarily equal in length, alignment is required to ensure that the feature vectors of each TCR sequence have the same dimension. And thirdly, the first method mainly considers the overall similarity of two TCR sequences, the second method mainly considers the information of each amino acid in the sequences, and no method considers the role played by the information provided by adjacent amino acids in the TCR sequences in the specific recognition process of TCR and epitope.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a T cell receptor corresponding epitope prediction method based on multiconnector characteristics aiming at the defects in the prior art, so as to solve the problem of complicated and complicated characteristic extraction; the problem of time consumption of model training is solved, and the model training can be completed in a short time; multi-class prediction can be directly performed.

The invention adopts the following technical scheme:

a T cell receptor corresponding epitope prediction method based on a concatemer characteristic comprises the following steps:

s1, resolving a CDR3 beta chain and a corresponding epitope into bases with the length of 3, and counting the frequency of each triplet as an initial characteristic;

s2, establishing an initial feature matrix according to the initial features obtained in the step S1, and performing dimension reduction on the initial feature matrix by using a principal component analysis method to perform feature extraction;

s3, setting n training samples, training to obtain a gradient lifting decision tree model after inputting prediction data x, and linearly combining decision results of all decision trees through the gradient lifting decision tree model to make prediction;

and S4, inputting the characteristic data of the step S2 into the model trained in the step S3 for prediction, and selecting different prediction indexes according to different prediction purposes.

Specifically, step S2 specifically includes:

s201, recording the initial characteristic matrix as: x = { X ₁ ,x ₂ ,…,x _n Centering each column of features;

s202, order the sample point x _i The projection on the hyperplane in the new space is W ^T x _i If all the sample points are separated, the variance of the sample points after projection is maximized, and an optimization target is determined;

s203, solving the optimized target part by using a Lagrange multiplier method, and carrying out XX on the covariance matrix ^T Performing characteristic decomposition, and sequencing the obtained characteristic values; then, the eigenvectors corresponding to the first k eigenvalues are taken to form a projection matrix W, and finally, the obtained eigenvector matrix W ^T X is a matrix of k rows and n columns.

Further, in step S201, m-dimensional column vector x ₁ Comprises the following steps:

wherein n is the number of training samples and m is the feature dimension.

Further, in step S202, the optimization objective is:

where W is the transformation matrix, W ^T Is the transpose of the transformation matrix, X is the initial feature matrix, X ^T Is the transpose of the initial feature matrix.

Further, in step S203, the optimization objective is solved to obtain

XX ^T W＝λW

The projection matrix W is:

W＝(w ₁ ,w ₂ ,…,w _k )

wherein λ is a characteristic value, w _i Is the column vector of the projection matrix, i is more than or equal to 1 and less than or equal to k, and the ordering of the characteristic values is as follows: lambda [ alpha ] ₁ ≥λ ₂ ≥…≥λ _n 。

Specifically, step S3 specifically includes:

s301, initializing iteration number M =0, setting the maximum iteration number to be M, and initializing a model f ₀ (x)；

S302, adding a decision tree on the basis of the current model in each model iteration, and using residual errors L (y, f) _m-1 (x) Estimate parameter Θ) _m ；

S303, letting m = m +1, if m is smaller than the maximum iteration number, returning to the step S302; otherwise, stopping training, returning all the decision trees of the training, and finishing the training of the epitope prediction model.

Further, in step S301, model f is initialized ₀ (x) Comprises the following steps:

where N is the number of samples, c is the constant of the initial model fit, and L is the log-likelihood loss function defined as:

where Y is the output variable, X is the input variable, and L is the lossLoss function, M is the number of epitope classes, y _ij Is a binary index, if the category j is input example x _i True class of (1), then y _ij =1; otherwise y _ij ＝0，p _ij Predicting an input instance x for a model _i Probability of belonging to category j.

Further, in step S302, the result of the mth iteration is:

f _m (x)＝f _m-1 (x)+β _m T(x；Θ _m )

wherein f is _m-1 (x) Is the decision model for the m-1 th iteration, using all R _mi Set of (2)

i∈[1..n]To fit a regression classification decision tree.

Further, a residual L (y, f) is used _m-1 (x) Estimate parameter Θ) _m Parameter theta of decision Tree _m The method is obtained by solving the following optimization objectives:

loss function in model f _m-1 The negative gradient above is used to approximate the estimate residual as:

where i is the index of the ith training sample.

Further, in step S303,

wherein f is _M (x) The final integrated model composed of M decision trees is obtained, wherein M is the number of epitope classes, beta _m Is the weight of the mth decision tree, T is the decision tree, x is the input of the decision tree T,Θ _m are parameters of the decision tree.

Compared with the prior art, the invention has at least the following beneficial effects:

the invention relates to a method for predicting TCR epitope based on combined concatemer characteristics in a TCR sequence, which scans CDR3 beta sequences one by one, analyzes polypeptide chains into continuous short peptide chains with the length of 3 and counts the occurrence frequency of each triplet. And taking the statistical result as an initial characteristic matrix, and taking the epitope corresponding to the CDR3 beta sequence as a class label. According to biological knowledge, 20 amino acids exist in a human body, and the 20 amino acids can have 8000 different permutation and combination at most, so that the dimension of an initial characteristic matrix is not more than 8000 at most, the problems of complexity in characteristic extraction and time consumption of model training of the existing model are solved, the training can be completed in a short time, and the prediction performance is superior to that of the existing model.

Furthermore, principal component analysis is used for feature transformation, and the dimensionality of the features is reduced.

Further, the feature matrix is input into a Gradient Boosting Decision Tree (GBDT) for training, the optimal parameters of the model are obtained through grid search, and finally the multiple Decision trees are obtained.

Furthermore, the test data is coded by the same method, the test data characteristic matrix is input into the model, and the sum of the predicted results of all the decision trees is taken as the final predicted result.

In conclusion, the method only uses the statistic value of the triplet as the initial characteristic, and combines the gradient lifting decision tree model to complete the training of the model in a very short time, and the prediction accuracy is higher.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

Fig. 1 is a feature matrix obtained after feature selection is performed on TCR data in the Dash et.

FIG. 2 is a schematic flow chart of the present invention;

FIG. 3 shows comparison results of different models in a Dash dataset;

FIG. 4 shows the results of a multi-classification ROC curve performed in the Dash dataset.

Detailed Description

The invention provides a TCR Epitope prediction method based on adjacent amino acid information in a TCR Sequence, which is named as SETE (Sequence-based envelope learning approach for TCR Epitope binding prediction), data of a training set are CDR3 beta sequences and corresponding polypeptide chains capable of carrying out specific recognition, and data of a test set are CDR3 beta sequences.

Based on the following general consensus in academia:

the CDR3 region in the tcr sequence has a clear interaction with MHC-presented polypeptide chains, and the β chain of this region contributes significantly to peptide recognition;

2. the number of amino acids constituting proteins in humans is 20.

Referring to fig. 2, the method for predicting corresponding epitopes of T cell receptors based on the concatemer characteristics of the present invention includes the following steps:

s1, extracting initial characteristics

Referring to fig. 1, since the input data are CDR3 β chains and corresponding epitopes, the epitopes are the classes predicted by the model. Since the input amino acid sequence cannot be directly used as a feature, it is necessary to analyze it into bases having a length of 3 and count the frequency of each triplet as an initial feature. After the obtained initial characteristics are subjected to characteristic selection, certain similarity can be found among the TCR sequence characteristics corresponding to different types of epitopes, a characteristic matrix is obtained after the TCR data in the Dash et.al paper is subjected to characteristic selection, the x axis in the graph represents the characteristics, and the y axis represents a sample; the rightmost color bar represents the epitope class corresponding to TCR; darker colors indicate a greater number of triplets.

S2, feature extraction

On the one hand, since there are a total of 20 amino acids, a short chain of 3 amino acids will have a maximum of 20 ³ The method comprises the following steps of (1) carrying out seed combination, so that the features can reach 8000 dimensions at most, and feature screening is needed to reduce the dimension of the features; second, due to the similarity between TCR sequencesThere is similarity and there may be redundant information between triplets of TCR sequences of the same class. Therefore, the method for reducing the dimension of the data by using the principal component analysis specifically comprises the following steps:

s201, recording the initial characteristic matrix as: x = { X ₁ ,x ₂ ,…,x _n -centering each column feature;

m dimensional column vector x ₁ Comprises the following steps:

wherein n is the number of training samples, and m is the feature dimension;

s202, making the sample point x _i The projection on the hyperplane in the new space is W ^T x _i If all the sample points are separated, the variance of the sample points after projection is maximized, and the optimization target is determined as follows:

S203, solving the optimization objective member by using a Lagrange multiplier method to obtain XX ^T W = λ W, for covariance matrix XX ^T Performing characteristic decomposition, and sequencing the obtained characteristic values: lambda ₁ ≥λ ₂ ≥…≥λ _n Then, the eigenvectors corresponding to the first k eigenvalues are taken to form a projection matrix W, and finally, the obtained eigenvector matrix W is obtained ^T X is a matrix with k rows and n columns;

the projection matrix W is:

W＝(w ₁ ,w ₂ ,...,w _k )

wherein λ is a characteristic value, w _i Is the column vector of the projection matrix, i is more than or equal to 1 and less than or equal to k.

S3, training epitope prediction model

A new prediction model based on a gradient lifting decision tree is provided; if n training samples exist, after prediction data x is input, the gradient lifting decision tree model makes prediction by linearly combining decision results of all decision trees, and the method specifically comprises the following steps:

the n training samples are:

{(x ₁ ,y ₁ ),...,(x _n ,y _n )}

wherein, the first and the second end of the pipe are connected with each other,

i＝1，2，...，n；

s301, initializing a model

Initializing iteration M =0, setting the maximum iteration M to be M, and initializing a model f ₀ (x) Comprises the following steps:

wherein Y is an output variable, X is an input variable, L is a loss function, M is the number of epitope classes, Y _ij Is a binary index, if the category j is the input example x _i True class of (2), then y _ij =1; otherwise y _ij ＝0，p _ij Predicting an input instance x for a model _i Probability of belonging to category j.

S302, model iteration

Each iteration of the model adds a decision tree on the basis of the current model, and the result of the mth iteration is as follows:

f _m (x)＝f _m-1 (x)+β _m T(x；Θ _m )

wherein f is _m-1 (x) Is the firstDecision model of m-1 iterations, parameter Θ of decision tree _m The method is obtained by solving the following optimization targets:

since the basis functions are linearly additive, the goal is to use the residual L (y, f) _m-1 (x) Estimate the parameter Θ) _m 。

For this purpose, the loss function is in the model f _m-1 The negative gradient above is used to approximate the estimated residual.

Where i is the index of the ith training sample.

Using all R _mi Set of (2)

i∈[1..n]To fit a Regression Classification decision Tree (English name: classification and Regression Tree, english abbreviation: CART), and solve the parameter theta _m 。

S303, assigning m to be m +1, and if m is smaller than the maximum iteration number, returning to the step S302; otherwise, stopping training and returning to all the decision trees of the training;

s4, epitope prediction

And extracting initial features by the same method, extracting features, inputting final data into a trained model for prediction, and selecting different prediction indexes according to different prediction purposes.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention discloses a T cell receptor corresponding epitope prediction method based on a multiconnector characteristic, and solves the problems that the existing algorithm is long in training time and the prediction result is not ideal.

Because no model can directly carry out the TCR epitope multi-classification prediction problem at present, in order to verify the effectiveness of the invention, the prediction effect of the two classifications is firstly tested. Since the TCRGP of the existing method uses the working characteristics (ROC) of a testee and the Area Under the ROC Curve line (AUC) as the evaluation indexes of the model, the AUC is used for evaluating the prediction performance of the invention.

In addition, the runtime of the two models on the same data set is also compared; then, a multi-classification prediction test is carried out, and since the data sample amount of each class is unbalanced, the ROC is less influenced by the data imbalance, and therefore, the ROC and AUC indexes are still used for measuring the prediction performance of the model. Index name: true Positive (TP), false Positive (FP), true Negative (TN), and False Negative (FN).

Define false positive rate FPR = FP/(FP + TN).

A true positive ratio TPR = TP/(TP + FN) is defined.

The ROC curve is plotted from the values of FPR and TPR at different thresholds, and AUC is the area under the line of the ROC curve.

Tests were performed on the common data set VDJdb. Through screening, 22 types of table data are selected from VDJdb data. Because all the existing models can only process the two-classification task, in order to compare with other models, the two-classification test is firstly carried out. In each of the two classification tasks, all positive case data is used, and equal amounts of TCR data are randomly sampled from the other classes as negative cases. The results of the classification are shown in Table 1.

Table 1: SETE and TCRGP two classification results comparison (star points FRDYVDRFYKTLRAEQASBE)

/>

The above table shows that, in the two-classification task, compared with the existing method TCRGP, the method of the present invention has the same prediction effect, but the time consumption is significantly shortened, and the training time is greatly reduced, which is especially obvious in the data set with large data volume.

In the multi-classification task, the invention also carries out a series of experiments to verify the effectiveness. A ROC curve is used as an index of an evaluation model, a OneVsRest strategy is used for drawing a multi-classification ROC curve, a classifier is trained for data of each class, one class of TCR sequences is regarded as a positive example by each classifier, other classes of TCR sequences are regarded as negative examples by other classes of TCR sequences, and finally output results of ten classifiers are voted to obtain a final classification result. The results obtained using the five-fold cross-validation are shown in table 2.

Table 2: multi-classification prediction results of SETE on VDJdb dataset

/>

To further validate the ability of the present invention to predict the corresponding epitope of the TCR, tests were performed in the data set disclosed in the Dash et al paper, which collected epitope data for both class 3 humans and class 7 mice.

Since the model is more suitable for performing multi-classification tasks, a multi-classification test is performed on the data set at first, the model effect is evaluated by using the ROC curve and the AUC result, and the multi-classification result in the Dash data set is shown in table 3.

Table 3: SETE multiple classification results in Dash dataset

From the above table, it can be seen that SETE performs well in the whole of the multi-category problem, and the prediction result on individual epitope genes, such as pp65, is poor, and may have a certain relationship with the small data size of the epitope genes of this type. The comparison results of SETE multi-classification and TCRGP and TCRdist are shown in FIG. 3, the x axis represents different prediction models, and the y axis represents the area under the ROC curve of each model; in addition, ROC curves for human and mouse data were plotted for multiple classifications, respectively, and the results are shown in fig. 4. In the figure, the x-axis represents the false positive rate and the y-axis represents the true positive rate.

Two classification tests were performed on the Dash dataset and the prediction results for the two classifications are shown in table 4.

Table 4: comparison of results of two classifications of SETE and TCRGP in Dash dataset

As with previous results, SETE is able to complete training in a very short time and with prediction accuracy superior to the TCRGP model.

In conclusion, compared with the existing method TCPGP, the method can complete the training of the model in shorter time, and the performance in the two classification tasks is better than that of the existing method. In addition, the method can be directly applied to multi-classification tasks, and the prediction accuracy is high.

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. A T cell receptor corresponding epitope prediction method based on a concatemer characteristic is characterized by comprising the following steps:

s2, establishing an initial feature matrix according to the initial features obtained in the step S1, using a principal component analysis method to perform dimensionality reduction on the initial feature matrix, and performing feature extraction, wherein the method specifically comprises the following steps:

s201, recording the initial characteristic matrix as: x = { X ₁ ,x ₂ ,...,x _n Centering each column of features, wherein n is the number of samples;

s203, solving the optimized target part by using a Lagrange multiplier method, and carrying out XX on the covariance matrix ^T Performing characteristic decomposition, and sequencing the obtained characteristic values; then, the eigenvectors corresponding to the first k eigenvalues are taken to form a projection matrix W, and finally, the obtained eigenvector matrix W ^T X is a matrix with k rows and n columns;

s3, inputting prediction data x into n training samples, training to obtain a gradient lifting decision tree model, and linearly combining decision results of decision trees through the gradient lifting decision tree model to make prediction, wherein the prediction specifically comprises the following steps:

s301, initializing iteration number m =0, and initializing a model f ₀ (x)；

S303, letting m = m +1, if m is smaller than the maximum iteration number, returning to the step S302; otherwise, stopping training, returning to all the decision trees of the training, and finishing the training of the epitope prediction model;

2. The method for predicting the epitope corresponding to T cell receptor based on concatemer characteristics of claim 1, wherein in step S201, the m-dimensional column vector x _i Comprises the following steps:

wherein n is the number of training samples, and m is the feature dimension.

3. The method for predicting T cell receptor-corresponding epitopes based on concatemer characteristics according to claim 1, wherein in step S202, the optimization objective is:

wherein W is the projection matrix, W ^T Is the transpose of the projection matrix, X is the initial feature matrix, X ^T Is the transpose of the initial feature matrix.

4. The T cell receptor corresponding epitope prediction method based on concatemer characteristics of claim 1, wherein in step S203, the optimization objective is solved to obtain

XX ^T W＝λW

The projection matrix W is:

W＝(w ₁ ,w ₂ ,...,w _k )

wherein λ is a characteristic value, w _i Is the column vector of the projection matrix, i is more than or equal to 1 and less than or equal to k, and the ordering of the characteristic values is as follows: lambda [ alpha ] ₁ ≥λ ₂ ≥...≥λ _n 。

5. The method for predicting T-cell receptor-corresponding epitopes according to claim 1, wherein the model f is initialized in step S301 ₀ (x) Comprises the following steps:

wherein Y is an output variable, X is an input variable, L is a loss function, M is the number of epitope classes, Y _ij Is a binary index, if the category j is input example x _i True class of (2), then y _ij =1; otherwise y _ij ＝0，p _ij Predicting an input instance x for a model _i Probability of belonging to category j.

6. The method for predicting the epitope corresponding to a T-cell receptor based on the concatemer characteristics of claim 1, wherein in step S302, the result of the mth iteration is:

f _m (x)＝f _m-1 (x)+β _m T(x；Θ _m )

To fit a regression classification decision tree.

7. The method of claim 6, wherein residual L (y, f) is used for predicting T cell receptor epitope mapping based on concatemer characteristics _m-1 (x) Estimate the parameter Θ) _m Parameter theta of decision Tree _m The method is obtained by solving the following optimization objectives:

loss function in model f _m-1 The negative gradient above is used to approximate the estimated residual as:

where i is the index of the ith training sample.

8. The method for predicting T-cell receptor-corresponding epitopes according to claim 1, wherein, in step S303,

wherein f is _M (x) The finally obtained integrated model consisting of M decision trees, wherein M is the number of epitope classes and beta _m Is the weight of the mth decision tree, T is the decision tree, x is the input of the decision tree T, theta _m Are parameters of the decision tree.