CN107341363B

CN107341363B - Prediction method of protein epitope

Info

Publication number: CN107341363B
Application number: CN201710516045.8A
Authority: CN
Inventors: 羊红光; 成彬; 王程
Original assignee: Institute Of Applied Mathematics Hebei Academy Of Sciences
Current assignee: Institute Of Applied Mathematics Hebei Academy Of Sciences
Priority date: 2017-06-29
Filing date: 2017-06-29
Publication date: 2020-09-22
Anticipated expiration: 2037-06-29
Also published as: CN107341363A

Abstract

A prediction method of protein epitope, the method collects the antigen epitope sequence information and the related protein sequence information which are verified by the experiment from the professional database, constructs the positive and negative sample set for learning and training, and collects the physical and chemical property characteristic information of the amino acid; then, training a complementary prediction classifier group and an independent high-performance classifier by adopting a machine learning algorithm in the sample set; and finally, obtaining a first candidate epitope set by using a complementary prediction classifier group, obtaining a second candidate epitope set by using a high-performance classifier, and grading and sequencing sequences in the candidate epitope set by using a tendency grading method. On the basis of constructing a prediction model of a multilayer classification structure, the invention utilizes a plurality of classifiers with complementary capability to carry out cooperative prediction on the protein epitope, and the method can obviously improve the accuracy of the prediction of the protein epitope and provides an effective method for accurately and quickly finding the epitope.

Description

Prediction method of protein epitope

Technical Field

The invention relates to a method capable of accurately and quickly predicting a protein epitope, belonging to the technical field of biology.

Background

The epitope is the basis for recognizing the antigenicity of the protein, and the accurate and detailed drawing of the epitope map not only is helpful for the research of basic immunology, but also has important significance for the design of bioactive drugs and epitope vaccines. In the immune system, B cells and T cells act together in the human second line of defense, the "acquired immunity" process, which is to recognize non-self antigen bodies during immune presentation, and once the invading antigen is found, the two cell types produce respective immune effects.

Two traditional methods for epitope position determination are the X-ray diffraction method and the experimental method, and these methods have the disadvantages of complexity and large workload. With the development of computer technology and the increasing expansion of biological information databases, the sequence and structural features of antigen epitopes are summarized from the existing data, the epitopes are screened and predicted by using a machine learning algorithm, and then verified by using experiments, so that the mainstream technical route is formed. The technical route has the advantages of greatly saving cost and improving working efficiency.

The epitope is predicted by using a computer by fusing a plurality of characteristic parameters (such as hydrophobicity, hydrophilicity, accessibility, variability, antigenicity and the like) on the basis of the physicochemical properties of amino acids. The machine learning algorithm is widely used in epitope prediction with high accuracy and high efficiency. The method for predicting the epitope by the machine learning algorithm mainly comprises the steps of data collection and processing, model establishment, parameter optimization, epitope prediction and the like. The common machine learning algorithms mainly include: support Vector Machines (SVMHC), Hidden Markov Models (HMM), Artificial Neural Networks (ANN), and the like. The application of the algorithms improves the effect of epitope prediction, but the problems that high prediction precision is difficult to obtain by using a single algorithm and training sample data selection is unscientific exist. At present, the prediction research of the epitope at home and abroad mainly improves the prediction performance through the combined model construction of complementary prediction capability, the scientific sample data set construction and the like. Most of the researches are to find a classifier combination with complementary prediction capability by using a combination experiment of the existing prediction tools, and although the method can improve the prediction performance to a certain extent, no more effective prediction method is found by people at present.

Disclosure of Invention

The invention aims to provide a prediction method of protein epitope aiming at the defects of the prior art, and provides an effective method for accurately and quickly finding out the epitope.

The problems of the invention are solved by the following technical scheme:

a prediction method of protein epitope, the method collects the antigen epitope sequence information and the related protein sequence information which are verified by experiments from a professional database, constructs a positive and negative sample set for learning and training, and learns and predicts by using the physicochemical property of amino acid as the characteristic; then, training a complementary prediction classifier group and a high-performance classifier by adopting a machine learning algorithm in the sample set; finally, a complementary prediction classifier group is used for obtaining a first candidate epitope set, a high-performance classifier is used for obtaining a second candidate epitope set, and a tendency scoring method is used for scoring and sequencing sequences in the candidate epitope set;

the prediction is carried out according to the following steps:

a. data acquisition: selecting epitope data information from an IEDB database, searching primary protein sequence information adopting epitope samples in a Uniport protein database, perfecting the epitope sample data and extracting non-epitope sample data from the epitope sample data, constructing a positive and negative sample set for learning training, and forming a characteristic matrix by using physicochemical property characteristic information of hydrophobicity, accessibility and the like of each amino acid and the mean value of hydrophobicity and accessibility of three adjacent amino acids of the sequence in the sample as training input;

b. training of a complementary prediction classifier set:

① training in sample set D by using a binary training method, and obtaining a first classifier C when the accuracy of classification is more than lambda₁Recording lambda as the classification accuracy of the learning training classifier, wherein the specific numerical value can be set according to the actual situation, and the value range of the lambda is more than or equal to 50% and less than 100%; the first classifier C₁The sample set capable of correctly identifying the sample set D is recorded as D¹，

Note DⁱIs a first classifier C_iIn the sample set which can be correctly identified in the training set, i is more than or equal to 1 and less than or equal to n, and the set D is obtained according to a four-classification training method and a proper amount increasing method¹And D-D¹Constructing a new training sample set by using the samples of the middle epitope and the non-epitope, learning training in the new sample set, and obtaining a second classifier C when the classification accuracy is more than lambda₂Second classifier C₂The sample set D can be correctly identified and is recorded as D²Will be set D¹And set D²The intersection of (A) is denoted as D₁Record D_iThe method is characterized in that the method is a sample set which can be correctly identified by the ith (i is more than or equal to 1 and less than or equal to n-1) classifier and the (i +1) classifier, and then whether training is continued is judged according to a training termination rule;

② when training is needed, the set D-D is divided into four classes and added in proper amount²And D²-D₁Constructing a new training sample set by using the samples of the middle epitope and the non-epitope, learning training in the new sample set, and obtaining a third classifier C when the classification accuracy is more than lambda₃Classifier C₃For sample set D-D₁The correctly identified sample set is denoted as D³Will be set D²And set D³The intersection of (A) is denoted as D₃Then, judging whether the training is continued according to a training termination rule, and obtaining an n-1 classifier C_n-1Later, when the training is needed to be continued, the four-classification training method and the proper increasing method are gathered

And D^n-1-D_n-2Constructing a new training sample set by using the samples of the middle epitope and the non-epitope, learning and training the new sample set, and obtaining an nth classifier C when the classification accuracy is greater than lambda_nClassifier C_nFor sample set

The correctly identified sample set is denoted as DⁿAccording to the above mode, until the training is stopped, obtaining a group of classifiers with complementary classification capability, namely a complementary prediction classifier group;

c. in the sample set D, training a classifier with the classification accuracy of more than 90% of each class by adopting a two-class training method, and calling the classifier as a high-performance classifier EC;

d. for protein antigens with unknown epitopes, prediction was performed according to the following method:

dividing a primary sequence of an antigen protein into a plurality of sequence fragment sets SSD according to a 'reference length', wherein each fragment forms a characteristic matrix according to the average value of hydrophobicity and accessibility of each amino acid and adjacent three amino acids as predicted input;

② first predict the first classifier C in turn using trained classifiers₁And a second classifier C₂Respectively classifying and identifying the set SSD by a first classifier C₁Predicting the result as class one and the second classifier C₂Segment composition set ERD with prediction result of class three₁Erd, record_iIs a first classifier C_iThe prediction result is the category one, and the (i +1) th classifier C_i+1The prediction result is a set formed by the fragments of the category three, i is more than or equal to 1 and less than or equal to n-1; then in the set SSD-ERD₁Middle third classifier C₃Performing classification and identification by a second classifier C₂The prediction result is class one, and the third classifier C is used₃Segment composition set ERD with prediction result of class three₂And so on until in the aggregate

(Note that

For the first n-1 classifiers C_n-1Using the nth classifier C in the union of all the predicted results_nPerforming classification and identification by an (n-1) classifier C_n-1The prediction result is class one, and the nth classifier C_nSegment composition set ERD with prediction result of class three_n-1(ii) a The first candidate epitope set (first candidate set) FCS is

Classification identification in a set SSD-FCS using a classifier ECThe fragments with the EC prediction result of class one form a second candidate epitope set (second candidate epitope set) SCS;

thirdly, according to a tendency scoring method, scoring is carried out on each sequence fragment, the sequence fragments in the first candidate epitope set FCS and the second candidate epitope set SCS are sorted according to the score, and the sequence fragments with high score are arranged in the front.

The prediction method of the protein epitope comprises the following specific operation methods of the four-classification training method:

let i the i-th classifier C_iCorrectly identified sample set is DⁱSet D ofⁱThe subset combination composed of the mesopic and non-epitopic data is

In the i +1 th classifier C_i+1During training, the ith classifier C is used_iThe data which can not be correctly identified are respectively listed as a class-one epitope sample and a class-two non-epitope sample of a new training sample set according to the classes of the epitope and the non-epitope samples, and are randomly extracted

Part of the data in the (1) is respectively listed as a class three-epitope sample and a class four-non-epitope sample, and then the samples are trained by using a four-classification learning algorithm to obtain an i +1 classifier C_i+1。

The method for predicting the epitope of the protein comprises the following steps:

let N be number (d) be the total number of elements in the sample set, N_i＝number(D_i) Is a set D_iThe total number of elements of (a) is,

for the correct ratio of the joint prediction of classifier i and classifier i +1,

calculating R before the 3 rd training for the total prediction ratio after the i +1 th training, and terminating when the training times are less than or equal to 4The return value of parameter terminate is:

when the training times are more than or equal to 5 times and less than or equal to 7 times, the return value of the terminate is:

when the training times are more than or equal to 8 times, the return value of the terminate is as follows:

if the termination parameter terminate returns a value of 0, the training ends, and if the termination parameter terminate returns a value of 1, the training continues.

let i the i-th classifier C_i1, 2, n, the correctly identified sample set is Dⁱ；

When i is 1, D¹Two kinds of samples are shared, when the first classifier C is obtained₁Then, from set D according to the following rule¹Selecting data of a first class and data of a second class to form a new training sample set:

let G¹＝D-D¹，

Represents G¹The number of elements of the subset of the mesopic and non-epitopic samples,

are respectively a set G¹Class one, class two data in (1);

when in use

From the set D¹The quantity of the randomly selected data of the first category and the second category is

Simultaneous slave aggregation

Random selection

Forming a new training sample set by the data;

when in use

From the set D¹The number of the data of the selected category I and the category II is

Simultaneous slave aggregation

Random selection

Forming a new training sample set by the data;

when i is 2ⁱIn which there are four types of samples, respectively using subsets

Show, by

Respectively represent collections

The number of elements in (1) is set as the data set which cannot be correctly identified

By using

Represents GⁱThe element numbers of the middle class one and class two sample subset,

are respectively a set GⁱAccording to the following rule, the data of class one and class two in (1) are collected from the set DⁱSelecting data of a third category and data of a fourth category to form a new training sample set:

when in use

From the set DⁱThe number of randomly selected data of the three and four categories is

Simultaneous slave aggregation

Random selection

Forming a new training sample set by the data;

when in use

Then, selected set DⁱThe number of the selected data of the category three and the category four is

Simultaneous slave aggregation

Random selection

Each data constitutes a new set of training samples.

The prediction method of the protein epitope comprises the following steps:

in the data set of the antigen epitope, the frequency of occurrence of any type of combination of three consecutive amino acids in the epitope is calculated by the following formula:

in the formula, AA_x，AA_y，AA_zIs any one of 20 amino acids, AA_x-AA_y-AA_zRepresents any type of combination of three consecutive amino acids,

represents AA_x-AA_y-AA_zThe frequency of occurrence of the type combinations in the epitope,

indicates the number of times that the combination of the type occurs,

are each amino acid AA_x，AA_y，AA_zThe total number of occurrences is,

each being an amino acid combination AA_x-AA_y，AA_y-AA_zThe total number of occurrences;

if the prediction window is k, the propensity score for any sequence fragment into which the primary sequence of the antigenic protein is divided is:

according to the method, on the basis of constructing a prediction model of a multilayer classification structure, a plurality of classifiers with complementary capacity are utilized to carry out cooperative prediction on the protein epitope, prediction experiments are carried out on a plurality of blind data sets, the prediction accuracy rate in the experiments is higher than 70%, and therefore the method can obviously improve the accuracy of the prediction of the protein epitope and provides an effective method for accurately and quickly finding the epitope.

Drawings

The invention will be further explained with reference to the drawings.

FIG. 1 is a "training flow chart of a complementary classifier set" for the epitope prediction method of the present invention;

FIG. 2 is a "epitope prediction process diagram" used in the method of predicting an epitope according to the present invention.

In the figures and in the text, the symbols are: d is a set of samples, C_iIs the ith classifier, DⁱAs a classifier C_iA correctly identified sample set, EC is a high-performance classifier, SSD is a sequence fragment set, FCS is a first candidate epitope set, SCS is a second candidate epitope set, N (number) (d) is the total number of elements in the sample set, N is the total number of elements in the sample set, and_i＝number(D_i) Is a set D_iTotal number of elements of (2), R_iThe correct ratio of the combined prediction of the classifier i and the classifier i +1 is obtained, R is the total prediction ratio after the i +1 training, terminate is a termination parameter, AA_x，AA_y，AA_zIs any one of 20 amino acids, AA_x-AA_y-AA_zRepresents any type of combination of three consecutive amino acids,

indicates the number of times that the combination of the type occurs,

are each amino acid AA_x，AA_y，AA_zThe total number of occurrences is,

each being an amino acid combination AA_x-AA_y，AA_y-AA_zTotal number of occurrences.

IEDB means http:// www.iedb.org/professional database; uniport refers to the http:// www.uniprot.org/protein database.

Detailed Description

Epitope prediction is generally realized by a binary classifier, and the construction of a classifier with complementary prediction capability breaks through the limitation of inherent thinking. The structure of the classifier is that on the basis of a binary classifier, sample data is combined according to the classification result of the binary classifier, and a new classifier is trained in a new sample. The research finds that a plurality of classifiers are constructed by applying the prediction difference of two adjacent classifiers in a group of classifiers to realize gradual optimization, a mechanism for training a complementary classifier group is provided, and the method has an important promoting effect on the improvement of the prediction performance of the antigen epitope.

In order to clearly understand the technical contents of the present invention, the present invention will be described in detail with reference to fig. 1 and 2. It is to be understood that the examples are illustrative of the invention and are not to be construed as limiting the invention.

1. Data acquisition

Epitope sequence data are collected from an IEDB (http:// www.iedb.org /) epitope database as a training positive sample, and the database contains a plurality of epitope data which are verified by experiments and cover species such as human, non-human primates, other animals and the like. The primary sequence of the protein corresponding to the selected epitope sample is found in a Uniport (http:// www.uniprot.org /) protein database, and a sequence fragment (i.e. non-epitope sequence) which is not marked as the epitope is extracted from the primary sequence as a training negative sample. In our experiments, we verified that 800 protein sequences were extracted in total, and 5120 continuous epitope sequences and 5200 non-epitope sequence fragments were collected. Each sample is 20 amino acids in length as a reference, and a sequence fragment which is not marked as an epitope and has 20 amino acids in number is directly selected from a protein sequence primary sequence for a non-epitope sample. For epitope samples, due to the difference in the number of amino acids contained in the epitope sequence, the "baseline length" requirement is met as follows: for epitope sequences with the number of amino acids being less than 20 and even number, the same number of amino acids are selected from two sides of the protein sequence where the epitope sequences are located as successive supplements, and for epitope sequences with the number of amino acids being less than 20 and odd number, the successive supplement number of the epitope sequences from the front end of the protein sequence where the epitope sequences are located is one more than that of the epitope sequences from the rear end so as to meet the requirement of the standard length; the epitope sequences with the number of amino acids larger than 20 and even number are removed with the same number of amino acids from both sides to meet the requirement of the standard length, and the epitope sequences with the number of amino acids larger than 20 and odd number are reduced with one more than the number of the epitope sequences from the front end of the protein sequence to meet the requirement of the standard length. For the sample sequence, a characteristic matrix is formed according to the hydrophobicity and accessibility characteristics of the sequence of the amino acid sequences in the sample and the average value of the hydrophobicity and accessibility of every three adjacent amino acids, and the characteristic matrix is an input matrix for training and prediction.

2. Model building

In the sample set, the method for performing the training of the complementary classifier is as follows:

setting the sample set as D, training by using a two-classification training method, and obtaining the classifier C when the classification accuracy is higher than lambda (the lambda is the classification accuracy of the learning training classifier, the specific numerical value can be set according to the actual situation, and the value range of the lambda is more than or equal to 50% and less than 100%)₁In the set D, C₁The correctly identified sample set is denoted as D¹，

(note D)ⁱAs a classifier C_i(1 ≦ i ≦ n) sample set that can be correctly identified in the training set). Set D is trained according to the four-classification training method and the appropriate increment method¹And D-D¹Constructing a new training sample set by using the samples with the medium epitope and the non-epitope, performing learning training in the new sample set, and obtaining a classifier C when the classification accuracy is more than lambda₂In the set D, C₂The correctly identified sample set is denoted as D²(the same applies below with the recognized class one and class three as epitopes and the recognized class two and class four as non-epitopes), set D¹And set D²The intersection of (A) is denoted as D₁(note D)_iThe sample set is a sample set which can be correctly identified by the ith (i is more than or equal to 1 and less than or equal to n-1) classifier and the (i +1) th classifier), then the judgment is carried out according to a training termination rule, if the termination parameter terminate return value is 0, the training is ended, and if the termination parameter terminate return value is 1, the training is continued. When the training is needed, the set D-D is processed according to the four-classification training method and the proper increment method²And D²-D₁Constructing a new training sample set by using the samples with the medium epitope and the non-epitope, training and learning in the new sample set, and obtaining a classifier C when the classification accuracy is more than lambda₃，C₃For sample set D-D₁The correctly identified sample set is marked as D³Will be set D²And set D³The intersection of (A) is denoted as D₂Then, the judgment is made according to the training termination rule, if the termination parameter termination return value is 0, the training is terminated, and if the termination parameter termination return value is 1, the training is continued. When the training is continued after the n-1 th classifier is obtained, the four-classification training method and the proper adding method are collected into

And D^n-1-D_n-2(i is more than or equal to 4) constructing a new training sample set by the epitope and non-epitope samples, training in the new sample set, and obtaining a classifier C when the classification accuracy is more than lambda_nMixing C with_nFor sample set

The correctly identified sample set is marked as Dⁿ. In the above manner, until training is stopped. This results in a set of classifiers with complementary classification capabilities, only between two adjacent classifiers in the set. Any machine learning algorithm can be used for the two-classification training and the four-classification training in the method as long as the requirements of the relevant rules and the epitope classification of the method can be met.

The specific content of the four-classification training method is as follows:

obtaining classifier C in training₁And then, taking the two types of sample data correctly identified by the classifier as new sample data to participate in new classifier training, namely, the later classifier training is four-classification. Let classifier C₁Correctly identified sample set is D¹Set D of¹The sub-combinations of the data of the middle class one (epitope) and the class two (non-epitope) are respectively

During the second classifier training, the classifier C is selected₁Incorrectly identified data sets D-D¹The epitope and non-epitope in the training sample set are respectively listed as class one and class two samples of the new training sample set, and

the epitope data and non-epitope data in the method are respectively listed as a class three sample and a class four sample, a proper amount of data is extracted according to a proper amount increasing method to form a new training set, and then a four-classification learning algorithm is used for training to obtain a classifier so as to obtain the classifier. Starting from the training of the third classifier, the data which cannot be correctly identified by the previous classifier is respectively listed as a class I sample and a class II sample of a new training sample set according to an epitope and a non-epitope, part of correct identification data in the previous classifier is extracted and respectively listed as a class III sample and a class IV sample, a proper amount of data is extracted according to a proper amount increasing method to form a new training set, and the four-class learning algorithm is also utilized for training to obtain the classifier.

The specific contents of constructing a new training sample set according to the method for increasing the appropriate amount are as follows:

in obtaining a classifier C_iAfter (i 2.. multidot.n), when the next classifier training is performed, a new sample set is constructed according to the number of different classes of sample sets. The sample addition in the method refers to that the related class sample correctly identified by the last classifier is used as a new sample class to participate in the training of the next classifier. Generally, a classifier can recognize that the number of sample data of each class is larger than that of sample data which cannot be correctly recognizedThe number of samples, so the number of samples in the data set needs to be compared to construct a new training sample set. Let classifier C_iCorrectly identified sample set is Dⁱ，DⁱFour kinds of data in common are respectively used as subsets

And (4) showing. By using

Respectively represent collections

Number of elements in (1). Let the incorrectly identified data set be

By using

Represents GⁱNumber of middle epitope and non-epitope sample subset elements: (

Are respectively a set GⁱClass one, class two data) from set D according to the following rulesⁱAnd selecting data of the third category and the fourth category to form a new training sample set.

When in use

(when the product of the multiple is not an integer, rounding is performed by rounding, the same applies hereinafter) and the like, from the set DⁱThe number of randomly selected data of the three and four categories is

Simultaneous slave aggregation

Random selection

Each data constitutes a new set of training samples.

When in use

Simultaneous slave aggregation

Random selection

Each data constitutes a new set of training samples.

In particular, when the training classifier C is obtained₁Then, the following rule is followed to perform the secondary set D¹The first and second data are selected to form a new training sample set. Let G¹＝D-D¹，

Representing the number of mesoepitope and non-epitope sample subset elements: (

Are respectively a set G¹Data of middle category one, category two)

When in use

Simultaneous slave aggregation

Random selection

Each data constitutes a new set of training samples.

When in use

Then, selected set D¹The number of the data of the selected category I and the category II is

Simultaneous slave aggregation

Random selection

Each data constitutes a new set of training samples.

The specific contents of the training termination rule are as follows:

let N be number (D) be the total number of elements in the sample set D, N_i＝number(D_i) Is a set D_iThe total number of the elements in (a),

the total prediction ratio is obtained.

Calculating R before the 3 rd training, and when the training times are less than or equal to 4 times, the return value of the terminate is as follows:

when the training times are more than or equal to 5 times and less than or equal to 7 times, the return value of the terminate is as follows:

for the training of the high-performance classifier EC, only two classification training needs to be adopted in the sample set D until the classification accuracy reaches 90%.

3. Epitope prediction

For protein antigens with unknown epitope positions, epitope prediction is carried out according to the following method: the first step, dividing the primary sequence of the antigen protein into a plurality of sequence fragment sets SSD according to the 'reference length', wherein each fragment calculates a characteristic matrix as prediction input according to the method of the 'data acquisition' in the step 1. Secondly, firstly, the trained complementary prediction classifier group is used for sequentially predicting, and the classifier C is used₁And a classifier C₂Respectively classifying and identifying the set SSD, and C₁The prediction result is class-simultaneous C₂Segment composition set ERD with prediction result of class three₁(Note ERD_iAs a classifier C_iClassifier C with prediction result of class-simultaneous_i+1The prediction result is a fragment composition set of the category three, i is more than or equal to 1 and less than or equal to n-1)); then in the set SSD-ERD₁Middle classifier C₃Performing classification and identification by C₂The prediction result is class-simultaneous C₃Segment composition set ERD with prediction result of class three₂According to the rule until in the set

(Note that

Is a front n-1 classifier C_n-1Union of all predictions) with the last classifier C_nPerforming classification and identification by C_n-1The prediction result is class-simultaneous C_nSegment composition set ERD with prediction result of class three_n-1(ii) a The first candidate epitope set (first candidate set) FCS consists of all ERDs_iIs composed of a union of (i) i

And performing classified identification in a set SSD-FCS by using a classifier EC, wherein the fragments with the EC prediction result of the class one form a second candidate epitope set (second candidate set) SCS. And thirdly, scoring each sequence fragment according to a tendency scoring method, and sequencing the sequence fragments in the FCS set and the SCS set according to the scores, wherein the sequence fragments with high scores are ranked in the front.

The specific content of the tendency scoring method is as follows:

wherein, AA_x，AA_y，AA_zIs any one of 20 amino acids, AA in the formula_x-AA_y-AA_zRepresents any type of combination of three consecutive amino acids,

indicates the number of times that the combination of the type occurs,

are each amino acid AA_x，AA_y，AA_zThe total number of occurrences is,

accuracy evaluation of prediction method

The invention screens 800 antigen protein data, and a sample set consisting of 5120 epitope sequences and 5200 non-epitope sequences is counted. Two algorithms of a Support Vector Machine (SVM) and a Recurrent Neural Network (RNN) are utilized to respectively carry out two-classification training and four-classification training, and four times of training are carried out, wherein the first time of training the classifier 1 by adopting the SVM, the second time of training the classifier 2 by adopting the RNN, the third time of training the classifier 3 by adopting the RNN, and the fourth time of training the excellent performance classifier 4 by adopting the RNN. The number of the first-class sequences predicted by the classifier 1 and the number of the third-class sequences predicted by the classifier 2 are 3325, the number of the first-class sequences predicted by the classifier 2 and the number of the third-class sequences predicted by the classifier 3 are 1573, and the comprehensive prediction accuracy reaches 95.6%. The accuracy rate predicted by the five-fold cross validation classifier 4 is 91%.

We collected 287 proteins out of the training samples as a blind data test set, which contains 2000 validated epitope sequences, and randomly drawn 1000 for each test. The result predicted by the above trained classifier set is: the epitope sequences predicted by the classifier 1 and the classifier 2 are 739 in total, wherein the correctness is 551, and the correctness is 74.5%; 492 epitope sequences predicted by the classifier 1 and the classifier 2, wherein the correct epitope sequences are 327, and the correct epitope sequences are 66.5%; the comprehensive prediction accuracy is 71.3%, and the coverage rate of the correct result reaches 87.8%. The epitope sequences predicted by classifier 4 were 190 in total, 75 out of them were correct, and the accuracy was 39.5%. The results of the two classifiers are combined, and the coverage rate of correct results reaches 95.3%.

From the experimental results, the method has high prediction accuracy, the prediction result can contain most epitopes, and effective and scientific basis can be provided for the screening of the epitope.

Claims

1. A prediction method of protein epitope is characterized in that firstly, antigen epitope sequence information which is experimentally verified and sequence information of related proteins are collected from a professional database, a positive and negative sample set for learning and training is constructed, and the physicochemical properties of amino acid are used as characteristics for learning and prediction; then, training a complementary prediction classifier group and a high-performance classifier by adopting a machine learning algorithm in the sample set; finally, a complementary prediction classifier group is used for obtaining a first candidate epitope set, a high-performance classifier is used for obtaining a second candidate epitope set, and a tendency scoring method is used for scoring and sequencing sequences in the candidate epitope set;

the prediction is carried out according to the following steps:

b. training of a complementary prediction classifier set:

Note DⁱIs a first classifier C_iThe sample set which can be correctly identified in the training set, i is more than or equal to 1 and less than or equal to n, is added according to the four-classification training method and the proper amountMethod of addition "will set D¹And D-D¹Constructing a new training sample set by using the samples of the middle epitope and the non-epitope, learning training in the new sample set, and obtaining a second classifier C when the classification accuracy is more than lambda₂Second classifier C₂The sample set D can be correctly identified and is recorded as D²Will be set D¹And set D²The intersection of (A) is denoted as D₁Record D_iThe method comprises the steps of collecting samples which can be correctly identified by an ith classifier and an (i +1) th classifier, and then judging whether training is continued according to a training termination rule, wherein i is more than or equal to 1 and is less than or equal to n-1;

② first predict the first classifier C in turn using trained classifiers₁And a second classifier C₂Respectively classifying and identifying the set SSD by a first classifier C₁Predicting the result as class one and the second classifier C₂Segment composition set ERD with prediction result of class three₁Erd, record_iIs the ith classifier C_iThe prediction result is the category one, and the (i +1) th classifier C_i+1The prediction result is a set formed by the fragments of the category three, i is more than or equal to 1 and less than or equal to n-1; then in the set SSD-ERD₁Middle third classifier C₃Performing classification and identification by a second classifier C₂The prediction result is class one, and the third classifier C is used₃Segment composition set ERD with prediction result of class three₂And so on until in the aggregate

Note the book

Classifying and identifying in a set SSD-FCS by using a classifier EC, wherein the fragments with the EC prediction result of class one form a second candidate epitope set (second candidate set) SCS;

2. The method for predicting a protein epitope according to claim 1, wherein the specific operation method of the "four-classification training method" is as follows:

3. The method for predicting a protein epitope according to claim 1, wherein said "training termination rule" is as follows:

as a classifier C_iAnd a classifier C_i+1The correct ratio of the joint prediction,

calculating R from the 3 rd training before starting for the total prediction ratio after the (i +1) th training, wherein when the training times are less than or equal to 4 times, the return value of the termination parameter is as follows: