CN107341363B - Prediction method of protein epitope - Google Patents
Prediction method of protein epitope Download PDFInfo
- Publication number
- CN107341363B CN107341363B CN201710516045.8A CN201710516045A CN107341363B CN 107341363 B CN107341363 B CN 107341363B CN 201710516045 A CN201710516045 A CN 201710516045A CN 107341363 B CN107341363 B CN 107341363B
- Authority
- CN
- China
- Prior art keywords
- classifier
- training
- epitope
- data
- class
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Data Mining & Analysis (AREA)
- Bioethics (AREA)
- Chemical & Material Sciences (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Artificial Intelligence (AREA)
- Analytical Chemistry (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Genetics & Genomics (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Peptides Or Proteins (AREA)
Abstract
A prediction method of protein epitope, the method collects the antigen epitope sequence information and the related protein sequence information which are verified by the experiment from the professional database, constructs the positive and negative sample set for learning and training, and collects the physical and chemical property characteristic information of the amino acid; then, training a complementary prediction classifier group and an independent high-performance classifier by adopting a machine learning algorithm in the sample set; and finally, obtaining a first candidate epitope set by using a complementary prediction classifier group, obtaining a second candidate epitope set by using a high-performance classifier, and grading and sequencing sequences in the candidate epitope set by using a tendency grading method. On the basis of constructing a prediction model of a multilayer classification structure, the invention utilizes a plurality of classifiers with complementary capability to carry out cooperative prediction on the protein epitope, and the method can obviously improve the accuracy of the prediction of the protein epitope and provides an effective method for accurately and quickly finding the epitope.
Description
Technical Field
The invention relates to a method capable of accurately and quickly predicting a protein epitope, belonging to the technical field of biology.
Background
The epitope is the basis for recognizing the antigenicity of the protein, and the accurate and detailed drawing of the epitope map not only is helpful for the research of basic immunology, but also has important significance for the design of bioactive drugs and epitope vaccines. In the immune system, B cells and T cells act together in the human second line of defense, the "acquired immunity" process, which is to recognize non-self antigen bodies during immune presentation, and once the invading antigen is found, the two cell types produce respective immune effects.
Two traditional methods for epitope position determination are the X-ray diffraction method and the experimental method, and these methods have the disadvantages of complexity and large workload. With the development of computer technology and the increasing expansion of biological information databases, the sequence and structural features of antigen epitopes are summarized from the existing data, the epitopes are screened and predicted by using a machine learning algorithm, and then verified by using experiments, so that the mainstream technical route is formed. The technical route has the advantages of greatly saving cost and improving working efficiency.
The epitope is predicted by using a computer by fusing a plurality of characteristic parameters (such as hydrophobicity, hydrophilicity, accessibility, variability, antigenicity and the like) on the basis of the physicochemical properties of amino acids. The machine learning algorithm is widely used in epitope prediction with high accuracy and high efficiency. The method for predicting the epitope by the machine learning algorithm mainly comprises the steps of data collection and processing, model establishment, parameter optimization, epitope prediction and the like. The common machine learning algorithms mainly include: support Vector Machines (SVMHC), Hidden Markov Models (HMM), Artificial Neural Networks (ANN), and the like. The application of the algorithms improves the effect of epitope prediction, but the problems that high prediction precision is difficult to obtain by using a single algorithm and training sample data selection is unscientific exist. At present, the prediction research of the epitope at home and abroad mainly improves the prediction performance through the combined model construction of complementary prediction capability, the scientific sample data set construction and the like. Most of the researches are to find a classifier combination with complementary prediction capability by using a combination experiment of the existing prediction tools, and although the method can improve the prediction performance to a certain extent, no more effective prediction method is found by people at present.
Disclosure of Invention
The invention aims to provide a prediction method of protein epitope aiming at the defects of the prior art, and provides an effective method for accurately and quickly finding out the epitope.
The problems of the invention are solved by the following technical scheme:
a prediction method of protein epitope, the method collects the antigen epitope sequence information and the related protein sequence information which are verified by experiments from a professional database, constructs a positive and negative sample set for learning and training, and learns and predicts by using the physicochemical property of amino acid as the characteristic; then, training a complementary prediction classifier group and a high-performance classifier by adopting a machine learning algorithm in the sample set; finally, a complementary prediction classifier group is used for obtaining a first candidate epitope set, a high-performance classifier is used for obtaining a second candidate epitope set, and a tendency scoring method is used for scoring and sequencing sequences in the candidate epitope set;
the prediction is carried out according to the following steps:
a. data acquisition: selecting epitope data information from an IEDB database, searching primary protein sequence information adopting epitope samples in a Uniport protein database, perfecting the epitope sample data and extracting non-epitope sample data from the epitope sample data, constructing a positive and negative sample set for learning training, and forming a characteristic matrix by using physicochemical property characteristic information of hydrophobicity, accessibility and the like of each amino acid and the mean value of hydrophobicity and accessibility of three adjacent amino acids of the sequence in the sample as training input;
b. training of a complementary prediction classifier set:
① training in sample set D by using a binary training method, and obtaining a first classifier C when the accuracy of classification is more than lambda1Recording lambda as the classification accuracy of the learning training classifier, wherein the specific numerical value can be set according to the actual situation, and the value range of the lambda is more than or equal to 50% and less than 100%; the first classifier C1The sample set capable of correctly identifying the sample set D is recorded as D1,Note DiIs a first classifier CiIn the sample set which can be correctly identified in the training set, i is more than or equal to 1 and less than or equal to n, and the set D is obtained according to a four-classification training method and a proper amount increasing method1And D-D1Constructing a new training sample set by using the samples of the middle epitope and the non-epitope, learning training in the new sample set, and obtaining a second classifier C when the classification accuracy is more than lambda2Second classifier C2The sample set D can be correctly identified and is recorded as D2Will be set D1And set D2The intersection of (A) is denoted as D1Record DiThe method is characterized in that the method is a sample set which can be correctly identified by the ith (i is more than or equal to 1 and less than or equal to n-1) classifier and the (i +1) classifier, and then whether training is continued is judged according to a training termination rule;
② when training is needed, the set D-D is divided into four classes and added in proper amount2And D2-D1Constructing a new training sample set by using the samples of the middle epitope and the non-epitope, learning training in the new sample set, and obtaining a third classifier C when the classification accuracy is more than lambda3Classifier C3For sample set D-D1The correctly identified sample set is denoted as D3Will be set D2And set D3The intersection of (A) is denoted as D3Then, judging whether the training is continued according to a training termination rule, and obtaining an n-1 classifier Cn-1Later, when the training is needed to be continued, the four-classification training method and the proper increasing method are gatheredAnd Dn-1-Dn-2Constructing a new training sample set by using the samples of the middle epitope and the non-epitope, learning and training the new sample set, and obtaining an nth classifier C when the classification accuracy is greater than lambdanClassifier CnFor sample setThe correctly identified sample set is denoted as DnAccording to the above mode, until the training is stopped, obtaining a group of classifiers with complementary classification capability, namely a complementary prediction classifier group;
c. in the sample set D, training a classifier with the classification accuracy of more than 90% of each class by adopting a two-class training method, and calling the classifier as a high-performance classifier EC;
d. for protein antigens with unknown epitopes, prediction was performed according to the following method:
dividing a primary sequence of an antigen protein into a plurality of sequence fragment sets SSD according to a 'reference length', wherein each fragment forms a characteristic matrix according to the average value of hydrophobicity and accessibility of each amino acid and adjacent three amino acids as predicted input;
② first predict the first classifier C in turn using trained classifiers1And a second classifier C2Respectively classifying and identifying the set SSD by a first classifier C1Predicting the result as class one and the second classifier C2Segment composition set ERD with prediction result of class three1Erd, recordiIs a first classifier CiThe prediction result is the category one, and the (i +1) th classifier Ci+1The prediction result is a set formed by the fragments of the category three, i is more than or equal to 1 and less than or equal to n-1; then in the set SSD-ERD1Middle third classifier C3Performing classification and identification by a second classifier C2The prediction result is class one, and the third classifier C is used3Segment composition set ERD with prediction result of class three2And so on until in the aggregate(Note thatFor the first n-1 classifiers Cn-1Using the nth classifier C in the union of all the predicted resultsnPerforming classification and identification by an (n-1) classifier Cn-1The prediction result is class one, and the nth classifier CnSegment composition set ERD with prediction result of class threen-1(ii) a The first candidate epitope set (first candidate set) FCS isClassification identification in a set SSD-FCS using a classifier ECThe fragments with the EC prediction result of class one form a second candidate epitope set (second candidate epitope set) SCS;
thirdly, according to a tendency scoring method, scoring is carried out on each sequence fragment, the sequence fragments in the first candidate epitope set FCS and the second candidate epitope set SCS are sorted according to the score, and the sequence fragments with high score are arranged in the front.
The prediction method of the protein epitope comprises the following specific operation methods of the four-classification training method:
let i the i-th classifier CiCorrectly identified sample set is DiSet D ofiThe subset combination composed of the mesopic and non-epitopic data isIn the i +1 th classifier Ci+1During training, the ith classifier C is usediThe data which can not be correctly identified are respectively listed as a class-one epitope sample and a class-two non-epitope sample of a new training sample set according to the classes of the epitope and the non-epitope samples, and are randomly extractedPart of the data in the (1) is respectively listed as a class three-epitope sample and a class four-non-epitope sample, and then the samples are trained by using a four-classification learning algorithm to obtain an i +1 classifier Ci+1。
The method for predicting the epitope of the protein comprises the following steps:
let N be number (d) be the total number of elements in the sample set, Ni=number(Di) Is a set DiThe total number of elements of (a) is,for the correct ratio of the joint prediction of classifier i and classifier i +1,calculating R before the 3 rd training for the total prediction ratio after the i +1 th training, and terminating when the training times are less than or equal to 4The return value of parameter terminate is:
when the training times are more than or equal to 5 times and less than or equal to 7 times, the return value of the terminate is:
when the training times are more than or equal to 8 times, the return value of the terminate is as follows:
if the termination parameter terminate returns a value of 0, the training ends, and if the termination parameter terminate returns a value of 1, the training continues.
The method for predicting the epitope of the protein comprises the following steps:
let i the i-th classifier Ci1, 2, n, the correctly identified sample set is Di;
When i is 1, D1Two kinds of samples are shared, when the first classifier C is obtained1Then, from set D according to the following rule1Selecting data of a first class and data of a second class to form a new training sample set:
let G1=D-D1,Represents G1The number of elements of the subset of the mesopic and non-epitopic samples,are respectively a set G1Class one, class two data in (1);
when in useFrom the set D1The quantity of the randomly selected data of the first category and the second category isSimultaneous slave aggregationRandom selectionForming a new training sample set by the data;
when in useFrom the set D1The number of the data of the selected category I and the category II isSimultaneous slave aggregationRandom selectionForming a new training sample set by the data;
when i is 2iIn which there are four types of samples, respectively using subsetsShow, byRespectively represent collectionsThe number of elements in (1) is set as the data set which cannot be correctly identifiedBy usingRepresents GiThe element numbers of the middle class one and class two sample subset,are respectively a set GiAccording to the following rule, the data of class one and class two in (1) are collected from the set DiSelecting data of a third category and data of a fourth category to form a new training sample set:
when in useFrom the set DiThe number of randomly selected data of the three and four categories isSimultaneous slave aggregationRandom selectionForming a new training sample set by the data;
when in useThen, selected set DiThe number of the selected data of the category three and the category four isSimultaneous slave aggregationRandom selectionEach data constitutes a new set of training samples.
The prediction method of the protein epitope comprises the following steps:
in the data set of the antigen epitope, the frequency of occurrence of any type of combination of three consecutive amino acids in the epitope is calculated by the following formula:
in the formula, AAx,AAy,AAzIs any one of 20 amino acids, AAx-AAy-AAzRepresents any type of combination of three consecutive amino acids,represents AAx-AAy-AAzThe frequency of occurrence of the type combinations in the epitope,indicates the number of times that the combination of the type occurs,are each amino acid AAx,AAy,AAzThe total number of occurrences is,each being an amino acid combination AAx-AAy,AAy-AAzThe total number of occurrences;
if the prediction window is k, the propensity score for any sequence fragment into which the primary sequence of the antigenic protein is divided is:
according to the method, on the basis of constructing a prediction model of a multilayer classification structure, a plurality of classifiers with complementary capacity are utilized to carry out cooperative prediction on the protein epitope, prediction experiments are carried out on a plurality of blind data sets, the prediction accuracy rate in the experiments is higher than 70%, and therefore the method can obviously improve the accuracy of the prediction of the protein epitope and provides an effective method for accurately and quickly finding the epitope.
Drawings
The invention will be further explained with reference to the drawings.
FIG. 1 is a "training flow chart of a complementary classifier set" for the epitope prediction method of the present invention;
FIG. 2 is a "epitope prediction process diagram" used in the method of predicting an epitope according to the present invention.
In the figures and in the text, the symbols are: d is a set of samples, CiIs the ith classifier, DiAs a classifier CiA correctly identified sample set, EC is a high-performance classifier, SSD is a sequence fragment set, FCS is a first candidate epitope set, SCS is a second candidate epitope set, N (number) (d) is the total number of elements in the sample set, N is the total number of elements in the sample set, andi=number(Di) Is a set DiTotal number of elements of (2), RiThe correct ratio of the combined prediction of the classifier i and the classifier i +1 is obtained, R is the total prediction ratio after the i +1 training, terminate is a termination parameter, AAx,AAy,AAzIs any one of 20 amino acids, AAx-AAy-AAzRepresents any type of combination of three consecutive amino acids,represents AAx-AAy-AAzThe frequency of occurrence of the type combinations in the epitope,indicates the number of times that the combination of the type occurs,are each amino acid AAx,AAy,AAzThe total number of occurrences is,each being an amino acid combination AAx-AAy,AAy-AAzTotal number of occurrences.
IEDB means http:// www.iedb.org/professional database; uniport refers to the http:// www.uniprot.org/protein database.
Detailed Description
Epitope prediction is generally realized by a binary classifier, and the construction of a classifier with complementary prediction capability breaks through the limitation of inherent thinking. The structure of the classifier is that on the basis of a binary classifier, sample data is combined according to the classification result of the binary classifier, and a new classifier is trained in a new sample. The research finds that a plurality of classifiers are constructed by applying the prediction difference of two adjacent classifiers in a group of classifiers to realize gradual optimization, a mechanism for training a complementary classifier group is provided, and the method has an important promoting effect on the improvement of the prediction performance of the antigen epitope.
In order to clearly understand the technical contents of the present invention, the present invention will be described in detail with reference to fig. 1 and 2. It is to be understood that the examples are illustrative of the invention and are not to be construed as limiting the invention.
1. Data acquisition
Epitope sequence data are collected from an IEDB (http:// www.iedb.org /) epitope database as a training positive sample, and the database contains a plurality of epitope data which are verified by experiments and cover species such as human, non-human primates, other animals and the like. The primary sequence of the protein corresponding to the selected epitope sample is found in a Uniport (http:// www.uniprot.org /) protein database, and a sequence fragment (i.e. non-epitope sequence) which is not marked as the epitope is extracted from the primary sequence as a training negative sample. In our experiments, we verified that 800 protein sequences were extracted in total, and 5120 continuous epitope sequences and 5200 non-epitope sequence fragments were collected. Each sample is 20 amino acids in length as a reference, and a sequence fragment which is not marked as an epitope and has 20 amino acids in number is directly selected from a protein sequence primary sequence for a non-epitope sample. For epitope samples, due to the difference in the number of amino acids contained in the epitope sequence, the "baseline length" requirement is met as follows: for epitope sequences with the number of amino acids being less than 20 and even number, the same number of amino acids are selected from two sides of the protein sequence where the epitope sequences are located as successive supplements, and for epitope sequences with the number of amino acids being less than 20 and odd number, the successive supplement number of the epitope sequences from the front end of the protein sequence where the epitope sequences are located is one more than that of the epitope sequences from the rear end so as to meet the requirement of the standard length; the epitope sequences with the number of amino acids larger than 20 and even number are removed with the same number of amino acids from both sides to meet the requirement of the standard length, and the epitope sequences with the number of amino acids larger than 20 and odd number are reduced with one more than the number of the epitope sequences from the front end of the protein sequence to meet the requirement of the standard length. For the sample sequence, a characteristic matrix is formed according to the hydrophobicity and accessibility characteristics of the sequence of the amino acid sequences in the sample and the average value of the hydrophobicity and accessibility of every three adjacent amino acids, and the characteristic matrix is an input matrix for training and prediction.
2. Model building
In the sample set, the method for performing the training of the complementary classifier is as follows:
setting the sample set as D, training by using a two-classification training method, and obtaining the classifier C when the classification accuracy is higher than lambda (the lambda is the classification accuracy of the learning training classifier, the specific numerical value can be set according to the actual situation, and the value range of the lambda is more than or equal to 50% and less than 100%)1In the set D, C1The correctly identified sample set is denoted as D1,(note D)iAs a classifier Ci(1 ≦ i ≦ n) sample set that can be correctly identified in the training set). Set D is trained according to the four-classification training method and the appropriate increment method1And D-D1Constructing a new training sample set by using the samples with the medium epitope and the non-epitope, performing learning training in the new sample set, and obtaining a classifier C when the classification accuracy is more than lambda2In the set D, C2The correctly identified sample set is denoted as D2(the same applies below with the recognized class one and class three as epitopes and the recognized class two and class four as non-epitopes), set D1And set D2The intersection of (A) is denoted as D1(note D)iThe sample set is a sample set which can be correctly identified by the ith (i is more than or equal to 1 and less than or equal to n-1) classifier and the (i +1) th classifier), then the judgment is carried out according to a training termination rule, if the termination parameter terminate return value is 0, the training is ended, and if the termination parameter terminate return value is 1, the training is continued. When the training is needed, the set D-D is processed according to the four-classification training method and the proper increment method2And D2-D1Constructing a new training sample set by using the samples with the medium epitope and the non-epitope, training and learning in the new sample set, and obtaining a classifier C when the classification accuracy is more than lambda3,C3For sample set D-D1The correctly identified sample set is marked as D3Will be set D2And set D3The intersection of (A) is denoted as D2Then, the judgment is made according to the training termination rule, if the termination parameter termination return value is 0, the training is terminated, and if the termination parameter termination return value is 1, the training is continued. When the training is continued after the n-1 th classifier is obtained, the four-classification training method and the proper adding method are collected intoAnd Dn-1-Dn-2(i is more than or equal to 4) constructing a new training sample set by the epitope and non-epitope samples, training in the new sample set, and obtaining a classifier C when the classification accuracy is more than lambdanMixing C withnFor sample setThe correctly identified sample set is marked as Dn. In the above manner, until training is stopped. This results in a set of classifiers with complementary classification capabilities, only between two adjacent classifiers in the set. Any machine learning algorithm can be used for the two-classification training and the four-classification training in the method as long as the requirements of the relevant rules and the epitope classification of the method can be met.
The specific content of the four-classification training method is as follows:
obtaining classifier C in training1And then, taking the two types of sample data correctly identified by the classifier as new sample data to participate in new classifier training, namely, the later classifier training is four-classification. Let classifier C1Correctly identified sample set is D1Set D of1The sub-combinations of the data of the middle class one (epitope) and the class two (non-epitope) are respectivelyDuring the second classifier training, the classifier C is selected1Incorrectly identified data sets D-D1The epitope and non-epitope in the training sample set are respectively listed as class one and class two samples of the new training sample set, andthe epitope data and non-epitope data in the method are respectively listed as a class three sample and a class four sample, a proper amount of data is extracted according to a proper amount increasing method to form a new training set, and then a four-classification learning algorithm is used for training to obtain a classifier so as to obtain the classifier. Starting from the training of the third classifier, the data which cannot be correctly identified by the previous classifier is respectively listed as a class I sample and a class II sample of a new training sample set according to an epitope and a non-epitope, part of correct identification data in the previous classifier is extracted and respectively listed as a class III sample and a class IV sample, a proper amount of data is extracted according to a proper amount increasing method to form a new training set, and the four-class learning algorithm is also utilized for training to obtain the classifier.
The specific contents of constructing a new training sample set according to the method for increasing the appropriate amount are as follows:
in obtaining a classifier CiAfter (i 2.. multidot.n), when the next classifier training is performed, a new sample set is constructed according to the number of different classes of sample sets. The sample addition in the method refers to that the related class sample correctly identified by the last classifier is used as a new sample class to participate in the training of the next classifier. Generally, a classifier can recognize that the number of sample data of each class is larger than that of sample data which cannot be correctly recognizedThe number of samples, so the number of samples in the data set needs to be compared to construct a new training sample set. Let classifier CiCorrectly identified sample set is Di,DiFour kinds of data in common are respectively used as subsetsAnd (4) showing. By usingRespectively represent collectionsNumber of elements in (1). Let the incorrectly identified data set beBy usingRepresents GiNumber of middle epitope and non-epitope sample subset elements: (Are respectively a set GiClass one, class two data) from set D according to the following rulesiAnd selecting data of the third category and the fourth category to form a new training sample set.
When in use(when the product of the multiple is not an integer, rounding is performed by rounding, the same applies hereinafter) and the like, from the set DiThe number of randomly selected data of the three and four categories isSimultaneous slave aggregationRandom selectionEach data constitutes a new set of training samples.
When in useThen, selected set DiThe number of the selected data of the category three and the category four isSimultaneous slave aggregationRandom selectionEach data constitutes a new set of training samples.
In particular, when the training classifier C is obtained1Then, the following rule is followed to perform the secondary set D1The first and second data are selected to form a new training sample set. Let G1=D-D1,Representing the number of mesoepitope and non-epitope sample subset elements: (Are respectively a set G1Data of middle category one, category two)
When in useFrom the set D1The quantity of the randomly selected data of the first category and the second category isSimultaneous slave aggregationRandom selectionEach data constitutes a new set of training samples.
When in useThen, selected set D1The number of the data of the selected category I and the category II isSimultaneous slave aggregationRandom selectionEach data constitutes a new set of training samples.
The specific contents of the training termination rule are as follows:
let N be number (D) be the total number of elements in the sample set D, Ni=number(Di) Is a set DiThe total number of the elements in (a),for the correct ratio of the joint prediction of classifier i and classifier i +1,the total prediction ratio is obtained.
Calculating R before the 3 rd training, and when the training times are less than or equal to 4 times, the return value of the terminate is as follows:
when the training times are more than or equal to 5 times and less than or equal to 7 times, the return value of the terminate is as follows:
when the training times are more than or equal to 8 times, the return value of the terminate is as follows:
for the training of the high-performance classifier EC, only two classification training needs to be adopted in the sample set D until the classification accuracy reaches 90%.
3. Epitope prediction
For protein antigens with unknown epitope positions, epitope prediction is carried out according to the following method: the first step, dividing the primary sequence of the antigen protein into a plurality of sequence fragment sets SSD according to the 'reference length', wherein each fragment calculates a characteristic matrix as prediction input according to the method of the 'data acquisition' in the step 1. Secondly, firstly, the trained complementary prediction classifier group is used for sequentially predicting, and the classifier C is used1And a classifier C2Respectively classifying and identifying the set SSD, and C1The prediction result is class-simultaneous C2Segment composition set ERD with prediction result of class three1(Note ERDiAs a classifier CiClassifier C with prediction result of class-simultaneousi+1The prediction result is a fragment composition set of the category three, i is more than or equal to 1 and less than or equal to n-1)); then in the set SSD-ERD1Middle classifier C3Performing classification and identification by C2The prediction result is class-simultaneous C3Segment composition set ERD with prediction result of class three2According to the rule until in the set(Note thatIs a front n-1 classifier Cn-1Union of all predictions) with the last classifier CnPerforming classification and identification by Cn-1The prediction result is class-simultaneous CnSegment composition set ERD with prediction result of class threen-1(ii) a The first candidate epitope set (first candidate set) FCS consists of all ERDsiIs composed of a union of (i) iAnd performing classified identification in a set SSD-FCS by using a classifier EC, wherein the fragments with the EC prediction result of the class one form a second candidate epitope set (second candidate set) SCS. And thirdly, scoring each sequence fragment according to a tendency scoring method, and sequencing the sequence fragments in the FCS set and the SCS set according to the scores, wherein the sequence fragments with high scores are ranked in the front.
The specific content of the tendency scoring method is as follows:
in the data set of the antigen epitope, the frequency of occurrence of any type of combination of three consecutive amino acids in the epitope is calculated by the following formula:
wherein, AAx,AAy,AAzIs any one of 20 amino acids, AA in the formulax-AAy-AAzRepresents any type of combination of three consecutive amino acids,represents AAx-AAy-AAzThe frequency of occurrence of the type combinations in the epitope,indicates the number of times that the combination of the type occurs,are each amino acid AAx,AAy,AAzThe total number of occurrences is,each being an amino acid combination AAx-AAy,AAy-AAzTotal number of occurrences.
If the prediction window is k, the propensity score for any sequence fragment into which the primary sequence of the antigenic protein is divided is:
accuracy evaluation of prediction method
The invention screens 800 antigen protein data, and a sample set consisting of 5120 epitope sequences and 5200 non-epitope sequences is counted. Two algorithms of a Support Vector Machine (SVM) and a Recurrent Neural Network (RNN) are utilized to respectively carry out two-classification training and four-classification training, and four times of training are carried out, wherein the first time of training the classifier 1 by adopting the SVM, the second time of training the classifier 2 by adopting the RNN, the third time of training the classifier 3 by adopting the RNN, and the fourth time of training the excellent performance classifier 4 by adopting the RNN. The number of the first-class sequences predicted by the classifier 1 and the number of the third-class sequences predicted by the classifier 2 are 3325, the number of the first-class sequences predicted by the classifier 2 and the number of the third-class sequences predicted by the classifier 3 are 1573, and the comprehensive prediction accuracy reaches 95.6%. The accuracy rate predicted by the five-fold cross validation classifier 4 is 91%.
We collected 287 proteins out of the training samples as a blind data test set, which contains 2000 validated epitope sequences, and randomly drawn 1000 for each test. The result predicted by the above trained classifier set is: the epitope sequences predicted by the classifier 1 and the classifier 2 are 739 in total, wherein the correctness is 551, and the correctness is 74.5%; 492 epitope sequences predicted by the classifier 1 and the classifier 2, wherein the correct epitope sequences are 327, and the correct epitope sequences are 66.5%; the comprehensive prediction accuracy is 71.3%, and the coverage rate of the correct result reaches 87.8%. The epitope sequences predicted by classifier 4 were 190 in total, 75 out of them were correct, and the accuracy was 39.5%. The results of the two classifiers are combined, and the coverage rate of correct results reaches 95.3%.
From the experimental results, the method has high prediction accuracy, the prediction result can contain most epitopes, and effective and scientific basis can be provided for the screening of the epitope.
Claims (5)
1. A prediction method of protein epitope is characterized in that firstly, antigen epitope sequence information which is experimentally verified and sequence information of related proteins are collected from a professional database, a positive and negative sample set for learning and training is constructed, and the physicochemical properties of amino acid are used as characteristics for learning and prediction; then, training a complementary prediction classifier group and a high-performance classifier by adopting a machine learning algorithm in the sample set; finally, a complementary prediction classifier group is used for obtaining a first candidate epitope set, a high-performance classifier is used for obtaining a second candidate epitope set, and a tendency scoring method is used for scoring and sequencing sequences in the candidate epitope set;
the prediction is carried out according to the following steps:
a. data acquisition: selecting epitope data information from an IEDB database, searching primary protein sequence information adopting epitope samples in a Uniport protein database, perfecting the epitope sample data and extracting non-epitope sample data from the epitope sample data, constructing a positive and negative sample set for learning training, and forming a characteristic matrix by using physicochemical property characteristic information of hydrophobicity, accessibility and the like of each amino acid and the mean value of hydrophobicity and accessibility of three adjacent amino acids of the sequence in the sample as training input;
b. training of a complementary prediction classifier set:
① training in sample set D by using a binary training method, and obtaining a first classifier C when the accuracy of classification is more than lambda1Recording lambda as the classification accuracy of the learning training classifier, wherein the specific numerical value can be set according to the actual situation, and the value range of the lambda is more than or equal to 50% and less than 100%; the first classifier C1The sample set capable of correctly identifying the sample set D is recorded as D1,Note DiIs a first classifier CiThe sample set which can be correctly identified in the training set, i is more than or equal to 1 and less than or equal to n, is added according to the four-classification training method and the proper amountMethod of addition "will set D1And D-D1Constructing a new training sample set by using the samples of the middle epitope and the non-epitope, learning training in the new sample set, and obtaining a second classifier C when the classification accuracy is more than lambda2Second classifier C2The sample set D can be correctly identified and is recorded as D2Will be set D1And set D2The intersection of (A) is denoted as D1Record DiThe method comprises the steps of collecting samples which can be correctly identified by an ith classifier and an (i +1) th classifier, and then judging whether training is continued according to a training termination rule, wherein i is more than or equal to 1 and is less than or equal to n-1;
② when training is needed, the set D-D is divided into four classes and added in proper amount2And D2-D1Constructing a new training sample set by using the samples of the middle epitope and the non-epitope, learning training in the new sample set, and obtaining a third classifier C when the classification accuracy is more than lambda3Classifier C3For sample set D-D1The correctly identified sample set is denoted as D3Will be set D2And set D3The intersection of (A) is denoted as D3Then, judging whether the training is continued according to a training termination rule, and obtaining an n-1 classifier Cn-1Later, when the training is needed to be continued, the four-classification training method and the proper increasing method are gatheredAnd Dn-1-Dn-2Constructing a new training sample set by using the samples of the middle epitope and the non-epitope, learning and training the new sample set, and obtaining an nth classifier C when the classification accuracy is greater than lambdanClassifier CnFor sample setThe correctly identified sample set is denoted as DnAccording to the above mode, until the training is stopped, obtaining a group of classifiers with complementary classification capability, namely a complementary prediction classifier group;
c. in the sample set D, training a classifier with the classification accuracy of more than 90% of each class by adopting a two-class training method, and calling the classifier as a high-performance classifier EC;
d. for protein antigens with unknown epitopes, prediction was performed according to the following method:
dividing a primary sequence of an antigen protein into a plurality of sequence fragment sets SSD according to a 'reference length', wherein each fragment forms a characteristic matrix according to the average value of hydrophobicity and accessibility of each amino acid and adjacent three amino acids as predicted input;
② first predict the first classifier C in turn using trained classifiers1And a second classifier C2Respectively classifying and identifying the set SSD by a first classifier C1Predicting the result as class one and the second classifier C2Segment composition set ERD with prediction result of class three1Erd, recordiIs the ith classifier CiThe prediction result is the category one, and the (i +1) th classifier Ci+1The prediction result is a set formed by the fragments of the category three, i is more than or equal to 1 and less than or equal to n-1; then in the set SSD-ERD1Middle third classifier C3Performing classification and identification by a second classifier C2The prediction result is class one, and the third classifier C is used3Segment composition set ERD with prediction result of class three2And so on until in the aggregateNote the bookFor the first n-1 classifiers Cn-1Using the nth classifier C in the union of all the predicted resultsnPerforming classification and identification by an (n-1) classifier Cn-1The prediction result is class one, and the nth classifier CnSegment composition set ERD with prediction result of class threen-1(ii) a The first candidate epitope set (first candidate set) FCS isClassifying and identifying in a set SSD-FCS by using a classifier EC, wherein the fragments with the EC prediction result of class one form a second candidate epitope set (second candidate set) SCS;
thirdly, according to a tendency scoring method, scoring is carried out on each sequence fragment, the sequence fragments in the first candidate epitope set FCS and the second candidate epitope set SCS are sorted according to the score, and the sequence fragments with high score are arranged in the front.
2. The method for predicting a protein epitope according to claim 1, wherein the specific operation method of the "four-classification training method" is as follows:
let i the i-th classifier CiCorrectly identified sample set is DiSet D ofiThe subset combination composed of the mesopic and non-epitopic data isIn the i +1 th classifier Ci+1During training, the ith classifier C is usediThe data which can not be correctly identified are respectively listed as a class-one epitope sample and a class-two non-epitope sample of a new training sample set according to the classes of the epitope and the non-epitope samples, and are randomly extractedPart of the data in the (1) is respectively listed as a class three-epitope sample and a class four-non-epitope sample, and then the samples are trained by using a four-classification learning algorithm to obtain an i +1 classifier Ci+1。
3. The method for predicting a protein epitope according to claim 1, wherein said "training termination rule" is as follows:
let N be number (d) be the total number of elements in the sample set, Ni=number(Di) Is a set DiThe total number of elements of (a) is,as a classifier CiAnd a classifier Ci+1The correct ratio of the joint prediction,calculating R from the 3 rd training before starting for the total prediction ratio after the (i +1) th training, wherein when the training times are less than or equal to 4 times, the return value of the termination parameter is as follows:
when the training times are more than or equal to 5 times and less than or equal to 7 times, the return value of the terminate is:
when the training times are more than or equal to 8 times, the return value of the terminate is as follows:
if the termination parameter terminate returns a value of 0, the training ends, and if the termination parameter terminate returns a value of 1, the training continues.
4. The method for predicting a protein epitope according to claim 1, wherein said "method for increasing a suitable amount" comprises:
let i the i-th classifier Ci1, 2, n, the correctly identified sample set is Di;
When i is 1, D1Two kinds of samples are shared, when the first classifier C is obtained1Then, from set D according to the following rule1Selecting data of a first class and data of a second class to form a new training sample set:
let G1=D-D1,Represents G1The number of elements of the subset of the mesopic and non-epitopic samples,are respectively a set G1Class one, class two data in (1);
when in useFrom the set D1The quantity of the randomly selected data of the first category and the second category isSimultaneous slave aggregationRandom selectionForming a new training sample set by the data;
when in useFrom the set D1The number of the data of the selected category I and the category II isSimultaneous slave aggregationRandom selectionForming a new training sample set by the data;
when i is 2iIn which there are four types of samples, respectively using subsetsShow, byRespectively represent collectionsThe number of elements in (1) is set as the data set which cannot be correctly identifiedBy usingRepresents GiThe element numbers of the middle class one and class two sample subset,are respectively a set GiAccording to the following rule, the data of class one and class two in (1) are collected from the set DiSelecting data of a third category and data of a fourth category to form a new training sample set:
when in useFrom the set DiThe number of randomly selected data of the three and four categories isSimultaneous slave aggregationRandom selectionForming a new training sample set by the data;
5. The method for predicting a protein epitope according to claim 1, wherein said "tendency scoring" method comprises:
in the data set of the antigen epitope, the frequency of occurrence of any type of combination of three consecutive amino acids in the epitope is calculated by the following formula:
in the formula, AAx,AAy,AAzIs any one of 20 amino acids, AAx-AAy-AAzRepresents any type of combination of three consecutive amino acids,represents AAx-AAy-AAzThe frequency of occurrence of the type combinations in the epitope,indicates the number of times that the combination of the type occurs,are each amino acid AAx,AAy,AAzThe total number of occurrences is,each being an amino acid combination AAx-AAy,AAy-AAzThe total number of occurrences;
if the prediction window is k, the propensity score for any sequence fragment into which the primary sequence of the antigenic protein is divided is:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710516045.8A CN107341363B (en) | 2017-06-29 | 2017-06-29 | Prediction method of protein epitope |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710516045.8A CN107341363B (en) | 2017-06-29 | 2017-06-29 | Prediction method of protein epitope |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107341363A CN107341363A (en) | 2017-11-10 |
CN107341363B true CN107341363B (en) | 2020-09-22 |
Family
ID=60219158
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710516045.8A Active CN107341363B (en) | 2017-06-29 | 2017-06-29 | Prediction method of protein epitope |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107341363B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109326324B (en) * | 2018-09-30 | 2022-01-25 | 河北省科学院应用数学研究所 | Antigen epitope detection method, system and terminal equipment |
CN110060738B (en) * | 2019-04-03 | 2021-10-22 | 中国人民解放军军事科学院军事医学研究院 | Method and system for predicting bacterial protective antigen protein based on machine learning technology |
CN110310708A (en) * | 2019-06-18 | 2019-10-08 | 广东省生态环境技术研究所 | A method of building alienation arsenic reductase enzyme protein database |
CN111429965B (en) * | 2020-03-19 | 2023-04-07 | 西安交通大学 | T cell receptor corresponding epitope prediction method based on multiconnector characteristics |
CN113838523A (en) * | 2021-09-17 | 2021-12-24 | 深圳太力生物技术有限责任公司 | Antibody protein CDR region amino acid sequence prediction method and system |
CN114242169B (en) * | 2021-12-15 | 2023-10-20 | 河北省科学院应用数学研究所 | Antigen epitope prediction method for B cells |
CN116386712B (en) * | 2023-02-20 | 2024-02-09 | 北京博康健基因科技有限公司 | Epitope prediction method and device based on antigen protein dynamic space structure |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102521527A (en) * | 2011-12-12 | 2012-06-27 | 同济大学 | Method for predicting space epitope of protein antigen according to antibody species classification |
EP2842068A1 (en) * | 2012-04-24 | 2015-03-04 | Laboratory Corporation of America Holdings | Methods and systems for identification of a protein binding site |
CN105524984A (en) * | 2014-09-30 | 2016-04-27 | 深圳华大基因科技有限公司 | Method and equipment for neoantigen epitope prediction |
CN105868583A (en) * | 2016-04-06 | 2016-08-17 | 东北师范大学 | Method for predicting epitope through cost-sensitive integrating and clustering on basis of sequence |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8121797B2 (en) * | 2007-01-12 | 2012-02-21 | Microsoft Corporation | T-cell epitope prediction |
-
2017
- 2017-06-29 CN CN201710516045.8A patent/CN107341363B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102521527A (en) * | 2011-12-12 | 2012-06-27 | 同济大学 | Method for predicting space epitope of protein antigen according to antibody species classification |
EP2842068A1 (en) * | 2012-04-24 | 2015-03-04 | Laboratory Corporation of America Holdings | Methods and systems for identification of a protein binding site |
CN105524984A (en) * | 2014-09-30 | 2016-04-27 | 深圳华大基因科技有限公司 | Method and equipment for neoantigen epitope prediction |
CN105868583A (en) * | 2016-04-06 | 2016-08-17 | 东北师范大学 | Method for predicting epitope through cost-sensitive integrating and clustering on basis of sequence |
Non-Patent Citations (3)
Title |
---|
《Prediction of CTL epitopes using QM, SVM and ANN techniques》;Manoj Bhasin 等;《Vaccine》;20040305;第3196页图1 * |
《基于PCA和SVM的线性B细胞表位预测研究》;董娇娇;《中国优秀硕士学位论文全文数据库 医药卫生科技辑》;20151215;第12-19页第三章 * |
《基于信息融合和计算智能的构象性B细胞表位预测方法研究》;张春华;《中国博士学位论文全文数据库 医药卫生科技辑》;20170215(第2期);第E059-129页 * |
Also Published As
Publication number | Publication date |
---|---|
CN107341363A (en) | 2017-11-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107341363B (en) | Prediction method of protein epitope | |
CN104966104B (en) | A kind of video classification methods based on Three dimensional convolution neutral net | |
CN110070909B (en) | Deep learning-based multi-feature fusion protein function prediction method | |
Qi et al. | Random forest similarity for protein-protein interaction prediction from multiple sources | |
CN103559504B (en) | Image target category identification method and device | |
CN109994151B (en) | Tumor driving gene prediction system based on complex network and machine learning method | |
CN105930688B (en) | Based on the protein function module detection method for improving PSO algorithms | |
CN106055928B (en) | A kind of sorting technique of macro genome contig | |
CN113436684B (en) | Cancer classification and characteristic gene selection method | |
Rasheed et al. | Metagenomic taxonomic classification using extreme learning machines | |
CN106548041A (en) | A kind of tumour key gene recognition methods based on prior information and parallel binary particle swarm optimization | |
Zhang et al. | Protein family classification from scratch: a CNN based deep learning approach | |
CN105139031A (en) | Data processing method based on subspace clustering | |
WO2024045989A1 (en) | Graph network data set processing method and apparatus, electronic device, program, and medium | |
CN106951728B (en) | Tumor key gene identification method based on particle swarm optimization and scoring criterion | |
CN109376790A (en) | A kind of binary classification method based on Analysis of The Seepage | |
CN107463799B (en) | Method for identifying DNA binding protein by interactive fusion feature representation and selective integration | |
CN108595909A (en) | TA targeting proteins prediction techniques based on integrated classifier | |
CN113160886B (en) | Cell type prediction system based on single cell Hi-C data | |
CN108052796B (en) | Global human mtDNA development tree classification query method based on ensemble learning | |
CN108388769A (en) | Protein Functional Module Identification Method Based on Edge-Driven Label Propagation Algorithm | |
CN106404878A (en) | Protein tandem mass spectrometry identification method based on multiple omics abundance information | |
CN114999566A (en) | Drug repositioning method and system based on word vector characterization and attention mechanism | |
CN115662504A (en) | Multi-angle fusion-based biological omics data analysis method | |
Mahatma et al. | Prediction and functional characterization of transcriptional activation domains |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |