CN107341363B - Prediction method of protein epitope - Google Patents

Prediction method of protein epitope Download PDF

Info

Publication number
CN107341363B
CN107341363B CN201710516045.8A CN201710516045A CN107341363B CN 107341363 B CN107341363 B CN 107341363B CN 201710516045 A CN201710516045 A CN 201710516045A CN 107341363 B CN107341363 B CN 107341363B
Authority
CN
China
Prior art keywords
classifier
training
epitope
data
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710516045.8A
Other languages
Chinese (zh)
Other versions
CN107341363A (en
Inventor
羊红光
成彬
王程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute Of Applied Mathematics Hebei Academy Of Sciences
Original Assignee
Institute Of Applied Mathematics Hebei Academy Of Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute Of Applied Mathematics Hebei Academy Of Sciences filed Critical Institute Of Applied Mathematics Hebei Academy Of Sciences
Priority to CN201710516045.8A priority Critical patent/CN107341363B/en
Publication of CN107341363A publication Critical patent/CN107341363A/en
Application granted granted Critical
Publication of CN107341363B publication Critical patent/CN107341363B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Data Mining & Analysis (AREA)
  • Bioethics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Artificial Intelligence (AREA)
  • Analytical Chemistry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Genetics & Genomics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Peptides Or Proteins (AREA)

Abstract

A prediction method of protein epitope, the method collects the antigen epitope sequence information and the related protein sequence information which are verified by the experiment from the professional database, constructs the positive and negative sample set for learning and training, and collects the physical and chemical property characteristic information of the amino acid; then, training a complementary prediction classifier group and an independent high-performance classifier by adopting a machine learning algorithm in the sample set; and finally, obtaining a first candidate epitope set by using a complementary prediction classifier group, obtaining a second candidate epitope set by using a high-performance classifier, and grading and sequencing sequences in the candidate epitope set by using a tendency grading method. On the basis of constructing a prediction model of a multilayer classification structure, the invention utilizes a plurality of classifiers with complementary capability to carry out cooperative prediction on the protein epitope, and the method can obviously improve the accuracy of the prediction of the protein epitope and provides an effective method for accurately and quickly finding the epitope.

Description

Prediction method of protein epitope
Technical Field
The invention relates to a method capable of accurately and quickly predicting a protein epitope, belonging to the technical field of biology.
Background
The epitope is the basis for recognizing the antigenicity of the protein, and the accurate and detailed drawing of the epitope map not only is helpful for the research of basic immunology, but also has important significance for the design of bioactive drugs and epitope vaccines. In the immune system, B cells and T cells act together in the human second line of defense, the "acquired immunity" process, which is to recognize non-self antigen bodies during immune presentation, and once the invading antigen is found, the two cell types produce respective immune effects.
Two traditional methods for epitope position determination are the X-ray diffraction method and the experimental method, and these methods have the disadvantages of complexity and large workload. With the development of computer technology and the increasing expansion of biological information databases, the sequence and structural features of antigen epitopes are summarized from the existing data, the epitopes are screened and predicted by using a machine learning algorithm, and then verified by using experiments, so that the mainstream technical route is formed. The technical route has the advantages of greatly saving cost and improving working efficiency.
The epitope is predicted by using a computer by fusing a plurality of characteristic parameters (such as hydrophobicity, hydrophilicity, accessibility, variability, antigenicity and the like) on the basis of the physicochemical properties of amino acids. The machine learning algorithm is widely used in epitope prediction with high accuracy and high efficiency. The method for predicting the epitope by the machine learning algorithm mainly comprises the steps of data collection and processing, model establishment, parameter optimization, epitope prediction and the like. The common machine learning algorithms mainly include: support Vector Machines (SVMHC), Hidden Markov Models (HMM), Artificial Neural Networks (ANN), and the like. The application of the algorithms improves the effect of epitope prediction, but the problems that high prediction precision is difficult to obtain by using a single algorithm and training sample data selection is unscientific exist. At present, the prediction research of the epitope at home and abroad mainly improves the prediction performance through the combined model construction of complementary prediction capability, the scientific sample data set construction and the like. Most of the researches are to find a classifier combination with complementary prediction capability by using a combination experiment of the existing prediction tools, and although the method can improve the prediction performance to a certain extent, no more effective prediction method is found by people at present.
Disclosure of Invention
The invention aims to provide a prediction method of protein epitope aiming at the defects of the prior art, and provides an effective method for accurately and quickly finding out the epitope.
The problems of the invention are solved by the following technical scheme:
a prediction method of protein epitope, the method collects the antigen epitope sequence information and the related protein sequence information which are verified by experiments from a professional database, constructs a positive and negative sample set for learning and training, and learns and predicts by using the physicochemical property of amino acid as the characteristic; then, training a complementary prediction classifier group and a high-performance classifier by adopting a machine learning algorithm in the sample set; finally, a complementary prediction classifier group is used for obtaining a first candidate epitope set, a high-performance classifier is used for obtaining a second candidate epitope set, and a tendency scoring method is used for scoring and sequencing sequences in the candidate epitope set;
the prediction is carried out according to the following steps:
a. data acquisition: selecting epitope data information from an IEDB database, searching primary protein sequence information adopting epitope samples in a Uniport protein database, perfecting the epitope sample data and extracting non-epitope sample data from the epitope sample data, constructing a positive and negative sample set for learning training, and forming a characteristic matrix by using physicochemical property characteristic information of hydrophobicity, accessibility and the like of each amino acid and the mean value of hydrophobicity and accessibility of three adjacent amino acids of the sequence in the sample as training input;
b. training of a complementary prediction classifier set:
① training in sample set D by using a binary training method, and obtaining a first classifier C when the accuracy of classification is more than lambda1Recording lambda as the classification accuracy of the learning training classifier, wherein the specific numerical value can be set according to the actual situation, and the value range of the lambda is more than or equal to 50% and less than 100%; the first classifier C1The sample set capable of correctly identifying the sample set D is recorded as D1
Figure GDA0002529525130000021
Note DiIs a first classifier CiIn the sample set which can be correctly identified in the training set, i is more than or equal to 1 and less than or equal to n, and the set D is obtained according to a four-classification training method and a proper amount increasing method1And D-D1Constructing a new training sample set by using the samples of the middle epitope and the non-epitope, learning training in the new sample set, and obtaining a second classifier C when the classification accuracy is more than lambda2Second classifier C2The sample set D can be correctly identified and is recorded as D2Will be set D1And set D2The intersection of (A) is denoted as D1Record DiThe method is characterized in that the method is a sample set which can be correctly identified by the ith (i is more than or equal to 1 and less than or equal to n-1) classifier and the (i +1) classifier, and then whether training is continued is judged according to a training termination rule;
② when training is needed, the set D-D is divided into four classes and added in proper amount2And D2-D1Constructing a new training sample set by using the samples of the middle epitope and the non-epitope, learning training in the new sample set, and obtaining a third classifier C when the classification accuracy is more than lambda3Classifier C3For sample set D-D1The correctly identified sample set is denoted as D3Will be set D2And set D3The intersection of (A) is denoted as D3Then, judging whether the training is continued according to a training termination rule, and obtaining an n-1 classifier Cn-1Later, when the training is needed to be continued, the four-classification training method and the proper increasing method are gathered
Figure GDA0002529525130000031
And Dn-1-Dn-2Constructing a new training sample set by using the samples of the middle epitope and the non-epitope, learning and training the new sample set, and obtaining an nth classifier C when the classification accuracy is greater than lambdanClassifier CnFor sample set
Figure GDA0002529525130000032
The correctly identified sample set is denoted as DnAccording to the above mode, until the training is stopped, obtaining a group of classifiers with complementary classification capability, namely a complementary prediction classifier group;
c. in the sample set D, training a classifier with the classification accuracy of more than 90% of each class by adopting a two-class training method, and calling the classifier as a high-performance classifier EC;
d. for protein antigens with unknown epitopes, prediction was performed according to the following method:
dividing a primary sequence of an antigen protein into a plurality of sequence fragment sets SSD according to a 'reference length', wherein each fragment forms a characteristic matrix according to the average value of hydrophobicity and accessibility of each amino acid and adjacent three amino acids as predicted input;
② first predict the first classifier C in turn using trained classifiers1And a second classifier C2Respectively classifying and identifying the set SSD by a first classifier C1Predicting the result as class one and the second classifier C2Segment composition set ERD with prediction result of class three1Erd, recordiIs a first classifier CiThe prediction result is the category one, and the (i +1) th classifier Ci+1The prediction result is a set formed by the fragments of the category three, i is more than or equal to 1 and less than or equal to n-1; then in the set SSD-ERD1Middle third classifier C3Performing classification and identification by a second classifier C2The prediction result is class one, and the third classifier C is used3Segment composition set ERD with prediction result of class three2And so on until in the aggregate
Figure GDA0002529525130000033
(Note that
Figure GDA0002529525130000034
For the first n-1 classifiers Cn-1Using the nth classifier C in the union of all the predicted resultsnPerforming classification and identification by an (n-1) classifier Cn-1The prediction result is class one, and the nth classifier CnSegment composition set ERD with prediction result of class threen-1(ii) a The first candidate epitope set (first candidate set) FCS is
Figure GDA0002529525130000035
Classification identification in a set SSD-FCS using a classifier ECThe fragments with the EC prediction result of class one form a second candidate epitope set (second candidate epitope set) SCS;
thirdly, according to a tendency scoring method, scoring is carried out on each sequence fragment, the sequence fragments in the first candidate epitope set FCS and the second candidate epitope set SCS are sorted according to the score, and the sequence fragments with high score are arranged in the front.
The prediction method of the protein epitope comprises the following specific operation methods of the four-classification training method:
let i the i-th classifier CiCorrectly identified sample set is DiSet D ofiThe subset combination composed of the mesopic and non-epitopic data is
Figure GDA0002529525130000041
In the i +1 th classifier Ci+1During training, the ith classifier C is usediThe data which can not be correctly identified are respectively listed as a class-one epitope sample and a class-two non-epitope sample of a new training sample set according to the classes of the epitope and the non-epitope samples, and are randomly extracted
Figure GDA0002529525130000042
Part of the data in the (1) is respectively listed as a class three-epitope sample and a class four-non-epitope sample, and then the samples are trained by using a four-classification learning algorithm to obtain an i +1 classifier Ci+1
The method for predicting the epitope of the protein comprises the following steps:
let N be number (d) be the total number of elements in the sample set, Ni=number(Di) Is a set DiThe total number of elements of (a) is,
Figure GDA0002529525130000043
for the correct ratio of the joint prediction of classifier i and classifier i +1,
Figure GDA0002529525130000044
calculating R before the 3 rd training for the total prediction ratio after the i +1 th training, and terminating when the training times are less than or equal to 4The return value of parameter terminate is:
Figure GDA0002529525130000045
when the training times are more than or equal to 5 times and less than or equal to 7 times, the return value of the terminate is:
Figure GDA0002529525130000046
when the training times are more than or equal to 8 times, the return value of the terminate is as follows:
Figure GDA0002529525130000047
if the termination parameter terminate returns a value of 0, the training ends, and if the termination parameter terminate returns a value of 1, the training continues.
The method for predicting the epitope of the protein comprises the following steps:
let i the i-th classifier Ci1, 2, n, the correctly identified sample set is Di
When i is 1, D1Two kinds of samples are shared, when the first classifier C is obtained1Then, from set D according to the following rule1Selecting data of a first class and data of a second class to form a new training sample set:
let G1=D-D1
Figure GDA0002529525130000048
Represents G1The number of elements of the subset of the mesopic and non-epitopic samples,
Figure GDA0002529525130000049
are respectively a set G1Class one, class two data in (1);
when in use
Figure GDA00025295251300000410
From the set D1The quantity of the randomly selected data of the first category and the second category is
Figure GDA00025295251300000411
Simultaneous slave aggregation
Figure GDA00025295251300000412
Random selection
Figure GDA00025295251300000413
Forming a new training sample set by the data;
when in use
Figure GDA0002529525130000051
From the set D1The number of the data of the selected category I and the category II is
Figure GDA0002529525130000052
Simultaneous slave aggregation
Figure GDA0002529525130000053
Random selection
Figure GDA0002529525130000054
Forming a new training sample set by the data;
when i is 2iIn which there are four types of samples, respectively using subsets
Figure GDA0002529525130000055
Show, by
Figure GDA0002529525130000056
Respectively represent collections
Figure GDA0002529525130000057
The number of elements in (1) is set as the data set which cannot be correctly identified
Figure GDA0002529525130000058
By using
Figure GDA0002529525130000059
Represents GiThe element numbers of the middle class one and class two sample subset,
Figure GDA00025295251300000510
are respectively a set GiAccording to the following rule, the data of class one and class two in (1) are collected from the set DiSelecting data of a third category and data of a fourth category to form a new training sample set:
when in use
Figure GDA00025295251300000511
From the set DiThe number of randomly selected data of the three and four categories is
Figure GDA00025295251300000512
Simultaneous slave aggregation
Figure GDA00025295251300000513
Random selection
Figure GDA00025295251300000514
Forming a new training sample set by the data;
when in use
Figure GDA00025295251300000515
Then, selected set DiThe number of the selected data of the category three and the category four is
Figure GDA00025295251300000516
Simultaneous slave aggregation
Figure GDA00025295251300000517
Random selection
Figure GDA00025295251300000518
Each data constitutes a new set of training samples.
The prediction method of the protein epitope comprises the following steps:
in the data set of the antigen epitope, the frequency of occurrence of any type of combination of three consecutive amino acids in the epitope is calculated by the following formula:
Figure GDA00025295251300000519
in the formula, AAx,AAy,AAzIs any one of 20 amino acids, AAx-AAy-AAzRepresents any type of combination of three consecutive amino acids,
Figure GDA00025295251300000520
represents AAx-AAy-AAzThe frequency of occurrence of the type combinations in the epitope,
Figure GDA00025295251300000521
indicates the number of times that the combination of the type occurs,
Figure GDA00025295251300000522
are each amino acid AAx,AAy,AAzThe total number of occurrences is,
Figure GDA00025295251300000523
each being an amino acid combination AAx-AAy,AAy-AAzThe total number of occurrences;
if the prediction window is k, the propensity score for any sequence fragment into which the primary sequence of the antigenic protein is divided is:
Figure GDA00025295251300000524
according to the method, on the basis of constructing a prediction model of a multilayer classification structure, a plurality of classifiers with complementary capacity are utilized to carry out cooperative prediction on the protein epitope, prediction experiments are carried out on a plurality of blind data sets, the prediction accuracy rate in the experiments is higher than 70%, and therefore the method can obviously improve the accuracy of the prediction of the protein epitope and provides an effective method for accurately and quickly finding the epitope.
Drawings
The invention will be further explained with reference to the drawings.
FIG. 1 is a "training flow chart of a complementary classifier set" for the epitope prediction method of the present invention;
FIG. 2 is a "epitope prediction process diagram" used in the method of predicting an epitope according to the present invention.
In the figures and in the text, the symbols are: d is a set of samples, CiIs the ith classifier, DiAs a classifier CiA correctly identified sample set, EC is a high-performance classifier, SSD is a sequence fragment set, FCS is a first candidate epitope set, SCS is a second candidate epitope set, N (number) (d) is the total number of elements in the sample set, N is the total number of elements in the sample set, andi=number(Di) Is a set DiTotal number of elements of (2), RiThe correct ratio of the combined prediction of the classifier i and the classifier i +1 is obtained, R is the total prediction ratio after the i +1 training, terminate is a termination parameter, AAx,AAy,AAzIs any one of 20 amino acids, AAx-AAy-AAzRepresents any type of combination of three consecutive amino acids,
Figure GDA0002529525130000061
represents AAx-AAy-AAzThe frequency of occurrence of the type combinations in the epitope,
Figure GDA0002529525130000062
indicates the number of times that the combination of the type occurs,
Figure GDA0002529525130000063
are each amino acid AAx,AAy,AAzThe total number of occurrences is,
Figure GDA0002529525130000064
each being an amino acid combination AAx-AAy,AAy-AAzTotal number of occurrences.
IEDB means http:// www.iedb.org/professional database; uniport refers to the http:// www.uniprot.org/protein database.
Detailed Description
Epitope prediction is generally realized by a binary classifier, and the construction of a classifier with complementary prediction capability breaks through the limitation of inherent thinking. The structure of the classifier is that on the basis of a binary classifier, sample data is combined according to the classification result of the binary classifier, and a new classifier is trained in a new sample. The research finds that a plurality of classifiers are constructed by applying the prediction difference of two adjacent classifiers in a group of classifiers to realize gradual optimization, a mechanism for training a complementary classifier group is provided, and the method has an important promoting effect on the improvement of the prediction performance of the antigen epitope.
In order to clearly understand the technical contents of the present invention, the present invention will be described in detail with reference to fig. 1 and 2. It is to be understood that the examples are illustrative of the invention and are not to be construed as limiting the invention.
1. Data acquisition
Epitope sequence data are collected from an IEDB (http:// www.iedb.org /) epitope database as a training positive sample, and the database contains a plurality of epitope data which are verified by experiments and cover species such as human, non-human primates, other animals and the like. The primary sequence of the protein corresponding to the selected epitope sample is found in a Uniport (http:// www.uniprot.org /) protein database, and a sequence fragment (i.e. non-epitope sequence) which is not marked as the epitope is extracted from the primary sequence as a training negative sample. In our experiments, we verified that 800 protein sequences were extracted in total, and 5120 continuous epitope sequences and 5200 non-epitope sequence fragments were collected. Each sample is 20 amino acids in length as a reference, and a sequence fragment which is not marked as an epitope and has 20 amino acids in number is directly selected from a protein sequence primary sequence for a non-epitope sample. For epitope samples, due to the difference in the number of amino acids contained in the epitope sequence, the "baseline length" requirement is met as follows: for epitope sequences with the number of amino acids being less than 20 and even number, the same number of amino acids are selected from two sides of the protein sequence where the epitope sequences are located as successive supplements, and for epitope sequences with the number of amino acids being less than 20 and odd number, the successive supplement number of the epitope sequences from the front end of the protein sequence where the epitope sequences are located is one more than that of the epitope sequences from the rear end so as to meet the requirement of the standard length; the epitope sequences with the number of amino acids larger than 20 and even number are removed with the same number of amino acids from both sides to meet the requirement of the standard length, and the epitope sequences with the number of amino acids larger than 20 and odd number are reduced with one more than the number of the epitope sequences from the front end of the protein sequence to meet the requirement of the standard length. For the sample sequence, a characteristic matrix is formed according to the hydrophobicity and accessibility characteristics of the sequence of the amino acid sequences in the sample and the average value of the hydrophobicity and accessibility of every three adjacent amino acids, and the characteristic matrix is an input matrix for training and prediction.
2. Model building
In the sample set, the method for performing the training of the complementary classifier is as follows:
setting the sample set as D, training by using a two-classification training method, and obtaining the classifier C when the classification accuracy is higher than lambda (the lambda is the classification accuracy of the learning training classifier, the specific numerical value can be set according to the actual situation, and the value range of the lambda is more than or equal to 50% and less than 100%)1In the set D, C1The correctly identified sample set is denoted as D1
Figure GDA0002529525130000071
(note D)iAs a classifier Ci(1 ≦ i ≦ n) sample set that can be correctly identified in the training set). Set D is trained according to the four-classification training method and the appropriate increment method1And D-D1Constructing a new training sample set by using the samples with the medium epitope and the non-epitope, performing learning training in the new sample set, and obtaining a classifier C when the classification accuracy is more than lambda2In the set D, C2The correctly identified sample set is denoted as D2(the same applies below with the recognized class one and class three as epitopes and the recognized class two and class four as non-epitopes), set D1And set D2The intersection of (A) is denoted as D1(note D)iThe sample set is a sample set which can be correctly identified by the ith (i is more than or equal to 1 and less than or equal to n-1) classifier and the (i +1) th classifier), then the judgment is carried out according to a training termination rule, if the termination parameter terminate return value is 0, the training is ended, and if the termination parameter terminate return value is 1, the training is continued. When the training is needed, the set D-D is processed according to the four-classification training method and the proper increment method2And D2-D1Constructing a new training sample set by using the samples with the medium epitope and the non-epitope, training and learning in the new sample set, and obtaining a classifier C when the classification accuracy is more than lambda3,C3For sample set D-D1The correctly identified sample set is marked as D3Will be set D2And set D3The intersection of (A) is denoted as D2Then, the judgment is made according to the training termination rule, if the termination parameter termination return value is 0, the training is terminated, and if the termination parameter termination return value is 1, the training is continued. When the training is continued after the n-1 th classifier is obtained, the four-classification training method and the proper adding method are collected into
Figure GDA0002529525130000081
And Dn-1-Dn-2(i is more than or equal to 4) constructing a new training sample set by the epitope and non-epitope samples, training in the new sample set, and obtaining a classifier C when the classification accuracy is more than lambdanMixing C withnFor sample set
Figure GDA0002529525130000082
The correctly identified sample set is marked as Dn. In the above manner, until training is stopped. This results in a set of classifiers with complementary classification capabilities, only between two adjacent classifiers in the set. Any machine learning algorithm can be used for the two-classification training and the four-classification training in the method as long as the requirements of the relevant rules and the epitope classification of the method can be met.
The specific content of the four-classification training method is as follows:
obtaining classifier C in training1And then, taking the two types of sample data correctly identified by the classifier as new sample data to participate in new classifier training, namely, the later classifier training is four-classification. Let classifier C1Correctly identified sample set is D1Set D of1The sub-combinations of the data of the middle class one (epitope) and the class two (non-epitope) are respectively
Figure GDA0002529525130000083
During the second classifier training, the classifier C is selected1Incorrectly identified data sets D-D1The epitope and non-epitope in the training sample set are respectively listed as class one and class two samples of the new training sample set, and
Figure GDA0002529525130000084
the epitope data and non-epitope data in the method are respectively listed as a class three sample and a class four sample, a proper amount of data is extracted according to a proper amount increasing method to form a new training set, and then a four-classification learning algorithm is used for training to obtain a classifier so as to obtain the classifier. Starting from the training of the third classifier, the data which cannot be correctly identified by the previous classifier is respectively listed as a class I sample and a class II sample of a new training sample set according to an epitope and a non-epitope, part of correct identification data in the previous classifier is extracted and respectively listed as a class III sample and a class IV sample, a proper amount of data is extracted according to a proper amount increasing method to form a new training set, and the four-class learning algorithm is also utilized for training to obtain the classifier.
The specific contents of constructing a new training sample set according to the method for increasing the appropriate amount are as follows:
in obtaining a classifier CiAfter (i 2.. multidot.n), when the next classifier training is performed, a new sample set is constructed according to the number of different classes of sample sets. The sample addition in the method refers to that the related class sample correctly identified by the last classifier is used as a new sample class to participate in the training of the next classifier. Generally, a classifier can recognize that the number of sample data of each class is larger than that of sample data which cannot be correctly recognizedThe number of samples, so the number of samples in the data set needs to be compared to construct a new training sample set. Let classifier CiCorrectly identified sample set is Di,DiFour kinds of data in common are respectively used as subsets
Figure GDA0002529525130000085
And (4) showing. By using
Figure GDA0002529525130000086
Respectively represent collections
Figure GDA0002529525130000091
Number of elements in (1). Let the incorrectly identified data set be
Figure GDA0002529525130000092
By using
Figure GDA0002529525130000093
Represents GiNumber of middle epitope and non-epitope sample subset elements: (
Figure GDA0002529525130000094
Are respectively a set GiClass one, class two data) from set D according to the following rulesiAnd selecting data of the third category and the fourth category to form a new training sample set.
When in use
Figure GDA0002529525130000095
(when the product of the multiple is not an integer, rounding is performed by rounding, the same applies hereinafter) and the like, from the set DiThe number of randomly selected data of the three and four categories is
Figure GDA0002529525130000096
Simultaneous slave aggregation
Figure GDA0002529525130000097
Random selection
Figure GDA0002529525130000098
Each data constitutes a new set of training samples.
When in use
Figure GDA0002529525130000099
Then, selected set DiThe number of the selected data of the category three and the category four is
Figure GDA00025295251300000910
Simultaneous slave aggregation
Figure GDA00025295251300000911
Random selection
Figure GDA00025295251300000912
Each data constitutes a new set of training samples.
In particular, when the training classifier C is obtained1Then, the following rule is followed to perform the secondary set D1The first and second data are selected to form a new training sample set. Let G1=D-D1
Figure GDA00025295251300000913
Representing the number of mesoepitope and non-epitope sample subset elements: (
Figure GDA00025295251300000914
Are respectively a set G1Data of middle category one, category two)
When in use
Figure GDA00025295251300000915
From the set D1The quantity of the randomly selected data of the first category and the second category is
Figure GDA00025295251300000916
Simultaneous slave aggregation
Figure GDA00025295251300000917
Random selection
Figure GDA00025295251300000918
Each data constitutes a new set of training samples.
When in use
Figure GDA00025295251300000919
Then, selected set D1The number of the data of the selected category I and the category II is
Figure GDA00025295251300000920
Simultaneous slave aggregation
Figure GDA00025295251300000921
Random selection
Figure GDA00025295251300000922
Each data constitutes a new set of training samples.
The specific contents of the training termination rule are as follows:
let N be number (D) be the total number of elements in the sample set D, Ni=number(Di) Is a set DiThe total number of the elements in (a),
Figure GDA00025295251300000923
for the correct ratio of the joint prediction of classifier i and classifier i +1,
Figure GDA00025295251300000924
the total prediction ratio is obtained.
Calculating R before the 3 rd training, and when the training times are less than or equal to 4 times, the return value of the terminate is as follows:
Figure GDA00025295251300000925
when the training times are more than or equal to 5 times and less than or equal to 7 times, the return value of the terminate is as follows:
Figure GDA0002529525130000105
when the training times are more than or equal to 8 times, the return value of the terminate is as follows:
Figure GDA0002529525130000101
for the training of the high-performance classifier EC, only two classification training needs to be adopted in the sample set D until the classification accuracy reaches 90%.
3. Epitope prediction
For protein antigens with unknown epitope positions, epitope prediction is carried out according to the following method: the first step, dividing the primary sequence of the antigen protein into a plurality of sequence fragment sets SSD according to the 'reference length', wherein each fragment calculates a characteristic matrix as prediction input according to the method of the 'data acquisition' in the step 1. Secondly, firstly, the trained complementary prediction classifier group is used for sequentially predicting, and the classifier C is used1And a classifier C2Respectively classifying and identifying the set SSD, and C1The prediction result is class-simultaneous C2Segment composition set ERD with prediction result of class three1(Note ERDiAs a classifier CiClassifier C with prediction result of class-simultaneousi+1The prediction result is a fragment composition set of the category three, i is more than or equal to 1 and less than or equal to n-1)); then in the set SSD-ERD1Middle classifier C3Performing classification and identification by C2The prediction result is class-simultaneous C3Segment composition set ERD with prediction result of class three2According to the rule until in the set
Figure GDA0002529525130000102
(Note that
Figure GDA0002529525130000103
Is a front n-1 classifier Cn-1Union of all predictions) with the last classifier CnPerforming classification and identification by Cn-1The prediction result is class-simultaneous CnSegment composition set ERD with prediction result of class threen-1(ii) a The first candidate epitope set (first candidate set) FCS consists of all ERDsiIs composed of a union of (i) i
Figure GDA0002529525130000104
And performing classified identification in a set SSD-FCS by using a classifier EC, wherein the fragments with the EC prediction result of the class one form a second candidate epitope set (second candidate set) SCS. And thirdly, scoring each sequence fragment according to a tendency scoring method, and sequencing the sequence fragments in the FCS set and the SCS set according to the scores, wherein the sequence fragments with high scores are ranked in the front.
The specific content of the tendency scoring method is as follows:
in the data set of the antigen epitope, the frequency of occurrence of any type of combination of three consecutive amino acids in the epitope is calculated by the following formula:
Figure GDA0002529525130000111
wherein, AAx,AAy,AAzIs any one of 20 amino acids, AA in the formulax-AAy-AAzRepresents any type of combination of three consecutive amino acids,
Figure GDA0002529525130000112
represents AAx-AAy-AAzThe frequency of occurrence of the type combinations in the epitope,
Figure GDA0002529525130000113
indicates the number of times that the combination of the type occurs,
Figure GDA0002529525130000114
are each amino acid AAx,AAy,AAzThe total number of occurrences is,
Figure GDA0002529525130000115
each being an amino acid combination AAx-AAy,AAy-AAzTotal number of occurrences.
If the prediction window is k, the propensity score for any sequence fragment into which the primary sequence of the antigenic protein is divided is:
Figure GDA0002529525130000116
accuracy evaluation of prediction method
The invention screens 800 antigen protein data, and a sample set consisting of 5120 epitope sequences and 5200 non-epitope sequences is counted. Two algorithms of a Support Vector Machine (SVM) and a Recurrent Neural Network (RNN) are utilized to respectively carry out two-classification training and four-classification training, and four times of training are carried out, wherein the first time of training the classifier 1 by adopting the SVM, the second time of training the classifier 2 by adopting the RNN, the third time of training the classifier 3 by adopting the RNN, and the fourth time of training the excellent performance classifier 4 by adopting the RNN. The number of the first-class sequences predicted by the classifier 1 and the number of the third-class sequences predicted by the classifier 2 are 3325, the number of the first-class sequences predicted by the classifier 2 and the number of the third-class sequences predicted by the classifier 3 are 1573, and the comprehensive prediction accuracy reaches 95.6%. The accuracy rate predicted by the five-fold cross validation classifier 4 is 91%.
We collected 287 proteins out of the training samples as a blind data test set, which contains 2000 validated epitope sequences, and randomly drawn 1000 for each test. The result predicted by the above trained classifier set is: the epitope sequences predicted by the classifier 1 and the classifier 2 are 739 in total, wherein the correctness is 551, and the correctness is 74.5%; 492 epitope sequences predicted by the classifier 1 and the classifier 2, wherein the correct epitope sequences are 327, and the correct epitope sequences are 66.5%; the comprehensive prediction accuracy is 71.3%, and the coverage rate of the correct result reaches 87.8%. The epitope sequences predicted by classifier 4 were 190 in total, 75 out of them were correct, and the accuracy was 39.5%. The results of the two classifiers are combined, and the coverage rate of correct results reaches 95.3%.
From the experimental results, the method has high prediction accuracy, the prediction result can contain most epitopes, and effective and scientific basis can be provided for the screening of the epitope.

Claims (5)

1. A prediction method of protein epitope is characterized in that firstly, antigen epitope sequence information which is experimentally verified and sequence information of related proteins are collected from a professional database, a positive and negative sample set for learning and training is constructed, and the physicochemical properties of amino acid are used as characteristics for learning and prediction; then, training a complementary prediction classifier group and a high-performance classifier by adopting a machine learning algorithm in the sample set; finally, a complementary prediction classifier group is used for obtaining a first candidate epitope set, a high-performance classifier is used for obtaining a second candidate epitope set, and a tendency scoring method is used for scoring and sequencing sequences in the candidate epitope set;
the prediction is carried out according to the following steps:
a. data acquisition: selecting epitope data information from an IEDB database, searching primary protein sequence information adopting epitope samples in a Uniport protein database, perfecting the epitope sample data and extracting non-epitope sample data from the epitope sample data, constructing a positive and negative sample set for learning training, and forming a characteristic matrix by using physicochemical property characteristic information of hydrophobicity, accessibility and the like of each amino acid and the mean value of hydrophobicity and accessibility of three adjacent amino acids of the sequence in the sample as training input;
b. training of a complementary prediction classifier set:
① training in sample set D by using a binary training method, and obtaining a first classifier C when the accuracy of classification is more than lambda1Recording lambda as the classification accuracy of the learning training classifier, wherein the specific numerical value can be set according to the actual situation, and the value range of the lambda is more than or equal to 50% and less than 100%; the first classifier C1The sample set capable of correctly identifying the sample set D is recorded as D1
Figure FDA0002565774690000011
Note DiIs a first classifier CiThe sample set which can be correctly identified in the training set, i is more than or equal to 1 and less than or equal to n, is added according to the four-classification training method and the proper amountMethod of addition "will set D1And D-D1Constructing a new training sample set by using the samples of the middle epitope and the non-epitope, learning training in the new sample set, and obtaining a second classifier C when the classification accuracy is more than lambda2Second classifier C2The sample set D can be correctly identified and is recorded as D2Will be set D1And set D2The intersection of (A) is denoted as D1Record DiThe method comprises the steps of collecting samples which can be correctly identified by an ith classifier and an (i +1) th classifier, and then judging whether training is continued according to a training termination rule, wherein i is more than or equal to 1 and is less than or equal to n-1;
② when training is needed, the set D-D is divided into four classes and added in proper amount2And D2-D1Constructing a new training sample set by using the samples of the middle epitope and the non-epitope, learning training in the new sample set, and obtaining a third classifier C when the classification accuracy is more than lambda3Classifier C3For sample set D-D1The correctly identified sample set is denoted as D3Will be set D2And set D3The intersection of (A) is denoted as D3Then, judging whether the training is continued according to a training termination rule, and obtaining an n-1 classifier Cn-1Later, when the training is needed to be continued, the four-classification training method and the proper increasing method are gathered
Figure FDA0002565774690000021
And Dn-1-Dn-2Constructing a new training sample set by using the samples of the middle epitope and the non-epitope, learning and training the new sample set, and obtaining an nth classifier C when the classification accuracy is greater than lambdanClassifier CnFor sample set
Figure FDA0002565774690000022
The correctly identified sample set is denoted as DnAccording to the above mode, until the training is stopped, obtaining a group of classifiers with complementary classification capability, namely a complementary prediction classifier group;
c. in the sample set D, training a classifier with the classification accuracy of more than 90% of each class by adopting a two-class training method, and calling the classifier as a high-performance classifier EC;
d. for protein antigens with unknown epitopes, prediction was performed according to the following method:
dividing a primary sequence of an antigen protein into a plurality of sequence fragment sets SSD according to a 'reference length', wherein each fragment forms a characteristic matrix according to the average value of hydrophobicity and accessibility of each amino acid and adjacent three amino acids as predicted input;
② first predict the first classifier C in turn using trained classifiers1And a second classifier C2Respectively classifying and identifying the set SSD by a first classifier C1Predicting the result as class one and the second classifier C2Segment composition set ERD with prediction result of class three1Erd, recordiIs the ith classifier CiThe prediction result is the category one, and the (i +1) th classifier Ci+1The prediction result is a set formed by the fragments of the category three, i is more than or equal to 1 and less than or equal to n-1; then in the set SSD-ERD1Middle third classifier C3Performing classification and identification by a second classifier C2The prediction result is class one, and the third classifier C is used3Segment composition set ERD with prediction result of class three2And so on until in the aggregate
Figure FDA0002565774690000023
Note the book
Figure FDA0002565774690000024
For the first n-1 classifiers Cn-1Using the nth classifier C in the union of all the predicted resultsnPerforming classification and identification by an (n-1) classifier Cn-1The prediction result is class one, and the nth classifier CnSegment composition set ERD with prediction result of class threen-1(ii) a The first candidate epitope set (first candidate set) FCS is
Figure FDA0002565774690000025
Classifying and identifying in a set SSD-FCS by using a classifier EC, wherein the fragments with the EC prediction result of class one form a second candidate epitope set (second candidate set) SCS;
thirdly, according to a tendency scoring method, scoring is carried out on each sequence fragment, the sequence fragments in the first candidate epitope set FCS and the second candidate epitope set SCS are sorted according to the score, and the sequence fragments with high score are arranged in the front.
2. The method for predicting a protein epitope according to claim 1, wherein the specific operation method of the "four-classification training method" is as follows:
let i the i-th classifier CiCorrectly identified sample set is DiSet D ofiThe subset combination composed of the mesopic and non-epitopic data is
Figure FDA0002565774690000031
In the i +1 th classifier Ci+1During training, the ith classifier C is usediThe data which can not be correctly identified are respectively listed as a class-one epitope sample and a class-two non-epitope sample of a new training sample set according to the classes of the epitope and the non-epitope samples, and are randomly extracted
Figure FDA0002565774690000032
Part of the data in the (1) is respectively listed as a class three-epitope sample and a class four-non-epitope sample, and then the samples are trained by using a four-classification learning algorithm to obtain an i +1 classifier Ci+1
3. The method for predicting a protein epitope according to claim 1, wherein said "training termination rule" is as follows:
let N be number (d) be the total number of elements in the sample set, Ni=number(Di) Is a set DiThe total number of elements of (a) is,
Figure FDA0002565774690000033
as a classifier CiAnd a classifier Ci+1The correct ratio of the joint prediction,
Figure FDA0002565774690000034
calculating R from the 3 rd training before starting for the total prediction ratio after the (i +1) th training, wherein when the training times are less than or equal to 4 times, the return value of the termination parameter is as follows:
Figure FDA0002565774690000035
when the training times are more than or equal to 5 times and less than or equal to 7 times, the return value of the terminate is:
Figure FDA0002565774690000036
when the training times are more than or equal to 8 times, the return value of the terminate is as follows:
Figure FDA0002565774690000037
if the termination parameter terminate returns a value of 0, the training ends, and if the termination parameter terminate returns a value of 1, the training continues.
4. The method for predicting a protein epitope according to claim 1, wherein said "method for increasing a suitable amount" comprises:
let i the i-th classifier Ci1, 2, n, the correctly identified sample set is Di
When i is 1, D1Two kinds of samples are shared, when the first classifier C is obtained1Then, from set D according to the following rule1Selecting data of a first class and data of a second class to form a new training sample set:
let G1=D-D1
Figure FDA0002565774690000041
Represents G1The number of elements of the subset of the mesopic and non-epitopic samples,
Figure FDA0002565774690000042
are respectively a set G1Class one, class two data in (1);
when in use
Figure FDA0002565774690000043
From the set D1The quantity of the randomly selected data of the first category and the second category is
Figure FDA0002565774690000044
Simultaneous slave aggregation
Figure FDA0002565774690000045
Random selection
Figure FDA0002565774690000046
Forming a new training sample set by the data;
when in use
Figure FDA0002565774690000047
From the set D1The number of the data of the selected category I and the category II is
Figure FDA0002565774690000048
Simultaneous slave aggregation
Figure FDA0002565774690000049
Random selection
Figure FDA00025657746900000410
Forming a new training sample set by the data;
when i is 2iIn which there are four types of samples, respectively using subsets
Figure FDA00025657746900000411
Show, by
Figure FDA00025657746900000412
Respectively represent collections
Figure FDA00025657746900000413
The number of elements in (1) is set as the data set which cannot be correctly identified
Figure FDA00025657746900000414
By using
Figure FDA00025657746900000415
Represents GiThe element numbers of the middle class one and class two sample subset,
Figure FDA00025657746900000416
are respectively a set GiAccording to the following rule, the data of class one and class two in (1) are collected from the set DiSelecting data of a third category and data of a fourth category to form a new training sample set:
when in use
Figure FDA00025657746900000417
From the set DiThe number of randomly selected data of the three and four categories is
Figure FDA00025657746900000418
Simultaneous slave aggregation
Figure FDA00025657746900000419
Random selection
Figure FDA00025657746900000420
Forming a new training sample set by the data;
when in use
Figure FDA00025657746900000421
Then, selected set DiThe number of the selected data of the category three and the category four is
Figure FDA00025657746900000422
Simultaneous slave aggregation
Figure FDA00025657746900000423
Random selection
Figure FDA00025657746900000424
Each data constitutes a new set of training samples.
5. The method for predicting a protein epitope according to claim 1, wherein said "tendency scoring" method comprises:
in the data set of the antigen epitope, the frequency of occurrence of any type of combination of three consecutive amino acids in the epitope is calculated by the following formula:
Figure FDA00025657746900000425
in the formula, AAx,AAy,AAzIs any one of 20 amino acids, AAx-AAy-AAzRepresents any type of combination of three consecutive amino acids,
Figure FDA0002565774690000051
represents AAx-AAy-AAzThe frequency of occurrence of the type combinations in the epitope,
Figure FDA0002565774690000052
indicates the number of times that the combination of the type occurs,
Figure FDA0002565774690000053
are each amino acid AAx,AAy,AAzThe total number of occurrences is,
Figure FDA0002565774690000054
each being an amino acid combination AAx-AAy,AAy-AAzThe total number of occurrences;
if the prediction window is k, the propensity score for any sequence fragment into which the primary sequence of the antigenic protein is divided is:
Figure FDA0002565774690000055
CN201710516045.8A 2017-06-29 2017-06-29 Prediction method of protein epitope Active CN107341363B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710516045.8A CN107341363B (en) 2017-06-29 2017-06-29 Prediction method of protein epitope

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710516045.8A CN107341363B (en) 2017-06-29 2017-06-29 Prediction method of protein epitope

Publications (2)

Publication Number Publication Date
CN107341363A CN107341363A (en) 2017-11-10
CN107341363B true CN107341363B (en) 2020-09-22

Family

ID=60219158

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710516045.8A Active CN107341363B (en) 2017-06-29 2017-06-29 Prediction method of protein epitope

Country Status (1)

Country Link
CN (1) CN107341363B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109326324B (en) * 2018-09-30 2022-01-25 河北省科学院应用数学研究所 Antigen epitope detection method, system and terminal equipment
CN110060738B (en) * 2019-04-03 2021-10-22 中国人民解放军军事科学院军事医学研究院 Method and system for predicting bacterial protective antigen protein based on machine learning technology
CN110310708A (en) * 2019-06-18 2019-10-08 广东省生态环境技术研究所 A method of building alienation arsenic reductase enzyme protein database
CN111429965B (en) * 2020-03-19 2023-04-07 西安交通大学 T cell receptor corresponding epitope prediction method based on multiconnector characteristics
CN113838523A (en) * 2021-09-17 2021-12-24 深圳太力生物技术有限责任公司 Antibody protein CDR region amino acid sequence prediction method and system
CN114242169B (en) * 2021-12-15 2023-10-20 河北省科学院应用数学研究所 Antigen epitope prediction method for B cells
CN116386712B (en) * 2023-02-20 2024-02-09 北京博康健基因科技有限公司 Epitope prediction method and device based on antigen protein dynamic space structure

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521527A (en) * 2011-12-12 2012-06-27 同济大学 Method for predicting space epitope of protein antigen according to antibody species classification
EP2842068A1 (en) * 2012-04-24 2015-03-04 Laboratory Corporation of America Holdings Methods and systems for identification of a protein binding site
CN105524984A (en) * 2014-09-30 2016-04-27 深圳华大基因科技有限公司 Method and equipment for neoantigen epitope prediction
CN105868583A (en) * 2016-04-06 2016-08-17 东北师范大学 Method for predicting epitope through cost-sensitive integrating and clustering on basis of sequence

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8121797B2 (en) * 2007-01-12 2012-02-21 Microsoft Corporation T-cell epitope prediction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521527A (en) * 2011-12-12 2012-06-27 同济大学 Method for predicting space epitope of protein antigen according to antibody species classification
EP2842068A1 (en) * 2012-04-24 2015-03-04 Laboratory Corporation of America Holdings Methods and systems for identification of a protein binding site
CN105524984A (en) * 2014-09-30 2016-04-27 深圳华大基因科技有限公司 Method and equipment for neoantigen epitope prediction
CN105868583A (en) * 2016-04-06 2016-08-17 东北师范大学 Method for predicting epitope through cost-sensitive integrating and clustering on basis of sequence

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《Prediction of CTL epitopes using QM, SVM and ANN techniques》;Manoj Bhasin 等;《Vaccine》;20040305;第3196页图1 *
《基于PCA和SVM的线性B细胞表位预测研究》;董娇娇;《中国优秀硕士学位论文全文数据库 医药卫生科技辑》;20151215;第12-19页第三章 *
《基于信息融合和计算智能的构象性B细胞表位预测方法研究》;张春华;《中国博士学位论文全文数据库 医药卫生科技辑》;20170215(第2期);第E059-129页 *

Also Published As

Publication number Publication date
CN107341363A (en) 2017-11-10

Similar Documents

Publication Publication Date Title
CN107341363B (en) Prediction method of protein epitope
CN104966104B (en) A kind of video classification methods based on Three dimensional convolution neutral net
CN110070909B (en) Deep learning-based multi-feature fusion protein function prediction method
Qi et al. Random forest similarity for protein-protein interaction prediction from multiple sources
CN103559504B (en) Image target category identification method and device
CN109994151B (en) Tumor driving gene prediction system based on complex network and machine learning method
CN105930688B (en) Based on the protein function module detection method for improving PSO algorithms
CN106055928B (en) A kind of sorting technique of macro genome contig
CN113436684B (en) Cancer classification and characteristic gene selection method
Rasheed et al. Metagenomic taxonomic classification using extreme learning machines
CN106548041A (en) A kind of tumour key gene recognition methods based on prior information and parallel binary particle swarm optimization
Zhang et al. Protein family classification from scratch: a CNN based deep learning approach
CN105139031A (en) Data processing method based on subspace clustering
WO2024045989A1 (en) Graph network data set processing method and apparatus, electronic device, program, and medium
CN106951728B (en) Tumor key gene identification method based on particle swarm optimization and scoring criterion
CN109376790A (en) A kind of binary classification method based on Analysis of The Seepage
CN107463799B (en) Method for identifying DNA binding protein by interactive fusion feature representation and selective integration
CN108595909A (en) TA targeting proteins prediction techniques based on integrated classifier
CN113160886B (en) Cell type prediction system based on single cell Hi-C data
CN108052796B (en) Global human mtDNA development tree classification query method based on ensemble learning
CN108388769A (en) Protein Functional Module Identification Method Based on Edge-Driven Label Propagation Algorithm
CN106404878A (en) Protein tandem mass spectrometry identification method based on multiple omics abundance information
CN114999566A (en) Drug repositioning method and system based on word vector characterization and attention mechanism
CN115662504A (en) Multi-angle fusion-based biological omics data analysis method
Mahatma et al. Prediction and functional characterization of transcriptional activation domains

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant