CN105868583A

CN105868583A - Method for predicting epitope through cost-sensitive integrating and clustering on basis of sequence

Info

Publication number: CN105868583A
Application number: CN201610207437.1A
Authority: CN
Inventors: 马志强; 张健; 柴海挺; 高博
Original assignee: Northeast Normal University
Current assignee: Northeastern University China; Northeast Normal University
Priority date: 2016-04-06
Filing date: 2016-04-06
Publication date: 2016-08-17
Anticipated expiration: 2036-04-06
Also published as: CN105868583B

Abstract

The invention belongs to a computational biology information technique, and particularly relates to a method for predicting epitope through cost-sensitive integrating and clustering on the basis of a sequence. The method comprises the main steps that 1, descriptive features of antigen protein residues are constructed, wherein the features comprise the evolutionary conservation feature, the secondary structure feature, the disordered region feature, the dipeptide composition feature and physical and chemical attributes; 2, an optimal feature subset is selected through Fisher-Markov and an incremental iterative feature selection method; 3, unbalanced data sets are processed through cost-sensitive integrating learning; 4, potential epitope residues are predicted from antigenic determination residues through a spatial clustering algorithm. The method is suitable for antigen protein epitope prediction of known and unknown structure information and is also suitable for large-scale application and popularization.

Description

A kind of based on the sensitive integrated method with Forecast epi-position of sequence utilization cost

Technical field

The invention belongs to calculate biology information technology, be specifically related to a kind of sensitive integrated based on sequence utilization cost and cluster The method of prediction epi-position.

Background technology

Along with expanding economy and growth in the living standard, to the demand of clothing, food, lodging and transportion--basic necessities of life unlike the shortage economy epoch that Sample cannot meet.Attention is transferred to health by people, and corresponding industry all welcomes high speed development.Along with China steps into always Ageization society, country and individual increase year by year in input pharmaceutically.Bio-pharmaceuticals and the huge machine of production of vaccine field face Meet.According to statistics, more than the 50% of 60 years old later medical expense averaged occupation of people health care expenditures in its in all one's life.2010, Whole world medicine vaccine marketplaces, close to 25,000,000,000 dollars, has reached 50,000,000,000 dollars for 2014, and short 4 year market scales have turned over one Times.According to estimates, this market will rise to 200,000,000,000 U.S. dollars by 2025.

Pharmacy vaccine marketplaces is forefront in medical market, one of field that scientific and technological content is the highest.A new and effective medicine Research and development the most much several years even many decades of thing, on the one hand this need the long-term of a large amount of scientific manpower to be absorbed in research, the opposing party Face is also required to the support of a large amount of fund of scientific research and sophisticated equipment.Succeeding in developing of newtype drug, can not only give millions upon millions of diseases Patient brings glad tidings, and also implies that great riches effect and social benefit simultaneously.Pursue pharmacy and the commanding elevation of vaccine marketplaces, Have become as the most important thing of European and American developed countries' life sciences development.Chinese Government is more and more heavier for pharmacy vaccines arts Depending on.In recent years, medical school is flourish, and armarium constantly develops renewal, and medical knowledge is goed deep in general public Universal.For pharmacy, vaccine and association area, in the nearest more than ten years, country is each sides such as science and technology, fund, the policy talents Face all puts into huge.

In theory, the key point of pharmacy and vaccine is to be accurately positioned epitope, and designs on this basis Corresponding immunologic intervention antibody or artificial vaccines.At present, the location most reliable method of epitope is to pass through Ag-Ab The method of complex crystal diffraction and nuclear magnetic resonance, NMR obtains the space structure of complex；Then the space for complex is tied Structure, probes into the epi-position that its surface is potential.But this experimental technique needs the highest technology to require and substantial amounts of manpower and money Gold is supported.If the structural resolution obtained is relatively low or manufactured goods failure, all are needed to restart.

B cell epi-position can aid in people and is better understood by dividing of Ag-Ab to use computational methods to predict accurately Sub-interaction mechanism, it is also possible to the prevention of some diseases, treat and diagnosis brings hope, therefore the research of this respect has reason concurrently Value and positive realistic meaning.The critical period of SARS epidemic situation in 2003 development, Hua Da gene, Peking University, Fudan University are big Xue Deng scientific research institution, by calculating SARS virus epi-position, produces first vaccine in short some months.This achievement rouses oneself The popular feeling, and it is greatly promoted the development in Antigen Epitope Prediction field.Although the research in terms of conformational antigen Antigen Epitope Prediction at present is the most not Maturation, but existing increasing research worker recognizes this importance studied, and start to be absorbed in the work of this respect.

Summary of the invention

Present invention is generally directed to the shortcoming in current Epitope prediction technology, it is provided that a kind of quick based on sequence utilization cost Feel integrated and Forecast epi-position method.The method has been concentrated and can have been determined that residue and non-antigen determine residual by accurate description antigen The feature of base, in combination with efficient feature selection approach, identifies potential antigen from antigen protein primary sequence and determines Residue, then uses Spatial Clustering that the antigen of gathering being determined, residue screens as epi-position, designs ingenious, accuracy rate The highest, it is also suitably for large-scale promotion application simultaneously.

To achieve these goals, the sensitive integrated and side of Forecast epi-position based on sequence utilization cost of the present invention Method, is characterized in, comprises the following steps:

(1) feature construction: according to the analysis to antigenic surface residue characteristic, calculates antigen and determines that residue and non-antigen determine residue Descriptive characteristics.

(2) feature selection: for the full eigenmatrix of structure, selective discrimination degree is higher, descriptive accurate feature, And build optimal feature subset on this basis.

(3) integrated study: in order to solve data sample imbalance problem and improve estimated performance, use integrated study strategy Build a classifiers.

(4) Antigen Epitope Prediction: analyzed by schedule of samples position prediction, calculates epi-position distribution space threshold value.For cohesion in threshold value The antigen of the prediction more than 3 of collection determines residue, assert that it can potential composition epi-position.

Beneficial effects of the present invention is characterized in particular in:

1. present invention primary sequence based on antigen protein, can be analyzed detection for the novel protein of unknown structure, Application surface is wider；The various descriptive characteristics of conjugated antigen albumen and cleverly Fisher-Markov feature selection and increment feature In order to distinguish antigen, selection strategy, determines that residue and non-antigen determine residue.

2. the present invention (the most only predicts residue relative to the method for traditional prediction epi-position, does not consider the gathering tendency of residue Property), adding the further analysis for predicting the outcome, this analysis is based on for the gathering tendentiousness of epi-position in reality.This Method reflects the feature in proteantigen epi-position space more accurately so that predict the outcome more true and reliable.

Accompanying drawing explanation

Fig. 1 is present invention flow chart based on the sensitive integrated method with Forecast epi-position of sequence utilization cost.

Fig. 2 is that the antigen of protein 1PKO during the present invention tests determines residue prediction and cluster analysis.Figure identifies two Individual epi-position group (1 and 2), grey parts is normal residue of protein, and black part is divided into antigen to determine residue.Left side circle in black The antigen of black part during color part and right side are enclosed determines that residue, according to setting threshold value, is under the jurisdiction of Liang Ge epi-position group respectively.

Detailed description of the invention

For the more careful technology contents being expressly understood the present invention, in conjunction with Fig. 1, Fig. 2, the present invention is carried out detailed Describe.Especially, case study on implementation is merely to illustrate the present invention, rather than limitation of the present invention.

The sensitive integrated and method of Forecast epi-position based on sequence utilization cost of the present invention, comprises the following steps:

Described step (1) specifically includes following steps:

(1.1) PSIBLAST is used to calculate location specific marking (PSSM) matrix of antigen protein sequence, in sequence A certain position residue replaces to the score of other residues, uses logistics function to be normalized:

Wherein x in being PSSM matrix a certain position residue replace to the score of other residues, the evolutionary conservatism of a certain residue is special Levy as all of evolutionary conservatism score in this residue sequence position front 5 and rear 5 length of window.

(1.2) use PSIPRED to calculate each residue on antigen protein to form secondary structure and (spiral, crimp or roll over Folded) probability matrix.In the second structure characteristic of a certain residue is this residue sequence position front 5 and rear 5 length of window All of secondary structure probability matrix.

(1.3) using DISORDER to calculate, that each residue on antigen protein falls in protein disordered regions is general Rate, it is contemplated that neighboring residues can produce impact to center residue, and therefore the disordered regions of center residue is characterized as this residue sequence All of disordered regions probability matrix in position front 5 and rear 5 length of window.

(1.4) residue pair, the residue combinations acted on the most two-by-two, in forming protein function group, play important work With, and it is widely used in analysis and predicted protein matter 26S Proteasome Structure and Function site.Aminoacid one under naturalness has 20 kinds, because of This, corresponding aminoacid to for 20 × 20=400 kind, i.e. " AA, AC ..., VV ".

(1.5) physico-chemical properties is closely related with the function of residue of protein, selects 6 kinds of physics and chemistry attributes here: hydrophilic Property, flexible, accessibility, polarity, exposed surface, corner.

Described step (2) specifically includes following steps:

(2.1) Fisher-Markov is used to calculate each feature and the Relevance scores of class label in described step (1), Being arranged in order from big to small by Relevance scores, score is the highest shows that this feature is higher with the dependency of class label, otherwise then Show that dependency is more weak.

(2.2) described step (2.1) is calculated to the Relevance scores list obtained, use increment iterative policy selection Excellent character subset.First, from the above-mentioned feature arranged, from the high to Low feature of adding successively of dependency to feature pool and structure Build corresponding grader be modeled and predict, by estimated performance record and draw a diagram, select the peak value in chart corresponding Number of features and corresponding character subset are optimal feature subset.

Described step (3) specifically includes following steps:

(3.1) on the basis of conventional machines study is built upon equilibrium criterion collection, during model construction, for positive and negative The wrong point penalty of sample is the same.Conventional machines learning algorithm obtains minimum point penalty by optimizing and obtains optimal predictability Energy.For unbalanced dataset (positive and negative sample proportion serious unbalance), this searching minimum of conventional machines learning algorithm Point penalty often tends to filter out small scale classification as noise data, so that small scale classification can not get study.Examine Considering to this situation, we introduce cost-sensitive strategy, and the wrong identification for positive negative sample gives different point penalties, the least ratio The wrong identification point penalty of example classification is high, and the wrong identification point penalty of vast scale classification is low.

(3.2) although the discrimination efficiency of single Weak Classifier is more weak, but the organic assembling of multiple Weak Classifier can make Discrimination efficiency exceedes best that of discrimination efficiency in each sub-classifier.

Described step (4) specifically includes following steps:

(4.1) first obtain the three-dimensional structure data of the antigen protein of all known epi-positions in sample data, and obtain institute The three-dimensional coordinate that some epi-positions are corresponding.

(4.2) for each epi-position, the distance of itself and other residues is added up.According to maximum enrichment density and minimum cluster The principle of group, determines the radius of average cluster space sphere.

(4.3) according to the radius of (4.2) step statistics gained, the antigen of all predictions is determined, and residue carries out region and draws Point, for the prediction data flocked together, regard as the potential residue that may be constructed epi-position；For one or two away from poly- The prediction antigen in collection region determines residue, regards as false sun data.

1. data set includes the bound data set (having antigen-antibody complex structure) of two part: Rubinstein, The unbundling data set (having the antigen single structure without antibody) of Liang.This data set is the benchmark data of comformational epitope prediction Collection.

2. the feature description of antigen protein residue: particular content is shown in Table 1.

The feature description of table 1. antigen protein residue

After carrying out Fisher-Markov and increment feature iteration acquisition optimal feature subset in the feature space created, Use traditional method and integrated learning approach in binding and unbundling data set respectively, and compare itself and cost-sensitive Integrated Strategy Prediction effect.Table 2 and table 3 give different integrated learning approach predicting the outcome in binding and unbundling data set.

The different integrated learning approach results contrast in bound data set of table 2

The different integrated study strategy results contrast on unbundling data set of table 3

From table 2 and table 3 it can be seen that conventional machines learning method on unbalanced dataset almost without predictive ability, though So its accuracy rate is all more than 90%, but this is built upon it and almost treats as negative sample using indiscriminate for all of sample milli The result caused, therefore specificity the highest (reaching 99.9%) and sensitivity is the lowest, only about 1%.

Compared to sample not carried out the traditional method of any process, simply it is integrated in for having in the identification of minority class Bigger raising, in bound data set, brings up to 19.6% from 0.8%；On unbundling data set, bring up to from 1.1% 25.6%.Simple Integrated Strategy is to carry out taking turns stochastical sampling in overall sample more, and each group of sampling all generates independent classification mould Type.Simple Integrated Strategy is simplest Ensemble classifier strategy, and its advantage is to realize simply, and speed, shortcoming is performance Limited.

Balance cascade Integrated Strategy improves on the basis of the most integrated.In the sampling of balance cascade, most classes The data sampled be no longer participate in after sampling, so ensure that sample can large range of covering the most Data.Relative to simple Integrated Strategy, the prediction effect of balance cascade has certain progress.

Cost-sensitive Integrated Strategy gives different cost value for positive negative sample, predicts by finding optimal classification Result cost expected value so that the prediction error penalty value of minimum sought automatically by grader.This method, it is possible to make each Sub-classifier all focuses onto in the sample of minority class, thus substantially increases the discrimination for minority class sample. Cost-sensitive strategy respectively reached the discrimination of 64.8% and 70.4% in binding and unbundling data set, it was demonstrated that the party The effectiveness of method.

Compared to traditional pharmacy vaccine approach, antigen protein epi-position can be the quickest to use the method calculated to predict Potential candidate's epi-position is provided, this can provide reality to help for biologist, and a huge sum of money reduced in medicine research and development is thrown Enter the risk brought.Relative to previous studies method, the present invention has two big innovative points: utilization cost sensitivity is integrated first Strategy, transfers to the prediction for minority class (positive sample) data by the emphasis of prediction from extensive accuracy rate, significantly increases Prediction effect；2. use Spatial Clustering to analyze further for the result predicted, the residue of scattered distribution got rid of, Assert that the prediction antigen flocked together determines that residue can constitute potential epi-position simultaneously.This method can improve further Precision of prediction, has higher realistic meaning.

Claims

1. one kind based on sequence utilization cost the sensitive integrated and method of Forecast epi-position, it is characterised in that include following step Rapid:

(1) feature construction: for sample data, calculates antigen protein descriptive characteristics, obtains the feature space of sample data；

(2) feature selection: use Fisher-Markov and increment iterative feature selection approach to select optimal feature subset；

(3) cost-sensitive integrated study: utilization cost sensitivity Integrated Strategy, is assigned to not respectively for serious unbalanced sample Same mistake classification punishment parameter, significantly improves the discrimination of sample positive for minority；

(4) space clustering: the antigen for prediction determines residue, uses Spatial Clustering, for resisting in setting threshold value Former decision residue, assert that it is epi-position.

The most according to claim 1 based on the sensitive integrated method with Forecast epi-position of sequence utilization cost, its feature It is that described step (1) specifically includes following steps:

(1.1) evolutionary conservatism feature: use PSIBLAST to calculate the location specific scoring matrix of antigen sequence；Obtained Scoring matrix on, for each amino acid replacement value, use logistic function to be normalized, obtain entering of this position Change conservative score；In the evolutionary conservatism of a certain residue is characterized as this residue sequence position front 5 and rear 5 length of window All of evolutionary conservatism score；

(1.2) second structure characteristic: use PSIPRED to calculate each residue on antigen protein and form secondary structure i.e. spiral shell Rotation, the probability matrix crimping or folding；The second structure characteristic of a certain residue is first 5 and latter 5 of this residue sequence position All of secondary structure probability matrix in length of window；

(1.3) disordered regions feature: use DISORDER to calculate each residue affiliated area on antigen protein and be ordered into district Territory or the probability matrix of disordered regions；The disordered regions of a certain residue is characterized as first 5 and latter 5 of this residue sequence position All of disordered regions probability matrix in length of window；

(1.4) dipeptides constitutive characteristic: residue combines the most two-by-two in protein and forms stable function residue pair, this residue Have very important significance to for analysis and predicted protein matter 26S Proteasome Structure and Function tool；According to 20 kinds of combination sides that aminoacid is different Formula, adds up 400 kinds of different dipeptides on some protein and constitutes；

(1.5) physics and chemistry attribute: select 6 kinds to be proved and the antigen protein closely-related physico-chemical properties of residue function, i.e. parent Aqueous, flexible, accessibility, polarity, exposed surface, corner 6 attribute；

Described step (2) specifically includes following steps；

(2.1) Fisher-Markov method is used feature to be ranked up: to use Fisher-Markov selector to calculate described Each feature and the dependency of class label in step (1), and arrange from big to small according to the numerical value of dependency；

(2.2) increment feature policy selection optimal feature subset is used: use increment feature strategy, from the above-mentioned feature arranged In, from the high to Low feature of adding successively of dependency to feature pool and build grader and be modeled study and prediction, and according to Estimated performance selects optimal number of features, and corresponding character subset is optimal feature subset；

Described step (3) specifically includes following steps:

(3.1) the utilization cost sensitivity Integrated process serious imbalance problem of positive and negative sample data: conventional machines study side Method effect in positive and negative sample imbalance classification problem is poor, and this is owing to its birth defect i.e. tends to ignore minority class to chase after The accuracy rate asking higher is caused；Introduce cost-sensitive Integrated and process the problem of positive and negative sample imbalance, first Giving different costs respectively for positive negative sample, the positive negative sample of wrong identification is different to the punishment of prediction effect, and grader is Pursuit preferable effect, can pay attention to the identification for minority class；

(3.2) support vector machine is used to build sub-classifier: based on using LibSVM, Machine learning tools builds basis point Class device, uses gridsearchforSVM.m to find optimized parameter c and value；By multiple sub-classifiers, constitute Ensemble classifier Device, improves Model Identification accuracy rate；

In described step (4), typically there is according to epi-position the phenomenon being enriched in the same area, the antigen of prediction is determined that residue enters Row space clusters, and points out that the region that cluster density is bigger is the region that potential composition epi-position probability is higher, specifically include with Lower step:

(4.1) in statistical sample data, the antigen on the antigen protein surface of known epi-position determines the spatial distribution coordinate of residue； According to maximum enrichment density and the principle of minimum cluster group, all antigens decision residue is clustered, it is thus achieved that it is the most poly- The radius of space-like spheroid；

(4.2) according to the radius of calculated Cluster space spheroid, antigen early stage predicted determines that residue carries out cluster and draws Point, in antigen determines the region that residue is enriched with, all of residue is identified as epi-position；Only have the antigen of one or two predictions The region of decision residue is considered as false sun data, i.e. non-epitopes.