Summary of the invention
For solve in above-mentioned single various dimensions feature space, have mutual exclusion feature cause precision of prediction apart from practical application gap the large and poor shortcoming of interpretation, the object of the invention is to propose a kind of protein-vitamin bindings bit point prediction method merging based on subspace that predetermined speed is fast, precision of prediction is high.
For reaching above-mentioned purpose, the technical solution adopted in the present invention is as follows:
Protein-vitamin bindings bit point prediction the method merging based on subspace, comprises the following steps:
Step 1, feature extraction and Feature Combination, utilize respectively PSI-BLAST algorithm, PSIPRED algorithm to extract evolution information characteristics and the secondary structure information characteristics of protein, and according to the binding tendentiousness information characteristics of protein-vitamin binding site tendency table extraction protein, aforementioned three kinds of features composition primitive character space; Then using moving window and serial combination mode that the amino acid residue in protein sequence is converted to vector form represents;
Step 2, use characteristic selection algorithm are Joint Laplacian Feature Weights Learning algorithm, Fisher Score algorithm and Laplacian Score algorithm, respectively repeatedly feature selecting are carried out in primitive character space; The character subset that each feature selecting obtains forms a proper subspace, thereby builds multiple proper subspaces;
Step 3, each proper subspace to step 2 gained, train a svm classifier device;
Step 4: use average weighted Multiple Classifier Fusion mode to training complete multiple svm classifier devices to merge; And
Step 5, SVM fallout predictor based on after merging are treated predicted protein matter and are carried out protein-vitamin bindings bit point prediction.
Further, in embodiment, in described step 1, comprise the following steps for feature extraction and the serial combination of training protein:
Step 1-1, for the protein being formed by l amino acid residue, obtain its position-specific scoring matrices by PSI-BLAST algorithm, this matrix is the matrix that a l capable 20 is listed as, thereby prlmary structure of protein information (information of evolving) is converted to matrix representation:
Wherein: A, C...Y represent 20 seed amino acid residues, p
i,jrepresent that i amino acid residue of protein is mutated into the possibility of j amino acid residue of 20 seed amino acid residues during evolution;
Then utilize following formula (2) to carry out standardization line by line to the each value in PSSM:
PSSM after standardization is suc as formula (3):
Afterwards, re-use the moving window of size for W, extract the eigenmatrix of each amino acid residue:
Finally, above-mentioned eigenmatrix (4) is combined into the proper vector that dimension is 20*W by the mode of row major:
Step 1-2, for the protein being formed by l amino acid residue, obtain its secondary structure probability matrix by PSIPRED, this matrix is the matrix that a l capable 3 is listed as, shown in (6):
Wherein, C, H...E represent three kinds of secondary structure: coil, helix, the strand of protein, s
i, 1the secondary structure that represents i amino acid residue in protein is the probability of coil, s
i, 2the secondary structure that represents i amino acid residue in protein is the probability of helix, s
i, 3the secondary structure that represents i amino acid residue in protein is the probability of strand;
Then, utilizing the moving window extraction of above-mentioned steps 1-1 and combining by the mode of row major the dimension that obtains each amino acid residue is the proper vector of 3*W, shown in (7):
f
i=(s
i,1,s
i,2,…,p
i,3W)
T (7)
Step 1-3, for the protein being formed by l amino acid residue, obtain by searching protein-vitamin binding site tendency table the matrix that contains its binding tendentiousness information, this matrix is the matrix that a l capable 1 is listed as, shown in (8):
Wherein, b
irepresent that in protein, i amino acid residue bound vitaminic tendentiousness;
Then, utilizing the moving window extraction of above-mentioned steps 1-1 and combining by the mode of row major the dimension that obtains each amino acid residue is the proper vector of 1*W, shown in (9):
f
i=(b
i,1,b
i,2,…,b
i,W)
T (9)
Step 1-4,3 proper vector serial combination that above-mentioned steps is obtained, obtain the proper vector that length is 20*W+3*W+1*W.
Further, in embodiment, in described step 2, the specific implementation that uses described three kinds of feature selecting algorithm to build multiple proper subspaces comprises the following steps:
Feature selecting is carried out in step 2-1, the primitive character space that utilizes Joint Laplacian Feature Weights Learning algorithm to produce step 1, and it comprises:
1) for the data X=[x in primitive character space
1, x
2..., x
m] ∈ R
n × M, use following formula (10) and formula (11) structure Laplacian matrix H
m × Mwith diagonal matrix D
m × Mas follows:
D
ii=∑
jh
ij, 1≤i≤M and 1≤j≤M (11)
Wherein, R
n × Mrepresent the scale of X matrix, X has M element that has N dimensional feature, N representation feature dimension, and M represents that number of samples is amino acid residue number;
2) the Laplacian matrix H to above-mentioned steps gained
m × Mwith diagonal matrix D
m × Msolve generalized eigenvalue decomposition problem Hy=λ Dy, obtain an eigenvalue of maximum characteristic of correspondence vector y below 1;
3) use the above-mentioned proper vector y trying to achieve, upgrade the weight that every one-dimensional characteristic is corresponding until restrain according to following formula (12):
Wherein, w=[w
1, w
2..., w
i..., w
n] represent each characteristic dimension weight, the transposition of T representing matrix, t represents iterations, ε represents to control the lax item of neutral element number in w;
4) at the above-mentioned weight vectors w=[w trying to achieve
1, w
2..., w
i..., w
n], select all weight component w that are greater than zero
icorresponding sample characteristics dimension, the proper subspace finally all selected characteristic dimension being combined into output, simultaneously by the number of characteristic dimension in subspace
output in the lump;
Feature selecting is carried out in step 2-2, the primitive character space that utilizes Fisher Score algorithm to produce step 1, and it comprises:
1) for the space with c class original sample
wherein
represent the sample set of i class,
representation feature vector,
represent classification, M
(i)represent the number of samples of i class, aforementioned sample refers to an amino acid residue of protein; Calculate the average of every one-dimensional characteristic of each class data according to formula (13) and formula (14)
and variance
1≤n≤N and 1≤i≤c (13)
1≤n≤N and 1≤i≤c (14)
2) all averages that use above-mentioned middle calculating to get
and variance
each characteristic dimension is calculated to Fisher Score according to formula (15):
Wherein, u
nrepresent the average of n dimensional characteristics in all data, H
nrepresent the Fisher Score value of n characteristic dimension, N characteristic dimension has a Fisher Score value;
Obtain a Fisher Score vector H, H=[H according to formula (15)
1, H
2..., H
n... H
n];
3) to above-mentioned Fisher Score vector H=[H
1, H
2..., H
n... H
n] in each value sort from big to small, then select before
sample characteristics corresponding to individual Fisher Score value, the proper subspace output that all selected Feature Combinations are become, wherein
the number that represents to select to have stayed feature, by step, 2-1 determines;
Feature selecting is carried out in step 2-3, the primitive character space that utilizes Laplacian Score algorithm to produce step 1, and it comprises:
1) for the data X=[x in primitive character space
1, x
2..., x
m] ∈ R
n × M, use formula (16) and formula (17) structure Laplacian matrix H
m × Mwith diagonal matrix D
m × Mas follows:
D
ii=∑
jh
ij, 1≤i≤M and 1≤j≤M (17)
Wherein, R
n × Mrepresent the scale of X matrix, be that X has M element that has N dimensional feature, N representation feature dimension, M represents that number of samples is amino acid residue number, σ represents Gaussian parameter, formula (16) is for trying to achieve the distance that two samples are the nuclear space of amino acid residue, and this σ is for controlling the width of nuclear space;
2) use the Laplacian matrix H of above-mentioned structure
m × Mwith diagonal matrix D
m × M, calculate the Laplacian Score of each characteristic dimension according to formula (18):
Wherein, x
inrepresent the value of n dimensional characteristics of i sample,
represent the average of n dimensional characteristics of all samples; L
nrepresent the Laplacian Score value of n characteristic dimension, N characteristic dimension has a Laplacian Score value, finally obtains a Laplacian Score vector L, L=[L according to formula (18)
1, L
2..., L
n..., L
n];
3) the Laplacian Score vector L=[L above-mentioned calculating being tried to achieve
1, L
2..., L
n..., L
n] in each value sort from big to small, then select before
sample characteristics corresponding to individual Laplacian Score value, the proper subspace output that all selected Feature Combinations are become, wherein
the number that represents to select to have stayed feature, by abovementioned steps, 2-1 determines.
Further, in embodiment, in described step 3, the distribution situation according to aforementioned original sample in each proper subspace, is used respectively the SVC classification algorithm training one sub spaces SVM fallout predictor in LIBSVM; Finally three different SVM fallout predictors are trained at three proper subspaces.
Further, in embodiment, in described step 4, use weighted average method to train the SVM fallout predictor of three different characteristic subspaces that obtain to merge to step 3, it comprises:
Make ω
1and ω
2represent respectively binding site class and unbundling site class, S
1, S
2and S
3represent respectively three SVM fallout predictors under different characteristic subspace,
represent assessment sample set, for determining the weight of SVM model corresponding to subspace, the amino acid residue of wherein assessing sample set is known its classification; For each x
irepresented sample characteristics, S
1, S
2and S
3will export the vector (s of three 2 dimensions
1,1(x
i), s
1,2(x
i))
t, (s
2,1(x
i), s
2,2(x
i))
t(s
3,1(x
i), s
3,2(x
i))
t, two elements of each 2 dimensional vectors represent respectively x
ibelong to ω
1and ω
2degree and two elements and be 1, therefore for assessment sample set
can obtain at S respectively
1, S
2and S
3on the matrix that predicts the outcome:
First, according to
true category construction objective result matrix:
If y
i=ω
1p
i=1, otherwise p
i=0 (20)
Secondly, calculate the error of the svm classifier device under each proper subspace:
Again, gather in assessment according to each proper subspace SVM fallout predictor
on the weight of predicated error structure different subspace SVM fallout predictor:
Wherein, M
evarepresent completely by a point error of staggering the time;
Finally, according to the SVM fallout predictor that calculates the integrated different subspace of weight on assessment sample set:
Obtain as the SVM fallout predictor after above formula (23) fusion.
Further, in embodiment, in step 5, use the SVM fallout predictor after merging to carry out protein-vitamin bindings bit point prediction to protein to be predicted:
For each amino acid residue in protein to be predicted, produce the feature of amino acid residue in primitive character space according to step 1; Then use respectively three feature selecting algorithm described in step 2 to produce three sub spaces features to the primitive character of amino acid residue; Again three sub spaces features are input to corresponding three the SVM fallout predictor S of step 3
1, S
2and S
3obtain three predicting the outcome of providing with binding vitamin Probability Forms, by the SVM fallout predictor after integrated according to the weighted average method of step 4 these three inputs that predict the outcome, vitaminic probability is bound or do not bound to output amino acid residue; Finally bind judgement using the threshold value T that maximizes Ma Xiusi relative coefficient (matthews correlation coefficient) as judgment standard: the amino acid residue that all binding probability are more than or equal to T is predicted as binding residue; Other amino acid residues are bound the amino acid residue that probability is less than threshold value T and are predicted as unbundling residue, wherein T ∈ [0,1].
From the above technical solution of the present invention shows that, beneficial effect of the present invention is:
1, improve training speed, predetermined speed and precision of prediction: use the subspace integration technology based on feature selecting algorithm, can build proper subspace more closely, effectively solve the phenomenon of the alternative existing between feature, reduce the dimension of feature space, thereby improve training speed, predetermined speed and precision of prediction;
2, the interpretation of lift scheme: used after the integration technology of subspace, to protein and different classes of vitamin bindings bit point prediction problem, the proper subspace of selecting is different, better express the otherness between protein and variety classes vitamin bindings bit point prediction problem, promoted the interpretation of model.
Embodiment
In order more to understand technology contents of the present invention, especially exemplified by specific embodiment and coordinate appended graphic being described as follows.
As shown in Figure 1, according to preferred embodiment of the present invention, protein-vitamin bindings bit point prediction the method merging based on subspace, first, use PSI-BLAST, PSIPRED to obtain respectively PSSM matrix (information matrix of evolving), the secondary structure probability matrix of protein, and show the binding tendentiousness matrix of the protein generating according to protein-vitamin binding site tendency; Secondly, use moving window and serial combination to build the proper vector of each amino acid residue from PSSM matrix, secondary structure probability matrix and protein-vitamin binding site tendency table; Then, use three feature selecting algorithm of Joint Laplacian Feature Weights Learning (algorithm 1), Fisher Score (algorithm 2) and Laplacian Score (algorithm 3) to build three proper subspaces with complementary characteristic between not mutual exclusion of feature in the same space, different spaces, in every sub spaces, train a SVM fallout predictor; Finally, use weighted average method to use integrated technology to form final forecast model to multiple SVM fallout predictors and carry out protein-vitamin bindings bit point prediction.
So-called binding site, has bound vitaminic amino acid residue exactly.
Shown in Fig. 1, describe the specific implementation of the above steps of the present embodiment in detail.
As optional mode, in described step 1, comprise the following steps for feature extraction and the serial combination of training protein::
Step 1-1, for the protein being formed by l amino acid residue, obtain its position-specific scoring matrices by PSI-BLAST algorithm, this matrix is the matrix that a l capable 20 is listed as, thereby prlmary structure of protein information (information of evolving) is converted to matrix representation:
Wherein: A, C...Y represent 20 seed amino acid residues, p
i,jrepresent that i amino acid residue of protein is mutated into the possibility of j amino acid residue of above-mentioned 20 seed amino acid residues (A, C...Y) during evolution;
Then utilize following formula (2) to carry out standardization line by line to the each value in PSSM:
PSSM after standardization is suc as formula (3):
Afterwards, re-use the moving window of size for W, extract the eigenmatrix of each amino acid residue:
Finally, above-mentioned eigenmatrix (4) is combined into the proper vector that dimension is 20*W by the mode of row major:
Step 1-2, for the protein being formed by l amino acid residue, obtain its secondary structure probability matrix by PSIPRED, this matrix is the matrix that a l capable 3 is listed as, shown in (6):
Wherein, C, H...E represent three kinds of secondary structure: coil, helix, the strand of protein, s
i, 1the secondary structure that represents i amino acid residue in protein is the probability of coil, s
i, 2the secondary structure that represents i amino acid residue in protein is the probability of helix, s
i, 3the secondary structure that represents i amino acid residue in protein is the probability of strand;
Then, utilizing the moving window extraction of above-mentioned steps 1-1 and combining by the mode of row major the dimension that obtains each amino acid residue is the proper vector of 3*W, shown in (7):
f
i=(s
i,1,s
i,2,…,p
i,3W)
T (7)
Step 1-3, for the protein being formed by l amino acid residue, obtain by searching protein-vitamin binding site tendency table the matrix that contains its binding tendentiousness information, this matrix is the matrix that a l capable 1 is listed as, shown in (8):
Wherein, b
irepresent that in protein, i amino acid residue bound vitaminic tendentiousness;
Then, utilizing the moving window extraction of above-mentioned steps 1-1 and combining by the mode of row major the dimension that obtains each amino acid residue is the proper vector of 1*W, shown in (9):
f
i=(b
i,1,b
i,2,…,b
i,W)
T (9)
Step 1-4,3 proper vector serial combination that above-mentioned steps is obtained, obtain the proper vector that length is 20*W+3*W+1*W.
As optional embodiment, in described step 2, the specific implementation that uses described three kinds of feature selecting algorithm to build multiple proper subspaces comprises the following steps:
Feature selecting is carried out in step 2-1, the primitive character space that utilizes Joint Laplacian Feature Weights Learning algorithm to produce step 1, and it comprises:
1) for the data X=[x in primitive character space
1, x
2..., x
m] ∈ R
n × M, use following formula (10) and formula (11) structure Laplacian matrix H
m × Mwith diagonal matrix D
m × Mas follows:
D
ii=∑
jh
ij, 1≤i≤M and 1≤j≤M (11)
Wherein, R
n × Mrepresent the scale of X matrix, X has M element that has N dimensional feature, N representation feature dimension, and M represents that number of samples is amino acid residue number;
2) the Laplacian matrix H to above-mentioned steps gained
m × Mwith diagonal matrix D
m × Msolve generalized eigenvalue decomposition problem Hy=λ Dy, obtain eigenvalue of maximum characteristic of correspondence below a 1 vector y (Hy=λ Dy is certain, and to have an eigenwert be 1, and proper vector is y=[1,1 ..., 1]
t, and this y is useless for feature selecting, so need eigenwert to be less than 1, proper vector is not y=[1, and 1 ..., 1]
t);
3) use the above-mentioned proper vector y trying to achieve, upgrade the weight that every one-dimensional characteristic is corresponding until restrain according to following formula (12):
Wherein, w=[w
1, w
2..., w
i..., w
n] represent each characteristic dimension weight, the transposition of T representing matrix, t represents iterations, ε represents to control lax of neutral element number in w, and (above-mentioned formula (12) is an iterative formula, t represents iteration the t time, carrys out mark w different in different iterations intermediate values with t);
4) at the above-mentioned weight vectors w=[w trying to achieve
1, w
2..., w
i..., w
n], select all weight component w that are greater than zero
icorresponding sample characteristics dimension (w
iw=[w
1, w
2..., w
i..., w
n] in one-component), the output of the proper subspace that finally all selected characteristic dimension is combined into, simultaneously by the number of characteristic dimension in subspace
output in the lump;
Feature selecting is carried out in step 2-2, the primitive character space that utilizes Fisher Score algorithm to produce step 1, and it comprises:
1) for the space with c class original sample
wherein
represent the sample set of i class,
representation feature vector,
represent classification, M
(i)represent the number of samples of i class, aforementioned sample refers to an amino acid residue of protein; Calculate the average of every one-dimensional characteristic of each class data according to formula (13) and formula (14)
and variance
(it is worth mentioning that: the sample in original sample is to represent a concrete things; Be in protein-vitamin bindings bit point prediction in the present embodiment, a sample just represents an amino acid residue of protein, also: the i.e. element of a sample):
1≤n≤N and 1≤i≤c (13)
1≤n≤N and 1≤i≤c (14)
2) all averages that use above-mentioned middle calculating to get
and variance
each characteristic dimension is calculated to Fisher Score according to formula (15):
Wherein, u
nrepresent the average of n dimensional characteristics in all data, H
nrepresent the Fisher Score value of n characteristic dimension, N characteristic dimension has a Fisher Score value;
Obtain a Fisher Score vector H, H=[H according to formula (15)
1, H
2..., H
n... H
n];
3) to above-mentioned Fisher Score vector H=[H
1, H
2..., H
n... H
n] in each value sort from big to small, then select before
sample characteristics corresponding to individual Fisher Score value, the proper subspace output that all selected Feature Combinations are become, wherein
represent to select to have stayed the number of feature, determined in (as step by step 4 of abovementioned steps 2-1) by step 2-1, exported simultaneously
);
Feature selecting is carried out in step 2-3, the primitive character space that utilizes Laplacian Score algorithm to produce step 1, and it comprises:
1) for the data X=[x in primitive character space
1, x
2..., x
m] ∈ R
n × M, use formula (16) and formula (17) structure Laplacian matrix H
m × Mwith diagonal matrix D
m × Mas follows:
D
ii=∑
jhi
j, 1≤i≤M and 1≤j≤M (17)
Wherein, R
n × Mrepresent the scale of X matrix, be that X has M element that has N dimensional feature, N representation feature dimension, M represents that number of samples is amino acid residue number, σ represents Gaussian parameter, formula (16) is for trying to achieve the distance that two samples are the nuclear space of amino acid residue, and this σ is for controlling the width of nuclear space;
2) use the Laplacian matrix H of above-mentioned structure
m × Mwith diagonal matrix D
m × M, calculate the Laplacian Score of each characteristic dimension according to formula (18):
Wherein, x
inrepresent the value of n dimensional characteristics of i sample,
represent the average of n dimensional characteristics of all samples; L
nrepresent the Laplacian Score value of n characteristic dimension, N characteristic dimension has a Laplacian Score value, finally obtains a Laplacian Score vector L, L=[L according to formula (18)
1, L
2..., L
n..., L
n];
3) the Laplacian Score vector L=[L above-mentioned calculating being tried to achieve
1, L
2..., L
n..., L
n] in each value sort from big to small, then select before
sample characteristics corresponding to individual Laplacian Score value, the proper subspace output that all selected Feature Combinations are become, wherein
represent to select to have stayed the number of feature, determined in (as step by step 4 of abovementioned steps 2-1) by abovementioned steps 2-1, exported simultaneously
).
Because Fisher Score algorithm and Laplacian Score algorithm do not have initiatively to determine the ability of selecting how many intrinsic dimensionalities, so select the ability of intrinsic dimensionality in the present embodiment by the algorithm Autonomous determination of step 2-1.
As optional embodiment, in described step 3, the distribution situation according to aforementioned original sample in each proper subspace, is used respectively the SVC classification algorithm training one sub spaces SVM fallout predictor in LIBSVM; Finally three different SVM fallout predictors are trained at three proper subspaces.
Further, in embodiment, in described step 4, use weighted average method to train the SVM fallout predictor of three different characteristic subspaces that obtain to merge to step 3, it comprises:
Make ω
1and ω
2represent respectively binding site class and unbundling site class, S
1, S
2and S
3represent respectively three SVM fallout predictors under different characteristic subspace,
represent assessment sample set, for determining the weight of SVM model corresponding to subspace, the amino acid residue of wherein assessing sample set is known its classification; For each x
irepresented sample characteristics, S
1, S
2and S
3will export the vector (s of three 2 dimensions
1,1(x
i), s
1,2(x
i))
t, (s
2,1(x
i), s
2,2(x
i))
t(s
3,1(x
i), s
3,2(x
i))
t, two elements of each 2 dimensional vectors represent respectively x
ibelong to ω
1and ω
2degree and two elements and be 1, therefore for assessment sample set
can obtain at S respectively
1, S
2and S
3on the matrix that predicts the outcome:
First, according to
true category construction objective result matrix:
If y
i=ω
1p
i=1, otherwise p
i=0 (20)
Secondly, calculate the error of the svm classifier device under each proper subspace:
Again, gather in assessment according to each proper subspace SVM fallout predictor
on the weight of predicated error structure different subspace SVM fallout predictor:
Wherein, M
evarepresent completely by a point error of staggering the time;
Finally, according to the SVM fallout predictor that calculates the integrated different subspace of weight on assessment sample set:
Obtain as the SVM fallout predictor after above formula (23) fusion.
In the present embodiment, above-mentioned assessment sample set and protein to be predicted are different, are two different set; The amino acid residue of protein to be predicted is not know classification, and assessment sample set is known classification, but use in the present embodiment its (assessing sample set) to determine the weight of SVM model corresponding to subspace, on its practical significance, still belong to the data part for building model.
As optional embodiment, in step 5, use the SVM fallout predictor after merging to carry out protein-vitamin bindings bit point prediction to protein to be predicted:
For each amino acid residue in protein to be predicted, produce the feature of amino acid residue in primitive character space according to step 1; Then use respectively three feature selecting algorithm described in step 2 to produce three sub spaces features to the primitive character of amino acid residue; Again three sub spaces features are input to corresponding three the SVM fallout predictor S of step 3
1, S
2and S
3obtain three predicting the outcome of providing with binding vitamin Probability Forms, by the SVM fallout predictor after integrated according to the weighted average method of step 4 these three inputs that predict the outcome, vitaminic probability is bound or do not bound to output amino acid residue; Finally bind judgement using the threshold value T that maximizes Ma Xiusi relative coefficient (matthews correlation coefficient) as judgment standard: the amino acid residue that all binding probability are more than or equal to T is predicted as binding residue; Other amino acid residues are bound the amino acid residue that probability is less than threshold value T and are predicted as unbundling residue, wherein T ∈ [0,1].
By an above example technique scheme of the present invention, the Forecasting Methodology proposing in this embodiment, its evolution information based on protein, secondary structure information and binding tendentiousness information, adopt subspace integration technology and support vector machine (SVM) forecasting techniques based on multiple feature selecting algorithm to carry out the prediction in protein-vitamin site, use PSI-BLAST algorithm (A.A.Schaffer et al., " Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements, " Nucleic Acids Res., vol.29, pp.2994 – 3005, 2001) generate the position-specific scoring matrices of evolution information that represents protein, use PSIPRED algorithm (D.T.Jones, " Protein secondary structure prediction based on position-specific scoring matrices, " J Mol Biol, vol.292, no.2, pp.195-202, Sep17,1999) extract the secondary structure information of protein, use and generate binding tendentiousness algorithm (D.Yu, J.Hu, J.Yang et al., " Designing template-free predictor for targeting protein-ligand binding sites with classifier ensemble and spatial clustering, " IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.10, no.4, pp.994-1008,2013) generate the binding tendentiousness information of protein.Use multiple feature selecting algorithm (H.Yan, and J.Yang, " Joint Laplacian feature weights learning, " Pattern Recognition, vol.47, no.3, pp.1425-1432,2014; Bishop, C. " Neural Networks for Pattern Recognition, " Clarendon Press:Oxford, 1995.) construct the subspace of containing complementary information; Use average weighted integrated technology to carry out multi predictors fusion, finally use Threshold sementation based on soft classification to bind the judgement in site.Compared with current only VitaPred fallout predictor, there is higher precision of prediction and better interpretation.
Bind site as example taking the vitamin of not distinguishing kind of predicted protein matter 2ZZA_A below, predict the outcome as shown in table 1.
The amino acid sequence of protein 2ZZA_A is as follows:
>2ZZA_A
VIVSMIAALANNRVIGLDNKMPWHLPAELQLFKRATLGKPIVMGRNTFESIGRPLPGRL NIVLSRQTDYQPEGVTVVATLEDAVVAAGDVEELMIIGGATIYNQCLAAADRLYLTHIELTTE GDTWFPDYEQYNWQEIEHESYAADDKNPHNYRFSLLERVX
This protein has 19 vitamin binding sites.
First describe and use PSI-BLAST algorithm, PSIPRED algorithm and protein-vitamin binding site tendency table to extract the primitive character of each amino acid residue in protein 2ZZA_A according to step 1; Secondly Joint Laplacian Feature Weights Learning (algorithm 1), Fisher Score (algorithm 2) and three feature selecting algorithm of Laplacian Score (algorithm 3) described in use step 2 are carried out subspace feature selection to the primitive character of each amino acid residue in protein 2ZZA_A, form three sub spaces features, then three sub spaces features are input to corresponding three the SVM fallout predictor S of step 3
1, S
2and S
3obtain three predicting the outcome of providing with binding vitamin Probability Forms, by in the SVM fallout predictor after integrated according to the weighted average method of step 4 these three inputs that predict the outcome, obtain final protein 2ZZA_A and the prediction case of vitaminic binding, finally predict the outcome as shown in table 1:
Table 1 the present embodiment method and the predict the outcome contrast of current only protein-vitamin bindings bit point prediction device to 2ZZA_A
As can be seen from Table 1, use the Forecasting Methodology of the present embodiment, 15 vitamin binding sites of correct Prediction number, 0 false positive vitamin binding site, 4 false negative vitamin binding sites, predict the outcome and are obviously better than only protein-vitamin bindings bit point prediction device in currently available technology.
Although the present invention discloses as above with preferred embodiment, so it is not in order to limit the present invention.Persond having ordinary knowledge in the technical field of the present invention, without departing from the spirit and scope of the present invention, when being used for a variety of modifications and variations.Therefore, protection scope of the present invention is when being as the criterion depending on claims person of defining.