Based on the crystallization of protein Forecasting Methodology of two-layer SVM study mechanism
Technical field
The present invention relates to Bioinformatics Prediction crystallization of protein capability realm, in particular to a kind of crystallization of protein Forecasting Methodology based on two-layer SVM study mechanism.
Background technology
In proteomics, unanimously think that protein structure determines protein function, accurate protein three-dimensional structure information contributes to the specific function that discovery protein has, so the critical role of protein structure in proteomics is self-evident.Along with the develop rapidly of sequencing technologies and the propelling of mankind's Structural genomics, in proteomics, have accumulated the protein sequence of a large amount of structure the unknown, although structural genomics (A.E.Todd, R.L.Marsden, J.M.Thornton et al., " Progress of structural genomics initiatives:an analysis of solved target structures, " J Mol Biol, vol.348, no.5, pp.1235-60, May 20, 2005.) X-ray diffraction (M.J.Mizianty can be passed through, X.Fan, J.Yan et al., " Covering complete proteomes with X-ray structures:a current snapshot, " Biological Crystallography, vol.70, no.11, 2014.), magnetic resonance imaging (L.Jackman, Dynamic nuclear magnetic resonance spectroscopy:Elsevier, 2012.), electron microscopic observation (N.I.Bradshaw, D.C.Soares, J.Zou et al., " 15:30STRUCTURAL ELUCIDATION OF DISC1PATHWAY PROTEINS USING ELECTRON MICROSCOPY, CHEMICAL CROSS-LINKING AND MASS SPECTROSCOPY, " Schizophrenia Research, vol.136, pp.S74, 2012.) etc. crystallization technique measures the three-dimensional structure of protein, but the method for structural genomics is expensive, consuming time, and not all protein sequence can obtain protein three-dimensional structure by existing measuring technique, so the crystallizing power of the protein sequence of predict the unknown in advance can shorten the cycle for measuring protein three-dimensional structure engineering, cost-saving, improve success ratio, for the discovery engineering of protein function accelerates paces.Therefore the relevant knowledge of applying biological information science, research and development can directly from protein sequence carry out crystallization of protein ability fast and accurately Intelligent Forecasting have active demand, for discovery and understanding protein function have important biological meaning.
At present, need to improve for the interpretation of the model of crystallization of protein ability forecasting problem, precision of prediction.By consulting literatures can find, being used for the forecast model of predicted protein matter crystallization has SECRET (P.Smialowski, T.Schmidt, J.Cox et al., " Will my protein crystallize? A sequence-based predictor, " Proteins, vol.62, no.2, pp.343-55, Feb 1, 2006.), CRYSTALP (K.Chen, L.Kurgan, and M.Rahbari, " Prediction of protein crystallization using collocation of amino acid pairs, " Biochemical and Biophysical Research Communications, vol.355, no.3, pp.764-769, Apr 13, 2007.), MetaCrys (M.J.Mizianty, and L.Kurgan, " Meta prediction of protein crystallization propensity, " Biochemical and Biophysical Research Communications, vol.390, no.1, pp.10-15, Dec 4, 2009.), PCCpred (M.J.Mizianty, and L.Kurgan, " Sequence-based prediction of protein crystallization, purification and production propensity, " Bioinformatics, vol.27, no.13, pp.i24-33, Jul 1, 2011.), CRYSpred (M.J.Mizianty, and L.A.Kurgan, " CRYSpred:Accurate Sequence-Based Protein Crystallization Propensity Prediction Using Sequence-Derived Structural Characteristics, " Protein Pept Lett, vol.19, no.1, pp.40-9, Jan 1, 2012.), ParCrys (I.M.Overton, G.Padovani, M.A.Girolami et al., " ParCrys:a Parzen window density estimation approach to protein crystallization propensity prediction, " Bioinformatics, vol.24, no.7, pp.901-907, Apr 1, 2008.), SVMCRYS (K.K.Kandaswamy, G.Pugalenthi, P.N.Suganthan et al., " SVMCRYS:An SVM Approach for the Prediction of Protein Crystallization Propensity from Protein Sequence, " Protein and Peptide Letters, vol.17, no.4, pp.423-430, Apr, 2010.), RFCRYS (S.Jahandideh, and A.Mahdavi, " RFCRYS:sequence-based protein crystallization propensity prediction by means ofrandom forest, " J Theor Biol, vol.306, pp.115-9, Aug 7, 2012.), SCMCRYS (P.Charoenkwan, W.Shoombuatong, H.C.Lee et al., " SCMCRYS:Predicting Protein Crystallization Using an Ensemble Scoring Card Method with Estimating Propensity Scores of P-Collocated Amino Acid Pairs, " PloS one, vol.8, no.9, Sep, 2013.) etc., the feature visual angle that these forecast models use has: physico-chemical properties (Physicochemical properties), amino acid composition (Amino acid composition), dipeptides constituent (Dipeptide composition), tripeptides constituent (Tripeptide Composition), secondary structure (Secondary Structure), sequence length (Sequence Length), pseudo amino acid composition constituent (Pseudo amino acid composition), protein-protein interactive information etc., the prediction algorithm used has NB Algorithm (Naive BayesAlgorithm), algorithm of support vector machine (Support Vector Machine, SVM), random forests algorithm (Random Forest), scorecard algorithm (Scoring Card Method), radial base neural net algorithm etc., these forecast models all by input prediction algorithm after the series connection of various visual angles feature, and achieve certain precision of prediction.
But, crystallization of protein forecast model recited above does not all use the evolution information characteristics of protein, does not take into full account the relation mutually disturbed between different visual angles feature, the information do not had in degree of depth excavation feature, thus causes the poor problem of the interpretation of crystallization of protein forecast model to have to be overcome; And can find that the practical application of precision of prediction distance also has larger gap, in the urgent need to further raising.
Summary of the invention
Not strong in order to solve feature visual angle distinctive potential in above-mentioned crystallization of protein forecasting problem, the mutual interference existed between different visual angles feature, prediction algorithm degree of depth mined information is indifferent and cause precision of prediction distance practical application gap comparatively large and the shortcoming that interpretation is poor, the object of the invention is to propose a kind of conjugated protein evolution point of information feature, protein sequence visual angle characteristic, amino acid physico-chemical properties visual angle characteristic and use 2L-SVM prediction algorithm that different visual angles feature can be avoided mutually to disturb to have precision of prediction high, the crystallization of protein Forecasting Methodology based on two-layer SVM study mechanism that model interpretation is strong.
For reaching above-mentioned purpose, the technical solution adopted in the present invention is as follows:
Based on a crystallization of protein Forecasting Methodology for two-layer SVM study mechanism, comprise the following steps:
Step 1: feature extraction, PSI-BLAST is used to extract the evolution information of protein, and conjugated protein sequence information and amino acid whose physico-chemical properties information, by extracting AminoAcid Composition (AAC), Dipeptide Composition (DiAAC), Tripeptide Composition (TriAAC), PseudoAminoAcid Composition (PseAAC) and Pseudo Position Specific Scoring Matrix (PsePSSM) five visual angle characteristics, protein sequence being converted to numeric form and representing
Step 2: the character representation according to step 1, protein sequences all in training data set being carried out different visual angles, form the training sample set of five different visual angles, then use two-layer SVM prediction algorithm 2L-SVM to close at the training sample set of five different visual angles and be trained to a crystallization of protein 2L-SVM forecast model;
Step 3: for the protein sequence of each crystallizing power to be predicted, the feature of this protein sequence five different visual angles is obtained by step 1, the crystallization of protein 2L-SVM forecast model of training in step 2 is used to carry out crystallization of protein probabilistic forecasting, final prediction of output probability; And
Step 4: for protein sequence to be predicted in step 3, uses threshold segmentation method according to the output probability in step 3, the whether crystallizable decision-making of final this protein sequence of output.
In further embodiment, in described step 1, carry out the extraction of different visual angles feature according to following step:
A. AAC visual angle characteristic is extracted
For the protein sequence P that any one length is l, the number of times that in its protein sequence, all amino acid classes occur, is denoted as:
Count
AA=(n
A,n
C,…,n
Y)
T(1)
Wherein A, C ..., Y represents 20 kinds of common amino acid residues, n respectively
a, n
cand n
yrepresent the number of amino acid A, C and Y in protein sequence P respectively;
Represent that the AAC visual angle characteristic of Protein And Its Amino Acid can be expressed as:
B. DiAAC visual angle characteristic is extracted
For the protein sequence P that random length is l, represented the feature at the DiAAC visual angle of protein by following equation:
Wherein A, A, A, C ..., Y, Y represent the combination of two of 20 seed amino acids, n respectively
a,A, n
a,Cand n
y,Yrepresent in protein sequence to there is amino acid to A, A, A, C and Y, the number of Y respectively;
C. TriAAC visual angle characteristic is extracted
For the protein sequence P arbitrarily containing l amino acid residue, represent TriAAC visual angle characteristic by following equation:
Wherein A, A, A, A, A, C ..., Y, Y, Y represent respectively 20 seed amino acids tripeptides combination, n
a, A, A, n
a, A, Cand n
y, Y, Yrepresent in protein sequence to there is amino acid to A, A, A, A, A, C and Y, the number of Y, Y respectively;
D. PseAAC visual angle characteristic is extracted
Each amino acid has intrinsic physico-chemical properties, and from these physico-chemical properties, extract the feature at PseAAC visual angle, concrete steps are as follows:
(1) use the method calculating AAC in steps A, calculate the amino acid composition of protein, be denoted as:
(2) calculate the association's relevant information in protein sequence corresponding to each different physico-chemical properties, concrete steps are as follows: the association's relevant information first calculating the λ level of protein on a kth physico-chemical properties:
Wherein
represent association's relevant information of i-th amino acid and the i-th+λ the λ level of amino acid on a kth physico-chemical properties in protein;
represent the scoring values of i-th amino acid on a kth physico-chemical properties in protein;
Then calculate association's relevant information of all levels of protein on a kth physico-chemical properties, be denoted as:
Wherein Λ is maximum level;
Finally calculate the association relevant information of protein on all physics chemistry attribute, be denoted as:
τ=(τ
1,τ
2,…,τ
K) (8)
Wherein K represents the number of physico-chemical properties in AAIndex;
(3) in conjunction with AAC information and association's relevant information, final formation PseAAC visual angle characteristic, is denoted as:
PseAAC=(x
1,…,x
μ,…,x
K·Λ,x
1+K·Λ,…,x
20+K·Λ)
T(9)
Wherein
Wherein
rounding operation in expression, w represents the weight of PseAAC;
E. PsePSSM visual angle characteristic is extracted
One is contained to the protein sequence P of l amino acid residue, first calculated by PSI-BLAST algorithm and obtain its position-specific scoring matrices PSSM, this PSSM matrix is the matrix of capable 20 row of l, thus is converted to by the primary structural information of protein
matrix form, is expressed as follows:
Wherein A, C ..., Y represents 20 seed amino acid residues, o
i,jrepresent that protein i-th amino acid residue is mutated into the possibility of the jth seed amino acid residue in 20 seed amino acid residues during evolution;
Then right
be normalized, use following function pair
in each value carry out standardization:
PSSM after standardization, is expressed as follows:
Again, for the PSSM after standardization, use PsePSSM algorithm that the evolution information matrix of Length discrepancy is converted into isometric proper vector, concrete grammar is as follows:
(1) at P
pssmthe amino acid position relation information λ of different levels in middle excavation protein evolution information
k, be expressed as follows:
Wherein
1≤j≤20,1≤k≤K; K represents the maximum level that can excavate amino acid position relation information, so far can obtain the amino acid position relation information of K different levels;
(2) to P
pssmeach row average, obtain one 20 dimension proper vector:
C
PSSM=(p
1,p
2,…,p
j,…,p
20) (15)
Wherein
(3) finally by the amino acid position relation information of K different levels and C
pSSMserial combination is got up, and obtains the PsePSSM characteristic information of protein sequence:
PsePSSM
K=(λ
1,λ
2,…,λ
K,C
PSSM)
T。(16)
In further embodiment, in described step 2, according to the five kinds of visual angle characteristic information obtained in step 1, the training sample set of composition five different visual angles, and in conjunction with the positive and negative sample distribution situation of five training sample set, train a 2L-SVM forecast model, concrete steps are as follows:
A. for the training sample set at any v visual angle
wherein
represent the proper vector at v visual angle of i-th sample, y
irepresent the classification of i-th sample, N represents number of samples, uses the SVM programmed algorithm of standard to solve following relevant SVM optimization problem:
Wherein w
vthe normal vector of optimum segmentation lineoid, γ
v> 0 be SVM regularization parameter,
represent training data set D
vin penalty term, the φ of i-th sample
v() is can be by
maps feature vectors, to the mapping function in higher-dimension Hilbert space, finally obtains the SVM forecast model at v visual angle, is denoted as SVM
v;
B. in order to train the second layer model SVM of 2L-SVM forecast model
en, the training sample set under five visual angles closes the probability output under using cross validation strategy to obtain five visual angles respectively, and then these five probability outputs constitute new training data set with training collection class, are denoted as:
wherein
represent i-th sample probability output that cross validation obtains on v visual angle, reuse the SVM program of standard at D
endata acquisition is trained the Optimal Separating Hyperplane that optimum, thus form the second layer model SVM in 2L-SVM forecast model
en;
C. five forecast models will obtained in steps A
five output probabilities as the forecast model SVM obtained in step B
eninput, thus constitute 2L-SVM forecast model.
In further embodiment, in described step 3, for the protein sequence of each crystallizing power to be predicted, the feature of this protein sequence five different visual angles is obtained by step 1, the 2L-SVM forecast model being input to training in step 2 respectively carries out crystallization of protein probabilistic forecasting, final prediction of output probability.
In further embodiment, in described step 4, for the output probability obtained in step 3, threshold segmentation method is used to carry out the final decision of protein whether crystallization, threshold value span is 0 ~ 1, and aforesaid threshold values value meets the following conditions: the geneva related coefficient predicted the outcome is maximized.
From the above technical solution of the present invention shows that, beneficial effect of the present invention is:
1. improve the precision of prediction of model: employ more effective protein evolution information characteristics, employ the mutual interference that the algorithm of 2L-SVM effectively avoids between different visual angles feature simultaneously, how effective authentication information can be excavated further, improve the precision of prediction of crystallization of protein forecast model;
2. the interpretation of lift scheme: the first level SVM model in prediction algorithm 2L-SVM builds a basic SVM model on any one single visual angle characteristic, can know and know that the authentication information of which visual angle characteristic is more outstanding, second level of 2L-SVM is on the basis of first level, made deeper excavation, the effective information of different visual angles has effectively been merged under the condition mutually disturbed avoiding different visual angles, make to predict that the result obtained has more fairness and rationality, improve the interpretation of model.
Accompanying drawing explanation
Fig. 1 is the principle schematic of an embodiment of the present invention based on the crystallization of protein Forecasting Methodology of two-layer SVM study mechanism.
Embodiment
In order to more understand technology contents of the present invention, institute's accompanying drawings is coordinated to be described as follows especially exemplified by specific embodiment.
As shown in Figure 1, according to preferred embodiment of the present invention, the crystallization of protein Forecasting Methodology process based on two-layer SVM study mechanism is as follows:
Step 1: feature extraction, PSI-BLAST is used to extract the evolution information of protein, and conjugated protein sequence information and amino acid whose physico-chemical properties information, by extracting AAC, DiAAC, TriAAC, PseAAC, PsePSSM five visual angle characteristics, protein sequence being converted to numeric form and representing;
Step 2: the character representation according to step 1, protein sequences all in training data set being carried out different visual angles, form the training sample set of five different visual angles, then use two-layer SVM prediction algorithm 2L-SVM to close at the training sample set of five different visual angles and be trained to a crystallization of protein 2L-SVM forecast model;
Step 3: for the protein sequence of each crystallizing power to be predicted, the feature of this protein sequence five different visual angles is obtained by step 1, the crystallization of protein 2L-SVM forecast model of training in step 2 is used to carry out crystallization of protein probabilistic forecasting, final prediction of output probability; And
Step 4: for protein sequence to be predicted in step 3, uses threshold segmentation method according to the output probability in step 3, the whether crystallizable decision-making of final this protein sequence of output.
Shown in accompanying drawing, exemplary explanation is done to some specific implementations of preceding method.
In described step 1, carry out the extraction of different visual angles feature according to following step:
A. AAC visual angle characteristic is extracted
For the protein sequence P that any one length is l, the number of times that in its protein sequence, all amino acid classes occur, is denoted as:
Count
AA=(n
A,n
C,…,n
Y)
T(1)
Wherein A, C ..., Y represents 20 kinds of common amino acid residues, n respectively
a, n
cand n
yrepresent the number of amino acid A, C and Y in protein sequence P respectively;
Represent that the AAC visual angle characteristic of Protein And Its Amino Acid can be expressed as:
B. DiAAC visual angle characteristic is extracted
For the protein sequence P that random length is l, represented the feature at the DiAAC visual angle of protein by following equation:
Wherein A, A, A, C ..., Y, Y represent the combination of two of 20 seed amino acids, n respectively
a,A, n
a,Cand n
y,Yrepresent in protein sequence to there is amino acid to A, A, A, C and Y, the number of Y respectively;
C. TriAAC visual angle characteristic is extracted
For the protein sequence P arbitrarily containing l amino acid residue, represent TriAAC visual angle characteristic by following equation:
Wherein A, A, A, A, A, C ..., Y, Y, Y represent respectively 20 seed amino acids tripeptides combination, n
a, A, A, n
a, A, Cand n
y, Y, Yrepresent in protein sequence to there is amino acid to A, A, A, A, A, C and Y, the number of Y, Y respectively;
D. PseAAC visual angle characteristic is extracted
Each amino acid has intrinsic physico-chemical properties, and from these physico-chemical properties, extract the feature at PseAAC visual angle, concrete steps are as follows:
(1) use the method calculating AAC in steps A, calculate the amino acid composition of protein, be denoted as:
(2) calculate the association's relevant information in protein sequence corresponding to each different physico-chemical properties, concrete steps are as follows: the association's relevant information first calculating the λ level of protein on a kth physico-chemical properties:
Wherein
represent association's relevant information of i-th amino acid and the i-th+λ the λ level of amino acid on a kth physico-chemical properties in protein;
represent the scoring values of i-th amino acid on a kth physico-chemical properties in protein;
Then calculate association's relevant information of all levels of protein on a kth physico-chemical properties, be denoted as:
Wherein Λ is maximum level;
Finally calculate the association relevant information of protein on all physics chemistry attribute, be denoted as:
τ=(τ
1,τ
2,…,τ
K) (8)
Wherein K represents the number of physico-chemical properties in AAIndex;
(3) in conjunction with AAC information and association's relevant information, final formation PseAAC visual angle characteristic, is denoted as:
PseAAC=(x
1,…,x
μ,…,x
K·Λ,x
1+K·Λ,…,x
20+K·Λ)
T(9)
Wherein
Wherein
rounding operation in expression, w represents the weight of PseAAC;
E. PsePSSM visual angle characteristic is extracted
One is contained to the protein sequence P of l amino acid residue, first calculated by PSI-BLAST algorithm and obtain its position-specific scoring matrices PSSM, this PSSM matrix is the matrix of capable 20 row of l, thus is converted to by the primary structural information of protein
matrix form, is expressed as follows:
Wherein A, C ..., Y represents 20 seed amino acid residues, o
i,jrepresent that protein i-th amino acid residue is mutated into the possibility of the jth seed amino acid residue in 20 seed amino acid residues during evolution;
Then right
be normalized, use following function pair
in each value carry out standardization:
PSSM after standardization, is expressed as follows:
Again, for the PSSM after standardization, use PsePSSM algorithm that the evolution information matrix of Length discrepancy is converted into isometric proper vector, concrete grammar is as follows:
(1) at P
pssmthe amino acid position relation information λ of different levels in middle excavation protein evolution information
k, be expressed as follows:
Wherein
1≤j≤20,1≤k≤K; K represents the maximum level that can excavate amino acid position relation information, so far can obtain the amino acid position relation information of K different levels;
(2) to P
pssmeach row average, obtain one 20 dimension proper vector:
C
PSSM=(p
1,p
2,…,p
j,…,p
20) (15)
Wherein
(3) finally by the amino acid position relation information of K different levels and C
pSSMserial combination is got up, and obtains the PsePSSM characteristic information of protein sequence:
PsePSSM
K=(λ
1,λ
2,…,λ
K,C
PSSM)
T。(16)
Next, in described step 2, according to the five kinds of visual angle characteristic information obtained in step 1, the training sample set of composition five different visual angles, and in conjunction with the positive and negative sample distribution situation of five training sample set, train a 2L-SVM forecast model, concrete steps are as follows:
A. for the training sample set at any v visual angle
wherein
represent the proper vector at v visual angle of i-th sample, y
irepresent the classification of i-th sample, N represents number of samples, uses the SVM programmed algorithm of standard to solve following relevant SVM optimization problem:
Wherein w
vthe normal vector of optimum segmentation lineoid, γ
v> 0 be SVM regularization parameter,
represent training data set D
vin penalty term, the φ of i-th sample
v() is can be by
maps feature vectors, to the mapping function in higher-dimension Hilbert space, finally obtains the SVM forecast model at v visual angle, is denoted as SVM
v;
B. in order to train the second layer model SVM of 2L-SVM forecast model
en, the training sample set under five visual angles closes the probability output under using cross validation strategy to obtain five visual angles respectively, and then these five probability outputs constitute new training data set with training collection class, are denoted as:
wherein
represent i-th sample probability output that cross validation obtains on v visual angle, reuse the SVM program of standard at D
endata acquisition is trained the Optimal Separating Hyperplane that optimum, thus form the second layer model SVM in 2L-SVM forecast model
en;
C. five forecast models will obtained in steps A
five output probabilities as the forecast model SVM obtained in step B
eninput, thus constitute 2L-SVM forecast model.
After training 2L-SVM forecast model, in following step 3, for the protein sequence of each crystallizing power to be predicted, the feature of this protein sequence five different visual angles is obtained by step 1, the 2L-SVM forecast model being input to training in step 2 respectively carries out crystallization of protein probabilistic forecasting, final prediction of output probability.
Finally, in step 4, for the output probability obtained in step 3, threshold segmentation method is used to carry out the final decision of protein whether crystallization, threshold value span is 0 ~ 1, and aforesaid threshold values value meets the following conditions: the geneva related coefficient predicted the outcome is maximized.
In sum, the present invention is compared with existing Forecasting Methodology, its remarkable advantage is: this method has protein evolution point of information at interior various visual angles feature, the model architecture effectively avoiding interference mutually between various visual angles feature, the profound prediction algorithm 2L-SVM excavating effective authentication information ability, make forecast model not only interpretation enhancing, and improve the precision of prediction of model.
Although the present invention with preferred embodiment disclose as above, so itself and be not used to limit the present invention.Persond having ordinary knowledge in the technical field of the present invention, without departing from the spirit and scope of the present invention, when being used for a variety of modifications and variations.Therefore, protection scope of the present invention is when being as the criterion depending on those as defined in claim.