CN104636635A - Protein crystallization predicting method based on two-layer SVM learning mechanism - Google Patents

Protein crystallization predicting method based on two-layer SVM learning mechanism Download PDF

Info

Publication number
CN104636635A
CN104636635A CN201510047426.7A CN201510047426A CN104636635A CN 104636635 A CN104636635 A CN 104636635A CN 201510047426 A CN201510047426 A CN 201510047426A CN 104636635 A CN104636635 A CN 104636635A
Authority
CN
China
Prior art keywords
protein
svm
amino acid
information
visual angle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510047426.7A
Other languages
Chinese (zh)
Other versions
CN104636635B (en
Inventor
胡俊
於东军
何雪
李阳
沈红斌
杨静宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN201510047426.7A priority Critical patent/CN104636635B/en
Publication of CN104636635A publication Critical patent/CN104636635A/en
Application granted granted Critical
Publication of CN104636635B publication Critical patent/CN104636635B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a protein crystallization predicting method based on a two-layer SVM learning mechanism. The method comprises the steps: first, obtaining evolution information of proteins from protein sequence information through PSI-BLAST; then, extracting five types of view angle features of an AAC, a DiAAC, a TriAAC, a PseAAC and a PsePSSM from information including the sequence information, the evolution information of the proteins, the amino acid physical and chemical properties and the like; training a two-layer SVM prediction model (2L-SVM) by utilizing the five types of view angle features; utilizing the 2L-SVM model to carry out prediction, wherein the five types of view angle features are input into a corresponding first layer of the model in the 2L-SVM respectively, and obtained five probability outputs are input into a second layer of the prediction model of the 2L-SVM to obtain a predicted probability; finally, utilizing the threshold segmentation technology to obtain a final strategy. The protein crystallization predicting method has the advantages that the five types of features at different view angles are utilized to increase effective authentication information and improve the prediction capacity of the model; the 2L-SVM prediction model is utilized to effectively avoid information loss caused by mutual interference among the different view angles and improve the prediction precision of the model.

Description

Based on the crystallization of protein Forecasting Methodology of two-layer SVM study mechanism
Technical field
The present invention relates to Bioinformatics Prediction crystallization of protein capability realm, in particular to a kind of crystallization of protein Forecasting Methodology based on two-layer SVM study mechanism.
Background technology
In proteomics, unanimously think that protein structure determines protein function, accurate protein three-dimensional structure information contributes to the specific function that discovery protein has, so the critical role of protein structure in proteomics is self-evident.Along with the develop rapidly of sequencing technologies and the propelling of mankind's Structural genomics, in proteomics, have accumulated the protein sequence of a large amount of structure the unknown, although structural genomics (A.E.Todd, R.L.Marsden, J.M.Thornton et al., " Progress of structural genomics initiatives:an analysis of solved target structures, " J Mol Biol, vol.348, no.5, pp.1235-60, May 20, 2005.) X-ray diffraction (M.J.Mizianty can be passed through, X.Fan, J.Yan et al., " Covering complete proteomes with X-ray structures:a current snapshot, " Biological Crystallography, vol.70, no.11, 2014.), magnetic resonance imaging (L.Jackman, Dynamic nuclear magnetic resonance spectroscopy:Elsevier, 2012.), electron microscopic observation (N.I.Bradshaw, D.C.Soares, J.Zou et al., " 15:30STRUCTURAL ELUCIDATION OF DISC1PATHWAY PROTEINS USING ELECTRON MICROSCOPY, CHEMICAL CROSS-LINKING AND MASS SPECTROSCOPY, " Schizophrenia Research, vol.136, pp.S74, 2012.) etc. crystallization technique measures the three-dimensional structure of protein, but the method for structural genomics is expensive, consuming time, and not all protein sequence can obtain protein three-dimensional structure by existing measuring technique, so the crystallizing power of the protein sequence of predict the unknown in advance can shorten the cycle for measuring protein three-dimensional structure engineering, cost-saving, improve success ratio, for the discovery engineering of protein function accelerates paces.Therefore the relevant knowledge of applying biological information science, research and development can directly from protein sequence carry out crystallization of protein ability fast and accurately Intelligent Forecasting have active demand, for discovery and understanding protein function have important biological meaning.
At present, need to improve for the interpretation of the model of crystallization of protein ability forecasting problem, precision of prediction.By consulting literatures can find, being used for the forecast model of predicted protein matter crystallization has SECRET (P.Smialowski, T.Schmidt, J.Cox et al., " Will my protein crystallize? A sequence-based predictor, " Proteins, vol.62, no.2, pp.343-55, Feb 1, 2006.), CRYSTALP (K.Chen, L.Kurgan, and M.Rahbari, " Prediction of protein crystallization using collocation of amino acid pairs, " Biochemical and Biophysical Research Communications, vol.355, no.3, pp.764-769, Apr 13, 2007.), MetaCrys (M.J.Mizianty, and L.Kurgan, " Meta prediction of protein crystallization propensity, " Biochemical and Biophysical Research Communications, vol.390, no.1, pp.10-15, Dec 4, 2009.), PCCpred (M.J.Mizianty, and L.Kurgan, " Sequence-based prediction of protein crystallization, purification and production propensity, " Bioinformatics, vol.27, no.13, pp.i24-33, Jul 1, 2011.), CRYSpred (M.J.Mizianty, and L.A.Kurgan, " CRYSpred:Accurate Sequence-Based Protein Crystallization Propensity Prediction Using Sequence-Derived Structural Characteristics, " Protein Pept Lett, vol.19, no.1, pp.40-9, Jan 1, 2012.), ParCrys (I.M.Overton, G.Padovani, M.A.Girolami et al., " ParCrys:a Parzen window density estimation approach to protein crystallization propensity prediction, " Bioinformatics, vol.24, no.7, pp.901-907, Apr 1, 2008.), SVMCRYS (K.K.Kandaswamy, G.Pugalenthi, P.N.Suganthan et al., " SVMCRYS:An SVM Approach for the Prediction of Protein Crystallization Propensity from Protein Sequence, " Protein and Peptide Letters, vol.17, no.4, pp.423-430, Apr, 2010.), RFCRYS (S.Jahandideh, and A.Mahdavi, " RFCRYS:sequence-based protein crystallization propensity prediction by means ofrandom forest, " J Theor Biol, vol.306, pp.115-9, Aug 7, 2012.), SCMCRYS (P.Charoenkwan, W.Shoombuatong, H.C.Lee et al., " SCMCRYS:Predicting Protein Crystallization Using an Ensemble Scoring Card Method with Estimating Propensity Scores of P-Collocated Amino Acid Pairs, " PloS one, vol.8, no.9, Sep, 2013.) etc., the feature visual angle that these forecast models use has: physico-chemical properties (Physicochemical properties), amino acid composition (Amino acid composition), dipeptides constituent (Dipeptide composition), tripeptides constituent (Tripeptide Composition), secondary structure (Secondary Structure), sequence length (Sequence Length), pseudo amino acid composition constituent (Pseudo amino acid composition), protein-protein interactive information etc., the prediction algorithm used has NB Algorithm (Naive BayesAlgorithm), algorithm of support vector machine (Support Vector Machine, SVM), random forests algorithm (Random Forest), scorecard algorithm (Scoring Card Method), radial base neural net algorithm etc., these forecast models all by input prediction algorithm after the series connection of various visual angles feature, and achieve certain precision of prediction.
But, crystallization of protein forecast model recited above does not all use the evolution information characteristics of protein, does not take into full account the relation mutually disturbed between different visual angles feature, the information do not had in degree of depth excavation feature, thus causes the poor problem of the interpretation of crystallization of protein forecast model to have to be overcome; And can find that the practical application of precision of prediction distance also has larger gap, in the urgent need to further raising.
Summary of the invention
Not strong in order to solve feature visual angle distinctive potential in above-mentioned crystallization of protein forecasting problem, the mutual interference existed between different visual angles feature, prediction algorithm degree of depth mined information is indifferent and cause precision of prediction distance practical application gap comparatively large and the shortcoming that interpretation is poor, the object of the invention is to propose a kind of conjugated protein evolution point of information feature, protein sequence visual angle characteristic, amino acid physico-chemical properties visual angle characteristic and use 2L-SVM prediction algorithm that different visual angles feature can be avoided mutually to disturb to have precision of prediction high, the crystallization of protein Forecasting Methodology based on two-layer SVM study mechanism that model interpretation is strong.
For reaching above-mentioned purpose, the technical solution adopted in the present invention is as follows:
Based on a crystallization of protein Forecasting Methodology for two-layer SVM study mechanism, comprise the following steps:
Step 1: feature extraction, PSI-BLAST is used to extract the evolution information of protein, and conjugated protein sequence information and amino acid whose physico-chemical properties information, by extracting AminoAcid Composition (AAC), Dipeptide Composition (DiAAC), Tripeptide Composition (TriAAC), PseudoAminoAcid Composition (PseAAC) and Pseudo Position Specific Scoring Matrix (PsePSSM) five visual angle characteristics, protein sequence being converted to numeric form and representing
Step 2: the character representation according to step 1, protein sequences all in training data set being carried out different visual angles, form the training sample set of five different visual angles, then use two-layer SVM prediction algorithm 2L-SVM to close at the training sample set of five different visual angles and be trained to a crystallization of protein 2L-SVM forecast model;
Step 3: for the protein sequence of each crystallizing power to be predicted, the feature of this protein sequence five different visual angles is obtained by step 1, the crystallization of protein 2L-SVM forecast model of training in step 2 is used to carry out crystallization of protein probabilistic forecasting, final prediction of output probability; And
Step 4: for protein sequence to be predicted in step 3, uses threshold segmentation method according to the output probability in step 3, the whether crystallizable decision-making of final this protein sequence of output.
In further embodiment, in described step 1, carry out the extraction of different visual angles feature according to following step:
A. AAC visual angle characteristic is extracted
For the protein sequence P that any one length is l, the number of times that in its protein sequence, all amino acid classes occur, is denoted as:
Count AA=(n A,n C,…,n Y) T(1)
Wherein A, C ..., Y represents 20 kinds of common amino acid residues, n respectively a, n cand n yrepresent the number of amino acid A, C and Y in protein sequence P respectively;
Represent that the AAC visual angle characteristic of Protein And Its Amino Acid can be expressed as:
AAC = ( n A l , n C l , . . . , n Y l ) T - - - ( 2 )
B. DiAAC visual angle characteristic is extracted
For the protein sequence P that random length is l, represented the feature at the DiAAC visual angle of protein by following equation:
DiAAC = ( n A , A l - 1 , n A , C l - 1 , . . . , n Y , Y l - 1 ) T - - - ( 3 )
Wherein A, A, A, C ..., Y, Y represent the combination of two of 20 seed amino acids, n respectively a,A, n a,Cand n y,Yrepresent in protein sequence to there is amino acid to A, A, A, C and Y, the number of Y respectively;
C. TriAAC visual angle characteristic is extracted
For the protein sequence P arbitrarily containing l amino acid residue, represent TriAAC visual angle characteristic by following equation:
TriAAC = ( n A , A , A l - 2 , n A , A , C l - 2 , . . . , n Y , Y , Y l - 2 ) T - - - ( 4 )
Wherein A, A, A, A, A, C ..., Y, Y, Y represent respectively 20 seed amino acids tripeptides combination, n a, A, A, n a, A, Cand n y, Y, Yrepresent in protein sequence to there is amino acid to A, A, A, A, A, C and Y, the number of Y, Y respectively;
D. PseAAC visual angle characteristic is extracted
Each amino acid has intrinsic physico-chemical properties, and from these physico-chemical properties, extract the feature at PseAAC visual angle, concrete steps are as follows:
(1) use the method calculating AAC in steps A, calculate the amino acid composition of protein, be denoted as:
F AAC = ( f 1 , f 2 , . . . , f 20 ) = ( n A l , n C l , . . . , n Y l ) - - - ( 5 )
(2) calculate the association's relevant information in protein sequence corresponding to each different physico-chemical properties, concrete steps are as follows: the association's relevant information first calculating the λ level of protein on a kth physico-chemical properties:
τ λ k = 1 l - λ Σ i = 1 l - λ Corr i , i + λ k - - - ( 6 )
Wherein represent association's relevant information of i-th amino acid and the i-th+λ the λ level of amino acid on a kth physico-chemical properties in protein; represent the scoring values of i-th amino acid on a kth physico-chemical properties in protein;
Then calculate association's relevant information of all levels of protein on a kth physico-chemical properties, be denoted as:
τ k = ( τ 1 k , τ 2 k , . . . , τ Λ k ) - - - ( 7 )
Wherein Λ is maximum level;
Finally calculate the association relevant information of protein on all physics chemistry attribute, be denoted as:
τ=(τ 12,…,τ K) (8)
Wherein K represents the number of physico-chemical properties in AAIndex;
(3) in conjunction with AAC information and association's relevant information, final formation PseAAC visual angle characteristic, is denoted as:
PseAAC=(x 1,…,x μ,…,x K·Λ,x 1+K·Λ,…,x 20+K·Λ) T(9)
Wherein
Wherein rounding operation in expression, w represents the weight of PseAAC;
E. PsePSSM visual angle characteristic is extracted
One is contained to the protein sequence P of l amino acid residue, first calculated by PSI-BLAST algorithm and obtain its position-specific scoring matrices PSSM, this PSSM matrix is the matrix of capable 20 row of l, thus is converted to by the primary structural information of protein matrix form, is expressed as follows:
Wherein A, C ..., Y represents 20 seed amino acid residues, o i,jrepresent that protein i-th amino acid residue is mutated into the possibility of the jth seed amino acid residue in 20 seed amino acid residues during evolution;
Then right be normalized, use following function pair in each value carry out standardization:
f ( x ) = 1 1 + e - x - - - ( 12 )
PSSM after standardization, is expressed as follows:
Again, for the PSSM after standardization, use PsePSSM algorithm that the evolution information matrix of Length discrepancy is converted into isometric proper vector, concrete grammar is as follows:
(1) at P pssmthe amino acid position relation information λ of different levels in middle excavation protein evolution information k, be expressed as follows:
λ k = ( λ 1 k , λ 2 k , . . . , λ j k , . . . , λ 20 k ) - - - ( 14 )
Wherein 1≤j≤20,1≤k≤K; K represents the maximum level that can excavate amino acid position relation information, so far can obtain the amino acid position relation information of K different levels;
(2) to P pssmeach row average, obtain one 20 dimension proper vector:
C PSSM=(p 1,p 2,…,p j,…,p 20) (15)
Wherein p j = ( Σ t = 1 l p t , j ) / l ;
(3) finally by the amino acid position relation information of K different levels and C pSSMserial combination is got up, and obtains the PsePSSM characteristic information of protein sequence:
PsePSSM K=(λ 12,…,λ K,C PSSM) T。(16)
In further embodiment, in described step 2, according to the five kinds of visual angle characteristic information obtained in step 1, the training sample set of composition five different visual angles, and in conjunction with the positive and negative sample distribution situation of five training sample set, train a 2L-SVM forecast model, concrete steps are as follows:
A. for the training sample set at any v visual angle wherein represent the proper vector at v visual angle of i-th sample, y irepresent the classification of i-th sample, N represents number of samples, uses the SVM programmed algorithm of standard to solve following relevant SVM optimization problem:
min 1 2 | | w v | | 2 + γ v Σ i = 1 N ξ i v
s . t . y i ( ( w v ) T φ v ( x i v ) + b v ) ≥ 1 - ξ i v - - - ( 17 )
ξ i v ≥ 0 , i = 1 , . . . , N
Wherein w vthe normal vector of optimum segmentation lineoid, γ v> 0 be SVM regularization parameter, represent training data set D vin penalty term, the φ of i-th sample v() is can be by maps feature vectors, to the mapping function in higher-dimension Hilbert space, finally obtains the SVM forecast model at v visual angle, is denoted as SVM v;
B. in order to train the second layer model SVM of 2L-SVM forecast model en, the training sample set under five visual angles closes the probability output under using cross validation strategy to obtain five visual angles respectively, and then these five probability outputs constitute new training data set with training collection class, are denoted as: wherein represent i-th sample probability output that cross validation obtains on v visual angle, reuse the SVM program of standard at D endata acquisition is trained the Optimal Separating Hyperplane that optimum, thus form the second layer model SVM in 2L-SVM forecast model en;
C. five forecast models will obtained in steps A five output probabilities as the forecast model SVM obtained in step B eninput, thus constitute 2L-SVM forecast model.
In further embodiment, in described step 3, for the protein sequence of each crystallizing power to be predicted, the feature of this protein sequence five different visual angles is obtained by step 1, the 2L-SVM forecast model being input to training in step 2 respectively carries out crystallization of protein probabilistic forecasting, final prediction of output probability.
In further embodiment, in described step 4, for the output probability obtained in step 3, threshold segmentation method is used to carry out the final decision of protein whether crystallization, threshold value span is 0 ~ 1, and aforesaid threshold values value meets the following conditions: the geneva related coefficient predicted the outcome is maximized.
From the above technical solution of the present invention shows that, beneficial effect of the present invention is:
1. improve the precision of prediction of model: employ more effective protein evolution information characteristics, employ the mutual interference that the algorithm of 2L-SVM effectively avoids between different visual angles feature simultaneously, how effective authentication information can be excavated further, improve the precision of prediction of crystallization of protein forecast model;
2. the interpretation of lift scheme: the first level SVM model in prediction algorithm 2L-SVM builds a basic SVM model on any one single visual angle characteristic, can know and know that the authentication information of which visual angle characteristic is more outstanding, second level of 2L-SVM is on the basis of first level, made deeper excavation, the effective information of different visual angles has effectively been merged under the condition mutually disturbed avoiding different visual angles, make to predict that the result obtained has more fairness and rationality, improve the interpretation of model.
Accompanying drawing explanation
Fig. 1 is the principle schematic of an embodiment of the present invention based on the crystallization of protein Forecasting Methodology of two-layer SVM study mechanism.
Embodiment
In order to more understand technology contents of the present invention, institute's accompanying drawings is coordinated to be described as follows especially exemplified by specific embodiment.
As shown in Figure 1, according to preferred embodiment of the present invention, the crystallization of protein Forecasting Methodology process based on two-layer SVM study mechanism is as follows:
Step 1: feature extraction, PSI-BLAST is used to extract the evolution information of protein, and conjugated protein sequence information and amino acid whose physico-chemical properties information, by extracting AAC, DiAAC, TriAAC, PseAAC, PsePSSM five visual angle characteristics, protein sequence being converted to numeric form and representing;
Step 2: the character representation according to step 1, protein sequences all in training data set being carried out different visual angles, form the training sample set of five different visual angles, then use two-layer SVM prediction algorithm 2L-SVM to close at the training sample set of five different visual angles and be trained to a crystallization of protein 2L-SVM forecast model;
Step 3: for the protein sequence of each crystallizing power to be predicted, the feature of this protein sequence five different visual angles is obtained by step 1, the crystallization of protein 2L-SVM forecast model of training in step 2 is used to carry out crystallization of protein probabilistic forecasting, final prediction of output probability; And
Step 4: for protein sequence to be predicted in step 3, uses threshold segmentation method according to the output probability in step 3, the whether crystallizable decision-making of final this protein sequence of output.
Shown in accompanying drawing, exemplary explanation is done to some specific implementations of preceding method.
In described step 1, carry out the extraction of different visual angles feature according to following step:
A. AAC visual angle characteristic is extracted
For the protein sequence P that any one length is l, the number of times that in its protein sequence, all amino acid classes occur, is denoted as:
Count AA=(n A,n C,…,n Y) T(1)
Wherein A, C ..., Y represents 20 kinds of common amino acid residues, n respectively a, n cand n yrepresent the number of amino acid A, C and Y in protein sequence P respectively;
Represent that the AAC visual angle characteristic of Protein And Its Amino Acid can be expressed as:
AAC = ( n A l , n C l , . . . , n Y l ) T - - - ( 2 )
B. DiAAC visual angle characteristic is extracted
For the protein sequence P that random length is l, represented the feature at the DiAAC visual angle of protein by following equation:
DiAAC = ( n A , A l - 1 , n A , C l - 1 , . . . , n Y , Y l - 1 ) T - - - ( 3 )
Wherein A, A, A, C ..., Y, Y represent the combination of two of 20 seed amino acids, n respectively a,A, n a,Cand n y,Yrepresent in protein sequence to there is amino acid to A, A, A, C and Y, the number of Y respectively;
C. TriAAC visual angle characteristic is extracted
For the protein sequence P arbitrarily containing l amino acid residue, represent TriAAC visual angle characteristic by following equation:
TriAAC = ( n A , A , A l - 2 , n A , A , C l - 2 , . . . , n Y , Y , Y l - 2 ) T - - - ( 4 )
Wherein A, A, A, A, A, C ..., Y, Y, Y represent respectively 20 seed amino acids tripeptides combination, n a, A, A, n a, A, Cand n y, Y, Yrepresent in protein sequence to there is amino acid to A, A, A, A, A, C and Y, the number of Y, Y respectively;
D. PseAAC visual angle characteristic is extracted
Each amino acid has intrinsic physico-chemical properties, and from these physico-chemical properties, extract the feature at PseAAC visual angle, concrete steps are as follows:
(1) use the method calculating AAC in steps A, calculate the amino acid composition of protein, be denoted as:
F AAC = ( f 1 , f 2 , . . . , f 20 ) = ( n A l , n C l , . . . , n Y l ) - - - ( 5 )
(2) calculate the association's relevant information in protein sequence corresponding to each different physico-chemical properties, concrete steps are as follows: the association's relevant information first calculating the λ level of protein on a kth physico-chemical properties:
τ λ k = 1 l - λ Σ i = 1 l - λ Corr i , i + λ k - - - ( 6 )
Wherein represent association's relevant information of i-th amino acid and the i-th+λ the λ level of amino acid on a kth physico-chemical properties in protein; represent the scoring values of i-th amino acid on a kth physico-chemical properties in protein;
Then calculate association's relevant information of all levels of protein on a kth physico-chemical properties, be denoted as:
τ k = ( τ 1 k , τ 2 k , . . . , τ Λ k ) - - - ( 7 )
Wherein Λ is maximum level;
Finally calculate the association relevant information of protein on all physics chemistry attribute, be denoted as:
τ=(τ 12,…,τ K) (8)
Wherein K represents the number of physico-chemical properties in AAIndex;
(3) in conjunction with AAC information and association's relevant information, final formation PseAAC visual angle characteristic, is denoted as:
PseAAC=(x 1,…,x μ,…,x K·Λ,x 1+K·Λ,…,x 20+K·Λ) T(9)
Wherein
Wherein rounding operation in expression, w represents the weight of PseAAC;
E. PsePSSM visual angle characteristic is extracted
One is contained to the protein sequence P of l amino acid residue, first calculated by PSI-BLAST algorithm and obtain its position-specific scoring matrices PSSM, this PSSM matrix is the matrix of capable 20 row of l, thus is converted to by the primary structural information of protein matrix form, is expressed as follows:
Wherein A, C ..., Y represents 20 seed amino acid residues, o i,jrepresent that protein i-th amino acid residue is mutated into the possibility of the jth seed amino acid residue in 20 seed amino acid residues during evolution;
Then right be normalized, use following function pair in each value carry out standardization:
f ( x ) = 1 1 + e - x - - - ( 12 )
PSSM after standardization, is expressed as follows:
Again, for the PSSM after standardization, use PsePSSM algorithm that the evolution information matrix of Length discrepancy is converted into isometric proper vector, concrete grammar is as follows:
(1) at P pssmthe amino acid position relation information λ of different levels in middle excavation protein evolution information k, be expressed as follows:
λ k = ( λ 1 k , λ 2 k , . . . , λ j k , . . . , λ 20 k ) - - - ( 14 )
Wherein 1≤j≤20,1≤k≤K; K represents the maximum level that can excavate amino acid position relation information, so far can obtain the amino acid position relation information of K different levels;
(2) to P pssmeach row average, obtain one 20 dimension proper vector:
C PSSM=(p 1,p 2,…,p j,…,p 20) (15)
Wherein p j = ( Σ t = 1 l p t , j ) / l ;
(3) finally by the amino acid position relation information of K different levels and C pSSMserial combination is got up, and obtains the PsePSSM characteristic information of protein sequence:
PsePSSM K=(λ 12,…,λ K,C PSSM) T。(16)
Next, in described step 2, according to the five kinds of visual angle characteristic information obtained in step 1, the training sample set of composition five different visual angles, and in conjunction with the positive and negative sample distribution situation of five training sample set, train a 2L-SVM forecast model, concrete steps are as follows:
A. for the training sample set at any v visual angle wherein represent the proper vector at v visual angle of i-th sample, y irepresent the classification of i-th sample, N represents number of samples, uses the SVM programmed algorithm of standard to solve following relevant SVM optimization problem:
min 1 2 | | w v | | 2 + γ v Σ i = 1 N ξ i v
s . t . y i ( ( w v ) T φ v ( x i v ) + b v ) ≥ 1 - ξ i v - - - ( 17 )
ξ i v ≥ 0 , i = 1 , . . . , N
Wherein w vthe normal vector of optimum segmentation lineoid, γ v> 0 be SVM regularization parameter, represent training data set D vin penalty term, the φ of i-th sample v() is can be by maps feature vectors, to the mapping function in higher-dimension Hilbert space, finally obtains the SVM forecast model at v visual angle, is denoted as SVM v;
B. in order to train the second layer model SVM of 2L-SVM forecast model en, the training sample set under five visual angles closes the probability output under using cross validation strategy to obtain five visual angles respectively, and then these five probability outputs constitute new training data set with training collection class, are denoted as: wherein represent i-th sample probability output that cross validation obtains on v visual angle, reuse the SVM program of standard at D endata acquisition is trained the Optimal Separating Hyperplane that optimum, thus form the second layer model SVM in 2L-SVM forecast model en;
C. five forecast models will obtained in steps A five output probabilities as the forecast model SVM obtained in step B eninput, thus constitute 2L-SVM forecast model.
After training 2L-SVM forecast model, in following step 3, for the protein sequence of each crystallizing power to be predicted, the feature of this protein sequence five different visual angles is obtained by step 1, the 2L-SVM forecast model being input to training in step 2 respectively carries out crystallization of protein probabilistic forecasting, final prediction of output probability.
Finally, in step 4, for the output probability obtained in step 3, threshold segmentation method is used to carry out the final decision of protein whether crystallization, threshold value span is 0 ~ 1, and aforesaid threshold values value meets the following conditions: the geneva related coefficient predicted the outcome is maximized.
In sum, the present invention is compared with existing Forecasting Methodology, its remarkable advantage is: this method has protein evolution point of information at interior various visual angles feature, the model architecture effectively avoiding interference mutually between various visual angles feature, the profound prediction algorithm 2L-SVM excavating effective authentication information ability, make forecast model not only interpretation enhancing, and improve the precision of prediction of model.
Although the present invention with preferred embodiment disclose as above, so itself and be not used to limit the present invention.Persond having ordinary knowledge in the technical field of the present invention, without departing from the spirit and scope of the present invention, when being used for a variety of modifications and variations.Therefore, protection scope of the present invention is when being as the criterion depending on those as defined in claim.

Claims (5)

1., based on a crystallization of protein Forecasting Methodology for two-layer SVM study mechanism, it is characterized in that, comprise the following steps:
Step 1: feature extraction, PSI-BLAST is used to extract the evolution information of protein, and conjugated protein sequence information and amino acid whose physico-chemical properties information, by extracting AAC, DiAAC, TriAAC, PseAAC, PsePSSM five visual angle characteristics, protein sequence being converted to numeric form and representing;
Step 2: the character representation according to step 1, protein sequences all in training data set being carried out different visual angles, form the training sample set of five different visual angles, then use two-layer SVM prediction algorithm 2L-SVM to close at the training sample set of five different visual angles and be trained to a crystallization of protein 2L-SVM forecast model;
Step 3: for the protein sequence of each crystallizing power to be predicted, the feature of this protein sequence five different visual angles is obtained by step 1, the crystallization of protein 2L-SVM forecast model of training in step 2 is used to carry out crystallization of protein probabilistic forecasting, final prediction of output probability; And
Step 4: for protein sequence to be predicted in step 3, uses threshold segmentation method according to the output probability in step 3, the whether crystallizable decision-making of final this protein sequence of output.
2. the crystallization of protein Forecasting Methodology based on two-layer SVM study mechanism according to claim 1, is characterized in that, in described step 1, carries out the extraction of different visual angles feature according to following step:
A. AAC visual angle characteristic is extracted
For the protein sequence P that any one length is l, the number of times that in its protein sequence, all amino acid classes occur, is denoted as:
Count AA=(n A,n C,…,n Y) T(1)
Wherein A, C ..., Y represents 20 kinds of common amino acid residues, n respectively a, n cand n yrepresent the number of amino acid A, C and Y in protein sequence P respectively;
Represent that the AAC visual angle characteristic of Protein And Its Amino Acid can be expressed as:
B. DiAAC visual angle characteristic is extracted
For the protein sequence P that random length is l, represented the feature at the DiAAC visual angle of protein by following equation:
Wherein A, A, A, C ..., Y, Y represent the combination of two of 20 seed amino acids, n respectively a,A, n a,Cand n y,Yrepresent in protein sequence to there is amino acid to A, A, A, C and Y, the number of Y respectively;
C. TriAAC visual angle characteristic is extracted
For the protein sequence P arbitrarily containing l amino acid residue, represent TriAAC visual angle characteristic by following equation:
Wherein A, A, A, A, A, C ..., Y, Y, Y represent respectively 20 seed amino acids tripeptides combination, n a, A, A, n a, A, Cand n y, Y, Yrepresent in protein sequence to there is amino acid to A, A, A, A, A, C and Y, the number of Y, Y respectively;
D. PseAAC visual angle characteristic is extracted
Each amino acid has intrinsic physico-chemical properties, and from these physico-chemical properties, extract the feature at PseAAC visual angle, concrete steps are as follows:
(1) use the method calculating AAC in steps A, calculate the amino acid composition of protein, be denoted as:
(2) calculate the association's relevant information in protein sequence corresponding to each different physico-chemical properties, concrete steps are as follows: the association's relevant information first calculating the λ level of protein on a kth physico-chemical properties:
Wherein represent association's relevant information of i-th amino acid and the i-th+λ the λ level of amino acid on a kth physico-chemical properties in protein; represent the scoring values of i-th amino acid on a kth physico-chemical properties in protein;
Then calculate association's relevant information of all levels of protein on a kth physico-chemical properties, be denoted as:
Wherein Λ is maximum level;
Finally calculate the association relevant information of protein on all physics chemistry attribute, be denoted as:
τ=(τ 12,…,τ K) (8)
Wherein K represents the number of physico-chemical properties in AAIndex;
(3) in conjunction with AAC information and association's relevant information, final formation PseAAC visual angle characteristic, is denoted as:
PseAAC=(x 1,…,x μ,…,x K·Λ,x 1+K·Λ,…,x 20+K·Λ) T(9)
Wherein
wherein rounding operation in expression, w represents the weight of PseAAC;
E. PsePSSM visual angle characteristic is extracted
One is contained to the protein sequence P of l amino acid residue, first calculated by PSI-BLAST algorithm and obtain its position-specific scoring matrices PSSM, this PSSM matrix is the matrix of capable 20 row of l, thus is converted to by the primary structural information of protein matrix form, is expressed as follows:
Wherein A, C ..., Y represents 20 seed amino acid residues, o i,jrepresent that protein i-th amino acid residue is mutated into the possibility of the jth seed amino acid residue in 20 seed amino acid residues during evolution;
Then right be normalized, use following function pair in each value carry out standardization:
PSSM after standardization, is expressed as follows:
Again, for the PSSM after standardization, use PsePSSM algorithm that the evolution information matrix of Length discrepancy is converted into isometric proper vector, concrete grammar is as follows:
(1) at P pssmthe amino acid position relation information λ of different levels in middle excavation protein evolution information k, be expressed as follows:
Wherein 1≤j≤20,1≤k≤K; K represents the maximum level that can excavate amino acid position relation information, so far can obtain the amino acid position relation information of K different levels;
(2) to P pssmeach row average, obtain one 20 dimension proper vector:
C PSSM=(p 1,p 2,…,p j,…,p 20) (15)
Wherein
(3) finally by the amino acid position relation information of K different levels and C pSSMserial combination is got up, and obtains the PsePSSM characteristic information of protein sequence:
PsePSSM K=(λ 12,…,λ K,C PSSM) T(16) 。
3. the crystallization of protein Forecasting Methodology based on two-layer SVM study mechanism according to claim 1, it is characterized in that, in described step 2, according to the five kinds of visual angle characteristic information obtained in step 1, the training sample set of composition five different visual angles, and in conjunction with the positive and negative sample distribution situation of five training sample set, train a 2L-SVM forecast model, concrete steps are as follows:
A. for the training sample set at any v visual angle wherein represent the proper vector at v visual angle of i-th sample, y irepresent the classification of i-th sample, N represents number of samples, uses the SVM programmed algorithm of standard to solve following relevant SVM optimization problem:
Wherein w vthe normal vector of optimum segmentation lineoid, γ v> 0 be SVM regularization parameter, represent training data set D vin penalty term, the φ of i-th sample v() is can be by maps feature vectors, to the mapping function in higher-dimension Hilbert space, finally obtains the SVM forecast model at v visual angle, is denoted as SVM v;
B. in order to train the second layer model SVM of 2L-SVM forecast model en, the training sample set under five visual angles closes the probability output under using cross validation strategy to obtain five visual angles respectively, and then these five probability outputs constitute new training data set with training collection class, are denoted as: wherein represent i-th sample probability output that cross validation obtains on v visual angle, reuse the SVM program of standard at D endata acquisition is trained the Optimal Separating Hyperplane that optimum, thus form the second layer model SVM in 2L-SVM forecast model en;
C. five forecast models will obtained in steps A five output probabilities as the forecast model SVM obtained in step B eninput, thus constitute 2L-SVM forecast model.
4. the crystallization of protein Forecasting Methodology based on two-layer SVM study mechanism according to claim 1, it is characterized in that, in described step 3, for the protein sequence of each crystallizing power to be predicted, the feature of this protein sequence five different visual angles is obtained by step 1, the 2L-SVM forecast model being input to training in step 2 respectively carries out crystallization of protein probabilistic forecasting, final prediction of output probability.
5. the crystallization of protein Forecasting Methodology based on two-layer SVM study mechanism according to claim 1, it is characterized in that, in described step 4, for the output probability obtained in step 3, threshold segmentation method is used to carry out the final decision of protein whether crystallization, threshold value span is 0 ~ 1, and aforesaid threshold values value meets the following conditions: the geneva related coefficient predicted the outcome is maximized.
CN201510047426.7A 2015-01-29 2015-01-29 Crystallization of protein Forecasting Methodology based on two layers of SVM study mechanism Active CN104636635B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510047426.7A CN104636635B (en) 2015-01-29 2015-01-29 Crystallization of protein Forecasting Methodology based on two layers of SVM study mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510047426.7A CN104636635B (en) 2015-01-29 2015-01-29 Crystallization of protein Forecasting Methodology based on two layers of SVM study mechanism

Publications (2)

Publication Number Publication Date
CN104636635A true CN104636635A (en) 2015-05-20
CN104636635B CN104636635B (en) 2018-06-12

Family

ID=53215376

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510047426.7A Active CN104636635B (en) 2015-01-29 2015-01-29 Crystallization of protein Forecasting Methodology based on two layers of SVM study mechanism

Country Status (1)

Country Link
CN (1) CN104636635B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105046103A (en) * 2015-07-03 2015-11-11 景德镇陶瓷学院 Novel representation method for protein sequence fusing genetic information
CN107169312A (en) * 2017-05-27 2017-09-15 南开大学 A kind of Forecasting Methodology of the natural unordered protein of low complex degree
CN107967321A (en) * 2017-11-23 2018-04-27 北京信息科技大学 A kind of crop breeding evaluation method based on hierarchical support vector machines
CN109979525A (en) * 2019-02-28 2019-07-05 天津大学 Improved hormonebinding protein qualitative classification method
CN110473592A (en) * 2019-07-31 2019-11-19 广东工业大学 The multi-angle of view mankind for having supervision based on figure convolutional network cooperate with lethal gene prediction technique
CN111627494A (en) * 2020-05-29 2020-09-04 北京晶派科技有限公司 Protein property prediction method and device based on multi-dimensional features and computing equipment
CN115240775A (en) * 2022-07-18 2022-10-25 东北林业大学 Cas protein prediction method based on stacking ensemble learning strategy

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760210A (en) * 2012-06-19 2012-10-31 南京理工大学常熟研究院有限公司 Adenosine triphosphate binding site predicting method for protein
CN103324933A (en) * 2013-06-08 2013-09-25 南京理工大学常熟研究院有限公司 Membrane protein sub-cell positioning method based on complex space multi-view feature fusion
CN103617203A (en) * 2013-11-15 2014-03-05 南京理工大学 Protein-ligand binding site predicting method based on inquiry drive
CN103955628A (en) * 2014-04-22 2014-07-30 南京理工大学 Subspace fusion-based protein-vitamin binding location point predicting method
CN104077499A (en) * 2014-05-25 2014-10-01 南京理工大学 Supervised up-sampling learning based protein-nucleotide binding positioning point prediction method
CN104239751A (en) * 2014-09-05 2014-12-24 南京理工大学 GPCR(G Protein-Coupled Receptor)-drug interaction prediction method based on postprocessing study

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760210A (en) * 2012-06-19 2012-10-31 南京理工大学常熟研究院有限公司 Adenosine triphosphate binding site predicting method for protein
CN103324933A (en) * 2013-06-08 2013-09-25 南京理工大学常熟研究院有限公司 Membrane protein sub-cell positioning method based on complex space multi-view feature fusion
CN103617203A (en) * 2013-11-15 2014-03-05 南京理工大学 Protein-ligand binding site predicting method based on inquiry drive
CN103955628A (en) * 2014-04-22 2014-07-30 南京理工大学 Subspace fusion-based protein-vitamin binding location point predicting method
CN104077499A (en) * 2014-05-25 2014-10-01 南京理工大学 Supervised up-sampling learning based protein-nucleotide binding positioning point prediction method
CN104239751A (en) * 2014-09-05 2014-12-24 南京理工大学 GPCR(G Protein-Coupled Receptor)-drug interaction prediction method based on postprocessing study

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
DONG-JUN YU等: "enhancing protein-vitamin binding residues prediction by multiple heterogeneous subspace SVMs ensemble", 《BMC BIOINFORMATICS》 *
DONG-JUN YU等: "improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling", 《NEUROCOMPUTING》 *
DONG-JUN YU等: "learning protein multi-view features in complex space", 《AMINO ACIDS》 *
JUN HU等: "A new supervised over-sampling algorithm with application to protein-nucleotide binding residue prediction", 《POLS ONE》 *
KUO-CHEN CHOU: "prediction of protein cellular attributes using pseudo amino acid composition", 《PROTEINS:STRUCTURE,FUCTION,AND GENETICS》 *
LORIS NANNI等: "combing ontologies and dipeptide composition for predicting DNA-binding proteins", 《AMINO ACIDS》 *
MOHD HILMI MUDA: "two-layer SVM classifier for remote protein homology detection and fold recognition", 《网页在线公开:EPRINTS.UTM.MY/11488》 *
SAMAD JAHANDIDEH等: "RFCRYS:sequence-based protein crystallization propensity prediction by means of random forest", 《JOURNAL OF THEORETICAL BIOLOGY》 *
吴小伟: "基于多视角特征融合的蛋白质属性预测", 《中国优秀硕士学位论文全文数据库 基础科学辑》 *
张亚楠: "基于数据挖掘算法的蛋白质相互作用及其活性位点研究", 《中国优秀硕士学位论文全文数据库 基础科学辑》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105046103A (en) * 2015-07-03 2015-11-11 景德镇陶瓷学院 Novel representation method for protein sequence fusing genetic information
CN107169312A (en) * 2017-05-27 2017-09-15 南开大学 A kind of Forecasting Methodology of the natural unordered protein of low complex degree
CN107169312B (en) * 2017-05-27 2020-05-08 南开大学 Low-complexity natural disordered protein prediction method
CN107967321A (en) * 2017-11-23 2018-04-27 北京信息科技大学 A kind of crop breeding evaluation method based on hierarchical support vector machines
CN109979525A (en) * 2019-02-28 2019-07-05 天津大学 Improved hormonebinding protein qualitative classification method
CN110473592A (en) * 2019-07-31 2019-11-19 广东工业大学 The multi-angle of view mankind for having supervision based on figure convolutional network cooperate with lethal gene prediction technique
CN110473592B (en) * 2019-07-31 2023-05-23 广东工业大学 Multi-view human synthetic lethal gene prediction method
CN111627494A (en) * 2020-05-29 2020-09-04 北京晶派科技有限公司 Protein property prediction method and device based on multi-dimensional features and computing equipment
CN111627494B (en) * 2020-05-29 2023-12-01 北京晶泰科技有限公司 Protein property prediction method and device based on multidimensional features and computing equipment
CN115240775A (en) * 2022-07-18 2022-10-25 东北林业大学 Cas protein prediction method based on stacking ensemble learning strategy
CN115240775B (en) * 2022-07-18 2023-10-03 东北林业大学 Cas protein prediction method based on stacking integrated learning strategy

Also Published As

Publication number Publication date
CN104636635B (en) 2018-06-12

Similar Documents

Publication Publication Date Title
CN104636635A (en) Protein crystallization predicting method based on two-layer SVM learning mechanism
Cai et al. Multiobjective optimization of data-driven model for lithium-ion battery SOH estimation with short-term feature
Fischer et al. High-definition reconstruction of clonal composition in cancer
Allanach Impact of CMS multi-jets and missing energy search on CMSSM fits
CN106778059A (en) A kind of colony's Advances in protein structure prediction based on Rosetta local enhancements
Bao et al. Prediction of protein structure classes with flexible neural tree
Buckley et al. Buckets of tops
CN103390154A (en) Face recognition method based on extraction of multiple evolution features
CN106778065A (en) A kind of Forecasting Methodology based on multivariate data prediction DNA mutation influence interactions between protein
CMS collaboration Measurement of the top quark pole mass using $\mathrm {t\bar {t}} $+ jet events in the dilepton final state in proton-proton collisions at $\sqrt {s} $= 13 TeV
CN101216879A (en) Face identification method based on Fisher-supported vector machine
CN110060738A (en) Method and system based on machine learning techniques prediction bacterium protective antigens albumen
Al-Haija et al. Supervised regression study for electron microscopy data
Fei et al. Early-stage lifetime prediction for lithium-ion batteries: A deep learning framework jointly considering machine-learned and handcrafted data features
Abazov et al. Search for single top quarks in the tau+ jets channel using 4.8 fb− 1 of pp collision data
CN113033104A (en) Lithium battery state of charge estimation method based on graph convolution
CN112613391A (en) Hyperspectral image band selection method based on reverse learning binary rice breeding algorithm
CN114758721B (en) Deep learning-based transcription factor binding site positioning method
CN115938490A (en) Metabolite identification method, system and equipment based on graph representation learning algorithm
CN116130018A (en) Organic crystal structure prediction method, device, equipment and storage medium
CN106021999B (en) A kind of optimal multiple labeling integrated prediction method of multi-functional antimicrobial peptide
CN115346602A (en) Data analysis method and device
Pedergnana et al. A novel supervised feature selection technique based on genetic algorithms
CN103033213A (en) RReliefF variable selection based production process primary variable streamline soft measuring method
Ghorbani et al. Revolutionising inverse design of magnesium alloys through generative adversarial networks

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Wu Dongjun

Inventor after: Hu Jun

Inventor after: He Xue

Inventor after: Li Yang

Inventor after: Shen Hongbin

Inventor after: Yang Jingyu

Inventor before: Hu Jun

Inventor before: Wu Dongjun

Inventor before: He Xue

Inventor before: Li Yang

Inventor before: Shen Hongbin

Inventor before: Yang Jingyu

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant