CN109147870A

CN109147870A - The recognition methods of intrinsic unordered protein based on condition random field

Info

Publication number: CN109147870A
Application number: CN201810834590.6A
Authority: CN
Inventors: 刘滨; 刘羽朦
Original assignee: Individual
Current assignee: Individual
Priority date: 2018-07-26
Filing date: 2018-07-26
Publication date: 2019-01-04

Abstract

The recognition methods of the present invention provides a kind of intrinsic unordered protein based on condition random field, intrinsic unordered protein identification method IDP-CRF is constructed using the evolution information of protein, amino acid composition information, secondary structure information and relative solvent accessibility information, conjugation condition random field.The site of biological sequence is predicted, how to be always important problem comprising the dependence between the label of site, is also based on the indeterminable problem of recognition methods of traditional sorting algorithm building.In addition, being also the key of improvement method performance using the numeric type feature abundant extracted in biological sequence.So present invention employs the condition random field algorithms for being capable of handling numeric type feature to construct prediction model.The model can not only be comprising the dependence between the label of site, and is capable of handling numeric type feature, to further increase estimated performance.

Description

The recognition methods of intrinsic unordered protein based on condition random field

Technical field

The present invention relates to bioinformatics technique field more particularly to a kind of recognition methods of intrinsic unordered protein.

Background technique

The recognition methods of most of intrinsic unordered protein is constructed based on traditional sorting algorithm, such as supporting vector Machine, random forest, feedforward neural network etc..Such methods first have to A series of subsequence, the amino acid among subsequence are target amino acid (amino acid namely to be predicted).It is then based on These subsequences extract feature, finally each subsequence is predicted using sorting algorithm (namely to target amino acid into Row prediction).In addition to this, further include based on dimensioning algorithm condition random field CRF building can only processing character type feature knowledge Other method.This method be Protein primary sequence and its secondary structure sequence of prediction are converted to using feature templates it is a series of Feature, target amino acid is labeled based on these characteristic use condition random fields.

PDB database and DisProt database are two important databases for storing intrinsic unordered protein, and close The fast speed updated over year.But most of the training set of existing prediction model is the egg in database according to legacy version White matter building.It does not include newest protein sequence this results in prediction model, to influence the generalization ability of model. In addition, adjacent amino acid has similar feature in terms of it whether will form intrinsic disordered state in a protein.But It is that each target amino acid is trained by the prediction model constructed based on traditional sorting algorithm as independent sample, thus The dependence between adjacent amino acid label cannot be included.On the other hand, protein rich numeric type feature is for knowing Not intrinsic unordered protein plays an important role.Although being able to solve biography currently based on the prediction technique that condition random field constructs The problem of sorting algorithm of system, but numeric type feature can't be handled, to greatly limit the prediction of model Performance.

Summary of the invention

In order to solve the problems in the prior art, the present invention provides a kind of intrinsic unordered albumen based on condition random field The recognition methods of matter contains the dependence between site label adjacent in protein sequence, and is utilized from protein The numeric type feature abundant extracted in sequence, to improve the estimated performance to intrinsic unordered protein.

The present invention is realized especially by following technical solution:

A kind of recognition methods of the intrinsic unordered protein based on condition random field, comprising the following steps: S1, building condition The feature of random field models, the feature include transfer characteristic and state feature；The building of state feature first has to utilize sliding Protein sequence is cut into a series of subsequence by window technique, then to its state feature of each desired amino acid construct, The second structure characteristic and relative solvent of evolution information characteristics and amino acid composition characteristic i.e. in window and target amino acid Accessibility feature；S2, using the condition random farm software for being capable of handling numeric type feature, training pattern；During training, It first has to construct a certain proportion of positive and negative sample set, the method for building is to remove negative sample at random, and the balanced proportions of use are positive Sample: negative sample=1:2；S3, to training set execute step S1 to be input in conditional random field models, training pattern parameter； S4, it is input in conditional random field models after executing step S1 to test set, obtains recognition result.

As a further improvement of the present invention, it is assumed that the tag set of amino acid is L={ orderly, unordered }, then shifts spy Sign is shown below:

Wherein y_i-1And y_iIt is that in the label of the amino acid of i-1 and i, y and y ' belong to L for position in protein sequence.

As a further improvement of the present invention, the present invention is based on MobiDB databases and newest DisProt database structure Newest, most full data set has been built, and prediction model is constructed based on this data set.

As a further improvement of the present invention, the building process of the evolution information in window are as follows: first with PSI-BLAST Search for the parameter E- that large-scale Protein Data Bank obtains location specific the scoring matrix PSSM, PSI-BLAST of protein Value and the number of iterations are set to 0.001 and 3, and other parameters are default；Then PSSM matrix is normalized, it is public Formula is as follows:

Wherein x represents the value of each element in PSSM matrix；Finally by all ammonia in each target amino acid window The PSSM information of base acid connects, and obtains the evolution information characteristics of target amino acid.

As a further improvement of the present invention, the amino acid composition characteristic in window refers to continuous k amino acid in window The frequecy characteristic of appearance.

As a further improvement of the present invention, the second structure characteristic of target amino acid is using based on sequence spectrum information PSIPRED software predicts three kinds of structures of target amino acid, including spiral, folding and random coil.But when one Protein sequence does not obtain PSSM matrix after searching for database, then being based only upon protein sequence with regard to using PSIPRED。

As a further improvement of the present invention, the relative solvent accessibility of target amino acid is characterized in utilizing Sable software What prediction obtained, SA_ACTION and SA_OUT parameter is respectively set to SVR and RELATIVE, and other parameters are default parameters.

The beneficial effects of the present invention are: method of the invention forms information, two using the evolution information of protein, amino acid Level structure information and relative solvent accessibility information, conjugation condition random field construct intrinsic unordered protein identification method (IDP-CRF).Predict the site of biological sequence how to be important always comprising the dependence between the label of site Problem is also based on the indeterminable problem of recognition methods of traditional sorting algorithm building.In addition, using being mentioned in biological sequence The numeric type feature abundant taken is also the key of improvement method performance.So present invention employs be capable of handling numeric type spy The condition random field algorithm of sign constructs prediction model.The model can not only include the dependence between the label of site, and It is capable of handling numeric type feature, to further increase estimated performance.

Detailed description of the invention

Fig. 1 is recognition methods flow chart of the invention.

Specific embodiment

The present invention is further described for explanation and specific embodiment with reference to the accompanying drawing.

The present invention is intrinsic unordered protein (the Intrinsically Disordered based on condition random field Protein, IDP) recognition methods.For the protein sequence comprising multiplicity, the present invention is based on MobiDB databases and newest Newest, the most full data set of DisProt database sharing, and prediction model is constructed based on this data set.In addition, in order to gram The defect of the model constructed based on traditional classification algorithm is taken, present invention employs condition random field algorithms, so as to include egg Dependence in white matter sequence between adjacent amino acid label.And present invention employs be capable of handling numeric type feature Condition random farm software enables the prediction model of building to include the numeric type abundant spy extracted from protein sequence Sign, to improve the estimated performance to intrinsic unordered protein.

Flow chart of the invention is as shown in Figure 1.The feature of building conditional random field models includes that transfer characteristic and state are special Sign.Assuming that the tag set of amino acid is L={ orderly, unordered }, then shown in transfer characteristic such as formula (1):

Wherein y_i-1And y_iIt is that in the label of the amino acid of i-1 and i, y and y ' belong to L for position in protein sequence.State is special The building of sign first has to that protein sequence is cut into a series of subsequence using sliding window technique, then to each target Its state feature of amino acid construct, i.e., evolution information characteristics and amino acid composition characteristic and target amino acid in window Second structure characteristic and relative solvent accessibility feature.

The building process of evolution information in window are as follows: search for large-scale protein data first with PSI-BLAST Library (such as NRdb90) obtains the location specific scoring matrix (PSSM) of protein, the parameter E- of PSI-BLAST in the present invention Value and the number of iterations are set to 0.001 and 3, and other parameters are default.Then PSSM matrix is normalized, it is public Formula is as follows:

Wherein x represents the value of each element in PSSM matrix.Finally by all ammonia in each target amino acid window The PSSM information of base acid connects, and obtains the evolution information characteristics of target amino acid.The window size being arranged in the present invention is 11, so the intrinsic dimensionality of evolution information is 11 × 20=220 dimension.Amino acid composition characteristic in window refers to continuous in window The frequecy characteristic that k amino acid occurs, k value of the present invention take 1, so the dimension of this feature is 20 dimensions.The second level of target amino acid Structure feature is predicted three kinds of structures of target amino acid, including spiral, folding and random coil.It is used in the present invention The secondary structure of PSIPRED software prediction target amino acid, the present invention in using based on sequence spectrum information prediction PSIPRED.As soon as but when protein sequence does not obtain PSSM matrix after searching for NRdb90 database, then using It is based only upon the PSIPRED of protein sequence.All parameters of PSIPRED software are all made of default parameters in the present invention.This feature Dimension be 1 dimension.What the relative solvent accessibility of target amino acid was characterized in obtaining using Sable software prediction, SA_ ACTION and SA_OUT parameter is respectively set to SVR and RELATIVE, and other parameters are default parameters.The dimension of this feature is 1 Dimension.

Since in the data set of intrinsic unordered protein, the ratio of positive negative sample is extremely uneven, and negative sample is much more In positive sample, this this may result in the prediction model that unordered information is difficult to be fabricated and is included.So in trained process In, it first has to construct a certain proportion of positive and negative sample set.The method of building is to remove negative sample at random, can guarantee to retain in this way The relative position of the amino acid to get off is remained unchanged compared to the relative position in urporotein sequence.It is used in the present invention Balanced proportions be positive sample: negative sample be equal to 1:2.Be then based on this data set and above-mentioned building state feature and can Transfer characteristic comprising dependence between label constructs condition random field prediction model.Finally, based on this prediction model to appoint Meaning protein is predicted.

In addition, recognition methods of the invention is not limited only to the identification to intrinsic unordered protein, similar be somebody's turn to do is extended also to The biological questions that the site DNA, RNA and Protein is predicted of problem, such as prediction, the albumen of protein binding site The prediction etc. of matter secondary structure.

The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be said that Specific implementation of the invention is only limited to these instructions.For those of ordinary skill in the art to which the present invention belongs, exist Under the premise of not departing from present inventive concept, a number of simple deductions or replacements can also be made, all shall be regarded as belonging to of the invention Protection scope.

Claims

1. a kind of recognition methods of the intrinsic unordered protein based on condition random field, it is characterised in that: the method includes with Lower step: S1, the feature for constructing conditional random field models, the feature includes transfer characteristic and state feature；State feature Building first has to that protein sequence is cut into a series of subsequence using sliding window technique, then to each desired amino Its state feature of acid construct, i.e., the second level of evolution information characteristics and amino acid composition characteristic and target amino acid in window Structure feature and relative solvent accessibility feature；S2, using the condition random farm software for being capable of handling numeric type feature, training mould Type；During training, first having to construct a certain proportion of positive and negative sample set, the method for building is to remove negative sample at random, The balanced proportions used is positive samples: negative sample=1:2；S3, step S1 is executed to training set to be input to condition random field mould In type, training pattern parameter；S4, it is input in conditional random field models after executing step S1 to test set, obtains recognition result.

2. according to the method described in claim 1, it is characterized by: assuming that the tag set of amino acid is L={ orderly, nothing Sequence }, then transfer characteristic is shown below:

3. according to the method described in claim 1, it is characterized by: the method is based on MobiDB database and DisProt number Data set is constructed according to library, and prediction model is constructed based on this data set.

4. according to the method described in claim 1, it is characterized by: the building process of the evolution information in window are as follows: sharp first Large-scale Protein Data Bank, which is searched for, with PSI-BLAST obtains location specific the scoring matrix PSSM, PSI- of protein The parameter E-value and the number of iterations of BLAST is set to 0.001 and 3, and other parameters are default；Then to PSSM matrix into Row normalization, formula are as follows:

Wherein x represents the value of each element in PSSM matrix；Finally by all amino acid in each target amino acid window PSSM information connect, obtain the evolution information characteristics of target amino acid.

5. according to the method described in claim 1, it is characterized by: the amino acid composition characteristic in window refers in window continuously The frequecy characteristic that k amino acid occurs.

6. according to the method described in claim 1, it is characterized by: the second structure characteristic of target amino acid is using based on sequence The PSIPRED software of column spectrum information predicts three kinds of structures of target amino acid, including spiral, folding and random coil； As soon as but when protein sequence does not obtain PSSM matrix after searching for database, then using protein sequence is based only upon The PSIPRED of column.

7. according to the method described in claim 1, it is characterized by: the relative solvent accessibility of target amino acid is characterized in utilizing Sable software prediction obtains, SA_ACTION and SA_OUT parameter is respectively set to SVR and RELATIVE, and other parameters are Default parameters.

8. according to the method described in claim 1, it is characterized by: the method is further adapted to the site DNA, RNA and Protein The biological questions predicted, such as the prediction of protein binding site, the prediction of secondary protein structure.