CN109147870A - The recognition methods of intrinsic unordered protein based on condition random field - Google Patents

The recognition methods of intrinsic unordered protein based on condition random field Download PDF

Info

Publication number
CN109147870A
CN109147870A CN201810834590.6A CN201810834590A CN109147870A CN 109147870 A CN109147870 A CN 109147870A CN 201810834590 A CN201810834590 A CN 201810834590A CN 109147870 A CN109147870 A CN 109147870A
Authority
CN
China
Prior art keywords
amino acid
protein
feature
random field
window
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810834590.6A
Other languages
Chinese (zh)
Inventor
刘滨
刘羽朦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201810834590.6A priority Critical patent/CN109147870A/en
Publication of CN109147870A publication Critical patent/CN109147870A/en
Pending legal-status Critical Current

Links

Abstract

The recognition methods of the present invention provides a kind of intrinsic unordered protein based on condition random field, intrinsic unordered protein identification method IDP-CRF is constructed using the evolution information of protein, amino acid composition information, secondary structure information and relative solvent accessibility information, conjugation condition random field.The site of biological sequence is predicted, how to be always important problem comprising the dependence between the label of site, is also based on the indeterminable problem of recognition methods of traditional sorting algorithm building.In addition, being also the key of improvement method performance using the numeric type feature abundant extracted in biological sequence.So present invention employs the condition random field algorithms for being capable of handling numeric type feature to construct prediction model.The model can not only be comprising the dependence between the label of site, and is capable of handling numeric type feature, to further increase estimated performance.

Description

The recognition methods of intrinsic unordered protein based on condition random field
Technical field
The present invention relates to bioinformatics technique field more particularly to a kind of recognition methods of intrinsic unordered protein.
Background technique
The recognition methods of most of intrinsic unordered protein is constructed based on traditional sorting algorithm, such as supporting vector Machine, random forest, feedforward neural network etc..Such methods first have to A series of subsequence, the amino acid among subsequence are target amino acid (amino acid namely to be predicted).It is then based on These subsequences extract feature, finally each subsequence is predicted using sorting algorithm (namely to target amino acid into Row prediction).In addition to this, further include based on dimensioning algorithm condition random field CRF building can only processing character type feature knowledge Other method.This method be Protein primary sequence and its secondary structure sequence of prediction are converted to using feature templates it is a series of Feature, target amino acid is labeled based on these characteristic use condition random fields.
PDB database and DisProt database are two important databases for storing intrinsic unordered protein, and close The fast speed updated over year.But most of the training set of existing prediction model is the egg in database according to legacy version White matter building.It does not include newest protein sequence this results in prediction model, to influence the generalization ability of model. In addition, adjacent amino acid has similar feature in terms of it whether will form intrinsic disordered state in a protein.But It is that each target amino acid is trained by the prediction model constructed based on traditional sorting algorithm as independent sample, thus The dependence between adjacent amino acid label cannot be included.On the other hand, protein rich numeric type feature is for knowing Not intrinsic unordered protein plays an important role.Although being able to solve biography currently based on the prediction technique that condition random field constructs The problem of sorting algorithm of system, but numeric type feature can't be handled, to greatly limit the prediction of model Performance.
Summary of the invention
In order to solve the problems in the prior art, the present invention provides a kind of intrinsic unordered albumen based on condition random field The recognition methods of matter contains the dependence between site label adjacent in protein sequence, and is utilized from protein The numeric type feature abundant extracted in sequence, to improve the estimated performance to intrinsic unordered protein.
The present invention is realized especially by following technical solution:
A kind of recognition methods of the intrinsic unordered protein based on condition random field, comprising the following steps: S1, building condition The feature of random field models, the feature include transfer characteristic and state feature;The building of state feature first has to utilize sliding Protein sequence is cut into a series of subsequence by window technique, then to its state feature of each desired amino acid construct, The second structure characteristic and relative solvent of evolution information characteristics and amino acid composition characteristic i.e. in window and target amino acid Accessibility feature;S2, using the condition random farm software for being capable of handling numeric type feature, training pattern;During training, It first has to construct a certain proportion of positive and negative sample set, the method for building is to remove negative sample at random, and the balanced proportions of use are positive Sample: negative sample=1:2;S3, to training set execute step S1 to be input in conditional random field models, training pattern parameter; S4, it is input in conditional random field models after executing step S1 to test set, obtains recognition result.
As a further improvement of the present invention, it is assumed that the tag set of amino acid is L={ orderly, unordered }, then shifts spy Sign is shown below:
Wherein yi-1And yiIt is that in the label of the amino acid of i-1 and i, y and y ' belong to L for position in protein sequence.
As a further improvement of the present invention, the present invention is based on MobiDB databases and newest DisProt database structure Newest, most full data set has been built, and prediction model is constructed based on this data set.
As a further improvement of the present invention, the building process of the evolution information in window are as follows: first with PSI-BLAST Search for the parameter E- that large-scale Protein Data Bank obtains location specific the scoring matrix PSSM, PSI-BLAST of protein Value and the number of iterations are set to 0.001 and 3, and other parameters are default;Then PSSM matrix is normalized, it is public Formula is as follows:
Wherein x represents the value of each element in PSSM matrix;Finally by all ammonia in each target amino acid window The PSSM information of base acid connects, and obtains the evolution information characteristics of target amino acid.
As a further improvement of the present invention, the amino acid composition characteristic in window refers to continuous k amino acid in window The frequecy characteristic of appearance.
As a further improvement of the present invention, the second structure characteristic of target amino acid is using based on sequence spectrum information PSIPRED software predicts three kinds of structures of target amino acid, including spiral, folding and random coil.But when one Protein sequence does not obtain PSSM matrix after searching for database, then being based only upon protein sequence with regard to using PSIPRED。
As a further improvement of the present invention, the relative solvent accessibility of target amino acid is characterized in utilizing Sable software What prediction obtained, SA_ACTION and SA_OUT parameter is respectively set to SVR and RELATIVE, and other parameters are default parameters.
The beneficial effects of the present invention are: method of the invention forms information, two using the evolution information of protein, amino acid Level structure information and relative solvent accessibility information, conjugation condition random field construct intrinsic unordered protein identification method (IDP-CRF).Predict the site of biological sequence how to be important always comprising the dependence between the label of site Problem is also based on the indeterminable problem of recognition methods of traditional sorting algorithm building.In addition, using being mentioned in biological sequence The numeric type feature abundant taken is also the key of improvement method performance.So present invention employs be capable of handling numeric type spy The condition random field algorithm of sign constructs prediction model.The model can not only include the dependence between the label of site, and It is capable of handling numeric type feature, to further increase estimated performance.
Detailed description of the invention
Fig. 1 is recognition methods flow chart of the invention.
Specific embodiment
The present invention is further described for explanation and specific embodiment with reference to the accompanying drawing.
The present invention is intrinsic unordered protein (the Intrinsically Disordered based on condition random field Protein, IDP) recognition methods.For the protein sequence comprising multiplicity, the present invention is based on MobiDB databases and newest Newest, the most full data set of DisProt database sharing, and prediction model is constructed based on this data set.In addition, in order to gram The defect of the model constructed based on traditional classification algorithm is taken, present invention employs condition random field algorithms, so as to include egg Dependence in white matter sequence between adjacent amino acid label.And present invention employs be capable of handling numeric type feature Condition random farm software enables the prediction model of building to include the numeric type abundant spy extracted from protein sequence Sign, to improve the estimated performance to intrinsic unordered protein.
Flow chart of the invention is as shown in Figure 1.The feature of building conditional random field models includes that transfer characteristic and state are special Sign.Assuming that the tag set of amino acid is L={ orderly, unordered }, then shown in transfer characteristic such as formula (1):
Wherein yi-1And yiIt is that in the label of the amino acid of i-1 and i, y and y ' belong to L for position in protein sequence.State is special The building of sign first has to that protein sequence is cut into a series of subsequence using sliding window technique, then to each target Its state feature of amino acid construct, i.e., evolution information characteristics and amino acid composition characteristic and target amino acid in window Second structure characteristic and relative solvent accessibility feature.
The building process of evolution information in window are as follows: search for large-scale protein data first with PSI-BLAST Library (such as NRdb90) obtains the location specific scoring matrix (PSSM) of protein, the parameter E- of PSI-BLAST in the present invention Value and the number of iterations are set to 0.001 and 3, and other parameters are default.Then PSSM matrix is normalized, it is public Formula is as follows:
Wherein x represents the value of each element in PSSM matrix.Finally by all ammonia in each target amino acid window The PSSM information of base acid connects, and obtains the evolution information characteristics of target amino acid.The window size being arranged in the present invention is 11, so the intrinsic dimensionality of evolution information is 11 × 20=220 dimension.Amino acid composition characteristic in window refers to continuous in window The frequecy characteristic that k amino acid occurs, k value of the present invention take 1, so the dimension of this feature is 20 dimensions.The second level of target amino acid Structure feature is predicted three kinds of structures of target amino acid, including spiral, folding and random coil.It is used in the present invention The secondary structure of PSIPRED software prediction target amino acid, the present invention in using based on sequence spectrum information prediction PSIPRED.As soon as but when protein sequence does not obtain PSSM matrix after searching for NRdb90 database, then using It is based only upon the PSIPRED of protein sequence.All parameters of PSIPRED software are all made of default parameters in the present invention.This feature Dimension be 1 dimension.What the relative solvent accessibility of target amino acid was characterized in obtaining using Sable software prediction, SA_ ACTION and SA_OUT parameter is respectively set to SVR and RELATIVE, and other parameters are default parameters.The dimension of this feature is 1 Dimension.
Since in the data set of intrinsic unordered protein, the ratio of positive negative sample is extremely uneven, and negative sample is much more In positive sample, this this may result in the prediction model that unordered information is difficult to be fabricated and is included.So in trained process In, it first has to construct a certain proportion of positive and negative sample set.The method of building is to remove negative sample at random, can guarantee to retain in this way The relative position of the amino acid to get off is remained unchanged compared to the relative position in urporotein sequence.It is used in the present invention Balanced proportions be positive sample: negative sample be equal to 1:2.Be then based on this data set and above-mentioned building state feature and can Transfer characteristic comprising dependence between label constructs condition random field prediction model.Finally, based on this prediction model to appoint Meaning protein is predicted.
In addition, recognition methods of the invention is not limited only to the identification to intrinsic unordered protein, similar be somebody's turn to do is extended also to The biological questions that the site DNA, RNA and Protein is predicted of problem, such as prediction, the albumen of protein binding site The prediction etc. of matter secondary structure.
The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be said that Specific implementation of the invention is only limited to these instructions.For those of ordinary skill in the art to which the present invention belongs, exist Under the premise of not departing from present inventive concept, a number of simple deductions or replacements can also be made, all shall be regarded as belonging to of the invention Protection scope.

Claims (8)

1. a kind of recognition methods of the intrinsic unordered protein based on condition random field, it is characterised in that: the method includes with Lower step: S1, the feature for constructing conditional random field models, the feature includes transfer characteristic and state feature;State feature Building first has to that protein sequence is cut into a series of subsequence using sliding window technique, then to each desired amino Its state feature of acid construct, i.e., the second level of evolution information characteristics and amino acid composition characteristic and target amino acid in window Structure feature and relative solvent accessibility feature;S2, using the condition random farm software for being capable of handling numeric type feature, training mould Type;During training, first having to construct a certain proportion of positive and negative sample set, the method for building is to remove negative sample at random, The balanced proportions used is positive samples: negative sample=1:2;S3, step S1 is executed to training set to be input to condition random field mould In type, training pattern parameter;S4, it is input in conditional random field models after executing step S1 to test set, obtains recognition result.
2. according to the method described in claim 1, it is characterized by: assuming that the tag set of amino acid is L={ orderly, nothing Sequence }, then transfer characteristic is shown below:
Wherein yi-1And yiIt is that in the label of the amino acid of i-1 and i, y and y ' belong to L for position in protein sequence.
3. according to the method described in claim 1, it is characterized by: the method is based on MobiDB database and DisProt number Data set is constructed according to library, and prediction model is constructed based on this data set.
4. according to the method described in claim 1, it is characterized by: the building process of the evolution information in window are as follows: sharp first Large-scale Protein Data Bank, which is searched for, with PSI-BLAST obtains location specific the scoring matrix PSSM, PSI- of protein The parameter E-value and the number of iterations of BLAST is set to 0.001 and 3, and other parameters are default;Then to PSSM matrix into Row normalization, formula are as follows:
Wherein x represents the value of each element in PSSM matrix;Finally by all amino acid in each target amino acid window PSSM information connect, obtain the evolution information characteristics of target amino acid.
5. according to the method described in claim 1, it is characterized by: the amino acid composition characteristic in window refers in window continuously The frequecy characteristic that k amino acid occurs.
6. according to the method described in claim 1, it is characterized by: the second structure characteristic of target amino acid is using based on sequence The PSIPRED software of column spectrum information predicts three kinds of structures of target amino acid, including spiral, folding and random coil; As soon as but when protein sequence does not obtain PSSM matrix after searching for database, then using protein sequence is based only upon The PSIPRED of column.
7. according to the method described in claim 1, it is characterized by: the relative solvent accessibility of target amino acid is characterized in utilizing Sable software prediction obtains, SA_ACTION and SA_OUT parameter is respectively set to SVR and RELATIVE, and other parameters are Default parameters.
8. according to the method described in claim 1, it is characterized by: the method is further adapted to the site DNA, RNA and Protein The biological questions predicted, such as the prediction of protein binding site, the prediction of secondary protein structure.
CN201810834590.6A 2018-07-26 2018-07-26 The recognition methods of intrinsic unordered protein based on condition random field Pending CN109147870A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810834590.6A CN109147870A (en) 2018-07-26 2018-07-26 The recognition methods of intrinsic unordered protein based on condition random field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810834590.6A CN109147870A (en) 2018-07-26 2018-07-26 The recognition methods of intrinsic unordered protein based on condition random field

Publications (1)

Publication Number Publication Date
CN109147870A true CN109147870A (en) 2019-01-04

Family

ID=64797912

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810834590.6A Pending CN109147870A (en) 2018-07-26 2018-07-26 The recognition methods of intrinsic unordered protein based on condition random field

Country Status (1)

Country Link
CN (1) CN109147870A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150284405A1 (en) * 2014-04-04 2015-10-08 Pfizer Inc. Bicyclic-Fused Heteroaryl or Aryl Compounds
CN106709273A (en) * 2016-12-15 2017-05-24 国家海洋局第海洋研究所 Protein rapid detection method based on matched microalgae protein characteristics sequence label and system thereof
CN106980609A (en) * 2017-03-21 2017-07-25 大连理工大学 A kind of name entity recognition method of the condition random field of word-based vector representation
WO2017165856A1 (en) * 2016-03-24 2017-09-28 Baker Brian M Biomolecule design model and uses thereof
CN107463802A (en) * 2017-08-02 2017-12-12 南昌大学 A kind of Forecasting Methodology of protokaryon protein acetylation sites

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150284405A1 (en) * 2014-04-04 2015-10-08 Pfizer Inc. Bicyclic-Fused Heteroaryl or Aryl Compounds
WO2017165856A1 (en) * 2016-03-24 2017-09-28 Baker Brian M Biomolecule design model and uses thereof
CN106709273A (en) * 2016-12-15 2017-05-24 国家海洋局第海洋研究所 Protein rapid detection method based on matched microalgae protein characteristics sequence label and system thereof
CN106980609A (en) * 2017-03-21 2017-07-25 大连理工大学 A kind of name entity recognition method of the condition random field of word-based vector representation
CN107463802A (en) * 2017-08-02 2017-12-12 南昌大学 A kind of Forecasting Methodology of protokaryon protein acetylation sites

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ASHISH V T 等: "Analysis of Protein Structure using Geometric and Machine Learning", 《RESEARCHGATE》 *
MAN-WAI MAK 等: "Fusion of Conditional Random Field and SignalP for Protein Cleavage Site Prediction", 《RESEARCHGATE》 *
MING-HUI LI 等: "Protein–protein interaction site prediction based on conditional random fields", 《BIOINFORMATICS》 *
罗亮 等: "基于条件随机场进行蛋白质二级结构预测", 《计算机应用研究》 *

Similar Documents

Publication Publication Date Title
Drew et al. Polymorphic malware detection using sequence classification methods
US10191974B2 (en) Method and system for high performance integration, processing and searching of structured and unstructured data
CN113282759B (en) Threat information-based network security knowledge graph generation method
CN103617203B (en) Protein-ligand bindings bit point prediction method based on query driven
Zhang et al. An empirical study of TextRank for keyword extraction
CN111640468B (en) Method for screening disease-related protein based on complex network
Anastasiu et al. Efficient identification of Tanimoto nearest neighbors: all-pairs similarity search using the extended Jaccard coefficient
Teng et al. Multi-scale local cues and hierarchical attention-based LSTM for stock price trend prediction
Li et al. Hashing with dual complementary projection learning for fast image retrieval
Lyu et al. Scalable supergraph search in large graph databases
CN114491082A (en) Plan matching method based on network security emergency response knowledge graph feature extraction
Weng et al. Drug target interaction prediction using multi-task learning and co-attention
CN116013428A (en) Drug target general prediction method, device and medium based on self-supervision learning
Kuksa et al. Spatial representation for efficient sequence classification
CN109815478A (en) Medicine entity recognition method and system based on convolutional neural networks
CN114237621A (en) Semantic code searching method based on fine-grained common attention mechanism
Ionescu et al. Local rank distance
Tian et al. Deep incremental hashing for semantic image retrieval with concept drift
Wang et al. A drug-target interaction prediction based on GCN learning
CN109147870A (en) The recognition methods of intrinsic unordered protein based on condition random field
Shao et al. ProtRe-CN: protein remote homology detection by combining classification methods and network methods via learning to rank
Wang et al. Supervised discrete hashing for hamming space retrieval
Banerjee et al. Random Forest boosted CNN: An empirical technique for plant classification
Wali et al. m-CALP–Yet another way of generating handwritten data through evolution for pattern recognition
Zheng et al. One for more: Structured Multi-Modal Hashing for multiple multimedia retrieval tasks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190104