CN106650309A

CN106650309A - Prediction method and prediction device for membrane protein residue interaction relation

Info

Publication number: CN106650309A
Application number: CN201611264831.5A
Authority: CN
Inventors: 张慧玲; 魏彦杰; 郭宁; 贝振东; 朱昱寰
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2016-12-30
Filing date: 2016-12-30
Publication date: 2017-05-10

Abstract

A prediction method for membrane protein residue interaction relation includes acquiring membrane protein with analyzed protein structure as a training set, extracting characteristic of inbalance classification, for distinguishing a interactive residue pair and a non-interactive residue pair, of the membrane protein with the analyzed protein structure, training a prediction model by a smote-boost algorithm according to the extracted characteristic of inbalance classification, and predicting interaction relation of the membrane protein residue of the unknown protein structure according to the trained prediction model. Since the prediction model is trained according to the characteristic of inbalance classification, loss of useful information is avoided for the trained prediction model, and prediction precision and coverage is improved.

Description

A kind of Forecasting Methodology and device of interactively of memebrane protein residue

Technical field

The invention belongs to the crossing domain of data mining, machine learning and computer biology, more particularly to a kind of film egg The Forecasting Methodology and device of white residue effect relation.

Background technology

In the drug target being currently known, memebrane protein accounts for 60%.Due to Membrane protein conformation experiment parsing difficulty compared with Greatly, in Protein Data Bank (Protein Data Bank-PDB), in the known protein structure more than 90,000, it is known that Membrane protein conformation only accounts for the 1% of known protein structure.

The biological experimental method of existing parsing protein three-dimensional structure mainly includes X-RAY and NMR methods.These are biological Not only operating process is complex to learn experimental technique, takes, and it is also higher to test the cost for spending.Just because of experiment parsing These of method are not enough so that computer computational methods develop into certainty.Currently used for the meter of protein three-dimensional structure prediction Calculation method mainly has Blast search method, Folding recognition and ab initio prediction method.And generally from the angle of balanced classification, will be mutual The residue pair of effect or non-interacting residue are to according to 1：1 Scale Training method model.Wherein, residue is referred to by 20 kinds of differences Amino acid connect the polymer to be formed, formed protein after, the amino and carboxyl dehydration bonding between these amino acid, ammonia Base acid take part in the formation of peptide bond due to its moieties, and remaining structure division is referred to as amino acid residue.So-called residue effect Relation refers to those non-conterminous in the primary sequence of protein and residues pair neighbouring in tertiary structure.

Because the residue pair for interacting can typically be far longer than 1 with the ratio of non-interacting residue pair：1, so as to So that existing Forecasting Methodology can cause a large amount of useful information loss, the degree of accuracy and the coverage of prediction are affected.

The content of the invention

It is an object of the invention to provide a kind of Forecasting Methodology of the interactively of memebrane protein residue, to solve prior art In Forecasting Methodology can cause a large amount of useful information loss, affect the degree of accuracy of prediction and the problem of coverage.

In a first aspect, embodiments providing a kind of Forecasting Methodology of the interactively of memebrane protein residue, the side Method includes：

Acquisition has parsed the memebrane protein of protein structure as training set；

Parse described in extracting in the memebrane protein of protein structure for distinguishing the residue pair and non-phase interaction that interact The feature of the lack of balance classification of residue pair；

The feature that the lack of balance extracted is classified passes through smote-boost Algorithm for Training forecast models, after being trained Forecast model；

According to the forecast model after training, the interactively of the memebrane protein residue of agnoprotein matter structure is predicted.

With reference in a first aspect, in the first possible implementation of first aspect, having parsed albumen described in the extraction It is used to distinguish the spy that the lack of balance of the residue pair and non-interacting residue pair for interacting is classified in the memebrane protein of matter structure In levying step, the feature of the lack of balance classification includes：Position-specific scoring matrices PSSM features, the residue phase in α spirals Adjust the distance a kind of or many in feature, train interval feature, residue type feature, α spiral number features, sequence length feature Kind.

With reference to the first possible implementation of first aspect, in second possible implementation of first aspect, institute State vector representation of each residue in position-specific scoring matrices PSSM by one 20 dimension, the location specific score square Battle array PSSM features include：

With residue to the residue i and residue j in (i, j) respectively centered on take slides container of the size as a, each is residual Base is to obtaining 40a position-specific scoring matrices PSSM feature；

Sliding window of the size as b is taken centered on centre position (i+j)/2 of the residue to (i, j), 20*b is obtained Individual position-specific scoring matrices PSSM features.

With reference to the first possible implementation of first aspect, in the third possible implementation of first aspect, one Individual residue effect is to including two amino acid, the residue type feature is included by acidic amino acid, basic amino acid, polarity ammonia 10 kinds of combinations produced by any two kinds in base acid, nonpolar amino acid.

With reference in a first aspect, in the 4th kind of possible implementation of first aspect, the residue of the interaction to for Residue pair of the CB-CB atomic distances on the α spirals of memebrane protein less than 8 angstroms.

Second aspect, embodiments provides a kind of prediction meanss of the interactively of memebrane protein residue, the dress Put including：

Training set acquiring unit, for obtaining the memebrane protein for having parsed protein structure as training set；

Feature extraction unit, for extracting the memebrane protein for having parsed protein structure in for distinguishing what is interacted The feature that the lack of balance of residue pair and non-interacting residue pair is classified；

Training unit, the feature for the lack of balance extracted to be classified predicts mould by smote-boost Algorithm for Training Type, the forecast model after being trained；

Predicting unit, for according to the forecast model after training, predicting the work of the memebrane protein residue of agnoprotein matter structure With relation.

It is described in the feature extraction unit in the first possible implementation of second aspect with reference to second aspect The feature of lack of balance classification includes：Position-specific scoring matrices PSSM features, residue relative distance feature, sequence in α spirals One or more in row spaced features, residue type feature, α spiral number features, sequence length feature.

With reference to the first possible implementation of second aspect, in second possible implementation of second aspect, institute State vector representation of each residue in position-specific scoring matrices PSSM by one 20 dimension, the location specific score square Battle array PSSM features include：

With reference to the first possible implementation of second aspect, in the third possible implementation of second aspect, one Individual residue effect is to including two amino acid, the residue type feature is included by acidic amino acid, basic amino acid, polarity ammonia 10 kinds of combinations produced by any two kinds in base acid, nonpolar amino acid.

With reference to second aspect, in the 4th kind of possible implementation of second aspect, the residue of the interaction to for Residue pair of the CB-CB atomic distances on the α spirals of memebrane protein less than 8 angstroms.

In the present invention, obtain the memebrane protein of protein structure that parsed as training set, extract described in parsed The lack of balance for being used to distinguish the residue pair and non-interacting residue pair for interacting in the memebrane protein of protein structure is classified Feature, by the feature extracted by smote-boost Algorithm for Training forecast models, the forecast model after being trained, and root According to the forecast model after the training, the interactively of the memebrane protein residue of agnoprotein matter structure is predicted.Due to using non-equal The feature of weighing apparatus classification is predicted the training of model, so that the forecast model after training can avoid the stream of useful information Lose, be conducive to improving the precision and coverage of prediction.

Description of the drawings

Fig. 1 is the flowchart of the Forecasting Methodology of the interactively of memebrane protein residue provided in an embodiment of the present invention；

Fig. 2 is the structural representation of the prediction meanss of the interactively of memebrane protein residue provided in an embodiment of the present invention.

Specific embodiment

In order that the objects, technical solutions and advantages of the present invention become more apparent, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, and It is not used in the restriction present invention.

The purpose of the embodiment of the present invention is to provide a kind of Forecasting Methodology of the interactively of memebrane protein residue, existing to solve Have in technology for unknown structure memebrane protein residue interactively prediction during, the angle typically from balanced classification will The residue pair of interaction or non-interacting residue are to according to 1：1 Scale Training method model, and in fact, interact or Non-interacting compared residue's example is far longer than 1：1, can cause a large amount of useful letters according to the Scale Training method model of balanced equity The loss of breath, so as to the problem that the precision and coverage of the interactively of the memebrane protein residue of prediction can be caused not high.Below With reference to accompanying drawing, the present invention is further illustrated.

Fig. 1 shows the realization stream of the Forecasting Methodology of the interactively of the memebrane protein residue that first embodiment of the invention is provided Journey, details are as follows：

In step S101, acquisition has parsed the memebrane protein of protein structure as training set.

Specifically, the memebrane protein that protein structure has been parsed, should fixed memebrane protein residue interactively. Preferred a kind of embodiment, it is possible to use (English full name is PDBTM：protein data bank of Transmembrane proteins, Chinese full name is：The Protein Data Bank of transmembrane protein) in parsed in the past within 2 months 2012 Memebrane protein as training set.

Certainly, the selection of the training set of above-mentioned Transmembrane Protein Database be one of which preferred embodiment, with solution Analysis and the development of technology of identification, increasing Membrane protein conformation is resolved, and can obtain the effect of the memebrane protein residue of determination Relation, thus the sample data in the training set also can increasingly be enriched, thus also can advantageously in raising forecast model Training the degree of accuracy.

In step s 102, parsed described in extracting in the memebrane protein of protein structure for distinguishing the residue for interacting Pair and non-interacting residue pair lack of balance classify feature；

Specifically, it is used to distinguish the residue pair and non-interacting residue pair of interaction described in the embodiment of the present invention Lack of balance classification feature, can include that position-specific scoring matrices PSSM features, the residue relative distance in α spirals is special Levy, one or more in train interval feature, residue type feature, α spiral number features, sequence length feature.

Wherein, (English full name is for the position-specific scoring matrices PSSM：Position-Specific Scoring Matrix) feature, can (English full name be by running PSI-BLAST：Position-Specific Iterative Basic Local Alignment Search Tool, Chinese full name is：Location specific Iterative search algorithm) mode obtain Take.Wherein, the database that can adopt is UNIREF90 databases when running PSI-BLAST, and iterations during operation can be with It is 1e-10 (being expressed as -10 powers of 1*10) for 2, E-value cutoff values.

In embodiments of the present invention, each residue in the position-specific scoring matrices PSSM by one 20 dimension Vector representation, represents the frequency that 20 kinds of amino acid occur in PSSM relevant positions.During feature extraction, position-specific scoring matrices PSSM features are divided into two classes, respectively：

Such as, in a kind of specific embodiment, Ke Yiwei：

The first kind be with residue to the residue i and residue j in (i, j) respectively centered on take the sliding window that size is 7 Mouthful, i.e., to each residue to 2 × 7 × 20=280 position-specific scoring matrices PSSM feature is obtained；

Equations of The Second Kind is that the sliding window that size is 3 is taken centered on centre position (i+j)/2 with residue to (i, j), 3 × 20=60 position-specific scoring matrices PSSM feature can be obtained.

The sum of two class position-specific scoring matrices PSSM features is 280+60=340.

Residue relative distance feature in α spirals is specially：Assume that p is in length for a residue of residue centering Relative position on the spiral of l, then residue relative distance feature in α spirals is just defined as p/l, for each residue centering Including two residues, the relative distance feature in α spirals of the residue corresponding to residue can be respectively extracted, it is residual including 2 altogether Base relative distance feature in α spirals.

The train interval feature can be divided according to residue to the position in primary sequence.Such as, a kind of tool The interval dividing mode of body can be divided into following multiple intervals：

<25th, 25-50,50-75,75-100,100-125,125-150,150-175,175-200 and>200 this nine areas Between.

(0 represents not in the interval, otherwise is can will to set to 0 or put 1 using corresponding train interval condition code 000000000 1) for stating train interval feature.For each residue is for, according to above-mentioned interval division mode, 9 sequences can be corresponded to One in row spaced features.

For the residue type feature, it is contemplated that totally 20 kinds of the amino acid of constitutive protein matter, according to the pole of amino acid R bases Property property can be divided into acidic amino acid (glutamic acid and aspartic acid), basic amino acid (lysine, arginine and histidine) and Neutral amino acid, wherein neutral amino acid can be divided into polar amino acid (glycine, serine, cysteine, threonine, junket again Propylhomoserin, asparagine and glutamine) and nonpolar amino acid (alanine, leucine, isoleucine, phenylalanine, first sulphur ammonia Acid, tryptophan, valine and proline).According to this 4 kinds of different amino acid classes (acidic amino acid, basic amino acid, poles Acidic amino acid and nonpolar amino acid), a residue effect can produce 10 kinds of different combinations to (two amino acid of correspondence), 1 can be respectively set to 0 or put with binary code 0000000000 to represent different composite types.10 residue type spies can be included Levy.

The α spirals number feature can carry out interval division according to the α spirals number that memebrane protein is included.Such as, may be used To be divided into 2-4,5-7,8-10 and this 4 intervals more than 10.1 is set to 0 or put by binary vector 0000 to represent the α Spiral number feature (0 represents not in the interval, otherwise for 1).The category feature is to all residues in a certain memebrane protein to one Cause property.Each residue includes 4 category features to characteristic vector.

The sequence length feature, be able to can be divided into according to the length of memebrane protein institute primary sequence<100,100-400, 400-800,>800 this 4 intervals, set to 0 or put 1 to represent that (0 represents not in the interval, anti-this feature with binary vector 0000 For 1).This category feature to same memebrane protein in all residues to consistent.Each residue should comprising 4 to characteristic vector Category feature.

In sum, the present invention can use 340 position-specific scoring matrices PSSM features, relative in 2 α spirals Distance feature, 9 train interval features and 10 residue type features, 4 α spiral number features, 4 sequence length spies Levy, altogether 369 features.

In addition, the ratio of the residue pair interacted described in the embodiment of the present invention and non-interacting residue pair, can Think 1 to 50 to 1 to 80, a kind of preferred embodiment could be arranged to 1 to 67.

Specifically, residue of protein act on to definition have various, for example based on atom Van der Waals distance definition, base Definition in the definition of CA-CA atomic distances and based on CB-CB atomic distances.The present invention will with regard to the definition of residue effect pair Continue to use a definition being widely adopted：CB-CB atomic distances on the α spirals of memebrane protein are less thanThe residue of (angstrom) To the residue pair for being defined as interacting.CA, CB are the atomic types inside gromacs, gromacs molecular dynamics softwares.

In step s 103, the feature lack of balance extracted classified predicts mould by smote-boost Algorithm for Training Type, the forecast model after being trained；

After the feature for mentioning the lack of balance classification, the feature can be updated in forecast model and be instructed Practice.The forecast model can be vector machine training pattern etc..

The training algorithm smote-boost, is the Novel training method for combining smote technologies and boost technologies, its In：Boost methods in each iteration, increase the weights without correct classification samples, reduce the weights of correct classification samples, more Pay attention in the sample in classification error.Because a few sample is easier by mistake classification, institute can be improved to minority in this way The estimated performance of class.SMOTE (English full name is synthetic minority over-sampling rechnique) technology It is a kind of new method of non-equalization data collection study, by the ratio of the artificial synthesized raising minority class sample to a few sample, Reduce the excess divergence of data.SMOTE technologies can be prevented effectively from due to giving a few sample more in combination with BOOST technologies The issuable overfitting of big weights.

In step S104, according to the forecast model after training, the work of the memebrane protein residue of agnoprotein matter structure is predicted With relation.

The present invention is used as training set by obtaining the memebrane protein of protein structure for having parsed, the egg parsed described in extraction It is used to distinguish the lack of balance classification of the residue pair and non-interacting residue pair for interacting in the memebrane protein of white matter structure Feature, by the feature extracted by smote-boost Algorithm for Training forecast models, the forecast model after being trained, and according to Forecast model after the training, predicts the interactively of the memebrane protein residue of agnoprotein matter structure.Due to using lack of balance The feature of classification is predicted the training of model, so that the forecast model after training can avoid the loss of useful information, Be conducive to improving the precision and coverage of prediction.

Fig. 2 shows that a kind of structure of the prediction meanss of the interactively of memebrane protein residue provided in an embodiment of the present invention is shown It is intended to, details are as follows：

The prediction meanss of the interactively of memebrane protein residue described in the embodiment of the present invention, including：

Training set acquiring unit 201, for obtaining the memebrane protein for having parsed protein structure as training set；

Feature extraction unit 202, for extracting the memebrane protein for having parsed protein structure in for distinguishing phase interaction The feature that the lack of balance of residue pair and non-interacting residue pair is classified；

Training unit 203, for the feature that the lack of balance extracted is classified to be predicted by smote-boost Algorithm for Training Model, the forecast model after being trained；

Predicting unit 204, for according to the forecast model after training, predicting the memebrane protein residue of agnoprotein matter structure Interactively.

Preferably, in the feature extraction unit, the feature of the lack of balance classification includes：Position-specific scoring matrices PSSM features, residue relative distance feature, train interval feature, residue type feature, α spiral number features, sequence in α spirals One or more in row length characteristic.

Preferably, each residue in the position-specific scoring matrices PSSM is by one 20 vector representation tieed up, institute Stating position-specific scoring matrices PSSM features includes：

Preferably, a residue effect is to including two amino acid, the residue type feature include by acidic amino acid, 10 kinds of combinations produced by any two kinds in basic amino acid, polar amino acid, nonpolar amino acid.

Preferably, the residue of the interaction for the CB-CB atomic distances on the α spirals of memebrane protein to being less than 8 Angstrom residue pair.

The prediction meanss of the interactively of memebrane protein residue described in Fig. 2, the effect with memebrane protein residue described in embodiment one The Forecasting Methodology correspondence of relation, here is not repeated and repeats.

In several embodiments provided by the present invention, it should be understood that disclosed apparatus and method, it can be passed through Its mode is realized.For example, device embodiment described above is only schematic, for example, the division of the unit, and only Only a kind of division of logic function, can there is other dividing mode when actually realizing, such as multiple units or component can be tied Close or be desirably integrated into another system, or some features can be ignored, or do not perform.It is another, it is shown or discussed Coupling each other or direct-coupling or communication connection can be the INDIRECT COUPLINGs by some interfaces, device or unit or logical Letter connection, can be electrical, mechanical or other forms.

The unit as separating component explanation can be or may not be it is physically separate, it is aobvious as unit The part for showing can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.Some or all of unit therein can according to the actual needs be selected to realize the mesh of this embodiment scheme 's.

In addition, each functional unit in each embodiment of the invention can be integrated in a processing unit, it is also possible to It is that unit is individually physically present, it is also possible to which two or more units are integrated in a unit.Above-mentioned integrated list Unit both can be realized in the form of hardware, it would however also be possible to employ the form of SFU software functional unit is realized.

If the integrated unit is realized using in the form of SFU software functional unit and as independent production marketing or used When, during a computer read/write memory medium can be stored in.Based on such understanding, technical scheme is substantially The part for contributing to prior art in other words or all or part of the technical scheme can be in the form of software products Embody, the computer software product is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server, or network equipment etc.) performs the complete of each embodiment methods described of the invention Portion or part.And aforesaid storage medium includes：USB flash disk, portable hard drive, read-only storage (ROM, Read-Only Memory), Random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with store program codes Medium.

Presently preferred embodiments of the present invention is the foregoing is only, not to limit the present invention, all essences in the present invention Any modification, equivalent and improvement made within god and principle etc., should be included within the scope of the present invention.

Claims

1. a kind of Forecasting Methodology of the interactively of memebrane protein residue, it is characterised in that methods described includes：

Extract in the memebrane protein for having parsed protein structure for distinguish the residue pair for interacting with it is non-interacting The feature of the lack of balance classification of residue pair；

The feature that the lack of balance extracted is classified is pre- after being trained by smote-boost Algorithm for Training forecast models Survey model；

2. method according to claim 1, it is characterised in that in having parsed the memebrane protein of protein structure described in the extraction It is described non-equal in the characterization step classified for the lack of balance for distinguishing the residue pair and non-interacting residue pair for interacting The feature of weighing apparatus classification includes：Position-specific scoring matrices PSSM features, residue are in α spirals between relative distance feature, sequence One or more in feature, residue type feature, α spiral number features, sequence length feature.

3. method according to claim 2, it is characterised in that each residue in the position-specific scoring matrices PSSM By the vector representation of one 20 dimension, the position-specific scoring matrices PSSM features include：

With residue to the residue i and residue j in (i, j) respectively centered on take slides container of the size as a, each residue pair Obtain 40a position-specific scoring matrices PSSM feature；

Sliding window of the size as b is taken centered on centre position (i+j)/2 of the residue to (i, j), 20*b position is obtained Put specific score matrix PSSM features.

4. method according to claim 2 a, it is characterised in that residue effect is to including two amino acid, the residue Type feature is included by produced by any two kinds in acidic amino acid, basic amino acid, polar amino acid, nonpolar amino acid 10 kinds combination.

5. method according to claim 1, it is characterised in that the residue of the interaction is to for positioned at the α spiral shells of memebrane protein Residue pair of the CB-CB atomic distances for screwing on less than 8 angstroms.

6. a kind of prediction meanss of the interactively of memebrane protein residue, it is characterised in that described device includes：

Feature extraction unit, for extracting the memebrane protein for having parsed protein structure in for distinguishing the residue for interacting Pair and non-interacting residue pair lack of balance classify feature；

Training unit, the feature for the lack of balance extracted to be classified passes through smote-boost Algorithm for Training forecast models, obtains Forecast model to after training；

Predicting unit, for according to the forecast model after training, the effect for predicting the memebrane protein residue of agnoprotein matter structure to be closed System.

7. device according to claim 6, it is characterised in that in the feature extraction unit, the spy of the lack of balance classification Levy including：Position-specific scoring matrices PSSM features, residue relative distance feature, train interval feature, residue in α spirals One or more in type feature, α spiral number features, sequence length feature.

8. device according to claim 7, it is characterised in that each residue in the position-specific scoring matrices PSSM By the vector representation of one 20 dimension, the position-specific scoring matrices PSSM features include：

9. device according to claim 7 a, it is characterised in that residue effect is to including two amino acid, the residue Type feature is included by produced by any two kinds in acidic amino acid, basic amino acid, polar amino acid, nonpolar amino acid 10 kinds combination.

10. device according to claim 6, it is characterised in that the residue of the interaction is to for positioned at the α spiral shells of memebrane protein Residue pair of the CB-CB atomic distances for screwing on less than 8 angstroms.