CN109801672A

CN109801672A - Interaction prediction method between multivariate mutual information and residue combination calorie-protein matter

Info

Publication number: CN109801672A
Application number: CN201811366481.2A
Authority: CN
Inventors: 郭菲; 邹权; 丁漪杰; 潘高峰; 唐继军
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2018-11-16
Filing date: 2018-11-16
Publication date: 2019-05-24

Abstract

The present invention relates to biology information technologies, to propose a kind of method to interact between the prediction protein-protein for capableing of precise and high efficiency, it can be improved effect of the useful information in predicted operation in amino acid sequence, the influence of useless noise information is effectively reduced simultaneously, the present invention, interaction prediction method between multivariate mutual information and residue combination calorie-protein matter, step (1): amino acid classification grouping；Step (2): defined feature indicates；Step (3): feature frequency table is established；Step (4): mutual information feature is calculated；Step (5): 3 tuple mutual information features are calculated；Step (6): amino acid physics and chemistry attributive character is calculated；Step (7): amino acid contact matrix AAC is calculated；Step (8): feature extraction is carried out to amino acid sequence；Step (9): singular value decomposition is carried out；Step (10): the interaction between two protein is obtained.Present invention is mainly applied to anticipating interaction between proteins occasions.

Description

Interaction prediction method between multivariate mutual information and residue combination calorie-protein matter

Technical field

The present invention relates in biology information technology based on amino acid sequence information to mutual between protein-protein The method predicted is acted on, the macromolecular structure prediction algorithm field in proteomics is belonged to.Specifically, it is related to polynary mutual Interaction prediction method between information and residue combination calorie-protein matter.

Background technique

Interaction between protein-protein is the core of many bioprocess.It is mutual between identification of protein Effect is very important the bioprocess illustrated in protein function and identification of cell.Interaction letter between protein Breath can help people to be best understood from disease mechanism, thus the carry out drug design of more efficiently and accurately.Past In several years, a large amount of computing technique has evolved to the stage that can carry out large scale analysis.In general, between detection protein The calculation method of interaction mainly has three classes: the method based on evolution information, method based on natural language processing and is based on The method of amino acid sequence feature.Method based on evolution information extracts evolution letter from the Multiple sequence alignments of homologous protein Breath constructs chadogram to analyze the relationship between protein function.This method needs a large amount of homologous protein data and these eggs Interaction label between white matter, therefore it uses in large-scale calculating and is above greatly limited.Based on natural language The method of processing is relied on widely applied natural language processing technique.Such method is from being stored in biology and medicine section It learns in a large amount of known protein-protein interaction relationships in document and excavates useful information.Due in document partial information lack Few, prediction result is possible and imperfect.Therefore, using multivariate mutual information feature extracting method and residue based on amino acid sequence The extensive of the precision of prediction and ensuring method to interact between protein is improved in conjunction with energy information feature extracting method Popularization and use are just particularly important.

As the key technology for carrying out interaction prediction method between protein based on amino acid sequence information, feature is mentioned It takes method to refer to and defines a series of mapping function, one will be mapped in one section of amino acid sequence of protein by these functions Column can represent the character numerical value of the sequence.These numerical value will comprehensively include the useful feature of protein as far as possible, simultaneously Exclude the noise information that can have an adverse effect to prediction result.Classical amino acid sequence feature extracting method includes certainly Covariance, joint triplet, local protein sequence description, multiple dimensioned local feature description symbol, local phase quantificational description The methods of symbol and the protein sequence expression based on matrix.Never Tongfang is abstracted these methods in face of amino acid sequence It indicates, while its prediction result has very big difference.Therefore a kind of effective feature extracting method how is designed to reflect to be abstracted Amino acid sequence is penetrated, the degree of difference between sequence is improved, reduction noise information becomes protein to the interference of prediction result The key technology of interaction prediction method.

Summary of the invention

In order to overcome the deficiencies of the prior art, the present invention is directed to propose a kind of prediction protein and albumen for capableing of precise and high efficiency The method to interact between matter.Feature extraction function used in this method can be improved useful information in amino acid sequence and exist Effect in predicted operation, while the influence of useless noise information is effectively reduced.For this reason, the technical scheme adopted by the present invention is that Interaction prediction method between multivariate mutual information and residue combination calorie-protein matter, steps are as follows:

Step (1): 20 kinds of standard amino acids are assigned to n functional group according to dipolar nature and volume by the grouping of amino acid classification, This n functional group is denoted as C respectively₀, C₁, C₂..., C_n, by original amino acid according to the functional group class where each amino acid Classification sequence in groups is not converted；

Step (2): defining different types of 3 tuple and 2 tuple character representations, the character representation of 3 tuples are " C₀C₀C₀", “C₀C₀C₁" ... " C_nC_nC_n"；The character representation of 2 tuples is " C₀C₀", " C₀C₁" ... " C_nC_n”。

Step (3): in statistics group classification sequence, the number that 3 tuple features and 2 tuple features occur establishes feature frequency Table, frequency of use calculate function f (a)=(n_a+ 1)/(L+1) calculates separately the frequency that n classification occurs in the sequence；

Step (4): 2 tuple mutual information features, calculation formula are calculated are as follows:

Wherein f (ab) is that in binary group while the frequency of classification ab occur；

Step (5): 3 tuple mutual information features are calculated.Calculation formula are as follows:

I (abc)=I (ab)+f (a | c) lnf (a | c)

-f(a|bc)lnf(a|bc)

Wherein f (a | c) is that in the binary group of occurred classification c while the frequency of classification a occur, and f (a | bc) it is all Occur occurring the frequency of classification a in the triple of classification bc simultaneously；

First part's mutual information characteristic value is obtained by above-mentioned 5 steps；

Step (6): amino acid physics and chemistry attributive character is calculated；

Step (7): by statisticalling analyze protein complex database, amino acid is calculated using Residue pairing frequency and is contacted Matrix A AC:

Wherein i, j indicate two kinds of amino acid, N_{I, j}=∑_Dn_ijIt is contact quantity of the i with j,

Calculate substitution matrix SMR, SMR_{I, l}=AAC (i, A_l), wherein i=1 ..., 20 be 20 kinds of amino acid classes it One, l=1 ..., L are one of L position, A in given protein sequence_lIt is l amino acid classes, is obtained by the step The substitution matrix SMR of one 20 × L；

Step (8): feature extraction is carried out to amino acid sequence using gradient orientation histogram HOG feature extraction algorithm；

Step (9): carrying out singular value decomposition to the transposed matrix of SMR matrix, and available 20 by singular value decomposition Right singular vector.

Step (10): being input to a Random Forest model for the characteristic value obtained by step 1 to 9 and predict, from And obtain the interaction between two protein.

Specifically steps are as follows for calculating for step (6):

Step (6.1): Moreau-Broto autocorrelation haracter value indicative, calculation formula are calculated are as follows:

Wherein lag is the distance between residue, and p is p-th of physicochemical properties of above-mentioned natural amino acid, and l is sequence Position, l=1,2 ..., L-lag, and lag=1,2 ..., lg obtain lg × 6 after being indicated with six kinds of physicochemical properties Characteristic value.

Step (6.2): obtained lg × 6 characteristic value is normalized；

Step (6.3): the frequency that 20 amino acid occurs in statistical series.

The specific calculating process of step (8) is as follows:

Step (8.1): gradient value G both horizontally and vertically is calculated_h(i, l), G_v(i, l), calculation formula are as follows:

Step (8.2): gradient magnitude is calculated

Step (8.3): gradient direction is calculated

Step (8.4): being the submatrix of 9 same sizes by gradient magnitude matrix and gradient direction Factorization algorithm；

Step (8.5): the histogram of each gradient direction is counted, the histogram size of each gradient direction is as one Characteristic value.

Through the above steps, every sequence obtains x characteristic value, and 2x characteristic value is obtained in two sequences one.

The features of the present invention and beneficial effect are:

Since the present invention is integrated with the multivariate mutual information and residue combination energy information of amino acid sequence.With traditional sequence Information is compared, and multivariate mutual information had both considered each amino acid with the characteristic of two ortho position peptide ammino acid, it is contemplated that its The mutual information of component.Histogram of gradients and singular value decomposition operation can extract the textural characteristics of albumen stromal matrix simultaneously.These The addition of new information and feature, the interaction relationship between Accurate Prediction protein provide strong help, thus this Method when being analyzed and predicted to protein-protein interaction relationship, the accuracy of prediction result than it is existing its His method is more excellent.This method can not only interaction between Accurate Prediction protein, while can also interact net in protein The new interaction relationship of middle discovery, has very big significance to different kinds of proteins interactive network is improved.

Detailed description of the invention:

The flow chart of calculating process Fig. 1 of the invention

The foundation of the character representation and frequency meter of Fig. 2 binary group and triple；

The schematic diagram of Fig. 3 calculating gradient orientation histogram；

Accuracy rate of Fig. 4 .Moreau-Broto autocorrelation characteristic when using different lg values；

Prediction result of Fig. 5 this method on single core network；

Prediction result of Fig. 6 this method on multi-core network；

Prediction result of Fig. 7 this method on crossover network.

Specific embodiment

The object of the present invention is to provide one kind to be capable of phase interaction between the prediction protein-protein of precise and high efficiency Method.Feature extraction function used in this method can be improved in amino acid sequence useful information in predicted operation Effect, while the influence of useless noise information is effectively reduced.

It is a feature of the present invention that it successively contains following steps:

Step (1): amino acid classification grouping.20 kinds of standard amino acids are assigned into 7 functional groups according to dipolar nature and volume. This 7 functional groups are denoted as C respectively₀, C₁, C₂..., C₆.By original amino acid according to the functional group where each amino acid Class switch classification sequence in groups.

Step (2): different types of 3 tuple and 2 tuple character representations are defined.The character representation of 3 tuples is " C₀C₀C₀", “C₀C₀C₁" ... " C₆C₆C₆".The character representation of 2 tuples is " C₀C₀", " C₀C₁" ... " C₆C₆”。

Step (3): in statistics group classification sequence, the number that 3 tuple features and 2 tuple features occur establishes feature frequency Table, as shown in Figure 2.Frequency of use calculates function f (a)=(n_a+ 1)/(L+1) calculates separately what 7 classifications occurred in the sequence Frequency.

Step (4): 28 2 tuple mutual information features are calculated.Calculation formula are as follows:

Wherein f (ab) is the frequency that binary group ab occurs.

Step (5): 84 3 tuple mutual information features are calculated.Calculation formula are as follows:

I (abc)=I (ab)+f (a | c) lnf (a | c)

-f(a|bc)lnf(a|bc)

Wherein f (a | c) is that in the binary group of occurred classification c while the frequency of classification a occur, and f (a | bc) it is all Occur occurring the frequency of classification a in the triple of classification bc simultaneously.

Pass through the available 238 mutual information characteristic values of above-mentioned 5 steps.

Step (6): amino acid physics and chemistry attributive character is calculated.Available 200 characteristic values of every amino acid sequence, it is a pair of Predict available 400 characteristic values of amino acid sequence of interaction.Circular is as follows:

Step (6.1): Moreau-Broto autocorrelation haracter value indicative is calculated.Calculation formula are as follows:

Wherein lag is the distance between residue, and p is p-th of physicochemical properties of above-mentioned natural amino acid, and l is sequence Position, l=1,2 ..., L-lag, and the general value of lag=1,2 ..., lg, lg here be 30.With six kinds of physicochemical properties After expression, available 30 × 6=180 characteristic value.

Step (6.2): obtain 180 characteristic values are normalized.

Step (6.3): the frequency that 20 amino acid occurs in statistical series.

Wherein i, j indicate two kinds of amino acid.N_{I, j}=∑_Dn_ijIt is contact quantity of the i with j.

Step (8): feature extraction, tool are carried out to amino acid sequence using gradient orientation histogram HOG feature extraction algorithm Body calculating process is as follows:

Step (8.2): gradient magnitude is calculated

Step (8.3): gradient direction is calculated

Step (8.4): being the submatrix of 9 same sizes by gradient magnitude matrix and gradient direction Factorization algorithm.

Step (8.5): the histogram of each gradient direction is counted.The histogram size of each gradient direction is as one Characteristic value.

Through the above steps, available 81 characteristic values of every sequence, two sequences have available 162 features altogether Value.

Step (9): singular value decomposition is carried out to the transposed matrix of SMR matrix.Available 20 by singular value decomposition Right singular vector.Available 800 characteristic values of the step.

Step (10): by step 1 to 9, available 238+400+162+800=1600 characteristic value altogether.By this A little characteristic values are input to a Random Forest model and are predicted, to obtain the interaction between two protein.

According to above-mentioned calculation method, the universally recognized protein of 12 researchers and albumen that we use forefathers to construct Matter interaction data set has carried out the analysis in performance by prediction technique of the Random Forest model to us.Including The data sets such as S.cerevisiae, H.pylori2918, human8161 and E.coli.It is also that this method is true at three simultaneously Protein-protein interaction network on carried out analysis of experiments, such as single core network C D9, multi-core network Ras-Raf-Mek- Erk-Elk-Srf metabolic pathway and crossover network Wnt.On S.cerevisiae data set, binary mutual information, ternary are used Mutual information and multivariate mutual information are come to carry out Interaction Predicting accuracy rate achieved be respectively 93.56%, 93.88% and 94.23%.It is clear that carrying out feature extraction using combined multivariate mutual information can obtain than a kind of feature extraction is used alone Better performance.For Moreau-Broto autocorrelation characteristic value calculating method, we test herein in order to obtain best lg Nine different lg values (lg=5,10,15,20,25,30,35,40,45).Fig. 4 is shown ought use different lg values respectively When prediction result accuracy.Curve can be seen that when lg increases to 30 from 5 from figure, and precision of prediction increases, however, working as Lg from 30 increase to 45 when, precision is reducing.Optimum prediction precision is obtained when lg is 30, accuracy rate 92.76%.In method The attainable accuracy rate of institute is respectively when the gradient orientation histogram and two category feature value of singular value decomposition used is used alone 93.86% and 92.93%.In our method, four category feature extracting methods are integrated, and the accuracy rate of prediction is 94.56%.Random forest grader used in method is more advantageous in prediction result than support vector machine classifier.At random Forest classified device is integrated model, while being able to detect the importance of feature, therefore have 2% in the accuracy of prediction result It improves.

When this method is applied to protein-protein interaction network prediction, also there is very high accuracy rate.In single core network C D9 On, our method can identify 14 in 16 protein interaction relationships, accuracy rate 87.50%.Ras-Raf- On the multi-core network of Mek-Erk-Elk-Srf metabolic pathway, we can correctly predict 189 protein interactions 174 in relationship, accuracy rate 92.06%.With the crossover network in Wnt associated metabolic path to Guan Chong in signal transduction It wants.Our method has found 91 from 96 interaction relationships, and accuracy rate 94.79% is better than existing prediction side Method.For current existing other methods in these three types of network structures, the attainable accuracy of institute is respectively 81.25%, 90.00% With 76.04%.It can be seen that our method has higher accuracy than existing method.Fig. 5,6,7 are that our method exists But it is carried out on core network, multi-core network and crossover network acquired when the discovery and prediction of protein-protein interaction relationship Result.Blue line indicates correctly prediction in figure, and red line indicates the prediction of mistake.As can be seen that our side from this 3 figures Method has very high credible and practicability.

In proteomics, the greatest difficulty to protein-protein interaction forecasting problem is that existing information is not clear enough Clear, useful information is all lain in too simple sequence information.It, cannot if accurate Thermodynamic parameters is wanted to be predicted Only be used only direct sequence information, and the information extracting method being the need for taken out from bottom sequence information it is richer Rich interactive information and physical and chemical attribute information.For the forecasting problem of protein-protein interaction, design a kind of general The feature extracting method that useful information can be extracted from sequence information is main contributions of the invention.

The basic thought of the invention is: extracting and merges a plurality of types of attribute informations, by effective classifier to mutual Effect is predicted.The invention calculates binary group of each amino acid and its composition first and triple occurs in the sequence Frequency information, then on the basis of these frequency informations, further integration takes out multivariate mutual information, from simple sequence number According to existing connection between middle excavation different aminoacids and its tuple.Secondly, the invention also fully considers the physics and chemistry of amino acid The influence of confrontation interaction, extracts residue combination energy information from sequence to further increase the accuracy of prediction.

The invention mainly comprises the steps of: the multivariate mutual information calculated in amino acid sequence, obtains 238 mutual information spies Value indicative.It calculates Moreau-Broto autocorrelation haracter value indicative and counts the frequency that amino acid occurs in the sequence in 20, obtain 400 Characteristic value.Amino acid contact matrix is calculated using Residue pairing frequency, then calculates substitution matrix.It is straight by using gradient direction Square figure handles substitution matrix, available 162 characteristic values.Carrying out singular value decomposition to substitution matrix simultaneously can obtain To 800 characteristic values.Obtain 1600 characteristic values are classified using random forest grader, to judge two albumen With the presence or absence of interaction between matter.

The calculating process of the invention has the characteristics that simply to be easily achieved, and calculates required hardware device and calculate money Source is also relatively low, has extensive workability.Our method can be realized by C++ and MATLAB, common On the computer of 6 core CPU and 32GB memory of 2.5GHz, being predicted thousands of samples for task can be in the very short time Interior completion.Meanwhile for the relationship between balance quality and effect, the decision tree quantity and every subtree of random forest grader It is chosen for 500 and 400 respectively using characteristic.By adjusting these parameters, can also improve the speed of classified calculating to Predicted operation is carried out faster.

Claims

1. interaction prediction method between a kind of multivariate mutual information and residue combination calorie-protein matter, characterized in that steps are as follows:

Step (1): 20 kinds of standard amino acids are assigned to n functional group according to dipolar nature and volume, this n by the grouping of amino acid classification A functional group is denoted as C respectively₀, C₁, C₂..., C_n, by original amino acid according to the functional group classification where each amino acid Convert classification sequence in groups；

Step (2): defining different types of 3 tuple and 2 tuple character representations, the character representation of 3 tuples are " C₀C₀C₀”, “C₀C₀C₁”,…“C_nC_nC_n"；The character representation of 2 tuples is " C₀C₀”,”C₀C₁”,…”C_nC_n”。

Step (3): in statistics group classification sequence, the number that 3 tuple features and 2 tuple features occur is established feature frequency table, is made Function f (a)=(n is calculated with frequency_a+ 1)/(L+1) calculates separately the frequency that n classification occurs in the sequence；

I (abc)=I (ab)+f (a | c) lnf (a | c)-f (a | bc) lnf (a | bc)

Wherein f (a | c) is that in the binary group of occurred classification c while the frequency of classification a occur, and f (a | bc) it is to be occurred Occurs the frequency of classification a in the triple of classification bc simultaneously；

Step (7): by statisticalling analyze protein complex database, amino acid contact matrix is calculated using Residue pairing frequency AAC:

Calculate substitution matrix SMR, SMR_{I, l}=AAC (i, A_l), wherein i=1 ..., 20 is one of 20 kinds of amino acid classes, l= 1 ..., L is one of L position, A in given protein sequence_lIt is l amino acid classes, obtains one 20 by the step The substitution matrix SMR of × L；

Step (9): singular value decomposition is carried out to the transposed matrix of SMR matrix, passes through the available 20 right surprises of singular value decomposition Incorgruous amount.

Step (10): being input to a Random Forest model for the characteristic value obtained by step 1 to 9 and predict, thus Interaction between two protein.

2. interaction prediction method between multivariate mutual information as described in claim 1 and residue combination calorie-protein matter, special Sign is that specifically steps are as follows for calculating for step (6):

Wherein lag is the distance between residue, and p is p-th of physicochemical properties of above-mentioned natural amino acid, and l is the position of sequence It sets, l=1,2 ..., L-lag, and lag=1,2 ..., lg, after being indicated with six kinds of physicochemical properties, obtains lg × 6 characteristic value.

Step (6.2): obtained lg × 6 characteristic value is normalized；

Step (6.3): the frequency that 20 amino acid occurs in statistical series.

3. interaction prediction method between multivariate mutual information as described in claim 1 and residue combination calorie-protein matter, special Sign is that the specific calculating process of step (8) is as follows:

Step (8.2): gradient magnitude is calculated

Step (8.3): gradient direction is calculated

Step (8.5): the histogram of each gradient direction is counted, the histogram size of each gradient direction is as a feature Value.