CN102043910A

CN102043910A - Remote protein homology detection and fold recognition method based on Top-n-gram

Info

Publication number: CN102043910A
Application number: CN 201010600321
Authority: CN
Inventors: 林磊; 刘滨; 孙承杰; 王晓龙; 刘秉权; 刘远超
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2010-12-22
Filing date: 2010-12-22
Publication date: 2011-05-04
Anticipated expiration: 2030-12-22
Also published as: CN102043910B

Abstract

The invention discloses a remote protein homology detection and fold recognition method based on a Top-n-gram, and relates to a remote protein homology detection and fold recognition method. The method is used for solving a problem that a binary spectrum cannot find out an optimal threshold and cannot distinguish difference of frequency of occurrences of amino acid in the prior protein remote homology detection and fold recognition method, and comprises the following steps: 1, operating a PSII-BLAST, inputting a tested protein sequence for multiple sequence alignment, and calculating a pseudo count of an amino acid i; 2, generating a frequency spectrum; 3, transforming the frequency spectrum into the Top-n-gram; 4, obtaining a latent semantic expression vector corresponding to the tested protein sequence; 5, inputting the latent semantic expression vector corresponding to the tested protein sequence into an SVM sorter for sorting, and obtaining a forecasting result. The protein remote homology detection and fold recognition method based on the Top-n-gram is used in the filed of protein homology detection and fold recognition.

Description

The long-range homology of a kind of protein based on Top-n-gram detects and the fold recognition method

Technical field

The present invention relates to the long-range homology of a kind of protein detects and the fold recognition method.

Background technology

At present, the long-range homology detection method of protein both domestic and external roughly is divided into following several types: dynamic programming algorithm, production model, discriminant model.The discriminant model is the method for prediction effect optimum in this field, and wherein (Support Vector Machine, method SVM) is present the most frequently used method based on support vector machine.Raising is to search out a kind of appropriate protein representation based on the valid approach of the prediction effect of support vector machine method, and then the protein sequence vectorization.

Comprise a large amount of evolution information in the protein multisequencing comparison result by operation PSI-BLAST (location specific iteration BLAST) output.Therefore because frequency spectrum comprises more information than protein sequence, adopt the evolution information that comprises in the frequency spectrum to improve that the long-range homology of protein detects and the prediction effect of fold recognition is significant.Have the researcher to propose a kind of proper vector based on the scale-of-two spectrum, this method is converted into the scale-of-two spectrum to frequency spectrum by frequency threshold before.Frequency represents with 1 that greater than the amino acid of threshold value frequency is represented with 0 less than the amino acid of threshold value.Scale-of-two spectrum is that a kind of protein is formed composition, and is used to solve some biological problems, protein domain Boundary Prediction for example, design of average power potential energy and protein interaction site estimation.Though the method based on the scale-of-two spectrum has obtained success, the scale-of-two spectrum has some shortcomings.At first, select by experience because frequency spectrum is converted into the frequency threshold of scale-of-two spectrum, so there is not the method for system can optimize this threshold value, the assurance of therefore having no idea can be found optimum threshold value; Secondly, the scale-of-two spectrum can not be distinguished the difference of the amino acid frequency of occurrences.Frequency all uses 1 to represent greater than the amino acid of threshold value, and this method for expressing has been ignored these amino acid and had different frequencies and have different importance during evolution.

Summary of the invention

The present invention is in order to solve in long-range homology detection of existing protein and the fold recognition method, the scale-of-two spectrum can't find optimal threshold, can't distinguish the problem of the difference of the amino acid frequency of occurrences, provide the long-range homology of a kind of protein to detect and the fold recognition method based on Top-n-gram.The concrete steps of this method are:

Step 1: operation PSI-BLAST, input test protein sequence carry out the multisequencing comparison, calculate the spurious count g of amino acid i _i:

g_{i} = Σ_{j = 1}^{20} f_{j} * (q_{ij} / p_{j})

F wherein _jBe the observing frequency of amino acid j, p _jBe the background frequency of amino acid j, q _IjIt is the mark of the replacement matrix of correspondence between amino acid i and the amino acid j;

Step 2: according to the spurious count generated frequency spectrum of amino acid i;

Step 3: frequency spectrum is converted into Top-n-gram;

Step 4: by adding up the number of times that every kind of Top-n-gram occurs, the test protein sequence is converted into the vector of regular length, makes up speech-document matrix W then;

Step 5: the speech-document matrix W that generates is carried out svd, obtain the potential semantic meaning representation vector of test protein sequence correspondence;

Step 6: the potential semantic meaning representation vector input svm classifier device of test protein sequence correspondence is classified, the svm classifier device is composed to mark of test protein sequence, fractional value has homology or folding greater than 0 test protein sequence, thereby is predicted the outcome.

The method of the described generated frequency spectrum of the described step 2 of step 2 is:

The target frequency Q of 20 kinds of standard amino acids on each amino acid sites in the calculating test protein sequence _i:

Q _i＝(αf _i+βg _i)/(α+β)

Wherein β is a free parameter, is the default value 10 of PSI-BLAST, and α is that the amino acid kind that is occurred in a certain row in the multisequencing comparison subtracts 1;

Frequency spectrum is expressed as matrix M, and its dimension is L * N, and wherein L is the length of protein sequence, and N is a constant 20, i.e. the quantity of standard amino acid, and the element among the M is target spectrum rate Q _i

The described method that frequency spectrum is converted into Top-n-gram of step 3 is:

20 kinds of standard amino acids during each is gone with frequency spectrum are according to its target frequency descending sort, be preceding n amino acid of target frequency maximum a Top-n-gram according to its combination of frequency then, each Top-n-gram is by amino acid their different frequencies of diverse location difference in Top-n-gram, obtain L Top-n-gram altogether, wherein n is more than or equal to 1 and smaller or equal to 5 integer.

The corresponding Top-n-gram of speech among the described speech of step 4-document matrix W, the corresponding test protein sequence of document.

The method that the described speech-document matrix W to generation of step 5 carries out svd is: speech-document matrix W is decomposed into three matrixes:

W＝USV ^T

Wherein matrix U is that dimension is the left singular matrix of M * K, and S is that dimension is the diagonal matrix of K * K, and its diagonal element is the singular value of matrix W, and satisfies s ₁〉=s ₂〉=... s _K＞0, V is that dimension is the right singular matrix of N * K, thereby reaches the purpose that dimensionality reduction is removed noise by R singular value before keeping, and the dimension of the matrix U behind the dimensionality reduction, S and V is respectively M * R, R * R and N * R, and the value of R is 300.

The described svm classifier device of step 6 obtains by following training method:

In the described training method with a plurality of training protein sequences as training sample, respectively to each the training protein sequence carry out following training,

Steps A: operation PSI-BLAST, input training protein sequence carries out the multisequencing comparison, calculates the spurious count g of amino acid i _i:

g_{i} = Σ_{j = 1}^{20} f_{j} * (q_{ij} / p_{j})

Step B: the spurious count according to amino acid i is produced frequency spectrum:

The target frequency Q of 20 kinds of standard amino acids on each amino acid sites in the calculation training protein sequence _i:

Q _i＝(αf _i+βg _i)/(α+β)

Step C: frequency spectrum is converted into Top-n-gram:

20 kinds of standard amino acids during each is gone with frequency spectrum are according to its target frequency descending sort, be preceding n amino acid of target frequency maximum a Top-n-gram according to its combination of frequency then, each Top-n-gram is by amino acid their different frequencies of diverse location difference in Top-n-gram, obtain L Top-n-gram altogether, wherein n is more than or equal to 1 and smaller or equal to 5 integer;

Step D:,, make up speech-document matrix W then with training protein sequence to be converted into the vector of regular length by adding up the number of times that every kind of Top-n-gram occurs;

Step e: the speech-document matrix W that generates is carried out svd, obtain the potential semantic meaning representation vector of training protein sequence correspondence;

Step F: adopt the potential semantic meaning representation vector training of training protein sequence correspondence to obtain the svm classifier device.

The present invention adopts Gist SVM kit commonly used in long-range homology detection and the fold recognition field implementation tool bag as the SVM algorithm.Except kernel function adopted gaussian kernel function, other parameters were used the parameter of Gist kit acquiescence.

In the inventive method, the process of vector that the frequency spectrum of test protein sequence is converted to regular length is referring to shown in Fig. 2 to 5, Fig. 2 is the frequency spectrum of test protein sequence, Fig. 3 is the frequency spectrum that frequency spectrum shown in Figure 2 is obtained after according to the target frequency descending sort, Fig. 4 selects under the n=3 situation, by the Top-n-gram that frequency spectrum shown in Figure 3 obtains, Fig. 5 is the vector by the regular length of Top-n-gram acquisition shown in Figure 4.

Method of the present invention is converted into protein sequence by the occurrence number of every kind of Top-n-gram in the statistics protein sequence vector of regular length.Top-n-gram by combination frequency spectrum medium frequency before the big amino acid of n extract evolution information in the frequency spectrum, Top-n-gram has emphasized the high amino acid whose importance of n before the frequency spectrum medium frequency.Compare with the scale-of-two spectrum, Top-n-gram does not comprise threshold value, thereby does not therefore need the parameter optimization step to avoid the generation of over-fitting; The amino acid whose frequency size of variety classes in all right crossover frequency spectrum.The present invention adopts speech-document matrix dimensionality reduction, the removal noise of latent semantic analysis to obtaining, and then has improved the prediction effect of long-range homology detection of protein and fold recognition.

Description of drawings

Fig. 1 is that the long-range homology of embodiment one described protein based on Top-n-gram detects and the fold recognition method flow diagram; Fig. 2 is the frequency spectrum of test protein sequence; The frequency spectrum of Fig. 3 for frequency spectrum shown in Figure 2 is obtained after according to the target frequency descending sort; Fig. 4 is for selecting under the n=3 situation, by the Top-n-gram of frequency spectrum acquisition shown in Figure 3; Fig. 5 is the vector by the regular length of Top-n-gram acquisition shown in Figure 4.

Embodiment

Technical solution of the present invention is not limited to following cited embodiment, also comprises the combination in any between each embodiment.

Embodiment one: in conjunction with Fig. 1 present embodiment is described, the long-range homology of a kind of protein based on Top-n-gram detects and the fold recognition method, and its concrete steps are:

g_{i} = Σ_{j = 1}^{20} f_{j} * (q_{ij} / p_{j})

Step 3: frequency spectrum is converted into Top-n-gram;

The iterations of PSI-BLAST is 10 times in the present embodiment step 1, the Non-redundant data storehouse of PSI-BLAST search is the nrdb90 database, adopt sequence similarity to compose less than 98% multisequencing comparison calculated rate, the weight of every sequence adopts location-based sequence weight method assignment in the multisequencing comparison.

The computing method of amino acid whose background frequency are described in the present embodiment step 1, the mean value of 20 kinds of standard amino acids frequency of occurrences in every protein sequence in the PDB25 database, and the background frequency of 20 kinds of standard amino acids is:

Replacement matrix in the present embodiment step 1 is the mark matrix B LOSUM62 of PSI-BLAST acquiescence, the replacement matrix that promptly adopts same amino acid to make up more than 62% block group:

The dimension of the vector in the present embodiment step 4 is 20 ⁿ

Embodiment two: present embodiment is that the step 2 in long-range homology detection of embodiment one described a kind of protein based on Top-n-gram and the fold recognition method is described further, and the method for the described generated frequency spectrum of step 2 is:

Q _i＝(αf _i+βg _i)/(α+β)

Target frequency Q _iRepresent during evolution certain amino acid whose frequency of occurrences on the protein sequence ad-hoc location.

Embodiment three: present embodiment be to the long-range homology of embodiment one described a kind of protein based on Top-n-gram detect and the fold recognition method in step 3 be described further, the described method that frequency spectrum is converted into Top-n-gram of step 3 is:

The value of n can be for more than or equal to 1 and smaller or equal to 20 integer, but n gets more than or equal to 1 and best smaller or equal to 5 integer effect in practical operation.

Embodiment four: present embodiment is that the step 4 in long-range homology detection of embodiment one described a kind of protein based on Top-n-gram and the fold recognition method is described further, the corresponding Top-n-gram of speech among the described speech of step 4-document matrix W, the corresponding test protein sequence of document.

Embodiment five: present embodiment is that the step 5 in long-range homology detection of embodiment one described a kind of protein based on Top-n-gram and the fold recognition method is described further, and the described speech of step 5-document matrix W can be decomposed into three matrixes:

W＝USV ^T

Wherein matrix U is that dimension is the left singular matrix of M * K, and S is that dimension is the diagonal matrix of K * K, and its diagonal element is the singular value of matrix W, and satisfies s ₁〉=s ₂〉=... s _K＞0, V is that dimension is the right singular matrix of N * K, thereby reaches the purpose of dimensionality reduction removal noise by R singular value before keeping, and the dimension of the matrix U behind the dimensionality reduction, S and V is respectively M * R, R * R and N * R,, the value of R is 300.

Embodiment six: present embodiment is that the step 6 in long-range homology detection of embodiment one described a kind of protein based on Top-n-gram and the fold recognition method is described further, and the described svm classifier device of step 6 obtains by following training method:

g_{i} = Σ_{j = 1}^{20} f_{j} * (q_{ij} / p_{j})

Q _i＝(αf _i+βg _i)/(α+β)

Step C: frequency spectrum is converted into Top-n-gram:

Claims

1. the long-range homology of the protein based on Top-n-gram detects and the fold recognition method, it is characterized in that its concrete steps are:

g_{i} = Σ_{j = 1}^{20} f_{i} * (q_{ij} / p_{j})

Step 3: frequency spectrum is converted into Top-n-gram;

2. the long-range homology of a kind of protein based on Top-n-gram according to claim 1 detects and the fold recognition method, it is characterized in that, the method for the described generated frequency spectrum of step 2 is:

The target frequency Qi of 20 kinds of standard amino acids on each amino acid sites in the calculating test protein sequence:

Q _i＝(αf _i+βg _i)/(α+β)

3. the long-range homology of a kind of protein based on Top-n-gram according to claim 1 detects and the fold recognition method, it is characterized in that, the described method that frequency spectrum is converted into Top-n-gram of step 3 is:

4. the long-range homology of a kind of protein based on Top-n-gram according to claim 1 detects and the fold recognition method, it is characterized in that the corresponding Top-n-gram of speech among the described speech of step 4-document matrix W, the corresponding test protein sequence of document.

5. the long-range homology of a kind of protein based on Top-n-gram according to claim 1 detects and the fold recognition method, it is characterized in that the method that the described speech-document matrix W to generation of step 5 carries out svd is: speech-document matrix W is decomposed into three matrixes:

W＝USV ^T

6. the long-range homology of a kind of protein based on Top-n-gram according to claim 1 detects and the fold recognition method, it is characterized in that the described svm classifier device of step 6 obtains by following training method:

g_{i} = Σ_{j = 1}^{20} f_{i} * (q_{ij} / p_{j})

Q _i＝(αf _i+βg _i)/(α+β)

Step C: frequency spectrum is converted into Top-n-gram: