CN102043910A - Remote protein homology detection and fold recognition method based on Top-n-gram - Google Patents

Remote protein homology detection and fold recognition method based on Top-n-gram Download PDF

Info

Publication number
CN102043910A
CN102043910A CN 201010600321 CN201010600321A CN102043910A CN 102043910 A CN102043910 A CN 102043910A CN 201010600321 CN201010600321 CN 201010600321 CN 201010600321 A CN201010600321 A CN 201010600321A CN 102043910 A CN102043910 A CN 102043910A
Authority
CN
China
Prior art keywords
amino acid
gram
protein sequence
matrix
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201010600321
Other languages
Chinese (zh)
Other versions
CN102043910B (en
Inventor
林磊
刘滨
孙承杰
王晓龙
刘秉权
刘远超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN 201010600321 priority Critical patent/CN102043910B/en
Publication of CN102043910A publication Critical patent/CN102043910A/en
Application granted granted Critical
Publication of CN102043910B publication Critical patent/CN102043910B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a remote protein homology detection and fold recognition method based on a Top-n-gram, and relates to a remote protein homology detection and fold recognition method. The method is used for solving a problem that a binary spectrum cannot find out an optimal threshold and cannot distinguish difference of frequency of occurrences of amino acid in the prior protein remote homology detection and fold recognition method, and comprises the following steps: 1, operating a PSII-BLAST, inputting a tested protein sequence for multiple sequence alignment, and calculating a pseudo count of an amino acid i; 2, generating a frequency spectrum; 3, transforming the frequency spectrum into the Top-n-gram; 4, obtaining a latent semantic expression vector corresponding to the tested protein sequence; 5, inputting the latent semantic expression vector corresponding to the tested protein sequence into an SVM sorter for sorting, and obtaining a forecasting result. The protein remote homology detection and fold recognition method based on the Top-n-gram is used in the filed of protein homology detection and fold recognition.

Description

The long-range homology of a kind of protein based on Top-n-gram detects and the fold recognition method
Technical field
The present invention relates to the long-range homology of a kind of protein detects and the fold recognition method.
Background technology
At present, the long-range homology detection method of protein both domestic and external roughly is divided into following several types: dynamic programming algorithm, production model, discriminant model.The discriminant model is the method for prediction effect optimum in this field, and wherein (Support Vector Machine, method SVM) is present the most frequently used method based on support vector machine.Raising is to search out a kind of appropriate protein representation based on the valid approach of the prediction effect of support vector machine method, and then the protein sequence vectorization.
Comprise a large amount of evolution information in the protein multisequencing comparison result by operation PSI-BLAST (location specific iteration BLAST) output.Therefore because frequency spectrum comprises more information than protein sequence, adopt the evolution information that comprises in the frequency spectrum to improve that the long-range homology of protein detects and the prediction effect of fold recognition is significant.Have the researcher to propose a kind of proper vector based on the scale-of-two spectrum, this method is converted into the scale-of-two spectrum to frequency spectrum by frequency threshold before.Frequency represents with 1 that greater than the amino acid of threshold value frequency is represented with 0 less than the amino acid of threshold value.Scale-of-two spectrum is that a kind of protein is formed composition, and is used to solve some biological problems, protein domain Boundary Prediction for example, design of average power potential energy and protein interaction site estimation.Though the method based on the scale-of-two spectrum has obtained success, the scale-of-two spectrum has some shortcomings.At first, select by experience because frequency spectrum is converted into the frequency threshold of scale-of-two spectrum, so there is not the method for system can optimize this threshold value, the assurance of therefore having no idea can be found optimum threshold value; Secondly, the scale-of-two spectrum can not be distinguished the difference of the amino acid frequency of occurrences.Frequency all uses 1 to represent greater than the amino acid of threshold value, and this method for expressing has been ignored these amino acid and had different frequencies and have different importance during evolution.
Summary of the invention
The present invention is in order to solve in long-range homology detection of existing protein and the fold recognition method, the scale-of-two spectrum can't find optimal threshold, can't distinguish the problem of the difference of the amino acid frequency of occurrences, provide the long-range homology of a kind of protein to detect and the fold recognition method based on Top-n-gram.The concrete steps of this method are:
Step 1: operation PSI-BLAST, input test protein sequence carry out the multisequencing comparison, calculate the spurious count g of amino acid i i:
g i = Σ j = 1 20 f j * ( q ij / p j )
F wherein jBe the observing frequency of amino acid j, p jBe the background frequency of amino acid j, q IjIt is the mark of the replacement matrix of correspondence between amino acid i and the amino acid j;
Step 2: according to the spurious count generated frequency spectrum of amino acid i;
Step 3: frequency spectrum is converted into Top-n-gram;
Step 4: by adding up the number of times that every kind of Top-n-gram occurs, the test protein sequence is converted into the vector of regular length, makes up speech-document matrix W then;
Step 5: the speech-document matrix W that generates is carried out svd, obtain the potential semantic meaning representation vector of test protein sequence correspondence;
Step 6: the potential semantic meaning representation vector input svm classifier device of test protein sequence correspondence is classified, the svm classifier device is composed to mark of test protein sequence, fractional value has homology or folding greater than 0 test protein sequence, thereby is predicted the outcome.
The method of the described generated frequency spectrum of the described step 2 of step 2 is:
The target frequency Q of 20 kinds of standard amino acids on each amino acid sites in the calculating test protein sequence i:
Q i=(αf i+βg i)/(α+β)
Wherein β is a free parameter, is the default value 10 of PSI-BLAST, and α is that the amino acid kind that is occurred in a certain row in the multisequencing comparison subtracts 1;
Frequency spectrum is expressed as matrix M, and its dimension is L * N, and wherein L is the length of protein sequence, and N is a constant 20, i.e. the quantity of standard amino acid, and the element among the M is target spectrum rate Q i
The described method that frequency spectrum is converted into Top-n-gram of step 3 is:
20 kinds of standard amino acids during each is gone with frequency spectrum are according to its target frequency descending sort, be preceding n amino acid of target frequency maximum a Top-n-gram according to its combination of frequency then, each Top-n-gram is by amino acid their different frequencies of diverse location difference in Top-n-gram, obtain L Top-n-gram altogether, wherein n is more than or equal to 1 and smaller or equal to 5 integer.
The corresponding Top-n-gram of speech among the described speech of step 4-document matrix W, the corresponding test protein sequence of document.
The method that the described speech-document matrix W to generation of step 5 carries out svd is: speech-document matrix W is decomposed into three matrixes:
W=USV T
Wherein matrix U is that dimension is the left singular matrix of M * K, and S is that dimension is the diagonal matrix of K * K, and its diagonal element is the singular value of matrix W, and satisfies s 1〉=s 2〉=... s K>0, V is that dimension is the right singular matrix of N * K, thereby reaches the purpose that dimensionality reduction is removed noise by R singular value before keeping, and the dimension of the matrix U behind the dimensionality reduction, S and V is respectively M * R, R * R and N * R, and the value of R is 300.
The described svm classifier device of step 6 obtains by following training method:
In the described training method with a plurality of training protein sequences as training sample, respectively to each the training protein sequence carry out following training,
Steps A: operation PSI-BLAST, input training protein sequence carries out the multisequencing comparison, calculates the spurious count g of amino acid i i:
g i = Σ j = 1 20 f j * ( q ij / p j )
F wherein jBe the observing frequency of amino acid j, p jBe the background frequency of amino acid j, q IjIt is the mark of the replacement matrix of correspondence between amino acid i and the amino acid j;
Step B: the spurious count according to amino acid i is produced frequency spectrum:
The target frequency Q of 20 kinds of standard amino acids on each amino acid sites in the calculation training protein sequence i:
Q i=(αf i+βg i)/(α+β)
Wherein β is a free parameter, is the default value 10 of PSI-BLAST, and α is that the amino acid kind that is occurred in a certain row in the multisequencing comparison subtracts 1;
Frequency spectrum is expressed as matrix M, and its dimension is L * N, and wherein L is the length of protein sequence, and N is a constant 20, i.e. the quantity of standard amino acid, and the element among the M is target spectrum rate Q i
Step C: frequency spectrum is converted into Top-n-gram:
20 kinds of standard amino acids during each is gone with frequency spectrum are according to its target frequency descending sort, be preceding n amino acid of target frequency maximum a Top-n-gram according to its combination of frequency then, each Top-n-gram is by amino acid their different frequencies of diverse location difference in Top-n-gram, obtain L Top-n-gram altogether, wherein n is more than or equal to 1 and smaller or equal to 5 integer;
Step D:,, make up speech-document matrix W then with training protein sequence to be converted into the vector of regular length by adding up the number of times that every kind of Top-n-gram occurs;
Step e: the speech-document matrix W that generates is carried out svd, obtain the potential semantic meaning representation vector of training protein sequence correspondence;
Step F: adopt the potential semantic meaning representation vector training of training protein sequence correspondence to obtain the svm classifier device.
The present invention adopts Gist SVM kit commonly used in long-range homology detection and the fold recognition field implementation tool bag as the SVM algorithm.Except kernel function adopted gaussian kernel function, other parameters were used the parameter of Gist kit acquiescence.
In the inventive method, the process of vector that the frequency spectrum of test protein sequence is converted to regular length is referring to shown in Fig. 2 to 5, Fig. 2 is the frequency spectrum of test protein sequence, Fig. 3 is the frequency spectrum that frequency spectrum shown in Figure 2 is obtained after according to the target frequency descending sort, Fig. 4 selects under the n=3 situation, by the Top-n-gram that frequency spectrum shown in Figure 3 obtains, Fig. 5 is the vector by the regular length of Top-n-gram acquisition shown in Figure 4.
Method of the present invention is converted into protein sequence by the occurrence number of every kind of Top-n-gram in the statistics protein sequence vector of regular length.Top-n-gram by combination frequency spectrum medium frequency before the big amino acid of n extract evolution information in the frequency spectrum, Top-n-gram has emphasized the high amino acid whose importance of n before the frequency spectrum medium frequency.Compare with the scale-of-two spectrum, Top-n-gram does not comprise threshold value, thereby does not therefore need the parameter optimization step to avoid the generation of over-fitting; The amino acid whose frequency size of variety classes in all right crossover frequency spectrum.The present invention adopts speech-document matrix dimensionality reduction, the removal noise of latent semantic analysis to obtaining, and then has improved the prediction effect of long-range homology detection of protein and fold recognition.
Description of drawings
Fig. 1 is that the long-range homology of embodiment one described protein based on Top-n-gram detects and the fold recognition method flow diagram; Fig. 2 is the frequency spectrum of test protein sequence; The frequency spectrum of Fig. 3 for frequency spectrum shown in Figure 2 is obtained after according to the target frequency descending sort; Fig. 4 is for selecting under the n=3 situation, by the Top-n-gram of frequency spectrum acquisition shown in Figure 3; Fig. 5 is the vector by the regular length of Top-n-gram acquisition shown in Figure 4.
Embodiment
Technical solution of the present invention is not limited to following cited embodiment, also comprises the combination in any between each embodiment.
Embodiment one: in conjunction with Fig. 1 present embodiment is described, the long-range homology of a kind of protein based on Top-n-gram detects and the fold recognition method, and its concrete steps are:
Step 1: operation PSI-BLAST, input test protein sequence carry out the multisequencing comparison, calculate the spurious count g of amino acid i i:
g i = Σ j = 1 20 f j * ( q ij / p j )
F wherein jBe the observing frequency of amino acid j, p jBe the background frequency of amino acid j, q IjIt is the mark of the replacement matrix of correspondence between amino acid i and the amino acid j;
Step 2: according to the spurious count generated frequency spectrum of amino acid i;
Step 3: frequency spectrum is converted into Top-n-gram;
Step 4: by adding up the number of times that every kind of Top-n-gram occurs, the test protein sequence is converted into the vector of regular length, makes up speech-document matrix W then;
Step 5: the speech-document matrix W that generates is carried out svd, obtain the potential semantic meaning representation vector of test protein sequence correspondence;
Step 6: the potential semantic meaning representation vector input svm classifier device of test protein sequence correspondence is classified, the svm classifier device is composed to mark of test protein sequence, fractional value has homology or folding greater than 0 test protein sequence, thereby is predicted the outcome.
The iterations of PSI-BLAST is 10 times in the present embodiment step 1, the Non-redundant data storehouse of PSI-BLAST search is the nrdb90 database, adopt sequence similarity to compose less than 98% multisequencing comparison calculated rate, the weight of every sequence adopts location-based sequence weight method assignment in the multisequencing comparison.
The computing method of amino acid whose background frequency are described in the present embodiment step 1, the mean value of 20 kinds of standard amino acids frequency of occurrences in every protein sequence in the PDB25 database, and the background frequency of 20 kinds of standard amino acids is:
Figure BDA0000039963110000051
Replacement matrix in the present embodiment step 1 is the mark matrix B LOSUM62 of PSI-BLAST acquiescence, the replacement matrix that promptly adopts same amino acid to make up more than 62% block group:
Figure BDA0000039963110000052
The dimension of the vector in the present embodiment step 4 is 20 n
Embodiment two: present embodiment is that the step 2 in long-range homology detection of embodiment one described a kind of protein based on Top-n-gram and the fold recognition method is described further, and the method for the described generated frequency spectrum of step 2 is:
The target frequency Q of 20 kinds of standard amino acids on each amino acid sites in the calculating test protein sequence i:
Q i=(αf i+βg i)/(α+β)
Wherein β is a free parameter, is the default value 10 of PSI-BLAST, and α is that the amino acid kind that is occurred in a certain row in the multisequencing comparison subtracts 1;
Frequency spectrum is expressed as matrix M, and its dimension is L * N, and wherein L is the length of protein sequence, and N is a constant 20, i.e. the quantity of standard amino acid, and the element among the M is target spectrum rate Q i
Target frequency Q iRepresent during evolution certain amino acid whose frequency of occurrences on the protein sequence ad-hoc location.
Embodiment three: present embodiment be to the long-range homology of embodiment one described a kind of protein based on Top-n-gram detect and the fold recognition method in step 3 be described further, the described method that frequency spectrum is converted into Top-n-gram of step 3 is:
20 kinds of standard amino acids during each is gone with frequency spectrum are according to its target frequency descending sort, be preceding n amino acid of target frequency maximum a Top-n-gram according to its combination of frequency then, each Top-n-gram is by amino acid their different frequencies of diverse location difference in Top-n-gram, obtain L Top-n-gram altogether, wherein n is more than or equal to 1 and smaller or equal to 5 integer.
The value of n can be for more than or equal to 1 and smaller or equal to 20 integer, but n gets more than or equal to 1 and best smaller or equal to 5 integer effect in practical operation.
Embodiment four: present embodiment is that the step 4 in long-range homology detection of embodiment one described a kind of protein based on Top-n-gram and the fold recognition method is described further, the corresponding Top-n-gram of speech among the described speech of step 4-document matrix W, the corresponding test protein sequence of document.
Embodiment five: present embodiment is that the step 5 in long-range homology detection of embodiment one described a kind of protein based on Top-n-gram and the fold recognition method is described further, and the described speech of step 5-document matrix W can be decomposed into three matrixes:
W=USV T
Wherein matrix U is that dimension is the left singular matrix of M * K, and S is that dimension is the diagonal matrix of K * K, and its diagonal element is the singular value of matrix W, and satisfies s 1〉=s 2〉=... s K>0, V is that dimension is the right singular matrix of N * K, thereby reaches the purpose of dimensionality reduction removal noise by R singular value before keeping, and the dimension of the matrix U behind the dimensionality reduction, S and V is respectively M * R, R * R and N * R,, the value of R is 300.
Embodiment six: present embodiment is that the step 6 in long-range homology detection of embodiment one described a kind of protein based on Top-n-gram and the fold recognition method is described further, and the described svm classifier device of step 6 obtains by following training method:
In the described training method with a plurality of training protein sequences as training sample, respectively to each the training protein sequence carry out following training,
Steps A: operation PSI-BLAST, input training protein sequence carries out the multisequencing comparison, calculates the spurious count g of amino acid i i:
g i = Σ j = 1 20 f j * ( q ij / p j )
F wherein jBe the observing frequency of amino acid j, p jBe the background frequency of amino acid j, q IjIt is the mark of the replacement matrix of correspondence between amino acid i and the amino acid j;
Step B: the spurious count according to amino acid i is produced frequency spectrum:
The target frequency Q of 20 kinds of standard amino acids on each amino acid sites in the calculation training protein sequence i:
Q i=(αf i+βg i)/(α+β)
Wherein β is a free parameter, is the default value 10 of PSI-BLAST, and α is that the amino acid kind that is occurred in a certain row in the multisequencing comparison subtracts 1;
Frequency spectrum is expressed as matrix M, and its dimension is L * N, and wherein L is the length of protein sequence, and N is a constant 20, i.e. the quantity of standard amino acid, and the element among the M is target spectrum rate Q i
Step C: frequency spectrum is converted into Top-n-gram:
20 kinds of standard amino acids during each is gone with frequency spectrum are according to its target frequency descending sort, be preceding n amino acid of target frequency maximum a Top-n-gram according to its combination of frequency then, each Top-n-gram is by amino acid their different frequencies of diverse location difference in Top-n-gram, obtain L Top-n-gram altogether, wherein n is more than or equal to 1 and smaller or equal to 5 integer;
Step D:,, make up speech-document matrix W then with training protein sequence to be converted into the vector of regular length by adding up the number of times that every kind of Top-n-gram occurs;
Step e: the speech-document matrix W that generates is carried out svd, obtain the potential semantic meaning representation vector of training protein sequence correspondence;
Step F: adopt the potential semantic meaning representation vector training of training protein sequence correspondence to obtain the svm classifier device.

Claims (6)

1. the long-range homology of the protein based on Top-n-gram detects and the fold recognition method, it is characterized in that its concrete steps are:
Step 1: operation PSI-BLAST, input test protein sequence carry out the multisequencing comparison, calculate the spurious count g of amino acid i i:
g i = Σ j = 1 20 f i * ( q ij / p j )
F wherein jBe the observing frequency of amino acid j, p jBe the background frequency of amino acid j, q IjIt is the mark of the replacement matrix of correspondence between amino acid i and the amino acid j;
Step 2: according to the spurious count generated frequency spectrum of amino acid i;
Step 3: frequency spectrum is converted into Top-n-gram;
Step 4: by adding up the number of times that every kind of Top-n-gram occurs, the test protein sequence is converted into the vector of regular length, makes up speech-document matrix W then;
Step 5: the speech-document matrix W that generates is carried out svd, obtain the potential semantic meaning representation vector of test protein sequence correspondence;
Step 6: the potential semantic meaning representation vector input svm classifier device of test protein sequence correspondence is classified, the svm classifier device is composed to mark of test protein sequence, fractional value has homology or folding greater than 0 test protein sequence, thereby is predicted the outcome.
2. the long-range homology of a kind of protein based on Top-n-gram according to claim 1 detects and the fold recognition method, it is characterized in that, the method for the described generated frequency spectrum of step 2 is:
The target frequency Qi of 20 kinds of standard amino acids on each amino acid sites in the calculating test protein sequence:
Q i=(αf i+βg i)/(α+β)
Wherein β is a free parameter, is the default value 10 of PSI-BLAST, and α is that the amino acid kind that is occurred in a certain row in the multisequencing comparison subtracts 1;
Frequency spectrum is expressed as matrix M, and its dimension is L * N, and wherein L is the length of protein sequence, and N is a constant 20, i.e. the quantity of standard amino acid, and the element among the M is target spectrum rate Q i
3. the long-range homology of a kind of protein based on Top-n-gram according to claim 1 detects and the fold recognition method, it is characterized in that, the described method that frequency spectrum is converted into Top-n-gram of step 3 is:
20 kinds of standard amino acids during each is gone with frequency spectrum are according to its target frequency descending sort, be preceding n amino acid of target frequency maximum a Top-n-gram according to its combination of frequency then, each Top-n-gram is by amino acid their different frequencies of diverse location difference in Top-n-gram, obtain L Top-n-gram altogether, wherein n is more than or equal to 1 and smaller or equal to 5 integer.
4. the long-range homology of a kind of protein based on Top-n-gram according to claim 1 detects and the fold recognition method, it is characterized in that the corresponding Top-n-gram of speech among the described speech of step 4-document matrix W, the corresponding test protein sequence of document.
5. the long-range homology of a kind of protein based on Top-n-gram according to claim 1 detects and the fold recognition method, it is characterized in that the method that the described speech-document matrix W to generation of step 5 carries out svd is: speech-document matrix W is decomposed into three matrixes:
W=USV T
Wherein matrix U is that dimension is the left singular matrix of M * K, and S is that dimension is the diagonal matrix of K * K, and its diagonal element is the singular value of matrix W, and satisfies s 1〉=s 2〉=... s K>0, V is that dimension is the right singular matrix of N * K, thereby reaches the purpose that dimensionality reduction is removed noise by R singular value before keeping, and the dimension of the matrix U behind the dimensionality reduction, S and V is respectively M * R, R * R and N * R, and the value of R is 300.
6. the long-range homology of a kind of protein based on Top-n-gram according to claim 1 detects and the fold recognition method, it is characterized in that the described svm classifier device of step 6 obtains by following training method:
In the described training method with a plurality of training protein sequences as training sample, respectively to each the training protein sequence carry out following training,
Steps A: operation PSI-BLAST, input training protein sequence carries out the multisequencing comparison, calculates the spurious count g of amino acid i i:
g i = Σ j = 1 20 f i * ( q ij / p j )
F wherein jBe the observing frequency of amino acid j, p jBe the background frequency of amino acid j, q IjIt is the mark of the replacement matrix of correspondence between amino acid i and the amino acid j;
Step B: the spurious count according to amino acid i is produced frequency spectrum:
The target frequency Q of 20 kinds of standard amino acids on each amino acid sites in the calculation training protein sequence i:
Q i=(αf i+βg i)/(α+β)
Wherein β is a free parameter, is the default value 10 of PSI-BLAST, and α is that the amino acid kind that is occurred in a certain row in the multisequencing comparison subtracts 1;
Frequency spectrum is expressed as matrix M, and its dimension is L * N, and wherein L is the length of protein sequence, and N is a constant 20, i.e. the quantity of standard amino acid, and the element among the M is target spectrum rate Q i
Step C: frequency spectrum is converted into Top-n-gram:
20 kinds of standard amino acids during each is gone with frequency spectrum are according to its target frequency descending sort, be preceding n amino acid of target frequency maximum a Top-n-gram according to its combination of frequency then, each Top-n-gram is by amino acid their different frequencies of diverse location difference in Top-n-gram, obtain L Top-n-gram altogether, wherein n is more than or equal to 1 and smaller or equal to 5 integer;
Step D:,, make up speech-document matrix W then with training protein sequence to be converted into the vector of regular length by adding up the number of times that every kind of Top-n-gram occurs;
Step e: the speech-document matrix W that generates is carried out svd, obtain the potential semantic meaning representation vector of training protein sequence correspondence;
Step F: adopt the potential semantic meaning representation vector training of training protein sequence correspondence to obtain the svm classifier device.
CN 201010600321 2010-12-22 2010-12-22 Remote protein homology detection and fold recognition method based on Top-n-gram Expired - Fee Related CN102043910B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010600321 CN102043910B (en) 2010-12-22 2010-12-22 Remote protein homology detection and fold recognition method based on Top-n-gram

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010600321 CN102043910B (en) 2010-12-22 2010-12-22 Remote protein homology detection and fold recognition method based on Top-n-gram

Publications (2)

Publication Number Publication Date
CN102043910A true CN102043910A (en) 2011-05-04
CN102043910B CN102043910B (en) 2012-12-12

Family

ID=43910044

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010600321 Expired - Fee Related CN102043910B (en) 2010-12-22 2010-12-22 Remote protein homology detection and fold recognition method based on Top-n-gram

Country Status (1)

Country Link
CN (1) CN102043910B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103077226A (en) * 2012-12-31 2013-05-01 浙江工业大学 Spatial search method for multi-modal protein conformations
CN106709273A (en) * 2016-12-15 2017-05-24 国家海洋局第海洋研究所 Protein rapid detection method based on matched microalgae protein characteristics sequence label and system thereof
CN113362900A (en) * 2021-06-15 2021-09-07 邵阳学院 Mixed model for predicting N4-acetylcytidine

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040219601A1 (en) * 2003-01-02 2004-11-04 Jinbo Xu Method and system for more effective protein three-dimensional structure prediction
CN101231677A (en) * 2007-11-30 2008-07-30 中国科学院合肥物质科学研究院 Long-distance interaction prediction method between residue base on sequence spectrum center and genetic optimization process
CN101794351A (en) * 2010-03-09 2010-08-04 哈尔滨工业大学 Protein secondary structure engineering prediction method based on large margin nearest central point

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040219601A1 (en) * 2003-01-02 2004-11-04 Jinbo Xu Method and system for more effective protein three-dimensional structure prediction
CN101231677A (en) * 2007-11-30 2008-07-30 中国科学院合肥物质科学研究院 Long-distance interaction prediction method between residue base on sequence spectrum center and genetic optimization process
CN101794351A (en) * 2010-03-09 2010-08-04 哈尔滨工业大学 Protein secondary structure engineering prediction method based on large margin nearest central point

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《中国博士学位论文全文数据库》 20110815 刘滨 基于频率谱的蛋白质结构和相互作用位点预测 33-73 1-6 , 第8期 *
《中国科学C辑》 20050228 董启文等 蛋白质二级结构预测:基于词条的最大熵马尔科夫方法 87-96 1-6 第35卷, 第1期 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103077226A (en) * 2012-12-31 2013-05-01 浙江工业大学 Spatial search method for multi-modal protein conformations
CN103077226B (en) * 2012-12-31 2015-10-07 浙江工业大学 A kind of multi-modal protein conformation space search method
CN106709273A (en) * 2016-12-15 2017-05-24 国家海洋局第海洋研究所 Protein rapid detection method based on matched microalgae protein characteristics sequence label and system thereof
CN106709273B (en) * 2016-12-15 2019-06-18 国家海洋局第一海洋研究所 The matched rapid detection method of microalgae protein characteristic sequence label and system
CN113362900A (en) * 2021-06-15 2021-09-07 邵阳学院 Mixed model for predicting N4-acetylcytidine

Also Published As

Publication number Publication date
CN102043910B (en) 2012-12-12

Similar Documents

Publication Publication Date Title
Zheng et al. A rolling bearing fault diagnosis method based on multi-scale fuzzy entropy and variable predictive model-based class discrimination
CN102081655B (en) Information retrieval method based on Bayesian classification algorithm
WO2018120077A1 (en) Three-level inverter fault diagnosis method based on empirical mode decomposition and decision tree rvm
Manimala et al. Hybrid soft computing techniques for feature selection and parameter optimization in power quality data mining
CN107462785A (en) The more disturbing signal classifying identification methods of the quality of power supply based on GA SVM
CN104535905A (en) Partial discharge diagnosis method based on naive bayesian classification
CN102915448B (en) A kind of three-dimensional model automatic classification method based on AdaBoost
Guo et al. Improved adversarial learning for fault feature generation of wind turbine gearbox
CN110738232A (en) grid voltage out-of-limit cause diagnosis method based on data mining technology
Zhou et al. Text categorization based on clustering feature selection
CN104915679A (en) Large-scale high-dimensional data classification method based on random forest weighted distance
CN104809233A (en) Attribute weighting method based on information gain ratios and text classification methods
CN102043910B (en) Remote protein homology detection and fold recognition method based on Top-n-gram
CN103440275A (en) Prim-based K-means clustering method
CN104820702A (en) Attribute weighting method based on decision tree and text classification method
Ma et al. Cluster analysis of wind turbines of large wind farm with diffusion distance method
CN104809229A (en) Method and system for extracting text characteristic words
Wang et al. A new process industry fault diagnosis algorithm based on ensemble improved binary‐tree SVM
CN115796231B (en) Temporal analysis ultra-short term wind speed prediction method
CN104573331A (en) K neighbor data prediction method based on MapReduce
CN115936926A (en) SMOTE-GBDT-based unbalanced electricity stealing data classification method and device, computer equipment and storage medium
CN106845799B (en) Evaluation method for typical working condition of battery energy storage system
CN105116323A (en) Motor fault detection method based on RBF
Xu et al. NWP feature selection and GCN-based ultra-short-term wind farm cluster power forecasting method
CN109753990B (en) User electric energy substitution potential prediction method, system and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20121212

Termination date: 20141222

EXPY Termination of patent right or utility model