CN102819693A

CN102819693A - Prediction method for protein subcellular site formed based on improved-period pseudo amino acid

Info

Publication number: CN102819693A
Application number: CN2012102934168A
Authority: CN
Inventors: 李立奇; 张瑗; 朱洁; 周跃; 杨桦
Original assignee: Second Affiliated Hospital of TMMU
Current assignee: Second Affiliated Hospital of TMMU
Priority date: 2012-08-17
Filing date: 2012-08-17
Publication date: 2012-12-12

Abstract

The invention relates to a prediction method for protein subcellular site formed based on improved-period pseudo amino acid, which has a strategy that an integrated classifier is constructed with a KNN (K nearest neighbor) method and an SVM (support vector machine) method based on a one-to-one scheme. The prediction method aims to predict the protein subcellular site and accelerate protein function study and belongs to the field of bioinformatics. The prediction method is used for constructing the integrated classifier with the KNN method based on the Euclidean distance and the SVM method based on an RBF (radial basis function) kernel function. The protein characteristic information consists of improved-period pseudo amino acid and is obtained by the fact that a high-score characteristic closely related to the protein subcellular site is extracted with a fselect.py method on the basis of the characteristics of GO (gene ontology), AAC (amino acid composition), AAP (amino acid pair composition) and the hydrophily and the hydrophobicity of amino acid. The prediction accuracy of the protein subcellular site aims to be improved with two prediction methods of KNN and SVM and according to the high-score characteristic. In the implementation, the prediction method is identified from indexes, such as total prediction accuracy rate, each-site prediction accuracy rate, MCC (Markovian correlation coefficient) and the like with a jackknife inspection method. The prediction method disclosed by the invention is suitable for the prediction of the subcellular site of the proteins of different species.

Description

A kind of protein subcellular fraction site estimation method of forming based on the pseudo-amino acid in improvement week

Technical field

The present invention relates to a kind of subcellular fraction site, belong to field of bioinformatics through KNN-SVM integrated classifier predicted protein matter.

Background technology

The subcellular fraction site of research protein plays an important role for illustrating their functions in cell.Though can carry out the research of protein Subcellular Localization through experiment method at present, these experiment methods are not only time-consuming expensive, and be not suitable for the research of large-scale protein Subcellular Localization.Through computing method then can realize fast, accurately, the subcellular fraction site of large-scale predicted protein matter.

In the past few decades, there are many computing method to be applied to the subcellular fraction site estimation of protein.These methods mainly are divided into two big types.First kind method is based on amino acid and forms.Nakashima etc. [1] discover: extracellular protein and intracellular protein have significant difference on amino acid is formed, and distinguish the protein of these two types of subcellular locations thus.Along this thinking, many computing method of forming [3] based on amino acid composition, two peptide composition [2], the two peptides in n rank are suggested.Simultaneously, in order to mix greater protein matter sequence signature, many further features (forming [5], psi-blast [6] etc. like the hydrophilic hydrophobic property of amino acid [4], functional domain) also are introduced into.And second class methods are based on some sorting signalses, comprise signal peptide, Mitochondrially targeted peptide and chloroplast transit peptides [7,8].For example, Emanuelsson etc. [8] has set forth in detail and has used the cleavage site that SignalP and ChloroP predict secretory pathway signal peptide and chloroplast transit peptides.But the reliability of these methods depends on the N terminal sequence of protein to a great extent.And the molecular mechanism that sorting signals is relevant is quite complicated, does not set forth clear at present fully.

Not only protein sequence information, and prediction algorithm can influence the accuracy of protein subcellular fraction site estimation too.So far, existing many computing method are used for the subcellular fraction site of predicted protein matter, like hidden Markov model (HMM) [9,10], neural network [11], K nearest neighbor method (KNN) [12] and SVMs (SVM) [13] etc.But most of prediction sorter all is based on single theory of algorithm, and every kind of algorithm all has self intrinsic defective, and this can cause that prediction effect is not good.For example, the parameter a lot [14] that needs estimation in the HMM algorithm; Neural network model may meet with many local minimums [15].In addition, though there are some integrated classifiers [2,16,17] to be used for the subcellular fraction site of predicted protein matter.But great majority for example blur KNN [2], KNN [16] and bayesian theory [17] in fact only based on single algorithm.Other integrated classifier is based on algorithms of different like CE-PLoc [18] etc., and these integrated classifiers have all comprised KNN and SVM algorithm.Along this thinking, we intend with KNN and two kinds of algorithms of SVM and make up integrated classifier, come the subcellular fraction site of predicted protein matter.

Summary of the invention

The present invention is for solving the deficiency of prior art; A kind of protein subcellular fraction site estimation method of forming based on the pseudo-amino acid in improvement week is provided; Its purpose has two: one of which; Be intended to come the subcellular fraction site of predicted protein matter, to remedy the intrinsic defective of single method self through KNN and these two kinds the most frequently used prediction sorting techniques of SVM.And utilize parameter optimization instrument grid.py that Forecasting Methodology is carried out parameter optimization, help improving predictablity rate.They are two years old; Solve the information redundancy that the used protein characteristic information of tradition causes because of containing much information; Thereby cause prediction effect not good; The present invention utilizes characteristic screening implement fselect.py from a large amount of characteristic informations, to extract and the most closely-related high score characteristic in protein subcellular fraction site, improves the subcellular fraction site estimation accuracy rate of protein.

Technical solution of the present invention is following:

The present invention relates to a kind of protein subcellular fraction site estimation method of forming based on the pseudo-amino acid in improvement week; It is characterized in that; Said Forecasting Methodology is the integrated prediction method based on KNN and SVM method, and the KNN method adopts Euclidean distance, and the SVM method adopts the RBF kernel function; And adopt the grid.py method to carry out parameter optimization, said method called after KNN-SVM integrated classifier.Protein characteristic information is formed for the pseudo-amino acid in improvement week, by gene ontology opinion (GO), amino acid forms (AAC), amino acid to forming characteristic process fselect.py methods such as (AAP), amino acid are hydrophilic, hydrophobic property screen and form.Said integrated classifier is to adopt the 1 pair 1 a plurality of two types of sorters of construction of strategy, predicts respectively through KNN and two kinds of methods of SVM, and predicting the outcome of two kinds of methods compared and merge.The protein data collection is eukaryotic protein data set, protokaryon protein data collection or virus protein data set, selects according to the kind of institute's predicted protein matter.

Referring to Fig. 1, the main construction step of this integrated classifier is following:

1. the structure of protein data collection: 1. eukaryotic protein data set Euk7579,2. protokaryon protein data collection Gneg1456,3. virus protein data set Virus252 obtains through following address respectively:

①http://web.kuicr.kyoto-u.ac.jp/~park/Seqdata/?[3]；

②http://www.csbio.sjtu.edu.cn/bioinf/Gneg-multi/?[19]；

③http://www.csbio.sjtu.edu.cn/bioinf/virus-multi/?[20]。

Utilize word to search, replace function deletion redundant information, stay protein numbering, affiliated subcellular fraction site numbering and amino acid sequence.

2. GO feature extraction: the GO data set is from ftp: //ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/ downloads and obtains [21], utilizes file division, excel deletes redundant information, obtains the protein numbering and corresponding GO numbers.

3. the AAC of protein, AAP, hydrophobic, hydrophilic feature extraction:, obtain AAC, AAP, hydrophilic, hydrophobic character according to the pseudo-amino acid formational theory [13,22] of the both sexes in week.

4. KNN-SVM integrated classifier predicted data is concentrated the subcellular fraction site of protein:

1. proper vector is set up: the equal corresponding GO of each protein, AAC, AAP, hydrophobic, hydrophilic this five Partial Feature, these characteristics have constituted the proper vector of each protein.

2. characteristic marking, ordering: use the marking of fselect.py characteristic, sort method, each characteristic is given a mark, sort from high to low according to mark again.

3. generate Top characteristic (being the high score characteristic): with the feature scores ordering is foundation, and the Top characteristic (is the interval with 10) of getting mark row preceding 10 to preceding 60 is as the simplification feature set after the Feature Selection.

4. the SVM parameter optimization of Top characteristic: use the grid.py method that the parameters C in the SVM method, γ are optimized.

5. confirming of Top characteristic dimension Dim: use the SVM method to calculate the predictablity rate of Top10 to Top60 respectively, the corresponding predictablity rate of each Top characteristic relatively confirms that the corresponding Top characteristic dimension Dim of high-accuracy is final dimension.

6. the confirming of parameter K in the KNN method: use the KNN method predictablity rate of calculating parameter K from 1 to 10 respectively, the corresponding predictablity rate of a K value relatively is with the K value of the high-accuracy correspondence parameter K value as the KNN method.

7. KNN-SVM integrated classifier predicted protein matter subcellular fraction site: for the protein data collection of known site; Because it includes n subcellular fraction site; N >=1, n * (n-1)/2 two types of sorters (are example with viral data set for example, so adopt 1 pair 1 strategy meeting formation; Owing to include 8 subcellular fraction sites, adopt 1 pair 1 strategy can form 8 * (8-1)/2=28 two types of sorters).Select predicting the outcome of the higher method of accuracy rate according to the height of KNN method in each sorter and SVM method predictablity rate, again these stacks that predict the outcome are obtained merging and predict the outcome as the predicting the outcome of this sorter.For the virus protein P in unknown site, merging predicts the outcome is the subcellular fraction site that protein P is predicted with the maximum site of protein P sensing number of times.

The present invention adopts 1 pair 1 strategy, promptly each two kinds of subcellular fraction sites is differentiated, and with respect to strategy more than 1 pair, the protein data in two kinds of subcellular fraction sites of its differentiation is more balanced, is difficult for taking place prediction drift.

The KNN-SVM integrated classifier of gained of the present invention can independently select eucaryon, protokaryon, virus protein data set as training sample set in application facet, thus the subcellular fraction site of predicting variety classes protein more targetedly.

The present invention can be used for predicting the subcellular fraction site of the various kinds of proteinoid in unknown site, for the subcellular fraction site of research protein provide a kind of fast, Forecasting Methodology reliably, also certain reference value is provided for the further function of research protein.

Description of drawings

Fig. 1 shows the protein subcellular fraction site estimation synoptic diagram of KNN-SVM integrated classifier;

Fig. 2 shows the Top30 characteristic and the score value of Euk7579 protein data collection;

Fig. 3 shows the Top30 characteristic and the score value of Gneg1456 protein data collection;

Fig. 4 shows the Top30 characteristic and the score value of Virus252 protein data collection.

Embodiment

Specify building process of the present invention below in conjunction with embodiment:

①http://web.kuicr.kyoto-u.ac.jp/~park/Seqdata/；

②http://www.csbio.sjtu.edu.cn/bioinf/Gneg-multi/；

③http://www.csbio.sjtu.edu.cn/bioinf/virus-multi/。

With above-mentioned eucaryon, protokaryon, the unloading of virus protein data set in the word file; Utilize word to search, replace function deletion redundant information, protein numbering, affiliated subcellular fraction site numbering and amino acid sequence are stored in respectively among new files A.xls and the A2.xls.

2. the GO feature extraction makes up with vector: from ftp: //ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/ downloads and obtains GO document data set local_gene_association.goa_uniprot.sorted.Utilize file division software that this document is divided into the small documents of each size for 100MB, and unloading is in the excel file.Deletion redundant information wherein is stored in protein numbering and corresponding GO numbering thereof among the new files GO.xls.With the GO information vectorization among the GO.xls, construct GO proper vector file again, be stored among the GONo2.txt.

3. the AAC of protein, AAP, hydrophobic, hydrophilic feature extraction make up with vector: according to the pseudo-amino acid formational theory of the both sexes in week; Utilize MATLAB software to programme voluntarily respectively to calculate obtain AAC, AAP, hydrophilic, hydrophobic character is vectorial, be stored in respectively in AAC.txt, AAP0.txt, Hydrophobicity1.txt and the Hydrophilicity1.txt file.With the Euk7579 data set is example, all contains 7579 row proper vectors in above-mentioned 4 tag files, 7579 eukaryotic proteins of corresponding Euk7579 data centralization.

4. KNN-SVM integrated classifier predicted data is concentrated the subcellular fraction site of protein: rely on biological information center Langchao Tiansuo of Third Military Medical University high-performance server cluster platform; Utilize the MATLAB software programming to carry out the subcellular fraction site of KNN, SVM method predicted protein matter, specific procedure and detailed method step are listed as follows one by one.

⑴. predictor is write: a newly-built m file in MATLAB software, called after file5_libsvm_knn_jackknife_linux.m, and coding is following:

function?f=libsvm_knn(a36,a41,a56,a63)

a1=importdata('B_sheet2_linux.txt');

a5=importdata('A_sheet1_num_linux.txt');a6=importdata('A_sheet1_char_linux

.txt');

a9=importdata('GONo2.txt');a9=sparse(a9);

a11=importdata('AAC.txt');

a12=importdata('AAP0.txt');

a15=[];

a15(:,:,1)=importdata('Hydrophobicity1.txt');

a15(:,:,2)=importdata('Hydrophilicity1.txt');

a3=isnan(a1(:,2))

a4=find(a3==0)

a7=a5(a4,1);

a8=a6(a4,1);

a10=a9(a4,:);

a13=a11(a4,:);

a14=a12(a4,:);

a16=a15(a4,:,1);

a17=a15(a4,:,2);

clear?a1?a9?a11?a12?a15

if?a63=='yesGO'

a25=[a13,a14,a16,a17,a10];

elseif?a63=='notGO'

a25=[a13,a14,a16,a17];

end

a25=sparse(a25);

size(a25);

while?exist(strcat(['features_',a63,'.txt']))~=2

a65=fopen(strcat(['features_',a63,'.txt']),'wt');

a66=1:1:size(a25,2);

for?a69=1:1:size(a25,1)

a69%%

a67=[];

a67(:,1:3:(3*size(a25,2)))=a66;

a67(:,2:3:(3*size(a25,2)))=Inf;

a67(:,3:3:(3*size(a25,2)))=a25(a69,:);

a68=[a7(a69,1),0,a67];

a68(:,2)=NaN;

a70=mat2str(a68);

a70(a70=='[')='';a70(a70==']')='';a70=regexprep(a70,'?NaN','');a70

=regexprep(a70,'?Inf?',':');

fprintf(a65,'%s\n',a70);

end

fclose(a65);

pause

end

a55=importdata(strcat(['features_',a63,'.txt.fscore']));

a25=a25(:,a55(1:a56,1));

a57=[];

while?exist(strcat(['top',num2str(a56),'_features_',a63,'.txt']))~=2

a58=fopen(strcat(['top',num2str(a56),'_features_',a63,'.txt']),'wt');

a64=num2str(a7);

for?a59=1:1:size(a25,1)

a61=[];

for?a60=1:1:size(a25,2)

a61=strcat([a61,'?',num2str(a60),':',num2str(a25(a59,a60))]);

end

a62=strcat(a64(a59,1),a61);

fprintf(a58,'%s\n',a62);

end

fclose(a58);

pause

end

a43=[];

for?a18=1:1:size(a4,1)

tic%%

a42=1;

for?a19=1:1:(max(a7)-1)

for?a20=(a19+1):1:(max(a7))

a21=a8(find(a7==a19),1);

a22=a8(find(a7==a20),1);

a26=a25(find(a7==a19),:);

a27=a25(find(a7==a20),:);

a23=strcmp([a21;a22],a8(a18,1));

a24=length(find(a23==1));

a29=[a19*ones(length(find(a7==a19)),1);a20*ones(length(find

(a7==a20)),1)];

a28=[a26;a27];

if?a24==0

a31=a29;

a32=a28;

else

a30=find(a23==0);

a31=a29(a30,1);

a32=a28(a30,:);

end

a33=a7(a18,1);

a34=a25(a18,:);

a35=svmtrain(a31,full(a32),a36);

[a37,a38,a39]=svmpredict(a33,full(a34),a35);

a40=knnclassify(a34,a32,a31,a41);

a43(a18,a42,1)=a37;

a43(a18,a42,2)=a40;

a42=a42+1;

end

a18;%%

toc%%

end

a43;

a44=a43(:,:,1);save(strcat(['result_libsvm','?',a36,'?',a63,'?top',num2str(a56),'.txt']),

'a44','-ascii');

a45=a43(:,:,2);save(strcat(['result_knn','?-k?',num2str(a41),'?',a63,'?top',num2str(a

56),'.txt']),'a45','-ascii');

b1=[];

b1(:,:,1)=a44;

b1(:,:,2)=a45;

b13=zeros(max(a7),5,size(b1,3));

for?b2=1:1:size(b1,3)

b3=b1(:,:,b2);

b9=zeros(size(b3,1),1);

for?b4=1:1:size(b3,1)

if?b9(b4,1)==1

continue

end

b7=[];

for?b5=1:1:max(a7)

b6=length(find(b3(b4,:)==b5));

b7=[b7,b6];

end

b8=find(b7==max(b7));

b10=strcmp(a8,a8(b4,1));

b11=find(b10);

if?length(b11)>1

b9(b11,1)=1;

end

b12=a7(b11,1);

for?b14=1:1:size(b8,2)

if?length(find(b12==b8(1,b14)))==1

b13(b8(1,b14),1,b2)=b13(b8(1,b14),1,b2)+1;

else

b13(b8(1,b14),3,b2)=b13(b8(1,b14),3,b2)+1;

end

for?b15=1:1:size(b12,1)

if?length(find(b8,b12(b15,1)))==1

%b13(b12(b15,1),1,b2)=b13(b12(b15,1),1,b2)+1;

else

b13(b12(b15,1),4,b2)=b13(b12(b15,1),4,b2)+1;

end

for?b17=1:1:max(a7)

b16=[b8';b12];

b18=length(find(b16==b17));

if?b18==0

b13(b17,2,b2)=b13(b17,2,b2)+1;

end

for?b19=1:1:max(a7)

b21=b13(b19,1,b2);%TP

b22=b13(b19,2,b2);%TN

b23=b13(b19,3,b2);%FP

b24=b13(b19,4,b2);%FN

b13(b19,5,b2)=(b21*b22-b23*b24)/(sqrt((b21+b23)*(b21+b24)

*(b22+b23)*(b22+b24)));%MCC

end

b20=sum(b13(:,1,b2))/size(a7,1)%overall?accuracy

end

b13

Preserve and move above-mentioned file.Before the 1st pause that program is provided with, can generate the total characteristic file of the various characteristics of comprehensive previous experiments, i.e. features_yesGO.txt.

⑵. characteristic marking, ordering; Referring to Fig. 2, Fig. 3 and Fig. 4: with Euk7579 protein data collection is example; The features_yesGO.txt file has comprised GO, AAC, AAP, hydrophobic, the hydrophilic characteristic of amino acid, and the total dimension of characteristic is 6533+20+400+2x9=6971.Python software is installed in server, and is obtained characteristic marking program file fselect.py from http://www.csie.ntu.edu.tw/ ~ cjlin/libsvm/ download.With fselect.py be placed on the same file of file f eatures_yesGO.txt in.

Open Xshell software, get into this document folder, input is as issuing orders:

Python fselect.py features_yesGO.txt carriage return

Generate the file f eatures_yesGO.txt.fscore of each characteristic score value of record.

⑶. generate Top characteristic (being the high score characteristic): the file f eatures_yesGO.txt.fscore that obtains with ⑵ is a characteristic score value foundation, moves the file5_libsvm_knn_jackknife_linux.m program once more, generates the Top characteristic.Each data set all calculates and generates the mark row preceding 10 Top characteristics to preceding 60 (are the interval with 10), promptly produces

Top10_features_yesGO.txt

Top20_features_yesGO.txt

Top30_features_yesGO.txt

Top40_features_yesGO.txt

Top50_features_yesGO.txt

The Top tag file of Top60_features_yesGO.txt.

⑷. the SVM parameter optimization of Top characteristic: obtain program grid.py and file gp442win32.zip from http://www.csie.ntu.edu.tw/ ~ cjlin/ download.With Euk7579 protein data collection is example.After getting into Euk7579 protein data collection catalogue, input python grid.py top10_features_yesGO.txt, and carriage return.Can obtain the Optimization result of parameters C, γ in the corresponding SVM algorithm of the Top10 characteristic of Euk7579 data set through working procedure grid.py.In like manner, can test out corresponding optimized parameter C, the γ of other top characteristic of Euk7579 data set, and the corresponding parameters optimization result of different Top characteristics of other protein data collection.Owing to have 3 data sets, each data set to contain Top10 ~ Top60 totally 6 groups of Top characteristics.So can obtain totally 18 groups of parameter optimization results at last, i.e. the 18 couples of optimum parameters C, γ numerical value.

⑸. Top characteristic dimension Dim confirms: because the required program runtime of Virus252 data set is the shortest, easy to operate.So with the Virus252 data set is standard, confirms the Top characteristic dimension.Get in the Virus252 data set catalogue, move the file5_libsvm_knn_jackknife_linux.m program once more, the corresponding predictablity rate of Top10 ~ Top60 of the SVM method after acquisition parameters C, γ optimize.The corresponding predictablity rate of each Top characteristic relatively confirms that the corresponding Top characteristic dimension of high-accuracy is final dimension.Simultaneously, with the optimized parameter C of Virus252 data set under this dimension, γ parameter as final SVM method.

⑹. parameter K is definite in the KNN method: get in the Virus252 data set catalogue, move the file5_libsvm_knn_jackknife_linux.m program once more.Parameter K from 1 to 10 value is calculated predictablity rate respectively in the KNN method.With the parameter of the corresponding K value of high-accuracy as final KNN method.The combination of the optimized parameter C that confirms according to above-mentioned steps again, γ, K, Dim moves the file5_libsvm_knn_jackknife_linux.m program once more, calculates the KNN method of two other protein data collection, the predictablity rate of SVM method.

⑺. KNN-SVM integrated classifier predicted protein matter subcellular fraction site: with the Virus252 data set is example, owing to include 8 subcellular fraction sites, adopts 1 pair 1 strategy can form 8 * (8-1)/2=28 two types of sorters.Select predicting the outcome of the higher method of accuracy rate according to the height of KNN method in each sorter and SVM method predictablity rate, again these 28 stacks that predict the outcome are obtained merging and predict the outcome as the predicting the outcome of this sorter.For the virus protein P in unknown site, merging predicts the outcome is the subcellular fraction site that protein P is predicted with the maximum site of protein P sensing number of times.

Citing document:

[1]?Nakashima?H,?Nishikawa?K.?Discrimination?of?intracellular?and?extracellular?proteins?using?amino?acid?composition?and?residue-pair?frequencies.?J?Mol?Biol?1994;238(1):54-61.

[2]?Gu?Q,?Ding?YS,?Jiang?XY,?Zhang?TL.?Prediction?of?subcellular?location?apoptosis?proteins?with?ensemble?classifier?and?feature?selection.?Amino?Acids?2010;38(4):975-83.

[3]?Park?KJ,?Kanehisa?M.?Prediction?of?protein?subcellular?locations?by?support?vector?machines?using?compositions?of?amino?acids?and?amino?acid?pairs.?Bioinformatics?2003;19(13):1656-63.

[4]?Zhou?XB,?Chen?C,?Li?ZC,?Zou?XY.?Using?Chou's?amphiphilic?pseudo-amino?acid?composition?and?support?vector?machine?for?prediction?of?enzyme?subfamily?classes.?J?Theor?Biol?2007;248(3):546-51.

[5]?Chou?KC,?Cai?YD.?Using?functional?domain?composition?and?support?vector?machines?for?prediction?of?protein?subcellular?location.?J?Biol?Chem?2002;277(48):45765-9.

[6]?Bhasin?M,?Raghava?GP.?ESLpred:?SVM-based?method?for?subcellular?localization?of?eukaryotic?proteins?using?dipeptide?composition?and?PSI-BLAST.?Nucleic?Acids?Res?2004;32(Web?Server?issue):W414-9.

[7]?Emanuelsson?O,?Nielsen?H,?Brunak?S,?von?Heijne?G.?Predicting?subcellular?localization?of?proteins?based?on?their?N-terminal?amino?acid?sequence.?J?Mol?Biol?2000;300(4):1005-16.

[8]?Emanuelsson?O,?Brunak?S,?von?Heijne?G,?Nielsen?H.?Locating?proteins?in?the?cell?using?TargetP,?SignalP?and?related?tools.?Nat?Protoc?2007;2(4):953-71.

[9]?Rashid?M,?Saha?S,?Raghava?GP.?Support?Vector?Machine-based?method?for?predicting?subcellular?localization?of?mycobacterial?proteins?using?evolutionary?information?and?motifs.?BMC?Bioinformatics?2007;8:337.

[10]?Lin?TH,?Murphy?RF,?Bar-Joseph?Z.?Discriminative?motif?finding?for?predicting?protein?subcellular?localization.?IEEE/ACM?Trans?Comput?Biol?Bioinform?2011;8(2):441-51.

[11]?Zou?L,?Wang?Z,?Huang?J.?Prediction?of?subcellular?localization?of?eukaryotic?proteins?using?position-specific?profiles?and?neural?network?with?weighted?inputs.?J?Genet?Genomics?2007;34(12):1080-7.

[12]?Li?L,?Zhang?Y,?Zou?L,?Li?C,?Yu?B,?Zheng?X,?et?al.?An?ensemble?classifier?for?eukaryotic?protein?subcellular?location?prediction?using?gene?ontology?categories?and?amino?acid?hydrophobicity.?PLoS?One?2012;7(1):e31057.

[13]?Li?LQ,?Zhang?Y,?Zou?LY,?Zhou?Y,?Zheng?XQ.?Prediction?of?protein?subcellular?multi-localization?based?on?the?general?form?of?Chou's?pseudo?amino?acid?composition.?Protein?Pept?Lett?2012;19(4):375-87.

[14]?Mount?DW.?Using?hidden?Markov?models?to?align?multiple?sequences.?Cold?Spring?Harb?Protoc?2009;2009(7):pdb?top41.

[15]?Marinov?M,?Weeks?DE.?The?complexity?of?linkage?analysis?with?neural?networks.?Hum?Hered?2001;51(3):169-76.

[16]?Shen?HB,?Yang?J,?Chou?KC.?Euk-PLoc:?an?ensemble?classifier?for?large-scale?eukaryotic?protein?subcellular?location?prediction.?Amino?Acids?2007;33(1):57-67.

[17]?Bulashevska?A,?Eils?R.?Predicting?protein?subcellular?locations?using?hierarchical?ensemble?of?Bayesian?classifiers?based?on?Markov?chains.?BMC?Bioinformatics?2006;7:298.

[18]?Khan?A,?Majid?A,?Hayat?M.?CE-PLoc:?an?ensemble?classifier?for?predicting?protein?subcellular?locations?by?fusing?different?modes?of?pseudo?amino?acid?composition.?Comput?Biol?Chem?2011;35(4):218-29.

[19]?Shen?HB,?Chou?KC.?Gneg-mPLoc:?a?top-down?strategy?to?enhance?the?quality?of?predicting?subcellular?localization?of?Gram-negative?bacterial?proteins.?J?Theor?Biol?2010;264(2):326-33.

[20]?Shen?HB,?Chou?KC.?Virus-mPLoc:?a?fusion?classifier?for?viral?protein?subcellular?location?prediction?by?incorporating?multiple?sites.?J?Biomol?Struct?Dyn?2010;28(2):175-86.

[21]?Harris?MA,?Clark?J,?Ireland?A,?Lomax?J,?Ashburner?M,?Foulger?R,?et?al.?The?Gene?Ontology?(GO)?database?and?informatics?resource.?Nucleic?Acids?Res?2004;32(Database?issue):D258-61.

[22]?Chou?KC,?Shen?HB.?Predicting?eukaryotic?protein?subcellular?location?by?fusing?optimized?evidence-theoretic?K-Nearest?Neighbor?classifiers.?J?Proteome?Res?2006;5(8):1888-97。

Claims

1. protein subcellular fraction site estimation method of forming based on the pseudo-amino acid in improvement week; It is characterized in that; Said Forecasting Methodology is the integrated prediction method based on K nearest neighbor method (KNN) and SVMs method (SVM), and the KNN method adopts Euclidean distance, and the SVM method adopts the RBF kernel function; And adopt the grid.py method to carry out parameter optimization, said method called after KNN-SVM integrated classifier; Said Forecasting Methodology may further comprise the steps:

(1) foundation of protein training dataset: training dataset comprises eukaryotic protein data set, protokaryon protein data collection and virus protein data set, and data centralization contains protein numbering, affiliated subcellular fraction site numbering and amino acid sequence;

(2) GO feature extraction: the GO data set is from ftp: //ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/ downloads and obtains, and only needs retaining protein numbering and corresponding GO numbering thereof;

(3) amino acid of protein is formed AAC, amino acid to forming AAP, hydrophobic, hydrophilic feature extraction: utilize the pseudo-amino acid formational theory of both sexes in week can obtain AAC, AAP, hydrophilic, hydrophobic character;

The subcellular fraction site of (4) adopting KNN-SVM integrated classifier predicted data to concentrate protein:

1. proper vector is set up: the equal corresponding GO of each protein, AAC, AAP, hydrophobic, hydrophilic this five Partial Feature, and these characteristics have constituted the proper vector of each protein;

2. characteristic marking, ordering: use the marking of fselect.py characteristic, sort method, each characteristic is given a mark, sort from high to low according to mark again;

3. generating the Top characteristic is the high score characteristic: with the feature scores ordering is foundation, 10 to be the interval, and the simplification feature set of the Top characteristic of getting mark row preceding 10 to preceding 60 after as Feature Selection;

4. the SVM parameter optimization of Top characteristic: use the grid.py method that the parameters C in the SVM method, γ are optimized;

5. confirming of Top characteristic dimension Dim: use the SVM method to calculate the predictablity rate of Top10 to Top60 respectively, the corresponding predictablity rate of each Top characteristic relatively confirms that the corresponding Top characteristic dimension Dim of high-accuracy is final dimension;

6. the confirming of parameter K in the KNN method: use the KNN method predictablity rate of calculating parameter K from 1 to 10 respectively, the corresponding predictablity rate of each K value relatively is with the K value of the high-accuracy correspondence parameter K value as the KNN method;

7. KNN-SVM integrated classifier predicted protein matter subcellular fraction site: for the protein data collection of known site, because it includes n subcellular fraction site, n >=1 is so adopt the two types of sorters of the 1 couple 1 strategy meeting formation n * (n-1)/2; Select predicting the outcome of the higher method of accuracy rate according to the height of KNN method in each sorter and SVM method predictablity rate, again these stacks that predict the outcome are obtained merging and predict the outcome as the predicting the outcome of this sorter; For the virus protein P in unknown site, merging predicts the outcome is the subcellular fraction site that protein P is predicted with the maximum site of protein P sensing number of times.

2. according to the Forecasting Methodology of claim 1; It is characterized in that; Said integrated classifier is based on the parameter optimization that the virus protein data set carries out SVM and KNN method; And with optimum parameters numerical value be standard application in the prediction of all proteins data set, and predicting the outcome of two kinds of methods compared and merge, be different from the constructed integrated classifier of single Forecasting Methodology.