CN102819693A - Prediction method for protein subcellular site formed based on improved-period pseudo amino acid - Google Patents

Prediction method for protein subcellular site formed based on improved-period pseudo amino acid Download PDF

Info

Publication number
CN102819693A
CN102819693A CN2012102934168A CN201210293416A CN102819693A CN 102819693 A CN102819693 A CN 102819693A CN 2012102934168 A CN2012102934168 A CN 2012102934168A CN 201210293416 A CN201210293416 A CN 201210293416A CN 102819693 A CN102819693 A CN 102819693A
Authority
CN
China
Prior art keywords
protein
knn
svm
amino acid
site
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012102934168A
Other languages
Chinese (zh)
Inventor
李立奇
张瑗
朱洁
周跃
杨桦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Second Affiliated Hospital of TMMU
Original Assignee
Second Affiliated Hospital of TMMU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Second Affiliated Hospital of TMMU filed Critical Second Affiliated Hospital of TMMU
Priority to CN2012102934168A priority Critical patent/CN102819693A/en
Publication of CN102819693A publication Critical patent/CN102819693A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention relates to a prediction method for protein subcellular site formed based on improved-period pseudo amino acid, which has a strategy that an integrated classifier is constructed with a KNN (K nearest neighbor) method and an SVM (support vector machine) method based on a one-to-one scheme. The prediction method aims to predict the protein subcellular site and accelerate protein function study and belongs to the field of bioinformatics. The prediction method is used for constructing the integrated classifier with the KNN method based on the Euclidean distance and the SVM method based on an RBF (radial basis function) kernel function. The protein characteristic information consists of improved-period pseudo amino acid and is obtained by the fact that a high-score characteristic closely related to the protein subcellular site is extracted with a fselect.py method on the basis of the characteristics of GO (gene ontology), AAC (amino acid composition), AAP (amino acid pair composition) and the hydrophily and the hydrophobicity of amino acid. The prediction accuracy of the protein subcellular site aims to be improved with two prediction methods of KNN and SVM and according to the high-score characteristic. In the implementation, the prediction method is identified from indexes, such as total prediction accuracy rate, each-site prediction accuracy rate, MCC (Markovian correlation coefficient) and the like with a jackknife inspection method. The prediction method disclosed by the invention is suitable for the prediction of the subcellular site of the proteins of different species.

Description

A kind of protein subcellular fraction site estimation method of forming based on the pseudo-amino acid in improvement week
Technical field
The present invention relates to a kind of subcellular fraction site, belong to field of bioinformatics through KNN-SVM integrated classifier predicted protein matter.
Background technology
The subcellular fraction site of research protein plays an important role for illustrating their functions in cell.Though can carry out the research of protein Subcellular Localization through experiment method at present, these experiment methods are not only time-consuming expensive, and be not suitable for the research of large-scale protein Subcellular Localization.Through computing method then can realize fast, accurately, the subcellular fraction site of large-scale predicted protein matter.
In the past few decades, there are many computing method to be applied to the subcellular fraction site estimation of protein.These methods mainly are divided into two big types.First kind method is based on amino acid and forms.Nakashima etc. [1] discover: extracellular protein and intracellular protein have significant difference on amino acid is formed, and distinguish the protein of these two types of subcellular locations thus.Along this thinking, many computing method of forming [3] based on amino acid composition, two peptide composition [2], the two peptides in n rank are suggested.Simultaneously, in order to mix greater protein matter sequence signature, many further features (forming [5], psi-blast [6] etc. like the hydrophilic hydrophobic property of amino acid [4], functional domain) also are introduced into.And second class methods are based on some sorting signalses, comprise signal peptide, Mitochondrially targeted peptide and chloroplast transit peptides [7,8].For example, Emanuelsson etc. [8] has set forth in detail and has used the cleavage site that SignalP and ChloroP predict secretory pathway signal peptide and chloroplast transit peptides.But the reliability of these methods depends on the N terminal sequence of protein to a great extent.And the molecular mechanism that sorting signals is relevant is quite complicated, does not set forth clear at present fully.
Not only protein sequence information, and prediction algorithm can influence the accuracy of protein subcellular fraction site estimation too.So far, existing many computing method are used for the subcellular fraction site of predicted protein matter, like hidden Markov model (HMM) [9,10], neural network [11], K nearest neighbor method (KNN) [12] and SVMs (SVM) [13] etc.But most of prediction sorter all is based on single theory of algorithm, and every kind of algorithm all has self intrinsic defective, and this can cause that prediction effect is not good.For example, the parameter a lot [14] that needs estimation in the HMM algorithm; Neural network model may meet with many local minimums [15].In addition, though there are some integrated classifiers [2,16,17] to be used for the subcellular fraction site of predicted protein matter.But great majority for example blur KNN [2], KNN [16] and bayesian theory [17] in fact only based on single algorithm.Other integrated classifier is based on algorithms of different like CE-PLoc [18] etc., and these integrated classifiers have all comprised KNN and SVM algorithm.Along this thinking, we intend with KNN and two kinds of algorithms of SVM and make up integrated classifier, come the subcellular fraction site of predicted protein matter.
Summary of the invention
The present invention is for solving the deficiency of prior art; A kind of protein subcellular fraction site estimation method of forming based on the pseudo-amino acid in improvement week is provided; Its purpose has two: one of which; Be intended to come the subcellular fraction site of predicted protein matter, to remedy the intrinsic defective of single method self through KNN and these two kinds the most frequently used prediction sorting techniques of SVM.And utilize parameter optimization instrument grid.py that Forecasting Methodology is carried out parameter optimization, help improving predictablity rate.They are two years old; Solve the information redundancy that the used protein characteristic information of tradition causes because of containing much information; Thereby cause prediction effect not good; The present invention utilizes characteristic screening implement fselect.py from a large amount of characteristic informations, to extract and the most closely-related high score characteristic in protein subcellular fraction site, improves the subcellular fraction site estimation accuracy rate of protein.
Technical solution of the present invention is following:
The present invention relates to a kind of protein subcellular fraction site estimation method of forming based on the pseudo-amino acid in improvement week; It is characterized in that; Said Forecasting Methodology is the integrated prediction method based on KNN and SVM method, and the KNN method adopts Euclidean distance, and the SVM method adopts the RBF kernel function; And adopt the grid.py method to carry out parameter optimization, said method called after KNN-SVM integrated classifier.Protein characteristic information is formed for the pseudo-amino acid in improvement week, by gene ontology opinion (GO), amino acid forms (AAC), amino acid to forming characteristic process fselect.py methods such as (AAP), amino acid are hydrophilic, hydrophobic property screen and form.Said integrated classifier is to adopt the 1 pair 1 a plurality of two types of sorters of construction of strategy, predicts respectively through KNN and two kinds of methods of SVM, and predicting the outcome of two kinds of methods compared and merge.The protein data collection is eukaryotic protein data set, protokaryon protein data collection or virus protein data set, selects according to the kind of institute's predicted protein matter.
Referring to Fig. 1, the main construction step of this integrated classifier is following:
1. the structure of protein data collection: 1. eukaryotic protein data set Euk7579,2. protokaryon protein data collection Gneg1456,3. virus protein data set Virus252 obtains through following address respectively:
①http://web.kuicr.kyoto-u.ac.jp/~park/Seqdata/?[3];
②http://www.csbio.sjtu.edu.cn/bioinf/Gneg-multi/?[19];
③http://www.csbio.sjtu.edu.cn/bioinf/virus-multi/?[20]。
Utilize word to search, replace function deletion redundant information, stay protein numbering, affiliated subcellular fraction site numbering and amino acid sequence.
2. GO feature extraction: the GO data set is from ftp: //ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/ downloads and obtains [21], utilizes file division, excel deletes redundant information, obtains the protein numbering and corresponding GO numbers.
3. the AAC of protein, AAP, hydrophobic, hydrophilic feature extraction:, obtain AAC, AAP, hydrophilic, hydrophobic character according to the pseudo-amino acid formational theory [13,22] of the both sexes in week.
4. KNN-SVM integrated classifier predicted data is concentrated the subcellular fraction site of protein:
1. proper vector is set up: the equal corresponding GO of each protein, AAC, AAP, hydrophobic, hydrophilic this five Partial Feature, these characteristics have constituted the proper vector of each protein.
2. characteristic marking, ordering: use the marking of fselect.py characteristic, sort method, each characteristic is given a mark, sort from high to low according to mark again.
3. generate Top characteristic (being the high score characteristic): with the feature scores ordering is foundation, and the Top characteristic (is the interval with 10) of getting mark row preceding 10 to preceding 60 is as the simplification feature set after the Feature Selection.
4. the SVM parameter optimization of Top characteristic: use the grid.py method that the parameters C in the SVM method, γ are optimized.
5. confirming of Top characteristic dimension Dim: use the SVM method to calculate the predictablity rate of Top10 to Top60 respectively, the corresponding predictablity rate of each Top characteristic relatively confirms that the corresponding Top characteristic dimension Dim of high-accuracy is final dimension.
6. the confirming of parameter K in the KNN method: use the KNN method predictablity rate of calculating parameter K from 1 to 10 respectively, the corresponding predictablity rate of a K value relatively is with the K value of the high-accuracy correspondence parameter K value as the KNN method.
7. KNN-SVM integrated classifier predicted protein matter subcellular fraction site: for the protein data collection of known site; Because it includes n subcellular fraction site; N >=1, n * (n-1)/2 two types of sorters (are example with viral data set for example, so adopt 1 pair 1 strategy meeting formation; Owing to include 8 subcellular fraction sites, adopt 1 pair 1 strategy can form 8 * (8-1)/2=28 two types of sorters).Select predicting the outcome of the higher method of accuracy rate according to the height of KNN method in each sorter and SVM method predictablity rate, again these stacks that predict the outcome are obtained merging and predict the outcome as the predicting the outcome of this sorter.For the virus protein P in unknown site, merging predicts the outcome is the subcellular fraction site that protein P is predicted with the maximum site of protein P sensing number of times.
The present invention adopts 1 pair 1 strategy, promptly each two kinds of subcellular fraction sites is differentiated, and with respect to strategy more than 1 pair, the protein data in two kinds of subcellular fraction sites of its differentiation is more balanced, is difficult for taking place prediction drift.
The KNN-SVM integrated classifier of gained of the present invention can independently select eucaryon, protokaryon, virus protein data set as training sample set in application facet, thus the subcellular fraction site of predicting variety classes protein more targetedly.
The present invention can be used for predicting the subcellular fraction site of the various kinds of proteinoid in unknown site, for the subcellular fraction site of research protein provide a kind of fast, Forecasting Methodology reliably, also certain reference value is provided for the further function of research protein.
Description of drawings
Fig. 1 shows the protein subcellular fraction site estimation synoptic diagram of KNN-SVM integrated classifier;
Fig. 2 shows the Top30 characteristic and the score value of Euk7579 protein data collection;
Fig. 3 shows the Top30 characteristic and the score value of Gneg1456 protein data collection;
Fig. 4 shows the Top30 characteristic and the score value of Virus252 protein data collection.
Embodiment
Specify building process of the present invention below in conjunction with embodiment:
1. the structure of protein data collection: 1. eukaryotic protein data set Euk7579,2. protokaryon protein data collection Gneg1456,3. virus protein data set Virus252 obtains through following address respectively:
①http://web.kuicr.kyoto-u.ac.jp/~park/Seqdata/;
②http://www.csbio.sjtu.edu.cn/bioinf/Gneg-multi/;
③http://www.csbio.sjtu.edu.cn/bioinf/virus-multi/。
With above-mentioned eucaryon, protokaryon, the unloading of virus protein data set in the word file; Utilize word to search, replace function deletion redundant information, protein numbering, affiliated subcellular fraction site numbering and amino acid sequence are stored in respectively among new files A.xls and the A2.xls.
2. the GO feature extraction makes up with vector: from ftp: //ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/ downloads and obtains GO document data set local_gene_association.goa_uniprot.sorted.Utilize file division software that this document is divided into the small documents of each size for 100MB, and unloading is in the excel file.Deletion redundant information wherein is stored in protein numbering and corresponding GO numbering thereof among the new files GO.xls.With the GO information vectorization among the GO.xls, construct GO proper vector file again, be stored among the GONo2.txt.
3. the AAC of protein, AAP, hydrophobic, hydrophilic feature extraction make up with vector: according to the pseudo-amino acid formational theory of the both sexes in week; Utilize MATLAB software to programme voluntarily respectively to calculate obtain AAC, AAP, hydrophilic, hydrophobic character is vectorial, be stored in respectively in AAC.txt, AAP0.txt, Hydrophobicity1.txt and the Hydrophilicity1.txt file.With the Euk7579 data set is example, all contains 7579 row proper vectors in above-mentioned 4 tag files, 7579 eukaryotic proteins of corresponding Euk7579 data centralization.
4. KNN-SVM integrated classifier predicted data is concentrated the subcellular fraction site of protein: rely on biological information center Langchao Tiansuo of Third Military Medical University high-performance server cluster platform; Utilize the MATLAB software programming to carry out the subcellular fraction site of KNN, SVM method predicted protein matter, specific procedure and detailed method step are listed as follows one by one.
⑴. predictor is write: a newly-built m file in MATLAB software, called after file5_libsvm_knn_jackknife_linux.m, and coding is following:
function?f=libsvm_knn(a36,a41,a56,a63)
a1=importdata('B_sheet2_linux.txt');
a5=importdata('A_sheet1_num_linux.txt');a6=importdata('A_sheet1_char_linux
.txt');
a9=importdata('GONo2.txt');a9=sparse(a9);
a11=importdata('AAC.txt');
a12=importdata('AAP0.txt');
a15=[];
a15(:,:,1)=importdata('Hydrophobicity1.txt');
a15(:,:,2)=importdata('Hydrophilicity1.txt');
a3=isnan(a1(:,2))
a4=find(a3==0)
a7=a5(a4,1);
a8=a6(a4,1);
a10=a9(a4,:);
a13=a11(a4,:);
a14=a12(a4,:);
a16=a15(a4,:,1);
a17=a15(a4,:,2);
clear?a1?a9?a11?a12?a15
if?a63=='yesGO'
a25=[a13,a14,a16,a17,a10];
elseif?a63=='notGO'
a25=[a13,a14,a16,a17];
end
a25=sparse(a25);
size(a25);
while?exist(strcat(['features_',a63,'.txt']))~=2
a65=fopen(strcat(['features_',a63,'.txt']),'wt');
a66=1:1:size(a25,2);
for?a69=1:1:size(a25,1)
a69%%
a67=[];
a67(:,1:3:(3*size(a25,2)))=a66;
a67(:,2:3:(3*size(a25,2)))=Inf;
a67(:,3:3:(3*size(a25,2)))=a25(a69,:);
a68=[a7(a69,1),0,a67];
a68(:,2)=NaN;
a70=mat2str(a68);
a70(a70=='[')='';a70(a70==']')='';a70=regexprep(a70,'?NaN','');a70
=regexprep(a70,'?Inf?',':');
fprintf(a65,'%s\n',a70);
end
fclose(a65);
pause
end
a55=importdata(strcat(['features_',a63,'.txt.fscore']));
a25=a25(:,a55(1:a56,1));
a57=[];
while?exist(strcat(['top',num2str(a56),'_features_',a63,'.txt']))~=2
a58=fopen(strcat(['top',num2str(a56),'_features_',a63,'.txt']),'wt');
a64=num2str(a7);
for?a59=1:1:size(a25,1)
a61=[];
for?a60=1:1:size(a25,2)
a61=strcat([a61,'?',num2str(a60),':',num2str(a25(a59,a60))]);
end
a62=strcat(a64(a59,1),a61);
fprintf(a58,'%s\n',a62);
end
fclose(a58);
pause
end
a43=[];
for?a18=1:1:size(a4,1)
tic%%
a42=1;
for?a19=1:1:(max(a7)-1)
for?a20=(a19+1):1:(max(a7))
a21=a8(find(a7==a19),1);
a22=a8(find(a7==a20),1);
a26=a25(find(a7==a19),:);
a27=a25(find(a7==a20),:);
a23=strcmp([a21;a22],a8(a18,1));
a24=length(find(a23==1));
a29=[a19*ones(length(find(a7==a19)),1);a20*ones(length(find
(a7==a20)),1)];
a28=[a26;a27];
if?a24==0
a31=a29;
a32=a28;
else
a30=find(a23==0);
a31=a29(a30,1);
a32=a28(a30,:);
end
a33=a7(a18,1);
a34=a25(a18,:);
a35=svmtrain(a31,full(a32),a36);
[a37,a38,a39]=svmpredict(a33,full(a34),a35);
a40=knnclassify(a34,a32,a31,a41);
a43(a18,a42,1)=a37;
a43(a18,a42,2)=a40;
a42=a42+1;
end
end
a18;%%
toc%%
end
a43;
a44=a43(:,:,1);save(strcat(['result_libsvm','?',a36,'?',a63,'?top',num2str(a56),'.txt']),
'a44','-ascii');
a45=a43(:,:,2);save(strcat(['result_knn','?-k?',num2str(a41),'?',a63,'?top',num2str(a
56),'.txt']),'a45','-ascii');
b1=[];
b1(:,:,1)=a44;
b1(:,:,2)=a45;
b13=zeros(max(a7),5,size(b1,3));
for?b2=1:1:size(b1,3)
b3=b1(:,:,b2);
b9=zeros(size(b3,1),1);
for?b4=1:1:size(b3,1)
if?b9(b4,1)==1
continue
end
b7=[];
for?b5=1:1:max(a7)
b6=length(find(b3(b4,:)==b5));
b7=[b7,b6];
end
b8=find(b7==max(b7));
b10=strcmp(a8,a8(b4,1));
b11=find(b10);
if?length(b11)>1
b9(b11,1)=1;
end
b12=a7(b11,1);
for?b14=1:1:size(b8,2)
if?length(find(b12==b8(1,b14)))==1
b13(b8(1,b14),1,b2)=b13(b8(1,b14),1,b2)+1;
else
b13(b8(1,b14),3,b2)=b13(b8(1,b14),3,b2)+1;
end
end
for?b15=1:1:size(b12,1)
if?length(find(b8,b12(b15,1)))==1
%b13(b12(b15,1),1,b2)=b13(b12(b15,1),1,b2)+1;
else
b13(b12(b15,1),4,b2)=b13(b12(b15,1),4,b2)+1;
end
end
for?b17=1:1:max(a7)
b16=[b8';b12];
b18=length(find(b16==b17));
if?b18==0
b13(b17,2,b2)=b13(b17,2,b2)+1;
end
end
end
for?b19=1:1:max(a7)
b21=b13(b19,1,b2);%TP
b22=b13(b19,2,b2);%TN
b23=b13(b19,3,b2);%FP
b24=b13(b19,4,b2);%FN
b13(b19,5,b2)=(b21*b22-b23*b24)/(sqrt((b21+b23)*(b21+b24)
*(b22+b23)*(b22+b24)));%MCC
end
b20=sum(b13(:,1,b2))/size(a7,1)%overall?accuracy
end
b13
Preserve and move above-mentioned file.Before the 1st pause that program is provided with, can generate the total characteristic file of the various characteristics of comprehensive previous experiments, i.e. features_yesGO.txt.
⑵. characteristic marking, ordering; Referring to Fig. 2, Fig. 3 and Fig. 4: with Euk7579 protein data collection is example; The features_yesGO.txt file has comprised GO, AAC, AAP, hydrophobic, the hydrophilic characteristic of amino acid, and the total dimension of characteristic is 6533+20+400+2x9=6971.Python software is installed in server, and is obtained characteristic marking program file fselect.py from http://www.csie.ntu.edu.tw/ ~ cjlin/libsvm/ download.With fselect.py be placed on the same file of file f eatures_yesGO.txt in.
Open Xshell software, get into this document folder, input is as issuing orders:
Python fselect.py features_yesGO.txt carriage return
Generate the file f eatures_yesGO.txt.fscore of each characteristic score value of record.
⑶. generate Top characteristic (being the high score characteristic): the file f eatures_yesGO.txt.fscore that obtains with ⑵ is a characteristic score value foundation, moves the file5_libsvm_knn_jackknife_linux.m program once more, generates the Top characteristic.Each data set all calculates and generates the mark row preceding 10 Top characteristics to preceding 60 (are the interval with 10), promptly produces
Top10_features_yesGO.txt
Top20_features_yesGO.txt
Top30_features_yesGO.txt
Top40_features_yesGO.txt
Top50_features_yesGO.txt
The Top tag file of Top60_features_yesGO.txt.
⑷. the SVM parameter optimization of Top characteristic: obtain program grid.py and file gp442win32.zip from http://www.csie.ntu.edu.tw/ ~ cjlin/ download.With Euk7579 protein data collection is example.After getting into Euk7579 protein data collection catalogue, input python grid.py top10_features_yesGO.txt, and carriage return.Can obtain the Optimization result of parameters C, γ in the corresponding SVM algorithm of the Top10 characteristic of Euk7579 data set through working procedure grid.py.In like manner, can test out corresponding optimized parameter C, the γ of other top characteristic of Euk7579 data set, and the corresponding parameters optimization result of different Top characteristics of other protein data collection.Owing to have 3 data sets, each data set to contain Top10 ~ Top60 totally 6 groups of Top characteristics.So can obtain totally 18 groups of parameter optimization results at last, i.e. the 18 couples of optimum parameters C, γ numerical value.
⑸. Top characteristic dimension Dim confirms: because the required program runtime of Virus252 data set is the shortest, easy to operate.So with the Virus252 data set is standard, confirms the Top characteristic dimension.Get in the Virus252 data set catalogue, move the file5_libsvm_knn_jackknife_linux.m program once more, the corresponding predictablity rate of Top10 ~ Top60 of the SVM method after acquisition parameters C, γ optimize.The corresponding predictablity rate of each Top characteristic relatively confirms that the corresponding Top characteristic dimension of high-accuracy is final dimension.Simultaneously, with the optimized parameter C of Virus252 data set under this dimension, γ parameter as final SVM method.
⑹. parameter K is definite in the KNN method: get in the Virus252 data set catalogue, move the file5_libsvm_knn_jackknife_linux.m program once more.Parameter K from 1 to 10 value is calculated predictablity rate respectively in the KNN method.With the parameter of the corresponding K value of high-accuracy as final KNN method.The combination of the optimized parameter C that confirms according to above-mentioned steps again, γ, K, Dim moves the file5_libsvm_knn_jackknife_linux.m program once more, calculates the KNN method of two other protein data collection, the predictablity rate of SVM method.
⑺. KNN-SVM integrated classifier predicted protein matter subcellular fraction site: with the Virus252 data set is example, owing to include 8 subcellular fraction sites, adopts 1 pair 1 strategy can form 8 * (8-1)/2=28 two types of sorters.Select predicting the outcome of the higher method of accuracy rate according to the height of KNN method in each sorter and SVM method predictablity rate, again these 28 stacks that predict the outcome are obtained merging and predict the outcome as the predicting the outcome of this sorter.For the virus protein P in unknown site, merging predicts the outcome is the subcellular fraction site that protein P is predicted with the maximum site of protein P sensing number of times.
Citing document:
[1]?Nakashima?H,?Nishikawa?K.?Discrimination?of?intracellular?and?extracellular?proteins?using?amino?acid?composition?and?residue-pair?frequencies.?J?Mol?Biol?1994;238(1):54-61.
[2]?Gu?Q,?Ding?YS,?Jiang?XY,?Zhang?TL.?Prediction?of?subcellular?location?apoptosis?proteins?with?ensemble?classifier?and?feature?selection.?Amino?Acids?2010;38(4):975-83.
[3]?Park?KJ,?Kanehisa?M.?Prediction?of?protein?subcellular?locations?by?support?vector?machines?using?compositions?of?amino?acids?and?amino?acid?pairs.?Bioinformatics?2003;19(13):1656-63.
[4]?Zhou?XB,?Chen?C,?Li?ZC,?Zou?XY.?Using?Chou's?amphiphilic?pseudo-amino?acid?composition?and?support?vector?machine?for?prediction?of?enzyme?subfamily?classes.?J?Theor?Biol?2007;248(3):546-51.
[5]?Chou?KC,?Cai?YD.?Using?functional?domain?composition?and?support?vector?machines?for?prediction?of?protein?subcellular?location.?J?Biol?Chem?2002;277(48):45765-9.
[6]?Bhasin?M,?Raghava?GP.?ESLpred:?SVM-based?method?for?subcellular?localization?of?eukaryotic?proteins?using?dipeptide?composition?and?PSI-BLAST.?Nucleic?Acids?Res?2004;32(Web?Server?issue):W414-9.
[7]?Emanuelsson?O,?Nielsen?H,?Brunak?S,?von?Heijne?G.?Predicting?subcellular?localization?of?proteins?based?on?their?N-terminal?amino?acid?sequence.?J?Mol?Biol?2000;300(4):1005-16.
[8]?Emanuelsson?O,?Brunak?S,?von?Heijne?G,?Nielsen?H.?Locating?proteins?in?the?cell?using?TargetP,?SignalP?and?related?tools.?Nat?Protoc?2007;2(4):953-71.
[9]?Rashid?M,?Saha?S,?Raghava?GP.?Support?Vector?Machine-based?method?for?predicting?subcellular?localization?of?mycobacterial?proteins?using?evolutionary?information?and?motifs.?BMC?Bioinformatics?2007;8:337.
[10]?Lin?TH,?Murphy?RF,?Bar-Joseph?Z.?Discriminative?motif?finding?for?predicting?protein?subcellular?localization.?IEEE/ACM?Trans?Comput?Biol?Bioinform?2011;8(2):441-51.
[11]?Zou?L,?Wang?Z,?Huang?J.?Prediction?of?subcellular?localization?of?eukaryotic?proteins?using?position-specific?profiles?and?neural?network?with?weighted?inputs.?J?Genet?Genomics?2007;34(12):1080-7.
[12]?Li?L,?Zhang?Y,?Zou?L,?Li?C,?Yu?B,?Zheng?X,?et?al.?An?ensemble?classifier?for?eukaryotic?protein?subcellular?location?prediction?using?gene?ontology?categories?and?amino?acid?hydrophobicity.?PLoS?One?2012;7(1):e31057.
[13]?Li?LQ,?Zhang?Y,?Zou?LY,?Zhou?Y,?Zheng?XQ.?Prediction?of?protein?subcellular?multi-localization?based?on?the?general?form?of?Chou's?pseudo?amino?acid?composition.?Protein?Pept?Lett?2012;19(4):375-87.
[14]?Mount?DW.?Using?hidden?Markov?models?to?align?multiple?sequences.?Cold?Spring?Harb?Protoc?2009;2009(7):pdb?top41.
[15]?Marinov?M,?Weeks?DE.?The?complexity?of?linkage?analysis?with?neural?networks.?Hum?Hered?2001;51(3):169-76.
[16]?Shen?HB,?Yang?J,?Chou?KC.?Euk-PLoc:?an?ensemble?classifier?for?large-scale?eukaryotic?protein?subcellular?location?prediction.?Amino?Acids?2007;33(1):57-67.
[17]?Bulashevska?A,?Eils?R.?Predicting?protein?subcellular?locations?using?hierarchical?ensemble?of?Bayesian?classifiers?based?on?Markov?chains.?BMC?Bioinformatics?2006;7:298.
[18]?Khan?A,?Majid?A,?Hayat?M.?CE-PLoc:?an?ensemble?classifier?for?predicting?protein?subcellular?locations?by?fusing?different?modes?of?pseudo?amino?acid?composition.?Comput?Biol?Chem?2011;35(4):218-29.
[19]?Shen?HB,?Chou?KC.?Gneg-mPLoc:?a?top-down?strategy?to?enhance?the?quality?of?predicting?subcellular?localization?of?Gram-negative?bacterial?proteins.?J?Theor?Biol?2010;264(2):326-33.
[20]?Shen?HB,?Chou?KC.?Virus-mPLoc:?a?fusion?classifier?for?viral?protein?subcellular?location?prediction?by?incorporating?multiple?sites.?J?Biomol?Struct?Dyn?2010;28(2):175-86.
[21]?Harris?MA,?Clark?J,?Ireland?A,?Lomax?J,?Ashburner?M,?Foulger?R,?et?al.?The?Gene?Ontology?(GO)?database?and?informatics?resource.?Nucleic?Acids?Res?2004;32(Database?issue):D258-61.
[22]?Chou?KC,?Shen?HB.?Predicting?eukaryotic?protein?subcellular?location?by?fusing?optimized?evidence-theoretic?K-Nearest?Neighbor?classifiers.?J?Proteome?Res?2006;5(8):1888-97。

Claims (2)

1. protein subcellular fraction site estimation method of forming based on the pseudo-amino acid in improvement week; It is characterized in that; Said Forecasting Methodology is the integrated prediction method based on K nearest neighbor method (KNN) and SVMs method (SVM), and the KNN method adopts Euclidean distance, and the SVM method adopts the RBF kernel function; And adopt the grid.py method to carry out parameter optimization, said method called after KNN-SVM integrated classifier; Said Forecasting Methodology may further comprise the steps:
(1) foundation of protein training dataset: training dataset comprises eukaryotic protein data set, protokaryon protein data collection and virus protein data set, and data centralization contains protein numbering, affiliated subcellular fraction site numbering and amino acid sequence;
(2) GO feature extraction: the GO data set is from ftp: //ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/ downloads and obtains, and only needs retaining protein numbering and corresponding GO numbering thereof;
(3) amino acid of protein is formed AAC, amino acid to forming AAP, hydrophobic, hydrophilic feature extraction: utilize the pseudo-amino acid formational theory of both sexes in week can obtain AAC, AAP, hydrophilic, hydrophobic character;
The subcellular fraction site of (4) adopting KNN-SVM integrated classifier predicted data to concentrate protein:
1. proper vector is set up: the equal corresponding GO of each protein, AAC, AAP, hydrophobic, hydrophilic this five Partial Feature, and these characteristics have constituted the proper vector of each protein;
2. characteristic marking, ordering: use the marking of fselect.py characteristic, sort method, each characteristic is given a mark, sort from high to low according to mark again;
3. generating the Top characteristic is the high score characteristic: with the feature scores ordering is foundation, 10 to be the interval, and the simplification feature set of the Top characteristic of getting mark row preceding 10 to preceding 60 after as Feature Selection;
4. the SVM parameter optimization of Top characteristic: use the grid.py method that the parameters C in the SVM method, γ are optimized;
5. confirming of Top characteristic dimension Dim: use the SVM method to calculate the predictablity rate of Top10 to Top60 respectively, the corresponding predictablity rate of each Top characteristic relatively confirms that the corresponding Top characteristic dimension Dim of high-accuracy is final dimension;
6. the confirming of parameter K in the KNN method: use the KNN method predictablity rate of calculating parameter K from 1 to 10 respectively, the corresponding predictablity rate of each K value relatively is with the K value of the high-accuracy correspondence parameter K value as the KNN method;
7. KNN-SVM integrated classifier predicted protein matter subcellular fraction site: for the protein data collection of known site, because it includes n subcellular fraction site, n >=1 is so adopt the two types of sorters of the 1 couple 1 strategy meeting formation n * (n-1)/2; Select predicting the outcome of the higher method of accuracy rate according to the height of KNN method in each sorter and SVM method predictablity rate, again these stacks that predict the outcome are obtained merging and predict the outcome as the predicting the outcome of this sorter; For the virus protein P in unknown site, merging predicts the outcome is the subcellular fraction site that protein P is predicted with the maximum site of protein P sensing number of times.
2. according to the Forecasting Methodology of claim 1; It is characterized in that; Said integrated classifier is based on the parameter optimization that the virus protein data set carries out SVM and KNN method; And with optimum parameters numerical value be standard application in the prediction of all proteins data set, and predicting the outcome of two kinds of methods compared and merge, be different from the constructed integrated classifier of single Forecasting Methodology.
CN2012102934168A 2012-08-17 2012-08-17 Prediction method for protein subcellular site formed based on improved-period pseudo amino acid Pending CN102819693A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012102934168A CN102819693A (en) 2012-08-17 2012-08-17 Prediction method for protein subcellular site formed based on improved-period pseudo amino acid

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012102934168A CN102819693A (en) 2012-08-17 2012-08-17 Prediction method for protein subcellular site formed based on improved-period pseudo amino acid

Publications (1)

Publication Number Publication Date
CN102819693A true CN102819693A (en) 2012-12-12

Family

ID=47303803

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012102934168A Pending CN102819693A (en) 2012-08-17 2012-08-17 Prediction method for protein subcellular site formed based on improved-period pseudo amino acid

Country Status (1)

Country Link
CN (1) CN102819693A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268576A (en) * 2014-10-11 2015-01-07 国家电网公司 Electric system transient stability classification method based on TNN-SVM
CN104899477A (en) * 2015-06-18 2015-09-09 江南大学 Protein subcellular interval prediction method using bag-of-word model
CN105447340A (en) * 2015-07-21 2016-03-30 郑州轻工业学院 Protein subchloroplast multi-position prediction method
CN105760711A (en) * 2016-02-02 2016-07-13 江南大学 Method for using KNN calculation and similarity comparison to predict protein subcellular section
CN108830042A (en) * 2018-06-13 2018-11-16 深圳大学 A kind of feature extraction based on multi-modal protein sequence and coding method and system
CN112201300A (en) * 2020-10-23 2021-01-08 天津大学 Protein subcellular localization method based on depth image features and threshold learning strategy

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
LIQI LI等: "An ensemble classifier for eukaryotic protein subcellular location prediction using Gene Ontology categories and amino acid hydrophobicity", 《PLOS ONE》 *
刘洪霖 等: "《化工冶金过程人工智能优化》", 31 January 1999, 冶金工业出版社 *
姜小莹等: "基于伪氨基酸和支持向量机的蛋白质亚细胞定位预测", 《广西农业生物科学》 *
孙亮 等: "《模式识别原理》", 28 February 2009, 北京工业大学出版社 *
李立奇 等: "蛋白质的亚细胞定位预测研究发展", 《免疫学杂志》 *
樊玉才等: "基于改进的GO-PseAA方法的凋亡蛋白质亚细胞定位", 《内蒙古工业大学学报》 *
湖北省土地学会: "《生态文明中的土地问题研究》", 31 December 2008, 湖北科学技术出版社 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268576A (en) * 2014-10-11 2015-01-07 国家电网公司 Electric system transient stability classification method based on TNN-SVM
CN104899477A (en) * 2015-06-18 2015-09-09 江南大学 Protein subcellular interval prediction method using bag-of-word model
CN104899477B (en) * 2015-06-18 2018-01-26 江南大学 A kind of Protein Subcellular interval prediction method using bag of words
CN105447340A (en) * 2015-07-21 2016-03-30 郑州轻工业学院 Protein subchloroplast multi-position prediction method
CN105760711A (en) * 2016-02-02 2016-07-13 江南大学 Method for using KNN calculation and similarity comparison to predict protein subcellular section
CN108830042A (en) * 2018-06-13 2018-11-16 深圳大学 A kind of feature extraction based on multi-modal protein sequence and coding method and system
CN108830042B (en) * 2018-06-13 2021-09-21 深圳大学 Feature extraction and coding method and system based on multi-modal protein sequence
CN112201300A (en) * 2020-10-23 2021-01-08 天津大学 Protein subcellular localization method based on depth image features and threshold learning strategy

Similar Documents

Publication Publication Date Title
Hong et al. Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning
Su et al. iLoc-lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC
Liu et al. iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC
Huang et al. BERMP: a cross-species classifier for predicting m6A sites by integrating a deep learning algorithm and a random forest approach
Zhang et al. RBPPred: predicting RNA-binding proteins from sequence using SVM
Afiahayati et al. MetaVelvet-SL: an extension of the Velvet assembler to a de novo metagenomic assembler utilizing supervised learning
Jamali et al. DrugMiner: comparative analysis of machine learning algorithms for prediction of potential druggable proteins
Ali et al. Alignment-free protein interaction network comparison
Lan et al. Computational approaches for prioritizing candidate disease genes based on PPI networks
Philips et al. LigandRNA: computational predictor of RNA–ligand interactions
Xu et al. Peer: a comprehensive and multi-task benchmark for protein sequence understanding
Zhu et al. A comprehensive comparison and analysis of computational predictors for RNA N6-methyladenosine sites of Saccharomyces cerevisiae
CN102819693A (en) Prediction method for protein subcellular site formed based on improved-period pseudo amino acid
Dong et al. Identification of DNA-binding proteins by auto-cross covariance transformation
Tang et al. A boosting approach for prediction of protein-RNA binding residues
Zhao et al. A new method for predicting protein functions from dynamic weighted interactome networks
Zhang et al. RNAcmap: a fully automatic pipeline for predicting contact maps of RNAs by evolutionary coupling analysis
Zhou et al. Deep learning reveals many more inter-protein residue-residue contacts than direct coupling analysis
Yang et al. HCRNet: high-throughput circRNA-binding event identification from CLIP-seq data using deep temporal convolutional network
Liu et al. Predicting the multi-label protein subcellular localization through multi-information fusion and MLSI dimensionality reduction based on MLFE classifier
Chen et al. An automated RNA‐Seq analysis pipeline to identify and visualize differentially expressed genes and pathways in CHO cells
Wang et al. Identification of hormone-binding proteins using a novel ensemble classifier
Tahir et al. An intelligent computational model for prediction of promoters and their strength via natural language processing
Qiu et al. Prediction of protein–protein interaction sites using patch-based residue characterization
Acera Mateos et al. Concepts and methods for transcriptome-wide prediction of chemical messenger RNA modifications with machine learning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20121212