CN106778070A - A kind of human protein's subcellular location Forecasting Methodology - Google Patents

A kind of human protein's subcellular location Forecasting Methodology Download PDF

Info

Publication number
CN106778070A
CN106778070A CN201710204499.1A CN201710204499A CN106778070A CN 106778070 A CN106778070 A CN 106778070A CN 201710204499 A CN201710204499 A CN 201710204499A CN 106778070 A CN106778070 A CN 106778070A
Authority
CN
China
Prior art keywords
protein
feature
features
sequence
correlation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710204499.1A
Other languages
Chinese (zh)
Inventor
沈红斌
周航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201710204499.1A priority Critical patent/CN106778070A/en
Publication of CN106778070A publication Critical patent/CN106778070A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of method of human protein's subcellular location prediction, it is the subcellular location that the protein is predicted using the sequence of human protein, based on Gene Ontology (GO) feature and conservative domain relevance optimization human protein subcellular fraction sorting algorithm.The sequence statistical nature (amino acid composition feature, normalized specific scoring matrix feature) of the protein is obtained by the sequence of protein first, characteristic of field and GO features is guarded;Secondly, character subset is extracted using CFS feature selection approach to sequence statistical nature, the similarity measurement of these features is respectively obtained by calculating to guarding characteristic of field and GO features, probabilistic information is calculated using the KNN methods of Weighted Coefficients, then the feature of acquisition is integrated and is classified with SVM classifier.

Description

A kind of human protein's subcellular location Forecasting Methodology
Technical field
The invention belongs to technical field of biological information, more particularly to a kind of side of human protein's subcellular location prediction Method.
Background technology
The subcellular location of protein is understood for the interaction between the function, the protein that understand protein, and medicine The targeted therapy of thing has great importance.But the subcellular location of protein is obtained currently with the method for experimental check Need very big time and cost.Therefore substantial amounts of protein is predicted using Protein Subcellular predicted position deivce It is significant.According to our statistics, had altogether on the SWISS-PROT Protein Data Banks of in February, 2016 issue There are 550552 protein, wherein only 10.4% protein has the subcellular location of experimental verification, remaining unknown Asia The protein of cell position is badly in need of being predicted by a kind of reliable Forecasting Methodology.
Up to the present, having had can much predict the instrument of Protein Subcellular position, the common webserver Including BaCeLlo, YLoc, MultiLoc, GOASVM, WoLF PSORT, CellPLoc, HSLPred etc..These forecasting tools Biologist to association area brings great convenience.
The subcellular location information of protein is frequently used in the gene therapy of disease, in drug targeting treatment.For example lead to The inspection expression of protein YAP and Subcellular Localization in tumour is crossed to be drilled in children's hepatocellular carcinoma studying Hippo/YAP approach Effect in change.So, wieldy high-precision forecast instrument will be remarkably contributing to these laboratories to carry out clinic and grinds Study carefully.The webserver Hum-mPLoc2.0 that we issued in the past is designed exclusively for prediction human protein positioning.Often The number of times that year uses increases to more than 80,000 times in 2015 from 20,000 in 2010 times.This indicates that to provide preferably prediction clothes Business, based on new technology and more comprehensively accurately annotations database is significant to further enhance predictive ability.
It is commonly used for predicting that the computational methods of proteins subcellular location can be divided into two classes, i.e., is searched for based on homology Method and the method based on machine learning.Method based on homology search is considered using arest neighbors method Row prediction, in the method the distance between two protein generally weighed by their sequence homology.By calculating The homology of query protein and a large amount of sequences for having subcellular location annotation information, the method find preceding K it is most like Protein, and their annotation information is passed into the protein to be predicted as classification results.Based on homology search Method is a kind of than relatively straightforward Forecasting Methodology, but its performance significantly depends on whether similarity existing Asia high can be found The homologous sequence of cell position information annotation, but additionally, similarity sometimes between two protein sequences it is high they There can be very different structure or function, this can cause the failure of the method.
Fallout predictor based on machine learning is the more flexible model of a class in Protein Subcellular position prediction.They are needed So-called training dataset is wanted, then by the algorithm based on statistical learning come learning classification rule.Therefore, the matter of training data Amount is closely related with the quality of the statistical rules for being learnt.Benefit from more next on subcellular location information in Protein Data Bank More and more and more reliable annotation, we can be by collecting large scale training data in order to more fully train classification Model.Another major issue in machine learning model be how coded protein sequence because most of algorithms need Characteristic vector is extracted as input, feature how is extracted from urporotein sequence and associated existing knowledge for dividing The final performance of class device it is critical that.Existing Machine learning tools for predicting subcellular location use various features such as Under:
(1) statistical nature based on residue, pseudo amino acid composition composition and location specific rating matrix.
(2) based on signal peptide, the feature of functional domain.
(3) feature based on database annotation, such as Gene Ontology (GO) feature.
Because GO features are the high-level abstractions to domain knowledge, when enough annotations are possessed, they are generally than being based on sequence The extracted feature of row has accuracy higher.However, the new algorithm challenge of substantial amounts of annotation data band.For example, passing through Bernoulli Jacob's event model is used to each GO feature, i.e., is frequently resulted in the presence or absence of binary coding is carried out for the GO features The feature space of high dimension.With the regular extension and renewal of GO databases, dimension is by with our knowledge on protein Expand and be continuously increased.High dimensional feature vector increased the complexity of machine-learning process, and it is also contemplated that annotation number According to the influence of the potential noise in storehouse.Although whole GO databases are huge, each protein is actually only comprising several GO features.According to our statistics, those at least have a protein for GO features in SWISS-PROT databases, they It is average to possess 6 GO annotations.That is the GO features of a protein are a sparse features vectors, and it has thousands of dimensions Degree, but only about 6 GO annotations.Different methods are proposed in current field for this problem to process.For example, YLoc only selects GO annotations and the PROSITE patterns for having obvious correlation for specific subcellular location.Therefore, it is reduced Unnecessary feature, and result is caused it is more readily appreciated that still so also resulting in information loss.WegoLoc is each GO special The GO features for levying distribution weight to protrude.
The content of the invention
The present invention provides a kind of human protein's subcellular location Forecasting Methodology, it is therefore intended that by using comments feature it Between potentially relevant property information improve the precision of prediction of human protein's subcellular fraction grader.
A kind of human protein's subcellular location Forecasting Methodology, based on human protein sequence prediction Protein Subcellular position Put, comprise the following steps:
The first step:Distinguish abstraction sequence total length, sequence N-terminal, C-terminal multiple protein sequence using human protein sequence information The residue statistical nature of column-slice section, the spy obtained including amino acid composition feature and using protein homology information Different in nature scoring matrix feature is simultaneously normalized to this feature, and Correlation- is used after comprehensive the two features This feature selecting algorithms for having supervision of based Feature Selection carry out dimensionality reduction;
Second step:By extracting the GO features of all human proteins in Protein Data Bank, GO is obtained using GOSSTO Three similarity matrixs of (BP, MF, CC) feature space;
3rd step:Homologous protein is searched in Swiss-Prot databases by blast methods, the homologous protein is extracted GO features, while with identical method obtain training set in protein GO features;
4th step:By three parts (BP, MF, CC) of protein G O features by a tuple, two tuples, triple is divided It is 7 parts (BP, MF, CC), (BP&MF, BP&CC, MF&CC), (BP&MF&CC);
5th step:By the correlation of protein G O features, it is divided into seven parts to calculate two correlations of protein, And by parameter optimization, the KNN methods that ten correlation protein high do weights are extracted in training set, obtain the protein Probable value on each subcellular location;
6th step:The conservative domain that all human proteins in Swiss-Prot databases are obtained by rps-blast is special Levy, and the correlation between feature is calculated by information gap, conservative characteristic of field similarity matrix is obtained, then by rps- Blast calculates two correlations of protein to obtain the conservative characteristic of field of target protein, and by parameter optimization, extracts Ten correlation protein high does the KNN methods of weights in training set, obtains the protein on each subcellular location Probable value;
7th step:The obtained sequence signature of fusion, seven probability characteristicses of part of GO guard domain probability characteristics, use Binary Relevance strategies are built can predict centerbody, cytoplasm, cytoskeleton, endoplasmic reticulum, interior body, secretory pathway, Golgiosome, lysosome, mitochondria, nucleus, peroxisome and cell membrane this 12 svm classifiers of subcellular location Device.
A kind of human protein's subcellular location Forecasting Methodology, based on human protein sequence prediction Protein Subcellular position Put, comprise the following steps:
S101, distinguishes abstraction sequence total length, 10 to 60 before N-terminal, 10 to 100 before C-terminal using human protein sequence information The amino acid composition feature of length proteins sequence fragment, the PSSM matrix characters after normalization, and CFS dimensionality reductions are used, Wherein PSSM matrix normalizations and in the formula for being often partially converted into 20 dimensional features it is:
Wherein Si,jThe amino acid that expression is appeared on i-th (1≤i≤L) position of sequence is developed into during evolution The probability score of jth kind (1≤j≤20) amino acid, L represents the length of protein sequence.
The fraction of this specific scoring matrix after normalizing is illustrated, this N illustrates the number of amino acid, So N is equal to 20 in formula 2.
WhereinWhat is represented is that the value after phase adduction is asked for averagely is carried out to each column fraction;
It is exactly by the PSSM matrix characters after normalized obtained by us.
S102, by extracting the GO features of all human proteins in Swiss-Prot databases, is obtained using GOSSTO Three similarity matrixs of GO (BP, MF, CC) feature space;
S103, homologous protein is searched for by blast methods in Swiss-Prot databases, extracts their GO features, The GO features of protein in training set are obtained with identical method simultaneously;
S104, by three parts (BP, MF, CC) of protein G O features by a tuple, two tuples, triple is divided into 7 parts (BP, MF, CC), (BP&MF, BP&CC, MF&CC), (BP&MF&CC);
S105, by the correlation of protein G O features, is divided into seven parts to calculate two correlations of protein:
Wherein Cor (xi, K) and represent xiRepresentative GO comments features are related under this part to k-th protein Property.
Wherein SimkCorrelation between the protein that k-th protein is predicted with us in expression training set.
After the correlation in obtaining all training sets between protein and the protein predicted, we extract training Concentrate ten correlation protein high to do the KNN methods of weights, obtain the protein general on each subcellular location Rate value:
Wherein numaIt is illustrated respectively in training set with num, protein is in number and the training of a-th subcellular location Concentrate the total number of protein.And proaThen represent that predicted protein is in a-th probability of subcellular location.
S106, the conservative characteristic of field of all human proteins in Swiss-Prot databases is obtained by rps-blast, And the correlation between feature is calculated by information gap:
WhereinI-th entropy of CDD features is represented,Represent that i-th CDD feature is present in protein Probability in training set.Ith feature and j-th feature their differential entropy are represented,Represent i-th Correlation between CDD features and j-th CDD feature.
Conservative characteristic of field similarity matrix is obtained, the conservative domain that target protein is then obtained by rps-blast is special Levy to calculate two correlations of protein, and extract in training set the KNN side that ten correlation protein high do weights Method, obtains probable value of the protein on each subcellular location;
S107, merges obtained sequence signature, and seven probability characteristicses of part of GO are guarded domain probability characteristics, used Binary Relevance strategies build the subcellular location that 12 SVM classifiers predict protein, and in each subcellular fraction position The probability put.
In the present invention, characteristic vector is compiled rather than the frequency using GO in annotation by GO relevant informations Code.It is well known that GO features can substantially be divided into three pieces, i.e. bioprocess (BP), molecular function (MF) and cell composition (CC).This Three Partial Features are all with hierarchical structure.According to the hierarchical structure, the semantic phase defined between GO features is proposed in field Like many methods of property, such as method based on comentropy and the method based on graph theory.However, as far as we know, at present seldom The prediction algorithm of Protein Subcellular position consider correlation between these GO features.This promotes us to pass through GO features Between hiding correlation, obtain more preferable similarity measurement between two higher-dimensions but sparse GO characteristic vectors.We A kind of new method is proposed, with the hiding correlation between the comments feature using protein.In order to process due to GO data The imperfection in storehouse and GO annotations are lacked to protein of some needs predictions, we are special herein in connection with statistics protein sequence residue The functional structure characteristic of field based on peptide levied and extracted from conserved structure regional data base (CDD), constructs a new prediction Device, referred to as Hum-mPLoc3.0, it is named with the fallout predictor of human protein's location prediction of exploitation before us, but is assigned One brand-new character representation.
Compared with method with existing field of the invention, its remarkable advantage:
(1) potential correlation between comments feature is make use of in a model, effectively increases human protein subcellular fraction Position prediction precision;
(2) sequence statistical nature is incorporated, characteristic of field and GO features is guarded, human protein is effectively increased sub- thin Born of the same parents' position prediction precision.
Brief description of the drawings
Fig. 1 is human protein sequence Forecasting Methodology system construction drawing of the invention:
Specific embodiment
The present invention is further illustrated below in conjunction with the accompanying drawings.
Fig. 1 gives human protein sequence Forecasting Methodology system construction drawing of the invention:
The sequence statistical nature of the protein is obtained by the sequence of protein first, characteristic of field is guarded and GO is special Levy;Secondly, character subset is extracted using CFS feature selection approach to sequence statistical nature, it is special to conservative characteristic of field and GO The similarity measurement that these features are respectively obtained by calculating is levied, probabilistic information is calculated using the KNN methods of Weighted Coefficients, then The feature of acquisition is integrated and is classified with SVM classifier.Lower mask body is illustrated:
S101, distinguishes abstraction sequence total length, 10 to 60 before N-terminal, 10 to 100 before C-terminal using human protein sequence information The amino acid composition feature of length proteins sequence fragment, the PSSM matrix characters after normalization, and CFS dimensionality reductions are used, Wherein PSSM matrix normalizations and in the formula for being often partially converted into 20 dimensional features it is:
Wherein Si,jThe amino acid that expression is appeared on i-th (1≤i≤L) position of sequence is developed into during evolution The probability score of jth kind (1≤j≤20) amino acid, L represents the length of protein sequence.
The fraction of this specific scoring matrix after normalizing is illustrated, this N illustrates the number of amino acid, So N is equal to 20 in formula 2.
WhereinWhat is represented is that the value after phase adduction is asked for averagely is carried out to each column fraction;
It is exactly by the PSSM matrix characters after normalized obtained by us.
S102, by extracting the GO features of all human proteins in Swiss-Prot databases, is obtained using GOSSTO Three similarity matrixs of GO (BP, MF, CC) feature space;
S103, homologous protein is searched for by blast methods in Swiss-Prot databases, extracts their GO features, The GO features of protein in training set are obtained with identical method simultaneously;
S104, by three parts (BP, MF, CC) of protein G O features by a tuple, two tuples, triple is divided into 7 parts (BP, MF, CC), (BP&MF, BP&CC, MF&CC), (BP&MF&CC);
S105, by the correlation of protein G O features, is divided into seven parts to calculate two correlations of protein:
Wherein Cor (xi, K) and represent xiRepresentative GO comments features are related under this part to k-th protein Property.
Wherein SimkCorrelation between the protein that k-th protein is predicted with us in expression training set.
After the correlation in obtaining all training sets between protein and the protein predicted, we extract training Concentrate ten correlation protein high to do the KNN methods of weights, obtain the protein general on each subcellular location Rate value:
Wherein numaIt is illustrated respectively in training set with num, protein is in number and the training of a-th subcellular location Concentrate the total number of protein.And proaThen represent that predicted protein is in a-th probability of subcellular location.
S106, the conservative characteristic of field of all human proteins in Swiss-Prot databases is obtained by rps-blast, And the correlation between feature is calculated by information gap:
WhereinI-th entropy of CDD features is represented,Represent that i-th CDD feature is present in protein Probability in training set.Ith feature and j-th feature their differential entropy are represented,Represent i-th Correlation between CDD features and j-th CDD feature.
Conservative characteristic of field similarity matrix is obtained, the conservative domain that target protein is then obtained by rps-blast is special Levy to calculate two correlations of protein, and extract in training set the KNN side that ten correlation protein high do weights Method, obtains probable value of the protein on each subcellular location;
S107, merges obtained sequence signature, and seven probability characteristicses of part of GO are guarded domain probability characteristics, used Binary Relevance strategies build the subcellular location that 12 SVM classifiers predict protein, and in each subcellular fraction position The probability put.
Example:
An existing list entries, data are as follows:
>query protein 1;example of multiple subcellular locationsMSAVGAATPYLHHPGDSHSGRVSFLGAQLPPEVAAMARLLGDLDRSTFRKLLKFVVSSLQGEDCREAV QRLGVSANLPEEQLGALLAGMHTLLQQALRLPPTSLKPDTFRDQLQELCIPQDLVGDLASVVFGSQRPLLDSVAQQQ GAWLPHVADFRWRVDVAISTSALARSLQPSVLMQLKLSDGSAYRFEVPTAKFQELRYSVALVLKEMADLEKRCERRL QD
This is a sequence to be measured, and the software output result using the inventive method is as follows:
From the results, it was seen that this method is effective and accurately predicts except the subcellular fraction position of this protein of the mankind Put.
Above-described embodiment limits the present invention never in any form, every to be obtained by the way of equivalent or equivalent transformation Technical scheme all fall within protection scope of the present invention.

Claims (2)

1. a kind of human protein's subcellular location Forecasting Methodology, Protein Subcellular position is predicted based on human protein sequence Put, it is characterised in that comprise the following steps:
The first step:Distinguish abstraction sequence total length, sequence N-terminal, C-terminal multiple protein sequence piece using human protein sequence information The residue statistical nature of section, the specificity obtained including amino acid composition feature and using protein homology information Scoring matrix feature is simultaneously normalized to this feature, and Correlation- is used after comprehensive the two features This feature selecting algorithms for having supervision of based Feature Selection carry out dimensionality reduction;
Second step:By extracting the GO features of all human proteins in Protein Data Bank, using GOSSTO obtain GO (BP, MF, CC) three similarity matrixs of feature space;
3rd step:Homologous protein is searched in Swiss-Prot databases by blast methods, the GO of the homologous protein is extracted Feature, while obtaining the GO features of protein in training set with identical method;
4th step:By three parts (BP, MF, CC) of protein G O features by a tuple, two tuples, triple is divided into 7 Individual part (BP, MF, CC), (BP&MF, BP&CC, MF&CC), (BP&MF&CC);
5th step:By the correlation of protein G O features, it is divided into seven parts to calculate two correlations of protein, and lead to Cross parameter optimization, extract in training set the KNN methods that ten correlation protein high do weights, obtain the protein every Probable value on individual subcellular location;
6th step:The conservative characteristic of field of all human proteins in Swiss-Prot databases is obtained by rps-blast, and Correlation between feature is calculated by information gap, conservative characteristic of field similarity matrix is obtained, is then obtained by rps-blast The conservative characteristic of field of target protein is obtained to calculate two correlations of protein, and by parameter optimization, in extraction training set Ten correlation protein high do the KNN methods of weights, obtain probable value of the protein on each subcellular location;
7th step:The obtained sequence signature of fusion, seven probability characteristicses of part of GO guard domain probability characteristics, use Binary Relevance strategies are built can predict centerbody, cytoplasm, cytoskeleton, endoplasmic reticulum, interior body, secretory pathway, Golgiosome, lysosome, mitochondria, nucleus, peroxisome and cell membrane this 12 svm classifiers of subcellular location Device.
2. a kind of human protein's subcellular location Forecasting Methodology, Protein Subcellular position is predicted based on human protein sequence Put, it is characterised in that comprise the following steps:
S101, abstraction sequence total length, 10 to 60 before N-terminal, 10 to 100 length before C-terminal are distinguished using human protein sequence information The amino acid composition feature of protein sequence fragment, the PSSM matrix characters after normalization, and CFS dimensionality reductions are used, wherein PSSM matrix normalizations simultaneously in the formula for being often partially converted into 20 dimensional features are:
Wherein Si,jThe amino acid that expression is appeared on i-th (1≤i≤L) position of sequence develops into jth during evolution The probability score of (1≤j≤20) amino acid is planted, L represents the length of protein sequence,
S i , j 0 = S i , j - 1 N Σ k = 1 N S i , k 1 N - 1 Σ u = 1 N ( S i , u - 1 N Σ k = 1 N S i , k ) 2 , - - - ( 2 )
S0 i,jThe fraction of this specific scoring matrix after normalizing is illustrated, N illustrates the number of amino acid, in formula (2) N=20,
S j 0 ‾ = 1 L Σ i = 1 L S i , j 0 , - - - ( 3 )
WhereinWhat is represented is that the value after phase adduction is asked for averagely is carried out to each column fraction;
S P S S M ‾ = [ S 1 0 ‾ , S 2 0 ‾ , S 3 0 ‾ , ... , S 20 0 ‾ ] , - - - ( 4 )
It is exactly by the PSSM matrix characters after normalized;
S102, by extracting the GO features of all human proteins in Swiss-Prot databases, GO is obtained using GOSSTO Three similarity matrixs of (BP, MF, CC) feature space;
S103, homologous protein is searched for by blast methods in Swiss-Prot databases, extracts their GO features, while The GO features of protein in training set are obtained with identical method;
S104, by three parts (BP, MF, CC) of protein G O features by a tuple, two tuples, triple is divided into 7 (BP&MF, BP&CC, MF&CC), partly (BP, MF, CC), (BP&MF&CC);
S105, by the correlation of protein G O features, is divided into seven parts to calculate two correlations of protein:
C o r ( x i , K ) = m a x 1 ≤ i ≤ m C o r ( x i , y j ) , - - - ( 5 )
Wherein Cor (xi, K) and represent xiCorrelation of the representative GO comments features with k-th protein under this part,
Sim k = Σ i = 1 n C o r ( x i , K ) 2 , - - - ( 6 )
Wherein SimkCorrelation between the protein that k-th protein is predicted with us in expression training set,
After the correlation in obtaining all training sets between protein and the protein predicted, extract ten in training set Correlation protein high does the KNN methods of weights, obtains probable value of the protein on each subcellular location:
pro a = Σ j ∈ I N a sim j + num a n u m Σ i ∈ I N sim i + 1 , - - - ( 7 )
Wherein numaIt is illustrated respectively in training set with num, protein is in a-th number and training set of subcellular location The total number of protein.And proaThen represent that predicted protein is in a-th probability of subcellular location.
S106, the conservative characteristic of field of all human proteins in Swiss-Prot databases is obtained by rps-blast, and is led to The correlation crossed between information gap calculating feature:
H ( f i c d d ) = - Σ m ∈ { 0 , 1 } p ( f i c d d = m ) × log p ( f i c d d = m ) , - - - ( 9 )
S i , j c d d = 2 × ( H ( f i c d d ) + H ( f j c d d ) - H ( f i c d d , f j c d d ) ) H ( f i c d d ) + H ( f j c d d ) . - - - ( 11 )
Wherein H (fi cdd) represent i-th entropy of CDD features, p (fi cdd=1) represent that i-th CDD feature is present in protein training The probability of concentration.H(fj cdd,fi cdd) represent ith feature and j-th feature their differential entropy, Si,j cddRepresent i-th Correlation between CDD features and j-th CDD feature,
Obtain conservative characteristic of field similarity matrix, then obtained by rps-blast the conservative characteristic of field of target protein come Calculate two correlations of protein, and extract in training set the KNN methods that ten correlation protein high do weights, obtain Obtain probable value of the protein on each subcellular location;
S107, merges obtained sequence signature, and seven probability characteristicses of part of GO guard domain probability characteristics, use Binary Relevance strategies build the subcellular location that 12 SVM classifiers predict protein, and general on each subcellular location Rate.
CN201710204499.1A 2017-03-31 2017-03-31 A kind of human protein's subcellular location Forecasting Methodology Pending CN106778070A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710204499.1A CN106778070A (en) 2017-03-31 2017-03-31 A kind of human protein's subcellular location Forecasting Methodology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710204499.1A CN106778070A (en) 2017-03-31 2017-03-31 A kind of human protein's subcellular location Forecasting Methodology

Publications (1)

Publication Number Publication Date
CN106778070A true CN106778070A (en) 2017-05-31

Family

ID=58965603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710204499.1A Pending CN106778070A (en) 2017-03-31 2017-03-31 A kind of human protein's subcellular location Forecasting Methodology

Country Status (1)

Country Link
CN (1) CN106778070A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110164507A (en) * 2019-05-31 2019-08-23 郑州大学第一附属医院 A kind of determination method and system of protein similarity and similar protein matter
CN110739028A (en) * 2019-10-18 2020-01-31 中国矿业大学 cell line drug response prediction method based on K-nearest neighbor constraint matrix decomposition
CN110797080A (en) * 2019-10-18 2020-02-14 湖南大学 Predicting synthetic lethal genes based on cross-species migratory learning
CN111009287A (en) * 2019-12-20 2020-04-14 东软集团股份有限公司 SLiMs prediction model generation method, device, equipment and storage medium
CN111091874A (en) * 2019-12-20 2020-05-01 东软集团股份有限公司 Protein feature construction method, device, equipment, storage medium and program product
CN112259160A (en) * 2020-11-19 2021-01-22 广东工业大学 Protein subcellular localization method, system, storage medium and computer equipment
CN112542213A (en) * 2020-12-11 2021-03-23 沈阳师范大学 Protein compound identification method fusing local topological attribute and gene expression information of node
CN114882954A (en) * 2022-05-24 2022-08-09 南京邮电大学 Integrated learning-based automatic cell type classification method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105760711A (en) * 2016-02-02 2016-07-13 江南大学 Method for using KNN calculation and similarity comparison to predict protein subcellular section

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105760711A (en) * 2016-02-02 2016-07-13 江南大学 Method for using KNN calculation and similarity comparison to predict protein subcellular section

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HANG ZHOU ET AL.: "Hum-mPLoc3.0: prediction enhancement of human protein subcellular localization through modeling the hidden correlations of gene ontology and functional domain features", 《BOINFORMATICS》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110164507A (en) * 2019-05-31 2019-08-23 郑州大学第一附属医院 A kind of determination method and system of protein similarity and similar protein matter
CN110739028B (en) * 2019-10-18 2023-08-15 中国矿业大学 Cell line drug response prediction method based on K-nearest neighbor constraint matrix decomposition
CN110739028A (en) * 2019-10-18 2020-01-31 中国矿业大学 cell line drug response prediction method based on K-nearest neighbor constraint matrix decomposition
CN110797080A (en) * 2019-10-18 2020-02-14 湖南大学 Predicting synthetic lethal genes based on cross-species migratory learning
CN111009287A (en) * 2019-12-20 2020-04-14 东软集团股份有限公司 SLiMs prediction model generation method, device, equipment and storage medium
CN111091874A (en) * 2019-12-20 2020-05-01 东软集团股份有限公司 Protein feature construction method, device, equipment, storage medium and program product
CN111009287B (en) * 2019-12-20 2023-12-15 东软集团股份有限公司 SLiMs prediction model generation method, device, equipment and storage medium
CN111091874B (en) * 2019-12-20 2024-01-19 东软集团股份有限公司 Protein feature construction method, protein feature construction device, protein feature construction apparatus, protein feature construction program product, and protein feature construction program product
CN112259160A (en) * 2020-11-19 2021-01-22 广东工业大学 Protein subcellular localization method, system, storage medium and computer equipment
CN112259160B (en) * 2020-11-19 2023-05-26 广东工业大学 Protein subcellular localization method, system, storage medium and computer device
CN112542213A (en) * 2020-12-11 2021-03-23 沈阳师范大学 Protein compound identification method fusing local topological attribute and gene expression information of node
CN112542213B (en) * 2020-12-11 2024-02-02 沈阳师范大学 Protein complex identification method fusing local topological attribute of node and gene expression information
CN114882954A (en) * 2022-05-24 2022-08-09 南京邮电大学 Integrated learning-based automatic cell type classification method

Similar Documents

Publication Publication Date Title
CN106778070A (en) A kind of human protein's subcellular location Forecasting Methodology
Liu et al. A new feature selection method based on a validity index of feature subset
Wei et al. An improved protein structural classes prediction method by incorporating both sequence and structure information
RU2607999C2 (en) Use of machine learning techniques for extraction of association rules in datasets of plants and animals containing molecular genetic markers accompanied by classification or prediction using features created by these association rules
Rao et al. A new intelligence-based approach for computer-aided diagnosis of dengue fever
CN101751455B (en) Method for automatically generating title by adopting artificial intelligence technology
Sathya et al. [Retracted] Cancer Categorization Using Genetic Algorithm to Identify Biomarker Genes
CN107291895B (en) Quick hierarchical document query method
Daelemans et al. Skousen's analogical modelling algorithm: a comparison with lazy learning
Dellert Combining information-weighted sequence alignment and sound correspondence models for improved cognate detection
CN102346817B (en) Prediction method for establishing allergen of allergen-family featured peptides by means of SVM (Support Vector Machine)
Cao et al. Combining contents and citations for scientific document classification
Zhang et al. MpsLDA-ProSVM: predicting multi-label protein subcellular localization by wMLDAe dimensionality reduction and ProSVM classifier
Sarkate et al. Classification of chemical medicine or drug using K nearest neighbor (KNN) and genetic algorithm
Yu et al. An automatic recognition method of journal impact factor manipulation
CN113033176B (en) Court case judgment prediction method
Zhang et al. A hierarchical feature selection model using clustering and recursive elimination methods
Belete et al. Wrapper based feature selection techniques on EDHS-HIV/AIDS dataset
AlShwaish et al. Mortality prediction based on imbalanced new born and perinatal period data
Natarajan Early disease diagnosis using multivariate linear regression
Hemmerich et al. A study of residue correlation within protein sequences and its application to sequence classification
Upadhyay et al. Exploratory data analysis and prediction of human genetic disorder and species using dna sequencing
Mostafavi et al. Classification of Persian News Articles using Machine Learning Techniques
Jagdale et al. Extending the Classifier Algorithms in Machine Learning to Improve the Performance in Spoken Language Understanding Systems Under Deficient Training Data
CN116563646B (en) Brain image classification method based on discretization data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170531