CN106778070A - A kind of human protein's subcellular location Forecasting Methodology - Google Patents
A kind of human protein's subcellular location Forecasting Methodology Download PDFInfo
- Publication number
- CN106778070A CN106778070A CN201710204499.1A CN201710204499A CN106778070A CN 106778070 A CN106778070 A CN 106778070A CN 201710204499 A CN201710204499 A CN 201710204499A CN 106778070 A CN106778070 A CN 106778070A
- Authority
- CN
- China
- Prior art keywords
- protein
- feature
- features
- sequence
- correlation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- Public Health (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of method of human protein's subcellular location prediction, it is the subcellular location that the protein is predicted using the sequence of human protein, based on Gene Ontology (GO) feature and conservative domain relevance optimization human protein subcellular fraction sorting algorithm.The sequence statistical nature (amino acid composition feature, normalized specific scoring matrix feature) of the protein is obtained by the sequence of protein first, characteristic of field and GO features is guarded;Secondly, character subset is extracted using CFS feature selection approach to sequence statistical nature, the similarity measurement of these features is respectively obtained by calculating to guarding characteristic of field and GO features, probabilistic information is calculated using the KNN methods of Weighted Coefficients, then the feature of acquisition is integrated and is classified with SVM classifier.
Description
Technical field
The invention belongs to technical field of biological information, more particularly to a kind of side of human protein's subcellular location prediction
Method.
Background technology
The subcellular location of protein is understood for the interaction between the function, the protein that understand protein, and medicine
The targeted therapy of thing has great importance.But the subcellular location of protein is obtained currently with the method for experimental check
Need very big time and cost.Therefore substantial amounts of protein is predicted using Protein Subcellular predicted position deivce
It is significant.According to our statistics, had altogether on the SWISS-PROT Protein Data Banks of in February, 2016 issue
There are 550552 protein, wherein only 10.4% protein has the subcellular location of experimental verification, remaining unknown Asia
The protein of cell position is badly in need of being predicted by a kind of reliable Forecasting Methodology.
Up to the present, having had can much predict the instrument of Protein Subcellular position, the common webserver
Including BaCeLlo, YLoc, MultiLoc, GOASVM, WoLF PSORT, CellPLoc, HSLPred etc..These forecasting tools
Biologist to association area brings great convenience.
The subcellular location information of protein is frequently used in the gene therapy of disease, in drug targeting treatment.For example lead to
The inspection expression of protein YAP and Subcellular Localization in tumour is crossed to be drilled in children's hepatocellular carcinoma studying Hippo/YAP approach
Effect in change.So, wieldy high-precision forecast instrument will be remarkably contributing to these laboratories to carry out clinic and grinds
Study carefully.The webserver Hum-mPLoc2.0 that we issued in the past is designed exclusively for prediction human protein positioning.Often
The number of times that year uses increases to more than 80,000 times in 2015 from 20,000 in 2010 times.This indicates that to provide preferably prediction clothes
Business, based on new technology and more comprehensively accurately annotations database is significant to further enhance predictive ability.
It is commonly used for predicting that the computational methods of proteins subcellular location can be divided into two classes, i.e., is searched for based on homology
Method and the method based on machine learning.Method based on homology search is considered using arest neighbors method
Row prediction, in the method the distance between two protein generally weighed by their sequence homology.By calculating
The homology of query protein and a large amount of sequences for having subcellular location annotation information, the method find preceding K it is most like
Protein, and their annotation information is passed into the protein to be predicted as classification results.Based on homology search
Method is a kind of than relatively straightforward Forecasting Methodology, but its performance significantly depends on whether similarity existing Asia high can be found
The homologous sequence of cell position information annotation, but additionally, similarity sometimes between two protein sequences it is high they
There can be very different structure or function, this can cause the failure of the method.
Fallout predictor based on machine learning is the more flexible model of a class in Protein Subcellular position prediction.They are needed
So-called training dataset is wanted, then by the algorithm based on statistical learning come learning classification rule.Therefore, the matter of training data
Amount is closely related with the quality of the statistical rules for being learnt.Benefit from more next on subcellular location information in Protein Data Bank
More and more and more reliable annotation, we can be by collecting large scale training data in order to more fully train classification
Model.Another major issue in machine learning model be how coded protein sequence because most of algorithms need
Characteristic vector is extracted as input, feature how is extracted from urporotein sequence and associated existing knowledge for dividing
The final performance of class device it is critical that.Existing Machine learning tools for predicting subcellular location use various features such as
Under:
(1) statistical nature based on residue, pseudo amino acid composition composition and location specific rating matrix.
(2) based on signal peptide, the feature of functional domain.
(3) feature based on database annotation, such as Gene Ontology (GO) feature.
Because GO features are the high-level abstractions to domain knowledge, when enough annotations are possessed, they are generally than being based on sequence
The extracted feature of row has accuracy higher.However, the new algorithm challenge of substantial amounts of annotation data band.For example, passing through
Bernoulli Jacob's event model is used to each GO feature, i.e., is frequently resulted in the presence or absence of binary coding is carried out for the GO features
The feature space of high dimension.With the regular extension and renewal of GO databases, dimension is by with our knowledge on protein
Expand and be continuously increased.High dimensional feature vector increased the complexity of machine-learning process, and it is also contemplated that annotation number
According to the influence of the potential noise in storehouse.Although whole GO databases are huge, each protein is actually only comprising several
GO features.According to our statistics, those at least have a protein for GO features in SWISS-PROT databases, they
It is average to possess 6 GO annotations.That is the GO features of a protein are a sparse features vectors, and it has thousands of dimensions
Degree, but only about 6 GO annotations.Different methods are proposed in current field for this problem to process.For example,
YLoc only selects GO annotations and the PROSITE patterns for having obvious correlation for specific subcellular location.Therefore, it is reduced
Unnecessary feature, and result is caused it is more readily appreciated that still so also resulting in information loss.WegoLoc is each GO special
The GO features for levying distribution weight to protrude.
The content of the invention
The present invention provides a kind of human protein's subcellular location Forecasting Methodology, it is therefore intended that by using comments feature it
Between potentially relevant property information improve the precision of prediction of human protein's subcellular fraction grader.
A kind of human protein's subcellular location Forecasting Methodology, based on human protein sequence prediction Protein Subcellular position
Put, comprise the following steps:
The first step:Distinguish abstraction sequence total length, sequence N-terminal, C-terminal multiple protein sequence using human protein sequence information
The residue statistical nature of column-slice section, the spy obtained including amino acid composition feature and using protein homology information
Different in nature scoring matrix feature is simultaneously normalized to this feature, and Correlation- is used after comprehensive the two features
This feature selecting algorithms for having supervision of based Feature Selection carry out dimensionality reduction;
Second step:By extracting the GO features of all human proteins in Protein Data Bank, GO is obtained using GOSSTO
Three similarity matrixs of (BP, MF, CC) feature space;
3rd step:Homologous protein is searched in Swiss-Prot databases by blast methods, the homologous protein is extracted
GO features, while with identical method obtain training set in protein GO features;
4th step:By three parts (BP, MF, CC) of protein G O features by a tuple, two tuples, triple is divided
It is 7 parts (BP, MF, CC), (BP&MF, BP&CC, MF&CC), (BP&MF&CC);
5th step:By the correlation of protein G O features, it is divided into seven parts to calculate two correlations of protein,
And by parameter optimization, the KNN methods that ten correlation protein high do weights are extracted in training set, obtain the protein
Probable value on each subcellular location;
6th step:The conservative domain that all human proteins in Swiss-Prot databases are obtained by rps-blast is special
Levy, and the correlation between feature is calculated by information gap, conservative characteristic of field similarity matrix is obtained, then by rps-
Blast calculates two correlations of protein to obtain the conservative characteristic of field of target protein, and by parameter optimization, extracts
Ten correlation protein high does the KNN methods of weights in training set, obtains the protein on each subcellular location
Probable value;
7th step:The obtained sequence signature of fusion, seven probability characteristicses of part of GO guard domain probability characteristics, use
Binary Relevance strategies are built can predict centerbody, cytoplasm, cytoskeleton, endoplasmic reticulum, interior body, secretory pathway,
Golgiosome, lysosome, mitochondria, nucleus, peroxisome and cell membrane this 12 svm classifiers of subcellular location
Device.
A kind of human protein's subcellular location Forecasting Methodology, based on human protein sequence prediction Protein Subcellular position
Put, comprise the following steps:
S101, distinguishes abstraction sequence total length, 10 to 60 before N-terminal, 10 to 100 before C-terminal using human protein sequence information
The amino acid composition feature of length proteins sequence fragment, the PSSM matrix characters after normalization, and CFS dimensionality reductions are used,
Wherein PSSM matrix normalizations and in the formula for being often partially converted into 20 dimensional features it is:
Wherein Si,jThe amino acid that expression is appeared on i-th (1≤i≤L) position of sequence is developed into during evolution
The probability score of jth kind (1≤j≤20) amino acid, L represents the length of protein sequence.
The fraction of this specific scoring matrix after normalizing is illustrated, this N illustrates the number of amino acid,
So N is equal to 20 in formula 2.
WhereinWhat is represented is that the value after phase adduction is asked for averagely is carried out to each column fraction;
It is exactly by the PSSM matrix characters after normalized obtained by us.
S102, by extracting the GO features of all human proteins in Swiss-Prot databases, is obtained using GOSSTO
Three similarity matrixs of GO (BP, MF, CC) feature space;
S103, homologous protein is searched for by blast methods in Swiss-Prot databases, extracts their GO features,
The GO features of protein in training set are obtained with identical method simultaneously;
S104, by three parts (BP, MF, CC) of protein G O features by a tuple, two tuples, triple is divided into
7 parts (BP, MF, CC), (BP&MF, BP&CC, MF&CC), (BP&MF&CC);
S105, by the correlation of protein G O features, is divided into seven parts to calculate two correlations of protein:
Wherein Cor (xi, K) and represent xiRepresentative GO comments features are related under this part to k-th protein
Property.
Wherein SimkCorrelation between the protein that k-th protein is predicted with us in expression training set.
After the correlation in obtaining all training sets between protein and the protein predicted, we extract training
Concentrate ten correlation protein high to do the KNN methods of weights, obtain the protein general on each subcellular location
Rate value:
Wherein numaIt is illustrated respectively in training set with num, protein is in number and the training of a-th subcellular location
Concentrate the total number of protein.And proaThen represent that predicted protein is in a-th probability of subcellular location.
S106, the conservative characteristic of field of all human proteins in Swiss-Prot databases is obtained by rps-blast,
And the correlation between feature is calculated by information gap:
WhereinI-th entropy of CDD features is represented,Represent that i-th CDD feature is present in protein
Probability in training set.Ith feature and j-th feature their differential entropy are represented,Represent i-th
Correlation between CDD features and j-th CDD feature.
Conservative characteristic of field similarity matrix is obtained, the conservative domain that target protein is then obtained by rps-blast is special
Levy to calculate two correlations of protein, and extract in training set the KNN side that ten correlation protein high do weights
Method, obtains probable value of the protein on each subcellular location;
S107, merges obtained sequence signature, and seven probability characteristicses of part of GO are guarded domain probability characteristics, used
Binary Relevance strategies build the subcellular location that 12 SVM classifiers predict protein, and in each subcellular fraction position
The probability put.
In the present invention, characteristic vector is compiled rather than the frequency using GO in annotation by GO relevant informations
Code.It is well known that GO features can substantially be divided into three pieces, i.e. bioprocess (BP), molecular function (MF) and cell composition (CC).This
Three Partial Features are all with hierarchical structure.According to the hierarchical structure, the semantic phase defined between GO features is proposed in field
Like many methods of property, such as method based on comentropy and the method based on graph theory.However, as far as we know, at present seldom
The prediction algorithm of Protein Subcellular position consider correlation between these GO features.This promotes us to pass through GO features
Between hiding correlation, obtain more preferable similarity measurement between two higher-dimensions but sparse GO characteristic vectors.We
A kind of new method is proposed, with the hiding correlation between the comments feature using protein.In order to process due to GO data
The imperfection in storehouse and GO annotations are lacked to protein of some needs predictions, we are special herein in connection with statistics protein sequence residue
The functional structure characteristic of field based on peptide levied and extracted from conserved structure regional data base (CDD), constructs a new prediction
Device, referred to as Hum-mPLoc3.0, it is named with the fallout predictor of human protein's location prediction of exploitation before us, but is assigned
One brand-new character representation.
Compared with method with existing field of the invention, its remarkable advantage:
(1) potential correlation between comments feature is make use of in a model, effectively increases human protein subcellular fraction
Position prediction precision;
(2) sequence statistical nature is incorporated, characteristic of field and GO features is guarded, human protein is effectively increased sub- thin
Born of the same parents' position prediction precision.
Brief description of the drawings
Fig. 1 is human protein sequence Forecasting Methodology system construction drawing of the invention:
Specific embodiment
The present invention is further illustrated below in conjunction with the accompanying drawings.
Fig. 1 gives human protein sequence Forecasting Methodology system construction drawing of the invention:
The sequence statistical nature of the protein is obtained by the sequence of protein first, characteristic of field is guarded and GO is special
Levy;Secondly, character subset is extracted using CFS feature selection approach to sequence statistical nature, it is special to conservative characteristic of field and GO
The similarity measurement that these features are respectively obtained by calculating is levied, probabilistic information is calculated using the KNN methods of Weighted Coefficients, then
The feature of acquisition is integrated and is classified with SVM classifier.Lower mask body is illustrated:
S101, distinguishes abstraction sequence total length, 10 to 60 before N-terminal, 10 to 100 before C-terminal using human protein sequence information
The amino acid composition feature of length proteins sequence fragment, the PSSM matrix characters after normalization, and CFS dimensionality reductions are used,
Wherein PSSM matrix normalizations and in the formula for being often partially converted into 20 dimensional features it is:
Wherein Si,jThe amino acid that expression is appeared on i-th (1≤i≤L) position of sequence is developed into during evolution
The probability score of jth kind (1≤j≤20) amino acid, L represents the length of protein sequence.
The fraction of this specific scoring matrix after normalizing is illustrated, this N illustrates the number of amino acid,
So N is equal to 20 in formula 2.
WhereinWhat is represented is that the value after phase adduction is asked for averagely is carried out to each column fraction;
It is exactly by the PSSM matrix characters after normalized obtained by us.
S102, by extracting the GO features of all human proteins in Swiss-Prot databases, is obtained using GOSSTO
Three similarity matrixs of GO (BP, MF, CC) feature space;
S103, homologous protein is searched for by blast methods in Swiss-Prot databases, extracts their GO features,
The GO features of protein in training set are obtained with identical method simultaneously;
S104, by three parts (BP, MF, CC) of protein G O features by a tuple, two tuples, triple is divided into
7 parts (BP, MF, CC), (BP&MF, BP&CC, MF&CC), (BP&MF&CC);
S105, by the correlation of protein G O features, is divided into seven parts to calculate two correlations of protein:
Wherein Cor (xi, K) and represent xiRepresentative GO comments features are related under this part to k-th protein
Property.
Wherein SimkCorrelation between the protein that k-th protein is predicted with us in expression training set.
After the correlation in obtaining all training sets between protein and the protein predicted, we extract training
Concentrate ten correlation protein high to do the KNN methods of weights, obtain the protein general on each subcellular location
Rate value:
Wherein numaIt is illustrated respectively in training set with num, protein is in number and the training of a-th subcellular location
Concentrate the total number of protein.And proaThen represent that predicted protein is in a-th probability of subcellular location.
S106, the conservative characteristic of field of all human proteins in Swiss-Prot databases is obtained by rps-blast,
And the correlation between feature is calculated by information gap:
WhereinI-th entropy of CDD features is represented,Represent that i-th CDD feature is present in protein
Probability in training set.Ith feature and j-th feature their differential entropy are represented,Represent i-th
Correlation between CDD features and j-th CDD feature.
Conservative characteristic of field similarity matrix is obtained, the conservative domain that target protein is then obtained by rps-blast is special
Levy to calculate two correlations of protein, and extract in training set the KNN side that ten correlation protein high do weights
Method, obtains probable value of the protein on each subcellular location;
S107, merges obtained sequence signature, and seven probability characteristicses of part of GO are guarded domain probability characteristics, used
Binary Relevance strategies build the subcellular location that 12 SVM classifiers predict protein, and in each subcellular fraction position
The probability put.
Example:
An existing list entries, data are as follows:
>query protein 1;example of multiple subcellular
locationsMSAVGAATPYLHHPGDSHSGRVSFLGAQLPPEVAAMARLLGDLDRSTFRKLLKFVVSSLQGEDCREAV
QRLGVSANLPEEQLGALLAGMHTLLQQALRLPPTSLKPDTFRDQLQELCIPQDLVGDLASVVFGSQRPLLDSVAQQQ
GAWLPHVADFRWRVDVAISTSALARSLQPSVLMQLKLSDGSAYRFEVPTAKFQELRYSVALVLKEMADLEKRCERRL
QD
This is a sequence to be measured, and the software output result using the inventive method is as follows:
From the results, it was seen that this method is effective and accurately predicts except the subcellular fraction position of this protein of the mankind
Put.
Above-described embodiment limits the present invention never in any form, every to be obtained by the way of equivalent or equivalent transformation
Technical scheme all fall within protection scope of the present invention.
Claims (2)
1. a kind of human protein's subcellular location Forecasting Methodology, Protein Subcellular position is predicted based on human protein sequence
Put, it is characterised in that comprise the following steps:
The first step:Distinguish abstraction sequence total length, sequence N-terminal, C-terminal multiple protein sequence piece using human protein sequence information
The residue statistical nature of section, the specificity obtained including amino acid composition feature and using protein homology information
Scoring matrix feature is simultaneously normalized to this feature, and Correlation- is used after comprehensive the two features
This feature selecting algorithms for having supervision of based Feature Selection carry out dimensionality reduction;
Second step:By extracting the GO features of all human proteins in Protein Data Bank, using GOSSTO obtain GO (BP,
MF, CC) three similarity matrixs of feature space;
3rd step:Homologous protein is searched in Swiss-Prot databases by blast methods, the GO of the homologous protein is extracted
Feature, while obtaining the GO features of protein in training set with identical method;
4th step:By three parts (BP, MF, CC) of protein G O features by a tuple, two tuples, triple is divided into 7
Individual part (BP, MF, CC), (BP&MF, BP&CC, MF&CC), (BP&MF&CC);
5th step:By the correlation of protein G O features, it is divided into seven parts to calculate two correlations of protein, and lead to
Cross parameter optimization, extract in training set the KNN methods that ten correlation protein high do weights, obtain the protein every
Probable value on individual subcellular location;
6th step:The conservative characteristic of field of all human proteins in Swiss-Prot databases is obtained by rps-blast, and
Correlation between feature is calculated by information gap, conservative characteristic of field similarity matrix is obtained, is then obtained by rps-blast
The conservative characteristic of field of target protein is obtained to calculate two correlations of protein, and by parameter optimization, in extraction training set
Ten correlation protein high do the KNN methods of weights, obtain probable value of the protein on each subcellular location;
7th step:The obtained sequence signature of fusion, seven probability characteristicses of part of GO guard domain probability characteristics, use
Binary Relevance strategies are built can predict centerbody, cytoplasm, cytoskeleton, endoplasmic reticulum, interior body, secretory pathway,
Golgiosome, lysosome, mitochondria, nucleus, peroxisome and cell membrane this 12 svm classifiers of subcellular location
Device.
2. a kind of human protein's subcellular location Forecasting Methodology, Protein Subcellular position is predicted based on human protein sequence
Put, it is characterised in that comprise the following steps:
S101, abstraction sequence total length, 10 to 60 before N-terminal, 10 to 100 length before C-terminal are distinguished using human protein sequence information
The amino acid composition feature of protein sequence fragment, the PSSM matrix characters after normalization, and CFS dimensionality reductions are used, wherein
PSSM matrix normalizations simultaneously in the formula for being often partially converted into 20 dimensional features are:
Wherein Si,jThe amino acid that expression is appeared on i-th (1≤i≤L) position of sequence develops into jth during evolution
The probability score of (1≤j≤20) amino acid is planted, L represents the length of protein sequence,
S0 i,jThe fraction of this specific scoring matrix after normalizing is illustrated, N illustrates the number of amino acid, in formula (2)
N=20,
WhereinWhat is represented is that the value after phase adduction is asked for averagely is carried out to each column fraction;
It is exactly by the PSSM matrix characters after normalized;
S102, by extracting the GO features of all human proteins in Swiss-Prot databases, GO is obtained using GOSSTO
Three similarity matrixs of (BP, MF, CC) feature space;
S103, homologous protein is searched for by blast methods in Swiss-Prot databases, extracts their GO features, while
The GO features of protein in training set are obtained with identical method;
S104, by three parts (BP, MF, CC) of protein G O features by a tuple, two tuples, triple is divided into 7
(BP&MF, BP&CC, MF&CC), partly (BP, MF, CC), (BP&MF&CC);
S105, by the correlation of protein G O features, is divided into seven parts to calculate two correlations of protein:
Wherein Cor (xi, K) and represent xiCorrelation of the representative GO comments features with k-th protein under this part,
Wherein SimkCorrelation between the protein that k-th protein is predicted with us in expression training set,
After the correlation in obtaining all training sets between protein and the protein predicted, extract ten in training set
Correlation protein high does the KNN methods of weights, obtains probable value of the protein on each subcellular location:
Wherein numaIt is illustrated respectively in training set with num, protein is in a-th number and training set of subcellular location
The total number of protein.And proaThen represent that predicted protein is in a-th probability of subcellular location.
S106, the conservative characteristic of field of all human proteins in Swiss-Prot databases is obtained by rps-blast, and is led to
The correlation crossed between information gap calculating feature:
Wherein H (fi cdd) represent i-th entropy of CDD features, p (fi cdd=1) represent that i-th CDD feature is present in protein training
The probability of concentration.H(fj cdd,fi cdd) represent ith feature and j-th feature their differential entropy, Si,j cddRepresent i-th
Correlation between CDD features and j-th CDD feature,
Obtain conservative characteristic of field similarity matrix, then obtained by rps-blast the conservative characteristic of field of target protein come
Calculate two correlations of protein, and extract in training set the KNN methods that ten correlation protein high do weights, obtain
Obtain probable value of the protein on each subcellular location;
S107, merges obtained sequence signature, and seven probability characteristicses of part of GO guard domain probability characteristics, use Binary
Relevance strategies build the subcellular location that 12 SVM classifiers predict protein, and general on each subcellular location
Rate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710204499.1A CN106778070A (en) | 2017-03-31 | 2017-03-31 | A kind of human protein's subcellular location Forecasting Methodology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710204499.1A CN106778070A (en) | 2017-03-31 | 2017-03-31 | A kind of human protein's subcellular location Forecasting Methodology |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106778070A true CN106778070A (en) | 2017-05-31 |
Family
ID=58965603
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710204499.1A Pending CN106778070A (en) | 2017-03-31 | 2017-03-31 | A kind of human protein's subcellular location Forecasting Methodology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106778070A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110164507A (en) * | 2019-05-31 | 2019-08-23 | 郑州大学第一附属医院 | A kind of determination method and system of protein similarity and similar protein matter |
CN110739028A (en) * | 2019-10-18 | 2020-01-31 | 中国矿业大学 | cell line drug response prediction method based on K-nearest neighbor constraint matrix decomposition |
CN110797080A (en) * | 2019-10-18 | 2020-02-14 | 湖南大学 | Predicting synthetic lethal genes based on cross-species migratory learning |
CN111009287A (en) * | 2019-12-20 | 2020-04-14 | 东软集团股份有限公司 | SLiMs prediction model generation method, device, equipment and storage medium |
CN111091874A (en) * | 2019-12-20 | 2020-05-01 | 东软集团股份有限公司 | Protein feature construction method, device, equipment, storage medium and program product |
CN112259160A (en) * | 2020-11-19 | 2021-01-22 | 广东工业大学 | Protein subcellular localization method, system, storage medium and computer equipment |
CN112542213A (en) * | 2020-12-11 | 2021-03-23 | 沈阳师范大学 | Protein compound identification method fusing local topological attribute and gene expression information of node |
CN114882954A (en) * | 2022-05-24 | 2022-08-09 | 南京邮电大学 | Integrated learning-based automatic cell type classification method |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105760711A (en) * | 2016-02-02 | 2016-07-13 | 江南大学 | Method for using KNN calculation and similarity comparison to predict protein subcellular section |
-
2017
- 2017-03-31 CN CN201710204499.1A patent/CN106778070A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105760711A (en) * | 2016-02-02 | 2016-07-13 | 江南大学 | Method for using KNN calculation and similarity comparison to predict protein subcellular section |
Non-Patent Citations (1)
Title |
---|
HANG ZHOU ET AL.: "Hum-mPLoc3.0: prediction enhancement of human protein subcellular localization through modeling the hidden correlations of gene ontology and functional domain features", 《BOINFORMATICS》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110164507A (en) * | 2019-05-31 | 2019-08-23 | 郑州大学第一附属医院 | A kind of determination method and system of protein similarity and similar protein matter |
CN110739028B (en) * | 2019-10-18 | 2023-08-15 | 中国矿业大学 | Cell line drug response prediction method based on K-nearest neighbor constraint matrix decomposition |
CN110739028A (en) * | 2019-10-18 | 2020-01-31 | 中国矿业大学 | cell line drug response prediction method based on K-nearest neighbor constraint matrix decomposition |
CN110797080A (en) * | 2019-10-18 | 2020-02-14 | 湖南大学 | Predicting synthetic lethal genes based on cross-species migratory learning |
CN111009287A (en) * | 2019-12-20 | 2020-04-14 | 东软集团股份有限公司 | SLiMs prediction model generation method, device, equipment and storage medium |
CN111091874A (en) * | 2019-12-20 | 2020-05-01 | 东软集团股份有限公司 | Protein feature construction method, device, equipment, storage medium and program product |
CN111009287B (en) * | 2019-12-20 | 2023-12-15 | 东软集团股份有限公司 | SLiMs prediction model generation method, device, equipment and storage medium |
CN111091874B (en) * | 2019-12-20 | 2024-01-19 | 东软集团股份有限公司 | Protein feature construction method, protein feature construction device, protein feature construction apparatus, protein feature construction program product, and protein feature construction program product |
CN112259160A (en) * | 2020-11-19 | 2021-01-22 | 广东工业大学 | Protein subcellular localization method, system, storage medium and computer equipment |
CN112259160B (en) * | 2020-11-19 | 2023-05-26 | 广东工业大学 | Protein subcellular localization method, system, storage medium and computer device |
CN112542213A (en) * | 2020-12-11 | 2021-03-23 | 沈阳师范大学 | Protein compound identification method fusing local topological attribute and gene expression information of node |
CN112542213B (en) * | 2020-12-11 | 2024-02-02 | 沈阳师范大学 | Protein complex identification method fusing local topological attribute of node and gene expression information |
CN114882954A (en) * | 2022-05-24 | 2022-08-09 | 南京邮电大学 | Integrated learning-based automatic cell type classification method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106778070A (en) | A kind of human protein's subcellular location Forecasting Methodology | |
Liu et al. | A new feature selection method based on a validity index of feature subset | |
Wei et al. | An improved protein structural classes prediction method by incorporating both sequence and structure information | |
RU2607999C2 (en) | Use of machine learning techniques for extraction of association rules in datasets of plants and animals containing molecular genetic markers accompanied by classification or prediction using features created by these association rules | |
Rao et al. | A new intelligence-based approach for computer-aided diagnosis of dengue fever | |
CN101751455B (en) | Method for automatically generating title by adopting artificial intelligence technology | |
Sathya et al. | [Retracted] Cancer Categorization Using Genetic Algorithm to Identify Biomarker Genes | |
CN107291895B (en) | Quick hierarchical document query method | |
Daelemans et al. | Skousen's analogical modelling algorithm: a comparison with lazy learning | |
Dellert | Combining information-weighted sequence alignment and sound correspondence models for improved cognate detection | |
CN102346817B (en) | Prediction method for establishing allergen of allergen-family featured peptides by means of SVM (Support Vector Machine) | |
Cao et al. | Combining contents and citations for scientific document classification | |
Zhang et al. | MpsLDA-ProSVM: predicting multi-label protein subcellular localization by wMLDAe dimensionality reduction and ProSVM classifier | |
Sarkate et al. | Classification of chemical medicine or drug using K nearest neighbor (KNN) and genetic algorithm | |
Yu et al. | An automatic recognition method of journal impact factor manipulation | |
CN113033176B (en) | Court case judgment prediction method | |
Zhang et al. | A hierarchical feature selection model using clustering and recursive elimination methods | |
Belete et al. | Wrapper based feature selection techniques on EDHS-HIV/AIDS dataset | |
AlShwaish et al. | Mortality prediction based on imbalanced new born and perinatal period data | |
Natarajan | Early disease diagnosis using multivariate linear regression | |
Hemmerich et al. | A study of residue correlation within protein sequences and its application to sequence classification | |
Upadhyay et al. | Exploratory data analysis and prediction of human genetic disorder and species using dna sequencing | |
Mostafavi et al. | Classification of Persian News Articles using Machine Learning Techniques | |
Jagdale et al. | Extending the Classifier Algorithms in Machine Learning to Improve the Performance in Spoken Language Understanding Systems Under Deficient Training Data | |
CN116563646B (en) | Brain image classification method based on discretization data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170531 |