CN105760711A - Method for using KNN calculation and similarity comparison to predict protein subcellular section - Google Patents
Method for using KNN calculation and similarity comparison to predict protein subcellular section Download PDFInfo
- Publication number
- CN105760711A CN105760711A CN201610072828.7A CN201610072828A CN105760711A CN 105760711 A CN105760711 A CN 105760711A CN 201610072828 A CN201610072828 A CN 201610072828A CN 105760711 A CN105760711 A CN 105760711A
- Authority
- CN
- China
- Prior art keywords
- sequence
- protein
- protein sequence
- knn
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
Landscapes
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Analytical Chemistry (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention discloses a method for using KNN calculation and similarity comparison to predict the protein subcellular section. The method is characterized by comprising the following steps that 1, AAC features of all protein sequences in a protein sequence data set are extracted; 2, by means of a KNN algorithm, the protein sequence set within the prediction range is determined; 3, Blast similarity comparison calculation is carried out to obtain the highest similarity sequence; the section to which the highest similarity sequence belongs is a section to which the prediction sequence belongs. The prediction accuracy is high, especially the recognition precision of subcellular types with low prediction accuracy when a traditional method is used is remarkably improved, and the method plays an important role in accurately predicting the subcellular position of unknown protein.
Description
Technical field
The invention belongs to field of bioinformatics, relate to the method that predicted protein matter subcellular fraction is interval, be specifically related to a kind of KNN of use and calculate and similarity comparison predicted protein matter subcellular fraction interval method.
Background technology
The subcellular fraction interval that the function of protein sequence is affiliated with it is closely connected, and therefore the subcellular fraction interval prediction of protein sequence is studied important in inhibiting.Protein Subcellular interval is predicted becoming the main approaches obtaining block information by the thought currently with machine learning.
Along with machine learning method first Application in 1991 is in subcellular fraction interval prediction, nearly more than 20 years, the research of Protein Subcellular interval prediction is achieved a series of progress, major prognostic method includes: utilize covariant discriminant function that protein sequence aminoacid composition characteristic is predicted, based on Feature Fusion predictions such as N end, C end and hydrophobicitys, fuzzy k nearest neighbor (FuzzyK-NearestNeighbor, FKNN) algorithm is in conjunction with the prediction etc. of pseudo amino acid composition feature.
In above-mentioned Forecasting Methodology, extract protein sequence feature and input grader and determine interval, owing to only considering that the feature of sequence itself have ignored the similarity associations that between sequence, hereditary variation produces, causing that the accuracy rate predicted is on the low side.
Summary of the invention
For the deficiencies in the prior art, the invention discloses a kind of KNN of use and calculate and similarity comparison predicted protein matter subcellular fraction interval method.
Technical scheme is as follows:
A kind of KNN of use calculates and similarity comparison predicted protein matter subcellular fraction interval method, comprises the following steps:
Step 1, extraction protein sequence data concentrate the AAC feature of all proteins sequence;
Step 2, concentrating at protein sequence data and choose a protein sequence and be set as cycle tests, all the other protein sequences are set as training set, by KNN algorithm, it is determined that the protein sequence set in estimation range;
Step 3, the protein sequence set in institute's forecasting sequence and estimation range is carried out Blast similarity contrast conting, obtain the highest similarity sequence;Interval belonging to the highest similarity sequence is exactly the affiliated interval of institute's forecasting sequence.
Its further technical scheme is that described step 1 specifically includes:
Step 1-A, protein sequence data is concentrated every protein sequence be expressed as P:
P=R1R2R3R4R5…RL;
In above formula, L is the length of protein sequence, Ri(i=1 ... L) is i-th amino acid residue in protein sequence;
Step 1-B, calculate the AAC feature of every protein sequence P:
PAAC=[f1,f2,…f20]T;
In above formula, fu(u=1,2,3 ..., 20) represent the frequency that u seed amino acid occurs in protein sequence P.
Its further technical scheme is, the frequency f that in described step 1-B, u seed amino acid occurs in protein sequence PuComputational methods be:
In above formula, L represents the length of protein sequence, and N represents the total number of all amino acid residues that a protein sequence comprises, and A (u) represents the amino acid residue corresponding to sequence number u.
Its further technical scheme is that described step 2 specifically includes:
Step 2-A, the threshold k determined in KNN algorithm;
Step 2-B, protein sequence data concentrate choose one as forecasting sequence, all the other sequences are as training set;
Step 2-C, the protein sequence obtained based on step 1 AAC feature, calculate the Euclidean distance between each protein sequence in forecasting sequence and training set, choose the shortest front K the protein sequence of Euclidean distance as estimation range.
Its further technical scheme is, calculates Euclidean distance in described step 2-C method particularly includes:
The AAC feature P ' of forecasting sequenceACC=(f '1,f′2,f′3,...,f′n), the AAC feature P of any one protein sequence in training set "ACC=(f "1,f″2,f″3,...,f″n), then the computational methods of Euclidean distance d are as follows:
Its further technical scheme is that described step 3 specifically includes:
Step 3-A, using the protein sequence set in estimation range as sequence alignment of protein data base;
Step 3-B, institute's forecasting sequence and sequence alignment of protein data base are carried out Blast similarity comparison, using the protein sequence of highest scoring in institute's forecasting sequence as similarity highest serial, the affiliated interval of similarity highest serial is exactly the affiliated interval of institute's forecasting sequence.
The method have the benefit that:
The research of life sciences is deepened continuously by the mankind, and large-scale data constantly produce, and extracts effective information efficiently and accurately and be significant from these mass datas.Wherein extracting from protein sequence can be one of the core content of Subcellular Localization predictive study with sequential structure and functional character that numeral describes, method of the present invention, the protein sequence characteristics extraction algorithm that simple employing is traditional relative in prior art, such as AAC etc., carry out feature extraction and send into grader to position the accuracy rate of prediction higher.
The present invention uses the KNN algorithm that Blast alignment algorithm improves, extract the feature of protein sequence, and the test through practicing on two apoptotic proteins data sets, the predictablity rate demonstrating this method is higher, especially in the subcellular fraction class that traditional method predictablity rate is relatively low, accuracy of identification significantly improves, and the subcellular location of Accurate Prediction agnoprotein is had important function.
Accompanying drawing explanation
Fig. 1 is the step schematic diagram of the present invention.
Detailed description of the invention
Illustrate for the protein sequence data collection that 98 apoptin sequences form that comprises obtained from SWISS-PROT data base, adopt the KNN algorithm that Blast comparison improves, it is achieved the prediction interval to Protein Subcellular.Fig. 1 is the block diagram of the present invention, as it is shown in figure 1, specifically comprise the following steps that
Step 1, extraction protein sequence data concentrate the AAC feature of all proteins sequence;
Step 1-A, every protein sequence is expressed as P:
P=R1R2R3R4R5…RL;
In above formula, L is the length of protein sequence, and Ri (i=1 ... L) is i-th amino acid residue in protein sequence;
Step 1-B, calculate the AAC feature of every protein sequence respectively.
AAC (AminoAcidComposition, the aminoacid forms) feature of protein sequence is represented by:
PAAC=[f1,f2,…f20]T;
In above formula, fu(u=1,2,3 ..., 20) represent the frequency that 20 seed amino acids occur in protein sequence;fuAvailable equation below solves:
In above formula, L represents the length of protein sequence, and N represents the total number of all amino acid residues that a sequence word comprises, and A (u) represents the amino acid residue corresponding to sequence number u.Through statistical computation, all of sequence word can with the vector representation of one 20 dimension, thus obtaining the sequence word feature of all proteins sequence, i.e. the AAC feature of protein sequence.
Step 2, choosing a protein sequence be set as cycle tests in protein sequence, all the other sequences are set as training set, by KNN algorithm, it is determined that estimation range;
Step 2-A, the threshold k determined in KNN algorithm;In the present embodiment, K=20 is selected.
Step 2-B, choosing one in all of protein sequence as forecasting sequence, remaining sequence is as training set;In the present embodiment, arbitrarily taking a sequence as forecasting sequence in 98 sequences, remaining 97 sequences are training set.
Step 2-C, AAC feature based on protein sequence, calculate the Euclidean distance of each sequence in forecasting sequence and training set, choose the set of the shortest front K the protein sequence of Euclidean distance as estimation range.Due to K=20, the estimation range obtained comprises 20 protein sequences,
For any two N dimensional feature vector (s1,s2,s3,…sn) and (t1,t2,t3,…tn), it is as follows that Euclidean distance calculates process:
In the present embodiment, it was predicted that the AAC feature P ' of sequenceACC=(f '1,f′2,f′3,...,f′n), the AAC feature P of any one protein sequence in training set "ACC=(f "1,f″2,f″3,...,f″n), then based on the AAC feature of protein sequence, it was predicted that sequence and in training set the Euclidean distance d of protein sequence be:
Step 3, institute's forecasting sequence carry out Blast similarity comparison with the protein sequence in estimation range, obtain the highest similarity sequence;Interval belonging to the highest similarity sequence is exactly the affiliated interval of institute's forecasting sequence.The affiliated interval of institute's forecasting sequence is Protein Subcellular interval.
Step 3-A, using the protein sequence set in estimation range as sequence alignment of protein data base, in write into Databasce file f asta.
Step 3-B, the institute's forecasting sequence in database file fasta and sequence alignment of protein data base are carried out Blast similarity comparison, using the protein sequence of highest scoring in institute's forecasting sequence as similarity highest serial P ', the affiliated interval of similarity highest serial P ' is exactly the affiliated interval of institute's forecasting sequence, is the Protein Subcellular in the present invention interval.
Wherein the comparison of Blast similarity is prior art, and existing computational methods can be used to carry out correlation computations.
Above-described is only the preferred embodiment of the present invention, the invention is not restricted to above example.It is appreciated that the oher improvements and changes that those skilled in the art directly derive without departing from the spirit and concept in the present invention or associate, is all considered as being included within protection scope of the present invention.
Claims (6)
1. one kind uses KNN to calculate and similarity comparison predicted protein matter subcellular fraction interval method, it is characterised in that comprise the following steps:
Step 1, extraction protein sequence data concentrate the AAC feature of all proteins sequence;
Step 2, concentrating at protein sequence data and choose a protein sequence and be set as cycle tests, all the other protein sequences are set as training set, by KNN algorithm, it is determined that the protein sequence set in estimation range;
Step 3, the protein sequence set in institute's forecasting sequence and estimation range is carried out Blast similarity contrast conting, obtain the highest similarity sequence;Interval belonging to the highest similarity sequence is exactly the affiliated interval of institute's forecasting sequence.
2. the KNN of use as claimed in claim 1 calculates and similarity comparison predicted protein matter subcellular fraction interval method, it is characterised in that described step 1 specifically includes:
Step 1-A, protein sequence data is concentrated every protein sequence be expressed as P:
P=R1R2R3R4R5…RL;
In above formula, L is the length of protein sequence, Ri(i=1 ... L) is i-th amino acid residue in protein sequence;
Step 1-B, calculate the AAC feature of every protein sequence P:
PAAC=[f1,f2,…f20]T;
In above formula, fu(u=1,2,3 ..., 20) represent the frequency that u seed amino acid occurs in protein sequence P.
3. the KNN of use as claimed in claim 2 calculates and similarity comparison predicted protein matter subcellular fraction interval method, it is characterised in that the frequency f that in described step 1-B, u seed amino acid occurs in protein sequence PuComputational methods be:
In above formula, L represents the length of protein sequence, and N represents the total number of all amino acid residues that a protein sequence comprises, and A (u) represents the amino acid residue corresponding to sequence number u.
4. the KNN of use as claimed in claim 1 calculates and similarity comparison predicted protein matter subcellular fraction interval method, it is characterised in that described step 2 specifically includes:
Step 2-A, the threshold k determined in KNN algorithm;
Step 2-B, protein sequence data concentrate choose one as forecasting sequence, all the other sequences are as training set;
Step 2-C, the protein sequence obtained based on step 1 AAC feature, calculate the Euclidean distance between each protein sequence in forecasting sequence and training set, choose the shortest front K the protein sequence set of Euclidean distance as estimation range.
5. the KNN of use as claimed in claim 4 calculates and similarity comparison predicted protein matter subcellular fraction interval method, it is characterised in that calculate Euclidean distance in described step 2-C method particularly includes:
The AAC feature P ' of forecasting sequenceACC=(f1′,f2′,f3′,...,fn'), the AAC feature P of any one protein sequence in training set "ACC=(f1″,f2″,f3″,...,fn"), then the computational methods of Euclidean distance d are as follows:
6. the KNN of use as claimed in claim 1 calculates and similarity comparison predicted protein matter subcellular fraction interval method, it is characterised in that described step 3 specifically includes:
Step 3-A, using the protein sequence set in estimation range as sequence alignment of protein data base;
Step 3-B, institute's forecasting sequence and sequence alignment of protein data base are carried out Blast similarity comparison, using the protein sequence of highest scoring in institute's forecasting sequence as similarity highest serial, the affiliated interval of similarity highest serial is exactly the affiliated interval of institute's forecasting sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610072828.7A CN105760711A (en) | 2016-02-02 | 2016-02-02 | Method for using KNN calculation and similarity comparison to predict protein subcellular section |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610072828.7A CN105760711A (en) | 2016-02-02 | 2016-02-02 | Method for using KNN calculation and similarity comparison to predict protein subcellular section |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105760711A true CN105760711A (en) | 2016-07-13 |
Family
ID=56342992
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610072828.7A Pending CN105760711A (en) | 2016-02-02 | 2016-02-02 | Method for using KNN calculation and similarity comparison to predict protein subcellular section |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105760711A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106778070A (en) * | 2017-03-31 | 2017-05-31 | 上海交通大学 | A kind of human protein's subcellular location Forecasting Methodology |
CN109273054A (en) * | 2018-08-31 | 2019-01-25 | 南京农业大学 | Protein Subcellular interval prediction method based on relation map |
CN112634988A (en) * | 2021-01-07 | 2021-04-09 | 内江师范学院 | Python language-based gene variation detection method and system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006090868A1 (en) * | 2005-02-22 | 2006-08-31 | Riken | Gene structure predicting method and gene structure predicting program |
CN102819693A (en) * | 2012-08-17 | 2012-12-12 | 中国人民解放军第三军医大学第二附属医院 | Prediction method for protein subcellular site formed based on improved-period pseudo amino acid |
-
2016
- 2016-02-02 CN CN201610072828.7A patent/CN105760711A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006090868A1 (en) * | 2005-02-22 | 2006-08-31 | Riken | Gene structure predicting method and gene structure predicting program |
CN102819693A (en) * | 2012-08-17 | 2012-12-12 | 中国人民解放军第三军医大学第二附属医院 | Prediction method for protein subcellular site formed based on improved-period pseudo amino acid |
Non-Patent Citations (3)
Title |
---|
HU LL等: "Using protein-protein interaction network information to predict the subcellular locations of proteins in budding yeast. Protein Pept Lett", 《PROTEIN PEPT LETT》 * |
YAO YH等: "Apoptosis protein subcellular location prediction based on position-specific scoring matrix", 《J COMPUTAT THEORET NABOSCI》 * |
刘立元: "基于集成学习的蛋白质亚细胞定位预测", 《中国优秀硕士学位论文全文数据库》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106778070A (en) * | 2017-03-31 | 2017-05-31 | 上海交通大学 | A kind of human protein's subcellular location Forecasting Methodology |
CN109273054A (en) * | 2018-08-31 | 2019-01-25 | 南京农业大学 | Protein Subcellular interval prediction method based on relation map |
CN109273054B (en) * | 2018-08-31 | 2021-07-13 | 南京农业大学 | Protein subcellular interval prediction method based on relational graph |
CN112634988A (en) * | 2021-01-07 | 2021-04-09 | 内江师范学院 | Python language-based gene variation detection method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lin et al. | In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval | |
CN108009405A (en) | A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter | |
CN108765383B (en) | Video description method based on deep migration learning | |
Xiao et al. | Based on grid-search and PSO parameter optimization for Support Vector Machine | |
CN103778227A (en) | Method for screening useful images from retrieved images | |
Yin et al. | Feature-based map matching for low-sampling-rate GPS trajectories | |
CN105760711A (en) | Method for using KNN calculation and similarity comparison to predict protein subcellular section | |
CN103955628A (en) | Subspace fusion-based protein-vitamin binding location point predicting method | |
CN106227884B (en) | A kind of recommended method of calling a taxi online based on collaborative filtering | |
CN103440471A (en) | Human body action identifying method based on lower-rank representation | |
CN104077499A (en) | Supervised up-sampling learning based protein-nucleotide binding positioning point prediction method | |
CN109544600A (en) | It is a kind of based on it is context-sensitive and differentiate correlation filter method for tracking target | |
CN112085247A (en) | Protein residue contact prediction method based on deep learning | |
CN106250925A (en) | A kind of zero Sample video sorting technique based on the canonical correlation analysis improved | |
CN102945553A (en) | Remote sensing image partition method based on automatic difference clustering algorithm | |
CN103324933A (en) | Membrane protein sub-cell positioning method based on complex space multi-view feature fusion | |
CN103177121A (en) | Locality preserving projection method for adding pearson relevant coefficient | |
CN109034237A (en) | Winding detection method based on convolutional Neural metanetwork road sign and sequence search | |
CN109215737A (en) | Protein characteristic extracts, functional mode generates, the method and device of function prediction | |
Sun et al. | Dun: Dual-path temporal matching network for natural language-based vehicle retrieval | |
CN105046106B (en) | A kind of Prediction of Protein Subcellular Location method realized with nearest _neighbor retrieval | |
Kim | Deep active learning for sequence labeling based on diversity and uncertainty in gradient | |
CN112116949B (en) | Protein folding identification method based on triple loss | |
CN106250818A (en) | A kind of total order keeps the face age estimation method of projection | |
CN103439441A (en) | Peptide identification method based on subset error rate estimation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160713 |
|
RJ01 | Rejection of invention patent application after publication |