CN105760711A - Method for using KNN calculation and similarity comparison to predict protein subcellular section - Google Patents

Method for using KNN calculation and similarity comparison to predict protein subcellular section Download PDF

Info

Publication number
CN105760711A
CN105760711A CN201610072828.7A CN201610072828A CN105760711A CN 105760711 A CN105760711 A CN 105760711A CN 201610072828 A CN201610072828 A CN 201610072828A CN 105760711 A CN105760711 A CN 105760711A
Authority
CN
China
Prior art keywords
sequence
protein
protein sequence
knn
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610072828.7A
Other languages
Chinese (zh)
Inventor
张梁
薛卫
王雄飞
杨荣丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangnan University
Original Assignee
Jiangnan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangnan University filed Critical Jiangnan University
Priority to CN201610072828.7A priority Critical patent/CN105760711A/en
Publication of CN105760711A publication Critical patent/CN105760711A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a method for using KNN calculation and similarity comparison to predict the protein subcellular section. The method is characterized by comprising the following steps that 1, AAC features of all protein sequences in a protein sequence data set are extracted; 2, by means of a KNN algorithm, the protein sequence set within the prediction range is determined; 3, Blast similarity comparison calculation is carried out to obtain the highest similarity sequence; the section to which the highest similarity sequence belongs is a section to which the prediction sequence belongs. The prediction accuracy is high, especially the recognition precision of subcellular types with low prediction accuracy when a traditional method is used is remarkably improved, and the method plays an important role in accurately predicting the subcellular position of unknown protein.

Description

KNN is used to calculate and similarity comparison predicted protein matter subcellular fraction interval method
Technical field
The invention belongs to field of bioinformatics, relate to the method that predicted protein matter subcellular fraction is interval, be specifically related to a kind of KNN of use and calculate and similarity comparison predicted protein matter subcellular fraction interval method.
Background technology
The subcellular fraction interval that the function of protein sequence is affiliated with it is closely connected, and therefore the subcellular fraction interval prediction of protein sequence is studied important in inhibiting.Protein Subcellular interval is predicted becoming the main approaches obtaining block information by the thought currently with machine learning.
Along with machine learning method first Application in 1991 is in subcellular fraction interval prediction, nearly more than 20 years, the research of Protein Subcellular interval prediction is achieved a series of progress, major prognostic method includes: utilize covariant discriminant function that protein sequence aminoacid composition characteristic is predicted, based on Feature Fusion predictions such as N end, C end and hydrophobicitys, fuzzy k nearest neighbor (FuzzyK-NearestNeighbor, FKNN) algorithm is in conjunction with the prediction etc. of pseudo amino acid composition feature.
In above-mentioned Forecasting Methodology, extract protein sequence feature and input grader and determine interval, owing to only considering that the feature of sequence itself have ignored the similarity associations that between sequence, hereditary variation produces, causing that the accuracy rate predicted is on the low side.
Summary of the invention
For the deficiencies in the prior art, the invention discloses a kind of KNN of use and calculate and similarity comparison predicted protein matter subcellular fraction interval method.
Technical scheme is as follows:
A kind of KNN of use calculates and similarity comparison predicted protein matter subcellular fraction interval method, comprises the following steps:
Step 1, extraction protein sequence data concentrate the AAC feature of all proteins sequence;
Step 2, concentrating at protein sequence data and choose a protein sequence and be set as cycle tests, all the other protein sequences are set as training set, by KNN algorithm, it is determined that the protein sequence set in estimation range;
Step 3, the protein sequence set in institute's forecasting sequence and estimation range is carried out Blast similarity contrast conting, obtain the highest similarity sequence;Interval belonging to the highest similarity sequence is exactly the affiliated interval of institute's forecasting sequence.
Its further technical scheme is that described step 1 specifically includes:
Step 1-A, protein sequence data is concentrated every protein sequence be expressed as P:
P=R1R2R3R4R5…RL
In above formula, L is the length of protein sequence, Ri(i=1 ... L) is i-th amino acid residue in protein sequence;
Step 1-B, calculate the AAC feature of every protein sequence P:
PAAC=[f1,f2,…f20]T
In above formula, fu(u=1,2,3 ..., 20) represent the frequency that u seed amino acid occurs in protein sequence P.
Its further technical scheme is, the frequency f that in described step 1-B, u seed amino acid occurs in protein sequence PuComputational methods be:
In above formula, L represents the length of protein sequence, and N represents the total number of all amino acid residues that a protein sequence comprises, and A (u) represents the amino acid residue corresponding to sequence number u.
Its further technical scheme is that described step 2 specifically includes:
Step 2-A, the threshold k determined in KNN algorithm;
Step 2-B, protein sequence data concentrate choose one as forecasting sequence, all the other sequences are as training set;
Step 2-C, the protein sequence obtained based on step 1 AAC feature, calculate the Euclidean distance between each protein sequence in forecasting sequence and training set, choose the shortest front K the protein sequence of Euclidean distance as estimation range.
Its further technical scheme is, calculates Euclidean distance in described step 2-C method particularly includes:
The AAC feature P ' of forecasting sequenceACC=(f '1,f′2,f′3,...,f′n), the AAC feature P of any one protein sequence in training set "ACC=(f "1,f″2,f″3,...,f″n), then the computational methods of Euclidean distance d are as follows:
Its further technical scheme is that described step 3 specifically includes:
Step 3-A, using the protein sequence set in estimation range as sequence alignment of protein data base;
Step 3-B, institute's forecasting sequence and sequence alignment of protein data base are carried out Blast similarity comparison, using the protein sequence of highest scoring in institute's forecasting sequence as similarity highest serial, the affiliated interval of similarity highest serial is exactly the affiliated interval of institute's forecasting sequence.
The method have the benefit that:
The research of life sciences is deepened continuously by the mankind, and large-scale data constantly produce, and extracts effective information efficiently and accurately and be significant from these mass datas.Wherein extracting from protein sequence can be one of the core content of Subcellular Localization predictive study with sequential structure and functional character that numeral describes, method of the present invention, the protein sequence characteristics extraction algorithm that simple employing is traditional relative in prior art, such as AAC etc., carry out feature extraction and send into grader to position the accuracy rate of prediction higher.
The present invention uses the KNN algorithm that Blast alignment algorithm improves, extract the feature of protein sequence, and the test through practicing on two apoptotic proteins data sets, the predictablity rate demonstrating this method is higher, especially in the subcellular fraction class that traditional method predictablity rate is relatively low, accuracy of identification significantly improves, and the subcellular location of Accurate Prediction agnoprotein is had important function.
Accompanying drawing explanation
Fig. 1 is the step schematic diagram of the present invention.
Detailed description of the invention
Illustrate for the protein sequence data collection that 98 apoptin sequences form that comprises obtained from SWISS-PROT data base, adopt the KNN algorithm that Blast comparison improves, it is achieved the prediction interval to Protein Subcellular.Fig. 1 is the block diagram of the present invention, as it is shown in figure 1, specifically comprise the following steps that
Step 1, extraction protein sequence data concentrate the AAC feature of all proteins sequence;
Step 1-A, every protein sequence is expressed as P:
P=R1R2R3R4R5…RL
In above formula, L is the length of protein sequence, and Ri (i=1 ... L) is i-th amino acid residue in protein sequence;
Step 1-B, calculate the AAC feature of every protein sequence respectively.
AAC (AminoAcidComposition, the aminoacid forms) feature of protein sequence is represented by:
PAAC=[f1,f2,…f20]T
In above formula, fu(u=1,2,3 ..., 20) represent the frequency that 20 seed amino acids occur in protein sequence;fuAvailable equation below solves:
In above formula, L represents the length of protein sequence, and N represents the total number of all amino acid residues that a sequence word comprises, and A (u) represents the amino acid residue corresponding to sequence number u.Through statistical computation, all of sequence word can with the vector representation of one 20 dimension, thus obtaining the sequence word feature of all proteins sequence, i.e. the AAC feature of protein sequence.
Step 2, choosing a protein sequence be set as cycle tests in protein sequence, all the other sequences are set as training set, by KNN algorithm, it is determined that estimation range;
Step 2-A, the threshold k determined in KNN algorithm;In the present embodiment, K=20 is selected.
Step 2-B, choosing one in all of protein sequence as forecasting sequence, remaining sequence is as training set;In the present embodiment, arbitrarily taking a sequence as forecasting sequence in 98 sequences, remaining 97 sequences are training set.
Step 2-C, AAC feature based on protein sequence, calculate the Euclidean distance of each sequence in forecasting sequence and training set, choose the set of the shortest front K the protein sequence of Euclidean distance as estimation range.Due to K=20, the estimation range obtained comprises 20 protein sequences,
For any two N dimensional feature vector (s1,s2,s3,…sn) and (t1,t2,t3,…tn), it is as follows that Euclidean distance calculates process:
In the present embodiment, it was predicted that the AAC feature P ' of sequenceACC=(f '1,f′2,f′3,...,f′n), the AAC feature P of any one protein sequence in training set "ACC=(f "1,f″2,f″3,...,f″n), then based on the AAC feature of protein sequence, it was predicted that sequence and in training set the Euclidean distance d of protein sequence be:
Step 3, institute's forecasting sequence carry out Blast similarity comparison with the protein sequence in estimation range, obtain the highest similarity sequence;Interval belonging to the highest similarity sequence is exactly the affiliated interval of institute's forecasting sequence.The affiliated interval of institute's forecasting sequence is Protein Subcellular interval.
Step 3-A, using the protein sequence set in estimation range as sequence alignment of protein data base, in write into Databasce file f asta.
Step 3-B, the institute's forecasting sequence in database file fasta and sequence alignment of protein data base are carried out Blast similarity comparison, using the protein sequence of highest scoring in institute's forecasting sequence as similarity highest serial P ', the affiliated interval of similarity highest serial P ' is exactly the affiliated interval of institute's forecasting sequence, is the Protein Subcellular in the present invention interval.
Wherein the comparison of Blast similarity is prior art, and existing computational methods can be used to carry out correlation computations.
Above-described is only the preferred embodiment of the present invention, the invention is not restricted to above example.It is appreciated that the oher improvements and changes that those skilled in the art directly derive without departing from the spirit and concept in the present invention or associate, is all considered as being included within protection scope of the present invention.

Claims (6)

1. one kind uses KNN to calculate and similarity comparison predicted protein matter subcellular fraction interval method, it is characterised in that comprise the following steps:
Step 1, extraction protein sequence data concentrate the AAC feature of all proteins sequence;
Step 2, concentrating at protein sequence data and choose a protein sequence and be set as cycle tests, all the other protein sequences are set as training set, by KNN algorithm, it is determined that the protein sequence set in estimation range;
Step 3, the protein sequence set in institute's forecasting sequence and estimation range is carried out Blast similarity contrast conting, obtain the highest similarity sequence;Interval belonging to the highest similarity sequence is exactly the affiliated interval of institute's forecasting sequence.
2. the KNN of use as claimed in claim 1 calculates and similarity comparison predicted protein matter subcellular fraction interval method, it is characterised in that described step 1 specifically includes:
Step 1-A, protein sequence data is concentrated every protein sequence be expressed as P:
P=R1R2R3R4R5…RL
In above formula, L is the length of protein sequence, Ri(i=1 ... L) is i-th amino acid residue in protein sequence;
Step 1-B, calculate the AAC feature of every protein sequence P:
PAAC=[f1,f2,…f20]T
In above formula, fu(u=1,2,3 ..., 20) represent the frequency that u seed amino acid occurs in protein sequence P.
3. the KNN of use as claimed in claim 2 calculates and similarity comparison predicted protein matter subcellular fraction interval method, it is characterised in that the frequency f that in described step 1-B, u seed amino acid occurs in protein sequence PuComputational methods be:
f u = 1 N Σ i = 1 L R i , R i = 1 , I f R i = A ( u ) 0 , I f R i ≠ A ( u ) ;
In above formula, L represents the length of protein sequence, and N represents the total number of all amino acid residues that a protein sequence comprises, and A (u) represents the amino acid residue corresponding to sequence number u.
4. the KNN of use as claimed in claim 1 calculates and similarity comparison predicted protein matter subcellular fraction interval method, it is characterised in that described step 2 specifically includes:
Step 2-A, the threshold k determined in KNN algorithm;
Step 2-B, protein sequence data concentrate choose one as forecasting sequence, all the other sequences are as training set;
Step 2-C, the protein sequence obtained based on step 1 AAC feature, calculate the Euclidean distance between each protein sequence in forecasting sequence and training set, choose the shortest front K the protein sequence set of Euclidean distance as estimation range.
5. the KNN of use as claimed in claim 4 calculates and similarity comparison predicted protein matter subcellular fraction interval method, it is characterised in that calculate Euclidean distance in described step 2-C method particularly includes:
The AAC feature P ' of forecasting sequenceACC=(f1′,f2′,f3′,...,fn'), the AAC feature P of any one protein sequence in training set "ACC=(f1″,f2″,f3″,...,fn"), then the computational methods of Euclidean distance d are as follows:
d = Σ i = 1 n ( f ′ i - f ′ ′ i ) 2 .
6. the KNN of use as claimed in claim 1 calculates and similarity comparison predicted protein matter subcellular fraction interval method, it is characterised in that described step 3 specifically includes:
Step 3-A, using the protein sequence set in estimation range as sequence alignment of protein data base;
Step 3-B, institute's forecasting sequence and sequence alignment of protein data base are carried out Blast similarity comparison, using the protein sequence of highest scoring in institute's forecasting sequence as similarity highest serial, the affiliated interval of similarity highest serial is exactly the affiliated interval of institute's forecasting sequence.
CN201610072828.7A 2016-02-02 2016-02-02 Method for using KNN calculation and similarity comparison to predict protein subcellular section Pending CN105760711A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610072828.7A CN105760711A (en) 2016-02-02 2016-02-02 Method for using KNN calculation and similarity comparison to predict protein subcellular section

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610072828.7A CN105760711A (en) 2016-02-02 2016-02-02 Method for using KNN calculation and similarity comparison to predict protein subcellular section

Publications (1)

Publication Number Publication Date
CN105760711A true CN105760711A (en) 2016-07-13

Family

ID=56342992

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610072828.7A Pending CN105760711A (en) 2016-02-02 2016-02-02 Method for using KNN calculation and similarity comparison to predict protein subcellular section

Country Status (1)

Country Link
CN (1) CN105760711A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106778070A (en) * 2017-03-31 2017-05-31 上海交通大学 A kind of human protein's subcellular location Forecasting Methodology
CN109273054A (en) * 2018-08-31 2019-01-25 南京农业大学 Protein Subcellular interval prediction method based on relation map
CN112634988A (en) * 2021-01-07 2021-04-09 内江师范学院 Python language-based gene variation detection method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006090868A1 (en) * 2005-02-22 2006-08-31 Riken Gene structure predicting method and gene structure predicting program
CN102819693A (en) * 2012-08-17 2012-12-12 中国人民解放军第三军医大学第二附属医院 Prediction method for protein subcellular site formed based on improved-period pseudo amino acid

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006090868A1 (en) * 2005-02-22 2006-08-31 Riken Gene structure predicting method and gene structure predicting program
CN102819693A (en) * 2012-08-17 2012-12-12 中国人民解放军第三军医大学第二附属医院 Prediction method for protein subcellular site formed based on improved-period pseudo amino acid

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HU LL等: "Using protein-protein interaction network information to predict the subcellular locations of proteins in budding yeast. Protein Pept Lett", 《PROTEIN PEPT LETT》 *
YAO YH等: "Apoptosis protein subcellular location prediction based on position-specific scoring matrix", 《J COMPUTAT THEORET NABOSCI》 *
刘立元: "基于集成学习的蛋白质亚细胞定位预测", 《中国优秀硕士学位论文全文数据库》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106778070A (en) * 2017-03-31 2017-05-31 上海交通大学 A kind of human protein's subcellular location Forecasting Methodology
CN109273054A (en) * 2018-08-31 2019-01-25 南京农业大学 Protein Subcellular interval prediction method based on relation map
CN109273054B (en) * 2018-08-31 2021-07-13 南京农业大学 Protein subcellular interval prediction method based on relational graph
CN112634988A (en) * 2021-01-07 2021-04-09 内江师范学院 Python language-based gene variation detection method and system

Similar Documents

Publication Publication Date Title
Lin et al. In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval
CN108009405A (en) A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter
CN108765383B (en) Video description method based on deep migration learning
Xiao et al. Based on grid-search and PSO parameter optimization for Support Vector Machine
CN103778227A (en) Method for screening useful images from retrieved images
Yin et al. Feature-based map matching for low-sampling-rate GPS trajectories
CN105760711A (en) Method for using KNN calculation and similarity comparison to predict protein subcellular section
CN103955628A (en) Subspace fusion-based protein-vitamin binding location point predicting method
CN106227884B (en) A kind of recommended method of calling a taxi online based on collaborative filtering
CN103440471A (en) Human body action identifying method based on lower-rank representation
CN104077499A (en) Supervised up-sampling learning based protein-nucleotide binding positioning point prediction method
CN109544600A (en) It is a kind of based on it is context-sensitive and differentiate correlation filter method for tracking target
CN112085247A (en) Protein residue contact prediction method based on deep learning
CN106250925A (en) A kind of zero Sample video sorting technique based on the canonical correlation analysis improved
CN102945553A (en) Remote sensing image partition method based on automatic difference clustering algorithm
CN103324933A (en) Membrane protein sub-cell positioning method based on complex space multi-view feature fusion
CN103177121A (en) Locality preserving projection method for adding pearson relevant coefficient
CN109034237A (en) Winding detection method based on convolutional Neural metanetwork road sign and sequence search
CN109215737A (en) Protein characteristic extracts, functional mode generates, the method and device of function prediction
Sun et al. Dun: Dual-path temporal matching network for natural language-based vehicle retrieval
CN105046106B (en) A kind of Prediction of Protein Subcellular Location method realized with nearest _neighbor retrieval
Kim Deep active learning for sequence labeling based on diversity and uncertainty in gradient
CN112116949B (en) Protein folding identification method based on triple loss
CN106250818A (en) A kind of total order keeps the face age estimation method of projection
CN103439441A (en) Peptide identification method based on subset error rate estimation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160713

RJ01 Rejection of invention patent application after publication