CN105760711A

CN105760711A - Method for using KNN calculation and similarity comparison to predict protein subcellular section

Info

Publication number: CN105760711A
Application number: CN201610072828.7A
Authority: CN
Inventors: 张梁; 薛卫; 王雄飞; 杨荣丽
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2016-02-02
Filing date: 2016-02-02
Publication date: 2016-07-13

Abstract

The invention discloses a method for using KNN calculation and similarity comparison to predict the protein subcellular section. The method is characterized by comprising the following steps that 1, AAC features of all protein sequences in a protein sequence data set are extracted; 2, by means of a KNN algorithm, the protein sequence set within the prediction range is determined; 3, Blast similarity comparison calculation is carried out to obtain the highest similarity sequence; the section to which the highest similarity sequence belongs is a section to which the prediction sequence belongs. The prediction accuracy is high, especially the recognition precision of subcellular types with low prediction accuracy when a traditional method is used is remarkably improved, and the method plays an important role in accurately predicting the subcellular position of unknown protein.

Description

KNN is used to calculate and similarity comparison predicted protein matter subcellular fraction interval method

Technical field

The invention belongs to field of bioinformatics, relate to the method that predicted protein matter subcellular fraction is interval, be specifically related to a kind of KNN of use and calculate and similarity comparison predicted protein matter subcellular fraction interval method.

Background technology

The subcellular fraction interval that the function of protein sequence is affiliated with it is closely connected, and therefore the subcellular fraction interval prediction of protein sequence is studied important in inhibiting.Protein Subcellular interval is predicted becoming the main approaches obtaining block information by the thought currently with machine learning.

Along with machine learning method first Application in 1991 is in subcellular fraction interval prediction, nearly more than 20 years, the research of Protein Subcellular interval prediction is achieved a series of progress, major prognostic method includes: utilize covariant discriminant function that protein sequence aminoacid composition characteristic is predicted, based on Feature Fusion predictions such as N end, C end and hydrophobicitys, fuzzy k nearest neighbor (FuzzyK-NearestNeighbor, FKNN) algorithm is in conjunction with the prediction etc. of pseudo amino acid composition feature.

In above-mentioned Forecasting Methodology, extract protein sequence feature and input grader and determine interval, owing to only considering that the feature of sequence itself have ignored the similarity associations that between sequence, hereditary variation produces, causing that the accuracy rate predicted is on the low side.

Summary of the invention

For the deficiencies in the prior art, the invention discloses a kind of KNN of use and calculate and similarity comparison predicted protein matter subcellular fraction interval method.

Technical scheme is as follows:

A kind of KNN of use calculates and similarity comparison predicted protein matter subcellular fraction interval method, comprises the following steps:

Step 1, extraction protein sequence data concentrate the AAC feature of all proteins sequence；

Step 2, concentrating at protein sequence data and choose a protein sequence and be set as cycle tests, all the other protein sequences are set as training set, by KNN algorithm, it is determined that the protein sequence set in estimation range；

Step 3, the protein sequence set in institute's forecasting sequence and estimation range is carried out Blast similarity contrast conting, obtain the highest similarity sequence；Interval belonging to the highest similarity sequence is exactly the affiliated interval of institute's forecasting sequence.

Its further technical scheme is that described step 1 specifically includes:

Step 1-A, protein sequence data is concentrated every protein sequence be expressed as P:

P=R₁R₂R₃R₄R₅…R_L；

In above formula, L is the length of protein sequence, R_i(i=1 ... L) is i-th amino acid residue in protein sequence；

Step 1-B, calculate the AAC feature of every protein sequence P:

P_AAC=[f₁,f₂,…f₂₀]^T；

In above formula, f_u(u=1,2,3 ..., 20) represent the frequency that u seed amino acid occurs in protein sequence P.

Its further technical scheme is, the frequency f that in described step 1-B, u seed amino acid occurs in protein sequence P_uComputational methods be:

In above formula, L represents the length of protein sequence, and N represents the total number of all amino acid residues that a protein sequence comprises, and A (u) represents the amino acid residue corresponding to sequence number u.

Its further technical scheme is that described step 2 specifically includes:

Step 2-A, the threshold k determined in KNN algorithm；

Step 2-B, protein sequence data concentrate choose one as forecasting sequence, all the other sequences are as training set；

Step 2-C, the protein sequence obtained based on step 1 AAC feature, calculate the Euclidean distance between each protein sequence in forecasting sequence and training set, choose the shortest front K the protein sequence of Euclidean distance as estimation range.

Its further technical scheme is, calculates Euclidean distance in described step 2-C method particularly includes:

The AAC feature P ' of forecasting sequence_ACC=(f '₁,f′₂,f′₃,...,f′_n), the AAC feature P of any one protein sequence in training set "_ACC=(f "₁,f″₂,f″₃,...,f″_n), then the computational methods of Euclidean distance d are as follows:

Its further technical scheme is that described step 3 specifically includes:

Step 3-A, using the protein sequence set in estimation range as sequence alignment of protein data base；

Step 3-B, institute's forecasting sequence and sequence alignment of protein data base are carried out Blast similarity comparison, using the protein sequence of highest scoring in institute's forecasting sequence as similarity highest serial, the affiliated interval of similarity highest serial is exactly the affiliated interval of institute's forecasting sequence.

The method have the benefit that:

The research of life sciences is deepened continuously by the mankind, and large-scale data constantly produce, and extracts effective information efficiently and accurately and be significant from these mass datas.Wherein extracting from protein sequence can be one of the core content of Subcellular Localization predictive study with sequential structure and functional character that numeral describes, method of the present invention, the protein sequence characteristics extraction algorithm that simple employing is traditional relative in prior art, such as AAC etc., carry out feature extraction and send into grader to position the accuracy rate of prediction higher.

The present invention uses the KNN algorithm that Blast alignment algorithm improves, extract the feature of protein sequence, and the test through practicing on two apoptotic proteins data sets, the predictablity rate demonstrating this method is higher, especially in the subcellular fraction class that traditional method predictablity rate is relatively low, accuracy of identification significantly improves, and the subcellular location of Accurate Prediction agnoprotein is had important function.

Accompanying drawing explanation

Fig. 1 is the step schematic diagram of the present invention.

Detailed description of the invention

Illustrate for the protein sequence data collection that 98 apoptin sequences form that comprises obtained from SWISS-PROT data base, adopt the KNN algorithm that Blast comparison improves, it is achieved the prediction interval to Protein Subcellular.Fig. 1 is the block diagram of the present invention, as it is shown in figure 1, specifically comprise the following steps that

Step 1-A, every protein sequence is expressed as P:

P=R₁R₂R₃R₄R₅…R_L；

In above formula, L is the length of protein sequence, and Ri (i=1 ... L) is i-th amino acid residue in protein sequence；

Step 1-B, calculate the AAC feature of every protein sequence respectively.

AAC (AminoAcidComposition, the aminoacid forms) feature of protein sequence is represented by:

P_AAC=[f₁,f₂,…f₂₀]^T；

In above formula, f_u(u=1,2,3 ..., 20) represent the frequency that 20 seed amino acids occur in protein sequence；f_uAvailable equation below solves:

In above formula, L represents the length of protein sequence, and N represents the total number of all amino acid residues that a sequence word comprises, and A (u) represents the amino acid residue corresponding to sequence number u.Through statistical computation, all of sequence word can with the vector representation of one 20 dimension, thus obtaining the sequence word feature of all proteins sequence, i.e. the AAC feature of protein sequence.

Step 2, choosing a protein sequence be set as cycle tests in protein sequence, all the other sequences are set as training set, by KNN algorithm, it is determined that estimation range；

Step 2-A, the threshold k determined in KNN algorithm；In the present embodiment, K=20 is selected.

Step 2-B, choosing one in all of protein sequence as forecasting sequence, remaining sequence is as training set；In the present embodiment, arbitrarily taking a sequence as forecasting sequence in 98 sequences, remaining 97 sequences are training set.

Step 2-C, AAC feature based on protein sequence, calculate the Euclidean distance of each sequence in forecasting sequence and training set, choose the set of the shortest front K the protein sequence of Euclidean distance as estimation range.Due to K=20, the estimation range obtained comprises 20 protein sequences,

For any two N dimensional feature vector (s₁,s₂,s₃,…s_n) and (t₁,t₂,t₃,…t_n), it is as follows that Euclidean distance calculates process:

In the present embodiment, it was predicted that the AAC feature P ' of sequence_ACC=(f '₁,f′₂,f′₃,...,f′_n), the AAC feature P of any one protein sequence in training set "_ACC=(f "₁,f″₂,f″₃,...,f″_n), then based on the AAC feature of protein sequence, it was predicted that sequence and in training set the Euclidean distance d of protein sequence be:

Step 3, institute's forecasting sequence carry out Blast similarity comparison with the protein sequence in estimation range, obtain the highest similarity sequence；Interval belonging to the highest similarity sequence is exactly the affiliated interval of institute's forecasting sequence.The affiliated interval of institute's forecasting sequence is Protein Subcellular interval.

Step 3-A, using the protein sequence set in estimation range as sequence alignment of protein data base, in write into Databasce file f asta.

Step 3-B, the institute's forecasting sequence in database file fasta and sequence alignment of protein data base are carried out Blast similarity comparison, using the protein sequence of highest scoring in institute's forecasting sequence as similarity highest serial P ', the affiliated interval of similarity highest serial P ' is exactly the affiliated interval of institute's forecasting sequence, is the Protein Subcellular in the present invention interval.

Wherein the comparison of Blast similarity is prior art, and existing computational methods can be used to carry out correlation computations.

Above-described is only the preferred embodiment of the present invention, the invention is not restricted to above example.It is appreciated that the oher improvements and changes that those skilled in the art directly derive without departing from the spirit and concept in the present invention or associate, is all considered as being included within protection scope of the present invention.

Claims

1. one kind uses KNN to calculate and similarity comparison predicted protein matter subcellular fraction interval method, it is characterised in that comprise the following steps:

2. the KNN of use as claimed in claim 1 calculates and similarity comparison predicted protein matter subcellular fraction interval method, it is characterised in that described step 1 specifically includes:

P=R₁R₂R₃R₄R₅…R_L；

Step 1-B, calculate the AAC feature of every protein sequence P:

P_AAC=[f₁,f₂,…f₂₀]^T；

3. the KNN of use as claimed in claim 2 calculates and similarity comparison predicted protein matter subcellular fraction interval method, it is characterised in that the frequency f that in described step 1-B, u seed amino acid occurs in protein sequence P_uComputational methods be:

f_{u} = \frac{1}{N} Σ_{i = 1}^{L} R_{i}, R_{i} = \{\begin{matrix} 1, & \begin{matrix} I f & R_{i} = A (u) \end{matrix} \\ 0, & \begin{matrix} I f & R_{i} &NotEqual; A (u) \end{matrix} \end{matrix}\};

4. the KNN of use as claimed in claim 1 calculates and similarity comparison predicted protein matter subcellular fraction interval method, it is characterised in that described step 2 specifically includes:

Step 2-A, the threshold k determined in KNN algorithm；

Step 2-C, the protein sequence obtained based on step 1 AAC feature, calculate the Euclidean distance between each protein sequence in forecasting sequence and training set, choose the shortest front K the protein sequence set of Euclidean distance as estimation range.

5. the KNN of use as claimed in claim 4 calculates and similarity comparison predicted protein matter subcellular fraction interval method, it is characterised in that calculate Euclidean distance in described step 2-C method particularly includes:

The AAC feature P ' of forecasting sequence_ACC=(f₁′,f₂′,f₃′,...,f_n'), the AAC feature P of any one protein sequence in training set "_ACC=(f₁″,f₂″,f₃″,...,f_n"), then the computational methods of Euclidean distance d are as follows:

d = \sqrt{Σ_{i = 1}^{n} {({f^{'}}_{i} - {f^{''}}_{i})}^{2}} .

6. the KNN of use as claimed in claim 1 calculates and similarity comparison predicted protein matter subcellular fraction interval method, it is characterised in that described step 3 specifically includes: