CN105046106A

CN105046106A - Protein subcellular localization and prediction method realized by using nearest-neighbor retrieval

Info

Publication number: CN105046106A
Application number: CN201510411973.9A
Authority: CN
Inventors: 薛卫; 王雄飞; 赵南; 任守纲
Original assignee: Nanjing Agricultural University
Current assignee: Nanjing Agricultural University
Priority date: 2015-07-14
Filing date: 2015-07-14
Publication date: 2015-11-11
Anticipated expiration: 2035-07-14
Also published as: CN105046106B

Abstract

A protein subcellular localization and prediction method realized by using nearest-neighbor retrieval comprises the following steps of: (1), taking AAC characteristic vectors as characteristics of protein sequences and storing the AAC characteristic vector of each protein sequence in a training set to a plurality of hash tables with an LSH (Locality Sensitive Hashing) method; (2), during prediction, calculating a corresponding hash value of the AAC characteristic vector of a target sequence in each hash table with the LSH method, and obtaining a vector set of similar sequences; and (3), selecting Q vectors closest to a Euclidean distance of the AAC characteristic vector of the target sequence from the vector set of the similar sequences, calculating expected protein sequence distances between the AAC characteristic vector of the target sequence and the Q vectors with a global alignment dynamic programming method, and taking a corresponding interval of protein with a sequence having a longest expected distance from the target sequence in the Q vectors as a prediction interval.

Description

A kind of Prediction of Protein Subcellular Location method that nearest _neighbor retrieval realizes

Technical field

The invention belongs to field of bioinformatics, especially a kind of Prediction of Protein Subcellular Location method using machine learning techniques to realize, the Prediction of Protein Subcellular Location method of specifically a kind of nearest _neighbor retrieval realization.

Background technology

Proteins subcellular location refers to that certain albumen or certain gene expression product are at intracellular concrete Present site, namely predict the subcellular location at its place according to given protein sequence.The Subcellular Localization of protein and its biological function closely related.The knowledge position of albuminous cell is at biology, and cell biology, pharmacology, plays vital effect in medical science.Although the Subcellular Localization of protein is determined by experiment, consuming time and expensive.Along with the increase of genomic data of order-checking, the Subcellular Localization method for predicted protein matter becomes more and more important, needs robotization and instrument accurately.In recent years some effective location prediction methods have been there are, study from independent sorter to ensemble machine learning, common independent classifier algorithm comprises: support vector machine, neural network, hidden Markov model, bayes method, K-arest neighbors etc. multiple Weak Classifier combines by integrated study, build a strong integrated classifier, model performance can be made to obtain and improve.Single classifier and integrated classifier are constantly attempted being used in subcellular fraction prediction location by people, accuracy rate has been difficult to improve, and these method major parts all rely on the model training process of more complicated, unless invented new method or feature, otherwise accuracy rate is difficult to be improved again.

Summary of the invention

The object of the invention is the problem for proteins subcellular location, propose a kind of Prediction of Protein Subcellular Location method that nearest _neighbor retrieval realizes.Training set sequence signature vector, using simple AAC vector as the feature of protein sequence, leaves in multiple Hash table with LSH algorithm by the method.During prediction, calculate target sequence AAC proper vector cryptographic hash corresponding in each Hash table by LSH method, obtain the set of similar sequences vector.Again from the similar collection obtained, choose from Q nearest vector of object vector Euclidean distance.By protein sequence desired distance between overall comparison dynamic programming compute vector, the corresponding interval of the sequence albumen the highest with target sequence desired distance is forecast interval.

Technical scheme of the present invention is:

The Prediction of Protein Subcellular Location method that nearest _neighbor retrieval realizes, the method comprises the following steps:

(1), using AAC proper vector as the feature of protein sequence, by LSH method, the AAC proper vector of each protein sequence in training set is left in multiple Hash table;

(2), prediction time, calculate target sequence AAC proper vector cryptographic hash corresponding in each Hash table by LSH method, obtain the set of similar sequences vector;

(3), choose from the set of the similar sequences vector obtained from Q nearest vector of target sequence AAC proper vector Euclidean distance, with overall comparison dynamic programming calculate target sequence AAC proper vector and aforementioned Q vector vector between protein sequence desired distance, using corresponding for sequence albumen the highest with target sequence desired distance in Q vector interval as forecast interval.

Step of the present invention (1) specifically comprises the following steps:

(A) the AAC proper vector of protein sequence, is extracted:

If protein sequence P is:

P=R ₁r ₂r ₃r _t(1) wherein: t is the length of protein sequence and the number of amino acid residue, R ₁for first amino acid residue in sequence word P, R ₂be second amino acid residue, by that analogy, R _tbe t amino acid residue;

AAC feature extraction: then the amino acid composition information of protein sequence P and AAC proper vector are:

v＝[f ₁,f ₂,…,f _d](2)

Wherein f ₁f ₂" f ₂₀adopt following equations:

Wherein, f _u(u=1,2 ..., d) be each amino acid whose frequency of occurrences, d=20, t is the length of a protein sequence, and i represents the numbering of amino acid residue, and A (u) is amino acid residue corresponding to sequence number u; (B) Hash table, is built:

For the protein sequence of the n in training set, the AAC proper vector that the d of each protein sequence ties up is left in L Hash table, for each vector, by LSH method, put into the bucket of key assignments corresponding to L Hash table respectively.

Step of the present invention (B) specifically comprises the following steps:

(B-1), for the protein sequence of the n in training set, be the AAC proper vector of d by the dimension of each protein sequence, by formula (4), the d in v vector is expanded C and doubly round, the coordinate being converted to each vector is the vector of positive integer:

v′＝[C×v](4)

Wherein: [] represents rounding operation;

(B-2), d vector is done following conversion: set r as coordinate, then g (r)=000 of vector v ... 0111 ... 1, wherein left end is 0 entirely, right-hand member be entirely 1,1 number be the size of the value of r;

Adopt operational symbol | connect two adjacent coordinates, so vector v ' changed by F (v '): v "=F (v ')=g (f1) | g (f2) | g (f3) | ... | g (fd);

(B-3), from the integer of 0 to Cd-1, random selecting k is: n ₁, n ₂, n ₃..., n _kif, h (the n-th coordinate in v ", n) be v ", then v " '=G (v ")=h (v ", n ₁) h (v ", n ₂) ... h (v ", n _k); (v ") is just a hash value of AAC proper vector v to G;

(B-4), for the protein sequence of the n in training set, all obtain n hash value according to step (B-3), set up a hash table;

(B-5), in order to improve similar collision rate, setting up L by (B-3)-(B-4) step and opening hash table.

Step of the present invention (2) specifically comprises the following steps: the AAC proper vector T extracting target protein sequence, calculates AAC proper vector T cryptographic hash corresponding in each Hash table: J by LSH method ₁, J ₂, " J _l, extract each hash show in vector corresponding to cryptographic hash, obtain the set of similar sequences vector; Again from the set obtained, choosing from Q nearest vector of vector T Euclidean, with the protein sequence desired distance M that overall comparison dynamic programming compute vector T and Q vector is corresponding, is forecast interval between the sequence protein white area that M is the highest.

Overall comparison dynamic programming computing method of the present invention are: establish two sequence a and b, and length is x and y, and between these two sequences, desired distance is M (a _x, b _y), by the distance M (a of front j position in i position front in evaluation sequence a and sequence b _i, b _j), i ∈ [1, x], j ∈ [1, y], recursively obtain distance M (a _x, b _y).

Recurrence comparison of the present invention is divided into some steps, by span i ∈ [1, x], has three kinds of events when j ∈ [1, y] performs x × y each step increase position:

From the vertical movement of unit (i-1, j) to (i, j), be equivalent in b sequence, insert a room and similar sequences is extended, distance value subtracts 2;

Move from the diagonal line of unit (i-1, j-1) to (i, j), be equivalent to increase alphabetical a _iand b _jsimilar sequences is extended, and letter is identical, and distance value increases 1, and letter is different, and distance value subtracts 1;

From unit (i, j-1) moving horizontally to (i, j), be equivalent in sequence b, insert a room and similar sequences is extended, distance value subtracts 2;

The distance that the distance of unit (i, j) regards three adjacent cells as adds the reckling after respective weights, namely

M (a_{i}, b_{j}) = m a x \{\begin{matrix} M (a_{i - 1}, b_{j}) - 2 \\ M (a_{i - 1}, b_{j - 1}) + S (i, j) \\ M (a_{i}, b_{j - 1}) - 2 \end{matrix}\}

Wherein, max refers to get the best result in three kinds of possibility scores, M (a ₀, b ₀)=0, S (i, j) refers to i-th letter and jth alphabetical comparing, and is all 1 mutually, is not all-1.

Beneficial effect of the present invention:

The present invention propose a kind of approximate KNN based on LSH search and overall comparison dynamic programming method protein region between location prediction model, this forecast model does not rely on complicated sequence signature, and Model suitability is strong, even if adjusting training collection sequential element, the hash as the LSH of Prediction Parameters shows also without the need to all recalculating.Forecast model obtains higher overall accuracy in the jackknife inspection of benchmark dataset, and this Forecasting Methodology can obtain predicting the outcome of target sequence fast and effectively.

Accompanying drawing explanation

Fig. 1 Hash shows the MAP curve map of quantity experiment

Fig. 2 Hash shows the MRR curve map of quantity experiment

The MAP curve map of Fig. 3 Hash table figure place experiment

The MRR curve map of Fig. 4 Hash table figure place experiment

Embodiment

Below in conjunction with drawings and Examples, the present invention is further illustrated.

Choosing of 1 test data set

Be described for the data set comprising 317 apoptin sequences obtained from SWISS-PROT database.Article 317, protein sequence, be distributed in 6 intervals, wherein cytoplasm protein (Cytoplasmicproteins) 112, memebrane protein (Membraneproteins) 55, mitochondrial protein (Mitochondrialproteins) 34, secretory protein (Secretedproteins) 17, Nuclear extract (Nuclearproteins) 52,47, endoplasmic reticulum albumen (Endoplasmicreticulumproteins).

2 experimental evaluation method and indexs

Common prediction and evaluation has three kinds of methods: self-compatibility inspection (Resubstitution), K roll over crosscheck (K-foldcrossvalidation) and jackknife (Jackknife).For self-compatibility inspection, test set comprises sequence to be predicted, and can predicting context of methods, to be detected as power be 100%.Roll over crosscheck with K to compare, jackknife inspection uses the predictive mode of one-to-many, and it is considered to more objective and strict verification method in statistics, predicts the outcome to verify with jackknife in implementation step.

Experiment uses susceptibility, specificity, related coefficient and total accuracy rate three evaluation indexes, susceptibility (SN _i), specificity (SP _i), related coefficient (MCC _i) and total accuracy rate OA be defined as follows:

SN _i＝TP _i/(TP _i+FN _i)

SP _i＝TP _i/(TP _i+FP _i)

{MMC}_{i} = \frac{({TP}_{i} \times {TN}_{i}) - ({FP}_{i} \times {FN}_{i})}{\sqrt{({TP}_{i} + {FP}_{i}) \times ({TN}_{i} + {FN}_{i}) \times ({TP}_{i} + {FN}_{i}) \times ({TN}_{i} + {FP}_{i})}}

OA＝∑ _iTP _i/∑ _i(TP _i+FP _i)

In above formula, TP _ithe sequence number of the interval correct Prediction of the i-th class subcellular fraction, FN _ithe sequence number not having correct Prediction in the i-th class subcellular fraction interval, FP _iright and wrong i-th class subcellular fraction is interval but be predicted to be the sequence number of the i-th class interval, TN _iit is the sequence number in the non-i-th class subcellular fraction interval be predicted correctly.The introducing of evaluation index carries out objective, effectively assessment from three aspects to search method: susceptibility (SN _i) embody prediction algorithm in each interval accuracy, specificity (SP _i) be evaluation to algorithm degree of confidence, related coefficient MCC _ithen embody the validity of prediction algorithm entirety, total accuracy rate OA embodies the accuracy in all intervals of prediction algorithm.

The setting of 3 Forecasting Methodology parameters

Prediction algorithm will arrange the value of three parameters: Hash shows quantity L, the figure place k of Hash table and overall comparison vector number Q.In order to discuss these three parameters how to affect LSH prediction algorithm.The optimum configurations of setting acquiescence is: L=10, k=200, Q=6.When studying one of them parameter to the affecting of algorithm, fixing two other parameter is default value, often organizes parameter and does 10 experiments.

Fig. 1,2 illustrates Hash and shows quantity how to affect hash algorithm performance.When L increases, can see that the mean value (MeanAveragePrecision, MAP) of accuracy rate first increases steadily, and tend towards stability; The search of Hash table returns line number mean value (meanreturnrow, MRR) linearly increases trend, and search returns results several increase can increase predicted time.Two data centralizations, when L is 4, the sixth of the twelve Earthly Branches is through making our algorithm obtain good predicting the outcome.Result shows, when taking into account consideration accuracy and counting yield at the same time, L is rational in interval [5,20], can obtain higher success rate prediction.

How the figure place k that Fig. 3,4 illustrates Hash table affects hash algorithm performance.As seen from the figure, parameter k is larger, MAP and MRR can decline.Reason is that the larger similar collision rate of k can decline, thus have influence on Hash table return line number.When taking into account consideration accuracy and efficiency at the same time, it is more rational for arranging k=200.

During concrete enforcement:

According to the results and analysis of optimum configurations experiment, final Forecasting Methodology parameter L=10 is set, k=200, Q=4.For 317 sequences, Forecasting Methodology implementation process is described as follows:

(1) extract the AAC feature of protein sequence, obtain 317 20 dimensional feature vectors.

(2) build Hash table: leave in 10 Hash tables by the Sample Storehouse of 317 20 proper vectors tieed up, for each vector, by above-mentioned LSH method, put into the bucket of key assignments corresponding to 10 Hash tables respectively.

1) by 317 dimensions be 20 the AAC vector coordinate that is converted to each vector be the vector of positive integer.

2) each vector v can be converted into 01 string of a 1000*20 length.

3) from the integer of 0 to 1000*20-1, random selecting 200 number is: n ₁, n ₂, n ₃..., n ₂₀₀if, h (the n-th coordinate in v ", n) be v ", v " '=G (v ") and=h (v ", n ₁) h (v ", n ₂) " h (v ", n ₂₀₀).

4) (v ") is just a hash value of AAC proper vector v to G.

5) in order to improve similar collision rate, 10 hash tables are set up by 2-4 step.

(2) for the search of AAC proper vector T in Sample Storehouse of target sequence to be predicted.Vector T cryptographic hash corresponding in each Hash table is calculated: h by LSH method ₁, h ₂..., h ₁₀.Union is got in the set that taking-up 10 is vectorial from 10 Hash tables again.Again from also concentrating of obtaining, choose from 4 nearest vectors of vector T Euclidean.With the protein sequence desired distance M that overall comparison dynamic programming compute vector T is corresponding with 4 vectors, be forecast interval between the sequence protein white area that M is the highest.

A table 1317 sequence jackknife predicts the outcome

The part that the present invention does not relate to prior art that maybe can adopt all same as the prior art is realized.

Claims

1., by the Prediction of Protein Subcellular Location method that nearest _neighbor retrieval realizes, it is characterized in that: the method comprises the following steps:

2. the Prediction of Protein Subcellular Location method of nearest _neighbor retrieval realization according to claim 1, is characterized in that step (1) specifically comprises the following steps:

(A) the AAC proper vector of protein sequence, is extracted:

If protein sequence P is:

P＝R ₁R ₂R ₃…R _t(1)

Wherein: t is the length of protein sequence and the number of amino acid residue, R ₁for first amino acid residue in sequence word P, R ₂be second amino acid residue, by that analogy, R _tbe t amino acid residue;

v＝[f ₁,f ₂,…,f _d](2)

Wherein f ₁f ₂f ₂₀adopt following equations:

3. the Prediction of Protein Subcellular Location method of nearest _neighbor retrieval realization according to claim 2, is characterized in that step (B) specifically comprises the following steps:

v′＝[C×v](4)

Wherein: [] represents rounding operation;

Adopt operational symbol | connect two adjacent coordinates, so vector v ' changed by F (v '):

v″＝F(v′)＝g(f1)|g(f2)|g(f3)|…|g(fd)；

4. the Prediction of Protein Subcellular Location method of nearest _neighbor retrieval realization according to claim 1, it is characterized in that step (2) specifically comprises the following steps: the AAC proper vector T extracting target protein sequence, calculate AAC proper vector T cryptographic hash corresponding in each Hash table by LSH method: J ₁, J ₂... J _l, extract each hash show in vector corresponding to cryptographic hash, obtain the set of similar sequences vector; Again from the set obtained, choosing from Q nearest vector of vector T Euclidean, with the protein sequence desired distance M that overall comparison dynamic programming compute vector T and Q vector is corresponding, is forecast interval between the sequence protein white area that M is the highest.

5. the Prediction of Protein Subcellular Location method of nearest _neighbor retrieval realization according to claim 4, it is characterized in that: overall comparison dynamic programming computing method are: establish two sequence a and b, length is x and y, and between these two sequences, desired distance is M (a _x, b _y), by the distance M (a of front j position in i position front in evaluation sequence a and sequence b _i, b _j), i ∈ [1, x], j ∈ [1, y], recursively obtain distance M (a _x, b _y).

6. the Prediction of Protein Subcellular Location method of nearest _neighbor retrieval realization according to claim 5, it is characterized in that: recurrence comparison is divided into some steps, by span i ∈ [1, x], j ∈ [1, y] has three kinds of events when performing x × y each step increase position:

M (a_{i}, b_{j}) = \max \{\begin{matrix} M (a_{i - 1}, b_{j}) - 2 \\ M (a_{i - 1}, b_{j - 1}) + S (i, j) \\ M (a_{i}, b_{j - 1}) 2 \end{matrix}\}