CN105046106A - Protein subcellular localization and prediction method realized by using nearest-neighbor retrieval - Google Patents

Protein subcellular localization and prediction method realized by using nearest-neighbor retrieval Download PDF

Info

Publication number
CN105046106A
CN105046106A CN201510411973.9A CN201510411973A CN105046106A CN 105046106 A CN105046106 A CN 105046106A CN 201510411973 A CN201510411973 A CN 201510411973A CN 105046106 A CN105046106 A CN 105046106A
Authority
CN
China
Prior art keywords
vector
sequence
aac
protein
protein sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510411973.9A
Other languages
Chinese (zh)
Other versions
CN105046106B (en
Inventor
薛卫
王雄飞
赵南
任守纲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Agricultural University
Original Assignee
Nanjing Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Agricultural University filed Critical Nanjing Agricultural University
Priority to CN201510411973.9A priority Critical patent/CN105046106B/en
Publication of CN105046106A publication Critical patent/CN105046106A/en
Application granted granted Critical
Publication of CN105046106B publication Critical patent/CN105046106B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

A protein subcellular localization and prediction method realized by using nearest-neighbor retrieval comprises the following steps of: (1), taking AAC characteristic vectors as characteristics of protein sequences and storing the AAC characteristic vector of each protein sequence in a training set to a plurality of hash tables with an LSH (Locality Sensitive Hashing) method; (2), during prediction, calculating a corresponding hash value of the AAC characteristic vector of a target sequence in each hash table with the LSH method, and obtaining a vector set of similar sequences; and (3), selecting Q vectors closest to a Euclidean distance of the AAC characteristic vector of the target sequence from the vector set of the similar sequences, calculating expected protein sequence distances between the AAC characteristic vector of the target sequence and the Q vectors with a global alignment dynamic programming method, and taking a corresponding interval of protein with a sequence having a longest expected distance from the target sequence in the Q vectors as a prediction interval.

Description

A kind of Prediction of Protein Subcellular Location method that nearest _neighbor retrieval realizes
Technical field
The invention belongs to field of bioinformatics, especially a kind of Prediction of Protein Subcellular Location method using machine learning techniques to realize, the Prediction of Protein Subcellular Location method of specifically a kind of nearest _neighbor retrieval realization.
Background technology
Proteins subcellular location refers to that certain albumen or certain gene expression product are at intracellular concrete Present site, namely predict the subcellular location at its place according to given protein sequence.The Subcellular Localization of protein and its biological function closely related.The knowledge position of albuminous cell is at biology, and cell biology, pharmacology, plays vital effect in medical science.Although the Subcellular Localization of protein is determined by experiment, consuming time and expensive.Along with the increase of genomic data of order-checking, the Subcellular Localization method for predicted protein matter becomes more and more important, needs robotization and instrument accurately.In recent years some effective location prediction methods have been there are, study from independent sorter to ensemble machine learning, common independent classifier algorithm comprises: support vector machine, neural network, hidden Markov model, bayes method, K-arest neighbors etc. multiple Weak Classifier combines by integrated study, build a strong integrated classifier, model performance can be made to obtain and improve.Single classifier and integrated classifier are constantly attempted being used in subcellular fraction prediction location by people, accuracy rate has been difficult to improve, and these method major parts all rely on the model training process of more complicated, unless invented new method or feature, otherwise accuracy rate is difficult to be improved again.
Summary of the invention
The object of the invention is the problem for proteins subcellular location, propose a kind of Prediction of Protein Subcellular Location method that nearest _neighbor retrieval realizes.Training set sequence signature vector, using simple AAC vector as the feature of protein sequence, leaves in multiple Hash table with LSH algorithm by the method.During prediction, calculate target sequence AAC proper vector cryptographic hash corresponding in each Hash table by LSH method, obtain the set of similar sequences vector.Again from the similar collection obtained, choose from Q nearest vector of object vector Euclidean distance.By protein sequence desired distance between overall comparison dynamic programming compute vector, the corresponding interval of the sequence albumen the highest with target sequence desired distance is forecast interval.
Technical scheme of the present invention is:
The Prediction of Protein Subcellular Location method that nearest _neighbor retrieval realizes, the method comprises the following steps:
(1), using AAC proper vector as the feature of protein sequence, by LSH method, the AAC proper vector of each protein sequence in training set is left in multiple Hash table;
(2), prediction time, calculate target sequence AAC proper vector cryptographic hash corresponding in each Hash table by LSH method, obtain the set of similar sequences vector;
(3), choose from the set of the similar sequences vector obtained from Q nearest vector of target sequence AAC proper vector Euclidean distance, with overall comparison dynamic programming calculate target sequence AAC proper vector and aforementioned Q vector vector between protein sequence desired distance, using corresponding for sequence albumen the highest with target sequence desired distance in Q vector interval as forecast interval.
Step of the present invention (1) specifically comprises the following steps:
(A) the AAC proper vector of protein sequence, is extracted:
If protein sequence P is:
P=R 1r 2r 3r t(1) wherein: t is the length of protein sequence and the number of amino acid residue, R 1for first amino acid residue in sequence word P, R 2be second amino acid residue, by that analogy, R tbe t amino acid residue;
AAC feature extraction: then the amino acid composition information of protein sequence P and AAC proper vector are:
v=[f 1,f 2,…,f d](2)
Wherein f 1f 2" f 20adopt following equations:
Wherein, f u(u=1,2 ..., d) be each amino acid whose frequency of occurrences, d=20, t is the length of a protein sequence, and i represents the numbering of amino acid residue, and A (u) is amino acid residue corresponding to sequence number u; (B) Hash table, is built:
For the protein sequence of the n in training set, the AAC proper vector that the d of each protein sequence ties up is left in L Hash table, for each vector, by LSH method, put into the bucket of key assignments corresponding to L Hash table respectively.
Step of the present invention (B) specifically comprises the following steps:
(B-1), for the protein sequence of the n in training set, be the AAC proper vector of d by the dimension of each protein sequence, by formula (4), the d in v vector is expanded C and doubly round, the coordinate being converted to each vector is the vector of positive integer:
v′=[C×v](4)
Wherein: [] represents rounding operation;
(B-2), d vector is done following conversion: set r as coordinate, then g (r)=000 of vector v ... 0111 ... 1, wherein left end is 0 entirely, right-hand member be entirely 1,1 number be the size of the value of r;
Adopt operational symbol | connect two adjacent coordinates, so vector v ' changed by F (v '): v "=F (v ')=g (f1) | g (f2) | g (f3) | ... | g (fd);
(B-3), from the integer of 0 to Cd-1, random selecting k is: n 1, n 2, n 3..., n kif, h (the n-th coordinate in v ", n) be v ", then v " '=G (v ")=h (v ", n 1) h (v ", n 2) ... h (v ", n k); (v ") is just a hash value of AAC proper vector v to G;
(B-4), for the protein sequence of the n in training set, all obtain n hash value according to step (B-3), set up a hash table;
(B-5), in order to improve similar collision rate, setting up L by (B-3)-(B-4) step and opening hash table.
Step of the present invention (2) specifically comprises the following steps: the AAC proper vector T extracting target protein sequence, calculates AAC proper vector T cryptographic hash corresponding in each Hash table: J by LSH method 1, J 2, " J l, extract each hash show in vector corresponding to cryptographic hash, obtain the set of similar sequences vector; Again from the set obtained, choosing from Q nearest vector of vector T Euclidean, with the protein sequence desired distance M that overall comparison dynamic programming compute vector T and Q vector is corresponding, is forecast interval between the sequence protein white area that M is the highest.
Overall comparison dynamic programming computing method of the present invention are: establish two sequence a and b, and length is x and y, and between these two sequences, desired distance is M (a x, b y), by the distance M (a of front j position in i position front in evaluation sequence a and sequence b i, b j), i ∈ [1, x], j ∈ [1, y], recursively obtain distance M (a x, b y).
Recurrence comparison of the present invention is divided into some steps, by span i ∈ [1, x], has three kinds of events when j ∈ [1, y] performs x × y each step increase position:
From the vertical movement of unit (i-1, j) to (i, j), be equivalent in b sequence, insert a room and similar sequences is extended, distance value subtracts 2;
Move from the diagonal line of unit (i-1, j-1) to (i, j), be equivalent to increase alphabetical a iand b jsimilar sequences is extended, and letter is identical, and distance value increases 1, and letter is different, and distance value subtracts 1;
From unit (i, j-1) moving horizontally to (i, j), be equivalent in sequence b, insert a room and similar sequences is extended, distance value subtracts 2;
The distance that the distance of unit (i, j) regards three adjacent cells as adds the reckling after respective weights, namely
M ( a i , b j ) = m a x M ( a i - 1 , b j ) - 2 M ( a i - 1 , b j - 1 ) + S ( i , j ) M ( a i , b j - 1 ) - 2
Wherein, max refers to get the best result in three kinds of possibility scores, M (a 0, b 0)=0, S (i, j) refers to i-th letter and jth alphabetical comparing, and is all 1 mutually, is not all-1.
Beneficial effect of the present invention:
The present invention propose a kind of approximate KNN based on LSH search and overall comparison dynamic programming method protein region between location prediction model, this forecast model does not rely on complicated sequence signature, and Model suitability is strong, even if adjusting training collection sequential element, the hash as the LSH of Prediction Parameters shows also without the need to all recalculating.Forecast model obtains higher overall accuracy in the jackknife inspection of benchmark dataset, and this Forecasting Methodology can obtain predicting the outcome of target sequence fast and effectively.
Accompanying drawing explanation
Fig. 1 Hash shows the MAP curve map of quantity experiment
Fig. 2 Hash shows the MRR curve map of quantity experiment
The MAP curve map of Fig. 3 Hash table figure place experiment
The MRR curve map of Fig. 4 Hash table figure place experiment
Embodiment
Below in conjunction with drawings and Examples, the present invention is further illustrated.
Choosing of 1 test data set
Be described for the data set comprising 317 apoptin sequences obtained from SWISS-PROT database.Article 317, protein sequence, be distributed in 6 intervals, wherein cytoplasm protein (Cytoplasmicproteins) 112, memebrane protein (Membraneproteins) 55, mitochondrial protein (Mitochondrialproteins) 34, secretory protein (Secretedproteins) 17, Nuclear extract (Nuclearproteins) 52,47, endoplasmic reticulum albumen (Endoplasmicreticulumproteins).
2 experimental evaluation method and indexs
Common prediction and evaluation has three kinds of methods: self-compatibility inspection (Resubstitution), K roll over crosscheck (K-foldcrossvalidation) and jackknife (Jackknife).For self-compatibility inspection, test set comprises sequence to be predicted, and can predicting context of methods, to be detected as power be 100%.Roll over crosscheck with K to compare, jackknife inspection uses the predictive mode of one-to-many, and it is considered to more objective and strict verification method in statistics, predicts the outcome to verify with jackknife in implementation step.
Experiment uses susceptibility, specificity, related coefficient and total accuracy rate three evaluation indexes, susceptibility (SN i), specificity (SP i), related coefficient (MCC i) and total accuracy rate OA be defined as follows:
SN i=TP i/(TP i+FN i)
SP i=TP i/(TP i+FP i)
MMC i = ( TP i × TN i ) - ( FP i × FN i ) ( TP i + FP i ) × ( TN i + FN i ) × ( TP i + FN i ) × ( TN i + FP i )
OA=∑ iTP i/∑ i(TP i+FP i)
In above formula, TP ithe sequence number of the interval correct Prediction of the i-th class subcellular fraction, FN ithe sequence number not having correct Prediction in the i-th class subcellular fraction interval, FP iright and wrong i-th class subcellular fraction is interval but be predicted to be the sequence number of the i-th class interval, TN iit is the sequence number in the non-i-th class subcellular fraction interval be predicted correctly.The introducing of evaluation index carries out objective, effectively assessment from three aspects to search method: susceptibility (SN i) embody prediction algorithm in each interval accuracy, specificity (SP i) be evaluation to algorithm degree of confidence, related coefficient MCC ithen embody the validity of prediction algorithm entirety, total accuracy rate OA embodies the accuracy in all intervals of prediction algorithm.
The setting of 3 Forecasting Methodology parameters
Prediction algorithm will arrange the value of three parameters: Hash shows quantity L, the figure place k of Hash table and overall comparison vector number Q.In order to discuss these three parameters how to affect LSH prediction algorithm.The optimum configurations of setting acquiescence is: L=10, k=200, Q=6.When studying one of them parameter to the affecting of algorithm, fixing two other parameter is default value, often organizes parameter and does 10 experiments.
Fig. 1,2 illustrates Hash and shows quantity how to affect hash algorithm performance.When L increases, can see that the mean value (MeanAveragePrecision, MAP) of accuracy rate first increases steadily, and tend towards stability; The search of Hash table returns line number mean value (meanreturnrow, MRR) linearly increases trend, and search returns results several increase can increase predicted time.Two data centralizations, when L is 4, the sixth of the twelve Earthly Branches is through making our algorithm obtain good predicting the outcome.Result shows, when taking into account consideration accuracy and counting yield at the same time, L is rational in interval [5,20], can obtain higher success rate prediction.
How the figure place k that Fig. 3,4 illustrates Hash table affects hash algorithm performance.As seen from the figure, parameter k is larger, MAP and MRR can decline.Reason is that the larger similar collision rate of k can decline, thus have influence on Hash table return line number.When taking into account consideration accuracy and efficiency at the same time, it is more rational for arranging k=200.
During concrete enforcement:
According to the results and analysis of optimum configurations experiment, final Forecasting Methodology parameter L=10 is set, k=200, Q=4.For 317 sequences, Forecasting Methodology implementation process is described as follows:
(1) extract the AAC feature of protein sequence, obtain 317 20 dimensional feature vectors.
(2) build Hash table: leave in 10 Hash tables by the Sample Storehouse of 317 20 proper vectors tieed up, for each vector, by above-mentioned LSH method, put into the bucket of key assignments corresponding to 10 Hash tables respectively.
1) by 317 dimensions be 20 the AAC vector coordinate that is converted to each vector be the vector of positive integer.
2) each vector v can be converted into 01 string of a 1000*20 length.
3) from the integer of 0 to 1000*20-1, random selecting 200 number is: n 1, n 2, n 3..., n 200if, h (the n-th coordinate in v ", n) be v ", v " '=G (v ") and=h (v ", n 1) h (v ", n 2) " h (v ", n 200).
4) (v ") is just a hash value of AAC proper vector v to G.
5) in order to improve similar collision rate, 10 hash tables are set up by 2-4 step.
(2) for the search of AAC proper vector T in Sample Storehouse of target sequence to be predicted.Vector T cryptographic hash corresponding in each Hash table is calculated: h by LSH method 1, h 2..., h 10.Union is got in the set that taking-up 10 is vectorial from 10 Hash tables again.Again from also concentrating of obtaining, choose from 4 nearest vectors of vector T Euclidean.With the protein sequence desired distance M that overall comparison dynamic programming compute vector T is corresponding with 4 vectors, be forecast interval between the sequence protein white area that M is the highest.
A table 1317 sequence jackknife predicts the outcome
The part that the present invention does not relate to prior art that maybe can adopt all same as the prior art is realized.

Claims (6)

1., by the Prediction of Protein Subcellular Location method that nearest _neighbor retrieval realizes, it is characterized in that: the method comprises the following steps:
(1), using AAC proper vector as the feature of protein sequence, by LSH method, the AAC proper vector of each protein sequence in training set is left in multiple Hash table;
(2), prediction time, calculate target sequence AAC proper vector cryptographic hash corresponding in each Hash table by LSH method, obtain the set of similar sequences vector;
(3), choose from the set of the similar sequences vector obtained from Q nearest vector of target sequence AAC proper vector Euclidean distance, with overall comparison dynamic programming calculate target sequence AAC proper vector and aforementioned Q vector vector between protein sequence desired distance, using corresponding for sequence albumen the highest with target sequence desired distance in Q vector interval as forecast interval.
2. the Prediction of Protein Subcellular Location method of nearest _neighbor retrieval realization according to claim 1, is characterized in that step (1) specifically comprises the following steps:
(A) the AAC proper vector of protein sequence, is extracted:
If protein sequence P is:
P=R 1R 2R 3…R t(1)
Wherein: t is the length of protein sequence and the number of amino acid residue, R 1for first amino acid residue in sequence word P, R 2be second amino acid residue, by that analogy, R tbe t amino acid residue;
AAC feature extraction: then the amino acid composition information of protein sequence P and AAC proper vector are:
v=[f 1,f 2,…,f d](2)
Wherein f 1f 2f 20adopt following equations:
Wherein, f u(u=1,2 ..., d) be each amino acid whose frequency of occurrences, d=20, t is the length of a protein sequence, and i represents the numbering of amino acid residue, and A (u) is amino acid residue corresponding to sequence number u; (B) Hash table, is built:
For the protein sequence of the n in training set, the AAC proper vector that the d of each protein sequence ties up is left in L Hash table, for each vector, by LSH method, put into the bucket of key assignments corresponding to L Hash table respectively.
3. the Prediction of Protein Subcellular Location method of nearest _neighbor retrieval realization according to claim 2, is characterized in that step (B) specifically comprises the following steps:
(B-1), for the protein sequence of the n in training set, be the AAC proper vector of d by the dimension of each protein sequence, by formula (4), the d in v vector is expanded C and doubly round, the coordinate being converted to each vector is the vector of positive integer:
v′=[C×v](4)
Wherein: [] represents rounding operation;
(B-2), d vector is done following conversion: set r as coordinate, then g (r)=000 of vector v ... 0111 ... 1, wherein left end is 0 entirely, right-hand member be entirely 1,1 number be the size of the value of r;
Adopt operational symbol | connect two adjacent coordinates, so vector v ' changed by F (v '):
v″=F(v′)=g(f1)|g(f2)|g(f3)|…|g(fd);
(B-3), from the integer of 0 to Cd-1, random selecting k is: n 1, n 2, n 3..., n kif, h (the n-th coordinate in v ", n) be v ", then v " '=G (v ")=h (v ", n 1) h (v ", n 2) ... h (v ", n k); (v ") is just a hash value of AAC proper vector v to G;
(B-4), for the protein sequence of the n in training set, all obtain n hash value according to step (B-3), set up a hash table;
(B-5), in order to improve similar collision rate, setting up L by (B-3)-(B-4) step and opening hash table.
4. the Prediction of Protein Subcellular Location method of nearest _neighbor retrieval realization according to claim 1, it is characterized in that step (2) specifically comprises the following steps: the AAC proper vector T extracting target protein sequence, calculate AAC proper vector T cryptographic hash corresponding in each Hash table by LSH method: J 1, J 2... J l, extract each hash show in vector corresponding to cryptographic hash, obtain the set of similar sequences vector; Again from the set obtained, choosing from Q nearest vector of vector T Euclidean, with the protein sequence desired distance M that overall comparison dynamic programming compute vector T and Q vector is corresponding, is forecast interval between the sequence protein white area that M is the highest.
5. the Prediction of Protein Subcellular Location method of nearest _neighbor retrieval realization according to claim 4, it is characterized in that: overall comparison dynamic programming computing method are: establish two sequence a and b, length is x and y, and between these two sequences, desired distance is M (a x, b y), by the distance M (a of front j position in i position front in evaluation sequence a and sequence b i, b j), i ∈ [1, x], j ∈ [1, y], recursively obtain distance M (a x, b y).
6. the Prediction of Protein Subcellular Location method of nearest _neighbor retrieval realization according to claim 5, it is characterized in that: recurrence comparison is divided into some steps, by span i ∈ [1, x], j ∈ [1, y] has three kinds of events when performing x × y each step increase position:
From the vertical movement of unit (i-1, j) to (i, j), be equivalent in b sequence, insert a room and similar sequences is extended, distance value subtracts 2;
Move from the diagonal line of unit (i-1, j-1) to (i, j), be equivalent to increase alphabetical a iand b jsimilar sequences is extended, and letter is identical, and distance value increases 1, and letter is different, and distance value subtracts 1;
From unit (i, j-1) moving horizontally to (i, j), be equivalent in sequence b, insert a room and similar sequences is extended, distance value subtracts 2;
The distance that the distance of unit (i, j) regards three adjacent cells as adds the reckling after respective weights, namely
M ( a i , b j ) = max M ( a i - 1 , b j ) - 2 M ( a i - 1 , b j - 1 ) + S ( i , j ) M ( a i , b j - 1 ) 2
Wherein, max refers to get the best result in three kinds of possibility scores, M (a 0, b 0)=0, S (i, j) refers to i-th letter and jth alphabetical comparing, and is all 1 mutually, is not all-1.
CN201510411973.9A 2015-07-14 2015-07-14 A kind of Prediction of Protein Subcellular Location method realized with nearest _neighbor retrieval Expired - Fee Related CN105046106B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510411973.9A CN105046106B (en) 2015-07-14 2015-07-14 A kind of Prediction of Protein Subcellular Location method realized with nearest _neighbor retrieval

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510411973.9A CN105046106B (en) 2015-07-14 2015-07-14 A kind of Prediction of Protein Subcellular Location method realized with nearest _neighbor retrieval

Publications (2)

Publication Number Publication Date
CN105046106A true CN105046106A (en) 2015-11-11
CN105046106B CN105046106B (en) 2018-02-23

Family

ID=54452646

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510411973.9A Expired - Fee Related CN105046106B (en) 2015-07-14 2015-07-14 A kind of Prediction of Protein Subcellular Location method realized with nearest _neighbor retrieval

Country Status (1)

Country Link
CN (1) CN105046106B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595909A (en) * 2018-03-29 2018-09-28 山东师范大学 TA targeting proteins prediction techniques based on integrated classifier
CN109273054A (en) * 2018-08-31 2019-01-25 南京农业大学 Protein Subcellular interval prediction method based on relation map
CN112259160A (en) * 2020-11-19 2021-01-22 广东工业大学 Protein subcellular localization method, system, storage medium and computer equipment
CN112585686A (en) * 2018-09-21 2021-03-30 渊慧科技有限公司 Machine learning to determine protein structure

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324933A (en) * 2013-06-08 2013-09-25 南京理工大学常熟研究院有限公司 Membrane protein sub-cell positioning method based on complex space multi-view feature fusion
CN104156634A (en) * 2014-08-14 2014-11-19 中南大学 Key protein identification method based on subcellular localization specificity

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324933A (en) * 2013-06-08 2013-09-25 南京理工大学常熟研究院有限公司 Membrane protein sub-cell positioning method based on complex space multi-view feature fusion
CN104156634A (en) * 2014-08-14 2014-11-19 中南大学 Key protein identification method based on subcellular localization specificity

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
宋杰: "蛋白质亚细胞定位预测的最近邻算法", 《计算机应用研究》 *
张继福 等: "基于MapReduce与相关子空间的局部离群数据挖掘算法", 《软件学报》 *
李立奇 等: "KNN法在含纤连蛋白域蛋白质亚细胞定位中的应用", 《山东医药》 *
樊玉才 等: "基于改进的GO-PseAA方法的凋亡蛋白亚细胞定位", 《内蒙古工业大学学报》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595909A (en) * 2018-03-29 2018-09-28 山东师范大学 TA targeting proteins prediction techniques based on integrated classifier
CN109273054A (en) * 2018-08-31 2019-01-25 南京农业大学 Protein Subcellular interval prediction method based on relation map
CN109273054B (en) * 2018-08-31 2021-07-13 南京农业大学 Protein subcellular interval prediction method based on relational graph
CN112585686A (en) * 2018-09-21 2021-03-30 渊慧科技有限公司 Machine learning to determine protein structure
CN112259160A (en) * 2020-11-19 2021-01-22 广东工业大学 Protein subcellular localization method, system, storage medium and computer equipment
CN112259160B (en) * 2020-11-19 2023-05-26 广东工业大学 Protein subcellular localization method, system, storage medium and computer device

Also Published As

Publication number Publication date
CN105046106B (en) 2018-02-23

Similar Documents

Publication Publication Date Title
Liu et al. Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning
Wei et al. An improved protein structural classes prediction method by incorporating both sequence and structure information
Dong et al. Identification of DNA-binding proteins by auto-cross covariance transformation
Zhang et al. StackPDB: predicting DNA-binding proteins based on XGB-RFE feature optimization and stacked ensemble classifier
CN108009405A (en) A kind of method based on machine learning techniques prediction Bacterial outer membrane proteins matter
Li et al. Protein contact map prediction based on ResNet and DenseNet
CN105046106A (en) Protein subcellular localization and prediction method realized by using nearest-neighbor retrieval
CN105550715A (en) Affinity propagation clustering-based integrated classifier constructing method
Zhang et al. Predicting linear B-cell epitopes by using sequence-derived structural and physicochemical features
CN103617203A (en) Protein-ligand binding site predicting method based on inquiry drive
CN110060738A (en) Method and system based on machine learning techniques prediction bacterium protective antigens albumen
CN103473416A (en) Protein-protein interaction model building method and device
Wang et al. PredDBP-stack: prediction of DNA-binding proteins from HMM profiles using a stacked ensemble method
Yang et al. PseKNC and Adaboost-based method for DNA-binding proteins recognition
Ma et al. Kernel soft-neighborhood network fusion for miRNA-disease interaction prediction
Wang A Modified Machine Learning Method Used in Protein Prediction in Bioinformatics.
Chrysostomou et al. Structural classification of protein sequences based on signal processing and support vector machines
CN101609486B (en) Identification method of superclass of G-protein-coupled receptors and Web service system thereof
CN108388774A (en) A kind of on-line analysis of polypeptide spectrum matched data
Zaki et al. Features extraction for protein homology detection using Hidden Markov Models combining scores
Arango-Argoty et al. An adaptation of Pfam profiles to predict protein sub-cellular localization in Gram positive bacteria
Chen et al. FFF: Fragment-Guided Flexible Fitting for Building Complete Protein Structures
Fu et al. Prediction of anuran antimicrobial peptides using AdaBoost and improved PSSM profiles
Hassan et al. COMPARATIVE ANALYSIS OF CLASSIFICATION BASED ON CELLULAR LOCALIZATION DATA USING MACHINE LEARNING
CN111951889B (en) Recognition prediction method and system for M5C locus in RNA sequence

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180223

Termination date: 20210714

CF01 Termination of patent right due to non-payment of annual fee